Citation: | LAN Xiaotian, HE Qianhua, YAN Haikang, et al., “A Novel Re-weighted CTC Loss for Data Imbalance in Speech Keyword Spotting,” Chinese Journal of Electronics, vol. 32, no. 3, pp. 465-473, 2023, doi: 10.23919/cje.2021.00.198 |
[1] |
J. Gao, J. Shao, Q. W. Zhao, et al., “Efficient system combination for Chinese spoken term detection,” Chinese Journal of Electronics, vol.19, no.3, pp.457–462, 2010.
|
[2] |
E. F. Huang, H. C. Wang, and F. K. Soong, “A fast algorithm for large vocabulary keyword spotting application,” IEEE Transactions on Speech and Audio Processing, vol.2, no.3, pp.449–452, 1994. doi: 10.1109/89.294361
|
[3] |
S. Tabibian, “A voice command detection system for aerospace applications,” International Journal of Speech Technology, vol.20, no.4, pp.1049–1061, 2017. doi: 10.1007/s10772-017-9467-4
|
[4] |
J. T. Foote, S. J. Young, G. J. F. Jones, et al., “Unconstrained keyword spotting using phone lattices with application to spoken document retrieval,” Computer Speech & Language, vol.11, no.3, pp.207–224, 1997. doi: 10.1006/csla.1997.0027
|
[5] |
C. L. Zhu, Q. J. Kong, L. Zhou, et al., “Sensitive keyword spotting for voice alarm systems,” in Proceedings of 2013 IEEE International Conference on Service Operations and Logistics, and Informatics, Dongguan, China, pp.350–353, 2013.
|
[6] |
J. G. Wilpon, L. R. Rabiner, C. H. Lee, et al., “Automatic recognition of keywords in unconstrained speech using hidden Markov models,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.38, no.11, pp.1870–1878, 1990. doi: 10.1109/29.103088
|
[7] |
M. C. Silaghi, “Spotting subsequences matching a HMM using the average observation probability criteria with application to keyword spotting,” in Proceedings of the 20th National Conference on Artificial Intelligence, Pittsburgh, PA, USA, pp.1118–1123, 2005.
|
[8] |
V. Frinken, A. Fischer, R. Manmatha, et al., “A novel word spotting method based on recurrent neural networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, no.2, pp.211–224, 2012. doi: 10.1109/TPAMI.2011.113
|
[9] |
G. G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, pp.4087–4091, 2014.
|
[10] |
T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Proceedings of the 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, pp.1478–1482, 2015.
|
[11] |
M. Sun, D. Snyder, Y. X. Gao, et al., “Compressed time delay neural network for small-footprint keyword spotting,” in Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp.3607–3611, 2017.
|
[12] |
A. Graves, S. Fernández, F. Gomez, et al., “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, pp.369–376, 2006.
|
[13] |
M. Wöllmer, B. Schuller, and G. Rigoll, “Keyword spotting exploiting long short-term memory,” Speech Communication, vol.55, no.2, pp.252–265, 2013. doi: 10.1016/j.specom.2012.08.006
|
[14] |
Y. Bai, J. Y. Yi, H. Ni, et al., “End-to-end keywords spotting based on connectionist temporal classification for mandarin,” in Proceedings of the 10th International Symposium on Chinese Spoken Language Processing, Tianjin, China, pp.1–5, 2016.
|
[15] |
H. K. Yan, Q. H. He, and W. Xie, “CRNN-CTC based mandarin keywords spotting,” in Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, pp.7489–7493, 2020.
|
[16] |
B. Liu, S. Nie, Y. P. Zhang, et al., “Loss and double-edge-triggered detector for robust small-footprint keyword spotting,” in Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, pp.6361–6365, 2019.
|
[17] |
J. Y. Hou, Y. Y. Shi, M. Ostendorf, et al., “Mining effective negative training samples for keyword spotting,” in Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, pp.7444–7448, 2020.
|
[18] |
K. Zhang, Z. Y. Wu, D. D. Yuan, et al., “Re-weighted interval loss for handling data imbalance problem of end-to-end keyword spotting,” in Proceedings of 21st Annual Conference of the International Speech Communication Association, Shanghai, China, pp.2567–2571, 2020.
|
[19] |
C. Elkan, “The foundations of cost-sensitive learning,” in Proceedings of the 17th International Joint Conference on Artificial Intelligence, Seattle, WA, USA, vol.17, pp.973–978, 2001.
|
[20] |
Y. Cui, M. L. Jia, T. Y. Lin, et al., “Class-balanced loss based on effective number of samples,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.9260–9269, 2019.
|
[21] |
T. Y. Lin, P. Goyal, R. Girshick, et al., “Focal loss for dense object detection,” in Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, Italy, pp.2999–3007, 2017.
|
[22] |
S. Ben-David, D. Loker, N. Srebro, et al., “Minimizing the misclassification error rate using a surrogate convex loss,” in Proceedings of the 29th International Conference on Machine Learning, Edinburgh, UK, pp.83–90, 2012.
|
[23] |
Y. B. Zhou, C. M. Xiong, and R. Socher, “Improving end-to-end speech recognition with policy learning,” in Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, pp.5819–5823, 2018.
|
[24] |
X. J. Feng, H. X. Yao, and S. P. Zhang, “Focal CTC loss for Chinese optical character recognition on unbalanced datasets,” Complexity, vol.2019, article no.9345861, 2019. doi: 10.1155/2019/9345861
|
[25] |
B. Y. Li, Y. Liu, and X. G. Wang, “Gradient harmonized single-stage detector,” in Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, vol.33, pp.8577–8584, 2019.
|
[26] |
M. Toneva, A. Sordoni, R. T. des Combes, et al., “An empirical study of example forgetting during deep neural network learning,” in Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, USA, 2019.
|
[27] |
L. Jiang, Z. Y. Zhou, T. Leung, et al., “MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels,” in Proceedings of 35th International Conference on Machine Learning, Stockholm, Sweden, pp.2304–2313, 2018.
|
[28] |
H. Bu, J. Y. Du, X. Y. Na, et al., “AISHELL-1: An open-source mandarin speech corpus and a speech recognition baseline,” in Proceedings of 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, Seoul, Korea (South), pp.1–5, 2017.
|
[29] |
J. Y. Du, X. Y. Na, X. C. Liu, et al., “AISHELL-2: Transforming mandarin ASR research into industrial scale,” arXiv preprint, arXiv: 1808.10583, 2018.
|