Volume 32 Issue 3
May  2023
Turn off MathJax
Article Contents
LAN Xiaotian, HE Qianhua, YAN Haikang, et al., “A Novel Re-weighted CTC Loss for Data Imbalance in Speech Keyword Spotting,” Chinese Journal of Electronics, vol. 32, no. 3, pp. 465-473, 2023, doi: 10.23919/cje.2021.00.198
Citation: LAN Xiaotian, HE Qianhua, YAN Haikang, et al., “A Novel Re-weighted CTC Loss for Data Imbalance in Speech Keyword Spotting,” Chinese Journal of Electronics, vol. 32, no. 3, pp. 465-473, 2023, doi: 10.23919/cje.2021.00.198

A Novel Re-weighted CTC Loss for Data Imbalance in Speech Keyword Spotting

doi: 10.23919/cje.2021.00.198
Funds:  This work was supported by the National Natural Science Foundation of China (61571192) and Guangdong Basic and Applied Basic Research Foundation (2021A1515011454)
More Information
  • Author Bio:

    Xiaotian LAN received the B.S. degree in communication engineering from Northeastern University, Shenyang, in 2019. He is currently pursuing the M.S. degree at South China University of Technology. His research interests include speech keyword spotting and automatic speech recognition. (Email: xtianlan@163.com)

    Qianhua HE (corresponding author) received the Ph.D. degree in communication engineering from South China University of Technology in 1993, where he is currently a Full Professor with the School of Electronic and Information Engineering, the B.S. degree in physics from Hunan Normal University in 1987, and the M.S. degree in medical instrument engineering from Xi’an Jiaotong University in 1990. From 2007 to 2008, he was with University of Washington in Seattle USA as a Visiting Scholar. From 1994 to 2001, he was with the Department of Computer Science, City University of Hong Kong, four times as a Research Assistant, a Senior Research Assistant, and a Research Fellow, respectively. His research interests include spoken term detection, audio event detection, speech coding, multimedia retrieval, and digital audio forensic. He is a Senior Member of IEEE. (Email: eeqhhe@scut.edu.cn)

    Haikang YAN received the B.S. degree in communication engineering from Southwest Jiaotong University in 2018. He is currently pursuing the M.S. degree with the South China University of Technology (SCUT). His research interests include speech keyword spotting and audio signal processing. (Email: haikangyan@163.com)

    Yanxiong LI received the B.S. and M.S. degrees in electronic engineering from Hunan Normal University, Changsha, China, in 2003 and 2006, respectively, and Ph.D. degree in electronic engineering from SCUT, Guangzhou, China, in 2009. From 2008 to 2009, he worked as a Research Associate with the City University of Hong Kong. From 2013 to 2014, he worked as a Researcher with the University of Sheffield, UK. From Jul. to Aug. 2016, he worked as a Visiting Scholar with the Institute for Infocomm Research, Singapore. From Jul. to Oct. 2019, he worked as a Visiting Scholar with the Tampere University of Technology (TUT), Finland. He is now an Associate Professor with the School of Electronic and Information Engineering, SCUT. His research interests include audio signal processing and machine learning. (Email: eeyxli@scut.edu.cn)

  • Received Date: 2021-06-02
  • Accepted Date: 2022-02-03
  • Available Online: 2022-03-19
  • Publish Date: 2023-05-05
  • Speech keyword spotting system is a critical component of human-computer interfaces. And connectionist temporal classifier (CTC) has been proven to be an effective tool for that task. However, the standard training process of speech keyword spotting faces a data imbalance issue where positive samples are usually far less than negative samples. Numerous easy-training negative examples overwhelm the training, resulting in a degenerated model. To deal with it, this paper tries to reshape the standard CTC loss and proposes a novel re-weighted CTC loss. It evaluates the sample importance by its number of detection errors during training and automatically down-weights the contribution of easy examples, the majorities of which are negatives, making the training focus on samples deserving more training. The proposed method can alleviate the imbalance naturally and make use of all available data efficiently. Evaluation on several sets of keywords selected from AISHELL-1 and AISHELL-2 achieves 16%–38% relative reductions in false rejection rates over standard CTC loss at 0.5 false alarms per keyword per hour in experiments.
  • loading
  • [1]
    J. Gao, J. Shao, Q. W. Zhao, et al., “Efficient system combination for Chinese spoken term detection,” Chinese Journal of Electronics, vol.19, no.3, pp.457–462, 2010.
    [2]
    E. F. Huang, H. C. Wang, and F. K. Soong, “A fast algorithm for large vocabulary keyword spotting application,” IEEE Transactions on Speech and Audio Processing, vol.2, no.3, pp.449–452, 1994. doi: 10.1109/89.294361
    [3]
    S. Tabibian, “A voice command detection system for aerospace applications,” International Journal of Speech Technology, vol.20, no.4, pp.1049–1061, 2017. doi: 10.1007/s10772-017-9467-4
    [4]
    J. T. Foote, S. J. Young, G. J. F. Jones, et al., “Unconstrained keyword spotting using phone lattices with application to spoken document retrieval,” Computer Speech & Language, vol.11, no.3, pp.207–224, 1997. doi: 10.1006/csla.1997.0027
    [5]
    C. L. Zhu, Q. J. Kong, L. Zhou, et al., “Sensitive keyword spotting for voice alarm systems,” in Proceedings of 2013 IEEE International Conference on Service Operations and Logistics, and Informatics, Dongguan, China, pp.350–353, 2013.
    [6]
    J. G. Wilpon, L. R. Rabiner, C. H. Lee, et al., “Automatic recognition of keywords in unconstrained speech using hidden Markov models,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.38, no.11, pp.1870–1878, 1990. doi: 10.1109/29.103088
    [7]
    M. C. Silaghi, “Spotting subsequences matching a HMM using the average observation probability criteria with application to keyword spotting,” in Proceedings of the 20th National Conference on Artificial Intelligence, Pittsburgh, PA, USA, pp.1118–1123, 2005.
    [8]
    V. Frinken, A. Fischer, R. Manmatha, et al., “A novel word spotting method based on recurrent neural networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, no.2, pp.211–224, 2012. doi: 10.1109/TPAMI.2011.113
    [9]
    G. G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, pp.4087–4091, 2014.
    [10]
    T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Proceedings of the 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, pp.1478–1482, 2015.
    [11]
    M. Sun, D. Snyder, Y. X. Gao, et al., “Compressed time delay neural network for small-footprint keyword spotting,” in Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp.3607–3611, 2017.
    [12]
    A. Graves, S. Fernández, F. Gomez, et al., “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, pp.369–376, 2006.
    [13]
    M. Wöllmer, B. Schuller, and G. Rigoll, “Keyword spotting exploiting long short-term memory,” Speech Communication, vol.55, no.2, pp.252–265, 2013. doi: 10.1016/j.specom.2012.08.006
    [14]
    Y. Bai, J. Y. Yi, H. Ni, et al., “End-to-end keywords spotting based on connectionist temporal classification for mandarin,” in Proceedings of the 10th International Symposium on Chinese Spoken Language Processing, Tianjin, China, pp.1–5, 2016.
    [15]
    H. K. Yan, Q. H. He, and W. Xie, “CRNN-CTC based mandarin keywords spotting,” in Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, pp.7489–7493, 2020.
    [16]
    B. Liu, S. Nie, Y. P. Zhang, et al., “Loss and double-edge-triggered detector for robust small-footprint keyword spotting,” in Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, pp.6361–6365, 2019.
    [17]
    J. Y. Hou, Y. Y. Shi, M. Ostendorf, et al., “Mining effective negative training samples for keyword spotting,” in Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, pp.7444–7448, 2020.
    [18]
    K. Zhang, Z. Y. Wu, D. D. Yuan, et al., “Re-weighted interval loss for handling data imbalance problem of end-to-end keyword spotting,” in Proceedings of 21st Annual Conference of the International Speech Communication Association, Shanghai, China, pp.2567–2571, 2020.
    [19]
    C. Elkan, “The foundations of cost-sensitive learning,” in Proceedings of the 17th International Joint Conference on Artificial Intelligence, Seattle, WA, USA, vol.17, pp.973–978, 2001.
    [20]
    Y. Cui, M. L. Jia, T. Y. Lin, et al., “Class-balanced loss based on effective number of samples,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.9260–9269, 2019.
    [21]
    T. Y. Lin, P. Goyal, R. Girshick, et al., “Focal loss for dense object detection,” in Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, Italy, pp.2999–3007, 2017.
    [22]
    S. Ben-David, D. Loker, N. Srebro, et al., “Minimizing the misclassification error rate using a surrogate convex loss,” in Proceedings of the 29th International Conference on Machine Learning, Edinburgh, UK, pp.83–90, 2012.
    [23]
    Y. B. Zhou, C. M. Xiong, and R. Socher, “Improving end-to-end speech recognition with policy learning,” in Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, pp.5819–5823, 2018.
    [24]
    X. J. Feng, H. X. Yao, and S. P. Zhang, “Focal CTC loss for Chinese optical character recognition on unbalanced datasets,” Complexity, vol.2019, article no.9345861, 2019. doi: 10.1155/2019/9345861
    [25]
    B. Y. Li, Y. Liu, and X. G. Wang, “Gradient harmonized single-stage detector,” in Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, vol.33, pp.8577–8584, 2019.
    [26]
    M. Toneva, A. Sordoni, R. T. des Combes, et al., “An empirical study of example forgetting during deep neural network learning,” in Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, USA, 2019.
    [27]
    L. Jiang, Z. Y. Zhou, T. Leung, et al., “MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels,” in Proceedings of 35th International Conference on Machine Learning, Stockholm, Sweden, pp.2304–2313, 2018.
    [28]
    H. Bu, J. Y. Du, X. Y. Na, et al., “AISHELL-1: An open-source mandarin speech corpus and a speech recognition baseline,” in Proceedings of 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, Seoul, Korea (South), pp.1–5, 2017.
    [29]
    J. Y. Du, X. Y. Na, X. C. Liu, et al., “AISHELL-2: Transforming mandarin ASR research into industrial scale,” arXiv preprint, arXiv: 1808.10583, 2018.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(2)  / Tables(5)

    Article Metrics

    Article views (824) PDF downloads(89) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return