Teacher-Student Training Approach Using an Adaptive Gain Mask for LSTM-Based Speech Enhancement in the Airborne Noise Environment

HUANG Ping; WU Yafeng

doi:10.23919/cje.2022.00.307

Volume 32 Issue 4

Jul. 2023

Turn off MathJax

Article Contents

Article Navigation > Chinese Journal of Electronics > 2023 > 32(4): 882-895

HUANG Ping and WU Yafeng, “Teacher-Student Training Approach Using an Adaptive Gain Mask for LSTM-Based Speech Enhancement in the Airborne Noise Environment,” Chinese Journal of Electronics, vol. 32, no. 4, pp. 882-895, 2023, doi: 10.23919/cje.2022.00.307

Citation:

HUANG Ping and WU Yafeng, “Teacher-Student Training Approach Using an Adaptive Gain Mask for LSTM-Based Speech Enhancement in the Airborne Noise Environment,” Chinese Journal of Electronics, vol. 32, no. 4, pp. 882-895, 2023, doi: 10.23919/cje.2022.00.307

Citation:

PDF( 4213 KB)

Teacher-Student Training Approach Using an Adaptive Gain Mask for LSTM-Based Speech Enhancement in the Airborne Noise Environment

doi: 10.23919/cje.2022.00.307

1.
School of Power and Energy, Northwestern Polytechnical University, Xi’an 710072, China

Funds: This work was supported by the Fundamental Research Funds for the Central Universities (D5000210974).

More Information

Author Bio:
Ping HUANG received the M.E. degree from Northwestern Polytechnical University in 2014. She is pursuing the Ph.D. degree at the School of Power and Energy, Northwestern Polytechnical University, Xi’an, China. Her research interests include speech enhancement and signal processing. (Email: hp0409@mail.nwpu.edu.cn)

Yafeng WU (corresponding author) received the Ph.D. degree in electronic engineering from Northwestern Polytechnical University, Xi’an, China. He is a Professor and a Doctoral Supervisor in Northwestern Polytechnical University. His research interests include speech signal processing and vibration noise control. (Email: yfwu@nwpu.edu.cn)
Received Date: 2022-09-13
Accepted Date: 2022-11-30

Available Online: 2022-12-28

Publish Date: 2023-07-05

Abstract

Abstract

Research on speech enhancement algorithms in the airborne environment is of great significance to the security of airborne systems. Recently, the research focus of speech enhancement has turned from conventional unsupervised algorithms, like the log minimum mean square error estimator (log-MMSE), to the state-of-the-art masking-based long short-term memory (LSTM) method. However, each method has its characteristics and limitations, so they cannot always handle noise well. Besides, the requirements of clean speech and noise data for training a supervised speech enhancement model are difficult to satisfy in the real-world airborne environment. Therefore, in this work, to fully utilize the advantages of those two different methods without any data restrictions, we propose a novel adaptive gain mask (AGM) based teacher-student training approach for speech enhancement. In our method, the AGM, as a robust learning target for the student model, is devised by incorporating the estimated ideal ratio mask from the teacher model into the procedure of the log-MMSE approach. To get an appropriate tradeoff between the two methods, we adaptively update the AGM using a recursive weighting coefficient. Experiments on the real airborne data show that the proposed AGM-based method outperforms other baselines in terms of all essential objective metrics evaluated in this paper.
- Adaptive ideal mask,
- Teacher-student learning,
- Long short-term memory (LSTM),
- Speech enhancement

FullText(HTML)

References(40)

References

[1]	X. B. Cao, P. Yang, M. Alzenad, et al., “Airborne communication networks: A survey,” IEEE Journal on Selected Areas in Communications, vol.36, no.9, pp.1907–1926, 2018. doi: 10.1109/JSAC.2018.2864423
[2]	S. F. Ou, P. Song, and Y. Gao, “Soft decision based Gaussian-Laplacian combination model for noisy speech enhancement,” Chinese Journal of Electronics, vol.27, no.4, pp.827–834, 2018. doi: 10.1049/cje.2018.05.015
[3]	S. F. Ou, P. Song, and Y. Gao, “Laplacian speech model and soft decision based MMSE estimator for noise power spectral density in speech enhancement,” Chinese Journal of Electronics, vol.27, no.6, pp.1214–1220, 2018. doi: 10.1049/cje.2018.09.009
[4]	W. Jiang, P. Liu, F. Wen, “Speech magnitude spectrum reconstruction from MFCCs using deep neural network,” Chinese Journal of Electronics, vol.27, no.2, pp.393–398, 2018. doi: 10.1049/cje.2017.09.018
[5]	T. Wang, H. Guo, B. Lyu, et al., “Speech signal processing on graphs: Graph topology, graph frequency analysis and denois-ing,” Chinese Journal of Electronics, vol.29, no.5, pp.926–936, 2020. doi: 10.1049/cje.2020.08.008,2020
[6]	X. Wang, Y. Guo, Q. Fu, et al., “Speech enhancement using multi-channel post-filtering with modified signal presence probability in reverberant environment,” Chinese Journal of Electronics, vol.25, no.3, pp.512–519, 2016. doi: 10.1049/cje.2016.05.017
[7]	S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.27, no.2, pp.113–120, 1979. doi: 10.1109/TASSP.1979.1163209
[8]	B. Picinbono and M. Bouvet, “Constrained Wiener filtering (Corresp. ),” IEEE Transactions on Information Theory, vol.33, no.1, pp.160–166, 1987. doi: 10.1109/TIT.1987.1057267
[9]	T. V. Sreenivas and P. Kirnapure, “Codebook constrained Wiener filtering for speech enhancement,” IEEE Transactions on Speech and Audio Processing, vol.4, no.5, pp.383–389, 1996. doi: 10.1109/89.536932
[10]	P. Scalart and J. V. Filho, “Speech enhancement based on a priori signal to noise estimation,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA, USA, pp.629–632, 1996.
[11]	Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.32, no.6, pp.1109–1121, 1984. doi: 10.1109/TASSP.1984.1164453
[12]	Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.33, no.2, pp.443–445, 1985. doi: 10.1109/TASSP.1985.1164550
[13]	G. Enzner and P. Thüne, “Robust MMSE filtering for single-microphone speech enhancement,” in Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, pp.4009–4013, 2017.
[14]	R. Martin, “Speech enhancement using MMSE short time spectral estimation with gamma distributed speech priors,” in Proceedings of 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, USA, pp.253–256, 2002.
[15]	R. W. Li, X. Y. Sun, T. Li, et al., “A multi-objective learning speech enhancement algorithm based on IRM post-processing with joint estimation of SCNN and TCNN,” Digital Signal Processing, vol.101, article no.102731, 2020. doi: 10.1016/j.dsp.2020.102731
[16]	N. Saleem and M. I. Khattak, “Multi-scale decomposition based supervised single channel deep speech enhancement,” Applied Soft Computing, vol.95, article no.106666, 2020. doi: 10.1016/j.asoc.2020.106666
[17]	S. Routray and Q. R. Mao, “Phase sensitive masking-based single channel speech enhancement using conditional generative adversarial network,” Computer Speech & Language, vol.71, article no.101270, 2022. doi: 10.1016/j.csl.2021.101270
[18]	Z. T. Wang, X. F. Wang, X. Li, et al., “Oracle performance investigation of the ideal masks,” in Proceedings of 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), Xi’an, China, pp.1–5, 2016.
[19]	Q. Wang, J. Du, L. R. Dai, et al., “A multiobjective learning and ensembling approach to high-performance speech enhancement with compact neural network architectures,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.26, no.7, pp.1185–1197, 2018. doi: 10.1109/TASLP.2018.2817798
[20]	X. Y. Wang, F. Bao, and C. C. Bao, “IRM estimation based on data field of cochleagram for speech enhancement,” Speech Communication, vol.97, pp.19–31, 2018. doi: 10.1016/j.specom.2017.12.014
[21]	H. J. Yu, W. P. Zhu, and B. Champagne, “Speech enhancement using a DNN-augmented colored-noise Kalman filter,” Speech Communication, vol.125, pp.142–151, 2020. doi: 10.1016/j.specom.2020.10.007
[22]	W. L. Zhou and Z. Zhu, “A novel BNMF-DNN based speech reconstruction method for speech quality evaluation under complex environments,” International Journal of Machine Learning and Cybernetics, vol.12, no.4, pp.959–972, 2021. doi: 10.1007/s13042-020-01214-3
[23]	G. W. Lee and H. K. Kim, “Multi-task learning U-Net for single-channel speech enhancement and mask-based voice activity detection,” Applied Sciences, vol.10, no.9, article no.articleno.3230, 2020. doi: 10.3390/app10093230
[24]	N. Saleem, M. I. Khattak, M. Al-Hasan, et al., “On learning spectral masking for single channel speech enhancement using feedforward and recurrent neural networks,” IEEE Access, vol.8, pp.160581–160595, 2020. doi: 10.1109/ACCESS.2020.3021061
[25]	D. Servan-Schreiber, A. Cleeremans, and J. L. McClelland, Encoding Sequential Structure in Simple Recurrent Networks. Pittsburgh: Carnegie Mellon University, 1988.
[26]	L. Zhang, M. J. Wang, Q. Q. Zhang, et al., “PhaseDCN: A phase-enhanced dual-path dilated convolutional network for single-channel speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.29, pp.2561–2574, 2021. doi: 10.1109/TASLP.2021.3092585
[27]	Z. Y. Wang, T. Zhang, Y. Y. Shao, et al., “LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement,” Applied Acoustics, vol.172, article no.107647, 2021. doi: 10.1016/j.apacoust.2020.107647
[28]	Y. H. Tu, J. Du, and C. H. Lee, “Speech enhancement based on teacher-student deep learning using improved speech presence probability for noise-robust speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.27, no.12, pp.2080–2091, 2019. doi: 10.1109/TASLP.2019.2940662
[29]	I. Cohen, “Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging,” IEEE Transactions on Speech and Audio Processing, vol.11, no.5, pp.466–475, 2003. doi: 10.1109/TSA.2003.811544
[30]	H. Dinkel, S. Wang, and X. N. Xu, et al., “Voice activity detection in the wild: A data-driven approach using teacher-student training,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.29, pp.1542–1555, 2021. doi: 10.1109/TASLP.2021.3073596
[31]	L. Sun, J. Du, L. R. Dai, et al., “Multiple-target deep learning for LSTM-RNN based speech enhancement,” in Proceedings of 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), San Francisco, CA, USA, pp.136–140, 2017.
[32]	T. Gerkmann and R. C. Hendriks, “Unbiased MMSE-based noise power estimation with low complexity and low tracking delay,” IEEE Transactions on Audio, Speech, and Language Processing, vol.20, no.4, pp.1383–1393, 2012. doi: 10.1109/TASL.2011.2180896
[33]	D. Pearce and H. G. Hirsch, “The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions,” in Proceedings of Sixth International Conference on Spoken Language Processing, Beijing, China, pp.29–32, 2000.
[34]	J. S. Garofolo, L. F. Lamel, W. M. Fisher, et al., “DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM”, NIST Interagency/Internal Report (NISTIR), Report No.4930, 1993.
[35]	A. Varga and H. J. M. Steeneken, “Assessment for automatic speech recognition: Ⅱ. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Communication, vol.12, no.3, pp.247–251, 1993. doi: 10.1016/0167-6393(93)90095-3
[36]	G. N. Hu and D. L. Wang, “A tandem algorithm for pitch estimation and voiced speech segregation,” IEEE Transactions on Audio, Speech, and Language Processing, vol.18, no.8, pp.2067–2079, 2010. doi: 10.1109/TASL.2010.2041110
[37]	D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 2015.
[38]	A. W. Rix, J. G. Beerends, M. P. Hollier, et al., “Perceptual evaluation of speech quality (PESQ) - A new method for speech quality assessment of telephone networks and codecs,” in Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, UT, USA, pp.749–752, 2001.
[39]	J. Beh, R. H. Baran, and H. Ko, “Dual channel based speech enhancement using novelty filter for robust speech recognition in automobile environment,” IEEE Transactions on Consumer Electronics, vol.52, no.2, pp.583–589, 2006. doi: 10.1109/TCE.2006.1649683
[40]	A. Gray and J. Markel, “Distance measures for speech processing,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.24, no.5, pp.380–391, 1976. doi: 10.1109/TASSP.1976.1162849