Citation: | HUANG Ping and WU Yafeng, “Teacher-Student Training Approach Using an Adaptive Gain Mask for LSTM-Based Speech Enhancement in the Airborne Noise Environment,” Chinese Journal of Electronics, vol. 32, no. 4, pp. 882-895, 2023, doi: 10.23919/cje.2022.00.307 |
[1] |
X. B. Cao, P. Yang, M. Alzenad, et al., “Airborne communication networks: A survey,” IEEE Journal on Selected Areas in Communications, vol.36, no.9, pp.1907–1926, 2018. doi: 10.1109/JSAC.2018.2864423
|
[2] |
S. F. Ou, P. Song, and Y. Gao, “Soft decision based Gaussian-Laplacian combination model for noisy speech enhancement,” Chinese Journal of Electronics, vol.27, no.4, pp.827–834, 2018. doi: 10.1049/cje.2018.05.015
|
[3] |
S. F. Ou, P. Song, and Y. Gao, “Laplacian speech model and soft decision based MMSE estimator for noise power spectral density in speech enhancement,” Chinese Journal of Electronics, vol.27, no.6, pp.1214–1220, 2018. doi: 10.1049/cje.2018.09.009
|
[4] |
W. Jiang, P. Liu, F. Wen, “Speech magnitude spectrum reconstruction from MFCCs using deep neural network,” Chinese Journal of Electronics, vol.27, no.2, pp.393–398, 2018. doi: 10.1049/cje.2017.09.018
|
[5] |
T. Wang, H. Guo, B. Lyu, et al., “Speech signal processing on graphs: Graph topology, graph frequency analysis and denois-ing,” Chinese Journal of Electronics, vol.29, no.5, pp.926–936, 2020. doi: 10.1049/cje.2020.08.008,2020
|
[6] |
X. Wang, Y. Guo, Q. Fu, et al., “Speech enhancement using multi-channel post-filtering with modified signal presence probability in reverberant environment,” Chinese Journal of Electronics, vol.25, no.3, pp.512–519, 2016. doi: 10.1049/cje.2016.05.017
|
[7] |
S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.27, no.2, pp.113–120, 1979. doi: 10.1109/TASSP.1979.1163209
|
[8] |
B. Picinbono and M. Bouvet, “Constrained Wiener filtering (Corresp. ),” IEEE Transactions on Information Theory, vol.33, no.1, pp.160–166, 1987. doi: 10.1109/TIT.1987.1057267
|
[9] |
T. V. Sreenivas and P. Kirnapure, “Codebook constrained Wiener filtering for speech enhancement,” IEEE Transactions on Speech and Audio Processing, vol.4, no.5, pp.383–389, 1996. doi: 10.1109/89.536932
|
[10] |
P. Scalart and J. V. Filho, “Speech enhancement based on a priori signal to noise estimation,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA, USA, pp.629–632, 1996.
|
[11] |
Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.32, no.6, pp.1109–1121, 1984. doi: 10.1109/TASSP.1984.1164453
|
[12] |
Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.33, no.2, pp.443–445, 1985. doi: 10.1109/TASSP.1985.1164550
|
[13] |
G. Enzner and P. Thüne, “Robust MMSE filtering for single-microphone speech enhancement,” in Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, pp.4009–4013, 2017.
|
[14] |
R. Martin, “Speech enhancement using MMSE short time spectral estimation with gamma distributed speech priors,” in Proceedings of 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, USA, pp.253–256, 2002.
|
[15] |
R. W. Li, X. Y. Sun, T. Li, et al., “A multi-objective learning speech enhancement algorithm based on IRM post-processing with joint estimation of SCNN and TCNN,” Digital Signal Processing, vol.101, article no.102731, 2020. doi: 10.1016/j.dsp.2020.102731
|
[16] |
N. Saleem and M. I. Khattak, “Multi-scale decomposition based supervised single channel deep speech enhancement,” Applied Soft Computing, vol.95, article no.106666, 2020. doi: 10.1016/j.asoc.2020.106666
|
[17] |
S. Routray and Q. R. Mao, “Phase sensitive masking-based single channel speech enhancement using conditional generative adversarial network,” Computer Speech & Language, vol.71, article no.101270, 2022. doi: 10.1016/j.csl.2021.101270
|
[18] |
Z. T. Wang, X. F. Wang, X. Li, et al., “Oracle performance investigation of the ideal masks,” in Proceedings of 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), Xi’an, China, pp.1–5, 2016.
|
[19] |
Q. Wang, J. Du, L. R. Dai, et al., “A multiobjective learning and ensembling approach to high-performance speech enhancement with compact neural network architectures,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.26, no.7, pp.1185–1197, 2018. doi: 10.1109/TASLP.2018.2817798
|
[20] |
X. Y. Wang, F. Bao, and C. C. Bao, “IRM estimation based on data field of cochleagram for speech enhancement,” Speech Communication, vol.97, pp.19–31, 2018. doi: 10.1016/j.specom.2017.12.014
|
[21] |
H. J. Yu, W. P. Zhu, and B. Champagne, “Speech enhancement using a DNN-augmented colored-noise Kalman filter,” Speech Communication, vol.125, pp.142–151, 2020. doi: 10.1016/j.specom.2020.10.007
|
[22] |
W. L. Zhou and Z. Zhu, “A novel BNMF-DNN based speech reconstruction method for speech quality evaluation under complex environments,” International Journal of Machine Learning and Cybernetics, vol.12, no.4, pp.959–972, 2021. doi: 10.1007/s13042-020-01214-3
|
[23] |
G. W. Lee and H. K. Kim, “Multi-task learning U-Net for single-channel speech enhancement and mask-based voice activity detection,” Applied Sciences, vol.10, no.9, article no.articleno.3230, 2020. doi: 10.3390/app10093230
|
[24] |
N. Saleem, M. I. Khattak, M. Al-Hasan, et al., “On learning spectral masking for single channel speech enhancement using feedforward and recurrent neural networks,” IEEE Access, vol.8, pp.160581–160595, 2020. doi: 10.1109/ACCESS.2020.3021061
|
[25] |
D. Servan-Schreiber, A. Cleeremans, and J. L. McClelland, Encoding Sequential Structure in Simple Recurrent Networks. Pittsburgh: Carnegie Mellon University, 1988.
|
[26] |
L. Zhang, M. J. Wang, Q. Q. Zhang, et al., “PhaseDCN: A phase-enhanced dual-path dilated convolutional network for single-channel speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.29, pp.2561–2574, 2021. doi: 10.1109/TASLP.2021.3092585
|
[27] |
Z. Y. Wang, T. Zhang, Y. Y. Shao, et al., “LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement,” Applied Acoustics, vol.172, article no.107647, 2021. doi: 10.1016/j.apacoust.2020.107647
|
[28] |
Y. H. Tu, J. Du, and C. H. Lee, “Speech enhancement based on teacher-student deep learning using improved speech presence probability for noise-robust speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.27, no.12, pp.2080–2091, 2019. doi: 10.1109/TASLP.2019.2940662
|
[29] |
I. Cohen, “Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging,” IEEE Transactions on Speech and Audio Processing, vol.11, no.5, pp.466–475, 2003. doi: 10.1109/TSA.2003.811544
|
[30] |
H. Dinkel, S. Wang, and X. N. Xu, et al., “Voice activity detection in the wild: A data-driven approach using teacher-student training,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.29, pp.1542–1555, 2021. doi: 10.1109/TASLP.2021.3073596
|
[31] |
L. Sun, J. Du, L. R. Dai, et al., “Multiple-target deep learning for LSTM-RNN based speech enhancement,” in Proceedings of 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), San Francisco, CA, USA, pp.136–140, 2017.
|
[32] |
T. Gerkmann and R. C. Hendriks, “Unbiased MMSE-based noise power estimation with low complexity and low tracking delay,” IEEE Transactions on Audio, Speech, and Language Processing, vol.20, no.4, pp.1383–1393, 2012. doi: 10.1109/TASL.2011.2180896
|
[33] |
D. Pearce and H. G. Hirsch, “The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions,” in Proceedings of Sixth International Conference on Spoken Language Processing, Beijing, China, pp.29–32, 2000.
|
[34] |
J. S. Garofolo, L. F. Lamel, W. M. Fisher, et al., “DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM”, NIST Interagency/Internal Report (NISTIR), Report No.4930, 1993.
|
[35] |
A. Varga and H. J. M. Steeneken, “Assessment for automatic speech recognition: Ⅱ. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Communication, vol.12, no.3, pp.247–251, 1993. doi: 10.1016/0167-6393(93)90095-3
|
[36] |
G. N. Hu and D. L. Wang, “A tandem algorithm for pitch estimation and voiced speech segregation,” IEEE Transactions on Audio, Speech, and Language Processing, vol.18, no.8, pp.2067–2079, 2010. doi: 10.1109/TASL.2010.2041110
|
[37] |
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 2015.
|
[38] |
A. W. Rix, J. G. Beerends, M. P. Hollier, et al., “Perceptual evaluation of speech quality (PESQ) - A new method for speech quality assessment of telephone networks and codecs,” in Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, UT, USA, pp.749–752, 2001.
|
[39] |
J. Beh, R. H. Baran, and H. Ko, “Dual channel based speech enhancement using novelty filter for robust speech recognition in automobile environment,” IEEE Transactions on Consumer Electronics, vol.52, no.2, pp.583–589, 2006. doi: 10.1109/TCE.2006.1649683
|
[40] |
A. Gray and J. Markel, “Distance measures for speech processing,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.24, no.5, pp.380–391, 1976. doi: 10.1109/TASSP.1976.1162849
|