Monaural Speech Separation Using Dual-Output Deep Neural Network with Multiple Joint Constraint

SUN Linhui; LIANG Wenqing; ZHANG Meng; LI Ping’an

doi:10.23919/cje.2022.00.110

Volume 32 Issue 3

May 2023

Turn off MathJax

Article Contents

Article Navigation > Chinese Journal of Electronics > 2023 > 32(3): 493-506

SUN Linhui, LIANG Wenqing, ZHANG Meng, et al., “Monaural Speech Separation Using Dual-Output Deep Neural Network with Multiple Joint Constraint,” Chinese Journal of Electronics, vol. 32, no. 3, pp. 493-506, 2023, doi: 10.23919/cje.2022.00.110

Citation:

SUN Linhui, LIANG Wenqing, ZHANG Meng, et al., “Monaural Speech Separation Using Dual-Output Deep Neural Network with Multiple Joint Constraint,” Chinese Journal of Electronics, vol. 32, no. 3, pp. 493-506, 2023, doi: 10.23919/cje.2022.00.110

Citation:

PDF( 3338 KB)

Monaural Speech Separation Using Dual-Output Deep Neural Network with Multiple Joint Constraint

doi: 10.23919/cje.2022.00.110

1.
College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China

Funds: This work was supported by the National Natural Science Foundation of China (61901227, 62071242) and the China Scholarship Fund (202008320043)

More Information

Author Bio:
Linhui SUN is currently an Associate Professor at Nanjing University of Posts and Telecommunications, China. She received the B.S. degree from Jilin University, China, in 2002, and received the M.S. and Ph.D. degrees both from Nanjing University of Posts and Telecommunications, China, in 2005 and 2013, respectively. Her research interests mainly include speech signal processing and modern speech communication. (Email: sunlh@njupt.edu.cn)

Wenqing LIANG is currently pursuing the M.S. degree at Nanjing University of Posts and Telecommunications, China. She received the B.S. degree from Shandong Technology and Business University, China, in 2020. Her major research interests include speech processing and modern speech communication. (Email: 1020010511@njupt.edu.cn)

Meng ZHANG is currently pursuing the M.S. degree at Nanjing University of Posts and Telecommunications, China. She received the B.S. degree from Xi’an University of Posts and Telecommunications, China, in 2020. Her major research interests include speech processing and modern speech communication. (Email: 1020010619@njupt.edu.cn)

Ping’an LI is currently a Senior Engineer at Nanjing University of Posts and Telecommunications, China. He received the B.S. degree from Nanjing University of Aeronautics and Astronautics, China, in 2000, and received the M.S. degree from Nanjing University of Posts and Telecommunications, China, in 2006. His research interests include signal processing. (Email: lpa@njupt.edu.cn)
Received Date: 2022-04-27
Accepted Date: 2022-08-16

Available Online: 2022-11-03

Publish Date: 2023-05-05

Abstract

Abstract

Monaural speech separation is a significant research field in speech signal processing. To achieve a better separation performance, we propose three novel joint-constraint loss functions and a multiple joint-constraint loss function for monaural speech separation based on dual-output deep neural network (DNN). The multiple joint-constraint loss function for DNN separation model not only restricts the ideal ratio mask (IRM) errors of the two outputs, but also constrains the relationship of the estimated IRMs and the magnitude spectrograms of the clean speech signals, the relationship of the estimated IRMs of the two outputs, and the relationship of the estimated IRMs and the magnitude spectrogram of the mixed signal. The constraint strength is adjusted through three parameters to improve the accuracy of the speech separation model. Furthermore, we solve the optimal weighting coefficients of the multiple joint-constraint loss function based on the optimization idea, which further improves the performance of the separation system. We conduct a series of speech separation experiments on the GRID corpus to validate the superiority performance of the proposed method. The results show that using perceptual evaluation of speech quality, the short-time objective intelligibility, source to distortion ratio, signal to interference ratio and source to artifact ratio as the evaluation metrics, the proposed method outperforms the conventional DNN separation model. Taking the gender into consideration, we carry out experiments among Female-Female, Male-Male and Male-Female cases, which show that our method improves the robustness and performance of the separation system compared with some previous approaches.
- Monaural speech separation,
- Joint constraint,
- Deep neural networks,
- Dual-output

FullText(HTML)

References(28)

References

[1]	J. Du, Y. H. Tu, Y. Xu, et al., “Speech separation of a target speaker based on deep neural networks,” in Proceedings of the 2014 12th International Conference on Signal Processing, Hangzhou, China, pp.473–477, 2014.
[2]	A. Belouchrani, K. Abed-Meraim, J. F. Cardoso, et al., “A blind source separation technique using second-order statistics,” IEEE Transactions on Signal Processing, vol.45, no.2, pp.434–444, 1997. doi: 10.1109/78.554307
[3]	M. Yu, A. Rhuma, S. M. Naqvi, et al., “A posture recognition-based fall detection system for monitoring an elderly person in a smart home environment,” IEEE Transactions on Information Technology in Biomedicine, vol.16, no.6, pp.1274–1286, 2012. doi: 10.1109/TITB.2012.2214786
[4]	J. Du, Y. H. Tu, L. R. Dai, et al., “A regression approach to single-channel speech separation via high-resolution deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.24, no.8, pp.1424–1437, 2016. doi: 10.1109/TASLP.2016.2558822
[5]	Y. Sun, W. W. Wang, J. Chambers, et al., “Two-stage monaural source separation in reverberant room environments using deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.27, no.1, pp.125–139, 2019. doi: 10.1109/TASLP.2018.2874708
[6]	D. L. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Separation by Humans and Machines, P. Divenyi, Ed., Springer, Boston, MA, USA, pp.181–197, 2005.
[7]	G. N. Hu and D. L. Wang, “A tandem algorithm for pitch estimation and voiced speech segregation,” IEEE Transactions on Audio, Speech, and Language Processing, vol.18, no.8, pp.2067–2079, 2010. doi: 10.1109/TASL.2010.2041110
[8]	A. Hyvärinen and E. Oja, “Independent component analysis: Algorithms and applications,” Neural Networks, vol.13, no.4-5, pp.411–430, 2000. doi: 10.1016/S0893-6080(00)00026-5
[9]	X. L. Zhang and D. L. Wang, “A deep ensemble learning method for monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.24, no.5, pp.967–977, 2016. doi: 10.1109/TASLP.2016.2536478
[10]	Y. X. Wang, A. Narayanan, and D. L. Wang, “On training targets for supervised speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.22, no.12, pp.1849–1858, 2014. doi: 10.1109/TASLP.2014.2352935
[11]	H. Erdogan, J. R. Hershey, S. Watanabe, et al., “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, South Brisbane, QLD, Australia, pp.708–712, 2015.
[12]	D. S. Williamson, Y. X. Wang, and D. L. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.24, no.3, pp.483–492, 2016. doi: 10.1109/TASLP.2015.2512042
[13]	G. Kim, Y. Lu, and P. C. Loizou, “An algorithm that improves speech intelligibility in noise for normal-hearing listeners,” The Journal of the Acoustical Society of America, vol.126, no.3, pp.1486–1494, 2009. doi: 10.1121/1.3184603
[14]	T. May and T. Dau, “Computational speech segregation based on an auditory-inspired modulation analysis,” The Journal of the Acoustical Society of America, vol.136, no.6, pp.3350–3359, 2014. doi: 10.1121/1.4901711
[15]	P. S. Huang, M. Kim, M. Hasegawa-Johnson, et al., “Joint optimization of masks and deep recurrent neural networks for monaural source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.23, no.12, pp.2136–2147, 2015. doi: 10.1109/TASLP.2015.2468583
[16]	Y. N. Wang, J. Du, L. R. Dai, et al., “A gender mixture detection approach to unsupervised single-channel speech separation based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.25, no.7, pp.1535–1546, 2017. doi: 10.1109/TASLP.2017.2700540
[17]	M. S. Chauhan, R. Mishra, M. I. Patel, et al., “Speech recognition and separation system using deep learning,” in Proceedings of 2021 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems, Chennai, India, pp.1–5, 2021.
[18]	S. Wan, “Research on speech separation and recognition algorithm based on deep learning,” in Proceedings of 2021 IEEE International Conference on Power, Intelligent Computing and Systems, Shenyang, China, pp.722–725, 2021.
[19]	C. H. Fan, J. H. Tao, B. Liu, et al., “End-to-end post-filter for speech separation with deep attention fusion features,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.28, pp.1303–1314, 2020. doi: 10.1109/TASLP.2020.2982029
[20]	N. J. Zheng and X. L. Zhang, “Phase-aware speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.27, no.1, pp.63–76, 2019. doi: 10.1109/TASLP.2018.2870742
[21]	T. G. Kang, J. W. Shin, and N. S. Kim, “DNN-based monaural speech enhancement with temporal and spectral variations equalization,” Digital Signal Processing, vol.74, pp.102–110, 2018. doi: 10.1016/j.dsp.2017.12.002
[22]	G. Naithani, J. Nikunen, and L. Bramsløw, et al., “Deep neural network based speech separation optimizing an objective estimator of intelligibility for low latency applications,” in Proceedings of the 2018 16th International Workshop on Acoustic Signal Enhancement, Tokyo, Japan, pp.386–390, 2018.
[23]	L. H. Sun, G. Zhu, and P. A. Li, “Joint constraint algorithm based on deep neural network with dual outputs for single-channel speech separation,” Signal, Image and Video Processing, vol.14, no.7, pp.1387–1395, 2020. doi: 10.1007/s11760-020-01676-6
[24]	M. Delfarah and D. L. Wang, “Features for masking-based monaural speech separation in reverberant conditions,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.25, no.5, pp.1085–1094, 2017. doi: 10.1109/TASLP.2017.2687829
[25]	M. Cooke, J. Barker, S. Cunningham, et al., “An audio-visual corpus for speech perception and automatic speech recognition,” The Journal of the Acoustical Society of America, vol.120, no.5, pp.2421–2424, 2006. doi: 10.1121/1.2229005
[26]	A. W. Rix, J. G. Beerends, and M. P. Hollier, et al., “Perceptual evaluation of speech quality (PESQ) - A new method for speech quality assessment of telephone networks and codecs,” in Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, UT, USA, pp.749–752, 2001.
[27]	C. H. Taal, R. C. Hendriks, R. Heusdens, et al., “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol.19, no.7, pp.2125–2136, 2011. doi: 10.1109/TASL.2011.2114881
[28]	E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol.14, no.4, pp.1462–1469, 2006. doi: 10.1109/TSA.2005.858005