LI Xu, TU Ming, WANG Xiaofei, WU Chao, FU Qiang, YAN Yonghong. Single-Channel Speech Separation Based on Non-negative Matrix Factorization and Factorial Conditional Random Field[J]. Chinese Journal of Electronics, 2018, 27(5): 1063-1070. doi: 10.1049/cje.2018.06.016
Citation: LI Xu, TU Ming, WANG Xiaofei, WU Chao, FU Qiang, YAN Yonghong. Single-Channel Speech Separation Based on Non-negative Matrix Factorization and Factorial Conditional Random Field[J]. Chinese Journal of Electronics, 2018, 27(5): 1063-1070. doi: 10.1049/cje.2018.06.016

Single-Channel Speech Separation Based on Non-negative Matrix Factorization and Factorial Conditional Random Field

doi: 10.1049/cje.2018.06.016
Funds:  This work is supported by the National Natural Science Foundation of China (No.11461141004, No.11590770, No.11590771, No.11590772, No.11590773, No.11590774), the Strategic Priority Research Program of the Chinese Academy of Sciences (No.XDA06030100, No.XDA06030500), the National High Technology Research and Development Program of China (863 Program) (No.2015AA016306), and the Key Science and Technology Project of the Xinjiang Uygur Autonomous Region (No.201230118-3).
More Information
  • Corresponding author: FU Qiang (corresponding author) received the Ph.D. degree in electronic engineering from Xidian University in 2000. From 2001 to 2002, he was working as a senior research associate in Center for Spoken Language Understanding (CSLU), Oregon Graduate Institute of Science Engineering (OGI) School of Science and Engineering at Oregon Health and Science University, Oregon, USA. From 2002 to 2004, he was working as a senior postdoctoral research fellow in Department of Electric and Computer Engineering, University of Limerick, Ireland. He is currently a professor in Institute of Acoustics, Chinese Academy of Sciences, China. His research interests are in speech analysis, microphone array processing, far-distant speech recognition, audio-visual signal processing, and machine learning for signal processing. (Email:qfu@hccl.ioa.ac.cn)
  • Received Date: 2016-01-25
  • Rev Recd Date: 2016-10-18
  • Publish Date: 2018-09-10
  • A new Non-negative matrix factorization (NMF) based algorithm is proposed for single-channel speech separation with a prior known speakers, which aims to better model the spectral structure and temporal continuity of speech signal. First, NMF and k-means clustering are employed to obtain multiple small dictionaries as well as a state sequence that describes the temporal dynamics between these dictionaries for each speaker. Then, a Factorial conditional random field (FCRF) model is trained using the state sequences and dictionaries to jointly model the temporal continuity of two speakers' mixed signal for separation. Experiments show that the proposed algorithm outperforms the baselines with respect to all metrics, for example sparse NMF (+1.12dB SDR, +2.37dB SIR, +0.40dB SAR, +0.2 MOS), nonnegative factorial hidden Markov model (+2.04dB SDR, +4.26dB SIR, +0.62dB SAR, +1.0 MOS) and standard NMF (+2.8dB SDR, +5.08dB SIR, +1.06dB SAR, +1.2 MOS).
  • loading
  • S.T. Roweis, “One microphone source separation”, International Conference on Neural Information Processing Systems (NIPS), Denver, USA, pp.763-769, 2000.
    G.J. Jang and T.W. Lee, “A maximum likelihood approach to single-channel source separation”, Journal of Machine Learning Research, Vol.4, No.12, pp.1365-1392, 2003.
    P.S. Huang, M. Kim and M. Hasegawa-Johnson, “Deep learning for speech separation”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, pp.1562-1566, 2014.
    Y. Tu, J. Du and Y. Xu, “Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers”, International Symposium on Chinese Spoken Language Processing (ISCSLP), Singapore, pp.250-254, 2014.
    Y. Wang, A. Narayanan and D.L. Wang, “On training targets for supervised speech separation”, IEEE Transactions on Audio, Speech, and Language Processing, Vol.22, No.12, pp.1849- 1858, 2014.
    F. Weninger, J.R. Hershey and J. Le Roux, “Discriminatively trained recurrent neural networks for single-channel speech separation”, IEEE Global Conference on Signal and Information Processing (GlobalSIP), Atlanta, USA, pp.577-581, 2014.
    D.D. Lee and H.S. Seung, “Learning the parts of objects by non-negative matrix factorization”, Nature, Vol.401, No.6755, pp.788-791, 1999.
    P. Smaragdis, B. Raj and M. Shashanka, “Supervised and semisupervised separation of sounds from single-channel mixtures”, International Conference on Independent Component Analysis and Signal Separation, London, UK, pp.414-421, 2007
    M.N. Schmidt and R.K. Olsson, “Single-channel speech separation using sparse non-negative matrix factorization”, ISCA International Conference on Spoken Language Processing (INTERSPEECH), Pittsburgh, USA, pp.2614-2617, 2006.
    T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria”, IEEE Transactions on Audio, Speech, and Language Processing, Vol.15, No.3, pp.1066-1074, 2007.
    K.W. Wilson, B. Raj and P.Smaragdis, “Regularized nonnegative matrix factorization with temporal dependencies for speech denoising”, INTERSPEECH, Brisbane, Australia, pp.411-414, 2008.
    K.W. Wilson, B. Raj, P.Smaragdis, et al., “Speech denoising using nonnegative matrix factorization with priors”, ICASSP, Toulouse, France, pp.4029-4032, 2008.
    C. Fevotte, “Majorization-minimization algorithm for smooth Itakura-Saito nonnegative matrix factorization”, ICASSP, Prague, Czech Republic, pp.1980-1983, 2011.
    J. Nam, G.J. Mysore and P. Smaragdis, “Sound recognition in mixtures”, International Conference on Latent Variable Analysis and Signal Separation, Tel Aviv, Israel, pp.405-413, 2012.
    C. Fevotte, J.Le Roux and J.R. Hershey, “Non-negative dynamical system with application to speech and audio”, ICASSP, Vancouver, BC, Canada, pp.3158-3162, 2013.
    N. Mohammadiha, P. Smaragdis and A. Leijon, “Prediction based filtering and smoothing to exploit temporal dependencies in NMF”, ICASSP, Vancouver, BC, Canada, pp.873-877, 2013.
    G.J. Mysore, P. Smaragdis and B. Raj, “Non-negative hidden markov modeling of audio with application to source separation”, International Conference on Latent Variable Analysis and Signal Separation, St. Malo, France, pp.140-148, 2010.
    G.J. Mysore and P. Smaragdis, “A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics”, ICASSP, Prague, Czech Republic, pp.17-20, 2011.
    Y.T. Yeung, T. Lee and C.C. Leung, “Integrating multiple observations for model-based single-microphone speech separation with conditional random fields”, ICASSP, Kyoto, Japan, pp.257-260, 2012.
    Y.T. Yeung, T. Lee and C.C. Leung, “Using dynamic conditional random field on single-microphone speech separation”, ICASSP, Vancouver, BC, Canada, pp.146-150, 2013.
    J.F. Gemmeke and H. Van hamme, “An hierarchical exemplarbased sparse model of speech, with an application to ASR”, IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Hawaii, USA, pp.101-106, 2011.
    D.D. Lee and H.S. Seung, “Algorithms for non-negative matrix factorization”, Advances in Neural Information Processing Systems, Vol.13, No.6, pp.556-562, 2001.
    T. Virtanen, J.F. Gemmeke and B. Raj, “Active-set newton algorithm for overcomplete non-negative representations of audio”, IEEE Transactions on Audio, Speech, and Language Processing, Vol.21, No.11, pp.2277-2289, 2013.
    T. Virtanen, J.F. Gemmeke, B. Raj, et al., “Compositional models for audio processing: Uncovering the structure of sound mixtures”, IEEE Signal Processing Magazine, Vol.32, No.2, pp.125-144, 2015.
    C. Sutton, A. McCallum and K. Rohanimanesh, “Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data”, Journal of Machine Learning Research, Vol.8, No.3, pp.693-723, 2007.
    M. Cooke, J. Barker, S. Cunningham, et al., “An audio-visual corpus for speech perception and automatic speech recognition”, The Journal of the Acoustical Society of America, Vol.120, No.5, pp.2421-2424, 2006.
    E. Vincent, R. Gribonval and C. Fevotte, “Performance measurement in in blind audio source separation”, IEEE Transactions on Audio, Speech, and Language Processing, Vol.14, No.4, pp.1462-1469, 2006.
    E.H. Rothauser, W.D. Chapman and N. Guttman, “IEEE recommended practice for speech quality measurements”, IEEE Transactions on Audio Electroacoust, Vol.17, No.3, pp.225-246, 1969.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (124) PDF downloads(202) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return