Volume 32 Issue 3
May  2023
Turn off MathJax
Article Contents
ZHAO Huijuan, YE Ning, WANG Ruchuan, “Improved Cross-Corpus Speech Emotion Recognition Using Deep Local Domain Adaptation,” Chinese Journal of Electronics, vol. 32, no. 3, pp. 640-646, 2023, doi: 10.23919/cje.2021.00.196
Citation: ZHAO Huijuan, YE Ning, WANG Ruchuan, “Improved Cross-Corpus Speech Emotion Recognition Using Deep Local Domain Adaptation,” Chinese Journal of Electronics, vol. 32, no. 3, pp. 640-646, 2023, doi: 10.23919/cje.2021.00.196

Improved Cross-Corpus Speech Emotion Recognition Using Deep Local Domain Adaptation

doi: 10.23919/cje.2021.00.196
Funds:  This work was supported by the National Natural Science Foundation of China (61572260), Postgraduate Research & Practice Innovation Program of Jiangsu Province (46035CX17789), and Research Project of Nanjing Vocational University of Industry Technology (YK20-05-08)
More Information
  • Author Bio:

    Huijuan ZHAO received the M.S. degree in computer software and theory from Nanjing University of Posts and Telecommunications. She is a teacher of Nanjing Vocational University of Industry Technology. She is currently pursuing the Ph.D. degree with the Nanjing University of Posts and Telecommunications. Her research interests include affective computing and deep learning. (Email: zhaohj86@126.com)

    Ning YE received the B.S. degree in computer science from Nanjing University in 1994, the M.S. degree from the School of Computer and Engineering, Southeast University, in 2004, and the Ph.D. degree from the Institute of Computer Science, Nanjing University of Posts and Telecommunications, in 2009, where she is currently a Professor. In 2010, she was a Visiting Scholar and Research Assistant with the Department of Computer Science, University of Victoria, Canada. Her research interests include wireless networks and Internet of Things. (Email: yening@njupt.edu.cn)

    Ruchuan WANG (corresponding author) researched on graphic processing with the University of Bremen and on program design theory with Ludwig Maximilian Muenchen Unitversitaet from 1984 to 1992. He has been a Professor and Supervisor of Ph.D. candidates with the Nanjing University of Posts and Telecommunications since 1992. His major research interests include wireless sensor networks and information security. (Email: wangrc@njupt.edu.cn)

  • Received Date: 2021-05-28
  • Accepted Date: 2021-09-28
  • Available Online: 2022-03-03
  • Publish Date: 2023-05-05
  • Due to the scarcity of high-quality labeled speech emotion data, it is natural to apply transfer learning to emotion recognition. However, transfer learning-based speech emotion recognition becomes more challenging because of the complexity and ambiguity of emotion. Domain adaptation based on maximum mean discrepancy considers marginal alignment of source domain and target domain, but not pay regard to class prior distribution in both domains, which results in the reduction of transfer efficiency. In order to address the problem, this study proposes a novel cross-corpus speech emotion recognition framework based on local domain adaption. A category-grained discrepancy is used to evaluate the distance between two relevant domains. According to research findings, the generalization ability of the model is enhanced by using the local adaptive method. Compared with global adaptive and non-adaptive methods, the effectiveness of cross-corpus speech emotion recognition is significantly improved.
  • loading
  • [1]
    M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol.44, no.3, pp.572–587, 2011. doi: 10.1016/j.patcog.2010.09.020
    [2]
    B. W. Schuller, “Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol.61, no.5, pp.90–99, 2018. doi: 10.1145/3129340
    [3]
    M. S. Fahad, A. Ranjan, J. Yadav, et al., “A survey of speech emotion recognition in natural environment,” Digital Signal Processing, vol.110, article no.102951, 2021. doi: 10.1016/j.dsp.2020.102951
    [4]
    K. X. Feng and T. Chaspari, “A review of generalizable transfer learning in automatic emotion recognition,” Frontiers in Computer Science, vol.2, article no.9, 2020. doi: 10.3389/fcomp.2020.00009
    [5]
    H. J. Zhao, N. Ye, and R. C. Wang, “Speech emotion recognition based on hierarchical attributes using feature nets,” International Journal of Parallel, Emergent and Distributed Systems, vol.35, no.3, pp.354–364, 2020. doi: 10.1080/17445760.2019.1626854
    [6]
    B. Zhang, E. M. Provost, and G. Essl. “Cross-corpus acoustic emotion recognition from singing and speaking: A multi-task learning approach,” in Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, pp.5805−5809, 2016.
    [7]
    S. Parthasarathy and C. Busso, “Jointly predicting arousal, valence and dominance with multi-task learning,” in Proceedings of Interspeech 2017, Stockholm, Sweden, pp.1103−1107, 2017.
    [8]
    S. H. Liu, M. Y. Zhang, M. Fang, et al., “Speech emotion recognition based on transfer learning from the FaceNet framework,” The Journal of the Acoustical Society of America, vol.149, no.2, pp.1338–1345, 2021. doi: 10.1121/10.0003530
    [9]
    M. Abdelwahab and C. Busso, “Supervised domain adaptation for emotion recognition from speech,” in Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, pp.5058−5062, 2015.
    [10]
    J. Deng, Z. X. Zhang, F. Eyben, et al., “Autoencoder-based unsupervised domain adaptation for speech emotion recognition,” IEEE Signal Processing Letters, vol.21, no.9, pp.1068–1072, 2014. doi: 10.1109/LSP.2014.2324759
    [11]
    S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol.22, no.10, pp.1345–1359, 2010. doi: 10.1109/TKDE.2009.191
    [12]
    Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp.1180−1189, 2015.
    [13]
    P. Viola and W. M. Wells ó, “Alignment by maximization of mutual information,” International Journal of Computer Vision, vol.24, no.2, pp.137–154, 1997. doi: 10.1023/A:1007958904918
    [14]
    T. Van Erven and P. Harremos, “Rényi divergence and Kullback-Leibler divergence,” IEEE Transactions on Information Theory, vol.60, no.7, pp.3797–3820, 2014. doi: 10.1109/TIT.2014.2320500
    [15]
    K. Saito, K. Watanabe, Y. Ushiku, et al., “Maximum classifier discrepancy for unsupervised domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, pp.3723−3732, 2018.
    [16]
    W. W. Lin, M. W. Mak, and J. T. Chien, “Multisource I-vectors domain adaptation using maximum mean discrepancy based autoencoders,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.26, no.12, pp.2412–2422, 2018. doi: 10.1109/TASLP.2018.2866707
    [17]
    E. Tzeng, J. Hoffman, N. Zhang, et al., “Deep domain confusion: Maximizing for domain invariance,” arXiv preprint, arXiv: 1412.3474, 2014.
    [18]
    M. S. Long, Y. Cao, J. M. Wang, et al., “Learning transferable features with deep adaptation networks,” in Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp.97–105, 2015.
    [19]
    A. Gretton, B. Sriperumbudur, D. Sejdinovic, et al., “Optimal kernel choice for large-scale two-sample tests,” in Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, pp.1205–1213, 2012.
    [20]
    M. S. Long, H. Zhu, J. M. Wang, et al., “Deep transfer learning with joint adaptation networks,” in Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, pp.2208–2217, 2017.
    [21]
    P. Song, W. M. Zheng, S. F. Ou, et al., “Cross-corpus speech emotion recognition based on transfer non-negative matrix factorization,” Speech Communication, vol.83, pp.34–41, 2016. doi: 10.1016/j.specom.2016.07.010
    [22]
    J. T. Liu, W. M. Zheng, Y. Zong, et al., “Cross-corpus speech emotion recognition based on deep domain-adaptive convolutional neural network,” IEICE Transactions on Information and Systems, vol.E103.D, no.2, pp.459–463, 2020. doi: 10.1587/transinf.2019EDL8136
    [23]
    H. L. Yan, Y. K. Ding, P. H. Li, et al., “Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation,” in Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp.945−954, 2017.
    [24]
    Y. C. Zhu, F. Z. Zhuang, J. D. Wang, et al., “Deep subdomain adaptation network for image classification,” IEEE Transactions on Neural Networks and Learning Systems, vol.32, no.4, pp.1713–1722, 2021. doi: 10.1109/TNNLS.2020.2988928
    [25]
    C. Busso, M. Bulut, C. C. Lee, et al., “IEMOCAP: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol.42, no.4, pp.335–359, 2008. doi: 10.1007/s10579-008-9076-6
    [26]
    F. Burkhardt, A. Paeschke, M. Rolfes, et al., “A database of German emotional speech,” in Proceedings of the 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, pp.1517–1520, 2005.
    [27]
    M. Y. Chen, X. J. He, J. Yang, et al., “3-D convolutional recurrent neural networks with attention model for speech emotion recognition,” IEEE Signal Processing Letters, vol.25, no.10, pp.1440–1444, 2018. doi: 10.1109/LSP.2018.2860246
    [28]
    L. Lee and R. Rose, “A frequency warping approach to speaker normalization,” IEEE Transactions on Speech and Audio Processing, vol.6, no.1, pp.49–60, 1998. doi: 10.1109/89.650310
    [29]
    N. Jaitly and G. E. Hinton, “Vocal tract length perturbation (VTLP) improves speech recognition,” in Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 2013.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(1)  / Tables(3)

    Article Metrics

    Article views (899) PDF downloads(110) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return