Citation: | ZHAO Huijuan, YE Ning, WANG Ruchuan, “Improved Cross-Corpus Speech Emotion Recognition Using Deep Local Domain Adaptation,” Chinese Journal of Electronics, vol. 32, no. 3, pp. 640-646, 2023, doi: 10.23919/cje.2021.00.196 |
[1] |
M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol.44, no.3, pp.572–587, 2011. doi: 10.1016/j.patcog.2010.09.020
|
[2] |
B. W. Schuller, “Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol.61, no.5, pp.90–99, 2018. doi: 10.1145/3129340
|
[3] |
M. S. Fahad, A. Ranjan, J. Yadav, et al., “A survey of speech emotion recognition in natural environment,” Digital Signal Processing, vol.110, article no.102951, 2021. doi: 10.1016/j.dsp.2020.102951
|
[4] |
K. X. Feng and T. Chaspari, “A review of generalizable transfer learning in automatic emotion recognition,” Frontiers in Computer Science, vol.2, article no.9, 2020. doi: 10.3389/fcomp.2020.00009
|
[5] |
H. J. Zhao, N. Ye, and R. C. Wang, “Speech emotion recognition based on hierarchical attributes using feature nets,” International Journal of Parallel, Emergent and Distributed Systems, vol.35, no.3, pp.354–364, 2020. doi: 10.1080/17445760.2019.1626854
|
[6] |
B. Zhang, E. M. Provost, and G. Essl. “Cross-corpus acoustic emotion recognition from singing and speaking: A multi-task learning approach,” in Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, pp.5805−5809, 2016.
|
[7] |
S. Parthasarathy and C. Busso, “Jointly predicting arousal, valence and dominance with multi-task learning,” in Proceedings of Interspeech 2017, Stockholm, Sweden, pp.1103−1107, 2017.
|
[8] |
S. H. Liu, M. Y. Zhang, M. Fang, et al., “Speech emotion recognition based on transfer learning from the FaceNet framework,” The Journal of the Acoustical Society of America, vol.149, no.2, pp.1338–1345, 2021. doi: 10.1121/10.0003530
|
[9] |
M. Abdelwahab and C. Busso, “Supervised domain adaptation for emotion recognition from speech,” in Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, pp.5058−5062, 2015.
|
[10] |
J. Deng, Z. X. Zhang, F. Eyben, et al., “Autoencoder-based unsupervised domain adaptation for speech emotion recognition,” IEEE Signal Processing Letters, vol.21, no.9, pp.1068–1072, 2014. doi: 10.1109/LSP.2014.2324759
|
[11] |
S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol.22, no.10, pp.1345–1359, 2010. doi: 10.1109/TKDE.2009.191
|
[12] |
Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp.1180−1189, 2015.
|
[13] |
P. Viola and W. M. Wells ó, “Alignment by maximization of mutual information,” International Journal of Computer Vision, vol.24, no.2, pp.137–154, 1997. doi: 10.1023/A:1007958904918
|
[14] |
T. Van Erven and P. Harremos, “Rényi divergence and Kullback-Leibler divergence,” IEEE Transactions on Information Theory, vol.60, no.7, pp.3797–3820, 2014. doi: 10.1109/TIT.2014.2320500
|
[15] |
K. Saito, K. Watanabe, Y. Ushiku, et al., “Maximum classifier discrepancy for unsupervised domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, pp.3723−3732, 2018.
|
[16] |
W. W. Lin, M. W. Mak, and J. T. Chien, “Multisource I-vectors domain adaptation using maximum mean discrepancy based autoencoders,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.26, no.12, pp.2412–2422, 2018. doi: 10.1109/TASLP.2018.2866707
|
[17] |
E. Tzeng, J. Hoffman, N. Zhang, et al., “Deep domain confusion: Maximizing for domain invariance,” arXiv preprint, arXiv: 1412.3474, 2014.
|
[18] |
M. S. Long, Y. Cao, J. M. Wang, et al., “Learning transferable features with deep adaptation networks,” in Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp.97–105, 2015.
|
[19] |
A. Gretton, B. Sriperumbudur, D. Sejdinovic, et al., “Optimal kernel choice for large-scale two-sample tests,” in Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, pp.1205–1213, 2012.
|
[20] |
M. S. Long, H. Zhu, J. M. Wang, et al., “Deep transfer learning with joint adaptation networks,” in Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, pp.2208–2217, 2017.
|
[21] |
P. Song, W. M. Zheng, S. F. Ou, et al., “Cross-corpus speech emotion recognition based on transfer non-negative matrix factorization,” Speech Communication, vol.83, pp.34–41, 2016. doi: 10.1016/j.specom.2016.07.010
|
[22] |
J. T. Liu, W. M. Zheng, Y. Zong, et al., “Cross-corpus speech emotion recognition based on deep domain-adaptive convolutional neural network,” IEICE Transactions on Information and Systems, vol.E103.D, no.2, pp.459–463, 2020. doi: 10.1587/transinf.2019EDL8136
|
[23] |
H. L. Yan, Y. K. Ding, P. H. Li, et al., “Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation,” in Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp.945−954, 2017.
|
[24] |
Y. C. Zhu, F. Z. Zhuang, J. D. Wang, et al., “Deep subdomain adaptation network for image classification,” IEEE Transactions on Neural Networks and Learning Systems, vol.32, no.4, pp.1713–1722, 2021. doi: 10.1109/TNNLS.2020.2988928
|
[25] |
C. Busso, M. Bulut, C. C. Lee, et al., “IEMOCAP: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol.42, no.4, pp.335–359, 2008. doi: 10.1007/s10579-008-9076-6
|
[26] |
F. Burkhardt, A. Paeschke, M. Rolfes, et al., “A database of German emotional speech,” in Proceedings of the 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, pp.1517–1520, 2005.
|
[27] |
M. Y. Chen, X. J. He, J. Yang, et al., “3-D convolutional recurrent neural networks with attention model for speech emotion recognition,” IEEE Signal Processing Letters, vol.25, no.10, pp.1440–1444, 2018. doi: 10.1109/LSP.2018.2860246
|
[28] |
L. Lee and R. Rose, “A frequency warping approach to speaker normalization,” IEEE Transactions on Speech and Audio Processing, vol.6, no.1, pp.49–60, 1998. doi: 10.1109/89.650310
|
[29] |
N. Jaitly and G. E. Hinton, “Vocal tract length perturbation (VTLP) improves speech recognition,” in Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 2013.
|