Volume 31 Issue 1
Jan.  2022
Turn off MathJax
Article Contents
LIN Long and TAN Liang, “Multi-Distributed Speech Emotion Recognition Based on Mel Frequency Cepstogram and Parameter Transfer,” Chinese Journal of Electronics, vol. 31, no. 1, pp. 155-167, 2022, doi: 10.1049/cje.2020.00.080
Citation: LIN Long and TAN Liang, “Multi-Distributed Speech Emotion Recognition Based on Mel Frequency Cepstogram and Parameter Transfer,” Chinese Journal of Electronics, vol. 31, no. 1, pp. 155-167, 2022, doi: 10.1049/cje.2020.00.080

Multi-Distributed Speech Emotion Recognition Based on Mel Frequency Cepstogram and Parameter Transfer

doi: 10.1049/cje.2020.00.080
Funds:  This work was supported by National Natural Science Foundation of China (61373162), Sichuan Science and Technology Support Project (2019YFG0183), and Visual Computing and Virtual Reality Sichuan Provincial Key Laboratory Project (KJ201402)
More Information
  • Author Bio:

    was born in 1995. He received the B.S. degree in engineering from Sichuan Normal University, Chengdu, China, in 2018. He is currently pursuing an M.S. degree in software engineering at Sichuan Normal University, Chengdu, China. His main research interest is emotion recognition. He also worked on a digital teaching patent. (Email: 569074330@qq.com)

    (corresponding author) received the Ph.D. degree in computer science from University of Electronic Science and Technology of China in 2007. He is a Professor of School of Computer Science, Sichuan Normal University. His research interests include cloud computing, big data, blockchain, trusted computing and network security. (Email: jkxy_tl@sicnu.edu.cn)

  • Received Date: 2020-03-17
  • Accepted Date: 2020-06-03
  • Available Online: 2021-10-22
  • Publish Date: 2022-01-05
  • Speech emotion recognition (SER) is the use of speech signals to estimate the state of emotion. At present, machine learning is one of the main research methods of SER, the test and training dataS of traditional machine learning all have the same distribution and feature space, but the data of speech is accessed from different environments and devices, with different distribution characteristics in real life. Thus, the traditional machine learning method is applied to the poor performance of SER. This paper proposes a multi-distributed SER method based on Mel frequency cepstogram (MFCC) and parameter transfer. The method is based on single-layer long short-term memory (LSTM), pre-trained inception-v3 network and multi-distribution corpus. The speech pre-processed MFCC is taken as the input of single-layer LSTM, and input to the pre-trained inception-v3 network. The features are extracted through the pre-trained inception-v3 model. Then the features are sent to the newly defined the fully connected layer and classification layer, let the parameters of the fully connected layer be fine-tuned, finally get the classification result. The experiment proves that the method can effectively complete the classification of multi-distribution speech emotions and is more effective than the traditional machine learning framework of SER.
  • loading
  • [1]
    WANG Haikun, PAN Jia, and LIU Cong, “Research development and forecast of automatic speech recognition technologies,” Telecommunications Science, vol.34, no.2, pp.1–11, 2018.
    [2]
    Han WJ, Li HF, Ruan HB, et al., “Review on speech emotion recognition,” Journal of Software, vol.25, no.1, pp.37–50, 2014.
    [3]
    SONG Peng, ZHENG Wenming, and ZHAO Li, “Cross-corpus speech emotion recognition based on a feature transfer learning method,” Journal of Tsinghua University (Science and Technology), vol.56, no.11, pp.1179–1183, 2016.
    [4]
    TENG Z and JI W, “Speech emotion recognition with i-vector feature and rnn model,” 2015 IEEE China Summit and International Conference on Signal and Information Processing (China SIP), Chengdu, pp.524–528, 2015.
    [5]
    Basu A, Chakraborty J, and Aftabuddin M, “Emotion recognition from speech using convolutional neural network with recurrent neural network architecture,” 2017 2nd International Conference on Communication and Electronics Systems, Coimbatore, DOI: 10.1109/CESYS.2017.8321292, 2017.
    [6]
    Sak H, Senior A, and Beaufays F, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Proc. of the Annual Conference of the International Speech Communication Association, Singapore, pp.338–342, 2014.
    [7]
    Badshah A M, Ahmad J, and Rahim N, “Speech emotion recognition from spectrograms with deep convolutional neural network,” in Proc. of the International Conference on Platform Technology and Service, Busan, DOI: 10.1109/PlatCon.2017.7883728, 2017.
    [8]
    LU Guanming, YUAN Liang, YANG Wenjuan, et al., “Speech emotion recognition based on long-term and short-term memory and convolutional neural networks,” Journal of Nanjing University of Posts and Telecommunications (Natural Science Edition), vol.38, no.5, pp.63–69, 2018.
    [9]
    Cowie R, Douglas-Cowie E, Savvidou S, et al., “FEELTRACE: An instrument for recording perceived emotion in real time,” in Proc. of the 2000 ISCA Workshop on Speech and Emotion: A Conceptual Frame Work for Research, Newcastle, pp.19–24, 2000.
    [10]
    McGilloway S, Cowie R, Douglas-Cowie E, et al., “Approaching automatic recognition of emotion from speech: A rough benchmark,” in Proc. of the 2000 ISCA Workshop on Speech and Emotion: A Conceptual Framework for Research, Newcastle, pp.207–212, 2000.
    [11]
    Burkhardt F, Paeschke A, Rolfes M, et al., “A database of german emotional speech,” in Proc. of the 2005 INTERSPEECH, Lisbon, pp.1517–1520, 2005.
    [12]
    Steidl S, “Automatic classification of emotion-related user states in spontaneous children’s speech,” Ph.D. Thesis, Erlangen: University at Erlangen Nurberg, 2009.
    [13]
    Grimm M, Kroschel K, and Narayanan S, “The Vera am Mittag German audio-visual emotional speech database,” in Proc. of the 2008 IEEE Int. Conf. on Multimedia and Expo (ICME), Hannover, pp.865–868, 2008
    [14]
    McKeown G, Valstar MF, Cowie R, et al., “The semaine corpus of emotionally coloured character interactions,” in Proc. of the 2010 IEEE Int. Conf. on Multimedia and Expo (ICME), Singapore, pp.1079-1084, 2010.
    [15]
    Schuller B, Valstar M, Eyben F, et al., “AVEC 2012 the continuous audio/visual emotion challenge,” in Proc. of the 2012 Int. Audio/Visual Emotion Challenge and Workshop (AVEC), Grand Challenge and Satellite of ACM ICMI 2012, Santa Monica, California, available at: https://mediatum.ub.tum.de/doc/1137896/1137896.pdf, 2012.
    [16]
    van Bezooijen R, Otto SA, and Heenan TA, “Recognition of vocal expressions of emotion: A three-nation study to identify universal characteristics,” Journal of Cross-Cultural Psychology, vol.14, no.4, pp.387–406, 1983. doi: 10.1177/0022002183014004001
    [17]
    Tolkmitt FJ and Scherer KR, “Effect of experimentally induced stress on vocal parameters,” Journal of Experimental Psychology Human Perception Performance, vol.12, no.3, pp.302–313, 1986. doi: 10.1037/0096-1523.12.3.302
    [18]
    Cahn JE, “The generation of affect in synthesized speech,” Journal of the American Speech Input/Output Society, vol.8, pp.1–19, 1990.
    [19]
    Moriyama T and Ozawa S, “Emotion recognition and synthesis system on speech,” the 1999 IEEE Int. Conf. on Multimedia Computing and Systems (ICMCS), IEEE Computer Society, Florence, pp.840–844, 1999.
    [20]
    Cowie R, Douglas-Cowie E, Savvidou S, et al., “Feeltrace: An instrument for recording perceived emotion in real time,” the 2000 ISCA Workshop on Speech and Emotion: A Conceptual Frame Work for Research, ISCA, Belfast, pp.19–24, 2000.
    [21]
    Grimm M and Kroschel K, “Evaluation of natural emotions using self assessment manikins,” the 2005 IEEE Workshop on Automatic Speech Recognition and Understanding, Cancun, pp.381–385, 2005.
    [22]
    Grimm M, Kroschel K, and Narayanan S, “Support vector regression for automatic recognition of spontaneous emotions in speech,” the 2007 IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), IEEE Computer Society, Honolulu, HI, pp.1085–1088, 2007.
    [23]
    Giannakopoulos T, Pikrakis A, and Theodoridis S, “A dimensional approach to emotion recognition of speech from movies,” the 2009 IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) , IEEE Computer Society, Taipei, pp.65–68, 2009.
    [24]
    Wu D R, Parsons T D, Mower E, et al., “Speech emotion estimation in 3d space,” the 2010 IEEE Int. Conf. on Multimedia and Expo (ICME), IEEE Computer Society, Singapore, pp.737–742. 2010.
    [25]
    Eyben F, Wollmer M, Graves A, et al., “On-Line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues,” Journal on Multimodal User Interfaces, vol.3, no.1-2, pp.7–19, 2010. doi: 10.1007/s12193-009-0032-6
    [26]
    Karadogan SG and Larsen J, “Combining semantic and acoustic features for valence and arousal recognition in speech,” the 2012 Int. Workshop on Cognitive Information Processing (CIP), IEEE Computer Society, Baiona, pp.1–6, 2012.
    [27]
    Eyben F, Wollmer M, and Schuller B, “OpenSMILE—The Munich versatile and fast open-source audio feature extractor,” in Proc. of the 9th ACM International Conference on Multimedia, Firenze, pp.1459–1462, 2010.
    [28]
    Schuller B, Valstar M, Eyben F, et al., “AVEC 2011—The first international audio/visual emotion challenge,” 2011 International Conference on Affective Computing and Intelligent Interaction, Memphis, TN, pp.415–424, 2011.
    [29]
    Yamada T, Hashimoto H, and Tosa N, “Pattern recognition of emotion with neural network,” The 1995 IEEE IECON 21st International Conference on Industrial Electronics, Control, and Instrumentation, Orlando, FL, pp.183–187, 1995.
    [30]
    Shi B, Bai X, and Yao C, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.39, no.11, pp.12–15, 2017.
    [31]
    Chen B, Yin Q, and Guo P, “A study of deep belief network based Chinese speech emotion recognition,” 10th International Conference on Computational Intelligence and Security, IEEE, Kunming, pp.180–184, 2014.
    [32]
    Lozano-Dez A, Zazo C R, and Gonzlez D J, “An end-to-end approach to language identification in short utterances using convolutional neural networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.30, no.10, pp.112–115, 2015.
    [33]
    Zazo R, Lozano-Diez A, and Gonzalez D J, “Language identification in short utterances using long short-term memory (LSTM),” Recurrent Neural Networks, vol.23, no.1, pp.23–27, 2016.
    [34]
    Gelly G, Gauvain J L, Le V, et al., “A divide-and-conquer approach for language identification based on recurrent neural networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.29, no.5, pp.22–25, 2016.
    [35]
    Xinran Z, Peng S, and Gchen Z, “Auditory attention model based on Chirplet for cross-corpus speech emotion recognition,” Journal of Southeast University, vol.32, no.4, pp.402–407, 2016.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(17)  / Tables(3)

    Article Metrics

    Article views (651) PDF downloads(35) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return