LI Rui, YU Jun, LI Xian, FANG Peng, WANG Zengfu. An Evaluation Framework for Virtual Articulatory Movements Based on Medical Video[J]. Chinese Journal of Electronics, 2019, 28(3): 585-592. doi: 10.1049/cje.2018.09.019
Citation: LI Rui, YU Jun, LI Xian, FANG Peng, WANG Zengfu. An Evaluation Framework for Virtual Articulatory Movements Based on Medical Video[J]. Chinese Journal of Electronics, 2019, 28(3): 585-592. doi: 10.1049/cje.2018.09.019

An Evaluation Framework for Virtual Articulatory Movements Based on Medical Video

doi: 10.1049/cje.2018.09.019
Funds:  This work is supported by the National Natural Science Foundation of China (No.U1736123, No.61572450), Anhui Provincial Natural Science Foundation (No.1708085QF138), and the Fundamental Research Funds for the Central Universities (No.WK2350000002).
More Information
  • Corresponding author: YU Jun (corresponding author) Jun Yu is an associate professor of University of Science and Technology of China. He is a member of the technical committee-Biological Information and Artificial Life, Chinese Association for Artificial Intelligence, and a member of the IEEE Signal Processing Society. His research interests include human computer interaction and intelligent robot. He has published more than 100 papers, and won the Best Paper Finalist Award at IEEE ICME 2017. (
  • Received Date: 2016-01-19
  • Publish Date: 2019-05-10
  • An important aspect of a Speech tutoring aimed talking-head system (STTS) is the accuracy of produced articulatory movements. Little work has been done for the Articulatory movements' accuracy (AMA) evaluation in STTSs. Although subjective evaluation is reliable, it is time consuming and inconvenient. The traditional objective evaluation is comparing the motion of several points on the surface of the synthetic articulator to the Electromagnetic articulography (EMA) data which describes the motion of corresponding points on the articulatory surface of a speaker. The EMA information is too limited to describe the whole shape changing of deformable articulators for a speech process. To solve this problem, we propose a substantially different objective evaluation method based on a separately recorded medical video. The synthetic articulatory shapes in a speech process are compared to the corresponding shapes tracked from the medical video. This method is translation, rotation, and scaling invariant which allows the comparison of the shapes from the synthetic tongue and the medical images. The time difference problem of synthesis results and medical video is solved by introducing Dynamic time warping (DTW) to the proposed method. Experimental results demonstrate that our method has the ability to evaluate the deformation shape accuracy from an entire articulation process. The comparison results suggest that our method is more accurate than the traditional method especially for deformable articulators.
  • loading
  • J. Yu, Z. Wang and R. Li, "A simultaneous motion tracking and facial expression recognition algorithm", Acta Electronic Sinica, Vol.43, No.2, pp.371-376, 2015. (in Chinese)
    Q. Li, J. Jiang and M. Qi, "Face recognition algorithm based on improved deep networks", Acta Electronic Sinica, Vol.45, No.3, pp.619-625, 2017. (in Chinese)
    L. Wang, Y. Liang, W. Cai, et al., "Failure detection and correction for appearance based facial tracking", Chinese Journal of Electronics, Vol.24, No.1, pp.20-25, 2015.
    X. Wu and J. Ju, "A markerless facial expression capture and reproduce algorithm", Acta Electronic Sinica, Vol.44, No.9, pp.2141-2147, 2016. (in Chinese)
    J. Yu and C. Chen, "Joint facial landmark detection and action estimation based on deep probabilistic random forest", 2017 IEEE Visual Communications and Image Processing (VCIP), pp.1-4, 2017.
    O. Engwall and O. Bälter, "Pronunciation feedback from real and virtual language teachers", Computer Assisted Language Learning, Vol.20, No.3, pp.235-262, 2007.
    J. Beskow, "Trainable articulatory control models for visual speech synthesis", International Journal of Speech Technology, Vol.7, No.4, pp.335-349, 2004.
    J. Yu and Z. Wang, "A video, text, and speech-driven realistic 3-d virtual head for human-machine interface", IEEE transactions on cybernetic, Vol.45, No.5, pp.977-988, 2015.
    O. Engwall, "Evaluation of a system for concatenative articulatory visual speech synthesis", INTERSPEECH, 2002.
    A. Wrench, "The mocha-timit articulatory database",, 1999.
    L. Wang, H. Chen, S. Li, et al., "Phoneme-level articulatory animation in pronunciation training", Speech Communication, Vol.54, No.7, pp. 845-856, 2012.
    S. Belongie, J. Malik, and J. Puzicha, "Shape matching and object recognition using shape contexts", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.24, No.4, pp.509-522, 2002.
    S. Narayanan, A. Toutios, V. Ramanarayanan, et al., "Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (tc)", The Journal of the Acoustical Society of America, Vol.136, No.3, pp.1307-1311, 2014.
    T. Lin and L. Wang, Course of Phonetics, Press of Beijing University, 1991.(in Chinese)
    H. Bao and L. Yang, The Motion Characteristics of Articulators for Mandarin Chinese (X-ray video), Beijing Language and Culture University Press, 1985.
    C. Song, "Modeling of 3d geometry vocal tract in the process of speech producation", Master's thesis, Tianjin Univeristy, China, 2013.
    K. G, Munhall, E. Vatikiotis-Bateson, et al., "X-ray film database for speech research", The Journal of the Acoustical Society of America, Vol.98, No.2, pp.1222-1224, 1995.
    M. Yang, J. Tao, and D. Zhang, "Extraction of tongue contour in x-ray videos", in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.1094-1098, 2013.
    G. Wang and J. Kong, "The relation between larynx height and f0 during the four tones of mandarin in x-ray movie", in 7th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp.335-338, 2010.
    G. Wang, "The vocal tract model of mandarin chinese", Ph.D. dissertation, Peking University, China, 2010.(in Chinese)
    C. Luo, R. Li, L. Yu, et al., "Automatic tongue tracking in x-ray images", Chinese Journal of Electronics, Vol.24, No.4, pp.767-771, 2015.
    H. Takemoto, T. Kitamura, H. Nishimoto, et al., "A method of tooth superimposition on MRI data for accurate measurement of vocal tract shape and dimensions", Acoustical science and technology, Vol.25, No.6, pp.468-474, 2004.
    J. Yu, C. Jiang, R. Li, et al., "Real-time 3D facial animation:from appearance to internal articulators", IEEE Transactions on Circuits and Systems for Video Technology, Vol.28, No.6, pp.920-932, 2018.
    C. Luo, J. Yu, X. Li, et al., "HMM based speech-driven 3D tongue animation", 2017 IEEE International Conference On Image Processing (ICIP), pp.4377-4381, 2017.
    R. Li and J. Yu. "Multimodal 3D visible articulation system for syllable based Mandarin Chinese training", 2017 IEEE Visual Communications and Image Processing (VCIP), pp.1-4, 2017.
  • 加载中


    通讯作者: 陈斌,
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (115) PDF downloads(149) Cited by()
    Proportional views


    DownLoad:  Full-Size Img  PowerPoint