Handling OOV Words in Mandarin Spoken Term Detection with an Hierarchical <i>n</i>-Gram Language Model

WANG Xuyang; ZHANG Pengyuan; NA Xingyu; PAN Jielin; YAN Yonghong

doi:10.1049/cje.2017.07.004

WANG Xuyang, ZHANG Pengyuan, NA Xingyu, PAN Jielin, YAN Yonghong. Handling OOV Words in Mandarin Spoken Term Detection with an Hierarchical n-Gram Language Model[J]. Chinese Journal of Electronics, 2017, 26(6): 1239-1244. DOI: 10.1049/cje.2017.07.004

Citation:

Handling OOV Words in Mandarin Spoken Term Detection with an Hierarchical n-Gram Language Model

Graphical Abstract

Graphical Abstract

Abstract

Abstract

In this paper, an hierarchical n-gram Language model (LM) combining words and characters is explored to improve the detection of Out-of-vocabulary (OOV) words in Mandarin Spoken term detection (STD). The hierarchical LM is based on a word-level LM, with a character-level LM estimating probabilities of OOV words in a class-based way. The region containing OOV words in the sentence to be decoded is detected with the help of the word-level LM and the probabilities of OOV words are derived from the character-level LM. The implementation of the proposed approach is based on a dynamic decoder. The proposed approach is evaluated in terms of Actual term weighted value (ATWV) on two Mandarin data sets. Experiment results show that more than 10% relative improvement for OOV word detection is achieved on both sets. In addition, the detection of In-vocabulary (IV) words is barely influenced as well.