YUAN Lichi. A New Word Clustering Algorithm Based on Word Similarity[J]. Chinese Journal of Electronics, 2017, 26(6): 1221-1226. doi: 10.1049/cje.2017.09.016
Citation: YUAN Lichi. A New Word Clustering Algorithm Based on Word Similarity[J]. Chinese Journal of Electronics, 2017, 26(6): 1221-1226. doi: 10.1049/cje.2017.09.016

A New Word Clustering Algorithm Based on Word Similarity

doi: 10.1049/cje.2017.09.016
Funds:  This work is supported by the National Natural Science Foundation of China (No.61562034, No.61262035), the Science and Technology Support Program of Jiangxi Province, China (No.20151BBE50082), and the Natural Science Foundation of Jiangxi Province, China (No.20142BAB207028).
  • Received Date: 2014-04-14
  • Rev Recd Date: 2014-10-24
  • Publish Date: 2017-11-10
  • Category-based statistic language model is an important method to solve the problem of sparse data in statistical language models. But there are two bottlenecks about this model:1) The problem of word clustering, it is hard to find a suitable clustering method that has good performance and has not large amount of computation; 2) Class-based method always loses some prediction ability to adapt the text of different domain. In order to solve above problems, a novel definition of word similarity by utilizing mutual information was presented. Based on word similarity, the definition of word set similarity was given and a bottom-up hierarchical clustering algorithm was proposed. Experimental results show that the word clustering algorithm based on word similarity is better than conventional greedy clustering method in speed and performance, the perplexity is reduced from 283 to 207.8.
  • loading
  • D. M. Christopher and Hinrich Schutze, Foundations of Statistical Natural Language Processing, The MIT Press, London, England, pp.210-225, 1999.
    Joshua T. Goodman, "A bit of progress in language modeling", Computer Speech and Language, Vol.15, No.4, pp.403-434, 2001.
    Nianwen Xue, Fei Xia, Fu-dong Chiou, et al., "The Penn Chinese treebank:Phrase structure annotation of a large corpus", Natural Language Engineering, Vol.11, No.2, pp.207-238, 2005.
    Pascale Fung, Grace Ngai, Yongsheng Yang, et al., "A maximum-entropy Chinese parser augmented by transformation-based learning", ACM Trans. on Asian Language Processing, Vol.3, No.2, pp.159-168, 2004.
    Ciprian Chelba and Frederick Jelinek, "Structured language modeling", Computer Speech and Language, Vol.14, No.4, pp.283-332, 2000.
    Sharon Aviran, Paul H. Siegel and Jack K. wolf, " Optimal parsing trees for run-length boding of biased data", IEEE Transaction on Information Theory, Vol.54, No.2, pp.841-849, 2008.
    Deyu Zhou and Yulan He, "Discriminative training of the hidden vectors state model for semantic parsing", IEEE Transaction on Knowledge and Data Engineering, Vol.21, No.1, pp. 66-77, 2009.
    Seo Kwang-Jun, Nam Ki-Chun and Choi Key-Sun, "A probalistic model of the dependency parse of the variable-word-order languages by using ascending dependency", Computer Processing of Oriental Languages, Vol.12, No.3, pp.309-322, 2000.
    LI Zheng-hua, CHE Wan-xiang and LIU Ting, "Beam-search based high-order dependency parser", Journal of Chinese Information Processing, Vol.24, No.1, pp.37-41, 2010. (in Chinese)
    YUAN Li-chi, "A speech recognition method based on improved hidden Markov model", Journal of Central South University:Natural Science, Vol.39, No.6, pp.1303-1308, 2008. (in Chinese)
    Takuya Matsuzaki, Yusuke Miyao and Jun'ichi Tsujii, "An efficient clustering algorithm for class-based language models", Proceedings of the 7th Conference on Computational Natural Language Learning (CoNLL-2003), Edmonton, Canada, pp.119-126. 2003.
    Ido Dagan, Shaul Marcus and Shaul Markovitch, "Context word similarity and estimation from sparse data", Computer Speech and Language, Vol. 9, No.2, pp.123-152, 1995.
    Douglass R. Cutting, David R. Karger, Jan O. Pedersen, et al., "A cluster-based approach to browsing large document collections", Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'92), Copenhagen, Denmark, pp.318-329,1992.
    Lillian Lee, "Similarity-based approaches to natural language processing", Ph.D. Thesis, Harvard University, Cambridge, MA, USA, pp.56-72,1997.
    Yael Karov and Shimon Edelman, "Learning similarity-based word sense disambiguation from sparse data", Proceedings of the Fourth Workshop on Very Large Corpora, Copenhagen, Denmark, pp.42-55, 1996.
    Ying Liu, Wang Nan and Tie Zheng, "Spectral clustering for Chinese word", Proceedings of the Sixth International Conference on Fuzzy Systems and Knowledge, Tianjin, China, pp.529-533, 2009.
    YUAN Li-chi, "Dependency language paring model based on word clustering", Journal of Central South University:Natural Science, Vol.42, No.7, pp.2023-2027, 2011. (in Chinese)
    Nikoletta Bassiou and Constantine Kotropoulos, "Long distance bigram models applied to word clustering", Pattern Recognition, Vol.44, No.1, pp.145-158, 2011.
    Zhou XinYuan, Du Jie and He Qiang "Research on word clustering based on co-occurrence", Journal of ChangSha University, Vol.21, No.2, pp.83-87, 2007. (in Chinese)
    Hu HePing, Zong QingRui and Lu SongFeng, "Research on Chinese word clustering", Computer Engineering & Science, Vol.28, No.1, pp.122-125, 2006. (in Chinese)
    Jianfeng Gao, Hai-Feng Wang, Mingjing Li, et al., "A unified approach to statistical language modeling for Chinese", Proceeding of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2000), Istanbul, Turkey, pp.1703-1706, 2006.
  • 加载中


    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (166) PDF downloads(370) Cited by()
    Proportional views


    DownLoad:  Full-Size Img  PowerPoint