HUANG Degen, ZHANG Jing, HUANG Kaiyu. Automatic Microblog-Oriented Unknown Word Recognition with Unsupervised Method[J]. Chinese Journal of Electronics, 2018, 27(1): 1-8. doi: 10.1049/cje.2017.11.004
Citation: HUANG Degen, ZHANG Jing, HUANG Kaiyu. Automatic Microblog-Oriented Unknown Word Recognition with Unsupervised Method[J]. Chinese Journal of Electronics, 2018, 27(1): 1-8. doi: 10.1049/cje.2017.11.004

Automatic Microblog-Oriented Unknown Word Recognition with Unsupervised Method

doi: 10.1049/cje.2017.11.004
Funds:  This work is supported by the National Natural Science Foundation of China (No.61672127).
  • Received Date: 2016-06-12
  • Rev Recd Date: 2016-09-14
  • Publish Date: 2018-01-10
  • As a prerequisite task in Natural language processing (NLP), Chinese word segmentation (CWS), is challenged by unknown words. Aiming to effectively detect Chinese unknown words, especially the low-frequency unknown words in unstructured microblog data, we modify the usage of Accessor variety (AV) to measure the context environments of core fragments and propose a novel variable, the Independence of strings, which is derived from the internal structure of segments. Our approach is unsupervised without using any manual materials. Due to the lack of manual resources of microblog-oriented unknown words extraction, we use sampling approach to assess the effectiveness of our method. Experimental results suggest our best system beats the baseline system as well as the state-of-the-art system by a significant improvement in F1-measure and the recall of low-frequency unknown words.
  • loading
  • T.H. Nguyen and K. Shirai, "Topic modeling based sentiment analysis on social media for stock market prediction", Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics, Beijing, China, pp.1354-1364, 2015.
    Y.S. Zhang, Y.R. Jiang and Y.X. Tong, "Study of sentiment classification for Chinese microblog based on recurrent neural network", Chinese Journal of Electronics, Vol.25, No.4, pp.601-607, 2016.
    X. Liu, M. Zhou, F. Wei, et al., "Joint inference of named entity recognition and normalization for tweets", Proc. of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island Korea, pp.526-535, 2012.
    J. Wang, Z. Liu and H. Zhao, "Micro-blogs entity recognition based on DSTCRF", Chinese Journal of Electronics, Vol.23, No.1, pp.147-150, 2014.
    N. Peng and M. Dredze, "Named entity recognition for Chinese social media with jointly trained embeddings", Proc. of the 2015 Conference on EMNLP of the Association for Computational Linguistics, Lisbon, Portugal, pp.548-554, 2015.
    C. Li and Y. Liu, "Improving named entity recognition in tweets via detecting non-standard words", Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics, Beijing, China, pp.929-938, 2015.
    G. Dong, R. Li, W. Yang, et al., "Microblog burst keywords detection based on social trust and dynamics model", Chinese Journal of Electronics, Vol.23, No.4, pp.695-700, 2014.
    W.W. Xu, P. Shi, L.B. Yu, et al., "An adaptive opinion guiding model for online social networks", Acta Electronica Sinica, Vol.44, No.7, pp.1714-1720, 2016. (In Chinese)
    H. Tseng, P. Chang, G. Andrew, et al., "A conditional random field word segmenter for sighan bakeoff 2005", Proc. of the 4th SIGHAN workshop on Chinese Language Processing, Jeju, Korea, pp.168-171, 2005.
    H.P. Zhang, H.K. Yu, et al., "HHMM-based Chinese lexical analyzer ICTCLAS", Proc. of the Second SIGHAN Workshop on Chinese Language Processing of the Association for Computational Linguistics, Sapporo, Japan, pp.184-187, 2003.
    H. Li, C.N. Huang, J. Gao, et al., "The use of SVM for Chinese new word identification", Natural Language Processing-IJCNLP 2004, Springer, Berlin, Heidelberg, pp.723-732, 2005.
    F. Peng, F. Feng and A. McCallum, "Chinese segmentation and new word detection using conditional random fields", Proc. of the 20th International Conference on Computational Linguistics of the Association for Computational Linguistics, Geneva, Switzerland, pp.562-568, 2004.
    W. Sun and J. Xu, "Enhancing Chinese word segmentation using unlabeled data", Proc. of the Conference on EMNLP of the Association for Computational Linguistics, Edinburgh, Scotland, UK, pp.970-979, 2011.
    V. Sornlertlamvanich, T. Potipiti and T. Charoenporn, "Automatic corpus-based Thai word extraction with the C4.5 learning algorithm", Proc. of the 18th Conference on Computational Linguistics of the Association for Computational Linguistics, Saarbrücken, Germany, pp.802-807, 2000.
    S. Huo, M. Zhang, Y. Liu, et al., "New words discovery in microblog content", Pattern Recognition and Artificial Intelligence, Vol.27, No.2, pp.141-145, 2014.
    Y. Ye, Q. Wu, Y. Li, et al., "Unknown Chinese word extraction based on variety of overlapping strings", Information Processing and Management, Vol.49, No.2, pp.497-512, 2013.
    K.J. Chen and W.Y. Ma, "Unknown word extraction for Chinese documents", Proc. of the 19th International Conference on Computational Linguistics of the Association for Computational Linguistics, Stroudsburg, PA, USA, pp.1-7, 2002.
    G. Zou, Y. Liu, Q. Liu, et al., "Internet-oriented Chinese new words detection", Journal of Chinese Information Processing, Vol.18, No.6, pp.1-9, 2004. (In Chinese)
    D. Huang and D. Tong, "Context information and fragments based cross-domain word segmentation", China Communications, Vol.9, No.3, pp.49-57, 2012.
    H. Feng, K. Chen, X. Deng, et al., "Accessor variety criteria for Chinese word extraction", Computational Linguistics, Vol.30, No.1, pp.75-93, 2004.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (178) PDF downloads(442) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return