Volume 31 Issue 5
Sep.  2022
Turn off MathJax
Article Contents
YU Hao, HUANG Kaiyu, WANG Yu, et al., “Lexicon-Augmented Cross-Domain Chinese Word Segmentation with Graph Convolutional Network,” Chinese Journal of Electronics, vol. 31, no. 5, pp. 949-957, 2022, doi: 10.1049/cje.2021.00.363
Citation: YU Hao, HUANG Kaiyu, WANG Yu, et al., “Lexicon-Augmented Cross-Domain Chinese Word Segmentation with Graph Convolutional Network,” Chinese Journal of Electronics, vol. 31, no. 5, pp. 949-957, 2022, doi: 10.1049/cje.2021.00.363

Lexicon-Augmented Cross-Domain Chinese Word Segmentation with Graph Convolutional Network

doi: 10.1049/cje.2021.00.363
Funds:  This work was supported by the National Key Research and Development Program of China (2020AAA0108004) and the National Natural Science Foundation of China (U1936109, 61672127)
More Information
  • Author Bio:

    was born in 1996. He received the B.S. degree in computer science from Dalian University of Technology, China, in 2019. He is currently pursuing the master’s degree in computer science and technology at Dalian University of Technology. His research interests include natural language processing, Chinese word segmentation and machine translation. (Email: yuhaodlut@mail.dlut.edu.cn)

    received the Ph.D. degree in the School of Computer Science at Dalian University of Technology in 2021. He is currently a Postdoctor at the Institute for AI Industry Research (AIR) of Tsinghua University. His research interests include natural language processing and machine translation. (Email: kaiyuhuang@hotmail.com)

    is currently an Associate Professor in School of Software Technology of Dalian University of Technology. Her research interests include corpus linguistics, machine translation and academic publishing, and communication. She is Executive Member of China EAP Association and Memeber of CCF. (Email: karan_wang@dlut.edu.cn)

    (corresponding author) received the B.S. degree in computer science from Fuzhou University, China, in 1986, and the M.S and Ph.D. degrees in computer science from the Dalian University of Technology, China, in 1988 and 2004 respectively. He is currently a Professor with the School of Computer Science, Dalian University of Technology. His research interests include natural language processing and machine translation. He is now a Senior Member of CCF, CIPS, ACM, CAAI and an Associate Editor of Int. J. Advanced Intelligence. (Email: huangdg@dlut.edu.cn)

  • Received Date: 2021-12-02
  • Accepted Date: 2022-01-26
  • Available Online: 2022-06-11
  • Publish Date: 2022-09-05
  • Existing neural approaches have achieved significant progress for Chinese word segmentation (CWS). The performances of these methods tend to drop dramatically in the cross-domain scenarios due to the data distribution mismatch across domains and the out of vocabulary words problem. To address these two issues, proposes a lexicon-augmented graph convolutional network for cross-domain CWS. The novel model can capture the information of word boundaries from all candidate words and utilize domain lexicons to alleviate the distribution gap across domains. Experimental results on the cross-domain CWS datasets (SIGHAN-2010 and TCM) show that the proposed method successfully models information of domain lexicons for neural CWS approaches and helps to achieve competitive performance for cross-domain CWS. The two problems of cross-domain CWS can be effectively solved through various interactions between characters and candidate words based on graphs. Further, experiments on the CWS benchmarks (Bakeoff-2005) also demonstrate the robustness and efficiency of the proposed method.
  • https://github.com/fxsjy/jieba/blob/master/jieba/dict.txt
    https://github.com/L706077/jieba-zh_TW/blob/master/jieba/dict.txt
    https://pinyin.sogou.com/dict/
    https://github.com/brightmart/roberta_zh
    https://github.com/rtmdrr/testSignificanceNLP
  • loading
  • [1]
    N. W. Xue, “Chinese word segmentation as character tagging,” International Journal of Computational Linguistics & Chinese Language Processing, vol.8, no.1, pp.29–48, 2003.
    [2]
    D. Cai and H. Zhao, “Neural word segmentation learning for Chinese,” in Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers), Berlin, Germany, pp.409–420, 2016.
    [3]
    D. Cai, H. Zhao, Z. S. Zhang, et al., “Fast and accurate neural word segmentation for Chinese,” in Proc. of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short papers), Vancouver, Canada, pp.608–615, 2017.
    [4]
    J. Ma, K. Ganchev, and D. Weiss, “State-of-the-art Chinese word segmentation with Bi-LSTMs,” in Proc. of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp.4902–4908, 2018.
    [5]
    S. F. Duan and H. Zhao, “Attention is all you need for Chinese word segmentation,” in Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, pp.3862–3872, 2020
    [6]
    F. C. Peng, F. F. Feng, and A. McCallum, “Chinese segmentation and new word detection using conditional random fields,” in Proc. of the 20th International Conference on Computational Linguistics (COLING), Geneva, Switzerland, pp.562–568, 2004.
    [7]
    H. Tseng, P. Chang, G. Andraw, et al., “A conditional random field word segmenter for Sighan Bakeoff 2005,” in Proc. of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, pp.168–171, 2005.
    [8]
    H. Zhao, C. N. Huang, M. Li, et al., “A unified character-based tagging framework for Chinese word segmentation,” ACM Transactions on Asian Language Information Processing (TALIP), vol.9, no.2, pp.1–32, 2010. doi: 10.1145/1781134.1781135
    [9]
    H. Zhao, C. N. Huang, M. Li, et al., “Effective tag set selection in Chinese word segmentation via conditional random field modeling,” in Proc. of the 20th Pacific Asia Conference on Language, Information and Computation, Wuhan, China, pp.87–94, 2006.
    [10]
    H. Zhao and C. Y. Kit, “Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition,” in Proc. of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, pp.106–111, 2008.
    [11]
    X. Sun, H. F. Wang, and W. J. Li, “Fast online training with frequency-adaptive learning rates for Chinese word segmentation and new word detection,” in Proc. of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers), Jeju Island, Korea, pp.253–262, 2012.
    [12]
    L. K. Zhang, H. F. Wang, X. Sun, et al., “Exploring representations from unlabeled data with co-training for Chinese word segmentation,” in Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp.311–321, 2013.
    [13]
    X. Q. Zheng, H. Y. Chen, and T. Y. Xu, “Deep learning for Chinese word segmentation and POS tagging,” in Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp.647–657, 2013.
    [14]
    X. C. Chen, X. P. Qiu, C. X. Zhu, et al., “Long short-term memory neural networks for Chinese word segmentation,” in Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp.1197–1206, 2015.
    [15]
    D. G. Huang, J. Zhang, and K. Y. Huang, “Automatic microblog-oriented unknown word recognition with unsupervised method,” Chinese Journal of Electronics, vol.27, no.1, pp.1–8, 2018. doi: 10.1049/cje.2017.11.004
    [16]
    N. Xi, X. Y Dai, S. J. Huang, et al., “Discriminative word alignment over multiple word segmenations,” Chinese Journal of Electronics, vol.23, no.2, pp.263–279, 2014.
    [17]
    L. J. Zhao, Q. Zhang, P. Wang, et al., “Neural networks incorporating unlabeled and partially-labeled data for cross-domain Chinese word segmentation,” in Proc. of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, pp.4602–4608, 2018.
    [18]
    X. B. Wang, D. Cai, L. L. Li, et al., “Unsupervised learning helps supervised neural word segmentation,” in Proc. of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Honolulu, Hawaii, USA, vol.33, no.1, pp.7200–7207, 2019.
    [19]
    X. P. Qiu, H. Z. Pei, H. Yan, et al., “A concise model for multi-criteria Chinese word segmentation with transformer encoder,” in Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, Association for Computational Linguistics, Online, pp.2887–2897, 2020.
    [20]
    Z. Y. Bao, S. Li, S. Gao, et al., “Neural domain adaptation with contextualized character embedding for Chinese word segmentation,” in Proc. of the Sixth Conference on Natural Language Processing and Chinese Computing, Dalian, China, pp.419–430, 2017.
    [21]
    Y. X. Ye, W. K. Li, Y. Zhang, et al., “Improving cross-domain Chinese word segmentation with word embeddings,” in Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and short papers), Minneapolis, Minnesota, pp.2726–2735, 2019.
    [22]
    Y.X. Meng, W. Wu, F. Wang, et al., “Glyce: Glyph-vectors for Chinese character representations,” Proc. of 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, pp.2742–2753, 2019.
    [23]
    K. Y. Huang. D. G. Huang, Z. Liu, et al., “A joint multiple criteria model in transfer learning for cross-domain Chinese word segmentation,” in Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, pp.3873–3882, 2020.
    [24]
    Y. H. Tian, Y. Song, F. Xia, et al., “Improving Chinese word segmentation with wordhood memory networks,” in Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, pp.8274–8285, 2020.
    [25]
    Q. Zhang, X. Y. Liu, and J. L. Fu, “Neural networks incorporating dictionaries for Chinese word segmentation,” in Proc. of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), New Orleans, Louisiana, USA, pp.5682–5689, 2018.
    [26]
    J. Yang, Y. Zhang, and S. L. Liang, “Subword encoding in lattice LSTM for Chinese word segmentation,” in Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, USA, pp.2720–2725, 2019.
    [27]
    J. X. Liu, F. Z. Wu, C. H. Wu, et al., “Neural Chinese word segmentation with dictionary,” Neurocomputing, vol.338, pp.46–54, 2019. doi: 10.1016/j.neucom.2019.01.085
    [28]
    J. X. Liu, F. Z. Wu, C. H. Wu, et al., “Neural Chinese word segmentation with Lexicon and unlabeled data via posterior regularization,” in Proc. of The World Wide Web Conference, San Francisco, CA, USA, pp.3013–3019, 2019.
    [29]
    J. Zhou, G. Q. Cui, S. D. Hu, et al., “Graph neural networks: A review of methods and applications,” AI Open, vol.1, pp.57–81, 2020. doi: 10.1016/j.aiopen.2021.01.001
    [30]
    R.X. Ding, P.J Xie, X.Y. Zhang, et al., “A neural multi-digraph model for Chinese NER with gazetteers,” Proc. of The 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp.1462–1467, 2019.
    [31]
    T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks, ” in Proc. of The 5th International Conference on Learning Representations, Toulon, France, pp.1462–1467, 2017.
    [32]
    T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in Proc. of The 5th International Conference on Learning Representations, Toulon, France, arXiv:1609.02907, 2017.
    [33]
    A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM networks,” in Proc. of 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, vol.4, pp.2047–2052, 2005.
    [34]
    T. Emerson, “The second international Chinese word segmentation Bakeoff,” in Proc. of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, pp.123–133, 2005.
    [35]
    N, W Xue, F. Xia, F. D. Chiou, et al., “The Penn Chinese TreeBank: Phrase structure annotation of a large corpus,” Natural Language Engineering, vol.11, no.2, pp.207–238, 2005. doi: 10.1017/S135132490400364X
    [36]
    H. M Zhao and Q. Liu, “The CIPS-SIGHAN CLP2010 Chinese word segmentation Backoff,” CIPS-SIGHAN Joint Conference on Chinese Language Processing, Beijing, China, pp.199–209, 2010.
    [37]
    A. Paszke, S. Gross, F. Massa, et al., “PyTorch: An imperative style, high-performance deep learning library,” in Proc. of 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, pp.8024–8035, 2019.
    [38]
    M. Fey and J. E. Lenssen, “Fast graph representation learning with PyTorch geometric,” ICLR Workshop on Representation Learning on Graphs and Manifolds, New Orleans, Louisiana, USA, arXiv:1903.02428, 2019.
    [39]
    Y. M. Cui, W. X. Che, T. Liu, et al., “Pre-training with whole word masking for Chinese BERT,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.29, pp.3504–3514, 2021. doi: 10.1109/TASLP.2021.3124365
    [40]
    D. Rotem, B. Gili, S. Segev, et al., “The Hitchhiker’s guide to testing statistical significance in natural language processing,” in Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers), Melbourne, Australia, pp.1383–1392, 2019.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(4)  / Tables(4)

    Article Metrics

    Article views (495) PDF downloads(36) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return