
Citation: | YU Hao, HUANG Kaiyu, WANG Yu, HUANG Degen. Lexicon-Augmented Cross-Domain Chinese Word Segmentation with Graph Convolutional Network[J]. Chinese Journal of Electronics, 2022, 31(5): 949-957. DOI: 10.1049/cje.2021.00.363 |
Unlike most European written languages, the Chinese written language has no explicit delimiters to separate words in the context. Therefore, Chinese word segmentation is an essential task for Chinese downstream natural language processing (NLP) tasks.
Chinese word segmentation (CWS) is conventionally formalized as a sequence labelling task[1]. The label of each character is predicted to denote the position of each character in a word. Recently, with the development of deep learning techniques, neural CWS approaches have achieved significant progress on in-domain CWS benchmarks (e.g., Bakeoff-2005)[2-5].
However, the good performance of existing neural CWS approaches depends on the large-scale annotated corpus. This brings on two issues for croos-domain CWS: 1) Issue of out of vocabulary (OOV) words: Baseline models are difficult to recognize OOV words since the training data do not contain any information of OOV words. Due to the lack of training samples, current methods fail to identify OOV words. 2) Data distribution mismatch: Existing supervised methods leverage other domain data to train in the cross-domain scenario, but source domain data and target domain data have different distributions. Performance suffers from training the methods on source domain data instead of target domain data in the cross-domain scenario. For example, the Chinese character “证” is like to represent the “proof” in the domain of news. In the traditional Chinese medicine texts, this character always indicates the meaning of “symptom.” In this situation, researchers need to manually annotate target domain data to adapt to the source domain models, which is expensive and time-consuming.
To deal with these two issues, we propose a lexicon-augmented graph convolutional network (LGCN) to incorporate the dictionary information into the supervised neural CWS model. First, the input sentences construct the lexicon-augmented graph by dictionaries. Then, we use the graph convolutional network (GCN) to capture the information of word boundaries and contextual features from external dictionaries.
The contributions of this paper are as follows:
• We proposed LGCN to address the OOV words and data distribution mismatch problem. It is straightforward and effective to improve the performance of CWS in the cross-domain scenario.
• Our method achieves noticeable improvement on both in-domain and cross-domain CWS scenario, compared with baseline models and existing methods.
• We further improve the accuracy of cross-domain segmentation in a specific domain by expanding the corresponding domain lexicon in the testing phase.
Since Ref.[1] first formalized CWS as a sequence labelling task, statistical machine learning methods are widely employed for CWS in the early stage[6-8]. CWS has been considered as a supervised learning from annotated corpus. Ref.[6] further utilized sequence labelling tool CRF (conditional random field) for CWS. Ref.[9] showed that different tag sets can lead to different segmentation performance. Some models based on the variations of CRF achieved state-of-the-art performances[7-8, 10-12].
With the development of the neural network, Ref.[13] proposed a neural network CWS method with sliding-window, which for the first time verified the feasibility of applying deep learning methods to CWS tasks. Ref.[14] proposed to use LSTM (long short-term memory) Network to capture long-range dependencies in response to the limitations of sliding windows. Ref.[3] designed a fast segmentation system based on greedy strategy and achieved similar performance to statistical machine learning methods. Ref.[15] improved the recognition performance of unknown words using unsupervised methods. Some research attempted to employ the diverse and complementary knowledge in multiple word segmentations to alleviate word-based problems[16]. Ref.[4] proved that under rigorous tuning, CWS tasks can achieve good performance without overly complex model structures. Ref.[5] used the gaussian-masked directional attention mechanism to strengthen the model on the basis of the greedy decoding algorithm. Ref.[17] proposed a concise, but effective unified model based on transformer encoder, which is fully-shared for different segmentation criteria.
Despite great progress of neural network CWS, OOV words and data distribution mismatch are obvious gap for cross domain scenario. Some research have attempted to utilize external resources like dictionaries, multitask learning, unlabelled corpora and pre-trained language model to solve this problem. Ref.[18] and Ref.[19] explored the word segmentation method of neural network integrating unlabelled and partially labelled data. Ref.[20] introduce contextualized character embedding to neural domain adaptation for CWS. The contextualized character embedding aimed to capture the useful dimension in embedding for target domain. Ref.[21] proposed a semi-supervised word-based approach to improving cross-domain CWS given a baseline segmenter.
Some CWS methods fine-tuned the pre-trained language model to improve the performance of CWS on both in-domain and cross-domain datasets[22-24]. Ref.[22] was the first to apply a pre-trained language model to CWS task by presenting Glyce, the glyph-vector for Chinese character representations. Ref.[23] proposed a joint multiple criteria model based on the pre-trained language mode by sharing all parameters to integrate different segmentation criteria into one model. Ref.[24] utilized the wordhood memory network to incorporate the contextual features on the basis of multiple pre-trained language model.
In particular, Ref.[25] defined several templates to construct feature vectors for each character based on dictionaries and contexts, and incorporate the information from dictionaries into the BiLSTM (bidirectional long short-term memory) model. Ref.[26] used Lattice-LSTM model to incorporate subwords lexicons and their pre-trained embeddings. Ref.[27] employed subsampling and negative sampling methods for word embeddings and achieved 3.0% F-measure on four datasets covering domains in novels and medicine. Ref.[28] proposed a neural approach for CWS which incorporate the unlabeled data and lexicon into model training as indirect supervision by regularizing the prediction space of CWS models. Recently, graph neural network has been explored in several kinds of NLP tasks[29]. Ref.[30] utilized graph neural network with a multi-graph structure to resolving ambiguities in Chinese NER through capturing the information that the gazetteer lexicons offer. Our proposed method uses an external dictionary to alleviate the OOV problem. Inspired by this method, we attempt to incorporate lexicon information into context representation with graph neural network
The framework of our model is illustrated in Fig.1 consisting of two principal constructions: the lexicon-augmented GCN encoder, and the baseline CWS model with context encoder and decoder. We describe the construction of the lexicon-augmented graph, the architecture of the baseline model, and how it is integrated with the GCN, respectively.
Lexicon-augmented graph As is shown in Fig.1, we construct the graph
In addition, we introduce three additional nodes
GCN encoder The graph convolutional network[31] is a variation of the graph neural network (GNN[32]), which scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. After constructing the undirected lexicon-augmented graph
H(l+1) = σ(˜D(−1/2)˜A˜D(−1/2)H(l)W(l)) |
(1) |
˜Dii = ∑˜Aii |
(2) |
Here,
Following previous studies, we regard CWS as the character-based sequence labelling task. The framework predicts a tag that represents the position of each character in a word. We used a 4-tag set
T = {B,E,M,S} |
(3) |
for prediction (e.g., tag “B” represents the first character in a word). The context encoder (BiLSTM[33]) that we utilize is a mainstream neural architecture for the sequence labelling task. Due to its design characteristics, BiLSTM is well suited for sequential annotation tasks. Given a sentence
ei=[ui⊕bi⊕gi] |
(4) |
for each character
→hi=LSTM(ei,→hi−1;→θ) |
(5) |
←hi=LSTM(ei,←hi+1;←θ) |
(6) |
hi=[→hi⊕←hi] |
(7) |
where
Then, the baseline model adopts a dense layer and the
The loss function of cross-entropy can be expressed as:
Loss(y,y∗)= −∑xy(x)logy∗(x) |
(8) |
where y denotes the gold label sequence, and the
To evaluate our method, we did comparative experiments on both in-domain (Bakeoff-2005[34] and CTB6[35]) and cross-domain CWS datasets (SIGHAN-2010[36]), the statistical results of the experiments are shown in the Table 1. We randomly selected 10% sentences from the training set as the validation set for tuning hyperparameters and follow the settings of “PKU” dataset for the cross-domain experiments. For consistency, We converted all punctuations, digits and Latin letters into half-width, and continuous English characters and digits are transferred into a unique token, which is similar to the previous paper Ref.[3]. The evaluation values for experiments are F-scores. We used the simplified Chinese dictionary “Dict-I①” and the traditional Chinese dictionary “Dict-II②” derived from Jieba as the external dictionary in PKU, MSR, AS, CITYU, and CTB6 datasets.
Lr | [1E−3,1E−4,2E−5] |
Lr of bert | [1E−4,2E−5] |
Optimzer | AdamW |
Unigram dim | [128,768] |
Bigram dim | 128 |
BiLSTM hidden dim | 512 |
GCN hidden dim | [256,768] |
BiLSTM layers | 3 |
GCN layers | 2 |
Dropout | [0.2,0.5] |
Batch size | [32,64,128,256] |
Epochs | 20 |
We have collected a traditional Chinese medicine domain lexicon “Dict-m” on Sougou Cell Thesaurus③ for cross-domain word segmentation. We validate the performance of the model under three different unigram embeddings. The hyperparameters are shown in Table 2. The implementation of our proposed method is based on Pytorch[37] and Pytorch geometric[38] in the GPU (Nvidia Tesla V100) environment.
In-domain | PKU | MSR | AS | CITYU | CTB6 | |
Words | Train | 1.1M | 2.4M | 5.5M | 1.5M | 0.7M |
Test | 0.1M | 0.1M | 0.1M | 40.9k | 52K | |
Chars | Train | 1.8M | 4.0M | 8.2M | 2.3M | 1.2M |
Test | 0.2M | 0.2M | 0.2M | 65.9K | 86K | |
OOV rate(%) | 6.14 | 2.42 | 3.89 | 7.15 | 5.63 | |
Cross-domain | Lit. | Com. | Med. | Fin. | TCM | |
Words | Train | – | – | – | – | – |
Test | 0.1M | 0.1M | 31K | 33K | 56.5K | |
Chars | Train | – | – | – | – | – |
Test | 50K | 54K | 51K | 53K | 90.2K | |
OOV rate(%) | 6.14 | 14.04 | 10.56 | 7.14 | 19.57 | |
Note: The size of different CWS datasets, where “PKU,” “MSR,” “AS,” and “CITYU” come from Bakeoff-2005 and CTB6 is widely used benchmarks, and “Lit.,” “Com.,” “Med.,” and “Fin.” indicate the specific domain literature, computer, medicine and finance, respectively, in SIGHAN-2010. “TCM” indicate new cross-domain dataset with high OOV rate. |
We construct three baseline models to verify the validity of our proposed method “+LGCN” on SIGHAN-2010 in the cross-domain scenario. Three baseline models utilizing different unigram embedding are represented as “Randinit,” “RoBERTa,” and “Finetune” respectively. “Randinit” indicates that random initialization is used for unigram embedding. “RoBERTa” represents that we utilize a pre-trained word embeddings from a pre-trained language model (RoBERTa-WWM[39]). “Finetune” represents that we fine-tune the parameters on RoBERTa-WWM④.
Experimental results on the four cross-domain datasets are shown in Table 3, where the overall F-scores and
Model | Lit. | Com. | Med. | Fin. | Avg. | |
Ref.[13] | F | 92.89 | 93.71 | 92.16 | 95.20 | 93.49 |
Roov | – | – | – | – | – | |
Ref.[3] | F | 92.90 | 94.04 | 92.10 | 95.38 | 93.61 |
Roov | – | – | – | – | – | |
Ref.[18] | F | 93,23 | 95,32 | 93.73 | 95.84 | 94.53 |
Roov | – | – | – | – | – | |
Ref.[25] | F | 94.76 | 94.70 | 94.18 | 96.06 | 94.93 |
Roov | – | – | – | – | – | |
Ref.[23] | F | 96.13 | 96.08 | 95.21 | 96.82 | 96.06 |
Roov | – | – | – | – | – | |
Randinit | F | 91.51 | 92.46 | 89.89 | 94.34 | 92.05 |
Roov | 73.34 | 79.25 | 68.05 | 83.00 | 75.91 | |
+LGCN | F | 94.37* | 95.16* | 93.71* | 96.13* | 94.84 |
Roov | 78.03 | 86.07 | 75.91 | 83.25 | 80.82 | |
RoBERTa | F | 95.85 | 95.29 | 95.14 | 96.73 | 95.75 |
Roov | 83.47 | 86.31 | 81.25 | 89.38 | 85.10 | |
+LGCN | F | 96.19* | 95.67 | 95.38* | 96.79 | 96.01 |
Roov | 85.95 | 88.08 | 82.08 | 87.87 | 86.00 | |
Finetune | F | 96.52 | 96.08 | 95.50 | 96.88 | 96.24 |
Roov | 85.08 | 88.26 | 82.44 | 89.17 | 86.24 | |
+LGCN | F | 96.60 | 96.50* | 95.64 | 97.25* | 96.50 |
Roov | 85.91 | 89.16 | 82.68 | 89.25 | 86.75 | |
Note: The first block includes the latest domain adaptive models. The maximum evaluation value for each pair of comparison is shown in bold. The “*” indicates that the group of experiments passed the significance test. |
Experimental results on the five benchmark datasets are shown in Table 4, which show a similar trend as that in Table 3. Overall, comparison demonstrates that our proposed method (+LGCN) outperforms the baseline model for all 15 pairs in terms of F-scores and 14 pairs in terms of
Model | PKU | MSR | AS | CITYU | CTB6 | |
Ref.[4] | F | 96.1 | 98.1 | 96.2 | 97.2 | 96.7 |
Roov | 78.8 | 80.0 | 70.7 | 87.5 | 85.4 | |
Ref.[25] | F | 96.5 | 97.8 | 95.9 | 96.3 | 96.4 |
Roov | – | – | – | – | – | |
Lattice[26] | F | 95.8 | 97.8 | – | – | 96.1 |
Roov | – | – | – | – | – | |
Glyce[22] | F | 96.7 | 98.3 | 96.7 | 97.9 | – |
Roov | – | – | – | – | – | |
Ref.[17] | F | 96.41 | 98.05 | 96.44 | 96.91 | 96.99 |
Roov | 78.91 | 78.92 | 76.39 | 86.91 | 87.00 | |
Ref.[24] | F | 96.53 | 98.40 | 96.62 | 97.93 | 97.25 |
Roov | 85.36 | 84.87 | 79.64 | 90.15 | 88.46 | |
Ref.[23] | F | 96.85 | 98.29 | – | – | 97.56 |
Roov | 82.35 | 81.75 | – | – | 88.02 | |
Randinit | F | 95.27 | 96.99 | 95.48 | 95.40 | 95.87 |
Roov | 81.36 | 60.75 | 68.07 | 76.78 | 78.76 | |
+LGCN | F | 96.22* | 97.79* | 96.04* | 96.13* | 96.48* |
Roov | 84.91 | 75.70 | 75.14 | 81.63 | 82.39 | |
RoBERTa | F | 96.85 | 97.88 | 96.55 | 97.42 | 97.45 |
Roov | 89.00 | 73.65 | 77.82 | 87.84 | 88.47 | |
+LGCN | F | 97.03* | 98.19* | 96.88* | 97.47 | 97.55* |
Roov | 89.38 | 77.90 | 81.98 | 87.81 | 89.05 | |
Finetune | F | 97.07 | 98.41 | 96.76 | 98.00 | 97.81 |
Roov | 89.13 | 85.67 | 78.96 | 91.09 | 90.56 | |
+LGCN | F | 97.26 | 98.58* | 97.07* | 98.10 | 97.82 |
Roov | 89.63 | 86.10 | 83.28 | 92.11 | 90.66 | |
Note: Results on the benchmark Bakeoff-2005 and CTB6 and compared with previous methods. The maximum evaluation value for each pair of comparisons is shown in bold. The “*” indicates that the group of experiments passed the significance test. |
Since OOV words in specific domain may not be included in general purpose dictionary, further experiment is required to investigate the robustness of our method on domain lexicon, i.e., OOV words in specific domain. In the above benchmark experiments, the simplified lexicon “Dict-I” lacks sufficient domain lexicon for traditional Chinese medicine domain. Domain lexicon need to be added to strengthen the domain information for test. Hence, we set up and annotated manually a traditional Chinese medicine domain dataset (TCM) with high OOV rate (following the “PKU” segmentation criterion). The size of TCM is shown in Table 2. Based on three baseline models, we conducted experiments to observe the effect of using expanded lexicons in the “TCM” domain dataset. The F-scores are shown in Fig.2, where “non-Dict” denotes a BiLSTM model without any lexicons, “Dict-I” and “Dict-I+Dict-m” denote using different lexicons during the testing phase based on the same model trained with “Dict-I.”
As expected, our proposed method can utilize lexicons to capture the lexicon knowledge and improve the performance of the corresponding baseline model. Moreover, adding domain lexicon in the testing phase alone can further improve the performance of the model in the corresponding domain, and we do not need to retrain the model.
We first looked at the effect of dictionary size. We randomly selected 25%, 50%, 75%, 80%, 90% and 95% words from the original dictionary “Dict-I” to build new dictionaries of different sizes. Fig.3 shows F-scores for LGCN with these dictionaries on “Randinit.” As Fig.3 suggests, the performance of this model gradually improves as the dictionary size increases.
To investigate how the proposed framework learns from the LGCN. We choose an example input sentence “四神聪/位于/头部/阙阴/区” (Sishengcong is located in the Deficiency Yin area of the head) in the traditional Chinese medicine domain as a case study. In this sentence, the “四神聪” (Sishengcong) and “阙阴” (Deficiency Yin) are the terminology of acupoints in traditional Chinese medicine. These two words are OOV words and are collected by the domain lexicon. This sentence is cut into “四神/聪位于/头部/阙阴区” in the baseline model, which leads to ambiguity. In the experiments on the “TCM” dataset, we added a traditional Chinese medicine domain dictionary to the simplified lexicon “Dict-I” used in the training phase, and our model identified these two words. This illustrates the ability of our model to benefit from different lexicons during the testing phase and to improve the performance of cross-domain scenarios.
We counted the number of errors on cross-domain benchmark dataset “Com.” The statistics are shown in Fig.4, where “Error in total” represents the total number of errors. “Merging error” means that multiple words are mistakenly merged, for example, “生态/网(ecological network)” is mistakenly merged into “生态网(ecological network).” “Splitting error” indicates that a single word is incorrectly split, for example, “头文件(header files)” is incorrectly split into “头/文件(head/file).” All these errors are counted independently by computer scripts. “In” denotes words occurring in the training corpus. “Out” is OOV words. “I_dict” denotes words appearing in both the lexicon and the training corpus. “O_dict” denotes OOV words appearing in the lexicon. Compared with the baseline, errors in most categories decreased in our LGCN model. Meanwhile, errors in “O_dict” have increased in “Merging error.” This demonstrates that our model prefers to merge OOV words that appear in the lexicon.
We empirically analyse the “Error in total” in Fig.4 from four aspects. The first is annotation inconsistency or incorrect annotation. For instance, “个人所得税 (individual income tax)” is annotated as “个人所得税 (individual income tax)” in the training dataset, but is annotated as “个人 (individual)/所得税 (income tax)” in the test dataset. The word “动脉注射 (arterial injection)” in the training corpus is labelled as “动脉 (artery)/注射 (injection),” but the similar word “静脉注射 (intravenous injection)” in the test corpus is regarded as one word. The suffix “店 (shop),” “食品店 (food shop)” “玩具店 (toy shop)” in the training corpus was annotated as a word, however “热狗店 (hot dog shop),” “饲料店 (Feed shop)” have been annotated as “热狗 (hot dog)/店 (shop),” “饲料 (fodder)/店 (shop),” this makes it hard for the model to identify such suffix. In addition, there are strings “蜀南 (the southern Sichuan)/竹海 (the bamboo sea)/风景区 (scenic spot)” was wrong in the test corpus labelled “蜀 (Sichuan)/南竹 (the southern bamboo)/海风 (sea breeze)/景区 (scenic spot).” The second reason is that the segmentation criteria in the cross-domain dataset are different from those in the in-domain training set. For instance, “重 (weight)/启 (open)” (reboot) is considered as one word, but in the in-domain dataset “PKU” is considered as two words. The third reason is that the model sometimes hesitate on the affixes. For instance, “的” appears 53890 times as a single-character word in the training corpus, and 623 times as a suffix. The former is nearly 100 times more than the latter. So the model annotates “沉沉的 (sunken)” as “沉沉 (heavy)/的 (of),” “茶花女 (La Traviata)” and “茶花 (camellia)” are OOV words in the PKU training dataset, but occur in the simplified dictionary. “茶花女 (La Traviata)” was annotated as “茶花 (camellia)/女 (woman)” because the “女 (woman)” occurs 523 times as a single-character word. Lastly, OOV words are not thoroughly included in our lexicon.
To make better use of the lexicon information, we explore the application of the graph convolution network for Chinese word segmentation. We propose the LGCN model to deal with the data distribution mismatch and OOV words issues in the cross-domain scenarios. We first construct a lexicon-augmented graph for a sentence, and then introduce the GCN to extract the information. In this way, the word boundary information in the sentence is explicitly represented, and the OOV words and domain-specific words are handled well. Benchmark experiment results show that our model outperforms previous models on cross-domain datasets and achieves optimal performance for benchmark datasets. To test performance of our model on domain lexicons in the cross-domain scenarios, we continue to experiment on Chinese medicine dataset. Result shows that our method is able to benefit from domain lexicons and achieves further improvement in this cross-domain scenarios.
In summary, our model utilizes external dictionaries to improve the accuracy of CWS task, and can further improve the accuracy of segmentation in a specific domain by expanding the corresponding domain lexicon in the testing phase. We will continue to study the efficiency of our LGCN and focus on how to reduce the time complexity of segmentation and how to improve dataset annotation inconsistency.
[1] |
N. W. Xue, “Chinese word segmentation as character tagging,” International Journal of Computational Linguistics & Chinese Language Processing, vol.8, no.1, pp.29–48, 2003.
|
[2] |
D. Cai and H. Zhao, “Neural word segmentation learning for Chinese,” in Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers), Berlin, Germany, pp.409–420, 2016.
|
[3] |
D. Cai, H. Zhao, Z. S. Zhang, et al., “Fast and accurate neural word segmentation for Chinese,” in Proc. of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short papers), Vancouver, Canada, pp.608–615, 2017.
|
[4] |
J. Ma, K. Ganchev, and D. Weiss, “State-of-the-art Chinese word segmentation with Bi-LSTMs,” in Proc. of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp.4902–4908, 2018.
|
[5] |
S. F. Duan and H. Zhao, “Attention is all you need for Chinese word segmentation,” in Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, pp.3862–3872, 2020
|
[6] |
F. C. Peng, F. F. Feng, and A. McCallum, “Chinese segmentation and new word detection using conditional random fields,” in Proc. of the 20th International Conference on Computational Linguistics (COLING), Geneva, Switzerland, pp.562–568, 2004.
|
[7] |
H. Tseng, P. Chang, G. Andraw, et al., “A conditional random field word segmenter for Sighan Bakeoff 2005,” in Proc. of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, pp.168–171, 2005.
|
[8] |
H. Zhao, C. N. Huang, M. Li, et al., “A unified character-based tagging framework for Chinese word segmentation,” ACM Transactions on Asian Language Information Processing (TALIP), vol.9, no.2, pp.1–32, 2010. DOI: 10.1145/1781134.1781135
|
[9] |
H. Zhao, C. N. Huang, M. Li, et al., “Effective tag set selection in Chinese word segmentation via conditional random field modeling,” in Proc. of the 20th Pacific Asia Conference on Language, Information and Computation, Wuhan, China, pp.87–94, 2006.
|
[10] |
H. Zhao and C. Y. Kit, “Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition,” in Proc. of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, pp.106–111, 2008.
|
[11] |
X. Sun, H. F. Wang, and W. J. Li, “Fast online training with frequency-adaptive learning rates for Chinese word segmentation and new word detection,” in Proc. of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers), Jeju Island, Korea, pp.253–262, 2012.
|
[12] |
L. K. Zhang, H. F. Wang, X. Sun, et al., “Exploring representations from unlabeled data with co-training for Chinese word segmentation,” in Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp.311–321, 2013.
|
[13] |
X. Q. Zheng, H. Y. Chen, and T. Y. Xu, “Deep learning for Chinese word segmentation and POS tagging,” in Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp.647–657, 2013.
|
[14] |
X. C. Chen, X. P. Qiu, C. X. Zhu, et al., “Long short-term memory neural networks for Chinese word segmentation,” in Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp.1197–1206, 2015.
|
[15] |
D. G. Huang, J. Zhang, and K. Y. Huang, “Automatic microblog-oriented unknown word recognition with unsupervised method,” Chinese Journal of Electronics, vol.27, no.1, pp.1–8, 2018. DOI: 10.1049/cje.2017.11.004
|
[16] |
N. Xi, X. Y Dai, S. J. Huang, et al., “Discriminative word alignment over multiple word segmenations,” Chinese Journal of Electronics, vol.23, no.2, pp.263–279, 2014.
|
[17] |
L. J. Zhao, Q. Zhang, P. Wang, et al., “Neural networks incorporating unlabeled and partially-labeled data for cross-domain Chinese word segmentation,” in Proc. of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, pp.4602–4608, 2018.
|
[18] |
X. B. Wang, D. Cai, L. L. Li, et al., “Unsupervised learning helps supervised neural word segmentation,” in Proc. of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Honolulu, Hawaii, USA, vol.33, no.1, pp.7200–7207, 2019.
|
[19] |
X. P. Qiu, H. Z. Pei, H. Yan, et al., “A concise model for multi-criteria Chinese word segmentation with transformer encoder,” in Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, Association for Computational Linguistics, Online, pp.2887–2897, 2020.
|
[20] |
Z. Y. Bao, S. Li, S. Gao, et al., “Neural domain adaptation with contextualized character embedding for Chinese word segmentation,” in Proc. of the Sixth Conference on Natural Language Processing and Chinese Computing, Dalian, China, pp.419–430, 2017.
|
[21] |
Y. X. Ye, W. K. Li, Y. Zhang, et al., “Improving cross-domain Chinese word segmentation with word embeddings,” in Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and short papers), Minneapolis, Minnesota, pp.2726–2735, 2019.
|
[22] |
Y.X. Meng, W. Wu, F. Wang, et al., “Glyce: Glyph-vectors for Chinese character representations,” Proc. of 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, pp.2742–2753, 2019.
|
[23] |
K. Y. Huang. D. G. Huang, Z. Liu, et al., “A joint multiple criteria model in transfer learning for cross-domain Chinese word segmentation,” in Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, pp.3873–3882, 2020.
|
[24] |
Y. H. Tian, Y. Song, F. Xia, et al., “Improving Chinese word segmentation with wordhood memory networks,” in Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, pp.8274–8285, 2020.
|
[25] |
Q. Zhang, X. Y. Liu, and J. L. Fu, “Neural networks incorporating dictionaries for Chinese word segmentation,” in Proc. of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), New Orleans, Louisiana, USA, pp.5682–5689, 2018.
|
[26] |
J. Yang, Y. Zhang, and S. L. Liang, “Subword encoding in lattice LSTM for Chinese word segmentation,” in Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, USA, pp.2720–2725, 2019.
|
[27] |
J. X. Liu, F. Z. Wu, C. H. Wu, et al., “Neural Chinese word segmentation with dictionary,” Neurocomputing, vol.338, pp.46–54, 2019. DOI: 10.1016/j.neucom.2019.01.085
|
[28] |
J. X. Liu, F. Z. Wu, C. H. Wu, et al., “Neural Chinese word segmentation with Lexicon and unlabeled data via posterior regularization,” in Proc. of The World Wide Web Conference, San Francisco, CA, USA, pp.3013–3019, 2019.
|
[29] |
J. Zhou, G. Q. Cui, S. D. Hu, et al., “Graph neural networks: A review of methods and applications,” AI Open, vol.1, pp.57–81, 2020. DOI: 10.1016/j.aiopen.2021.01.001
|
[30] |
R.X. Ding, P.J Xie, X.Y. Zhang, et al., “A neural multi-digraph model for Chinese NER with gazetteers,” Proc. of The 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp.1462–1467, 2019.
|
[31] |
T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks, ” in Proc. of The 5th International Conference on Learning Representations, Toulon, France, pp.1462–1467, 2017.
|
[32] |
T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in Proc. of The 5th International Conference on Learning Representations, Toulon, France, arXiv:1609.02907, 2017.
|
[33] |
A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM networks,” in Proc. of 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, vol.4, pp.2047–2052, 2005.
|
[34] |
T. Emerson, “The second international Chinese word segmentation Bakeoff,” in Proc. of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, pp.123–133, 2005.
|
[35] |
N, W Xue, F. Xia, F. D. Chiou, et al., “The Penn Chinese TreeBank: Phrase structure annotation of a large corpus,” Natural Language Engineering, vol.11, no.2, pp.207–238, 2005. DOI: 10.1017/S135132490400364X
|
[36] |
H. M Zhao and Q. Liu, “The CIPS-SIGHAN CLP2010 Chinese word segmentation Backoff,” CIPS-SIGHAN Joint Conference on Chinese Language Processing, Beijing, China, pp.199–209, 2010.
|
[37] |
A. Paszke, S. Gross, F. Massa, et al., “PyTorch: An imperative style, high-performance deep learning library,” in Proc. of 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, pp.8024–8035, 2019.
|
[38] |
M. Fey and J. E. Lenssen, “Fast graph representation learning with PyTorch geometric,” ICLR Workshop on Representation Learning on Graphs and Manifolds, New Orleans, Louisiana, USA, arXiv:1903.02428, 2019.
|
[39] |
Y. M. Cui, W. X. Che, T. Liu, et al., “Pre-training with whole word masking for Chinese BERT,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.29, pp.3504–3514, 2021. DOI: 10.1109/TASLP.2021.3124365
|
[40] |
D. Rotem, B. Gili, S. Segev, et al., “The Hitchhiker’s guide to testing statistical significance in natural language processing,” in Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers), Melbourne, Australia, pp.1383–1392, 2019.
|
Lr | [1E−3,1E−4,2E−5] |
Lr of bert | [1E−4,2E−5] |
Optimzer | AdamW |
Unigram dim | [128,768] |
Bigram dim | 128 |
BiLSTM hidden dim | 512 |
GCN hidden dim | [256,768] |
BiLSTM layers | 3 |
GCN layers | 2 |
Dropout | [0.2,0.5] |
Batch size | [32,64,128,256] |
Epochs | 20 |
In-domain | PKU | MSR | AS | CITYU | CTB6 | |
Words | Train | 1.1M | 2.4M | 5.5M | 1.5M | 0.7M |
Test | 0.1M | 0.1M | 0.1M | 40.9k | 52K | |
Chars | Train | 1.8M | 4.0M | 8.2M | 2.3M | 1.2M |
Test | 0.2M | 0.2M | 0.2M | 65.9K | 86K | |
OOV rate(%) | 6.14 | 2.42 | 3.89 | 7.15 | 5.63 | |
Cross-domain | Lit. | Com. | Med. | Fin. | TCM | |
Words | Train | – | – | – | – | – |
Test | 0.1M | 0.1M | 31K | 33K | 56.5K | |
Chars | Train | – | – | – | – | – |
Test | 50K | 54K | 51K | 53K | 90.2K | |
OOV rate(%) | 6.14 | 14.04 | 10.56 | 7.14 | 19.57 | |
Note: The size of different CWS datasets, where “PKU,” “MSR,” “AS,” and “CITYU” come from Bakeoff-2005 and CTB6 is widely used benchmarks, and “Lit.,” “Com.,” “Med.,” and “Fin.” indicate the specific domain literature, computer, medicine and finance, respectively, in SIGHAN-2010. “TCM” indicate new cross-domain dataset with high OOV rate. |
Model | Lit. | Com. | Med. | Fin. | Avg. | |
Ref.[13] | F | 92.89 | 93.71 | 92.16 | 95.20 | 93.49 |
Roov | – | – | – | – | – | |
Ref.[3] | F | 92.90 | 94.04 | 92.10 | 95.38 | 93.61 |
Roov | – | – | – | – | – | |
Ref.[18] | F | 93,23 | 95,32 | 93.73 | 95.84 | 94.53 |
Roov | – | – | – | – | – | |
Ref.[25] | F | 94.76 | 94.70 | 94.18 | 96.06 | 94.93 |
Roov | – | – | – | – | – | |
Ref.[23] | F | 96.13 | 96.08 | 95.21 | 96.82 | 96.06 |
Roov | – | – | – | – | – | |
Randinit | F | 91.51 | 92.46 | 89.89 | 94.34 | 92.05 |
Roov | 73.34 | 79.25 | 68.05 | 83.00 | 75.91 | |
+LGCN | F | 94.37* | 95.16* | 93.71* | 96.13* | 94.84 |
Roov | 78.03 | 86.07 | 75.91 | 83.25 | 80.82 | |
RoBERTa | F | 95.85 | 95.29 | 95.14 | 96.73 | 95.75 |
Roov | 83.47 | 86.31 | 81.25 | 89.38 | 85.10 | |
+LGCN | F | 96.19* | 95.67 | 95.38* | 96.79 | 96.01 |
Roov | 85.95 | 88.08 | 82.08 | 87.87 | 86.00 | |
Finetune | F | 96.52 | 96.08 | 95.50 | 96.88 | 96.24 |
Roov | 85.08 | 88.26 | 82.44 | 89.17 | 86.24 | |
+LGCN | F | 96.60 | 96.50* | 95.64 | 97.25* | 96.50 |
Roov | 85.91 | 89.16 | 82.68 | 89.25 | 86.75 | |
Note: The first block includes the latest domain adaptive models. The maximum evaluation value for each pair of comparison is shown in bold. The “*” indicates that the group of experiments passed the significance test. |
Model | PKU | MSR | AS | CITYU | CTB6 | |
Ref.[4] | F | 96.1 | 98.1 | 96.2 | 97.2 | 96.7 |
Roov | 78.8 | 80.0 | 70.7 | 87.5 | 85.4 | |
Ref.[25] | F | 96.5 | 97.8 | 95.9 | 96.3 | 96.4 |
Roov | – | – | – | – | – | |
Lattice[26] | F | 95.8 | 97.8 | – | – | 96.1 |
Roov | – | – | – | – | – | |
Glyce[22] | F | 96.7 | 98.3 | 96.7 | 97.9 | – |
Roov | – | – | – | – | – | |
Ref.[17] | F | 96.41 | 98.05 | 96.44 | 96.91 | 96.99 |
Roov | 78.91 | 78.92 | 76.39 | 86.91 | 87.00 | |
Ref.[24] | F | 96.53 | 98.40 | 96.62 | 97.93 | 97.25 |
Roov | 85.36 | 84.87 | 79.64 | 90.15 | 88.46 | |
Ref.[23] | F | 96.85 | 98.29 | – | – | 97.56 |
Roov | 82.35 | 81.75 | – | – | 88.02 | |
Randinit | F | 95.27 | 96.99 | 95.48 | 95.40 | 95.87 |
Roov | 81.36 | 60.75 | 68.07 | 76.78 | 78.76 | |
+LGCN | F | 96.22* | 97.79* | 96.04* | 96.13* | 96.48* |
Roov | 84.91 | 75.70 | 75.14 | 81.63 | 82.39 | |
RoBERTa | F | 96.85 | 97.88 | 96.55 | 97.42 | 97.45 |
Roov | 89.00 | 73.65 | 77.82 | 87.84 | 88.47 | |
+LGCN | F | 97.03* | 98.19* | 96.88* | 97.47 | 97.55* |
Roov | 89.38 | 77.90 | 81.98 | 87.81 | 89.05 | |
Finetune | F | 97.07 | 98.41 | 96.76 | 98.00 | 97.81 |
Roov | 89.13 | 85.67 | 78.96 | 91.09 | 90.56 | |
+LGCN | F | 97.26 | 98.58* | 97.07* | 98.10 | 97.82 |
Roov | 89.63 | 86.10 | 83.28 | 92.11 | 90.66 | |
Note: Results on the benchmark Bakeoff-2005 and CTB6 and compared with previous methods. The maximum evaluation value for each pair of comparisons is shown in bold. The “*” indicates that the group of experiments passed the significance test. |