Processing math: 100%
YU Hao, HUANG Kaiyu, WANG Yu, HUANG Degen. Lexicon-Augmented Cross-Domain Chinese Word Segmentation with Graph Convolutional Network[J]. Chinese Journal of Electronics, 2022, 31(5): 949-957. DOI: 10.1049/cje.2021.00.363
Citation: YU Hao, HUANG Kaiyu, WANG Yu, HUANG Degen. Lexicon-Augmented Cross-Domain Chinese Word Segmentation with Graph Convolutional Network[J]. Chinese Journal of Electronics, 2022, 31(5): 949-957. DOI: 10.1049/cje.2021.00.363

Lexicon-Augmented Cross-Domain Chinese Word Segmentation with Graph Convolutional Network

Funds: This work was supported by the National Key Research and Development Program of China (2020AAA0108004) and the National Natural Science Foundation of China (U1936109, 61672127)
More Information
  • Author Bio:

    YU Hao: was born in 1996. He received the B.S. degree in computer science from Dalian University of Technology, China, in 2019. He is currently pursuing the master’s degree in computer science and technology at Dalian University of Technology. His research interests include natural language processing, Chinese word segmentation and machine translation. (Email: yuhaodlut@mail.dlut.edu.cn)

    HUANG Kaiyu: received the Ph.D. degree in the School of Computer Science at Dalian University of Technology in 2021. He is currently a Postdoctor at the Institute for AI Industry Research (AIR) of Tsinghua University. His research interests include natural language processing and machine translation. (Email: kaiyuhuang@hotmail.com)

    WANG Yu: is currently an Associate Professor in School of Software Technology of Dalian University of Technology. Her research interests include corpus linguistics, machine translation and academic publishing, and communication. She is Executive Member of China EAP Association and Memeber of CCF. (Email: karan_wang@dlut.edu.cn)

    HUANG Degen: (corresponding author) received the B.S. degree in computer science from Fuzhou University, China, in 1986, and the M.S and Ph.D. degrees in computer science from the Dalian University of Technology, China, in 1988 and 2004 respectively. He is currently a Professor with the School of Computer Science, Dalian University of Technology. His research interests include natural language processing and machine translation. He is now a Senior Member of CCF, CIPS, ACM, CAAI and an Associate Editor of Int. J. Advanced Intelligence. (Email: huangdg@dlut.edu.cn)

  • Received Date: December 01, 2021
  • Accepted Date: January 25, 2022
  • Available Online: June 10, 2022
  • Published Date: September 04, 2022
  • Existing neural approaches have achieved significant progress for Chinese word segmentation (CWS). The performances of these methods tend to drop dramatically in the cross-domain scenarios due to the data distribution mismatch across domains and the out of vocabulary words problem. To address these two issues, proposes a lexicon-augmented graph convolutional network for cross-domain CWS. The novel model can capture the information of word boundaries from all candidate words and utilize domain lexicons to alleviate the distribution gap across domains. Experimental results on the cross-domain CWS datasets (SIGHAN-2010 and TCM) show that the proposed method successfully models information of domain lexicons for neural CWS approaches and helps to achieve competitive performance for cross-domain CWS. The two problems of cross-domain CWS can be effectively solved through various interactions between characters and candidate words based on graphs. Further, experiments on the CWS benchmarks (Bakeoff-2005) also demonstrate the robustness and efficiency of the proposed method.
  • Unlike most European written languages, the Chinese written language has no explicit delimiters to separate words in the context. Therefore, Chinese word segmentation is an essential task for Chinese downstream natural language processing (NLP) tasks.

    Chinese word segmentation (CWS) is conventionally formalized as a sequence labelling task[1]. The label of each character is predicted to denote the position of each character in a word. Recently, with the development of deep learning techniques, neural CWS approaches have achieved significant progress on in-domain CWS benchmarks (e.g., Bakeoff-2005)[2-5].

    However, the good performance of existing neural CWS approaches depends on the large-scale annotated corpus. This brings on two issues for croos-domain CWS: 1) Issue of out of vocabulary (OOV) words: Baseline models are difficult to recognize OOV words since the training data do not contain any information of OOV words. Due to the lack of training samples, current methods fail to identify OOV words. 2) Data distribution mismatch: Existing supervised methods leverage other domain data to train in the cross-domain scenario, but source domain data and target domain data have different distributions. Performance suffers from training the methods on source domain data instead of target domain data in the cross-domain scenario. For example, the Chinese character “证” is like to represent the “proof” in the domain of news. In the traditional Chinese medicine texts, this character always indicates the meaning of “symptom.” In this situation, researchers need to manually annotate target domain data to adapt to the source domain models, which is expensive and time-consuming.

    To deal with these two issues, we propose a lexicon-augmented graph convolutional network (LGCN) to incorporate the dictionary information into the supervised neural CWS model. First, the input sentences construct the lexicon-augmented graph by dictionaries. Then, we use the graph convolutional network (GCN) to capture the information of word boundaries and contextual features from external dictionaries.

    The contributions of this paper are as follows:

    • We proposed LGCN to address the OOV words and data distribution mismatch problem. It is straightforward and effective to improve the performance of CWS in the cross-domain scenario.

    • Our method achieves noticeable improvement on both in-domain and cross-domain CWS scenario, compared with baseline models and existing methods.

    • We further improve the accuracy of cross-domain segmentation in a specific domain by expanding the corresponding domain lexicon in the testing phase.

    Since Ref.[1] first formalized CWS as a sequence labelling task, statistical machine learning methods are widely employed for CWS in the early stage[6-8]. CWS has been considered as a supervised learning from annotated corpus. Ref.[6] further utilized sequence labelling tool CRF (conditional random field) for CWS. Ref.[9] showed that different tag sets can lead to different segmentation performance. Some models based on the variations of CRF achieved state-of-the-art performances[7-8, 10-12].

    With the development of the neural network, Ref.[13] proposed a neural network CWS method with sliding-window, which for the first time verified the feasibility of applying deep learning methods to CWS tasks. Ref.[14] proposed to use LSTM (long short-term memory) Network to capture long-range dependencies in response to the limitations of sliding windows. Ref.[3] designed a fast segmentation system based on greedy strategy and achieved similar performance to statistical machine learning methods. Ref.[15] improved the recognition performance of unknown words using unsupervised methods. Some research attempted to employ the diverse and complementary knowledge in multiple word segmentations to alleviate word-based problems[16]. Ref.[4] proved that under rigorous tuning, CWS tasks can achieve good performance without overly complex model structures. Ref.[5] used the gaussian-masked directional attention mechanism to strengthen the model on the basis of the greedy decoding algorithm. Ref.[17] proposed a concise, but effective unified model based on transformer encoder, which is fully-shared for different segmentation criteria.

    Despite great progress of neural network CWS, OOV words and data distribution mismatch are obvious gap for cross domain scenario. Some research have attempted to utilize external resources like dictionaries, multitask learning, unlabelled corpora and pre-trained language model to solve this problem. Ref.[18] and Ref.[19] explored the word segmentation method of neural network integrating unlabelled and partially labelled data. Ref.[20] introduce contextualized character embedding to neural domain adaptation for CWS. The contextualized character embedding aimed to capture the useful dimension in embedding for target domain. Ref.[21] proposed a semi-supervised word-based approach to improving cross-domain CWS given a baseline segmenter.

    Some CWS methods fine-tuned the pre-trained language model to improve the performance of CWS on both in-domain and cross-domain datasets[22-24]. Ref.[22] was the first to apply a pre-trained language model to CWS task by presenting Glyce, the glyph-vector for Chinese character representations. Ref.[23] proposed a joint multiple criteria model based on the pre-trained language mode by sharing all parameters to integrate different segmentation criteria into one model. Ref.[24] utilized the wordhood memory network to incorporate the contextual features on the basis of multiple pre-trained language model.

    In particular, Ref.[25] defined several templates to construct feature vectors for each character based on dictionaries and contexts, and incorporate the information from dictionaries into the BiLSTM (bidirectional long short-term memory) model. Ref.[26] used Lattice-LSTM model to incorporate subwords lexicons and their pre-trained embeddings. Ref.[27] employed subsampling and negative sampling methods for word embeddings and achieved 3.0% F-measure on four datasets covering domains in novels and medicine. Ref.[28] proposed a neural approach for CWS which incorporate the unlabeled data and lexicon into model training as indirect supervision by regularizing the prediction space of CWS models. Recently, graph neural network has been explored in several kinds of NLP tasks[29]. Ref.[30] utilized graph neural network with a multi-graph structure to resolving ambiguities in Chinese NER through capturing the information that the gazetteer lexicons offer. Our proposed method uses an external dictionary to alleviate the OOV problem. Inspired by this method, we attempt to incorporate lexicon information into context representation with graph neural network

    The framework of our model is illustrated in Fig.1 consisting of two principal constructions: the lexicon-augmented GCN encoder, and the baseline CWS model with context encoder and decoder. We describe the construction of the lexicon-augmented graph, the architecture of the baseline model, and how it is integrated with the GCN, respectively.

    Figure  1.  Main architecture of the proposed framework. Consecutive nodes of the same colour are candidates in the lexicon. And the nodes of “B,” “M” and “E” are three additional nodes that represent the “Begin,” “Mid” and “End” respectively. means the concatenation operator

    Lexicon-augmented graph As is shown in Fig.1, we construct the graph G based on the pre-defined lexicon. Given the input sentence X=[“糖尿病会引发并发症” (Diabetes can lead to complications)] containing nine characters, we utilize a pre-defined lexicon to extract candidate words in the sentence. The word lists are W=[“糖尿病” (Diabetes), “会” (can), “引发” (cause), “并发” (Concurrency), “并发症” (complications)]. The lexicon-augmented graph is defined as G:=(V,E), where V and E is a set of nodes and edges, respectively. Each character is represented as a node Vc, while each adjacent character node is connecting to an undirected edge Ec.

    In addition, we introduce three additional nodes Vd={VB,VM,VE}, and the entire set of nodes is V={Vc,Vd}. To extract lexicon information, we construct undirected edges between the character nodes of candidate words and additional nodes. For the candidate word w=[c1,c2,...,ci],i1, when i is equal to 1, the candidate word is a single-character word. The graph connects the additional nodes VB and VE together to the unique character node Vc1 of this candidate word w. When the candidate word is a multi-character word, the graph connects additional nodes VB and VE to the character node Vc1 and Vci, respectively. In addition, the graph connects all character nodes of this word w to an additional node VM. For the overlapping candidates (e.g., “并发” (Concurrency) and “并发症” (Complications)), undirected edges corresponding to both candidates are generated simultaneously in this graph, and only one of the duplicated edges is retained. Specifically, the character node “并” connects to the node VB and VM. The character node “发” connects to the node VM and VE. The character node “症” connects to the node VM and VE.

    GCN encoder The graph convolutional network[31] is a variation of the graph neural network (GNN[32]), which scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. After constructing the undirected lexicon-augmented graph G, we utilize the GCN to generate graph representation vectors. The layer-wise propagation rule of the GCN can be expressed as:

    H(l+1) = σ(˜D(1/2)˜A˜D(1/2)H(l)W(l))
    (1)
    ˜Dii = ˜Aii
    (2)

    Here, ˜A=A+IN is the adjacency matrix for the undirected lexicon-augmented graph G with added self-connections, and IN is the identity matrix. W(i) is a layer-specific trainable weight matrix. σ() denotes an activation function, such as the ReLU()=max(0,). H(l) is the matrix of activations in the lth layer; H0=u, and u is the unigram embed of the sentence X. ˜D is the degree matrix of ˜A. The GCN mainly contains two parts: Laplace normalized adjacency matrix ˜D(1/2)˜A˜D(1/2) and the layer-specific trainable weight matrix W(l)RdHi×dHi+1. The GCN can encode unstructured lexicon-augmented graphs and extract lexicon information.

    Following previous studies, we regard CWS as the character-based sequence labelling task. The framework predicts a tag that represents the position of each character in a word. We used a 4-tag set

    T = {B,E,M,S}
    (3)

    for prediction (e.g., tag “B” represents the first character in a word). The context encoder (BiLSTM[33]) that we utilize is a mainstream neural architecture for the sequence labelling task. Due to its design characteristics, BiLSTM is well suited for sequential annotation tasks. Given a sentence X=[x1,...,xn], the vector representation

    ei=[uibigi]
    (4)

    for each character xi is the concatenation of the unigram embedding ui, the bigram embedding bi and the matrix representation gi of the lexicon-augmented GCN encoder. The vector representation ei is fed into the BiLSTM which extracts the sequential features h=[h1,...,hn].

    hi=LSTM(ei,hi1;θ)
    (5)
    hi=LSTM(ei,hi+1;θ)
    (6)
    hi=[hihi]
    (7)

    where θ and θ are trainable parameters, respectively.

    Then, the baseline model adopts a dense layer and the Softmax function to output the label sequence y=[t1,...,tn]. The dense layer converts hidden dimensions hi of BiLSTM to the 4-tag set. In the previous studies, many kinds of research have used CRF as a decoder to improve the performance of sequence labelling task. Since CRF has large time and space complexity, we adopt the Softmax function and greedy search as the decode layer.

    The loss function of cross-entropy can be expressed as:

    Loss(y,y)= xy(x)logy(x)
    (8)

    where y denotes the gold label sequence, and the y(x) denotes the output of the decode layer.

    To evaluate our method, we did comparative experiments on both in-domain (Bakeoff-2005[34] and CTB6[35]) and cross-domain CWS datasets (SIGHAN-2010[36]), the statistical results of the experiments are shown in the Table 1. We randomly selected 10% sentences from the training set as the validation set for tuning hyperparameters and follow the settings of “PKU” dataset for the cross-domain experiments. For consistency, We converted all punctuations, digits and Latin letters into half-width, and continuous English characters and digits are transferred into a unique token, which is similar to the previous paper Ref.[3]. The evaluation values for experiments are F-scores. We used the simplified Chinese dictionary “Dict-I” and the traditional Chinese dictionary “Dict-II” derived from Jieba as the external dictionary in PKU, MSR, AS, CITYU, and CTB6 datasets.

    Table  1.  The hyperparameters setting and search ranges
    Lr [1E−3,1E−4,2E−5]
    Lr of bert [1E−4,2E−5]
    Optimzer AdamW
    Unigram dim [128,768]
    Bigram dim 128
    BiLSTM hidden dim 512
    GCN hidden dim [256,768]
    BiLSTM layers 3
    GCN layers 2
    Dropout [0.2,0.5]
    Batch size [32,64,128,256]
    Epochs 20
     | Show Table
    DownLoad: CSV

    We have collected a traditional Chinese medicine domain lexicon “Dict-m” on Sougou Cell Thesaurus for cross-domain word segmentation. We validate the performance of the model under three different unigram embeddings. The hyperparameters are shown in Table 2. The implementation of our proposed method is based on Pytorch[37] and Pytorch geometric[38] in the GPU (Nvidia Tesla V100) environment.

    Table  2.  Size of different CWS datasets
    In-domain PKU MSR AS CITYU CTB6
    Words Train 1.1M 2.4M 5.5M 1.5M 0.7M
    Test 0.1M 0.1M 0.1M 40.9k 52K
    Chars Train 1.8M 4.0M 8.2M 2.3M 1.2M
    Test 0.2M 0.2M 0.2M 65.9K 86K
    OOV rate(%) 6.14 2.42 3.89 7.15 5.63
    Cross-domain Lit. Com. Med. Fin. TCM
    Words Train
    Test 0.1M 0.1M 31K 33K 56.5K
    Chars Train
    Test 50K 54K 51K 53K 90.2K
    OOV rate(%) 6.14 14.04 10.56 7.14 19.57
    Note: The size of different CWS datasets, where “PKU,” “MSR,” “AS,” and “CITYU” come from Bakeoff-2005 and CTB6 is widely used benchmarks, and “Lit.,” “Com.,” “Med.,” and “Fin.” indicate the specific domain literature, computer, medicine and finance, respectively, in SIGHAN-2010. “TCM” indicate new cross-domain dataset with high OOV rate.
     | Show Table
    DownLoad: CSV

    We construct three baseline models to verify the validity of our proposed method “+LGCN” on SIGHAN-2010 in the cross-domain scenario. Three baseline models utilizing different unigram embedding are represented as “Randinit,” “RoBERTa,” and “Finetune” respectively. “Randinit” indicates that random initialization is used for unigram embedding. “RoBERTa” represents that we utilize a pre-trained word embeddings from a pre-trained language model (RoBERTa-WWM[39]). “Finetune” represents that we fine-tune the parameters on RoBERTa-WWM.

    Experimental results on the four cross-domain datasets are shown in Table 3, where the overall F-scores and Roov are reported. The Roov indicates the recall of OOV words. Some observations can be drawn from the results. First, the overall comparison indicates that, our proposed method “+LGCN” outperforms the corresponding baseline model for all 12 pairs in terms of F-scores and 11 pairs in terms of Roov. We used the significance assessment script in Ref.[40] to perform significance tests. Eight of the experimental pairs passed the significance test. Second, our LGCN work smoothly under different unigram embedding. For instance, when using random initialization of unigram embedding, lexicon-augmented GCN improves the F-scores on the literature domain from 91.51% to 94.37%, and Roov from 73.34% to 78.03%. Using “RoBERTa” or “Finetune” as the unigram embedding, where the baseline system performs very well, and the improvement of LGCN on F-scores is still decent. Third, although the baseline model adopts the pre-trained language model (i.e., “Finetune”), which is already trained by rich pre-trained knowledge, and our proposed method “+LGCN” is able to provide further task-specific guidance for lexical information fusion and improved by an average of 0.26 percentage points over the four datasets. To further illustrate the validity and the effectiveness of LGCN, we compare our best-performing model with previous studies on the cross-domain datasets. The results are shown in the first block of Table 3, the LGCN outperforms all existing models with respect to the F-scores and achieves state-of-the-art performance on all datasets.

    Table  3.  The F-scores on cross-domain benchmark datasets
    Model Lit. Com. Med. Fin. Avg.
    Ref.[13] F 92.89 93.71 92.16 95.20 93.49
    Roov
    Ref.[3] F 92.90 94.04 92.10 95.38 93.61
    Roov
    Ref.[18] F 93,23 95,32 93.73 95.84 94.53
    Roov
    Ref.[25] F 94.76 94.70 94.18 96.06 94.93
    Roov
    Ref.[23] F 96.13 96.08 95.21 96.82 96.06
    Roov
    Randinit F 91.51 92.46 89.89 94.34 92.05
    Roov 73.34 79.25 68.05 83.00 75.91
    +LGCN F 94.37* 95.16* 93.71* 96.13* 94.84
    Roov 78.03 86.07 75.91 83.25 80.82
    RoBERTa F 95.85 95.29 95.14 96.73 95.75
    Roov 83.47 86.31 81.25 89.38 85.10
    +LGCN F 96.19* 95.67 95.38* 96.79 96.01
    Roov 85.95 88.08 82.08 87.87 86.00
    Finetune F 96.52 96.08 95.50 96.88 96.24
    Roov 85.08 88.26 82.44 89.17 86.24
    +LGCN F 96.60 96.50* 95.64 97.25* 96.50
    Roov 85.91 89.16 82.68 89.25 86.75
    Note: The first block includes the latest domain adaptive models. The maximum evaluation value for each pair of comparison is shown in bold. The “*” indicates that the group of experiments passed the significance test.
     | Show Table
    DownLoad: CSV

    Experimental results on the five benchmark datasets are shown in Table 4, which show a similar trend as that in Table 3. Overall, comparison demonstrates that our proposed method (+LGCN) outperforms the baseline model for all 15 pairs in terms of F-scores and 14 pairs in terms of Roov. Eleven pairs of experiments passed the significance test. Our LGCN effectively improve the performance of all three baseline models. To further demonstrate the performance of our proposed model on the benchmark datasets, we compared our performing model with the previous studies on the same benchmark datasets. The comparison is presented in the first block of Table 4, where lexicon-augmented GCN outperforms all existing models to F-scores and achieves state-of-the-art performance on all benchmark datasets.

    Table  4.  The accuracy values on in-domain benchmark datasets
    ModelPKUMSRASCITYUCTB6
    Ref.[4] F 96.1 98.1 96.2 97.2 96.7
    Roov 78.8 80.0 70.7 87.5 85.4
    Ref.[25] F 96.5 97.8 95.9 96.3 96.4
    Roov
    Lattice[26] F 95.8 97.8 96.1
    Roov
    Glyce[22] F 96.7 98.3 96.7 97.9
    Roov
    Ref.[17] F 96.41 98.05 96.44 96.91 96.99
    Roov 78.91 78.92 76.39 86.91 87.00
    Ref.[24] F 96.53 98.40 96.62 97.93 97.25
    Roov 85.36 84.87 79.64 90.15 88.46
    Ref.[23] F 96.85 98.29 97.56
    Roov 82.35 81.75 88.02
    Randinit F 95.27 96.99 95.48 95.40 95.87
    Roov 81.36 60.75 68.07 76.78 78.76
    +LGCN F 96.22* 97.79* 96.04* 96.13* 96.48*
    Roov 84.91 75.70 75.14 81.63 82.39
    RoBERTa F 96.85 97.88 96.55 97.42 97.45
    Roov 89.00 73.65 77.82 87.84 88.47
    +LGCN F 97.03* 98.19* 96.88* 97.47 97.55*
    Roov 89.38 77.90 81.98 87.81 89.05
    Finetune F 97.07 98.41 96.76 98.00 97.81
    Roov 89.13 85.67 78.96 91.09 90.56
    +LGCN F 97.26 98.58* 97.07* 98.10 97.82
    Roov 89.63 86.10 83.28 92.11 90.66
    Note: Results on the benchmark Bakeoff-2005 and CTB6 and compared with previous methods. The maximum evaluation value for each pair of comparisons is shown in bold. The “*” indicates that the group of experiments passed the significance test.
     | Show Table
    DownLoad: CSV

    Since OOV words in specific domain may not be included in general purpose dictionary, further experiment is required to investigate the robustness of our method on domain lexicon, i.e., OOV words in specific domain. In the above benchmark experiments, the simplified lexicon “Dict-I” lacks sufficient domain lexicon for traditional Chinese medicine domain. Domain lexicon need to be added to strengthen the domain information for test. Hence, we set up and annotated manually a traditional Chinese medicine domain dataset (TCM) with high OOV rate (following the “PKU” segmentation criterion). The size of TCM is shown in Table 2. Based on three baseline models, we conducted experiments to observe the effect of using expanded lexicons in the “TCM” domain dataset. The F-scores are shown in Fig.2, where “non-Dict” denotes a BiLSTM model without any lexicons, “Dict-I” and “Dict-I+Dict-m” denote using different lexicons during the testing phase based on the same model trained with “Dict-I.”

    Figure  2.  The F-scores of LGCN using three different unigram embeddings and two different lexicons on traditional Chinese medicine domain dataset “TCM”

    As expected, our proposed method can utilize lexicons to capture the lexicon knowledge and improve the performance of the corresponding baseline model. Moreover, adding domain lexicon in the testing phase alone can further improve the performance of the model in the corresponding domain, and we do not need to retrain the model.

    We first looked at the effect of dictionary size. We randomly selected 25%, 50%, 75%, 80%, 90% and 95% words from the original dictionary “Dict-I” to build new dictionaries of different sizes. Fig.3 shows F-scores for LGCN with these dictionaries on “Randinit.” As Fig.3 suggests, the performance of this model gradually improves as the dictionary size increases.

    Figure  3.  The F-scores of LGCN on benchmark datasets “PKU,” “MSR,” and “CTB6” using multiple dictionaries of diffferent sizes

    To investigate how the proposed framework learns from the LGCN. We choose an example input sentence “四神聪/位于/头部/阙阴/区” (Sishengcong is located in the Deficiency Yin area of the head) in the traditional Chinese medicine domain as a case study. In this sentence, the “四神聪” (Sishengcong) and “阙阴” (Deficiency Yin) are the terminology of acupoints in traditional Chinese medicine. These two words are OOV words and are collected by the domain lexicon. This sentence is cut into “四神/聪位于/头部/阙阴区” in the baseline model, which leads to ambiguity. In the experiments on the “TCM” dataset, we added a traditional Chinese medicine domain dictionary to the simplified lexicon “Dict-I” used in the training phase, and our model identified these two words. This illustrates the ability of our model to benefit from different lexicons during the testing phase and to improve the performance of cross-domain scenarios.

    We counted the number of errors on cross-domain benchmark dataset “Com.” The statistics are shown in Fig.4, where “Error in total” represents the total number of errors. “Merging error” means that multiple words are mistakenly merged, for example, “生态/网(ecological network)” is mistakenly merged into “生态网(ecological network).” “Splitting error” indicates that a single word is incorrectly split, for example, “头文件(header files)” is incorrectly split into “头/文件(head/file).” All these errors are counted independently by computer scripts. “In” denotes words occurring in the training corpus. “Out” is OOV words. “I_dict” denotes words appearing in both the lexicon and the training corpus. “O_dict” denotes OOV words appearing in the lexicon. Compared with the baseline, errors in most categories decreased in our LGCN model. Meanwhile, errors in “O_dict” have increased in “Merging error.” This demonstrates that our model prefers to merge OOV words that appear in the lexicon.

    Figure  4.  The number of errors on computer domain dataset “Com.”

    We empirically analyse the “Error in total” in Fig.4 from four aspects. The first is annotation inconsistency or incorrect annotation. For instance, “个人所得税 (individual income tax)” is annotated as “个人所得税 (individual income tax)” in the training dataset, but is annotated as “个人 (individual)/所得税 (income tax)” in the test dataset. The word “动脉注射 (arterial injection)” in the training corpus is labelled as “动脉 (artery)/注射 (injection),” but the similar word “静脉注射 (intravenous injection)” in the test corpus is regarded as one word. The suffix “店 (shop),” “食品店 (food shop)” “玩具店 (toy shop)” in the training corpus was annotated as a word, however “热狗店 (hot dog shop),” “饲料店 (Feed shop)” have been annotated as “热狗 (hot dog)/店 (shop),” “饲料 (fodder)/店 (shop),” this makes it hard for the model to identify such suffix. In addition, there are strings “蜀南 (the southern Sichuan)/竹海 (the bamboo sea)/风景区 (scenic spot)” was wrong in the test corpus labelled “蜀 (Sichuan)/南竹 (the southern bamboo)/海风 (sea breeze)/景区 (scenic spot).” The second reason is that the segmentation criteria in the cross-domain dataset are different from those in the in-domain training set. For instance, “重 (weight)/启 (open)” (reboot) is considered as one word, but in the in-domain dataset “PKU” is considered as two words. The third reason is that the model sometimes hesitate on the affixes. For instance, “的” appears 53890 times as a single-character word in the training corpus, and 623 times as a suffix. The former is nearly 100 times more than the latter. So the model annotates “沉沉的 (sunken)” as “沉沉 (heavy)/的 (of),” “茶花女 (La Traviata)” and “茶花 (camellia)” are OOV words in the PKU training dataset, but occur in the simplified dictionary. “茶花女 (La Traviata)” was annotated as “茶花 (camellia)/女 (woman)” because the “女 (woman)” occurs 523 times as a single-character word. Lastly, OOV words are not thoroughly included in our lexicon.

    To make better use of the lexicon information, we explore the application of the graph convolution network for Chinese word segmentation. We propose the LGCN model to deal with the data distribution mismatch and OOV words issues in the cross-domain scenarios. We first construct a lexicon-augmented graph for a sentence, and then introduce the GCN to extract the information. In this way, the word boundary information in the sentence is explicitly represented, and the OOV words and domain-specific words are handled well. Benchmark experiment results show that our model outperforms previous models on cross-domain datasets and achieves optimal performance for benchmark datasets. To test performance of our model on domain lexicons in the cross-domain scenarios, we continue to experiment on Chinese medicine dataset. Result shows that our method is able to benefit from domain lexicons and achieves further improvement in this cross-domain scenarios.

    In summary, our model utilizes external dictionaries to improve the accuracy of CWS task, and can further improve the accuracy of segmentation in a specific domain by expanding the corresponding domain lexicon in the testing phase. We will continue to study the efficiency of our LGCN and focus on how to reduce the time complexity of segmentation and how to improve dataset annotation inconsistency.

  • [1]
    N. W. Xue, “Chinese word segmentation as character tagging,” International Journal of Computational Linguistics & Chinese Language Processing, vol.8, no.1, pp.29–48, 2003.
    [2]
    D. Cai and H. Zhao, “Neural word segmentation learning for Chinese,” in Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers), Berlin, Germany, pp.409–420, 2016.
    [3]
    D. Cai, H. Zhao, Z. S. Zhang, et al., “Fast and accurate neural word segmentation for Chinese,” in Proc. of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short papers), Vancouver, Canada, pp.608–615, 2017.
    [4]
    J. Ma, K. Ganchev, and D. Weiss, “State-of-the-art Chinese word segmentation with Bi-LSTMs,” in Proc. of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp.4902–4908, 2018.
    [5]
    S. F. Duan and H. Zhao, “Attention is all you need for Chinese word segmentation,” in Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, pp.3862–3872, 2020
    [6]
    F. C. Peng, F. F. Feng, and A. McCallum, “Chinese segmentation and new word detection using conditional random fields,” in Proc. of the 20th International Conference on Computational Linguistics (COLING), Geneva, Switzerland, pp.562–568, 2004.
    [7]
    H. Tseng, P. Chang, G. Andraw, et al., “A conditional random field word segmenter for Sighan Bakeoff 2005,” in Proc. of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, pp.168–171, 2005.
    [8]
    H. Zhao, C. N. Huang, M. Li, et al., “A unified character-based tagging framework for Chinese word segmentation,” ACM Transactions on Asian Language Information Processing (TALIP), vol.9, no.2, pp.1–32, 2010. DOI: 10.1145/1781134.1781135
    [9]
    H. Zhao, C. N. Huang, M. Li, et al., “Effective tag set selection in Chinese word segmentation via conditional random field modeling,” in Proc. of the 20th Pacific Asia Conference on Language, Information and Computation, Wuhan, China, pp.87–94, 2006.
    [10]
    H. Zhao and C. Y. Kit, “Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition,” in Proc. of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, pp.106–111, 2008.
    [11]
    X. Sun, H. F. Wang, and W. J. Li, “Fast online training with frequency-adaptive learning rates for Chinese word segmentation and new word detection,” in Proc. of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers), Jeju Island, Korea, pp.253–262, 2012.
    [12]
    L. K. Zhang, H. F. Wang, X. Sun, et al., “Exploring representations from unlabeled data with co-training for Chinese word segmentation,” in Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp.311–321, 2013.
    [13]
    X. Q. Zheng, H. Y. Chen, and T. Y. Xu, “Deep learning for Chinese word segmentation and POS tagging,” in Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp.647–657, 2013.
    [14]
    X. C. Chen, X. P. Qiu, C. X. Zhu, et al., “Long short-term memory neural networks for Chinese word segmentation,” in Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp.1197–1206, 2015.
    [15]
    D. G. Huang, J. Zhang, and K. Y. Huang, “Automatic microblog-oriented unknown word recognition with unsupervised method,” Chinese Journal of Electronics, vol.27, no.1, pp.1–8, 2018. DOI: 10.1049/cje.2017.11.004
    [16]
    N. Xi, X. Y Dai, S. J. Huang, et al., “Discriminative word alignment over multiple word segmenations,” Chinese Journal of Electronics, vol.23, no.2, pp.263–279, 2014.
    [17]
    L. J. Zhao, Q. Zhang, P. Wang, et al., “Neural networks incorporating unlabeled and partially-labeled data for cross-domain Chinese word segmentation,” in Proc. of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, pp.4602–4608, 2018.
    [18]
    X. B. Wang, D. Cai, L. L. Li, et al., “Unsupervised learning helps supervised neural word segmentation,” in Proc. of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Honolulu, Hawaii, USA, vol.33, no.1, pp.7200–7207, 2019.
    [19]
    X. P. Qiu, H. Z. Pei, H. Yan, et al., “A concise model for multi-criteria Chinese word segmentation with transformer encoder,” in Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, Association for Computational Linguistics, Online, pp.2887–2897, 2020.
    [20]
    Z. Y. Bao, S. Li, S. Gao, et al., “Neural domain adaptation with contextualized character embedding for Chinese word segmentation,” in Proc. of the Sixth Conference on Natural Language Processing and Chinese Computing, Dalian, China, pp.419–430, 2017.
    [21]
    Y. X. Ye, W. K. Li, Y. Zhang, et al., “Improving cross-domain Chinese word segmentation with word embeddings,” in Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and short papers), Minneapolis, Minnesota, pp.2726–2735, 2019.
    [22]
    Y.X. Meng, W. Wu, F. Wang, et al., “Glyce: Glyph-vectors for Chinese character representations,” Proc. of 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, pp.2742–2753, 2019.
    [23]
    K. Y. Huang. D. G. Huang, Z. Liu, et al., “A joint multiple criteria model in transfer learning for cross-domain Chinese word segmentation,” in Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, pp.3873–3882, 2020.
    [24]
    Y. H. Tian, Y. Song, F. Xia, et al., “Improving Chinese word segmentation with wordhood memory networks,” in Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, pp.8274–8285, 2020.
    [25]
    Q. Zhang, X. Y. Liu, and J. L. Fu, “Neural networks incorporating dictionaries for Chinese word segmentation,” in Proc. of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), New Orleans, Louisiana, USA, pp.5682–5689, 2018.
    [26]
    J. Yang, Y. Zhang, and S. L. Liang, “Subword encoding in lattice LSTM for Chinese word segmentation,” in Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, USA, pp.2720–2725, 2019.
    [27]
    J. X. Liu, F. Z. Wu, C. H. Wu, et al., “Neural Chinese word segmentation with dictionary,” Neurocomputing, vol.338, pp.46–54, 2019. DOI: 10.1016/j.neucom.2019.01.085
    [28]
    J. X. Liu, F. Z. Wu, C. H. Wu, et al., “Neural Chinese word segmentation with Lexicon and unlabeled data via posterior regularization,” in Proc. of The World Wide Web Conference, San Francisco, CA, USA, pp.3013–3019, 2019.
    [29]
    J. Zhou, G. Q. Cui, S. D. Hu, et al., “Graph neural networks: A review of methods and applications,” AI Open, vol.1, pp.57–81, 2020. DOI: 10.1016/j.aiopen.2021.01.001
    [30]
    R.X. Ding, P.J Xie, X.Y. Zhang, et al., “A neural multi-digraph model for Chinese NER with gazetteers,” Proc. of The 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp.1462–1467, 2019.
    [31]
    T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks, ” in Proc. of The 5th International Conference on Learning Representations, Toulon, France, pp.1462–1467, 2017.
    [32]
    T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in Proc. of The 5th International Conference on Learning Representations, Toulon, France, arXiv:1609.02907, 2017.
    [33]
    A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM networks,” in Proc. of 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, vol.4, pp.2047–2052, 2005.
    [34]
    T. Emerson, “The second international Chinese word segmentation Bakeoff,” in Proc. of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, pp.123–133, 2005.
    [35]
    N, W Xue, F. Xia, F. D. Chiou, et al., “The Penn Chinese TreeBank: Phrase structure annotation of a large corpus,” Natural Language Engineering, vol.11, no.2, pp.207–238, 2005. DOI: 10.1017/S135132490400364X
    [36]
    H. M Zhao and Q. Liu, “The CIPS-SIGHAN CLP2010 Chinese word segmentation Backoff,” CIPS-SIGHAN Joint Conference on Chinese Language Processing, Beijing, China, pp.199–209, 2010.
    [37]
    A. Paszke, S. Gross, F. Massa, et al., “PyTorch: An imperative style, high-performance deep learning library,” in Proc. of 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, pp.8024–8035, 2019.
    [38]
    M. Fey and J. E. Lenssen, “Fast graph representation learning with PyTorch geometric,” ICLR Workshop on Representation Learning on Graphs and Manifolds, New Orleans, Louisiana, USA, arXiv:1903.02428, 2019.
    [39]
    Y. M. Cui, W. X. Che, T. Liu, et al., “Pre-training with whole word masking for Chinese BERT,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.29, pp.3504–3514, 2021. DOI: 10.1109/TASLP.2021.3124365
    [40]
    D. Rotem, B. Gili, S. Segev, et al., “The Hitchhiker’s guide to testing statistical significance in natural language processing,” in Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers), Melbourne, Australia, pp.1383–1392, 2019.

Catalog

    Figures(4)  /  Tables(4)

    Article Metrics

    Article views (599) PDF downloads (50) Cited by()
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return