Turn off MathJax
Article Contents
ZHAO Lingling, WANG Junjie, WANG Chunyu, GUO Maozu. A Cross-Domain Ontology Semantic Representation Based on NCBI-blueBERT Embedding[J]. Chinese Journal of Electronics. doi: 10.1049/cje.2020.00.326
Citation: ZHAO Lingling, WANG Junjie, WANG Chunyu, GUO Maozu. A Cross-Domain Ontology Semantic Representation Based on NCBI-blueBERT Embedding[J]. Chinese Journal of Electronics. doi: 10.1049/cje.2020.00.326

A Cross-Domain Ontology Semantic Representation Based on NCBI-blueBERT Embedding

doi: 10.1049/cje.2020.00.326
Funds:  This work is supported by the National Natural Science Foundation of China (No.62171164, No.62102191 and No.61872114)
More Information
  • Author Bio:

    Associate professor, Faculty of Computing, Harbin Institute of Technology. Commissioner of the Bioinformatics Committee of the China Computer Federation, Commissioner of the Computational Design Committee of the Architectural Society of China. She received the Ph.D., M.S. and B.S. degrees from Harbin Institute of Technology. Her current research interests include machine learning and bioinformatics. She has published more than 40 academic papers. (Email: zhaoll@hit.edu.cn)

    received the B.S. degree in Information management and information system from Institute of Disaster Prevention, China, in 2013, the M.S. degree in software engineering from the Harbin Institute of Technology, Harbin, China, in 2015, and the Ph.D. degree in Computer science and technology from the Harbin Institute of Technology, Harbin, in 2020. Since December 2020, he has been a Lecturer with the School of Biomedical Engineering and Informatics, Nanjing Medical University, China. His current research interests include bioinformatics and deep learning. (Email: junjie2021@njmu.edu.cn)

    received his B.S., M.S., and Ph.D. degrees in computer science and technology from Harbin Institute of Technology. He is an associate professor at the Faculty of Computing, Harbin Institute of Technology. His current research interests include bioinformatics and machine learning. (Email: chunyu@hit.edu.cn)

  • Corresponding author: (corresponding author) received the Ph.D. degree from the School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China. He is currently a Professor with the School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China. His current research interests include machine learning, bioinformatics, and image processing.(Email: maozuguo@bucea.edu.cn)
  • Accepted Date: 2021-12-07
  • Available Online: 2021-12-22
  • A common but critical task in biological ontologies data analysis is to compare the difference between ontologies. There have been numerous ontology-based semantic-similarity measures proposed in specific ontology domain, but it still remains a challenge for cross-domain ontologies comparison. An ontology contains the scientific natural language description for the corresponding biological aspect. Therefore, we develop a new method based on natural language processing (NLP) representation model Bidirectional Encoder Representations from Transformers (BERT) for cross-domain semantic representation of biological ontologies. This article uses the BERT model to represent the word-level of the ontologies as a set of vectors, facilitating the semantic analysis or comparing the biomedical entities named in an ontology or associated with ontology terms. We evaluated the ability of our method in two experiments: calculating similarities of pair-wise Disease Ontology (DO) and Human Phenotype Ontology (HPO) terms and predicting the pair-wise of proteins interaction. The experimental results demonstrated the comparative performance. This gives promise to the development of NLP methods in biological data analysis.
  • loading
  • [1]
    T.R. Gruber, “A translation approach to portable ontology specifications,” Knowledge acquisition, vol.5, no.2, pp.199–220, 1993. doi: 10.1006/knac.1993.1008
    [2]
    M.A. Rodríguez and M.J. Egenhofer, “Determining semantic similarity among entity classes from different ontologies,” IEEE transactions on knowledge and data engineering, vol.15, no.2, pp.442–456, 2003. doi: 10.1109/TKDE.2003.1185844
    [3]
    B. Smith, M. Ashburner, C. Rosse, et al., “The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration,” Nature biotechnology, vol.25, no.11, pp.1251–1255, 2007. doi: 10.1038/nbt1346
    [4]
    G.K. Mazandu and N.J. Mulder, “A topology-based metric for measuring term similarity in the gene ontology,” Advances in bioinformatics, vol.2012, 2012.
    [5]
    L. Cheng, Y. Jiang, H. Ju, et al., “InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk,” BMC genomics, vol.19, no.1, pp.125–134, 2018. doi: 10.1186/s12864-018-4500-9
    [6]
    R. Rada, H. Mili, E. Bicknell and M. Blettner, “Development and application of a metric on semantic nets,” IEEE transactions on systems, man, and cybernetics, vol.19, no.1, pp.17–30, 1989. doi: 10.1109/21.24528
    [7]
    Z. Wu and M. Palmer, “Verb semantics and lexical selection.” arXiv preprint cmp-lg/9406033, 1994.
    [8]
    C. Pesquita, D. Faria, A.O. Falcao, et al., “Semantic similarity in biomedical ontologies,” PLoS comput biol, vol.5, no.7, article no.e1000443, 2009. doi: 10.1371/journal.pcbi.1000443
    [9]
    P. Resnik, “Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language,” Journal of artificial intelligence research, vol.11, pp.95–130, 1999. doi: 10.1613/jair.514
    [10]
    D. Lin. “An information-theoretic definition of similarity.” in Icml. 1998.
    [11]
    J.Z. Wang, Z. Du, R. Payattakool, et al., “A new method to measure the semantic similarity of GO terms,” Bioinformatics, vol.23, no.10, pp.1274–1281, 2007. doi: 10.1093/bioinformatics/btm087
    [12]
    F.Z. Smaili, X. Gao and R. Hoehndorf, “Opa2vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction,” Bioinformatics, vol.35, no.12, pp.2133–2140, 2019. doi: 10.1093/bioinformatics/bty933
    [13]
    F.Z. Smaili, X. Gao and R. Hoehndorf, “Onto2vec: joint vector-based representation of biological entities and their ontology-based annotations,” Bioinformatics, vol.34, no.13, pp.i52–i60, 2018. doi: 10.1093/bioinformatics/bty259
    [14]
    D. Duong, E. Eskin and J.J. Li, “A novel Word2vec based tool to estimate semantic similarity of genes by using Gene Ontology terms,” bioRxiv, article no.103648, 2017.
    [15]
    J. Lafferty, A. McCallum and F.C. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data.” 2001.
    [16]
    J. Zhang, Y. Song, C. Zhang and S. Liu. “Evolutionary hierarchical dirichlet processes for multiple correlated time-varying corpora.” in Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. 2010.
    [17]
    T. Mikolov, I. Sutskever, K. Chen, et al., “Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546 (2013).” arXiv preprint arXiv: 1310.4546, 2013.
    [18]
    J. Pennington, R. Socher and C.D. Manning. “Glove: Global vectors for word representation.” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.
    [19]
    A. Joulin, E. Grave, P. Bojanowski, et al., “Fasttext. zip: Compressing text classification models.” arXiv preprint arXiv: 1612.03651, 2016.
    [20]
    F. Shen, S. Peng, Y. Fan, et al., “HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology,” Journal of biomedical informatics, vol.96, article no.103246, 2019. doi: 10.1016/j.jbi.2019.103246
    [21]
    M.E. Peters, M. Neumann, M. Iyyer, et al., “Deep contextualized word representations.” arXiv preprint arXiv: 1802.05365, 2018.
    [22]
    J. Lee, W. Yoon, S. Kim, et al., “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol.36, no.4, pp.1234–1240, 2020.
    [23]
    I. Beltagy, K. Lo and A. Cohan, “SciBERT: A pretrained language model for scientific text.” arXiv preprint arXiv: 1903.10676, 2019.
    [24]
    Y. Peng, S. Yan and Z. Lu, “Transfer learning in biomedical natural language processing: An evaluation of bert and elmo on ten benchmarking datasets”. arXiv preprint arXiv: 1906.05474, 2019.
    [25]
    A. Conneau, D. Kiela, H. Schwenk, et al., “Supervised learning of universal sentence representations from natural language inference data.” arXiv preprint arXiv: 1705.02364, 2017.
    [26]
    R. Kiros, Y. Zhu, R.R. Salakhutdinov, et al., “Skip-thought vectors,” in Advances in neural information processing systems, 2015.
    [27]
    D. Cer, Y. Yang, S.-y. Kong, et al., “Universal sentence encoder”. arXiv preprint arXiv: 1803.11175, 2018.
    [28]
    H. Al-Mubaid and H.A. Nguyen. “A cluster-based approach for semantic similarity in the biomedical domain.” in 2006 International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE, 2006.
    [29]
    G. Pirró, “A semantic similarity metric combining features and intrinsic information content,” Data & Knowledge Engineering, vol.68, no.11, pp.1289–1308, 2009.
    [30]
    D. Bollegala, Y. Matsuo and M. Ishizuka. “A relational model of semantic similarity between words using automatically extracted lexical pattern clusters from the web.” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009.
    [31]
    E.G. Petrakis, G. Varelas, A. Hliaoutakis and P. Raftopoulou, “X-similarity: Computing semantic similarity between concepts from different ontologies,” Journal of Digital Information Management, vol.4, no.4, 2006.
    [32]
    L. Ding, T. Finin, A. Joshi, et al. “Swoogle: a search and metadata engine for the semantic web”. in Proceedings of the thirteenth ACM international conference on Information and knowledge management. 2004.
    [33]
    D. Sánchez, D. Isern and M. Millan, “Content annotation for the semantic web: an automatic web-based approach,” Knowledge and Information Systems, vol.27, no.3, pp.393–418, 2011. doi: 10.1007/s10115-010-0302-3
    [34]
    D. Duong, A. Uppunda, L. Gai, et al., “Evaluating representations for gene ontology terms,” bioRxiv, article no.765644, 2020.
    [35]
    G.O. Consortium, “Expansion of the Gene Ontology knowledgebase and resources,” Nucleic acids research, vol.45, no.D1, pp.D331–D338, 2017. doi: 10.1093/nar/gkw1108
    [36]
    G.K. Mazandu, E.R. Chimusa and N.J. Mulder, “Gene ontology semantic similarity tools: survey on features and challenges for biological knowledge discovery,” Briefings in bioinformatics, vol.18, no.5, pp.886–901, 2017.
    [37]
    A. Pesaranghader, S. Matwin, M. Sokolova and R.G. Beiko, “simDEF: definition-based semantic similarity measure of gene ontology terms for functional similarity analysis of genes,” Bioinformatics, vol.32, no.9, pp.1380–1387, 2016. doi: 10.1093/bioinformatics/btv755
    [38]
    L.v.d. Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of machine learning research, vol.9, no.Nov, pp.2579–2605, 2008.
    [39]
    C. Szegedy, V. Vanhoucke, S. Ioffe, et al. “Rethinking the inception architecture for computer vision”. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
    [40]
    J. Peng, J. Chen and Y. Wang. “Identifying cross-category relations in gene ontology and constructing genome-specific term association networks.” in BMC bioinformatics. Springer, 2013.
    [41]
    A. Bellandi, B. Furletti, V. Grossi and A. Romei, “Ontology-driven association rule extraction: A case study,” Contexts and Ontologies Representation and Reasoning, vol.10, 2007.
    [42]
    O. Bodenreider, M. Aubry and A. Burgun, “Non-lexical approaches to identifying associative relations in the gene ontology,” in Biocomputing 2005. World Scientific, pp.91–102, 2005.
    [43]
    J. Peng, H. Wang, J. Lu, et al., “Identifying term relations cross different gene ontology categories,” BMC bioinformatics, vol.18, no.16, article no.573, 2017.
    [44]
    G. Salton, A. Wong and C.-S. Yang, “A vector space model for automatic indexing,” Communications of the ACM, vol.18, no.11, pp.613–620, 1975. doi: 10.1145/361219.361220
    [45]
    A. Kumar, B. Smith and C. Borgelt. “Dependence relationships between Gene Ontology terms based on TIGR gene product annotations.” in Proceedings of CompuTerm 2004: 3rd International Workshop on Computational Terminology. 2004.
    [46]
    K.-H. Chen, T.-F. Wang and Y.-J. Hu, “Protein-protein interaction prediction using a hybrid feature representation and a stacked generalization scheme,” BMC Bioinformatics, vol.20, no.1, article no.308, 2019. doi: 10.1186/s12859-019-2907-1
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(7)  / Tables(4)

    Article Metrics

    Article views (109) PDF downloads(15) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return