Volume 31 Issue 5
Sep.  2022
Turn off MathJax
Article Contents
ZHAO Lingling, WANG Junjie, WANG Chunyu, et al., “A Cross-Domain Ontology Semantic Representation Based on NCBI-BlueBERT Embedding,” Chinese Journal of Electronics, vol. 31, no. 5, pp. 860-869, 2022, doi: 10.1049/cje.2020.00.326
Citation: ZHAO Lingling, WANG Junjie, WANG Chunyu, et al., “A Cross-Domain Ontology Semantic Representation Based on NCBI-BlueBERT Embedding,” Chinese Journal of Electronics, vol. 31, no. 5, pp. 860-869, 2022, doi: 10.1049/cje.2020.00.326

A Cross-Domain Ontology Semantic Representation Based on NCBI-BlueBERT Embedding

doi: 10.1049/cje.2020.00.326
Funds:  This work was supported by the National Natural Science Foundation of China (62171164, 62102191, 61872114, 62131004)
More Information
  • Author Bio:

    Associate Professor, Faculty of Computing, Harbin Institute of Technology. Commissioner of the Bioinformatics Committee of the China Computer Federation, Commissioner of the Computational Design Committee of the Architectural Society of China. She received the Ph.D., M.S. and B.S. degrees from Harbin Institute of Technology. Her current research interests include machine learning and bioinformatics. She has published more than 40 academic papers. (Email: zhaoll@hit.edu.cn)

    received the B.S. degree in information management and information system from Institute of Disaster Prevention, China, in 2013, the M.S. degree in software engineering from the Harbin Institute of Technology, Harbin, China, in 2015, and the Ph.D. degree in computer science and technology from the Harbin Institute of Technology, Harbin, in 2020. Since December 2020, he has been a Lecturer with the School of Biomedical Engineering and Informatics, Nanjing Medical University, China. His current research interests include bioinformatics and deep learning. (Email: junjie2021@njmu.edu.cn)

    received the B.S., M.S., and Ph.D. degrees in computer science and technology from Harbin Institute of Technology. He is an Associate Professor at the Faculty of Computing, Harbin Institute of Technology. His current research interests include bioinformatics and machine learning. (Email: chunyu@hit.edu.cn)

    (corresponding author) received the Ph.D. degree from the School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China. He is currently a Professor with the School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China. His current research interests include machine learning, bioinformatics, and image processing.(Email: maozuguo@bucea.edu.cn)

  • Received Date: 2020-09-30
  • Accepted Date: 2021-12-07
  • Available Online: 2021-12-22
  • Publish Date: 2022-09-05
  • A common but critical task in biological ontologies data analysis is to compare the difference between ontologies. There have been numerous ontology-based semantic-similarity measures proposed in specific ontology domain, but it still remains a challenge for cross-domain ontologies comparison. An ontology contains the scientific natural language description for the corresponding biological aspect. Therefore, we develop a new method based on natural language processing (NLP) representation model bidirectional encoder representations from transformers (BERT) for cross-domain semantic representation of biological ontologies. This article uses the BERT model to represent the word-level of the ontologies as a set of vectors, facilitating the semantic analysis or comparing the biomedical entities named in an ontology or associated with ontology terms. We evaluated the ability of our method in two experiments: calculating similarities of pair-wise disease ontology and human phenotype ontology terms and predicting the pair-wise of proteins interaction. The experimental results demonstrated the comparative performance. This gives promise to the development of NLP methods in biological data analysis.
  • loading
  • [1]
    T. R. Gruber, “A translation approach to portable ontology specifications,” Knowledge Acquisition, vol.5, no.2, pp.199–220, 1993. doi: 10.1006/knac.1993.1008
    [2]
    M. A. Rodríguez and M. J. Egenhofer, “Determining semantic similarity among entity classes from different ontologies,” IEEE Transactions on Knowledge and Data Engineering, vol.15, no.2, pp.442–456, 2003. doi: 10.1109/TKDE.2003.1185844
    [3]
    B. Smith, M. Ashburner, C. Rosse, et al., “The OBO Foundry: Coordinated evolution of ontologies to support biomedical data integration,” Nature Biotechnology, vol.25, no.11, pp.1251–1255, 2007. doi: 10.1038/nbt1346
    [4]
    G. K. Mazandu and N. J. Mulder, “A topology-based metric for measuring term similarity in the gene ontology,” Advances in Bioinformatics, vol.2012, article no.975783, 2012.
    [5]
    L. Cheng, Y. Jiang, H. Ju, et al., “InfAcrOnt: Calculating cross-ontology term similarities using information flow by a random walk,” BMC Genomics, vol.19, no.1, pp.125–134, 2018. doi: 10.1186/s12864-018-4500-9
    [6]
    R. Rada, H. Mili, E. Bicknell, and M. Blettner, “Development and application of a metric on semantic nets,” IEEE Transactions on Systems, Man, and Cybernetics, vol.19, no.1, pp.17–30, 1989. doi: 10.1109/21.24528
    [7]
    Z. Wu and M. Palmer, “Verb semantics and lexical selection,” in Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, Las Cruces, New Mexico, USA, pp.133–138, 1994.
    [8]
    C. Pesquita, D. Faria, A. O. Falcao, et al., “Semantic similarity in biomedical ontologies,” PLOS Computational Biology, vol.5, no.7, article no.e1000443, 2009. doi: 10.1371/journal.pcbi.1000443
    [9]
    P. Resnik, “Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language,” Journal of Artificial Intelligence Research, vol.11, pp.95–130, 1999. doi: 10.1613/jair.514
    [10]
    D. Lin, “An information-theoretic definition of similarity,” in Proceedings of the Fifteenth International Conference on Machine Learning, San Francisco, USA, pp.296–304, 1998.
    [11]
    J. Z. Wang, Z. Du, R. Payattakool, et al., “A new method to measure the semantic similarity of GO terms,” Bioinformatics, vol.23, no.10, pp.1274–1281, 2007. doi: 10.1093/bioinformatics/btm087
    [12]
    F. Z. Smaili, X. Gao and R. Hoehndorf, “Opa2vec: Combining formal and informal content of biomedical ontologies to improve similarity-based prediction,” Bioinformatics, vol.35, no.12, pp.2133–2140, 2019. doi: 10.1093/bioinformatics/bty933
    [13]
    F. Z. Smaili, X. Gao and R. Hoehndorf, “Onto2vec: Joint vector-based representation of biological entities and their ontology-based annotations,” Bioinformatics, vol.34, no.13, pp.i52–i60, 2018. doi: 10.1093/bioinformatics/bty259
    [14]
    D. Duong, W. U. Ahmad, E. Eskin, et al., “Word and sentence embedding tools to measure semantic similarity of gene ontology terms by their definitions,” Journal of Computational Biology, vol.26, no.1, pp.38–52, 2019. doi: DOI:10.1089/cmb.2018.0093
    [15]
    J. Lafferty, A. McCallum, and F.C. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of the Eighteenth International Conference on Machine Learning, San Francisco, USA, pp.282–289, 2001.
    [16]
    J. Zhang, Y. Song, C. Zhang, and S. Liu, “Evolutionary hierarchical dirichlet processes for multiple correlated time-varying corpora,” in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, pp.1–10, 2010.
    [17]
    T. Mikolov, I. Sutskever, K. Chen, et al., “Distributed representations of words and phrases and their compositionality,” in Proceedings of the 26th International Conference on Neural Information Processing Systems, Red Hook, USA, vol.2, pp.3111–3119, 2013.
    [18]
    J. Pennington, R. Socher and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp.1532–1543, 2014.
    [19]
    A. Joulin, E. Grave, P. Bojanowski, et al., “Fasttext. zip: Compressing text classification models,” arXiv preprint, arXiv: 1612.03651, 2016.
    [20]
    F. Shen, S. Peng, Y. Fan, et al., “HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology,” Journal of Biomedical Informatics, vol.96, article no.103246, 2019. doi: 10.1016/j.jbi.2019.103246
    [21]
    M. E. Peters, M. Neumann, M. Iyyer, et al., “Deep contextualized word representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, USA, vol.1, pp.2227–2237, 2018.
    [22]
    J. Lee, W. Yoon, S. Kim, et al., “BioBERT: A pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol.36, no.4, pp.1234–1240, 2020.
    [23]
    I. Beltagy, K. Lo, and A. Cohan, “SciBERT: A pretrained language model for scientific text,” arXiv preprint, arXiv: 1903.10676, 2019.
    [24]
    Peng Y, Yan S, and Lu Z., “Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets,” in Proceedings of the Workshop on Biomedical Natural Language Processing (BioNLP), Florence, Italy, pp.58–65, 2019.
    [25]
    A. Conneau, D. Kiela, H. Schwenk, et al., “Supervised learning of universal sentence representations from natural language inference data,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, ACL, Copenhagen, Denmark, pp.670–680, 2017.
    [26]
    R. Kiros, Y. Zhu, R. R. Salakhutdinov, et al., “Skip-thought vectors,” in Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, Canada, vol.2, pp.3294–3302, 2015.
    [27]
    D. Cer, Y. Yang, S. -y. Kong, et al., “Universal sentence encoder for English,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, ACL, Brussels, Belgium, pp.169–174, 2018.
    [28]
    H. Al-Mubaid and H.A. Nguyen, “A cluster-based approach for semantic similarity in the biomedical domain,” in Proceedings of 2006 International Conference of the IEEE Engineering in Medicine and Biology Society, New York, NY, USA, pp.2713–2717, 2006.
    [29]
    G. Pirró, “A semantic similarity metric combining features and intrinsic information content,” Data & Knowledge Engineering, vol.68, no.11, pp.1289–1308, 2009.
    [30]
    D. Bollegala, Y. Matsuo, and M. Ishizuka. “A relational model of semantic similarity between words using automatically extracted lexical pattern clusters from the web,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Stroudsburg, PA, USA, pp.803–812, 2009.
    [31]
    E. G. Petrakis, G. Varelas, A. Hliaoutakis, et al., “X-similarity: Computing semantic similarity between concepts from different ontologies,” Journal of Digital Information Management, vol.4, no.4, pp.233–237, 2006.
    [32]
    L. Ding, T. Finin, A. Joshi, et al., “Swoogle: A search and metadata engine for the semantic web,” in Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, Washington, DC, USA, pp.652–659, 2004.
    [33]
    D. Sánchez, D. Isern, and M. Millan, “Content annotation for the semantic web: an automatic web-based approach,” Knowledge and Information Systems, vol.27, no.3, pp.393–418, 2011. doi: 10.1007/s10115-010-0302-3
    [34]
    D. Duong, A. Uppunda, L. Gai, et al., “"Evaluating representations for gene ontology terms,” bioRxiv preprint, DOI: 10.1101/765644, 2020.
    [35]
    G. O. Consortium, “Expansion of the gene ontology knowledgebase and resources,” Nucleic Acids Research, vol.45, no.D1, pp.D331–D338, 2017. doi: 10.1093/nar/gkw1108
    [36]
    G. K. Mazandu, E. R. Chimusa, and N. J. Mulder, “Gene ontology semantic similarity tools: survey on features and challenges for biological knowledge discovery,” Briefings in Bioinformatics, vol.18, no.5, pp.886–901, 2017.
    [37]
    A. Pesaranghader, S. Matwin, M. Sokolova, and R.G. Beiko, “simDEF: Definition-based semantic similarity measure of gene ontology terms for functional similarity analysis of genes,” Bioinformatics, vol.32, no.9, pp.1380–1387, 2016. doi: 10.1093/bioinformatics/btv755
    [38]
    L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol.9, no.86, pp.2579–2605, 2008.
    [39]
    C. Szegedy, V. Vanhoucke, S. Ioffe, et al., “Rethinking the inception architecture for computer vision,” 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp.2818–2826, 2016.
    [40]
    J. Peng, J. Chen, and Y. Wang, “Identifying cross-category relations in gene ontology and constructing genome-specific term association networks,” BMC Bioinformatics, vol.14, no.Suppl 2, article no.S15, 2013. doi: 10.1186/1471-2105-14-S2-S15
    [41]
    A. Bellandi, B. Furletti, V. Grossi, et al., “Ontology-driven association rule extraction: A case study,” in Proceedings of the International Workshop on Contexts and Ontologies: Representation and Reasoning (C&O:RR) Collocated with the 6th International and Interdisciplinary Conference on Modelling and Using Context, Roskilde, Denmark, available at: http://ceur-ws.org/Vol-298/paper1.pdf, 2007.
    [42]
    O. Bodenreider, M. Aubry, and A. Burgun, “Non-lexical approaches to identifying associative relations in the gene ontology,” in Proceedings of Pacific Symposium on Biocomputing 2005: World Scientific, World Scientific Publishing Co. Pte. Ltd, pp.91–102, 2005.
    [43]
    J. Peng, H. Wang, J. Lu, et al., “Identifying term relations cross different gene ontology categories,” BMC Bioinformatics, vol.18, no.16, article no.573, 2017.
    [44]
    G. Salton, A. Wong, and C.-S. Yang, “A vector space model for automatic indexing,” Communications of the ACM, vol.18, no.11, pp.613–620, 1975. doi: 10.1145/361219.361220
    [45]
    A. Kumar, B. Smith, and C. Borgelt. “Dependence relationships between Gene Ontology terms based on TIGR gene product annotations,” in Proceedings of CompuTerm 2004: 3rd International Workshop on Computational Terminology, COLING, Geneva, Switzerland, pp.31–38, 2004.
    [46]
    K.-H. Chen, T.-F. Wang, and Y.-J. Hu, “Protein-protein interaction prediction using a hybrid feature representation and a stacked generalization scheme,” BMC Bioinformatics, vol.20, no.1, article no.308, 2019. doi: 10.1186/s12859-019-2907-1
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(7)  / Tables(4)

    Article Metrics

    Article views (669) PDF downloads(66) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return