Volume 30 Issue 5
Sep.  2021
Turn off MathJax
Article Contents
WANG Chao, ZOU Quan. A Machine Learning Method for Differentiating and Predicting Human-Infective Coronavirus Based on Physicochemical Features and Composition of the Spike Protein[J]. Chinese Journal of Electronics, 2021, 30(5): 815-823. doi: 10.1049/cje.2021.06.003
Citation: WANG Chao, ZOU Quan. A Machine Learning Method for Differentiating and Predicting Human-Infective Coronavirus Based on Physicochemical Features and Composition of the Spike Protein[J]. Chinese Journal of Electronics, 2021, 30(5): 815-823. doi: 10.1049/cje.2021.06.003

A Machine Learning Method for Differentiating and Predicting Human-Infective Coronavirus Based on Physicochemical Features and Composition of the Spike Protein

doi: 10.1049/cje.2021.06.003

This paper is supported by the National Natural Science Foundation of China (No.61922020, No.61771331, No.62002051).

  • Received Date: 2020-09-11
    Available Online: 2021-09-02
  • Several Coronaviruses (CoVs) are epidemic pathogens that cause severe respiratory syndrome and are associated with significant morbidity and mortality. In this paper, a machine learning method was developed for predicting the risk of human infection posed by CoVs as an early warning system. The proposed Spike-SVM (Support vector machine) model achieved an accuracy of 97.36% for Human-infective CoV (HCoV) and Nonhuman-infective CoV (Non-HCoV) classification. The top informative features that discriminate HCoVs and Non-HCoVs were identified. Spike-SVM is anticipated to be a useful bioinformatics tool for predicting the infection risk posed by CoVs to humans.
  • loading
  • J. Cui, F. Li and Z. L. Shi, "Origin and evolution of pathogenic coronaviruses", Nature Reviews Microbiology, Vol.17, No.3, pp.181-192, 2019.
    V. M. Corman, D. Muth, D. Niemeyer, et al., "Hosts and sources of endemic human coronaviruses", in Advances in Virus Research, M. Kielian, T. C. Mettenleiter, and M. J. Roossinck, Editors, Elsevier Academic Press Inc:San Diego, 163-188, 2018.
    J. Chaung, D. L. Chan, S. Pada, et al., "Coinfection with COVID-19 and coronavirus HKU1-The critical need for repeat testing if clinically indicated", Journal of Medical Virology, Vol.92, No.10, pp.1785-1786, 2020.
    H. Zhu, Q. Guo, M. Li, et al., "Host and infectivity prediction of Wuhan 2019 novel coronavirus using deep learning algorithm", BioRxiv, DOI:10.1101/2020.01.21.914044, 2020.
    X. L. Qiang and Z. Kou, "Scoring amino acid mutation to predict pandemic risk of avian influenza virus", BMC Bioinformatics, Vol.20, No.Suppl. 8, Article number:288, 2019.
    J. Li, S. Zhang, B. Li, et al., "Machine learning methods for predicting human-adaptive influenza a viruses based on viral nucleotide compositions", Molecular Biology and Evolution, Vol.37, No.4, pp.1224-1236, 2020.
    Q. Zou, G. Lin, X. Jiang, et al., "Sequence clustering in bioinformatics:An empirical study", Briefings in Bioinformatics, Vol.21, No.1, pp.1-10, 2020.
    X. Fu, L. Cai, X. Zeng, et al., "StackCPPred:A stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency", Bioinformatics, Vol.36, No.10, pp.3028-3034, 2020.
    L. Yu, F. Xu and L. Gao, "Predict new therapeutic drugs for hepatocellular carcinoma based on gene mutation and expression", Frontiers in Bioengineering and Biotechnology, Vol.8, DOI:10.3389/fbioe.2020.00008, 2020.
    H. Wang, Y. Ding, J. Tang, et al., "Identification of membrane protein types via multivariate information fusion with Hilbert-Schmidt independence criterion", Neurocomputing, Vol.383, pp.257-269, 2020.
    Y. Ding, J. Tang and F. Guo, "Identification of drug-side effect association via multiple information integration with centered kernel alignment", Neurocomputing, Vol.325, pp.211-224, 2019.
    Y.-J. Tang, Y.-H. Pang and B. Liu, "IDP-Seq2Seq:Identification of intrinsically disordered regions based on sequence to sequence learning", Bioinformaitcs, DOI:10.1093/bioinformatics/btaa667, 2020.
    B. Liu, Y. Zhu and K. Yan, "Fold-LTR-TCP:Protein fold recognition based on triadic closure principle", Briefings in Bioinformatics, Vol.21, No.6, pp.2185-2193, 2020.
    N. Zheng, K. Wang, W. Zhan, et al., "Targeting virushost protein interactions:Feature extraction and machine learning approaches", Current Drug Metabolism, Vol.20, No.3, pp.177-184, 2019.
    X. Y. Ou, Y. Liu, X. B. Lei, et al., "Characterization of spike glycoprotein of SARS-CoV-2 on virus entry and its immune cross-reactivity with SARS-CoV", Nature Communications, Vol.11, No.1, pp.1-12, 2020.
    R. K. Williams, G. S. Jiang and K. V. Holmes, "Receptor for mouse hepatitis-virus is a member of the carcinoembryonic antigen family of glycoproteins", Proceedings of the National Academy of Sciences of the United States of America, Vol.88, No.13, pp.5533-5536, 1991.
    V. S. Raj, H. H. Mou, S. L. Smits, et al., "Dipeptidyl peptidase 4 is a functional receptor for the emerging human coronavirusEMC", Nature, Vol.495, No.7440, pp.251-254, 2013.
    W. H. Li, M. J. Moore, N. Vasilieva, et al., "Angiotensinconverting enzyme 2 is a functional receptor for the SARS coronavirus", Nature, Vol.426, No.6965, pp.450-454, 2003.
    V. D. Menachery, B. L. Yount, K. Debbink, et al., "A SARSlike cluster of circulating bat coronaviruses shows potential for human emergence", Nature Medicine, Vol.21, No.12, pp.1508-1513, 2015.
    X. L. Qiang, P. Xu, G. Fang, et al., "Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus", Infectious Diseases of Poverty, Vol.9, Article number:33, 2020.
    W. -M. Zhao, S. -H. Song, M. -L. Chen, et al., "The 2019 novel coronavirus resource", Hereditas, Vol.42, No.2, pp.212-221, 2020.
    M. Bhasin and G. P. S. Raghava, "Classification of nuclear receptors based on amino acid composition and dipeptide composition", Journal of Biological Chemistry, Vol.279, No.22, pp.23262-23266, 2004.
    B. Liu, X. Gao and H. Zhang, "BioSeq-Analysis2. 0:An updated platform for analyzing DNA, RNA, and protein sequences at sequence level and residue level based on machine learning approaches", Nucleic Acids Research, Vol.47, No.20, DOI:10.1093/nar/gkz740, 2019.
    Y. Zuo, Y. Li, Y. Chen, et al., "PseKRAAC:A flexible web server for generating pseudo K-tuple reduced amino acids composition", Bioinformatics, Vol.33, No.1, pp.122-124, 2017.
    L. Cheng, H. Zhao, P. Wang, et al., "Computational methods for identifying similar diseases", Molecular Therapy-Nucleic Acids, Vol.18, pp.590-604, 2019.
    L. Cheng, "Computational and biological methods for gene therapy", Current Gene Therapy, Vol.19, No.4, DOI:10. 2174/156652321904191022113307, 2019.
    Y. Qiao, Y. Xiong, H. Gao, et al., "Protein-protein interface hot spots prediction based on a hybrid feature selection strategy", BMC Bioinformatics, Vol.19, Article number:14, 2018.
    G. Govindan and A. S. Nair, "Composition, transition and distribution (ctd)-A dynamic feature for predictions based on hierarchical structure of cellular sorting", in 2011 Annual IEEE India Conference, A. Negi, et al., Editors, IEEE:New York, 2011.
    D. S. Horne, "Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities", Biopolymers, Vol.27, No.3, pp.451-477, 1988.
    Z. Chen, P. Zhao, F. Li, et al., "iFeature:A Python package and web server for features extraction and selection from protein and peptide sequences", Bioinformatics, Vol.34, No.14, pp.2499-2502, 2018.
    Z. Chen, P. Zhao, F. Li, et al., "iLearn:An integrated platform and meta-learner for feature engineering, machinelearning analysis and modeling of DNA, RNA and protein sequence data", Briefings in Bioinformatics, Vol.21, No.3, pp.1047-1057, 2020.
    J. W. Shen, J. Zhang, X. M. Luo, et al., "Predictina proteinprotein interactions based only on sequences information", Proceedings of the National Academy of Sciences of the United States of America, Vol.104, No.11, pp.4337-4341, 2007.
    S. Liang, A. Ma, S. Yang, et al., "A review of matched-pairs feature selection methods for gene expression data analysis", Computational and Structural Biotechnology Journal, Vol.16, pp.88-97, 2018.
    B. Yu, W. Qiu, C. Chen, et al., "SubMito-XGBoost:predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting", Bioinformatics, Vol.36, No.4, pp.1074-1081, 2020.
    Y. Ding, J. Tang and F. Guo, "Identification of drugtarget interactions via multiple information integration", Information Sciences, Vol.418, pp.546-560, 2017.
    W. Li, J. Yu, B. Lian, et al., "Identifying prognostic features by bottom-up approach and correlating to drug repositioning", PLoS One, Vol.10, No.3, DOI:10.1371/journal.pone.0118672, 2015.
    P. Liang, W. Yang, X. Chen, et al., "Machine learning of single-cell transcriptome highly identifies mRNA signature by comparing F-score selection with DGE analysis", Molecular Therapy-Nucleic Acids, Vol.20, pp.155-163, 2020.
    T. Zhao, Y. Hu, J. Peng, et al., "DeepLGP:A novel deep learning method for prioritizing lncRNA target genes", Bioinformatics, Vol.36, No.16, pp.4466-4472, 2020.
    L. Cheng, "Omics data and artificial intelligence:New challenges for gene therapy", Current Gene Therapy, Vol.20, No.1, DOI:10. 2174/156652322001200604150041, 2020.
    C. Liang, Q. Changlu, Z. He, et al., "gutMDisorder:A comprehensive database for dysbiosis of the gut microbiota in disorders and interventions", Nucleic Acids Research, Vol.48, No.D1, pp.D554-D560, 2019.
    S. H. Liu, P. C. Shen, C. Y. Chen, et al., "DriverDBv3:A multi-omics database for cancer driver gene research", Nucleic Acids Research, pp.D863-D870, 2019.
    L. Xu, G. Liang, S. Shi, et al., "SeqSVM:A sequence-based support vector machine method for identifying antioxidant proteins", International journal of Molecular Sciences, Vol.19, No.6, pp.1773, Article number:1773, 2018.
    L. Xu, G. Liang, C. Liao, et al., "k-Skip-n-Gram-RF:A random forest based method for alzheimer's disease protein identification", Frontiers in Genetics, Vol.10, No.33, DOI:10.3389/fgene.2019.00033, 2019.
    L. Nie, L. Deng, C. Fan, et al., "Prediction of protein S-sulfenylation sites using a deep belief network", Current Bioinformatics, Vol.13, No.5, pp.461-467, 2018.
    J. He, T. Fang, Z. Zhang, et al., "PseUI:Pseudouridine sites identification based on RNA sequence information", BMC Bioinformatics, Vol.19, Article number:306, 2018.
    F. Li, J. Chen, Z. Ge, et al., "Computational prediction and interpretation of both general and specific types of promoters in Escherichia coli by exploiting a stacked ensemble-learning framework", Briefings in Bioinformatics, DOI:10.1093/bib/bbaa049, 2020.
    F. Li, A. Leier, Q. Liu, et al., "Procleave:Predicting protease-specific substrate cleavage sites by combining sequence and structural information", Genomics Proteomics Bioinformatics, Vol.18, No.1, pp.52-64, 2020.
    F. Li, C. Li, T. T. Marquez-Lago, et al., "Quokka:A comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome", Bioinformatics, Vol.34, No.24, pp.4223-4231, 2018.
    J. Song, Y. Wang, F. Li, et al., "iProt-Sub:A comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites", Briefings in Bioinformatics, Vol.20, No.2, pp.638-658, 2019.
    F. Li, Y. Zhang, A. W. Purcell, et al., "Positive-unlabelled learning of glycosylation sites in the human proteome", BMC Bioinformatics, Vol.20, Article number:112, 2019.
    C. Wang, P. P. Wang, S. G. Han, et al., "FunEffector-Pred:Identification of fungi effector by activate learning and genetic algorithm sampling of imbalanced data", IEEE Access, Vol.8, pp.57674-57683, 2020.
    C. Wang, Y. Zhang and S. Han, "Its2vec:Fungal species identification using sequence embedding and random forest classification", Biomed Research International, DOI:10.1155/2020/2468789, 2020.
    X. Zeng, S. Zhu, X. Liu, et al., "deepDR:A networkbased deep learning approach to in silico drug repositioning", Bioinformatics, Vol.35, No.24, pp.5191-5198, 2019.
    S. Jin, X. Zeng, F. Xia, et al., "Application of deep learning methods in biological networks", Briefings in Bioinformatics, DOI:10. 1093/bib/bbaa043, 2020.
    L. Yu, S. Y. Yao, L. Gao, et al., "Conserved disease modules extracted from multilayer heterogeneous disease and gene networks for understanding disease mechanisms and predicting disease treatments", Frontiers in Genetics, Vol.9, DOI:10.3389/fgene.2018.00745, 2019.
    Z. Wang, W. He, J. Tang, et al., "Identification of highestaffinity binding sites of yeast transcription factor families", Journal of Chemical Information and Modeling, Vol.60, No.3, pp.1876-1883, 2020.
    J. Li, Y. Pu, J. Tang, et al., "DeepAVP:A dualchannel deep neural network for identifying variable-length antiviral peptides", IEEE Journal of Biomedical and Health Informatics, Vol.24, No.10, pp.3012-3019, 2020.
    J. Shao, K. Yan and B. Liu, "FoldRec-C2C:Protein fold recognition by combining cluster-to-cluster model and protein similarity network", Briefings in Bioinformatics, DOI:10.1093/bib/bbaa144, 2020.
    L. Xu, G. Liang, C. Liao, et al., "An efficient classifier for alzheimer's disease genes identification", Molecules, Vol.23, No.12, Article number:3140, 2018.
    C. Geng, A. Vangone, G. E. Folkers, et al., "iSEE:Interface structure, evolution, and energy-based machine learning predictor of binding affinity changes upon mutations", Proteins:Structure, Function, and Bioinformatics, Vol.87, No.2, pp.110-119, 2018.
    X. Shan, X. Wang, C. D. Li, et al., "Prediction of CYP450 enzyme-substrate selectivity based on the network-based label space division method", Journal of Chemical Information and Modeling, Vol.59, No.11, pp.4577-4586, 2019.
    Y. Chu, A. C. Kaushik, X. Wang, et al., "DTI-CDF:A cascade deep forest model towards the prediction of drugtarget interactions based on hybrid features", Briefings in Bioinformatics, Vol.22, No.1, pp.451-462, 2019.
    Y. Xiong, Q. Wang, J. Yang, et al., "PredT4SE-Stack:Prediction of bacterial type iv secreted effectors from protein sequences using a stacked ensemble method", Frontiers in Microbiology, Vol.9, DOI:10.3389/fmicb.2018.02571, 2018.
    F. Li, J. Chen, A. Leier, et al., "DeepCleave:A deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites", Bioinformatics, Vol.36, No.4, pp.1057-1065, 2020.
    C. Jia, Y. Bi, J. Chen, et al., "PASSION:An ensemble neural network approach for identifying the binding sites of RBPs on circRNAs", Bioinformatics, Vol.36, No.15, pp.4276-4282, 2020.
    F. Li, Y. Wang, C. Li, et al., "Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction:A comprehensive revisit and benchmarking of existing methods", Briefings in Bioinformatics, Vol.20, No.6, pp.2150-2166, 2019.
    L. Deng, W. Li and J. Zhang, "LDAH2V:Exploring metapaths across multiple networks for lncRNA-disease association prediction", IEEE/ACM Transactions on Computational Biology and Bioinformatics, DOI:10.1109/TCBB.2019.2946257, 2019.
    C. Wang, J. Wu, L. Xu, et al., "NonClasGP-Pred:Robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data", Microbial Genomics, Vol.6, No.12, DOI:10.1099/mgen.0.000483, 2020.
    P. Virtanen, R. Gommers, T. E. Oliphant, et al., "SciPy 1.0:Fundamental algorithms for scientific computing in Python", Nature Methods, Vol.17, No.3, pp.261-272, 2020.
    L. Zheng, D. Liu, W. Yang, et al., "RaacLogo:A new sequence logo generator by using reduced amino acid clusters", Briefings in Bioinformatics, DOI:10.1093/bib/bbaa096, 2020.
  • 加载中


    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (179) PDF downloads(28) Cited by()
    Proportional views


    DownLoad:  Full-Size Img  PowerPoint