Volume 32 Issue 1
Jan.  2023
Turn off MathJax
Article Contents
KONG Zixiao, XUE Jingfeng, WANG Yong, ZHANG Qian, HAN Weijie, ZHU Yufen. MalFSM: Feature Subset Selection Method for Malware Family Classification[J]. Chinese Journal of Electronics, 2023, 32(1): 26-38. doi: 10.23919/cje.2022.00.038
Citation: KONG Zixiao, XUE Jingfeng, WANG Yong, ZHANG Qian, HAN Weijie, ZHU Yufen. MalFSM: Feature Subset Selection Method for Malware Family Classification[J]. Chinese Journal of Electronics, 2023, 32(1): 26-38. doi: 10.23919/cje.2022.00.038

MalFSM: Feature Subset Selection Method for Malware Family Classification

doi: 10.23919/cje.2022.00.038
Funds:  This work was supported by the National Natural Science Foundation of China (62172042), Major Scientific and Technological Innovation Projects of Shandong Province (2020CXGC010116), and the National Key Research & Development Program of China (2020YFB1712104)
More Information
  • Author Bio:

    Zixiao KONG was born in 1996. She takes a successive postgraduate and doctoral program at Beijing Institute of Technology, majored in Cyberspace Security. Her research interests include cyber security and machine learning. She has a B.S. degree in software engineering. (Email: 3120185534@bit.edu.cn)

    Jingfeng XUE was born in 1975. He is a Professor and Ph.D. Supervisor in Beijing Institute of Technology. His main research interests focus on network security, data security, and software security. (Email: xuejf@bit.edu.cn)

    Yong WANG was born in 1975. She is an Associate Professor of Beijing Institute of Technology. Her main research interests focus on cyber security and machine learning. (Email: wangyong@bit.edu.cn)

    Qian ZHANG (corresponding author) was born in Inner Mongolia, China, in 1986. She graduated from the School of Software, Beijing Institute of Technology (BIT) in 2012. She is an Assistant Experimentalist of BIT now, and her research interests include software security and software testing. (Email: zhangqian16@bit.edu.cn)

    Weijie HAN received the B.E. and M.E. degrees from Space Engineering University in 2003 and 2006, respectively, and received the Ph.D. degree from BIT in 2020. He is currently a Lecturer in Space Engineering University. His current research interest includes malware detection and APT detection. (Email: bit_hwj2016@126.com)

    Yufen ZHU received the B.E. degree in 2006. She is currently an Engineer in the Software Evaluation Center of Beijing Institute of Technology. Her research interests include malware analysis and network anomalies detection. (Email: visc_hwj@126.com)

  • Received Date: 2022-03-09
  • Accepted Date: 2022-05-13
  • Available Online: 2022-05-27
  • Publish Date: 2023-01-05
  • Malware detection has been a hot spot in cyberspace security and academic research. We investigate the correlation between the opcode features of malicious samples and perform feature extraction, selection and fusion by filtering redundant features, thus alleviating the dimensional disaster problem and achieving efficient identification of malware families for proper classification. Malware authors use obfuscation technology to generate a large number of malware variants, which imposes a heavy analysis burden on security researchers and consumes a lot of resources in both time and space. To this end, we propose the MalFSM framework. Through the feature selection method, we reduce the 735 opcode features contained in the Kaggle dataset to 16, and then fuse on metadata features (count of file lines and file size) for a total of 18 features, and find that the machine learning classification is efficient and high accuracy. We analyzed the correlation between the opcode features of malicious samples and interpreted the selected features. Our comprehensive experiments show that the highest classification accuracy of MalFSM can reach up to 98.6% and the classification time is only 7.76 s on the Kaggle malware dataset of Microsoft.
  • loading
  • [1]
    Christiaan Beek, Sandeep Chandana, Taylor Dunton, et al., “McAfee Labs threat report: November 2020,” available at: https://www.mcafee.com/enterprise/zh-cn/assets/reports/rp-quarterly-threats-nov-2020.pdf, 2020-11-20.
    [2]
    W. He, “The October 2021 malware heinous list,” available at: https://www.easemob.com/news/7467, 2021-11-23.
    [3]
    H. Zhou, W. Zhang, F. Wei, and Y. Chen, “Analysis of Android malware family characteristic based on isomorphism of sensitive API call graph,” in Proceedings of 2017 IEEE Second International Conference on Data Science in Cyberspace (DSC), Shenzhen, China, pp.319–327, 2017.
    [4]
    S. Cesare, Y. Xiang, and W. Zhou, “Control flow-based malware variant detection,” IEEE Trans. Dependable and Secure Comput, vol.11, no.4, pp.307–317, 2014. doi: 10.1109/TDSC.2013.40
    [5]
    W. Hu and Y. Tan, “Black-box attacks against RNN based malware detection algorithms,” in Proceedings of the Workshops of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, pp.245–251, 2018.
    [6]
    W. Han, J. Xue, Y. Wang, et al., “MalDAE: Detecting and explaining malware based on correlation and fusion of static and dynamic characteristics,” Computers & Security, vol.83, pp.208–233, 2019. doi: 10.1016/j.cose.2019.02.007
    [7]
    C. Wu and W. Li, “Enhancing intrusion detection with feature selection and neural network,” International Journal of Intelligent Systems, vol.36, no.7, pp.3087–3105, 2021. doi: 10.1002/int.22397
    [8]
    Kemal Polat and Salih Güneş, “A new feature selection method on classification of medical datasets: Kernel F-score feature selection,” Expert Systems with Applications, vol.36, no.7, pp.10367–10373, 2009. doi: 10.1016/j.eswa.2009.01.041
    [9]
    J. Benesty, J. Chen, and Y. Huang, “On the importance of the Pearson correlation coefficient in noise reduction,” IEEE Transactions on Audio, Speech, and Language Processing, vol.16, no.4, pp.757–765, 2008. doi: 10.1109/TASL.2008.919072
    [10]
    X. Zheng, Y. Wang, L. Jia, et al., “Network intrusion detection model based on Chi-square test and stacking approach,” in Proceedings of 2020 7th International Conference on Information Science and Control Engineering (ICISCE), Changsha, China, pp.894–899, 2020.
    [11]
    S. Tan, X. Zhang, Q. Li, and A. Chen, “Information push model-building based on maximum mutual information coefficient,” Journal of Jilin University Engineering and Technology Edition, vol.48, no.2, pp.558–563, 2018. (in Chinese)
    [12]
    M. Cuturi and A. D'Aspremont, “Mean reversion with a variance threshold,” in Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, pp.III-271–III-279, 2013.
    [13]
    Moutaz Alazab, “Automated malware detection in mobile app stores based on robust feature generation,” Electronics, vol.9, no.3, article no.435, 2020. doi: 10.3390/electronics9030435
    [14]
    K. Yan and D. Zhang, “Feature selection and analysis on correlated gas sensor data with recursive feature elimination,” Sensors and Actuators B: Chemical, vol.212, pp.353–363, 2015. doi: 10.1016/j.snb.2015.02.025
    [15]
    P. Zhang, “A novel feature selection method based on global sensitivity analysis with application in machine learning-based prediction model,” Applied Soft Computing, vol.85, article no.105859, 2019. doi: 10.1016/j.asoc.2019.105859
    [16]
    M. Ahmadi, D. Ulyanov, S. Semenov, et al., “Novel feature extraction, selection and fusion for effective malware family classification,” in Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, New Orleans, LA, USA, pp.183–194, 2016.
    [17]
    S. Ni, Q. Qian, and R. Zhang, “Malware identification using visualization images and deep learning,” Computers & Security, vol.77, pp.871–885, 2018. doi: 10.1016/j.cose.2018.04.005
    [18]
    W. Han, J. Xue, Y. Wang, et al., “MalInsight: A systematic profiling based malware detection framework,” Journal of Network and Computer Applications, vol.125, pp.236–250, 2019. doi: 10.1016/j.jnca.2018.10.022
    [19]
    A. Darem, J. Abawajy, A. Makkar, et al., “Visualization and deep-learning-based malware variant detection using OpCode-level features,” Future Generation Computer Systems, vol.125, pp.314–323, 2021. doi: 10.1016/j.future.2021.06.032
    [20]
    I. Almomani, A. AlKhayer, and M. Ahmed, “An efficient machine learning-based approach for Android v.11 ransomware detection,” in Proceedings of 2021 1st International Conference on Artificial Intelligence and Data Analytics (CAIDA), Riyadh, Saudi Arabia, pp.240–244, 2021.
    [21]
    G. Sun and Q. Qian, “Deep learning and visualization for identifying malware families,” IEEE Transactions on Dependable and Secure Computing, vol.18, no.1, pp.283–295, 2021. doi: 10.1109/TDSC.2018.2884928
    [22]
    Q. Le, O. Boydell, B. Mac Namee, et al., “Deep learning at the shallow end: Malware classification for non-domain experts,” Digital Investigation, vol.26, pp.S118–S126, 2018. doi: 10.1016/j.diin.2018.04.024
    [23]
    X. Hu, J. Jang, T. Wang, et al., “Scalable malware classification with multifaceted content features and threat intelligence,” IBM Journal of Research and Development, vol.60, no.4, pp.6:1–6:11, 2016. doi: 10.1147/JRD.2016.2559378
    [24]
    M. Masum, M.J. Hossain Faruk, H. Shahriar, et al., “Ransomware classification and detection with machine learning algorithms,” in Proceedings of 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, pp.0316–0322, 2022.
    [25]
    S. Jain, T. Khandelwal, Y. Jain, et al., “Android malware analysis using machine learning classifiers,” in Proceedings of International Conference on Computational Intelligence and Emerging Power System, Singapore, pp.171–179, 2022.
    [26]
    J. Bao, “Multi-features based arrhythmia diagnosis algorithm using Xgboost,” in Proceedings of 2020 International Conference on Computing and Data Science (CDS), Stanford, CA, United States, pp.454–457, 2020.
    [27]
    Z. Zhou and J. Feng, “Deep forest,” National Science Review, vol.6, no.1, pp.74–86, 2019. doi: 10.1093/nsr/nwy108
    [28]
    iFLYTEK, “Malware classification challenge,” available at: https://challenge.xfyun.cn/topic/info?type=malware-classification, 2021-08-02.
    [29]
    K. Xu, Y. Li, R. Deng, et al., “DroidEvolver: Self-evolving Android malware detection system,” in Proceedings of IEEE European Symposium on Security and Privacy (EuroS & P), Stockholm, Sweden, pp.47–62, 2019.
    [30]
    H. Cai, “Assessing and improving malware detection sustainability through App evolution studies,” ACM Trans. Softw. Eng. Methodol, vol.29, no.2, pp.1–28, 2020. doi: 10.1145/3371924
    [31]
    X. Fu and H. Cai, “On the deterioration of learning-based malware detectors for Android,” in Proceedings of IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), Montreal, QC, Canada, pp.272–273, 2019.
    [32]
    H. Cai and J. Jenkins, “Towards sustainable Android malware detection,” in Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings, Gothenburg, Sweden, pp.350–351, 2018.
    [33]
    H. Cai, “Embracing mobile App evolution via continuous ecosystem mining and characterization,” in Proceedings of the IEEE/ACM 7th International Conference on Mobile Software Engineering and Systems, Seoul, Republic of Korea, pp.31–35, 2020.
    [34]
    T. Han, L. Zhang, and S. Jia, “Bin similarity based domain adaptation for fine-grained image classification,” International Journal of Intelligent Systems, vol.37, no.3, pp.2319–2334, 2021. doi: 10.1002/int.22775
    [35]
    M. R. Minar and J. Naher, “Recent advances in deep learning: An overview,” arXiv preprint, arXiv: 1807.08169, 2018.
    [36]
    E. Rezende, G. Ruppert, T. Carvalho, et al., “Malicious software classification using VGG16 deep neural network’s bottleneck features,” in Information Technology - New Generations, Advances in Intelligent Systems and Computing, vol.738, Springer, Cham, pp.51–59, 2018.
    [37]
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” The Ninth International Conference on Learning Representations (ICLR 2021 Oral), Virtual Event, article no.1909, 2021.
    [38]
    E. Raff, R. Zak, R. Cox, et al., “An investigation of byte n-gram features for malware classification,” J Comput Virol Hack Tech, vol.14, no.1, pp.1–20, 2018. doi: 10.1007/s11416-016-0283-1
    [39]
    G. Suarez-Tangil and G. Stringhini, “Eight years of rider measurement in the Android malware ecosystem,” IEEE Transactions on Dependable and Secure Computing, vol.19, no.1, pp.107–118, 2022. doi: 10.1109/TDSC.2020.2982635
    [40]
    A. Al-Dujaili, A. Huang, E. Hemberg, et al., “Adversarial deep learning for robust detection of binary encoded malware,” in Proceedings of 2018 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, pp.76–82, 2018.
    [41]
    C. Agarwal, A.M. Nguyen, and D. Schonfeld, “Improving robustness to adversarial examples by encouraging discriminative features,” in Proceedings of IEEE International Conference on Image Processing (ICIP), Taipei, China, pp.3801–3505, 2019.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(11)  / Tables(9)

    Article Metrics

    Article views (377) PDF downloads(39) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return