MalFSM: Feature Subset Selection Method for Malware Family Classification

KONG Zixiao; XUE Jingfeng; WANG Yong; ZHANG Qian; HAN Weijie; ZHU Yufen

doi:10.23919/cje.2022.00.038

KONG Zixiao, XUE Jingfeng, WANG Yong, ZHANG Qian, HAN Weijie, ZHU Yufen. MalFSM: Feature Subset Selection Method for Malware Family Classification[J]. Chinese Journal of Electronics, 2023, 32(1): 26-38. DOI: 10.23919/cje.2022.00.038

Citation:

MalFSM: Feature Subset Selection Method for Malware Family Classification

Graphical Abstract

Graphical Abstract

Abstract

Abstract

Malware detection has been a hot spot in cyberspace security and academic research. We investigate the correlation between the opcode features of malicious samples and perform feature extraction, selection and fusion by filtering redundant features, thus alleviating the dimensional disaster problem and achieving efficient identification of malware families for proper classification. Malware authors use obfuscation technology to generate a large number of malware variants, which imposes a heavy analysis burden on security researchers and consumes a lot of resources in both time and space. To this end, we propose the MalFSM framework. Through the feature selection method, we reduce the 735 opcode features contained in the Kaggle dataset to 16, and then fuse on metadata features (count of file lines and file size) for a total of 18 features, and find that the machine learning classification is efficient and high accuracy. We analyzed the correlation between the opcode features of malicious samples and interpreted the selected features. Our comprehensive experiments show that the highest classification accuracy of MalFSM can reach up to 98.6% and the classification time is only 7.76 s on the Kaggle malware dataset of Microsoft.