Fast Cross-Platform Binary Code Similarity Detection Framework Based on CFGs Taking Advantage of NLP and Inductive GNN

PENG Jinxue; WANG Yong; XUE Jingfeng; LIU Zhenyan

doi:10.23919/cje.2022.00.228

Volume 33 Issue 1

Jan. 2024

Turn off MathJax

Article Contents

Article Navigation > Chinese Journal of Electronics > 2024 > 33(1): 128-138

Jinxue PENG, Yong WANG, Jingfeng XUE, et al., “Fast Cross-Platform Binary Code Similarity Detection Framework Based on CFGs Taking Advantage of NLP and Inductive GNN,” Chinese Journal of Electronics, vol. 33, no. 1, pp. 128–138, 2024 doi: 10.23919/cje.2022.00.228

Citation:

Jinxue PENG, Yong WANG, Jingfeng XUE, et al., “Fast Cross-Platform Binary Code Similarity Detection Framework Based on CFGs Taking Advantage of NLP and Inductive GNN,” Chinese Journal of Electronics, vol. 33, no. 1, pp. 128–138, 2024 doi: 10.23919/cje.2022.00.228

Citation:

PDF( 3148 KB)

Fast Cross-Platform Binary Code Similarity Detection Framework Based on CFGs Taking Advantage of NLP and Inductive GNN

doi: 10.23919/cje.2022.00.228

1.
School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China

More Information

Author Bio:
Jinxue PENG was born in 1999. She is a postgraduate of Beijing Institute of Technology, China. Her main research interests focus on binary code similarity detection and machine learning. (Email: 3120201124@bit.edu.cn)

Yong WANG was born in 1975. She is an Associate Professor of Beijing Institute of Technology, China. Her main research interests focus on cyber security and machine leaning. (Email: wangyong@bit.edu.cn)

Jingfeng XUE was born in 1975. He is a Professor and Ph.D. Supervisor in Beijing Institute of Technology, China. His main research interests focus on network security and software security. (Email: xuejf@bit.edu.cn)

Zhenyan LIU was born in 1975. She is an Associate Professor of Beijing Institute of Technology, China. Her main research interests focus on machine learning. (Email: zhenyanliu@bit.edu.cn)
Corresponding author: Email: wangyong@bit.edu.cn
Received Date: 2022-03-22
Accepted Date: 2023-01-05

Available Online: 2023-04-15

Publish Date: 2024-01-05

Abstract

Abstract

Cross-platform binary code similarity detection aims at detecting whether two or more pieces of binary code are similar or not. Existing approaches that combine control flow graphs (CFGs)-based function representation and graph convolutional network (GCN)-based similarity analysis are the best-performing ones. Due to a large amount of convolutional computation and the loss of structural information, the use of convolution networks will inevitably bring problems such as high overhead and sometimes inaccuracy. To address these issues, we propose a fast cross-platform binary code similarity detection framework that takes advantage of natural language processing (NLP) and inductive graph neural network (GNN) for basic blocks embedding and function representation respectively by simulating extracting structural features and temporal features. GNN’s node-centric and small batch is a suitable training way for large CFGs, it can greatly reduce computational overhead. Various NLP basic block embedding models and GNNs are evaluated. Experimental results show that the scheme with long short term memory (LSTM) for basic blocks embedding and inductive learning-based GraphSAGE(GAE) for function representation outperforms the state-of-the-art works. In our framework, we can take only 45% overhead. Improve efficiency significantly with a small performance trade-off.
- Control flow graph,
- Natural language processing,
- Inductive graph neural network,
- Binary code similarity detection

FullText(HTML)

References(44)

References

[1]	Tableau, “Number of available applications in the google play store from December 2009 to March 2023”, Available at: https://www.statista.com/statistics/266210/number, 2023-06-19
[2]	X. Hu, S. Bhatkar, K. Griffin, et al., “MutantX-S: scalable malware clustering based on static features,” in Proceedings of 2013 USENIX conference on Annual Technical Conference, San Jose, CA, USA, pp. 187–198, 2013.
[3]	I. U. Haq and J. Caballero, “A survey of binary code similarity,” ACM Computing Surveys, vol. 54, no. 3, article no. 51, 2021. doi: 10.1145/3446371
[4]	Z. M. Tai, H. Washizaki, Y. Fukazawa, et al., “Binary similarity analysis for vulnerability detection,” in Proceedings of the 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, pp. 1121–1122, 2020.
[5]	Y. David and E. Yahav, “Tracelet-based code search in executables,” ACM SIGPLAN Notices, vol. 49, no. 6, pp. 349–360, 2014. doi: 10.1145/2666356.2594343
[6]	Y. David, N. Partush, and E. Yahav, “Statistical similarity of binaries,” ACM SIGPLAN Notices, vol. 51, no. 6, pp. 266–280, 2016. doi: 10.1145/2980983.2908126
[7]	Y. David, N. Partush, and E. Yahav, “Similarity of binaries through re-optimization,” in Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, Barcelona, Spain, pp. 79–94, 2017.
[8]	Y. David, N. Partush, and E. Yahav, “FirmUp: Precise static detection of common vulnerabilities in firmware,” in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, Williamsburg, VA, USA, pp. 392–404, 2018.
[9]	P. Shirani, L. Collard, B. L. Agba, et al., “BINARM: Scalable and efficient detection of vulnerabilities in firmware images of intelligent electronic devices,” in Proceedings of the 15th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, Saclay, France, pp. 114–138, 2018.
[10]	J. Jang, S. Choi, and J. Hong, “A method for resilient graph-based comparison of executable objects,” in Proceedings of 2012 ACM Research in Applied Computation Symposium, San Antonio, TX, USA, pp. 288–289, 2012.
[11]	H. Flake, “Structural comparison of executable objects, ” in Proceedings of Detection of intrusions and malware & vulnerability assessment, Dortmund, Germany, pp. 161–173, 2004.
[12]	D. Gao, M. K. Reiter, and D. Song, “BinHunt: Automatically finding semantic differences in binary programs,” in Proceedings of the 10th International Conference on Information and Communications Security, Birmingham, UK, pp. 238–255, 2008.
[13]	U. Kargén and N. Shahmehri, “Towards robust instruction-level trace alignment of binary code,” in Proceedings of the 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), Urbana, IL, USA, pp. 342–352, 2017.
[14]	Z. Z. Xu, B. H. Chen, M. Chandramohan, et al., “SPAIN: Security patch analysis for binaries towards understanding the pain and pills,” in Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), Buenos Aires, Argentina, pp. 462–472, 2017.
[15]	T. Kim, Y. R. Lee, B. Kang, et al., “Binary executable file similarity calculation using function matching,” The Journal of Supercomputing, vol. 75, no. 2, pp. 607–622, 2019. doi: 10.1007/s11227-016-1941-2
[16]	D. Katsaros, “Structural pattern recognition with graph edit distance: approximation algorithms and applications,” Computing Reviews, vol. 57, no. 11, pp. 665–665, 2016.
[17]	S. Alrabaee, L. Y. Wang, and M. Debbabi, “BinGold: Towards robust binary analysis by extracting the semantics of binary code as semantic flow graphs (SFGS),” Digital Investigation, vol. 18, no. S, pp. S11–S22, 2016. doi: 10.1016/j.diin.2016.04.002
[18]	M. Chandramohan, Y. X. Xue, Z. Z. Xu, et al., “BinGo: Cross-architecture cross-OS binary search,” in Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Seattle, WA, USA, pp. 678–689, 2016.
[19]	S. Eschweiler, K. Yakdan, and Gerhards-Padilla E, “DiscovRE: Efficient cross-architecture identification of bugs in binary code,” in Proceedings of the 23rd Annual Network and Distributed System Security Symposium, San Diego, CA, USA, pp. 58–79, 2016.
[20]	S. Alrabaee, P. Shirani, L. Y. Wang, et al., “SIGMA: A semantic integrated graph matching approach for identifying reused functions in binary code,” Digital Investigation, vol. 12, no. S1, pp. S61–S71, 2015. doi: 10.1016/j.diin.2015.01.011
[21]	F. Qian, R. D. Zhou, C. C. Xu, et al., “Scalable graph-based bug search for firmware images,” in Proceedings of 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, pp. 480–491, 2016.
[22]	X. J. Xu, C. Liu, Q. Feng, et al., “Neural network-based graph embedding for cross-platform binary code similarity detection,” in Proceedings of 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, pp. 363–376, 2017.
[23]	H. J. Dai, B. Dai, and L. Song, “Discriminative embeddings of latent variable models for structured data,” in Proceedings of the 33rd International Conference on International Conference on Machine Learning, New York City, NY, USA, pp. 2702–2711, 2016.
[24]	J. Gao, X. Yang, Y. Fu, et al., “VulSeeker: A semantic learning based vulnerability seeker for cross-platform binary,” in Proceedings of the 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE), Montpellier, France, pp. 896–899, 2018.
[25]	J. Gao, X. Yang, Y. Fu, et al., “VulSeeker-pro: Enhanced semantic learning based binary vulnerability seeker with emulation,” in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Lake Buena Vista, FL, USA, pp. 803–808, 2018.
[26]	L. Massarelli, G. A. Di Luna, F. Petroni, et al., “Investigating graph embedding neural networks with unsupervised features extraction for binary analysis,” in Proceedings of Workshop on Binary Analysis Research, San Diego, CA, USA, pp. 1–11, 2019.
[27]	L. Massarelli, G. A. Di Luna, F. Petroni, et al., “SAFE: Self-attentive function embeddings for binary similarity,” in Proceedings of the 16th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, Gothenburg, Sweden, pp. 309–329, 2019.
[28]	S. H. H. Ding, B. C. M. Fung, and P. Charland, “Asm2Vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization,” in Proceedings of 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, pp. 472–489, 2019.
[29]	B. C. Liu, W. Huo, C. Zhang, et al., “αDiff: cross-version binary code similarity detection with DNN,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, Montpellier, France, pp. 667–678, 2018.
[30]	Z. P. Yu, R. Cao, Q. Y. Tang, et al., “Order matters: Semantic-aware neural networks for binary code similarity detection,” in Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, pp. 1145–1152, 2020.
[31]	C. Y. Zhuang and Q. Ma, “Dual graph convolutional networks for graph-based semi-supervised classification,” in Proceedings of 2018 World Wide Web Conference, Lyon, France, pp. 499–508, 2018.
[32]	X. Hu, T. C. Chiueh, and K. G. Shin, “Large-scale malware indexing using function-call graphs,” in Proceedings of the 16th ACM Conference on Computer and Communications Security, Chicago, IL, USA, pp. 611–620, 2009.
[33]	N. E. Rosenblum, B. P. Miller, and X. J. Zhu, “Extracting compiler provenance from program binaries,” in Proceedings of the 9th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis For Software Tools and Engineering, Toronto, Canada, pp. 21–28, 2010.
[34]	S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735
[35]	Z. Y. Feng, D. Y. Guo, and D. Y. Tang, “CodeBERT: A pre-trained model for programming and natural languages,” in Proceedings ofFindings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 1536–1547, 2020.
[36]	P. Veličković, G. Cucurull, A. Casanova, et al., “Graph attention networks,” in Proceedings of the 6th International Conference on Learning Representations, ICLR 2018 OpenReview.net, 2018.
[37]	W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, pp. 1025–1035, 2017.
[38]	S. G. Yang, L. Cheng, Y. C. Zeng, et al., “Asteria: Deep learning-based AST-encoding for cross-platform binary code similarity detection,” in Proceedings of the 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Taipei, China, pp. 224–236, 2021.
[39]	Radareorg, “Radare2,” Available at: https://github.com/radareorg/radare2, 2021-12.
[40]	P. B. M. Abadi and E. A. J. Chen, “Word2vec skip-gram implementation in tensor-flow,” Available at: https://www.tensorflow:tutorials/representation/word2vec, 2021-12.
[41]	T. Mikolov, K. Chen, G. Corrado, et al., “Efficient estimation of word representations in vector space,” in Proceedings of the 1st International Conference on Learning Representations, Scottsdale, AZ, USA, pp. 1–12, 2013.
[42]	M. Abadi, P. Barham, J. M. Chen, et al., “TensorFlow: A system for large-scale machine learning,” in Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, Savannah, GA, USA, pp. 265–283, 2016.
[43]	J. Devlin, M. W. Chang, K. Lee, et al., “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, pp. 4171–4186, 2018.
[44]	Y. Qu and H. Yin, “Evaluating network embedding techniques’ performances in software bug prediction,” Empirical Software Engineering, vol. 26, no. 4, article no. 60, 2021. doi: 10.1007/s10664-021-09965-5