Double-Layer Positional Encoding Embedding Method for Cross-Platform Binary Function Similarity Detection

JIANG Xunzhi; WANG Shen; YU Xiangzhan; GONG Yuxin

doi:10.1049/cje.2021.00.139

Volume 31 Issue 4

Jul. 2022

Turn off MathJax

Article Contents

Article Navigation > Chinese Journal of Electronics > 2022 > 31(4): 604-611

JIANG Xunzhi, WANG Shen, YU Xiangzhan, et al., “Double-Layer Positional Encoding Embedding Method for Cross-Platform Binary Function Similarity Detection,” Chinese Journal of Electronics, vol. 31, no. 4, pp. 604-611, 2022, doi: 10.1049/cje.2021.00.139

Citation:

JIANG Xunzhi, WANG Shen, YU Xiangzhan, et al., “Double-Layer Positional Encoding Embedding Method for Cross-Platform Binary Function Similarity Detection,” Chinese Journal of Electronics, vol. 31, no. 4, pp. 604-611, 2022, doi: 10.1049/cje.2021.00.139

Citation:

PDF( 2804 KB)

Double-Layer Positional Encoding Embedding Method for Cross-Platform Binary Function Similarity Detection

doi: 10.1049/cje.2021.00.139

1.
School of Cyberspace Science, Harbin Institute of Technology, Harbin 150001, China

Funds: This work was supported by the National Defense Basic Scientific Research Program of China (JCKY2018603B006)

More Information

Author Bio:
received the B.S. degree in software engineering from Harbin Engineering University, Harbin, China, in 2014. He is currently pursuing the Ph.D. degree in cyberspace security with Harbin Institute of Technology, Harbin, China. His current research interests include the binary code similarity and firmware vulnerability mining. (Email: jiangxunzhi@hit.edu.cn)

(corresponding author) received the B.S. and M.E. degrees in electrical engineering and information technology from TUDresden Germany, in 2001 and 2007, respectively, and the Ph.D. degree in computer science from Harbin Institute of Technology, China, in 2012. Currently, he is an Associate Professor in the School of Cyberspace Science, Harbin Institute of Technology. His research interests include digital watermarking, digital forensics, and image processing. (Email: shen.wang@hit.edu.cn)

received the B.S. and M.S. degrees in computer application from Harbin Institute of Technology (HIT), Harbin, China, in 1995 and 1997, respectively, and Ph.D. degree in computer architecture from HIT, Harbin, China, in 2005. He is currently a Professor with School of Cyberspace Science in HIT. His research interests include traffc identification and classification, network monitoring and emergency response, data security, and Internet of Things security. (Email: yxz@hit.edu.cn)

received the B.S. degree in software engineering from Harbin Engineering University, Harbin, China, in 2014, the M.S. degree in software engineering from Harbin Institute of Technology (HIT), Harbin, China, in 2020. She is currently pursuing the Ph.D. degree in cyberspace security at HIT, Harbin, China. Her current research interests include the adversarial attack and defense based on machine learning. (Email: gongyuxin@hit.edu.cn)
Received Date: 2021-04-20
Accepted Date: 2021-12-01

Available Online: 2022-02-17

Publish Date: 2022-07-05

Abstract

Abstract

The similarity detection between two cross-platform binary functions has been applied in many fields, such as vulnerability detection, software copyright protection or malware classification. Current advanced methods for binary function similarity detection usually use semantic features, but have certain limitations. For example, practical applications may encounter instructions that have not been seen in training, which may easily cause the out of vocabulary (OOV) problem. In addition, the generalization of the extracted binary semantic features may be poor, resulting in a lower accuracy of the trained model in practical applications. To overcome these limitations, we propose a double-layer positional encoding based transformer model (DP-Transformer). The DP-Transformer’s encoder is used to extract the semantic features of the source instruction set architecture (ISA), which is called the source ISA encoder. Then, the source ISA encoder is fine-tuned by the triplet loss while the target ISA encoder is trained. This process is called DP-MIRROR. When facing the same semantic basic block, the embedding vectors of the source and target ISA encoders are similar. Different from the traditional transformer which uses single-layer positional encoding, the double-layer positional encoding embedding can solve the OOV problem while ensuring the separation between instructions, so it is more suitable for the embedding of assembly instructions. Our comparative experiment results show that DP-MIRROR outperforms the state-of-the-art approach, MIRROR, by about 35% in terms of precision at 1.
- Binary similarity,
- Cross-platform and semantic features

FullText(HTML)

References(22)

References

[1]	J. Pewny, B. Garmany, R. Gawlik, et al., “Cross-architecture bug search in binary executables,” IEEE Symp. on Security and Privacy, San Jose, CA, USA, pp.709−724, 2015.
[2]	S. Eschweiler, K. Yakdan, and E. Gerhards-Padilla, “discovRE: Efficient cross-architecture identification of bugs in binary code,” The Network and Distributed System Security Symposium (NDSS 2016), San Diego, CA, USA, DOI: 10.14722/ndss.2016.23185, 2016.
[3]	M. Chandramohan, Y. X. Xue, Z. Z. Xu, et al., “Bingo: Cross-architecture cross-os binary search,” 2016 24th ACM SIGSOFT Int. Symp. on Foundations of Software Engineering, Seattle, WA, USA, pp.678−689, 2016.
[4]	B. C. Liu, W. Huo, C. Zhang, et al., “α Diff: Cross-version binary code similarity detection with DNN,” 2018 33rd ACM/IEEE Int. Conf. on Automated Software Engineering, Montpellier, France, pp.667−678, 2018.
[5]	Q. Feng, R. D. Zhou, C. C. Xu, et al., “Scalable graph-based bug search for firmware images,” in Proc. of ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, pp.480−491, 2016.
[6]	L. N. Luo, J. Ming, D. H. Wu, et al. “Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection,” 22nd ACM SIGSOFT Int. Symp. on Foundations of Software Engineering, Hong Kong, China, pp.389−400, 2014.
[7]	X. J. Xu, C. Liu, Q. Feng, et al., “Neural network-based graph embedding for cross-platform binary code similarity detection,” 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, pp.363−376, 2017.
[8]	A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” The 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, pp.6000–6010, 2017.
[9]	J. Devlin, M. W. Chang, K. Lee, et al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint, arXiv: 1810.04805, 2018.
[10]	Y. H. Liu, M. Ott, N. Goyal, et al., “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv preprint, arXiv: 1907.11692, 2019.
[11]	Z. Y. Zhang, X. Han, Z. Y. Liu, et al., “ERNIE: Enhanced language representation with informative entities,” The 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp.1441–1451, 2019.
[12]	F. Zuo, X. P. Li, P. Young, et al., “Neural machine translation inspired binary code similarity comparison beyond function pairs,” Network and Distributed System Security Symposium, San Diego, CA, USA, DOI: 10.14722/ndss.2019.23492, 2019.
[13]	L. Massarelli, G. A. D. Luna, F. Petroni, et al., “SAFE: Self-attentive function embeddings for binary similarity,” International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, Gothenburg, Sweden, pp.309−329, 2019.
[14]	X. C. Zhang, W. J. Sun, J. M. Pang, et al., “Similarity metric method for binary basic blocks of cross-instruction set architecture,” Workshop on Binary Analysis Research, San Diego, CA, USA, DOI: 10.14722/bar.2020.23002, 2020.
[15]	F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp.815−823, 2015.
[16]	S. H. H. Ding, B. C. M. Fung, and P. Charland, “Asm2Vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization,” 2019 IEEE Symposium on Security and Privacy, San Francisco, CA, USA, pp.472−489, 2019.
[17]	C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong program analysis and transformation,” International Symposium on Code Generation and Optimization, San Jose, CA, USA, pp.75−88, 2004.
[18]	D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization”, The 3rd Int. Conf. for Learning Representation, San Diego, CA, USA, arXiv: 1412.6980, 2014.
[19]	N. Craswell, “Precision at n,” in: L. Liu and M. T. Özsu, Eds, Encyclopedia of Database Systems, New York, NY, USA: Springer, DOI:10.1007/978-0-387-39940-9_484, 2009.
[20]	M. Henderson, R. Al-Rfou, B. Strope, et al., “Efficient natural language response suggestion for smart reply,” arXiv preprint, arXiv: 1705.00652, 2017.
[21]	Y. F. Yang, S. Yuan, D. Cer, et al., “Learning semantic textual similarity from conversations,” arXiv preprint, arXiv: 1804.07754, 2018.
[22]	L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol.9, no.86, pp.2579–2605, 2008.