Volume 29 Issue 6
Dec.  2020
Turn off MathJax
Article Contents
HE Kang, ZHU Yuefei, HE Yubo, LIU Long, LU Bin, LIN Wei. Detection of Malicious PDF Files Using a Two-Stage Machine Learning Algorithm[J]. Chinese Journal of Electronics, 2020, 29(6): 1165-1177. doi: 10.1049/cje.2020.10.002
Citation: HE Kang, ZHU Yuefei, HE Yubo, LIU Long, LU Bin, LIN Wei. Detection of Malicious PDF Files Using a Two-Stage Machine Learning Algorithm[J]. Chinese Journal of Electronics, 2020, 29(6): 1165-1177. doi: 10.1049/cje.2020.10.002

Detection of Malicious PDF Files Using a Two-Stage Machine Learning Algorithm

doi: 10.1049/cje.2020.10.002
Funds:  This work is supported by the National Key R&D Program China (No.2016YFB0801505) and the Cutting-edge Science and Technology Innovation Project of the Key Research and Development Program of China (2019QY1305).
More Information
  • Corresponding author: ZHU Yuefei (corresponding author) is currently a professor and a doctoral supervisor with the State Key Laboratory of Mathematical Engineering and Advanced Computing. His research areas are intrusion detection, cryptography, and information security. (Email:yfzhu17@sina.com)
  • Received Date: 2019-10-08
  • Publish Date: 2020-12-25
  • Portable document format (PDF) files are increasingly used to launch cyberattacks due to their popularity and increasing number of vulnerabilities. Many solutions have been developed to detect malicious files, but their accuracy decreases rapidly in face of new evasion techniques. We explore how to improve the robustness of classifiers for detecting adversarial attacks in PDF files. Content replacement and the n-gram are implemented to extract robust features using proposed guiding principles. In the two-stage machine learning model, the objects are divided based on their types, and the anomaly detection model is first trained for each type individually. The former detection results are organized into tree-like information structure and treated as inputs to convolutional neural network. Experimental results show that the accuracy of our classifier is nearly 100% and the robustness against evasive samples is excellent. The object features also enable the identification of different vulnerabilities exploited in malicious PDF files.
  • loading
  • D. Alperovitch, Revealed:Operation Shady RAT, McAfee, Vol.3, 2011.
    P. Laskov and N. Šrndić, "Static detection of malicious JavaScript-bearing PDF documents", Proc. of the 27th ACM Annual Computer Security Applications Conference, Orlando Florida USA, pp.373-382, 2011
    D. Stevens, "Static detection of malicious JavaScript-bearing PDF documents", IEEE Security & Privacy, Vol.9, No.1, pp.80-82, 2011
    N. Šrndić and P. Laskov, "Detection of malicious pdf files based on hierarchical document structure", Proc. of the Network & Distributed System Security Symposium, San Diego, California, USA, pp.1-16, 2013
    M. Elingiusti, L. Aniello, L. Querzoni, et al., "Malware detection:A survey and taxonomy of current techniques", Cyber threat intelligence. Springer, Cham, Switzerland, pp.169-191, 2018.
    N. Nissim, A. Cohen, et al., "Detection of malicious PDF files and directions for enhancements:A state-of-the art survey", Computers & Security, Vol.48, No.16, pp.246-266, 2015.
    W. Xu, Y. Qi and D. Evans, "Automatically evading classifiers", Proc. of the network and distributed systems symposium, San Diego, California, USA, pp.21-24, 2016.
    N. Šrndić and P. Laskov, Hidost, "A static machinelearningbased detector of malicious files", EURASIP Journal on Information Security, Vol.2016, Issue. 1, No.22, 2016.
    C. Smutz and A. Stavrou, "Malicious PDF detection using metadata and structural features", Proc. of the 28th ACM annual computer security applications conference, Orlando, Florida, USA, pp.239-248, 2012.
    G. Hinton, L. Deng, D. Yu, et al., "Deep neural networks for acoustic modeling in speech recognition", IEEE Signal Processing Magazine, pp.29, 2012.
    A. Krizhevsky, I. Sutskever and G. E. Hinton, "Imagenet classification with deep convolutional neural networks", Proc. of the Advances in Neural Information Processing Systems, Lake Tahoe, Nevada, USA, pp.1097-1105, 2012.
    N. Papernot, P. McDaniel, I. Goodfellow, et al., A. Swami, "Practical black-box attacks against machine learning", Proc. of ACM on Asia Conference on Computer and Communications Security, Abu Dhabi, United Arab Emirates, pp.506-519, 2017.
    A. Nguyen, J. Yosinski and J. Clune, "Deep neural networks are easily fooled:High confidence predictions for unrecognizable images", Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts, USA pp.427-436, 2015.
    I. J. Goodfellow, J. Shlens and C. Szegedy, "Explaining and harnessing adversarial examples", arXiv preprint arXiv, Article ID. 1412, 6572, 2014.
    D. Maiorca, D. Ariu, I. Corona, et al., "A structural and content-based approach for a precise and robust detection of malicious PDF files", Proc. of IEEE International Conference on Information Systems Security and Privacy (ICISSP), Angers, Loire Valley, France, pp.27-36, 2015.
    E. Raff, R. Zak, R. Cox, et al., "An investigation of byte ngram features for malware classification". Journal of Computer Virology and Hacking Techniques, Vol.14, No.1 pp.1-20, 2018
    A. Corona, D. Maiorca, D. Ariu, et al., "Lux0r:Detection of malicious pdf-embedded javascript code through discriminant analysis of api references", Proc. of the 2014 Workshop on Artificial Intelligent and Security Workshop, Scottsdale, Arizona, USA, pp.47-57, 2014.
    M. Cova, C. Kruegel, and G. Vigna, "Detection and analysis of drive-by-download attacks and malicious JavaScript code", Proc. of the 19th International Conference on World Wide Web, Raleigh, North Carolina, USA, USA, pp.281-290, 2010.
    C. Willems, T. Holz and F. Freiling, "CWSandbox:Towards automated dynamic binary analysis", IEEE Security and Privacy, Vol.5, No.2, pp.32-39, 2007.
    K. Z. Snow, S. Krishnan, F. Monrose, et al., "SHELLOS:Enabling fast detection and forensic analysis of code injection attacks", Proc. of the USENIX Security Symposium, San Francisco, California, pp.183-200, 2011.
    Z. Tzermias, G. Sykiotakis, et al, "Combining static and dynamic analysis for the detection of malicious documents", Proc. of the Fourth European Workshop on System Security, Salzburg, Austria, pp.1-6, 2011.
    N. Dalvi, P. Domingos, S. Sanghai, et al., "Adversarial classification", Proc. of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, pp.99-108, 2004.
    L. Tong, B. Li, C. Hajaj, et al., "Hardening classifiers against evasion:the good, the bad, and the ugly", CoRR, abs:1708. 08327, 2017.
    D. Wagner, P. Soto, "Mimicry attacks on host-based intrusion detection systems", Proc. of the 9th ACM Conference on Computer and Communications Security, Washington, DC, USA, pp.255-264, 2002.
    W. Hu and Y. Tan, "Generating adversarial malware examples for black-box attacks based on GAN", arXiv preprint arXiv:1702. 05983, 2017.
    Y. Chen, S. Wang, D. She, et al., "On Training Robust PDF Malware Classifiers", arXiv preprint, arXiv:1904. 03542, 2019.
    Y. S. Jeong, J. Woo and A. R. Kang, "Malware Detection on Byte Streams of PDF Files Using Convolutional Neural Networks", Security and Communication Networks, Vol.2019, Article ID 8485365, 9 pages, 2019.
    S. Dey, A. Kumar, et al., "EvadePDF:Towards Evading Machine Learning Based PDF Malware Classifiers", Proc. of the Security and Privacy, Singapore, pp.140-150, 2019
    A. Jordan, F. Gauthier, B. Hassanshahi, et al., "SAFE-PDF:Robust Detection of JavaScript PDF Malware Using Abstract Interpretation", arXiv preprint, arXiv:1810. 12490, 2018.
    L. Tong, B. Li, C. Hajaj, et al, "Feature Conservation in Adversarial Classifier Evasion:A Case Study", arXiv preprint, arXiv:1708. 08327, 2017.
    E. Menahem, A. Shabtai, L. Rokach, et al, "Improving malware detection by applying multiinducer ensemble", Computational Statistics & Data Analysis, Vol.53, Issue. 4, pp.1483-1494, 2009.
    T. Bodström, and T. Hämäläinen, "A novel deep learning stack for APT detection", Applied Sciences, Vol.9, No.6, pp.1055-1064, 2019.
    D. Maiorca, B. Biggio and G. Giacinto, "Towards adversarial malware detection:Lessons learned from PDF-based attacks", arXiv preprint, arXiv:1811. 00830, 2019.
  • 加载中


    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (209) PDF downloads(92) Cited by()
    Proportional views


    DownLoad:  Full-Size Img  PowerPoint