LI Siyuan, LI Ruiguang, XU Yuan, ZHOU Hao, YAN Hanbing, XU Bin, ZHANG Honggang. WAF-Based Chinese Character Recognition for Spam Image Filtering[J]. Chinese Journal of Electronics, 2018, 27(5): 1050-1055. doi: 10.1049/cje.2018.06.014
Citation: LI Siyuan, LI Ruiguang, XU Yuan, ZHOU Hao, YAN Hanbing, XU Bin, ZHANG Honggang. WAF-Based Chinese Character Recognition for Spam Image Filtering[J]. Chinese Journal of Electronics, 2018, 27(5): 1050-1055. doi: 10.1049/cje.2018.06.014

WAF-Based Chinese Character Recognition for Spam Image Filtering

doi: 10.1049/cje.2018.06.014
Funds:  This work is supported by the National Natural Science Foundation of China (No.U1736218).
More Information
  • Corresponding author: YAN Hanbing (corresponding author) obtained the Ph.D. degree from the Department of Computer Science and Technology, Tsinghua University, China in 2006. He is now working in CNCERT Coordination Center of China. His research interests include image analysis, computer network security and information security, and computer graphics. (Email:yhb@cert.org.cn)
  • Received Date: 2015-03-10
  • Rev Recd Date: 2015-10-19
  • Publish Date: 2018-09-10
  • We address the problem of filtering image spam, a kind of rapidly spread spam in which the text is embedded into images to defeat text-based spam filter. Particularly, we focus on image spam with Chinese text as "spam" which is a more challenging task. A popular way to detect image spam is by Optical character recognition (OCR) system, which detects and recognizes the embedded text, then followed by a text classifier that discriminate spam from ham. However, spammers start to obscure image text to prevent OCR system discovering the spam text. To compensate for the shortcomings of OCR system, a novel method which essentially is a keyword reconstruction algorithm based on Word activation force (WAF) model is proposed. It is effective on discovering keywords, hence is benefit for the later classification stage and notably improve the performance of image spam filtering. The experimental results on a personal data set of spam images (publicly available) validate the effectiveness of our approach that outperforms the original OCR system in practical usage with complex background in image spam.
  • loading
  • Ching-Tung Wu, Kwang-Ting Cheng, Qiang Zhu, et al., “Using visual features for anti-spam filtering”, IEEE Intern. Conf. on Image Processing, Genova, Italy, Vol.3, p.509, 2005.
    G. Fumera, I. Pillai and F. Roli, “Spam filtering based on the analysis of text information embedded into images”, Journal of Machine Learning Research, Vol.6, No.4, pp.2699-2720, 2006.
    L. Schomaker, L. Vuurpijl and E. de Leau, “New use for the pen: Outline-based image queries”, Proc. of the Fifth International Conference on Document Analysis and Recognition, Bangalore, India, IEEE, pp.293-296, 1999.
    Jun Guo, Hanliang Guo and Zhanyi Wang, “An activation forcebased affinity measure for analyzing complex networks”, Scientific Reports, Vol.1, No.10, p.113, 2011.
    Mingcheng Wan, Fengli Zhang, Hongrong Cheng, et al., “Text localization in spam image using edge features”, International Conference on Communications, Circuits and Systems, Xiamen, China, IEEE, pp.838-842, 2008.
    H.B. Aradhye, G.K. Myers and J.A. Herson, “Image analysis for efficient categorization of imagebased spam e-mail”, Proceedings of Eighth International Conference on Document Analysis and Recognition, Seoul, Korea, IEEE, pp.914-918, 2005.
    G. Fumera, I. Pillai, F. Roli, et al., “Image spam filtering using textual and visual information”, Proceedings of the MIT Spam Conference 2007, 2007.
    B. Biggio, G. Fumera, I. Pillai, et al., “Improving image spam filtering using image text features”, Proc. of the Fifth Conf. on Email and Anti-spam, Mountain View, California, USA, 2008.
    Jinson Zhang, Mao Lin Huang and Doan Hoang, “Visual analytics for intrusion detection in spam emails”, International Journal of Grid and Utility Computing, Vol.4, No.2-3, pp.178-186, 2013.
    Hsi-Jian Lee and Cheng-Huang Tung, “A language model based on semantically clustered words in a chinese character recognition system”, Pattern Recognition, Vol.30, No.8, pp.1339-1346, 1997.
    Pak-Kwong Wong and Chorkin Chan, “Postprocessing statistical language models for handwritten chinese character recognizer”, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, Vol.29, No.2, pp.286-291, 1999.
    Lawrence R Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition”, Proceedings of the IEEE, Vol.77, No.2, pp.257-286, 1989.
    John Canny, “A computational approach to edge detection”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.8, No.6, pp.679-698, 1986.
    Bolan Su, Shijian Lu and Chew Lim Tan, “Binarization of historical document images using the local maximum and minimum”, Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, ACM, pp.159-166, 2010.
    Cheng-Lin Liu, K. Nakashima, H. Sako, et al., “Handwritten digit recognition: Benchmarking of state-of-the-art techniques”, Pattern Recognition, Vol.36, No.10, pp.2271-2285, 2003.
    Thomas H Hildebrandt and Wentai Liu, “Optical recognition of handwritten Chinese characters: Advances since 1980”, Pattern Recognition, Vol.26, No.2, pp.205-225, 1993.
    Fumitaka Kimura, Tetsushi Wakabayashi, Shinji Tsuruoka, et al., “Improvement of handwritten Japanese character recognition using weighted direction code histogram”, Pattern Recognition, Vol.30, No.8, pp.1329-1337, 1997.
    Gergely Palla, Imre Derenyi, Illes Farkas, et al., “Uncovering the overlapping community structure of complex networks in nature and society”, Nature, Vol.435, No.7043, pp.814-818, 2005.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (134) PDF downloads(266) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return