XU Bin, LI Ruiguang, LIU Yashu, YAN Hanbing, LI Siyuan, ZHANG Honggang. Filtering Chinese Image Spam Using Pseudo-OCR[J]. Chinese Journal of Electronics, 2015, 24(1): 134-139.
Citation: XU Bin, LI Ruiguang, LIU Yashu, YAN Hanbing, LI Siyuan, ZHANG Honggang. Filtering Chinese Image Spam Using Pseudo-OCR[J]. Chinese Journal of Electronics, 2015, 24(1): 134-139.

Filtering Chinese Image Spam Using Pseudo-OCR

Funds:  This work is supported by the National Natural Science Foundation of China (No.61171193, No.61175011, No.61273217), and the 111 Project (No.B08004).
More Information
  • Corresponding author: LIU Yashu is a Ph.D. candidate at the Department of Computer Science and Technology in Beijing Jiaotong University in China. At the same time she is a lecture in Beijing University of Civil Engineering and Architecture. Her research interests include data mining and computer vision. (Email: ly_s8020@163.com)
  • Received Date: 2014-05-01
  • Rev Recd Date: 2014-04-01
  • Publish Date: 2015-01-10
  • For image spam filtering, the Optical character recognition(OCR) based methods often achieve a better performance due to the more complex structure of recognizing corresponding text. However, applying traditional OCR techniques usually introduced shortcomings like the expensive computational cost, vulnerability to image noises and artificial interferences, especially for Chinese image spam filtering. So, by optimizing recognition procedure of traditional OCR, we propose the idea of pseudo-OCR more suitable for Chinese image spam filtering. During which discriminating the potential image spam character features from ham ones is sufficient, instead of recognizing them. What's more, a novel Chinese key-point based character feature specific for pseudo-OCR is also devised and extracted using a carefully designed algorithm, which outperforms classic corner detection methods in finding such key-points. Experiment results show that our proposed system usually has a better performance than traditional OCR based method while maintaining a low false positive rate.
  • loading
  • Z. Wang, W. Josephson, Q. Lv, M. Charikar and K. Li, "Filtering image spam with near-duplicate detection", Proceedings of the Fourth Conference on Email and Anti-Spam, Mountain View, California, USA, 2007.
    B. Mehta, S. Nangia and M. Gupta, "Detecting image spam using visual features and near duplicate detection", Proceedings of the 17th International conference on World Wide Web Pages, Beijing, China, pp.497-506, 2008.
    C. Zhang, W. Chen, X. Chen, et al., "A multimodal data mining framework for revealing common sources of spare images", Journal of Multimedia, Vol.4, No.5, pp.313-320, 2009.
    G. Fumera, I. Pillai and F. Roli, "Spam filtering based on the analysis of text information embedded into images", The Journal of Machine Learning Research, Vol.7, No.6, pp.2699-2720, 2006.
    G. Fumera, I. Pillai, F. Roli, et al., "Image spam filtering using textual and visual information", Proceedings of MIT Spam Conference, Boston, MA, USA, 2007.
    H.B. Aradhye, G.K. Myers and J.A. Herson, "Image analysis for efficient categorization of image-based spam Email", Proceedings of the Eighth International Conference on Document Analysis and Recognition, Seoul, Korea, pp.914-918, 2005.
    N.C. Woods, O.B. Longe and A.B.C. Roberts, "A sobel edge detection algorithm based system for analyzing and classifying image based spam", Journal of Emerging Trends in Computing and Information Sciences, Vol.3, No.4 pp.506-512, 2012.
    C.T. Wu, et al., "Using visual features for anti-spam filtering", Proceedings of the IEEE International Conference on Image Processing, Genoa, Italy, pp.501-504, 2005
    M. Dredze, R. Gevaryahu and A. Elias, "Learning fast classifiers for image spam", Proceedings of the Fourth Conference on Email and Anti-Spam, Mountain View, California, USA, pp.487-493, 2007.
    C.M. Wan, J. Geng, H.R. Cheng, et al., "Image spam identifying algorithm based on color and corner feature", Computer Engineering, Vol.35, No.15, pp.209-211, 2009. (in Chinese)
    J. Duong, H. Emptoz, et al., "Extraction of text areas in printed document images", Proceedings of the 2001 ACM Symposium on Document Engineering, Atlanta, Georgia, USA, pp.157-165, 2001.
    F.M. Wahl, et al., "Block segmentation and text extraction in mixed text/image documents", Computer graphics and image processing, Vol.20, No.4, pp.375-390, 1982.
    S. Audithan and R.M. Chandrasekaran, "Document text extraction from document images using haar discrete wavelet transform", European Journal of Scientific Research, Vol.36, No.4, pp.502-512, 2006.
    B.G. Wei, Y. Zhang, J. Yuan, Y.H. Liu and L.D. Wang, "A novel approach to text detection and extraction from videos by discriminative features and density", Chinese Journal of Electroics, Vol.23, No.2, pp.236-239, 2014.
    T.Y. Zhang and C.Y. Suen, "A fast parallel algorithm for thinning digital patterns", Comm. ACM, Vol.27, No.3, pp.236-239, 1984.
    C. Harris and M. Stephens, "A combined corner and edge detector", Proceedings of Fourth Alvey Vision Conference, Manchester, UK, pp.147-151, 1988.
    J. Shi and C. Tomasi, "Good features to track", IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.593-600, 1994.
    H. Sun and H. Zhang, "Survey on Chinese character recognition method", Computer Engineering, Vol.36, No.20, pp.194- 197, 2010. (in Chinese)
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (317) PDF downloads(959) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return