ZHAO Wei, GUAN Ziyu, CAO Zhengwen, LIU Zheng. Mining and Harvesting High Quality Topical Resources from the Web[J]. Chinese Journal of Electronics, 2016, 25(1): 48-57. doi: 10.1049/cje.2016.01.008
Citation: ZHAO Wei, GUAN Ziyu, CAO Zhengwen, LIU Zheng. Mining and Harvesting High Quality Topical Resources from the Web[J]. Chinese Journal of Electronics, 2016, 25(1): 48-57. doi: 10.1049/cje.2016.01.008

Mining and Harvesting High Quality Topical Resources from the Web

doi: 10.1049/cje.2016.01.008
Funds:  This work is supported by the National Natural Science Foundation of China (No.61373118) and Natural Fund of Science in Shaanxi (No.2013JM8036).
More Information
  • Corresponding author: GUAN Ziyu (corresponding author) received the B.S. and Ph.D. degrees in Computer Science from Zhejiang University in 2004 and 2010, respectively. He is currently a full professor in the College of Information and Technology of China's Northwest University. His research interests include graph mining, machine learning, expertise modeling and retrieval, and recommender systems. (Email: ziyuguan@nwu.edu.cn)
  • Received Date: 2014-12-04
  • Rev Recd Date: 2015-01-05
  • Publish Date: 2016-01-10
  • Focused crawlers aim to effectively prioritize uncrawled URLs to harvest relevant pages while avoiding irrelevant ones. In practice, harvesting high quality topical Web resources is more important due to the explosion of Web information. Our study shows that the popular focused crawling strategy cannot achieve this goal. In this paper we develop a new focused crawler, namely On-line topical quality estimation (OTQE), which intelligently evaluates the topical quality of uncrawled pages by the observed link and content evidences and prioritize their URLs accordingly. The new crawler is scalable and requires fewer additional resources to do link-based analysis. The experimental results on crawling 3.6 million Web pages demonstrate the advantages of our proposed method over traditional focused crawlers.
  • loading
  • S. Chakrabarti, M. van den Berg and B. Dom, “Focused crawling: A new approach to topic-specific web resource discovery”, Computer Networks, Vol.31, No.11-16, pp.1623-1640, 1999.
    M. Diligenti, F.M. Coetzee, S. Lawrence, C.L. Giles and M. Gori, “Focused crawling using context graphs”, 26th International Conference on Very Large Databases, Cairo, Egypt, pp.527-534, 2000.
    F. Menczer, G. Pant, P. Srinivasan and M.E. Ruiz, “Evaluating topic-driven web crawlers”, 24th Annual International ACM SIGIR Conference, New Orleans, LA, USA, pp.241-249, 2001.
    G. Pant and P. Srinivasan, “Learning to crawl: Comparing classification schemes”, ACM Trans. Information Systems, Vol.23, No.4, pp.430-462, 2005.
    G. Pant and P. Srinivasan, “Link contexts in classifier-guided topical crawlers”, IEEE Trans. Knowledge and Data Engineering, Vol.18, No.1, pp.107-122, 2006.
    Z. Guan, C. Wang, C. Chen, J. Bu and J. Wang, “Guide focused crawler efficiently and effectively using on-line topical importance estimation”, 31st Annual International ACM SIGIR Conference, Singapore, pp.757-758, 2008.
    D. Hati and A. Kumar, “An approach for identifying urls based on division score and link score in focused crawler”, International Journal of Computer Applications, Vol.2, No.3, pp.48-53, 2010.
    A.K. McCallum, K. Nigam, J. Rennie and K. Seymore, “Automating the construction of internet portals with machine learning”, Information Retrieval, Vol.3, No.2, pp.127-163, 2000.
    G. Pant and F. Menczer, “Topical crawling for business intelligence”, 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003), Trondheim, Norway, pp.233-244, 2003.
    A. Ntoulas, M. Najork, M. Manasse and D. Fetterly, “Detecting spam web pages through content analysis”, 15th International Conference on World Wide Web, Edinburgh, Scotland, pp.83- 92, 2006.
    T.T. Tang, D. Hawking, N. Craswell and K. Griffiths, “Focused crawling for both topical relevance and quality of medical information”, 14th ACM International Conference on Information and Knowledge Management, Bremen, Germany, pp.147-154, 2005.
    T.H. Haveliwala, “Topic-sensitive PageRank”, 11th International Conference on World Wide Web, Honolulu, Hawaii, USA, pp.517-526, 2002.
    P. Calado, M. Cristo, E. Moura, N. Ziviani, B. Ribeiro-Neto and M. A. Gonçalves, “Combining link-based and content-based methods for web document classification”, 12th International Conference on Information and Knowledge Management, New Orleans, USA, pp.394-401, 2003.
    D. Cai and X. He, “Manifold adaptive experimental design for text categorization”, IEEE Transactions on Knowledge and Data Engineering, Vol.24, No.4, pp.707-719, 2012.
    P. Srinivasan, F. Menczer and G. Pant, “A general evaluation framework for topical crawlers”, Information Retrieval, Vol.8, No.3, pp.417-447, 2005.
    G.T. De Assis, A.H.F. Laender, M.A. Gonçalves and A.S. Da Silva, “A genre-aware approach to focused crawling”, World Wide Web, Vol.12, No.3, pp.285-319, 2009.
    J. Kleinberg, “Authoritative sources in a hyperlinked environment”, 9th Annual ACM-SIAM Symposium Discrete Algorithms, San Francisco, CA, USA, pp.668-677, 1998.
    G. Almpanidis, C. Kotropoulos and I. Pitas, “Combining text and link analysis for focused crawling - an application for vertical search engines”, Information Systems, Vol.32, No.6, pp.886- 908, 2007.
    M. Jamali, H. Sayyadi, B.B. Hariri and H. Abolhassani, “A method for focused crawling using combination of link structure and content similarity”, IEEE/WIC/ACM International Conference on Web Intelligence(WI'06), Hong Kong, China, pp.753-756, 2006.
    C. Wang, Z. Guan, C. Chen, J. Bu, J. Wang and H. Lin, “Online topical importance estimation: an effective focused crawling algorithm combining link and content analysis”, Journal of Zhejiang University SCIENCE A, Vol.10, No.8, pp.1114-1124, 2009.
    J. Cho, H. Garcia-Molina and L. Page, “Efficient crawling through url ordering”, Computer Networks and ISDN Systems, Vol.30, No.1, pp.161-172, 1998.
    L. Page, S. Brin, R. Motwani and T. Winograd, “The pagerank citation algorithm: Bringing order to the web”, Stanford InfoLab Digital Libraries Project Report, 1999-66, 1999.
    J. Cho and U. Schonfeld, “Rankmass crawler: A crawler with high personalized pagerank coverage guarantee”, 33rd International Conference on Very Large Databases, Vienna, Austria, pp.375-386, 2007.
    Z. Gyöngyi and H. Garcia-Molina, “Web spam taxonomy”, 1st International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan, pp.39-47, 2005.
    G.W. Flake, S. Lawrence and C.L. Giles, “Efficient identification of web communities”, Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, pp.150-160, 2000.
    S. Abiteboul, M. Preda and G. Cobena, “Adaptive on-line page importance computation”, Twelfth International Conference on World Wide Web, Budapest, Hungary, pp.280-290, 2003.
    M. Porter, “An algorithm for suffix stripping”, Program, Vol.14, No.3, pp.130-137, 1980.
    A.M. Martínez and A.C. Kak, “Pca versus lda”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.23, No.2, pp.228-233, 2001.
    D. Cai, X. He, J. Han and T.S. Huang, “Graph regularized nonnegative matrix factorization for data representation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.33, No.8, pp.1548-1560, 2011.
    QIAN Wenbin, SHU Wenhao, YANG Bingru and ZHANG Changsheng, “An incremental algorithm to feature selection in decision systems with the variation of feature set”, Chinese Journal of Electronics, Vol.24, No.1, pp.128-133, 2015.
    M.E. Tipping, “Sparse bayesian learning and the relevance vector machine”, The Journal of Machine Learning Research, Vol.1, No.1, pp.211-244, 2001.
    C.M. Bishop, Pattern Recognition and Machine Learning. Springer, New York, USA, 2006.
    G. Pant, P. Srinivasan and F. Menczer, “Exploration versus exploitation in topic driven crawlers”, WWW'02 Workshop on Web Dynamics, Honolulu, Hawaii, USA, pp.86-95, 2002.
    M. Najork and J.L. Wiener, “Breadth-first crawling yields highquality pages”, 10th International Conference on World Wide Web, Hong Kong, China, pp.114-118, 2001.
  • 加载中


    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (206) PDF downloads(950) Cited by()
    Proportional views


    DownLoad:  Full-Size Img  PowerPoint