Mining and Harvesting High Quality Topical Resources from the Web

Funds:  This work is supported by the National Natural Science Foundation of China (No.61373118) and Natural Fund of Science in Shaanxi (No.2013JM8036).
  • Corresponding author: GUAN Ziyu (corresponding author) received the B.S. and Ph.D. degrees in Computer Science from Zhejiang University in 2004 and 2010, respectively. He is currently a full professor in the College of Information and Technology of China's Northwest University. His research interests include graph mining, machine learning, expertise modeling and retrieval, and recommender systems. (Email: ziyuguan@nwu.edu.cn)
  • Received Date: 2014-12-04
  • Rev Recd Date: 2015-01-05
  • Publish Date: 2016-01-10
  • Focused crawlers aim to effectively prioritize uncrawled URLs to harvest relevant pages while avoiding irrelevant ones. In practice, harvesting high quality topical Web resources is more important due to the explosion of Web information. Our study shows that the popular focused crawling strategy cannot achieve this goal. In this paper we develop a new focused crawler, namely On-line topical quality estimation (OTQE), which intelligently evaluates the topical quality of uncrawled pages by the observed link and content evidences and prioritize their URLs accordingly. The new crawler is scalable and requires fewer additional resources to do link-based analysis. The experimental results on crawling 3.6 million Web pages demonstrate the advantages of our proposed method over traditional focused crawlers.
