Citation: | ZHAO Wei, GUAN Ziyu, CAO Zhengwen, et al., “Mining and Harvesting High Quality Topical Resources from the Web,” Chinese Journal of Electronics, vol. 25, no. 1, pp. 48-57, 2016, doi: 10.1049/cje.2016.01.008 |
S. Chakrabarti, M. van den Berg and B. Dom, “Focused crawling: A new approach to topic-specific web resource discovery”, Computer Networks, Vol.31, No.11-16, pp.1623-1640, 1999.
|
M. Diligenti, F.M. Coetzee, S. Lawrence, C.L. Giles and M. Gori, “Focused crawling using context graphs”, 26th International Conference on Very Large Databases, Cairo, Egypt, pp.527-534, 2000.
|
F. Menczer, G. Pant, P. Srinivasan and M.E. Ruiz, “Evaluating topic-driven web crawlers”, 24th Annual International ACM SIGIR Conference, New Orleans, LA, USA, pp.241-249, 2001.
|
G. Pant and P. Srinivasan, “Learning to crawl: Comparing classification schemes”, ACM Trans. Information Systems, Vol.23, No.4, pp.430-462, 2005.
|
G. Pant and P. Srinivasan, “Link contexts in classifier-guided topical crawlers”, IEEE Trans. Knowledge and Data Engineering, Vol.18, No.1, pp.107-122, 2006.
|
Z. Guan, C. Wang, C. Chen, J. Bu and J. Wang, “Guide focused crawler efficiently and effectively using on-line topical importance estimation”, 31st Annual International ACM SIGIR Conference, Singapore, pp.757-758, 2008.
|
D. Hati and A. Kumar, “An approach for identifying urls based on division score and link score in focused crawler”, International Journal of Computer Applications, Vol.2, No.3, pp.48-53, 2010.
|
A.K. McCallum, K. Nigam, J. Rennie and K. Seymore, “Automating the construction of internet portals with machine learning”, Information Retrieval, Vol.3, No.2, pp.127-163, 2000.
|
G. Pant and F. Menczer, “Topical crawling for business intelligence”, 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003), Trondheim, Norway, pp.233-244, 2003.
|
A. Ntoulas, M. Najork, M. Manasse and D. Fetterly, “Detecting spam web pages through content analysis”, 15th International Conference on World Wide Web, Edinburgh, Scotland, pp.83- 92, 2006.
|
T.T. Tang, D. Hawking, N. Craswell and K. Griffiths, “Focused crawling for both topical relevance and quality of medical information”, 14th ACM International Conference on Information and Knowledge Management, Bremen, Germany, pp.147-154, 2005.
|
T.H. Haveliwala, “Topic-sensitive PageRank”, 11th International Conference on World Wide Web, Honolulu, Hawaii, USA, pp.517-526, 2002.
|
P. Calado, M. Cristo, E. Moura, N. Ziviani, B. Ribeiro-Neto and M. A. Gonçalves, “Combining link-based and content-based methods for web document classification”, 12th International Conference on Information and Knowledge Management, New Orleans, USA, pp.394-401, 2003.
|
D. Cai and X. He, “Manifold adaptive experimental design for text categorization”, IEEE Transactions on Knowledge and Data Engineering, Vol.24, No.4, pp.707-719, 2012.
|
P. Srinivasan, F. Menczer and G. Pant, “A general evaluation framework for topical crawlers”, Information Retrieval, Vol.8, No.3, pp.417-447, 2005.
|
G.T. De Assis, A.H.F. Laender, M.A. Gonçalves and A.S. Da Silva, “A genre-aware approach to focused crawling”, World Wide Web, Vol.12, No.3, pp.285-319, 2009.
|
J. Kleinberg, “Authoritative sources in a hyperlinked environment”, 9th Annual ACM-SIAM Symposium Discrete Algorithms, San Francisco, CA, USA, pp.668-677, 1998.
|
G. Almpanidis, C. Kotropoulos and I. Pitas, “Combining text and link analysis for focused crawling - an application for vertical search engines”, Information Systems, Vol.32, No.6, pp.886- 908, 2007.
|
M. Jamali, H. Sayyadi, B.B. Hariri and H. Abolhassani, “A method for focused crawling using combination of link structure and content similarity”, IEEE/WIC/ACM International Conference on Web Intelligence(WI'06), Hong Kong, China, pp.753-756, 2006.
|
C. Wang, Z. Guan, C. Chen, J. Bu, J. Wang and H. Lin, “Online topical importance estimation: an effective focused crawling algorithm combining link and content analysis”, Journal of Zhejiang University SCIENCE A, Vol.10, No.8, pp.1114-1124, 2009.
|
J. Cho, H. Garcia-Molina and L. Page, “Efficient crawling through url ordering”, Computer Networks and ISDN Systems, Vol.30, No.1, pp.161-172, 1998.
|
L. Page, S. Brin, R. Motwani and T. Winograd, “The pagerank citation algorithm: Bringing order to the web”, Stanford InfoLab Digital Libraries Project Report, 1999-66, 1999.
|
J. Cho and U. Schonfeld, “Rankmass crawler: A crawler with high personalized pagerank coverage guarantee”, 33rd International Conference on Very Large Databases, Vienna, Austria, pp.375-386, 2007.
|
Z. Gyöngyi and H. Garcia-Molina, “Web spam taxonomy”, 1st International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan, pp.39-47, 2005.
|
G.W. Flake, S. Lawrence and C.L. Giles, “Efficient identification of web communities”, Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, pp.150-160, 2000.
|
S. Abiteboul, M. Preda and G. Cobena, “Adaptive on-line page importance computation”, Twelfth International Conference on World Wide Web, Budapest, Hungary, pp.280-290, 2003.
|
M. Porter, “An algorithm for suffix stripping”, Program, Vol.14, No.3, pp.130-137, 1980.
|
A.M. Martínez and A.C. Kak, “Pca versus lda”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.23, No.2, pp.228-233, 2001.
|
D. Cai, X. He, J. Han and T.S. Huang, “Graph regularized nonnegative matrix factorization for data representation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.33, No.8, pp.1548-1560, 2011.
|
QIAN Wenbin, SHU Wenhao, YANG Bingru and ZHANG Changsheng, “An incremental algorithm to feature selection in decision systems with the variation of feature set”, Chinese Journal of Electronics, Vol.24, No.1, pp.128-133, 2015.
|
M.E. Tipping, “Sparse bayesian learning and the relevance vector machine”, The Journal of Machine Learning Research, Vol.1, No.1, pp.211-244, 2001.
|
C.M. Bishop, Pattern Recognition and Machine Learning. Springer, New York, USA, 2006.
|
G. Pant, P. Srinivasan and F. Menczer, “Exploration versus exploitation in topic driven crawlers”, WWW'02 Workshop on Web Dynamics, Honolulu, Hawaii, USA, pp.86-95, 2002.
|
M. Najork and J.L. Wiener, “Breadth-first crawling yields highquality pages”, 10th International Conference on World Wide Web, Hong Kong, China, pp.114-118, 2001.
|