TY - GEN
T1 - Vrpsofc:A framework for focused crawler using mutation improving particle swarm optimization algorithm
AU - Xu, Guangxia
AU - Jiang, Peng
AU - Ma, Chuang
AU - Daneshmand, Mahmoud
AU - Xie, Shaoci
N1 - Publisher Copyright:
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2019/5/17
Y1 - 2019/5/17
N2 - The focused crawler is the key technology of the search engine. It filters webpages based on relevant algorithms until certain conditions are met. The current focused crawler is prone to topic-drift and low precision in the process of crawling the webpages. Therefore, this paper proposes a focused crawler framework (VRPSOFC) based on mutation improving particle swarm optimization. First of all, for each topic, VRPSOFC gets 3 different types of seed pages that are easy to generate large-scale web page aggregation based on the page click rate of Google search, which are official website, wikipedia, forum or video page. Then VRPSOFC uses the mutation improved particle swarm optimization algorithm proposed in this paper to crawl webpages, where each seed page will be used as the initial page. Finally, experiment in the real web environment and analyze the results. Compared with traditional VSM and other methods, VRPSOFC can obtain more accurate URL priority and crawl high quality web pages. Therefore, the topic crawler framework proposed in this paper is effective and important.
AB - The focused crawler is the key technology of the search engine. It filters webpages based on relevant algorithms until certain conditions are met. The current focused crawler is prone to topic-drift and low precision in the process of crawling the webpages. Therefore, this paper proposes a focused crawler framework (VRPSOFC) based on mutation improving particle swarm optimization. First of all, for each topic, VRPSOFC gets 3 different types of seed pages that are easy to generate large-scale web page aggregation based on the page click rate of Google search, which are official website, wikipedia, forum or video page. Then VRPSOFC uses the mutation improved particle swarm optimization algorithm proposed in this paper to crawl webpages, where each seed page will be used as the initial page. Finally, experiment in the real web environment and analyze the results. Compared with traditional VSM and other methods, VRPSOFC can obtain more accurate URL priority and crawl high quality web pages. Therefore, the topic crawler framework proposed in this paper is effective and important.
KW - Focused crawler
KW - Mutation
KW - Particle swarm algorithm
KW - Precision
KW - Topic-drift
UR - http://www.scopus.com/inward/record.url?scp=85072824774&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85072824774&partnerID=8YFLogxK
U2 - 10.1145/3321408.3323081
DO - 10.1145/3321408.3323081
M3 - Conference contribution
AN - SCOPUS:85072824774
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the ACM Turing Celebration Conference - China, ACM TURC 2019
T2 - 2019 ACM Turing Celebration Conference - China, ACM TURC 2019
Y2 - 17 May 2019 through 19 May 2019
ER -