FICA: A novel intelligent crawling algorithm based on reinforcement learning

AMZ Bidoki, N Yazdani… - Web Intelligence and …, 2009 - content.iospress.com
Web Intelligence and Agent Systems: An International Journal, 2009content.iospress.com
The web is a huge and highly dynamic environment which is growing exponentially in
content and developing fast in structure. No search engine can cover the whole web, thus it
has to focus on the most valuable pages for crawling. So an efficient crawling algorithm for
retrieving the most important pages remains a challenging issue. Several algorithms like
PageRank and OPIC have been proposed. Unfortunately, they have high time complexity
and low throughput. In this paper, an intelligent crawling algorithm based on reinforcement …
Abstract
The web is a huge and highly dynamic environment which is growing exponentially in content and developing fast in structure. No search engine can cover the whole web, thus it has to focus on the most valuable pages for crawling. So an efficient crawling algorithm for retrieving the most important pages remains a challenging issue. Several algorithms like PageRank and OPIC have been proposed. Unfortunately, they have high time complexity and low throughput. In this paper, an intelligent crawling algorithm based on reinforcement learning, called FICA is proposed that models a random surfing user. The priority for crawling pages is based on a concept we call logarithmic distance. FICA is easy to implement and its time complexity is O (E* logV) where V and E are the number of nodes and edges in the web graph respectively. Comparison of FICA with other proposed algorithms shows that FICA outperforms them in discovering highly important pages. Furthermore, FICA computes the importance (ranking) of each page during the crawling process. Thus, we can also use FICA as a ranking method for computation of page importance. A nice property of FICA is its adaptability to the web in that it adjusts dynamically with changes in the web graph. We have used UK's web graph to evaluate our approach.
content.iospress.com
以上显示的是最相近的搜索结果。 查看全部搜索结果