Focused crawler - AbsoluteAstronomy.com

A focused crawler or topical crawler is a web crawler

Web crawler

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...

that attempts to download only web page

Web page

A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...

s that are relevant to a pre-defined topic or set of topics. Topical crawling generally assumes that only the topic is given, while focused crawling also assumes that some labeled examples of relevant and not relevant pages are available. Topical crawling was first introduced by Menczer .

Strategies

A focused crawler ideally would like to download only web pages that are relevant to a particular topic and avoid downloading all others.

Therefore a focused crawler may predict the probability that a link to a particular page is relevant before actually downloading the page. A possible predictor is the anchor text of links; this was the approach taken by Pinkerton in a crawler developed in the early days of the Web. In a review of topical crawling algorithms, Menczer et al. show that such simple strategies are very effective for short crawls, while more sophisticated techniques such as reinforcement learning

Reinforcement learning

Inspired by behaviorist psychology, reinforcement learning is an area of machine learning in computer science, concerned with how an agent ought to take actions in an environment so as to maximize some notion of cumulative reward...

and evolutionary adaptation can give the best performance over longer crawls. Diligenti et al. propose to use the complete content of the pages already visited to infer the similarity between the driving query and the pages that have not been visited yet.

In another approach, the relevance of a page is determined after downloading its content. Relevant pages are sent to content indexing and their contained URLs are added to the crawl frontier; pages that fall below a relevance threshold are discarded.

The performance of a focused crawler depends mostly on the richness of links in the specific topic being searched, and focused crawling usually relies on a general web search engine

Search engine

A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...

for providing starting points.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.