Web harvesting - AbsoluteAstronomy.com

Web harvesting is commonly used to describe Web scraping

Web scraping

Web scraping is a computer software technique of extracting information from websites...

from a multitude of sites. It also refers to an implementation of a Web crawler

Web crawler

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...

that uses human expertise or machine guidance to direct the crawler to URLs which compose a specialized collection or set of knowledge. Web harvesting can be thought of as focused or directed Web crawling.

Purpose

Web harvesting allows Web-based search and retrieval applications, commonly referred to as search engines, to index content that is pertinent to the audience for which the harvest is intended. Such content is thus virtually integrated and made searchable as a separate Web application. General purpose search engines, such as Google

Google

Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...

and Yahoo!

Yahoo!

Yahoo! Inc. is an American multinational internet corporation headquartered in Sunnyvale, California, United States. The company is perhaps best known for its web portal, search engine , Yahoo! Directory, Yahoo! Mail, Yahoo! News, Yahoo! Groups, Yahoo! Answers, advertising, online mapping ,...

index all possible links they encounter from the origin of their crawl. In contrast, search engines based on Web harvesting only index URLs to which they are directed. This implementation strategy has the effect of creating a searchable application that is faster, due to the reduced size of the index; and one that provides higher quality and more selective results since the indexed URLs are pre-filtered for the topic or domain of interest. In effect, harvesting makes otherwise isolated islands of information searchable as if they were an integrated whole.

Another common purpose of Web harvesting is to supply content to vertical search

Vertical search

A vertical search engine, as distinct from a general web search engine, focuses on a specific segment of online content. The vertical content area may be based on topicality, media type, or genre of content. Common verticals include shopping, the automotive industry, legal information, medical...

engines.

Process

Web Harvesting begins by identifying and specifying as input to a computer program a list of URLs that define a specialized collection or set of knowledge. The computer program then begins to download this list of URLs. Embedded hyperlinks that are encountered can be either followed or ignored, depending on human or machine guidance. A key differentiation between Web harvesting and general purpose Web crawlers is that for Web harvesting, crawl depth will be defined and the crawls need not recursively follow URLs until all links have been exhausted. The downloaded content is then indexed by the search engine application and offered to information customers as a searchable Web application. Information customers can then access and search the Web application and follow hyperlinks to the original URLs that meet their search criteria.

Focused web harvesting

Focused web harvesting is similar to the targeted web crawler. Instead of letting the general purpose crawler harvest the web, the mechanism works under certain pre-defined conditions to specify the information. Especially this mechanism is intended to realize an indirect data integration

Data integration

Data integration involves combining data residing in different sources and providing users with a unified view of these data.This process becomes significant in a variety of situations, which include both commercial and scientific domains...

. An implementation of this kind of data integration can be found at the Indonesian Scientific Index - ISI which integrates all information related to the science and technology in Indonesia

Indonesia

Indonesia , officially the Republic of Indonesia , is a country in Southeast Asia and Oceania. Indonesia is an archipelago comprising approximately 13,000 islands. It has 33 provinces with over 238 million people, and is the world's fourth most populous country. Indonesia is a republic, with an...

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.