Surface web
Encyclopedia
The surface Web is that portion of the World Wide Web
that is indexed
by conventional search engine
s. The part of the Web that is not reachable this way is called the Deep Web
. Search engines construct a database
of the Web by using programs called spiders or Web crawler
s that begin with a list of known Web pages. The spider gets a copy of each page and indexes it, storing useful information that will let the page be quickly retrieved again later. Any hyperlink
s to new pages are added to the list of pages to be crawled. Eventually all reachable pages are indexed, unless the spider runs out of time or disk space. The collection of reachable pages defines the Surface Web.
For various reasons (e.g., the Robots Exclusion Standard
, links generated by JavaScript
and Flash, password-protection) some pages can not be reached by the spider. These 'invisible' pages are referred to as the Deep Web
.
A 2005 study queried the Google
, MSN
, Yahoo!
, and Ask Jeeves
search engines with search terms from 75 different languages and determined that there were over 11.5 billion web pages in the publicly indexable Web as of January 2005.
As of June 2008, the indexed web contains at least 63 billion pages.
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...
that is indexed
Index (search engine)
Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science...
by conventional search engine
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...
s. The part of the Web that is not reachable this way is called the Deep Web
Deep web
The Deep Web refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines....
. Search engines construct a database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
of the Web by using programs called spiders or Web crawler
Web crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...
s that begin with a list of known Web pages. The spider gets a copy of each page and indexes it, storing useful information that will let the page be quickly retrieved again later. Any hyperlink
Hyperlink
In computing, a hyperlink is a reference to data that the reader can directly follow, or that is followed automatically. A hyperlink points to a whole document or to a specific element within a document. Hypertext is text with hyperlinks...
s to new pages are added to the list of pages to be crawled. Eventually all reachable pages are indexed, unless the spider runs out of time or disk space. The collection of reachable pages defines the Surface Web.
For various reasons (e.g., the Robots Exclusion Standard
Robots Exclusion Standard
The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to...
, links generated by JavaScript
JavaScript
JavaScript is a prototype-based scripting language that is dynamic, weakly typed and has first-class functions. It is a multi-paradigm language, supporting object-oriented, imperative, and functional programming styles....
and Flash, password-protection) some pages can not be reached by the spider. These 'invisible' pages are referred to as the Deep Web
Deep web
The Deep Web refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines....
.
A 2005 study queried the Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...
, MSN
MSN
MSN is a collection of Internet sites and services provided by Microsoft. The Microsoft Network debuted as an online service and Internet service provider on August 24, 1995, to coincide with the release of the Windows 95 operating system.The range of services offered by MSN has changed since its...
, Yahoo!
Yahoo!
Yahoo! Inc. is an American multinational internet corporation headquartered in Sunnyvale, California, United States. The company is perhaps best known for its web portal, search engine , Yahoo! Directory, Yahoo! Mail, Yahoo! News, Yahoo! Groups, Yahoo! Answers, advertising, online mapping ,...
, and Ask Jeeves
Ask.com
Ask is a Q&A focused search engine founded in 1996 by Garrett Gruener and David Warthen in Berkeley, California. The original software was implemented by Gary Chevsky from his own design. Warthen, Chevsky, Justin Grant, and others built the early AskJeeves.com website around that core engine...
search engines with search terms from 75 different languages and determined that there were over 11.5 billion web pages in the publicly indexable Web as of January 2005.
As of June 2008, the indexed web contains at least 63 billion pages.