Web scraping
Encyclopedia
Web scraping is a computer software technique of extracting information from website
Website
A website, also written as Web site, web site, or simply site, is a collection of related web pages containing images, videos or other digital assets. A website is hosted on at least one web server, accessible via a network such as the Internet or a private local area network through an Internet...

s. Usually, such software programs simulate human exploration of the World Wide Web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...

 by either implementing low-level Hypertext Transfer Protocol
Hypertext Transfer Protocol
The Hypertext Transfer Protocol is a networking protocol for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web....

 (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer
Internet Explorer
Windows Internet Explorer is a series of graphical web browsers developed by Microsoft and included as part of the Microsoft Windows line of operating systems, starting in 1995. It was first released as part of the add-on package Plus! for Windows 95 that year...

 or Mozilla Firefox
Mozilla Firefox
Mozilla Firefox is a free and open source web browser descended from the Mozilla Application Suite and managed by Mozilla Corporation. , Firefox is the second most widely used browser, with approximately 25% of worldwide usage share of web browsers...

.

Web scraping is closely related to web indexing
Web indexing
Web indexing includes back-of-book-style indexes to individual websites or an intranet, and the creation of keyword metadata to provide a more useful vocabulary for Internet or onsite search engines...

, which indexes information on the web using a bot
Internet bot
Internet bots, also known as web robots, WWW robots or simply bots, are software applications that run automated tasks over the Internet. Typically, bots perform tasks that are both simple and structurally repetitive, at a much higher rate than would be possible for a human alone...

 and is a universal technique adopted by most search engines. In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically in HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

 format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to web automation, which simulates human browsing using computer software. Uses of web scraping include online price comparison, weather data monitoring, website change detection, research, web mashup and web data integration.

Techniques

Web scraping is the process of automatically collecting information from the world wide web. It is a field with active developments sharing a common goal with the semantic web
Semantic Web
The Semantic Web is a collaborative movement led by the World Wide Web Consortium that promotes common formats for data on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web of unstructured documents into a "web of...

 vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. Web scraping, instead, favors practical solutions based on existing technologies that are often entirely ad hoc. Therefore, there are different levels of automation that existing web-scraping technologies can provide:
  • Human copy-and-paste: Sometimes even the best web-scraping technology cannot replace a human’s manual examination and copy-and-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly set up barriers to prevent machine automation.
  • Text grepping and regular expression matching
    Regular expression
    In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

    : A simple yet powerful approach to extract information from web pages can be based on the UNIX grep
    Grep
    grep is a command-line text-search utility originally written for Unix. The name comes from the ed command g/re/p...

     command or regular expression matching facilities of programming languages (for instance Perl
    Perl
    Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

     or Python
    Python (programming language)
    Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

    ).
  • HTTP programming: Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming.
  • DOM parsing: By embedding a full-fledged web browser, such as the Internet Explorer
    Internet Explorer
    Windows Internet Explorer is a series of graphical web browsers developed by Microsoft and included as part of the Microsoft Windows line of operating systems, starting in 1995. It was first released as part of the add-on package Plus! for Windows 95 that year...

     or the Mozilla
    Mozilla
    Mozilla is a term used in a number of ways in relation to the Mozilla.org project and the Mozilla Foundation, their defunct commercial predecessor Netscape Communications Corporation, and their related application software....

     browser control, programs can retrieve the dynamic contents generated by client side scripts. These browser controls also parse web pages into a DOM tree, based on which programs can retrieve parts of the pages.
  • HTML parsers: Some semi-structured data
    Semi-structured data
    Semi-structured data is a form of structured data that does not conform with the formal structure of tables and data models associated with relational databases but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data...

     query languages, such as XQuery
    XQuery
    - Features :XQuery provides the means to extract and manipulate data from XML documents or any data source that can be viewed as XML, such as relational databases or office documents....

     and the HTQL, can be used to parse HTML pages and to retrieve and transform page content.
  • Web-scraping software: There are many software tools available that can be used to customize web-scraping solutions. This software may attempt to automatically recognize the data structure of a page or provide a recording interface that removes the necessity to manually write web-scraping code, or some scripting functions that can be used to extract and transform content, and database interfaces that can store the scraped data in local databases.
  • Vertical aggregation platforms: There are several companies that have developed vertical specific harvesting platforms. These platforms create and monitor a multitude of “bots” for specific verticals with no man-in-the-loop, and no work related to a specific target site. The preparation involves establishing the knowledge base for the entire vertical and then the platform creates the bots automatically. The platforms robustness is measured by the quality of the information it retrieves (usually number of fields) and its scalability (how quick it can scale up to hundreds or thousands of sites). This scalability is mostly used to target the Long Tail
    Long tail
    Long tail may refer to:*The Long Tail, a consumer demographic in business*Power law's long tail, a statistics term describing certain kinds of distribution*Long-tail boat, a type of watercraft native to Southeast Asia...

     of sites that common aggregators find complicated or too labor intensive to harvest content from.
  • Semantic annotation recognizing: The pages being scraped may embrace metadata or semantic markups and annotations, which can be made use of to locate specific data snippets. If the annotations are embedded in the pages, as Microformat
    Microformat
    A microformat is a web-based approach to semantic markup which seeks to re-use existing HTML/XHTML tags to convey metadata and other attributes in web pages and other contexts that support HTML, such as RSS...

     does, this technique can be viewed as a special case of DOM parsing. In another case, the annotations, organized into a semantic layer, are stored and managed separately from the web pages, so the scrapers can retrieve data schema and instructions from this layer before scraping the pages.

Legal issues

Web scraping may be against the terms of use of some websites. The enforceability of these terms is unclear. While outright duplication of original expression will in many cases be illegal, in the United States
United States
The United States of America is a federal constitutional republic comprising fifty states and a federal district...

 the courts ruled in Feist Publications v. Rural Telephone Service
Feist Publications v. Rural Telephone Service
Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 , commonly called Feist v. Rural, is an important United States Supreme Court case establishing that information alone without a minimum of original creativity cannot be protected by copyright...

 that duplication of facts is allowable.
U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing trespass to chattels
Trespass to chattels
Trespass to chattels is a tort whereby the infringing party has intentionally interfered with another person's lawful possession of a chattel...

, which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. The best known of these cases, eBay v. Bidder's Edge
EBay v. Bidder's Edge
eBay v. Bidder's Edge, , was a leading case applying the trespass to chattels doctrine to online activities. In 2000, eBay, an online auction company, successfully used the 'trespass to chattels' theory to obtain a preliminary injunction preventing Bidder’s Edge, an auction data aggregator, from...

, resulted in an injunction ordering Bidder's Edge to stop data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

 from the eBay web site. This case involved automatic placing of bids, known as auction sniping
Auction sniping
Auction sniping is the process, in a timed online auction , of placing a winning bid at the last possible moment , giving the other bidders no time to outbid the sniper. This can be done manually, or by software...

. However, in order to succeed on a claim of trespass to chattels, the plaintiff
Plaintiff
A plaintiff , also known as a claimant or complainant, is the term used in some jurisdictions for the party who initiates a lawsuit before a court...

 must demonstrate that the defendant
Defendant
A defendant or defender is any party who is required to answer the complaint of a plaintiff or pursuer in a civil lawsuit before a court, or any party who has been formally charged or accused of violating a criminal statute...

 intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system and that the defendant's unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels.

One of the first major tests of screen scraping involved American Airlines
American Airlines
American Airlines, Inc. is the world's fourth-largest airline in passenger miles transported and operating revenues. American Airlines is a subsidiary of the AMR Corporation and is headquartered in Fort Worth, Texas adjacent to its largest hub at Dallas/Fort Worth International Airport...

, and a firm called FareChase. AA successfully obtained an injunction
Injunction
An injunction is an equitable remedy in the form of a court order that requires a party to do or refrain from doing certain acts. A party that fails to comply with an injunction faces criminal or civil penalties and may have to pay damages or accept sanctions...

 from a Texas trial court, stopping FareChase from selling software that enables users to compare online fares if it also searches AA's website. The airline argued that FareChase's websearch software trespassed on AA's servers when it collected the publicly available data. The injunction was appealed in 2003.

Southwest Airlines
Southwest Airlines
Southwest Airlines Co. is an American low-cost airline based in Dallas, Texas. Southwest is the largest airline in the United States, based upon domestic passengers carried,...

 has also challenged screen-scraping practices, and has involved both FareChase and another firm, Outtask, in a legal claim. Southwest Airlines charged that the screen-scraping is Illegal since it is an example of "Computer Fraud and Abuse" and has led to "Damage and Loss" and "Unauthorized Access" of Southwest's site. It also constitutes "Interference with Business Relations", "Trespass" and "Harmful Access by Computer". They also claimed that screen-scraping constitutes what is legally known as Misappropriation and Unjust Enrichment, and is also a breach of the website's user agreement. Outtask denied all these claims, and claimed that the prevailing law in this case should be US Copyright law, and that under copyright, the pieces of information being scraped would not be subject to copyright protection. Although the cases were never resolved in the Supreme Court of the United States
Supreme Court of the United States
The Supreme Court of the United States is the highest court in the United States. It has ultimate appellate jurisdiction over all state and federal courts, and original jurisdiction over a small range of cases...

, FareChase was eventually shuttered by parent company Yahoo!
Yahoo!
Yahoo! Inc. is an American multinational internet corporation headquartered in Sunnyvale, California, United States. The company is perhaps best known for its web portal, search engine , Yahoo! Directory, Yahoo! Mail, Yahoo! News, Yahoo! Groups, Yahoo! Answers, advertising, online mapping ,...

 and Outtask was purchased by travel expense company Concur.

Although these are early scraping decisions, and the theories of liability are not uniform, it is difficult to ignore a pattern emerging that the courts are prepared to protect proprietary content on commercial sites from uses which are undesirable to the owners of such sites. However, the degree of protection for such content is not settled, and will depend on the type of access made by the scraper, the amount of information accessed and copied, the degree to which the access adversely affects the site owner’s system and the types and manner of prohibitions on such conduct.

While the law in this area becomes more settled, entities contemplating using scraping programs to access a public web site should also consider whether such action is authorized by reviewing the terms of use and other terms or notices posted on or made available through the site. In the latest ruling in the Cvent, Inc. v. Eventbrite, Inc. In the United States district court for the eastern district of Virginia, the court ruled that the terms of use should be brought to the users' attention In order for a browse wrap
Browse wrap
Browse-wrap is a term used in Internet law to refer to a contract or license agreement covering access to or use of materials on a web site. Specifically, a browse-wrap license is expected or assumed to have been agreed to before a user browses the website...

 contract or license to be enforced.
In the plaintiff's website during the period of this trial the terms of use link is displayed among all the links of the site, at the bottom of the page as most sites on the internet. This ruling contradicts the Irish ruling described below. The court also rejected the plaintiff's argument that the browse wrap restrictions were enforceable in view of Virginia's adoption of the Uniform Computer Information Transactions Act (UCITA)—a uniform law that many believed was in favor on common browse wrap contracting practices.

Outside of the United States, in a February, 2006 ruling, the Danish Maritime and Commercial Court (Copenhagen) found systematic crawling, indexing and deep linking
Deep linking
On the World Wide Web, deep linking is making a hyperlink that points to a specific page or image on a website, instead of that website's main or home page. Such links are called deep links.-Example:...

 by portal site ofir.dk of real estate site Home.dk not to conflict with Danish law or the database directive of the European Union. In February, 2010, the Irish High Court in the case of Ryanair Limited v Billigfluege.de GmbH. This case established a precedent by acknowledging that the Terms of Use on Ryanair’s website, including the prohibitions contained in those Terms of Use against screen scraping. The case is currently under appeal in the Supreme Court. In Australia, the Spam Act 2003
Spam Act 2003
The Spam Act 2003 was passed in 2003 as federal legislation by the Parliament of the Commonwealth of Australia. The first portions of the act came into effect on 12 December 2003, the day the act received Royal Assent, with all remaining sections of the act coming into force on 10 April 2004.Its...

 outlaws some forms of web harvesting, although this only applies to email addresses.

Technical measures to stop bots

The administrator of a website can use various measures to stop or slow a bot. Some techniques include:
  • If the application is well behaved, adding entries to robots.txt
    Robots Exclusion Standard
    The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to...

     will be adhered to. Google and other well-behaved bots can be stopped this way.
  • Blocking an IP address. This will also block all browsing from that address.
  • Bots sometimes declare who they are and can be blocked on that basis; 'googlebot
    Googlebot
    Googlebot is the search bot software used by Google, which collects documents from the web to build a searchable index for the Google search engine....

    ' is an example. Some bots make no distinction between themselves and a human browser.
  • Bots can be blocked by excess traffic monitoring.
  • Bots can be blocked with tools to verify that it is a real person accessing the site, like a CAPTCHA
    CAPTCHA
    A CAPTCHA is a type of challenge-response test used in computing as an attempt to ensure that the response is generated by a person. The process usually involves one computer asking a user to complete a simple test which the computer is able to generate and grade...

    .
  • Commercial anti-bot services: Several companies, such as Distil, SiteBlackBox and Sentor, offer anti-bot and anti-scraping services for websites. A few Web Application Firewalls have limited bot detection capabilities as well.
  • Locating bots with a honeypot
    Honeypot (computing)
    In computer terminology, a honeypot is a trap set to detect, deflect, or in some manner counteract attempts at unauthorized use of information systems...

     or other method to identify the IP addresses of automated crawlers.
  • Sometimes bots can be blocked with carefully crafted Javascript code.
  • Using CSS Sprites to display such data as phone numbers or email addresses.

Notable Tools

  • Apache Camel
    Apache Camel
    Apache Camel is a rule-based routing and mediation engine which provides a Java object-based implementation of the Enterprise Integration Patterns using an API to configure routing and mediation rules...

  • Automation Anywhere
    Automation Anywhere
    Automation Anywhere is a developer of automation software and testing software. The company was established in 2003, as Tethys Solutions, LLC in San Jose, California...

  • cURL
    CURL
    cURL is a computer software project providing a library and command-line tool for transferring data using various protocols. The cURL project produces two products, libcurl and cURL...

  • Data Toolbar
    Data Toolbar
    Data Toolbar is an Internet Explorer add-on to collect catalog style information from the web. The add-on converts structured web data into a table style format that can be loaded into a spreadsheet or a database.- Algorithm :...

  • Firebug
  • Greasemonkey
    Greasemonkey
    Greasemonkey is a Mozilla Firefox extension that allows users to install scripts that make on-the-fly changes to HTML web page content on the DOMContentLoaded event, which happens immediately after it is loaded in the browser .As Greasemonkey scripts are persistent, the changes made to the web...

  • HtmlUnit
    HtmlUnit
    HtmlUnit is a headless web browser written in Java. It allows high-level manipulation of websites from other Java code, including filling and submitting forms and clicking hyperlinks. It also provides access to the structure and the details within received web pages. HtmlUnit emulates parts of...

  • HTTrack
    HTTrack
    HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License....

  • iMacros
    IMacros
    iMacros is an extension for the Mozilla Firefox, Google Chrome, and Internet Explorer web browsers which adds record and replay functionality similar to that found in web testing and form filler software. The macros can be combined and controlled via JavaScript. Demo macros and JavaScript code...

  • Jaxer
  • nokogiri
    Nokogiri (project)
    Nokogiri is an open source library to parse HTML and XML in Ruby....

  • ScraperWiki
    ScraperWiki
    ScraperWiki is a website for collaboratively building programs to extract and analyze public data, in a wiki-like fashion. "Scraper" refers to screen scrapers, programs that extract data from websites. "Wiki" means that any user with programming experience can create or edit such programs for...

  • Scrapy
    Scrapy
    Scrapy is a web crawling framework, which was designed for web scraping. It is open-source and written in Python. It is controlled using command line tools, that can be used to trigger the scrapers written in Python....

  • SimpleTest
    SimpleTest
    SimpleTest is an open source unit test framework for the PHP programming language and was created by Marcus Baker. The test structure is similar to JUnit/PHPUnit...

  • watir
    Watir
    Web Application Testing in Ruby is a toolkit used to automate browser-based tests during web application development. This automated test tool uses the Ruby scripting language to drive Internet Explorer, Mozilla Firefox, Google Chrome, Opera and Safari, and is available as a RubyGems gem...

  • Wget
    Wget
    GNU Wget is a computer program that retrieves content from web servers, and is part of the GNU Project. Its name is derived from World Wide Web and get...

  • WSO2 Mashup Server
    WSO2 Mashup Server
    The WSO2 Mashup Server is an open source mashup platform that hosts JavaScript based mashups. It is based on Apache Axis2 and other open source projects, and provides JavaScript authors the ability to consume, compose and emit web services, feeds, scraped web pages, email, and instant messages. The...

  • yahoo pipes
  • Yahoo! query language
    Yahoo! query language
    Yahoo! query language is an SQL-like query language created by Yahoo! as part of their Developer Network. YQL is designed to retrieve and manipulate data from APIs through a single Web interface, thus allowing mashups that enable developers to create their own applications.Initially launched in...

     (yql)

See also

  • Importer (computing)
    Importer (computing)
    An importer is a software application that reads in a data file or metadata information in one format and converts it to another format via special algorithms . An importer often is not an entire program by itself, but an extension to another program, implemented as a plug-in...

  • 80legs
    80legs
    80legs is a web crawling service that allows its users to create and run web crawls through its software as a service platform.-History:80legs was created by Computational Crawling, a company in Houston, TX. The company launched the private beta of 80legs in April 2009 and publicly launched the...

  • Corpus linguistics
    Corpus linguistics
    Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...

  • Data scraping
    Data scraping
    Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program.-Description:Normally, data transfer between programs is accomplished using data structures suited for automated processing by computers, not people...

  • Report mining
    Report mining
    Report mining is the extraction of data from human readable computer reports. Conventional data extraction requires a connection to a working source system, suitable connectivity standards or an API, and usually complex querying...

  • Mashup (web application hybrid)
    Mashup (web application hybrid)
    In Web development, a mashup is a Web page or application that uses and combines data, presentation or functionality from two or more sources to create new services...

  • opensocial
    OpenSocial
    OpenSocial is a set of common application programming interfaces for web-based social network applications, developed by Google along with MySpace and a number of other social networks.It was released November 1, 2007....

  • Scraper site
    Scraper site
    A scraper site is a spam website that copies all of its content from other websites using web scraping.In the last few years scraper sites have proliferated at an amazing rate for spamming search engines...

  • Screen scraping
  • Spamdexing
    Spamdexing
    In computing, spamdexing is the deliberate manipulation of search engine indexes...

  • Text corpus
    Text corpus
    In linguistics, a corpus or text corpus is a large and structured set of texts...

  • Web crawling
  • Metadata
    Metadata
    The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

  • Comparison of feed aggregators
    Comparison of feed aggregators
    The following is a comparison of notable RSS feed aggregators. Often e-mail programs and web browsers have the ability to display RSS feeds. They are listed here, too.Many BitTorrent clients support RSS feeds for broadcatching ....


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK