Web scraping - AbsoluteAstronomy.com

Web scraping is a computer software technique of extracting information from website

Website

A website, also written as Web site, web site, or simply site, is a collection of related web pages containing images, videos or other digital assets. A website is hosted on at least one web server, accessible via a network such as the Internet or a private local area network through an Internet...

s. Usually, such software programs simulate human exploration of the World Wide Web

World Wide Web

The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...

by either implementing low-level Hypertext Transfer Protocol

Hypertext Transfer Protocol

The Hypertext Transfer Protocol is a networking protocol for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web....

(HTTP), or embedding a fully-fledged web browser, such as Internet Explorer

Internet Explorer

Windows Internet Explorer is a series of graphical web browsers developed by Microsoft and included as part of the Microsoft Windows line of operating systems, starting in 1995. It was first released as part of the add-on package Plus! for Windows 95 that year...

or Mozilla Firefox

Mozilla Firefox

Mozilla Firefox is a free and open source web browser descended from the Mozilla Application Suite and managed by Mozilla Corporation. , Firefox is the second most widely used browser, with approximately 25% of worldwide usage share of web browsers...

.

Web scraping is closely related to web indexing

Web indexing

Web indexing includes back-of-book-style indexes to individual websites or an intranet, and the creation of keyword metadata to provide a more useful vocabulary for Internet or onsite search engines...

, which indexes information on the web using a bot

Internet bot

Internet bots, also known as web robots, WWW robots or simply bots, are software applications that run automated tasks over the Internet. Typically, bots perform tasks that are both simple and structurally repetitive, at a much higher rate than would be possible for a human alone...

and is a universal technique adopted by most search engines. In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically in HTML

HTML

HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to web automation, which simulates human browsing using computer software. Uses of web scraping include online price comparison, weather data monitoring, website change detection, research, web mashup and web data integration.

Techniques

Web scraping is the process of automatically collecting information from the world wide web. It is a field with active developments sharing a common goal with the semantic web

Semantic Web

The Semantic Web is a collaborative movement led by the World Wide Web Consortium that promotes common formats for data on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web of unstructured documents into a "web of...

vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. Web scraping, instead, favors practical solutions based on existing technologies that are often entirely ad hoc. Therefore, there are different levels of automation that existing web-scraping technologies can provide:

Human copy-and-paste: Sometimes even the best web-scraping technology cannot replace a human’s manual examination and copy-and-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly set up barriers to prevent machine automation.
Text grepping and regular expression matching
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

: A simple yet powerful approach to extract information from web pages can be based on the UNIX grep
Grep
grep is a command-line text-search utility originally written for Unix. The name comes from the ed command g/re/p...

command or regular expression matching facilities of programming languages (for instance Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

or Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

).
HTTP programming: Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming.
DOM parsing: By embedding a full-fledged web browser, such as the Internet Explorer
Internet Explorer
Windows Internet Explorer is a series of graphical web browsers developed by Microsoft and included as part of the Microsoft Windows line of operating systems, starting in 1995. It was first released as part of the add-on package Plus! for Windows 95 that year...

or the Mozilla
Mozilla
Mozilla is a term used in a number of ways in relation to the Mozilla.org project and the Mozilla Foundation, their defunct commercial predecessor Netscape Communications Corporation, and their related application software....

browser control, programs can retrieve the dynamic contents generated by client side scripts. These browser controls also parse web pages into a DOM tree, based on which programs can retrieve parts of the pages.
HTML parsers: Some semi-structured data
Semi-structured data
Semi-structured data is a form of structured data that does not conform with the formal structure of tables and data models associated with relational databases but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data...

query languages, such as XQuery
XQuery
- Features :XQuery provides the means to extract and manipulate data from XML documents or any data source that can be viewed as XML, such as relational databases or office documents....

and the HTQL, can be used to parse HTML pages and to retrieve and transform page content.
Web-scraping software: There are many software tools available that can be used to customize web-scraping solutions. This software may attempt to automatically recognize the data structure of a page or provide a recording interface that removes the necessity to manually write web-scraping code, or some scripting functions that can be used to extract and transform content, and database interfaces that can store the scraped data in local databases.
Vertical aggregation platforms: There are several companies that have developed vertical specific harvesting platforms. These platforms create and monitor a multitude of “bots” for specific verticals with no man-in-the-loop, and no work related to a specific target site. The preparation involves establishing the knowledge base for the entire vertical and then the platform creates the bots automatically. The platforms robustness is measured by the quality of the information it retrieves (usually number of fields) and its scalability (how quick it can scale up to hundreds or thousands of sites). This scalability is mostly used to target the Long Tail
Long tail
Long tail may refer to:*The Long Tail, a consumer demographic in business*Power law's long tail, a statistics term describing certain kinds of distribution*Long-tail boat, a type of watercraft native to Southeast Asia...

of sites that common aggregators find complicated or too labor intensive to harvest content from.
Semantic annotation recognizing: The pages being scraped may embrace metadata or semantic markups and annotations, which can be made use of to locate specific data snippets. If the annotations are embedded in the pages, as Microformat
Microformat
A microformat is a web-based approach to semantic markup which seeks to re-use existing HTML/XHTML tags to convey metadata and other attributes in web pages and other contexts that support HTML, such as RSS...

does, this technique can be viewed as a special case of DOM parsing. In another case, the annotations, organized into a semantic layer, are stored and managed separately from the web pages, so the scrapers can retrieve data schema and instructions from this layer before scraping the pages.

Legal issues

Web scraping may be against the terms of use of some websites. The enforceability of these terms is unclear. While outright duplication of original expression will in many cases be illegal, in the United States

United States

The United States of America is a federal constitutional republic comprising fifty states and a federal district...

the courts ruled in Feist Publications v. Rural Telephone Service

Feist Publications v. Rural Telephone Service

Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 , commonly called Feist v. Rural, is an important United States Supreme Court case establishing that information alone without a minimum of original creativity cannot be protected by copyright...

that duplication of facts is allowable.
U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing trespass to chattels

Trespass to chattels

Trespass to chattels is a tort whereby the infringing party has intentionally interfered with another person's lawful possession of a chattel...

, which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. The best known of these cases, eBay v. Bidder's Edge

EBay v. Bidder's Edge

eBay v. Bidder's Edge, , was a leading case applying the trespass to chattels doctrine to online activities. In 2000, eBay, an online auction company, successfully used the 'trespass to chattels' theory to obtain a preliminary injunction preventing Bidder’s Edge, an auction data aggregator, from...

, resulted in an injunction ordering Bidder's Edge to stop data mining

Data mining

Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

from the eBay web site. This case involved automatic placing of bids, known as auction sniping

Auction sniping

Auction sniping is the process, in a timed online auction , of placing a winning bid at the last possible moment , giving the other bidders no time to outbid the sniper. This can be done manually, or by software...

. However, in order to succeed on a claim of trespass to chattels, the plaintiff

Plaintiff

A plaintiff , also known as a claimant or complainant, is the term used in some jurisdictions for the party who initiates a lawsuit before a court...

must demonstrate that the defendant

Defendant

A defendant or defender is any party who is required to answer the complaint of a plaintiff or pursuer in a civil lawsuit before a court, or any party who has been formally charged or accused of violating a criminal statute...

intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system and that the defendant's unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels.

One of the first major tests of screen scraping involved American Airlines

American Airlines

American Airlines, Inc. is the world's fourth-largest airline in passenger miles transported and operating revenues. American Airlines is a subsidiary of the AMR Corporation and is headquartered in Fort Worth, Texas adjacent to its largest hub at Dallas/Fort Worth International Airport...

, and a firm called FareChase. AA successfully obtained an injunction

Injunction

An injunction is an equitable remedy in the form of a court order that requires a party to do or refrain from doing certain acts. A party that fails to comply with an injunction faces criminal or civil penalties and may have to pay damages or accept sanctions...

from a Texas trial court, stopping FareChase from selling software that enables users to compare online fares if it also searches AA's website. The airline argued that FareChase's websearch software trespassed on AA's servers when it collected the publicly available data. The injunction was appealed in 2003.

Southwest Airlines

Southwest Airlines

Southwest Airlines Co. is an American low-cost airline based in Dallas, Texas. Southwest is the largest airline in the United States, based upon domestic passengers carried,...

has also challenged screen-scraping practices, and has involved both FareChase and another firm, Outtask, in a legal claim. Southwest Airlines charged that the screen-scraping is Illegal since it is an example of "Computer Fraud and Abuse" and has led to "Damage and Loss" and "Unauthorized Access" of Southwest's site. It also constitutes "Interference with Business Relations", "Trespass" and "Harmful Access by Computer". They also claimed that screen-scraping constitutes what is legally known as Misappropriation and Unjust Enrichment, and is also a breach of the website's user agreement. Outtask denied all these claims, and claimed that the prevailing law in this case should be US Copyright law, and that under copyright, the pieces of information being scraped would not be subject to copyright protection. Although the cases were never resolved in the Supreme Court of the United States

Supreme Court of the United States

The Supreme Court of the United States is the highest court in the United States. It has ultimate appellate jurisdiction over all state and federal courts, and original jurisdiction over a small range of cases...

, FareChase was eventually shuttered by parent company Yahoo!

Yahoo!

Yahoo! Inc. is an American multinational internet corporation headquartered in Sunnyvale, California, United States. The company is perhaps best known for its web portal, search engine , Yahoo! Directory, Yahoo! Mail, Yahoo! News, Yahoo! Groups, Yahoo! Answers, advertising, online mapping ,...

and Outtask was purchased by travel expense company Concur.

Although these are early scraping decisions, and the theories of liability are not uniform, it is difficult to ignore a pattern emerging that the courts are prepared to protect proprietary content on commercial sites from uses which are undesirable to the owners of such sites. However, the degree of protection for such content is not settled, and will depend on the type of access made by the scraper, the amount of information accessed and copied, the degree to which the access adversely affects the site owner’s system and the types and manner of prohibitions on such conduct.

While the law in this area becomes more settled, entities contemplating using scraping programs to access a public web site should also consider whether such action is authorized by reviewing the terms of use and other terms or notices posted on or made available through the site. In the latest ruling in the Cvent, Inc. v. Eventbrite, Inc. In the United States district court for the eastern district of Virginia, the court ruled that the terms of use should be brought to the users' attention In order for a browse wrap

Browse wrap

Browse-wrap is a term used in Internet law to refer to a contract or license agreement covering access to or use of materials on a web site. Specifically, a browse-wrap license is expected or assumed to have been agreed to before a user browses the website...

contract or license to be enforced.
In the plaintiff's website during the period of this trial the terms of use link is displayed among all the links of the site, at the bottom of the page as most sites on the internet. This ruling contradicts the Irish ruling described below. The court also rejected the plaintiff's argument that the browse wrap restrictions were enforceable in view of Virginia's adoption of the Uniform Computer Information Transactions Act (UCITA)—a uniform law that many believed was in favor on common browse wrap contracting practices.

Outside of the United States, in a February, 2006 ruling, the Danish Maritime and Commercial Court (Copenhagen) found systematic crawling, indexing and deep linking

Deep linking

On the World Wide Web, deep linking is making a hyperlink that points to a specific page or image on a website, instead of that website's main or home page. Such links are called deep links.-Example:...

by portal site ofir.dk of real estate site Home.dk not to conflict with Danish law or the database directive of the European Union. In February, 2010, the Irish High Court in the case of Ryanair Limited v Billigfluege.de GmbH. This case established a precedent by acknowledging that the Terms of Use on Ryanair’s website, including the prohibitions contained in those Terms of Use against screen scraping. The case is currently under appeal in the Supreme Court. In Australia, the Spam Act 2003

Spam Act 2003

The Spam Act 2003 was passed in 2003 as federal legislation by the Parliament of the Commonwealth of Australia. The first portions of the act came into effect on 12 December 2003, the day the act received Royal Assent, with all remaining sections of the act coming into force on 10 April 2004.Its...

outlaws some forms of web harvesting, although this only applies to email addresses.

Technical measures to stop bots

The administrator of a website can use various measures to stop or slow a bot. Some techniques include:

If the application is well behaved, adding entries to robots.txt
Robots Exclusion Standard
The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to...

will be adhered to. Google and other well-behaved bots can be stopped this way.
Blocking an IP address. This will also block all browsing from that address.
Bots sometimes declare who they are and can be blocked on that basis; 'googlebot
Googlebot
Googlebot is the search bot software used by Google, which collects documents from the web to build a searchable index for the Google search engine....

' is an example. Some bots make no distinction between themselves and a human browser.
Bots can be blocked by excess traffic monitoring.
Bots can be blocked with tools to verify that it is a real person accessing the site, like a CAPTCHA
CAPTCHA
A CAPTCHA is a type of challenge-response test used in computing as an attempt to ensure that the response is generated by a person. The process usually involves one computer asking a user to complete a simple test which the computer is able to generate and grade...

.
Commercial anti-bot services: Several companies, such as Distil, SiteBlackBox and Sentor, offer anti-bot and anti-scraping services for websites. A few Web Application Firewalls have limited bot detection capabilities as well.
Locating bots with a honeypot
Honeypot (computing)
In computer terminology, a honeypot is a trap set to detect, deflect, or in some manner counteract attempts at unauthorized use of information systems...

or other method to identify the IP addresses of automated crawlers.
Sometimes bots can be blocked with carefully crafted Javascript code.
Using CSS Sprites to display such data as phone numbers or email addresses.

Notable Tools

Apache Camel
Apache Camel
Apache Camel is a rule-based routing and mediation engine which provides a Java object-based implementation of the Enterprise Integration Patterns using an API to configure routing and mediation rules...
Automation Anywhere
Automation Anywhere
Automation Anywhere is a developer of automation software and testing software. The company was established in 2003, as Tethys Solutions, LLC in San Jose, California...
cURL
CURL
cURL is a computer software project providing a library and command-line tool for transferring data using various protocols. The cURL project produces two products, libcurl and cURL...
Data Toolbar
Data Toolbar
Data Toolbar is an Internet Explorer add-on to collect catalog style information from the web. The add-on converts structured web data into a table style format that can be loaded into a spreadsheet or a database.- Algorithm :...
Firebug
Greasemonkey
Greasemonkey
Greasemonkey is a Mozilla Firefox extension that allows users to install scripts that make on-the-fly changes to HTML web page content on the DOMContentLoaded event, which happens immediately after it is loaded in the browser .As Greasemonkey scripts are persistent, the changes made to the web...
HtmlUnit
HtmlUnit
HtmlUnit is a headless web browser written in Java. It allows high-level manipulation of websites from other Java code, including filling and submitting forms and clicking hyperlinks. It also provides access to the structure and the details within received web pages. HtmlUnit emulates parts of...
HTTrack
HTTrack
HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License....
iMacros
IMacros
iMacros is an extension for the Mozilla Firefox, Google Chrome, and Internet Explorer web browsers which adds record and replay functionality similar to that found in web testing and form filler software. The macros can be combined and controlled via JavaScript. Demo macros and JavaScript code...
Jaxer
nokogiri
Nokogiri (project)
Nokogiri is an open source library to parse HTML and XML in Ruby....
ScraperWiki
ScraperWiki
ScraperWiki is a website for collaboratively building programs to extract and analyze public data, in a wiki-like fashion. "Scraper" refers to screen scrapers, programs that extract data from websites. "Wiki" means that any user with programming experience can create or edit such programs for...
Scrapy
Scrapy
Scrapy is a web crawling framework, which was designed for web scraping. It is open-source and written in Python. It is controlled using command line tools, that can be used to trigger the scrapers written in Python....
SimpleTest
SimpleTest
SimpleTest is an open source unit test framework for the PHP programming language and was created by Marcus Baker. The test structure is similar to JUnit/PHPUnit...
watir
Watir
Web Application Testing in Ruby is a toolkit used to automate browser-based tests during web application development. This automated test tool uses the Ruby scripting language to drive Internet Explorer, Mozilla Firefox, Google Chrome, Opera and Safari, and is available as a RubyGems gem...
Wget
Wget
GNU Wget is a computer program that retrieves content from web servers, and is part of the GNU Project. Its name is derived from World Wide Web and get...
WSO2 Mashup Server
WSO2 Mashup Server
The WSO2 Mashup Server is an open source mashup platform that hosts JavaScript based mashups. It is based on Apache Axis2 and other open source projects, and provides JavaScript authors the ability to consume, compose and emit web services, feeds, scraped web pages, email, and instant messages. The...
yahoo pipes
Yahoo! query language
Yahoo! query language
Yahoo! query language is an SQL-like query language created by Yahoo! as part of their Developer Network. YQL is designed to retrieve and manipulate data from APIs through a single Web interface, thus allowing mashups that enable developers to create their own applications.Initially launched in...

(yql)

External links

The Future of Web Sites = Web Services

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Techniques

Legal issues

Technical measures to stop bots

Notable Tools

See also

External links