Data scraping
Encyclopedia
Data scraping is a technique in which a computer program
Computer program
A computer program is a sequence of instructions written to perform a specified task with a computer. A computer requires programs to function, typically executing the program's instructions in a central processor. The program has an executable form that the computer can use directly to execute...

 extracts data
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

 from human-readable output
Output
Output is the term denoting either an exit or changes which exit a system and which activate/modify a process. It is an abstract concept, used in the modeling, system design and system exploitation.-In control theory:...

 coming from another program.

Description

Normally, data transfer
Data transmission
Data transmission, digital transmission, or digital communications is the physical transfer of data over a point-to-point or point-to-multipoint communication channel. Examples of such channels are copper wires, optical fibres, wireless communication channels, and storage media...

 between programs is accomplished using data structures suited for automated
Automation
Automation is the use of control systems and information technologies to reduce the need for human work in the production of goods and services. In the scope of industrialization, automation is a step beyond mechanization...

 processing by computers, not people. Such interchange formats
File format
A file format is a particular way that information is encoded for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for...

 and protocols are typically rigidly structured, well-documented, easily parsed
Parsing
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...

, and keep ambiguity to a minimum. Very often, these transmissions are not human-readable
Human-readable
A human-readable medium or human-readable format is a representation of data or information that can be naturally read by humans.In computing, human-readable data is often encoded as ASCII or Unicode text, rather than presented in a binary representation...

 at all.

Thus, the key element that distinguishes data scraping from regular parsing
Parsing
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...

 is that the output being scraped was intended for display to an end-user, rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing. Data scraping often involves ignoring binary data (usually images or multimedia data), display
Display device
A display device is an output device for presentation of information in visual or tactile form...

 formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing.

Data scraping is most often done to either (1) interface to a legacy system
Legacy system
A legacy system is an old method, technology, computer system, or application program that continues to be used, typically because it still functions for the users' needs, even though newer technology or more efficient methods of performing a task are now available...

 which has no other mechanism which is compatible with current hardware
Computer hardware
Personal computer hardware are component devices which are typically installed into or peripheral to a computer case to create a personal computer upon which system software is installed including a firmware interface such as a BIOS and an operating system which supports application software that...

, or (2) interface to a third-party system which does not provide a more convenient API
Application programming interface
An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...

. In the second case, the operator of the third-party system may even see screen scraping as unwanted, due to reasons such as increased system load
Load (computing)
In UNIX computing, the system load is a measure of the amount of work that a computer system performs. The load average represents the average system load over a period of time...

, the loss of advertisement revenue
Revenue
In business, revenue is income that a company receives from its normal business activities, usually from the sale of goods and services to customers. In many countries, such as the United Kingdom, revenue is referred to as turnover....

, or the loss of control of the information content.

Data scraping is generally considered an ad hoc, inelegant technique, often used only as a "last resort" when no other mechanism is available. Aside from the higher programming
Computer programming
Computer programming is the process of designing, writing, testing, debugging, and maintaining the source code of computer programs. This source code is written in one or more programming languages. The purpose of programming is to create a program that performs specific operations or exhibits a...

 and processing overhead, output displays intended for human consumption often change structure frequently. Humans can cope with this easily, but computer programs will often crash
Crash (computing)
A crash in computing is a condition where a computer or a program, either an application or part of the operating system, ceases to function properly, often exiting after encountering errors. Often the offending program may appear to freeze or hang until a crash reporting service documents...

 or produce incorrect results.

Screen scraping is normally associated with the programmatic collection of visual data from a source, instead of parsing data as in web scraping. Originally, screen scraping referred to the practice of reading text data from a computer display terminal
Computer terminal
A computer terminal is an electronic or electromechanical hardware device that is used for entering data into, and displaying data from, a computer or a computing system...

's screen
Display device
A display device is an output device for presentation of information in visual or tactile form...

. This was generally done by reading the terminal's memory through its auxiliary port
Computer port (hardware)
In computer hardware, a port serves as an interface between the computer and other computers or peripheral devices. Physically, a port is a specialized outlet on a piece of equipment to which a plug or cable connects...

, or by connecting the terminal output port of one computer system to an input port on another. The term screen scraping is also commonly used to refer to the bidirectional exchange of data. This could be the simple cases where the controlling program navigates through the user interface, or more complex scenarios where the controlling program is entering data into an interface meant to be used by a human.

As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the 1960s — the dawn of computerized data processing
Data processing
Computer data processing is any process that a computer program does to enter data and summarise, analyse or otherwise convert data into usable information. The process may be automated and run on a computer. It involves recording, analysing, sorting, summarising, calculating, disseminating and...

. Computer to user interface
User interface
The user interface, in the industrial design field of human–machine interaction, is the space where interaction between humans and machines occurs. The goal of interaction between a human and a machine at the user interface is effective operation and control of the machine, and feedback from the...

s from that era were often simply text-based dumb terminals which were not much more than virtual teleprinter
Teleprinter
A teleprinter is a electromechanical typewriter that can be used to communicate typed messages from point to point and point to multipoint over a variety of communication channels that range from a simple electrical connection, such as a pair of wires, to the use of radio and microwave as the...

s (such systems are still in use , for various reasons). The desire to interface such a system to more modern systems is common. A robust
Robustness (computer science)
In computer science, robustness is the ability of a computer system to cope with errors during execution or the ability of an algorithm to continue to operate despite abnormalities in input, calculations, etc. Formal techniques, such as fuzz testing, are essential to showing robustness since this...

 solution will often require things no longer available, such as source code
Source code
In computer science, source code is text written using the format and syntax of the programming language that it is being written in. Such a language is specially designed to facilitate the work of computer programmers, who specify the actions to be performed by a computer mostly by writing source...

, system documentation
Documentation
Documentation is a term used in several different ways. Generally, documentation refers to the process of providing evidence.Modules of Documentation are Helpful...

, API
Application programming interface
An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...

s, and/or programmers with experience in a 50-year-old computer system. In such cases, the only feasible solution may be to write a screen scraper which "pretends" to be a user at a terminal. The screen scraper might connect to the legacy system via Telnet
TELNET
Telnet is a network protocol used on the Internet or local area networks to provide a bidirectional interactive text-oriented communications facility using a virtual terminal connection...

, emulate
Emulator
In computing, an emulator is hardware or software or both that duplicates the functions of a first computer system in a different second computer system, so that the behavior of the second system closely resembles the behavior of the first system...

 the keystrokes needed to navigate the old user interface, process the
resulting display output, extract the desired data, and pass it on to the modern system.

In the 1980s, financial data providers such as Reuters
Reuters
Reuters is a news agency headquartered in New York City. Until 2008 the Reuters news agency formed part of a British independent company, Reuters Group plc, which was also a provider of financial market data...

, Telerate
Dow Jones & Company
Dow Jones & Company is an American publishing and financial information firm.The company was founded in 1882 by three reporters: Charles Dow, Edward Jones, and Charles Bergstresser. Like The New York Times and the Washington Post, the company was in recent years publicly traded but privately...

, and Quotron displayed data in 24x80 format intended for a human reader. Users of this data, particularly investment banks
Investment banking
An investment bank is a financial institution that assists individuals, corporations and governments in raising capital by underwriting and/or acting as the client's agent in the issuance of securities...

, wrote applications to capture and convert this character data as numeric data for inclusion into calculations for trading decisions without re-keying the data. The common term for this practice, especially in the United Kingdom
United Kingdom
The United Kingdom of Great Britain and Northern IrelandIn the United Kingdom and Dependencies, other languages have been officially recognised as legitimate autochthonous languages under the European Charter for Regional or Minority Languages...

, was page shredding, since the results could be imagined to have passed through a paper shredder
Paper shredder
A paper shredder is a mechanical device used to cut paper into chad, typically either strips or fine particles. Government organizations, businesses, and private individuals use shredders to destroy private, confidential, or otherwise sensitive documents...

.

More modern screen scraping techniques include capturing the bitmap data from the screen and running it through an OCR
Optical character recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...

 engine, or in the case of GUI
Gui
Gui or guee is a generic term to refer to grilled dishes in Korean cuisine. These most commonly have meat or fish as their primary ingredient, but may in some cases also comprise grilled vegetables or other vegetarian ingredients. The term derives from the verb, "gupda" in Korean, which literally...

 applications, querying the graphical controls by programmatically obtaining references to their underlying programming objects
Object-oriented programming
Object-oriented programming is a programming paradigm using "objects" – data structures consisting of data fields and methods together with their interactions – to design applications and computer programs. Programming techniques may include features such as data abstraction,...

.

Web scraping

Web page
Web page
A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...

s are built using text-based mark-up languages (HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

 and XHTML
XHTML
XHTML is a family of XML markup languages that mirror or extend versions of the widely-used Hypertext Markup Language , the language in which web pages are written....

), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper
Web scraping
Web scraping is a computer software technique of extracting information from websites...

 is an API to extract data from a web site.

Report mining

Whereas data scraping and web scraping involve interacting with dynamic output, report mining involves extracting data from files in a human readable format, such as HTML, PDF, or text. These can be easily generated from almost any system by intercepting the data feed to a printer. This approach can provide a quick and simple route to obtaining data without needing to program an API to the source system.**

See also

  • Importer (computing)
    Importer (computing)
    An importer is a software application that reads in a data file or metadata information in one format and converts it to another format via special algorithms . An importer often is not an entire program by itself, but an extension to another program, implemented as a plug-in...

  • Web scraping
    Web scraping
    Web scraping is a computer software technique of extracting information from websites...

  • Mashup (web application hybrid)
    Mashup (web application hybrid)
    In Web development, a mashup is a Web page or application that uses and combines data, presentation or functionality from two or more sources to create new services...

  • Report mining
    Report mining
    Report mining is the extraction of data from human readable computer reports. Conventional data extraction requires a connection to a working source system, suitable connectivity standards or an API, and usually complex querying...

  • Metadata
    Metadata
    The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

  • Comparison of feed aggregators
    Comparison of feed aggregators
    The following is a comparison of notable RSS feed aggregators. Often e-mail programs and web browsers have the ability to display RSS feeds. They are listed here, too.Many BitTorrent clients support RSS feeds for broadcatching ....


Further reading

  • Hemenway, Kevin and Calishain, Tara. Spidering Hacks. Cambridge, Massachusetts: O'Reilly, 2003. ISBN 0-596-00577-6.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK