Data extraction
Data extraction is the act or process of retrieving data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

 out of (usually unstructured
Unstructured data
Unstructured Data refers to information that either does not have a pre-defined data model and/or does not fit well into relational tables. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well...

 or poorly structured) data source
Data source
A data source is any of the following types of sources for digitized data:*a database** in the Java software platform, datasource is a special name for the connection set up to a database from a server*a computer file*a data stream...

s for further data processing
Data processing
Computer data processing is any process that a computer program does to enter data and summarise, analyse or otherwise convert data into usable information. The process may be automated and run on a computer. It involves recording, analysing, sorting, summarising, calculating, disseminating and...

 or data storage
Data storage device
thumb|200px|right|A reel-to-reel tape recorder .The magnetic tape is a data storage medium. The recorder is data storage equipment using a portable medium to store the data....

 (data migration
Data migration
Data migration is the process of transferring data between storage types, formats, or computer systems. Data migration is usually performed programmatically to achieve an automated migration, freeing up human resources from tedious tasks...

). The import into the intermediate extracting system is thus usually followed by data transformation
Data transformation
In metadata and data warehouse, a data transformation converts data from a source data format into destination data.Data transformation can be divided into two steps:...

 and possibly the addition of metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

 prior to export to another stage in the data workflow
A workflow consists of a sequence of connected steps. It is a depiction of a sequence of operations, declared as work of a person, a group of persons, an organization of staff, or one or more simple or complex mechanisms. Workflow may be seen as any abstraction of real work...


Usually, the term data extraction is applied when (experiment
An experiment is a methodical procedure carried out with the goal of verifying, falsifying, or establishing the validity of a hypothesis. Experiments vary greatly in their goal and scale, but always rely on repeatable procedure and logical analysis of the results...

al) data is first imported into a computer from primary sources, like measuring or recording devices. Today's electronic devices will usually present a electrical connector
Electrical connector
An electrical connector is an electro-mechanical device for joining electrical circuits as an interface using a mechanical assembly. The connection may be temporary, as for portable equipment, require a tool for assembly and removal, or serve as a permanent electrical joint between two wires or...

 (e.g. USB) through which 'raw data
Raw data
'\putang inaIn computing, it may have the following attributes: possibly containing errors, not validated; in sfferent formats; uncoded or unformatted; and suspect, requiring confirmation or citation. For example, a data input sheet might contain dates as raw data in many forms: "31st January...

' can be streamed
Data stream
In telecommunications and computing, a data stream is a sequence of digitally encoded coherent signals used to transmit or receive information that is in the process of being transmitted....

 into a personal computer
Personal computer
A personal computer is any general-purpose computer whose size, capabilities, and original sales price make it useful for individuals, and which is intended to be operated directly by an end-user with no intervening computer operator...


Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports, spool files etc. Extracting data from these unstructured sources has grown into a considerable technical challenge where as historically data extraction has had to deal with changes in physical hardware formats, the majority of current data extraction deals with extracting data from these unstructured data sources, and from different software formats. This growing process of data extraction from the web is referred to as Web scraping
Web scraping
Web scraping is a computer software technique of extracting information from websites...


The act of adding structure to unstructured data takes a number of forms
  • Using text pattern matching such as regular expression
    Regular expression
    In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

    s to identify small or large-scale structure e.g. records in a report and their associated data from headers and footers;
  • Using a table-based approach to identify common sections within a limited domain e.g. in emailed resumes, identifying skills, previous work experience, qualifications etc using a standard set of commonly used headings (these would differ from language to language), eg Education might be found under Education/Qualification/Courses;
  • Using text analytics to attempt to understand the text and link it to other information

External links

  • Data Extraction as a part of the ETL process in a Data Warehousing environment
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.