Unstructured data
Encyclopedia
Unstructured Data refers to information that either does not have a pre-defined data model
Data model
A data model in software engineering is an abstract model, that documents and organizes the business data for communication between team members and is used as a plan for developing applications, specifically how data is stored and accessed....

 and/or does not fit well into relational tables. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.
The term is imprecise for several reasons;
  1. structure, while not formally defined can still be implied and
  2. data with some form of structure may still be characterized as unstructured if its structure is not helpful for the desired processing task, and
  3. unstructured information might have some structure (semi-structured) or even be highly structured but in ways that are unanticipated or unannounced.


Software that creates machine-processable structure exploits the linguistic, auditory, and visual structure that is inherent in all forms of human communication. This inherent structure can be inferred from text, for instance, by examining word morphology, sentence syntax, and other small- and large-scale patterns. Unstructured information can then be enriched and tagged to address ambiguities and relevancy-based techniques then used to facilitate search and discovery. Examples of "unstructured data" may include books, journals, documents, metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

, health records, audio
Sound
Sound is a mechanical wave that is an oscillation of pressure transmitted through a solid, liquid, or gas, composed of frequencies within the range of hearing and of a level sufficiently strong to be heard, or the sensation stimulated in organs of hearing by such vibrations.-Propagation of...

, video
Video
Video is the technology of electronically capturing, recording, processing, storing, transmitting, and reconstructing a sequence of still images representing scenes in motion.- History :...

, files, and unstructured text such as the body of an e-mail
E-mail
Electronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...

 message, Web page, or word processor
Word processor
A word processor is a computer application used for the production of any sort of printable material....

 document. While the main content being conveyed does not have a defined structure, it generally comes packaged in objects (e.g. in files or documents, ...) that themselves have structure and are thus a mix of structured and unstructured data, but collectively this is still referred to as "unstructured data". For example, an HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

 web page is tagged, but HTML mark-up is typically designed solely for rendering. It does not capture the meaning or function of tagged elements in ways that support automated processing of the information content of the page. XHTML
XHTML
XHTML is a family of XML markup languages that mirror or extend versions of the widely-used Hypertext Markup Language , the language in which web pages are written....

 tagging does allow machine processing of elements although it typically does not capture or convey the semantic meaning of tagged terms.

In 1998, Merrill Lynch
Merrill Lynch
Merrill Lynch is the wealth management division of Bank of America. With over 15,000 financial advisors and $2.2 trillion in client assets it is the world's largest brokerage. Formerly known as Merrill Lynch & Co., Inc., prior to 2009 the firm was publicly owned and traded on the New York...

 cited estimates that as much as 80% of all potentially usable business information originates in unstructured form. Such estimates may not be based on primary research, but they are nonetheless widely accepted.
More recently, multiple analysts have estimated that data will grow 800% over the next five years. Unstructured information accounts for more than 70%–80% of all data in organizations and is growing 10–50x more than structured data.

Dealing with unstructured data

Data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

 and text analytics
Text analytics
The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. The term is roughly synonymous with text mining;...

 and noisy text analytics
Noisy text analytics
Noisy text analytics is a process of information extraction whose goal is to automatically extract structured or semistructured information from noisy unstructured text data...

 techniques are different methods used to find patterns in, or otherwise interpret, this information. Common techniques for structuring text usually involve manual tagging with metadata
Tag (metadata)
In online computer systems terminology, a tag is a non-hierarchical keyword or term assigned to a piece of information . This kind of metadata helps describe an item and allows it to be found again by browsing or searching...

 or Part-of-speech tagging
Part-of-speech tagging
In corpus linguistics, part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e...

 for further text mining
Text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...

-based structuring. UIMA
Uima
UIMA stands for Unstructured Information Management Architecture. An OASIS standard as of March 2009, UIMA is to date the only industry standard for content analytics....

 provides a common framework for processing this information to extract meaning and create structured data about the information.

Several commercial solutions are available for analyzing and understanding unstructured data for business applications. This includes products from companies like SAS, Inxight
Inxight
Inxight Software, Inc. is a software company specializing in visualization, information retrieval and natural language processing. It was bought by Business Objects in 2007; Business Objects was in turn acquired by SAP AG in 2008. Founded in 1997, Inxight is headquartered in Sunnyvale, California...

 and SPSS
SPSS
SPSS is a computer program used for survey authoring and deployment , data mining , text analytics, statistical analysis, and collaboration and deployment ....

, as well as more specialized offerings such as Attensity
Attensity
Attensity provides text analytics software for Customer Experience Management . Attensity's software applications extract facts, relationships and sentiment from unstructured data, which comprise approximately 85% of the information companies store electronically.The software uses natural language...

360 and Sysomos
Sysomos
Sysomos Inc. is a Toronto-based social media analytics company.The company uses content of social media sites including blogs, forums and Twitter to create a real-time picture on how products, people, and brands are covered in those media sites. Unlike other similar services, it also attempts to...

, which focuses on analyzing unstructured social media data.

See also

  • UIMA
    Uima
    UIMA stands for Unstructured Information Management Architecture. An OASIS standard as of March 2009, UIMA is to date the only industry standard for content analytics....

  • Data mining
    Data mining
    Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

  • Metadata
    Metadata
    The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

  • Noisy text
    Noisy text
    Noise in text can be defined as any kind of difference between the surface form of a coded representation of the text and the intended, correct, or original text....

  • Semi-structured data
    Semi-structured data
    Semi-structured data is a form of structured data that does not conform with the formal structure of tables and data models associated with relational databases but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data...

  • General Architecture for Text Engineering
    General Architecture for Text Engineering
    General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for all sorts of natural language processing tasks, including...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK