Unstructured data - AbsoluteAstronomy.com

Unstructured Data refers to information that either does not have a pre-defined data model

Data model

A data model in software engineering is an abstract model, that documents and organizes the business data for communication between team members and is used as a plan for developing applications, specifically how data is stored and accessed....

and/or does not fit well into relational tables. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.
The term is imprecise for several reasons;

structure, while not formally defined can still be implied and
data with some form of structure may still be characterized as unstructured if its structure is not helpful for the desired processing task, and
unstructured information might have some structure (semi-structured) or even be highly structured but in ways that are unanticipated or unannounced.

Software that creates machine-processable structure exploits the linguistic, auditory, and visual structure that is inherent in all forms of human communication. This inherent structure can be inferred from text, for instance, by examining word morphology, sentence syntax, and other small- and large-scale patterns. Unstructured information can then be enriched and tagged to address ambiguities and relevancy-based techniques then used to facilitate search and discovery. Examples of "unstructured data" may include books, journals, documents, metadata

Metadata

The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

, health records, audio

Sound

Sound is a mechanical wave that is an oscillation of pressure transmitted through a solid, liquid, or gas, composed of frequencies within the range of hearing and of a level sufficiently strong to be heard, or the sensation stimulated in organs of hearing by such vibrations.-Propagation of...

, video

Video

Video is the technology of electronically capturing, recording, processing, storing, transmitting, and reconstructing a sequence of still images representing scenes in motion.- History :...

, files, and unstructured text such as the body of an e-mail

E-mail

Electronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...

message, Web page, or word processor

Word processor

A word processor is a computer application used for the production of any sort of printable material....

document. While the main content being conveyed does not have a defined structure, it generally comes packaged in objects (e.g. in files or documents, ...) that themselves have structure and are thus a mix of structured and unstructured data, but collectively this is still referred to as "unstructured data". For example, an HTML

HTML

HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

web page is tagged, but HTML mark-up is typically designed solely for rendering. It does not capture the meaning or function of tagged elements in ways that support automated processing of the information content of the page. XHTML

XHTML

XHTML is a family of XML markup languages that mirror or extend versions of the widely-used Hypertext Markup Language , the language in which web pages are written....

tagging does allow machine processing of elements although it typically does not capture or convey the semantic meaning of tagged terms.

In 1998, Merrill Lynch

Merrill Lynch

Merrill Lynch is the wealth management division of Bank of America. With over 15,000 financial advisors and $2.2 trillion in client assets it is the world's largest brokerage. Formerly known as Merrill Lynch & Co., Inc., prior to 2009 the firm was publicly owned and traded on the New York...

cited estimates that as much as 80% of all potentially usable business information originates in unstructured form. Such estimates may not be based on primary research, but they are nonetheless widely accepted.
More recently, multiple analysts have estimated that data will grow 800% over the next five years. Unstructured information accounts for more than 70%–80% of all data in organizations and is growing 10–50x more than structured data.

Dealing with unstructured data

Data mining

Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

and text analytics

Text analytics

The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. The term is roughly synonymous with text mining;...

and noisy text analytics

Noisy text analytics

Noisy text analytics is a process of information extraction whose goal is to automatically extract structured or semistructured information from noisy unstructured text data...

techniques are different methods used to find patterns in, or otherwise interpret, this information. Common techniques for structuring text usually involve manual tagging with metadata

Tag (metadata)

In online computer systems terminology, a tag is a non-hierarchical keyword or term assigned to a piece of information . This kind of metadata helps describe an item and allows it to be found again by browsing or searching...

or Part-of-speech tagging

Part-of-speech tagging

In corpus linguistics, part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e...

for further text mining

Text mining

Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...

-based structuring. UIMA

Uima

UIMA stands for Unstructured Information Management Architecture. An OASIS standard as of March 2009, UIMA is to date the only industry standard for content analytics....

provides a common framework for processing this information to extract meaning and create structured data about the information.

Several commercial solutions are available for analyzing and understanding unstructured data for business applications. This includes products from companies like SAS, Inxight

Inxight

Inxight Software, Inc. is a software company specializing in visualization, information retrieval and natural language processing. It was bought by Business Objects in 2007; Business Objects was in turn acquired by SAP AG in 2008. Founded in 1997, Inxight is headquartered in Sunnyvale, California...

and SPSS

SPSS

SPSS is a computer program used for survey authoring and deployment , data mining , text analytics, statistical analysis, and collaboration and deployment ....

, as well as more specialized offerings such as Attensity

Attensity

Attensity provides text analytics software for Customer Experience Management . Attensity's software applications extract facts, relationships and sentiment from unstructured data, which comprise approximately 85% of the information companies store electronically.The software uses natural language...

360 and Sysomos

Sysomos

Sysomos Inc. is a Toronto-based social media analytics company.The company uses content of social media sites including blogs, forums and Twitter to create a real-time picture on how products, people, and brands are covered in those media sites. Unlike other similar services, it also attempts to...

, which focuses on analyzing unstructured social media data.

External links

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Dealing with unstructured data

See also

External links