Noisy text analytics - AbsoluteAstronomy.com

Noisy text analytics is a process of information extraction

Information extraction

Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...

whose goal is to automatically extract structured or semistructured information from noisy unstructured text data

Noisy text

Noise in text can be defined as any kind of difference between the surface form of a coded representation of the text and the intended, correct, or original text....

. While Text analytics

Text analytics

The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. The term is roughly synonymous with text mining;...

is a growing and mature field that has great value because of the huge amounts of data being produced, processing of noisy text is gaining in importance because a lot of common applications produce noisy text data. Noisy unstructured text data is found in informal settings such as online chat

Online chat

Online chat may refer to any kind of communication over the Internet, that offers an instantaneous transmission of text-based messages from sender to receiver, hence the delay for visual access to the sent message shall not hamper the flow of communications in any of the directions...

, text messages

Text messaging

Text messaging, or texting, refers to the exchange of brief written text messages between fixed-line phone or mobile phone and fixed or portable devices over a network...

, e-mail

E-mail

Electronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...

s, message boards, newsgroups, blogs, wikis and web pages. Also, text produced by processing spontaneous speech using automatic speech recognition and printed or handwritten text using optical character recognition

Optical character recognition

Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...

contains processing noise. Text produced under such circumstances is typically highly noisy containing spelling errors, abbreviation

Abbreviation

An abbreviation is a shortened form of a word or phrase. Usually, but not always, it consists of a letter or group of letters taken from the word or phrase...

s, non-standard words, false starts, repetitions, missing punctuation

Punctuation

Punctuation marks are symbols that indicate the structure and organization of written language, as well as intonation and pauses to be observed when reading aloud.In written English, punctuation is vital to disambiguate the meaning of sentences...

s, missing letter case

Letter case

In orthography and typography, letter case is the distinction between the larger majuscule and smaller minuscule letters...

information, pause filling words such as “um” and “uh” and other texting and speech disfluencies

Speech disfluencies

Speech disfluencies are any of various breaks, irregularities, or non-lexical vocables that occur within the flow of otherwise fluent speech. These include false starts, i.e. words and sentences that are cut off mid-utterance, phrases that are restarted or repeated and repeated syllables, fillers i.e...

. Such text can be seen in large amounts in contact centers

Contact centre (business)

A contact centre is a facility used by companies to manage all client contact through a variety of mediums such as telephone, fax, letter, e-mail and increasingly, online live chat....

, chat room

Chat room

The term chat room, or chatroom, is primarily used by mass media to describe any form of synchronous conferencing, occasionally even asynchronous conferencing...

s, optical character recognition

Optical character recognition

(OCR) of text documents, short message service

Short message service

Short Message Service is a text messaging service component of phone, web, or mobile communication systems, using standardized communications protocols that allow the exchange of short text messages between fixed line or mobile phone devices...

(SMS) text, etc. Documents with historical language

Historical language

Historical languages are languages that were spoken in a historical period. See:*Historical linguistics*List of languages by first written accounts*List of extinct languages*Classical language*Proto-language...

can also be considered noisy with respect to today’s knowledge about the language. Such text contains important historical, religious, ancient medical knowledge that is useful. The nature of the noisy text produced in all these contexts warrants moving beyond traditional text analysis techniques.

Techniques for noisy text analysis

Missing punctuation and the use of non-standard words can often hinder standard natural language processing

Natural language processing

Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

tools such as Part-of-speech tagging

Part-of-speech tagging

In corpus linguistics, part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e...

and parsing

Parsing

In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...

. Techniques to both learn from the noisy data and then to be able to process the noisy data are only now being developed.

Possible source of noisy text

World wide web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...

: Poorly written text is found in web pages, online chat
Online chat
Online chat may refer to any kind of communication over the Internet, that offers an instantaneous transmission of text-based messages from sender to receiver, hence the delay for visual access to the sent message shall not hamper the flow of communications in any of the directions...

, blogs, wikis, discussion forums, newsgroups. Most of these data are unstructured and the style of writing is very different from, say, well-written news articles. Analysis for the web data is important because they are sources for market buzz analysis, market review, trend estimation
Trend estimation
Trend estimation is a statistical technique to aid interpretation of data. When a series of measurements of a process are treated as a time series, trend estimation can be used to make and justify statements about tendencies in the data...

, etc. Also, because of the large amount of data, it is necessary to find efficient methods of information extraction
Information extraction
Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...

, classification, automatic summarization
Automatic summarization
Automatic summarization is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text....

and analysis of these data.
Contact centers
Contact centre (business)
A contact centre is a facility used by companies to manage all client contact through a variety of mediums such as telephone, fax, letter, e-mail and increasingly, online live chat....

: This is a general term for help desks, information lines and customer service centers operating in domains ranging from computer sales and support to mobile phones to apparels. On an average a person in the developed world interacts at least once a week with a contact center agent. A typical contact center agent handles over a hundred calls per day. They operate in various modes such as voice, online chat
Online chat
Online chat may refer to any kind of communication over the Internet, that offers an instantaneous transmission of text-based messages from sender to receiver, hence the delay for visual access to the sent message shall not hamper the flow of communications in any of the directions...

and E-mail
E-mail
Electronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...

. The contact center industry produces gigabytes of data in the form of E-mails, chat logs, voice conversation transcription
Transcription (linguistics)
Transcription in the linguistic sense is the systematic representation of language in written form. The source can either be utterances or preexisting text in another writing system, although some linguists only consider the former as transcription.Transcription should not be confused with...

s, customer feedback, etc. A bulk of the contact center data is voice conversations. Transcription of these using state of the art automatic speech recognition results in text with 30-40% word error rate
Word error rate
Word error rate is a common metric of the performance of a speech recognition or machine translation system.The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence...

. Further, even written modes of communication like online chat between customers and agents and even the interactions over email tend to be noisy. Analysis of contact center data is essential for customer relationship management, customer satisfaction analysis, call modeling, customer profiling, agent profiling, etc., and it requires sophisticated techniques to handle poorly written text.
Printed Documents: Many libraries, government organizations and national defence organizations have vast repositories of hard copy
Hard copy
In information handling, a hard copy is a permanent reproduction, or copy, in the form of a physical object, of any media suitable for direct use by a person , of displayed or transmitted data...

documents. To retrieve and process the content from such documents, they need to be processed using Optical Character Recognition
Optical character recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...

. In addition to printed text, these documents may also contain handwritten annotations. OCRed text can be highly noisy depending on the font size, quality of the print etc. It can range from 2-3% word error rate
Word error rate
Word error rate is a common metric of the performance of a speech recognition or machine translation system.The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence...

s to as high as 50-60% word error rate
Word error rate
Word error rate is a common metric of the performance of a speech recognition or machine translation system.The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence...

s. Handwritten annotations can be particularly hard to decipher, and error rates can be quite high in their presence.
Short Messaging Service
Text messaging
Text messaging, or texting, refers to the exchange of brief written text messages between fixed-line phone or mobile phone and fixed or portable devices over a network...

(SMS): Language usage over computer mediated discourses, like chats, emails and SMS texts, significantly differs from the standard form of the language. An urge towards shorter message length facilitating faster typing and the need for semantic clarity, shape the structure of this non-standard form known as the texting language.

Techniques for noisy text analysis

Possible source of noisy text

See also