Noisy text analytics
Encyclopedia
Noisy text analytics is a process of information extraction
whose goal is to automatically extract structured or semistructured information from noisy unstructured text data
. While Text analytics
is a growing and mature field that has great value because of the huge amounts of data being produced, processing of noisy text is gaining in importance because a lot of common applications produce noisy text data. Noisy unstructured text data is found in informal settings such as online chat
, text messages
, e-mail
s, message boards, newsgroups, blogs, wikis and web pages. Also, text produced by processing spontaneous speech using automatic speech recognition and printed or handwritten text using optical character recognition
contains processing noise. Text produced under such circumstances is typically highly noisy containing spelling errors, abbreviation
s, non-standard words, false starts, repetitions, missing punctuation
s, missing letter case
information, pause filling words such as “um” and “uh” and other texting and speech disfluencies
. Such text can be seen in large amounts in contact centers
, chat room
s, optical character recognition
(OCR) of text documents, short message service
(SMS) text, etc. Documents with historical language
can also be considered noisy with respect to today’s knowledge about the language. Such text contains important historical, religious, ancient medical knowledge that is useful. The nature of the noisy text produced in all these contexts warrants moving beyond traditional text analysis techniques.
tools such as Part-of-speech tagging
and parsing
. Techniques to both learn from the noisy data and then to be able to process the noisy data are only now being developed.
Information extraction
Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...
whose goal is to automatically extract structured or semistructured information from noisy unstructured text data
Noisy text
Noise in text can be defined as any kind of difference between the surface form of a coded representation of the text and the intended, correct, or original text....
. While Text analytics
Text analytics
The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. The term is roughly synonymous with text mining;...
is a growing and mature field that has great value because of the huge amounts of data being produced, processing of noisy text is gaining in importance because a lot of common applications produce noisy text data. Noisy unstructured text data is found in informal settings such as online chat
Online chat
Online chat may refer to any kind of communication over the Internet, that offers an instantaneous transmission of text-based messages from sender to receiver, hence the delay for visual access to the sent message shall not hamper the flow of communications in any of the directions...
, text messages
Text messaging
Text messaging, or texting, refers to the exchange of brief written text messages between fixed-line phone or mobile phone and fixed or portable devices over a network...
E-mail
Electronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...
s, message boards, newsgroups, blogs, wikis and web pages. Also, text produced by processing spontaneous speech using automatic speech recognition and printed or handwritten text using optical character recognition
Optical character recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...
contains processing noise. Text produced under such circumstances is typically highly noisy containing spelling errors, abbreviation
Abbreviation
An abbreviation is a shortened form of a word or phrase. Usually, but not always, it consists of a letter or group of letters taken from the word or phrase...
s, non-standard words, false starts, repetitions, missing punctuation
Punctuation
Punctuation marks are symbols that indicate the structure and organization of written language, as well as intonation and pauses to be observed when reading aloud.In written English, punctuation is vital to disambiguate the meaning of sentences...
s, missing letter case
Letter case
In orthography and typography, letter case is the distinction between the larger majuscule and smaller minuscule letters...
information, pause filling words such as “um” and “uh” and other texting and speech disfluencies
Speech disfluencies
Speech disfluencies are any of various breaks, irregularities, or non-lexical vocables that occur within the flow of otherwise fluent speech. These include false starts, i.e. words and sentences that are cut off mid-utterance, phrases that are restarted or repeated and repeated syllables, fillers i.e...
. Such text can be seen in large amounts in contact centers
Contact centre (business)
A contact centre is a facility used by companies to manage all client contact through a variety of mediums such as telephone, fax, letter, e-mail and increasingly, online live chat....
, chat room
Chat room
The term chat room, or chatroom, is primarily used by mass media to describe any form of synchronous conferencing, occasionally even asynchronous conferencing...
s, optical character recognition
Optical character recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...
(OCR) of text documents, short message service
Short message service
Short Message Service is a text messaging service component of phone, web, or mobile communication systems, using standardized communications protocols that allow the exchange of short text messages between fixed line or mobile phone devices...
(SMS) text, etc. Documents with historical language
Historical language
Historical languages are languages that were spoken in a historical period. See:*Historical linguistics*List of languages by first written accounts*List of extinct languages*Classical language*Proto-language...
can also be considered noisy with respect to today’s knowledge about the language. Such text contains important historical, religious, ancient medical knowledge that is useful. The nature of the noisy text produced in all these contexts warrants moving beyond traditional text analysis techniques.
Techniques for noisy text analysis
Missing punctuation and the use of non-standard words can often hinder standard natural language processingNatural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
tools such as Part-of-speech tagging
Part-of-speech tagging
In corpus linguistics, part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e...
and parsing
Parsing
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...
. Techniques to both learn from the noisy data and then to be able to process the noisy data are only now being developed.
Possible source of noisy text
- World wide webWorld Wide WebThe World Wide Web is a system of interlinked hypertext documents accessed via the Internet...
: Poorly written text is found in web pages, online chatOnline chatOnline chat may refer to any kind of communication over the Internet, that offers an instantaneous transmission of text-based messages from sender to receiver, hence the delay for visual access to the sent message shall not hamper the flow of communications in any of the directions...
, blogs, wikis, discussion forums, newsgroups. Most of these data are unstructured and the style of writing is very different from, say, well-written news articles. Analysis for the web data is important because they are sources for market buzz analysis, market review, trend estimationTrend estimationTrend estimation is a statistical technique to aid interpretation of data. When a series of measurements of a process are treated as a time series, trend estimation can be used to make and justify statements about tendencies in the data...
, etc. Also, because of the large amount of data, it is necessary to find efficient methods of information extractionInformation extractionInformation extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...
, classification, automatic summarizationAutomatic summarizationAutomatic summarization is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text....
and analysis of these data. - Contact centersContact centre (business)A contact centre is a facility used by companies to manage all client contact through a variety of mediums such as telephone, fax, letter, e-mail and increasingly, online live chat....
: This is a general term for help desks, information lines and customer service centers operating in domains ranging from computer sales and support to mobile phones to apparels. On an average a person in the developed world interacts at least once a week with a contact center agent. A typical contact center agent handles over a hundred calls per day. They operate in various modes such as voice, online chatOnline chatOnline chat may refer to any kind of communication over the Internet, that offers an instantaneous transmission of text-based messages from sender to receiver, hence the delay for visual access to the sent message shall not hamper the flow of communications in any of the directions...
and E-mailE-mailElectronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...
. The contact center industry produces gigabytes of data in the form of E-mails, chat logs, voice conversation transcriptionTranscription (linguistics)Transcription in the linguistic sense is the systematic representation of language in written form. The source can either be utterances or preexisting text in another writing system, although some linguists only consider the former as transcription.Transcription should not be confused with...
s, customer feedback, etc. A bulk of the contact center data is voice conversations. Transcription of these using state of the art automatic speech recognition results in text with 30-40% word error rateWord error rateWord error rate is a common metric of the performance of a speech recognition or machine translation system.The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence...
. Further, even written modes of communication like online chat between customers and agents and even the interactions over email tend to be noisy. Analysis of contact center data is essential for customer relationship management, customer satisfaction analysis, call modeling, customer profiling, agent profiling, etc., and it requires sophisticated techniques to handle poorly written text. - Printed Documents: Many libraries, government organizations and national defence organizations have vast repositories of hard copyHard copyIn information handling, a hard copy is a permanent reproduction, or copy, in the form of a physical object, of any media suitable for direct use by a person , of displayed or transmitted data...
documents. To retrieve and process the content from such documents, they need to be processed using Optical Character RecognitionOptical character recognitionOptical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...
. In addition to printed text, these documents may also contain handwritten annotations. OCRed text can be highly noisy depending on the font size, quality of the print etc. It can range from 2-3% word error rateWord error rateWord error rate is a common metric of the performance of a speech recognition or machine translation system.The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence...
s to as high as 50-60% word error rateWord error rateWord error rate is a common metric of the performance of a speech recognition or machine translation system.The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence...
s. Handwritten annotations can be particularly hard to decipher, and error rates can be quite high in their presence. - Short Messaging ServiceText messagingText messaging, or texting, refers to the exchange of brief written text messages between fixed-line phone or mobile phone and fixed or portable devices over a network...
(SMS): Language usage over computer mediated discourses, like chats, emails and SMS texts, significantly differs from the standard form of the language. An urge towards shorter message length facilitating faster typing and the need for semantic clarity, shape the structure of this non-standard form known as the texting language.
See also
- Text analyticsText analyticsThe term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. The term is roughly synonymous with text mining;...
- Information extractionInformation extractionInformation extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...
- Computational linguisticsComputational linguisticsComputational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....
- Natural language processingNatural language processingNatural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
- Named entity recognitionNamed entity recognitionNamed-entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.Most research on NER...
- Text miningText miningText mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...
- Automatic summarizationAutomatic summarizationAutomatic summarization is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text....
- Statistical classification
- Data qualityData qualityData are of high quality "if they are fit for their intended uses in operations, decision making and planning" . Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer...