Sentence boundary disambiguation
Encyclopedia
Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

 of deciding where sentences
Sentence (linguistics)
In the field of linguistics, a sentence is an expression in natural language, and often defined to indicate a grammatical unit consisting of one or more words that generally bear minimal syntactic relation to the words that precede or follow it...

 begin and end. Often natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

 tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period
Full stop
A full stop is the punctuation mark commonly placed at the end of sentences. In American English, the term used for this punctuation is period. In the 21st century, it is often also called a dot by young people...

 may denote an abbreviation
Abbreviation
An abbreviation is a shortened form of a word or phrase. Usually, but not always, it consists of a letter or group of letters taken from the word or phrase...

, decimal point, an ellipsis
Ellipsis
Ellipsis is a series of marks that usually indicate an intentional omission of a word, sentence or whole section from the original text being quoted. An ellipsis can also be used to indicate an unfinished thought or, at the end of a sentence, a trailing off into silence...

, or an email address - not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question mark
Question mark
The question mark , is a punctuation mark that replaces the full stop at the end of an interrogative sentence in English and many other languages. The question mark is not used for indirect questions...

s and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang
Slang
Slang is the use of informal words and expressions that are not considered standard in the speaker's language or dialect but are considered more acceptable when used socially. Slang is often to be found in areas of the lexicon that refer to things considered taboo...

.

Languages like Japanese and Chinese have unambiguous sentence-ending markers.

Strategies

The standard 'vanilla' approach to locate the end of a sentence: If it's a period, it ends a sentence. If the preceding token is on my hand-compiled list of abbreviations, then it doesn't end a sentence. If the next token is capitalized, then it ends a sentence.
This strategy gets about 95% of sentences correct.

Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a maximum entropy model. The SATZ architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.

Software

Perl compatible regular expression ("pcre")
  • ((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])
  • $sentences=preg_split("/(?


Online use, libraries, and api
  • SATZ - An Adaptive Sentence Segmentation System -by David D. Palmer -in C


Toolkits that include sentence detection

External links

  • Search for 'sentence boundary disambiguation', Google Scholar
    Google Scholar
    Google Scholar is a freely accessible web search engine that indexes the full text of scholarly literature across an array of publishing formats and disciplines. Released in beta in November 2004, the Google Scholar index includes most peer-reviewed online journals of Europe and America's largest...

    .
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK