American National Corpus - AbsoluteAstronomy.com

The American National Corpus (ANC) is a text corpus

Text corpus

In linguistics, a corpus or text corpus is a large and structured set of texts...

American English

American English is a set of dialects of the English language used mostly in the United States. Approximately two-thirds of the world's native speakers of English live in the United States....

currently containing 22 million words written and spoken data produced since 1990. The ANC may at some point of time include a range of genres comparable to the British National Corpus

British National Corpus

The British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...

. It is currently annotated for part of speech

Lexical category

In grammar, a part of speech is a linguistic category of words , which is generally defined by the syntactic or morphological behaviour of the lexical item in question. Common linguistic categories include noun and verb, among others...

and lemma, shallow parse

Shallow parsing

Shallow parsing is an analysis of a sentence which identifies the constituents , but does not specify their internal structure, nor their role in the main sentence....

, and named entities.

The ANC in its current size of 22 million words is available from the Linguistic Data Consortium

Linguistic Data Consortium

The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and development purposes. The University of Pennsylvania is...

. A 15 million word subset of the corpus, called the Open American National Corpus (OANC), is freely available with no restrictions on its use from the ANC Website.

The corpus and its annotations are provided according to the specifications of ISO/TC 37

ISO/TC 37

ISO/TC 37 is a technical committee within the International Organization for Standardization that prepares standards and other documents concerning methodology and principles for terminology and language resources....

SC4's Linguistic Annotation Framework. By using a freely provided transduction tool, the corpus and user-chosen annotations is provided in multiple formats, including the XML format conformant to the XML Corpus Encoding Standard (XCES)

XCES

XCES is an XML based standard to codify text corpus. These texts are mainly used by linguists and natural language researchers. XCES is highly based on previous Corpus Encoding Standard but using XML as the markup language. It supports simple corpora as well as anotated corpora, parallel corpora...

(usable with the British National Corpus

British National Corpus

's XAIRA search engine), a UIMA

Uima

UIMA stands for Unstructured Information Management Architecture. An OASIS standard as of March 2009, UIMA is to date the only industry standard for content analytics....

-compliant format, and formats suitable for input to a wide variety of concordance software.

The ANC differs from other corpora of English because it is richly annotated, including different part of speech

Lexical category

annotations (Penn tags, CLAWS5 and CLAWS7 tags), shallow parse annotations

Shallow parsing

Shallow parsing is an analysis of a sentence which identifies the constituents , but does not specify their internal structure, nor their role in the main sentence....

, and annotations for several types of named entities. Additional annotations are added to all or parts of the corpus as they become available, often by contributions from other projects. Unlike on-line searchable corpora, which due to copyright restrictions allow access only to individual sentences, the entire ANC is available to enable research involving, for example, development of statistical language models and full-text linguistic annotation.

ANC annotations are automatically produced and unvalidated. A Manually Annotated Sub-Corpus (MASC) will be released in Fall 2009, which includes validated annotations for the above-mentioned phenomena as well as Penn Treebank

Treebank

A treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name Treebank...

syntactic annotation, WordNet

WordNet

WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...

sense annotation, and FrameNet

FrameNet

FrameNet is a project housed at the International Computer Science Institute in Berkeley, California which produces an electronic resource based on...

semantic frame annotations.

In Fall, 2009, the OANC Ngram Search Engine will become available on the ANC Website, which will provide intra- and inter-sentential pattern-based searches. In early 2010, the OANC will be expanded to include an additional 20-30 million words of written and spoken data.

External links

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

See also

External links