American National Corpus
Encyclopedia
The American National Corpus (ANC) is a text corpus
of American English
currently containing 22 million words written and spoken data produced since 1990. The ANC may at some point of time include a range of genres comparable to the British National Corpus
. It is currently annotated for part of speech
and lemma, shallow parse
, and named entities.
The ANC in its current size of 22 million words is available from the Linguistic Data Consortium
. A 15 million word subset of the corpus, called the Open American National Corpus (OANC), is freely available with no restrictions on its use from the ANC Website.
The corpus and its annotations are provided according to the specifications of ISO/TC 37
SC4's Linguistic Annotation Framework. By using a freely provided transduction tool, the corpus and user-chosen annotations is provided in multiple formats, including the XML format conformant to the XML Corpus Encoding Standard (XCES)
(usable with the British National Corpus
's XAIRA search engine), a UIMA
-compliant format, and formats suitable for input to a wide variety of concordance software.
The ANC differs from other corpora of English because it is richly annotated, including different part of speech
annotations (Penn tags, CLAWS5 and CLAWS7 tags), shallow parse annotations
, and annotations for several types of named entities. Additional annotations are added to all or parts of the corpus as they become available, often by contributions from other projects. Unlike on-line searchable corpora, which due to copyright restrictions allow access only to individual sentences, the entire ANC is available to enable research involving, for example, development of statistical language models and full-text linguistic annotation.
ANC annotations are automatically produced and unvalidated. A Manually Annotated Sub-Corpus (MASC) will be released in Fall 2009, which includes validated annotations for the above-mentioned phenomena as well as Penn Treebank
syntactic annotation, WordNet
sense annotation, and FrameNet
semantic frame annotations.
In Fall, 2009, the OANC Ngram Search Engine will become available on the ANC Website, which will provide intra- and inter-sentential pattern-based searches. In early 2010, the OANC will be expanded to include an additional 20-30 million words of written and spoken data.
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...
of American English
American English
American English is a set of dialects of the English language used mostly in the United States. Approximately two-thirds of the world's native speakers of English live in the United States....
currently containing 22 million words written and spoken data produced since 1990. The ANC may at some point of time include a range of genres comparable to the British National Corpus
British National Corpus
The British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...
. It is currently annotated for part of speech
Lexical category
In grammar, a part of speech is a linguistic category of words , which is generally defined by the syntactic or morphological behaviour of the lexical item in question. Common linguistic categories include noun and verb, among others...
and lemma, shallow parse
Shallow parsing
Shallow parsing is an analysis of a sentence which identifies the constituents , but does not specify their internal structure, nor their role in the main sentence....
, and named entities.
The ANC in its current size of 22 million words is available from the Linguistic Data Consortium
Linguistic Data Consortium
The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and development purposes. The University of Pennsylvania is...
. A 15 million word subset of the corpus, called the Open American National Corpus (OANC), is freely available with no restrictions on its use from the ANC Website.
The corpus and its annotations are provided according to the specifications of ISO/TC 37
ISO/TC 37
ISO/TC 37 is a technical committee within the International Organization for Standardization that prepares standards and other documents concerning methodology and principles for terminology and language resources....
SC4's Linguistic Annotation Framework. By using a freely provided transduction tool, the corpus and user-chosen annotations is provided in multiple formats, including the XML format conformant to the XML Corpus Encoding Standard (XCES)
XCES
XCES is an XML based standard to codify text corpus. These texts are mainly used by linguists and natural language researchers. XCES is highly based on previous Corpus Encoding Standard but using XML as the markup language. It supports simple corpora as well as anotated corpora, parallel corpora...
(usable with the British National Corpus
British National Corpus
The British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...
's XAIRA search engine), a UIMA
Uima
UIMA stands for Unstructured Information Management Architecture. An OASIS standard as of March 2009, UIMA is to date the only industry standard for content analytics....
-compliant format, and formats suitable for input to a wide variety of concordance software.
The ANC differs from other corpora of English because it is richly annotated, including different part of speech
Lexical category
In grammar, a part of speech is a linguistic category of words , which is generally defined by the syntactic or morphological behaviour of the lexical item in question. Common linguistic categories include noun and verb, among others...
annotations (Penn tags, CLAWS5 and CLAWS7 tags), shallow parse annotations
Shallow parsing
Shallow parsing is an analysis of a sentence which identifies the constituents , but does not specify their internal structure, nor their role in the main sentence....
, and annotations for several types of named entities. Additional annotations are added to all or parts of the corpus as they become available, often by contributions from other projects. Unlike on-line searchable corpora, which due to copyright restrictions allow access only to individual sentences, the entire ANC is available to enable research involving, for example, development of statistical language models and full-text linguistic annotation.
ANC annotations are automatically produced and unvalidated. A Manually Annotated Sub-Corpus (MASC) will be released in Fall 2009, which includes validated annotations for the above-mentioned phenomena as well as Penn Treebank
Treebank
A treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name Treebank...
syntactic annotation, WordNet
WordNet
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...
sense annotation, and FrameNet
FrameNet
FrameNet is a project housed at the International Computer Science Institute in Berkeley, California which produces an electronic resource based on...
semantic frame annotations.
In Fall, 2009, the OANC Ngram Search Engine will become available on the ANC Website, which will provide intra- and inter-sentential pattern-based searches. In early 2010, the OANC will be expanded to include an additional 20-30 million words of written and spoken data.
See also
- British National CorpusBritish National CorpusThe British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...
- Oxford English CorpusOxford English CorpusThe Oxford English Corpus is a text corpus of English language used by the makers of the Oxford English Dictionary and by Oxford University Press's language research programme. It is the largest corpus of its kind, containing over two billion words...
- Corpus of Contemporary American EnglishCorpus of Contemporary American EnglishThe freely-searchable 425 million word Corpus of Contemporary American English is the largest corpus of American English currently available, and the only publicly-available corpus of American English to contain a wide array of texts from a number of genres.It was created by Mark Davies, Professor...
(COCA).