Corpus of Contemporary American English

The freely-searchable 425 million word Corpus of Contemporary American English (COCA) is the largest corpus

Text corpus

In linguistics, a corpus or text corpus is a large and structured set of texts...

American English

American English is a set of dialects of the English language used mostly in the United States. Approximately two-thirds of the world's native speakers of English live in the United States....

currently available, and the only publicly-available corpus of American English to contain a wide array of texts from a number of genres.

It was created by Mark Davies

Mark Davies

Mark Nicholas Davies is an English footballer who plays for Bolton Wanderers in the Premier League.-Wolverhampton Wanderers:...

, Professor of Corpus Linguistics

Corpus linguistics

Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...

at Brigham Young University

Brigham Young University

Brigham Young University is a private university located in Provo, Utah. It is owned and operated by The Church of Jesus Christ of Latter-day Saints , and is the United States' largest religious university and third-largest private university.Approximately 98% of the university's 34,000 students...

Content

The corpus is composed of 425 million words from more than 160,000 texts, including 20 million words each year from 1990 to 2011. The most recent update was made in March 2011. The corpus is used by approximately 40,000 people each month, which may make it the most widely-used "structured" corpus currently available.

For each year, the corpus is evenly divided between the five genres: spoken, fiction, popular magazines, newspapers, and academic journals. The texts come from a variety of sources:

Spoken: (85 million words) Transcripts of unscripted conversation from nearly 150 different TV and radio programs.

Fiction: (81 million words) Short stories and plays, first chapters of books 1990–present, and movie scripts.

Popular magazines: (86 million words) Nearly 100 different magazines, from a range of domains such as news, health, home and gardening, women's, financial, religion, and sports.

Newspapers: (81 million words) Ten newspapers from across the US, with text from different sections of the newspapers, such as local news, opinion, sports, and the financial section.

Academic Journals: (81 million words) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system.

Queries

The interface is the same as the BYU-BNC interface for the 100 million word British National Corpus
British National Corpus
The British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...

, the 100 million word TIME Magazine corpus, and the 400 million word Corpus of *Historical* American English (COHA), 1810s–2000s (see links below)

Queries by word, phrase, alternates, substring, part of speech, lemma, synonyms (see below), and customized lists (see below)

The corpus is tagged by CLAWS, the same tagger that was used for the BNC and the TIME corpus

Chart listings (totals for all matching forms in each genre or year, 1990–present, as well as for sub-genres) and table listings (frequency for each matching form in each genre or year)

Full collocates searching (up to ten words left and right of node word)

Re-sortable concordances, showing the most common words/strings to the left and right of the searched word

Comparisons between genres or time periods (e.g. collocates of 'chair' in fiction or academic, nouns with 'break the [N]' in newspapers or academic, adjectives that occur primarily in sports magazines, or verbs that are more common 2005–2010 than previously)

One-step comparisons of collocates of related words, to study semantic or cultural differences between words (e.g. comparison of collocates of 'small' and 'little', or 'Democrats' and 'Republicans', or 'men' and 'women', or 'rob' vs 'steal')

Users can include semantic information from a 60,000 entry thesaurus directly as part of the query syntax (e.g. frequency and distribution of synonyms of 'beautiful', synonyms of 'strong' occurring in fiction but not academic, synonyms of 'clean' + noun ('clean the floor', 'washed the dishes')

Users can also create their own own 'customized' word lists, and then re-use these as part of subsequent queries (e.g. lists related to a particular semantic category (clothes, foods, emotions), or a user-defined part of speech)

Note that the corpus is only available through the web interface, due to copyright restrictions.

External links

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.