Text corpus - AbsoluteAstronomy.com

Linguistics

Linguistics is the scientific study of human language. Linguistics can be broadly broken into three categories or subfields of study: language form, language meaning, and language in context....

, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing

Statistical hypothesis testing

A statistical hypothesis test is a method of making decisions using data, whether from a controlled experiment or an observational study . In statistics, a result is called statistically significant if it is unlikely to have occurred by chance alone, according to a pre-determined threshold...

, checking occurrences or validating linguistic rules on a specific universe.

A corpus may contain texts in a single language hi (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora.

In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation

Annotation

An annotation is a note that is made while reading any form of text. This may be as simple as underlining or highlighting passages.Annotated bibliographies give descriptions about how each source is useful to an author in constructing a paper or argument...

. An example of annotating a corpus is part-of-speech tagging

Part-of-speech tagging

In corpus linguistics, part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e...

, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear gloss

Gloss

A gloss is a brief notation of the meaning of a word or wording in a text. It may be in the language of the text, or in the reader's language if that is different....

ing is used to make the annotation bilingual.

Some corpora have further structured levels of analysis applied. In particular, a number of smaller corpora may be fully parsed

Parsing

In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...

. Such corpora are usually called Treebank

Treebank

A treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name Treebank...

s or Parsed Corpora

Treebank

. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around 1 to 3 million words. Other levels of linguistic structured analysis are possible, including annotations for morphology

Morphology (linguistics)

In linguistics, morphology is the identification, analysis and description, in a language, of the structure of morphemes and other linguistic units, such as words, affixes, parts of speech, intonation/stress, or implied context...

, semantics

Semantics

Semantics is the study of meaning. It focuses on the relation between signifiers, such as words, phrases, signs and symbols, and what they stand for, their denotata....

and pragmatics

Pragmatics

Pragmatics is a subfield of linguistics which studies the ways in which context contributes to meaning. Pragmatics encompasses speech act theory, conversational implicature, talk in interaction and other approaches to language behavior in philosophy, sociology, and linguistics. It studies how the...

.

Corpora are the main knowledge base in corpus linguistics

Corpus linguistics

Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...

. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics

Computational linguistics

Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....

, speech recognition

Speech recognition

Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...

and machine translation

Machine translation

Machine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.On a basic...

, where they are often used to create hidden Markov model

Hidden Markov model

A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...

s for part of speech tagging and other purposes. Corpora and frequency list

Frequency list

In computational linguistics, a frequency list is a sorted list of words together with their frequency, where frequency here usually means the number of occurrences in a given corpus...

s derived from them are useful for language teaching.

Archaeological corpora

Text corpora are also used in the study of historical document

Historical document

Historical documents are documents that contain important information about a person, place, or event.Most famous historical documents are either laws, accounts of battles , or the exploits of the powerful...

s, for example in attempts to decipher

Decipherment

Decipherment is the analysis of documents written in ancient languages, where the language is unknown, or knowledge of the language has been lost....

ancient scripts, or in Biblical scholarship. Some archaeological corpora can be of such short duration that they provide a snapshot in time. One of the shortest corpora in time, may be the 15-30 year Amarna letters

Amarna letters

The Amarna letters are an archive of correspondence on clay tablets, mostly diplomatic, between the Egyptian administration and its representatives in Canaan and Amurru during the New Kingdom...

texts-(1350 BC). The corpus of an ancient city, (for example the "Kültepe

Kültepe

Kültepe is a modern village near the ancient city of Kaneš or Kanesh , located in the Kayseri Province of Turkey's Central Anatolia Region...

Texts" of Turkey), may go through a series of corpora, determined by their find site dates.

Some notable text corpora

English language:

Google N-Grams Corpus - Largest English corpus at 155 billion words. Also has corpora for other languages. (http://ngrams.googlelabs.com/datasets)
American National Corpus
American National Corpus
The American National Corpus is a text corpus of American English currently containing 22 million words written and spoken data produced since 1990. The ANC may at some point of time include a range of genres comparable to the British National Corpus...
Bank of English
Bank of English
The Bank of English is the name of the COBUILD corpus, a collection of English texts. These are mainly British, but American and Australian data are also included....
British National Corpus
British National Corpus
The British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...
Corpus Juris Secundum
Corpus Juris Secundum
Corpus Juris Secundum is an encyclopedia of U.S. law . Its full title is Corpus Juris Secundum: Complete Restatement Of The Entire American Law As Developed By All Reported Cases It contains an alphabetical arrangement of legal topics as developed by U.S...
Corpus of Contemporary American English
Corpus of Contemporary American English
The freely-searchable 425 million word Corpus of Contemporary American English is the largest corpus of American English currently available, and the only publicly-available corpus of American English to contain a wide array of texts from a number of genres.It was created by Mark Davies, Professor...

(COCA) 425 million words, 1990-2011. Freely searchable online.
Brown Corpus
Brown Corpus
The Brown University Standard Corpus of Present-Day American English was compiled in the 1960s by Henry Kucera and W. Nelson Francis at Brown University, Providence, Rhode Island as a general corpus in the field of corpus linguistics...

, forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB.
International Corpus of English
International Corpus of English
The International Corpus of English is a set of corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.-History:...
Oxford English Corpus
Oxford English Corpus
The Oxford English Corpus is a text corpus of English language used by the makers of the Oxford English Dictionary and by Oxford University Press's language research programme. It is the largest corpus of its kind, containing over two billion words...
Scottish Corpus of Texts & Speech

Other languages:

Hamshahri Corpus
Hamshahri Corpus
The Hamshahri Corpus is a sizable Persian corpus based on the Iranian newspaper Hamshahri, one of the first online Persian newspapers in Iran. It was in initially collected and compiled by Ehsan Darrudi at DBRG Group of the University of Tehran....

(Persian a.k.a Farsi)
Amarna letters
Amarna letters
The Amarna letters are an archive of correspondence on clay tablets, mostly diplomatic, between the Egyptian administration and its representatives in Canaan and Amurru during the New Kingdom...

, (for Akkadian
Akkadian language
Akkadian is an extinct Semitic language that was spoken in ancient Mesopotamia. The earliest attested Semitic language, it used the cuneiform writing system derived ultimately from ancient Sumerian, an unrelated language isolate...

, Egyptian, Sumerogram
Sumerogram
A Sumerogram is the use of a Sumerian cuneiform character or group of characters as an ideogram or logogram rather than a syllabogram in the graphic representation of a language other than Sumerian, such as Akkadian or Hittite....

's, etc.)
TEP: Tehran English-Persian Parallel Corpus (http://ece.ut.ac.ir/nlp/)
TMC: Tehran Monolingual Corpus, Standard corpus for Persian Language Modeling (http://ece.ut.ac.ir/nlp/)
Bijankhan Corpus
Bijankhan Corpus
The Bijankhan corpus is a tagged corpus that is suitable for natural language processing research on the Persian language. This collection is gathered from daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural, etc; in...

A Contemporary Persian Corpus for NLP researches
CETENFolha
Croatian Language Corpus
Croatian Language Corpus
The Croatian Language Corpus is a corpus of Croatian compiled at the Institute of Croatian Language and Linguistics .- Background :The CLC was initially funded as a sub-project of the research program Riznica by the Ministry of Science, Education, and Sports of the Republic of Croatia from May...
Croatian National Corpus
Croatian National Corpus
Croatian National Corpus is the biggest and the most important corpus of the Croatian language. Its compilation started in 1998 at the Institute of Linguistics of the Faculty of Humanities and Social Sciences, University of Zagreb following the ideas of Marko Tadić...
Czech National Corpus
Neo-Assyrian Text Corpus Project
Neo-Assyrian Text Corpus Project
-State archives of Assyria cuneiform texts:The following works are published in the series: State Archives of Assyria Cuneiform Texts:*1997–SAACT-Volume I..---The Standard Babylonian Epic of Gilgamesh, by Simo Parpola, 1997....
Russian National Corpus
Russian National Corpus
The Russian National Corpus is a corpus of the Russian language that has been available online since April 29, 2004...
Slovenian National Corpus
Slovenian National Corpus
Slovenian National Corpus FidaPLUS is the biggest and the most important corpus of the Slovenian language. It is an upgrade of FIDA corpus, which was developed between 1997 and 2000, with added texts that were published up to 2006...
Thesaurus Linguae Graecae
Thesaurus Linguae Graecae
The Thesaurus Linguae Graecae is a research center at the University of California, Irvine. The TLG was founded in 1972 by Marianne McDonald with the goal to create a comprehensive digital collection of all surviving texts written in Greek from antiquity to...

(Ancient Greek)
Quranic Arabic Corpus
Quranic Arabic Corpus
The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of . The research project is led by at the University of Leeds, and is part of the Arabic language computing research group within the School of Computing, supervised by...

(Classical Arabic)
Eastern Armenian National Corpus (EANC) 110 million words. Freely searchable online.
National Corpus of Polish
National Corpus of Polish
The National Corpus of Polish is the biggest and the most important corpus of the Polish language. A linguistic corpus is a collection of texts where one can find the typical use of a single word or a phrase, as well as their meaning and grammatical function.-Description:The National Corpus of...
German Reference Corpus
German Reference Corpus
The German Reference Corpus is an electronic archive of text corpora of contemporary written German. It was first created in 1964 and is hosted at the Institute for the German Language in Mannheim, Germany. The corpus archive is continuously updated and expanded...

(DeReKo) More than 4 billion words of contemporary written German.
Tatoeba
Tatoeba
Tatoeba.org is a free online database of example sentences geared towards foreign language learners. Its name comes from the Japanese term "tatoeba" , meaning "for example". Unlike other online dictionaries, which focus on words, Tatoeba focuses on complete sentences, their grammatical...

A parallel corpus which contains about 913000 sentences in 90 languages.
Spanish text corpus by Molino de Ideas, which contains 660 millions words.
Kotonoha Japanese language corpus

External links

Free, web-based corpora (45-425 million words each): American (COCA, COHA, TIME), British (BNC), Spanish, Portuguese
ACL SIGLEX Resource Links: Text Corpora
The Leipzig Glossing Rules: Conventions for interlinear morpheme
Morpheme
In linguistics, a morpheme is the smallest semantically meaningful unit in a language. The field of study dedicated to morphemes is called morphology. A morpheme is not identical to a word, and the principal difference between the two is that a morpheme may or may not stand alone, whereas a word,...

-by-morpheme gloss
Gloss
A gloss is a brief notation of the meaning of a word or wording in a text. It may be in the language of the text, or in the reader's language if that is different....

es
Developing Linguistic Corpora: a Guide to Good Practice
An interface for querying automatically-constructed virtual corpora.
TEP: Tehran English-Persian Parallel Corpus.
An interface for querying text corpora constructed through guided crawling of online news sites, the corpora (both local and virtual) constructed using the SPARTAN technique, and publicly available collections (e.g. Reuters-21578, texts from the Gutenberg project, GENIA).
http://www.korpus.cz/intercorp/ Building synchronous parallel corpora of the languages taught at the Faculty of Arts of Charles University.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Archaeological corpora

Some notable text corpora

See also

External links