Russian National Corpus
Encyclopedia
The Russian National Corpus (English official name; the Russian name is Национальный корпус русского языка, lit. the National Corpus of the Russian language, but as the official English variant the Russian National Corpus is used) is a corpus
Corpus linguistics
Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...

 of the Russian language
Russian language
Russian is a Slavic language used primarily in Russia, Belarus, Uzbekistan, Kazakhstan, Tajikistan and Kyrgyzstan. It is an unofficial but widely spoken language in Ukraine, Moldova, Latvia, Turkmenistan and Estonia and, to a lesser extent, the other countries that were once constituent republics...

 that has been available online since April 29, 2004. It is being created by the Institute of Russian language, Russian Academy of Sciences
Russian Academy of Sciences
The Russian Academy of Sciences consists of the national academy of Russia and a network of scientific research institutes from across the Russian Federation as well as auxiliary scientific and social units like libraries, publishers and hospitals....

.

It currently contains about 350 million word forms that are automatically lemmatized and POS-/grammeme-tagged, i. e. all the possible morphological analyses for each orthographic form are ascribed to it. Lemmata, POS, grammatical items and their combinations are searchable. Additionally, 6 million word forms are in the subcorpus with manually resolved homonymy.

The subcorpus with resolved morphological homonymy is also automatically accent
Stress (linguistics)
In linguistics, stress is the relative emphasis that may be given to certain syllables in a word, or to certain words in a phrase or sentence. The term is also used for similar patterns of phonetic prominence inside syllables. The word accent is sometimes also used with this sense.The stress placed...

uated. The whole corpus has a searchable tagging concerning lexical semantics
Lexical semantics
Lexical semantics is a subfield of linguistic semantics. It is the study of how and what the words of a language denote . Words may either be taken to denote things in the world, or concepts, depending on the particular approach to lexical semantics.The units of meaning in lexical semantics are...

 (LS), including morphosemantic POS subclasses (proper noun, reflexive pronoun etc.), LS characteristics proper (thematic class, causativity, evaluation), derivation (diminutive, adverb formed from adjective etc.).

The RNC includes also the following subcorpora:
  • a treebank
    Treebank
    A treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name Treebank...

     of syntactical
    Syntax
    In linguistics, syntax is the study of the principles and rules for constructing phrases and sentences in natural languages....

     dependencies (largely based on the Igor Mel'čuk
    Igor Mel'cuk
    Igor Aleksandrovič Mel'čuk is a retired professor at the Department of linguistics and translation, Université de Montréal.He graduated from the Moscow State University's Philological department. Since 1956 he has worked for the Institute of the Science of Language in Moscow. Since 1974, he has...

    's Meaning-Text Theory
    Meaning-Text Theory
    Meaning–text theory is a theoretical linguistic framework, first put forward in Moscow by Aleksandr Žolkovskij and Igor Mel’čuk, for the construction of models of natural language...

    )
  • English<=>Russian, German=>Russian, Ukrainian<=>Russian and Belorussian<=>Russian parallel corpora;
  • a large (100+ million words) separate corpus of modern newspapers (2001-2011);
  • a corpus of Russian poetry
    Poetry
    Poetry is a form of literary art in which language is used for its aesthetic and evocative qualities in addition to, or in lieu of, its apparent meaning...

    , where the rhyming words and poetic prosody (including meter, stanzas etc.) is additionally tagged;
  • a corpus of Russian dialect
    Dialect
    The term dialect is used in two distinct ways, even by linguists. One usage refers to a variety of a language that is a characteristic of a particular group of the language's speakers. The term is applied most often to regional speech patterns, but a dialect may also be defined by other factors,...

    s with specific dialect grammar tagging;
  • a multimedia corpus with searchable tagged fragments of Russian-language movies;
  • a corpus showing the history of Russian stress
    Stress (linguistics)
    In linguistics, stress is the relative emphasis that may be given to certain syllables in a word, or to certain words in a phrase or sentence. The term is also used for similar patterns of phonetic prominence inside syllables. The word accent is sometimes also used with this sense.The stress placed...

  • an educational subcorpus reflecting school standards.


All the texts have tags bearing metatextual information - the author, his/her birth date, creation date, text size, text genres (general fiction, detective story, newspaper article etc.); all these categories are browsable and searchable separately. It is possible to define a user's subcorpus to search lemmata/POS-grammeme/semantic tags combinations only within this subset.

The corpus will be made available off-line and distributed for non-commercial purposes, but currently due to some technical and/or copyright problems it is accessible only on-line.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK