Croatian National Corpus
Encyclopedia
Croatian National Corpus is the biggest and the most important corpus of the Croatian language
Croatian language
Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...

. Its compilation started in 1998 at the Institute of Linguistics of the Faculty of Humanities and Social Sciences
Faculty of Humanities and Social Sciences, University of Zagreb
Faculty of Humanities and Social Sciences or the Faculty of Philosophy in Zagreb is one of the top faculties of the University of Zagreb.-History:...

, University of Zagreb
University of Zagreb
The University of Zagreb is the biggest Croatian university and the oldest continuously operating university in the area covering Central Europe south of Vienna and all of Southeastern Europe...

 following the ideas of Marko Tadić. The theoretical foundations and the expression of the need for a general-purpose, representative and multi-million corpus of the Croatian language started to appear even earlier. The Croatian National Corpus is compiled from selected texts written in Croatian covering all fields, topics, genres and styles: from literary and scientific texts to text-books, newspaper, user-groups and chat rooms.

The initial composition was divided in two constituents:
  1. 30-million corpus of contemporary Croatian language (30m) where samples from texts from 1990 on were included. The criteria for inclusion of text samples were: written by native speakers, different fields, genres and topics. Translated text or poetry were excluded.
  2. Croatian Electronic Text Archive (HETA) where the complete text were included, particularly serial publications (volumes, series, editions etc.) which would imbalance the 30m if they were inserted there.


Since 2004, with the adoption of the concept of the 3rd generation corpus, the two-constituent structure has been abandoned in favor of several subcorpora and larger size. Since 2005 HNK 105 million tokens and is composed of number of different subcorpora which can be searched individually and all together in a whole corpus. Since 2004 HNK also migrated to a new server platform, namely Manatee/Bonito server-client architecture. For searching the HNK (today still with free test access) a free client program Bonito is needed. It has been produced at the Natural Language Processing Laboratory of the Faculty of Informatics, Masaryk University
Masaryk University
Masaryk University is the second largest university in the Czech Republic, a member of the Compostela Group and the Utrecht Network. Founded in 1919 in Brno as the third Czech university , it now consists of nine faculties and 42,182 students...

 in Brno, Czech Republic. Its interface features complex and more elaborated queries over corpus, different types of statistical results, total or partial word lists according to different query criteria (with their frequencies), frequency distribution of types, automatic collocation detection etc.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK