Croatian Language Corpus
Encyclopedia
The Croatian Language Corpus is a corpus
of Croatian
compiled at the Institute of Croatian Language and Linguistics
(IHJJ
).
(MZOŠ
) (project no. 0212010) from May 2005. In a second development phase, since 2007, the further extension and development of the CLC was embedded within the research program The Croatian Language Repository (CLR) that was granted by the MZOŠ
(cf. Ćavar and Brozović Rončević, 2012). Being a research program (PI Dunja Brozović Rončević) with numerous subsumed independent research projects that make use of the CLC, the corpus is mainly developed as a by-product of those research projects within the CLR. Currently Dunja Brozović Rončević and Damir Ćavar are in charge of the corpus development.
corpus
that is annotated on multiple levels, i.e. lemmatized, morphologically
segmented and morpho-syntactically
annotated, phonemically transcribed and syllabified, and syntactically parsed. While the current version of the corpus
provides resources from the Croatian
language standard, several corpora
from different development phases of Croatian
are created as well, including the digitizations of manuscripts and Croatian
dictionaries.
(TEI
) P5 XML
standard. Currently approx. 90 mil. tokens are available in the TEI
P5 XML
format. The corpus
can be accessed online via the Philologic interface (see The ARTFL Project, Department of Romance Languages and Literatures, The University of Chicago
). It is virtualized into various sub-corpora, and individual or specific definitions of sub-corpora can be provided on demand.
, covering various functional domains and genres. It includes literature and other written sources from the period of the beginning of the final shaping of the standardization of the Croatian
language, i.e. from the second half of the 19th century on.
The CLC consists of:
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...
of Croatian
Croatian language
Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...
compiled at the Institute of Croatian Language and Linguistics
Institute of Croatian Language and Linguistics
The Institute of Croatian Language and Linguistics is an official institute in Croatia whose purpose is to preserve and foster the Croatian language. It traces its history back to 1948, when it was part of the Yugoslav Academy of Sciences and Arts...
(IHJJ
Institute of Croatian Language and Linguistics
The Institute of Croatian Language and Linguistics is an official institute in Croatia whose purpose is to preserve and foster the Croatian language. It traces its history back to 1948, when it was part of the Yugoslav Academy of Sciences and Arts...
).
Background
The CLC was initially funded as a sub-project of the research program Riznica (Croatian Language Repository) by the Ministry of Science, Education, and Sports of the Republic of CroatiaMinistry of Science, Education and Sports (Croatia)
The Ministry of Science, Education and Sports of the Republic of Croatia is the ministry in the Government of Croatia which is in charge of primary, secondary and tertiary education, research institutions and sports.-List of ministers:...
(MZOŠ
Ministry of Science, Education and Sports (Croatia)
The Ministry of Science, Education and Sports of the Republic of Croatia is the ministry in the Government of Croatia which is in charge of primary, secondary and tertiary education, research institutions and sports.-List of ministers:...
) (project no. 0212010) from May 2005. In a second development phase, since 2007, the further extension and development of the CLC was embedded within the research program The Croatian Language Repository (CLR) that was granted by the MZOŠ
Ministry of Science, Education and Sports (Croatia)
The Ministry of Science, Education and Sports of the Republic of Croatia is the ministry in the Government of Croatia which is in charge of primary, secondary and tertiary education, research institutions and sports.-List of ministers:...
(cf. Ćavar and Brozović Rončević, 2012). Being a research program (PI Dunja Brozović Rončević) with numerous subsumed independent research projects that make use of the CLC, the corpus is mainly developed as a by-product of those research projects within the CLR. Currently Dunja Brozović Rončević and Damir Ćavar are in charge of the corpus development.
Goals
One of the main goals of the CLC project is to create a publicly available CroatianCroatian language
Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...
corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...
that is annotated on multiple levels, i.e. lemmatized, morphologically
Morphology (linguistics)
In linguistics, morphology is the identification, analysis and description, in a language, of the structure of morphemes and other linguistic units, such as words, affixes, parts of speech, intonation/stress, or implied context...
segmented and morpho-syntactically
Morphology (linguistics)
In linguistics, morphology is the identification, analysis and description, in a language, of the structure of morphemes and other linguistic units, such as words, affixes, parts of speech, intonation/stress, or implied context...
annotated, phonemically transcribed and syllabified, and syntactically parsed. While the current version of the corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...
provides resources from the Croatian
Croatian language
Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...
language standard, several corpora
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...
from different development phases of Croatian
Croatian language
Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...
are created as well, including the digitizations of manuscripts and Croatian
Croatian language
Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...
dictionaries.
Format and Availability
From the outset, the collected and digitized texts in the CLC were annotated using the Text Encoding InitiativeText Encoding Initiative
The Text Encoding Initiative is a text-centric community of practice in the academic field of digital humanities. The community runs a mailing list, meetings and conference series, and maintains a technical standard, a wiki and a toolset....
(TEI
Text Encoding Initiative
The Text Encoding Initiative is a text-centric community of practice in the academic field of digital humanities. The community runs a mailing list, meetings and conference series, and maintains a technical standard, a wiki and a toolset....
) P5 XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
standard. Currently approx. 90 mil. tokens are available in the TEI
Text Encoding Initiative
The Text Encoding Initiative is a text-centric community of practice in the academic field of digital humanities. The community runs a mailing list, meetings and conference series, and maintains a technical standard, a wiki and a toolset....
P5 XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
format. The corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...
can be accessed online via the Philologic interface (see The ARTFL Project, Department of Romance Languages and Literatures, The University of Chicago
University of Chicago
The University of Chicago is a private research university in Chicago, Illinois, USA. It was founded by the American Baptist Education Society with a donation from oil magnate and philanthropist John D. Rockefeller and incorporated in 1890...
). It is virtualized into various sub-corpora, and individual or specific definitions of sub-corpora can be provided on demand.
Content
The CLC is assembled from selected text of CroatianCroatian language
Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...
, covering various functional domains and genres. It includes literature and other written sources from the period of the beginning of the final shaping of the standardization of the Croatian
Croatian language
Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...
language, i.e. from the second half of the 19th century on.
The CLC consists of:
- fundamental Croatian literature (e.g. novels, short stories, drama, poetry)
- non-fiction
- scientific publications from various domains and University textbooks
- school books
- translated literature from outstanding CroatianCroatian languageCroatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...
translators - online journals and newspapers
- books from the pre-standardization period of CroatianCroatian languageCroatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...
that are adapted to nowadays standard CroatianCroatian languageCroatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...
Cooperation
The realization of the CLC was made possible in cooperation with:- Školska knjiga d.d.Školska knjigaŠkolska knjiga is one of the largest publishing companies in Croatia. It was established in 1950. Until the mid-1990s it had a virtual monopoly on publishing schoolbooks and this remains its core business....
- Croatian Academy of Sciences and Arts (HAZU)Croatian Academy of Sciences and ArtsThe Croatian Academy of Sciences and Arts is the national academy of Croatia. It was founded in 1866 as the Yugoslav Academy of Sciences and Arts , and was known by that name for most of its existence.- History :...
- Stoljeća hrvatske književnosti, Matica hrvatskaMatica hrvatskaMatica hrvatska is one of the oldest Croatian cultural institutions, dating back to 1842. The name is somewhat idiosyncratic, best translated as "The Croatian Centre" . It is the largest publisher of Croatian language books...
External Links
- Croatian Language Corpus (CLC) website and Philologic interface National Corpus, another Croatian corpus by the Institute of Linguistics of the Faculty of Humanities and Social SciencesFaculty of Humanities and Social Sciences, University of ZagrebFaculty of Humanities and Social Sciences or the Faculty of Philosophy in Zagreb is one of the top faculties of the University of Zagreb.-History:...
, University of ZagrebUniversity of ZagrebThe University of Zagreb is the biggest Croatian university and the oldest continuously operating university in the area covering Central Europe south of Vienna and all of Southeastern Europe...