International Corpus of English
Encyclopedia
The International Corpus of English (ICE) is a set of corpora
representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.
. Unlike Brown or the Lancaster-Oslo-Bergen (LOB) Corpus (or indeed mega-corpora such as the British National Corpus
), however, the majority of texts are derived from spoken data.
ICE corpora contain 60% (600,000 words) of orthographically transcribed spoken English. The father of the project, Sidney Greenbaum, insisted on the primacy of the spoken word, following Randolph Quirk and Jan Svartvik's collaboration on the original London-Lund Corpus (LLC). This emphasis on word-for-word transcription marks out ICE from many other corpora, including those containing, e.g. parliamentary or legal paraphrases.
The British Component of ICE, ICE-GB, is fully parsed with a detailed Quirk et al. phrase structure
grammar, and the analyses have been thoroughly checked and completed. This analysis includes a part-of-speech tagging
and parsing
of the entire corpus. The treebank can be thoroughly searched and explored with the ICE Corpus Utility Program or ICECUP software. More information is in the handbook.
To ensure compatibility between the individual corpora in ICE, each team is following a common corpus design, as well as a common scheme for grammatical annotation.
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...
representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.
History
The project began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Twenty-three research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. For most participating countries, the ICE project is stimulating the first systematic investigation of the national variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation.Description
Each corpus contains one million words in 500 texts of 2000 words, following the sampling methodology used for the Brown CorpusBrown Corpus
The Brown University Standard Corpus of Present-Day American English was compiled in the 1960s by Henry Kucera and W. Nelson Francis at Brown University, Providence, Rhode Island as a general corpus in the field of corpus linguistics...
. Unlike Brown or the Lancaster-Oslo-Bergen (LOB) Corpus (or indeed mega-corpora such as the British National Corpus
British National Corpus
The British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...
), however, the majority of texts are derived from spoken data.
ICE corpora contain 60% (600,000 words) of orthographically transcribed spoken English. The father of the project, Sidney Greenbaum, insisted on the primacy of the spoken word, following Randolph Quirk and Jan Svartvik's collaboration on the original London-Lund Corpus (LLC). This emphasis on word-for-word transcription marks out ICE from many other corpora, including those containing, e.g. parliamentary or legal paraphrases.
The British Component of ICE, ICE-GB, is fully parsed with a detailed Quirk et al. phrase structure
Phrase structure rules
Phrase-structure rules are a way to describe a given language's syntax. They are used to break down a natural language sentence into its constituent parts namely phrasal categories and lexical categories...
grammar, and the analyses have been thoroughly checked and completed. This analysis includes a part-of-speech tagging
Part-of-speech tagging
In corpus linguistics, part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e...
and parsing
Parsing
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...
of the entire corpus. The treebank can be thoroughly searched and explored with the ICE Corpus Utility Program or ICECUP software. More information is in the handbook.
To ensure compatibility between the individual corpora in ICE, each team is following a common corpus design, as well as a common scheme for grammatical annotation.
Participants
The current list of participant countries are (*= available):- Australia
- Cameroon
- Canada*
- East Africa (Kenya, Malawi, Tanzania)*
- Fiji
- Great Britain* (parsed)
- Hong Kong*
- India*
- Ireland*
- Jamaica*
- Kenya
- Malta
- Malaysia
- New Zealand*
- Nigeria
- Pakistan
- Philippines*
- Sierra Leone
- Singapore*
- South Africa
- Sri Lanka
- Trinidad and Tobago
- USA
See also
- Corpus linguisticsCorpus linguisticsCorpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...
- British National CorpusBritish National CorpusThe British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...
- BYU Corpus of American English