Word lists by frequency
Encyclopedia
Word lists by frequency are lists of a language words grouped by frequency, either as a whole, or as a ranked list, serving the purpose of vocabulary acquisition. Word lists by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort". Those lists are not intended to be given directly to students, but rather to serve as a guideline for teachers and book makers. Paul Nation
Paul Nation
Paul Nation is a leading language teaching methodology and vocabulary acquisition linguist researcher, mainly for English as a Foreign Language. He has taught in Indonesia, Thailand, the United States, Finland, and Japan. He is a professor in the School of Linguistics and Applied Language Studies...

's modern language teaching summary encourages first to "move from high frequency vocabulary and special purposes [thematic] vocabulary to low frequency vocabulary, then to teach learners strategies to sustain autonomous vocabulary expansion." In any case, the basic "word" unit should be definite. For latin scripts, words are usually one or several characters items separate either by spaces or punctuations. But exceptions can arise such English "can't", French "aujourd'hui", or idioms. It may also be preferable to group words of a word family
Word family
A word family is the base form of a word plus its inflected forms and derived forms made from affixes.. Inflections include third person -s, -ed, -ing, plural -s, possessive -s, comparative -er and superlative -est. Affixes includes -able, -er, -ish, -less, -ly, -ness, -th, -y, non-, un-, -al,...

 under the representation of its base word. Thus, possible, impossible, possibility are words of the same word family, represented by base word *possibl*. For statistical purpose, all their occurrences are summed up under the base word form *possibl*, allowing the ranking of a concept and form occurrence. Moreover, other languages may present specific difficulties. Such is the case of Chinese, which don't use space between words, and where a specified chain of several characters can be interpreted as a several characters unique word or as a phrase of several items.

English

Word counting dates back to Hellenistic time. Thorndike & Lorge, assisted by their colleagues, counted 18,000,000 running words to provide the first large scale frequency list in 1944, before modern computers made such projects far easier.

Major lists

The Teachers Word Book of 30,000 words (Thorndike and Lorge, 1944)
The TWB contains 30,000 lemmas or ~13,000 word families (Goulden, Nation and Read, 1990). A corpus of 18,000,000 written words was hand analysed. The size of its inputted corpus increased its usefulness, but its age and language change reduced its applicability.

The General Service List (West, 1953)
The GSL contains 2,000 headwords divided into two sets of 1,000 words. A corpus of 5,000,000 written words was analysed in the 1940s. Rate of occurrence (%) for different meanings and parts of speech of the headword are provided, while it was also a careful application of the various criteria other than frequency and range. Thus, despite its age, some errors, and its solely written base, it is still an excellent database (word frequency, frequency of meanings, reduction of noise).

The American Heritage Word Frequency Book (Carroll, Davies and Richman, 1971)
A corpus of 5,000,000 running words, from written texts used in United States schools (various grades, various subject areas). Its value is in its focus on school teaching materials, and its tagging of words, namely the frequency of each word in each of the school grade levels and in each of the subject areas.

The Brown (Francis and Kucera, 1982) LOB and related corpora
Now contain 1,000,000 word written corpora representing different dialects of English. Those sources are used to produce frequency lists.

French

An attempt was made in the 1950s–60s with the Français fondamental. It includes the F.F.1 list with 1,500 high-frequency words, completed by a later F.F.2 list with 1,700 mid-frequency words, and the most used syntax rules. It is claimed that 70 grammatical words constitute 50% of the communicatives sentence, while 3680 words make about 95~98% of coverage. A list of 3000 frequent words is available.

The French Ministry of the Education also provide a ranked list of the 1.500 most frequent word families
Word family
A word family is the base form of a word plus its inflected forms and derived forms made from affixes.. Inflections include third person -s, -ed, -ing, plural -s, possessive -s, comparative -er and superlative -est. Affixes includes -able, -er, -ish, -less, -ly, -ness, -th, -y, non-, un-, -al,...

, provied by the lexicologue Étienne Brunet. Jean Baudot made a study on the model of the American Brown study, entitled "Fréquences d'utilisation des mots en français écrit contemporain".

More recently, the project Lexique 3 provided a list of 135.000 French words, with orthography
Orthography
The orthography of a language specifies a standardized way of using a specific writing system to write the language. Where more than one writing system is used for a language, for example Kurdish, Uyghur, Serbian or Inuktitut, there can be more than one orthography...

, phonetic, syllabation, grammatical kind
Grammar
In linguistics, grammar is the set of structural rules that govern the composition of clauses, phrases, and words in any given natural language. The term refers also to the study of such rules, and this field includes morphology, syntax, and phonology, often complemented by phonetics, semantics,...

, le gender, singular/plurial, frequency, associated lemmes, etc.

Chinese

As a frequency toolkit, Jun Da and the Taiwanese Ministry of Education provided large databases with frequency ranks, for characters and words. The HSK
HSK
HSK may refer to:*The Hanyu Shuiping Kaoshi , a standardized test of Mandarin Chinese*Horrendous Space Kablooie, a Calvin and Hobbes reference to the Big Bang, a figure skating club in Finland*Auxiliary cruiser or HSK, a German cruiser ship...

 list of 8,848 high and medium frequency words in the People's Republic of China
People's Republic of China
China , officially the People's Republic of China , is the most populous country in the world, with over 1.3 billion citizens. Located in East Asia, the country covers approximately 9.6 million square kilometres...

, and the Republic of China (Taiwan)
Republic of China
The Republic of China , commonly known as Taiwan , is a unitary sovereign state located in East Asia. Originally based in mainland China, the Republic of China currently governs the island of Taiwan , which forms over 99% of its current territory, as well as Penghu, Kinmen, Matsu and other minor...

's TOP
Test of Proficiency-Huayu
The Test of Chinese as a Foreign Language is the Republic of China's Mandarin Chinese test. It is administered by the Steering Committee for the Test Of Proficiency-Huayu ...

list of about 8,600 common traditional Chinese words are two other lists displaying common Chinese words and characters.

Issues

Paul Nation noted the incredible help provided by computing capabilities, making corpus analysis much easier. He cited several key issues which influence the construction of frequency lists:
  • the representativeness
  • the frequency and range
  • word families
  • idioms and set expression
  • range of information
  • various other criteria

Sources

|editor1-last=Schmitt |editor2-last= McCarthy |title = Vocabulary: Description, Acquisition and Pedagogy |chapter= Vocabulary size, text coverage, and word lists |place= Cambridge |publisher= Cambridge University Press |year=1997 |pages=6–19 |isbn=9780521585514 |url=http://www.lextutor.ca/research/nation_waring_97.html}}. [Accessed August 21, 2010].|year=1997|title=八十六年常用語詞調查報告書|url=http://www.edu.tw/files/site_content/M0001/86news/ch2.html?open}} [Accessed August 21, 2010].
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK