Collating sequence
Encyclopedia
The term collating sequence refers to the order in which individual characters should be taken when sorting
a collection of character strings using dictionary order. This article is concerned with the order of the alphabetical characters comprising variants of the Latin alphabet
in various languages. For other writing system
s, see Collation
.
or Unicode
character set), but the proper and customary ordering of strings is not performed by a simple numeric comparison of those codes. Rather, the ordering is determined by reference to the collating sequence.
A general issue in sorting in dictionary order is whether two characters having different shapes are considered the same letter or different letters. In particular:
In same cases a digraph
or trigraph
is considered a single letter; for example, in Welsh the combination ch is one letter, and in dictionaries cymal comes before chwaer. Conversely, sometimes single characters may be sorted as if they are a sequence of other characters.
In several languages the rules have changed over time, and so older dictionaries may use a different order than modern ones. Furthermore, collation may depend on use. For example, German dictionaries and telephone directories use different approaches.
is as follows:
and collation
purposes. This varies from language to language, and sometimes from symbol to symbol, within the same language. Listed below are the collation orders in various languages.
The Unicode Collation Algorithm
can be used to get any of the collation sequences described above, by tailoring its default collation table. Several such tailorings are collected in Common Locale Data Repository
.
:Category:Latin-derived alphabets
Sorting
Sorting is any process of arranging items in some sequence and/or in different sets, and accordingly, it has two common, yet distinct meanings:# ordering: arranging items of the same kind, class, nature, etc...
a collection of character strings using dictionary order. This article is concerned with the order of the alphabetical characters comprising variants of the Latin alphabet
Latin alphabet
The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...
in various languages. For other writing system
Writing system
A writing system is a symbolic system used to represent elements or statements expressible in language.-General properties:Writing systems are distinguished from other possible symbolic communication systems in that the reader must usually understand something of the associated spoken language to...
s, see Collation
Collation
Collation is the assembly of written information into a standard order. One common type of collation is called alphabetization, though collation is not limited to ordering letters of the alphabet...
.
General issues
In a computer system, each character is assigned a unique numeric code (as in the ASCIIASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
or Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
character set), but the proper and customary ordering of strings is not performed by a simple numeric comparison of those codes. Rather, the ordering is determined by reference to the collating sequence.
A general issue in sorting in dictionary order is whether two characters having different shapes are considered the same letter or different letters. In particular:
- Majuscules (capital letters) and minuscules (lower-case letters): an upper-case A and a lower-case a are usually considered to be the same letter, so in sorting the name Abraham then comes between aardvark and accu.
- DiacriticDiacriticA diacritic is a glyph added to a letter, or basic glyph. The term derives from the Greek διακριτικός . Diacritic is both an adjective and a noun, whereas diacritical is only an adjective. Some diacritical marks, such as the acute and grave are often called accents...
s: various languages use marks over and around letters, but again for sorting purposes the characters may be considered to be the same letter. For example, in French dictionaries the word école comes between ecmnésie and ectoplasme, and in German dictionaries the word ökonomisch comes between offenbar and olfaktorisch. On the other hand, Turkish dictionaries treat o and ö as different letters, and oyun comes before öbür.
In same cases a digraph
Digraph (orthography)
A digraph or digram is a pair of characters used to write one phoneme or a sequence of phonemes that does not correspond to the normal values of the two characters combined...
or trigraph
Trigraph
A trigraph is a group of three symbols, most commonly letters.Trigraph can mean:-Computing:* Digraphs and trigraphs, groups of characters used to symbolise one character...
is considered a single letter; for example, in Welsh the combination ch is one letter, and in dictionaries cymal comes before chwaer. Conversely, sometimes single characters may be sorted as if they are a sequence of other characters.
In several languages the rules have changed over time, and so older dictionaries may use a different order than modern ones. Furthermore, collation may depend on use. For example, German dictionaries and telephone directories use different approaches.
The basic collating sequence of the Latin alphabet
The collating sequence of the standard 26-letter Latin alphabetLatin alphabet
The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...
is as follows:
- A · B · C · D · E · F · G · H · I · J · K · L · M · N · O · P · Q · R · S · T · U · V · W · X · Y · Z
Collating sequences in various languages that use a Latin-derived alphabet
Some languages use a Latin-derived alphabet that includes modified letters, ligatures, or digraphs, for orthographicOrthography
The orthography of a language specifies a standardized way of using a specific writing system to write the language. Where more than one writing system is used for a language, for example Kurdish, Uyghur, Serbian or Inuktitut, there can be more than one orthography...
and collation
Collation
Collation is the assembly of written information into a standard order. One common type of collation is called alphabetization, though collation is not limited to ordering letters of the alphabet...
purposes. This varies from language to language, and sometimes from symbol to symbol, within the same language. Listed below are the collation orders in various languages.
- In AzerbaijaniAzerbaijani languageAzerbaijani or Azeri or Torki is a language belonging to the Turkic language family, spoken in southwestern Asia by the Azerbaijani people, primarily in Azerbaijan and northwestern Iran...
, there are 8 additional letters. 5 of them are vowels: i, ı, ö, ü, ə and 3 are consonants: ç, ş, ğ. The alphabet is the same as the Turkish alphabetTurkish alphabetThe Turkish alphabet is a Latin alphabet used for writing the Turkish language, consisting of 29 letters, seven of which have been modified from their Latin originals for the phonetic requirements of the language. This alphabet represents modern Turkish pronunciation with a high degree of accuracy...
, with the same sounds written with the same letters, except for three additional letters: q, x and ə for sounds that do not exist in Turkish. Although all the "Turkish letters" are collated in their "normal" alphabetical order like in Turkish, the three extra letters are collated arbitrarily after letters whose sounds approach theirs. So, q is collated just after k, x is collated just after h and ə is collated just after e. - In BretonBreton languageBreton is a Celtic language spoken in Brittany , France. Breton is a Brythonic language, descended from the Celtic British language brought from Great Britain to Armorica by migrating Britons during the Early Middle Ages. Like the other Brythonic languages, Welsh and Cornish, it is classified as...
, there is no "c" but there are the digraphs "ch" and "c'h", which are collated between "b" and "d". For example: « buzhugenn, chug, c'hoar, daeraouenn » (earthworm, juice, sister, teardrop). - In BosnianBosnian languageBosnian is a South Slavic language, spoken by Bosniaks. As a standardized form of the Shtokavian dialect, it is one of the three official languages of Bosnia and Herzegovina....
, CroatianCroatian languageCroatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...
and SerbianSerbian languageSerbian is a form of Serbo-Croatian, a South Slavic language, spoken by Serbs in Serbia, Bosnia and Herzegovina, Montenegro, Croatia and neighbouring countries....
and other related South Slavic languages, the five accented characters and three conjoined characters are sorted after the originals: ..., C, Č, Ć, D, DŽ, Đ, E, ..., L, LJ, M, N, NJ, O, ..., S, Š, T, ..., Z, Ž. - In CzechCzech languageCzech is a West Slavic language with about 12 million native speakers; it is the majority language in the Czech Republic and spoken by Czechs worldwide. The language was known as Bohemian in English until the late 19th century...
and SlovakSlovak languageSlovak , is an Indo-European language that belongs to the West Slavic languages .Slovak is the official language of Slovakia, where it is spoken by 5 million people...
, accented vowels have secondary collating weight - compared to other letters, they are treated as their unaccented forms (A-Á, E-É-Ě, I-Í, O-Ó-Ô, U-Ú-Ů, Y-Ý), but then they are sorted after the unaccented letters (for example, the correct lexicographic order is baa, baá, báa, bab, báb, bac, bác, bač, báč). Accented consonants (the ones with caronCaronA caron or háček , also known as a wedge, inverted circumflex, inverted hat, is a diacritic placed over certain letters to indicate present or historical palatalization, iotation, or postalveolar pronunciation in the orthography of some Baltic, Slavic, Finno-Lappic, and other languages.It looks...
) have primary collating weight and are collocated immediately after their unaccented counterparts, with exception of Ď, Ň and Ť, which have again secondary weight. CHCH-Business:* Bemidji Airlines IATA code* Carolina Herrera, a fashion designer based in New York.-Entertainment and sports:* Channel * College Humor.com, a comedy website...
is considered to be a separate letter and goes between HHH .) is the eighth letter in the basic modern Latin alphabet.-History:The Semitic letter ⟨ח⟩ most likely represented the voiceless pharyngeal fricative . The form of the letter probably stood for a fence or posts....
and III is the ninth letter and a vowel in the basic modern Latin alphabet.-History:In Semitic, the letter may have originated in a hieroglyph for an arm that represented a voiced pharyngeal fricative in Egyptian, but was reassigned to by Semites, because their word for "arm" began with that sound...
. In Slovak, DZDz (digraph)Dz is a digraph of the Latin alphabet, used in Polish, Kashubian, Macedonian, Slovak, and Hungarian to represent . In Dene Suline and Cantonese Pinyin it represents .-In Polish:dz generally represents...
and DŽDZDZ or dz may refer to:* Delftsche Zwervers, a Dutch student society and rover crew* Delta Zeta, a college sorority in the USA* Dimension Zero , a melodic death metal band...
are also considered separate letters and are positioned between ĎDD is the fourth letter in the basic modern Latin alphabet.- History :The Semitic letter Dâlet may have developed from the logogram for a fish or a door. There are various Egyptian hieroglyphs that might have inspired this. In Semitic, Ancient Greek, and Latin, the letter represented ; in the...
and EEE is the fifth letter and a vowel in the basic modern Latin alphabet. It is the most commonly used letter in the Czech, Danish, Dutch, English, French, German, Hungarian, Latin, Norwegian, Spanish, and Swedish languages.-History:...
(A-Á-Ä-B-C-Č-D-Ď-DZ-DŽ-E-É…). - In the Danish and Norwegian alphabetDanish and Norwegian alphabetThe Danish and Norwegian alphabet is based upon the Latin alphabet and has consisted of the following 29 letters since 1917 and 1955 , although Danish did not officially recognize the W as a separate letter until 1980....
s, the same extra vowels as in Swedish (see below) are also present but in a different order and with different glyphGlyphA glyph is an element of writing: an individual mark on a written medium that contributes to the meaning of what is written. A glyph is made up of one or more graphemes....
s (..., X, Y, Z, ÆÆÆ is a grapheme formed from the letters a and e. Originally a ligature representing a Latin diphthong, it has been promoted to the full status of a letter in the alphabets of some languages, including Danish, Faroese, Norwegian and Icelandic...
, ØØØ — minuscule: "ø", is a vowel and a letter used in the Danish, Faroese, Norwegian and Southern Sami languages.It's mostly used as a representation of mid front rounded vowels, such as ø œ, except for Southern Sami where it's used as an [oe] diphtong.The name of this letter is the same as the sound...
, ÅÅÅ represents various sounds in several languages. Å is part of the alphabets used for the Alemannic and the Bavarian-Austrian dialects of German...
). Also, "Aa" collates as an equivalent to "Å". The Danish alphabet has traditionally seen "W" as a variant of "V", but today "W" is considered a separate letter. - In DutchDutch languageDutch is a West Germanic language and the native language of the majority of the population of the Netherlands, Belgium, and Suriname, the three member states of the Dutch Language Union. Most speakers live in the European Union, where it is a first language for about 23 million and a second...
the combination IJ (representing IJIJ (letter)The IJ is the digraph of the letters i and j. Occurring in the Dutch language, it is sometimes considered a ligature, or even a letter in itselfalthough in most fonts that have a separate character for ij the two composing parts are not connected, but are separate glyphs, sometimes slightly...
) was formerly to be collated as Y (or sometimes, as a separate letter Y < IJ < Z), but is currently mostly collated as 2 letters (II < IJ < IK). Exceptions are phone directories; IJ is always collated as Y here because in many Dutch family names Y is used where modern spelling would require IJ. Note that a word starting with ij that is written with a capital I is also written with a capital J, for example, the town IJmuiden and the river IJsselIJsselRiver IJssel , sometimes called Gelderse IJssel to avoid confusion with its Hollandse IJssel namesake in the west of the Netherlands, is a branch of the Rhine in the Dutch provinces of Gelderland and Overijssel...
. - In English, diacritics may occur in loanwords, such as the word rôle. Increasingly, however, these are omitted in modern orthography. When written, the word is nevertheless sorted as if the mark is absent: rôle comes between rock and rose.
- In EsperantoEsperantois the most widely spoken constructed international auxiliary language. Its name derives from Doktoro Esperanto , the pseudonym under which L. L. Zamenhof published the first book detailing Esperanto, the Unua Libro, in 1887...
, consonants with circumflexCircumflexThe circumflex is a diacritic used in the written forms of many languages, and is also commonly used in various romanization and transcription schemes. It received its English name from Latin circumflexus —a translation of the Greek περισπωμένη...
accents (ĉ, ĝ, ĥ, ĵ, ŝ), as well as ŭ (u with breveBreveA breve is a diacritical mark ˘, shaped like the bottom half of a circle. It resembles the caron , but is rounded, while the caron has a sharp tip...
), are counted as separate letters and collated separately (c, ĉ, d, e, f, g, ĝ, h, ĥ, i, j, ĵ ... s, ŝ, t, u, ŭ, v, z). - In EstonianEstonian languageEstonian is the official language of Estonia, spoken by about 1.1 million people in Estonia and tens of thousands in various émigré communities...
õÕ"Õ", or "õ" is a composition of the Latin letter O with the diacritic mark tilde.The HTML entity is Õ for Õ and õ for õ.-Estonian:...
, äÄ"Ä" and "ä" are both characters that represent either a letter from several extended Latin alphabets, or the letter A with an umlaut mark or diaeresis.- Independent letter :...
, öÖ"Ö", or "ö", is a character used in several extended Latin alphabets, or the letter O with umlaut to denote the front vowels or . In languages without umlaut, the character is also used as a "O with diaeresis" to denote a syllable break, wherein its pronunciation remains an unmodified .- O-Umlaut...
and üÜÜ, or ü, is a character which can be either a letter from several extended Latin alphabets, or the letter U with an umlaut or a diaeresis...
are considered separate letters and collate after wWW is the 23rd letter in the basic modern Latin alphabet.In other Germanic languages, including German, its pronunciation is similar or identical to that of English V...
. Letters šŠThe grapheme Š, š is used in various contexts, usually denoting the voiceless postalveolar fricative. In the International Phonetic Alphabet this sound is denoted with , but the lowercase š is used in the Americanist phonetic notation, as well as in the Uralic Phonetic Alphabet.For use in computer...
, zZZ is the twenty-sixth and final letter of the basic modern Latin alphabet.-Name and pronunciation:In most dialects of English, the letter's name is zed , reflecting its derivation from the Greek zeta but in American English, its name is zee , deriving from a late 17th century English dialectal...
and žŽThe grapheme Ž is formed from Latin Z with the addition of caron . It is used in various contexts, usually denoting the voiced postalveolar fricative, a sound similar to English g in mirage, or Portuguese and French j...
appear in loanwords and foreign proper names only and follow the letter sSS is the nineteenth letter in the ISO basic Latin alphabet.-History: Semitic Šîn represented a voiceless postalveolar fricative . Greek did not have this sound, so the Greek sigma came to represent...
in the Estonian alphabetEstonian alphabetThe Estonian alphabet is used for writing the Estonian language and is based on the Latin alphabet, with German influence. As such, the Estonian alphabet has the letters Ä, Ö, and Ü , which represent the vowel sounds , and , respectively...
, which otherwise does not differ from the basic Latin alphabet. - The Faroese alphabetFaroese alphabetThe Faroese alphabet consists of 29 letters derived from the Latin alphabet:- See also :* Alphabets derived from the Latin* Icelandic alphabet* Faroese language* Faroese orthography* Danish and Norwegian alphabet...
also has some of the Danish, Norwegian, and Swedish extra letters, namely ÆÆÆ is a grapheme formed from the letters a and e. Originally a ligature representing a Latin diphthong, it has been promoted to the full status of a letter in the alphabets of some languages, including Danish, Faroese, Norwegian and Icelandic...
and ØØØ — minuscule: "ø", is a vowel and a letter used in the Danish, Faroese, Norwegian and Southern Sami languages.It's mostly used as a representation of mid front rounded vowels, such as ø œ, except for Southern Sami where it's used as an [oe] diphtong.The name of this letter is the same as the sound...
. Furthermore, the Faroese alphabetFaroese alphabetThe Faroese alphabet consists of 29 letters derived from the Latin alphabet:- See also :* Alphabets derived from the Latin* Icelandic alphabet* Faroese language* Faroese orthography* Danish and Norwegian alphabet...
uses the Icelandic eth, which follows the DDD is the fourth letter in the basic modern Latin alphabet.- History :The Semitic letter Dâlet may have developed from the logogram for a fish or a door. There are various Egyptian hieroglyphs that might have inspired this. In Semitic, Ancient Greek, and Latin, the letter represented ; in the...
. Five of the six vowels AAA is the first letter and a vowel in the basic modern Latin alphabet. It is similar to the Ancient Greek letter Alpha, from which it derives.- Origins :...
, III is the ninth letter and a vowel in the basic modern Latin alphabet.-History:In Semitic, the letter may have originated in a hieroglyph for an arm that represented a voiced pharyngeal fricative in Egyptian, but was reassigned to by Semites, because their word for "arm" began with that sound...
, OOO is the fifteenth letter and a vowel in the basic modern Latin alphabet.The letter was derived from the Semitic `Ayin , which represented a consonant, probably , the sound represented by the Arabic letter ع called `Ayn. This Semitic letter in its original form seems to have been inspired by a...
, UUU is the twenty-first letter and a vowel in the basic modern Latin alphabet.-History:The letter U ultimately comes from the Semitic letter Waw by way of the letter Y. See the letter Y for details....
and YYY is the twenty-fifth letter in the basic modern Latin alphabet and represents either a vowel or a consonant in English.-Name:In Latin, Y was named Y Graeca "Greek Y". This was pronounced as I Graeca "Greek I", since Latin speakers had trouble pronouncing , which was not a native sound...
can get accents and are after that considered separate letters. The consonants CCĈ or ĉ is a consonant in Esperanto orthography, representing the sound .Esperanto orthography uses a diacritic for all four of its postalveolar consonants, as do the Latin-based Slavic alphabets...
, QQQ is the seventeenth letter of the basic modern Latin alphabet.- History :The Semitic sound value of Qôp was , a sound common to Semitic languages, but not found in English or most Indo-European ones...
, XXX is the twenty-fourth letter in the basic modern Latin alphabet.-Uses:In mathematics, x is commonly used as the name for an independent variable or unknown value. The usage of x to represent an independent or unknown variable can be traced back to the Arabic word šay شيء = “thing,” used in Arabic...
, WWW is the 23rd letter in the basic modern Latin alphabet.In other Germanic languages, including German, its pronunciation is similar or identical to that of English V...
and ZZZ is the twenty-sixth and final letter of the basic modern Latin alphabet.-Name and pronunciation:In most dialects of English, the letter's name is zed , reflecting its derivation from the Greek zeta but in American English, its name is zee , deriving from a late 17th century English dialectal...
are not found. Therefore the first five letters are AAA is the first letter and a vowel in the basic modern Latin alphabet. It is similar to the Ancient Greek letter Alpha, from which it derives.- Origins :...
, ÁÁis a letter of the Czech, Faroese, Hungarian, Icelandic, Slovak and Sámi languages. This letter also appears in Dutch, Galician, Irish, Occitan, Portuguese, Spanish, Lakota, Navajo, and Vietnamese as a variant of the letter “a”. Some writers use á incorrectly to denote a quantity, often used on...
, BBB is the second letter in the basic modern Latin alphabet. It is used to represent a variety of bilabial sounds , most commonly a voiced bilabial plosive.-History:...
, DDD is the fourth letter in the basic modern Latin alphabet.- History :The Semitic letter Dâlet may have developed from the logogram for a fish or a door. There are various Egyptian hieroglyphs that might have inspired this. In Semitic, Ancient Greek, and Latin, the letter represented ; in the...
and ÐÐA Latin capital letter D with a stroke through its vertical bar is the uppercase form of several different letters:*D with stroke , used in Vietnamese, some South Slavic , Moro and Sami languages...
, and the last five are VVV is the twenty-second letter in the basic modern Latin alphabet.-Letter:The letter V comes from the Semitic letter Waw, as do the modern letters F, U, W, and Y. See F for details....
, YYY is the twenty-fifth letter in the basic modern Latin alphabet and represents either a vowel or a consonant in English.-Name:In Latin, Y was named Y Graeca "Greek Y". This was pronounced as I Graeca "Greek I", since Latin speakers had trouble pronouncing , which was not a native sound...
, ÝYY is the twenty-fifth letter in the basic modern Latin alphabet and represents either a vowel or a consonant in English.-Name:In Latin, Y was named Y Graeca "Greek Y". This was pronounced as I Graeca "Greek I", since Latin speakers had trouble pronouncing , which was not a native sound...
, ÆÆÆ is a grapheme formed from the letters a and e. Originally a ligature representing a Latin diphthong, it has been promoted to the full status of a letter in the alphabets of some languages, including Danish, Faroese, Norwegian and Icelandic...
, ØØØ — minuscule: "ø", is a vowel and a letter used in the Danish, Faroese, Norwegian and Southern Sami languages.It's mostly used as a representation of mid front rounded vowels, such as ø œ, except for Southern Sami where it's used as an [oe] diphtong.The name of this letter is the same as the sound... - In Filipino (Tagalog)Filipino languageThis move has drawn much criticism from other regional groups.In 1987, a new constitution introduced many provisions for the language.Article XIV, Section 6, omits any mention of Tagalog as the basis for Filipino, and states that:...
and other Philippine languages, the letter Ng is treated as a separate letter. It is pronounced as in sing, ping-pong, etc. By itself, it is pronounced nang, but in general Filipino orthography, it is spelled as if it were two separate letters (n and g). Also, letter derivatives (such as ÑÑÑ is a letter of the modern Latin alphabet, formed by an N with a diacritical tilde. It is used in the Spanish alphabet, Galician alphabet, Asturian alphabet, Basque alphabet, Aragonese old alphabet , Filipino alphabet, Chamorro alphabet and the Guarani alphabet, where it represents...
) immediately follow the base letter. FilipinoFilipino languageThis move has drawn much criticism from other regional groups.In 1987, a new constitution introduced many provisions for the language.Article XIV, Section 6, omits any mention of Tagalog as the basis for Filipino, and states that:...
also is written with diacritics, but their use is very rare (except the tildeTildeThe tilde is a grapheme with several uses. The name of the character comes from Portuguese and Spanish, from the Latin titulus meaning "title" or "superscription", though the term "tilde" has evolved and now has a different meaning in linguistics....
). (Philippine orthography also includes spelling.) - The Finnish alphabetFinnish alphabetThe Finnish alphabet is based on the Latin script, and especially the Swedish alphabet. Officially it comprises 28 letters:A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, X, Y, Z, Å, Ä, Ö...
and collating rules are the same as in Swedish, except for the addition of the letters ŠŠThe grapheme Š, š is used in various contexts, usually denoting the voiceless postalveolar fricative. In the International Phonetic Alphabet this sound is denoted with , but the lowercase š is used in the Americanist phonetic notation, as well as in the Uralic Phonetic Alphabet.For use in computer...
and ŽŽThe grapheme Ž is formed from Latin Z with the addition of caron . It is used in various contexts, usually denoting the voiced postalveolar fricative, a sound similar to English g in mirage, or Portuguese and French j...
, which are considered variants of S and Z. - For FrenchFrench languageFrench is a Romance language spoken as a first language in France, the Romandy region in Switzerland, Wallonia and Brussels in Belgium, Monaco, the regions of Quebec and Acadia in Canada, and by various communities elsewhere. Second-language speakers of French are distributed throughout many parts...
, the last accent in a given word determines the order. For example, in French, the following four words would be sorted this way: cote < côte < coté < côté. - In GermanGerman alphabetThe modern German alphabet is an extended Latin alphabet consisting of 30 letters – the same letters that are found in the Basic modern Latin alphabet plus four extra letters.In German, the individual letters have neuter gender: das A, das B etc....
letters with umlaut (ÄÄ"Ä" and "ä" are both characters that represent either a letter from several extended Latin alphabets, or the letter A with an umlaut mark or diaeresis.- Independent letter :...
, ÖÖ"Ö", or "ö", is a character used in several extended Latin alphabets, or the letter O with umlaut to denote the front vowels or . In languages without umlaut, the character is also used as a "O with diaeresis" to denote a syllable break, wherein its pronunciation remains an unmodified .- O-Umlaut...
, ÜÜÜ, or ü, is a character which can be either a letter from several extended Latin alphabets, or the letter U with an umlaut or a diaeresis...
) are treated generally just like their non-umlauted versions; ßßIn the German alphabet, ß is a letter that originated as a ligature of ss or sz. Like double "s", it is pronounced as an , but in standard spelling, it is only used after long vowels and diphthongs, while ss is used after short vowels...
is always sorted as ss. This makes the alphabetic order Arg, Ärgerlich, Arm, Assistent, Aßlar, Assoziation. For phone directories and similar lists of names, the umlauts are to be collated like the letter combinations "ae", "oe", "ue". This makes the alphabetic order Udet, Übelacker, Uell, Ülle, Ueve, Üxküll, Uffenbach. - The HungarianHungarian languageHungarian is a Uralic language, part of the Ugric group. With some 14 million speakers, it is one of the most widely spoken non-Indo-European languages in Europe....
vowels have accents, umlauts, and double accents, while consonants are written with single, double (digraphs) or triple (trigraph) characters. In collating, accented vowels are equivalent with their non-accented counterparts and double and triple characters follow their single originals. Hungarian alphabetic order is: A=Á, B, C, CS, D, DZ, DZS, E=É, F, G, GY, H, I=Í, J, K, L, LY, M, N, NY, O=Ó, Ö=Ő, P, Q, R, S, SZ, T, TY, U=Ú, Ü=Ű, V, W, X, Y, Z, ZS. (For example, the correct lexicographic order is baa, baá, bab, báb, bac, bác, bacs, bács, bad, bád, ...). (Before approx. 1988, dz and dzs were not considered single letters for collation, but two letters each, d+z and d+zs instead.) - In IcelandicIcelandic languageIcelandic is a North Germanic language, the main language of Iceland. Its closest relative is Faroese.Icelandic is an Indo-European language belonging to the North Germanic or Nordic branch of the Germanic languages. Historically, it was the westernmost of the Indo-European languages prior to the...
, Þ is added, and D is followed by ÐÐA Latin capital letter D with a stroke through its vertical bar is the uppercase form of several different letters:*D with stroke , used in Vietnamese, some South Slavic , Moro and Sami languages...
. Each vowel (A, E, I, O, U, Y) is followed by its correspondent with acuteAcute accentThe acute accent is a diacritic used in many modern written languages with alphabets based on the Latin, Cyrillic, and Greek scripts.-Apex:An early precursor of the acute accent was the apex, used in Latin inscriptions to mark long vowels.-Greek:...
: Á, É, Í, Ó, Ú, Ý. There is no Z, so the alphabet ends: ... X, Y, Ý, Þ, ÆÆÆ is a grapheme formed from the letters a and e. Originally a ligature representing a Latin diphthong, it has been promoted to the full status of a letter in the alphabets of some languages, including Danish, Faroese, Norwegian and Icelandic...
, Ö.- Both letters were also used by Anglo-SaxonAnglo-SaxonsAnglo-Saxon is a term used by historians to designate the Germanic tribes who invaded and settled the south and east of Great Britain beginning in the early 5th century AD, and the period from their creation of the English nation to the Norman conquest. The Anglo-Saxon Era denotes the period of...
scribes who also used the Runic letter WynnWynnWynn is a letter of the Old English alphabet, where it is used to represent the sound ....
to represent /w/. - ÞThorn (letter)Thorn or þorn , is a letter in the Old English, Old Norse, and Icelandic alphabets, as well as some dialects of Middle English. It was also used in medieval Scandinavia, but was later replaced with the digraph th. The letter originated from the rune in the Elder Fuþark, called thorn in the...
(called thorn; lowercase þ) is also a Runic letter. - Ð (called eth; lowercase ð) is the letter DDD is the fourth letter in the basic modern Latin alphabet.- History :The Semitic letter Dâlet may have developed from the logogram for a fish or a door. There are various Egyptian hieroglyphs that might have inspired this. In Semitic, Ancient Greek, and Latin, the letter represented ; in the...
with an added stroke.
- Both letters were also used by Anglo-Saxon
- In LithuanianLithuanian languageLithuanian is the official state language of Lithuania and is recognized as one of the official languages of the European Union. There are about 2.96 million native Lithuanian speakers in Lithuania and about 170,000 abroad. Lithuanian is a Baltic language, closely related to Latvian, although they...
, specifically Lithuanian letters go after their Latin originals. Another change is that YYY is the twenty-fifth letter in the basic modern Latin alphabet and represents either a vowel or a consonant in English.-Name:In Latin, Y was named Y Graeca "Greek Y". This was pronounced as I Graeca "Greek I", since Latin speakers had trouble pronouncing , which was not a native sound...
comes just before JJĴ or ĵ is a letter in Esperanto orthography representing the sound .While Esperanto orthography uses a diacritic for its four postalveolar consonants, as do the Latin-based Slavic alphabets, the base letters are Romano-Germanic...
: ... G, H, I, Į, Y, J, K... - In PolishPolish languagePolish is a language of the Lechitic subgroup of West Slavic languages, used throughout Poland and by Polish minorities in other countries...
, specifically Polish letters derived from the Latin alphabet are collated after their originals: A, Ą, B, C, Ć, D, E, Ę, ..., L, Ł, M, N, Ń, O, Ó, P, ..., S, Ś, T, ..., Z, Ź, Ż. The digraphs for collation purposes are treated as if they were two separate letters. - In PortuguesePortuguese alphabetThe Portuguese alphabet, , consists of the following 23 or 26 Latin letters:In addition, the following characters with diacritics are used: Áá, Ââ, Ãã, Àà, Çç, Éé, Êê, Íí, Óó, Ôô, Õõ, Úú. These are not, however, treated as independent letters in collation, nor do they have entries of their own in...
, the collating order is just like in English, including the three letters not native to Portuguese: A, B, C, D, E, F, G, H, I, J, (K), L, M, N, O, P, Q, R, S, T, U, V, (W), X, (Y), Z. Digraphs and letters with diacritics are not included in the alphabet. - In RomanianRomanian languageRomanian Romanian Romanian (or Daco-Romanian; obsolete spellings Rumanian, Roumanian; self-designation: română, limba română ("the Romanian language") or românește (lit. "in Romanian") is a Romance language spoken by around 24 to 28 million people, primarily in Romania and Moldova...
, special characters derived from the Latin alphabet are collated after their originals: A, Ă, Â, ..., I, Î, ..., S, Ș, T, Ț, ..., Z. - In the Swedish alphabetSwedish alphabetModern Swedish is written with a 29-letter Latin alphabet:Prior to the 13th edition of Svenska Akademiens ordlista in 2006, the letters and were collated together....
, there are three extra vowelVowelIn phonetics, a vowel is a sound in spoken language, such as English ah! or oh! , pronounced with an open vocal tract so that there is no build-up of air pressure at any point above the glottis. This contrasts with consonants, such as English sh! , where there is a constriction or closure at some...
s placed at its end (..., X, Y, Z, ÅÅÅ represents various sounds in several languages. Å is part of the alphabets used for the Alemannic and the Bavarian-Austrian dialects of German...
, ÄÄ"Ä" and "ä" are both characters that represent either a letter from several extended Latin alphabets, or the letter A with an umlaut mark or diaeresis.- Independent letter :...
, ÖÖ"Ö", or "ö", is a character used in several extended Latin alphabets, or the letter O with umlaut to denote the front vowels or . In languages without umlaut, the character is also used as a "O with diaeresis" to denote a syllable break, wherein its pronunciation remains an unmodified .- O-Umlaut...
), similar to the Danish and Norwegian alphabet, but with different glyphs and a different collating order. The letter "W" has been treated as a variant of "V", but in the 13th edition of Svenska Akademiens ordlistaSvenska Akademiens OrdlistaSvenska Akademiens ordlista, or SAOL for short, is a dictionary published every few years by the Swedish Academy. It is a single volume that is considered the final arbiter of Swedish spelling...
(2006) "W" was considered a separate letter. - Spanish treated (until 1997) "CH" and "LL" as single letters, giving an ordering of cinco, credo, chispa and lomo, luz, llama. This is not true anymore since in 1997 RAEReal Academia EspañolaThe Royal Spanish Academy is the official royal institution responsible for regulating the Spanish language. It is based in Madrid, Spain, but is affiliated with national language academies in twenty-one other hispanophone nations through the Association of Spanish Language Academies...
adopted the more conventional usage, and now LL is collated between LK and LM, and CH between CG and CI. The six accented or umlauted characters Á, É, Í, Ó, Ú, Ü are treated as the original letters A, E, I, O, U, for example: radio, ráfaga, rana, rápido, rastrillo. The only Spanish specific collating question is ÑÑÑ is a letter of the modern Latin alphabet, formed by an N with a diacritical tilde. It is used in the Spanish alphabet, Galician alphabet, Asturian alphabet, Basque alphabet, Aragonese old alphabet , Filipino alphabet, Chamorro alphabet and the Guarani alphabet, where it represents...
(eñe) as a different letter collated after N. - In the Turkish alphabetTurkish alphabetThe Turkish alphabet is a Latin alphabet used for writing the Turkish language, consisting of 29 letters, seven of which have been modified from their Latin originals for the phonetic requirements of the language. This alphabet represents modern Turkish pronunciation with a high degree of accuracy...
there are 6 additional letters: ç, ğ, ı, ö, ş, and ü (but no q, w, and x). They are collated with ç after c, ğ after g, ı before i, ö after o, ş after s, and ü after u. Originally, when the alphabet was introduced in 1928, ı was collated after i, but the order was changed later so that letters having shapes containing dots, cedilles or other adorning marks always follow the letters with corresponding bare shapes. Note that in Turkish orthography the letter I is the majuscule of dotless ı, whereas İ is the majuscule of dotted i. - In many Turkic languagesTurkic languagesThe Turkic languages constitute a language family of at least thirty five languages, spoken by Turkic peoples across a vast area from Eastern Europe and the Mediterranean to Siberia and Western China, and are considered to be part of the proposed Altaic language family.Turkic languages are spoken...
(such as Azeri or the Jaŋalif orthography for TatarTatar languageThe Tatar language , or more specifically Kazan Tatar, is a Turkic language spoken by the Tatars of historical Kazan Khanate, including modern Tatarstan and Bashkiria...
), there used to be the letter GhaGhaThe letter ' has been used in various Latin orthographies for Turkic languages, such as Azeri or the Jaŋalif orthography for Tatar. It usually represents a voiced velar fricative, but is sometimes used for a voiced uvular fricative. All orthographies using it have been phased out, so the letter is...
, which came between GGG is the seventh letter in the basic modern Latin alphabet.-History:The letter 'G' was introduced in the Old Latin period as a variant of ⟨c⟩ to distinguish voiced, from voiceless, . The recorded originator of ⟨g⟩ is freedman Spurius Carvilius Ruga, the first Roman to open a fee-paying school,...
and HHH .) is the eighth letter in the basic modern Latin alphabet.-History:The Semitic letter ⟨ח⟩ most likely represented the voiceless pharyngeal fricative . The form of the letter probably stood for a fence or posts....
. It is now come in disuse. - WelshWelsh languageWelsh is a member of the Brythonic branch of the Celtic languages spoken natively in Wales, by some along the Welsh border in England, and in Y Wladfa...
also has complex rules: the combinations CH, DD, FF, NG, LL, PH, RH and TH are all considered single letters, and each is listed after the letter which is the first character in the combination, with the exception of NG which is listed after G. However, the situation is further complicated by these combinations not always being single letters. An example ordering is LAWR, LWCUS, LLONG, LLOM, LLONGYFARCH: the last of these words is a juxtaposition of LLON and GYFARCH, and, unlike LLONG, does not contain the letter NG.
The Unicode Collation Algorithm
Unicode collation algorithm
The Unicode collation algorithm is an algorithm defined in Unicode Technical Report #10, which defines a customizable method to compare two strings. These comparisons can then be used to collate or sort text in any writing system and language that can be represented with Unicode.Unicode Technical...
can be used to get any of the collation sequences described above, by tailoring its default collation table. Several such tailorings are collected in Common Locale Data Repository
Common Locale Data Repository
The Common Locale Data Repository Project, often abbreviated as CLDR, is a project of the Unicode Consortium to provide locale data in the XML format for use in computer applications. CLDR contains locale specific information that an operating system will typically provide to applications. CLDR is...
.
See also
- Alphabets derived from the LatinAlphabets derived from the LatinA Latin alphabet is an alphabetical writing system that uses letters of the original Roman Latin alphabet and often various extensions, the Latin script...
- CollationCollationCollation is the assembly of written information into a standard order. One common type of collation is called alphabetization, though collation is not limited to ordering letters of the alphabet...
- Latin alphabetLatin alphabetThe Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...
- List of Latin letters
- Taxonomic sequenceTaxonomic sequenceTaxonomic sequence is a sequence followed in listing of taxa which aids ease of use and roughly reflects the evolutionary relationships among the taxa...
:Category:Latin-derived alphabets
External links and references
- ICU Locale Explorer An online demonstration of sorting in different languages that uses the Unicode Collation AlgorithmUnicode collation algorithmThe Unicode collation algorithm is an algorithm defined in Unicode Technical Report #10, which defines a customizable method to compare two strings. These comparisons can then be used to collate or sort text in any writing system and language that can be represented with Unicode.Unicode Technical...
with International Components for UnicodeInternational Components for UnicodeInternational Components for Unicode is an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all...