Collating sequence
Encyclopedia
The term collating sequence refers to the order in which individual characters should be taken when sorting
Sorting
Sorting is any process of arranging items in some sequence and/or in different sets, and accordingly, it has two common, yet distinct meanings:# ordering: arranging items of the same kind, class, nature, etc...

 a collection of character strings using dictionary order. This article is concerned with the order of the alphabetical characters comprising variants of the Latin alphabet
Latin alphabet
The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...

 in various languages. For other writing system
Writing system
A writing system is a symbolic system used to represent elements or statements expressible in language.-General properties:Writing systems are distinguished from other possible symbolic communication systems in that the reader must usually understand something of the associated spoken language to...

s, see Collation
Collation
Collation is the assembly of written information into a standard order. One common type of collation is called alphabetization, though collation is not limited to ordering letters of the alphabet...

.

General issues

In a computer system, each character is assigned a unique numeric code (as in the ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 or Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 character set), but the proper and customary ordering of strings is not performed by a simple numeric comparison of those codes. Rather, the ordering is determined by reference to the collating sequence.

A general issue in sorting in dictionary order is whether two characters having different shapes are considered the same letter or different letters. In particular:
  • Majuscules (capital letters) and minuscules (lower-case letters): an upper-case A and a lower-case a are usually considered to be the same letter, so in sorting the name Abraham then comes between aardvark and accu.
  • Diacritic
    Diacritic
    A diacritic is a glyph added to a letter, or basic glyph. The term derives from the Greek διακριτικός . Diacritic is both an adjective and a noun, whereas diacritical is only an adjective. Some diacritical marks, such as the acute and grave are often called accents...

    s: various languages use marks over and around letters, but again for sorting purposes the characters may be considered to be the same letter. For example, in French dictionaries the word école comes between ecmnésie and ectoplasme, and in German dictionaries the word ökonomisch comes between offenbar and olfaktorisch. On the other hand, Turkish dictionaries treat o and ö as different letters, and oyun comes before öbür.


In same cases a digraph
Digraph (orthography)
A digraph or digram is a pair of characters used to write one phoneme or a sequence of phonemes that does not correspond to the normal values of the two characters combined...

 or trigraph
Trigraph
A trigraph is a group of three symbols, most commonly letters.Trigraph can mean:-Computing:* Digraphs and trigraphs, groups of characters used to symbolise one character...

 is considered a single letter; for example, in Welsh the combination ch is one letter, and in dictionaries cymal comes before chwaer. Conversely, sometimes single characters may be sorted as if they are a sequence of other characters.

In several languages the rules have changed over time, and so older dictionaries may use a different order than modern ones. Furthermore, collation may depend on use. For example, German dictionaries and telephone directories use different approaches.

The basic collating sequence of the Latin alphabet

The collating sequence of the standard 26-letter Latin alphabet
Latin alphabet
The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...

 is as follows:
A · B · C · D · E · F · G · H · I · J · K · L · M · N · O · P · Q · R · S · T · U · V · W · X · Y · Z

Collating sequences in various languages that use a Latin-derived alphabet

Some languages use a Latin-derived alphabet that includes modified letters, ligatures, or digraphs, for orthographic
Orthography
The orthography of a language specifies a standardized way of using a specific writing system to write the language. Where more than one writing system is used for a language, for example Kurdish, Uyghur, Serbian or Inuktitut, there can be more than one orthography...

 and collation
Collation
Collation is the assembly of written information into a standard order. One common type of collation is called alphabetization, though collation is not limited to ordering letters of the alphabet...

 purposes. This varies from language to language, and sometimes from symbol to symbol, within the same language. Listed below are the collation orders in various languages.
  • In Azerbaijani
    Azerbaijani language
    Azerbaijani or Azeri or Torki is a language belonging to the Turkic language family, spoken in southwestern Asia by the Azerbaijani people, primarily in Azerbaijan and northwestern Iran...

    , there are 8 additional letters. 5 of them are vowels: i, ı, ö, ü, ə and 3 are consonants: ç, ş, ğ. The alphabet is the same as the Turkish alphabet
    Turkish alphabet
    The Turkish alphabet is a Latin alphabet used for writing the Turkish language, consisting of 29 letters, seven of which have been modified from their Latin originals for the phonetic requirements of the language. This alphabet represents modern Turkish pronunciation with a high degree of accuracy...

    , with the same sounds written with the same letters, except for three additional letters: q, x and ə for sounds that do not exist in Turkish. Although all the "Turkish letters" are collated in their "normal" alphabetical order like in Turkish, the three extra letters are collated arbitrarily after letters whose sounds approach theirs. So, q is collated just after k, x is collated just after h and ə is collated just after e.
  • In Breton
    Breton language
    Breton is a Celtic language spoken in Brittany , France. Breton is a Brythonic language, descended from the Celtic British language brought from Great Britain to Armorica by migrating Britons during the Early Middle Ages. Like the other Brythonic languages, Welsh and Cornish, it is classified as...

    , there is no "c" but there are the digraphs "ch" and "c'h", which are collated between "b" and "d". For example: « buzhugenn, chug, c'hoar, daeraouenn » (earthworm, juice, sister, teardrop).
  • In Bosnian
    Bosnian language
    Bosnian is a South Slavic language, spoken by Bosniaks. As a standardized form of the Shtokavian dialect, it is one of the three official languages of Bosnia and Herzegovina....

    , Croatian
    Croatian language
    Croatian is the collective name for the standard language and dialects spoken by Croats, principally in Croatia, Bosnia and Herzegovina, the Serbian province of Vojvodina and other neighbouring countries...

     and Serbian
    Serbian language
    Serbian is a form of Serbo-Croatian, a South Slavic language, spoken by Serbs in Serbia, Bosnia and Herzegovina, Montenegro, Croatia and neighbouring countries....

     and other related South Slavic languages, the five accented characters and three conjoined characters are sorted after the originals: ..., C, Č, Ć, D, DŽ, Đ, E, ..., L, LJ, M, N, NJ, O, ..., S, Š, T, ..., Z, Ž.
  • In Czech
    Czech language
    Czech is a West Slavic language with about 12 million native speakers; it is the majority language in the Czech Republic and spoken by Czechs worldwide. The language was known as Bohemian in English until the late 19th century...

     and Slovak
    Slovak language
    Slovak , is an Indo-European language that belongs to the West Slavic languages .Slovak is the official language of Slovakia, where it is spoken by 5 million people...

    , accented vowels have secondary collating weight - compared to other letters, they are treated as their unaccented forms (A-Á, E-É-Ě, I-Í, O-Ó-Ô, U-Ú-Ů, Y-Ý), but then they are sorted after the unaccented letters (for example, the correct lexicographic order is baa, baá, báa, bab, báb, bac, bác, bač, báč). Accented consonants (the ones with caron
    Caron
    A caron or háček , also known as a wedge, inverted circumflex, inverted hat, is a diacritic placed over certain letters to indicate present or historical palatalization, iotation, or postalveolar pronunciation in the orthography of some Baltic, Slavic, Finno-Lappic, and other languages.It looks...

    ) have primary collating weight and are collocated immediately after their unaccented counterparts, with exception of Ď, Ň and Ť, which have again secondary weight. CH
    CH
    -Business:* Bemidji Airlines IATA code* Carolina Herrera, a fashion designer based in New York.-Entertainment and sports:* Channel * College Humor.com, a comedy website...

     is considered to be a separate letter and goes between H
    H
    H .) is the eighth letter in the basic modern Latin alphabet.-History:The Semitic letter ⟨ח⟩ most likely represented the voiceless pharyngeal fricative . The form of the letter probably stood for a fence or posts....

     and I
    I
    I is the ninth letter and a vowel in the basic modern Latin alphabet.-History:In Semitic, the letter may have originated in a hieroglyph for an arm that represented a voiced pharyngeal fricative in Egyptian, but was reassigned to by Semites, because their word for "arm" began with that sound...

    . In Slovak, DZ
    Dz (digraph)
    Dz is a digraph of the Latin alphabet, used in Polish, Kashubian, Macedonian, Slovak, and Hungarian to represent . In Dene Suline and Cantonese Pinyin it represents .-In Polish:dz generally represents...

     and
    DZ
    DZ or dz may refer to:* Delftsche Zwervers, a Dutch student society and rover crew* Delta Zeta, a college sorority in the USA* Dimension Zero , a melodic death metal band...

     are also considered separate letters and are positioned between Ď
    D
    D is the fourth letter in the basic modern Latin alphabet.- History :The Semitic letter Dâlet may have developed from the logogram for a fish or a door. There are various Egyptian hieroglyphs that might have inspired this. In Semitic, Ancient Greek, and Latin, the letter represented ; in the...

     and E
    E
    E is the fifth letter and a vowel in the basic modern Latin alphabet. It is the most commonly used letter in the Czech, Danish, Dutch, English, French, German, Hungarian, Latin, Norwegian, Spanish, and Swedish languages.-History:...

     (A-Á-Ä-B-C-Č-D-Ď-DZ-DŽ-E-É…).
  • In the Danish and Norwegian alphabet
    Danish and Norwegian alphabet
    The Danish and Norwegian alphabet is based upon the Latin alphabet and has consisted of the following 29 letters since 1917 and 1955 , although Danish did not officially recognize the W as a separate letter until 1980....

    s, the same extra vowels as in Swedish (see below) are also present but in a different order and with different glyph
    Glyph
    A glyph is an element of writing: an individual mark on a written medium that contributes to the meaning of what is written. A glyph is made up of one or more graphemes....

    s (..., X, Y, Z, Æ
    Æ
    Æ is a grapheme formed from the letters a and e. Originally a ligature representing a Latin diphthong, it has been promoted to the full status of a letter in the alphabets of some languages, including Danish, Faroese, Norwegian and Icelandic...

    , Ø
    Ø
    Ø — minuscule: "ø", is a vowel and a letter used in the Danish, Faroese, Norwegian and Southern Sami languages.It's mostly used as a representation of mid front rounded vowels, such as ø œ, except for Southern Sami where it's used as an [oe] diphtong.The name of this letter is the same as the sound...

    , Å
    Å
    Å represents various sounds in several languages. Å is part of the alphabets used for the Alemannic and the Bavarian-Austrian dialects of German...

    ). Also, "Aa" collates as an equivalent to "Å". The Danish alphabet has traditionally seen "W" as a variant of "V", but today "W" is considered a separate letter.
  • In Dutch
    Dutch language
    Dutch is a West Germanic language and the native language of the majority of the population of the Netherlands, Belgium, and Suriname, the three member states of the Dutch Language Union. Most speakers live in the European Union, where it is a first language for about 23 million and a second...

     the combination IJ (representing IJ
    IJ (letter)
    The IJ is the digraph of the letters i and j. Occurring in the Dutch language, it is sometimes considered a ligature, or even a letter in itselfalthough in most fonts that have a separate character for ij the two composing parts are not connected, but are separate glyphs, sometimes slightly...

    ) was formerly to be collated as Y (or sometimes, as a separate letter Y < IJ < Z), but is currently mostly collated as 2 letters (II < IJ < IK). Exceptions are phone directories; IJ is always collated as Y here because in many Dutch family names Y is used where modern spelling would require IJ. Note that a word starting with ij that is written with a capital I is also written with a capital J, for example, the town IJmuiden and the river IJssel
    IJssel
    River IJssel , sometimes called Gelderse IJssel to avoid confusion with its Hollandse IJssel namesake in the west of the Netherlands, is a branch of the Rhine in the Dutch provinces of Gelderland and Overijssel...

    .
  • In English, diacritics may occur in loanwords, such as the word rôle. Increasingly, however, these are omitted in modern orthography. When written, the word is nevertheless sorted as if the mark is absent: rôle comes between rock and rose.
  • In Esperanto
    Esperanto
    is the most widely spoken constructed international auxiliary language. Its name derives from Doktoro Esperanto , the pseudonym under which L. L. Zamenhof published the first book detailing Esperanto, the Unua Libro, in 1887...

    , consonants with circumflex
    Circumflex
    The circumflex is a diacritic used in the written forms of many languages, and is also commonly used in various romanization and transcription schemes. It received its English name from Latin circumflexus —a translation of the Greek περισπωμένη...

     accents (ĉ, ĝ, ĥ, ĵ, ŝ), as well as ŭ (u with breve
    Breve
    A breve is a diacritical mark ˘, shaped like the bottom half of a circle. It resembles the caron , but is rounded, while the caron has a sharp tip...

    ), are counted as separate letters and collated separately (c, ĉ, d, e, f, g, ĝ, h, ĥ, i, j, ĵ ... s, ŝ, t, u, ŭ, v, z).
  • In Estonian
    Estonian language
    Estonian is the official language of Estonia, spoken by about 1.1 million people in Estonia and tens of thousands in various émigré communities...

     õ
    Õ
    "Õ", or "õ" is a composition of the Latin letter O with the diacritic mark tilde.The HTML entity is &Otilde; for Õ and &otilde; for õ.-Estonian:...

    , ä
    Ä
    "Ä" and "ä" are both characters that represent either a letter from several extended Latin alphabets, or the letter A with an umlaut mark or diaeresis.- Independent letter :...

    , ö
    Ö
    "Ö", or "ö", is a character used in several extended Latin alphabets, or the letter O with umlaut to denote the front vowels or . In languages without umlaut, the character is also used as a "O with diaeresis" to denote a syllable break, wherein its pronunciation remains an unmodified .- O-Umlaut...

     and ü
    Ü
    Ü, or ü, is a character which can be either a letter from several extended Latin alphabets, or the letter U with an umlaut or a diaeresis...

     are considered separate letters and collate after w
    W
    W is the 23rd letter in the basic modern Latin alphabet.In other Germanic languages, including German, its pronunciation is similar or identical to that of English V...

    . Letters š
    Š
    The grapheme Š, š is used in various contexts, usually denoting the voiceless postalveolar fricative. In the International Phonetic Alphabet this sound is denoted with , but the lowercase š is used in the Americanist phonetic notation, as well as in the Uralic Phonetic Alphabet.For use in computer...

    , z
    Z
    Z is the twenty-sixth and final letter of the basic modern Latin alphabet.-Name and pronunciation:In most dialects of English, the letter's name is zed , reflecting its derivation from the Greek zeta but in American English, its name is zee , deriving from a late 17th century English dialectal...

     and ž
    Ž
    The grapheme Ž is formed from Latin Z with the addition of caron . It is used in various contexts, usually denoting the voiced postalveolar fricative, a sound similar to English g in mirage, or Portuguese and French j...

     appear in loanwords and foreign proper names only and follow the letter s
    S
    S is the nineteenth letter in the ISO basic Latin alphabet.-History: Semitic Šîn represented a voiceless postalveolar fricative . Greek did not have this sound, so the Greek sigma came to represent...

     in the Estonian alphabet
    Estonian alphabet
    The Estonian alphabet is used for writing the Estonian language and is based on the Latin alphabet, with German influence. As such, the Estonian alphabet has the letters Ä, Ö, and Ü , which represent the vowel sounds , and , respectively...

    , which otherwise does not differ from the basic Latin alphabet.
  • The Faroese alphabet
    Faroese alphabet
    The Faroese alphabet consists of 29 letters derived from the Latin alphabet:- See also :* Alphabets derived from the Latin* Icelandic alphabet* Faroese language* Faroese orthography* Danish and Norwegian alphabet...

     also has some of the Danish, Norwegian, and Swedish extra letters, namely Æ
    Æ
    Æ is a grapheme formed from the letters a and e. Originally a ligature representing a Latin diphthong, it has been promoted to the full status of a letter in the alphabets of some languages, including Danish, Faroese, Norwegian and Icelandic...

     and Ø
    Ø
    Ø — minuscule: "ø", is a vowel and a letter used in the Danish, Faroese, Norwegian and Southern Sami languages.It's mostly used as a representation of mid front rounded vowels, such as ø œ, except for Southern Sami where it's used as an [oe] diphtong.The name of this letter is the same as the sound...

    . Furthermore, the Faroese alphabet
    Faroese alphabet
    The Faroese alphabet consists of 29 letters derived from the Latin alphabet:- See also :* Alphabets derived from the Latin* Icelandic alphabet* Faroese language* Faroese orthography* Danish and Norwegian alphabet...

     uses the Icelandic eth, which follows the D
    D
    D is the fourth letter in the basic modern Latin alphabet.- History :The Semitic letter Dâlet may have developed from the logogram for a fish or a door. There are various Egyptian hieroglyphs that might have inspired this. In Semitic, Ancient Greek, and Latin, the letter represented ; in the...

    . Five of the six vowels A
    A
    A is the first letter and a vowel in the basic modern Latin alphabet. It is similar to the Ancient Greek letter Alpha, from which it derives.- Origins :...

    , I
    I
    I is the ninth letter and a vowel in the basic modern Latin alphabet.-History:In Semitic, the letter may have originated in a hieroglyph for an arm that represented a voiced pharyngeal fricative in Egyptian, but was reassigned to by Semites, because their word for "arm" began with that sound...

    , O
    O
    O is the fifteenth letter and a vowel in the basic modern Latin alphabet.The letter was derived from the Semitic `Ayin , which represented a consonant, probably , the sound represented by the Arabic letter ع called `Ayn. This Semitic letter in its original form seems to have been inspired by a...

    , U
    U
    U is the twenty-first letter and a vowel in the basic modern Latin alphabet.-History:The letter U ultimately comes from the Semitic letter Waw by way of the letter Y. See the letter Y for details....

     and Y
    Y
    Y is the twenty-fifth letter in the basic modern Latin alphabet and represents either a vowel or a consonant in English.-Name:In Latin, Y was named Y Graeca "Greek Y". This was pronounced as I Graeca "Greek I", since Latin speakers had trouble pronouncing , which was not a native sound...

     can get accents and are after that considered separate letters. The consonants C
    C
    Ĉ or ĉ is a consonant in Esperanto orthography, representing the sound .Esperanto orthography uses a diacritic for all four of its postalveolar consonants, as do the Latin-based Slavic alphabets...

    , Q
    Q
    Q is the seventeenth letter of the basic modern Latin alphabet.- History :The Semitic sound value of Qôp was , a sound common to Semitic languages, but not found in English or most Indo-European ones...

    , X
    X
    X is the twenty-fourth letter in the basic modern Latin alphabet.-Uses:In mathematics, x is commonly used as the name for an independent variable or unknown value. The usage of x to represent an independent or unknown variable can be traced back to the Arabic word šay شيء = “thing,” used in Arabic...

    , W
    W
    W is the 23rd letter in the basic modern Latin alphabet.In other Germanic languages, including German, its pronunciation is similar or identical to that of English V...

     and Z
    Z
    Z is the twenty-sixth and final letter of the basic modern Latin alphabet.-Name and pronunciation:In most dialects of English, the letter's name is zed , reflecting its derivation from the Greek zeta but in American English, its name is zee , deriving from a late 17th century English dialectal...

     are not found. Therefore the first five letters are A
    A
    A is the first letter and a vowel in the basic modern Latin alphabet. It is similar to the Ancient Greek letter Alpha, from which it derives.- Origins :...

    , Á
    Á
    is a letter of the Czech, Faroese, Hungarian, Icelandic, Slovak and Sámi languages. This letter also appears in Dutch, Galician, Irish, Occitan, Portuguese, Spanish, Lakota, Navajo, and Vietnamese as a variant of the letter “a”. Some writers use á incorrectly to denote a quantity, often used on...

    , B
    B
    B is the second letter in the basic modern Latin alphabet. It is used to represent a variety of bilabial sounds , most commonly a voiced bilabial plosive.-History:...

    , D
    D
    D is the fourth letter in the basic modern Latin alphabet.- History :The Semitic letter Dâlet may have developed from the logogram for a fish or a door. There are various Egyptian hieroglyphs that might have inspired this. In Semitic, Ancient Greek, and Latin, the letter represented ; in the...

     and Ð
    Ð
    A Latin capital letter D with a stroke through its vertical bar is the uppercase form of several different letters:*D with stroke , used in Vietnamese, some South Slavic , Moro and Sami languages...

    , and the last five are V
    V
    V is the twenty-second letter in the basic modern Latin alphabet.-Letter:The letter V comes from the Semitic letter Waw, as do the modern letters F, U, W, and Y. See F for details....

    , Y
    Y
    Y is the twenty-fifth letter in the basic modern Latin alphabet and represents either a vowel or a consonant in English.-Name:In Latin, Y was named Y Graeca "Greek Y". This was pronounced as I Graeca "Greek I", since Latin speakers had trouble pronouncing , which was not a native sound...

    , Ý
    Y
    Y is the twenty-fifth letter in the basic modern Latin alphabet and represents either a vowel or a consonant in English.-Name:In Latin, Y was named Y Graeca "Greek Y". This was pronounced as I Graeca "Greek I", since Latin speakers had trouble pronouncing , which was not a native sound...

    , Æ
    Æ
    Æ is a grapheme formed from the letters a and e. Originally a ligature representing a Latin diphthong, it has been promoted to the full status of a letter in the alphabets of some languages, including Danish, Faroese, Norwegian and Icelandic...

    , Ø
    Ø
    Ø — minuscule: "ø", is a vowel and a letter used in the Danish, Faroese, Norwegian and Southern Sami languages.It's mostly used as a representation of mid front rounded vowels, such as ø œ, except for Southern Sami where it's used as an [oe] diphtong.The name of this letter is the same as the sound...

  • In Filipino (Tagalog)
    Filipino language
    This move has drawn much criticism from other regional groups.In 1987, a new constitution introduced many provisions for the language.Article XIV, Section 6, omits any mention of Tagalog as the basis for Filipino, and states that:...

     and other Philippine languages, the letter Ng is treated as a separate letter. It is pronounced as in sing, ping-pong, etc. By itself, it is pronounced nang, but in general Filipino orthography, it is spelled as if it were two separate letters (n and g). Also, letter derivatives (such as Ñ
    Ñ
    Ñ is a letter of the modern Latin alphabet, formed by an N with a diacritical tilde. It is used in the Spanish alphabet, Galician alphabet, Asturian alphabet, Basque alphabet, Aragonese old alphabet , Filipino alphabet, Chamorro alphabet and the Guarani alphabet, where it represents...

    ) immediately follow the base letter. Filipino
    Filipino language
    This move has drawn much criticism from other regional groups.In 1987, a new constitution introduced many provisions for the language.Article XIV, Section 6, omits any mention of Tagalog as the basis for Filipino, and states that:...

     also is written with diacritics, but their use is very rare (except the tilde
    Tilde
    The tilde is a grapheme with several uses. The name of the character comes from Portuguese and Spanish, from the Latin titulus meaning "title" or "superscription", though the term "tilde" has evolved and now has a different meaning in linguistics....

    ). (Philippine orthography also includes spelling.)
  • The Finnish alphabet
    Finnish alphabet
    The Finnish alphabet is based on the Latin script, and especially the Swedish alphabet. Officially it comprises 28 letters:A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, X, Y, Z, Å, Ä, Ö...

     and collating rules are the same as in Swedish, except for the addition of the letters Š
    Š
    The grapheme Š, š is used in various contexts, usually denoting the voiceless postalveolar fricative. In the International Phonetic Alphabet this sound is denoted with , but the lowercase š is used in the Americanist phonetic notation, as well as in the Uralic Phonetic Alphabet.For use in computer...

     and Ž
    Ž
    The grapheme Ž is formed from Latin Z with the addition of caron . It is used in various contexts, usually denoting the voiced postalveolar fricative, a sound similar to English g in mirage, or Portuguese and French j...

    , which are considered variants of S and Z.
  • For French
    French language
    French is a Romance language spoken as a first language in France, the Romandy region in Switzerland, Wallonia and Brussels in Belgium, Monaco, the regions of Quebec and Acadia in Canada, and by various communities elsewhere. Second-language speakers of French are distributed throughout many parts...

    , the last accent in a given word determines the order. For example, in French, the following four words would be sorted this way: cote < côte < coté < côté.
  • In German
    German alphabet
    The modern German alphabet is an extended Latin alphabet consisting of 30 letters – the same letters that are found in the Basic modern Latin alphabet plus four extra letters.In German, the individual letters have neuter gender: das A, das B etc....

     letters with umlaut (Ä
    Ä
    "Ä" and "ä" are both characters that represent either a letter from several extended Latin alphabets, or the letter A with an umlaut mark or diaeresis.- Independent letter :...

    , Ö
    Ö
    "Ö", or "ö", is a character used in several extended Latin alphabets, or the letter O with umlaut to denote the front vowels or . In languages without umlaut, the character is also used as a "O with diaeresis" to denote a syllable break, wherein its pronunciation remains an unmodified .- O-Umlaut...

    , Ü
    Ü
    Ü, or ü, is a character which can be either a letter from several extended Latin alphabets, or the letter U with an umlaut or a diaeresis...

    ) are treated generally just like their non-umlauted versions; ß
    ß
    In the German alphabet, ß is a letter that originated as a ligature of ss or sz. Like double "s", it is pronounced as an , but in standard spelling, it is only used after long vowels and diphthongs, while ss is used after short vowels...

     is always sorted as ss. This makes the alphabetic order Arg, Ärgerlich, Arm, Assistent, Aßlar, Assoziation. For phone directories and similar lists of names, the umlauts are to be collated like the letter combinations "ae", "oe", "ue". This makes the alphabetic order Udet, Übelacker, Uell, Ülle, Ueve, Üxküll, Uffenbach.
  • The Hungarian
    Hungarian language
    Hungarian is a Uralic language, part of the Ugric group. With some 14 million speakers, it is one of the most widely spoken non-Indo-European languages in Europe....

     vowels have accents, umlauts, and double accents, while consonants are written with single, double (digraphs) or triple (trigraph) characters. In collating, accented vowels are equivalent with their non-accented counterparts and double and triple characters follow their single originals. Hungarian alphabetic order is: A=Á, B, C, CS, D, DZ, DZS, E=É, F, G, GY, H, I=Í, J, K, L, LY, M, N, NY, O=Ó, Ö=Ő, P, Q, R, S, SZ, T, TY, U=Ú, Ü=Ű, V, W, X, Y, Z, ZS. (For example, the correct lexicographic order is baa, baá, bab, báb, bac, bác, bacs, bács, bad, bád, ...). (Before approx. 1988, dz and dzs were not considered single letters for collation, but two letters each, d+z and d+zs instead.)
  • In Icelandic
    Icelandic language
    Icelandic is a North Germanic language, the main language of Iceland. Its closest relative is Faroese.Icelandic is an Indo-European language belonging to the North Germanic or Nordic branch of the Germanic languages. Historically, it was the westernmost of the Indo-European languages prior to the...

    , Þ is added, and D is followed by Ð
    Ð
    A Latin capital letter D with a stroke through its vertical bar is the uppercase form of several different letters:*D with stroke , used in Vietnamese, some South Slavic , Moro and Sami languages...

    . Each vowel (A, E, I, O, U, Y) is followed by its correspondent with acute
    Acute accent
    The acute accent is a diacritic used in many modern written languages with alphabets based on the Latin, Cyrillic, and Greek scripts.-Apex:An early precursor of the acute accent was the apex, used in Latin inscriptions to mark long vowels.-Greek:...

    : Á, É, Í, Ó, Ú, Ý. There is no Z, so the alphabet ends: ... X, Y, Ý, Þ, Æ
    Æ
    Æ is a grapheme formed from the letters a and e. Originally a ligature representing a Latin diphthong, it has been promoted to the full status of a letter in the alphabets of some languages, including Danish, Faroese, Norwegian and Icelandic...

    , Ö.
    • Both letters were also used by Anglo-Saxon
      Anglo-Saxons
      Anglo-Saxon is a term used by historians to designate the Germanic tribes who invaded and settled the south and east of Great Britain beginning in the early 5th century AD, and the period from their creation of the English nation to the Norman conquest. The Anglo-Saxon Era denotes the period of...

       scribes who also used the Runic letter Wynn
      Wynn
      Wynn is a letter of the Old English alphabet, where it is used to represent the sound ....

       to represent /w/.
    • Þ
      Thorn (letter)
      Thorn or þorn , is a letter in the Old English, Old Norse, and Icelandic alphabets, as well as some dialects of Middle English. It was also used in medieval Scandinavia, but was later replaced with the digraph th. The letter originated from the rune in the Elder Fuþark, called thorn in the...

       (called thorn; lowercase þ) is also a Runic letter.
    • Р(called eth; lowercase ð) is the letter D
      D
      D is the fourth letter in the basic modern Latin alphabet.- History :The Semitic letter Dâlet may have developed from the logogram for a fish or a door. There are various Egyptian hieroglyphs that might have inspired this. In Semitic, Ancient Greek, and Latin, the letter represented ; in the...

       with an added stroke.
  • In Lithuanian
    Lithuanian language
    Lithuanian is the official state language of Lithuania and is recognized as one of the official languages of the European Union. There are about 2.96 million native Lithuanian speakers in Lithuania and about 170,000 abroad. Lithuanian is a Baltic language, closely related to Latvian, although they...

    , specifically Lithuanian letters go after their Latin originals. Another change is that Y
    Y
    Y is the twenty-fifth letter in the basic modern Latin alphabet and represents either a vowel or a consonant in English.-Name:In Latin, Y was named Y Graeca "Greek Y". This was pronounced as I Graeca "Greek I", since Latin speakers had trouble pronouncing , which was not a native sound...

     comes just before J
    J
    Ĵ or ĵ is a letter in Esperanto orthography representing the sound .While Esperanto orthography uses a diacritic for its four postalveolar consonants, as do the Latin-based Slavic alphabets, the base letters are Romano-Germanic...

    : ... G, H, I, Į, Y, J, K...
  • In Polish
    Polish language
    Polish is a language of the Lechitic subgroup of West Slavic languages, used throughout Poland and by Polish minorities in other countries...

    , specifically Polish letters derived from the Latin alphabet are collated after their originals: A, Ą, B, C, Ć, D, E, Ę, ..., L, Ł, M, N, Ń, O, Ó, P, ..., S, Ś, T, ..., Z, Ź, Ż. The digraphs for collation purposes are treated as if they were two separate letters.
  • In Portuguese
    Portuguese alphabet
    The Portuguese alphabet, , consists of the following 23 or 26 Latin letters:In addition, the following characters with diacritics are used: Áá, Ââ, Ãã, Àà, Çç, Éé, Êê, Íí, Óó, Ôô, Õõ, Úú. These are not, however, treated as independent letters in collation, nor do they have entries of their own in...

    , the collating order is just like in English, including the three letters not native to Portuguese: A, B, C, D, E, F, G, H, I, J, (K), L, M, N, O, P, Q, R, S, T, U, V, (W), X, (Y), Z. Digraphs and letters with diacritics are not included in the alphabet.
  • In Romanian
    Romanian language
    Romanian Romanian Romanian (or Daco-Romanian; obsolete spellings Rumanian, Roumanian; self-designation: română, limba română ("the Romanian language") or românește (lit. "in Romanian") is a Romance language spoken by around 24 to 28 million people, primarily in Romania and Moldova...

    , special characters derived from the Latin alphabet are collated after their originals: A, Ă, Â, ..., I, Î, ..., S, Ș, T, Ț, ..., Z.
  • In the Swedish alphabet
    Swedish alphabet
    Modern Swedish is written with a 29-letter Latin alphabet:Prior to the 13th edition of Svenska Akademiens ordlista in 2006, the letters and were collated together....

    , there are three extra vowel
    Vowel
    In phonetics, a vowel is a sound in spoken language, such as English ah! or oh! , pronounced with an open vocal tract so that there is no build-up of air pressure at any point above the glottis. This contrasts with consonants, such as English sh! , where there is a constriction or closure at some...

    s placed at its end (..., X, Y, Z, Å
    Å
    Å represents various sounds in several languages. Å is part of the alphabets used for the Alemannic and the Bavarian-Austrian dialects of German...

    , Ä
    Ä
    "Ä" and "ä" are both characters that represent either a letter from several extended Latin alphabets, or the letter A with an umlaut mark or diaeresis.- Independent letter :...

    , Ö
    Ö
    "Ö", or "ö", is a character used in several extended Latin alphabets, or the letter O with umlaut to denote the front vowels or . In languages without umlaut, the character is also used as a "O with diaeresis" to denote a syllable break, wherein its pronunciation remains an unmodified .- O-Umlaut...

    ), similar to the Danish and Norwegian alphabet, but with different glyphs and a different collating order. The letter "W" has been treated as a variant of "V", but in the 13th edition of Svenska Akademiens ordlista
    Svenska Akademiens Ordlista
    Svenska Akademiens ordlista, or SAOL for short, is a dictionary published every few years by the Swedish Academy. It is a single volume that is considered the final arbiter of Swedish spelling...

    (2006) "W" was considered a separate letter.
  • Spanish treated (until 1997) "CH" and "LL" as single letters, giving an ordering of cinco, credo, chispa and lomo, luz, llama. This is not true anymore since in 1997 RAE
    Real Academia Española
    The Royal Spanish Academy is the official royal institution responsible for regulating the Spanish language. It is based in Madrid, Spain, but is affiliated with national language academies in twenty-one other hispanophone nations through the Association of Spanish Language Academies...

     adopted the more conventional usage, and now LL is collated between LK and LM, and CH between CG and CI. The six accented or umlauted characters Á, É, Í, Ó, Ú, Ü are treated as the original letters A, E, I, O, U, for example: radio, ráfaga, rana, rápido, rastrillo. The only Spanish specific collating question is Ñ
    Ñ
    Ñ is a letter of the modern Latin alphabet, formed by an N with a diacritical tilde. It is used in the Spanish alphabet, Galician alphabet, Asturian alphabet, Basque alphabet, Aragonese old alphabet , Filipino alphabet, Chamorro alphabet and the Guarani alphabet, where it represents...

     (eñe) as a different letter collated after N.
  • In the Turkish alphabet
    Turkish alphabet
    The Turkish alphabet is a Latin alphabet used for writing the Turkish language, consisting of 29 letters, seven of which have been modified from their Latin originals for the phonetic requirements of the language. This alphabet represents modern Turkish pronunciation with a high degree of accuracy...

     there are 6 additional letters: ç, ğ, ı, ö, ş, and ü (but no q, w, and x). They are collated with ç after c, ğ after g, ı before i, ö after o, ş after s, and ü after u. Originally, when the alphabet was introduced in 1928, ı was collated after i, but the order was changed later so that letters having shapes containing dots, cedilles or other adorning marks always follow the letters with corresponding bare shapes. Note that in Turkish orthography the letter I is the majuscule of dotless ı, whereas İ is the majuscule of dotted i.
  • In many Turkic languages
    Turkic languages
    The Turkic languages constitute a language family of at least thirty five languages, spoken by Turkic peoples across a vast area from Eastern Europe and the Mediterranean to Siberia and Western China, and are considered to be part of the proposed Altaic language family.Turkic languages are spoken...

     (such as Azeri or the Jaŋalif orthography for Tatar
    Tatar language
    The Tatar language , or more specifically Kazan Tatar, is a Turkic language spoken by the Tatars of historical Kazan Khanate, including modern Tatarstan and Bashkiria...

    ), there used to be the letter Gha
    Gha
    The letter ' has been used in various Latin orthographies for Turkic languages, such as Azeri or the Jaŋalif orthography for Tatar. It usually represents a voiced velar fricative, but is sometimes used for a voiced uvular fricative. All orthographies using it have been phased out, so the letter is...

     , which came between G
    G
    G is the seventh letter in the basic modern Latin alphabet.-History:The letter 'G' was introduced in the Old Latin period as a variant of ⟨c⟩ to distinguish voiced, from voiceless, . The recorded originator of ⟨g⟩ is freedman Spurius Carvilius Ruga, the first Roman to open a fee-paying school,...

     and H
    H
    H .) is the eighth letter in the basic modern Latin alphabet.-History:The Semitic letter ⟨ח⟩ most likely represented the voiceless pharyngeal fricative . The form of the letter probably stood for a fence or posts....

    . It is now come in disuse.
  • Welsh
    Welsh language
    Welsh is a member of the Brythonic branch of the Celtic languages spoken natively in Wales, by some along the Welsh border in England, and in Y Wladfa...

     also has complex rules: the combinations CH, DD, FF, NG, LL, PH, RH and TH are all considered single letters, and each is listed after the letter which is the first character in the combination, with the exception of NG which is listed after G. However, the situation is further complicated by these combinations not always being single letters. An example ordering is LAWR, LWCUS, LLONG, LLOM, LLONGYFARCH: the last of these words is a juxtaposition of LLON and GYFARCH, and, unlike LLONG, does not contain the letter NG.


The Unicode Collation Algorithm
Unicode collation algorithm
The Unicode collation algorithm is an algorithm defined in Unicode Technical Report #10, which defines a customizable method to compare two strings. These comparisons can then be used to collate or sort text in any writing system and language that can be represented with Unicode.Unicode Technical...

 can be used to get any of the collation sequences described above, by tailoring its default collation table. Several such tailorings are collected in Common Locale Data Repository
Common Locale Data Repository
The Common Locale Data Repository Project, often abbreviated as CLDR, is a project of the Unicode Consortium to provide locale data in the XML format for use in computer applications. CLDR contains locale specific information that an operating system will typically provide to applications. CLDR is...

.

See also

  • Alphabets derived from the Latin
    Alphabets derived from the Latin
    A Latin alphabet is an alphabetical writing system that uses letters of the original Roman Latin alphabet and often various extensions, the Latin script...

  • Collation
    Collation
    Collation is the assembly of written information into a standard order. One common type of collation is called alphabetization, though collation is not limited to ordering letters of the alphabet...

  • Latin alphabet
    Latin alphabet
    The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...

  • List of Latin letters
  • Taxonomic sequence
    Taxonomic sequence
    Taxonomic sequence is a sequence followed in listing of taxa which aids ease of use and roughly reflects the evolutionary relationships among the taxa...


:Category:Latin-derived alphabets

External links and references

  • ICU Locale Explorer An online demonstration of sorting in different languages that uses the Unicode Collation Algorithm
    Unicode collation algorithm
    The Unicode collation algorithm is an algorithm defined in Unicode Technical Report #10, which defines a customizable method to compare two strings. These comparisons can then be used to collate or sort text in any writing system and language that can be represented with Unicode.Unicode Technical...

     with International Components for Unicode
    International Components for Unicode
    International Components for Unicode is an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all...

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK