Collation
Encyclopedia
Collation is the assembly of written information into a standard order. One common type of collation is called alphabetization, though collation is not limited to ordering letters of the alphabet
Alphabet
An alphabet is a standard set of letters—basic written symbols or graphemes—each of which represents a phoneme in a spoken language, either as it exists now or as it was in the past. There are other systems, such as logographies, in which each character represents a word, morpheme, or semantic...

. Collating lists of words or names into alphabetical order is the basis of most office filing systems
File system
A file system is a means to organize data expected to be retained after a program terminates by providing procedures to store, retrieve and update data, as well as manage the available space on the device which contain it. A file system organizes data in an efficient manner and is tuned to the...

, library catalog
Library catalog
A library catalog is a register of all bibliographic items found in a library or group of libraries, such as a network of libraries at several locations...

s and reference books.

Collation differs from classification in that classification is concerned with arranging information into logical categories, while collation is concerned with the ordering of those categories.

Advantages of sorted lists include:
  • one can easily find the first n elements (e.g. the 5 smallest countries) and the last n elements (e.g. the 3 largest countries)
  • one can easily find the elements in a given range (e.g. countries with an area between .. and .. square km)
  • one can easily search for an element, and conclude whether it is in the list, e.g. with the binary search algorithm
    Binary search algorithm
    In computer science, a binary search or half-interval search algorithm finds the position of a specified value within a sorted array. At each stage, the algorithm compares the input key value with the key value of the middle element of the array. If the keys match, then a matching element has been...

     or interpolation search
    Interpolation search
    Interpolation search is an algorithm for searching for a given key value in an indexed array that has been ordered by the values of the key. It parallels how humans search through a telephone book for a particular name, the key value by which the book's entries are ordered...

     either automatically, or, roughly and perhaps unconsciously, manually.


A collation algorithm, e.g. the "Unicode collation algorithm
Unicode collation algorithm
The Unicode collation algorithm is an algorithm defined in Unicode Technical Report #10, which defines a customizable method to compare two strings. These comparisons can then be used to collate or sort text in any writing system and language that can be represented with Unicode.Unicode Technical...

", differs from a sorting algorithm
Sorting algorithm
In computer science, a sorting algorithm is an algorithm that puts elements of a list in a certain order. The most-used orders are numerical order and lexicographical order...

: the first is a process to define the order, which corresponds to the process of just comparing two values, while a sorting algorithm is a procedure to put a list of items in this order.

Collation defines a total preorder on the set of possible items, typically by defining a total order
Total order
In set theory, a total order, linear order, simple order, or ordering is a binary relation on some set X. The relation is transitive, antisymmetric, and total...

 on a sortkey. Note however that in the case of e.g. numerical sorting of strings representing numbers, the strings are only partially preordered, because e.g. 2e3 and 2000 have the same ranking, and 2 and 2.0 also. The numbers represented by the strings are totally ordered.

History

The first effective use among scholars may have been in ancient Alexandria.
In the 1st century BC Varro
Varro
Varro was a Roman cognomen carried by:*Marcus Terentius Varro, sometimes known as Varro Reatinus, the scholar*Publius Terentius Varro or Varro Atacinus, the poet*Gaius Terentius Varro, the consul defeated at the battle of Cannae...

 wrote some alphabetic lists of authors and titles.
In the 2nd century AD Sextus Pompeius Festus
Sextus Pompeius Festus
Sextus Pompeius Festus was a Roman grammarian, who probably flourished in the later 2nd century AD, perhaps at Narbo in Gaul.He made an epitome in 20 volumes of the encyclopedic treatise in many volumes De verborum significatu, of Verrius Flaccus, a celebrated grammarian who flourished in the...

 wrote an encyclopedic work with entries in alphabetic order.
In the 3rd century Harpocration
Harpocration
Valerius Harpocration was a Greek grammarian of Alexandria, probably working in the 2nd century CE. He is possibly the Harpocration mentioned by Julius Capitolinus as the Greek tutor of Lucius Verus ; some authorities place him much later, on the ground that he borrowed from Athenaeus...

 wrote a Homeric lexicon alphabetized by all letters.
In the 10th century the author of the Suda
Suda
The Suda or Souda is a massive 10th century Byzantine encyclopedia of the ancient Mediterranean world, formerly attributed to an author called Suidas. It is an encyclopedic lexicon, written in Greek, with 30,000 entries, many drawing from ancient sources that have since been lost, and often...

 used alphabetic order with phonetic variations.
In the 14th century the author of the Fons memorabilium universi
Fons memorabilium universi
Fons memorabilium universi is an early encyclopedia, written in Latin by the Italian humanist Domenico Bandini of Arezzo Fons memorabilium universi ("Fountain of the Memorable Universe") is an early encyclopedia, written in Latin by the Italian humanist Domenico Bandini of Arezzo Fons memorabilium...

 used a classification, but used alphabetical order within some of the books.
In 1604 Robert Cawdrey had to explain in Table Alphabeticall: the first monolingual English dictionary
Table Alphabeticall
A Table Alphabeticall is the abbreviated title of the first monolingual dictionary in the English language, created by Robert Cawdrey and first published in London in 1604....

 "Nowe if the word, which thou art desirous to finde, begin with (a) then looke in the beginning of this Table, but if with (v) looke towards the end."
Although as late as 1803 Samuel Taylor Coleridge
Samuel Taylor Coleridge
Samuel Taylor Coleridge was an English poet, Romantic, literary critic and philosopher who, with his friend William Wordsworth, was a founder of the Romantic Movement in England and a member of the Lake Poets. He is probably best known for his poems The Rime of the Ancient Mariner and Kubla...

 condemned encyclopedias with "an arrangement determined by the accident of initial letters", many lists are today based on this principle.

Numerical sorting, sorting of single characters

One collation system is numerical sorting. For example, the list of numbers 4 · 17 · 3 · -5 collates to -5 · 3 · 4 · 17.

While this might appear to work only for numbers, computer
Computer
A computer is a programmable machine designed to sequentially and automatically carry out a sequence of arithmetic or logical operations. The particular sequence of operations can be changed readily, allowing the computer to solve more than one kind of problem...

s can use this method for any textual information since computers internally use character sets which assign a numeric code point to each letter or glyph
Glyph
A glyph is an element of writing: an individual mark on a written medium that contributes to the meaning of what is written. A glyph is made up of one or more graphemes....

.
For example, a computer using ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 code (or any of its superset
SuperSet
SuperSet Software was a group founded by friends and former Eyring Research Institute co-workers Drew Major, Dale Neibaur, Kyle Powell and later joined by Mark Hurst...

s such as Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

) and numerical sorting would collate the list of characters a · b · C · d · $ to $ · C · a · b · d.

The numerical values that ASCII uses are $ = 36, a = 97, b = 98, C = 67, and d = 100, resulting in what is called "ASCIIbetical order".

This style of collation is commonly used, often with the refinement of converting uppercase letters to lowercase before comparing ASCII values, since most people do not expect capitalised words to jump the head of the list.

Alphabetical order

A collation system for multiple-character words is alphabetical order, based on the conventional order of letters in an alphabet
Alphabet
An alphabet is a standard set of letters—basic written symbols or graphemes—each of which represents a phoneme in a spoken language, either as it exists now or as it was in the past. There are other systems, such as logographies, in which each character represents a word, morpheme, or semantic...

 (most of which have a single conventional order).

Each nth letter is compared with the nth letter of other words in the list, starting at the first letter of each word and advancing to the second, third, fourth, and so on, until the order is established.

The order of the Latin alphabet
Latin alphabet
The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...

 is



The principle behind extending alphabetical order to words (lexicographical order
Lexicographical order
In mathematics, the lexicographic or lexicographical order, , is a generalization of the way the alphabetical order of words is based on the alphabetical order of letters.-Definition:Given two partially ordered sets A and B, the lexicographical order on...

) is that all words in a list beginning with the same letter should be grouped together; within a grouping starting with a single letter, all words beginning with the same two letters shall be grouped together; and so on, maximizing the number of common initial letters between adjacent words. The ordering principle is applied at the point where the letters differ. For instance, in the sequence:
Astrolabe
Astronomy
Astrophysics


The order of the words is given according to the first letter of the words that is different from the others (shown in bold). Since n follows l in the alphabet, but precedes p, Astronomy comes after Astrolabe, but before Astrophysics.

There has historically been some variation in the application of these rules. For instance, the prefixes Mc and M in Irish and Scottish surnames were taken to be abbreviations for Mac, and alphabetized as if they were spelled out as Mac in full. Thus one might find in a catalog the sequence:
McKinley
Mackintosh


with
McKinley preceding Mackintosh, as if it had been spelled "MacKinley". Since the advent of computer-sorted lists, this type of alphabetization is less frequently encountered, though it is still used in British phone books. A variation in alphabetical principles applies to names consisting of two words. In some cases, names with identical first words are all alphabetized together under the first word, e.g., grouping together all names beginning with San, all those beginning with Santa, and those beginning with Santo:
San
San Cristobal
San Juan
San Teodoro
San Tomas
Santa Barbara
Santa Clara
Santa Cruz
Santo Domingo


But in another system, the names are alphabetized as if they had no spaces, e.g. as follows:
San
San Cristobal
San Juan
Santa Barbara
Santa Clara
Santa Cruz
San Teodoro
Santo Domingo
San Tomas


The difference between computer-style numerical sorting and true alphabetical sorting becomes obvious in languages using an extended Latin alphabet
Latin alphabet
The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...

. For example, the 29-letter alphabet of Spanish
Spanish language
Spanish , also known as Castilian , is a Romance language in the Ibero-Romance group that evolved from several languages and dialects in central-northern Iberia around the 9th century and gradually spread with the expansion of the Kingdom of Castile into central and southern Iberia during the...

 treats ñ as a basic letter following n, and formerly treated ch and ll as basic letters following c and l, respectively. Ch and ll are still considered letters, but are now alphabetized as two-letter combinations. (The new alphabetization rule was issued by the Royal Spanish Academy in 1994.) On the other hand, the digraph
Digraph (orthography)
A digraph or digram is a pair of characters used to write one phoneme or a sequence of phonemes that does not correspond to the normal values of the two characters combined...

 
rr follows rqu as expected, both with and without the 1994 alphabetization rule. A numeric sort may order ñ incorrectly following z and treat ch as c + h, also incorrect when using pre-1994 alphabetization.

Similar differences between computer numeric sorting and alphabetic sorting occur in Danish
Danish language
Danish is a North Germanic language spoken by around six million people, principally in the country of Denmark. It is also spoken by 50,000 Germans of Danish ethnicity in the northern parts of Schleswig-Holstein, Germany, where it holds the status of minority language...

 and Norwegian
Norwegian language
Norwegian is a North Germanic language spoken primarily in Norway, where it is the official language. Together with Swedish and Danish, Norwegian forms a continuum of more or less mutually intelligible local and regional variants .These Scandinavian languages together with the Faroese language...

 (
aa is ordered at the end of the alphabet when it is pronounced like å
Å
Å represents various sounds in several languages. Å is part of the alphabets used for the Alemannic and the Bavarian-Austrian dialects of German...

, and at the start of the alphabet when it is pronounced like a), German
German language
German is a West Germanic language, related to and classified alongside English and Dutch. With an estimated 90 – 98 million native speakers, German is one of the world's major languages and is the most widely-spoken first language in the European Union....

 (
ß is ordered as s + s; ä, ö, ü are ordered as a + e, o + e, u + e in phone books, but as o elsewhere, and behind o in Austria), Icelandic
Icelandic language
Icelandic is a North Germanic language, the main language of Iceland. Its closest relative is Faroese.Icelandic is an Indo-European language belonging to the North Germanic or Nordic branch of the Germanic languages. Historically, it was the westernmost of the Indo-European languages prior to the...

 (
ð follows d), Dutch
Dutch language
Dutch is a West Germanic language and the native language of the majority of the population of the Netherlands, Belgium, and Suriname, the three member states of the Dutch Language Union. Most speakers live in the European Union, where it is a first language for about 23 million and a second...

 (
ij is sometimes ordered as y; see IJ: Collation), English (æ is ordered as a + e), and many other languages.

Usually the space
Space (punctuation)
In writing, a space is a blank area devoid of content, serving to separate words, letters, numbers, and punctuation. Conventions for interword and intersentence spaces vary among languages, and in some cases the spacing rules are quite complex....

s or hyphen
Hyphen
The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. The hyphen should not be confused with dashes , which are longer and have different uses, or with the minus sign which is also longer...

s between words are ignored.

Languages that used a syllabary
Syllabary
A syllabary is a set of written symbols that represent syllables, which make up words. In a syllabary, there is no systematic similarity between the symbols which represent syllables with the same consonant or vowel...

 or abugida
Abugida
An abugida , also called an alphasyllabary, is a segmental writing system in which consonant–vowel sequences are written as a unit: each unit is based on a consonant letter, and vowel notation is obligatory but secondary...

 instead of an alphabet (for example, Cherokee
Cherokee language
Cherokee is an Iroquoian language spoken by the Cherokee people which uses a unique syllabary writing system. It is the only Southern Iroquoian language that remains spoken. Cherokee is a polysynthetic language.-North American etymology:...

) can use approximately the same system if there is a set ordering for the symbols.

Radical-and-stroke sorting

Another form of collation is radical-and-stroke sorting, used for non-alphabetic writing systems such as Chinese
Chinese language
The Chinese language is a language or language family consisting of varieties which are mutually intelligible to varying degrees. Originally the indigenous languages spoken by the Han Chinese in China, it forms one of the branches of Sino-Tibetan family of languages...

 hanzi and Japanese
Japanese language
is a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is a member of the Japonic language family, which has a number of proposed relationships with other languages, none of which has gained wide acceptance among historical linguists .Japanese is an...

 kanji
Kanji
Kanji are the adopted logographic Chinese characters hanzi that are used in the modern Japanese writing system along with hiragana , katakana , Indo Arabic numerals, and the occasional use of the Latin alphabet...

, whose thousands of symbols defy ordering by convention. In this system, common components of characters are identified; these are called radicals
Radical (Chinese character)
A Chinese radical is a component of a Chinese character. The term may variously refer to the original semantic element of a character, or to any semantic element, or, loosely, to any element whatever its origin or purpose...

 in Chinese and logographic systems derived from Chinese. Characters are then grouped by their primary radical, then ordered by number of pen strokes within radicals. When there is no obvious radical or more than one radical, convention governs which is used for collation. For example, the Chinese character for "mother" (妈) is sorted as a six-stroke character under the three-stroke primary radical (女).

The radical-and-stroke system is cumbersome compared to an alphabetical system in which there are a few characters, all unambiguous. The choice of which components of a logograph comprise separate radicals and which radical is primary is not clear-cut. As a result, logographic languages often supplement radical-and-stroke ordering with alphabetic sorting of a phonetic conversion of the logographs. For example, the kanji word 'Tōkyō (東京), the Japanese name of Tokyo
Tokyo
, ; officially , is one of the 47 prefectures of Japan. Tokyo is the capital of Japan, the center of the Greater Tokyo Area, and the largest metropolitan area of Japan. It is the seat of the Japanese government and the Imperial Palace, and the home of the Japanese Imperial Family...

 can be sorted as if it were spelled out in the Japanese characters of the hiragana
Hiragana
is a Japanese syllabary, one basic component of the Japanese writing system, along with katakana, kanji, and the Latin alphabet . Hiragana and katakana are both kana systems, in which each character represents one mora...

 syllabary as "to-u-ki-yo-u" (とうきょう), using the conventional sorting order for these characters.

In addition, in Greater China, surname stroke order
Surname stroke order
The surname stroke order is a Chinese name ordering system. It arose as an impartial method of categorization of the order in which names appear in official documentation or in ceremonial procedure without any line of hierarchy. In official setting, the number of strokes in a person's surname...

ing is a convention in some official documents where peoples' names are listed without hierarchy.

The radical-and-stroke system, or some similar pattern-matching and stroke-counting method, was traditionally the only practical method for constructing dictionaries that someone could use to look up a logograph whose pronunciation was unknown. With the advent of computers, dictionary programs are now available that allow one to draw a character using a mouse or stylus.

Multilingual ordering

When lists of names or words need to be ordered, but the context does not define a particular single language or alphabet, the Unicode Collation Algorithm
Unicode collation algorithm
The Unicode collation algorithm is an algorithm defined in Unicode Technical Report #10, which defines a customizable method to compare two strings. These comparisons can then be used to collate or sort text in any writing system and language that can be represented with Unicode.Unicode Technical...

 provides a way to put them in sequence.

Conventions in typography and in sorting systems

In typography and in the writing of scientific articles etc., such things as headers, sections, lists, pages etc. might use alphabetical numbering instead of numerical numbering. However, this does not always mean that the full alphabet of a particular language is used. Often alphabetical numbering—or enumeration—only uses a subset of the full alphabet. E.g. the Russian alphabet has 33 letters, but typically only 28 are used in typographical enumeration (and for instance Ukrainian, Belarusian and Bulgarian Cyrillic enumeration shows similar features). Two Russian letters, Ъ and Ь, are only used for modifying the preceding consonants—they naturally fall out. The last three could have been used, but mostly are not: Ы never begins a Russian word, Й almost never begins a word either, and it is perhaps too much like И—and also a relatively new character. Ё is also relatively new and much debated—sometimes in proper alphabetical sorting letters on Ё are listed under Е. (These "rules" are of course moderated, again, e.g. in phone catalogs, where foreign (non-Russian) names may frequently begin with Й or Ы.) This alludes to a simple fact: alphabets are not only tools for writing. And letters are often kept in an alphabet of a certain language even though they are not used in writing, not least because they are used in alphabetical enumeration. For instance, X
X
X is the twenty-fourth letter in the basic modern Latin alphabet.-Uses:In mathematics, x is commonly used as the name for an independent variable or unknown value. The usage of x to represent an independent or unknown variable can be traced back to the Arabic word šay شيء = “thing,” used in Arabic...

, W
W
W is the 23rd letter in the basic modern Latin alphabet.In other Germanic languages, including German, its pronunciation is similar or identical to that of English V...

, Z
Z
Z is the twenty-sixth and final letter of the basic modern Latin alphabet.-Name and pronunciation:In most dialects of English, the letter's name is zed , reflecting its derivation from the Greek zeta but in American English, its name is zee , deriving from a late 17th century English dialectal...

 are not used in writing the Norwegian language, except in loanwords and names. Still they are kept in the Norwegian alphabet, and used in alphabetical lists. Likewise, earlier versions of the Russian alphabet
Russian alphabet
The Russian alphabet is a form of the Cyrillic script, developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...

 contained letters which only had two purposes: they were good for writing Greek words and for using the Greek counting system in its Cyrillic form.

Compound words and special characters

A complication in alphabetical sorting can arise due to disagreements over how groups of words (separated compound word
Compound (linguistics)
In linguistics, a compound is a lexeme that consists of more than one stem. Compounding or composition is the word formation that creates compound lexemes...

s, name
Name
A name is a word or term used for identification. Names can identify a class or category of things, or a single thing, either uniquely, or within a given context. A personal name identifies a specific unique and identifiable individual person, and may or may not include a middle name...

s, title
Title
A title is a prefix or suffix added to someone's name to signify either veneration, an official position or a professional or academic qualification. In some languages, titles may even be inserted between a first and last name...

s, etc.) should be ordered. One rule is to remove spaces for purposes of ordering, another is to consider a space
Space (punctuation)
In writing, a space is a blank area devoid of content, serving to separate words, letters, numbers, and punctuation. Conventions for interword and intersentence spaces vary among languages, and in some cases the spacing rules are quite complex....

 as a character that is ordered before numbers and letters (this method is consistent with ordering by ASCII or Unicode codepoint), and a third is to order a space after numbers and letters. Given the following strings to alphabetize—"catch", "cattle", "cat food"—the first rule produces "catch" "cat food" "cattle", the second "cat food" "catch" "cattle", and the third "catch" "cattle" "cat food". The first rule is used in many (but not all) dictionaries
Dictionary
A dictionary is a collection of words in one or more specific languages, often listed alphabetically, with usage information, definitions, etymologies, phonetics, pronunciations, and other information; or a book of words in one language with their equivalents in another, also known as a lexicon...

, the second in telephone directories
Telephone directory
A telephone directory is a listing of telephone subscribers in a geographical area or subscribers to services provided by the organization that publishes the directory...

 (so that Wilson, Jim K appears with other people named Wilson, Jim and not after Wilson, Jimbo). The third rule is rarely used.

A similar complication arises when special characters such as hyphen
Hyphen
The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. The hyphen should not be confused with dashes , which are longer and have different uses, or with the minus sign which is also longer...

s or apostrophe
Apostrophe
The apostrophe is a punctuation mark, and sometimes a diacritic mark, in languages that use the Latin alphabet or certain other alphabets...

s appear in words or names. Any of the same rules as above can be used in this case as well; however, the strict ASCII sorting no longer corresponds exactly to any of the rules.

Name/surname ordering

The telephone directory example sheds light on another complication. In cultures where family name
Family name
A family name is a type of surname and part of a person's name indicating the family to which the person belongs. The use of family names is widespread in cultures around the world...

s are written after given name
Given name
A given name, in Western contexts often referred to as a first name, is a personal name that specifies and differentiates between members of a group of individuals, especially in a family, all of whose members usually share the same family name...

s, it is usually still desired to sort by family name first. In this case, names need to be reordered to be sorted properly. For example, Juan Hernandes and Brian O'Leary should be sorted as "Hernandes, Juan" and "O'Leary, Brian" even if they are not written this way. Capturing this rule in a computer collation algorithm is difficult, and simple attempts will necessarily fail. For example, unless the algorithm has at its disposal an extensive list of family names, there is no way to decide if "Gillian Lucille van der Waal" is "van der Waal, Gillian Lucille", "Waal, Gillian Lucille van der", or even "Lucille van der Waal, Gillian".

Abbreviations and common words

When abbreviations are used, it is sometimes desired to expand the abbreviations for sorting. In this case, "St. Paul" comes before "Shanghai". Obviously, to capture this behavior in a collation algorithm, a list of abbreviations is needed. It may be more practical in some cases to store two sets of strings, one for sorting and one for display. A similar problem arises when letters are replaced by numbers or special symbols in an irregular manner, for example 1337 for leet
Leet
Leet , also known as eleet or leetspeak, is an alternative alphabet for the English language that is used primarily on the Internet. It uses various combinations of ASCII characters to replace Latinate letters...

 or the movie Se7en. In this case, proper sorting necessitates keeping two sets of strings.

In certain contexts, very common words (such as article
Article (grammar)
An article is a word that combines with a noun to indicate the type of reference being made by the noun. Articles specify the grammatical definiteness of the noun, in some languages extending to volume or numerical scope. The articles in the English language are the and a/an, and some...

s) at the beginning of a sequence of words are not considered for ordering, or are moved to the end. So "The Shining
The Shining (novel)
The Shining is a 1977 horror novel by American author Stephen King. The title was inspired by the John Lennon song "Instant Karma!", which contained the line "We all shine on…". It was King's third published novel, and first hardback bestseller, and the success of the book firmly established King...

" is considered "Shining" or "Shining, The" when alphabetizing and therefore is ordered before "Summer of Sam
Summer of Sam
Summer of Sam is a 1999 crime-drama based around the Son of Sam serial murders. It was directed and produced by Spike Lee.-Plot:Summer of Sam is the story of a group of people in New York City in the summer of 1977, a time when the headlines were dominated by the Son of Sam serial killer...

". This rule is fairly easy to capture in an algorithm, but many programs rely instead on simple lexicographic ordering. One fairly quaint exception to this rule is the flying of the flag of The Former Yugoslav Republic of Macedonia
Macedonia naming dispute
A diplomatic dispute over the use of the name Macedonia has been an ongoing issue in the bilateral relations between Greece and the Republic of Macedonia since the latter became independent from former Yugoslavia in 1991...

 at the United Nations
United Nations
The United Nations is an international organization whose stated aims are facilitating cooperation in international law, international security, economic development, social progress, human rights, and achievement of world peace...

 between those of Thailand
Thailand
Thailand , officially the Kingdom of Thailand , formerly known as Siam , is a country located at the centre of the Indochina peninsula and Southeast Asia. It is bordered to the north by Burma and Laos, to the east by Laos and Cambodia, to the south by the Gulf of Thailand and Malaysia, and to the...

 and Timor Leste
East Timor
The Democratic Republic of Timor-Leste, commonly known as East Timor , is a state in Southeast Asia. It comprises the eastern half of the island of Timor, the nearby islands of Atauro and Jaco, and Oecusse, an exclave on the northwestern side of the island, within Indonesian West Timor...

.

Sorting of numbers

Ascending order of numbers differs from alphabetical order, e.g. 11 comes alphabetically before 2. This can be fixed with leading zero
Leading zero
A leading zero is any 0 digits, that lead a number string in a positional notation. For example, James Bond's famous identifier, 007, has two leading zeros. Leading zeros occupy most significant digits, which could be left blank or omitted for the same numeric value...

s: 02 comes alphabetically before 11. See e.g. ISO 8601
ISO 8601
ISO 8601 Data elements and interchange formats – Information interchange – Representation of dates and times is an international standard covering the exchange of date and time-related data. It was issued by the International Organization for Standardization and was first published in 1988...

.

Also −13 comes alphabetically after −12 although it is less. With negative numbers, to make ascending order correspond with alphabetical sorting, more drastic measures are needed such as adding a constant to all numbers to make them all positive.

Numerical sorting of strings

Sometimes, it is desired to order text with embedded numbers using proper numerical order. For example, "Figure 7b" goes before "Figure 11a", even though '7' comes after '1' in Unicode. This can be extended to Roman numerals. This behavior is not particularly difficult to produce as long as only integers are to be sorted, although it can slow down sorting significantly. For example, Windows XP
Windows XP
Windows XP is an operating system produced by Microsoft for use on personal computers, including home and business desktops, laptops and media centers. First released to computer manufacturers on August 24, 2001, it is the second most popular version of Windows, based on installed user base...

 does this when sorting file names.

Sorting decimals properly is a bit more difficult, because different locales use different symbols for a decimal point
Decimal separator
Different symbols have been and are used for the decimal mark. The choice of symbol for the decimal mark affects the choice of symbol for the thousands separator used in digit grouping. Consequently the latter is treated in this article as well....

, and sometimes the same character used as a decimal point is also used as a separator, for example "Section 3.2.5". There is no universal answer for how to sort such strings; any rules are application dependent.

Alphabetical sorting of numbers

When numbers are used as names, rather than for their numerical properties, it is common to sort them alphabetically as they would be spelled. For example, the movie 1776
1776 (film)
1776 is a 1972 American musical film directed by Peter H. Hunt. The screenplay by Peter Stone was based on the 1969 stage musical of the same name. Portions of the dialogue and some of the song lyrics were taken directly from the letters and memoirs of the actual participants of the Second...

would be between Seve Ballesteros and Severus Snape
Severus Snape
Severus Snape is a fictional character in the Harry Potter book series written by J.K. Rowling. In the first novel of the series, he is hostile toward Harry and is built up to be the primary antagonist until the final chapters. As the series progresses, Snape's character becomes more layered and...

. If a number is in a foreign term, it is alphabetized as it would be spelled in that language; for example, 24 heures du Mans would be between Vinge's Singularity and Vinh Airport
Vinh Airport
Vinh Airport is located in Vinh city of Nghe An province northern Vietnam. It is a mixed military/civil airport. It used to be one of the two major military airbases in Vietnam besides Gia Lam Airbase in Hanoi....

, reflecting the French "vingt-quatre".

External links and references

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK