Integer sorting
Encyclopedia
In computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

, integer sorting is the algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

ic problem of sorting
Sorting algorithm
In computer science, a sorting algorithm is an algorithm that puts elements of a list in a certain order. The most-used orders are numerical order and lexicographical order...

 a collection of data values by numeric keys, each of which is an integer
Integer
The integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...

. Algorithms designed for integer sorting may also often be applied to sorting problems in which the keys are floating point numbers or text strings. The ability to perform integer arithmetic on the keys allows integer sorting algorithms to be faster than comparison sort
Comparison sort
A comparison sort is a type of sorting algorithm that only reads the list elements through a single abstract comparison operation that determines which of two elements should occur first in the final sorted list...

ing algorithms in many cases, depending on the details of which operations are allowed in the model of computing and how large the integers to be sorted are.

The classical integer sorting algorithms of bucket sort
Bucket sort
Bucket sort, or bin sort, is a sorting algorithm that works by partitioning an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sorting algorithm. It is a distribution sort, and is a cousin of...

, counting sort
Counting sort
In computer science, counting sort is an algorithm for sorting a collection of objects according to keys that are small integers; that is, it is an integer sorting algorithm. It operates by counting the number of objects that have each distinct key value, and using arithmetic on those counts to...

, and radix sort
Radix sort
In computer science, radix sort is a non-comparative integer sorting algorithm that sorts data with integer keys by grouping keys by the individual digits which share the same significant position and value...

 are widely used and practical. Much of the subsequent research on integer sorting algorithms has focused less on practicality and more on theoretical improvements in their worst case analysis, and the algorithms that come from this line of research are not believed to be practical for current 64-bit
64-bit
64-bit is a word size that defines certain classes of computer architecture, buses, memory and CPUs, and by extension the software that runs on them. 64-bit CPUs have existed in supercomputers since the 1970s and in RISC-based workstations and servers since the early 1990s...

 computer architectures, although
experiments have shown that some of these methods may be an improvement on radix sorting for data with 128 or more bits per key. Additionally, for large data sets, the near-random memory access patterns of many integer sorting algorithms can handicap them compared to comparison sorting algorithms that have been designed with the memory hierarchy
Memory hierarchy
The term memory hierarchy is used in the theory of computation when discussing performance issues in computer architectural design, algorithm predictions, and the lower level programming constructs such as involving locality of reference. A 'memory hierarchy' in computer storage distinguishes each...

 in mind.

Integer sorting provides one of the six benchmark
Benchmark
-Geology:*Benchmark , a point of reference for a measurement**Benchmarking , an activity involving finding benchmarks*Benchmark , used in pricing crude oil-Technology:...

s in the DARPA High Productivity Computing Systems
High Productivity Computing Systems
High Productivity Computing Systems is a DARPA project for developing a new generation of economically viable high productivity computing systems for national security and industry in the 2002-2010 timeframe....

 Discrete Mathematics benchmark suite, and one of eleven benchmarks in the NAS Parallel Benchmarks suite.

Models of computation

Time bounds for integer sorting algorithms typically depend on three parameters: the number of data values to be sorted, the magnitude of the largest possible key to be sorted, and the number of bits that can be represented in a single machine word of the computer on which the algorithm is to be performed. Typically, it is assumed that ; that is, that machine words are large enough to represent an index into the sequence of input data and that they are large enough to represent a single key.

Integer sorting algorithms are usually designed to work in either the pointer machine
Pointer machine
In theoretical computer science a pointer machine is an "atomistic" abstract computational machine model akin to the Random access machine....

 or random access machine
Random access machine
In computer science, random access machine is an abstract machine in the general class of register machines. The RAM is very similar to the counter machine but with the added capability of 'indirect addressing' of its registers...

 models of computing; the main difference between these two models is that the random access machine allows any value that is stored in a register to be used as the address of memory read and write operations, with unit cost per operation, allowing certain complex operations on data to be implemented quickly using table lookups. In contrast, the pointer machine model allows memory access only via pointers to memory that may not be manipulated using arithmetic operations. In both models, addition, bitwise Boolean operations, and binary shift operations may typically also be accomplished in unit time per operation. Different integer sorting algorithms make different assumptions, however, about whether integer multiplication is also allowed as a unit-time operation. Other more specialized models of computation such as the parallel random access machine
Parallel Random Access Machine
In computer science, Parallel Random Access Machine is a shared memory abstract machine. As its name indicates, the PRAM was intended as the parallel computing analogy to the random access machine...

 have also been considered.

showed that in some cases the multiplications or table lookups required by some integer sorting algorithms could be replaced by customized operations that would be more easily implemented in hardware but that are not typically available on general-purpose computers; improved this by showing how to replace these operations by the bit-field manipulation instructions available on Pentium
Pentium
The original Pentium microprocessor was introduced on March 22, 1993. Its microarchitecture, deemed P5, was Intel's fifth-generation and first superscalar x86 microarchitecture. As a direct extension of the 80486 architecture, it included dual integer pipelines, a faster FPU, wider data bus,...

 processors.

Sorting versus integer priority queues

It is possible to base a selection sort
Selection sort
Selection sort is a sorting algorithm, specifically an in-place comparison sort. It has O time complexity, making it inefficient on large lists, and generally performs worse than the similar insertion sort...

 algorithm on any priority queue
Priority queue
A priority queue is an abstract data type in computer programming.It is exactly like a regular queue or stack data structure, but additionally, each element is associated with a "priority"....

 data structure
Data structure
In computer science, a data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks...

, that allows one to maintain sets of items with keys in the range from to , subject to operations that find and remove the item with the minimum key. Simply create a priority queue containing all of the items, and then repeatedly apply the delete-min operation until the queue is empty; the sequence in which items are deleted is the sorted sequence of the items. The time for this algorithm is the time for creating the queue, plus the time for delete-min operations. For instance, among comparison sort
Comparison sort
A comparison sort is a type of sorting algorithm that only reads the list elements through a single abstract comparison operation that determines which of two elements should occur first in the final sorted list...

 algorithms, heap sort has this form. However, there are also algorithms of this type that are faster for integer keys, based on priority queue data structures that are specialized to integers.

In particular, a van Emde Boas tree
Van Emde Boas tree
A van Emde Boas tree , also known as a vEB tree, is a tree data structure which implements an associative array with m-bit integer keys. It performs all operations in O time...

 may be used as a priority queue to sort a set of keys, each in the range from to , in time . This is a theoretical improvement over radix sorting when is sufficiently large. However, in order to use a van Emde Boas tree, one either needs a directly-addressable memory of words, or one needs to simulate it using a hash table
Hash table
In computer science, a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys , to their associated values . Thus, a hash table implements an associative array...

, reducing the space to linear but making the algorithm be randomized. Another priority queue with similar performance (including the need for randomization in the form of hash tables) is the Y-fast trie
Y-fast trie
In computer science, a y-fast trie is a data structure for storing integers from a bounded domain. It supports exact and predecessor or successor queries in time O, using O space, where n is the number of stored values and M is the maximum value in the domain...

 of .

showed that the equivalence between priority queues and sorting goes also in the other direction: if it is possible to perform integer sorting in time per key, then the same time bound applies to the time per insertion or deletion operation in a priority queue data structure. Thorup's reduction is complicated and assumes the availability of either fast multiplication operations or table lookups, but he also provides an alternative priority queue using only addition and Boolean operations with time per operation, at most multiplying the time by an iterated logarithm
Iterated logarithm
In computer science, the iterated logarithm of n, written  n , is the number of times the logarithm function must be iteratively applied before the result is less than or equal to 1...

.

Algorithms for small keys

It has long been known that bucket sort
Bucket sort
Bucket sort, or bin sort, is a sorting algorithm that works by partitioning an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sorting algorithm. It is a distribution sort, and is a cousin of...

 or counting sort
Counting sort
In computer science, counting sort is an algorithm for sorting a collection of objects according to keys that are small integers; that is, it is an integer sorting algorithm. It operates by counting the number of objects that have each distinct key value, and using arithmetic on those counts to...

 can sort a set of keys, each in the range from to , in time . In bucket sort
Bucket sort
Bucket sort, or bin sort, is a sorting algorithm that works by partitioning an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sorting algorithm. It is a distribution sort, and is a cousin of...

, pointers to the data items are distributed to a table of "buckets" (represented as collection
Collection (computing)
In computer science, a collection is a grouping of some variable number of data items that have some shared significance to the problem being solved and need to be operated upon together in some controlled fashion. Generally, the data items will be of the same type or, in languages supporting...

 data types such as linked list
Linked list
In computer science, a linked list is a data structure consisting of a group of nodes which together represent a sequence. Under the simplest form, each node is composed of a datum and a reference to the next node in the sequence; more complex variants add additional links...

s) using the keys as pointers into the table. Then, all of the buckets are concatenated together to form the output list. In counting sort, the buckets are replaced by counters that determine the number of items with each value; then, a prefix sum
Prefix sum
In computer science, the prefix sum, or scan, of a sequence of numbers is a second sequence of numbers , the sums of prefixes of the input sequence:-Parallel algorithm:A prefix sum can be calculated in parallel by the following steps....

 computation is used to determine the subarray, within an output array, where the items with each value should be placed.

Radix sort
Radix sort
In computer science, radix sort is a non-comparative integer sorting algorithm that sorts data with integer keys by grouping keys by the individual digits which share the same significant position and value...

 is a general technique in which some other sorting algorithm that is suited only for small keys is repeatedly applied, allowing the algorithm to be extended to larger keys. The key used for the th application of the other sorting algorithm is the th digit in the positional notation
Positional notation
Positional notation or place-value notation is a method of representing or encoding numbers. Positional notation is distinguished from other notations for its use of the same symbol for the different orders of magnitude...

 for the full key, according to some specified radix
Radix
In mathematical numeral systems, the base or radix for the simplest case is the number of unique digits, including zero, that a positional numeral system uses to represent numbers. For example, for the decimal system the radix is ten, because it uses the ten digits from 0 through 9.In any numeral...

, starting from the least significant digit and progressing to the most significant. For this algorithm to work, the base algorithm must be stable: items with equal keys should not change positions with each other. Using radix sort, with a radix chosen to be proportional to and with bucket sort or counting sort as the base algorithm, it is possible to sort a set of keys, each in the range from to , in time . Using a power of two
Power of two
In mathematics, a power of two means a number of the form 2n where n is an integer, i.e. the result of exponentiation with as base the number two and as exponent the integer n....

 as the radix allows the keys for each application of the base algorithm to be constructed using only fast binary shift and mask operations.

A more sophisticated technique with a similar flavor and with better theoretical performance was developed by . They observe that each pass of radix sort can be interpreted as a "range reduction" technique that, in linear time, reduces the maximum key size by a factor of ; instead, their technique reduces the key size to the square root of its previous value (halving the number of bits needed to represent a key), again in linear time. As in radix sort, they interpret the keys as two-digit base- numbers for a base that is approximately . They then group the items to be sorted into buckets according to their high digits, in linear time, using either a large but uninitialized direct addressed memory or a hash table. Each bucket has a representative, the item in the bucket with the largest key; they then sort the list of items using as keys the high digits for the representatives and the low digits for the non-representatives. By grouping the items from this list into buckets again, each bucket may be placed into sorted order, and by extracting the representatives from the sorted list the buckets may be concatenated together into sorted order. Thus, in linear time, the sorting problem is reduced to another recursive sorting problem in which the keys are much smaller, the square root of their previous magnitude. Repeating this range reduction until the keys are small enough to bucket sort leads to an algorithm with running time .

A complicated randomized algorithm of allows these time bounds to be reduced even farther, to .

Algorithms for large words

An integer sorting algorithm is said to be non-conservative if it requires a word size that is significantly larger than . As an extreme instance, if , and all keys are distinct, then the set of keys may be sorted by representing it as a bitvector, with a 1 bit in position when is one of the input keys, and then repeatedly removing the least significant bit.

The non-conservative "packed sorting" algorithm of uses a subroutine, based on Ken Batcher
Ken Batcher
Ken Batcher is an emeritus professor of Computer Science at Kent State University. He also worked as a computer architect at Goodyear Aerospace in Akron, Ohio for 28 years. In 1964, Batcher received his Ph.D. in electrical engineering from the University of Illinois...

's bitonic sorting network
Bitonic sorter
Bitonic mergesort is a parallel algorithm for sorting. It is also used as a construction method for building a sorting network. The algorithm was devised by Ken Batcher...

, for merging
Merge algorithm
Merge algorithms are a family of algorithms that run sequentially over multiple sorted lists, typically producing more sorted lists as output. This is well-suited for machines with tape drives...

 two sorted sequences of keys that are each short enough to be packed into a single machine word. The input to the packed sorting algorithm, a sequence of items stored one per word, is transformed into a packed form, a sequence of words each holding multiple items in sorted order, by using this subroutine repeatedly to double the number of items packed into each word. Once the sequence is in packed form, Albers and Hagerup use a form of merge sort
Merge sort
Merge sort is an O comparison-based sorting algorithm. Most implementations produce a stable sort, meaning that the implementation preserves the input order of equal elements in the sorted output. It is a divide and conquer algorithm...

 to sort it; when two sequences are being merged to form a single longer sequence, the same bitonic sorting subroutine can be used to repeatedly extract packed words consisting of the smallest remaining elements of the two sequences. This algorithm gains enough of a speedup from its packed representation to sort its input in linear time whenever it is possible for a single word to contain keys; that is, when for some constant .

Algorithms for few items

Bucket sort, counting sort, radix sort, and van Emde Boas tree sorting all work best when the key size is small; for large enough keys, they become slower than comparison sorting algorithms. However, when the key size or the word size is very large relative to the number of items (or equivalently when the number of items is small), it may again become possible to sort quickly, using different algorithms that take advantage of the parallelism inherent in the ability to perform arithmetic operations on large words.

An early result in this direction was provided by using the cell probe model of computation (an artificial model in which the complexity of an algorithm is measured only by the number of memory accesses it performs). Building on their work, described two data structures, the Q-heap and the atomic heap, that are implementable on a random access machine. The Q-heap is a bit-parallel version of a binary trie
Trie
In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the...

, and allows both priority queue operations and successor and predecessor queries to be performed in constant time for sets of items, where is the size of the precomputed tables needed to implement the data structure. The atomic heap is a B-tree
B-tree
In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree is a generalization of a binary search tree in that a node can have more than two children...

 in which each tree node is represented as a Q-heap; it allows constant time priority queue operations (and therefore sorting) for sets of items.

provide a randomized algorithm called signature sort that allows for linear time sorting of sets of up to items at a time, for any constant . As in the algorithm of Kirkpatrick and Reisch, they perform range reduction using a representation of the keys as numbers in base  for a careful choice of . Their range reduction algorithm replaces each digit by a "signature", a hashed value with bits such that different digit values have different signatures. If is sufficiently small, the numbers formed by this replacement process will be significantly smaller than the original keys, allowing the non-conservative packed sorting algorithm of to sort the replaced numbers in linear time. From the sorted list of replaced numbers, it is possible to form a compressed trie
Trie
In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the...

 of the keys in linear time, and the children of each node in the trie may be sorted recursively using only keys of size , after which a tree traversal produces the sorted order of the items.

Trans-dichotomous algorithms

introduced the transdichotomous model
Transdichotomous model
In computational complexity theory, and more specifically in the analysis of algorithms with integer data, the transdichotomous model is a variation of the random access machine in which the machine word size is assumed to match the problem size. The model was proposed by Michael Fredman and Dan E...

 of analysis for integer sorting algorithms, in which nothing is assumed about the range of the integer keys and one must bound the algorithm's performance by a function of the number of data values alone. Alternatively, in this model, the running time for an algorithm on a set of items is assumed to be the worst case
Worst Case
Worst Case is the 3rd book in the Michael Bennett series from James Patterson and Michael Ledwidge.-Plot summary:NYPD Detective Mike Bennett and his new partner FBI Special Agent Emily Parker are on the trail of Francis Mooney, a Manhattan trusts and estates lawyer with terminal lung cancer...

 running time for any possible combination of values of and . For instance, Fredman and Willard's fusion tree
Fusion tree
A fusion tree is a type of tree data structure in computer science. It implements an associative array with integer keys up to a fixed size; by exploiting the constant-time machine word multiplication operation available on many real processors, it is able to achieve all operations inO\lefttime ,...

sorting algorithm runs in time , an improvement over comparison sorting for any choice of and . An alternative version of their algorithm that includes the use of random numbers and integer division operations improves this to .

Since their work, even better algorithms have been developed. For instance, by repeatedly applying the Kirkpatrick–Reisch range reduction technique until the keys are small enough to apply the Albers–Hagerup packed sorting algorithm, it is possible to sort in time ; however, the range reduction part of this algorithm requires either a large memory (proportional to ) or randomization in the form of hash tables.

showed how to sort in randomized time . Their technique involves using ideas related to signature sorting to partition the data into many small sublists, of a size small enough that signature sorting can sort each of them efficiently. It is also possible to use similar ideas to sort integers deterministically in time and linear space. Using only simple arithmetic operations (no multiplications or table lookups) it is possible to sort in randomized expected time or deterministically in time for any constant .
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK