Spreadsort
Encyclopedia
Spreadsort is a sorting algorithm
invented by Steven J. Ross in 2002. It combines concepts from distribution-based sorts, such as radix sort
and bucket sort
, with partitioning concepts from comparison sorts such as quicksort and mergesort. In experimental results it was shown to be highly efficient, often outperforming traditional algorithms such as quicksort, particularly on distributions exhibiting structure.
Quicksort identifies a pivot element in the list and then partitions the list into two sublists, those elements less than the pivot and those greater than the pivot. Spreadsort generalizes this idea by partitioning the list into n/c partitions at each step, where n is the total number of elements in the list and c is a small constant (in practice usually between 4 and 8 when comparisons are slow, or much larger in situations where they are fast). It uses distribution-based techniques to accomplish this, first locating the minimum and maximum value in the list, and then dividing the region between them into n/c equal-sized bins.
Where caching is an issue, it can help to have a maximum number of bins in each recursive division step, causing this division process to take multiple steps. Though this causes more iterations, it reduces cache misses and can make the algorithm run faster overall.
In the case where the number of bins is at least the number of elements, spreadsort degenerates to bucket sort and the sort completes. Otherwise, each bin is sorted recursively. The algorithm uses heuristic tests to determine whether each bin would be more efficiently sorted by spreadsort or some other classical sort algorithm, then recursively sorts the bin.
Like other distribution-based sorts, spreadsort has the weakness that the programmer is required to provide a means of converting each element into a numeric key, for the purpose of identifying which bin it falls in. Although it is possible to do this for arbitrary-length elements such as strings by considering each element to be followed by an infinite number of minimum values, and indeed for any datatype possessing a total order
, this can be more difficult to implement correctly than a simple comparison function, especially on complex structures. Poor implementation of this value function can result in clustering that harms the algorithm's relative performance.
, and O(n2) if it uses quicksort. In the case of distributions where the size of the key in bits k times 2 is roughly the square of the log of the list size n or smaller (2k < log(n)2), it does better in the worst case, achieving O(n·(k - log(n)).5) worst-case time for the originally published version, and O(n·((k/s) + s)) for the cache aware version.
Experiments were done comparing an optimized version of spreadsort to the highly-optimized C++
In space performance, spreadsort is worse than most in-place algorithms: in its simplest form, it is not an in-place algorithm, using O(n) extra space; in experiments, about 20% more than quicksort using a c of 4-8. With a cache-aware form, less memory is used and there is an upper bound on memory usage of the maximum bin count times the maximum number of recursions, which ends up being a few kilobytes times the size of the key in bytes. Although it uses asymptotically more space than the O(log n) overhead of quicksort or the O(1) overhead of heapsort, it uses considerably less space than the basic form of mergesort, which uses auxiliary space equal to the space occupied by the list.
Spreadsort also works efficiently on problems too large to fit in memory and thus requiring disk access.
Other similar algorithms are Flashsort (which is simpler) and Adaptive Left Radix. Adaptive Left Radix is apparently quite similar, the main difference being recursive behavior, with Spreadsort checking for worst-case situations and using std::sort to avoid performance problems where necessary, and Adaptive Left Radix recursing continuously until done or the data is small enough to use insertionsort.
Sorting algorithm
In computer science, a sorting algorithm is an algorithm that puts elements of a list in a certain order. The most-used orders are numerical order and lexicographical order...
invented by Steven J. Ross in 2002. It combines concepts from distribution-based sorts, such as radix sort
Radix sort
In computer science, radix sort is a non-comparative integer sorting algorithm that sorts data with integer keys by grouping keys by the individual digits which share the same significant position and value...
and bucket sort
Bucket sort
Bucket sort, or bin sort, is a sorting algorithm that works by partitioning an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sorting algorithm. It is a distribution sort, and is a cousin of...
, with partitioning concepts from comparison sorts such as quicksort and mergesort. In experimental results it was shown to be highly efficient, often outperforming traditional algorithms such as quicksort, particularly on distributions exhibiting structure.
Quicksort identifies a pivot element in the list and then partitions the list into two sublists, those elements less than the pivot and those greater than the pivot. Spreadsort generalizes this idea by partitioning the list into n/c partitions at each step, where n is the total number of elements in the list and c is a small constant (in practice usually between 4 and 8 when comparisons are slow, or much larger in situations where they are fast). It uses distribution-based techniques to accomplish this, first locating the minimum and maximum value in the list, and then dividing the region between them into n/c equal-sized bins.
Where caching is an issue, it can help to have a maximum number of bins in each recursive division step, causing this division process to take multiple steps. Though this causes more iterations, it reduces cache misses and can make the algorithm run faster overall.
In the case where the number of bins is at least the number of elements, spreadsort degenerates to bucket sort and the sort completes. Otherwise, each bin is sorted recursively. The algorithm uses heuristic tests to determine whether each bin would be more efficiently sorted by spreadsort or some other classical sort algorithm, then recursively sorts the bin.
Like other distribution-based sorts, spreadsort has the weakness that the programmer is required to provide a means of converting each element into a numeric key, for the purpose of identifying which bin it falls in. Although it is possible to do this for arbitrary-length elements such as strings by considering each element to be followed by an infinite number of minimum values, and indeed for any datatype possessing a total order
Total order
In set theory, a total order, linear order, simple order, or ordering is a binary relation on some set X. The relation is transitive, antisymmetric, and total...
, this can be more difficult to implement correctly than a simple comparison function, especially on complex structures. Poor implementation of this value function can result in clustering that harms the algorithm's relative performance.
Performance
The worst-case performance of spreadsort ultimately depends on what sort it switches to on smaller bins — O(n log n) if it uses a worst-case O(n log n) sort such as mergesort or heapsortHeapsort
Heapsort is a comparison-based sorting algorithm to create a sorted array , and is part of the selection sort family. Although somewhat slower in practice on most machines than a well implemented quicksort, it has the advantage of a more favorable worst-case O runtime...
, and O(n2) if it uses quicksort. In the case of distributions where the size of the key in bits k times 2 is roughly the square of the log of the list size n or smaller (2k < log(n)2), it does better in the worst case, achieving O(n·(k - log(n)).5) worst-case time for the originally published version, and O(n·((k/s) + s)) for the cache aware version.
Experiments were done comparing an optimized version of spreadsort to the highly-optimized C++
std::sort
, implemented with introsort. On lists of integers and floats spreadsort shows a roughly 2-7X runtime improvement for random data on various operating systems.http://www.boostpro.com/vault/index.php?action=downloadfile&filename=algorithm_sorting.zip&directory=&In space performance, spreadsort is worse than most in-place algorithms: in its simplest form, it is not an in-place algorithm, using O(n) extra space; in experiments, about 20% more than quicksort using a c of 4-8. With a cache-aware form, less memory is used and there is an upper bound on memory usage of the maximum bin count times the maximum number of recursions, which ends up being a few kilobytes times the size of the key in bytes. Although it uses asymptotically more space than the O(log n) overhead of quicksort or the O(1) overhead of heapsort, it uses considerably less space than the basic form of mergesort, which uses auxiliary space equal to the space occupied by the list.
Spreadsort also works efficiently on problems too large to fit in memory and thus requiring disk access.
Implementation
Two Levels are as Good as Any
An interesting result for algorithms of this general type (splitting based on the radix, then comparison-based sorting) is that they are O(n) for any continuous integrable function. This result can be obtained by forcing Spreadsort to always iterate at least twice if the bin size after the first iteration is above some constant value. If data is known to come in based upon some continuous integrable function at all times, this modification of Spreadsort can attain some performance improvement over the basic algorithm, and will have better worst-case performance. If this restriction cannot usually be depended on, this change will add a little extra runtime overhead to the algorithm and gain little.Other similar algorithms are Flashsort (which is simpler) and Adaptive Left Radix. Adaptive Left Radix is apparently quite similar, the main difference being recursive behavior, with Spreadsort checking for worst-case situations and using std::sort to avoid performance problems where necessary, and Adaptive Left Radix recursing continuously until done or the data is small enough to use insertionsort.