Linear hash
Encyclopedia
Linear hashing is a dynamic hash table
algorithm invented by Witold Litwin (1980), and later popularized by Paul Larson
. Linear hashing allows for the expansion of the hash table one slot at a time.
The frequent single slot expansion can very effectively control the length of
the collision chain. The cost of hash table expansion is spread out across each
hash table insertion operation, as opposed to being incurred all at once. Linear hashing is therefore well suited for interactive applications.
controls the address calculation of linear hashing.
In linear hashing, the address calculation is always bounded by a size that
is a power of two
* N, where N is the chosen original number of buckets.
The number of buckets is given by N * 2 ^ Level e.g. Level 0, N; Level 1, 2N; Level 2, 4N.
address(level,key) = hash(key) mod (N * 2level)
The 'split' variable controls the read operation, and the expansion operation.
A read operation would use address(level,key) if address(level,key) is greater
than or equal to the 'split' variable. Otherwise, address(level+1,key) is used. This takes into account the fact that
buckets numbered less than split have been rehashed with address(level+1,key) after its contents split
between two new buckets (the first bucket writing over the contents of the old single bucket prior to the split).
A linear hashing table expansion operation would consist of rehashing the
entries at one slot location indicated by the 'split' variable to either of two target slot locations
of address(level+1,key). This intuitively is consistent with the assertion that
if y = x mod M and y'= x mod M * 2, then y'= y or y' = y + M.
The 'split' variable is incremented by 1 at the end of
the expansion operation. If the 'split' variable reaches N * 2level, then the 'level'
variable is incremented by 1, and the 'split' variable is reset to 0.
Thus the hash buckets are expanded round robin, and seem unrelated to where buckets overflow at the time of expansion.
Overflow buckets are used at the sites of bucket overflow (the normal bucket has a pointer to the overflow bucket),
but these are eventually reabsorbed when the round robin comes to the bucket with the overflow bucket,
and the contents of that bucket and the overflow bucket are redistributed by
the future hash function hash(key) mod (N * 2 level+1 ).
The degenerate case, which is unlikely with a randomized hash function,
is that enough entries are hashed to the same bucket so that there is enough entries
to overflow more than one overflow bucket (assuming overflow bucket size = normal bucket size),
before being absorbed when that bucket's turn to split comes in the round robin.
The point of the algorithm seems to be that overflow is preempted by gradually increasing the number of available buckets,
and overflow buckets are eventually reabsorbed during a later split, which must eventually happen because splitting occurs round robin.
There is some flexibility in choosing how often the expansion operations are performed.
One obvious choice is to perform the expansion operation each time no more slots are available at the target slot location. Another choice
is to control the expansion with a programmer defined load factor.
The hash table array for linear hashing is usually implemented with a dynamic array
algorithm.
algorithm used in linear hashing, and presented performance comparisons using a list of Icon benchmark applications.
Hash table
In computer science, a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys , to their associated values . Thus, a hash table implements an associative array...
algorithm invented by Witold Litwin (1980), and later popularized by Paul Larson
Paul Larson
Paul Larson is a computer scientist. He is most famous for inventing the linear hashing algorithm with Witold Litwin. Paul Larson is currently a senior researcher in the Database Group of Microsoft Research...
. Linear hashing allows for the expansion of the hash table one slot at a time.
The frequent single slot expansion can very effectively control the length of
the collision chain. The cost of hash table expansion is spread out across each
hash table insertion operation, as opposed to being incurred all at once. Linear hashing is therefore well suited for interactive applications.
Algorithm Details
A hash functionHash function
A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...
controls the address calculation of linear hashing.
In linear hashing, the address calculation is always bounded by a size that
is a power of two
Power of two
In mathematics, a power of two means a number of the form 2n where n is an integer, i.e. the result of exponentiation with as base the number two and as exponent the integer n....
* N, where N is the chosen original number of buckets.
The number of buckets is given by N * 2 ^ Level e.g. Level 0, N; Level 1, 2N; Level 2, 4N.
address(level,key) = hash(key) mod (N * 2level)
The 'split' variable controls the read operation, and the expansion operation.
A read operation would use address(level,key) if address(level,key) is greater
than or equal to the 'split' variable. Otherwise, address(level+1,key) is used. This takes into account the fact that
buckets numbered less than split have been rehashed with address(level+1,key) after its contents split
between two new buckets (the first bucket writing over the contents of the old single bucket prior to the split).
A linear hashing table expansion operation would consist of rehashing the
entries at one slot location indicated by the 'split' variable to either of two target slot locations
of address(level+1,key). This intuitively is consistent with the assertion that
if y = x mod M and y'= x mod M * 2, then y'= y or y' = y + M.
The 'split' variable is incremented by 1 at the end of
the expansion operation. If the 'split' variable reaches N * 2level, then the 'level'
variable is incremented by 1, and the 'split' variable is reset to 0.
Thus the hash buckets are expanded round robin, and seem unrelated to where buckets overflow at the time of expansion.
Overflow buckets are used at the sites of bucket overflow (the normal bucket has a pointer to the overflow bucket),
but these are eventually reabsorbed when the round robin comes to the bucket with the overflow bucket,
and the contents of that bucket and the overflow bucket are redistributed by
the future hash function hash(key) mod (N * 2 level+1 ).
The degenerate case, which is unlikely with a randomized hash function,
is that enough entries are hashed to the same bucket so that there is enough entries
to overflow more than one overflow bucket (assuming overflow bucket size = normal bucket size),
before being absorbed when that bucket's turn to split comes in the round robin.
The point of the algorithm seems to be that overflow is preempted by gradually increasing the number of available buckets,
and overflow buckets are eventually reabsorbed during a later split, which must eventually happen because splitting occurs round robin.
There is some flexibility in choosing how often the expansion operations are performed.
One obvious choice is to perform the expansion operation each time no more slots are available at the target slot location. Another choice
is to control the expansion with a programmer defined load factor.
The hash table array for linear hashing is usually implemented with a dynamic array
Dynamic array
In computer science, a dynamic array, growable array, resizable array, dynamic table, or array list is a random access, variable-size list data structure that allows elements to be added or removed...
algorithm.
Adoption in language systems
Griswold and Townsend discussed the adoption of linear hashing in the Icon language. They discussed the implementation alternatives of dynamic arrayDynamic array
In computer science, a dynamic array, growable array, resizable array, dynamic table, or array list is a random access, variable-size list data structure that allows elements to be added or removed...
algorithm used in linear hashing, and presented performance comparisons using a list of Icon benchmark applications.