Rolling hash
Encyclopedia
A rolling hash is a hash function
where the input is hashed in a window that moves through the input.
A few hash functions allow a rolling hash to be computed very quickly -- the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window -- similar to the way a moving average function can be computed much more quickly than other low-pass filters.
One of the main applications is the Rabin-Karp string search algorithm
, which uses the rolling hash described below.
Another popular application is rsync
program which uses a checksum based on Mark Adler's adler-32
as its rolling hash.
is normally used with a very simple rolling hash function that only uses multiplications and additions:
where is a constant and are the input characters.
In order to avoid manipulating huge values, all math is done modulo
. The choice of and is critical to get good hashing; see linear congruential generator
for more discussion.
Removing and adding chars simply involves adding or subtracting the first or last term. Shifting all chars by one position to the left requires multiplying the entire sum by . Shifting all chars by one position to the right requires dividing the entire sum by . Note that in modulo arithmetic, can be chosen to have a multiplicative inverse
by which can be multiplied to get the result of the division without actually performing a division.
instead. It presumes that there is some hash function from characters to integers in the interval . This hash function might be simply an array or a hash table
mapping characters to random integers. Let the function be a cyclic binary rotation (or barrel shift
): it rotates the bits by 1 to the left, pushing the latest bit in the first position. E.g., . Let be the bit-wise exclusive or. The hash values are defined as
where the multiplications by powers of two can be implemented by binary shifts. The result is a number in .
Computing the hash values in a rolling fashion is done as follows. Let be the previous hash value. Rotate once: . If is the character to be removed, rotate it times: . Then simply set
where is the new character.
Hash function
A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...
where the input is hashed in a window that moves through the input.
A few hash functions allow a rolling hash to be computed very quickly -- the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window -- similar to the way a moving average function can be computed much more quickly than other low-pass filters.
One of the main applications is the Rabin-Karp string search algorithm
Rabin-Karp string search algorithm
The Rabin–Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp in 1987 that uses hashing to find any one of a set of pattern strings in a text. For text of length n and p patterns of combined length m, its average and best case running time is O in space O,...
, which uses the rolling hash described below.
Another popular application is rsync
Rsync
rsync is a software application and network protocol for Unix-like and Windows systems which synchronizes files and directories from one location to another while minimizing data transfer using delta encoding when appropriate. An important feature of rsync not found in most similar...
program which uses a checksum based on Mark Adler's adler-32
Adler-32
Adler-32 is a checksum algorithm which was invented by Mark Adler in 1995, and is a modification of the Fletcher checksum. Compared to a cyclic redundancy check of the same length, it trades reliability for speed. Adler-32 is more reliable than Fletcher-16, and slightly less reliable than Fletcher-32...
as its rolling hash.
Rabin-Karp rolling hash
The Rabin-Karp string search algorithmRabin-Karp string search algorithm
The Rabin–Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp in 1987 that uses hashing to find any one of a set of pattern strings in a text. For text of length n and p patterns of combined length m, its average and best case running time is O in space O,...
is normally used with a very simple rolling hash function that only uses multiplications and additions:
where is a constant and are the input characters.
In order to avoid manipulating huge values, all math is done modulo
Modular arithmetic
In mathematics, modular arithmetic is a system of arithmetic for integers, where numbers "wrap around" after they reach a certain value—the modulus....
. The choice of and is critical to get good hashing; see linear congruential generator
Linear congruential generator
A Linear Congruential Generator represents one of the oldest and best-known pseudorandom number generator algorithms. The theory behind them is easy to understand, and they are easily implemented and fast....
for more discussion.
Removing and adding chars simply involves adding or subtracting the first or last term. Shifting all chars by one position to the left requires multiplying the entire sum by . Shifting all chars by one position to the right requires dividing the entire sum by . Note that in modulo arithmetic, can be chosen to have a multiplicative inverse
Multiplicative inverse
In mathematics, a multiplicative inverse or reciprocal for a number x, denoted by 1/x or x−1, is a number which when multiplied by x yields the multiplicative identity, 1. The multiplicative inverse of a fraction a/b is b/a. For the multiplicative inverse of a real number, divide 1 by the...
by which can be multiplied to get the result of the division without actually performing a division.
Cyclic polynomial
Hashing by cyclic polynomial—sometimes called Buzhash—is also simple, but it has the benefit of avoiding multiplications, using barrel shiftsBarrel shifter
A barrel shifter is a digital circuit that can shift a data word by a specified number of bits in one clock cycle. It can be implemented as a sequence of multiplexers , and in such an implementation the output of one mux is connected to the input of the next mux in a way that depends on the shift...
instead. It presumes that there is some hash function from characters to integers in the interval . This hash function might be simply an array or a hash table
Hash table
In computer science, a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys , to their associated values . Thus, a hash table implements an associative array...
mapping characters to random integers. Let the function be a cyclic binary rotation (or barrel shift
Barrel shifter
A barrel shifter is a digital circuit that can shift a data word by a specified number of bits in one clock cycle. It can be implemented as a sequence of multiplexers , and in such an implementation the output of one mux is connected to the input of the next mux in a way that depends on the shift...
): it rotates the bits by 1 to the left, pushing the latest bit in the first position. E.g., . Let be the bit-wise exclusive or. The hash values are defined as
where the multiplications by powers of two can be implemented by binary shifts. The result is a number in .
Computing the hash values in a rolling fashion is done as follows. Let be the previous hash value. Rotate once: . If is the character to be removed, rotate it times: . Then simply set
where is the new character.
Independence
At best, rolling hash values are pairwise independent. Similarly, at best the (randomized) Rabin-Karp rolling hash values are uniform. Hashing n-grams by cyclic polynomials achieves pairwise independence: simply keep the first bits.Computational complexity
All rolling hash functions are linear in the number of characters, but their complexity with respect to the length of the window () varies. Rabin-Karp rolling hash requires the multiplications of two -bit numbers, integer multiplication is in . Hashing n-grams by cyclic polynomials can be done in linear time .Software
- ngramhashing is a Free softwareFree softwareFree software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...
C++ implementation of several rolling hash functions - rollinghashjava is an Apache licensed Java implementation of rolling hash functions