Boyer–Moore–Horspool algorithm - AbsoluteAstronomy.com

Computer science

Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

, the Boyer–Moore–Horspool algorithm or Horspool's algorithm is an algorithm

Algorithm

In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

for finding substring

Substring

A subsequence, substring, prefix or suffix of a string is a subset of the symbols in a string, where the order of the elements is preserved...

s in strings

String (computer science)

In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....

. It was published by Nigel Horspool

Nigel Horspool

R. Nigel Horspool is a professor of computer science at the University of Victoria. He invented the Boyer–Moore–Horspool algorithm, a fast string search algorithm adapted from the Boyer–Moore string search algorithm...

in 1980.

It is a simplification of the Boyer–Moore string search algorithm

Boyer–Moore string search algorithm

The Boyer–Moore string search algorithm is a particularly efficient string searching algorithm, and it has been the standard benchmark for the practical string search literature. It was developed by Bob Boyer and J Strother Moore in 1977...

which is related to the Knuth–Morris–Pratt algorithm

Knuth–Morris–Pratt algorithm

The Knuth–Morris–Pratt string searching algorithm searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing...

. The algorithm trades space for time in order to obtain an average-case complexity

Average-case complexity

Average-case complexity is a subfield of computational complexity theory that studies the complexity of algorithms on random inputs.The study of average-case complexity has applications in the theory of cryptography....

of O(N) on random text, although it has O(MN) in the worst case

Worst Case

Worst Case is the 3rd book in the Michael Bennett series from James Patterson and Michael Ledwidge.-Plot summary:NYPD Detective Mike Bennett and his new partner FBI Special Agent Emily Parker are on the trail of Francis Mooney, a Manhattan trusts and estates lawyer with terminal lung cancer...

. The length of the pattern is M and the length of the search string is N.

Example implementation

Here is an example implementation of the Boyer-Moore-Horspool algorithm, written in C

C (programming language)

C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

.

#include

include

/* The constant UCHAR_MAX is assumed to contain the maximum
* value of the input character type. Typically it's 255.
* size_t is an unsigned type for representing sizes.
* If your system doesn't have it, replace with
* unsigned int.
*/

/* Returns a pointer to the first occurrence of "needle"
* within "haystack", or NULL if not found. Works like
* memmem.
*/

/* Note: In this example needle is a Cstring. The ending
* 0x00 will be cut off, so you could call this example with
* boyermoore_horspool_memmem(haystack, hlen, "abc", sizeof("abc"))
*/
const unsigned char *
boyermoore_horspool_memmem(const unsigned char* haystack, ssize_t hlen,
const unsigned char* needle, ssize_t nlen)
{
size_t scan = 0;
size_t bad_char_skip[UCHAR_MAX + 1]; /* Officially called:
* bad character shift */

/* Sanity checks on the parameters */
if (nlen <= 0 || !haystack || !needle)
return NULL;

/* ---- Preprocess ---- */
/* Initialize the table to default value */
/* When a character is encountered that does not occur
* in the needle, we can safely skip ahead for the whole
* length of the needle.
*/
for (scan = 0; scan <= UCHAR_MAX; scan = scan + 1)
bad_char_skip[scan] = nlen;

/* C arrays have the first byte at [0], therefore:
* [nlen - 1] is the last byte of the array. */
size_t last = nlen - 1;

/* Then populate it with the analysis of the needle */
for (scan = 0; scan < last; scan = scan + 1)
bad_char_skip[needle[scan]] = last - scan;

/* ---- Do the matching ---- */

/* Search the haystack, while the needle can still be within it. */
while (hlen >= nlen)
{
/* scan from the end of the needle */
for (scan = last; haystack[scan]

needle[scan]; scan = scan - 1)
if (scan

0) /* If the first byte matches, we've found it. */
return haystack;

/* otherwise, we need to skip some bytes and start again.
Note that here we are getting the skip value based on the last byte
of needle, no matter where we didn't match. So if needle is: "abcd"
then we are skipping based on 'd' and that value will be 4, and
for "abcdd" we again skip on 'd' but the value will be only 1.
The alternative of pretending that the mismatched character was
the last character is slower in the normal case (Eg. finding
"abcd" in "...azcd..." gives 4 by using 'd' but only
4-2

2 using 'z'. */
hlen -= bad_char_skip[haystack[last]];
haystack += bad_char_skip[haystack[last]];
}

return NULL;
}

Performance
The algorithm performs best with long needle strings, when it consistently hits a non-matching character at or near the final byte of the current position in the haystack and the final byte of the needle does not occur elsewhere within the needle. For instance a 32 byte needle ending in "z" searching through a 255 byte haystack which does not have a 'z' byte in it would take up to 224 byte comparisons.

The best case is the same as for the Boyer–Moore string search algorithm

Boyer–Moore string search algorithm

in big O notation

Big O notation

In mathematics, big O notation is used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. It is a member of a larger family of notations that is called Landau notation, Bachmann-Landau notation, or...

, although the constant overhead of initialization and for each loop is less.

The worst case behavior happens when the bad character skip is consistently low (with the lower limit of 1 byte movement) and a large portion of the needle matches the haystack. The bad character skip is only low, on a partial match, when the final character of the needle also occurs elsewhere within the needle, with 1 byte movement happening when the same byte is in both of the last two positions.

The canonical degenerate case similar to the above "best" case is a needle of an 'a' byte followed by 31 'z' bytes in a haystack consisting of 255 'z' bytes. This will do 31 successful byte comparisons, a 1 byte comparison that fails and then move forward 1 byte. This process will repeat 223 more times (255 - 32), bringing the total byte comparisons to 7,168 (32 * 224).

The worst case is significantly higher than for the Boyer–Moore string search algorithm

Boyer–Moore string search algorithm

, although obviously this is hard to achieve in normal use cases. It is also worth noting that this worst case is also the worst case for the naive (but usual) memmem algorithm, although the implementation of that tends to be significantly optimized (and is more cache friendly).
External links

Description of the algorithm
StringSearch – high-performance pattern matching algorithms in Java – Implementations of many String-Matching-Algorithms in Java (BNDM, Boyer-Moore-Horspool, Boyer-Moore-Horspool-Raita, Shift-Or)
Boyer-Moore-Horspool implementation for Delphi - Non-visual components (TSearch and TFileSearch) which enable very fast data searches using the Boyer-Moore-Horspool search algorithm.
C++ implementation

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Example implementation

needle[scan]; scan = scan - 1) if (scan

2 using 'z'. */ hlen -= bad_char_skip[haystack[last]]; haystack += bad_char_skip[haystack[last]]; } return NULL; }

needle[scan]; scan = scan - 1)
if (scan

2 using 'z'. */
hlen -= bad_char_skip[haystack[last]];
haystack += bad_char_skip[haystack[last]];
}

return NULL;
}