Unlike Knuth-Moriss-Pratt linear time algorithm, Boyer-moore algorithm can search for a pattern in a sublinear time. Till KMP time, people worked hard to come up with an algorithm to search for a pattern in linear time, i.e., O(M+N) time where M is the total length of the pattern to search and N is the total length of the String searched from.
Boyer and Moore were curious to nail down an algorithm which can work more efficiently than KMP in sublinear time. By “Sublinear” Boyer-moore means, this can search a pattern from the string by inspecting minimum number of characters rather than inspecting every character in the string. They deviced an intellectual logic to skip inspecting some characters and still reached a successful search match when the pattern existed in the string.
Though Boyer-moore algorithm works efficiently in sublinear time, this has few limitations in which case it will end up working in linear time. Search efficiency increases when the number of characters from the pattern increases as it gets more characters to skip from inspecting. Efficiency falls back to linear time when it had to search only one character.
The idea of how it works:
The main idea behind the algorithm is it gains more important information by matching the pattern from right-left rather than usual left-right matching. Assume pattern has p0, p1, p2… pm characters and Searching string have s0, s1, s2… sn characters. Algorithm takes edge by comparing sm to pm, sm-1 to pm-1 and so on up to s0-p0.
Since pattern length is 7, search starts at the String location “F” and step left one by one. Searching is done by picking a character at the String pointed by the upward arrow and searches that character in the pattern.
Character “F” from string is matched with char “T” at the pattern and then with “A”, then with “H” and so on. Since char “F” is not matched with ANY of the character from the pattern, we can confirm that the whole pattern won’t be matched with the current position in the string. By current position, I mean from the char pointed by the first char of the pattern to the char pointed by last char of the pattern (that’s where the actual searching happens now). In this case, pattern is not matched from “W” to “F”.
The key point is that for match to succeed, the char “F” should exist atleast in one place in the pattern. Since it’s not appearing even in one place, we can confirm the pattern won’t match in the current place. So we can safely skip from inspecting all other left chars from “-“ to “W” in the string with the every char of the pattern.
Boyer-moore observation 1: If char is known not to occur in pattern, then we know we need not consider the possibility of an occurrence of pattern starting at string position 1, 2, 3… or patternlength: Such an occurrence would require that char to search be a character of pattern.
According to observation 1, we can safely skip first 7 chars from the string and first char of pattern is positioned at the 8th char of string.
“-“ is matched with every char in the pattern from right to left and we have match at the fifth location in the pattern. Importantly first matching with the pattern only considered during matching. When there is a match occurs pattern has to be realigned with the string so that the currently matched char is positioned at the same location. In this case, pattern is realigned to match those two hyphens. After realignment arrow too has to be repositioned down by 4 chars so as to point to the last char of the pattern.
Since our first search char “-“ matched at the fifth position in the pattern, not the first position (right most of pattern as the string), we know for sure that its worthless matching the rest of the chars in the string starting from “Y” to “I” left to right in the string. So moving pattern to right by 4 is safe.
Boyer-moore observation 2: if the last (right most) occurrence of char in pattern is delta1 characters from the right end of the pattern, then we know we can slide pattern down delta1 positions without checking for matches.
Current position after applying observation 2
First match succeed as “T” in string matches with “T” in the pattern.
Upward arrow moved to the left by one position to point to “L” and “L” is matched with “A” in the pattern but match fails.
Boyer-Moore observation 3a: if the mismatch occurs k characters from the start of the pattern and the mismatched character not in pattern, then we can advance atleast k characters.
‘L’ is not in ‘P’ and the mismatch occurred against p6, hence we can advance (atleast) 6 characters.
Current position is:
Boyer-moore observation 3b: Since we know that earlier we already matched some characters (1 in this case). If the matched characters don’t match the start of the pattern, then we can actually jump forward a little more, delta2 distance.
Current position is:
Current position is:
Now upward arrow points at the last char of the string T and when we match each character of string with the pattern, we get successful match between all the chars with the pattern. So we FOUND the pattern in the string.
Though Boyer-Moore algorithm performance is sublinear in best case, but it has O(MN) runtime in worst case when the length of characters to search is very small.
Boyer-Moore algorithm is really fast on larger alphabets and for small set of characters it’s recommended to use Knuth-Morris-Pratt algorithm.