Boyer-Moore 字符串匹配算法 - gaochundong－IT工程師數位筆記本

文章出處

字符串匹配問題的形式定義：

文本（Text）是一個長度為 n 的數組 T[1..n]；
模式（Pattern）是一個長度為 m 且 m≤n 的數組 P[1..m]；
T 和 P 中的元素都屬于有限的字母表 Σ 表；
如果 0≤s≤n-m，并且 T[s+1..s+m] = P[1..m]，即對 1≤j≤m，有 T[s+j] = P[j]，則說模式 P 在文本 T 中出現且位移為 s，且稱 s 是一個有效位移（Valid Shift）。

比如上圖中，目標是找出所有在文本 T = abcabaabcabac 中模式 P = abaa 的所有出現。該模式在此文本中僅出現一次，即在位移 s = 3 處，位移 s = 3 是有效位移。

解決字符串匹配的算法包括：樸素算法（Naive Algorithm）、Rabin-Karp 算法、有限自動機算法（Finite Automation）、 Knuth-Morris-Pratt 算法（即 KMP Algorithm）、Boyer-Moore 算法、Simon 算法、Colussi 算法、Galil-Giancarlo 算法、Apostolico-Crochemore 算法、Horspool 算法、Shift-Or 算法和 Sunday 算法等。本文中我們將主要介紹 Boyer-Moore 算法（即 BM Algorithm）。

在 1977 年，Robert S. Boyer (Stanford Research Institute) 和 J Strother Moore (Xerox Palo Alto Research Center) 共同發表了文章《A Fast String Searching Algorithm》，介紹了一種新的快速字符串匹配算法。這種算法在邏輯上相對于現有的算法有了顯著的改進，它對要搜索的字符串進行倒序的字符比較，并且當字符比較不匹配時無需對整個模式串再進行搜索。

Boyer-Moore 算法的主要特點有：

對模式字符的比較順序時從右向左；
預處理需要 O(m + σ) 的時間和空間復雜度；
匹配階段需要 O(m × n) 的時間復雜度；
匹配階段在最壞情況下需要 3n 次字符比較；
最優復雜度 O(n/m)；

下面是幾種常見的字符串匹配算法的性能比較。

在 Naive 算法中，對文本 T 和模式 P 字符串均未做預處理。而在 KMP 算法中則對模式 P 字符串進行了預處理操作，以預先計算模式串中各位置的最長相同前后綴長度的數組。Boyer–Moore 算法同樣也是對模式 P 字符串進行預處理。

我們知道，在 Naive 算法中，如果發現模式 P 中的字符與文本 T 中的字符不匹配時，需要將文本 T 的比較位置向后滑動一位，模式 P 的比較位置歸 0 并從頭開始比較。而 KMP 算法則是根據預處理的結果進行判斷以使模式 P 的比較位置可以向后滑動多個位置。Boyer–Moore 算法的預處理過程也是為了達到相同效果。

Boyer–Moore 算法在對模式 P 字符串進行預處理時，將采用兩種不同的啟發式方法。這兩種啟發式的預處理方法稱為：

壞字符（Bad Character Heuristic）：當文本 T 中的某個字符跟模式 P 的某個字符不匹配時，我們稱文本 T 中的這個失配字符為壞字符。
好后綴（Good Suffix Heuristic）：當文本 T 中的某個字符跟模式 P 的某個字符不匹配時，我們稱文本 T 中的已經匹配的字符串為好后綴。

Boyer–Moore 算法在預處理時，將為兩種不同的啟發法結果創建不同的數組，分別稱為 Bad-Character-Shift（or The Occurrence Shift）和 Good-Suffix-Shift（or Matching Shift）。當進行字符匹配時，如果發現模式 P 中的字符與文本 T 中的字符不匹配時，將比較兩種不同啟發法所建議的移動位移長度，選擇最大的一個值來對模式 P 的比較位置進行滑動。

此外，Naive 算法和 KMP 算法對模式 P 的比較方向是從前向后比較，而 Boyer–Moore 算法的設計則是從后向前比較，即從尾部向頭部方向進行比較。

下面，我們將以 J Strother Moore 提供的例子作為示例。

   Text T : HERE IS A SIMPLE EXAMPLE
Pattern P : EXAMPLE

首先將文本 T 與模式 P 頭部對齊，并從尾部開始進行比較。（注1）

這樣如果尾部的字符不匹配，則前面的字符也就無需比較了，直接跳過。我們看到，"S" 與 "E" 不匹配，我們稱文本 T 中的失配字符 "S" 為壞字符（Bad Character）。

由于字符 "S" 在模式 "EXAMPLE" 中不存在，則可將搜索位置滑動到 "S" 的后面。

仍然從尾部開始比較，發現 "P" 與 "E" 不匹配，所以 "P" 是壞字符。但此時，"P" 包含在模式 "EXAMPLE" 之中。所以，將模式后移兩位，使兩個 "P" 對齊。

由此總結壞字符啟發法的規則是：

模式后移位數 = 壞字符在模式中失配的位置 - 壞字符在模式中最后一次出現的位置

壞字符啟發法規則中的特殊情況：

如果壞字符不存在于模式中，則最后一次出現的位置為 -1。
如果壞字符在模式中的位置位于失配位置的右側，則此啟發法不提供任何建議。

The bad character shift means to shift the pattern so that the text character of the mismatch is aligned to the last occurrence of that character in the initial part of the pattern (pattern minus last pattern character), if there is such an occurrence, or one position before the pattern if the mismatched character doesn't appear in the initial part of the pattern at all.

以上面示例中壞字符 "P" 為例，它的失配位置為 "E" 的位置 6 （位置從 0 開始），在模式中最后一次出現的位置是 4，則模式后移位數為 6 - 4 = 2 位。模式移動的結果就是使模式中最后出現的 "P" 與文本中的壞字符 "P" 進行對齊。

實際上，前面的壞字符 "S" 出現時，其失配位置為 6，最后一次出現位置為 -1，所以模式后移位數為 6 - (-1) = 7 位。也就是將模式整體移過壞字符。

我們繼續上面的過程，仍然從尾部開始比較。

仍然從尾部開始比較，發現 "E" 與 "E" 匹配，則繼續倒序比較。

發現 "L" 與 "L" 匹配，則繼續倒序比較。

發現 "P" 與 "P" 匹配，則繼續倒序比較。

發現 "M" 與 "M" 匹配，則繼續倒序比較。

發現 "I" 與 "A" 不匹配，則 "I" 是壞字符。對于前面已經匹配的字符串 "MPLE"、"PLE"、"LE"、"E"，我們稱它們為好后綴（Good Suffix）。

The good suffix shift, aligns the matched part of the text with the rightmost occurrence of that character sequence in the pattern that is preceded by a different character (including none, if the matched suffix is also a prefix of the pattern) than the matched suffix of the pattern - if there is such an occurrence.

而好后綴啟發法的規則是：

模式后移位數 = 好后綴在模式中的當前位置 - 好后綴在模式中最右出現且前綴字符不同的位置

好后綴在模式中的當前位置以其最后一個字符為準。如果好后綴不存在于模式中，則最右出現的位置為 -1。

這樣，我們先來找出好后綴在模式中上一次出現的位置。

"MPLE" : 未出現，最右出現的位置為 -1；
"PLE" : 未出現在頭部，最右出現的位置為 -1；
"LE" : 未出現在頭部，最右出現的位置為 -1；
"E" : 出現在頭部，補充虛擬字符 'MPL'E，前綴字符為空，最右出現的位置為 0；

由于只有 "E" 在模式中出現，其當前位置為 6，上一次出現的位置為 0，則依據好后綴啟發法規則，模式后移位數為 6 - 0 = 6 位。

而如果依據壞字符啟發法規則，模式后移位數為 2 - (-1) = 3 位。

Boyer–Moore 算法的特點就在于此，選擇上述兩種啟發法規則計算結果中最大的一個值來對模式 P 的比較位置進行滑動。這里將選擇好后綴啟發法的計算結果，即將模式向后移 6 位。

此時，仍然從尾部開始比較。

發現 "P" 與 "E" 不匹配，則 "P" 是壞字符，則模式后移位數為 6 - 4 = 2 位。

發現 "E" 與 "E" 匹配，則繼續倒序比較，直到發現全部匹配，則匹配到的第一個完整的模式 P 被發現。

繼續下去則是依據好后綴啟示法規則計算好后綴 "E" 的后移位置為 6 - 0 = 6 位，然后繼續倒序比較時發現已超出文本 T 的范圍，搜索結束。

從上面的示例描述可以看出，Boyer–Moore 算法的精妙之處在于，其通過兩種啟示規則來計算后移位數，且其計算過程只與模式 P 有關，而與文本 T 無關。因此，在對模式 P 進行預處理時，可預先生成 "壞字符規則之向后位移表" 和 "好后綴規則之向后位移表"，在具體匹配時僅需查表比較兩者中最大的位移即可。

下面是 Boyer–Moore 算法的示例代碼，使用 C# 語言實現。

  1 namespace StringMatching
  2 {
  3   class Program
  4   {
  5     static void Main(string[] args)
  6     {
  7       char[] text1 = "BBC ABCDAB ABCDABCDABDE".ToCharArray();
  8       char[] pattern1 = "ABCDABD".ToCharArray();
  9 
 10       int firstShift1;
 11       bool isMatched1 = BoyerMooreStringMatcher.TryMatch(
 12         text1, pattern1, out firstShift1);
 13       Contract.Assert(isMatched1);
 14       Contract.Assert(firstShift1 == 15);
 15 
 16       char[] text2 = "ABABDAAAACAAAABCABAB".ToCharArray();
 17       char[] pattern2 = "AAACAAAA".ToCharArray();
 18 
 19       int firstShift2;
 20       bool isMatched2 = BoyerMooreStringMatcher.TryMatch(
 21         text2, pattern2, out firstShift2);
 22       Contract.Assert(isMatched2);
 23       Contract.Assert(firstShift2 == 6);
 24 
 25       char[] text3 = "ABAAACAAAAAACAAAABCABAAAACAAAAFDLAAACAAAAAACAAAA"
 26         .ToCharArray();
 27       char[] pattern3 = "AAACAAAA".ToCharArray();
 28 
 29       int[] shiftIndexes3 = BoyerMooreStringMatcher.MatchAll(text3, pattern3);
 30       Contract.Assert(shiftIndexes3.Length == 5);
 31       Contract.Assert(string.Join(",", shiftIndexes3) == "2,9,22,33,40");
 32 
 33       char[] text4 = "GCATCGCAGAGAGTATACAGTACG".ToCharArray();
 34       char[] pattern4 = "GCAGAGAG".ToCharArray();
 35 
 36       int firstShift4;
 37       bool isMatched4 = BoyerMooreStringMatcher.TryMatch(
 38         text4, pattern4, out firstShift4);
 39       Contract.Assert(isMatched4);
 40       Contract.Assert(firstShift4 == 5);
 41 
 42       char[] text5 = "HERE IS A SIMPLE EXAMPLE AND EXAMPLE OF BM.".ToCharArray();
 43       char[] pattern5 = "EXAMPLE".ToCharArray();
 44 
 45       int firstShift5;
 46       bool isMatched5 = BoyerMooreStringMatcher.TryMatch(
 47         text5, pattern5, out firstShift5);
 48       Contract.Assert(isMatched5);
 49       Contract.Assert(firstShift5 == 17);
 50       int[] shiftIndexes5 = BoyerMooreStringMatcher.MatchAll(text5, pattern5);
 51       Contract.Assert(shiftIndexes5.Length == 2);
 52       Contract.Assert(string.Join(",", shiftIndexes5) == "17,29");
 53 
 54       Console.WriteLine("Well done!");
 55       Console.ReadKey();
 56     }
 57   }
 58 
 59   public class BoyerMooreStringMatcher
 60   {
 61     private static int AlphabetSize = 256;
 62 
 63     private static int Max(int a, int b) { return (a > b) ? a : b; }
 64 
 65     static int[] PreprocessToBuildBadCharactorHeuristic(char[] pattern)
 66     {
 67       int m = pattern.Length;
 68       int[] badCharactorShifts = new int[AlphabetSize];
 69 
 70       for (int i = 0; i < AlphabetSize; i++)
 71       {
 72         //badCharactorShifts[i] = -1;
 73         badCharactorShifts[i] = m;
 74       }
 75 
 76       // fill the actual value of last occurrence of a character
 77       for (int i = 0; i < m; i++)
 78       {
 79         //badCharactorShifts[(int)pattern[i]] = i;
 80         badCharactorShifts[(int)pattern[i]] = m - 1 - i;
 81       }
 82 
 83       return badCharactorShifts;
 84     }
 85 
 86     static int[] PreprocessToBuildGoodSuffixHeuristic(char[] pattern)
 87     {
 88       int m = pattern.Length;
 89       int[] goodSuffixShifts = new int[m];
 90       int[] suffixLengthArray = GetSuffixLengthArray(pattern);
 91 
 92       for (int i = 0; i < m; ++i)
 93       {
 94         goodSuffixShifts[i] = m;
 95       }
 96 
 97       int j = 0;
 98       for (int i = m - 1; i >= -1; --i)
 99       {
100         if (i == -1 || suffixLengthArray[i] == i + 1)
101         {
102           for (; j < m - 1 - i; ++j)
103           {
104             if (goodSuffixShifts[j] == m)
105             {
106               goodSuffixShifts[j] = m - 1 - i;
107             }
108           }
109         }
110       }
111 
112       for (int i = 0; i < m - 1; ++i)
113       {
114         goodSuffixShifts[m - 1 - suffixLengthArray[i]] = m - 1 - i;
115       }
116 
117       return goodSuffixShifts;
118     }
119 
120     static int[] GetSuffixLengthArray(char[] pattern)
121     {
122       int m = pattern.Length;
123       int[] suffixLengthArray = new int[m];
124 
125       int f = 0, g = 0, i = 0;
126 
127       suffixLengthArray[m - 1] = m;
128 
129       g = m - 1;
130       for (i = m - 2; i >= 0; --i)
131       {
132         if (i > g && suffixLengthArray[i + m - 1 - f] < i - g)
133         {
134           suffixLengthArray[i] = suffixLengthArray[i + m - 1 - f];
135         }
136         else
137         {
138           if (i < g)
139           {
140             g = i;
141           }
142           f = i;
143 
144           // find different preceded character suffix
145           while (g >= 0 && pattern[g] == pattern[g + m - 1 - f])
146           {
147             --g;
148           }
149           suffixLengthArray[i] = f - g;
150         }
151       }
152 
153       return suffixLengthArray;
154     }
155 
156     public static bool TryMatch(char[] text, char[] pattern, out int firstShift)
157     {
158       firstShift = -1;
159       int n = text.Length;
160       int m = pattern.Length;
161       int s = 0; // s is shift of the pattern with respect to text
162       int j = 0;
163 
164       // fill the bad character and good suffix array by preprocessing
165       int[] badCharShifts = PreprocessToBuildBadCharactorHeuristic(pattern);
166       int[] goodSuffixShifts = PreprocessToBuildGoodSuffixHeuristic(pattern);
167 
168       while (s <= (n - m))
169       {
170         // starts matching from the last character of the pattern
171         j = m - 1;
172 
173         // keep reducing index j of pattern while characters of
174         // pattern and text are matching at this shift s
175         while (j >= 0 && pattern[j] == text[s + j])
176         {
177           j--;
178         }
179 
180         // if the pattern is present at current shift, then index j
181         // will become -1 after the above loop
182         if (j < 0)
183         {
184           firstShift = s;
185           return true;
186         }
187         else
188         {
189           // shift the pattern so that the bad character in text
190           // aligns with the last occurrence of it in pattern. the
191           // max function is used to make sure that we get a positive
192           // shift. We may get a negative shift if the last occurrence
193           // of bad character in pattern is on the right side of the
194           // current character.
195           //s += Max(1, j - badCharShifts[(int)text[s + j]]);
196           // now, compare bad char shift and good suffix shift to find best
197           s += Max(goodSuffixShifts[j], badCharShifts[(int)text[s + j]] - (m - 1) + j);
198         }
199       }
200 
201       return false;
202     }
203 
204     public static int[] MatchAll(char[] text, char[] pattern)
205     {
206       int n = text.Length;
207       int m = pattern.Length;
208       int s = 0; // s is shift of the pattern with respect to text
209       int j = 0;
210       int[] shiftIndexes = new int[n - m + 1];
211       int c = 0;
212 
213       // fill the bad character and good suffix array by preprocessing
214       int[] badCharShifts = PreprocessToBuildBadCharactorHeuristic(pattern);
215       int[] goodSuffixShifts = PreprocessToBuildGoodSuffixHeuristic(pattern);
216 
217       while (s <= (n - m))
218       {
219         // starts matching from the last character of the pattern
220         j = m - 1;
221 
222         // keep reducing index j of pattern while characters of
223         // pattern and text are matching at this shift s
224         while (j >= 0 && pattern[j] == text[s + j])
225         {
226           j--;
227         }
228 
229         // if the pattern is present at current shift, then index j
230         // will become -1 after the above loop
231         if (j < 0)
232         {
233           shiftIndexes[c] = s;
234           c++;
235 
236           // shift the pattern so that the next character in text
237           // aligns with the last occurrence of it in pattern.
238           // the condition s+m < n is necessary for the case when
239           // pattern occurs at the end of text
240           //s += (s + m < n) ? m - badCharShifts[(int)text[s + m]] : 1;
241           s += goodSuffixShifts[0];
242         }
243         else
244         {
245           // shift the pattern so that the bad character in text
246           // aligns with the last occurrence of it in pattern. the
247           // max function is used to make sure that we get a positive
248           // shift. We may get a negative shift if the last occurrence
249           // of bad character in pattern is on the right side of the
250           // current character.
251           //s += Max(1, j - badCharShifts[(int)text[s + j]]);
252           // now, compare bad char shift and good suffix shift to find best
253           s += Max(goodSuffixShifts[j], badCharShifts[(int)text[s + j]] - (m - 1) + j);
254         }
255       }
256 
257       int[] shifts = new int[c];
258       for (int y = 0; y < c; y++)
259       {
260         shifts[y] = shiftIndexes[y];
261       }
262 
263       return shifts;
264     }
265   }
266 }