字符串匹配算法 - gaochundong－IT工程師數位筆記本

文章出處

字符串匹配問題的形式定義：

文本（Text）是一個長度為 n 的數組 T[1..n]；
模式（Pattern）是一個長度為 m 且 m≤n 的數組 P[1..m]；
T 和 P 中的元素都屬于有限的字母表 Σ 表；
如果 0≤s≤n-m，并且 T[s+1..s+m] = P[1..m]，即對 1≤j≤m，有 T[s+j] = P[j]，則說模式 P 在文本 T 中出現且位移為 s，且稱 s 是一個有效位移（Valid Shift）。

比如上圖中，目標是找出所有在文本 T = abcabaabcabac 中模式 P = abaa 的所有出現。該模式在此文本中僅出現一次，即在位移 s = 3 處，位移 s = 3 是有效位移。

解決字符串匹配的算法包括：樸素算法（Naive Algorithm）、Rabin-Karp 算法、有限自動機算法（Finite Automation）、 Knuth-Morris-Pratt 算法（即 KMP Algorithm）、Boyer-Moore 算法、Simon 算法、Colussi 算法、Galil-Giancarlo 算法、Apostolico-Crochemore 算法、Horspool 算法和 Sunday 算法等。

樸素的字符串匹配算法（Naive String Matching Algorithm）
Knuth-Morris-Pratt 字符串匹配算法（即 KMP 算法）
Boyer-Moore 字符串匹配算法

字符串匹配算法通常分為兩個步驟：預處理（Preprocessing）和匹配（Matching）。所以算法的總運行時間為預處理和匹配的時間的總和。

上圖描述了常見字符串匹配算法的預處理和匹配時間。

樸素的字符串匹配算法（Naive String Matching Algorithm）

樸素的字符串匹配算法又稱為暴力匹配算法（Brute Force Algorithm），它的主要特點是：

沒有預處理階段；
滑動窗口總是后移 1 位；
對模式中的字符的比較順序不限定，可以從前到后，也可以從后到前；
匹配階段需要 O((n - m + 1)m) 的時間復雜度；
需要 2n 次的字符比較；

很顯然，樸素的字符串匹配算法 NAIVE-STRING-MATCHER 是最原始的算法，它通過使用循環來檢查是否在范圍 n-m+1 中存在滿足條件 P[1..m] = T [s + 1..s + m] 的有效位移 s。

1 NAIVE-STRING-MATCHER(T, P)
2  n ← length[T]
3  m ← length[P]
4  for s ← 0 to n - m
5    do if P[1 .. m] = T[s + 1 .. s + m]
6      then print "Pattern occurs with shift" s

如上圖中，對于模式 P = aab 和文本 T = acaabc，將模式 P 沿著 T 從左到右滑動，逐個比較字符以判斷模式 P 在文本 T 中是否存在。

可以看出，NAIVE-STRING-MATCHER 沒有對模式 P 進行預處理，所以預處理的時間為 0。而匹配的時間在最壞情況下為 Θ((n-m+1)m)，如果 m = [n/2]，則為 Θ(n²)。

下面是 NAIVE-STRING-MATCHER 的代碼示例。

  1 namespace StringMatching
  2 {
  3   class Program
  4   {
  5     static void Main(string[] args)
  6     {
  7       char[] text1 = "BBC ABCDAB ABCDABCDABDE".ToCharArray();
  8       char[] pattern1 = "ABCDABD".ToCharArray();
  9 
 10       int firstShift1;
 11       bool isMatched1 = NaiveStringMatcher.TryMatch1(text1, pattern1, out firstShift1);
 12       Contract.Assert(isMatched1);
 13       Contract.Assert(firstShift1 == 15);
 14 
 15       char[] text2 = "ABABDAAAACAAAABCABAB".ToCharArray();
 16       char[] pattern2 = "AAACAAAA".ToCharArray();
 17 
 18       int firstShift2;
 19       bool isMatched2 = NaiveStringMatcher.TryMatch2(text2, pattern2, out firstShift2);
 20       Contract.Assert(isMatched2);
 21       Contract.Assert(firstShift2 == 6);
 22 
 23       char[] text3 = "ABAAACAAAAAACAAAABCABAAAACAAAAFDLAAACAAAAAACAAAA".ToCharArray();
 24       char[] pattern3 = "AAACAAAA".ToCharArray();
 25 
 26       int[] shiftIndexes = NaiveStringMatcher.MatchAll(text3, pattern3);
 27       Contract.Assert(shiftIndexes.Length == 5);
 28       Contract.Assert(string.Join(",", shiftIndexes) == "2,9,22,33,40");
 29 
 30       Console.WriteLine("Well done!");
 31       Console.ReadKey();
 32     }
 33   }
 34 
 35   public class NaiveStringMatcher
 36   {
 37     public static bool TryMatch1(char[] text, char[] pattern, out int firstShift)
 38     {
 39       firstShift = -1;
 40       int n = text.Length;
 41       int m = pattern.Length;
 42       int s = 0, j = 0;
 43 
 44       // for..for..
 45       for (s = 0; s < n - m; s++)
 46       {
 47         for (j = 0; j < m; j++)
 48         {
 49           if (text[s + j] != pattern[j])
 50           {
 51             break;
 52           }
 53         }
 54         if (j == m)
 55         {
 56           firstShift = s;
 57           return true;
 58         }
 59       }
 60 
 61       return false;
 62     }
 63 
 64     public static bool TryMatch2(char[] text, char[] pattern, out int firstShift)
 65     {
 66       firstShift = -1;
 67       int n = text.Length;
 68       int m = pattern.Length;
 69       int s = 0, j = 0;
 70 
 71       // while..
 72       while (s < n && j < m)
 73       {
 74         if (text[s] == pattern[j])
 75         {
 76           s++;
 77           j++;
 78         }
 79         else
 80         {
 81           s = s - j + 1;
 82           j = 0;
 83         }
 84 
 85         if (j == m)
 86         {
 87           firstShift = s - j;
 88           return true;
 89         }
 90       }
 91 
 92       return false;
 93     }
 94 
 95     public static int[] MatchAll(char[] text, char[] pattern)
 96     {
 97       int n = text.Length;
 98       int m = pattern.Length;
 99       int s = 0, j = 0;
100       int[] shiftIndexes = new int[n - m + 1];
101       int c = 0;
102 
103       // while..
104       while (s < n && j < m)
105       {
106         if (text[s] == pattern[j])
107         {
108           s++;
109           j++;
110         }
111         else
112         {
113           s = s - j + 1;
114           j = 0;
115         }
116 
117         if (j == m)
118         {
119           shiftIndexes[c] = s - j;
120           c++;
121 
122           s = s - j + 1;
123           j = 0;
124         }
125       }
126 
127       int[] shifts = new int[c];
128       for (int y = 0; y < c; y++)
129       {
130         shifts[y] = shiftIndexes[y];
131       }
132 
133       return shifts;
134     }
135   }
136 }

上面代碼中 TryMatch1 和 TryMatch2 分別使用 for 和 while 循環達到相同效果。

Knuth-Morris-Pratt 字符串匹配算法（即 KMP 算法）

我們來觀察一下樸素的字符串匹配算法的操作過程。如下圖（a）中所描述，在模式 P = ababaca 和文本 T 的匹配過程中，模板的一個特定位移 s，q = 5 個字符已經匹配成功，但模式 P 的第 6 個字符不能與相應的文本字符匹配。

此時，q 個字符已經匹配成功的信息確定了相應的文本字符，而知道這 q 個文本字符，就使我們能夠立即確定某些位移是非法的。例如上圖（a）中，我們可以判斷位移 s+1 是非法的，因為模式 P 的第一個字符 a 將與模式的第二個字符 b 匹配的文本字符進行匹配，顯然是不匹配的。而圖（b）中則顯示了位移 s’ = s+2 處，使模式 P 的前三個字符和相應的三個文本字符對齊后必定會匹配。KMP 算法的基本思路就是設法利用這些已知信息，不要把 "搜索位置" 移回已經比較過的位置，而是繼續把它向后面移，這樣就提高了匹配效率。

The basic idea behind KMP’s algorithm is: whenever we detect a mismatch (after some matches), we already know some of the characters in the text (since they matched the pattern characters prior to the mismatch). We take advantage of this information to avoid matching the characters that we know will anyway match.

已知模式 P[1..q] 與文本 T[s+1..s+q] 匹配，那么滿足 P[1..k] = T[s’+1..s’+k] 其中 s’+k = s+q 的最小位移 s’ > s 是多少？這樣的位移 s’ 是大于 s 的但未必非法的第一個位移，因為已知 T[s+1..s+q] 。在最好的情況下有 s’ = s+q，因此立刻能排除掉位移 s+1, s+2 .. s+q-1。在任何情況下，對于新的位移 s’，無需把 P 的前 k 個字符與 T 中相應的字符進行比較，因為它們肯定匹配。

可以用模式 P 與其自身進行比較，以預先計算出這些必要的信息。例如上圖（c）中所示，由于 T[s’+1..s’+k] 是文本中已經知道的部分，所以它是字符串 P_q 的一個后綴。

此處我們引入模式的前綴函數 π（Pai），π 包含有模式與其自身的位移進行匹配的信息。這些信息可用于避免在樸素的字符串匹配算法中，對無用位移進行測試。

π[q] = max {k : k < q and P_k ⊐ P_q}

π[q] 代表當前字符之前的字符串中，最長的共同前綴后綴的長度。（π[q] is the length of the longest prefix of P that is a proper suffix of P_q.）

下圖給出了關于模式 P = ababababca 的完整前綴函數 π，可稱為部分匹配表（Partial Match Table）。

計算過程：

π[1] = 0，a 僅一個字符，前綴和后綴為空集，共有元素最大長度為 0；
π[2] = 0，ab 的前綴 a，后綴 b，不匹配，共有元素最大長度為 0；
π[3] = 1，aba，前綴 a ab，后綴 ba a，共有元素最大長度為 1；
π[4] = 2，abab，前綴 a ab aba，后綴 bab ab b，共有元素最大長度為 2；
π[5] = 3，ababa，前綴 a ab aba abab，后綴 baba aba ba a，共有元素最大長度為 3；
π[6] = 4，ababab，前綴 a ab aba abab ababa，后綴 babab abab bab ab b，共有元素最大長度為 4；
π[7] = 5，abababa，前綴 a ab aba abab ababa ababab，后綴 bababa ababa baba aba ba a，共有元素最大長度為 5；
π[8] = 6，abababab，前綴 .. ababab ..，后綴 .. ababab ..，共有元素最大長度為 6；
π[9] = 0，ababababc，前綴和后綴不匹配，共有元素最大長度為 0；
π[10] = 1，ababababca，前綴 .. a ..，后綴 .. a ..，共有元素最大長度為 1；

KMP 算法 KMP-MATCHER 中通過調用 COMPUTE-PREFIX-FUNCTION 函數來計算部分匹配表。

 1 KMP-MATCHER(T, P)
 2 n ← length[T]
 3 m ← length[P]
 4 π ← COMPUTE-PREFIX-FUNCTION(P)
 5 q ← 0                          //Number of characters matched.
 6 for i ← 1 to n                 //Scan the text from left to right.
 7     do while q > 0 and P[q + 1] ≠ T[i]
 8             do q ← π[q]        //Next character does not match.
 9         if P[q + 1] = T[i]
10             then q ← q + 1     //Next character matches.
11         if q = m               //Is all of P matched?
12             then print "Pattern occurs with shift" i - m
13             q ← π[q]           //Look for the next match.

 1 COMPUTE-PREFIX-FUNCTION(P)
 2 m ← length[P]
 3 π[1] ← 0
 4 k ← 0
 5 for q ← 2 to m
 6      do while k > 0 and P[k + 1] ≠ P[q]
 7             do k ← π[k]
 8         if P[k + 1] = P[q]
 9            then k ← k + 1
10         π[q] ← k
11 return π

預處理過程 COMPUTE-PREFIX-FUNCTION 的運行時間為 Θ(m)，KMP-MATCHER 的匹配時間為 Θ(n)。

相比較于 NAIVE-STRING-MATCHER，KMP-MATCHER 的主要優化點就是在當確定字符不匹配時對于 pattern 的位移。

NAIVE-STRING-MATCHER 的位移效果是：文本向后移一位，模式從頭開始。

    s = s - j + 1;
    j = 0;

KMP-MATCHER 首先對模式做了獲取共同前綴后綴最大長度的預處理操作，位移過程是先將模式向后移 partial_match_length - table[partial_match_length - 1]，然后再判斷是否匹配。這樣通過對已匹配字符串的已知信息的利用，可以有效節省比較數量。

    if (j != 0)
        j = lps[j - 1];
    else
        s++;

下面描述了當發現字符 j 與 c 不匹配時的位移效果。

    // partial_match_length - table[partial_match_length - 1]
    rrababababjjjjjiiooorababababcauuu
      ||||||||-
      ababababca
    // 8-6=2
    rrababababjjjjjiiooorababababcauuu
      xx||||||-
        ababababca
    // 6-4=2
    rrababababjjjjjiiooorababababcauuu
        xx||||-
          ababababca
    // 4-2=2
    rrababababjjjjjiiooorababababcauuu
          xx||-
            ababababca
    // 2-0=2
    rrababababjjjjjiiooorababababcauuu
            xx-
              ababababca

綜上可知，KMP 算法的主要特點是：

需要對模式字符串做預處理；
預處理階段需要額外的 O(m) 空間和復雜度；
匹配階段與字符集的大小無關；
匹配階段至多執行 2n - 1 次字符比較；
對模式中字符的比較順序時從左到右；

下面是 KMP-MATCHER 的代碼示例。

  1 namespace StringMatching
  2 {
  3   class Program
  4   {
  5     static void Main(string[] args)
  6     {
  7       char[] text1 = "BBC ABCDAB ABCDABCDABDE".ToCharArray();
  8       char[] pattern1 = "ABCDABD".ToCharArray();
  9 
 10       int firstShift1;
 11       bool isMatched1 = KmpStringMatcher.TryMatch1(text1, pattern1, out firstShift1);
 12       Contract.Assert(isMatched1);
 13       Contract.Assert(firstShift1 == 15);
 14 
 15       char[] text2 = "ABABDAAAACAAAABCABAB".ToCharArray();
 16       char[] pattern2 = "AAACAAAA".ToCharArray();
 17 
 18       int firstShift2;
 19       bool isMatched2 = KmpStringMatcher.TryMatch2(text2, pattern2, out firstShift2);
 20       Contract.Assert(isMatched2);
 21       Contract.Assert(firstShift2 == 6);
 22 
 23       char[] text3 = "ABAAACAAAAAACAAAABCABAAAACAAAAFDLAAACAAAAAACAAAA".ToCharArray();
 24       char[] pattern3 = "AAACAAAA".ToCharArray();
 25 
 26       int[] shiftIndexes3 = KmpStringMatcher.MatchAll1(text3, pattern3);
 27       Contract.Assert(shiftIndexes3.Length == 5);
 28       Contract.Assert(string.Join(",", shiftIndexes3) == "2,9,22,33,40");
 29       int[] shiftIndexes4 = KmpStringMatcher.MatchAll2(text3, pattern3);
 30       Contract.Assert(shiftIndexes4.Length == 5);
 31       Contract.Assert(string.Join(",", shiftIndexes4) == "2,9,22,33,40");
 32 
 33       Console.WriteLine("Well done!");
 34       Console.ReadKey();
 35     }
 36   }
 37 
 38   public class KmpStringMatcher
 39   {
 40     public static bool TryMatch1(char[] text, char[] pattern, out int firstShift)
 41     {
 42       // KMP needs a pattern preprocess to get the Partial Match Table
 43       int[] lps = PreprocessToComputeLongestProperPrefixSuffixArray(pattern);
 44       // pattern: ABCDABD
 45       // char:  | A | B | C | D | A | B | D |
 46       // index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
 47       // lps:   | 0 | 0 | 0 | 0 | 1 | 2 | 0 |
 48 
 49       firstShift = -1;
 50       int n = text.Length;
 51       int m = pattern.Length;
 52       int s = 0, j = 0;
 53 
 54       while (s < n && j < m)
 55       {
 56         if (j == -1 || text[s] == pattern[j])
 57         {
 58           s++;
 59           j++;
 60         }
 61         else
 62         {
 63           // here is different with naive string matcher
 64           if (j != 0)
 65             j = lps[j - 1];
 66           else
 67             s++;
 68         }
 69 
 70         if (j == m)
 71         {
 72           firstShift = s - j;
 73           return true;
 74         }
 75       }
 76 
 77       return false;
 78     }
 79 
 80     static int[] PreprocessToComputeLongestProperPrefixSuffixArray(char[] pattern)
 81     {
 82       int m = pattern.Length;
 83 
 84       // hold the longest prefix suffix values for pattern
 85       int[] lps = new int[m];
 86       lps[0] = 0;
 87 
 88       // length of the previous longest prefix suffix
 89       int k = 0;
 90       int q = 1;
 91       while (q < m)
 92       {
 93         if (pattern[k] == pattern[q])
 94         {
 95           k++;
 96           lps[q] = k;
 97           q++;
 98         }
 99         else
100         {
101           if (k != 0)
102           {
103             k = lps[k - 1];
104           }
105           else
106           {
107             lps[q] = 0;
108             q++;
109           }
110         }
111       }
112 
113       return lps;
114     }
115 
116     public static bool TryMatch2(char[] text, char[] pattern, out int firstShift)
117     {
118       // KMP needs a pattern preprocess
119       int[] next = PreprocessToGetNextArray(pattern);
120       // pattern: ABCDABD
121       // char:  | A | B | C | D | A | B | D |
122       // index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
123       // lps:   | 0 | 0 | 0 | 0 | 1 | 2 | 0 |
124       // next:  |-1 | 0 | 0 | 0 | 0 | 1 | 2 | -> shift LPS 1 position to right 
125 
126       firstShift = -1;
127       int n = text.Length;
128       int m = pattern.Length;
129       int s = 0, j = 0;
130 
131       while (s < n && j < m)
132       {
133         if (j == -1 || text[s] == pattern[j])
134         {
135           s++;
136           j++;
137         }
138         else
139         {
140           // here is different with naive string matcher
141           j = next[j];
142         }
143 
144         if (j == m)
145         {
146           firstShift = s - j;
147           return true;
148         }
149       }
150 
151       return false;
152     }
153 
154     static int[] PreprocessToGetNextArray(char[] pattern)
155     {
156       int m = pattern.Length;
157       int[] next = new int[m];
158       next[0] = -1;
159 
160       int k = -1;
161       int q = 0;
162       while (q < m - 1)
163       {
164         if (k == -1 || pattern[k] == pattern[q])
165         {
166           k++;
167           q++;
168 
169           //next[q] = k;       // does not optimize
170 
171           if (pattern[k] != pattern[q])
172             next[q] = k;
173           else
174             next[q] = next[k]; // with optimization
175         }
176         else
177         {
178           k = next[k];
179         }
180       }
181 
182       return next;
183     }
184 
185     public static int[] MatchAll1(char[] text, char[] pattern)
186     {
187       // KMP needs a pattern preprocess
188       int[] lps = PreprocessToComputeLongestProperPrefixSuffixArray(pattern);
189       // pattern: ABCDABD
190       // char:  | A | B | C | D | A | B | D |
191       // index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
192       // lps:   | 0 | 0 | 0 | 0 | 1 | 2 | 0 |
193 
194       int n = text.Length;
195       int m = pattern.Length;
196       int s = 0, j = 0;
197       int[] shiftIndexes = new int[n - m + 1];
198       int c = 0;
199 
200       while (s < n && j < m)
201       {
202         if (j == -1 || text[s] == pattern[j])
203         {
204           s++;
205           j++;
206         }
207         else
208         {
209           // here is different with naive string matcher
210           if (j != 0)
211             j = lps[j - 1];
212           else
213             s++;
214         }
215 
216         if (j == m)
217         {
218           shiftIndexes[c] = s - j;
219           c++;
220 
221           j = lps[j - 1];
222         }
223       }
224 
225       int[] shifts = new int[c];
226       for (int y = 0; y < c; y++)
227       {
228         shifts[y] = shiftIndexes[y];
229       }
230 
231       return shifts;
232     }
233 
234     public static int[] MatchAll2(char[] text, char[] pattern)
235     {
236       // KMP needs a pattern preprocess
237       int[] next = PreprocessToGetNextArray(pattern);
238       // pattern: ABCDABD
239       // char:  | A | B | C | D | A | B | D |
240       // index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
241       // lps:   | 0 | 0 | 0 | 0 | 1 | 2 | 0 |
242       // next:  |-1 | 0 | 0 | 0 | 0 | 1 | 2 | -> shift LPS 1 position to right
243 
244       int n = text.Length;
245       int m = pattern.Length;
246       int s = 0, j = 0;
247       int[] shiftIndexes = new int[n - m + 1];
248       int c = 0;
249 
250       while (s < n && j < m)
251       {
252         if (j == -1 || text[s] == pattern[j])
253         {
254           s++;
255           j++;
256         }
257         else
258         {
259           // here is different with naive string matcher
260           j = next[j];
261         }
262 
263         if (j == m)
264         {
265           shiftIndexes[c] = s - j;
266           c++;
267 
268           j = next[j - 1];
269         }
270       }
271 
272       int[] shifts = new int[c];
273       for (int y = 0; y < c; y++)
274       {
275         shifts[y] = shiftIndexes[y];
276       }
277 
278       return shifts;
279     }
280   }
281 }