查看“︁KMP算法”︁的源代码

{{TA
|G1=Math
|G2=IT
}}
<noinclude></noinclude>
:<small>在本文中，将使用[[数组|始于零的数组]]来表示字符串。比如，若字符串<code>S = "ABC"</code>，则<code>S[2]</code>表示字符<code>'C'</code>。这种表示方法与[[C编程语言|C语言]]一致。</small>

在[[计算机科学]]中，'''克努斯-莫里斯-普拉特[[字符串查找算法]]'''（{{lang-en|Knuth–Morris–Pratt algorithm}}，简称为'''KMP算法'''）可在一个[[字符串]]<code>S</code>内查找一个-{zh-hans:词; zh-hant:字;}-<code>W</code>的出现位置。一个词在不匹配时本身就包含足够的信息来确定下一个匹配可能的开始位置，此算法利用这一特性以避免重新检查先前配對的[[字符]]。

这个算法由[[高德纳]]和[[沃恩·普拉特]]在1974年构思，同年[[詹姆斯·H·莫里斯]]也独立地设计出该算法，最终三人于1977年联合发表。

== 查找过程 ==
以<code>W</code>=<code>"ABCDABD"</code>，<code>S</code>=<code>"ABC ABCDAB ABCDABCDABDE"</code>为例说明查找过程。查找过程同时使用两个循环变量<code>m</code>和<code>i</code>：

* <code>m</code>代表主文字符串<code>S</code>内匹配字符串<code>W</code>的当前查找位置，
* <code>i</code>代表匹配字符串<code>W</code>当前做比较的字符位置。

图示如下：

 {{color|gray|             1         2  }}
 m: {{color|blue|0}}{{color|gray|1234567890123456789012}}
 S: {{color|blue|ABC}}{{color|red| }}{{color|gray|ABCDAB ABCDABCDABDE}}
 W: {{color|blue|ABC}}{{color|red|D}}{{color|gray|ABD}}
 i: {{color|blue|012}}{{color|red|3}}{{color|gray|456}}

從<code>W</code>與<code>S</code>的開頭比較起。比對到<code>S[3](=' ')</code>時，發現<code>W[3](='D')</code>與之不符。接著並不是從<code>S[1]</code>比較下去。已經知道<code>S[1]~S[3]</code>不與<code>W[0]</code>相合。因此，略過這些字元，令<code>m = 4</code>以及<code>i = 0</code>。

 {{color|gray|             1         2  }}
 m: 0123{{color|blue|4}}{{color|gray|567890123456789012}}
 S: ABC {{color|blue|ABCDAB}}{{color|red| }}{{color|gray|ABCDABCDABDE}}
 W:     {{color|blue|ABCDAB}}{{color|red|D}}
 i:     {{color|blue|012345}}{{color|red|6}}

如上所示，檢核了<code>"ABCDAB"</code>這個字串。然而，下一字符便不相合。可以注意到，<code>"AB"</code>在<code>"ABCDAB"</code>的頭尾處均有出現。這意味著尾端的<code>"AB"</code>可以作為下次比較的起始點。因此，令<code>m = 8, i = 2</code>，繼續比較。圖示如下：

 {{color|gray|             1         2  }}
 m: 01234567{{color|blue|8}}{{color|gray|90123456789012}}
 S: ABC ABCD{{color|gray|AB}}{{color|red| }}{{color|gray|ABCDABCDABDE}}
 W:         {{color|gray|AB}}{{color|red|C}}{{color|gray|DABD}}
 i:         {{color|gray|01}}{{color|red|2}}{{color|gray|3456}}

於<code>m = 10</code>的地方，又出現不相符的情況。類似地，令<code>m = 11, i = 0</code>繼續比較：

              1         {{color|gray|2}}  
 m: 01234567890{{color|blue|1}}{{color|gray|23456789012}}
 S: ABC ABCDAB {{color|blue|ABCDAB}}{{color|red|C}}{{color|gray|DABDE}}
 W:            {{color|blue|ABCDAB}}{{color|red|D}}
 i:            {{color|blue|012345}}{{color|red|6}}

這時，<code>S[17](='C')</code>不與<code>W[6]</code>相同，但是已匹配部分<code>"ABCDAB"</code>亦为首尾均有<code>"AB"</code>，採取一貫的作法，令<code>m = 15</code>和<code>i = 2</code>，繼續搜尋。

              1         {{color|gray|2}}  
 m: 012345678901234{{color|blue|5}}{{color|gray|6789012}}
 S: ABC ABCDAB ABCD{{color|blue|ABCDABD}}{{color|gray|E}}
 W:                {{color|blue|ABCDABD}}
 i:                {{color|blue|0123456}}

找到完全匹配的字串了，其起始位置於<code>S[15]</code>的地方。

== 部分匹配表 ==
'''部分匹配表'''，又称为'''失配函数'''，作用是让算法无需多次匹配<code>S</code>中的任何字符。能够实现线性时间搜索的关键是在主串的一些字段中检查模式串的''初始字段''，可以确切地知道在当前位置之前的一个潜在匹配的位置。换句话说，在不错过任何潜在匹配的情况下，"预搜索"这个模式串本身并将其译成一个包含所有可能失配的位置对应可以绕过最多无效字符的列表。

对于<code>W</code>中的任何位置，都希望能够查询那个位置前（不包括那个位置）有可能的<code>W</code>的最长初始字段的长度，而不是重新从<code>W[0]</code>开始比较整个字段，这长度就是查找下一个匹配时回退的距离。因此<code>T[i]</code>是<code>W</code>的可能的''适当''初始字段同时也是结束于<code>W[i - 1]</code>的子串的最大长度。使空串长度是0。当一个失配出现在模式串的最开始，这是特殊情况（无法回退），设置<code>T[0] = -1</code>，在[[#建立表算法的伪代码的解释|下面]]讨论。

===创建表算法示例===
以<code>W = "ABCDABD"</code>为例。以下将看到，部分匹配表的生成过程与前述查找过程大同小异，且出于类似原因是高效的。

首先，设定<code>T[0] = -1</code>。为求出<code>T[1]</code>，必须找到一个<code>"A"</code>的[[子串|真后缀]]（真后缀指不等于原串的后缀）兼<code>W</code>的前缀。但<code>"A"</code>没有真后缀，所以设定<code>T[1] = 0</code>。类似地，<code>T[2] = 0</code>。

继续到<code>T[3]</code>，注意到检查'''所有'''后缀有一个捷径：假设存在符合条件的前后缀，两者分别为<code>W[0..1] = W[1..2]</code>，则必有<code>W[0..0] = W[1..1]</code>。由于<code>W[0..0]</code>亦是<code>W</code>的真前缀，上一步必然已经得到<code>T[2] = 1</code>（而有<code>T[2] = 0</code>，说明假设不成立）。一般地，遍历到每个字符时，只有上一步已经发现一个长为m的有效后缀，才需要判断有无长为m+1的后缀，而毋需考虑长为m+2、m+3等的后缀。

从而，不必考虑长为2的后缀，而唯独需要考虑的长度1亦不可行，故得到<code>T[3]=0</code>。

接下来是<code>W[4] = 'A'</code>。基于同样的理由，需要考虑的最大长度为1，并且在<code>'A'</code>这个情况中有效，回退到寻找的当前字符''之前''的字段，因此<code>T[4] = 0</code>。

现在考虑下一个字符<code>W[5] = 'B'</code>，使用这样的逻辑：如果曾发现一个子模式在上一个字符<code>W[4]</code>之前出现，继续到当前字符<code>W[5]</code>，那么在它之前它本身会拥有一个结束于<code>W[4]</code>合适的初始段，与事实相反的是已经找到<code>'A'</code>是最早出现在结束于<code>W[4]</code>的合适字段。因此为了找到<code>W[5]</code>的终止串，不需要查看<code>W[4]</code>。因此<code>T[5] = 1</code>。

最后到<code>W[4] = 'A'</code>。下一个字符是<code>'B'</code>，并且这也确实是<code>W[5]</code>。此外，上面的相同参数说明为了查找<code>W[6]</code>的字段，不需要向前查看<code>W[4]</code>，所以得出<code>T[6] = 2</code>。

于是得到下面的表：

{| class="wikitable" style="background-color:white; font-family:monospace; text-align:right"
!<code>i</code>
| 0
| 1
| 2
| 3
| 4
| 5
| 6
|-
!<code>W[i]</code>
| A
| B
| C
| D
| A
| B
| D
|-
!<code>T[i+1]</code>
| 0
| 0
| 0
| 0
| 1
| 2
| 0
|}

另一个更复杂和有趣的例子：
{| class="wikitable" style="background-color:white; font-family:monospace; text-align:right"
!<code>i</code>
| 0
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
| 12
| 13
| 14
| 15
| 16
| 17
| 18
| 19
| 20
| 21
| 22
| 23
|-
!<code>W[i]</code>
| P
| A
| R
| T
| I
| C
| I
| P
| A
| T
| E
|
| I
| N
| 
| P
| A
| R
| A
| C
| H
| U
| T
| E
|-
!<code>T[i+1]</code>
| 0
| 0
| 0
| 0
| 0
| 0
| 0
| 1
| 2
| 0
| 0
| 0
| 0
| 0
| 0
| 1
| 2
| 3
| 0
| 0
| 0
| 0
| 0
| 0
|}

===建立表算法的伪代码的解释===
<!--Note to future editors: please do not replace or even supplement the pseudocode with language-specific code.  Following the WikiProject Computer Science manual of style, pseudocode is preferred over real code unless the real code illustrates a feature of the language or an implementation detail.  This algorithm is so simple that neither of these can ever be the case-->
<!--This code appears to implement the table for the Morris-Pratt algorithm, not for the Knuth-Morris-Pratt Algorithm. The KMP table has bigger shifts. See http://www-igm.univ-mlv.fr/~lecroq/string/node8.html-->
上面的例子以最少的复杂步骤展示了组织这个表格的一般性方法。这么做的原理是对整体的搜索：大多数工作已经在检测到当前位置的时候做完了，剩下需要做的很少。略微复杂的一点是找到一个共同前后缀。这就需要有一些初始化的代码。

 '''algorithm''' ''kmp_table'':
     '''input''':
         an array of characters, W (the word to be analyzed)
         an array of integers, T (the table to be filled)
     '''output''':
         nothing (but during operation, it populates the table)
 
     '''define variables''':
         an integer, pos ← 2 (the current position we are computing in T)
         an integer, cnd ← 0 (the zero-based index in W of the next <br>character of the current candidate substring)
 
     (the first few values are fixed but different from what the algorithm <br>might suggest)
     '''let''' T[0] ← -1, T[1] ← 0
 
     '''while''' pos < length(W) do
         (first case: the substring continues)
         '''if''' W[pos - 1] = W[cnd] '''then'''
             '''let''' cnd ← cnd + 1, T[pos] ← cnd, pos ← pos + 1
 
         (second case: it doesn't, but we can fall back)
         '''else''' '''if''' cnd > 0 '''then'''
             '''let''' cnd ← T[cnd]
 
         (third case: we have run out of candidates.  Note cnd = 0)
         '''else'''
             '''let''' T[pos] ← 0, pos ← pos + 1

===建立表的算法的效率===
建立表的算法的复杂度是<math>O(n)</math>，其中<math>n</math>是<code>W</code>的长度。

除去一些初始化的工作，所有工作都在循环中完成。为说明循环执行用了<math>O(n)</math>的时间，考虑<code>pos</code>和<code>pos - cnd</code>的大小。
* 在第一个分支里，<code>pos - cnd</code>不变，而<code>pos</code>与<code>cnd</code>同时自增，自然，<code>pos</code>增加了。
* 在第二个分支里，<code>cnd</code>被更小的<code>T[cnd]</code>所替代，从而增加了<code>pos - cnd</code>。
* 在第三个分支里，<code>pos</code>增加了，而<code>cnd</code>不变，所以<code>pos</code>和<code>pos - cnd</code>都增加了。
因为<code>pos ≥ pos - cnd</code>，即在每一个阶段要么<code>pos</code>增加，要么<code>pos</code>的一个下界增加。故既然算法在<code>pos = n</code>时终止，此循环必然在最多<math>2n</math>次迭代后终止。因此建立表的算法的复杂度是<math>O(n)</math>。

==另见==
*[[Boyer-Moore字符串搜索算法]]

== 外部連結 ==

* {{en}}[http://www.ics.uci.edu/~eppstein/161/960227.html An explanation of the algorithm] {{Wayback|url=http://www.ics.uci.edu/~eppstein/161/960227.html |date=20210116080759 }} and [http://www.ics.uci.edu/~eppstein/161/kmp/ sample C++ code] {{Wayback|url=http://www.ics.uci.edu/~eppstein/161/kmp/ |date=20201004125500 }} by [[David Eppstein]]
* {{en}}[http://www-igm.univ-mlv.fr/~lecroq/string/node8.html Knuth-Morris-Pratt algorithm] {{Wayback|url=http://www-igm.univ-mlv.fr/~lecroq/string/node8.html |date=20210125110301 }} description and C code by Christian Charras and Thierry Lecroq
* {{en}}[https://web.archive.org/web/20090310031611/http://www.ics.uci.edu/~goodrich/dsa/11strings/demos/pattern/ Interactive animation for Knuth-Morris-Pratt algorithm] by Mike Goodrich
* {{en}}[http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm Explanation of the algorithm from scratch] {{Wayback|url=http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm |date=20200121185531 }} by FH Flensburg.

== 引用 ==

* {{cite journal
 | author = [[高德纳]]
 | coauthors = [[James H. Morris, Jr]], [[沃恩·普拉特|Vaughan Pratt]]
 | title = Fast pattern matching in strings
 | journal = SIAM Journal on Computing
 | volume = 6
 | issue = 2
 | pages = 323-350
 | year = 1977
 | url = http://citeseer.ist.psu.edu/context/23820/0
 | access-date = 2006-07-27
 | archive-date = 2010-01-04
 | archive-url = https://web.archive.org/web/20100104101822/http://citeseer.ist.psu.edu/context/23820/0
 | dead-url = no
 }}

* {{cite book
 | author = [[Thomas H. Cormen]]
 | coauthors = [[Charles E. Leiserson]], [[Ronald L. Rivest]], [[Clifford Stein]]
 | title = [[Introduction to Algorithms]]
 | edition = Second edition
 | publisher = MIT Press and McGraw-Hill
 | year = 2001
 | id = ISBN 978-0-262-03293-3
 | chapter = Section 32.4: The Knuth-Morris-Pratt algorithm
 | pages = [https://archive.org/details/introductiontoal00corm_691/page/n945 923]-931
}}
{{算法}}
{{字符串}}
{{高德納}}

[[Category:字符串匹配算法]]
[[Category:带有伪代码示例的条目]]