跳至主要内容

Understand KMP

KMP wiki introduciton

The core thought behind the complex description is avoiding duplicate comparison. For example:

    aim:    ababababd
            ^   ^
            s   i
pattern:    ababd
                ^
                j

When aim[i] != pattern[j], what brute-force solution usually do is set i = s + 1; j = 0 to re-compare it. But actually, we already know aim[s + 1] == pattern[1] && pattern[1] != pattern[0], so this comparison will always fail. So if there any way to avoid this? kmp algorithm use a pre-processed array – next to handle this.

What does next mean

Back to upper example, if we using kmp to compare and fail the comparison, i will remain not changed, j = next[j];//2 in this case.

    aim:    ababababd
            ^   ^
            s   i
pattern:      ababd
                ^
                j

So what on earth the next is?
We try to understand it by example:

        ------------------------
aim         |/////|y|  |/////|z|    
        ------------------------
                              ^
                              i
            -------------------
pattern     |/////|y|  |/////|x|    
            -------------------
            ^      ^   ^      ^
            0      k   jj     j

Again, the we compare aim[i] and pattern[j] and fail to match. We can move pattern right by j = k; //k == next[j] if we know [0, k) == [jj, j). So the core of next array is the same sub-string in pattern. Actually, it has to be prefix and suffix( from start to current index.)

In conclusion, next can be understood by this two views:

  • how it come: longest common prefix/suffix
  • how it used: next position to compare & how many chars to skip
Some example to make sure you understand:
string next[4]
aaaa 2
aaab 2

So how next is computed?

Using dynamic programming: compute next[i] from next[i-1].

-----------------------
|/////|x|y|  |/////|x|y|  
-----------------------
^        ^          ^ ^
0        k            i

next[i-1] = k
if p[i] == p[k]
    next[i] = next[i-1] + 1
else 
    ???

The difficult part is what to do when next char not equal? Let’s see a larger example:

--------------------------------------------------
|/////|z|  |/////|y|        |/////|z|  |/////|z|  
--------------------------------------------------
       ^   ^      ^         ^          ^      ^
      k0  i2      k         i0        i1      i


p[k] != p[z]// so we have to find a shorter prefix/suffix
// Assume we finally find a k0, make
[0, k0) == [i1, i)

// So we can find a same sub-string in [i2, k), because
[0, k) == [i0, i)
// now:
[0, k0) == [i2, k)
// i.e. it is a common prefix/suffix for index k
// because next is longest common prefix/suffix,
// k0 must <= next[k]

So we can complete upper pseudo-code:

if p[i] == p[k]
    next[i] = next[i-1] + 1
else 
    k = next[k] and try again

Full source can be found in gist

Ref

  1. A blog in chinese
  2. Wikipedia
  3. Gist cpp code
  4. Application in a problem

Written with StackEdit.

评论