The core thought behind the complex description is avoiding duplicate comparison. For example:
aim: ababababd
^ ^
s i
pattern: ababd
^
j
When aim[i] != pattern[j]
, what brute-force solution usually do is set i = s + 1; j = 0
to re-compare it. But actually, we already know aim[s + 1] == pattern[1] && pattern[1] != pattern[0]
, so this comparison will always fail. So if there any way to avoid this? kmp algorithm use a pre-processed array – next
to handle this.
What does next
mean
Back to upper example, if we using kmp to compare and fail the comparison, i
will remain not changed, j = next[j];//2 in this case
.
aim: ababababd
^ ^
s i
pattern: ababd
^
j
So what on earth the next
is?
We try to understand it by example:
------------------------
aim |/////|y| |/////|z|
------------------------
^
i
-------------------
pattern |/////|y| |/////|x|
-------------------
^ ^ ^ ^
0 k jj j
Again, the we compare aim[i]
and pattern[j]
and fail to match. We can move pattern right by j = k; //k == next[j]
if we know [0, k) == [jj, j)
. So the core of next
array is the same sub-string in pattern. Actually, it has to be prefix and suffix( from start to current index.)
In conclusion, next can be understood by this two views:
- how it come: longest common prefix/suffix
- how it used: next position to compare & how many chars to skip
Some example to make sure you understand:
string | next[4] |
---|---|
aaaa | 2 |
aaab | 2 |
So how next
is computed?
Using dynamic programming: compute next[i] from next[i-1].
-----------------------
|/////|x|y| |/////|x|y|
-----------------------
^ ^ ^ ^
0 k i
next[i-1] = k
if p[i] == p[k]
next[i] = next[i-1] + 1
else
???
The difficult part is what to do when next char not equal? Let’s see a larger example:
--------------------------------------------------
|/////|z| |/////|y| |/////|z| |/////|z|
--------------------------------------------------
^ ^ ^ ^ ^ ^
k0 i2 k i0 i1 i
p[k] != p[z]// so we have to find a shorter prefix/suffix
// Assume we finally find a k0, make
[0, k0) == [i1, i)
// So we can find a same sub-string in [i2, k), because
[0, k) == [i0, i)
// now:
[0, k0) == [i2, k)
// i.e. it is a common prefix/suffix for index k
// because next is longest common prefix/suffix,
// k0 must <= next[k]
So we can complete upper pseudo-code:
if p[i] == p[k]
next[i] = next[i-1] + 1
else
k = next[k] and try again
Full source can be found in gist
Ref
Written with StackEdit.
评论
发表评论