Substring With Concatenation of Words

Problem Statement

The following is the description of substring with concatenation of words:

You are given a string, s, and a list of words, words, that are all of the same length. Find all starting indices of substring(s) in s that is a concatenation of each word in words exactly once and without any intervening characters.
For example, given:
s: "barfoothefoobarman"
words: [“foo”, “bar”]
You should return the indices: [0,9].

Thoughts

Given the problem, we can easily come up with the following solution:

Compose all possible string combinations from the words array and compare them with target string: the time and memory complexity can be $O(n!)$ ;
Or, Find index of every single word and try to connect them;
Or, Combine words using prefix tree and compare with target;

In both cases, we need to compare string effectively, so we can use KMP algorithm.

How to KMP

KMP is a somewhat complicated algorithm to write, but very efficient. It can compare string with the time efficiency – $O(m + n)$ .
If you don’t understand KMP, try reading my basic introduction and/or more interactive version.

Application

I choose the second thought because it can avoid composing duplicate¹ connected-string. This is often the case that: we met a problem, which have two basic directions

Generate many possible candidates, then filter them;
Or, more efficiently, try to generate what we need directly.

Another reason is that the first thought will not use the condition of all words are same length as far as I can see.

s: "aaa"
words: [“a”, “a”, “b”]

Problem Analysis Again – Same Length

Now, suppose we get the index of every word, how to combine them?

We notice the condition – all words have same length. What make a difference if they are not?
Right, we can’t use the first index and plus the length to get the next possible index if they are not.

// So I sort the indices in order
sort(mapping:index -> word)
// and connect indices if the differences between them is length(word) -- which will make many linked-lists.
for( index : indcies ) 
	if(mapping.has(index + length(word)))
		connect two nodes of (index, word)

Then, I can iterate every linked-list to put word in a deque. If this word appear too many times, I pop some element from head to adjust. If the size of deque equals to the count of words, we find a match.

while(node != NULL) 
	if(deque.count(node.word) >= words.count(node.word))
		pop in front until count(node.word) decrease

	deque.add(word)
	if(deque.size() == words.size())
		remember start of front node's index

	node = node->next;

Finally, the implementation of upper pseudo-code

Written with StackEdit.

For example, the following will generate duplicate string “ab” ↩︎

On teh way

Blog Search