Algorithm: Can a string be covered by the elements in a set?

I recently was asked essentially this question in an interview: given a string S and a set of strings P, can the strings from P be concatenated with repetitions to form S?  Or put another way, can strings from P with repetitions cover S without overlaps?

My first instinct was to use suffix tries and an overlap graph, but the solution that we eventually reached used no sophisticated data structures, just the array of strings and some recursion in a divide and conquer approach.  Here’s my extended thoughts on the problem.

First Algorithm

bool canBeCovered(S, P)
  if (S == "") return true
  for i = 1 to |S| - 1
    substring = substring of S from 0 with length i
    if (substring is in P)
      if (canBeCovered(substring of S from i to end, P))
        return true
  return false

Graphically, the algorithm looks like this:

S = companyisgreat  (|S| = 14)
P = {com, compan, pany, yi, sg, reat}  (|P| = 6)

c ompanyisgreat
co mpanyisgreat
com p anyisgreat
com pa nyisgreat
com pan yisgreat
com pany i sgreat
com pany is great
com pany isg reat
com pany isgr eat
com pany isgre at
com pany isgrea t
com pany isgreat
com panyi sgreat
com panyis great
com panyisg reat
com panyisgr eat
com panyisgre at
com panyisgrea t
com panyisgreat
comp anyisgreat
compa nyisgreat
compan y isgreat
compan yi s great
compan yi sg reat

In this example, we use 24 recursive calls, with each call performing an iterative search through a string array for a string match.

It’s a simple algorithm that performs, in worst case, O(|S| * (sum of lengths of all strings in P)) time.  I left feeling like we could do better.

Second algorithm

I spent some time later that afternoon thinking about this problem, and it turns out my instincts generally pointed toward an efficient approach.  I developed the following algorithm based on a trie of the strings in P (not suffix tries) and an implicit graph (not an overlap graph).

Here’s the pseudocode:

S = companyisgreat  (|S| = 14)
P = {com, compan, pany, yi, sg, reat}  (|P| = 6)

trie = Generate a trie from the strings in P

bool canBeCovered(S, trie, i)
  if (i >= |S|) return true
  currentTriePosition = root
  for j = i to |S| - 1
    if S[j] is a child of currentTriePosition
      currentTriePosition = currenTriePosition.children[S[j]]
      if currentTriePosition "is a leaf"
        if (canBeCovered(S, trie, j + 1))
          return true
    else
      return false

  return false

Graphically, the trie looks like image below:

The graphic below shows the recursion, which resembles a graph.  Each letter is a node.  Each arc shows the letter at S[i] where a call starts and the letter S[j + 1] where a recursive call starts.  By representing the call graph this way, it’s pretty obvious that we are performing a search in the graph for a path from the node representing the first character to the node representing the position after the last character of the string.

After the initial call to canBeCovered(S, trie, 0), there is a recursive call from each outgoing arrow, which starts the covering search again at each incoming arrow.  If there’s no outgoing arrow from a box with an incoming arrow, the procedure returns false.  If the search reaches the last node (goes beyond the end of S), the procedure return true.

By counting the outgoing arrows we can see that this algorithm uses only 6 recursive calls.

Analysis

As I mention in my trie article, tries give the ability to quickly find the start location in S of any string in P.  The second algorithm uses what might be considered the reverse knowledge – we know all the strings in P that start at position i, and implicitly, their lengths.  Using this knowledge, we’re able to bypass the cover check for any locations in S that could not contribute to a covering set.  The same bypass could be performed in the first algorithm with the following change (hereby declared the modified first algorithm):

bool canBeCovered(S, P)
  if (S == "") return true
  for each p in P
    substring = substring of S from 0 with length |p|
    if (substring == p)
      if (canBeCovered(substring of S from |p| to end, P))
        return true
  return false

Also, we are able to avoid iterating through all of S – the second algorithm aborts as soon as we’ve found that no path terminates beyond the end of the string.  In the best no-cover case, when none of the strings in P start with S[0], the algorithm aborts after attempting one trie traversal, essentially performing one character comparison!

S = somesuperlongstringthatgoesonandon...andon
P = {averylongstring, ..., morestrings, laststring}

Compare this to the first algorithm – the same inputs would require at least |S| * |P| character comparisons.  The modified first algorithm above is better, but still requires |P| character comparisons.

However, the second algorithm does not always perform faster than the first, due to the overhead of creating the trie, which is built in O(sum of lengths of all strings in P).  In the best case for the first algorithm, when P[0] == S, the first & modified first algorithms perform |S| character comparisons.  The second algorithm also finds a covering set in |S| trie node traversals with these inputs.  However, because of the overhead of building the trie, the first algorithm is faster.  The first algorithm’s smaller memory usage may also give it a speed advantage.

Finally, as previously mentioned, the first algorithm uses 24 recursive calls while the second & modified first algorithms use only 6 – a reduction of 75%.

Conclusion

By choosing better data structures for our algorithms, we can often dramatically improve our algorithms.  In this case, using a trie allowed us to eliminate an iterative search through a string array for a string match.  This is especially valuable if we were using the same P for multiple values of S.

We also saw that after the trie construction, the second algorithm had much better best no-cover case performance than the first, and better performance than the modified first.  This might be desirable if, again, we were using the same P for multiple values of S, or if we suspected that P may not contain any element that is a prefix of S.

However, the cost of building the trie has to be considered, as does the memory usage.  If the strings in P were expected to be different for each procedure call, it’s likely that the modified first algorithm would be the fastest.  If the strings in P would be used for multiple strings S, then the second algorithm would probably be fastest.  Also, a trie has relatively high memory demands – if the system had tight memory constraints it may not be possible to use this approach.  Nevertheless, a decision among the algorithms would likely require at least minimal experimentation and runtime analysis using expected data, to determine if the overhead of building a trie is worth the cost.

As always, I welcome comments and discussion, just leave a comment below!

Leave a Reply

Your email address will not be published. Required fields are marked *