When to use it: Tries

Suffix trie for string "panamabananas"

I’ll admit it up front, this is not as much of a “when to use it” post as it is a “what is it” post.  Never fear, I’ll still fill you all in on when to use it.  Also, I’m talking about the data structure for storing strings, not the synonym for attempts.

What is it?

A trie is a data structure that is used for fast string searching.  A trie has a root node, and a node for each character in the string.  The nodes are constructed such that traversing the tree from the root to a leaf reconstructs one of the strings that is stored in it.

For example, let’s say we want to store the following string:

band

We would construct the following trie:

Here we have the root node, and one node for every character in the string.  By traversing from the root to the leaf, we pass b, then a, then n, then d, creating the string “band,” which is our original string.

Let’s make it more interesting by attempting to store the following two strings:

band
pans

Now our trie looks like this:

Again, we can traverse either path to reach a leaf, and by doing so we can construct either “band” or “pans.”

To make it even more interesting, let’s store these four strings:

band
pans
banana
panstar

Now we have this:

There are two new features in this trie.

  1. We see that banana can be found by traversing b-a-n-a-n-a, and band can be found by traversing b-a-n-d.  B-a-n are shared between the two words.
  2. It looks like there is only one string on the branch starting with p.  In fact, with the current scheme, we have no way of knowing if one of the strings in the storage list is a substring of another string.

We’ll fix the problem described in #2 above by appending a character that’s not in our alphabet to the end of every string.  It’s typical to use $.  If we append $ to the end of every string in our storage set, we will get this trie:

When we traverse to a leaf, we’ll always get a string ending in $, which we can remove to recover the original string.

Uses of tries

How can we use a trie?  One use is to search a fixed string for occurrences of any string in the search set.  More formally, given an input string S and a set of strings P, return the start locations of any item in P that is a substring of S.

For example, let’s say we wanted to search the following string S for occurrences of any strings in the set P, and report the start locations of those strings:

S = string with something prohibited
P = {prohibited, some, strict}

The brute force approach to this would perform a search over S for every string in P, potentially taking O(|S|*|P|) time.  Of course this does not scale – what if we want to search an email for the 1000 most common words used by spammers?  Instead of using a brute force approach, we can process our search strings into a trie, then traverse the trie as we walk S.  If any node has a child that is a $, then the string formed by traversing to that node is present in S.

For the example above, here’s our trie:

Let’s walk through the string S = “string with something prohibited” and see how we would use the trie to find any of the prohibited words in S.  At every step when our current location is not the root, we’ll check to see if there is a child that is a $.

Current location in trie: root node
s - traverse the trie to the s node
t - traverse to the t node which is a child of the s node
r - traverse to the r node which is a child of the t node
i - traverse to the i node which is a child of the r node
n - not a child of the r node.  Reset the trie location to the root.
g, Space, w, i, t, h, Space - not children of current location
s - traverse the trie to the s node
o - traverse the trie to the o node which is a child of the s node
m - traverse the trie to the m node which is a child of the o node
e - traverse the trie to the e node which is a child of the m node.  There is a child of the e node that is a $, so we've found a word in S.  Save the start location of this string in a list for later reporting.
t - not a child of the e node.  Reset the trie location to the root.
h, i, n, g, Space - not children of current location
p - traverse to the p node which is a child of the current location
r, o, h, i, b, i, t, e - traverse to the node, which is a child of the current location
d - traverse to the d node which is a child of the e node.  There is a child of the d node that is a $, so we've found a word in S.  Save the start location of this string in a list for later reporting.

So, there you have it.  We iterated through S one time, and found the start location of all occurrences of every string in P, while minimizing string comparisons.

Suffix trie

In the previous example, we used a trie to search a fixed string S for a set of strings P.  This is useful when your set of substrings P is relatively stable but your search string S changes frequently, as in the spam use case.

There are other use cases of string searches though – one might want to find all occurrences of a substring in another string very quickly.  This is a common use case in genomics, where we want to search a genome (which can be represented as a string over the alphabet ATGC) for a genome fragment.  In this case, the search string S is relatively stable and can be on the order of 10^9 characters in length, but the members of the search set P change relatively frequently and may only be on the order of 10^5.  Let’s look at the simple case first: find all occurrences of a substring in a search string.

The more formal description of this problem is: given an input string S and a substring F, return a list of all start locations of F in S.  This problem can be solved by creating a suffix trie, which is a trie constructed by adding every suffix of S to the trie.  The steps in this procedure are to add the substring from S[0:last] to the trie, then S[1:last] to the trie, then S[2:last], and every other substring up to S[last-1:last].  Of course before constructing the trie, we will need to append the $ as before.

To find every start position of F in S, we traverse the trie using the characters in F to find the next node.  F does not appear in S if the trie’s current location does not contain a child that matches the next character of F.  If the trie can be traversed in this manner to the end of F, then F appears 1 or more times in S.

To find the start locations, the suffix trie needs to be modified during construction so the leaves are decorated with the start position of the suffix that was used to create that branch.  Then, we simply perform a DFS or BFS from the end node, and append the locations stored in the leaves to a list.

In this example, we want to find all start locations of the string “ana” in the string “panamabananas”.  This example comes from the excellent Algorithms on Strings course on coursera.

S = panamabananas
F = ana

Here’s what the suffix trie looks like after each leaf has been decorated with the start location of the suffix used to create that branch.

To find the all the start locations of “ana,” we would traverse from the root to a, then n, then a.  Since we reached the end of “ana,” we know there is 1 or more occurrence of “ana” in T.  To find these locations, we’ll just perform a DFS to the leaves from the current location, and we see that the locations are 7, 1, and 9.  This is easy to verify:

              111 
    0123456789012
S = panamabananas
     ^     ^ ^

To find the string F = “and,” we would traverse from the root to a, then n, but n has no child labeled with a “d.”  Since we could not reach the end of “and” by traversing the trie, we can conclude that “and” is not found S.

Time & space complexity

This post is already really long, so I’m not going to go into these details.  However, it’s fairly easy to show that inserting a string p into a trie can be done in O(|p|).  Probably the fastest implementation of a node, shown below, uses memory space that’s approximately O(|ALPHABET| * |N|), where |ALPHABET| is the number of characters in the alphabet and |N| is the number of nodes in the trie.

struct TrieNode
{
  struct TrieNode *children[CHARS_IN_ALPHABET];

  // rather than using a $ and a separate node to indicate the end
  // of a word or a leaf, each node contains a boolean to indicate whether
  // this node was the end of a string.
  bool isLeaf;
};

Finally, searching for a string p in a trie using the implementation above can be done in O(|p|).

When to use it

We’ve already seen how tries can be used for finding any instances of strings from a set P in a fixed string S, and how a suffix trie can be used to find all starting locations of a substring F in a string S (though a Burrows-Wheeler Transform may be better for this purpose – I may write a post on that in the future!).  It could also be used for quickly checking if a substring F is a member of a set P.  Thus, I recommend using tries when:

  • You need to quickly find any string from a set P in a string S.  This is especially valuable if P is relatively stable and you have many strings to search, such as in the spam email example.
  • You need to quickly find if a string F is a member of the set P.  This might occur if, for example, you want to check that a series of blocks placed on a Scrabble board is a valid English word.
  • It might be appropriate to use a suffix trie to find all start locations of F in a string S, especially if S is not dominated by only a few characters (aka the character count for each character that appears in the string is roughly the same).  When the character count for one character is much higher than the other characters, the branch in the trie that starts with that character might contain O(|S|^2) nodes, which would result in O(|S|^2) runtime to traverse to those leaves.

If you know of other uses for tries, please let me know by leaving a comment at the end of the article!

Conclusion

As always, I’m here to answer your questions and help you understand these concepts! I’m also always looking for corrections and discussions.  So, if you have questions, see errors, or want to discuss improvements to the content of this article, just leave a comment below!

Leave a Reply

Your email address will not be published. Required fields are marked *