I recently interviewed with a large company where I was asked how I would check that a given word was a valid English word.  Knowing that there were on the order of a few tens of thousands of stems, and that each stem may have many variations, I figured that there were maybe a few hundred thousand possible words in the English language.  With each word having average length probably in the 7 range, there might be a million or so characters in this set.  I concluded that the check could be performed by searching an in-memory dictionary, and proposed either a std::unordered_set, or a custom trie, with a performance test necessary to see which would be faster.

I decided to perform this test.

Continue reading

I recently was asked essentially this question in an interview: given a string S and a set of strings P, can the strings from P be concatenated with repetitions to form S?  Or put another way, can strings from P with repetitions cover S without overlaps?

My first instinct was to use suffix tries and an overlap graph, but the solution that we eventually reached used no sophisticated data structures, just the array of strings and some recursion in a divide and conquer approach.  Here’s my extended thoughts on the problem.

Continue reading

Suffix trie for string "panamabananas"

I’ll admit it up front, this is not as much of a “when to use it” post as it is a “what is it” post.  Never fear, I’ll still fill you all in on when to use it.  Also, I’m talking about the data structure for storing strings, not the synonym for attempts.

What is it?

A trie is a data structure that is used for fast string searching.  A trie has a root node, and a node for each character in the string.  The nodes are constructed such that traversing the tree from the root to a leaf reconstructs one of the strings that is stored in it.

Continue reading

I’m currently enrolled in Andrew Ng’s Machine Learning class on coursera.org (highly recommended!) and thought I’d write a post about one of the first topics we’re covering, linear regression.  Though Dr. Ng explains the mechanics of this technique very clearly, there’s not many examples of situations in which the technique is commonly used, or examples of the types of problems that are solved by the techniques.  I’ll do my best to fill in this gap!

What is linear regression?

Linear regression is the technique of approximating output values for a given input set using a polynomial function.  Given an input matrix X & an output vector Y, find coefficients A & B such that XA + B produces a vector that is as close to Y as possible.  Finding the coefficients A & B allows a person to then predict output values given similar input values.

When to use it?

Linear regression assumes that the relationship between the input values in X and the dependent values in Y have a linear relationship.  Also, linear regression produces values that can be used as coefficients in a continuous function.  So, if you suspect that your inputs and outputs have a linear relationship, and the output is effectively continuous (rather than discrete), try using linear regression to approximate the relationship.

Continue reading

Let’s say you’re developing a simple alternative to the TriMet  website, where riders can (among other things) check real-time location of buses and trains.  This means that your website will need to query the TriMet TransitTracker service when one of your site’s users performs a search.

So now your life has many more problems than just a moment ago.  How does your code get called when the TransitTracker service:

  • returns the real-time location information?
  • is unavailable?

Further, how do you implement your site so that it shows on-time arrivals in green, and late arrivals in red?

Continue reading