Like the Abstract Factory pattern, this enforces object cohesion – the SubclassOfCreator creates only the types that it can work with. It also allows for extensibility, because the framework is only providing the ICreator and IProduct interfaces. The client code will derive from ICreator and IProduct, which eliminates the need for client-specific behaviors in the framework’s code.

So let’s look at what the code for a Factory Method might look like in C++:

class IProduct { public: virtual void DoSomething() = 0; virtual void DoSomethingElse() = 0; ... }; class ICreator { public: virtual unique_ptr<Product> CreateIProduct() = 0; void PerformOperation(); ... };

These two classes are abstract classes and can’t be instantiated. They must be subclassed, and the various operations must be implemented in the subclasses. The subclasses look like this:

class ConcreteProduct : public IProduct { public: virtual void DoSomething() { mValue++; } virtual void DoSomethingElse() { mValue--; } protected: int mValue; }; class ConcreteCreator : public ICreator { public: virtual unique_ptr<Product> CreateIProduct() { return unique_ptr<IProduct>(new ConcreteProduct()); } }; int main() { ... ConcreteCreator cc; unique_ptr<IProduct> myProduct = cc.CreateProduct(); myProduct->DoSomething(); ... }

Here we see that ConcreteCreator.CreateIProduct() creates a new ConcreteProduct. For easy memory management I’m using a std::unique_ptr, but it’s by no means required. With this scheme, the client lets ConcreteCreator create whatever subclass of IProduct it thinks is best.

This technique is valuable when the ConcreteCreator needs to be extended with a coherent object. For example, some simple racing games like Mario Kart allow the user to choose a car, and the driver that’s shown on the screen is specific to the car. With the Factory Pattern, there would be an ICar and an IDriver. ICar would have a method, CreateIDriver(). Each car (MarioCar, LuigiCar, PrincessCar, etc) would implement CreateIDriver to return the corresponding MarioDriver, LuigiDriver, PrincessDriver, and so on.

This could also be useful in other domains, such as extensible image processing software. Extensible image processing software might allow third party developers to add items in a special menu. When clicked, the menu item would open a new, extension-specific dialog. This could be implemented with IExtension and IExtensionDialog interfaces. Third-party developers would subclass the IExtension interface to initialize the extension and set the menu item text, and then implement CreateIExtensionDialog() so it returns their extension-specific dialog.

Please leave a comment if this has helped you, or if there’s anything else I can do to help you understand when to use the Factory Method!

]]>

In the real world, a factory is a building that produces one or more products. A steel factory might produce beams & sheets, while a shoe factory might produce all the colorways & sizes for both a running and a tennis shoe.

In the software world, a factory is a function or method that creates object instances. For example, a game might use an Enemy factory to randomly produce a GroundEnemy or a FlyingEnemy. Its signature might look something like this:

enum EnemyType { GroundEnemyType, FlyingEnemyType };

class Enemy {

public:

virtual void move(Direction d) = 0;

virtual void attack() = 0;

};

class GroundEnemy : Enemy { … };

class FlyingEnemy : Enemy { … };

Enemy* CreateEnemy(EnemyType et);

Notice that the factory returns a pointer to an Enemy, which is a parent class of both GroundEnemy and FlyingEnemy. The subclasses would implement the pure virtual functions move() & attack() to implement subclass-specific actions. So CreateEnemy can be used by client code to produce either of the subclasses, and then use the common interface to manipulate them.

Now, we can imagine that each kind of Enemy has its own kind of weapon that shoots differently behaving Projectiles: bullets from a pistol and bombs from a carpet bomb. Let’s assume that only ground enemies can use pistols, and only flying enemies can use carpet bombs. How can the client create Projectiles that match with the Enemy?

This is the problem that the Abstract Factory attempts to solve. It does so by putting a factory in the object created by a factory.

class Projectile {

public:

…

virtual int getDamage() = 0;

virtual float getInitialHorizontalVelocity() = 0;

…

};

class PistolProjectile : Projectile { … };

class CarpetBombProjectile : Projectile { … };

class Enemy {

public:

…

virtual Projectile* createProjectile() = 0;

…

};

class GroundEnemy : Enemy {

public:

…

virtual Projectile* createProjectile() { return new PistolProjectile(); }

…

};

class FlyingEnemy : Enemy {

public:

…

virtual Projectile* createProjectile() { return new CarpetBombProjectile(); }

…

};

The client creates a Projectile by calling the factory method createProjectile() in the Enemy class.

So to recap: client code might want to randomly get a ground or flying enemy. Then, the client code wants to shoot the enemy’s weapon and track the projectile’s path. To do so, we could create a factory method that returns either a GroundEnemy or a FlyingEnemy. In both classes, we implement createProjectile() so that it returns a new instance of the Projectile that the enemy uses. This allows the client to use one code path to simply create an Enemy, then create a matching Projectile, and ensure that no Projectiles are used with the wrong Enemy. Thus, we’ve maintained object coherence.

This is all nice when you want to deal with multiple sets of coherent objects that have the same interface. But what if they don’t have the same interface, ie inconsistent object set interfaces?

For example, let’s say you want to support 2 fictional UI toolkits, UIFrameworkForWindows and UIFrameworkForMac. These toolkits will allow your app to have a more or less native appearance on both Windows and Mac. They both support Windows, Scrollbars, and all the common UI elements. Thus, you can create an interface for your Abstract Factory that has methods like CreateWindow, CreateScrollbar, and so on.

But UIFrameworkForWindows also has a RightButtonContextMenu class, which you would like to use because you feel it will be a valuable addition to the UI on the Windows platform. However, there’s no corresponding class in the UIFrameworkForWindows. What should the Abstract Factory interface look like? If you add a CreateRightButtonContextMenu method, the UIFrameworkForMac wrapper will need to implement it, perhaps as an empty method. This feels like a workaround and not “clean code.” If you don’t the CreateRightButtonContextMenu method, then client code will need to detect the platform that is being used, and only call CreateRightButtonContextMenu when running on Windows. This eliminates much of the value of the Abstract Factory pattern, which promises to allow client code a single interface to any of the many underlying object sets.

The Abstract Factory pattern can be used when you have two object sets with very similar, if not the same, interfaces, but each set needs to be used coherently (those objects only work with other objects from the set, and not from a different set). This occurs in UI toolkits, games, and many more domains. It allows client code to get a handle to a factory that contains factory methods for creating coherent objects. As long as all the objects are created using the Abstract Factory, there is no chance that client code will attempt to create (for example) a MacWindow with a WindowsScrollbar.

Please leave any comments below, I’d love to hear any feedback on other situations where the Abstract Factory pattern is used, as well as any other drawbacks!

]]>

Frequently in science and engineering, an experiment or test generates data that is non-visual in nature. For example, in my last job, a co-worker ran tests comparing the frequency of encountering a bug while varying inputs of a 2 input, 32-bit arithmetic operation. He had a hypothesis that the frequency of encountering the bug was related to the number of on/true bits in the inputs, so he aggregated the data by the number of on/true bits in each input. This generated a matrix of values: one axis for the number of true bits in one of the inputs, one axis for number of true bits in the other input, and the value of the matrix at X, Y was the number of times that the bug was encountered when each input had that many on/true bits.

It is immediately apparent that it is difficult to detect trends from a 32×32 matrix (1024 values) of integers. Rather than look at numbers, it is far easier to detect a trend in this many values by visualizing it.

But how to visualize the data? The data originated from an arithmetic operation. There is no concept of color or shape in 32-bit inputs, even when aggregating the data.

One method of visualization is to assign a color for each frequency. So if the minimum number of times the bug was encountered was 0, and the maximum was 100, we might choose 100 different colors and create a map:

0 | #FFFFFF |

1 | #EEFFFF |

2 | #EEEEFF |

… | … |

99 | #000011 |

100 | #000000 |

Table 1 – Non-linear color map from white to black

Now we can plot a 32×32 square, where each element in the square contains one of the colors. This makes trends immediately apparent. It will be very visually obvious when the square shows a pattern such as a diagonal/horizontal/vertical black line, one or more black regions, and so on.

Therefore, a color map can help us translate non-visual data into a visualization that can help us spot trends or patterns.

The idea described above can be easily extended to a dataset that is poorly visualized, for example something with few colors or a small color range. For example, a color map could be used to colorize old black & white photos (though if I had to guess, it would need to be done on a region-by-region basis). In this case, each region would be assigned essentially a color, for example blue. Pixels closer to white would be lighter blues, while pixels closer to black would be mapped to darker blues. The person doing the colorizing could choose the color for the region, the lightest shade of the color, and the darkest shade of the color. Then, a color map could be created for all the intensities in the region. Each intensity would be mapped to a shade between the lightest & darkest shade. The person performing the colorization could choose to make the colors very saturated by choosing very light & dark shades (ie, a large difference between the light & dark shade) for every region, or could choose to make the colors washed out by choosing light & dark shades that were close to each other.

Even brief reading on the topic of color maps has taught me that the real challenge in creating or using color maps is creating a good color map for the application that will not introduce artifacts. These artifacts include disparate values being mapped to similar colors/luminances which can cause patterns to appear that may not actually appear, or may cause other patterns to disappear. It’s also taught me that the default color maps in applications such as Matlab and Octave may not be a good fit for every application. I’m not going to cover this topic, but here’s some great resources I found on it:

- https://jakevdp.github.io/blog/2014/10/16/how-bad-is-your-colormap/
- http://cresspahl.blogspot.com/2012/03/expanded-control-of-octaves-colormap.html

Based on my very limited research and frankly zero experience with actually using or creating color maps, I think it’s still safe to conclude that a color map can be used to visualize non-visual data. Compared to simply looking at numbers in a line or grid, visualizing data with color maps can reveal trends or patterns that are otherwise simply impossible to see. Compared to greyscale visualization, color maps can impart meaning that may otherwise be missing. This can be done with red/yellow to indicate “hot,” blue/green to indicate “cool,” light/dark to indicate infrequent/frequent – this is what data visualization experts excel at!

Color maps might also be useful to improve the visualization of poorly visualized data, by expanding/decreasing the contrast in a region, or even assigning color where none previously existed.

I’m still learning about this topic and would love to hear your additions or corrections to this post! I would also love to be pointed toward good resources on color maps. Just leave a comment below!

]]>

In C++, a simple single-threaded Singleton typically looks like this, which performs lazy initialization:

```
Logger.h
=========
class Logger {
public:
static Logger* GetInstance() const;
void log(char *msg);
protected:
Logger();
static Logger *instance_;
...
};
Logger* Logger::instance_ = nullptr;
Logger.cpp
==========
Logger* Logger::GetInstance() {
if (instance_ == nullptr) {
instance_ = new Logger();
}
return instance_;
}
...
```

What we have here is a static variable wrapped in a class. A traditional local or global `Logger`

can’t be created because the constructor is protected. Instead, client code has to get an instance with the `GetInstance()`

static method, which performs a lazy initialization of the static variable. Thus, during execution, this implementation allows exactly 0 or 1 instance of a `Logger`

to exist on the heap. It can be globally accessed by any code by calling `Logger::GetInstance()`

.

So it’s worth asking what’s the value of this when compared to a simple global variable? With a Singleton, there’s a guarantee of exactly 0 or 1 instances of the class at any given time, and there’s a difference in when the class is initialized. On the other hand, the two techniques are similar in that they provide globally-accessible state.

**Very important note!** It’s tempting to write a lazy initialized Singleton template class and create singletons like this: `Singleton<Logger>::GetInstance()`

. This gives you the worst of Singletons and global variables. You basically have global state, but without the guarantee of single instantiation, because there’s nothing preventing local or global instances of `Logger`

.

Search around online even a little bit and you’ll see that there’s some mostly deserved hate for the Singleton pattern for two reasons – it’s easy to misuse and it’s hard to test. In the misuse category, there’s temptation to use a Singleton for things like:

**Caching**(ex: storing results of specific db queries). This is better handled by a simple memory sharing scheme, like having all db objects keep a (smart) pointer to some memory on the heap. This is easily enforced if the db objects are created using a factory or prototype. A Singleton for this purpose would create a globally accessible cache, but typically only the interface objects need to read or write the cache.**Read-only resource sharing**(ex: list of valid words, or a configuration file). In my opinion, this is usually better handled with a local variable that is passed to the objects/methods that need it. Again, using a Singleton for something like this would create globally accessible state, but usually only a few components need access to the read-only data. In the case of a configuration file, sometimes a component needs only a small subset of the configuration data. In this case, just that data can be passed to the component, rather than making the entirety of the configuration file available globally as a Singleton does.**Replacing global variables.**Singletons are card-carrying members of the Gang of Four patterns book, and they provide global state. So, if we just replace the*bad*global variables with the*good*design pattern, there’s no more problem, right? Well, no – we still have most of the same problems that global variables introduce: it’s not always immediately clear which components use the global so any component could affect any other, and it can be hard to mock the global/Singleton which makes testing difficult.

The reason that code using Singletons is hard to test is because the dependency of an object on a Singleton can be hidden. Since the object is global, an object can get ahold of a Singleton without any mention in its API.

So let’s remove the hidden dependency drawback from the analysis by saying that all testers and developers are aware of the dependency. A typical method used to isolate the code under test is to mock the dependencies. Let’s say that we have a Singleton named A. GetInstance() returns a pointer to the instance of A. Therefore, it’s possible to configure the Singleton to return a pointer to a MockA which inherits from A, and everything is fine – in fact, one might claim that this pattern *enables* good testing patterns.

On the other hand, now we can make the argument that if we need to be able to change our dependencies, perhaps we should use a different technique for managing the dependencies, such as Dependency Injection. The dependency can be make explicit (rather than hidden), which may reduce bugs and complications when using the code.

So we understand the significant problems with Singletons – even if there is a valid use for it, it still introduces problems with testing code that uses it. What are the reasons to use a Singleton? This can be broken into two questions: what are the valid reasons to use global state, and what are the valid reasons to restrict instantiation of an object to exactly 0 or 1 copies?

A valid reason to use global state is that the state is truly needed globally. Perhaps many components in many layers need access to the data or object, such as a logger or a component that analyzes performance. This is especially relevant if the object hierarchies or call trees are deep. When there are many layers, passing a local variable to every component of the hierarchy adds a parameter to many constructors, methods, or functions. Often it’s simply much more work to add the parameters, and it doesn’t necessarily make the code more robust or performant. In this case, global state may be the best option.

Reasons to restrict the object to exactly 0 or 1 copies of an object include:

**Large initialization cost**. Perhaps there is a large initialization cost of an object, and you want to control exactly when the object will be initialized. This can be done with a Singleton.**Large ongoing resource use**. Perhaps the object consumes a lot of memory and you don’t want (or the machine can’t handle) more than one of these objects at a time. Or, maybe the object kicks off a CPU-intensive thread, and you only want one of these threads running at a time. These would be good reasons to restrict the number of instantiations of the object to 0 or 1.**Rare usage**. If the cost of an object is large,**and**the code only rarely uses it (as in, if the executable is run N times, the object may be used in only a small percentage of those executions), it could be especially valuable to use a lazy initialized Singleton. The object will not be instantiated most of the time, so those resources could be used by other components.

Furthermore, there could be other considerations with using a Singleton. A Singleton can be fairly easily converted to a Factory by changing the implementation of GetInstance(), so if it’s likely that the “0 or 1 instance” requirement will change to “N instances,” a Singleton might save some work in the future. And, as previously mentioned, a Singleton can easily return an interface rather than a class, which could allow it to return an object or its mock depending on context. This could be useful for testing, even if there are other options for managing dependencies.

So I propose this two part test for when to use a Singleton. A component should satisfy both of these conditions to be considered for a Singleton.

- Many components at many levels need access to the object’s state (global state is needed)
- Large object cost, especially if it’s not always needed during execution (large initialization/resource cost, especially combined with rare use).

If condition 1 is not satisfied but condition 2 is, then a local variable that is passed through parameters is probably preferable because it makes dependencies explicit. If condition 2 is not satisfied but condition 1 is, then a simple global variable might be reasonable, or local variables that are instantiated where needed should be considered.

Singletons have their place in code, but it’s easy to misuse them and can introduce more problems than they solve. I propose a two part test to determine if the situation really requires a Singleton, or if the situation can be handled by more explicit or simpler solutions such as local variables that are passed through parameters or instantiated where needed, or simply global variables.

It should be clear that Singletons can provide benefits to an application or library, but the benefits need to be considered against simpler or different options for providing the same benefit, and also the inherent drawbacks with using global state.

Please leave any feedback or questions in the comments, I’d love to hear your thoughts on Singletons!

]]>Design patterns in software development have been heavily influenced by the work of Erich Gamma, Richard Helm, Ralph Johnson and John Vlissides, known as the Gang of Four (GoF). They literally wrote the book on patterns, Design Patterns: Elements of Reusable Object-Oriented Software. In this book, the authors describe patterns for managing object creation, composing objects into larger structures, and coordinating control flow between objects. Since publication, other developers have identified and described more patterns in practically every area of software design. Try googling your favorite software topic + “design patterns” to see what kind of patterns other developers have identified and described: android design patterns, embedded design patterns, machine learning design patterns, etc.

It’s very useful to know about the patterns in abstract, even if you don’t know the details of a particular pattern. As the authors state, knowing the patterns helps developers identify and use the “right” design faster. Knowing the patterns provides these benefits to the developer:

- They describe common problems that occur in software development. As a developer, especially a new developer, it’s easy to think of every problem as completely unique to the program that you’re writing. But very often, they are unique only because of poor modeling or simply lack of experience. Simply knowing common problems can help a developer model a system in terms of those problems, which often reduces the size, number, and complexity of the problems that remain to be solved.
- They are considered “best known methods” for solving typical/frequent problems that arise in programming & architecture. Knowing the “best known method” for solving a problem eliminates a lot of thought, effort, and time devoted to solving it, which reduces the time that a developer must spend on a particular problem.
- Knowledge of the patterns simplifies & clarifies communication between developers when talking about a particular problem or solution. When a developer who is familiar with design patterns hears “I used the singleton pattern on the LogFile class,” the developer immediately knows that (if implemented correctly) there will only be one or zero instances of the LogFile class living in the program at one time.

It’s pretty easy to describe when to use a pattern – whenever your program contains the exact problem that is solved by one of the patterns. They can even be used if your program contains a similar problem to that solved by one of the patterns, but in this case, the implementation of the pattern may need to be modified to fit the particulars of your program.

However, it’s not always obvious your software’s problem(s) can be solved by a GoF pattern. In other words, the program may be such a mess that it needs to be refactored simply to transform a problem into one that can be solved with a GoF pattern. Hopefully by learning about the patterns, you’ll be able to recognize non-obvious applications in your own software.

I’ll cover the patterns by subject, and within a subject I’ll try to cover what I feel are the most broadly applicable patterns first. Stay updated by following me on RSS, linkedin, or twitter (@avitevet)!

]]>The simplex method is an algorithm for finding a maximal function value given a set of constraints. We’ll start with a non-trivial example that shows why we need a rigorous method to solve this problem, then move on to a simple example that illustrates most of the main parts of the simplex method. You’ll learn when to use it, you can check out my cheatsheet, and you can review my code on github!

To look at a concrete example of this kind of non-trivial problem, suppose you are CEO of an agricultural company that grows 3 types of crops – wheat, corn, and alfalfa. Since you’re a pragmatic and conscientious farmer, you rotate crops to prevent disease and increase yield. This year, based on the crop’s locations, some soils & topologies are nearly perfect for a given crop (and adding water or fertilizer will actually reduce yields) while other soils & topologies need assistance to produce maximum yield.

Your scientists have determined, for each crop in its planned location, the change in yield for each additional $1000 spent on irrigation, fertilization, herbicide application, and pesticide application; these values are described in Table 1 below. Your job is to determine the cheapest method to satisfy your customer’s demands of 80k pounds of wheat, 50k pounds of corn, and 100k pounds of alfalfa.

Action | Wheat | Corn | Alfalfa |
---|---|---|---|

Base production (yield while taking no action) | 17000 |
50000 |
1000 |

Irrigate | +400 | -100 | +500 |

Fertilize | -300 | +500 | +100 |

Apply weed killer | -500 | -200 | +200 |

Apply pesticide | +500 | +300 | +400 |

**Table 1: **The effect on yield in pounds, per $1000 of spending on the given action

One method of solving this problem is by trial and error, but this is very likely to produce a plan that is not the cheapest. For example, you could spend $150,000 on irrigation, $0 on fertilizer, $0 on weed killer, and $60,000 on pesticide. This would result in the following yields:

Wheat |
17000 + 150 * 400 - 0 * 300 - 0 * 500 + 60 * 50 = 80000 |
---|---|

Corn |
50000 - 150 * 100 + 0 * 500 - 0 * 200 + 60 * 300 = 53000 |

Alfalfa |
1000 + 150 * 500 + 0 * 100 + 0 * 200 + 60 * 400 = 100000 |

**Table 2: **Example yields given trial & error values for each possible action.

With this spending, you’re able to produce exactly the amount of wheat and alfalfa that you need, though you overproduce corn. However, how can we verify that this is an optimal (cheapest) solution to producing enough of each crop?

To develop the simplex method, we’ll look at a simpler example that can be easily plotted, for which the correct answer is intuitive and easily verified.

Suppose you’re in charge of a bake sale. You’ve decided to sell two products, cupcakes and pie slices. Cupcakes sell for $1, and mini pies sell for $2; note that this example does not reflect my personal opinion of the relative deliciousness of cupcakes and mini pies. You’ve determined that you have enough ingredients to make 120 cupcakes, or 40 mini pies, but you only have enough oven time to bake 50 total items. You know that whatever you make will sell out. Your goal is to maximize the revenue, so how many of each product should you make?

We can write out our optimization like this, where X is the number of cupcakes we can make and Y is the number of mini pies we can make:

0 <= X <= 120 0 <= Y <= 40 X + Y <= 50 We want to maximize our total revenue R: 1 * X + 2 * Y = R

Here’s the graphical representation of the inequalities:

This shows the intersection of the regions defined by various inequalities: X is between 0 & 120, Y is between 0 & 40, and X + Y is less than 50. So it’s pretty obvious that the optimal solution must be somewhere in this region. But how do we know what’s the optimal solution?

This problem is simple enough to be solved graphically. For any constant R, X + 2Y = R defines a line. For example, here are the lines for X + 2Y = 10, X + 2Y = 60, and X + 2Y = 100 overlaid on the region.

You can see that the line for R = 100 does not intersect the region, the line for R = 90 intersects the region at exactly one point, and the line for R = 50 intersects the region in the range X = [0, 50].

So, we can see that this region is convex. We can see that if R = 90 + epsilon, the line X + 2Y will not intersect the region. Therefore 90 is the maximum value of X + 2Y that intersects the region and it occurs at (X, Y) = (10, 40). This is exactly what we would expect given our constraints – to maximize our revenue, sell as many $2 mini pies as possible, then sell as many $1 cupcakes as possible.

Here are the important points to note in this example:

- The inequalities define a region in space, called the feasible region
- There is at least one solution, because the region is non-empty
- The objective function (what we’re trying to minimize) is a line
- The solution to the optimization problem occurs at a vertex of the feasible region

In fact, it sort of makes sense that the solution is at a vertex: a given inequality must have its minimum or maximum value along the boundary that it defines. This is the key insight that leads to the simplex algorithm: we will find some vertex of the feasible region, then travel along an edge to another vertex with a non-smaller value for the objective function, and so on, until we cannot find any vertices that have a larger value for the objective function.

I’m not going to cover things like the setup and terminology in detail. There are excellent explanations in (for example) Introduction to Algorithms, and you can also review my reference sheet. I just want to cover how the simplex algorithm works. But, to make this a pretty much standalone article, I’ll cover those things briefly.

Also, for consistency and without losing generality, I’m going to change most of the variable names to *xi*, where *i* is the index of the variable.

This is basically:

- Writing the objective function as a maximization objective in a first degree polynomial
- Adding inequalities for all the variables so they are greater than or equal to zero
- Writing the rest of the inequalities as less-than-or-equal-tos

The bake sale example would be written in standard form like this:

Maximize: 1*x1 + 2*x2 = z Subject to: x1 <= 120 x2 <= 40 x1 + x2 <= 50 x1, x2 >= 0

The slack form converts the standard form into an equivalent a system of equalities and inequalities. This makes the simplex algorithm easier for a computer to process, because we’re dealing primarily with equalities.

To perform the conversion, we’ll transform the system of inequalities so that all the inequalities are transformed into non-negativity inequalities. This is done by introducing new variables.

For example:

x1 <= 40

converts to:

x3 = 40 - x1 x3 >= 0

Here’s the bake sale example in slack form:

Maximize: 1*x1 + 2*x2 = z Subject to: x3 = 120 - x1 x4 = 40 - x2 x5 = 50 - x1 - x2 x1, x2, x3, x4, x5 >= 0

In the slack form, the basic variables on on the left side, and the non-basic variables are on the right side.

A feasible solution is a setting of the variables that satisfies all the constraints. Basically it’s a set of variable values that appear inside the shaded area of the graph.

The set of feasible solutions. Basically it’s the shaded area of the graph.

There are some “primitive” (ha!) operations that we’ll use in the algorithm.

In the slack form, set all the non-basic variables to zero. This is a simple procedure for generating values of the basic variables, because the basic variables will take on the values of the constants from each equality. The basic solution for the slack form above is:

x1 = 0 x2 = 0 x3 = 120 x4 = 40 x5 = 50

Given these variable values, the objective function will have the value 0.

1*0 + 2*0 = 0

A pivot swaps a basic for a non-basic variable by solving for the non-basic variable, then substituting the resulting equation into every other equation. ** This produces an equivalent system of** **equations **because all we’re doing is shuffling things around.

For example, if we wanted to pivot x2 & x4, we would solve for x2 in the first equation and find that:

x2 = 40 - x4

Then we could substitute this into all other equations:

Maximize: x1 + 2*(40 - x4)= z Subject to: x3 = 120 - x1 x2 = 40 - x4 x5 = 50 - x1 -(40 - x4)= 10 - x1 + x4 x1, x2, x3, x4, x5 >= 0

We can get a basic solution for this system by setting x1 = x4 = 0:

x1 = 0 x2 = 40 x3 = 120 x4 = 0 x5 = 10

And if we plug these values into the resulting objective function, we see that *z* has the value 80. This is comforting because previously z = 0, and the simplex algorithm is supposed to incrementally find non-smaller values for the objective function, which it has. An increasing value for the objective function probably means we’re doing the right things.

I glossed over a step above by choosing to pivot x2 & x4. What we’re trying to do with the simplex algorithm is gradually increase the value of the objective function until it can’t be increased any more. We previously saw that by pivoting x2 & x4, the value of the objective function increased from 0 to 80. It’s possible that a choice could cause the objective function value to remain constant, or even decrease. How do we know which variables to choose so that a pivot produces a basic solution that increases the objective function value? Let’s review the original slack form:

Maximize: 1) 1*x1 + 2*x2 = z Subject to: 2) x3 = 120 - x1 3) x4 = 40 - x2 4) x5 = 50 - x1 - x2 5) x1, x2, x3, x4, x5 >= 0

Recall that all variables *xi* must be non-negative due to the non-negativity constraints in line 5. Looking at the original objective function x1 + 2*x2 = z in line 1, we can increase z by increasing x1 or x2, because x1 & x2 have positive cofficients.

Let’s choose x2. How far can it be increased without violating the constraints? Well, it can be increased to 40 in line 3, because any value > 40 would cause x4 to become negative. It can be increased to 50 in line 4, because if we set x1 to the minimum value (0), increasing x2 > 50 would cause x5 to become negative. We’ll choose to pivot around the minimum of these possible increases to guarantee that we don’t violate any constraints, therefore we choose to pivot around x4 in line 3.

As we previously saw, this produces the new set of equations:

Maximize: 1) x1 + 80-x4 = z Subject to: 2) x3 = 120 - x1 3) x2 = 40 - x4 4) x5 = 10 - x1 + x4 5) x1, x2, x3, x4, x5 >= 0

It’s worth mentioning a final note about this pivot choice. The initial basic solution had (x1, x2) = (0, 0), which is a vertex of the feasible region in the graph. After the first pivot, basic solution for the system above has (x1, x2) = (0, 40). This is also a vertex of the feasible region in the graph above. So we’ve used the simplex method to move from one vertex to another, increasing the value of the objective function along the way! Neat.

Now we continue with similar reasoning. In the objective function, since x4 must be non-negative and it has a negative coefficient, any valid value for x4 must cause z to remain constant or decrease. Any valid value for x1 must cause z to remain constant or increase. Therefore we’ll choose to pivot around x1.

From line 2, we see that x1 can increase to 120 without violating the non-negativity constraint on x3. From line 4, we see that x1 can increase to 10 without violating the non-negativity constraint on x5. So, choosing the minimum of these values, we will choose to pivot around x5 in line 4:

x1 = 10 - x5 + x4

Which produces the following system:

Maximize: 1)(10 - x5 + x4)+ 80-x4 = z 90 - x5 = z Subject to: 2) x3 = 120 -(10 - x5 + x4)= 110 + x5 - x4 3) x2 = 40 - x4 4) x1 = 10 - x5 + x4 5) x1, x2, x3, x4, x5 >= 0

The basic solution for this system is found by setting x4 = x5 = 0:

x1 = 10 x2 = 40 x3 = 110 x4 = 0 x5 = 0

And the value of the objective function given this basic solution is z = 90. Since all the coefficients in the objective function are non-positive, we know that the objective function value cannot be increased any further without violating the non-negativity constraints. Therefore this is the maximal value of the objective function and we are done.

**Important note: **we found the same solution graphically and algorithmically (secret sigh of relief)! We matched both the maximum value of the objective function, and the point where it occurs. So we got that going for us, which is nice.

So I hope that’s a simple explanation of the simplex method. In a nutshell, after converting the problem into slack form, we iteratively perform these operations:

- Find a basic solution
- Compute the objective function value using basic solution values
- Choose pivot variables – stop when all coefficients of the objective function are negative
- Perform a pivot

This technique is surprisingly powerful. It allows you to find an optimal value of a linear function given an arbitrary number of linear constraints over an arbitrary number of variables. Situations where this might be valuable include:

- Diet management – find the cheapest combinations of foods that will satisfy your nutritional requirements (warning: may produce unpalatable diets!)
- Crew scheduling – find minimum cost for airline crews subject to ensuring every flight has a crew, crews can’t work more than X hours/day, crews must have minimum time between flights, etc.
- Transportation – find the cheapest route for a good from one city to another while accounting for driver compensation, depreciation of value of goods, toll roads, etc

Besides its applicability to only linear relationships, use of this technique is truly limited only by your imagination! Perhaps a better way to determine when to use this technique is by asking some questions:

- Is it a minimization or maximization problem (yes/no)?
- Can the relationships between the variables be expressed linearly (yes/no)?

If the answer to these two questions is yes, the problem can be formulated as a linear program and the simplex method can be used.

I’ve posted a slightly modified version of my own simplex algorithm code as a gist on github, written while I took an Advanced Algorithms and Complexity course on Coursera. It was originally written to solve the diet problem, but is easily generalized to any linear problem. Feel free to comment here or directly on the gist if you have any questions!

I ignored many very important, perhaps even fundamental, aspects of linear programming and the simplex method, including:

- How can we determine if there are any feasible solutions to a given set of inequalities? (answer: there is an initialization procedure that I didn’t discuss that can tell us whether there are any feasible solutions)
- What if some constraints are equalities rather than inequalities? (answer: replace with 2 inequalities; replace x1 = x2 with x1 <= x2 and x1 >= x2).
- How do we handle a minimization problem instead of a maximization problem? (answer: min(z) == max(-z))
- Does the algorithm work if the initial basic solution is not a feasible solution? (answer: no – the initialization procedure can find a feasible solution if the initial basic solution is not feasible)
- How do we handle constraints that are greater-than-or-equal-to, instead of less-than-or-equal-to? (answer: x1 >= x2 is equivalent to -x1 <= -x2)

]]>

FizzBuzz is a “simple” programming interview question. There’s a few variants but it goes something like this:

Tell me pseudocode that, for numbers from 1 to 100: Print "Fizz" for numbers evenly divisible by 3. Print "Buzz" for numbers evenly divisible by 5. Print "FizzBuzz" next to numbers that are evenly divisible by 3 and 5. Otherwise print the number

It’s “so simple” that interviewers are usually looking for the perfect answer immediately as the first words out of your mouth. But let’s be honest, if the interviewee has never seen it before they’re probably going to say *something* that makes them look less than perfect, whether it’s taking too long, saying something that’s actually wrong, or even something that’s just suboptimal.

To test this, I asked two former colleagues of mine this question, and they both very quickly came up with a 3 branch solution: if number evenly divisible by 15, print fizzbuzz, otherwise if it’s evenly divisible by 5, print buzz, otherwise if it’s evenly divisible by 3, print fizz.

An interviewer could have taken issue with:

- “evenly divisible” – how is this determined?
- What about printing the numbers?

Anything less than immediate perfection makes the interviewee look like they’re not that sharp. But this is clearly not the case – one of them had a PhD in math, the other had a Master’s degree in math, and both had achieved among the highest levels of technical leadership at our former employer (and both were stellar programmers).

The problem is with the expectations. Who can achieve immediate perfection, especially in a relatively high-stress situation like an interview? Next to no one. So how do you ace this question? Either be really, really fast at thinking through code flows, or have seen it before.

IMO, if you’re asked this question, it’s probably more of a litmus test for you than for them; it’s an indicator you probably don’t want to work for this company. So, I think it’s an opportunity to (gently) educate them. You could tell them that you read this article :), that because it’s well specified and easily testable it’s a perfect candidate for a TDD approach, that you would write the 100 lines of expected output, and modify your code until it matched. Then, if there was any discussion about performance, you would use appropriate performance testing to evaluate the options.

Or, you could go down the road of telling one simple, correct algorithm:

# print includes a newline, console.log style for (i = 1; i <= 100; ++i) if (i % 15 == 0) print "FizzBuzz" else if (i % 5 == 0) print "Buzz" else if (i % 3 == 0) print "Fizz" else print i

I recently had a conversation about FizzBuzz with a person who used it during interviews with junior programmers. With the right expectation, namely that the answer may not be immediately correct, it could be a potentially valuable tool to see how a junior developer is able to work through the problem. On the other hand, he mentioned that since it is used with such prevalence, it loses its value as a thought exercise because junior programmers have probably already seen it. So, there could be value in asking junior programmers to develop FizzBuzz, but only if they have not seen it before. Perhaps ask if they have seen FizzBuzz – if the answer is yes, move on. If no, go ahead and ask, and expect that there will be some imperfections.

]]>I decided to perform this test.

I decided to actually perform two tests: element addition performance, and element checking performance.

I wrote a simple Windows C++ application using the Windows performance analysis technique I described in a gist. In this simple C++ application, lines from an input file are added into the data structure, and lines from another input file are used to search the data structure. The application outputs a single CSV line containing information such as the number of elements added to the DS, the number of elements for which a search was performed, the number of elements found in the data structure, time to add all elements to the data structure, time to search for all elements, and more.

Then I wrote a python script to call the C++ application using slightly modified input files from the Moby Words II data set. The data set was modified so that the files contained only letters in the ASCII table between space and tilde. This seemingly corrected errors in the text, and allowed each trie node to have a smaller range of allowable children. Each of the 16 files in the data set is paired with each of the 16 files in the data set as both an input and output, for a total of 256 tests.

The files in the data set can be summarized as follows, courtesy of Project Gutenberg’s description :

**acronyms.txt**: 6,213 acronyms

common acronyms & abbreviations**common.txt**: 74,550 common dictionary words

A list of words in common with two or more published dictionaries.

This gives the developer of a custom spelling checker a good

beginning pool of relatively common words.**compound.txt**: 256,772 compound words

Over 256,700 hyphenated or other entries containing more than one

word as well as all capitalized words and acronyms. Phrases were

considered ‘common’ if they or variations of them occur in standard

dictionaries or thesauruses.**crosswd.txt**: 113,809 official crosswords

A list of words permitted in crossword games such as Scrabble(tm).

Compatible with the first edition of the Official Scrabble Players

Dictionary(tm). Since this list has all forms: -ing, -ed, -s, and so

on of words, it makes a good addition when building a custom spelling

dictionary.**crswd-d.txt**: 4,160 official crosswords delta

When combined with the 113,809 crosswords file, it produces the

official crossword list compatible with the second edition of the

Official Scrabble Players Dictionary. (Scrabble is a registered

trademark of Milton-Bradley licensed to Merriam-Webster.)**fiction.txt**: 467 current fiction substrings

The most frequently occurring 467 substrings occurring in a

best-selling novel by Amy Tan in 1990.**freq.txt**: 1,000 by frequency

This file consists of the 1,000 most frequently used English words

from a wide variety of common texts listed in decreasing order of

frequency**freq-int.txt**: 1,000 by frequency internet

This file consists of the 1,000 most frequently used English words

as used on the Internet computer network in 1992.**KJVfreq.txt**: 1,185 King James Version frequent substrings

The most frequently occurring 1,185 substrings in the King James

Version Bible ranked and counted by order of frequency.**names.txt**: 21,986 names

This database contains the most common names used in the United

States and Great Britain. Spelling checkers may want to supplement

their basic word list with this one.**names-f.txt**: 4,946 female names

frequent given names of females in English speaking countries**names-m.txt**: 3,897 male names

frequent given names of males in English speaking countries**oftenmis.txt**: 366 often misspelled words

many of the most commonly misspelled words in English speaking countries**places.txt**: 10,196 places

a large selection of place names in the United States**single.txt**: 354,984 single words

Over 354,000 single words, excluding proper names, acronyms, or

compound words and phrases. This list does not exclude archaic words

or significant variant spellings.**usaconst.txt**: USA Constitution

The Constitution of the United States, including the Bill of Rights

and all amendments current to 1993.

All the tests were run one time on my Broadwell i7-5950HQ CPU @ 2.90GHz, a quad-core hyperthreaded CPU supporting 8 threads. This machine has 16GB of RAM.

I imported the results into an Excel spreadsheet, formatted it nicely, added some fields, and made some charts that I thought would reveal some trends.

All the source code, and the analysis Excel file, are available in my github repo.

Here’s the aggregations that I thought were interesting. If you’d like to see more analysis, let me know specifics in the comments! Or, just clone the github repo and start working with the data in the Excel file.

In the tables and charts below, I used “uos” to represent std::unordered_set, “set” to represent std::set, and trie to represent the custom trie. This allows the tables and charts to be somewhat more concise.

Data Structure | # insert wins | Min insert winning time | Max insert winning time | Min insert time difference | Average insert time difference | Max insert time difference | Average insert % improvement |

set | 162 | 142 | 117630 | 1 | 24322 | 134577 | 34 |

uos | 94 | 111 | 2156 | 5 | 261 | 1406 | 33 |

There is no typo or missing data here – the trie did not win a single test of insertion performance with this input data. Std::set won about 63% of the tests while std::unordered_set won the other 37%. We can see by the average percent improvement that when either data structure won, it outperformed the next best performer by an average of ~33%. However, the min winning time, max winning time, average time difference, and max time difference seem to point to a conclusion that the std::set is faster at storing larger sets of strings, while maybe the std::unordered_set is faster at storing smaller sets of strings.

This chart, with number of strings stored in the set on the x-axis and number of wins on the y-axis, shows the trend precisely. Std::unordered_set wins almost every test of small set size, while std::set wins every test of larger set size.

Deeper investigation shows that std::unordered_set wins in the files **fiction.txt, freq.txt, freq-int.txt, kjvfreq.txt, oftenmis.txt, **and** usaconst.txt**. These files typically contain strings where each line contains a phrase of 1 – 3 words (1 – 20 chars), though usaconst.txt contains lines with long strings containing up to 60 or more characters.

Set wins in the files **acronyms.txt, common.txt, compound.txt, crosswd.txt, crswd-d.txt, names.txt, names-f.txt, names-m.txt, places.txt, **and** single.txt**. These files contain short strings of 1 – 3 words, up to ~20 characters.

Since std::set is typically implemented using some kind of binary search tree, and std::unordered_set is typically implemented with an open hash (array of buckets, with each bucket containing a list) it appears that in these implementations there is lower initial overhead with a std::unordered_set, but worse asymptotic performance; there is higher initial overhead with std::set, but better asymptotic performance. Perhaps the initial rebalances in the binary tree, especially when created using sorted data, are more expensive than filling the buckets in the open hash with very few items. However, when the buckets start to accumulate multiple items per bucket and need to be searched linearly, the binary tree can be faster.

Finally, a trie appears to have both higher initial overhead than both std containers, and worse asymptotic insertion performance; this is probably because it potentially creates multiple new nodes for each inserted string, while the std containers probably create one new node per inserted string.

Data structure | # search wins | Min search winning time | Max search winning time | Min search time difference | Average search time difference | Max search time difference | Average search % improvement |

trie | 200 | 2 | 49153 | 5 | 4066 | 31051 | 78 |

uos | 56 | 28 | 80969 | 10 | 4481 | 101197 | 42 |

Again, no missing data in this table – std::set did not win a single search performance test. In this data set, the trie wins approximately 78% of the tests while std::unordered_set wins the other approximately 22%.

The trends in this data are far less clear, so let’s take a closer look at some interesting charts.

This one is a little hard to read but it’s got “percent of inserted strings found” on the x-axis and number of wins on the y-axis.

What this unsurprisingly shows is that when there’s very small intersection between the strings that were inserted and the strings that were searched for, the trie wins. This could be because a trie can theoretically determine mismatch between a string and the dictionary with a single array element read. For example, when comparing **names.txt**,** **which contains strings all starting with a capital letter, with **freq-int.txt, **which effectively contains strings starting with a number, the trie always finds that the first character of the search string has no branch in the trie’s root node, so it returns false very quickly. On the other hand, when searching a binary tree, the search takes approximately log N time to reach a leaf and determine there’s no match.

On the other hand, when there’s a very large/complete intersection between the inserted strings and the search strings, the std::unordered_set wins. This is again not too surprising, because the trie performs length(searchstring) array element reads which cannot be vectorized. The binary tree must perform more than length(searchstring) character comparisons, but they can be vectorized.

So the vast majority of the trie’s search wins occur when there is very small intersection between the inserted strings and the search strings, for the obvious reason that it is very fast at determining that a search string starts with a letter that none of the inserted strings start with. Here’s the data with those points removed, to get a better sense of general-purpose search performance:

Data structure | # search wins | Min search winning time | Max search winning time | Min search time difference | Average search time difference | Max search time difference | Average search % improvement |

trie | 47 | 11 | 49153 | 5 | 6223 | 27741 | 60 |

uos | 33 | 51 | 80969 | 15 | 7361 | 101197 | 38 |

The trie still wins more tests than the std::unordered_set. Also, in the search tests where it wins, it outperforms the next best performer by an average of 60%. On the other hand, when std::unordered_set wins, it beats the next best performer by an average of 38%.

There’s a similar trend for “percent of searched strings found” and “winning search time” with possibly a slightly more visible trend of the trie performing better with small intersection of inserted & search sets, and std::unordered_set performing better with large/complete intersection. Those charts are in the “search wins vs % searched found” and “search wins vs winning time” tabs of the Excel spreadsheet.

With many use cases, reads/searches are performed far more often than writes/inserts – think of financial transactions, event signups, online store orders, and more. For this reason, it’s often preferable to optimize for read/search performance rather than write performance.

When performing search operations on a string set, it seems clear that the fastest data structures are either a custom trie or a std::unordered_set. When the search data is expected to have a small intersection with the inserted data, such as looking for strings indicating malware in the output of the unix “strings” command, a trie is likely to be the fastest data structure. When the search data is expected to have a large/complete intersection with the inserted data, such as the grader of a spelling test, a std::unordered_set is likely to be the fastest data structure.

However, there are use cases where write/insert performance is more important than read/search performance. For example, an algorithm that finds unique strings in a data set by inserting them into a set ought to consider using std::set for this purpose.

A final note about these performance tests: it’s likely the std::set and std::unordered_set implementations are highly optimized, and may even have specialized template implementations for std::string. The custom trie was written by a single developer in a couple of hours. There could be a lot of room in the trie implementation for improvement.

Finally, there was no attempt during these experiments to measure the memory usage of the various data structures. It’s very likely that the trie uses significantly more memory than either of the std containers, since each node uses an array of almost 100 ints, where 0 or more are used. It’s a very sparse data structure.

So that’s it! If you’re interested in this, please take a look at the github repo that contains all the source, and the Excel sheet that contains the analysis. Also, I’m always interested in discussions spurred on by my posts, and learning what I’ve done wrong or suboptimally. So if you’d like to discuss or correct anything, please let me know by leaving a comment below!

.

My first instinct was to use suffix tries and an overlap graph, but the solution that we eventually reached used no sophisticated data structures, just the array of strings and some recursion in a divide and conquer approach. Here’s my extended thoughts on the problem.

bool canBeCovered(S, P) if (S == "") return true for i = 1 to |S| - 1 substring = substring of S from 0 with length i if (substring is in P) if (canBeCovered(substring of S from i to end, P)) return true return false

Graphically, the algorithm looks like this:

S = companyisgreat (|S| = 14) P = {com, compan, pany, yi, sg, reat} (|P| = 6) c ompanyisgreat co mpanyisgreat com p anyisgreat com pa nyisgreat com pan yisgreat com pany i sgreat com pany is great com pany isg reat com pany isgr eat com pany isgre at com pany isgrea t com pany isgreat com panyi sgreat com panyis great com panyisg reat com panyisgr eat com panyisgre at com panyisgrea t com panyisgreat comp anyisgreat compa nyisgreat compan y isgreat compan yi s great compan yi sg reat

In this example, we use 24 recursive calls, with each call performing an iterative search through a string array for a string match.

It’s a simple algorithm that performs, in worst case, O(|S| * (sum of lengths of all strings in P)) time. I left feeling like we could do better.

I spent some time later that afternoon thinking about this problem, and it turns out my instincts generally pointed toward an efficient approach. I developed the following algorithm based on a trie of the strings in P (not suffix tries) and an implicit graph (not an overlap graph).

Here’s the pseudocode:

S = companyisgreat (|S| = 14) P = {com, compan, pany, yi, sg, reat} (|P| = 6) trie = Generate a trie from the strings in P bool canBeCovered(S, trie, i) if (i >= |S|) return true currentTriePosition = root for j = i to |S| - 1 if S[j] is a child of currentTriePosition currentTriePosition = currenTriePosition.children[S[j]] if currentTriePosition "is a leaf" if (canBeCovered(S, trie, j + 1)) return true else return false return false

Graphically, the trie looks like image below:

The graphic below shows the recursion, which resembles a graph. Each letter is a node. Each arc shows the letter at S[i] where a call starts and the letter S[j + 1] where a recursive call starts. By representing the call graph this way, it’s pretty obvious that we are performing a search in the graph for a path from the node representing the first character to the node representing the position after the last character of the string.

After the initial call to canBeCovered(S, trie, 0), there is a recursive call from each outgoing arrow, which starts the covering search again at each incoming arrow. If there’s no outgoing arrow from a box with an incoming arrow, the procedure returns false. If the search reaches the last node (goes beyond the end of S), the procedure return true.

By counting the outgoing arrows we can see that this algorithm uses only 6 recursive calls.

As I mention in my trie article, tries give the ability to quickly find the start location in S of any string in P. The second algorithm uses what might be considered the reverse knowledge – we know all the strings in P that start at position *i*, and implicitly, their lengths. Using this knowledge, we’re able to bypass the cover check for any locations in S that could not contribute to a covering set. The same bypass could be performed in the first algorithm with the following change (hereby declared the modified first algorithm):

bool canBeCovered(S, P) if (S == "") return true for each p in P substring = substring of S from 0 with length |p| if (substring == p) if (canBeCovered(substring of S from |p| to end, P)) return true return false

Also, we are able to avoid iterating through all of S – the second algorithm aborts as soon as we’ve found that no path terminates beyond the end of the string. In the best no-cover case, when none of the strings in P start with S[0], the algorithm aborts after attempting one trie traversal, essentially performing one character comparison!

S = somesuperlongstringthatgoesonandon...andon P = {averylongstring, ..., morestrings, laststring}

Compare this to the first algorithm – the same inputs would require at least |S| * |P| character comparisons. The modified first algorithm above is better, but still requires |P| character comparisons.

However, the second algorithm does not always perform faster than the first, due to the overhead of creating the trie, which is built in O(sum of lengths of all strings in P). In the best case for the first algorithm, when P[0] == S, the first & modified first algorithms perform |S| character comparisons. The second algorithm also finds a covering set in |S| trie node traversals with these inputs. However, because of the overhead of building the trie, the first algorithm is faster. The first algorithm’s smaller memory usage may also give it a speed advantage.

Finally, as previously mentioned, the first algorithm uses 24 recursive calls while the second & modified first algorithms use only 6 – a reduction of 75%.

By choosing better data structures for our algorithms, we can often dramatically improve our algorithms. In this case, using a trie allowed us to eliminate an iterative search through a string array for a string match. This is especially valuable if we were using the same P for multiple values of S.

We also saw that after the trie construction, the second algorithm had much better best no-cover case performance than the first, and better performance than the modified first. This might be desirable if, again, we were using the same P for multiple values of S, or if we suspected that P may not contain any element that is a prefix of S.

However, the cost of building the trie has to be considered, as does the memory usage. If the strings in P were expected to be different for each procedure call, it’s likely that the modified first algorithm would be the fastest. If the strings in P would be used for multiple strings S, then the second algorithm would probably be fastest. Also, a trie has relatively high memory demands – if the system had tight memory constraints it may not be possible to use this approach. Nevertheless, a decision among the algorithms would likely require at least minimal experimentation and runtime analysis using expected data, to determine if the overhead of building a trie is worth the cost.

As always, I welcome comments and discussion, just leave a comment below!

]]>A trie is a data structure that is used for fast string searching. A trie has a root node, and a node for each character in the string. The nodes are constructed such that traversing the tree from the root to a leaf reconstructs one of the strings that is stored in it.

For example, let’s say we want to store the following string:

band

We would construct the following trie:

Here we have the root node, and one node for every character in the string. By traversing from the root to the leaf, we pass b, then a, then n, then d, creating the string “band,” which is our original string.

Let’s make it more interesting by attempting to store the following two strings:

band pans

Now our trie looks like this:

Again, we can traverse either path to reach a leaf, and by doing so we can construct either “band” or “pans.”

To make it even more interesting, let’s store these four strings:

band pans banana panstar

Now we have this:

There are two new features in this trie.

- We see that banana can be found by traversing b-a-n-a-n-a, and band can be found by traversing b-a-n-d. B-a-n are shared between the two words.
- It looks like there is only one string on the branch starting with p. In fact, with the current scheme, we have no way of knowing if one of the strings in the storage list is a substring of another string.

We’ll fix the problem described in #2 above by appending a character that’s not in our alphabet to the end of every string. It’s typical to use $. If we append $ to the end of every string in our storage set, we will get this trie:

When we traverse to a leaf, we’ll always get a string ending in $, which we can remove to recover the original string.

How can we use a trie? One use is to search a fixed string for occurrences of any string in the search set. More formally, given an input string S and a set of strings P, return the start locations of any item in P that is a substring of S.

For example, let’s say we wanted to search the following string S for occurrences of any strings in the set P, and report the start locations of those strings:

S = string with something prohibited P = {prohibited, some, strict}

The brute force approach to this would perform a search over S for every string in P, potentially taking O(|S|*|P|) time. Of course this does not scale – what if we want to search an email for the 1000 most common words used by spammers? Instead of using a brute force approach, we can process our search strings into a trie, then traverse the trie as we walk S. If any node has a child that is a $, then the string formed by traversing to that node is present in S.

For the example above, here’s our trie:

Let’s walk through the string S = “string with something prohibited” and see how we would use the trie to find any of the prohibited words in S. **At every step when our current location is not the root, we’ll check to see if there is a child that is a $.**

Current location in trie: root node s - traverse the trie to the s node t - traverse to the t node which is a child of the s node r - traverse to the r node which is a child of the t node i - traverse to the i node which is a child of the r node n - not a child of the r node. Reset the trie location to the root. g, Space, w, i, t, h, Space - not children of current location s - traverse the trie to the s node o - traverse the trie to the o node which is a child of the s node m - traverse the trie to the m node which is a child of the o node e - traverse the trie to the e node which is a child of the m node. There is a child of the e node that is a $, so we've found a word in S. Save the start location of this string in a list for later reporting. t - not a child of the e node. Reset the trie location to the root. h, i, n, g, Space - not children of current location p - traverse to the p node which is a child of the current location r, o, h, i, b, i, t, e - traverse to the node, which is a child of the current location d - traverse to the d node which is a child of the e node. There is a child of the d node that is a $, so we've found a word in S. Save the start location of this string in a list for later reporting.

So, there you have it. We iterated through S one time, and found the start location of all occurrences of every string in P, while minimizing string comparisons.

In the previous example, we used a trie to search a fixed string S for a set of strings P. This is useful when your set of substrings P is relatively stable but your search string S changes frequently, as in the spam use case.

There are other use cases of string searches though – one might want to find all occurrences of a substring in another string very quickly. This is a common use case in genomics, where we want to search a genome (which can be represented as a string over the alphabet ATGC) for a genome fragment. In this case, the search string S is relatively stable and can be on the order of 10^9 characters in length, but the members of the search set P change relatively frequently and may only be on the order of 10^5. Let’s look at the simple case first: find all occurrences of a substring in a search string.

The more formal description of this problem is: given an input string S and a substring F, return a list of all start locations of F in S. This problem can be solved by creating a suffix trie, which is a trie constructed by adding every suffix of S to the trie. The steps in this procedure are to add the substring from S[0:last] to the trie, then S[1:last] to the trie, then S[2:last], and every other substring up to S[last-1:last]. Of course before constructing the trie, we will need to append the $ as before.

To find every start position of F in S, we traverse the trie using the characters in F to find the next node. F does not appear in S if the trie’s current location does not contain a child that matches the next character of F. If the trie can be traversed in this manner to the end of F, then F appears 1 or more times in S.

To find the start locations, the suffix trie needs to be modified during construction so the leaves are decorated with the start position of the suffix that was used to create that branch. Then, we simply perform a DFS or BFS from the end node, and append the locations stored in the leaves to a list.

In this example, we want to find all start locations of the string “ana” in the string “panamabananas”. This example comes from the excellent Algorithms on Strings course on coursera.

S = panamabananas F = ana

Here’s what the suffix trie looks like after each leaf has been decorated with the start location of the suffix used to create that branch.

To find the all the start locations of “ana,” we would traverse from the root to a, then n, then a. Since we reached the end of “ana,” we know there is 1 or more occurrence of “ana” in T. To find these locations, we’ll just perform a DFS to the leaves from the current location, and we see that the locations are 7, 1, and 9. This is easy to verify:

111 0123456789012 S = panamabananas ^ ^ ^

To find the string F = “and,” we would traverse from the root to a, then n, but n has no child labeled with a “d.” Since we could not reach the end of “and” by traversing the trie, we can conclude that “and” is not found S.

This post is already really long, so I’m not going to go into these details. However, it’s fairly easy to show that inserting a string *p* into a trie can be done in O(|*p*|). Probably the fastest implementation of a node, shown below, uses memory space that’s approximately O(|ALPHABET| * |N|), where |ALPHABET| is the number of characters in the alphabet and |N| is the number of nodes in the trie.

struct TrieNode { struct TrieNode *children[CHARS_IN_ALPHABET]; // rather than using a $ and a separate node to indicate the end // of a word or a leaf, each node contains a boolean to indicate whether // this node was the end of a string. bool isLeaf; };

Finally, searching for a string *p* in a trie using the implementation above can be done in O(|*p*|).

We’ve already seen how tries can be used for finding any instances of strings from a set P in a fixed string S, and how a suffix trie can be used to find all starting locations of a substring F in a string S (though a Burrows-Wheeler Transform may be better for this purpose – I may write a post on that in the future!). It could also be used for quickly checking if a substring F is a member of a set P. Thus, I recommend using tries when:

- You need to quickly find any string from a set P in a string S. This is especially valuable if P is relatively stable and you have many strings to search, such as in the spam email example.
- You need to quickly find if a string F is a member of the set P. This might occur if, for example, you want to check that a series of blocks placed on a Scrabble board is a valid English word.
- It might be appropriate to use a suffix trie to find all start locations of F in a string S, especially if S is not dominated by only a few characters (aka the character count for each character that appears in the string is roughly the same). When the character count for one character is much higher than the other characters, the branch in the trie that starts with that character might contain O(|S|^2) nodes, which would result in O(|S|^2) runtime to traverse to those leaves.

If you know of other uses for tries, please let me know by leaving a comment at the end of the article!

As always, I’m here to answer your questions and help you understand these concepts! I’m also always looking for corrections and discussions. So, if you have questions, see errors, or want to discuss improvements to the content of this article, just leave a comment below!

]]>