In my last article, I wrote about interleaved and planar memory layouts, and when to use each. Both layouts can contain the exact same information using the exact same amount of memory. One consideration when choosing whether to store data in interleaved or planar layout is the types of operations that will be performed on the data. Due to the principle of locality, significant performance gains can sometimes be obtained by choosing one over the other – pixel-level operations may be faster in interleaved layout, while channel-level operations may be faster in planar layout.

To exemplify this, I published unsophisticated separable 2D convolution algorithms (for an overview of separable convolution, check out this video). The code includes a performance-testing application that performs 2D separable convolutions over the same data using both interleaved and planar memory layouts, which demonstrates that the memory layout can have a significant impact on different algorithms that perform the exact same image processing computations.

Setup

I built and ran the performance testing application on my machine, which is an older Intel NUC with an Intel i5-4250U, 8GB of RAM, and 148GB SSD. At the time of running the test, Windows 10 Pro Version 1909, Build 18363.1139 was installed.

The build directory was configured to build with Visual Studio 2019 (cl version 19.27.29111) using the command below:

"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" --no-warn-unused-cli -DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=TRUE -H<src dir> -B<build dir> -G "Visual Studio 16 2019" -T host=x64 -A x64

A release configuration of the application was built with the command below:

"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" --build <build dir> --config Release --target ALL_BUILD -- /maxcpucount:6

To obtain runtimes for the different algorithms, I closed all open applications except VS Code, then ran the following command in the VS Code Terminal window:

interleaved_vs_planar.exe 2000 3000 4 3

This command creates a 2000 x 3000 x 4 (H x W x D) float matrix filled with random values in [0, 1]. It then performs 3 iterations of each test.

Tests

The application performs 5 tests, which are explained in the README.md and summarized below.

  1. interleaved3: Interpret the data as interleaved, and perform per-channel 2D separable blur using a kernel size of 3
  2. planar3: Interpret the data as planar, and perform per-channel 2D separable blur using a kernel size of 3
  3. interleaved7: Interpret the data as interleaved, and perform per-channel 2D separable blur using a kernel size of 7
  4. planar7: Interpret the data as planar, and perform per-channel 2D separable blur using a kernel size of 7
  5. planar7withTranspose: Interpret the data as planar, and perform per-channel 2D separable blur using a kernel size of 7. However, rather than performing horizontal and vertical convolution, perform horizontal convolution, transpose, horizontal convolution again, and transpose again.

In these experiments, the interleaved tests are performed on an interleaved layout of the data, while the planar tests are performed on a planar layout of the same data. In other words, the data at a given (column, row, channel) tuple is the same whether the data is in interleaved or planar layout. For clarity, it’s important to note that the particular data values in these floating point computations should not make a significant difference in the runtime, especially because all the input and output data should be in the range of [0.0, 1.0]. However, this setup allows the results of the computations to be compared directly, so that the reviewers can be confident that the same computations are occurring.

Every iteration of a test is performed before the next test is run. Statistics for the iteration with minimal total runtime are saved for output.

Results

The results obtained on the test machine are shown below, with time values recorded in seconds:

test,horizontal,transpose,vertical,total
interleaved3,0.070904,0,0.911467,0.982371
planar3,0.0454543,0,0.653508,0.698962
interleaved7,0.129526,0,0.889743,1.01927
planar7,0.0996537,0,0.645452,0.745106
planar7withTranspose,0.0995849,0.519957,0.104595,0.724137

For ease of viewing, the data has been formatted as a table below:

testhorizontaltransposeverticaltotal
interleaved3 0.07090400.911467 0.982371
planar3 0.045454300.653508 0.698962
interleaved7 0.12952600.889743 1.01927
planar7 0.099653700.645452 0.745106
planar7withTranspose 0.0995849 0.519957 0.104595 0.724137

In the first 4 tests, there is no transpose, so the transpose time is 0 for those tests; only the planar7withTranspose test includes a transpose operation so it is the only test with a non-zero transpose time.

Discussion

There are clear trends in the data. Some were expected, while others were not expected.

Interleaved vs planar

The planar3 total time is ~71% of the interleaved3 total time. The planar7 total time is 73% of the interleaved7 total time. The vast majority of the runtime for any of these convolutions occurs in the vertical convolution, which is another demonstration of the principle of locality.

Without getting too deep into the details of memory retrieval and caching, the basic idea is that when the program requests a single float (4 bytes) from memory, the processor actually retrieves those bytes as well as quite a few more adjacent bytes (up to 60 on modern processors). This adjacent data is stored in the fastest part of the processor’s cache and is potentially orders of magnitude faster to access than the data in memory.

Thus, it’s likely that in the horizontal planar convolution, requesting the data for the first element in the matrix causes the processor to retrieve all the matrix data required for ~16 operations. After the first operation, the next ~15 incur a fraction of the memory access cost. The cycle repeats itself when the next element of matrix data that’s not in the cache is requested.

There is a similar, but smaller effect in the horizontal interleaved convolution. The implementation iterates completely over the data in a channel before moving to the next channel. This means in practice that the horizontal convolution data is essentially operating over moderately sparse data. In this case, since the depth is 4, requesting data for the first element in the matrix causes the processor to store all the matrix data for ~4 operations in its fastest cache.

In contrast to this, the vertical convolutions in each layout essentially need to access data in memory, rather than cache, for probably every operation. In a sense, in these implementations, the vertical convolution almost certainly never gets data for the next operation “for free.”

Kernel size, 3 vs. 7

When comparing the horizontal convolutions, ie interleaved3 vs. interleaved7 or planar3 vs. planar7, there is an expected significant increase in runtime. Interleaved7 horizontal runtime is ~182% of interleaved3 horizontal runtime, while planar7 horizontal runtime is ~219% of planar3 horizontal runtime. The likely explanation for this significant increase is because there are simply more computations occurring – for a kernel of size 7, each element in the convolution is a function of 7 elements, while a kernel of size 3 results in a function of 3 elements. The operation on 7 elements must therefore compute 7 multiplications and 6 additions (13 total operations), while the operation on 3 elements must compute 3 multiplications and 2 additions (5 total operations). Therefore under ideal conditions we would expect the runtime of the size 7 kernel to be 260% of the runtime of the size 3 kernel. We can conclude that the conditions in these computations are not ideal, likely due to delays retrieving the required data from memory.

However, when comparing vertical convolutions, there is a very unexpected decrease in runtime. The difference is small, but consistent between interleaved3 and interleaved7 as well as planar3 and planar7. In either case, the number of operations in the size 7 kernel computation is still 260% of the computations in the size 3 kernel computation: 13 operations compared to 5. Therefore these results indicate that the size 7 routines somehow performed far more operations slightly faster than the size 3 routines. This is despite the fact that the routines are template functions, meaning that the C++ code for the size 3 & size 7 routines was the same!

No investigation was performed to understand the cause of this unexpected result.

planar7withTranspose

After noticing that vertical convolution dominated the runtime of the separable convolution, this test was conducted to see if a transpose + horizontal convolution + transpose could be faster than a single vertical convolution. In these results, a transpose + horizontal convolution + transpose takes ~97% of the time of a single vertical convolution. Performing the operation this way was clearly faster, though the difference is marginal for this test.

Considerations

These convolution implementations are only intended to demonstrate that choosing interleaved or planar can have a significant effect on runtime, depending on the algorithm that is running. They are not sophisticated or optimized implementations of convolutions, and are not intended to show minimal convolution runtime.

These results imply that planar layout is faster than interleaved layout for convolution. However, it’s more accurate to say that planar layout can be faster than interleaved for convolution. It’s certainly possible to write optimized implementations of convolution for interleaved layout by changing the order of operations to exploit memory already in the cache. It’s also possible to obtain very significant reductions in vertical convolution runtime by changing the order of operations. However, writing these sophisticated or optimized routines can be time consuming and error prone, and could be highly implementation or even processor specific. On the other hand, it’s likely that these optimizations would yield far better runtime improvements than just translating between interleaved and planar layout.

Another method of improving convolution performance would be to use an existing optimized implementation, such as the free Intel IPP. Intel IPP provides processor-specific optimized implementations of various convolutions. Since they are free to obtain and are very permissive in their licensing, there’s almost no reason not to use them. The performance improvement obtained by using an implementation like this is almost guaranteed to dwarf any improvement gained by translating from interleaved to planar or vice-versa.

In defense of this experiment, it’s clear that the relatively simple operation of translating between interleaved and planar layout can produce significant runtime reduction. This is an important conclusion, because not every image processing algorithm has a free, optimized implementation readily available, and development schedules do not always allow developers to write sophisticated implementations of image processing algorithms. Knowing that bottleneck algorithms are channel-level or pixel-level operations could allow a developer to gain significant runtime improvement for very minimal investment in development and testing.

The unexpected result in the vertical convolution, where the runtime decreased between the size 3 and size 7 kernel, would warrant further investigation in a product environment. To identify the cause of this unexpected result, some reasonable steps might include:

  1. a code review
  2. reproduction by another team member
  3. additional experiments using different horizontal size, vertical size, depth, and number of iterations
  4. investigation into the assembly code that the compiler produced

The planar7withTranspose test could be interesting for a few reasons. First, it may be possible to perform multiple vertical operations in the transposed matrix. This would amortize the cost of the two transposes and likely result in a significant overall runtime reduction. Second, as with all implementations in this experiment, the transpose operation is not optimized. It may be simpler to write an optimized transpose, rather than an optimized vertical convolution. Third, using this technique would allow developers to maintain a single 1D convolution implementation. This may be ideal in some conditions, for example if the convolution is expected to perform operations on the edge pixels rather than just zero them out. However, performance of the “transpose-horizontal convolution-transpose” algorithm for vertical convolution should be tested using different values for horizontal and vertical size – it’s possible that different image aspect ratios could produce different performance results.

None of the tests included the time to translate from interleaved to planar layout, or vice-versa. It’s possible that total runtime of translating, operating, and translating back would be similar or even greater than the runtime of just performing the operation on the data using a suboptimal layout. However, when multiple operations that would be optimized in the same layout are being performed in series (ie blur followed by derivative), the runtime reduction in the operations due to optimal layout may be much greater than the runtime increase due to the layout translations.

Conclusion

Storing image data in interleaved vs. planar layout can have a significant effect on performance. Converting between the two layouts is a simple operation that may lead to large runtime reductions, especially when multiple operations can be performed in sequence using the same optimal layout.

Feel free to clone the github repo and perform your own experiments! If you want to share those results with me, I can be contacted at whentouseit@avitevet.com.

I recently interviewed with a large company where I was asked how I would check that a given word was a valid English word.  Knowing that there were on the order of a few tens of thousands of stems, and that each stem may have many variations, I figured that there were maybe a few hundred thousand possible words in the English language.  With each word having average length probably in the 7 range, there might be a million or so characters in this set.  I concluded that the check could be performed by searching an in-memory dictionary, and proposed either a std::unordered_set, or a custom trie, with a performance test necessary to see which would be faster.

I decided to perform this test.

Continue reading