Vectorise image block processing efficiently? - c++

I am curious what is the most efficient method when I process image block-by-block.
At that moment, I applied some vectorization technics such as I read one row of pixels (8 pixels a row, each 8-bit depth) from an 8x8 block. But as modern processors support 128/256-bit vector operation, I think loading two rows of pixels from the image block can improve code speed.
But the problem is, an image(for example 16x16 image, contains 4 8x8 blocks) in memory is stored from the first pixel to the last pixel continuously. The loading of one 8-pixel row of is easy, but how should I operate the pointer or align image data so that I could load 2 rows together?
I think this figure can illustrate my problem clearly:
pixels' address in a image
So, when we load 8 pixels (a row) together, we simply load 8 bytes data from the initial pointer position by 1 instruction. When we load 2nd row, we simply add 9 to the pointer and load the second row.
So, the questions is, is there any method that we could load these two rows (16 pixels) together from the initial pointer position?
Thanks!

To make each row aligned, you can pad the end of each row. Writing your code to support a shorter image width than the stride between rows lets your algorithm work on a subset of an image.
Also, you don't actually need everything to be aligned for SIMD to work well. Contiguous is sufficient. Most SIMD instruction sets (SSE, NEON, etc.) have unaligned load instructions. Depending on the specific implementation, there might not be much penalty.
You don't load two different rows into the same SIMD vector. For example, to do an 8x8 SAD (sum of absolute differences) using AVX2 VPSADBW, each 32-byte load would get data from one row of four different 8x8 blocks. But that's fine, you just use that to produce four 8x8 SAD results in parallel, instead of wasting a lot of time shuffling to do a single 8x8 SAD.
For example, Intel's MPSADBW tutorial shows how to implement an exhaustive motion search for 4x4, 8x8, and 16x16 blocks, with C and Intel's SSE intrinsics. Apparently the actual MPSADBW instruction isn't actually worth using in practice, though, because it's slower than PSADBW and you can get identical results faster with a sequential elimination exhaustive search, as used by x264 (and mentioned by x264 developers in this forum thread about whether SSE4.1 would help x264.)
Some SIMD-programming blog posts from the archives of Dark Shikari's blog: Diary Of An x264 Developer:
Cacheline splits, take two: using PALIGNR or other techniques to set up unaligned inputs for motion searches
A curious SIMD assembly challenge: the zigzag

Related

Efficient design for multiple pixel operations when decompressing an image

I maintain an image codec that requires post processing of the image with a varying number of simple pixel operations: for example, gain, colour transform, scaling, truncating. Some ops work on single channels, while others (colour transform) work on three channels at a time.
When an image is decoded, it is stored in planar format, one buffer per channel.
I want to design an efficient framework in c++ that can apply a specified series of pixel ops. Since this is the inner loop, I need it to be efficient - pixel ops should be inlined, with minimum branching.
Simplest approach is to have a fixed array of say 20 operands, and pass this array with the actual number of ops to the post-process method. Can someone suggest a more clever way ?
Edit: This would be a block operation, for efficiency, and I do plan on using SIMD to accelerate. So, for each pixel, I want to efficiently perform a configurable sequence of pixel ops, using SIMD instructions.
I would not recommend to execute the pipeline at the pixel level, this will be horribly inefficient (and inapplicable for some operations), do it for whole images.
As you suggested, it is an easy matter to encode the sequence of operations and associated arguments as a list, and write a simple execution engine that will call the desired functions.
Probably some of your operations are done in-place and some other require an extra buffer. You will need to add suitable buffer management. Nothing
insurmountable.

Modern Modelling Formats that Support Vertex Buffers

Are there any modeling formats that directly support Vertex Buffer Objects?
Currently my game engine has been using Wavefront Models, but I have always been using them with immediate mode and display lists. This works, but I wanted to upgrade my entire system to modern OpenGL, including Shaders. I know that I can use immediate mode and display lists with Shaders, but like most aspiring developers, I want my game to be the best it can be. After asking the question linked above, I quickly came to the realization that Wavefront Models simply don't support Vertex Buffers; this is mainly due to the fact of how the model is indexed. In order for a Vertex Buffer Object to be used, Vertices, Texture Coordinates, and the Normal arrays all need to be equal in length.
I can achieve this by writing my own converter, which I have done. Essentially I unroll the indexing and create the associated arrays. I don't even need to exactly use glDrawElements then, I can just use glDrawArrays, which I'm perfectly fine doing. The only problem is that I am actually duplicating data; the arrays become massive(especially with large models), and this just seems wrong to me. Certainly there has to be a modern way of initializing a model into a Vertex Buffer without completely unrolling the indexing. So I have two questions.
1. Are their any modern model formats/concepts that support direct Vertex Buffer Objects?
2. Is this already an industry standard? Do most game engines unroll the indexing(and inflate the arrays also called unpacking) at runtime to create the game world assets?
The primary concern with storage formats is space efficiency. Reading from storage media you're limited by I/O bandwidth by large. So any CPU cycles you can invest to reduce the total amount of data to be read from storage will hugely benefit asset loading times. Just to give you the general idea. Even the fastest SSDs you can currently buy at the time of writing this won't get over 5GiB/s (believe me, I tried sourcing something that can saturate 8 lanes of PCIe-3 for my work). Your typical CPU memory bandwidth is at least one order of magnitude above that. GPUs have even more memory bandwidth. Even faster are lower level caches.
So what I'm trying to tell you: That index unrolling overhead? It's mostly an inconvenience for you, the developer, but probably shaves off some time from loading the assets.
(suggested edit): Of course storing numbers in their text representation is not going to help with space efficiency; depending on the choice of base a single digit represents between 3 to 5 bits (lets say 4 bits). That same text character however consumes 8 bits, so you have about 100% overhead there. The lowest hanging fruit this is storing in a binary format.
But why stop there? How about applying compression on the data? There are a number of compressed asset formats. But one particularly well developed one is OpenCTM, although it would make some sense to add one of the recently developed compression algorithms to it. I'm thinking of Zstandard here, which compresses data ridiculously well and at the same time is obscenely fast at decompression.

3D FFT with data larger than cache

I have searched for an answer to this question but have not found anything that can directly help me.
I am working on a 3D numerical integrator for a non-linear PDE using the parallel FFT library included in MKL.
My arrays consist of 2^30 data points which is much much larger than the cache. This results in ~50% of cache references being misses, which appears to add a massive amount of overhead accessing memory.
Is there a clever way I can deal with this? Is it expected to have 50% cache misses using an array this large?
Any help would be much appreciated.
Thanks,
Dylan
2^30 data points in a single FFT counts as being quite big!
The data plus the exponentials and the output array is several thousand times bigger than the L3 cache, and millions times bigger than L1.
Given that disparity one might argue that a 50% cache miss rate is actually quite good, especially for an algorithm like an FFT which accesses memory in non-sequential ways.
I don't think that there will be much you can do about it. The MKL is quite good, and I'm sure that they've taken advantage of whatever cache hinting instructions there are.
You might try contacting Mercury Systems Inc. (www.mrcy.com) and ask them about their Scientific Algorithms Library (SAL). They have a habit of writing their own math libraries, and in my experience they are pretty good at it. Their FFT on PowerPC was 30% quicker than the next best one; quite an achievement. You can try an un-optimised version of SAL for free (http://sourceforge.net/projects/opensal/). The real optimised for Intel SAL is definitely not free though.
Also bear in mind that no matter how clever the algorithm is, with a data set that size you're always going to be fundamentally stuck with main memory bandwidths, not cache bandwidths.
GPUs might be worth a look, but you'd need one with a lot of memory to hold 2^30 data points (32 bit complex values = 2gbytes, same again for the output array, plus exponentials, etc).
I think the problem of excessive misses is due to a failure of the cache prefetch mechanism, but not knowing the details of the memory accesses I can't tell you exactly why.
It does not matter that your arrays are very large, 50% misses are excessive. The processor should avoid misses by detecting you are iterating over an array and loading ahead of time the data elements you are likely to use.
Either the pattern of array accesses is not regular and thus the prefetcher in the processor does not figure out a pattern to prefetch, or you have a cache associativy problem, that is, elements in your iteration might be matched to the same cache slots.
For example, assume a cache size of 1Mb and a set associativy of 4. In this example, the cache will map memory using the lower 20 bits to an internal slot. If you stride by 1Mb, that is, your iterations are exactly 1Mb, then the lower 20 bits are always the same and go to the same cache slot, the new element shares the same cache slot as the old one. When you get to the fifth element, all four positions are used up and from then on it is only misses, in such case your cache size is effectively one single slot; if you stride by half the cache size, then the effective number of slots is 2, which might be enough to not have any misses at all or have 100% or anything in between depending on whether your access pattern requires both slots simultaneously or not.
To convince yourself of this, make a toy program with varying stride sizes and you'll see that those that divide or are multiples of the cache sizes increase misses, you can use valgrind --tool=cachegrind
You should first make sure you know what is causing the cache misses; they may be the fault of other code you've written rather than the FFT library. In fact, I expect that is very likely the case.
The rest of this post assumes that the FFT is really at fault and we need to optimize.
The standard trick to get data locality out of an FFT is to
Arrange the data in a two-dimensional array
Do an FFT along each row
Apply twiddle factors
Do a matrix transpose
Do an FFT along each row
This is the Cooley-Tukey algorithm, in the case where we factor 2^(m+n) = 2^m * 2^n.
The point of this is that the recursive calls to the FFT are much much smaller, and may very well fit in cache. And if not, you can apply this method recursively until things do fit in cache. And if you're ambitious, you do a lot of benchmarking to figure out the optimal way to do the splitting.
Thus, assuming you also use a good matrix transpose algorithm, the end result is a relatively cache-friendly FFT.
The library you're using really should be doing this already. If it's not, then some options are:
Maybe it exposes enough lower level functionality that you can tell it to use Cooley-Tukey in an efficient way even though the high level routines aren't
You could implement Cooley-Tukey yourself, using the given library to do the smaller FFTs.

Calculating (very) large matrix products with CUDA

I am just beginning to learn some cuda programming and I am interested how to handle calculation of large matrices which surpass the Block/thread sizes.
For example, I have seen code which shows how to perform tiled matrix multiplication but it fails with the Block size and grid size are too small. In the mentioned code, if the Block size and Grid size are each set to 1, then only the first element of the final matrix will be computed.
The answer is simple: call the kernel with larger block and grid sizes, but what happens when I want to perform a matrix multiplication with 8 million rows and 6 million columns - something arbitrarily large for which there cannot be a proper Grid and Block size for any modern GPU?
Where can I find example code or an algorithm for how to work with this sort of thing? I believe that the simple case should be a matrix multiplication algorithm which works if called with <<<1,1>>> and any algorithm which can account for this call should be able to account for any larger matrix.
The main problem with very large matrix is not the number of blocks or number of threads. The main problem is that you cannot fit the whole matrix in GPU's DRAM memory. So for doing the multiplication, you need to manually use tiling to divide the input matrix into tiles that you can fit in the GPU's memory. Then, you need to run matrix multiplication on that tile on GPU with as many threads as you need and then return the tile result back to the host (CPU).
When you are working on these big tiles on the GPU, you need to launch 1000s of threads to get the performance that you need. launching only one thread does not help you in any way.
for more information you can look at this paper:
CUDA Based Fast Implementation of Very Large Matrix Computation
I just found it by googling "large matrix multiplication CUDA"

Vertex Buffers - indexed or direct, interlaced or separate

What are some common guidelines in choosing vertex buffer type? When should we use interlaced buffers for vertex data, and when separate ones? When should we use an index array and when direct vertex data?
I'm searching for some common quidelines - I some cases where one or the opposite fits better, but not all cases are easily solvable. What should one have in mind choosing the vertex buffer format when aiming for performance?
Links to web resources on the topic are also welcome.
First of all, you can find some useful information on the OpenGL wiki. Second of all, if in doubt, profile, there are some rules-of-thumb about this one but experience can vary based on the data set, hardware, drivers, ... .
Indexed versus direct rendering
I would almost always by default use the indexed method for vertex buffers. The main reason for this is the so called post-transform cache. It's a cache kept after the vertex processing stage of your graphics pipeline. Essentially it means that if you use a vertex multiple times you have a good chance of hitting this cache and being able to skip the vertex computation. There is one condition to even hit this cache and that is that you need to use indexed buffers, it won't work without them as the index is a part of this cache's key.
Also, you likely will save storage, an index can be as small as you want (1 byte, 2 byte) and you can reuse a full vertex specification. Suppose that a vertex and all attributes total to about 30 bytes of data and you share this vertex over let's say 2 polygons. With indexed rendering (2 byte indices) this will cost you 2*index_size+attribute_size = 34 byte. With non-indexed rendering this will cost you 60 bytes. Often your vertices will be shared more than twice.
Is index-based rendering always better? No, there might be scenarios where it's worse. For very simple applications it might not be worth the code overhead to set up an index-based data model. Also, when your attributes are not shared over polygons (e.g. normal per-polygon instead of per-vertex) there is likely no vertex-sharing at all and IBO's won't give a benefit, only overhead.
Next to that, while it enables the post-transform cache, it does make generic memory cache performance worse. Because you access the attributes relatively random, you might have quite some more cache misses and memory prefetching (if this would be done on the GPU) won't work decently. So it might be (but measure) that if you have enough memory and your vertex shader is extremely simple that the non-indexed version outperforms the indexed version.
Interleaving vs non-interleaving vs buffer per-attribute
This story is a bit more subtle and I think it comes down to weighing some properties of your attributes.
Interleaved might be better because all attributes will be close together and likely be in a few memory cachelines (maybe even a single one). Obviously, this can mean better peformance. However, combined with indexed-based rendering your memory access is quite random anyway and the benefit might be smaller than you'd expect.
Know which attributes are static and which are dynamic. If you have 5 attributes of which 2 are completely static, 1 changes every 15 minutes and 2 every 10 seconds, consider putting them in 2 or 3 separate buffers. You don't want to re-upload all 5 attributes every time those 2 most frequent change.
Consider that attributes should be aligned on 4 bytes. So you might want to take interleaving even one step further from time to time. Suppose you have a vec3 1-byte attribute and some scalar 1-byte attribute, naively this will need 8 bytes. You might gain a lot by putting them together in a single vec4, which should reduce usage to 4 bytes.
Play with buffer size, a too large buffer or too many small buffers may impact performance. But this is likely very dependent on the hardware, driver and OpenGL implementation.
Indexed vs Direct
Let's see what you get by indexing. Every repeating vertex, that is, a vertex with "smooth" break will cost you less. Every singular "edge" vertex will cost you more. For data that's based on real world and is relatively dense, one vertex will belong to many triangles, and thus indexes will speed it up. For procedurally generated arbitrary data, direct mode will usually be better.
Indexed buffers also add additional complications to the code.
Interleaved vs Separate
The main difference here is actually based on a question "will I want to update only one component?". If the answer is yes, then you shouldn't interleave, because any update will be extremely costly. If it's no, using interleaved buffers should improve locality of reference and generally be faster on most of the hardware.