Efficient design for multiple pixel operations when decompressing an image - c++

I maintain an image codec that requires post processing of the image with a varying number of simple pixel operations: for example, gain, colour transform, scaling, truncating. Some ops work on single channels, while others (colour transform) work on three channels at a time.
When an image is decoded, it is stored in planar format, one buffer per channel.
I want to design an efficient framework in c++ that can apply a specified series of pixel ops. Since this is the inner loop, I need it to be efficient - pixel ops should be inlined, with minimum branching.
Simplest approach is to have a fixed array of say 20 operands, and pass this array with the actual number of ops to the post-process method. Can someone suggest a more clever way ?
Edit: This would be a block operation, for efficiency, and I do plan on using SIMD to accelerate. So, for each pixel, I want to efficiently perform a configurable sequence of pixel ops, using SIMD instructions.

I would not recommend to execute the pipeline at the pixel level, this will be horribly inefficient (and inapplicable for some operations), do it for whole images.
As you suggested, it is an easy matter to encode the sequence of operations and associated arguments as a list, and write a simple execution engine that will call the desired functions.
Probably some of your operations are done in-place and some other require an extra buffer. You will need to add suitable buffer management. Nothing
insurmountable.

Related

Is it thread-safe to access a Mat with multiple threads in OpenCV?

i want to speedup an algorithm (complete local binary pattern with circle neighbours) for which i iterate trough all pixels and calculate some stuff with it neighbours (so i need neighbour pixel access).
Currently i do this by iterating over all pixels with one thread/process. I want to parallelize this task by dividing the input image into multiple ROIs and calculate each ROI seperatly (with multiple threads).
The Problem here is, that the ROIs are overlapping (because to calculate a pixel, sometimes i need to look at neighbours far away) and its possible that multiple threads accessing Pixel-Data (READING) at same time. Is that a Problem if two or more threads reading same Mat at same Indices at same time?
Is it also a problem, if i write to the same Mat parallel but at different indices?
As long as no writes happen simultaneously to the reads, it is safe to have multiple concurrent reads.
That holds for any sane system.
Consider the alternative:
If there was a race condition, it would mean that the memory storing your object gets modified during the read operation. If no memory (storing the object) gets written to during the read, there's no possible interaction between the threads.
Lastly, if you look at the doc,
https://docs.opencv.org/3.1.0/d3/d63/classcv_1_1Mat.html
You'll see two mentions of thread-safety:
Thus, it is safe to operate on the same matrices asynchronously in
different threads.
They mention it around ref-counting, performed during matrix assignment. So, at the very least, assigning from the same matrix to two others can be done safely in multiple threads. This pretty much guarantees that simple read access is also thread-safe.
Generally, parallel reading is not a problem as a cv::Mat is just a nice wrapper around an array, just like std::vector (yes there are differences but I don't see how they would affect the matter of the topic here so I'm going to ignore them). However parallelization doesn't automatically give you a performance boost. There are quite a few things to consider here:
Creating a thread is ressource heavy and can have a large negative impact if the task is relatively short (in terms of computation time) so thread pooling has to be considered.
If you write high performance code (no matter if multi- or single threaded) you should have a grasp of how your hardware works. In this case: memory and CPU. There is a very good talk from Timur Doumler at CppCon 2016 about that topic. This should help you avoiding cache misses.
Also mention worthy is compiler optimization. Turn it on. I know this sounds super obvious but there are a lot of people on SO that ask questions about performance and yet they don't know what compiler optimization is.
Finally, there is the OpenCV Transparent API (TAPI) which basically utilizes the GPU instead of the CPU. Almost all built-in algorithms of OpenCV support the TAPI, you just have to pass a cv::UMat instead of a cv::Mat. Those two types are convertible to each other. However, the conversion is time intensive because a UMat is basically an array on the GPU memory (VRAM), which means it has to be copied each time you convert it. Also accessing the VRAM takes longer than accessing the RAM (for the CPU that is).
Though, you have to keep in mind that you cannot access VRAM data with the CPU without copying it to the RAM. This means you cannot iterate over your pixels if you use cv::UMat. It is only possible if you write your own OpenCL or Cuda code so your algorithm can run on the GPU.
In most consumer grade PCs, for sliding window algorithms (basically anything that iterates over the pixels and performs a calculation around each pixel), using the GPU is usually by far the fastest method (but also requires the most effort to implement). Of course this only holds if the data buffer (your image) is large enough to make it worth copying to and from the VRAM.
For parallel writing: it's generally safe as long as you don't have overlapping areas. However, cache misses and false sharing (as pointed out by NathanOliver) are problems to be considered.

OpenCL counter variable

I'm performing Otsu's method (link https://en.wikipedia.org/wiki/Otsu%27s_method) in order to determine how many black pixels are in the raw frame. I'm trying to optimize process and I want to do it with OpenCL. Is there any way to pass the single variable to OpenCL kernel and increment it, instead of passing the whole buffer when it's not necessary?
The problem you want to solve is very much like a global reduction. While you could use a single output variable and atomic read/modify/write access, it would be very slow (due to contention on the atomic) unless the thing you are counting is very sparse (not the case here). So you should use the same techniques used for reduction, which is to do partial sums in parallel, and then merge those together. Google for "OpenCL reduction" and you'll find some great examples, such as this one: https://developer.amd.com/resources/articles-whitepapers/opencl-optimization-case-study-simple-reductions/

Vectorise image block processing efficiently?

I am curious what is the most efficient method when I process image block-by-block.
At that moment, I applied some vectorization technics such as I read one row of pixels (8 pixels a row, each 8-bit depth) from an 8x8 block. But as modern processors support 128/256-bit vector operation, I think loading two rows of pixels from the image block can improve code speed.
But the problem is, an image(for example 16x16 image, contains 4 8x8 blocks) in memory is stored from the first pixel to the last pixel continuously. The loading of one 8-pixel row of is easy, but how should I operate the pointer or align image data so that I could load 2 rows together?
I think this figure can illustrate my problem clearly:
pixels' address in a image
So, when we load 8 pixels (a row) together, we simply load 8 bytes data from the initial pointer position by 1 instruction. When we load 2nd row, we simply add 9 to the pointer and load the second row.
So, the questions is, is there any method that we could load these two rows (16 pixels) together from the initial pointer position?
Thanks!
To make each row aligned, you can pad the end of each row. Writing your code to support a shorter image width than the stride between rows lets your algorithm work on a subset of an image.
Also, you don't actually need everything to be aligned for SIMD to work well. Contiguous is sufficient. Most SIMD instruction sets (SSE, NEON, etc.) have unaligned load instructions. Depending on the specific implementation, there might not be much penalty.
You don't load two different rows into the same SIMD vector. For example, to do an 8x8 SAD (sum of absolute differences) using AVX2 VPSADBW, each 32-byte load would get data from one row of four different 8x8 blocks. But that's fine, you just use that to produce four 8x8 SAD results in parallel, instead of wasting a lot of time shuffling to do a single 8x8 SAD.
For example, Intel's MPSADBW tutorial shows how to implement an exhaustive motion search for 4x4, 8x8, and 16x16 blocks, with C and Intel's SSE intrinsics. Apparently the actual MPSADBW instruction isn't actually worth using in practice, though, because it's slower than PSADBW and you can get identical results faster with a sequential elimination exhaustive search, as used by x264 (and mentioned by x264 developers in this forum thread about whether SSE4.1 would help x264.)
Some SIMD-programming blog posts from the archives of Dark Shikari's blog: Diary Of An x264 Developer:
Cacheline splits, take two: using PALIGNR or other techniques to set up unaligned inputs for motion searches
A curious SIMD assembly challenge: the zigzag

For simple rendering: Is OpenCL faster than OpenGL?

I need to draw hundreds of semi-transparent circles as part of my OpenCL pipeline.
Currently, I'm using OpenGL (with alpha blend), synced (for portability) using clFinish and glFinish with my OpenCL queue.
Would it be faster to do this rendering task in OpenCL? (assuming the rest of the pipeline is already in OpenCL, and may run on CPU if a no OpenCL-compatible GPU is available).
It's easy replace the rasterizer with a simple test function in the case of a circle. The blend function requires a single read from the destination texture per fragment. So a naive OpenCL implementation seems to be theoretically faster. But maybe OpenGL can render non-overlapping triangles in parallel (this would be harder to implement in OpenCL)?
Odds are good that OpenCL-based processing would be faster, but only because you don't have to deal with CL/GL interop. The fact that you have to execute a glFinish/clFinish at all is a bottleneck.
This has nothing to do with fixed-function vs. shader hardware. It's all about getting rid of the synchronization.
Now, that doesn't mean that there aren't wrong ways to use OpenCL to render these things.
What you don't want to do is write colors to memory with one compute operation, then read from another compute op, blend, and write them back out to memory. That way lies madness.
What you ought to do instead is effectively build a tile-based renderer internally. Each workgroup will represent some count of pixels (experiment to determine the best count for performance). Each invocation operates on a single pixel. They'll use their pixel position, do the math to determine whether the pixel is within the circle (and how much of it is within the circle), then blend that with a local variable the invocation keeps internally. So each invocation processes all of the circles, only writing their pixel's worth of data out at the very end.
Now if you want to be clever, you can do culling, so that each work group is given only the circles that are guaranteed to affect at least some pixel within their particular area. That is effectively a preprocessing pass, and you could even do that on the CPU, since it's probably not that expensive.

GPGPU - effective ping-pong technique?

I'm trying to implement effective fluid solver on the GPU using WebGL and GLSL shader programming.
I've found interesting article:
http://http.developer.nvidia.com/GPUGems/gpugems_ch38.html
See: 38.3.2 Slab Operations
I'm wondering if this technique of enforcing boundary conditions is possible with ping-pong rendering?
If I render only lines, what about an interior of the texture?
I've always assumed that the whole input texture must be copied to temporary texture (ofc boundary is updated during that process), as they are swapped after that operation.
This is interesting especially considering the fact, that Example 38-5. The Boundary Condition Fragment Program (visualization: http://i.stack.imgur.com/M4Hih.jpg) shows scheme that IMHO requires ping-pong technique.
What do you think? Do I misunderstand something?
Generally I've found that texture write is extremely costly and that's why I would like to limit it somehow. Unfortunately, the ping-pong technique enforces a lot of texture writes.
I've actually implemented the technique described in that chapter using FrameBuffer objects as the render to texture method (but in desktop OpenGL since WebGL didn't exist at the time), so it's definitely possible. Unfortunately I don't believe I have the code any more, but if you tag any future questions you have with [webgl] I'll see if I can provide some help.
You will need to ping-pong several times per frame (the article mentions five steps, but I seem to recall the exact number depends on the quality of the simulation you want and on your exact boundary conditions). Using FBOs is quite a bit more efficient than it was when this was written (the author mentions using a GeForce FX 5950, which was a while ago), so I wouldn't worry about the overhead he mentions in the article. As long as you aren't bringing data back to the CPU, you shouldn't find too high a cost for switching between texture and the framebuffer.
You will have some leakage if your boundaries are only a pixel thick, but that may or may not be acceptable depending on how you render your results and the velocity of your fluid. Making the boundaries thicker may help, and there are papers that have been written since this one that explore different ways of confining the fluid within boundaries (I also recall a few on more efficient diffusion/pressure solvers that you might check out after you have this version working...you'll find some interesting follow ups if you search for papers that cite the GPU gems article on google scholar).
Addendum: I'm not sure I entirely understand your question about boundaries. The key is that you must run a shader at each pixel of what you want to be a boundary, but it doesn't really matter how that pixel gets there, whether it's drawn with lines, points, or triangles (as long as its inputs are correct).
In the very general case (which might not apply if you only have a limited number of boundary primitives), you will likely have to draw a framebuffer-covering quad, since the interactions with the velocity and pressure fields are more complicated (any surrounding pixel could be another boundary pixel, instead of having simply defined edges). See section 38.5.4 (Arbitrary Boundaries) for some explanation of how to do it. If something isn't a boundary, you won't touch the vector field, and if it is, instead of hardcoding which directions you want to look in to sum vector values, you'll probably end up testing the surrounding pixels and only summing the ones that aren't boundaries so that you can enforce the boundary conditions.