How to access pixel value fast in C++ with OpenCV - c++

I looked up some tutorials on accessing pixel values in C++ with OpenCV. For an example of modifying every pixel value, using .ptr is faster than using .at
I realize that how you calculate the new value for assignment also influence your performance, but I wonder if using .ptr is alway faster than .at?
Even if what I do is comparing a pixel with its neighbor pixels?
I'm writing a code to find out if a pixel were maximum/minimum around its 8 neighbor pixels and other 18 more pixels from two different Gaussian-blurred (different sigma) images. (Yes, for SIFT) I'm currently using .at to access pixel value, and I can tell the code takes some time to run (Cuz there are many images need to go through the same process). I wonder if using .ptr will make performance better or not.

The documentation says that the pointers method is the fastest in every case. The other methods are only safer.
It also says that the .at() method is the most time consuming, this should explain your lack in performances

Related

Is it thread-safe to access a Mat with multiple threads in OpenCV?

i want to speedup an algorithm (complete local binary pattern with circle neighbours) for which i iterate trough all pixels and calculate some stuff with it neighbours (so i need neighbour pixel access).
Currently i do this by iterating over all pixels with one thread/process. I want to parallelize this task by dividing the input image into multiple ROIs and calculate each ROI seperatly (with multiple threads).
The Problem here is, that the ROIs are overlapping (because to calculate a pixel, sometimes i need to look at neighbours far away) and its possible that multiple threads accessing Pixel-Data (READING) at same time. Is that a Problem if two or more threads reading same Mat at same Indices at same time?
Is it also a problem, if i write to the same Mat parallel but at different indices?
As long as no writes happen simultaneously to the reads, it is safe to have multiple concurrent reads.
That holds for any sane system.
Consider the alternative:
If there was a race condition, it would mean that the memory storing your object gets modified during the read operation. If no memory (storing the object) gets written to during the read, there's no possible interaction between the threads.
Lastly, if you look at the doc,
https://docs.opencv.org/3.1.0/d3/d63/classcv_1_1Mat.html
You'll see two mentions of thread-safety:
Thus, it is safe to operate on the same matrices asynchronously in
different threads.
They mention it around ref-counting, performed during matrix assignment. So, at the very least, assigning from the same matrix to two others can be done safely in multiple threads. This pretty much guarantees that simple read access is also thread-safe.
Generally, parallel reading is not a problem as a cv::Mat is just a nice wrapper around an array, just like std::vector (yes there are differences but I don't see how they would affect the matter of the topic here so I'm going to ignore them). However parallelization doesn't automatically give you a performance boost. There are quite a few things to consider here:
Creating a thread is ressource heavy and can have a large negative impact if the task is relatively short (in terms of computation time) so thread pooling has to be considered.
If you write high performance code (no matter if multi- or single threaded) you should have a grasp of how your hardware works. In this case: memory and CPU. There is a very good talk from Timur Doumler at CppCon 2016 about that topic. This should help you avoiding cache misses.
Also mention worthy is compiler optimization. Turn it on. I know this sounds super obvious but there are a lot of people on SO that ask questions about performance and yet they don't know what compiler optimization is.
Finally, there is the OpenCV Transparent API (TAPI) which basically utilizes the GPU instead of the CPU. Almost all built-in algorithms of OpenCV support the TAPI, you just have to pass a cv::UMat instead of a cv::Mat. Those two types are convertible to each other. However, the conversion is time intensive because a UMat is basically an array on the GPU memory (VRAM), which means it has to be copied each time you convert it. Also accessing the VRAM takes longer than accessing the RAM (for the CPU that is).
Though, you have to keep in mind that you cannot access VRAM data with the CPU without copying it to the RAM. This means you cannot iterate over your pixels if you use cv::UMat. It is only possible if you write your own OpenCL or Cuda code so your algorithm can run on the GPU.
In most consumer grade PCs, for sliding window algorithms (basically anything that iterates over the pixels and performs a calculation around each pixel), using the GPU is usually by far the fastest method (but also requires the most effort to implement). Of course this only holds if the data buffer (your image) is large enough to make it worth copying to and from the VRAM.
For parallel writing: it's generally safe as long as you don't have overlapping areas. However, cache misses and false sharing (as pointed out by NathanOliver) are problems to be considered.

Improved SGBM based on previous frames result

I was wondering if there is any good method to make SGBM process faster, by taking the info from the previous video frame.
I think that it can be made faster by searching correspondences only near the distance of the disparity of previous frame. The problem I see in this is when from one frame to the next, the block passes from an object to background of viceversa. I think, in case to be possible, is an interesting improve to be made, and I have looked for it but I didn't find it.
You have told what is the problem, if the scene is in motion.
I managed to wrote some algorithm that take in consideration the critical zone around the objects' borders, they were a little more accurate but very slower than SGBM.
Maybe you can simply set the maximum and the minimum value of disparity in a reasonable range of what you find in the previous frame instead of "safe values". In my experience wuth OpenCV, stereoBM is faster but not so good as SGBM, and SGBM is better optimized than any other algorithm written by oneself (always in my experience).
Maybe you can have some better (faster) result using the CUDA algorithm (SGBM processed in GPU). My group and I are working on that.

texture(...) via textureoffset(...) performance in glsl

Does utilizing textureOffset(...) increase performance compared to calculating offsets manually and using regular texture(...) function?
As there is a GL_MAX_PROGRAM_TEXEL_OFFSET property, I would guess that it can fetch offseted texels in a single, or at least as few as possible, fetches making it superb for example blurring effects, but I cant seem to find out how it works internally anywhere?
Update:
Reformulating question: is it common among gl-drivers to make any optimizations regarding texture fetches when utilizing the textureOffset(...) function?
You're asking the wrong question. The question should not be whether the more specific function will always have better performance. The question is whether the more specific function will ever be slower.
And there's no reason to expect it to be slower. If the hardware has no specialized functionality for offset texture accesses, then the compiler will just offset the texture coordinate manually, exactly like you could. If there is hardware to help, then it will use it.
So if you have need of textureOffsets and can live within its limitations, there's no reason not to use it.
I would guess that it can fetch offseted texels in a single, or at least as few as possible, fetches making it superb for example blurring effects
No, that's textureGather. textureOffset is for doing exactly what its name says: accessing a texture based on a texture coordinate, with an texel offset from that coordinate's location.
textueGather samples from multiple neighboring texels all at once. If you need to read a section of a texture to do bluring, textureGather (and textureGatherOffset) are going to be more useful than textureOffset.

How to multiply each channel separately with same matrix?

I have a 1 and 3 channeled Mats of the same size, call them a and img. I want to multiply each channel of img with a. And I will perform this many times, performance is an issue.
Is there a way of using the multiply() operations or multiply operator overloads to benefit from the optimizations in OpenCV? I am trying to avoid writing my own loop for performance reasons, using operators leads to much clean code too.
I don't want to repeat a three times and merge() into a single 3-channeled Mat because of performance issues.
Is there a way of using the multiply() operations or multiply operator overloads to benefit from the optimizations in OpenCV?
OpenCV3 pushes the use of the cv::UMat class in place of cv::Mat. This should give you a little GPU acceleration where possible.
I am trying to avoid writing my own loop for performance reasons, using operators leads to much clean code too.
I would disagree, performance reasons is probably wrong because you will depend on whatever compilation was used to build the libs. If the lib doesn't have AVX2, you will loose performance. Further, you will be limited to OpenCV's primitives which drastically increase memory access. Specifically, each time you do something like cv::add(A,B,C) followed by cv::sqrt(C,C) you hit the memory an extra time resulting in a notable performance decrease.
It also definitely not cleaner code, more like writing scripts for an old Polish Notation calculator.
In summary, if you have performance concerns grab the .data() pointer, check if it vectorizes, and do your work in C++/CUDA/OCL.

How to use the native pointer to a texture on the GPU?

I'm currently doing some GPGPU on my GPU. I've written a shader that performs all the calculations I want it to do and this gives the right results. However, the engine I'm using (Unity), requires me to use a slow and cumbersome way to load the values from the GPU to the CPU, which is also memory-inefficient and loses precision. In short, it works, but it also sucks.
However, Unity also gives me the option to retrieve the texture's ID (openGL specific ?), or the texture's pointer (not platform specific apparently), after which I can write a DLL in native code (c++), to get the data from the GPU to the CPU. On the GPU it's a texture in RGBAFloat (so 4 floats per pixel, but I could easily change this to just 1 float per pixel if that would be necessary), and on the CPU I just want a two-dimensional array of floats. It seems to me that this would be pretty trivial, yet I can't seem to find useful information.
Does anyone have any ideas how I can retrieve the floats in the texture using the pointer, and let C++ output it as an array of floats?
Please ask for clarification if needed.