OpenCL counter variable - c++

I'm performing Otsu's method (link https://en.wikipedia.org/wiki/Otsu%27s_method) in order to determine how many black pixels are in the raw frame. I'm trying to optimize process and I want to do it with OpenCL. Is there any way to pass the single variable to OpenCL kernel and increment it, instead of passing the whole buffer when it's not necessary?

The problem you want to solve is very much like a global reduction. While you could use a single output variable and atomic read/modify/write access, it would be very slow (due to contention on the atomic) unless the thing you are counting is very sparse (not the case here). So you should use the same techniques used for reduction, which is to do partial sums in parallel, and then merge those together. Google for "OpenCL reduction" and you'll find some great examples, such as this one: https://developer.amd.com/resources/articles-whitepapers/opencl-optimization-case-study-simple-reductions/

Related

Is it thread-safe to access a Mat with multiple threads in OpenCV?

i want to speedup an algorithm (complete local binary pattern with circle neighbours) for which i iterate trough all pixels and calculate some stuff with it neighbours (so i need neighbour pixel access).
Currently i do this by iterating over all pixels with one thread/process. I want to parallelize this task by dividing the input image into multiple ROIs and calculate each ROI seperatly (with multiple threads).
The Problem here is, that the ROIs are overlapping (because to calculate a pixel, sometimes i need to look at neighbours far away) and its possible that multiple threads accessing Pixel-Data (READING) at same time. Is that a Problem if two or more threads reading same Mat at same Indices at same time?
Is it also a problem, if i write to the same Mat parallel but at different indices?
As long as no writes happen simultaneously to the reads, it is safe to have multiple concurrent reads.
That holds for any sane system.
Consider the alternative:
If there was a race condition, it would mean that the memory storing your object gets modified during the read operation. If no memory (storing the object) gets written to during the read, there's no possible interaction between the threads.
Lastly, if you look at the doc,
https://docs.opencv.org/3.1.0/d3/d63/classcv_1_1Mat.html
You'll see two mentions of thread-safety:
Thus, it is safe to operate on the same matrices asynchronously in
different threads.
They mention it around ref-counting, performed during matrix assignment. So, at the very least, assigning from the same matrix to two others can be done safely in multiple threads. This pretty much guarantees that simple read access is also thread-safe.
Generally, parallel reading is not a problem as a cv::Mat is just a nice wrapper around an array, just like std::vector (yes there are differences but I don't see how they would affect the matter of the topic here so I'm going to ignore them). However parallelization doesn't automatically give you a performance boost. There are quite a few things to consider here:
Creating a thread is ressource heavy and can have a large negative impact if the task is relatively short (in terms of computation time) so thread pooling has to be considered.
If you write high performance code (no matter if multi- or single threaded) you should have a grasp of how your hardware works. In this case: memory and CPU. There is a very good talk from Timur Doumler at CppCon 2016 about that topic. This should help you avoiding cache misses.
Also mention worthy is compiler optimization. Turn it on. I know this sounds super obvious but there are a lot of people on SO that ask questions about performance and yet they don't know what compiler optimization is.
Finally, there is the OpenCV Transparent API (TAPI) which basically utilizes the GPU instead of the CPU. Almost all built-in algorithms of OpenCV support the TAPI, you just have to pass a cv::UMat instead of a cv::Mat. Those two types are convertible to each other. However, the conversion is time intensive because a UMat is basically an array on the GPU memory (VRAM), which means it has to be copied each time you convert it. Also accessing the VRAM takes longer than accessing the RAM (for the CPU that is).
Though, you have to keep in mind that you cannot access VRAM data with the CPU without copying it to the RAM. This means you cannot iterate over your pixels if you use cv::UMat. It is only possible if you write your own OpenCL or Cuda code so your algorithm can run on the GPU.
In most consumer grade PCs, for sliding window algorithms (basically anything that iterates over the pixels and performs a calculation around each pixel), using the GPU is usually by far the fastest method (but also requires the most effort to implement). Of course this only holds if the data buffer (your image) is large enough to make it worth copying to and from the VRAM.
For parallel writing: it's generally safe as long as you don't have overlapping areas. However, cache misses and false sharing (as pointed out by NathanOliver) are problems to be considered.

Is there anything like atomic Operation in cuda in OpenCL

When I write CUDA code,I use atomic Operation to force a global sychronization at the last step.
Then I also have to implemente the same task in OpenCL, I wonder is there is a similar operation in OpenCL like atomic operation in CUDA that I can use, my devices is a fpga board..
barrier() may be something similar to what you are looking for, but can only force a "join" on threads in the same workgroup.
See this post. You may be able to use CLK_GLOBAL_MEM_FENCE to get the results you are looking for.
Stack overflow: Barriers in OpenCL
There is no kernel-level global synchronization is OpenCL and CUDA since entire workgroups may finish before others can be started. Only workgroup level synchronization is available inside a kernel. For global synchronization you much use multiple kernels.
According to your comment, it seems like you want atomic operations on float values.
Please check out this link: atomic operation and floats in opencl
The idea is to use the built in atom_cmpxchg operation to try to swap the old value of a float point variable with a new value, which could be be its addition with another value, or multiplication, division, subtraction, etc.
The swapping only succeeds if the old value is actually the old value (that's where the cmp comes into play). Otherwise, it will do it again in a while loop.
Notice that this atomic operation could be quite slow if many threads are doing this operation on a single value.

OpenCL Kernel performance is very bad. Why my code is better without OpenCL?

I'm writing an Ant-Simulation.
The Kernel Performance is very bad. In comparsion to standard c++ solution it has a big performance disadvantage.
I dont understand why. The operations in the kernel are mostly without control structures (like if/else).
Kernels:
https://github.com/Furtano/BA-Code-fuer-Mac/blob/master/BA/Ant.cl
https://github.com/Furtano/BA-Code-fuer-Mac/blob/master/BA/Pheromon.cl
I made a benchmark, and the OpenCL Kernel Performance is very bad.
(Left Axis: Execution time in ms, Bottom Axis: number of simulated Ants)
Can you give me advice?
You can find the hole code in the git repo, if you are interested (the OpenCL stuff is happening here: https://github.com/Furtano/BA-Code-fuer-Mac/blob/master/BA/clInitFunctions.cpp).
Thanks :)
You have a lot of if/else, can't you write it in a different way?
Don't follow the if/else path, since you will never reach anywhere.
You need to make the GPU will only execute useful instructions. Not millions of if/else.
It may be better to keep track and execute only the ants that are live in the grid. You better keep track of them and move them around. Having stored their coordinates.
You will obviously need as well a map with the ant positions and status, so you will need a multi kernel system.
In addition, you have a los of non-useful memory transfers, starting from using int variables for single boolean storage. This can lead to 90% of non useful transfer that can bottleneck the GPU.
Your OpenCL kernels have ifs. Current GPUs aren't supposed to do that. AFAIK an AMD GPU has n groups of 64 cores that have the same instruction pointer (they are executing the exact same part of the exact same statement). Ifs are implemented by stopping some of the cores, executing the true branch, stopping the others and executing the false branch. Imagine this with nested ifs or loops.

Improving image processing speed

I am using C++ and OpenCV to process some images taken from a Webcam in realtime and I am looking to get the best speed I can from my system.
Other than changing the processing algorithm (assume, for now, that you can't change it). Is there anything that I should be doing to maximize the speed of processing?
I am thinking maybe Multithreading could help here but I'm ashamed to say I don't really know the ins and outs (although obviously I have used multithreading before but not in C++).
Assuming I have an x-core processor, does splitting the processing into x threads actually speed things up?...or would the management overhead of these threads negate it assuming that I am looking for a throughput of 20fps (I assume that will affect the answer you give as it should give you an indication of how much processing will be done per thread)
Would multithreading help here?
Are there any tips for increasing the speed of OpenCV specifically, or any pitfalls that I might be falling into that reduce speed.
Thanks.
The easier way, I think, could be pipelining frame operations.
You could work with a thread pool, allocating sequentially a frame memory buffer to the first available thread, to be released to pool when the algorithm step on the associated frame has completed.
This could leave practically unchanged your current (debugged :) algorithm, but will require substantially more memory for buffering intermediate results.
Of course, without details about your task, it's hard to say if this is appropriate...
There is one important thing about increasing speed in OpenCV not related to processor nor algorithm and it is avoiding extra copying when dealing with matrices. I will give you an example taken from the documentation:
"...by constructing a header for a part of another matrix. It can be a
single row, single column, several rows, several columns, rectangular
region in the matrix (called a minor in algebra) or a diagonal. Such
operations are also O(1), because the new header will reference the
same data. You can actually modify a part of the matrix using this
feature, e.g."
// add 5-th row, multiplied by 3 to the 3rd row
M.row(3) = M.row(3) + M.row(5)*3;
// now copy 7-th column to the 1-st column
// M.col(1) = M.col(7); // this will not work
Mat M1 = M.col(1);
M.col(7).copyTo(M1);
Maybe you already knew this issue but I think it is important to highlight headers in openCV as an important and efficient coding tool.
Assuming I have an x-core processor, does splitting the processing into x threads actually speed things up?
Yes, although it very heavily depends on the particular algorithm being used, as well as your skill in writing threaded code to handle things like synchronization. You didn't really provide enough detail to make a better assessment than that.
Some algorithms are extremely easy to parallelize, like ones that have the form:
for (i=0; i < DATA_SIZE; i++)
{
output[i] = f(input[i]);
}
for some function f. These are known as embarassingly parallelizable; you can simply split the data into N blocks and have N threads process each block individually. Libraries like OpenMP make this kind of threading extremely simple.
Unless the particular algorithm you are using is already optimized for a multithreaded/parallel platform, throwing it at an x-core processor will do nothing for you. The algorithm has to be inherently threadable to benefit from multiple threads. But if it wasn't designed with that in mind, it would have to be altered. On the other hand, many image processing algorithms are "embarassingly-parallel", at least in concept. Can you share more details about the algorithm you have in mind?
If your threads can operate on different data, it would seem reasonable to thread it off, perhaps queueing each frame object to a thread pool. You may have to add sequence numbers to the frame objects to ensure that the processed frames emerging from the pool are delivered in the same order they went in.
As example code for multi-threaded image processing with OpenCV, you might want to check out my code:
https://github.com/vmlaker/sherlock-cpp
It's what I came up with wanting to take advantage of x-core CPU to improve performance of object detection. The detect program is basically a parallel algorithm that distributes tasks among multiple threads, a separate pipelined thread for every task:
Allocation of frame memory and video capture.
Object detection (one thread per each Haar classifier.)
Augmenting output with detection result and displaying the frame.
Memory deallocation.
With memory for every captured frame shared between all threads, I got great performance and CPU utilization.

OpenCL clEnqueueTasks Parallelism

I am trying to write some code that does AES Decryption. I have the code working but I wanted to be able to add Cipher Block Chaining which requires that I do an XOR operation after the decryption.
To make the code easier to write and understand I wrote the code using two kernels. One that does the decryption on a single block and one that does the XOR for the CBC part. I then submitted these to the queue via clEnqueueTask for each 16byte block of data with the dependency specified by an event between the Decryption and XOR.
This turns out to be very slow, it works in the fact that it does them in the correct order however it does not seem to be parallelizing the execution.
Does anyone know why or how to improve the performance without losing the granularity?
clEnqueueTask is typically used for single-threaded tasks.
If your kernels can execute in parallel, use one clEnqueueNDRangeKernel call instead of lots of clEnqueueTask calls with different parameters.
Something else that might prevent good parallel performance is lots of global memory access. If you are doing lots of reads/writes of global memory in your kernel compared to the amount of computation, that might be slowing you down depending on your hardware.
The kernel executed via clEnqueueTask is essentially single-threaded, which means the global work-size is 1 and the task occupy a whole compute unit for this single thread. This can be a great impact performance wise, because on a typical GPU you can execute 8-16 tasks/work-groups parallel (CL_DEVICE_MAX_COMPUTE_UNITS), and in a work-group you can execute 256-1024 (CL_DEVICE_MAX_WORK_GROUP_SIZE). So in your case you can achieve 8-16x parallelism instead of the theoretical maximum 15000x because you cannot utilize the whole hardware.