How to avoid constant memory copying in OpenCL - c++

I wrote C++ application which is simulating simple heat flow. It is using OpenCL for computing.
OpenCL kernel is taking two-dimensional (n x n) array of temperatures values and its size (n). It returns new array with temperatures after each cycle:
pseudocode:
int t_id = get_global_id(0);
if(t_id < n * n)
{
m_new[t_id / n][t_id % n] = average of its and its neighbors (top, bottom, left, right) temperatures
}
As You can see, every thread is computing single cell in matrix. When host application needs to perform X computing cycles it looks like this
For 1 ... X
Copy memory to OpenCL device
Call kernel
Copy memory back
I would like to rewrite kernel code to perform all X cycles without constant memory copying to/from OpenCL device.
Copy memory to OpenCL device
Call kernel X times OR call kernel one time and make it compute X cycles.
Copy memory back
I know that each thread in kernel should lock when all other threads are doing their job and after that - m[][] and m_new[][] should be swapped. I have no idea how to implement any of those two functionalities.
Or maybe there is another way to do this optimally?

Copy memory to OpenCL device
Call kernel X times
Copy memory back
this works. Make sure call kernel is not blocking(so 1-2 ms per cycle is saved) and there aren't any host-accesible buffer properties such as USE_HOST_PTR or ALLOC_HOST_PTR.
If calling kernel X times doesn't get satisfactory performance, you can try using single workgroup(such as only 256 threads) with looping X times that each cycles has a barrier() at the end so all 256 threads synchronize before starting next cycle. This way you can compute M different heat-flow problems at the same time where M is number of compute units(or workgroups) if that is a server, it can serve that many computations.
Global synchronization is not possible because when latest threads are launched, first threads are already gone. It works with (number of compute units)(number of threads per workgroup)(number of wavefronts per workgroup) threads concurrently. For example, a R7-240 gpu with 5 compute units and local-range=256, it can run maybe 5*256*20=25k threads at a time.
Then, for further performance, you can apply local-memory optimizations.

Related

why using more than 2 threads consume more time?

I want to optimize my sequential code to make a gradient.
The main thread compute gradient for the border of the image and the other threads each one compute the gradient for a chunk of the image,
using 2 threads and the main thread give result better than sequential code but using more than 2 threads, but it consume more time and looks worst than the sequential.
I tried this code to speed up the gradient process:
for (int n = 0; n<iter_outer; n++)
{
int chunk = 1 + ((row - 1) / num_threads); //ceiling
int start=0;
int end=0;
//Launch a group of threads
for (int tid = 0; tid < num_threads; ++tid)
{
start = tid * chunk;
end = start + chunk;
t[tid] = thread(gradient, tid, g, vx, vy, row, col, 1, start, end);
}
//Launched from the main;
gradient(1, g, vx, vy, row, col,0, start, end);
//Join the threads with the main thread
for (int i = 0; i < num_threads; ++i)
{
t[i].join();
}
}
For any parallel execution you have to take into account Amdahl's law. It states that the time required to do some task in parallel does not scale linear with the number of processors:
t = ( (1-p) + p/n ) * T
where
T is the time needed for the task when it is done sequentially
p fraction of time that can be parallelized
n is the number of processors
Note that I used a slightly different formulation, but the statement is the same: The total speedup you get is limited by 1/(1-p) (e.g. if p=50% your parallel version will run maximum twice as fast).
In addition to that you have to consider that adding more parallelism in reality also adds more overhead (for synchronisation, setting up threads, etc), so a more realistic estimate is:
t = ( (1-p) + p/n ) * T + o*p
^^ overhead
This t as a function of the number of processors p has a minimum for a certain number of processors. Adding more processors to the problem will not result in a speedup but rather in a slow down, because the minimum time you need to do that p portion is zero, but the overhead you add by adding more processors increases unlimited.
This does not explain why you dont get a speedup in your case, but in general it is not a big surprise that simply adding more processors on a task does not always result in a speedup.
The parallel execution is a huge benefit for tasks that are easily splittable and the threads will not depend on themselves, however creating threads does come with a price. Let's imagine that a computer does nothing else but running your program (there is not OS and no other processes). The processor has 2 cores, they are processors in their own regard and can concurrently run any code. In case of just one thread the second core sits and does absolutely nothing hence there is potential for speed up. If you spawn the second thread (and give it 50% of the task) the second core now works as well and theoretically the speedup is 2 (ignoring the sequential parts and practical aspects). Now, lets make 4 threads. Wait... we have two processors and 4 threads? Yes, now each CPU does more than one thing and before changing the task on which it works the CPU has to switch contexts (change the values of registers to hold appropriate variables values, go to different code section and so on) this takes time and if you create way too many threads it will in fact take more time than doing the job. This might be a huge impact on any threaded application and should be noted before deciding on how many threads to run.
Note that this post is as simplification many modern CPU's can run efficiently more then one thread per core (ie. HyperThreading).
It seems like your CPU is dual-core. So, actually, only 2 tasks could be done parallel

Thread indexing for image rows / columns in CUDA

So I'm working on a CUDA program, and I'm experiencing some issues when it comes to indexing blocks and threads. Basically, I'm trying to implement the Pixel Sort algorithm in CUDA. (With one modification, we are only processing either rows or columns, not both at the same time)
The way I was visualizing it was to simply run N blocks with 1 thread each (for the number of rows, or columns) and have each block process that row / column independently of each other.
So if we want to sort on columns, we launch the kernel like this (there are a couple extra parameters that are only relevant to our specific processing, so I've left them out for simplicity)
pixel_sort<<<cols, 1>>>(d_image, d_imageSort, rows, cols, ...);
Then in the kernel, I access the block index with
int tid = blockIdx.x;
This allows me to work with one row / columns data per block, but it has some issues. It's running slower than our serial implementation of the algorithm for smaller images, and straight up crashes when the image size becomes too large.
An alternative thread scheme I was considering would be to map each of the image's pixels to one thread, however I have a couple of questions on this.
If we were to launch N blocks with M threads, representing N cols with M rows, how do we avoid 512 (or 1024 ?) limit of threads per block. Can we just have each thread process multiple pixels in the column in this instance? How would the indexing look like in the kernel?
The algorithm basically requires that we work on the entire column, hence each thread cannot just do some work on that pixel, they have to communicate, presumably using shared memory. Would it be a valid strategy to have one "master" thread per block, that does the actual sorting calculations, and then have all of the other threads participate in the shared memory?
Other Notes:
Our image data is read in through OpenCV, and has the RGBA values stored in a uchar4 array
If you have a single thread per block, you very quickly run into thread occupancy issues. If your goal is to do a full row sort (for columns you could transpose the image before sending to the GPU to take advantage of global coalescing), the fastest way that gets a decent result is probably to do a radix or merge-sort on a per-row basis, basically copying the steps from http://mgarland.org/files/papers/nvr-2008-001.pdf. You could assign k blocks of m threads each for each row such that km >= image width. Then you would be launching k*(image height) blocks. Your grid would then be of size (k, height, 1).
As for your specific questions:
You cannot get around the 512/1024 thread-per-block limit, you would have to restructure your algorithm.
A "master" thread would generally be poor design, causing stalls, overhead, and not taking full advantage of the many cores. You may sometimes need to utilize a single thread, say to output/broadcast a result, but mostly you want to avoid it. See the linked article for sample algorithms that mostly avoid this.

OpenCL data parallel summation into a variable

Is it possible to use the opencl data parallel kernel to sum vector of size N, without doing the partial sum trick?
Say that if you have access to 16 work items and your vector is of size 16. Wouldn't it not be possible to just have a kernel doing the following
__kernel void summation(__global float* input, __global float* sum)
{
int idx = get_global_id(0);
sum[0] += input[idx];
}
When I've tried this, the sum variable doesn't get updated, but only overwritten. I've read something about using barriers, and i tried inserting a barrier before the summation above, it does update the variable somehow, but it doesn't reproduce the correct sum.
Let me try to explain why sum[0] is overwritten rather than updated.
In your case of 16 work items, there are 16 threads which are running simultaneously. Now sum[0] is a single memory location which is shared by all of the threads, and the line sum[0] += input[idx] is run by each of the 16 threads, simultaneously.
Now the instruction sum[0] += input[idx] (I think) expands performs a read of sum[0], then adds input[idx] to that before writing the result back to sum[0].
There will will be a data race as multiple threads are reading from and writing to the same shared memory location. So what might happen is:
All threads may read the value of sum[0] before any other thread
writes their updated result back to sum[0], in which case the final
result of sum[0] would be the value of input[idx] of the thread
which executed the slowest. Since this will be different each time,
if you run the example multiple times you should see different
results.
Or, one thread may execute slightly more slowly, in which case
another thread may have already written an updated result back to
sum[0] before this slow thread reads sum[0], in which case there
will be an addition using the values of more than one thread, but not
all threads.
So how can you avoid this?
Option 1 - Atomics (Worse Option):
You can use atomics to force all threads to block if another thread is performing an operation on the shared memory location, but this obviously results in a loss of performance since you are making the parallel process serial (and incurring the costs of parallelisation -- such as moving memory between the host and the device and creating the threads).
Option 2 - Reduction (Better Option):
The best solution would be to reduce the array, since you can use the parallelism most effectively, and can give O(log(N)) performance. Here is a good overview of reduction using OpenCL : Reduction Example.
Option 3 (and worst of all)
__kernel void summation(__global float* input, __global float* sum)
{
int idx = get_global_id(0);
for(int j=0;j<N;j++)
{
barrier(CLK_GLOBAL_MEM_FENCE| CLK_LOCAL_MEM_FENCE);
if(idx==j)
sum[0] += input[idx];
else
doOtherWorkWhileSingleCoreSums();
}
}
using a mainstream gpu, this should sum all of them as slow as a pentium mmx . This is just like computing on a single core and giving other cores other jobs but in a slower way.
A cpu device could be better than gpu for this kind.

Apparent CUDA magic

I'm using CUDA (in reality I'm using pyCUDA if the difference matters) and performing some computation over arrays. I'm launching a kernel with a grid of 320*600 threads. Inside the kernel I'm declaring two linear arrays of 20000 components using:
float test[20000]
float test2[20000]
With these arrays I perform simple calculations, like for example filling them with constant values. The point is that the kernel executes normally and perform correctly the computations (you can see this filling an array with a random component of test and sending that array to host from device).
The problem is that my NVIDIA card has only 2GB of memory and the total amount of memory to allocate the arrays test and test2 is 320*600*20000*4 bytes that is much more than 2GB.
Where is this memory coming from? and how can CUDA perform the computation in every thread?
Thank you for your time
The actual sizing of the local/stack memory requirements is not as you suppose (for the entire grid, all at once) but is actually based on a formula described by #njuffa here.
Basically, the local/stack memory require is sized based on the maximum instantaneous capacity of the device you are running on, rather than the size of the grid.
Based on the information provided by njuffa, the available stack size limit (per thread) is the lesser of:
The maximum local memory size (512KB for cc2.x and higher)
available GPU memory/(#of SMs)/(max threads per SM)
For your first case:
float test[20000];
float test2[20000];
That total is 160KB (per thread) so we are under the maximum limit of 512KB per thread. What about the 2nd limit?
GTX 650m has 2 cc 3.0 (kepler) SMs (each Kepler SM has 192 cores). Therefore, the second limit above gives, if all the GPU memory were available:
2GB/2/2048 = 512KB
(kepler has 2048 max threads per multiprocessor)
so it is the same limit in this case. But this assumes all the GPU memory is available.
Since you're suggesting in the comments that this configuration fails:
float test[40000];
float test2[40000];
i.e. 320KB, I would conclude that your actual available GPU memory is at the point of this bulk allocation attempt is somewhere above (160/512)*100% i.e. above 31% but below (320/512)*100% i.e. below 62.5% of 2GB, so I would conclude that your available GPU memory at the time of this bulk allocation request for the stack frame would be something less than 1.25GB.
You could try to see if this is the case by calling cudaGetMemInfo right before the kernel launch in question (although I don't know how to do this in pycuda). Even though your GPU starts out with 2GB, if you are running the display from it, you are likely starting with a number closer to 1.5GB. And dynamic (e.g. cudaMalloc) and or static (e.g. __device__) allocations that occur prior to this bulk allocation request at kernel launch, will all impact available memory.
This is all to explain some of the specifics. The general answer to your question is that the "magic" arises due to the fact that the GPU does not necessarily allocate the stack frame and local memory for all threads in the grid, all at once. It need only allocate what is required for the maximum instantaneous capacity of the device (i.e. SMs * max threads per SM), which may be a number that is significantly less than what would be required for the whole grid.

Calling multiple kernels, global memory performances - CUDA

I have four CUDA kernels working on matrices in the following way:
convolution<<<>>>(A,B);
multiplybyElement1<<<>>>(B);
multiplybyElement2<<<>>>(A);
multiplybyElement3<<<>>>(C);
// A + B + C with CUBLAS' cublasSaxpy
every kernel basically (except the convolution first) performs a matrix each-element multiplication by a fixed value hardcoded in its constant memory (to speed things up).
Should I join these kernels into a single one by calling something like
multiplyBbyX_AbyY_CbyZ<<<>>>(B,A,C)
?
Global memory should already be on the device so probably that would not help, but I'm not totally sure
If I understood correctly, you're asking if you should merge the three "multiplybyElement" kernels into one, where each of those kernels reads an entire (different) matrix, multiplying each element by a constant, and storing the new scaled matrix.
Given that these kernels will be memory bandwidth bound (practically no computation, just one multiply for every element) there is unlikely to be any benefit from merging the kernels unless your matrices are small, in which case you would be making inefficient use of the GPU since the kernels will execute in series (same stream).
If merging the kernels means that you can do only one pass over the memory, then you may see a 3x speedup.
Can you multiply up the fixed values up front and then do a single multiply in a single kernel?