Parallelize a method from inside a CUDA device function / kernel - c++

I've got an already parallelized CUDA kernel that does some tasks which require frequent interpolation.
So there's a kernel
__global__ void complexStuff(...)
which calls one or more times this interpolation device function:
__device__ void interpolate(...)
The interpolation algorithm does an WENO interpolation successively over three dimensions. This is a highly parallelizable task which I urgently would like to parallelize!
It is clear that the kernel complexStuff() can easily be parallelized by calling it from host code using the <<<...>>> syntax. It is also important that complexStuff() is already parallelized.
But it's not clear to me how to parallelize something / create new threads from inside a CUDA device function ... is this even possible? Does anyone know?

You might want to consider Dynamic Parallelism (some resources here, here, and here) in order to call a CUDA kernel from inside another CUDA kernel. It requires your device compute capability to be 3.5 or higher. It comes with a number of restrictions and limitations that may degrade the performance (mentioned in 3rd link).
My suggestion is to first consider calling your CUDA kernel with complexStuff(...) amount of work multiplied by interpolate(...) amount work. In other words, statically guess what is the maximum parallel fine-grained Jobs you need to do. Then configure your kernel to perform those fine-grained jobs with block threads. Note that it's just a speculation without knowing your program code.

Related

Concurrency of cuFFT streams

So I am using cuFFT combined with the CUDA stream feature. The problem I have is that I can't seem to make the cuFFT kernels run in full concurrency. The following is the results I have from nvvp. Each of the stream is running a kernel of 2D batch FFT on 128 images of size 128x128. I setup 3 streams to run 3 independent FFT batch plan.
As can be seen from the figure, some memory copies (yellow bars) were in concurrent with some kernel computations (purple, brown and pink bars). But the kernels runs were not in concurrent at all. As you notice each kernel was strictly following each other. The following is the code I used for memory copy to the device and kernel launching.
for (unsigned int j = 0; j < NUM_IMAGES; j++ ) {
gpuErrchk( cudaMemcpyAsync( dev_pointers_in[j],
image_vector[j],
NX*NY*NZ*sizeof(SimPixelType),
cudaMemcpyHostToDevice,
streams_fft[j]) );
gpuErrchk( cudaMemcpyAsync( dev_pointers_out[j],
out,
NX*NY*NZ*sizeof(cufftDoubleComplex),
cudaMemcpyHostToDevice,
streams_fft[j] ) );
cufftExecD2Z( planr2c[j],
(SimPixelType*)dev_pointers_in[j],
(cufftDoubleComplex*)dev_pointers_out[j]);
}
Then I changed my code so that I finished all memory copies (synchronize) and send all kernels to streams at once and I got the following profiling result:
Then I was confirmed that the kernels were not running in a concurrent way.
I looked at one link which explains in details how to setup to utilize full concurrency by either passing "–default-stream per-thread" command line argument or #define CUDA_API_PER_THREAD_DEFAULT_STREAM before you #include or in your code. It is a feature introduced in CUDA 7. I ran the sample code in the above link on my MacBook Pro Retina 15' with GeForce GT750M (The same machine used as in the above link), And I was able to get concurrent kernel runs. But I was not able to get my cuFFT kernels running in parallel.
Then I found this link with someone saying that cuFFT kernel will occupy the whole GPU so no two cuFFT kernels running parallel. Then I was stuck. Since I didn't find any formal documentation addressing whether CUFFT enables concurrent kernels. It this true? Is there a way to get around with this?
I assume you called cufftSetStream() prior to the code you have shown, appropriate for each planr2c[j], so that each plan is associated with a separate stream. I don't see it in the code you posted. If you actually want cufft kernels to overlap with other cufft kernels, it's necessary for those kernels to be launched to separate streams. So the cufft exec call for image 0 would have to be launched into a different stream than the cufft exec call for image 1, for example.
In order for any two CUDA operations to have the possibility to overlap, they must be launched into different streams.
Having said that, concurrent memory copies with kernel execution, but not concurrent kernels, is about what I would expect for reasonable sized FFTs.
A 128x128 FFT to a first order approximation will spin up ~15,000 threads, so if my thread blocks are ~500 threads each, that would be 30 threadblocks, which will keep a GPU fairly occupied, leaving not much "room" for additional kernels. (You can actually discover the total blocks and threads for a kernel in the profiler itself.) Your GT750m probably has 2 Kepler SMs with a maximum of 16 blocks per SM so a max instantaneous capacity of 32 blocks. And this capacity number could be reduced for a specific kernel due to shared memory usage, register usage, or other factors.
The instantaneous capacity of whatever GPU you are running on (max blocks per SM * number of SMs) will determine the potential for overlap (concurrency) of kernels. If you exceed that capacity with a single kernel launch, then that will "fill" the GPU, preventing kernel concurrency for some time period.
It should be theoretically possible for CUFFT kernels to run concurrently. But just like any kernel concurrency scenario, CUFFT or otherwise, the resource usage of those kernels would have to be pretty low to actually witness concurrency. Typically when you have low resource usage, it implies kernels with a relatively small number of threads/threadblocks. These kernels don't usually take long to execute, making it even more difficult to actually witness concurrency (because launch latency and other latency factors may get in the way). The easiest way to witness concurrent kernels is to have kernels with unusually low resource requirements combined with unusually long run times. This is generally not the typical scenario, for CUFFT kernels or any other kernels.
Overlap of copy and compute is a still a useful feature of streams with CUFFT. And the concurrency idea, without a basis of understanding of the machine capacity and resource constraints, is somewhat unreasonable in itself. For example, if kernel concurrency was an arbitrary achievable ("I should be able to make any 2 kernels run concurrently"), without consideration to capacity or resource specifics, then after you get two kernels running concurrently, the next logical step would be to go to 4, 8, 16 kernels concurrently. But the reality is that the machine can't handle that much work simultaneously. Once you've exposed enough parallelism (loosely translated as "enough threads") in a single kernel launch, exposing additional work parallelism via additional kernel launches normally cannot make the machine run any faster, or process the work quicker.

How to initialize CUDA so I can make valid execution time measurements?

In my application I have implemented the same algorithm for CPU and GPU with CUDA and I have to measure the time needed to perform algorithm on CPU and GPU. I've noticed that there's some time spent for CUDA initialization in GPU version of algorithm and added cudaFree(0); at the beginning of the program code as it recommended here for CUDA initialization, but it still takes more time for the first GPU CUDA algorithm execution, than the second one.
Are there any other CUDA related stuff that have to be initialized at the beginning to measure actual algorithm execution time correctly?
The heuristics of lazy context initialisation in the CUDA runtime API have subtly changed since the answer you linked to was written in two ways I am aware of:
cudaSetDevice() now initiates a context, whereas earlier on it did not (hence the need for the cudaFree() call discussed in that answer)
Some device code related initialisation which the runtime API used to perform at context initialisation is now done at first call to the kernel
The only solution I am aware of for the second item is to run the CUDA kernel code you want to time once as a "warm up" to absorb the setup latency, and then perform your timing on the code for benchmarking purposes.
Alternatively, you can use the driver API and have much finer grained control over when latency will occur during application start up.

Working with many fixed-size matrices in CUDA kernels

I am looking to work about 4000 fixed-size (3x3, 4x4) matrices, doing things such as matrix inversion and eigendecomposition.
It seems to me the best way to parallelize this would be to let each of the many GPU threads work on a single instance of the problem.
Is there a reasonable way to do this? I have read: http://www.culatools.com/blog/2011/12/09/batched-operations/ but as far as I can tell, it's always something that is "being worked on" with no solution in sight. Three years later, I hope there is a good solution.
So far, I have looked at:
Using Eigen in CUDA kernels: http://eigen.tuxfamily.org/dox-devel/TopicCUDA.html. But this is in its infancy: thus, it doesn't seem to work well and some things are not implemented. Moreover, I am not sure if it is optimized for CUDA at all. There is almost no documentation and the only example of code is a test file (eigen/test/cuda_basic.cu). When I tried using Eigen in CUDA kernels, simple things like declaring an Eigen::MatrixXf in a kernel did not survive compilation with nvcc V7.0.27 and Eigen 3.2.90 (mercurial).
Using the cuBLAS device API library to run BLAS routines within a kernel. It seems cuBLAS and its ilk are written to be parallelized even for small matrices, which seems overkill and likely slow for the 3x3 and 4x4 matrices I am interested in. Also, I'm not sure if there is anything like cuBLAS that can also do eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from within kernels).
Batch processing kernels using CUDA streams. In Section 2.1.7 "Batching Kernels" of the cuBLAS documentation for the CUDA Toolkit v7.0, this is suggested. But """in practice it is not possible to have more than 16 concurrent kernels executing at the same time""" and consequently it would be terrible for processing 4000 small matrices. In an aforementioned link to the CULA blog post, I quote, """One could, in theory, use a CUDA stream per problem and launch one problem at a time. This would be ill-performing for two reasons. First is that the number of threads per block would be far too low; [...] Second is that the overhead incurred by launching thousands of operations in this manner would be unacceptable, because the launch code is as expensive (if not more expensive) as just performing the matrix on the CPU."""
Implementing my own matrix multiplication and eigendecomposition in kernels. This is likely to be very slow, and may in addition be time consuming to implement.
At this point I am tempted to give up on doing this on the GPU at all. It is a pity, since I was hoping for real time performance for an algorithm that requires inverting 4000 3x3 matrices about 100 times every 0.1 seconds.
The cublas functions getrfBatched and getriBatched are designed for batch inversion of small matrices. This should be quicker than either dynamic parallelism or streams (your 2nd and 3rd approaches.) Also a batch solver is available in source code form that can do matrix inversions. You will need to log in as a registered developer at developer.nvidia.com to access this link.
Also, I'm not sure if there is anything like cuBLAS that can also do eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from within kernels).
Cusolver provides some eigen solver functions. However they are not batched nor callable from device code, so you're faced with streams as the only option beyond that.

OpenCL Kernel performance is very bad. Why my code is better without OpenCL?

I'm writing an Ant-Simulation.
The Kernel Performance is very bad. In comparsion to standard c++ solution it has a big performance disadvantage.
I dont understand why. The operations in the kernel are mostly without control structures (like if/else).
Kernels:
https://github.com/Furtano/BA-Code-fuer-Mac/blob/master/BA/Ant.cl
https://github.com/Furtano/BA-Code-fuer-Mac/blob/master/BA/Pheromon.cl
I made a benchmark, and the OpenCL Kernel Performance is very bad.
(Left Axis: Execution time in ms, Bottom Axis: number of simulated Ants)
Can you give me advice?
You can find the hole code in the git repo, if you are interested (the OpenCL stuff is happening here: https://github.com/Furtano/BA-Code-fuer-Mac/blob/master/BA/clInitFunctions.cpp).
Thanks :)
You have a lot of if/else, can't you write it in a different way?
Don't follow the if/else path, since you will never reach anywhere.
You need to make the GPU will only execute useful instructions. Not millions of if/else.
It may be better to keep track and execute only the ants that are live in the grid. You better keep track of them and move them around. Having stored their coordinates.
You will obviously need as well a map with the ant positions and status, so you will need a multi kernel system.
In addition, you have a los of non-useful memory transfers, starting from using int variables for single boolean storage. This can lead to 90% of non useful transfer that can bottleneck the GPU.
Your OpenCL kernels have ifs. Current GPUs aren't supposed to do that. AFAIK an AMD GPU has n groups of 64 cores that have the same instruction pointer (they are executing the exact same part of the exact same statement). Ifs are implemented by stopping some of the cores, executing the true branch, stopping the others and executing the false branch. Imagine this with nested ifs or loops.

OpenCL - Vectorization vs In-thread for loop

I have a problem where I need to process a known number of threads in parallel (great), but for which each thread may have a vastly different number of internal iterations (not great). In my mind, this makes it better to do a kernel scheme like this:
__kernel something(whatever)
{
unsigned int glIDx = get_global_id(0);
for(condition_from_whatever)
{
}//alternatively, do while
}
where id(0) is known beforehand, rather than:
__kernel something(whatever)
{
unsigned int glIDx = get_global_id(0);
unsigned int glIDy = get_global_id(1); // max "unroll dimension"
if( glIDy_meets_condition)
do_something();
else
dont_do_anything();
}
which would necessarily execute for the FULL POSSIBLE RANGE of glIDy, with no way to terminate beforehand, as per this discussion:
Killing OpenCL Kernels
I can't seem to find any specific information about costs of dynamic-sized forloops / do-while statements within kernels, though I do see them everywhere in kernels in Nvidia's and AMD's SDK. I remember reading something about how the more aperiodic an intra-kernel condition branch is, the worse the performance.
ACTUAL QUESTION:
Is there a more efficient way to deal with this on a GPU architecture than the first scheme I proposed?
I'm also open to general information about this topic.
Thanks.
I don't think there's a general answer that can be given to that question. It really depends on your problem.
However here are some considerations about this topic:
for loop / if else statements may or may not have an impact to the performance of a kernel. The fact is the performance cost is not at the kernel level but at the work-group level. A work-group is composed of one or more warps (NVIDIA)/ wavefront (AMD). These warps (I'll keep the NVIDIA terminology but it's exactly the same for AMD) are executed in lock-step.
So if within a warp you have divergence because of an if else (or a for loop with different iterations number) the execution will be serialized. That is to say that the threads within this warp following the first path will do their jobs will the others will idle. Once their job is finished, these threads will idle while the others will start working.
Another problem arise with these statements if you need to synchronize your threads with a barrier. You'll have an undefined behavior if not all the threads hit the barrier.
Now, knowing that and depending on your specific problem, you might be able to group your threads in such a fashion that within the work-groups there is not divergence, though you'll have divergence between work-groups (no impact there).
Knowing also that a warp is composed of 32 threads and a wavefront of 64 (maybe not on old AMD GPUs - not sure) you could make the size of your well organized work-groups equal or a multiple of these numbers. Note that it is quite simplified because some other problems should be taken into consideration. See for instance this question and the answer given by Chanakya.sun (maybe more digging on that topic would be nice).
In the case your problem could not be organized as just described, I'd suggest to consider using OpenCL on CPUs which are quite good dealing with branching. If I well recall, typically you'll have one work-item per work-group. In that case, better to check the documentation from Intel and AMD for CPU. I also like very much the chapter 6 of Heterogeneous Computing with OpenCL which explains the differences between using OCL with GPUs and CPUs when programming.
I like this article too. It's mainly a discussion about increasing performance for a simple reduction on GPU (not your problem), but the last part of the article examines also performance on CPUs.
Last thing, regarding your comments on the answer provided by #Oak about the "intra-device thread queuing support" which is actually called dynamic parallelism. This feature would obviously solve your problem but even using CUDA you'd need a device with capability 3.5 or higher. So even NVIDIA GPUs with Kepler GK104 architecture don't support it (capability 3.0). For OCL, the dynamic parallelism is part of the standard version 2.0. (as far as I know there is no implementation yet).
I like the 2nd version more, since for inserts a false dependency between iterations. If the inner iterations are independent, send each to a different work item and let the OpenCL implementation sort out how best to run them.
Two caveats:
If the average number of iterations is significantly lower than the max number of iterations, this might not be worth the extra dummy work items.
You will have a lot more work items and you still need to calculate the condition for each... if calculating the condition is complicated this might not be a good idea.
Alternatively, you can flatten the indices into the x dimension, group all the iterations into the same work-group, then calculate the condition just once per workgroup and use local memory + barriers to sync it.