CUDA atomic operations and concurrent kernel launch - concurrency

Currently I develop a GPU-based program that use multiple kernels that are launched concurrently by using multiple streams.
In my application, multiple kernels need to access a queue/stack
and I have plan to use atomic operations.
But I do not know whether atomic operations work between multiple kernels concurrently launched.
Please help me anyone who know the exact mechanism of the atomic operations on GPU
or who has experience with this issue.

Atomics are implemented in the L2 cache hardware of the GPU, through which all memory operations must pass. There is no hardware to ensure coherency between host and device memory, or between different GPUs; but as long as the kernels are running on the same GPU and using device memory on that GPU to synchronize, atomics will work as expected.

Related

OpenCL - multiple threads on a gpu

After having parallelized a C++ code via OpenMP, I am now considering to use the GPU (a Radeon Pro Vega II) to speed up specific parts of my code. Being an OpenCL neophyte,I am currently searching for examples that can show me how to implement a multicore CPU - GPU interaction.
Here is what I want to achieve. Suppose to have a fixed short length array, say {1,2,3,4,5}, and that as an exercise, you want to compute all of the possible "right shifts" of this array, i.e.,
{5,1,2,3,4}
{4,5,1,2,3}
{3,4,5,1,2}
{2,3,4,5,1}
{1,2,3,4,5}
.
The relative OpenCL code is quite straightforward.
Now, suppose that your CPU has many cores, say 56, that each core has a different starting array and that at any random instant of time each CPU core may ask the GPU to compute the right shifts of its own array. This core, say core 21, should copy its own array into the GPU memory, run the kernel, and wait for the result. My question is "during this operation, could the others CPU cores submit a similar request, without waiting for the completion of the task submitted by core 21?"
Also, can core 21 perform in parallel another task while waiting for the completion of the GPU task?
Would you feel like suggesting some examples to look at?
Thanks!
The GPU works with a queue of kernel calls and (PCIe-) memory transfers. Within this queue, it can work on non-blocking memory transfers and a kernel at the same time, but not on two consecutive kernels. You could do several queues (one per CPU core), then the kernels from different queues can be executed in parallel, provided that each kernel only takes up a fraction of the GPU resources. The CPU core can, while the queue is being executed on the GPU, perform a different task, and with the command queue.finish() the CPU will wait until the GPU is done.
However letting multiple CPUs send tasks to a single GPU is bad practice and will not give you any performance advantage while making your code over-complicated. Each small PCIe memory transfer has a large latency overhead and small kernels that do not sufficiently saturate the GPU have bad performance.
The multi-CPU approach is only useful if each CPU sends tasks to its own dedicated GPU, and even then I would only recommend this if your VRAM of a single GPU is not enough or if you want to have more parallel troughput than a single GPU allows.
A better strategy is to feed the GPU with a single CPU core and - if there is some processing to do on the CPU side - only then parallelize across multiple CPU cores. By combining small data packets into a single large PCIe memory transfer and large kernel, you will saturate the hardware and get the best possible performance.
For more details on how the parallelization on the GPU works, see https://stackoverflow.com/a/61652001/9178992

Linux TC eBPF and concurency

Is there a limit to how much instances of an eBPF programs the kernel can run simultaneously on several CPUs (similar to the python GIL problem)
In particular can eBPF tc programs work on multiple CPU simultaneously?
How is locking of kernel datastructures done by eBPF when it is running the same code on multiple CPUs?
In particular can eBPF tc programs work on multiple CPU simultaneously?
Yes (see details below).
How is locking of kernel datastructures done by eBPF when it is running the same code on multiple CPUs?
Concurrent accesses to maps in BPF are protected by the RCU mechanism. However, there is currently no way to protect concurrent code in BPF programs themselves. So, for example, a BPF program running on a first core may update a value between the lookup and update calls of the same program running on a second core.
In some cases, to improve performance, you can use per-CPU maps (e.g., per-CPU arrays and per-CPU hashmaps). In that case, the API for lookups, updates, and deletes stays the same, but each core actually has its own copy of the map's values. This means that, for example, if you are incrementing a counter in a map, each core will see its own counter and you'll have to aggregate their values in userspace to get the total counter. Of course, this might not always fit your use case.

Concurrency of cuFFT streams

So I am using cuFFT combined with the CUDA stream feature. The problem I have is that I can't seem to make the cuFFT kernels run in full concurrency. The following is the results I have from nvvp. Each of the stream is running a kernel of 2D batch FFT on 128 images of size 128x128. I setup 3 streams to run 3 independent FFT batch plan.
As can be seen from the figure, some memory copies (yellow bars) were in concurrent with some kernel computations (purple, brown and pink bars). But the kernels runs were not in concurrent at all. As you notice each kernel was strictly following each other. The following is the code I used for memory copy to the device and kernel launching.
for (unsigned int j = 0; j < NUM_IMAGES; j++ ) {
gpuErrchk( cudaMemcpyAsync( dev_pointers_in[j],
image_vector[j],
NX*NY*NZ*sizeof(SimPixelType),
cudaMemcpyHostToDevice,
streams_fft[j]) );
gpuErrchk( cudaMemcpyAsync( dev_pointers_out[j],
out,
NX*NY*NZ*sizeof(cufftDoubleComplex),
cudaMemcpyHostToDevice,
streams_fft[j] ) );
cufftExecD2Z( planr2c[j],
(SimPixelType*)dev_pointers_in[j],
(cufftDoubleComplex*)dev_pointers_out[j]);
}
Then I changed my code so that I finished all memory copies (synchronize) and send all kernels to streams at once and I got the following profiling result:
Then I was confirmed that the kernels were not running in a concurrent way.
I looked at one link which explains in details how to setup to utilize full concurrency by either passing "–default-stream per-thread" command line argument or #define CUDA_API_PER_THREAD_DEFAULT_STREAM before you #include or in your code. It is a feature introduced in CUDA 7. I ran the sample code in the above link on my MacBook Pro Retina 15' with GeForce GT750M (The same machine used as in the above link), And I was able to get concurrent kernel runs. But I was not able to get my cuFFT kernels running in parallel.
Then I found this link with someone saying that cuFFT kernel will occupy the whole GPU so no two cuFFT kernels running parallel. Then I was stuck. Since I didn't find any formal documentation addressing whether CUFFT enables concurrent kernels. It this true? Is there a way to get around with this?
I assume you called cufftSetStream() prior to the code you have shown, appropriate for each planr2c[j], so that each plan is associated with a separate stream. I don't see it in the code you posted. If you actually want cufft kernels to overlap with other cufft kernels, it's necessary for those kernels to be launched to separate streams. So the cufft exec call for image 0 would have to be launched into a different stream than the cufft exec call for image 1, for example.
In order for any two CUDA operations to have the possibility to overlap, they must be launched into different streams.
Having said that, concurrent memory copies with kernel execution, but not concurrent kernels, is about what I would expect for reasonable sized FFTs.
A 128x128 FFT to a first order approximation will spin up ~15,000 threads, so if my thread blocks are ~500 threads each, that would be 30 threadblocks, which will keep a GPU fairly occupied, leaving not much "room" for additional kernels. (You can actually discover the total blocks and threads for a kernel in the profiler itself.) Your GT750m probably has 2 Kepler SMs with a maximum of 16 blocks per SM so a max instantaneous capacity of 32 blocks. And this capacity number could be reduced for a specific kernel due to shared memory usage, register usage, or other factors.
The instantaneous capacity of whatever GPU you are running on (max blocks per SM * number of SMs) will determine the potential for overlap (concurrency) of kernels. If you exceed that capacity with a single kernel launch, then that will "fill" the GPU, preventing kernel concurrency for some time period.
It should be theoretically possible for CUFFT kernels to run concurrently. But just like any kernel concurrency scenario, CUFFT or otherwise, the resource usage of those kernels would have to be pretty low to actually witness concurrency. Typically when you have low resource usage, it implies kernels with a relatively small number of threads/threadblocks. These kernels don't usually take long to execute, making it even more difficult to actually witness concurrency (because launch latency and other latency factors may get in the way). The easiest way to witness concurrent kernels is to have kernels with unusually low resource requirements combined with unusually long run times. This is generally not the typical scenario, for CUFFT kernels or any other kernels.
Overlap of copy and compute is a still a useful feature of streams with CUFFT. And the concurrency idea, without a basis of understanding of the machine capacity and resource constraints, is somewhat unreasonable in itself. For example, if kernel concurrency was an arbitrary achievable ("I should be able to make any 2 kernels run concurrently"), without consideration to capacity or resource specifics, then after you get two kernels running concurrently, the next logical step would be to go to 4, 8, 16 kernels concurrently. But the reality is that the machine can't handle that much work simultaneously. Once you've exposed enough parallelism (loosely translated as "enough threads") in a single kernel launch, exposing additional work parallelism via additional kernel launches normally cannot make the machine run any faster, or process the work quicker.

Locking a process to Cuda core

I'm just getting into GPU processing.
I was wondering if it's possible to lock a new process, or 'launch' a process that is locked to a CUDA core?
For example you may have a small C program that performs an image filter on an index of images. Can you have that program running on each CUDA core that essentially runs forever - reading/writing from it's own memory to system memory and disk?
If this is possible, what are the implications for CPU performance - can we totally offset CPU usage or does the CPU still need to have some input/output?
My semantics here are probably way off. I apologize if what i've said requries some interpretation. I'm not that used to GPU stuff yet.
Thanks.
All of my comments here should be prefaced with "at the moment". Technology is constantly evolving.
was wondering if it's possible to lock a new process, or 'launch' a process that is locked to a CUDA core?
process is mostly a (host) operating system term. CUDA doesn't define a process separately from the host operating system definition of it, AFAIK. CUDA threadblocks, once launched on a Streaming Multiprocessor (or SM, a hardware execution resource component inside a GPU), in many cases will stay on that SM for their "lifetime", and the SM includes an array of "CUDA cores" (a bit of a loose or conceptual term). However, there is at least one documented exception today to this in the case of CUDA Dynamic Parallelism, so in the most general sense, it is not possible to "lock" a CUDA thread of execution to a CUDA core (using core here to refer to that thread of execution forever remaining on a given warp lane within a SM).
Can you have that program running on each CUDA core that essentially runs forever
You can have a CUDA program that runs essentially forever. It is a recognized programming technique sometimes referred to as persistent threads. Such a program will naturally occupy/require one or more CUDA cores (again, using the term loosely). As already stated, that may or may not imply that the program permanently occupies a particular set of physical execution resources.
reading/writing from it's own memory to system memory
Yes, that's possible, extending the train of thought. Writing to it's own memory is obviously possible, by definition, and writing to system memory is possible via the zero-copy mechanism (slides 21/22), given a reasonable assumption of appropriate setup activity for this mechanism.
and disk?
No, that's not directly possible today, without host system interaction, and/or without a significant assumption of atypical external resources such as a disk controller of some sort connected via a GPUDirect interface (with a lot of additional assumptions and unspecified framework). The GPUDirect exception requires so much additional framework, that I would say, for typical usage, the answer is "no", not without host system activity/intervention. The host system (normally) owns the disk drive, not the GPU.
If this is possible, what are the implications for CPU performance - can we totally offset CPU usage or does the CPU still need to have some input/output?
In my opinion, the CPU must still be considered. One consideration is if you need to write to disk. Even if you don't, most programs derive I/O from somewhere (e.g. MPI) and so the implication of a larger framework of some sort is there. Secondly, and relatedly, the persistent threads programming model usually implies a producer/consumer relationship, and a work queue. The GPU is on the processing side (consumer side) of the work queue, but something else (usually) is on the producer side, typically the host system CPU. Again, it could be another GPU, either locally or via MPI, that is on the producer side of the work queue, but that still usually implies an ultimate producer somewhere else (i.e. the need for system I/O).
Additionally:
Can CUDA threads send packets over a network?
This is like the disk question. These questions could be viewed in a general way, in which case the answer might be "yes". But restricting ourselves to formal definitions of what a CUDA thread can do, I believe the answer is more reasonably "no". CUDA provides no direct definitions for I/O interfaces to disk or network (or many other things, such as a display!). It's reasonable to conjecture or presume the existence of a lightweight host process that simply copies packets between a CUDA GPU and a network interface. With this presumption, the answer might be "yes" (and similarly for disk I/O). But without this presumption (and/or a related, perhaps more involved presumption of a GPUDirect framework), I think the most reasonable answer is "no". According to the CUDA programming model, there is no definition of how to access a disk or network resource directly.

Simultaneous launch of Multiple Kernels using CUDA for a GPU

Is it possible to launch two kernels that do independent tasks, simultaneously. For example if I have this Cuda code
// host and device initialization
.......
.......
// launch kernel1
myMethod1 <<<.... >>> (params);
// launch kernel2
myMethod2 <<<.....>>> (params);
Assuming that these kernels are independent, is there a facility to launch them at the same time allocating few grids/blocks for each. Does CUDA/OpenCL have this provision.
Only devices with CUDA compute capability 2.0 and better (i.e. Fermi) can support multiple simultaneous kernel executions. See section 3.2.6.3 of the CUDA 3.0 programming guide, which states:
Some devices of compute capability 2.0
can execute multiple kernels
concurrently. Applications may query
this capability by calling
cudaGetDeviceProperties() and checking
the concurrentKernels property.
The maximum number of kernel launches
that a device can execute concurrently
is four.
A kernel from one CUDA context cannot
execute concurrently with a kernel
from another CUDA context.
Kernels that use many textures or a
large amount of local memory are less
likely to execute concurrently with
other kernels.
You will need SM 2.0 or above for concurrent kernels.
To get concurrent execution you need to manually indicate that there is no dependence between the two kernels. This is because the compiler cannot determine that one kernel will not modify data being used in the other, this could be by reading from and writing to the same buffer which seems simple enough, but is actually much harder to detect since there can be pointers inside data structures and so on.
To express the independence you must launch the kernels in different streams. The fourth parameter in the triple-chevron syntax specifies the stream, check out the Programming Guide or the SDK concurrentKernels sample.
CUDA compatibility 2.1 = up to 16 Concurrent Kernels