OpenCL - multiple threads on a gpu - c++

After having parallelized a C++ code via OpenMP, I am now considering to use the GPU (a Radeon Pro Vega II) to speed up specific parts of my code. Being an OpenCL neophyte,I am currently searching for examples that can show me how to implement a multicore CPU - GPU interaction.
Here is what I want to achieve. Suppose to have a fixed short length array, say {1,2,3,4,5}, and that as an exercise, you want to compute all of the possible "right shifts" of this array, i.e.,
{5,1,2,3,4}
{4,5,1,2,3}
{3,4,5,1,2}
{2,3,4,5,1}
{1,2,3,4,5}
.
The relative OpenCL code is quite straightforward.
Now, suppose that your CPU has many cores, say 56, that each core has a different starting array and that at any random instant of time each CPU core may ask the GPU to compute the right shifts of its own array. This core, say core 21, should copy its own array into the GPU memory, run the kernel, and wait for the result. My question is "during this operation, could the others CPU cores submit a similar request, without waiting for the completion of the task submitted by core 21?"
Also, can core 21 perform in parallel another task while waiting for the completion of the GPU task?
Would you feel like suggesting some examples to look at?
Thanks!

The GPU works with a queue of kernel calls and (PCIe-) memory transfers. Within this queue, it can work on non-blocking memory transfers and a kernel at the same time, but not on two consecutive kernels. You could do several queues (one per CPU core), then the kernels from different queues can be executed in parallel, provided that each kernel only takes up a fraction of the GPU resources. The CPU core can, while the queue is being executed on the GPU, perform a different task, and with the command queue.finish() the CPU will wait until the GPU is done.
However letting multiple CPUs send tasks to a single GPU is bad practice and will not give you any performance advantage while making your code over-complicated. Each small PCIe memory transfer has a large latency overhead and small kernels that do not sufficiently saturate the GPU have bad performance.
The multi-CPU approach is only useful if each CPU sends tasks to its own dedicated GPU, and even then I would only recommend this if your VRAM of a single GPU is not enough or if you want to have more parallel troughput than a single GPU allows.
A better strategy is to feed the GPU with a single CPU core and - if there is some processing to do on the CPU side - only then parallelize across multiple CPU cores. By combining small data packets into a single large PCIe memory transfer and large kernel, you will saturate the hardware and get the best possible performance.
For more details on how the parallelization on the GPU works, see https://stackoverflow.com/a/61652001/9178992

Related

How to optimize code for Simultaneous Multithreading?

Currently, I am learning parallel processing using CPU, which is a well-covered topic with plenty of tutorials and books.
However, I could not find a single tutorial or resource that talks about programming techniques for hyper threaded CPU. Not a single code sample.
I know that to utilize hyper threading, the code must be implemented such that different parts of the CPU can be used at the same time (simplest example is calculating integer and float at the same time), so it's not plug-and-play.
Which book or resource should I look at if I want to learn more about this topic? Thank you.
EDIT: when I said hyper threading, I meant Simultaneous Multithreading in general, not Intel's hyper threading specifically.
Edit 2: for example, if I have an i7 8-core CPU, I can make a sorting algorithms that runs 8 times faster when it uses all 8-core instead of 1. But it will run the same on a 4-core CPU and a 4c-8t CPU, so in my case SMT does nothing.
Meanwhile, Cinebench will run much better on a 4c-8t CPU than on a 4c-4t CPU.
SMT is generally most effective, when one thread is loading something from memory. Depending on the memory (L1, L2, L3 cache, RAM), read/write latency can span a lot of CPU cycles that would have to be wasted doing nothing, if only one thread would be executed per core.
So, if you want to maximize the impact of SMT, try to interleave memory access of two threads so that one of them can execute instructions, while the other reads data. Theoretically, you can also use a thread just for cache warming, i.e. loading data from RAM or main storage into cache for subsequent use by other threads.
The way of successfully applying this can vary from one system to another because the access latency of cache, RAM and main storage as well as their size may differ by a lot.

OpenCL: how lightweight are GPU threads?

I keep reading that GPU threads are lightweight and you can throw many tasks at them to complete in parallel....but how lightweight are they, exactly?
Let's say I have a million-member float3 array, and I want to calculate the length of each float3 value.
Does it make sense to send essentially 1 million tasks to the GPU (so the kernel calculates a single float3 length of the global array and returns)? Or something more like 1000 tasks, and each kernel execution loops through 1000 members of the array? If there is a benefit to grouping tasks like that, is there a way to calculate the optimal size of each grouping?
If we're talking about GPUs only, the answer is - very lightweight.
Does it make sense to send essentially 1 million tasks to the GPU
You're not "sending a million tasks" to the GPU. You're sending a single request, which is a few dozen bytes, which essentially says "please launch a million copies of this code with the grid coordinates i give you here". Those "copies" are created on the fly by hardware inside the GPU, and yes it's very efficient.
1000 tasks, and each kernel execution loops through 1000 members of the array
On a GPU, you almost certainly don't want to do this. A modern high-end GPU has easily 4000+ processing units, so you need at minimum that amount of concurrency. But usually much higher. There is a scheduler which picks one hardware thread to run on each of those processing units, and usually there are several dozen hardware threads per processing unit. So it's not unusual to see a GPU with 100K+ hardware threads. This is required to hide memory latencies.
So if you launch a kernel with 1000x1 grid size, easily 3/4 of your GPU could be unused, and the used part will spend 90% of it's time waiting for memory. Go ahead and try it out. The GPU has been designed to handle ridiculous amounts of threads - don't be afraid to use them.
Now, if you're talking about CPU, that's a slightly different matter. CPUs obviously don't have 1000s of hardware threads. Here, it depends on the OpenCL implementation - but i think most reasonable CPU OpenCL implementations today will handle this for you, by processing work in loops, in just enough hardware threads for your CPU.
TL;DR: use the "1 million tasks" solution, and perhaps try tuning the local work size.

Possible options to calculate a part of paralleled Program over GPU

Hi I am not so familiar with gpu and I Just have a theoretical question.
So I am working on an application called Sassena, which calculates Neutron scattering from Molecular dynamics trajectories. This application is written in parallel with MPI and works for CPUs very well. But I am willing to run this app over GPU to make it faster. ofcourse not all of it but partly. when I look at the Source Code, The way it works is typical MPI, meaning the first rank send the data to each node individually and then each nodes does the calculation. Now, there is a part of calculation which is using Fast Fourier Transform(FFT), which consumes the most time and I want to send this part to GPU.
I see 2 Solutions ahead of me:
when the nodes reach the FFT part, they should send back the data to the main node, and when the main node gathered all the data it sends them to GPU, then GPU does the FFT, sends it back to cpu and cpu does the rest.
Each node would dynamically send the data to GPU and after the GPU does the FFT, it sends back to each node and they do the rest of their job.
So my Question is which one of these two are possible. I know first one is doable but it is having a lot of communication which is time consuming. But the second way I don't know if it is possible at all or not. I know in the second case it will be dependent on the Computer architecture as well. But is CUDA or OpenCL capable of this at all??
Thanks for any idea.
To my knowledge you are not restricted by CUDA. What you are restricted here is the number of GPUs you have. You need to create some sort of queue that distributes your work to the available GPUs and keeps track of free resources. Depending on ratio between the number of CPUs to the number of GPUs and the amount of time each FFT takes, you may be waiting longer for each FFT to be passed to the GPU compared to just doing it on each core.
What I mean is that you lose the asynchronous computation of FFT which is performed on each core. Rather, CPU 2 have to wait for CPU 1 to finish its FFT computation to be able to initiate a new kernel on GPU.
Other than what I have said, it is possible to create a simple mutex which is locked when a CPU starts computing its FFT and is unlocked when it finishes so that the next CPU can use the GPU.
You can look at StarPU. It is a task based api which can handle sending tasks to GPUs. It is also designed for distributed memory models.

Windows multitasking breaks OpenCL perfomance

I'm writing Qt application with simple idea: there are several OpenCL-capable devices, each of them gets own control thread which preparing data, executing OpenCL kernel and processing results. OpenCL code is actually bitcoin mining kernel (for now it's this one, but it doesn't matter).
When working with 2 GPUs everything is ok.
When I use GPU and CPU there is a problem. CPU works at reasonable speed, but GPU slowing down to zero perfomance.
There are no such promblem under Linux. Under Windows, poclbm behaves in the same way: when starting multiple instances (1 for GPU, 1 for CPU), GPU perfomance is 0.
I'm not sure about which part of code I should post, so it will be helpfull. I can only mention, that thread is a QThread's child with run() reimplemented with a busy loop while( !_stop ) { mineBitcoins(); }. Logic of that loop is pretty much copied from poclbm's BitcoinMiner::mining_thread (here).
In which direction should I dig? Thanks.
upd:
I'm using QtOpenCL with AMD APP SDK.
If you run the kernel on the CPU with full utilization of all cores, the threads that handle the other devices might not be able to keep up with the GPU, effectively limiting performance.
Try decreasing the number of threads running the kernel on the CPU, e.g. if your program runs on a quad-core with hyper threading, limit the threads to 7.
don't use the host device as opencl device. If you really have too, restrict the amount of compute units (of the CPU used as host) allocated for CL by creating a subdevice.
I don't know if you are using the both devices in the same context. But if that is the case, the memory consistency inside a context can be your problem and how the different OpenCL implementation handle it.
OpenCL tries to mantain the memory inside a context updated (at least in Windows), and can cause the GPU to continuosly copy the memory used back to "CPU-side".
I tryed that long ago and resulted as in your case with "~=0 performance in the GPU".

how much time does it take to make a call to opencl?

I'm currently implementing an algorithm that does allot of linear algebra on small matrices and vectors. the code is fast but I'm wondering if it would make sense to implement it on a gpgpu instead of the cpu.
I'm able to store most of the matrices and vectors in the gpu memory as a preprocessing step, and have profiles the multiplication algorithms, the algorithms are, ofcaurse, way faster on the gpu.
but now for my real question,
how do I determine the overhead of making calls to the gpu from the cpu? how many cycles am I losing wayting for my code to be executed and stuff like that?
I hope someone has some input?
It is hard to determine the exact "overhead" of calling OpenCL, because operations on the GPU can be done in parallel with whatever else is running on the CPU.
Depending on your application, you can, for example, do a transfer of a chunk of data to the GPU from your application and in paralell do some preprocessing in CPU of the following chunk of data. Similarly, while the code is executing on the GPU, you can be doing some prep work on the CPU on some data needed in the future.
The transfers to the GPU will be done via DMA transfers, which are very fast in general.
From my experience, I was able to transfer around 4MB of data in the order of 4 milliseconds to the GPU (modern GPU, modern motherboard), while doing some processing on the data that was sent previosly.
From that, it seems safe to say you can upload and download an order of 1GB of data per second to the GPU and do some processing on that data.
In your case, either the GPU or the CPU side will be the bottleneck. CPU side, if it cannot feed, say, 1GB of prepared data to the GPU per second. This may be very possibly limited by your disk I/O.
To test your GPU path, set up a bunch of buffers of data ready to process. You would want to keep re-sending that data to the GPU, processing it, and downloading the results (which you will discard). Measure the throughput and compare to the throughput of your CPU version of the application.
Don't measure just the GPU processing part, because transfers and processing on the GPU will compete for GPU memory controller time and will be affecting each other's pace.
Also, in case you want very good response time on small pieces of data, not good throughput, you probably won't benefit from going through the GPU, because it introduces a bit of delay to your processing.
The important thing to consider here is the time it takes to copy the data to the GPU and back. Even if the GPU implementation is much faster, the time spent doing transfers may wipe out any advantage.
Furthermore, if you are very serious about the accuracy of your algebra then you may want to consider that the operations you want to perform may not be available natively on the GPU with double accuracy.
Given that you say your matrices and vectors are small I suggest checking out SIMD optimisations that may improve the performance of your algorithm on CPU.
You can use clEvent objects to track the time that the actual computations take (latency). If you actually mean CPU cycles, use RDTSC (or its intrinsic, __rdtsc in MSVC) to do nanosecond-precise timing for the actual API calls. The RDTSC instruction (read time stamp counter) returns the number of clock cycles the cpu has completed since powerup.
If it really is that easy to upload, then you can batch up calls and perhaps add a dimension to your NDRange to do multiple computations in one call. Of course, the details depend on your kernel implementation.
I suggest using the following to measure the number of cpu cycles:
#include <stdlib.h>
#include <time.h>
// ...
clock_t start,end;
start = clock();
// do stuff...
end = clock();
cout<<"CPU cycles used: "<<end-start;