Possible options to calculate a part of paralleled Program over GPU - c++

Hi I am not so familiar with gpu and I Just have a theoretical question.
So I am working on an application called Sassena, which calculates Neutron scattering from Molecular dynamics trajectories. This application is written in parallel with MPI and works for CPUs very well. But I am willing to run this app over GPU to make it faster. ofcourse not all of it but partly. when I look at the Source Code, The way it works is typical MPI, meaning the first rank send the data to each node individually and then each nodes does the calculation. Now, there is a part of calculation which is using Fast Fourier Transform(FFT), which consumes the most time and I want to send this part to GPU.
I see 2 Solutions ahead of me:
when the nodes reach the FFT part, they should send back the data to the main node, and when the main node gathered all the data it sends them to GPU, then GPU does the FFT, sends it back to cpu and cpu does the rest.
Each node would dynamically send the data to GPU and after the GPU does the FFT, it sends back to each node and they do the rest of their job.
So my Question is which one of these two are possible. I know first one is doable but it is having a lot of communication which is time consuming. But the second way I don't know if it is possible at all or not. I know in the second case it will be dependent on the Computer architecture as well. But is CUDA or OpenCL capable of this at all??
Thanks for any idea.

To my knowledge you are not restricted by CUDA. What you are restricted here is the number of GPUs you have. You need to create some sort of queue that distributes your work to the available GPUs and keeps track of free resources. Depending on ratio between the number of CPUs to the number of GPUs and the amount of time each FFT takes, you may be waiting longer for each FFT to be passed to the GPU compared to just doing it on each core.
What I mean is that you lose the asynchronous computation of FFT which is performed on each core. Rather, CPU 2 have to wait for CPU 1 to finish its FFT computation to be able to initiate a new kernel on GPU.
Other than what I have said, it is possible to create a simple mutex which is locked when a CPU starts computing its FFT and is unlocked when it finishes so that the next CPU can use the GPU.
You can look at StarPU. It is a task based api which can handle sending tasks to GPUs. It is also designed for distributed memory models.

Related

OpenCL - multiple threads on a gpu

After having parallelized a C++ code via OpenMP, I am now considering to use the GPU (a Radeon Pro Vega II) to speed up specific parts of my code. Being an OpenCL neophyte,I am currently searching for examples that can show me how to implement a multicore CPU - GPU interaction.
Here is what I want to achieve. Suppose to have a fixed short length array, say {1,2,3,4,5}, and that as an exercise, you want to compute all of the possible "right shifts" of this array, i.e.,
{5,1,2,3,4}
{4,5,1,2,3}
{3,4,5,1,2}
{2,3,4,5,1}
{1,2,3,4,5}
.
The relative OpenCL code is quite straightforward.
Now, suppose that your CPU has many cores, say 56, that each core has a different starting array and that at any random instant of time each CPU core may ask the GPU to compute the right shifts of its own array. This core, say core 21, should copy its own array into the GPU memory, run the kernel, and wait for the result. My question is "during this operation, could the others CPU cores submit a similar request, without waiting for the completion of the task submitted by core 21?"
Also, can core 21 perform in parallel another task while waiting for the completion of the GPU task?
Would you feel like suggesting some examples to look at?
Thanks!
The GPU works with a queue of kernel calls and (PCIe-) memory transfers. Within this queue, it can work on non-blocking memory transfers and a kernel at the same time, but not on two consecutive kernels. You could do several queues (one per CPU core), then the kernels from different queues can be executed in parallel, provided that each kernel only takes up a fraction of the GPU resources. The CPU core can, while the queue is being executed on the GPU, perform a different task, and with the command queue.finish() the CPU will wait until the GPU is done.
However letting multiple CPUs send tasks to a single GPU is bad practice and will not give you any performance advantage while making your code over-complicated. Each small PCIe memory transfer has a large latency overhead and small kernels that do not sufficiently saturate the GPU have bad performance.
The multi-CPU approach is only useful if each CPU sends tasks to its own dedicated GPU, and even then I would only recommend this if your VRAM of a single GPU is not enough or if you want to have more parallel troughput than a single GPU allows.
A better strategy is to feed the GPU with a single CPU core and - if there is some processing to do on the CPU side - only then parallelize across multiple CPU cores. By combining small data packets into a single large PCIe memory transfer and large kernel, you will saturate the hardware and get the best possible performance.
For more details on how the parallelization on the GPU works, see https://stackoverflow.com/a/61652001/9178992

OpenCL Profiling timestamps are not consistent in duration compared to CPU clock

I am creating a custom tool interface with my application to profile the performance of OpenCL kernels while also integrating CPU profiling points. I'm currently working with this code on Linux using Ubuntu, and am testing using the 3 OpenCL devices in my machine: Intel CPU, Intel IGP, and Nvidia Quadro.
I am using this code std::chrono::high_resolution_clock::now().time_since_epoch().count() to produce a timestamp on the CPU, and of course for the OpenCL profiling time points, they are 64-bit nanoseconds provided from the OpenCL profiling events API. The purpose of the tool I made is to consume log output (specially formatted and generated so as not to impact performance much) from the program and generate a timeline chart to aid performance analysis.
So far in my visualization interface I had made the assumption that nanoseconds are uniform. I've realized now after getting my visual interface working and checking a few assumptions that this condition more or less does hold to a standard deviation of 0.4 microsecond for the CPU OpenCL device (which indicates that the CPU device could be implemented using the same time counter, as it has no drift), but does not hold for the two GPU devices! This is perhaps not the most surprising thing in the world, but it affects the core design of my tool, so this was an unforeseen risk.
I'll provide some eye candy since it is very interesting and it does prove to me that this is indeed happening.
This is zoomed into the beginning of the profile where the GPU has the corresponding mapBuffer for poses happening around a millisecond before the CPU calls it (impossible!)
Toward the end of the profile we see the same shapes but reversed relationship, clearly showing that GPU seconds seem to count for a little bit less compared to CPU seconds.
The way that this visualization currently works as i had assumed a GPU nanosecond is indeed a CPU nanosecond, is that I actually have been computing the average of the delta between the values given to me by the CPU and GPU... Since I did implement this initially, perhaps it indicates that i was at least subconsciously expecting there to be an issue like this one. Anyway, what I did was establish a sync point at the kernel dispatch by recording a CPU timestamp immediately before calling clEnqueueNDRangeKernel and then comparing this against the CL_PROFILING_COMMAND_QUEUED OpenCL Profile event time. This delta upon further inspection showed the time drift:
This screenshot from the chrome console shows me logging the array of delta values I collected from these two GPU devices; they are showing BigInts to avoid losing integer precision: in both cases the GPU reported timestamp deltas are trending down.
Compare with the numbers from the CPU:
My questions:
What might be a practical way to deal with this issue? I am currently leaning toward the use of sync points when dispatching OpenCL kernels, and these sync points could be used to either locally piecewise stretch the OpenCL Profiling timestamps, or to locally sync at the beginning of, say, a kernel dispatch and just ignore the discrepancy we have, assuming it will be insignificant during the period. In particular it is unclear whether it'd be a good idea to maximize granularity by implementing a sync point for every single profiling event I want to use.
What might be some other time measuring systems I can or should use on the CPU-side to see if maybe they will end up aligning better? I don't really have much hope in this at this point because I can imagine that the profiling times being provided to me are actually generated and timed on the GPU device itself. The fluctuations would then be affected by such things as dynamic GPU clock scaling, and there would be no hope of stumbling upon a different better timekeeping scheme on the CPU.

OpenCL: how lightweight are GPU threads?

I keep reading that GPU threads are lightweight and you can throw many tasks at them to complete in parallel....but how lightweight are they, exactly?
Let's say I have a million-member float3 array, and I want to calculate the length of each float3 value.
Does it make sense to send essentially 1 million tasks to the GPU (so the kernel calculates a single float3 length of the global array and returns)? Or something more like 1000 tasks, and each kernel execution loops through 1000 members of the array? If there is a benefit to grouping tasks like that, is there a way to calculate the optimal size of each grouping?
If we're talking about GPUs only, the answer is - very lightweight.
Does it make sense to send essentially 1 million tasks to the GPU
You're not "sending a million tasks" to the GPU. You're sending a single request, which is a few dozen bytes, which essentially says "please launch a million copies of this code with the grid coordinates i give you here". Those "copies" are created on the fly by hardware inside the GPU, and yes it's very efficient.
1000 tasks, and each kernel execution loops through 1000 members of the array
On a GPU, you almost certainly don't want to do this. A modern high-end GPU has easily 4000+ processing units, so you need at minimum that amount of concurrency. But usually much higher. There is a scheduler which picks one hardware thread to run on each of those processing units, and usually there are several dozen hardware threads per processing unit. So it's not unusual to see a GPU with 100K+ hardware threads. This is required to hide memory latencies.
So if you launch a kernel with 1000x1 grid size, easily 3/4 of your GPU could be unused, and the used part will spend 90% of it's time waiting for memory. Go ahead and try it out. The GPU has been designed to handle ridiculous amounts of threads - don't be afraid to use them.
Now, if you're talking about CPU, that's a slightly different matter. CPUs obviously don't have 1000s of hardware threads. Here, it depends on the OpenCL implementation - but i think most reasonable CPU OpenCL implementations today will handle this for you, by processing work in loops, in just enough hardware threads for your CPU.
TL;DR: use the "1 million tasks" solution, and perhaps try tuning the local work size.

Clueless on how to execute big tasks on C++ AMP

I have a task to see if an algorithm I developed can be ran faster using computing on GPU rather than CPU. I'm new to computing on accelerators, I was given a book "C++ AMP" which I've read thoroughly, and I thought I understood it reasonably well (I coded in C and C++ in the past but nowadays its mostly C#).
However, when going into real application, I seem to just not get it. So please, help me if you can.
Let's say I have a task to compute some complicated function that takes a huge matrix input (like 50000 x 50000) and some other data and outputs matrix of same size. Total calculation for the whole matrix takes several hours.
On CPU, I'd just cut tasks into several pieces (number of pieces being something like 100 or so) and execute them using Parralel.For or just a simple task managing loop I wrote myself. Basically, keep several threads running (num of threads = num of cores), start new part when thread finishes, until all parts are done. And it worked well!
However, on GPU, I cannot use the same approach, not only because of memory constraints (that's ok, can partition into several parts) but because of the fact that if something runs for over 2 seconds it's considered a "timeout" and GPU gets reset! So, I must ensure that every part of my calculation takes less than 2 seconds to run.
But that's not every task (like, partition a hour-long work into 60 tasks of 1sec each), which would be easy enough, thats every bunch of tasks, because no matter what queue mode I choose (immediate or automatic), if I run (via parralel_for_each) anything that takes in total more than 2s to execute, GPU will get reset.
Not only that, but if my CPU program hogs all CPU resource, as long as it is kept in lower priority, UI stays interactive - system is responsive, however, when executing code on GPU, it seems that screen is frozen until execution is finished!
So, what do I do? In the demonstrations to the book (N-Body problem), it shows that it is supposed to be like 100x as effective (multicore calculations give 2 gflops, or w/e amount of flops that was, while amp give 200 gflops), but in real application, I just don't see how to do it!
Do I have to partition my big task into like, into billions of pieces, like, partition into pieces that each take 10ms to execute and run 100 of them in parralel_for_each at a time?
Or am I just doing it wrong, and there is a better solution I just don't get?
Help please!
TDRs (the 2s timeouts you see) are a reality of using a resource that is shared between rendering the display and executing your compute work. The OS protects your application from completely locking up the display by enforcing a timeout. This will also impact applications which try and render to the screen. Moving your AMP code to a separate CPU thread will not help, this will free up your UI thread on the CPU but rendering will still be blocked on the GPU.
You can actually see this behavior in the n-body example when you set N to be very large on a low power system. The maximum value of N is actually limited in the application to prevent you running into these types of issues in typical scenarios.
You are actually on the right track. You do indeed need to break up your work into chunks that fit into sub 2s chunks or smaller ones if you want to hit a particular frame rate. You should also consider how your work is being queued. Remember that all AMP work is queued and in automatic mode you have no control over when it runs. Using immediate mode is the way to have better control over how commands are batched.
Note: TDRs are not an issue on dedicated compute GPU hardware (like Tesla) and Windows 8 offers more flexibility when dealing with TDR timeout limits if the underlying GPU supports it.

how much time does it take to make a call to opencl?

I'm currently implementing an algorithm that does allot of linear algebra on small matrices and vectors. the code is fast but I'm wondering if it would make sense to implement it on a gpgpu instead of the cpu.
I'm able to store most of the matrices and vectors in the gpu memory as a preprocessing step, and have profiles the multiplication algorithms, the algorithms are, ofcaurse, way faster on the gpu.
but now for my real question,
how do I determine the overhead of making calls to the gpu from the cpu? how many cycles am I losing wayting for my code to be executed and stuff like that?
I hope someone has some input?
It is hard to determine the exact "overhead" of calling OpenCL, because operations on the GPU can be done in parallel with whatever else is running on the CPU.
Depending on your application, you can, for example, do a transfer of a chunk of data to the GPU from your application and in paralell do some preprocessing in CPU of the following chunk of data. Similarly, while the code is executing on the GPU, you can be doing some prep work on the CPU on some data needed in the future.
The transfers to the GPU will be done via DMA transfers, which are very fast in general.
From my experience, I was able to transfer around 4MB of data in the order of 4 milliseconds to the GPU (modern GPU, modern motherboard), while doing some processing on the data that was sent previosly.
From that, it seems safe to say you can upload and download an order of 1GB of data per second to the GPU and do some processing on that data.
In your case, either the GPU or the CPU side will be the bottleneck. CPU side, if it cannot feed, say, 1GB of prepared data to the GPU per second. This may be very possibly limited by your disk I/O.
To test your GPU path, set up a bunch of buffers of data ready to process. You would want to keep re-sending that data to the GPU, processing it, and downloading the results (which you will discard). Measure the throughput and compare to the throughput of your CPU version of the application.
Don't measure just the GPU processing part, because transfers and processing on the GPU will compete for GPU memory controller time and will be affecting each other's pace.
Also, in case you want very good response time on small pieces of data, not good throughput, you probably won't benefit from going through the GPU, because it introduces a bit of delay to your processing.
The important thing to consider here is the time it takes to copy the data to the GPU and back. Even if the GPU implementation is much faster, the time spent doing transfers may wipe out any advantage.
Furthermore, if you are very serious about the accuracy of your algebra then you may want to consider that the operations you want to perform may not be available natively on the GPU with double accuracy.
Given that you say your matrices and vectors are small I suggest checking out SIMD optimisations that may improve the performance of your algorithm on CPU.
You can use clEvent objects to track the time that the actual computations take (latency). If you actually mean CPU cycles, use RDTSC (or its intrinsic, __rdtsc in MSVC) to do nanosecond-precise timing for the actual API calls. The RDTSC instruction (read time stamp counter) returns the number of clock cycles the cpu has completed since powerup.
If it really is that easy to upload, then you can batch up calls and perhaps add a dimension to your NDRange to do multiple computations in one call. Of course, the details depend on your kernel implementation.
I suggest using the following to measure the number of cpu cycles:
#include <stdlib.h>
#include <time.h>
// ...
clock_t start,end;
start = clock();
// do stuff...
end = clock();
cout<<"CPU cycles used: "<<end-start;