cudaMemsetAsync strange behavior - concurrency

I observe a strange behavior when overlapping data transfer and kernel execution in CUDA.
When calling cudaMemcpyAsync after cudaMemsetAsync although the cudaMemsetAsync does overlap with the compute kernel the cudaMemcpyAsync doesn't.
The compute kernel ends and then the cudaMemcpyAsync is executed.
When commenting out cudaMemsetAsync then the overlap is performed correctly.
Part of the code is presented below with some changes.
Code:
for (d = 0; d < TOTAL; ++d){
gpuErrchk(cudaMemsetAsync(data_d, 0, bytes, stream1));
for (j = 0; j < M; ++j)
{
gpuErrchk(cudaMemcpyAsync(&data_d[index1], &data_h[index2], bytes, H2D, stream1));
}
gpuErrchk(cudaStreamSynchronize(stream1));
cufftExecR2C(plan, data_d, data_fft_d);
gpuErrchk(cudaStreamSynchronize(stream2));
kernel<<dimGrid, dimBlock,0, stream3>>(result_d, data_fft_d, size);
}
I use a NVIDIA GTX-Titan GPU and the compute and memory operations are performed in different streams. Moreover, cudaMemsetAsync and cudaMemcpyAsync operate on the same device buffer.

Some of CUDA's memcpy functions are implemented with kernels (such as device->device memcpy), but ALL of CUDA's memset functions are implemented internally as kernels.
Assuming the cufftExecR2C call is supposed to be done in a different stream, you can bet that the kernel generated by the FFT plan was designed to fully occupy the GPU.
So you are likely hitting the same limitation in kernel concurrency that you would if you were trying to invoke a kernel in another stream. Kernels must occupy a limited amount of the GPU in order to run concurrently, but most CUDA kernels are not designed to accommodate that use case.

Related

Does NPP support overlapping streams?

I'm trying to perform multiple async 2D convolutions on a single image with multiple filters using NVIDIA's NPP library method nppiFilterBorder_32f_C1R_Ctx. However, even after creating multiple streams and assigning them to NPPI's method, the overlapping isn't happening; NVIDIA's nvvp informs the same:
That said, I'm confused if NPP supports overlapping context operations.
Below is a simplification of my code, only showing the async method calls and related variables:
std::vector<NppStreamContext> streams(n_filters);
for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
cudaStreamCreateWithFlags(&(streams[stream_idx].hStream), cudaStreamNonBlocking);
streams[stream_idx].nStreamFlags = cudaStreamNonBlocking;
// fill up NppStreamContext remaining fields
// malloc image and filter pointers
}
for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
cudaMemcpyAsync(..., streams[stream_idx].hStream);
nppiFilterBorder_32f_C1R_Ctx(..., streams[stream_idx]);
cudaMemcpy2DAsync(..., streams[stream_idx].hStream);
}
for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
cudaStreamSynchronize(streams[stream_idx].hStream);
cudaStreamDestroy(streams[stream_idx].hStream);
}
Note: All the device pointers of the output images and input filters are stored in a std::vector, where I access them via the current stream index (e.g., float *ptr_filter_d = filters[stream_idx])
To summarize and add to the comments:
The profile does show small overlaps, so the answer to the title question is clearly yes.
The reason for the overlap being so small is just that each NPP kernel already needs all resources of the used GPU for most of its runtime. At the end of each kernel one can probably see the tail effect (i.e. the number of blocks is not a multiple of the number of blocks that can reside in SMs at each moment in time), so blocks from the next kernel are getting scheduled and there is some overlap.
It can sometimes be useful (i.e. an optimization) to force overlap between a big kernel which was started first and uses the full device and a later small kernel that only needs a few resources. In that case one can use stream priorities via cudaStreamCreateWithPriority to hint the scheduler to schedule blocks from the second kernel before blocks from the first kernel. An example of this can be found in this multi-GPU example (permalink).
In this case however, as the size of the kernels is the same and there is no reason to prioritize any of them over the others, forcing an overlap like this would not decrease the total runtime because the compute resources are limited. In the profiler view the kernels might then show more overlap but also each one would take more time. That is the reason why the scheduler does not overlap the kernels even though you allow it to do so by using multiple streams (See asynchronous vs. parallel).
To still increase performance, one could write a custom CUDA kernel that does all the filters in one kernel launch. The main reason that this could be a better than using NPP in this case is that all NPP kernels take the same input image. Therefore a single kernel could significantly decrease the number of accesses to global memory by reading in each tile of the input image only once (to shared memory, although L1 caching might suffice), then apply all the filters sequentially or in parallel (by splitting the thread block up into smaller units) and write out the results.

Will iterating multiple images with CUDA increase performance?

The code was oversimplified just for the question purpose.
Before I setup CUDA environment and do any changes to my code, I wanted to get an input whether executing the code below will be much faster on GPU.
The code basically iterates through images and copy image pixel value to dst only if the corresponding mask value is not zero. Number of images can be as high as 10. The size of the image can be around 2K by 2K.
If I use #pragma omp it does increase performance. So, the question is will the performance be significantly increased if I'll execute this code on GPU (assuming I have a good graphic card like GTX 1050) when each thread will handle a separate image?
for (int i = 0; i < images.size(); ++i)
{
for (int y = 0; y < images[i].height; ++y)
{
for (int x = 0; x < images[i].width; ++x)
{
bool maskVal = masks[i][y][x];
if (maskVal > 0)
{
dst[i][y][x] = images[i].data(x,y);
}
}
}
}
In this case I would guess not. That piece of code would probably execute faster if the images and destination were already in the GPUs memory. However if you plan to take the images and mask from your main memory, copy it over the pci bus, execute the code on the gpu and then transfer the result back to the cpu, you will be much better off just running this code on the cpu. However if you think you will do further parallel processing to the dst images then you might as well get it transferred to the gpu, and do this on the gpu since you will have to pay that penalty anyway. The reason openmp is faster is because it uses cpu threads which use the same memory and no extra copying is required.
An example of something which run well on a gpu would be an image convolution or fourier transform since these are tasks are much heavier than what you are doing so the overhead of the memory transfer matters less.

What is the correct way to use queue.flush() and queue.finish() after calling a Kernel?

I am using opencl 1.2 c++ wrapper for my project. I want to know what is the correct method to call my kernel. In my case, I have 2 devices and the data should be sent simultaneously to them.
I am dividing my data into two chunks and both the devices should be able to perform computations on them separately. They have no interconnection and they don't need to know what is happening in the other device.
When the data is sent to both the devices, I want to wait for the kernels to finish before my program goes further. Because I will be using results returned from both of the kernels. So I don't want to start reading the data before the kernels have returned.
I have 2 methods. Which one is programmatically correct in my case:
Method 1:
for (int i = 0; i < numberOfDevices; i++) {
// Enqueue the kernel.
kernelGA(cl::EnqueueArgs(queue[iter],
arguments etc...);
queue[i].flush();
}
// Wait for the kernels to return.
for (int i = 0; i < numberOfDevices; i++) {
queue[i].finish();
}
Method 2:
for (int i = 0; i < numberOfDevices; i++) {
// Enqueue the kernel.
kernelGA(cl::EnqueueArgs(queue[iter],
arguments etc...);
}
for (int i = 0; i < numberOfDevices; i++) {
queue[i].flush();
}
// Wait for the kernels to return.
for (int i = 0; i < numberOfDevices; i++) {
queue[i].finish();
}
Or none of them are correct and there is a better way to wait for my kernels to return?
Assuming each device Computes in its own memory:
I would go for multi threaded (for) loop version of your method-1. Because opencl doesnt force vendors to do asynchronous enqueuing. Nvidia for example, does synchronous enqueuing for some drivers and hardware while amd has asynchronous enqueuing.
When each device is driven by a separate thread, they should enqueue Write+Compute together before synchronising for reading partial results(second threaded loop)
Having multiple threads also advantageous for spin-wait type synchronization (clfinish) because multiple spin-wait loops are worked in parallel. This should save time in Order of a millisecond.
Flush helps some vendors like amd to start enqueueing Early.
To have correct input and correct output for all devices, only two finish commands are enough. One After Write+Compute then one After read(results). So each device get same time step data and produce results at same time step. Write and Compute doesnt need finish between them if queue type is in-order because it Computes one by one. Also this doesnt need read operations to be blocking.
Trivial finish commands always Kill performance.
Note: I already wrote a load balancer using all this, and it performs better When using event-based synchronization instead of finish. Finish is easier but has bigger synchronization times than an event based one.
Also single queue doesnt always push a gpu to its limits. Using at Least 4 queues per device ensures Latency hiding of Write and Compute on my amd system. Sometimes even 16 queues help a bit more. But for io bottlenecked situations May need even more.
Example:
thread1
Write
Compute
Synchronization with other thread
Thread2
Write
Compute
Synchronization with other thread
Thread 1
Read
Synchronization with other thread
Thread2
Read
Synchronization with other thread
Trivial synchronization Kills performance because drivers dont know your intention and they leave it as it is So you should elliminate unnecessary finish commands and convert blocking Writes to nonblocking ones where you can.
Zero synchronization is also wrong because opencl doesnt force vendors to start computing After several enqueues. It May indefinitely grow to gifabytes of memory in minutes or even seconds.
You should use Method 1. clFlush is the only way of guaranteeing that commands are issued to the device (and not just buffered somewhere before sending).

Cuda unified memory between gpu and host

I'm writing a cuda-based program that needs to periodically transfer a set of items from the GPU to the Host memory. In order to keep the process asynchronous, I was hoping to use cuda's UMA to have a memory buffer and flag in the host memory (so both the GPU and the CPU can access it). The GPU would make sure the flag is clear, add its items to the buffer, and set the flag. The CPU waits for the flag to be set, copies things out of the buffer, and clears the flag. As far as I can see, this doesn't produce any race condition because it forces the GPU and CPU to take turns, always reading and writing to the flag opposite each other.
So far I haven't been able to get this to work because there does seem to be some sort of race condition. I came up with a simpler example that has a similar issue:
#include <stdio.h>
__global__
void uva_counting_test(int n, int *h_i);
int main() {
int *h_i;
int n;
cudaMallocHost(&h_i, sizeof(int));
*h_i = 0;
n = 2;
uva_counting_test<<<1, 1>>>(n, h_i);
//even numbers
for(int i = 1; i <= n; ++i) {
//wait for a change to odd from gpu
while(*h_i == (2*(i - 1)));
printf("host h_i: %d\n", *h_i);
*h_i = 2*i;
}
return 0;
}
__global__
void uva_counting_test(int n, int *h_i) {
//odd numbers
for(int i = 0; i < n; ++i) {
//wait for a change to even from host
while(*h_i == (2*(i - 1) + 1));
*h_i = 2*i + 1;
}
}
For me, this case always hangs after the first print statement from the CPU (host h_i: 1). The really unusual thing (which may be a clue) is that I can get it to work in cuda-gdb. If I run it in cuda-gdb, it will hang as before. If I press ctrl+C, it will bring me to the while() loop line in the kernel. From there, surprisingly, I can tell it to continue and it will finish. For n > 2, it will freeze on the while() loop in the kernel again after each kernel, but I can keep pushing it forward with ctrl+C and continue.
If there's a better way to accomplish what I'm trying to do, that would also be helpful.
You are describing a producer-consumer model, where the GPU is producing some data and from time-to-time the CPU will consume that data.
The simplest way to implement this is to have the CPU be the master. The CPU launches a kernel on the GPU, when it is ready to ready to consume data (i.e. the while loop in your example) it synchronises with the GPU, copies the data back from the GPU, launches the kernel again to generate more data, and does whatever it has to do with the data it copied. This allows you to have the GPU filling a fixed-size buffer while the CPU is processing the previous batch (since there are two copies, one on the GPU and one on the CPU).
That can be improved upon by double-buffering the data, meaning that you can keep the GPU busy producing data 100% of the time by ping-ponging between buffers as you copy the other to the CPU. That assumes the copy-back is faster than the production, but if not then you will saturate the copy bandwidth which is also good.
Neither of those are what you actually described. What you asked for is to have the GPU master the data. I'd urge caution on that since you will need to manage your buffer size carefully and you will need to think carefully about the timings and communication issues. It's certainly possible to do something like that but before you explore that direction you should read up about memory fences, atomic operations, and volatile.
I'd try to add
__threadfence_system();
after
*h_i = 2*i + 1;
See here for details. Without it, it's totally possible that the modification stay in the GPU cache forever. However better you listen to the other answers: to improve it for multiple threads/blocks you have to deal with other "problems" to get a similar scheme to work reliably.
As Tom suggested (+1), better to use double buffering. Streams help a lot such a scheme, as you can find depicted here.

Blend two images using GPU

I need to blend thousands of pairs of images very fast.
My code currently does the following: _apply is a function pointer to a function like Blend. It is one of the many functions we can pass, but it is not the only one. Any function takes two values and outputs a third and it is done on each channel for each pixel. I would prefer a solution that is general to any such function rather than a specific solution for blending.
typedef byte (*Transform)(byte src1,byte src2);
Transform _apply;
for (int i=0 ; i< _frameSize ; i++)
{
source[i] = _apply(blend[i]);
}
byte Blend(byte src, byte blend)
{
int resultPixel = (src + blend)/2;
return (byte)resultPixel;
}
I was doing this on CPU but the performance is terrible. It is my understanding that doing this in GPU is very fast. My program needs to run in computers that will have either Nvidia GPUs or Intel GPUs so whatever solution I use needs to be vendor independent. If I use GPU it has to be OpenGL to be platform independent as well.
I think using a GLSL pixel shader would help, but I am not familiar with pixel shaders or how to use them to 2D objects (like my images).
Is that a reasonable solution? If so, how do I do this in 2D?
If there is a library that already does that it is also great to know.
EDIT: I am receiving the image pairs from different sources. One is always coming from a 3d graphics component in opengl (so it is in GPU originally). The other one is coming from system memory, either from a socket (in a compressed video stream) or from a memory mapped file. The "sink" of the resulting image is the screen. I am expected to show the images on the screen, so going to GPU is an option or using something like SDL to display them.
The blend function that is going to be executed the most is this one
byte Patch(byte delta, byte lo)
{
int resultPixel = (2 * (delta - 127)) + lo;
if (resultPixel > 255)
resultPixel = 255;
if (resultPixel < 0)
resultPixel = 0;
return (byte)resultPixel;
}
EDIT 2: The image coming from GPU land, comes in this fashion. From FBO to PBO to system memory
glBindFramebuffer(GL_FRAMEBUFFER,fbo);
glReadBuffer( GL_COLOR_ATTACHMENT0 );
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
glReadPixels(0,0,width,height,GL_BGR,GL_UNSIGNED_BYTE,0);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
void* mappedRegion = glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
Seems like it is probably better to just work everything in GPU memory. The other bitmap can come from system memory. We may get it from a video decoder in GPU memory eventually as well.
Edit 3: One of my images will come from D3D while the other one comes from OpenGL. It seems that something like Thrust or OpenCL is the best option
From the looks of your Blend function, this is an entirely memory bounded operation. The caches on the CPU can likely only hold a very small fraction of the thousands of images you have. Meaning most of your time is spent waiting for RAM to fulfill load/store requests, and the CPU will idle a lot.
You will NOT get any speedup by having to copy your images from RAM to GPU, have the GPU arithmetic units idle while they wait for GPU RAM to feed them data, wait for GPU RAM again to write results, then copy it all back to main RAM. Using GPU for this could actually slow things down substantially.
But I could be wrong and you might not be saturating your memory bus already. You will have to try it on your system and profile it. Here are some simple things you can try to optimize.
1. Multi-thread
I would focus on optimizing the algorithm directly on the CPU. The simplest thing is to go multi-threaded, which can be as simple as enabling OpenMP in your compiler and updating your for loop:
#include <omp.h> // add this along with enabling OpenMP support in your compiler
...
#pragma omp parallel for // <--- compiler magic happens here
for (int i=0 ; i< _frameSize ; i++)
{
source[i] = _apply(blend[i]);
}
If your memory bandwidth is not saturated, this will likely speed up the blending by however many cores your system has.
2. Micro-optimizations
Another thing you can try is to implement your Blend using SIMD instructions which most CPUs have nowadays. I can't help you with that without knowing what CPU you are targeting.
You can also try unrolling your for loop to mitigate some of the loop overhead.
One easy way to achieve both of these is leverage the Eigen matrix library by wrapping your data in their data structures.
// initialize your data and result buffer
byte *source = ...
byte *blend = ...
byte *result = ...
// tell Eigen where you data/buffer are, and to treat it like a dynamic vectory of bytes
// this is a cheap shallow copy
Map<Matrix<byte, Dynamic,1> > sourceMap(source, _frameSize);
Map<Matrix<byte, Dynamic,1> > blendMap(blend, _frameSize);
Map<Matrix<byte, Dynamic,1> > resultMap(result, _frameSize);
// perform blend using all manner of insane optimization voodoo under the covers
resultMap = (sourceMap + blendMap)/2;
3. Use GPGPU
Finally, I will provide a direct answer to your question with an easy way to leverage the GPU without having to know much about GPU programming. The simplest thing to do is try the Thrust library. You will have to rewrite your algorithms as STL style algorithms, but that's pretty easy in your case.
// functor for blending
struct blend_functor
{
template <typename Tuple>
__host__ __device__
void operator()(Tuple t)
{
// C[i] = (A[i] + B[i])/2;
thrust::get<2>(t) = (thrust::get<0>(t) + thrust::get<1>(t))/2;
}
};
// initialize your data and result buffer
byte *source = ...
byte *blend = ...
byte *result = NULL;
// copy the data to the vectors on the GPU
thrust::device_vector<byte> A(source, source + _frameSize);
thrust::device_vector<byte> B(blend, blend + _frameSize);
// allocate result vector on the GPU
thrust::device_vector<byte> C(_frameSize);
// process the data on the GPU device
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(
A.begin(), B.begin(), C.begin())),
thrust::make_zip_iterator(thrust::make_tuple(
A.end(), B.end(), C.end())),
blend_functor());
// copy the data back to main RAM
thrust::host_vector<byte> resultVec = C;
result = resultVec.data();
A really neat thing about thrust is that once you have written the algorithms in a generic way, it can automagically use different back ends for doing the computation. CUDA is the default back end, but you can also configure it at compile time to use OpenMP or TBB (Intel threading library).