Will iterating multiple images with CUDA increase performance? - c++

The code was oversimplified just for the question purpose.
Before I setup CUDA environment and do any changes to my code, I wanted to get an input whether executing the code below will be much faster on GPU.
The code basically iterates through images and copy image pixel value to dst only if the corresponding mask value is not zero. Number of images can be as high as 10. The size of the image can be around 2K by 2K.
If I use #pragma omp it does increase performance. So, the question is will the performance be significantly increased if I'll execute this code on GPU (assuming I have a good graphic card like GTX 1050) when each thread will handle a separate image?
for (int i = 0; i < images.size(); ++i)
{
for (int y = 0; y < images[i].height; ++y)
{
for (int x = 0; x < images[i].width; ++x)
{
bool maskVal = masks[i][y][x];
if (maskVal > 0)
{
dst[i][y][x] = images[i].data(x,y);
}
}
}
}

In this case I would guess not. That piece of code would probably execute faster if the images and destination were already in the GPUs memory. However if you plan to take the images and mask from your main memory, copy it over the pci bus, execute the code on the gpu and then transfer the result back to the cpu, you will be much better off just running this code on the cpu. However if you think you will do further parallel processing to the dst images then you might as well get it transferred to the gpu, and do this on the gpu since you will have to pay that penalty anyway. The reason openmp is faster is because it uses cpu threads which use the same memory and no extra copying is required.
An example of something which run well on a gpu would be an image convolution or fourier transform since these are tasks are much heavier than what you are doing so the overhead of the memory transfer matters less.

Related

Reading and storing large matrix file for GPU

Goal: Storing a large matrix in memory (Radon matrix), and transferring it into GPU memory for massively parallel operations.
Problem: Horrible reading time, and potentially sub-optimal use of space (but non-limiting for the program's usage)
I have the possibility of doing this in either C or C++.
The files which I'm receiving are parsed as follows:
0.70316,0.71267,0.72221,0.73177,0.74135,0.75094,0.76053,0.77011,0.77967,0.7892,0.79868,0.80811,0.81747
and this goes on for at least 50MB.
My naïve implementation:
float ** Radon;
Radon = (float **)malloc(HeightxNproj * sizeof(float *));
for (int i = 0; i < Hauteur * Nproj; i++)
Radon[i] = (float *)malloc(WidthSquared * sizeof(float));
FILE *radonFile;
radonFile = fopen("radon.txt", "r");
if (radonFile == NULL)
{
printf("Radon file opening failed.");
return -1;
}
for (int i = 0; i < HeightxNproj; i++)
{
for (int j = 0; j < WidthSquared; j++)
{
fscanf(radonFile, "%f,", &Radon[i][j]);
}
}
fclose(radonFile);
printf("Radon loaded.");
I'm programming for windows. I've read a bit about File Memory Mapping, but I don't know if this method, which is not actually storing the matrix in memory, is compatible with GPGPU programming. I'm using CUDA, and I'll have to pass this matrix onto GPU memory for parallel operations.
This file-reading method performs terribly, it's taking roughly a minute to read and parse the 50MB file. Is there a way to shorten reading and parsing time? The matrix is also a sparse matrix, are there common ways to deal with such matrix?
The more separate accessing of a file the more performance you lose. The first step you should take is to estimate number of information you need to read from the file and read it in one take. It will increase your performance by huge amount. You can use memory mapped files.
and this goes on for at least 50MB.
This is not that much.
The files i'm receiving are parsed as follows:
0.70316,0.71267,0.72221,0.73177,0.74135,0.75094,0.76053,0.77011,0.77967,0.7892,0.79868,0.80811,0.81747
Save it in binary to save about half of the memory (maybe even more). This will also increase reading speed.
Read the whole file at one time.
An example will make you realize how naive and slow is your approach:
Once I was implementing algorithm that was reading .obj 3d model. The model was like 10 MB and it took around 1-2 minute to load. This was very strange, because Blender could load it immediately - maybe 1 or 2 seconds. Mapping whole file to memory and pre-allocating buffers allowed me to load the file in less than 5 secs.
Note:
I can do this in either C or C++, both are ok.
Don't ever mix C with C++ when it comes to memory management, unless you are sure what you are doing. C++ exceptions can cause huge memory leaks if you don't protect C dynamically allocated memory using RAII.

How to avoid constant memory copying in OpenCL

I wrote C++ application which is simulating simple heat flow. It is using OpenCL for computing.
OpenCL kernel is taking two-dimensional (n x n) array of temperatures values and its size (n). It returns new array with temperatures after each cycle:
pseudocode:
int t_id = get_global_id(0);
if(t_id < n * n)
{
m_new[t_id / n][t_id % n] = average of its and its neighbors (top, bottom, left, right) temperatures
}
As You can see, every thread is computing single cell in matrix. When host application needs to perform X computing cycles it looks like this
For 1 ... X
Copy memory to OpenCL device
Call kernel
Copy memory back
I would like to rewrite kernel code to perform all X cycles without constant memory copying to/from OpenCL device.
Copy memory to OpenCL device
Call kernel X times OR call kernel one time and make it compute X cycles.
Copy memory back
I know that each thread in kernel should lock when all other threads are doing their job and after that - m[][] and m_new[][] should be swapped. I have no idea how to implement any of those two functionalities.
Or maybe there is another way to do this optimally?
Copy memory to OpenCL device
Call kernel X times
Copy memory back
this works. Make sure call kernel is not blocking(so 1-2 ms per cycle is saved) and there aren't any host-accesible buffer properties such as USE_HOST_PTR or ALLOC_HOST_PTR.
If calling kernel X times doesn't get satisfactory performance, you can try using single workgroup(such as only 256 threads) with looping X times that each cycles has a barrier() at the end so all 256 threads synchronize before starting next cycle. This way you can compute M different heat-flow problems at the same time where M is number of compute units(or workgroups) if that is a server, it can serve that many computations.
Global synchronization is not possible because when latest threads are launched, first threads are already gone. It works with (number of compute units)(number of threads per workgroup)(number of wavefronts per workgroup) threads concurrently. For example, a R7-240 gpu with 5 compute units and local-range=256, it can run maybe 5*256*20=25k threads at a time.
Then, for further performance, you can apply local-memory optimizations.

cudaMemsetAsync strange behavior

I observe a strange behavior when overlapping data transfer and kernel execution in CUDA.
When calling cudaMemcpyAsync after cudaMemsetAsync although the cudaMemsetAsync does overlap with the compute kernel the cudaMemcpyAsync doesn't.
The compute kernel ends and then the cudaMemcpyAsync is executed.
When commenting out cudaMemsetAsync then the overlap is performed correctly.
Part of the code is presented below with some changes.
Code:
for (d = 0; d < TOTAL; ++d){
gpuErrchk(cudaMemsetAsync(data_d, 0, bytes, stream1));
for (j = 0; j < M; ++j)
{
gpuErrchk(cudaMemcpyAsync(&data_d[index1], &data_h[index2], bytes, H2D, stream1));
}
gpuErrchk(cudaStreamSynchronize(stream1));
cufftExecR2C(plan, data_d, data_fft_d);
gpuErrchk(cudaStreamSynchronize(stream2));
kernel<<dimGrid, dimBlock,0, stream3>>(result_d, data_fft_d, size);
}
I use a NVIDIA GTX-Titan GPU and the compute and memory operations are performed in different streams. Moreover, cudaMemsetAsync and cudaMemcpyAsync operate on the same device buffer.
Some of CUDA's memcpy functions are implemented with kernels (such as device->device memcpy), but ALL of CUDA's memset functions are implemented internally as kernels.
Assuming the cufftExecR2C call is supposed to be done in a different stream, you can bet that the kernel generated by the FFT plan was designed to fully occupy the GPU.
So you are likely hitting the same limitation in kernel concurrency that you would if you were trying to invoke a kernel in another stream. Kernels must occupy a limited amount of the GPU in order to run concurrently, but most CUDA kernels are not designed to accommodate that use case.

Cuda unified memory between gpu and host

I'm writing a cuda-based program that needs to periodically transfer a set of items from the GPU to the Host memory. In order to keep the process asynchronous, I was hoping to use cuda's UMA to have a memory buffer and flag in the host memory (so both the GPU and the CPU can access it). The GPU would make sure the flag is clear, add its items to the buffer, and set the flag. The CPU waits for the flag to be set, copies things out of the buffer, and clears the flag. As far as I can see, this doesn't produce any race condition because it forces the GPU and CPU to take turns, always reading and writing to the flag opposite each other.
So far I haven't been able to get this to work because there does seem to be some sort of race condition. I came up with a simpler example that has a similar issue:
#include <stdio.h>
__global__
void uva_counting_test(int n, int *h_i);
int main() {
int *h_i;
int n;
cudaMallocHost(&h_i, sizeof(int));
*h_i = 0;
n = 2;
uva_counting_test<<<1, 1>>>(n, h_i);
//even numbers
for(int i = 1; i <= n; ++i) {
//wait for a change to odd from gpu
while(*h_i == (2*(i - 1)));
printf("host h_i: %d\n", *h_i);
*h_i = 2*i;
}
return 0;
}
__global__
void uva_counting_test(int n, int *h_i) {
//odd numbers
for(int i = 0; i < n; ++i) {
//wait for a change to even from host
while(*h_i == (2*(i - 1) + 1));
*h_i = 2*i + 1;
}
}
For me, this case always hangs after the first print statement from the CPU (host h_i: 1). The really unusual thing (which may be a clue) is that I can get it to work in cuda-gdb. If I run it in cuda-gdb, it will hang as before. If I press ctrl+C, it will bring me to the while() loop line in the kernel. From there, surprisingly, I can tell it to continue and it will finish. For n > 2, it will freeze on the while() loop in the kernel again after each kernel, but I can keep pushing it forward with ctrl+C and continue.
If there's a better way to accomplish what I'm trying to do, that would also be helpful.
You are describing a producer-consumer model, where the GPU is producing some data and from time-to-time the CPU will consume that data.
The simplest way to implement this is to have the CPU be the master. The CPU launches a kernel on the GPU, when it is ready to ready to consume data (i.e. the while loop in your example) it synchronises with the GPU, copies the data back from the GPU, launches the kernel again to generate more data, and does whatever it has to do with the data it copied. This allows you to have the GPU filling a fixed-size buffer while the CPU is processing the previous batch (since there are two copies, one on the GPU and one on the CPU).
That can be improved upon by double-buffering the data, meaning that you can keep the GPU busy producing data 100% of the time by ping-ponging between buffers as you copy the other to the CPU. That assumes the copy-back is faster than the production, but if not then you will saturate the copy bandwidth which is also good.
Neither of those are what you actually described. What you asked for is to have the GPU master the data. I'd urge caution on that since you will need to manage your buffer size carefully and you will need to think carefully about the timings and communication issues. It's certainly possible to do something like that but before you explore that direction you should read up about memory fences, atomic operations, and volatile.
I'd try to add
__threadfence_system();
after
*h_i = 2*i + 1;
See here for details. Without it, it's totally possible that the modification stay in the GPU cache forever. However better you listen to the other answers: to improve it for multiple threads/blocks you have to deal with other "problems" to get a similar scheme to work reliably.
As Tom suggested (+1), better to use double buffering. Streams help a lot such a scheme, as you can find depicted here.

Blend two images using GPU

I need to blend thousands of pairs of images very fast.
My code currently does the following: _apply is a function pointer to a function like Blend. It is one of the many functions we can pass, but it is not the only one. Any function takes two values and outputs a third and it is done on each channel for each pixel. I would prefer a solution that is general to any such function rather than a specific solution for blending.
typedef byte (*Transform)(byte src1,byte src2);
Transform _apply;
for (int i=0 ; i< _frameSize ; i++)
{
source[i] = _apply(blend[i]);
}
byte Blend(byte src, byte blend)
{
int resultPixel = (src + blend)/2;
return (byte)resultPixel;
}
I was doing this on CPU but the performance is terrible. It is my understanding that doing this in GPU is very fast. My program needs to run in computers that will have either Nvidia GPUs or Intel GPUs so whatever solution I use needs to be vendor independent. If I use GPU it has to be OpenGL to be platform independent as well.
I think using a GLSL pixel shader would help, but I am not familiar with pixel shaders or how to use them to 2D objects (like my images).
Is that a reasonable solution? If so, how do I do this in 2D?
If there is a library that already does that it is also great to know.
EDIT: I am receiving the image pairs from different sources. One is always coming from a 3d graphics component in opengl (so it is in GPU originally). The other one is coming from system memory, either from a socket (in a compressed video stream) or from a memory mapped file. The "sink" of the resulting image is the screen. I am expected to show the images on the screen, so going to GPU is an option or using something like SDL to display them.
The blend function that is going to be executed the most is this one
byte Patch(byte delta, byte lo)
{
int resultPixel = (2 * (delta - 127)) + lo;
if (resultPixel > 255)
resultPixel = 255;
if (resultPixel < 0)
resultPixel = 0;
return (byte)resultPixel;
}
EDIT 2: The image coming from GPU land, comes in this fashion. From FBO to PBO to system memory
glBindFramebuffer(GL_FRAMEBUFFER,fbo);
glReadBuffer( GL_COLOR_ATTACHMENT0 );
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
glReadPixels(0,0,width,height,GL_BGR,GL_UNSIGNED_BYTE,0);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
void* mappedRegion = glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
Seems like it is probably better to just work everything in GPU memory. The other bitmap can come from system memory. We may get it from a video decoder in GPU memory eventually as well.
Edit 3: One of my images will come from D3D while the other one comes from OpenGL. It seems that something like Thrust or OpenCL is the best option
From the looks of your Blend function, this is an entirely memory bounded operation. The caches on the CPU can likely only hold a very small fraction of the thousands of images you have. Meaning most of your time is spent waiting for RAM to fulfill load/store requests, and the CPU will idle a lot.
You will NOT get any speedup by having to copy your images from RAM to GPU, have the GPU arithmetic units idle while they wait for GPU RAM to feed them data, wait for GPU RAM again to write results, then copy it all back to main RAM. Using GPU for this could actually slow things down substantially.
But I could be wrong and you might not be saturating your memory bus already. You will have to try it on your system and profile it. Here are some simple things you can try to optimize.
1. Multi-thread
I would focus on optimizing the algorithm directly on the CPU. The simplest thing is to go multi-threaded, which can be as simple as enabling OpenMP in your compiler and updating your for loop:
#include <omp.h> // add this along with enabling OpenMP support in your compiler
...
#pragma omp parallel for // <--- compiler magic happens here
for (int i=0 ; i< _frameSize ; i++)
{
source[i] = _apply(blend[i]);
}
If your memory bandwidth is not saturated, this will likely speed up the blending by however many cores your system has.
2. Micro-optimizations
Another thing you can try is to implement your Blend using SIMD instructions which most CPUs have nowadays. I can't help you with that without knowing what CPU you are targeting.
You can also try unrolling your for loop to mitigate some of the loop overhead.
One easy way to achieve both of these is leverage the Eigen matrix library by wrapping your data in their data structures.
// initialize your data and result buffer
byte *source = ...
byte *blend = ...
byte *result = ...
// tell Eigen where you data/buffer are, and to treat it like a dynamic vectory of bytes
// this is a cheap shallow copy
Map<Matrix<byte, Dynamic,1> > sourceMap(source, _frameSize);
Map<Matrix<byte, Dynamic,1> > blendMap(blend, _frameSize);
Map<Matrix<byte, Dynamic,1> > resultMap(result, _frameSize);
// perform blend using all manner of insane optimization voodoo under the covers
resultMap = (sourceMap + blendMap)/2;
3. Use GPGPU
Finally, I will provide a direct answer to your question with an easy way to leverage the GPU without having to know much about GPU programming. The simplest thing to do is try the Thrust library. You will have to rewrite your algorithms as STL style algorithms, but that's pretty easy in your case.
// functor for blending
struct blend_functor
{
template <typename Tuple>
__host__ __device__
void operator()(Tuple t)
{
// C[i] = (A[i] + B[i])/2;
thrust::get<2>(t) = (thrust::get<0>(t) + thrust::get<1>(t))/2;
}
};
// initialize your data and result buffer
byte *source = ...
byte *blend = ...
byte *result = NULL;
// copy the data to the vectors on the GPU
thrust::device_vector<byte> A(source, source + _frameSize);
thrust::device_vector<byte> B(blend, blend + _frameSize);
// allocate result vector on the GPU
thrust::device_vector<byte> C(_frameSize);
// process the data on the GPU device
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(
A.begin(), B.begin(), C.begin())),
thrust::make_zip_iterator(thrust::make_tuple(
A.end(), B.end(), C.end())),
blend_functor());
// copy the data back to main RAM
thrust::host_vector<byte> resultVec = C;
result = resultVec.data();
A really neat thing about thrust is that once you have written the algorithms in a generic way, it can automagically use different back ends for doing the computation. CUDA is the default back end, but you can also configure it at compile time to use OpenMP or TBB (Intel threading library).