Improving asynchronous execution in CUDA

Improving asynchronous execution in CUDA - c++

I am currently writing a programme that performs large simulations on the GPU using the CUDA API. In order to accelerate the performance, I tried to run my kernels simultaneously and then asynchronously copy the result into the host memory again. The code looks roughly like this:
#define NSTREAMS 8
#define BLOCKDIMX 16
#define BLOCKDIMY 16
void domainUpdate(float* domain_cpu, // pointer to domain on host
float* domain_gpu, // pointer to domain on device
const unsigned int dimX,
const unsigned int dimY,
const unsigned int dimZ)
{
dim3 blocks((dimX + BLOCKDIMX - 1) / BLOCKDIMX, (dimY + BLOCKDIMY - 1) / BLOCKDIMY);
dim3 threads(BLOCKDIMX, BLOCKDIMY);
for (unsigned int ii = 0; ii < NSTREAMS; ++ii) {
updateDomain3D<<<blocks,threads, 0, streams[ii]>>>(domain_gpu,
dimX, 0, dimX - 1, // dimX, minX, maxX
dimY, 0, dimY - 1, // dimY, minY, maxY
dimZ, dimZ * ii / NSTREAMS, dimZ * (ii + 1) / NSTREAMS - 1); // dimZ, minZ, maxZ
unsigned int offset = dimX * dimY * dimZ * ii / NSTREAMS;
cudaMemcpyAsync(domain_cpu + offset ,
domain_gpu+ offset ,
sizeof(float) * dimX * dimY * dimZ / NSTREAMS,
cudaMemcpyDeviceToHost, streams[ii]);
}
cudaDeviceSynchronize();
}
All in all it is just a simple for-loop, looping over all streams (8 in this case) and dividing the work. This actually is a deal faster (up to 30% performance gain), although maybe less than I had hoped. I analysed a typical cycle in Nvidia's Compute Visual Profiler, and the execution looks like this:
As can be seen in the picture, the kernels do overlap, although never more than two kernels are running at the same time. I tried the same thing for different numbers of streams and different sizes of the simulation domain, but this is always the case.
So my question is: is there a way to encourage/force the GPU scheduler to run more than two things at the same time? Or is this a limitation dependent on the GPU device that cannot be represented in the code?
My system specifications are: 64-bit Windows 7, and a GeForce GTX 670 graphics card (that's Kepler architecture, compute capability 3.0).

Kernels overlap only if the GPU has resources left to run a second kernel. Once the GPU is fully loaded, there is no gain from running more kernels in parallel, so the driver does not do that.

Related

CUDA periodic execution time

I just started learning CUDA and I have a trouble interpreting my experiment results. I wanted to compare CPU vs GPU in a simple program that adds two vectors together. The code is following:
__global__ void add(int *a, int *b, int *c, long long n) {
long long tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < n) {
c[tid] = a[tid] + b[tid];
}
}
void add_cpu(int* a, int* b, int* c, long long n) {
for (long long i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
void check_results(int* gpu, int* cpu, long long n) {
for (long long i = 0; i < n; i++) {
if (gpu[i] != cpu[i]) {
printf("Different results!\n");
return;
}
}
}
int main(int argc, char* argv[]) {
long long n = atoll(argv[1]);
int num_of_blocks = atoi(argv[2]);
int num_of_threads = atoi(argv[3]);
int* a = new int[n];
int* b = new int[n];
int* c = new int[n];
int* c_cpu = new int[n];
int *dev_a, *dev_b, *dev_c;
cudaMalloc((void **) &dev_a, n * sizeof(int));
cudaMalloc((void **) &dev_b, n * sizeof(int));
cudaMalloc((void **) &dev_c, n * sizeof(int));
for (long long i = 0; i < n; i++) {
a[i] = i;
b[i] = i * 2;
}
cudaMemcpy(dev_a, a, n * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, n * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_c, c, n * sizeof(int), cudaMemcpyHostToDevice);
StopWatchInterface *timer=NULL;
sdkCreateTimer(&timer);
sdkResetTimer(&timer);
sdkStartTimer(&timer);
add <<<num_of_blocks, num_of_threads>>>(dev_a, dev_b, dev_c, n);
cudaDeviceSynchronize();
sdkStopTimer(&timer);
float time = sdkGetTimerValue(&timer);
sdkDeleteTimer(&timer);
cudaMemcpy(c, dev_c, n * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
clock_t start = clock();
add_cpu(a, b, c_cpu, n);
clock_t end = clock();
check_results(c, c_cpu, n);
printf("%f %f\n", (double)(end - start) * 1000 / CLOCKS_PER_SEC, time);
return 0;
}
I ran this code in a loop with a bash script:
for i in {1..2560}
do
n="$((1024 * i))"
out=`./vectors $n $i 1024`
echo "$i $out" >> "./vectors.txt"
done
Where 2560 is maximum number of blocks that my GPU supports, and 1024 is the maximum number of threads in block. So I just ran it for maximum block size to the maximum problem size my GPU can handle, with a step of 1 block (1024 ints in vector).
Here is my GPU info:
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 2070 SUPER"
CUDA Driver Version / Runtime Version 11.3 / 11.0
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 8192 MBytes (8589934592 bytes)
(040) Multiprocessors, (064) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1785 MHz (1.78 GHz)
Memory Clock rate: 7001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 65536 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.3, CUDA Runtime Version = 11.0, NumDevs = 1
Result = PASS
After running the experiment I gathered the results and plotted them:
So what bothers me is this 256 blocks-wide period in the GPU execution time. I have no clue why this happens. Why executing 512 blocks is much slower than executing 513 blocks of threads?
I also checked this with a constant number of blocks (2560) as well as with different block sizes and it always give this period of 256 * 1024 vector size (so for block size 512 its each 512 blocks, not each 256 blocks). So maybe this is something with memory, but I can't figure out what.
I would appreciate any ideas on why this is happening.

This is by no means a complete or precise answer. However I believe the periodic pattern you are observing is at least partly due to a 1-time or first-time kernel launch overhead. Good benchmarking practice usually is to do something other than what you are doing. For example, run the kernel multiple times and take an average. Or do some other kind of statistical measurement.
When I run your code using your script on a GTX 960 GPU, I get the following graph (only plotting the GPU data, vertical axis is in milliseconds):
When I modify your code as follows:
cudaMemcpy(dev_c, c, n * sizeof(int), cudaMemcpyHostToDevice);
// next two lines added:
add <<<num_of_blocks, num_of_threads>>>(dev_a, dev_b, dev_c, n);
cudaDeviceSynchronize();
StopWatchInterface *timer=NULL;
sdkCreateTimer(&timer);
sdkResetTimer(&timer);
sdkStartTimer(&timer);
add <<<num_of_blocks, num_of_threads>>>(dev_a, dev_b, dev_c, n);
cudaDeviceSynchronize();
Doing a "warm-up" run first, then timing the second run, I witness data like this:
So the data without the warm-up shows a periodicity. After the warm-up, the periodicity disappears. I conclude that the periodicity is due to some kind of 1-time or first-time behavior. Some typical things that might be in this category are caching effects and cuda "lazy" initialization effects (for example, the time taken to JIT-compile the GPU code, which is certainly happening in your case, or the time to load the GPU code into GPU memory). I won't be able to go farther with any explanation of what kind of first-time effect exactly is giving rise to the periodicity.
Another observation is that while my data shows an expected "average slope" to each graph, indicating that the kernel duration associated with 2560 blocks is approximately 5 times the kernel duration associated with 512 blocks, I don't see that kind of trend in your data. It ought to be there, however. Your GPU will "saturate" at about 40 blocks. Thereafter, the average kernel duration should increase in approximately a linear fashion, such that the kernel duration associated with 2560 blocks is 4-5x the kernel duration associated with 512 blocks. I can't explain your data in this respect at all, I suspect a graphing or data processing error, or else a characteristic in your environment (e.g. shared GPU with other users, broken CUDA install, etc.) that is not present in my environment, and which I'm unable to guess at.
Finally, my conclusion is that GPU "expected" behavior is more evident in the presence of good benchmarking techniques.

a specific OpenCL kernel performs differently on mobile and PC

I was trying to run an OpenCL kernel on both Adreno 630 and my laptop, it turns out that the kernel runs perfectly on mobile but crashes my laptop every single time. I am still trying to figure out the reason by myself. Here's my kernel. I hope you could help me with it, thanks.
__kernel void gen_mapxy( __read_only image2d_t _disp, const float offsetX, __write_only image2d_t _mapxy )
{
const int y = get_global_id(0);
const int local_y = get_local_id(0);
__local short temp[24][1080];
const int imageWidth = get_image_width(_disp);
for(int x = 0; x < imageWidth; ++x)
temp[local_y][x] = 0;
for(int x = imageWidth - 1; x >= 0; --x){
int tempDisp = read_imagei(_disp, sampler_nearest, (int2)(x, y)).x;
int newPos = clamp((int)(x + offsetX * (tempDisp) / 255), 0, imageWidth - 1);
temp[local_y][newPos] = tempDisp;
write_imagef(_mapxy, (int2)(newPos, y), (float4)(x, y, 0, 0));
}

You are using a big local array.
__local short temp[24][1080]
2 byte * 24 * 1080 = 50.6kB. Some desktop GPUs(and their notebook counterparts) have less available local memory limits. For example, GTX 1060 supports the value CL_DEVICE_LOCAL_MEM_SIZE 49152 bytes. But adreno 620, either it is ignoring the array usage silently or supporting larger local arrays because there is a possilibity that local arrays are emulated inside global arrays (limited in hundreds of megabytes) for those chips. If they do support in-chip fast local memory, then there is more possibility of "ignoring" issue or they really doubled local memory limits from last generation of Adrenos.
Even when GPU supports exact value, using all of it will limit thread-level-parallelism on each pipeline, severely reducing potential performance gains, generally.
If last generation of Adreno GPUs are same,
https://compubench.com/device.jsp?benchmark=compu15m&os=Android&api=cs&D=Samsung+Galaxy+S7+%28SM-G930x%29&testgroup=info
this page says
CL_DEVICE_LOCAL_MEM_SIZE
32768
CL_DEVICE_LOCAL_MEM_TYPE
CL_LOCAL
it is fast but it is 32kB so it is ignoring the error or you've missed adding necessary error catching logic in there, or both.

CUDA Optimization

I developed Pincushion Distortion using CUDA to support real time - more than 40 fps for 3680*2456 Image Sequences.
But it takes 130ms if I use CUDA - nVIDIA GeForce GT 610, 2GB DDR3.
But it takes only 60ms if I use CPU and OpenMP - Core i7 3.4GHz, QuadCore.
Please tell me what to do to speed up.
Thanks.
Full source can be downloaded here.
https://drive.google.com/file/d/0B9SEJgsu0G6QX2FpMnRja0o5STA/view?usp=sharing
https://drive.google.com/file/d/0B9SEJgsu0G6QOGNPMmVQLWpSb2c/view?usp=sharing
The codes are as follows.
__global__
void undistort(int N, float k, int width, int height, int depth, int pitch, float R, float L, unsigned char* in_bits, unsigned char* out_bits)
{
// Get the Index of the Array from GPU Grid/Block/Thread Index and Dimension.
int i, j;
i = blockIdx.y * blockDim.y + threadIdx.y;
j = blockIdx.x * blockDim.x + threadIdx.x;
// If Out of Array
if (i >= height || j >= width)
{
return;
}
// Calculating Undistortion Equation.
// In CPU, We used Fast Approximation equations of atan and sqrt - It makes 2 times faster.
// But In GPU, No need to use Approximation Functions as it is faster.
int cx = width * 0.5;
int cy = height * 0.5;
int xt = j - cx;
int yt = i - cy;
float distance = sqrt((float)(xt*xt + yt*yt));
float r = distance*k / R;
float theta = 1;
if (r == 0)
theta = 1;
else
theta = atan(r)/r;
theta = theta*L;
float tx = theta*xt + cx;
float ty = theta*yt + cy;
// When we correct the frame, its size will be greater than Original.
// So We should Crop it.
if (tx < 0)
tx = 0;
if (tx >= width)
tx = width - 1;
if (ty < 0)
ty = 0;
if (ty >= height)
ty = height - 1;
// Output the Result.
int ux = (int)(tx);
int uy = (int)(ty);
tx = tx - ux;
ty = ty - uy;
unsigned char *p = (unsigned char*)out_bits + i*pitch + j*depth;
unsigned char *q00 = (unsigned char*)in_bits + uy*pitch + ux*depth;
unsigned char *q01 = q00 + depth;
unsigned char *q10 = q00 + pitch;
unsigned char *q11 = q10 + depth;
unsigned char newVal[4] = {0};
for (int k = 0; k < depth; k++)
{
newVal[k] = (q00[k]*(1-tx)*(1-ty) + q01[k]*tx*(1-ty) + q10[k]*(1-tx)*ty + q11[k]*tx*ty);
memcpy(p + k, &newVal[k], 1);
}
}
void wideframe_correction(char* bits, int width, int height, int depth)
{
// Find the device.
// Initialize the nVIDIA Device.
cudaSetDevice(0);
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, 0);
// This works for Calculating GPU Time.
cudaProfilerStart();
// This works for Measuring Total Time
long int dwTime = clock();
// Setting Distortion Parameters
// Note that Multiplying 0.5 works faster than divide into 2.
int cx = (int)(width * 0.5);
int cy = (int)(height * 0.5);
float k = -0.73f;
float R = sqrt((float)(cx*cx + cy*cy));
// Set the Radius of the Result.
float L = (float)(width<height ? width:height);
L = L/2.0f;
L = L/R;
L = L*L*L*0.3333f;
L = 1.0f/(1-L);
// Create the GPU Memory Pointers.
unsigned char* d_img_in = NULL;
unsigned char* d_img_out = NULL;
// Allocate the GPU Memory2D with pitch for fast performance.
size_t pitch;
cudaMallocPitch( (void**) &d_img_in, &pitch, width*depth, height );
cudaMallocPitch( (void**) &d_img_out, &pitch, width*depth, height );
_tprintf(_T("\nPitch : %d\n"), pitch);
// Copy RAM data to VRAM.
cudaMemcpy2D( d_img_in, pitch,
bits, width*depth, width*depth, height,
cudaMemcpyHostToDevice );
cudaMemcpy2D( d_img_out, pitch,
bits, width*depth, width*depth, height,
cudaMemcpyHostToDevice );
// Create Variables for Timing
cudaEvent_t startEvent, stopEvent;
cudaError_t err = cudaEventCreate(&startEvent, 0);
assert( err == cudaSuccess );
err = cudaEventCreate(&stopEvent, 0);
assert( err == cudaSuccess );
// Execution of the version using global memory
float elapsedTime;
cudaEventRecord(startEvent);
// Process image
dim3 dGrid(width / BLOCK_WIDTH + 1, height / BLOCK_HEIGHT + 1);
dim3 dBlock(BLOCK_WIDTH, BLOCK_HEIGHT);
undistort<<< dGrid, dBlock >>> (width*height, k, width, height, depth, pitch, R, L, d_img_in, d_img_out);
cudaThreadSynchronize();
cudaEventRecord(stopEvent);
cudaEventSynchronize( stopEvent );
// Estimate the GPU Time.
cudaEventElapsedTime( &elapsedTime, startEvent, stopEvent);
// Calculate the Total Time.
dwTime = clock() - dwTime;
// Save Image data from VRAM to RAM
cudaMemcpy2D( bits, width*depth,
d_img_out, pitch, width*depth, height,
cudaMemcpyDeviceToHost );
_tprintf(_T("GPU Processing Time(ms) : %d\n"), (int)elapsedTime);
_tprintf(_T("VRAM Memory Read/Write Time(ms) : %d\n"), dwTime - (int)elapsedTime);
_tprintf(_T("Total Time(ms) : %d\n"), dwTime );
// Free GPU Memory
cudaFree(d_img_in);
cudaFree(d_img_out);
cudaProfilerStop();
cudaDeviceReset();
}

i've not read the source code, but there is some things you can't pass through.
your GPU has nearly same performance as your CPU:
Adapt the follwing informations with your real GPU/CPU model.
Specification | GPU | CPU
----------------------------------------
Bandwith | 14,4 GB/sec | 25.6 GB/s
Flops | 155 (FMA) | 135
we can conclude that for memory bounded kernels your GPU will never be faster than your CPU.
GPU informations found here :
http://www.nvidia.fr/object/geforce-gt-610-fr.html#pdpContent=2
CPU informations found here : http://ark.intel.com/products/75123/Intel-Core-i7-4770K-Processor-8M-Cache-up-to-3_90-GHz?q=Intel%20Core%20i7%204770K
and here http://www.ocaholic.ch/modules/smartsection/item.php?page=6&itemid=1005

One does not simply optimize the code just by looking to the source. First of all, you should use Nvidia Profiler https://developer.nvidia.com/nvidia-visual-profiler and see, which part of your code on GPU is the one taking too much time. You might wish to write a UnitTest first however, just to be sure that only the investigated part of your project is tested.
Additionally, you can use CallGrind http://valgrind.org/docs/manual/cl-manual.html to test your CPU code performance.
In general, this is not very surprising that your GPU "optimized" code is slower then "not optimized" one. CUDA cores are usually several times slower than CPU and you have to actually introduce a lot of parallelism to notice a significant speed-up.
EDIT, response to your comment:
As a unit testing framework I strongly recommend GoogleTest. Here you can learn how to use it. Apart from its obvious functionalities (code testing) it allows you to run only specific methods from your class interfaces for performance analysis.
In general, Nvidia profiler is just a tool that runs your code and tells you how much time each of your kernel consume. Please look to their documentation.
By "lot of parallelism" I meant: on your processor you can run 8 x 3.4GHz threads, your GPU has one SM (streaming multiprocessor) with 810MHz clock, lets say 1024 threads per SM (I do not have exact data, but you can run deviceQuery Nvidia script to know the exact parameters), therefore if your GPU code can run (3.4*8)/0.81 = 33 computations in parallel, you will achieve exactly nothing. Execution time of your CPU and GPU code will be the same (neglecting L-cache GPU memory copying, which is expensive). Conclusion: your GPU code should be able to compute at least ~ 40 operations in parallel to introduce any speed-up. On the other hand, lets say that you are able to fully use your GPU potential and you can keep all 1024 on your SM busy all the time. In that case your code will run only (0.81*1024)/(8*3.4) = 30 times faster (approximately, remember that we neglect GPU L-cache operations), which is impossible in most cases, because usually you are not able to parallelize your serial code with such efficiency.
Wish you good luck with your research!

Yes, put nvprof to good use, it's a great tool.
What I could see from your code...
1. Consider using linear thread blocks instead of flat blocks, it could save up some integer operations.
2. Manual correction of image borders and/or thread indices leads to massive divergence and/or impacts coalescing. Consider using texture fetches and/or pre-padding data.
3. memcpy single value from inside the kernel is generally a bad idea.
4. Try to minimize type conversions.

It's slower to calculate integral image using CUDA than CPU code

I am implementing the integral image calculation module using CUDA to improve performance.
But its speed slower than the CPU module.
Please let me know what i did wrong.
cuda kernels and host code follow.
And also, another problem is...
In the kernel SumH, using texture memory is slower than global one, imageTexture was defined as below.
texture<unsigned char, 1> imageTexture;
cudaBindTexture(0, imageTexture, pbImage);
// kernels to scan the image horizontally and vertically.
__global__ void SumH(unsigned char* pbImage, int* pnIntImage, __int64* pn64SqrIntImage, float rVSpan, int nWidth)
{
int nStartY, nEndY, nIdx;
if (!threadIdx.x)
{
nStartY = 1;
}
else
nStartY = (int)(threadIdx.x * rVSpan);
nEndY = (int)((threadIdx.x + 1) * rVSpan);
for (int i = nStartY; i < nEndY; i ++)
{
for (int j = 1; j < nWidth; j ++)
{
nIdx = i * nWidth + j;
pnIntImage[nIdx] = pnIntImage[nIdx - 1] + pbImage[nIdx - nWidth - i];
pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - 1] + pbImage[nIdx - nWidth - i] * pbImage[nIdx - nWidth - i];
//pnIntImage[nIdx] = pnIntImage[nIdx - 1] + tex1Dfetch(imageTexture, nIdx - nWidth - i);
//pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - 1] + tex1Dfetch(imageTexture, nIdx - nWidth - i) * tex1Dfetch(imageTexture, nIdx - nWidth - i);
}
}
}
__global__ void SumV(unsigned char* pbImage, int* pnIntImage, __int64* pn64SqrIntImage, float rHSpan, int nHeight, int nWidth)
{
int nStartX, nEndX, nIdx;
if (!threadIdx.x)
{
nStartX = 1;
}
else
nStartX = (int)(threadIdx.x * rHSpan);
nEndX = (int)((threadIdx.x + 1) * rHSpan);
for (int i = 1; i < nHeight; i ++)
{
for (int j = nStartX; j < nEndX; j ++)
{
nIdx = i * nWidth + j;
pnIntImage[nIdx] = pnIntImage[nIdx - nWidth] + pnIntImage[nIdx];
pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - nWidth] + pn64SqrIntImage[nIdx];
}
}
}
// host code
int nW = image_width;
int nH = image_height;
unsigned char* pbImage;
int* pnIntImage;
__int64* pn64SqrIntImage;
cudaMallocManaged(&pbImage, nH * nW);
// assign image gray values to pbimage
cudaMallocManaged(&pnIntImage, sizeof(int) * (nH + 1) * (nW + 1));
cudaMallocManaged(&pn64SqrIntImage, sizeof(__int64) * (nH + 1) * (nW + 1));
float rHSpan, rVSpan;
int nHThreadNum, nVThreadNum;
if (nW + 1 <= 1024)
{
rHSpan = 1;
nVThreadNum = nW + 1;
}
else
{
rHSpan = (float)(nW + 1) / 1024;
nVThreadNum = 1024;
}
if (nH + 1 <= 1024)
{
rVSpan = 1;
nHThreadNum = nH + 1;
}
else
{
rVSpan = (float)(nH + 1) / 1024;
nHThreadNum = 1024;
}
SumH<<<1, nHThreadNum>>>(pbImage, pnIntImage, pn64SqrIntImage, rVSpan, nW + 1);
cudaDeviceSynchronize();
SumV<<<1, nVThreadNum>>>(pbImage, pnIntImage, pn64SqrIntImage, rHSpan, nH + 1, nW + 1);
cudaDeviceSynchronize();

Regarding the code that is currently in the question. There are two things I'd like to mention: launch parameters and timing methodology.
1) Launch parameters
When you launch a kernel there are two main arguments that specify the amount of threads you are launching. These are between the <<< and >>> sections, and are the number of blocks in the grid, and the number of threads per block as follows:
foo <<< numBlocks, numThreadsPerBlock >>> (args);
For a single kernel to be efficient on a current GPU you can use the rule of thumb that numBlocks * numThreadsPerBlock should be at least 10,000. Ie. 10,000 pieces of work. This is a rule of thumb, so you may get good results with only 5,000 threads (it varies with GPU: cheaper GPUs can get away with fewer threads), but this is the order of magnitude you need to be looking at as a minimum. You are running 1024 threads. This is almost certainly not enough (Hint: the loops inside your kernel look like scan primatives, these can be done in parallel).
Further to this there are a few other things to consider.
The number of blocks should be large in comparison to the number of SMs on your GPU. A Kepler K40 has 15 SMs, and to avoid a signficant tail effect you'd probably want at least ~100 blocks on this GPU. Other GPUs have fewer SMs, but you haven't specified which you have, so I can't be more specific.
The number of threads per block should not be too small. You can only have so many blocks on each SM, so if your blocks are too small you will use the GPU suboptimally. Furthermore, on newer GPUs up to four warps can receive instructions on a SM simultaneously, and as such is it often a good idea to have block sizes as multiples of 128.
2) Timing
I'm not going to go into so much depth here, but make sure your timing is sane. GPU code tends to have a one-time initialisation delay. If this is within your timing, you will see erroneously large runtimes for codes designed to represent a much larger code. Similarly, data transfer between the CPU and GPU takes time. In a real application you may only do this once for thousands of kernel calls, but in a test application you may do it once per kernel launch.
If you want to get accurate timings you must make your example more representitive of the final code, or you must be sure that you are only timing the regions that will be repeated.

The only way to be sure is to profile the code, but in this case we can probably make a reasonable guess.
You're basically just doing a single scan through some data, and doing extremely minimal processing on each item.
Given how little processing you're doing on each item, the bottleneck when you process the data with the CPU is probably just reading the data from memory.
When you do the processing on the GPU, the data still needs to be read from memory and copied into the GPU's memory. That means we still have to read all the data from main memory, just like if the CPU did the processing. Worse, it all has to be written to the GPU's memory, causing a further slowdown. By the time the GPU even gets to start doing real processing, you've already used up more time than it would have taken the CPU to finish the job.
For Cuda to make sense, you generally need to be doing a lot more processing on each individual data item. In this case, the CPU is probably already nearly idle most of the time, waiting for data from memory. In such a case, the GPU is unlikely to be of much help unless the input data was already in the GPU's memory so the GPU could do the processing without any extra copying.

When working with CUDA there are a few things you should keep in mind.
Copying from host memory to device memory is 'slow' - when you copy some data from the host to the device you should do as much calculations as possible (do all the work) before you copy it back to the host.
On the device there are 3 types of memory - global, shared, local. You can rank them in speed like global < shared < local (local = fastest).
Reading from consecutive memory blocks is faster than random access. When working with array of structures you would like to transpose it to a structure of arrays.
You can always consult the CUDA Visual Profiler to show you the bottleneck of your program.

the above mentioned GTX750 has 512 CUDA cores (these are the same as the shader units, just driven in a /different/ mode).
http://www.nvidia.de/object/geforce-gtx-750-de.html#pdpContent=2
the duty of creating integral images is only partially able to be parallel'ized as any result value in the results array depends on a bigger bunch of it's predecessors. further it is only a tiny math portion per memory transfer so the ALU powers and thus the unavoidable memory transfers might be the bottle neck. such an accelerator might provide some speed up, but not a thrilling speed up because of the duty itself does not allow it.
if you would compute multiple variations of integral images on the same input data you would be able to see the "thrill" much more likely due to much higher parallelism options and a higher amount of math ops. but that would be a different duty then.
as a wild guess from google search - others have already fiddled with those item: https://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=11&cad=rja&uact=8&ved=0CD8QFjAKahUKEwjjnoabw8bIAhXFvhQKHbUpA1Y&url=http%3A%2F%2Fdspace.mit.edu%2Fopenaccess-disseminate%2F1721.1%2F71883&usg=AFQjCNHBbOEB_OHAzLZI9__lXO_7FPqdqA

Slow performance of CUDA kernel vs CPU version for Julia set

I am learning CUDA from the book "CUDA by example".In the chapter 4 there is a demo to generate Julia fractals.The showcase demonstrates both CPU and GPU versions.I decided to add a time to see the execution speed for both cases and to my great surprise found that the CPU version executes 3 times faster than GPU.
CPU Julia generation total time:
745 milliseconds .
GPU Julia generation total time:
2456 milliseconds .
So what is going on ? It is clear ,at least from the CUDA kernel code that the execution is parallel as is distributed among 1000 block each of which calculates a pixel for 1000x1000 resolution final image.
Here is the source code of the implementation:
#define N 10
#define DIM 1000
typedef unsigned char byte;
struct cuComplex {
float r;
float i;
__host__ __device__ cuComplex( float a, float b ) : r(a), i(b) {}
__host__ __device__ float magnitude2( void ) {
return r * r + i * i;
}
__host__ __device__ cuComplex operator*(const cuComplex& a) {
return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
}
__host__ __device__ cuComplex operator+(const cuComplex& a) {
return cuComplex(r+a.r, i+a.i);
}
};
__device__ int juliaGPU(int x , int y){
const float scale =1.3;
float jx = scale * (float)(DIM/2 -x)/(DIM/2);
float jy= scale *(float)(DIM/2 -y)/(DIM/2);
cuComplex c(-0.8 ,0.156);
cuComplex a(jx ,jy);
int i = 0;
for(i=0; i <200;i++){
a = a * a +c;
if(a.magnitude2() >1000){
return 0;
}
}
return 1;
}
__global__ void kernelGPU(byte *ptr){
int x = blockIdx.x;
int y = blockIdx.y;
int offset =x + y * gridDim.x;
int juliaValue =juliaGPU(x , y);
ptr[offset * 4 + 0]=255 * juliaValue;
ptr[offset * 4 + 1]=0;
ptr[offset * 4 + 2]=0;
ptr[offset * 4 + 3]=255 ;
}
struct DataBlock {
unsigned char *dev_bitmap;
};
void juliaGPUTestSample(){
DataBlock data;
CPUBitmap bitmap(DIM,DIM);
byte *dev_bitmap; //memory on GPU
HANDLE_ERROR(cudaMalloc((void**)&dev_bitmap , bitmap.image_size()) );
data.dev_bitmap =dev_bitmap;
dim3 grid(DIM,DIM);
int starTime=glutGet(GLUT_ELAPSED_TIME);
kernelGPU<<<grid ,1 >>>(dev_bitmap);
HANDLE_ERROR(cudaMemcpy(bitmap.get_ptr() , dev_bitmap ,bitmap.image_size() ,cudaMemcpyDeviceToHost ) );
int endTime=glutGet(GLUT_ELAPSED_TIME)-starTime;
printf("Total time %d\n:" ,endTime);
HANDLE_ERROR(cudaFree(dev_bitmap));
bitmap.display_and_exit();
}
int main(void){
juliaGPUTestSample();
return 1;
}
Here is the CPU version :
/// the "cuComplex " struct is the same from above.
int julia (int x , int y){
const float scale = 1.3;
float jx = scale * (float)(DIM/2 -x)/(DIM/2);
float jy = scale * (float)(DIM/2 -y)/(DIM/2);
cuComplex c(-0.8 ,0.156);
cuComplex a(jx ,jy);
int i = 0;
for(i=0; i <200;i++){
a = a * a +c;
if(a.magnitude2() >1000){
return 0;
}
}
return 1;
}
void kernel(unsigned char *ptr){
for(int y = 0 ; y <DIM ;++y){
for(int x = 0 ; x <DIM ; ++x){
int offset =x + y * DIM;
int juliaValue = julia(x , y);
ptr[offset * 4 + 0 ] = juliaValue * 125;
ptr[offset * 4 + 1 ] = juliaValue * x;
ptr[offset * 4 + 2 ] = juliaValue * y;
ptr[offset * 4 + 3 ] = 255 ;
}
}
}
void juliaCPUTestSample(){
CPUBitmap bitmap(DIM ,DIM);
unsigned char *ptr = bitmap.get_ptr();
int starTime=glutGet(GLUT_ELAPSED_TIME);
kernel(ptr);
int endTime=glutGet(GLUT_ELAPSED_TIME)-starTime;
printf("Total time %d\n:" ,endTime);
bitmap.display_and_exit();
}
Update -system configurations :
Windows 7 64bit
CPU - Intel i7 -3770CPU 3.40GHz ,16GB RAM
GPU - NVidia Quadro 4000

Others have noticed this.
First of all, when talking about perf comparisons between CPU and GPU it's a good idea to mention the system configuration including hw platform and software. For example, I ran your code on an HP laptop with a core i7 2.60GHz quad core CPU and a quadro1000M GPU, running RHEL 6.2, and cuda 5.0, and I got a score of 438 for the GPU and 441 for the CPU.
Second, and more importantly, the julia sample in that book is a relatively early example of CUDA coding, and so it's not really oriented towards max performance, but rather to illustrate the concepts that have been discussed so far. That book and various other CUDA tutorial material starts by introducing parallel programming using CUDA at the block level. The indication of this is here:
kernelGPU<<<grid ,1 >>>(dev_bitmap);
The kernel launch parameters <<<grid, 1>>> indicate that a grid of some number (grid, which is 1 million total blocks in this case) blocks will be launched, with each block having a single thread. This immediately reduces the power of a Fermi-class GPU, for example, by a factor of 1/32 compared with launching a grid with a full complement of threads per threadblock. Each SM in a Fermi-class GPU has 32 thread processors, all executing in lockstep. If you launch a block with only 16 threads in it, then 16 thread processors will execute your code and the other 16 thread processors will do nothing (i.e. nothing useful). A threadblock containing only 1 thread will therefore use only 1 out of 32 thread processors, the other 31 being idle.
Therefore this particular code sample is not well-designed to utilize the full parallel capability of the GPU. Given that it is relatively early in the exposition of CUDA concepts in the book, this is understandable; I don't believe it was the authors intent to have this code benchmarked or used as a legitimate representation of how to write fast code on the GPU.
In light of this factor of 1/32, the idea that on your system the CPU is only 3 times faster, and on my system the CPU and GPU have comparable throughput (niether of these being particularly high-performance CUDA GPUs, most likely) I think it shows the GPU in reasonably good light. The GPU is fighting this battle with about 97% of it's capability unused.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js