CUDA Optimization - c++

I developed Pincushion Distortion using CUDA to support real time - more than 40 fps for 3680*2456 Image Sequences.
But it takes 130ms if I use CUDA - nVIDIA GeForce GT 610, 2GB DDR3.
But it takes only 60ms if I use CPU and OpenMP - Core i7 3.4GHz, QuadCore.
Please tell me what to do to speed up.
Thanks.
Full source can be downloaded here.
https://drive.google.com/file/d/0B9SEJgsu0G6QX2FpMnRja0o5STA/view?usp=sharing
https://drive.google.com/file/d/0B9SEJgsu0G6QOGNPMmVQLWpSb2c/view?usp=sharing
The codes are as follows.
__global__
void undistort(int N, float k, int width, int height, int depth, int pitch, float R, float L, unsigned char* in_bits, unsigned char* out_bits)
{
// Get the Index of the Array from GPU Grid/Block/Thread Index and Dimension.
int i, j;
i = blockIdx.y * blockDim.y + threadIdx.y;
j = blockIdx.x * blockDim.x + threadIdx.x;
// If Out of Array
if (i >= height || j >= width)
{
return;
}
// Calculating Undistortion Equation.
// In CPU, We used Fast Approximation equations of atan and sqrt - It makes 2 times faster.
// But In GPU, No need to use Approximation Functions as it is faster.
int cx = width * 0.5;
int cy = height * 0.5;
int xt = j - cx;
int yt = i - cy;
float distance = sqrt((float)(xt*xt + yt*yt));
float r = distance*k / R;
float theta = 1;
if (r == 0)
theta = 1;
else
theta = atan(r)/r;
theta = theta*L;
float tx = theta*xt + cx;
float ty = theta*yt + cy;
// When we correct the frame, its size will be greater than Original.
// So We should Crop it.
if (tx < 0)
tx = 0;
if (tx >= width)
tx = width - 1;
if (ty < 0)
ty = 0;
if (ty >= height)
ty = height - 1;
// Output the Result.
int ux = (int)(tx);
int uy = (int)(ty);
tx = tx - ux;
ty = ty - uy;
unsigned char *p = (unsigned char*)out_bits + i*pitch + j*depth;
unsigned char *q00 = (unsigned char*)in_bits + uy*pitch + ux*depth;
unsigned char *q01 = q00 + depth;
unsigned char *q10 = q00 + pitch;
unsigned char *q11 = q10 + depth;
unsigned char newVal[4] = {0};
for (int k = 0; k < depth; k++)
{
newVal[k] = (q00[k]*(1-tx)*(1-ty) + q01[k]*tx*(1-ty) + q10[k]*(1-tx)*ty + q11[k]*tx*ty);
memcpy(p + k, &newVal[k], 1);
}
}
void wideframe_correction(char* bits, int width, int height, int depth)
{
// Find the device.
// Initialize the nVIDIA Device.
cudaSetDevice(0);
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, 0);
// This works for Calculating GPU Time.
cudaProfilerStart();
// This works for Measuring Total Time
long int dwTime = clock();
// Setting Distortion Parameters
// Note that Multiplying 0.5 works faster than divide into 2.
int cx = (int)(width * 0.5);
int cy = (int)(height * 0.5);
float k = -0.73f;
float R = sqrt((float)(cx*cx + cy*cy));
// Set the Radius of the Result.
float L = (float)(width<height ? width:height);
L = L/2.0f;
L = L/R;
L = L*L*L*0.3333f;
L = 1.0f/(1-L);
// Create the GPU Memory Pointers.
unsigned char* d_img_in = NULL;
unsigned char* d_img_out = NULL;
// Allocate the GPU Memory2D with pitch for fast performance.
size_t pitch;
cudaMallocPitch( (void**) &d_img_in, &pitch, width*depth, height );
cudaMallocPitch( (void**) &d_img_out, &pitch, width*depth, height );
_tprintf(_T("\nPitch : %d\n"), pitch);
// Copy RAM data to VRAM.
cudaMemcpy2D( d_img_in, pitch,
bits, width*depth, width*depth, height,
cudaMemcpyHostToDevice );
cudaMemcpy2D( d_img_out, pitch,
bits, width*depth, width*depth, height,
cudaMemcpyHostToDevice );
// Create Variables for Timing
cudaEvent_t startEvent, stopEvent;
cudaError_t err = cudaEventCreate(&startEvent, 0);
assert( err == cudaSuccess );
err = cudaEventCreate(&stopEvent, 0);
assert( err == cudaSuccess );
// Execution of the version using global memory
float elapsedTime;
cudaEventRecord(startEvent);
// Process image
dim3 dGrid(width / BLOCK_WIDTH + 1, height / BLOCK_HEIGHT + 1);
dim3 dBlock(BLOCK_WIDTH, BLOCK_HEIGHT);
undistort<<< dGrid, dBlock >>> (width*height, k, width, height, depth, pitch, R, L, d_img_in, d_img_out);
cudaThreadSynchronize();
cudaEventRecord(stopEvent);
cudaEventSynchronize( stopEvent );
// Estimate the GPU Time.
cudaEventElapsedTime( &elapsedTime, startEvent, stopEvent);
// Calculate the Total Time.
dwTime = clock() - dwTime;
// Save Image data from VRAM to RAM
cudaMemcpy2D( bits, width*depth,
d_img_out, pitch, width*depth, height,
cudaMemcpyDeviceToHost );
_tprintf(_T("GPU Processing Time(ms) : %d\n"), (int)elapsedTime);
_tprintf(_T("VRAM Memory Read/Write Time(ms) : %d\n"), dwTime - (int)elapsedTime);
_tprintf(_T("Total Time(ms) : %d\n"), dwTime );
// Free GPU Memory
cudaFree(d_img_in);
cudaFree(d_img_out);
cudaProfilerStop();
cudaDeviceReset();
}

i've not read the source code, but there is some things you can't pass through.
your GPU has nearly same performance as your CPU:
Adapt the follwing informations with your real GPU/CPU model.
Specification | GPU | CPU
----------------------------------------
Bandwith | 14,4 GB/sec | 25.6 GB/s
Flops | 155 (FMA) | 135
we can conclude that for memory bounded kernels your GPU will never be faster than your CPU.
GPU informations found here :
http://www.nvidia.fr/object/geforce-gt-610-fr.html#pdpContent=2
CPU informations found here : http://ark.intel.com/products/75123/Intel-Core-i7-4770K-Processor-8M-Cache-up-to-3_90-GHz?q=Intel%20Core%20i7%204770K
and here http://www.ocaholic.ch/modules/smartsection/item.php?page=6&itemid=1005

One does not simply optimize the code just by looking to the source. First of all, you should use Nvidia Profiler https://developer.nvidia.com/nvidia-visual-profiler and see, which part of your code on GPU is the one taking too much time. You might wish to write a UnitTest first however, just to be sure that only the investigated part of your project is tested.
Additionally, you can use CallGrind http://valgrind.org/docs/manual/cl-manual.html to test your CPU code performance.
In general, this is not very surprising that your GPU "optimized" code is slower then "not optimized" one. CUDA cores are usually several times slower than CPU and you have to actually introduce a lot of parallelism to notice a significant speed-up.
EDIT, response to your comment:
As a unit testing framework I strongly recommend GoogleTest. Here you can learn how to use it. Apart from its obvious functionalities (code testing) it allows you to run only specific methods from your class interfaces for performance analysis.
In general, Nvidia profiler is just a tool that runs your code and tells you how much time each of your kernel consume. Please look to their documentation.
By "lot of parallelism" I meant: on your processor you can run 8 x 3.4GHz threads, your GPU has one SM (streaming multiprocessor) with 810MHz clock, lets say 1024 threads per SM (I do not have exact data, but you can run deviceQuery Nvidia script to know the exact parameters), therefore if your GPU code can run (3.4*8)/0.81 = 33 computations in parallel, you will achieve exactly nothing. Execution time of your CPU and GPU code will be the same (neglecting L-cache GPU memory copying, which is expensive). Conclusion: your GPU code should be able to compute at least ~ 40 operations in parallel to introduce any speed-up. On the other hand, lets say that you are able to fully use your GPU potential and you can keep all 1024 on your SM busy all the time. In that case your code will run only (0.81*1024)/(8*3.4) = 30 times faster (approximately, remember that we neglect GPU L-cache operations), which is impossible in most cases, because usually you are not able to parallelize your serial code with such efficiency.
Wish you good luck with your research!

Yes, put nvprof to good use, it's a great tool.
What I could see from your code...
1. Consider using linear thread blocks instead of flat blocks, it could save up some integer operations.
2. Manual correction of image borders and/or thread indices leads to massive divergence and/or impacts coalescing. Consider using texture fetches and/or pre-padding data.
3. memcpy single value from inside the kernel is generally a bad idea.
4. Try to minimize type conversions.

Related

Optimizing raster algorithm in OpenCL 32seconds for a cube on nVidia RTX 3080

I'm new in OpenCl. I wrote an OpenCL software rasterizer to rasterize triangles. Now the time that is used for a cube is 32Seconds, which is too much, I'm testing on nVidia RTX3080 Laptop.
The result is very weird and it's too slow.
Here is the kernel,
___kernel void fragment_shader(__global struct Fragment* fragments, __global struct Triangle_* triangles, int triCount)
{
size_t px = get_global_id(0); // triCount
//size_t py = get_global_id(1); // triCount
int imageWidth = 256;
int imageHeight = 256;
if(px < triCount)
{
float3 v0Raster = (float3)(triangles[px].v[0].pos[0], triangles[px].v[0].pos[1], triangles[px].v[0].pos[2]);
float3 v1Raster = (float3)(triangles[px].v[1].pos[0], triangles[px].v[1].pos[1], triangles[px].v[1].pos[2]);
float3 v2Raster = (float3)(triangles[px].v[2].pos[0], triangles[px].v[2].pos[1], triangles[px].v[2].pos[2]);
float xmin = min3(v0Raster.x, v1Raster.x, v2Raster.x);
float ymin = min3(v0Raster.y, v1Raster.y, v2Raster.y);
float xmax = max3(v0Raster.x, v1Raster.x, v2Raster.x);
float ymax = max3(v0Raster.y, v1Raster.y, v2Raster.y);
float slope = (ymax - ymin) / (xmax - xmin);
// be careful xmin/xmax/ymin/ymax can be negative. Don't cast to uint32_t
unsigned int x0 = max((uint)0, (uint)(floor(xmin)));
unsigned int x1 = min((uint)(imageWidth) - 1, (uint)(floor(xmax)));
unsigned int y0 = max((uint)0, (uint)(floor(ymin)));
unsigned int y1 = min((uint)(imageHeight) - 1, (uint)(floor(ymax)));
float3 v0 = v0Raster;
float3 v1 = v1Raster;
float3 v2 = v2Raster;
float area = edgeFunction(v0Raster, v1Raster, v2Raster);
for (unsigned int y = y0; y <= y1; ++y) {
for (unsigned int x = x0; x <= x1; ++x) {
float3 p = { x + 0.5f, y + 0.5f, 0 };
float w0 = edgeFunction(v1Raster, v2Raster, p);
float w1 = edgeFunction(v2Raster, v0Raster, p);
float w2 = edgeFunction(v0Raster, v1Raster, p);
if (w0 >= 0 && w1 >= 0 && w2 >= 0) {
fragments[y * 256 + x].col[0] = 1.0f;
fragments[y * 256 + x].col[1] = 0;
fragments[y * 256 + x].col[2] = 0;
}
}
}
}
}
The kernel is supposed to run for every triangle, and does box testing and rasterize the pixels.
here is how I invoke it:
global_size[0] = triCount-1;
auto time_start = std::chrono::high_resolution_clock::now();
err = clEnqueueNDRangeKernel(commandQueue, kernel_fragmentShader, 1, NULL, global_size,
NULL, 0, NULL, NULL);
if (err < 0) {
perror("Couldn't enqueue the kernel_fragmentShader");
exit(1);
}
I tried to omit lighting and everything still it takes around 20seconds to render a cube.
This kind of approach is well suited for massively parallel rendering like on GPU. I assume you are doing this on CPU side so the performance is poor as you have no or too small parallelization and no or very little HW support for used operations. On GPU you got SIMD instructions for most of the stuff needed and a lot of is done in HW instead of in code).
To gain speed on CPU side see how to rasterize rotated rectangle this was standard way of SW rendering back in the days before GPUs. The method simply renders edges of convex polygon (or triangle) as lines into 2 buffers (start end poins per horizontal line) and then just fill or interpolate the horizontal lines. This uses much much less operations per pixel.
Your method computes point inside triangle per each pixel of BBOX which meas much more pixels are processed and each pixel need too many complicated operations which kills performance.
On top of this your code is not optimized for example
fragments[y * 256 + x].col[0] = 1.0f;
fragments[y * 256 + x].col[1] = 0;
fragments[y * 256 + x].col[2] = 0;
Why are you computing y * 256 + x 3 times? also I would feel better with (y<<8)+x but nowadays compilers should do it for you. You can also just add 256 to starting address instead of multiplying...
I do not code in OpenCL (IIRC its for computer vision and DIP not for rendering) so I hope you have direct access to fragments[] and not some constrained with additional test which kills performance a lot (similar to putpixel,setpixel,pixel[][],etc. on most gfx apis which can kill performance even up to 10000x times)

a specific OpenCL kernel performs differently on mobile and PC

I was trying to run an OpenCL kernel on both Adreno 630 and my laptop, it turns out that the kernel runs perfectly on mobile but crashes my laptop every single time. I am still trying to figure out the reason by myself. Here's my kernel. I hope you could help me with it, thanks.
__kernel void gen_mapxy( __read_only image2d_t _disp, const float offsetX, __write_only image2d_t _mapxy )
{
const int y = get_global_id(0);
const int local_y = get_local_id(0);
__local short temp[24][1080];
const int imageWidth = get_image_width(_disp);
for(int x = 0; x < imageWidth; ++x)
temp[local_y][x] = 0;
for(int x = imageWidth - 1; x >= 0; --x){
int tempDisp = read_imagei(_disp, sampler_nearest, (int2)(x, y)).x;
int newPos = clamp((int)(x + offsetX * (tempDisp) / 255), 0, imageWidth - 1);
temp[local_y][newPos] = tempDisp;
write_imagef(_mapxy, (int2)(newPos, y), (float4)(x, y, 0, 0));
}
You are using a big local array.
__local short temp[24][1080]
2 byte * 24 * 1080 = 50.6kB. Some desktop GPUs(and their notebook counterparts) have less available local memory limits. For example, GTX 1060 supports the value CL_DEVICE_LOCAL_MEM_SIZE 49152 bytes. But adreno 620, either it is ignoring the array usage silently or supporting larger local arrays because there is a possilibity that local arrays are emulated inside global arrays (limited in hundreds of megabytes) for those chips. If they do support in-chip fast local memory, then there is more possibility of "ignoring" issue or they really doubled local memory limits from last generation of Adrenos.
Even when GPU supports exact value, using all of it will limit thread-level-parallelism on each pipeline, severely reducing potential performance gains, generally.
If last generation of Adreno GPUs are same,
https://compubench.com/device.jsp?benchmark=compu15m&os=Android&api=cs&D=Samsung+Galaxy+S7+%28SM-G930x%29&testgroup=info
this page says
CL_DEVICE_LOCAL_MEM_SIZE
32768
CL_DEVICE_LOCAL_MEM_TYPE
CL_LOCAL
it is fast but it is 32kB so it is ignoring the error or you've missed adding necessary error catching logic in there, or both.

Improving asynchronous execution in CUDA

I am currently writing a programme that performs large simulations on the GPU using the CUDA API. In order to accelerate the performance, I tried to run my kernels simultaneously and then asynchronously copy the result into the host memory again. The code looks roughly like this:
#define NSTREAMS 8
#define BLOCKDIMX 16
#define BLOCKDIMY 16
void domainUpdate(float* domain_cpu, // pointer to domain on host
float* domain_gpu, // pointer to domain on device
const unsigned int dimX,
const unsigned int dimY,
const unsigned int dimZ)
{
dim3 blocks((dimX + BLOCKDIMX - 1) / BLOCKDIMX, (dimY + BLOCKDIMY - 1) / BLOCKDIMY);
dim3 threads(BLOCKDIMX, BLOCKDIMY);
for (unsigned int ii = 0; ii < NSTREAMS; ++ii) {
updateDomain3D<<<blocks,threads, 0, streams[ii]>>>(domain_gpu,
dimX, 0, dimX - 1, // dimX, minX, maxX
dimY, 0, dimY - 1, // dimY, minY, maxY
dimZ, dimZ * ii / NSTREAMS, dimZ * (ii + 1) / NSTREAMS - 1); // dimZ, minZ, maxZ
unsigned int offset = dimX * dimY * dimZ * ii / NSTREAMS;
cudaMemcpyAsync(domain_cpu + offset ,
domain_gpu+ offset ,
sizeof(float) * dimX * dimY * dimZ / NSTREAMS,
cudaMemcpyDeviceToHost, streams[ii]);
}
cudaDeviceSynchronize();
}
All in all it is just a simple for-loop, looping over all streams (8 in this case) and dividing the work. This actually is a deal faster (up to 30% performance gain), although maybe less than I had hoped. I analysed a typical cycle in Nvidia's Compute Visual Profiler, and the execution looks like this:
As can be seen in the picture, the kernels do overlap, although never more than two kernels are running at the same time. I tried the same thing for different numbers of streams and different sizes of the simulation domain, but this is always the case.
So my question is: is there a way to encourage/force the GPU scheduler to run more than two things at the same time? Or is this a limitation dependent on the GPU device that cannot be represented in the code?
My system specifications are: 64-bit Windows 7, and a GeForce GTX 670 graphics card (that's Kepler architecture, compute capability 3.0).
Kernels overlap only if the GPU has resources left to run a second kernel. Once the GPU is fully loaded, there is no gain from running more kernels in parallel, so the driver does not do that.

Applying Matrix To Image, Seeking Performance Improvements

Edited: Working on Windows platform.
Problem: Less of a problem, more about advise. I'm currently not incredibly versed in low-level program, but I am attempting to optimize the code below in an attempt to increase the performance of my overall code. This application depends on extremely high speed image processing.
Current Performance: On my computer, this currently computes at about 4-6ms for a 512x512 image. I'm trying to cut that in half if possible.
Limitations: Due to this projects massive size, fundamental changes to the application are very difficult to do, so things such as porting to DirectX or other GPU methods isn't much of an option. The project currently works, I'm simply trying to figure out how to make it work faster.
Specific information about my use for this: Images going into this method are always going to be exactly square and some increment of 128. (Most likely 512 x 512) and they will always come out the same size. Other than that, there is not much else to it. The matrix is calculated somewhere else, so this is just the applying of the matrix to my image. The original image and the new image are both being used, so copying the image is necessary.
Here is my current implementation:
void ReprojectRectangle( double *mpProjMatrix, unsigned char *pDstScan0, unsigned char *pSrcScan0,
int NewBitmapDataStride, int SrcBitmapDataStride, int YOffset, double InversedAspect, int RectX, int RectY, int RectW, int RectH)
{
int i, j;
double Xnorm, Ynorm;
double Ynorm_X_ProjMatrix4, Ynorm_X_ProjMatrix5, Ynorm_X_ProjMatrix7;;
double SrcX, SrcY, T;
int SrcXnt, SrcYnt;
int SrcXec, SrcYec, SrcYnvDec;
unsigned char *pNewPtr, *pSrcPtr1, *pSrcPtr2, *pSrcPtr3, *pSrcPtr4;
int RectX2, RectY2;
/* Compensate (or re-center) the Y-coordinate regarding the aspect ratio */
RectY -= YOffset;
/* Compute the second point of the rectangle for the loops */
RectX2 = RectX + RectW;
RectY2 = RectY + RectH;
/* Clamp values (be careful with aspect ratio */
if (RectY < 0) RectY = 0;
if (RectY2 < 0) RectY2 = 0;
if ((double)RectY > (InversedAspect * 512.0)) RectY = (int)(InversedAspect * 512.0);
if ((double)RectY2 > (InversedAspect * 512.0)) RectY2 = (int)(InversedAspect * 512.0);
/* Iterate through each pixel of the scaled re-Proj */
for (i=RectY; i<RectY2; i++)
{
/* Normalize Y-coordinate and take the ratio into account */
Ynorm = InversedAspect - (double)i / 512.0;
/* Pre-compute some matrix coefficients */
Ynorm_X_ProjMatrix4 = Ynorm * mpProjMatrix[4] + mpProjMatrix[12];
Ynorm_X_ProjMatrix5 = Ynorm * mpProjMatrix[5] + mpProjMatrix[13];
Ynorm_X_ProjMatrix7 = Ynorm * mpProjMatrix[7] + mpProjMatrix[15];
for (j=RectX; j<RectX2; j++)
{
/* Get a pointer to the pixel on (i,j) */
pNewPtr = pDstScan0 + ((i+YOffset) * NewBitmapDataStride) + j;
/* Normalize X-coordinates */
Xnorm = (double)j / 512.0;
/* Compute the corresponding coordinates in the source image, before Proj and normalize source coordinates*/
T = (Xnorm * mpProjMatrix[3] + Ynorm_X_ProjMatrix7);
SrcY = (Xnorm * mpProjMatrix[0] + Ynorm_X_ProjMatrix4)/T;
SrcX = (Xnorm * mpProjMatrix[1] + Ynorm_X_ProjMatrix5)/T;
// Compute the integer and decimal values of the coordinates in the sources image
SrcXnt = (int) SrcX;
SrcYnt = (int) SrcY;
SrcXec = 64 - (int) ((SrcX - (double) SrcXnt) * 64);
SrcYec = 64 - (int) ((SrcY - (double) SrcYnt) * 64);
// Get the values of the four pixels up down right left
pSrcPtr1 = pSrcScan0 + (SrcXnt * SrcBitmapDataStride) + SrcYnt;
pSrcPtr2 = pSrcPtr1 + 1;
pSrcPtr3 = pSrcScan0 + ((SrcXnt+1) * SrcBitmapDataStride) + SrcYnt;
pSrcPtr4 = pSrcPtr3 + 1;
SrcYnvDec = (64-SrcYec);
(*pNewPtr) = (unsigned char)(((SrcYec * (*pSrcPtr1) + SrcYnvDec * (*pSrcPtr2)) * SrcXec +
(SrcYec * (*pSrcPtr3) + SrcYnvDec * (*pSrcPtr4)) * (64 - SrcXec)) >> 12);
}
}
}
Two things that could help: multiprocessing and SIMD. With multiprocessing you could break up the output image into tiles and have each processor work on the next available tile. You can use SIMD instructions (like SSE, AVX, AltiVec, etc.) to calculate multiple things at the same time, such as doing the same matrix math to multiple coordinates at the same time. You can even combine the two - use multiple processors running SIMD instructions to do as much work as possible. You didn't mention what platform you're working on.

CUDA kernel error when increasing thread number

I am developing a CUDA ray-plane intersection kernel.
Let's suppose, my plane (face) struct is:
typedef struct _Face {
int ID;
int matID;
int V1ID;
int V2ID;
int V3ID;
float V1[3];
float V2[3];
float V3[3];
float reflect[3];
float emmision[3];
float in[3];
float out[3];
int intersects[RAYS];
} Face;
I pasted the whole struct so you can get an idea of it's size. RAYS equals 625 in current configuration. In the following code assume that the size of faces array is i.e. 1270 (generally - thousands).
Now until today I have launched my kernel in a very naive way:
const int tpb = 64; //threads per block
dim3 grid = (n +tpb-1)/tpb; // n - face count in array
dim3 block = tpb;
//.. some memory allocation etc.
theKernel<<<grid,block>>>(dev_ptr, n);
and inside the kernel I had a loop:
__global__ void theKernel(Face* faces, int faceCount) {
int offset = threadIdx.x + blockIdx.x*blockDim.x;
if(offset >= faceCount)
return;
Face f = faces[offset];
//..some initialization
int RAY = -1;
for(float alpha=0.0f; alpha<=PI; alpha+= alpha_step ){
for(float beta=0.0f; beta<=PI; beta+= beta_step ){
RAY++;
//..calculation per ray in (alpha,beta) direction ...
faces[offset].intersects[RAY] = ...; //some assignment
This is about it. I looped through all the directions and updated the faces array. I worked correctly, but was hardly any faster than CPU code.
So today I tried to optimize the code, and launch the kernel with a much bigger number of threads. Instead of having 1 thread per face I want 1 thread per face's ray (meaning 625 threads work for 1 face). The modifications were simple:
dim3 grid = (n*RAYS +tpb-1)/tpb; //before launching . RAYS = 625, n = face count
and the kernel itself:
__global__ void theKernel(Face *faces, int faceCount){
int threadNum = threadIdx.x + blockIdx.x*blockDim.x;
int offset = threadNum/RAYS; //RAYS is a global #define
int rayNum = threadNum - offset*RAYS;
if(offset >= faceCount || rayNum != 0)
return;
Face f = faces[offset];
//initialization and the rest.. again ..
And this code does not work at all. Why? Theoretically, only the 1st thread (of the 625 per Face) should work, so why does this result in bad (hardly any) computation?
Kind regards,
e.
The maximum size of a grid in any dimension is 65535 (CUDA programming guide, Appendix F). If your grid size was 1000 before the change, you have increased it to 625000. That's bigger than the limit, so the kernel won't run correctly.
If you define the grid size as
dim3 grid((n + tpb - 1) / tpb, RAYS);
then all grid dimensions will be smaller than the limit. You'll also have to change the way blockIdx is used in the kernel.
As Heatsink pointed out you are probably exceeding available resources. Good idea is to check after kernel execution whether there was no error.
Here is C++ code I use:
#include <cutil_inline.h>
void
check_error(const char* str, cudaError_t err_code) {
if (err_code != ::cudaSuccess)
std::cerr << str << " -- " << cudaGetErrorString(err_code) << "\n";
}
Then when I invole kernel:
my_kernel <<<block_grid, thread_grid >>>(args);
check_error("my_kernel", cudaGetLastError());