C++: Simple CUDA volume reconstruction code crashing

C++: Simple CUDA volume reconstruction code crashing - c++

I am currently working on a more comprehensive project involving CUDA. During the recent days I have been encountering errors that I have been desperately trying to bugfix. However, I couldn't figure it out, so I made up a minimal example now that shows the same behaviour. I have to say I am kind of new to CUDA. I am using Visual Studio 2015 and the CUDA Toolkit 7.5.
The program involves creating a 3D-volume on the GPU memory and then calculating values and writing them to the volume. I have tried to make the code as simple as possible:
First ist the main.cpp file:
#include "cuda_test.h"
int main() {
size_t const xDimension = 500;
size_t const yDimension = 500;
size_t const zDimension = 1000;
//allocate volume part memory on gpu
cudaPitchedPtr volume = ct::cuda::create3dVolumeOnGPU(xDimension, yDimension, zDimension);
//start reconstruction
ct::cuda::startReconstruction(volume,
xDimension,
yDimension,
zDimension);
return 0;
}
Then the cuda_test.h that is the header file for the actual .cu file:
#ifndef CT_CUDA
#define CT_CUDA
#include <cstdlib>
#include <stdio.h>
#include <cmath>
//CUDA
#include <cuda_runtime.h>
namespace ct {
namespace cuda {
cudaPitchedPtr create3dVolumeOnGPU(size_t xSize, size_t ySize, size_t zSize);
void startReconstruction(cudaPitchedPtr volume,
size_t xSize,
size_t ySize,
size_t zSize);
}
}
#endif
And then the cuda_test.cu file that contains the actual function implementations:
#include "cuda_test.h"
namespace ct {
namespace cuda {
cudaPitchedPtr create3dVolumeOnGPU(size_t xSize, size_t ySize, size_t zSize) {
cudaExtent extent = make_cudaExtent(xSize * sizeof(float), ySize, zSize);
cudaPitchedPtr ptr;
cudaMalloc3D(&ptr, extent);
printf("malloc3D: %s\n", cudaGetErrorString(cudaGetLastError()));
cudaMemset3D(ptr, 0, extent);
printf("memset: %s\n", cudaGetErrorString(cudaGetLastError()));
return ptr;
}
__device__ void addToVolumeElement(cudaPitchedPtr volumePtr, size_t ySize, size_t xCoord, size_t yCoord, size_t zCoord, float value) {
char* devicePtr = (char*)(volumePtr.ptr);
//z * xSize * ySize + y * xSize + x
size_t pitch = volumePtr.pitch;
size_t slicePitch = pitch * ySize;
char* slice = devicePtr + zCoord*slicePitch;
float* row = (float*)(slice + yCoord * pitch);
row[xCoord] += value;
}
__global__ void reconstructionKernel(cudaPitchedPtr volumePtr, size_t xSize, size_t ySize, size_t zSize) {
size_t xIndex = blockIdx.x;
size_t yIndex = blockIdx.y;
size_t zIndex = blockIdx.z;
if (xIndex == 0 && yIndex == 0 && zIndex == 0) {
printf("kernel start\n");
}
//just make sure we're inside the volume bounds
if (xIndex < xSize && yIndex < ySize && zIndex < zSize) {
//float value = z;
float value = sqrt(sqrt(sqrt(5.3))) * sqrt(sqrt(sqrt(1.2))) * sqrt(sqrt(sqrt(10.8))) + 501 * 0.125 * 0.786 / 5.3;
addToVolumeElement(volumePtr, ySize, xIndex, yIndex, zIndex, value);
}
if (xIndex == 0 && yIndex == 0 && zIndex == 0) {
printf("kernel end\n");
}
}
void startReconstruction(cudaPitchedPtr volumePtr, size_t xSize, size_t ySize, size_t zSize) {
dim3 blocks(xSize, ySize, zSize);
reconstructionKernel <<< blocks, 1 >>>(volumePtr,
xSize,
ySize,
zSize);
printf("Kernel launch: %s\n", cudaGetErrorString(cudaGetLastError()));
cudaDeviceSynchronize();
printf("Device synchronise: %s\n", cudaGetErrorString(cudaGetLastError()));
}
}
}
The function create3dVolumeOnGPU allocates a 3-dimensional "volume" in the gpu memory and returns a pointer to it. This is a host function. The second host function is startReconstruction. The only thing it does is launching the actual kernel with as many blocks as there are voxels in the volume. The kernel function is reconstructionKernel. It just calculates an arbitrary value out of some constants and then calls addToVolumeElement (device function) to write the result in the corresponding voxel (adding it).
Now, the problem is that it crashes. If I launch with debugger (NSight), NSight interrupts giving the error message:
CUDA grid launch failed: CUcontext: 2358451327088 CUmodule: 2358541519888 Function: _ZN2ct4cuda20reconstructionKernelE14cudaPitchedPtryyy
The console outputs:
malloc3D: no error
memset: no error
kernel started
kernel end
If I launch in release mode the whole machine resets.
However, if I change the volume dimensions to be smaller it works, for example:
size_t const xDimension = 100;
size_t const yDimension = 100;
size_t const zDimension = 100;
However, the amount of free GPU memory should not be the problem (card has 4GB VRAM).
It would be nice if someone could have a look at it and maybe give me a tip what could cause the problem.

Now, the problem is that it crashes
It would be nice if someone could have a look at it and maybe give me a tip what could cause the problem.
I think it's likely you are running into a WDDM TDR issue. On windows, any time a kernel running on a WDDM GPU takes more than about 2 seconds to execute, you may run into the WDDM TDR watchdog (assuming you haven't made any changes to the watchdog).
Furthermore, launching kernels like this:
reconstructionKernel <<< blocks, 1 >>>(...);
where the threads-per-block number is 1, means that only one thread in each warp (and in each block) is active. But the GPU likes to have 32 active threads per warp. So the net effect is inefficient utilization of the GPU resources; perhaps as much as 97% of the GPU horsepower sits idle when you run kernels this way.
So if your code is flexible enough to allow this:
reconstructionKernel <<< blocks, 1 >>>(...);
or equivalently this:
reconstructionKernel <<< blocks/256, 256 >>>(...);
(this is just a representative example; I realize you have a multidimensional grid, and the above probably isn't exactly relevant for your case)
then the second invocation method will almost certainly be more efficient, leading to a shorter execution time for the same work.
So I believe when you tested your code with multiple threads per block, you did something like the above, and it reduced the execution time below the TDR limit.
That's a perfectly fine solution, but if you end up adding more work to your kernel (more total threads, or more work per thread) then you may run into the limit again. In that case, the linked article explains a possible work-around.
As an aside, kernel launch configurations like this:
kernel<<<1, ?>>>(...);
or this:
kernel<<<?, 1>>>(...);
are never recommended for high performance code on the GPU.

Related

CUDA Optimization

I developed Pincushion Distortion using CUDA to support real time - more than 40 fps for 3680*2456 Image Sequences.
But it takes 130ms if I use CUDA - nVIDIA GeForce GT 610, 2GB DDR3.
But it takes only 60ms if I use CPU and OpenMP - Core i7 3.4GHz, QuadCore.
Please tell me what to do to speed up.
Thanks.
Full source can be downloaded here.
https://drive.google.com/file/d/0B9SEJgsu0G6QX2FpMnRja0o5STA/view?usp=sharing
https://drive.google.com/file/d/0B9SEJgsu0G6QOGNPMmVQLWpSb2c/view?usp=sharing
The codes are as follows.
__global__
void undistort(int N, float k, int width, int height, int depth, int pitch, float R, float L, unsigned char* in_bits, unsigned char* out_bits)
{
// Get the Index of the Array from GPU Grid/Block/Thread Index and Dimension.
int i, j;
i = blockIdx.y * blockDim.y + threadIdx.y;
j = blockIdx.x * blockDim.x + threadIdx.x;
// If Out of Array
if (i >= height || j >= width)
{
return;
}
// Calculating Undistortion Equation.
// In CPU, We used Fast Approximation equations of atan and sqrt - It makes 2 times faster.
// But In GPU, No need to use Approximation Functions as it is faster.
int cx = width * 0.5;
int cy = height * 0.5;
int xt = j - cx;
int yt = i - cy;
float distance = sqrt((float)(xt*xt + yt*yt));
float r = distance*k / R;
float theta = 1;
if (r == 0)
theta = 1;
else
theta = atan(r)/r;
theta = theta*L;
float tx = theta*xt + cx;
float ty = theta*yt + cy;
// When we correct the frame, its size will be greater than Original.
// So We should Crop it.
if (tx < 0)
tx = 0;
if (tx >= width)
tx = width - 1;
if (ty < 0)
ty = 0;
if (ty >= height)
ty = height - 1;
// Output the Result.
int ux = (int)(tx);
int uy = (int)(ty);
tx = tx - ux;
ty = ty - uy;
unsigned char *p = (unsigned char*)out_bits + i*pitch + j*depth;
unsigned char *q00 = (unsigned char*)in_bits + uy*pitch + ux*depth;
unsigned char *q01 = q00 + depth;
unsigned char *q10 = q00 + pitch;
unsigned char *q11 = q10 + depth;
unsigned char newVal[4] = {0};
for (int k = 0; k < depth; k++)
{
newVal[k] = (q00[k]*(1-tx)*(1-ty) + q01[k]*tx*(1-ty) + q10[k]*(1-tx)*ty + q11[k]*tx*ty);
memcpy(p + k, &newVal[k], 1);
}
}
void wideframe_correction(char* bits, int width, int height, int depth)
{
// Find the device.
// Initialize the nVIDIA Device.
cudaSetDevice(0);
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, 0);
// This works for Calculating GPU Time.
cudaProfilerStart();
// This works for Measuring Total Time
long int dwTime = clock();
// Setting Distortion Parameters
// Note that Multiplying 0.5 works faster than divide into 2.
int cx = (int)(width * 0.5);
int cy = (int)(height * 0.5);
float k = -0.73f;
float R = sqrt((float)(cx*cx + cy*cy));
// Set the Radius of the Result.
float L = (float)(width<height ? width:height);
L = L/2.0f;
L = L/R;
L = L*L*L*0.3333f;
L = 1.0f/(1-L);
// Create the GPU Memory Pointers.
unsigned char* d_img_in = NULL;
unsigned char* d_img_out = NULL;
// Allocate the GPU Memory2D with pitch for fast performance.
size_t pitch;
cudaMallocPitch( (void**) &d_img_in, &pitch, width*depth, height );
cudaMallocPitch( (void**) &d_img_out, &pitch, width*depth, height );
_tprintf(_T("\nPitch : %d\n"), pitch);
// Copy RAM data to VRAM.
cudaMemcpy2D( d_img_in, pitch,
bits, width*depth, width*depth, height,
cudaMemcpyHostToDevice );
cudaMemcpy2D( d_img_out, pitch,
bits, width*depth, width*depth, height,
cudaMemcpyHostToDevice );
// Create Variables for Timing
cudaEvent_t startEvent, stopEvent;
cudaError_t err = cudaEventCreate(&startEvent, 0);
assert( err == cudaSuccess );
err = cudaEventCreate(&stopEvent, 0);
assert( err == cudaSuccess );
// Execution of the version using global memory
float elapsedTime;
cudaEventRecord(startEvent);
// Process image
dim3 dGrid(width / BLOCK_WIDTH + 1, height / BLOCK_HEIGHT + 1);
dim3 dBlock(BLOCK_WIDTH, BLOCK_HEIGHT);
undistort<<< dGrid, dBlock >>> (width*height, k, width, height, depth, pitch, R, L, d_img_in, d_img_out);
cudaThreadSynchronize();
cudaEventRecord(stopEvent);
cudaEventSynchronize( stopEvent );
// Estimate the GPU Time.
cudaEventElapsedTime( &elapsedTime, startEvent, stopEvent);
// Calculate the Total Time.
dwTime = clock() - dwTime;
// Save Image data from VRAM to RAM
cudaMemcpy2D( bits, width*depth,
d_img_out, pitch, width*depth, height,
cudaMemcpyDeviceToHost );
_tprintf(_T("GPU Processing Time(ms) : %d\n"), (int)elapsedTime);
_tprintf(_T("VRAM Memory Read/Write Time(ms) : %d\n"), dwTime - (int)elapsedTime);
_tprintf(_T("Total Time(ms) : %d\n"), dwTime );
// Free GPU Memory
cudaFree(d_img_in);
cudaFree(d_img_out);
cudaProfilerStop();
cudaDeviceReset();
}

i've not read the source code, but there is some things you can't pass through.
your GPU has nearly same performance as your CPU:
Adapt the follwing informations with your real GPU/CPU model.
Specification | GPU | CPU
----------------------------------------
Bandwith | 14,4 GB/sec | 25.6 GB/s
Flops | 155 (FMA) | 135
we can conclude that for memory bounded kernels your GPU will never be faster than your CPU.
GPU informations found here :
http://www.nvidia.fr/object/geforce-gt-610-fr.html#pdpContent=2
CPU informations found here : http://ark.intel.com/products/75123/Intel-Core-i7-4770K-Processor-8M-Cache-up-to-3_90-GHz?q=Intel%20Core%20i7%204770K
and here http://www.ocaholic.ch/modules/smartsection/item.php?page=6&itemid=1005

One does not simply optimize the code just by looking to the source. First of all, you should use Nvidia Profiler https://developer.nvidia.com/nvidia-visual-profiler and see, which part of your code on GPU is the one taking too much time. You might wish to write a UnitTest first however, just to be sure that only the investigated part of your project is tested.
Additionally, you can use CallGrind http://valgrind.org/docs/manual/cl-manual.html to test your CPU code performance.
In general, this is not very surprising that your GPU "optimized" code is slower then "not optimized" one. CUDA cores are usually several times slower than CPU and you have to actually introduce a lot of parallelism to notice a significant speed-up.
EDIT, response to your comment:
As a unit testing framework I strongly recommend GoogleTest. Here you can learn how to use it. Apart from its obvious functionalities (code testing) it allows you to run only specific methods from your class interfaces for performance analysis.
In general, Nvidia profiler is just a tool that runs your code and tells you how much time each of your kernel consume. Please look to their documentation.
By "lot of parallelism" I meant: on your processor you can run 8 x 3.4GHz threads, your GPU has one SM (streaming multiprocessor) with 810MHz clock, lets say 1024 threads per SM (I do not have exact data, but you can run deviceQuery Nvidia script to know the exact parameters), therefore if your GPU code can run (3.4*8)/0.81 = 33 computations in parallel, you will achieve exactly nothing. Execution time of your CPU and GPU code will be the same (neglecting L-cache GPU memory copying, which is expensive). Conclusion: your GPU code should be able to compute at least ~ 40 operations in parallel to introduce any speed-up. On the other hand, lets say that you are able to fully use your GPU potential and you can keep all 1024 on your SM busy all the time. In that case your code will run only (0.81*1024)/(8*3.4) = 30 times faster (approximately, remember that we neglect GPU L-cache operations), which is impossible in most cases, because usually you are not able to parallelize your serial code with such efficiency.
Wish you good luck with your research!

Yes, put nvprof to good use, it's a great tool.
What I could see from your code...
1. Consider using linear thread blocks instead of flat blocks, it could save up some integer operations.
2. Manual correction of image borders and/or thread indices leads to massive divergence and/or impacts coalescing. Consider using texture fetches and/or pre-padding data.
3. memcpy single value from inside the kernel is generally a bad idea.
4. Try to minimize type conversions.

CUDA: Reduce algorithm

I am new to C++/CUDA. I tried implementing the parallel algorithm "reduce" with ability to handle any type of inputsize, and threadsize without increasing asymptotic parallel runtime by recursing over the output of the kernel (in the kernel wrapper).
e.g. Implementing Max Reduce in Cuda the top answer to this question, his/hers implementation will essentially be sequential when threadsize is small enough.
However, I keep getting a "Segmentation fault" when I compile and run it ..?
>> nvcc -o mycode mycode.cu
>> ./mycode
Segmentail fault.
Compiled on a K40 with cuda 6.5
Here is the kernel, basically same as the SO post I linked the checker for "out of bounds" is different:
#include <stdio.h>
/* -------- KERNEL -------- */
__global__ void reduce_kernel(float * d_out, float * d_in, const int size)
{
// position and threadId
int pos = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
// do reduction in global memory
for (unsigned int s = blockDim.x / 2; s>0; s>>=1)
{
if (tid < s)
{
if (pos+s < size) // Handling out of bounds
{
d_in[pos] = d_in[pos] + d_in[pos+s];
}
}
}
// only thread 0 writes result, as thread
if (tid==0)
{
d_out[blockIdx.x] = d_in[pos];
}
}
The kernel wrapper I mentioned to handle when 1 block will not contain all of the data.
/* -------- KERNEL WRAPPER -------- */
void reduce(float * d_out, float * d_in, const int size, int num_threads)
{
// setting up blocks and intermediate result holder
int num_blocks = ((size) / num_threads) + 1;
float * d_intermediate;
cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
// recursively solving, will run approximately log base num_threads times.
do
{
reduce_kernel<<<num_blocks, num_threads>>>(d_intermediate, d_in, size);
// updating input to intermediate
cudaMemcpy(d_in, d_intermediate, sizeof(float)*num_blocks, cudaMemcpyDeviceToDevice);
// Updating num_blocks to reflect how many blocks we now want to compute on
num_blocks = num_blocks / num_threads + 1;
// updating intermediate
cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
}
while(num_blocks > num_threads); // if it is too small, compute rest.
// computing rest
reduce_kernel<<<1, num_blocks>>>(d_out, d_in, size);
}
Main program to initialize in/out and create bogus data for testing.
/* -------- MAIN -------- */
int main(int argc, char **argv)
{
// Setting num_threads
int num_threads = 512;
// Making bogus data and setting it on the GPU
const int size = 1024;
const int size_out = 1;
float * d_in;
float * d_out;
cudaMalloc(&d_in, sizeof(float)*size);
cudaMalloc((void**)&d_out, sizeof(float)*size_out);
const int value = 5;
cudaMemset(d_in, value, sizeof(float)*size);
// Running kernel wrapper
reduce(d_out, d_in, size, num_threads);
printf("sum is element is: %.f", d_out[0]);
}

There are a few things I would point out with your code.
As a general rule/boilerplate, I always recommend using proper cuda error checking and run your code with cuda-memcheck, any time you are having trouble with a cuda code. However those methods wouldn't help much with the seg fault, although they may help later (see below).
The actual seg fault is occurring on this line:
printf("sum is element is: %.f", d_out[0]);
you've broken a cardinal CUDA programming rule: host pointers must not be dereferenced in device code, and device pointers must not be dereferenced in host code. This latter condition applies here. d_out is a device pointer (allocated via cudaMalloc). Such pointers have no meaning if you attempt to dereference them in host code, and doing so will lead to a seg fault.
The solution is to copy the data back to the host before printing it out:
float result;
cudaMemcpy(&result, d_out, sizeof(float), cudaMemcpyDeviceToHost);
printf("sum is element is: %.f", result);
Using cudaMalloc in a loop, on the same variable, without doing any cudaFree operations, is not good practice, and may lead to out-of-memory errors in long-running loops, and may also lead to programs with memory leaks, if such a construct is used in a larger program:
do
{
...
cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
}
while...
in this case I think a better approach and trivial fix would be to cudaFree d_intermediate right before you re-allocate:
do
{
...
cudaFree(d_intermediate);
cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
}
while...
This might not be doing what you think it is:
const int value = 5;
cudaMemset(d_in, value, sizeof(float)*size);
probably you are aware of this, but cudaMemset, like memset, operates on byte quantities. So you are filling the d_in array with a value corresponding to 0x05050505 (and I have no idea what that bit pattern corresponds to when interpreted as a float quantity). Since you refer to bogus values, you may be cognizant of this already. But it's a common error (e.g. if you were actually trying to initialize the array with the value of 5 in every float location), so I thought I would point it out.
Your code has other issues as well (which you will discover if you make the above fixes then run your code with cuda-memcheck). To learn about how to do good parallel reductions, I would recommend studying the CUDA parallel reduction sample code and presentation. Parallel reductions in global memory are not recommended for performance reasons.
For completeness, here are some of the additional issues I found:
Your kernel code needs an appropriate __syncthreads() statement to ensure that the work of all threads in a block are complete before any threads go onto the next iteration of the for-loop.
Your final write to global memory in the kernel needs to also be conditioned on the read-location being in-bounds. Otherwise, your strategy of always launching an extra block would allow the read from this line to be out-of-bounds (cuda-memcheck will show this).
The reduction logic in your loop in the reduce function is generally messed up and needed to be re-worked in several ways.
I'm not saying this code is defect-free, but it seems to work for the given test case and produce the correct answer (1024):
#include <stdio.h>
/* -------- KERNEL -------- */
__global__ void reduce_kernel(float * d_out, float * d_in, const int size)
{
// position and threadId
int pos = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
// do reduction in global memory
for (unsigned int s = blockDim.x / 2; s>0; s>>=1)
{
if (tid < s)
{
if (pos+s < size) // Handling out of bounds
{
d_in[pos] = d_in[pos] + d_in[pos+s];
}
}
__syncthreads();
}
// only thread 0 writes result, as thread
if ((tid==0) && (pos < size))
{
d_out[blockIdx.x] = d_in[pos];
}
}
/* -------- KERNEL WRAPPER -------- */
void reduce(float * d_out, float * d_in, int size, int num_threads)
{
// setting up blocks and intermediate result holder
int num_blocks = ((size) / num_threads) + 1;
float * d_intermediate;
cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
cudaMemset(d_intermediate, 0, sizeof(float)*num_blocks);
int prev_num_blocks;
// recursively solving, will run approximately log base num_threads times.
do
{
reduce_kernel<<<num_blocks, num_threads>>>(d_intermediate, d_in, size);
// updating input to intermediate
cudaMemcpy(d_in, d_intermediate, sizeof(float)*num_blocks, cudaMemcpyDeviceToDevice);
// Updating num_blocks to reflect how many blocks we now want to compute on
prev_num_blocks = num_blocks;
num_blocks = num_blocks / num_threads + 1;
// updating intermediate
cudaFree(d_intermediate);
cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
size = num_blocks*num_threads;
}
while(num_blocks > num_threads); // if it is too small, compute rest.
// computing rest
reduce_kernel<<<1, prev_num_blocks>>>(d_out, d_in, prev_num_blocks);
}
/* -------- MAIN -------- */
int main(int argc, char **argv)
{
// Setting num_threads
int num_threads = 512;
// Making non-bogus data and setting it on the GPU
const int size = 1024;
const int size_out = 1;
float * d_in;
float * d_out;
cudaMalloc(&d_in, sizeof(float)*size);
cudaMalloc((void**)&d_out, sizeof(float)*size_out);
//const int value = 5;
//cudaMemset(d_in, value, sizeof(float)*size);
float * h_in = (float *)malloc(size*sizeof(float));
for (int i = 0; i < size; i++) h_in[i] = 1.0f;
cudaMemcpy(d_in, h_in, sizeof(float)*size, cudaMemcpyHostToDevice);
// Running kernel wrapper
reduce(d_out, d_in, size, num_threads);
float result;
cudaMemcpy(&result, d_out, sizeof(float), cudaMemcpyDeviceToHost);
printf("sum is element is: %.f\n", result);
}

Why is my CUDA kernel returning old values?

Kind of almost at the point of ripping my hair out over this issue.
I have a CUDA kernel that does some math on data stored in a 3D array. While testing this, I used to assign some values (non-zero) to the array and observe results. I commented out those lines since, but the result is still the same. It is as if it is completely ignoring the fact that I'm doing a memset to 0.
The code works correctly when I step through it in Debug... But not in Release! My guess is I have a memory leak from this matrix.
I allocate this array as:
cudaExtent m_extent = make_cudaExtent(sizeof(float)*matdim.x, matdim.y, matdim.z); // width, height, depth
cudaPitchedPtr m_device;
cudaMalloc3D(&m_device, m_extent);
cudaMemset3D(m_device, 0, m_extent);
I call the kernel in a loop like this:
for (int iter = 0; iter < gpu_iterations; iter++)
{
PF_iteration_kernel<<<grids,threads>>>(m_device, m_extent, matdim);
cudaDeviceSynchronize();
}
After which I release the m_device pitched pointer:
cudaFree(m_device.ptr);
matdim is just matrix dimensions held by a dim3.
Within the kernel I do the following (well, I commented everything functional out...):
__global__ void PF_iteration_kernel(cudaPitchedPtr mPtr, cudaExtent mExt, dim3 matrix_dimensions)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// Find location within the pitched memory
char *m = (char*)mPtr.ptr;
int sof = sizeof(float);
size_t pitch = mPtr.pitch;
size_t slice_pitch = pitch*mExt.height;
char* m_addroff = m + y * pitch + x * sof;
printf("m(%d,%d) is %f \n", x, y, *m_addroff); // display the slice
*m_addroff = 0; // WILL THIS RESET IT?!
__syncthreads();
}
That should be just showing 0s, but it displays my old values (25, 26, 27, 28, etc).
I have cleaned and re-cleaned and re-built everything several times. I have relaunched the IDE.
My IDE is Visual Studio 2010 With NSight 4.6 (CUDA 7.0).
I am on Windows 7 x64

Consider this
char* m_addroff = m + y * pitch + x * sof;
printf("m(%d,%d) is %f \n", x, y, *m_addroff);
The compiler will see a char and promote it to int pushed on to stack - not a float promoted to double that the format requires.
The compiler does not provide arguments to fit the format spec, but some compilers will examine the format specs and warn of problems.
I suggest you cast the argument. I risk guessing and failing, but something like this
printf("m(%d,%d) is %f \n", x, y, *(float*)m_addroff);
Herer is a simple example.
#include <stdio.h>
int main()
{
char car [4] = {0};
char *cptr = car;
printf ("Hello %f\n", *(float*)cptr);
return 0;
}

Wrong results with CUDA threads writing on private locations in global memory

EDIT 3:
I need each thread to write and read a private location in global memory. Below I post a working code showing my problem. In the following, I'll list the main variables and structures involved.
Variables:
srcArr_h (host) --> srcArr_d (device) : array of random floats in the range [0, COLORLEVELS] with dimensions given by ARRDIM
auxD (device) : array of dimension ARRDIM * ARRDIM holding the final result in device
auxH (host) : array of dimension ARRDIM * ARRDIM holding the final result in host
c_glob_d (device) : array that reserves a private location of COLORLEVELS floats for each thread, with size given by num_threads * COLORLEVELS
idx (device) : identification number of current thread
My problem: in the kernel, I update c_glob[idx] for each value ic (ic∈ [0, COLORLEVELS]), i.e. c_glob[idx][ic]. I use c_glob[idx][COLORLEVELS] to compute the final result g0 stored in auxD. My problem is that my final results are wrong. Results copied to auxH show that I get numbers at least one order of magnitude bigger then expected or even weird numbers suggesting my operation is likely to overflow.
Help: what am I doing wrong? How can I make each thread to write and read each private location in global memory? Right now I'm debugging with ARRDIM = 512, but my goal is to make it work for ARRDIM~ 10^4, thus creating a c_glob array for 10^4*10^4 threads). I guess I will have issues with the total number of threads allowed per run.. So I was wondering if you could suggest any other solution to my problem.
Thank you.
#include <string>
#include <stdint.h>
#include <iostream>
#include <stdio.h>
#include "cuPrintf.cu"
using namespace std;
#define ARRDIM 512
#define COLORLEVELS 4
__global__ void gpuKernel
(
float *sa, float *aux,
size_t memPitchAux, int w,
float *c_glob
)
{
float sc_loc[COLORLEVELS];
float g0=0.0f;
int tidx = blockIdx.x * blockDim.x + threadIdx.x;
int tidy = blockIdx.y * blockDim.y + threadIdx.y;
int idx = tidy * memPitchAux/4 + tidx;
for(int ic=0; ic<COLORLEVELS; ic++)
{
sc_loc[ic] = ((float)(ic*ic));
}
for(int is=0; is<COLORLEVELS; is++)
{
int ic = fabs(sa[tidy*w +tidx]);
c_glob[tidy * COLORLEVELS + tidx + ic] += 1.0f;
}
for(int ic=0; ic<COLORLEVELS; ic++)
{
g0 += c_glob[tidy * COLORLEVELS + tidx + ic]*sc_loc[ic];
}
aux[idx] = g0;
}
int main(int argc, char* argv[])
{
/*
* array src host and device
*/
int heightSrc = ARRDIM;
int widthSrc = ARRDIM;
cudaSetDevice(0);
float *srcArr_h, *srcArr_d;
size_t nBytesSrcArr = sizeof(float)*heightSrc * widthSrc;
srcArr_h = (float *)malloc(nBytesSrcArr); // Allocate array on host
cudaMalloc((void **) &srcArr_d, nBytesSrcArr); // Allocate array on device
cudaMemset((void*)srcArr_d,0,nBytesSrcArr); // set to zero
int totArrElm = heightSrc*widthSrc;
for(int ic=0; ic<totArrElm; ic++)
{
srcArr_h[ic] = (float)(rand() % COLORLEVELS);
}
cudaMemcpy( srcArr_d, srcArr_h,nBytesSrcArr,cudaMemcpyHostToDevice);
/*
* auxiliary buffer auxD to save final results
*/
float *auxD;
size_t auxDPitch;
cudaMallocPitch((void**)&auxD,&auxDPitch,widthSrc*sizeof(float),heightSrc);
cudaMemset2D(auxD, auxDPitch, 0, widthSrc*sizeof(float), heightSrc);
/*
* auxiliary buffer auxH allocation + initialization on host
*/
size_t auxHPitch;
auxHPitch = widthSrc*sizeof(float);
float *auxH = (float *) malloc(heightSrc*auxHPitch);
/*
* kernel launch specs
*/
int thpb_x = 16;
int thpb_y = 16;
int blpg_x = (int) widthSrc/thpb_x;
int blpg_y = (int) heightSrc/thpb_y;
int num_threads = blpg_x * thpb_x + blpg_y * thpb_y;
/*
* c_glob: array that reserves a private location of COLORLEVELS floats for each thread
*/
int cglob_w = COLORLEVELS;
int cglob_h = num_threads;
float *c_glob_d;
size_t c_globDPitch;
cudaMallocPitch((void**)&c_glob_d,&c_globDPitch,cglob_w*sizeof(float),cglob_h);
cudaMemset2D(c_glob_d, c_globDPitch, 0, cglob_w*sizeof(float), cglob_h);
/*
* kernel launch
*/
dim3 dimBlock(thpb_x,thpb_y, 1);
dim3 dimGrid(blpg_x,blpg_y,1);
gpuKernel<<<dimGrid,dimBlock>>>(srcArr_d,auxD, auxDPitch, widthSrc, c_glob_d);
cudaThreadSynchronize();
cudaMemcpy2D(auxH,auxHPitch,
auxD,auxDPitch,
auxHPitch, heightSrc,
cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
float min = auxH[0];
float max = auxH[0];
float f;
string str;
for(int i=0; i<widthSrc*heightSrc; i++)
{
if(min > auxH[i])
min = auxH[i];
if(max < auxH[i])
max = auxH[i];
}
cudaFree(srcArr_d);
cudaFree(auxD);
cudaFree(c_glob_d);
}

You decided neither not to show the whole code nor a reduced size thereof reproducing your problem. Therefore, it has not been possible to make tests and verify the possible solution below.
I think you have spot the source of the problem: multiple threads are trying to write to the same memory locations in parallel. This is a situation leading to race conditions. For an example, see the fourth slide of the presentation "CUDA C: race conditions, atomics, locks, mutex, and warps".
Race conditions have a brute-force solution: atomic functions. They are described at Section B.12 of the CUDA C Programming Guide. So you can try to fix your problem by changing the line
c[ic] += 1.0f;
to
atomicAdd(&c[ic],1);
You will pay this fix with performance: atomic operations serialize the code to avoid race conditions.
I have mentioned that atomic functions are a brute-force solution to your problem because it can be that, by properly rethinking the implementation, you can find a way to avoid them. But this is not possible to say as of now due to the very few details you provided.

CUDA kernel error when increasing thread number

I am developing a CUDA ray-plane intersection kernel.
Let's suppose, my plane (face) struct is:
typedef struct _Face {
int ID;
int matID;
int V1ID;
int V2ID;
int V3ID;
float V1[3];
float V2[3];
float V3[3];
float reflect[3];
float emmision[3];
float in[3];
float out[3];
int intersects[RAYS];
} Face;
I pasted the whole struct so you can get an idea of it's size. RAYS equals 625 in current configuration. In the following code assume that the size of faces array is i.e. 1270 (generally - thousands).
Now until today I have launched my kernel in a very naive way:
const int tpb = 64; //threads per block
dim3 grid = (n +tpb-1)/tpb; // n - face count in array
dim3 block = tpb;
//.. some memory allocation etc.
theKernel<<<grid,block>>>(dev_ptr, n);
and inside the kernel I had a loop:
__global__ void theKernel(Face* faces, int faceCount) {
int offset = threadIdx.x + blockIdx.x*blockDim.x;
if(offset >= faceCount)
return;
Face f = faces[offset];
//..some initialization
int RAY = -1;
for(float alpha=0.0f; alpha<=PI; alpha+= alpha_step ){
for(float beta=0.0f; beta<=PI; beta+= beta_step ){
RAY++;
//..calculation per ray in (alpha,beta) direction ...
faces[offset].intersects[RAY] = ...; //some assignment
This is about it. I looped through all the directions and updated the faces array. I worked correctly, but was hardly any faster than CPU code.
So today I tried to optimize the code, and launch the kernel with a much bigger number of threads. Instead of having 1 thread per face I want 1 thread per face's ray (meaning 625 threads work for 1 face). The modifications were simple:
dim3 grid = (n*RAYS +tpb-1)/tpb; //before launching . RAYS = 625, n = face count
and the kernel itself:
__global__ void theKernel(Face *faces, int faceCount){
int threadNum = threadIdx.x + blockIdx.x*blockDim.x;
int offset = threadNum/RAYS; //RAYS is a global #define
int rayNum = threadNum - offset*RAYS;
if(offset >= faceCount || rayNum != 0)
return;
Face f = faces[offset];
//initialization and the rest.. again ..
And this code does not work at all. Why? Theoretically, only the 1st thread (of the 625 per Face) should work, so why does this result in bad (hardly any) computation?
Kind regards,
e.

The maximum size of a grid in any dimension is 65535 (CUDA programming guide, Appendix F). If your grid size was 1000 before the change, you have increased it to 625000. That's bigger than the limit, so the kernel won't run correctly.
If you define the grid size as
dim3 grid((n + tpb - 1) / tpb, RAYS);
then all grid dimensions will be smaller than the limit. You'll also have to change the way blockIdx is used in the kernel.

As Heatsink pointed out you are probably exceeding available resources. Good idea is to check after kernel execution whether there was no error.
Here is C++ code I use:
#include <cutil_inline.h>
void
check_error(const char* str, cudaError_t err_code) {
if (err_code != ::cudaSuccess)
std::cerr << str << " -- " << cudaGetErrorString(err_code) << "\n";
}
Then when I invole kernel:
my_kernel <<<block_grid, thread_grid >>>(args);
check_error("my_kernel", cudaGetLastError());

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js