CUDA kernel not updating the output data - c++

Ok, so the main idea of the task is to calculate the average of multiple images, I have it running in the normal way so I thought I will give it a go using CUDA, but unfortunately what I receive in the output is the first image instead of the average. (Inside the Kernel I also tried to set some pixels to 0 to make sure something is happening but no luck..)
////My kernel:
//nImages - number of images in the memory
//nBytes - number of pixels*color per image (also it's a size of dataOut)
//nImages*nBytes gives us the size of dataIn
//nBatch - dataIn has 1 milion bytes per image, we run in 6144 threads, so we need 163 batches to calc the whole dataOut
__global__
void avg_arrays(unsigned char* cuDataIn, unsigned char* cuDataOut, int nImages, int nBytes, int nBatch)
{
//get the position of the correct byte
int j = threadIdx.x + nBatch;
//if we're outside of image then give up
if(j >= nBytes) return;
//proceed averaging
long lSum = 0;
for(int i=0; i < nImages; ++i)
lSum += cuDataIn[i*nBytes + j];
lSum = lSum / nImages;
cuDataOut[j] = lSum;
}
Memory allocation etc.
unsigned char* dataIn = 0;
unsigned char* dataOut= 0;
// Allocate and Transfer memory to the devicea
gpuErrchk( cudaMalloc((void**)&dataIn, nPixelCountBGR * nNumberOfImages * sizeof(unsigned char))); //dataIn
gpuErrchk( cudaMalloc((void**)&dataOut, nPixelCountBGR * sizeof(unsigned char))); //dataOut
gpuErrchk( cudaMemcpy(dataIn, bmps, nPixelCountBGR * nNumberOfImages * sizeof(unsigned char), cudaMemcpyHostToDevice )); //dataIn
gpuErrchk( cudaMemcpy(dataOut, basePixels, nPixelCountBGR * sizeof(unsigned char), cudaMemcpyHostToDevice )); //dataOut
// Perform the array addition
dim3 dimBlock(N);
dim3 dimGrid(1);
//do it in batches, unless it's possible to run more threads at once, anyway N is a number of max threads
for(int i=0; i<nPixelCountBGR; i+=N){
cout << "Running with: nImg: "<< nNumberOfImages << ", nPixBGR " << nPixelCountBGR << ", and i = " << i << endl;
avg_arrays<<<dimGrid, dimBlock>>>(dataIn, dataOut, nNumberOfImages, nPixelCountBGR, 0);
}
// Copy the Contents from the GPU
gpuErrchk(cudaMemcpy(basePixels, dataOut, nPixelCountBGR * sizeof(unsigned char), cudaMemcpyDeviceToHost));
gpuErrchk(cudaFree(dataOut));
gpuErrchk(cudaFree(dataIn));
The error check doesn't bring any messages, all the code runs smoothly, all I get at the end is the exact copy of the first image.
Just in case if someone needs here's some console output:
Running with: nImg: 29, nPixBGR 1228800, and i = 0
...
Running with: nImg: 29, nPixBGR 1228800, and i = 1210368
Running with: nImg: 29, nPixBGR 1228800, and i = 1216512
Running with: nImg: 29, nPixBGR 1228800, and i = 1222656
Time of averaging: 0.219

If N is greater than 512 or 1024 (depending on which GPU you are running on, which you don't mention), then this is invalid:
dim3 dimBlock(N);
because you can't launch a kernel with greater than 512 or 1024 threads per block:
avg_arrays<<<dimGrid, dimBlock>>>(...
^
|
this is limited to 512 or 1024
If you study proper cuda error checking and apply it to your kernel launch, you'll trap this kind of error.

Related

2D tiled convolution taking more time than untiled version

Writing a code that perform a 2D convolution on a float matrix, in both tiled and untiled version. I'm assuming the width of the tile as
BLOCK_SIZE - MASK_WIDTH + 1
, using halo cells.
But for a 1024 matrix and masks varing from 3 to 9 I get the untiled version performing better:
untiled version
vs
tiled
Both matrix and mask are defined in a constant manner, equal for tiled and untiled. No random values/sizes used.
I guess I'm doing some wrong assumption about the tile size, but even after doing some research the implementation seems quite legit.
#define MATRIX_SIZE 1024
#define BLOCK_WIDTH 32
Here's the kernel code for the tiled version
__global__ void convolution_2D_tiled(float* in, const float* __restrict__ mask, float* out, size_t mask_width, size_t w, size_t h) {
float outputPixel = 0; //minimize write to global memory: stored in register
int tx = threadIdx.x;
int ty = threadIdx.y;
int tile_width = BLOCK_WIDTH - mask_width + 1; //since BLOCK_WIDTH = TILE_WIDTH + MASK_WIDTH - 1
int col = blockIdx.x * tile_width + tx;
int row = blockIdx.y * tile_width + ty;
//picking the starting indexes of input matrix inside the mask
//(TOP-LEFT of the mask)
int inputRow = row - (mask_width / 2);
int inputCol = col - (mask_width / 2);
__shared__ float tile[BLOCK_WIDTH][BLOCK_WIDTH];
// Load tile elements
if (inputRow >= 0 && inputRow < h && inputCol >= 0 && inputCol < w)
tile[ty][tx] = in[inputRow * w + inputCol];
else
tile[ty][tx] = 0.0;
// Wait until all tile elements are loaded
__syncthreads();
//some thread won't write any outputs, only need to calculate tile_width elements
if (col < w && row < h && ty < tile_width && tx < tile_width) {
//get the neighbour in the mask
for (int i = 0; i < mask_width; ++i) {
for (int j = 0; j < mask_width; ++j) { //(Mask_Width^2) access for each thread in block -> for each block (Mask_Width^2) * (Block_width^2)
outputPixel += tile[i + ty][j + tx] * mask[i * mask_width + j];
}
}
out[(row * w) + col] = (float)(outputPixel);
}
}
The main with the matrix generation and sizes assumptions:
void errorCheck(unsigned int line){
cudaError_t cudaError = cudaGetLastError();
// if error code wasn't a code describing success
if (cudaError != cudaSuccess)
{
// output that there has been a CUDA error in the line of the CUDA function call
// and exit the program
printf("CUDA error in line %u in file %s: %s\n", line - 1, __FILE__, cudaGetErrorString(cudaError));
exit(EXIT_FAILURE);
}}
int main(int argc, char const* argv[]){
for (size_t mask_width = 3; mask_width <= 9; mask_width += 2) {
printf("Testing with mask size = %d\n\n", mask_width);
float* a;
float* b;
float* c;
cudaMallocManaged((void **) &a, sizeof(float)*MATRIX_SIZE*MATRIX_SIZE);
cudaMallocManaged((void **) &b, sizeof(int)*mask_width*mask_width);
cudaMallocManaged((void **) &c, sizeof(int)*MATRIX_SIZE*MATRIX_SIZE);
// initialize matrix A
for (int i = 0; i < MATRIX_SIZE; ++i) {
for (int j = 0; j < MATRIX_SIZE; ++j) {
a[i * MATRIX_SIZE + j] = (float)(1 +(3 * j % 20));
}
}
// initialize matrix B
for (int i = 0; i < mask_width; ++i) {
for (int j = 0; j < mask_width; ++j) {
b[i * mask_width + j] = (float)(1 + (((2 * i) + j) % mask_width));
}
}
float naive_gpu_elapsed_time_ms;
// some events to count the execution time
//clock_t st, end;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
int tile_width = BLOCK_WIDTH - mask_width + 1;
dim3 dimGrid(MATRIX_SIZE / tile_width, MATRIX_SIZE / tile_width);
dim3 dimBlock(BLOCK_WIDTH, BLOCK_WIDTH);
errorCheck(__LINE__);
cudaEventRecord(start, 0);
convolution_2D_tiled <<<dimGrid, dimBlock >>> (a, b, c, mask_width, MATRIX_SIZE, MATRIX_SIZE);
errorCheck(__LINE__);
cudaThreadSynchronize();
//time counting terminate
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
//compute time elapsed on GPU computing
cudaEventElapsedTime(&naive_gpu_elapsed_time_ms, start, stop);
printf("Time elapsed on naive GPU convolution 2d tiled ( %d ) block %f ms.\n\n", BLOCK_WIDTH, naive_gpu_elapsed_time_ms);
//free memory
cudaFree(a);
cudaFree(b);
cudaFree(c);
printf("________________________________________________________________________\n\n");
}
return 0;
}
I'm using google colab with Tesla T4 GPU, and no CUDA error is thrown.
Also tried to use bigger masks (11, 15 ..) but no changes in comparison between tiled and untiled.
You are making inefficient usage of managed memory as discussed here and here.
Nearly all of your ~2ms of execution time is used in inefficient demand-paged copying of data from host to device. As a result, your ability to resolve the difference in performance in the two cases due to the device code changes is almost completely obscured.
If you add these 3 lines of code immediately before float naive_gpu_elapsed_time_ms;, you will observe that your reported execution times decrease dramatically, and you should be able to better judge the performance difference between the shared memory tiled version and the non-tiled version:
cudaMemPrefetchAsync(a, sizeof(float)*MATRIX_SIZE*MATRIX_SIZE, 0);
cudaMemPrefetchAsync(b, sizeof(int)*mask_width*mask_width, 0);
cudaMemPrefetchAsync(c, sizeof(int)*MATRIX_SIZE*MATRIX_SIZE, 0);
You haven't shown your non-tiled code, so I can't demonstrate that for you. Here's an example profiling output using a non-tiled convolution code that I wrote, comparing to your tiled kernel, and including the cudaMemPrefetchAsync() statements:
$ nvprof ./t2140
Testing with mask size = 3
==13236== NVPROF is profiling process 13236, command: ./t2140
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 0.032832 ms.
________________________________________________________________________
Testing with mask size = 5
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 0.061120 ms.
________________________________________________________________________
Testing with mask size = 7
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 0.086080 ms.
________________________________________________________________________
Testing with mask size = 9
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 0.118688 ms.
________________________________________________________________________
==13236== Profiling application: ./t2140
==13236== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 52.59% 311.69us 4 77.922us 41.089us 119.08us convolution_2D(float*, float const *, float*, unsigned long, unsigned long, unsigned long)
47.41% 280.97us 4 70.241us 28.449us 114.28us convolution_2D_tiled(float*, float const *, float*, unsigned long, unsigned long, unsigned long)
API calls: 96.10% 365.32ms 12 30.443ms 12.906us 365.10ms cudaMallocManaged
1.32% 5.0301ms 4 1.2575ms 586.91us 3.2433ms cuDeviceTotalMem
0.66% 2.4917ms 404 6.1670us 320ns 268.82us cuDeviceGetAttribute
0.56% 2.1277ms 12 177.31us 8.3020us 578.90us cudaMemPrefetchAsync
0.50% 1.9035ms 4 475.88us 295.08us 549.01us cudaDeviceSynchronize
0.49% 1.8594ms 12 154.95us 75.533us 328.85us cudaFree
0.14% 526.53us 4 131.63us 42.014us 220.14us cudaEventSynchronize
0.11% 399.28us 4 99.820us 61.310us 210.74us cuDeviceGetName
0.09% 351.52us 8 43.940us 11.426us 116.52us cudaLaunchKernel
0.01% 45.911us 8 5.7380us 4.1870us 10.243us cudaEventRecord
0.01% 25.946us 8 3.2430us 935ns 10.182us cudaEventCreate
0.01% 21.643us 4 5.4100us 3.1450us 8.6700us cuDeviceGetPCIBusId
0.00% 10.304us 8 1.2880us 430ns 5.0980us cuDeviceGet
0.00% 9.6790us 4 2.4190us 1.9560us 3.7180us cudaEventElapsedTime
0.00% 3.3390us 3 1.1130us 617ns 1.6520us cuDeviceGetCount
0.00% 3.2480us 4 812ns 700ns 1.0470us cuDeviceGetUuid
0.00% 3.1420us 8 392ns 229ns 1.2110us cudaGetLastError
==13236== Unified Memory profiling result:
Device "Tesla V100-PCIE-32GB (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
12 1.3346MB 4.0000KB 2.0000MB 16.01563MB 1.405760ms Host To Device
Total CPU Page faults: 52
$
You can see that in each case, the tiled/shared memory kernel is faster.

Cuda: XOR single bitset with array of bitsets

I want to XOR a single bitset with a bunch of other bitsets (~100k) and count the set bits of every xor-result. The size of a single bitset is around 20k bits.
The bitsets are already converted to arrays of unsigned int to be able to use the intrinsic __popc()-function. The 'bunch' is already residing contiguously in device-memory.
My current kernel code looks like this:
// Grid/Blocks used for kernel invocation
dim3 block(32);
dim3 grid((bunch_size / 31) + 32);
__global__ void kernelXOR(uint * bitset, uint * bunch, int * set_bits, int bitset_size, int bunch_size) {
int tid = blockIdx.x*blockDim.x + threadIdx.x;
if (tid < bunch_size){ // 1 Thread for each bitset in the 'bunch'
int sum = 0;
uint xor_res = 0;
for (int i = 0; i < bitset_size; ++i){ // Iterate through every uint-block of the bitsets
xor_res = bitset[i] ^ bunch[bitset_size * tid + i];
sum += __popc(xor_res);
}
set_bits[tid] = sum;
}
}
However, compared to a parallelized c++/boost version, I see no benefit using Cuda.
Is there any potential in optimizing this kernel?
Is there any potential in optimizing this kernel?
I see 2 problems here (and they are the first two classical primary optimizations objectives for any CUDA programmer):
You want to try to efficiently use global memory. Your accesses to bitset and bunch are not coalesced. (efficiently use the memory subsystems)
The use of 32 threads per block is generally not recommended and could limit your overall occupancy. One thread per bitset is also potentially problematic. (expose enough parallelism)
Whether addressing those issues will meet your definition of benefit is impossible to say without a comparison test case. Furthermore, simple memory-bound problems like this are rarely interesting in CUDA when considered by themselves. However, we can (probably) improve the performance of your kernel.
We'll use a laundry list of ideas:
have each block handle a bitset, rather than each thread, to enable coalescing
use shared memory to load the comparison bitset, and reuse it
use just enough blocks to saturate the GPU, along with striding loops
use const ... __restrict__ style decoration to possibly benefit from RO cache
Here's a worked example:
$ cat t1649.cu
#include <iostream>
#include <cstdlib>
const int my_bitset_size = 20000/(32);
const int my_bunch_size = 100000;
typedef unsigned uint;
//using one thread per bitset in the bunch
__global__ void kernelXOR(uint * bitset, uint * bunch, int * set_bits, int bitset_size, int bunch_size) {
int tid = blockIdx.x*blockDim.x + threadIdx.x;
if (tid < bunch_size){ // 1 Thread for each bitset in the 'bunch'
int sum = 0;
uint xor_res = 0;
for (int i = 0; i < bitset_size; ++i){ // Iterate through every uint-block of the bitsets
xor_res = bitset[i] ^ bunch[bitset_size * tid + i];
sum += __popc(xor_res);
}
set_bits[tid] = sum;
}
}
const int nTPB = 256;
// one block per bitset, multiple bitsets per block
__global__ void kernelXOR_imp(const uint * __restrict__ bitset, const uint * __restrict__ bunch, int * __restrict__ set_bits, int bitset_size, int bunch_size) {
__shared__ uint sbitset[my_bitset_size]; // could also be dynamically allocated for varying bitset sizes
__shared__ int ssum[nTPB];
// load shared, block-stride loop
for (int idx = threadIdx.x; idx < bitset_size; idx += blockDim.x) sbitset[idx] = bitset[idx];
__syncthreads();
// stride across all bitsets in bunch
for (int bidx = blockIdx.x; bidx < bunch_size; bidx += gridDim.x){
int my_sum = 0;
for (int idx = threadIdx.x; idx < bitset_size; idx += blockDim.x) my_sum += __popc(sbitset[idx] ^ bunch[bidx*bitset_size + idx]);
// block level parallel reduction
ssum[threadIdx.x] = my_sum;
for (int ridx = nTPB>>1; ridx > 0; ridx >>=1){
__syncthreads();
if (threadIdx.x < ridx) ssum[threadIdx.x] += ssum[threadIdx.x+ridx];}
if (!threadIdx.x) set_bits[bidx] = ssum[0];}
}
int main(){
// data setup
uint *d_cbitset, *d_bitsets, *h_cbitset, *h_bitsets;
int *d_r, *h_r, *h_ri;
h_cbitset = new uint[my_bitset_size];
h_bitsets = new uint[my_bitset_size*my_bunch_size];
h_r = new int[my_bunch_size];
h_ri = new int[my_bunch_size];
for (int i = 0; i < my_bitset_size*my_bunch_size; i++){
h_bitsets[i] = rand();
if (i < my_bitset_size) h_cbitset[i] = rand();}
cudaMalloc(&d_cbitset, my_bitset_size*sizeof(uint));
cudaMalloc(&d_bitsets, my_bitset_size*my_bunch_size*sizeof(uint));
cudaMalloc(&d_r, my_bunch_size*sizeof(int));
cudaMemcpy(d_cbitset, h_cbitset, my_bitset_size*sizeof(uint), cudaMemcpyHostToDevice);
cudaMemcpy(d_bitsets, h_bitsets, my_bitset_size*my_bunch_size*sizeof(uint), cudaMemcpyHostToDevice);
// original
// Grid/Blocks used for kernel invocation
dim3 block(32);
dim3 grid((my_bunch_size / 31) + 32);
kernelXOR<<<grid, block>>>(d_cbitset, d_bitsets, d_r, my_bitset_size, my_bunch_size);
cudaMemcpy(h_r, d_r, my_bunch_size*sizeof(int), cudaMemcpyDeviceToHost);
// improved
dim3 iblock(nTPB);
dim3 igrid(640);
kernelXOR_imp<<<igrid, iblock>>>(d_cbitset, d_bitsets, d_r, my_bitset_size, my_bunch_size);
cudaMemcpy(h_ri, d_r, my_bunch_size*sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < my_bunch_size; i++)
if (h_r[i] != h_ri[i]) {std::cout << "mismatch at i: " << i << " was: " << h_ri[i] << " should be: " << h_r[i] << std::endl; return 0;}
std::cout << "Results match." << std::endl;
return 0;
}
$ nvcc -o t1649 t1649.cu
$ cuda-memcheck ./t1649
========= CUDA-MEMCHECK
Results match.
========= ERROR SUMMARY: 0 errors
$ nvprof ./t1649
==18868== NVPROF is profiling process 18868, command: ./t1649
Results match.
==18868== Profiling application: ./t1649
==18868== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 97.06% 71.113ms 2 35.557ms 2.3040us 71.111ms [CUDA memcpy HtoD]
2.26% 1.6563ms 1 1.6563ms 1.6563ms 1.6563ms kernelXOR(unsigned int*, unsigned int*, int*, int, int)
0.59% 432.68us 1 432.68us 432.68us 432.68us kernelXOR_imp(unsigned int const *, unsigned int const *, int*, int, int)
0.09% 64.770us 2 32.385us 31.873us 32.897us [CUDA memcpy DtoH]
API calls: 78.20% 305.44ms 3 101.81ms 11.373us 304.85ms cudaMalloc
18.99% 74.161ms 4 18.540ms 31.554us 71.403ms cudaMemcpy
1.39% 5.4121ms 4 1.3530ms 675.30us 3.3410ms cuDeviceTotalMem
1.26% 4.9393ms 388 12.730us 303ns 530.95us cuDeviceGetAttribute
0.11% 442.37us 4 110.59us 102.61us 125.59us cuDeviceGetName
0.03% 128.18us 2 64.088us 21.789us 106.39us cudaLaunchKernel
0.01% 35.764us 4 8.9410us 2.9670us 18.982us cuDeviceGetPCIBusId
0.00% 8.3090us 8 1.0380us 540ns 1.3870us cuDeviceGet
0.00% 5.9530us 3 1.9840us 310ns 3.9900us cuDeviceGetCount
0.00% 2.8800us 4 720ns 574ns 960ns cuDeviceGetUuid
$
In this case, on my Tesla V100, for your problem size, I witness about a 4x improvement in kernel performance. However the kernel performance here is tiny compared to the cost of data movement. So it's unlikely that these sort of optimizations would make a significant difference in your comparison test case, if this is the only thing you are doing on the GPU.
The code above uses striding-loops at the block level and at the grid level, which means it should behave correctly for almost any choice of threadblock size (multiple of 32 please) as well as grid size. That doesn't mean that any/all choices will perform equally. The choice of the threadblock size is to allow the possibility for nearly full occupancy (so don't choose 32). The choice of the grid size is the number of blocks to achieve full occupancy per SM, times the number of SMs. These should be nearly optimal choices, but according to my testing e.g. a larger number of blocks doesn't really reduce performance, and the performance should be roughly constant for nearly any threadblock size (except 32), assuming the number of blocks is calculated accordingly.

CUDA: Reduce algorithm

I am new to C++/CUDA. I tried implementing the parallel algorithm "reduce" with ability to handle any type of inputsize, and threadsize without increasing asymptotic parallel runtime by recursing over the output of the kernel (in the kernel wrapper).
e.g. Implementing Max Reduce in Cuda the top answer to this question, his/hers implementation will essentially be sequential when threadsize is small enough.
However, I keep getting a "Segmentation fault" when I compile and run it ..?
>> nvcc -o mycode mycode.cu
>> ./mycode
Segmentail fault.
Compiled on a K40 with cuda 6.5
Here is the kernel, basically same as the SO post I linked the checker for "out of bounds" is different:
#include <stdio.h>
/* -------- KERNEL -------- */
__global__ void reduce_kernel(float * d_out, float * d_in, const int size)
{
// position and threadId
int pos = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
// do reduction in global memory
for (unsigned int s = blockDim.x / 2; s>0; s>>=1)
{
if (tid < s)
{
if (pos+s < size) // Handling out of bounds
{
d_in[pos] = d_in[pos] + d_in[pos+s];
}
}
}
// only thread 0 writes result, as thread
if (tid==0)
{
d_out[blockIdx.x] = d_in[pos];
}
}
The kernel wrapper I mentioned to handle when 1 block will not contain all of the data.
/* -------- KERNEL WRAPPER -------- */
void reduce(float * d_out, float * d_in, const int size, int num_threads)
{
// setting up blocks and intermediate result holder
int num_blocks = ((size) / num_threads) + 1;
float * d_intermediate;
cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
// recursively solving, will run approximately log base num_threads times.
do
{
reduce_kernel<<<num_blocks, num_threads>>>(d_intermediate, d_in, size);
// updating input to intermediate
cudaMemcpy(d_in, d_intermediate, sizeof(float)*num_blocks, cudaMemcpyDeviceToDevice);
// Updating num_blocks to reflect how many blocks we now want to compute on
num_blocks = num_blocks / num_threads + 1;
// updating intermediate
cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
}
while(num_blocks > num_threads); // if it is too small, compute rest.
// computing rest
reduce_kernel<<<1, num_blocks>>>(d_out, d_in, size);
}
Main program to initialize in/out and create bogus data for testing.
/* -------- MAIN -------- */
int main(int argc, char **argv)
{
// Setting num_threads
int num_threads = 512;
// Making bogus data and setting it on the GPU
const int size = 1024;
const int size_out = 1;
float * d_in;
float * d_out;
cudaMalloc(&d_in, sizeof(float)*size);
cudaMalloc((void**)&d_out, sizeof(float)*size_out);
const int value = 5;
cudaMemset(d_in, value, sizeof(float)*size);
// Running kernel wrapper
reduce(d_out, d_in, size, num_threads);
printf("sum is element is: %.f", d_out[0]);
}
There are a few things I would point out with your code.
As a general rule/boilerplate, I always recommend using proper cuda error checking and run your code with cuda-memcheck, any time you are having trouble with a cuda code. However those methods wouldn't help much with the seg fault, although they may help later (see below).
The actual seg fault is occurring on this line:
printf("sum is element is: %.f", d_out[0]);
you've broken a cardinal CUDA programming rule: host pointers must not be dereferenced in device code, and device pointers must not be dereferenced in host code. This latter condition applies here. d_out is a device pointer (allocated via cudaMalloc). Such pointers have no meaning if you attempt to dereference them in host code, and doing so will lead to a seg fault.
The solution is to copy the data back to the host before printing it out:
float result;
cudaMemcpy(&result, d_out, sizeof(float), cudaMemcpyDeviceToHost);
printf("sum is element is: %.f", result);
Using cudaMalloc in a loop, on the same variable, without doing any cudaFree operations, is not good practice, and may lead to out-of-memory errors in long-running loops, and may also lead to programs with memory leaks, if such a construct is used in a larger program:
do
{
...
cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
}
while...
in this case I think a better approach and trivial fix would be to cudaFree d_intermediate right before you re-allocate:
do
{
...
cudaFree(d_intermediate);
cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
}
while...
This might not be doing what you think it is:
const int value = 5;
cudaMemset(d_in, value, sizeof(float)*size);
probably you are aware of this, but cudaMemset, like memset, operates on byte quantities. So you are filling the d_in array with a value corresponding to 0x05050505 (and I have no idea what that bit pattern corresponds to when interpreted as a float quantity). Since you refer to bogus values, you may be cognizant of this already. But it's a common error (e.g. if you were actually trying to initialize the array with the value of 5 in every float location), so I thought I would point it out.
Your code has other issues as well (which you will discover if you make the above fixes then run your code with cuda-memcheck). To learn about how to do good parallel reductions, I would recommend studying the CUDA parallel reduction sample code and presentation. Parallel reductions in global memory are not recommended for performance reasons.
For completeness, here are some of the additional issues I found:
Your kernel code needs an appropriate __syncthreads() statement to ensure that the work of all threads in a block are complete before any threads go onto the next iteration of the for-loop.
Your final write to global memory in the kernel needs to also be conditioned on the read-location being in-bounds. Otherwise, your strategy of always launching an extra block would allow the read from this line to be out-of-bounds (cuda-memcheck will show this).
The reduction logic in your loop in the reduce function is generally messed up and needed to be re-worked in several ways.
I'm not saying this code is defect-free, but it seems to work for the given test case and produce the correct answer (1024):
#include <stdio.h>
/* -------- KERNEL -------- */
__global__ void reduce_kernel(float * d_out, float * d_in, const int size)
{
// position and threadId
int pos = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
// do reduction in global memory
for (unsigned int s = blockDim.x / 2; s>0; s>>=1)
{
if (tid < s)
{
if (pos+s < size) // Handling out of bounds
{
d_in[pos] = d_in[pos] + d_in[pos+s];
}
}
__syncthreads();
}
// only thread 0 writes result, as thread
if ((tid==0) && (pos < size))
{
d_out[blockIdx.x] = d_in[pos];
}
}
/* -------- KERNEL WRAPPER -------- */
void reduce(float * d_out, float * d_in, int size, int num_threads)
{
// setting up blocks and intermediate result holder
int num_blocks = ((size) / num_threads) + 1;
float * d_intermediate;
cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
cudaMemset(d_intermediate, 0, sizeof(float)*num_blocks);
int prev_num_blocks;
// recursively solving, will run approximately log base num_threads times.
do
{
reduce_kernel<<<num_blocks, num_threads>>>(d_intermediate, d_in, size);
// updating input to intermediate
cudaMemcpy(d_in, d_intermediate, sizeof(float)*num_blocks, cudaMemcpyDeviceToDevice);
// Updating num_blocks to reflect how many blocks we now want to compute on
prev_num_blocks = num_blocks;
num_blocks = num_blocks / num_threads + 1;
// updating intermediate
cudaFree(d_intermediate);
cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
size = num_blocks*num_threads;
}
while(num_blocks > num_threads); // if it is too small, compute rest.
// computing rest
reduce_kernel<<<1, prev_num_blocks>>>(d_out, d_in, prev_num_blocks);
}
/* -------- MAIN -------- */
int main(int argc, char **argv)
{
// Setting num_threads
int num_threads = 512;
// Making non-bogus data and setting it on the GPU
const int size = 1024;
const int size_out = 1;
float * d_in;
float * d_out;
cudaMalloc(&d_in, sizeof(float)*size);
cudaMalloc((void**)&d_out, sizeof(float)*size_out);
//const int value = 5;
//cudaMemset(d_in, value, sizeof(float)*size);
float * h_in = (float *)malloc(size*sizeof(float));
for (int i = 0; i < size; i++) h_in[i] = 1.0f;
cudaMemcpy(d_in, h_in, sizeof(float)*size, cudaMemcpyHostToDevice);
// Running kernel wrapper
reduce(d_out, d_in, size, num_threads);
float result;
cudaMemcpy(&result, d_out, sizeof(float), cudaMemcpyDeviceToHost);
printf("sum is element is: %.f\n", result);
}

NVIDIA Visual profiler does not generate a timeline

My question is almost same as the question [asked here at SO before][1]. But no answer has been provided to it so, I am asking a separate question.
I am using CUDA 7.0 toolkit on a Windows-7 OS. I am using VS-2013.
I tried to generate the timeline of vector addition sample program and it worked. But when I follow exactly same steps to generate a timeline of my own code then, it keep showing a message "Running application to generate timeline". I know that the kernel gets called and everything is working.
cudaDeviceReset() call is also there after finishing everything related to CUDA.
Program: I have changed my original question to provide a minimal working example which can produce the same problem. The following code is not generating a timeline using nvvp irrespective of the place where I put cudaDeviceReset().
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
//OpenCV
#include <opencv2/highgui.hpp>
#include <opencv2/core.hpp>
#include <opencv2/imgproc.hpp>
#include <stdio.h>
using namespace cv;
__global__ void colorTransformation_kernel(int numChannels, int iw, int ih, unsigned char *ptr_source, unsigned char *ptr_dst)
{
// Calculate our pixel's location
int x = (blockIdx.x * blockDim.x) + threadIdx.x;
int y = (blockIdx.y * blockDim.y) + threadIdx.y;
// Operate only if we are in the correct boundaries
if (x >= 0 && x < iw && y >= 0 && y < ih)
{
ptr_dst[numChannels* (iw*y + x) + 0] = ptr_source[numChannels* (iw*y + x) + 0];
ptr_dst[numChannels* (iw*y + x) + 1] = ptr_source[numChannels* (iw*y + x) + 1];
ptr_dst[numChannels* (iw*y + x) + 2] = ptr_source[numChannels* (iw*y + x) + 2];
}
}
int main()
{
while (1)
{
Mat image(400, 400, CV_8UC3, Scalar(0, 0, 255));
unsigned char *h_src = image.data;
size_t numBytes = image.rows * image.cols * 3;
int numChannels = 3;
unsigned char *dev_src, *dev_dst, *h_dst;
//Allocate memomry at device for SOURCE and DESTINATION and get their pointers
cudaMalloc((void**)&dev_src, numBytes * sizeof(unsigned char));
cudaMalloc((void**)&dev_dst, numBytes * sizeof(unsigned char));
////Copy the source image to the device i.e. GPU
cudaMemcpy(dev_src, h_src, numBytes * sizeof(unsigned char), cudaMemcpyHostToDevice);
////KERNEL
dim3 numOfBlocks(3 * (image.cols / 20), 3 * (image.rows / 20)); //multiplied by 3 because we have 3 channel image now
dim3 numOfThreadsPerBlocks(20, 20);
colorTransformation_kernel << <numOfBlocks, numOfThreadsPerBlocks >> >(numChannels, image.cols, image.rows, dev_src, dev_dst);
cudaDeviceSynchronize();
//Get the processed image
Mat org_dijSDK_img(image.rows, image.cols, CV_8UC3);
h_dst = org_dijSDK_img.data;
cudaMemcpy(h_dst, dev_dst, numBytes * sizeof(unsigned char), cudaMemcpyDeviceToHost);
//DISPLAY PROCESSED IMAGE
imshow("Processed dijSDK image", org_dijSDK_img);
waitKey(33);
}
cudaDeviceReset();
return 0;
}
Very Important Clue: If I comment the line while(1) and hence run the code only once then, the nvvp generates timeline. But in my original project, I cannot get the timeline profile by doing so because, it contain multi-threading and other stuff due to which, there is no image to process during the first run. So, I must need some way to generate the timeline with a code containing infinite while loop.
The problem in my code is the endless while loop due to which the cudaDeviceReset() were never being called. There are two possible solutions to deal with such situations:
If you are interested to have a look at timeline profiling only then, just comment your while loop and the nvvp would be able to reach the cudaDeviceReset() present at the end of main().
There might be a situation where you must keep a loop inside your program. For example, in my original project containing multi-threading, there is no image to process during initial 180 run of while loop. To deal with such situations, replace your while loop with the for loop which can run for limited number of times. For example, the following code has helped me to get a timeline profiling of 4 number of runs. I am posting only the modified main().
int main()
{
cudaStream_t stream_one;
cudaStream_t stream_two;
cudaStream_t stream_three;
//while (1)
for (int i = 0; i < 4; i++)
{
cudaStreamCreate(&stream_one);
cudaStreamCreate(&stream_two);
cudaStreamCreate(&stream_three);
Mat image = imread("DijSDK_test_image.jpg", 1);
//Mat image(1080, 1920, CV_8UC3, Scalar(0,0,255));
size_t numBytes = image.rows * image.cols * 3;
int numChannels = 3;
int iw = image.rows;
int ih = image.cols;
size_t totalMemSize = numBytes * sizeof(unsigned char);
size_t oneThirdMemSize = totalMemSize / 3;
unsigned char *dev_src_1, *dev_src_2, *dev_src_3, *dev_dst_1, *dev_dst_2, *dev_dst_3, *h_src, *h_dst;
//Allocate memomry at device for SOURCE and DESTINATION and get their pointers
cudaMalloc((void**)&dev_src_1, (totalMemSize) / 3);
cudaMalloc((void**)&dev_src_2, (totalMemSize) / 3);
cudaMalloc((void**)&dev_src_3, (totalMemSize) / 3);
cudaMalloc((void**)&dev_dst_1, (totalMemSize) / 3);
cudaMalloc((void**)&dev_dst_2, (totalMemSize) / 3);
cudaMalloc((void**)&dev_dst_3, (totalMemSize) / 3);
//Get the processed image
Mat org_dijSDK_img(image.rows, image.cols, CV_8UC3, Scalar(0, 0, 255));
h_dst = org_dijSDK_img.data;
//copy new data of image to the host pointer
h_src = image.data;
//Copy the source image to the device i.e. GPU
cudaMemcpyAsync(dev_src_1, h_src, (totalMemSize) / 3, cudaMemcpyHostToDevice, stream_one);
cudaMemcpyAsync(dev_src_2, h_src + oneThirdMemSize, (totalMemSize) / 3, cudaMemcpyHostToDevice, stream_two);
cudaMemcpyAsync(dev_src_3, h_src + (2 * oneThirdMemSize), (totalMemSize) / 3, cudaMemcpyHostToDevice, stream_three);
//KERNEL--stream-1
callMultiStreamingCudaKernel(dev_src_1, dev_dst_1, numChannels, iw, ih, &stream_one);
//KERNEL--stream-2
callMultiStreamingCudaKernel(dev_src_2, dev_dst_2, numChannels, iw, ih, &stream_two);
//KERNEL--stream-3
callMultiStreamingCudaKernel(dev_src_3, dev_dst_3, numChannels, iw, ih, &stream_three);
//RESULT copy: GPU to CPU
cudaMemcpyAsync(h_dst, dev_dst_1, (totalMemSize) / 3, cudaMemcpyDeviceToHost, stream_one);
cudaMemcpyAsync(h_dst + oneThirdMemSize, dev_dst_2, (totalMemSize) / 3, cudaMemcpyDeviceToHost, stream_two);
cudaMemcpyAsync(h_dst + (2 * oneThirdMemSize), dev_dst_3, (totalMemSize) / 3, cudaMemcpyDeviceToHost, stream_three);
// wait for results
cudaStreamSynchronize(stream_one);
cudaStreamSynchronize(stream_two);
cudaStreamSynchronize(stream_three);
//Assign the processed data to the display image.
org_dijSDK_img.data = h_dst;
//DISPLAY PROCESSED IMAGE
imshow("Processed dijSDK image", org_dijSDK_img);
waitKey(33);
}
cudaDeviceReset();
return 0;
}

Why is my CUDA code not working properly for zero filling a large matrix?

It is a simple CUDA code for initializing a big matrix (filling in zeros).
I output the first 1*3 matrix, if the code works. It should be all zeros.
If I set the matrix size to be small, then the program works properly. But when I make the size larger (> 43200 * 2400), what is inside the matrix are all garbage.
I had cudaDeviceSynchronize() append at the end of each CUDA functions already.
I am using NVIDIA Quadro K4200, Xeon E5-2630 with Ubuntu 14.04.
Thanks for anyone helping me here.
Attached below is my full code.
#include <stdio.h>
#include <math.h>
#include <iostream>
#include <cuComplex.h>
#define BLOCK_SIZE 16 // change it to 16 to get maximum performance
// populate the matrix using first row
__global__ void RepmatKernel (cuComplex *Mat, const unsigned int N, const unsigned int Cols)
{
unsigned int i = (unsigned int)blockIdx.x * (unsigned int)blockDim.x + (unsigned int)threadIdx.x;
if (i < N)
{
Mat[i].x = 0;
Mat[i].y = 0;
}
}
// main routine
int main ()
{
const unsigned int Rows = 43200;
const unsigned int Cols = 2400;
const unsigned int Num_thrd = 256; // max threads per block
unsigned int Mat_size = Rows * Cols; // size of array
cuComplex *vec; // supposedly the input
cuComplex *mat_debug; // for debug
vec = new cuComplex [Cols];
mat_debug = new cuComplex [Rows*Cols];
cuComplex *mat_in_d; // device array
//input in host array
for(unsigned int i = 0; i < Cols; i++)
{
vec[i].x = 3*i+4;
vec[i].y = 0.2*i+1;
}
const unsigned int size_mat_d = Rows * Cols * sizeof(cuComplex);
//create device array cudaMalloc ( (void **)&array_name, sizeofmatrixinbytes) ;
if (cudaMalloc((void **) &mat_in_d , size_mat_d) != cudaSuccess) std::cout<<"Error allocating GPU";
cudaDeviceSynchronize() ;
//copy host array to device array; cudaMemcpy ( dest , source , WIDTH , direction )
cudaMemcpy ( mat_in_d , vec , Cols , cudaMemcpyHostToDevice ) ;
cudaDeviceSynchronize() ;
// ========================================================================
cudaMemcpy(mat_debug , mat_in_d , size_mat_d , cudaMemcpyDeviceToHost) ;
cudaDeviceSynchronize() ;
std::cout<<"before repmat="<<std::endl;
std::cout<<"[";
for(unsigned int i = 0; i < 3; i++)
{
std::cout<< mat_debug[i * Cols].x <<"+"<<mat_debug[i * Cols].y <<"i, ";
std::cout<<";"<<std::endl;
}
std::cout<<"]"<<std::endl;
// ==========================================================================
RepmatKernel<<<(unsigned int)ceil((float)(Mat_size)/(float)(Num_thrd)),
(Num_thrd)>>>(mat_in_d,
Mat_size,
Cols);
cudaDeviceSynchronize();
// ========================================================================
cudaMemcpy(mat_debug , mat_in_d , size_mat_d , cudaMemcpyDeviceToHost) ;
cudaDeviceSynchronize() ;
std::cout<<"after repmat="<<std::endl;
std::cout<<"[";
for(unsigned int i = 0; i < 3; i++)
{
std::cout<< mat_debug[i * Cols].x <<"+"<<mat_debug[i * Cols].y <<"i, ";
std::cout<<";"<<std::endl;
}
std::cout<<"]"<<std::endl;
// ==========================================================================
cudaFree(mat_in_d);
delete [] vec;
delete [] mat_debug;
return 0;
}
Your call to cudaMalloc states that there is a problem, but doesn't actually terminate the computation. You should put a
if (cudaMalloc((void **) &mat_in_d , size_mat_d) != cudaSuccess)
{
std::cout<<"Error allocating GPU\n";
return 1;
}
so that the computation actually stops when you overflow the memory, rather than attempt to work anyway with only a warning to std::cout. Even better would be to use an error handling macro.
Another problem is here:
cudaMemcpy ( mat_in_d , vec , Cols , cudaMemcpyHostToDevice );
First, mat_in_d is size Rows * Cols * sizeof(cuComplex), but you are only copying Cols bytes into it. Even if you only wanted to copy vec into the first part of the mat_in_d vector, you'd need to change this to
cudaMemcpy ( mat_in_d , vec , Cols*sizeof(cuComplex) , cudaMemcpyHostToDevice );
At this point, you'd expect the first Cols entries of you matrix to be reasonable, at the rest to be garbage. (Making the suggested change shows that this is indeed the case; why you would want to do this is a better question).
Next comes your kernel call, whose entire goal is to set the entries of Mat to zero. This should be done with cudaMemset, i.e., just use
cudaMemset(mat_in_d, 0, Mat_size*sizeof(cuComplex));
We could look more carefully at the execution configuration to see what went wrong with your kernel call, but for now this fixes your problem.
For debugging CUDA errors; I find a header from samples, helper_cuda.h, quite convenient. I almost always include this header, which is located in the common directory of samples, in my projects.
Then, wrapping all CUDA calls with checkCudaErrors(), like checkCudaErrors(cudaMalloc((void **) &mat_in_d , size_mat_d)); gives explicit error messages.
In my case, since just mat_in_d is close to 1 GB and my GPU's memory is only 512 MB, it failed for sure and threw cudaErrorMemoryAllocation. However, an NVIDIA Quadro K4200 should not fail that easily!
Did you check the actual available memory information using cudaMemGetInfo ?