CUDA -- simple code but some of my warps don't run

EDIT: As I was reading this question after myself, I figured it out.
The root of the problem is most likely that I didn't allocate enough memory. I will try to think about this and do it correctly and then answer to my question. Silly me. :-[ It doesn't explain the warps not showing up in stdout though...
Original question
I created a templated kernel in CUDA in which I iterate over sections of grayscale image data in global memory (shared memory optimizations are due when I get this working) to achieve morphological operations with disc-shaped structure elements. Each thread corresponds to a pixel of the image. When the data type is char, everything works as expected, all my threads do what they should. When I change it to unsigned short, it starts acting up and only computes the upper half of my image. When I put in some printfs (my device has 2.0 CC), I found out that some of the warps that should run aren't even computed.
Here's the relevant code.
From my main.cpp I call gcuda::ErodeGpuGray8(img, radius); and gcuda::ErodeGpuGray16(img, radius); which are the following functions:
// gcuda.h
i3d::Image3d<i3d::GRAY8> ErodeGpuGray8(i3d::Image3d<i3d::GRAY8> img, const unsigned int radius);
i3d::Image3d<i3d::GRAY16> ErodeGpuGray16(i3d::Image3d<i3d::GRAY16> img, const unsigned int radius);
// call this from outside
Image3d<GRAY8> ErodeGpuGray8(Image3d<GRAY8> img, const unsigned int radius) {
return ErodeGpu<GRAY8>(img, radius);
// call this from outside
Image3d<GRAY16> ErodeGpuGray16(Image3d<GRAY16> img, const unsigned int radius) {
return ErodeGpu<GRAY16>(img, radius);
The library I'm using defines GRAY8 as char and GRAY16 as unsigned short.
Here's how I call the kernel (blockSize is a const int set to 128 in the relevant namespace):
template<typename T> Image3d<T> ErodeGpu(Image3d<T> img, const unsigned int radius) {
unsigned int width = img.GetWidth();
unsigned int height = img.GetHeight();
unsigned int w = nextHighestPower2(width);
unsigned int h = nextHighestPower2(height);
const size_t n = width * height;
const size_t N = w * h;
Image3d<T>* rslt = new Image3d<T>(img);
T *vx = rslt->GetFirstVoxelAddr();
// kernel parameters
dim3 dimBlock( blockSize );
dim3 dimGrid( ceil( N / (float)blockSize) );
// source voxel array on device (orig)
T *vx_d;
// result voxel array on device (for result of erosion)
T *vxr1_d;
// allocate memory on device
gpuErrchk( cudaMalloc( (void**)&vx_d, n ) );
gpuErrchk( cudaMemcpy( vx_d, vx, n, cudaMemcpyHostToDevice ) );
gpuErrchk( cudaMalloc( (void**)&vxr1_d, n ) );
gpuErrchk( cudaMemcpy( vxr1_d, vx_d, n, cudaMemcpyDeviceToDevice ) );
ErodeGpu<T><<<dimGrid, dimBlock>>>(vx_d, vxr1_d, n, width, radius);
gpuErrchk( cudaMemcpy( vx, vxr1_d, n, cudaMemcpyDeviceToHost ) );
// free device memory
gpuErrchk( cudaFree( vx_d ) );
gpuErrchk( cudaFree( vxr1_d ) );
// for debug purposes
return rslt;
The dimensions of my testing image are 82x82, so n = 82*82 = 6724 and N = 128*128 = 16384.
This is my kernel:
// CUDA Kernel -- used for image erosion with a circular structure element of radius "erosionR"
template<typename T> __global__ void ErodeGpu(const T *in, T *out, const unsigned int n, const int width, const int erosionR)
ErodeOrDilateCore<T>(ERODE, in, out, n, width, erosionR);
// The core of erosion or dilation. Operation is determined by the first parameter
template<typename T> __device__ void ErodeOrDilateCore(operation_t operation, const T *in, T *out, const unsigned int n, const int width, const int radius) {
// get thread number, this method is overkill for my purposes but generally should be bulletproof, right?
int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z) + (threadIdx.z * (blockDim.x * blockDim.y)) + (threadIdx.y * blockDim.x) + threadIdx.x;
int tx = threadId;
if (tx >= n) {
printf("[%d > %d]", tx, n);
} else {
printf("{%d}", tx);
… (erosion implementation, stdout is the same when this is commented out so it's probably not the root of the problem)
To my understanding, this code should write a randomly sorted set of [X > N] and {X} strings to stdout, where X = thread ID and there should be n curly-bracketed numbers (i.e. the output of threads with the index < n) and N - n of the rest, but when I run it and count the curly-bracketed numbers using a regex, I find out that I only get 256 of them. Furthermore, they seem to occur in 32-member groups, which tells me that some warps are run and some are not.
I am really baffled by this. It doesn't help that when I don't comment out the erosion implementation part, the GRAY8 erosion works and the GRAY16 erosion doesn't, even though the stdout output is exactly the same in both cases (could be input-dependent, I only tried this with 2 images).
What am I missing? What could be the cause of this? Is there some memory-management mistake on my part or is it fine that some warps don't run and the erosion stuff is possibly just a bug in the image library that only occurs with the GRAY16 type?

So this was just a stupid malloc mistake.
Instead of
const size_t n = width * height;
const size_t N = w * h;
I used
const int n = width * height;
const int N = w * h;
and instead of the erroneous
gpuErrchk( cudaMalloc( (void**)&vx_d, n ) );
gpuErrchk( cudaMemcpy( vx_d, vx, n, cudaMemcpyHostToDevice ) );
gpuErrchk( cudaMalloc( (void**)&vxr1_d, n ) );
gpuErrchk( cudaMemcpy( vxr1_d, vx_d, n, cudaMemcpyDeviceToDevice ) );
gpuErrchk( cudaMemcpy( vx, vxr1_d, n, cudaMemcpyDeviceToHost ) );
I used
gpuErrchk( cudaMalloc( (void**)&vx_d, n * sizeof(T) ) );
gpuErrchk( cudaMemcpy( vx_d, vx, n * sizeof(T), cudaMemcpyHostToDevice ) );
gpuErrchk( cudaMalloc( (void**)&vxr1_d, n * sizeof(T) ) );
gpuErrchk( cudaMemcpy( vxr1_d, vx_d, n * sizeof(T), cudaMemcpyDeviceToDevice ) );
gpuErrchk( cudaMemcpy( vx, vxr1_d, n * sizeof(T), cudaMemcpyDeviceToHost ) );
and the erosion is working correctly now, which was the main problem I was trying to solve. I'm still not getting the stdout output I'm expecting though, so if someone could shed some light on that, please do so.


Unexpected CPU utilization with OpenCL

I've written a simple OpenCL kernel to calculate the cross-correlation of two images on the GPU. However, when I execute the kernel with enqueueNDRangeKernel the CPU usage of one core rises to 100%, but the host code does nothing except waiting for the enqueued command to finish. Is this normal behavior of an OpenCL program? What is going on there?
OpenCL kernel (if relevant):
kernel void cross_correlation(global double *f,
global double *g,
global double *res) {
// This work item will compute the cross-correlation value for pixel w
const int2 w = (int2)(get_global_id(0), get_global_id(1));
// Main loop
int xy_index = 0;
int xy_plus_w_index = w.x + w.y * X;
double integral = 0;
for ( int y = 0; y + w.y < Y; ++y ) {
for ( int x = 0; x + w.x < X; ++x, ++xy_index, ++xy_plus_w_index ) {
// xy_index is equal to x + y * X
// xy_plus_w_index is equal to (x + w.x) + (y + w.y) * X
integral += f[xy_index] * g[xy_plus_w_index];
xy_index += w.x;
xy_plus_w_index += w.x;
res[w.x + w.y * X] = integral;
The images f, g, res have a size of X times Y pixels, where X and Y are set at compile time. I'm testing the above kernel with X = 2048 and Y = 2048.
Additional info: I am running the kernel on a Nvidia GPU with OpenCL version 1.2. The C++ program is written using the OpenCL C++ Wrapper API and executed on Debian using optirun from the bumblebee package.
As requested, here is a minimal working example:
#include <CL/cl.hpp>
#include <sstream>
#include <fstream>
using namespace std;
int main ( int argc, char **argv ) {
const int X = 2048;
const int Y = 2048;
// Create context
cl::Context context ( CL_DEVICE_TYPE_GPU );
// Read kernel from file
ifstream kernel_file ( "" );
stringstream buffer;
buffer << kernel_file.rdbuf ( );
string kernel_code = buffer.str ( );
// Build kernel
cl::Program::Sources sources;
sources.push_back ( { kernel_code.c_str ( ), kernel_code.length ( ) } );
cl::Program program ( context, sources ); ( " -DX=2048 -DY=2048" );
// Allocate buffer memory
cl::Buffer fbuf ( context, CL_MEM_READ_WRITE, X * Y * sizeof(double) );
cl::Buffer gbuf ( context, CL_MEM_READ_WRITE, X * Y * sizeof(double) );
cl::Buffer resbuf ( context, CL_MEM_WRITE_ONLY, X * Y * sizeof(double) );
// Create command queue
cl::CommandQueue queue ( context );
// Create kernel
cl::Kernel kernel ( program, "cross_correlation" );
kernel.setArg ( 0, fbuf );
kernel.setArg ( 1, gbuf );
kernel.setArg ( 2, resbuf );
// Set input arguments
double *f = new double[X*Y];
double *g = new double[X*Y];
for ( int i = 0; i < X * Y; i++ )
f[i] = g[i] = 0.001 * i;
queue.enqueueWriteBuffer ( fbuf, CL_TRUE, 0, X * Y * sizeof(double), f );
queue.enqueueWriteBuffer ( gbuf, CL_TRUE, 0, X * Y * sizeof(double), g );
// Execute kernel
queue.enqueueNDRangeKernel ( kernel, cl::NullRange, cl::NDRange ( X, Y ), cl::NullRange, NULL, NULL );
queue.finish ( );
return 0;
You don't say how you call enqueueNDRangeKernel - which is the critical bit. As I understand it, for NVidia, the call is blocking (although I don't think it's part of the standard that it should be so.)
You can get around this by having a separate thread invoke enqueueNDRangeKernel and let that thread block on it whilst your other threads continue, and teh blocking thread can signal an event when it completes.
There's a discussion on it here - and it raises some caveats about having multiple calls to the enqueue occurring in parallel.

Memcopy multiple gpus in cuda programming [duplicate]

How can I use two devices in order to improve for example
the performance of the following code (sum of vectors)?
Is it possible to use more devices "at the same time"?
If yes, how can I manage the allocations of the vectors on the global memory of the different devices?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <cuda.h>
#define NB 32
#define NT 500
#define N NB*NT
__global__ void add( double *a, double *b, double *c);
__global__ void add( double *a, double *b, double *c){
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid < N){
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x;
int main( void ) {
double *a, *b, *c;
double *dev_a, *dev_b, *dev_c;
// allocate the memory on the CPU
a=(double *)malloc(N*sizeof(double));
b=(double *)malloc(N*sizeof(double));
c=(double *)malloc(N*sizeof(double));
// allocate the memory on the GPU
cudaMalloc( (void**)&dev_a, N * sizeof(double) );
cudaMalloc( (void**)&dev_b, N * sizeof(double) );
cudaMalloc( (void**)&dev_c, N * sizeof(double) );
// fill the arrays 'a' and 'b' on the CPU
for (int i=0; i<N; i++) {
a[i] = (double)i;
b[i] = (double)i*2;
// copy the arrays 'a' and 'b' to the GPU
cudaMemcpy( dev_a, a, N * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy( dev_b, b, N * sizeof(double), cudaMemcpyHostToDevice);
for(int i=0;i<10000;++i)
add<<<NB,NT>>>( dev_a, dev_b, dev_c );
// copy the array 'c' back from the GPU to the CPU
cudaMemcpy( c, dev_c, N * sizeof(double), cudaMemcpyDeviceToHost);
// display the results
// for (int i=0; i<N; i++) {
// printf( "%g + %g = %g\n", a[i], b[i], c[i] );
// }
printf("\nGPU done\n");
// free the memory allocated on the GPU
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );
// free the memory allocated on the CPU
free( a );
free( b );
free( c );
return 0;
Thank you in advance.
Since CUDA 4.0 was released, multi-GPU computations of the type you are asking about are relatively easy. Prior to that, you would have need to use a multi-threaded host application with one host thread per GPU and some sort of inter-thread communication system in order to use mutliple GPUs inside the same host application.
Now it is possible to do something like this for the memory allocation part of your host code:
double *dev_a[2], *dev_b[2], *dev_c[2];
const int Ns[2] = {N/2, N-(N/2)};
// allocate the memory on the GPUs
for(int dev=0; dev<2; dev++) {
cudaMalloc( (void**)&dev_a[dev], Ns[dev] * sizeof(double) );
cudaMalloc( (void**)&dev_b[dev], Ns[dev] * sizeof(double) );
cudaMalloc( (void**)&dev_c[dev], Ns[dev] * sizeof(double) );
(disclaimer: written in browser, never compiled, never tested, use at own risk).
The basic idea here is that you use cudaSetDevice to select between devices when you are preforming operations on a device. So in the above snippet, I have assumed two GPUs and allocated memory on each [(N/2) doubles on the first device and N-(N/2) on the second].
The transfer of data from the host to device could be as simple as:
// copy the arrays 'a' and 'b' to the GPUs
for(int dev=0,pos=0; dev<2; pos+=Ns[dev], dev++) {
cudaMemcpy( dev_a[dev], a+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy( dev_b[dev], b+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice);
(disclaimer: written in browser, never compiled, never tested, use at own risk).
The kernel launching section of your code could then look something like:
for(int i=0;i<10000;++i) {
for(int dev=0; dev<2; dev++) {
add<<<NB,NT>>>( dev_a[dev], dev_b[dev], dev_c[dev], Ns[dev] );
(disclaimer: written in browser, never compiled, never tested, use at own risk).
Note that I have added an extra argument to your kernel call, because each instance of the kernel may be called with a different number of array elements to process. I Will leave it to you to work out the modifications required.
But, again, the basic idea is the same: use cudaSetDevice to select a given GPU, then run kernels on it in the normal way, with each kernel getting its own unique arguments.
You should be able to put these parts together to produce a simple multi-GPU application. There are a lot of other features which can be used in recent CUDA versions and hardware to assist multiple GPU applications (like unified addressing, the peer-to-peer facilities are more), but this should be enough to get you started. There is also a simple muLti-GPU application in the CUDA SDK you can look at for more ideas.

Passing a Constant Integer in a CUDA Kernel [duplicate]

This question already has answers here:
allocating shared memory
(5 answers)
Closed 5 years ago.
I am having a problem with the following code. In the global kernel, loop_d, M has an integer value of 84. When I try to create a shared array, temp, and use M as the size of the array, I get the following error:
error: expression must have a constant value
I am not sure why that is. I know that if I declare M as a global variable, then it works, but the problem is that I get the value of M by calling the function d_two in a different Fortran program, so I am not sure how to get around that. I know that if I replace temp[M] with temp[84], then my program runs perfectly, but that is not very practical, since different problems might have different values of M. Thank you for your help!
The program
// Parallelized 2D Three-Point Guassian Quadrature Numerical Integration Method
// The following program is part of two linked programs, Integral_2D_Cuda.f.
// This is a CUDA kernel that could be called in the Integral_2D_Cuda.f Fortran code to compute
// the integral of a given 2D-function
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cuda.h>
#include <cuda_runtime.h>
// The following is a definition for the atomicAddd function that is called in the loop_d kernel
// This is needed because the "regular" atomicAdd function only works for floats and integers
__device__ double atomicAddd(double* address, double val)
unsigned long long int* address_as_ull = (unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val + __longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
// GPU kernel that computes the function of interest. This is good for a two dimensional problem.
__global__ void loop_d(double *a_sx, double *b_swx, double *c_sy, double *d_swy, double *e_ans0, int N, int M)
// Declaring a shared array that threads of the same block have access to
__shared__ double temp[M];
int idxX = blockIdx.x * blockDim.x + threadIdx.x; // Thread indices responsible for the swx and sx arrays
int idxY = threadIdx.y; // Thread indices responsible for the swy and sy arrays
// Computing the multiplication of elements
if (idxX < N && idxY < M)
temp[idxY] = a_sx[idxX] * b_swx[idxX] * c_sy[idxY] * d_swy[idxY];
// synchronizing all threads before summing all the mupltiplied elements int he temp array
// Allowing the 0th thread of y to do the summation of the multiplied elements in the temp array of one block
if (0 == idxY)
double sum = 0.00;
for(int k = 0; k < M; k++)
sum = sum + temp[k];
// Adding the result of this instance of calculation to the final answer, ans0
atomicAddd(e_ans0, sum);
extern "C" void d_two_(double *sx, double *swx, int *nptx, double *sy, double *swy, int *npty, double *ans0)
// Assigning GPU pointers
double *sx_d, *swx_d;
int N = *nptx;
double *sy_d, *swy_d;
int M = *npty;
double *ans0_d;
dim3 threadsPerBlock(1,M); // Creating a two dimesional block with 1 thread in the x dimesion and M threads in the y dimesion
dim3 numBlocks(N); // specifying the number of blocks to use of dimesion 1xM
// Allocating GPU Memory
cudaMalloc( (void **)&sx_d, sizeof(double) * N);
cudaMalloc( (void **)&swx_d, sizeof(double) * N);
cudaMalloc( (void **)&sy_d, sizeof(double) * M);
cudaMalloc( (void **)&swy_d, sizeof(double) * M);
cudaMalloc( (void **)&ans0_d, sizeof(double) );
// Copying information fromm CPU to GPU
cudaMemcpy( sx_d, sx, sizeof(double) * N, cudaMemcpyHostToDevice );
cudaMemcpy( swx_d, swx, sizeof(double) * N, cudaMemcpyHostToDevice );
cudaMemcpy( sy_d, sy, sizeof(double) * M, cudaMemcpyHostToDevice );
cudaMemcpy( swy_d, swy, sizeof(double) * M, cudaMemcpyHostToDevice );
cudaMemcpy( ans0_d, ans0, sizeof(double), cudaMemcpyHostToDevice );
// Calling the function on the GPU
loop_d<<< numBlocks, threadsPerBlock >>>(sx_d, swx_d, sy_d, swy_d, ans0_d, N, M);
// Copying from GPU to CPU
cudaMemcpy( ans0, ans0_d, sizeof(double), cudaMemcpyDeviceToHost );
// freeing GPU memory
The compiler needs M to be a compile-time constant. At compile time it cannot determine what M is actually going to be (it doesn't know you will just pass it 84 eventually).
When you want to use shared memory of size you only know at runtime, you use dynamic shared memory.
See this example here on the site or Using Shared Memory in CUDA on the Parallel4All blog.

How to use linear indexes on cv::cuda::PtrStepSzf data

I'm working with opencv 3.1 cv::cuda template matching but the cv::cuda::minMaxLoc() function is too slow for my case. My match results have minimum size of 128x128 and max size up to 512x512. In average minMaxLoc() will take 1.65 ms for the 128x128 and up to 25 ms for something like 350x350 which is too long since this is done hundreds of times.
I underestand that my match sizes are maybe too small for what do you usually use in GPU. But I want to test along the lines that Robert Crovella did at thrust::max_element slow in comparison cublasIsamax - More efficient implementation? to see if I can get better performance.
My problem is that all those reductions the data is being read using linear indexes and cv::cuda::PtrStepSzfdoes not allow this(At least I did not find how). I try to reshape my match result but I cannot do that since the data is not contiguous. Do I need to go toward cudaMallocPitch and cudaMemcpy2DIf that the case how I do that with a cv::cuda::GPUMat read as cv::cuda::PtrStepSzf object?
__global__ void minLoc(const cv::cuda::PtrStepSzf data,
float* minVal,
float * minValLoc
int dsize = data.cols*data.rows
__shared__ volatile T vals[nTPB];
__shared__ volatile int idxs[nTPB];
__shared__ volatile int last_block;
int idx = threadIdx.x+blockDim.x*blockIdx.x;
last_block = 0;
T my_val = FLOAT_MIN;
int my_idx = -1;
// sweep from global memory
while (idx < dsize)
//data(idx) is an illegal call;The legal one is data(x,y)
// How do I do it?
if (data(idx) > my_val)
my_val = data(idx); my_idx = idx;
idx += blockDim.x*gridDim.x;
// ... rest of the kernel
void callMinLocKernel(cv::InputArray _input,
cv::Point minValLoc,
float minVal,
cv::cuda::Stream _stream)
const cv::cuda::GpuMat input = _input.getGpuMat();
dim3 cthreads(32, 32);
dim3 cblocks(
static_cast<int>(std::ceil(input1.size().width /
static_cast<int>(std::ceil(input1.size().height /
// code that creates and upload d_min, d_minLoc
float h_min = 9999;
int h_minLoc = -1;
float * d_min = 0;
int * d_minLoc = 0;
//gpuErrchk is defined on other place
gpuErrchk( cudaMalloc((void**)&d_min, sizeof(h_min)));
gpuErrchk( cudaMalloc((void**)&d_minLoc, sizeof(h_minLoc));
gpuErrchk( cudaMemcpy(d_min, &h_min, sizeof(h_min), cudaMemcpyHostToDevice) );
gpuErrchk( cudaMemcpy(d_minLoc, &h_minLoc, sizeof(h_minLoc), cudaMemcpyHostToDevice) );
cudaStream_t stream = cv::cuda::StreamAccessor::getStream(_stream);
minLoc<<<cblocks, cthreads, 0, stream>>>(input,d_min,d_minLoc);
//code to read the answer
gpuErrchk( cudaMemcpy(&h_min, d_min, sizeof(h_min), cudaMemcpyDeviceToHost) );
gpuErrchk( cudaMemcpy(&h_minLoc, d_minLoc, sizeof(h_minLoc), cudaMemcpyDeviceToHost) );
minValLoc = cv::point(h_minLoc/data.cols,h_minLoc%data.cols)
minVal = h_min;
int main()
//read Background and template
cv::Mat input = imread("cat.jpg",0);
cv::Mat templ = imread("catNose.jpg",0)
//convert to floats
cv::Mat float_input, float_templ;
//upload Bckg and template to gpu
cv::cuda::GpuMat d_src,d_templ, d_match;
Size size = float_input.size();
double min_val, max_val;
Point min_loc, max_loc;
Ptr<cv::cuda::TemplateMatching> alg = cuda::createTemplateMatching(d_src.type(), CV_TM_SQDIFF);
alg->match(d_src, d_templ, d_match);
//Too slow
//cv::cuda::minMaxLoc(d_match, &min_val, &max_val, &min_loc, &max_loc);
return 0;
I did not find a way to actually use linear indexes with cv::cuda::PtrStepSzf. I am not sure there is one. Looks like when this format is used it can only use 2 subscripts. Instead I used the pointer ptr on cv::cuda::GpuMat input variable in the kernel wrapper as follow:
#define nTPB 1024
#define FLOAT_MAX 9999.0f
void callMinLocKernel(cv::InputArray _input,
cv::Point minValLoc,
float minVal,
cv::cuda::Stream _stream)
const cv::cuda::GpuMat input = _input.getGpuMat();
const float* linSrc = input.ptr<float>();
size_t step = input.step;
dim3 cthreads(nTPB);
dim3 cblocks(
static_cast<int>(std::ceil(input.size().width*input1.size().height /
// code that creates and upload d_min, d_minLoc
float h_min = 9999;
int h_minLoc = -1;
float * d_min = 0;
int * d_minLoc = 0;
//gpuErrchk is defined on other place
gpuErrchk( cudaMalloc((void**)&d_min, sizeof(h_min)));
gpuErrchk( cudaMalloc((void**)&d_minLoc, sizeof(h_minLoc));
gpuErrchk( cudaMemcpy(d_min, &h_min, sizeof(h_min), cudaMemcpyHostToDevice) );
gpuErrchk( cudaMemcpy(d_minLoc, &h_minLoc, sizeof(h_minLoc), cudaMemcpyHostToDevice) );
cudaStream_t stream = cv::cuda::StreamAccessor::getStream(_stream);
minLoc<<<cblocks, cthreads, 0, stream>>>(input,d_min,d_minLoc);
//code to read the answer
gpuErrchk( cudaMemcpy(&h_min, d_min, sizeof(h_min), cudaMemcpyDeviceToHost) );
gpuErrchk( cudaMemcpy(&h_minLoc, d_minLoc, sizeof(h_minLoc), cudaMemcpyDeviceToHost) );
minValLoc = cv::point(h_minLoc/data.cols,h_minLoc%data.cols)
minVal = h_min;
And inside the Kernel as:
__global__ void minLoc(const float* data,
const size_t step,
cv::Size dataSz,
float* minVal,
float * minValLoc
__shared__ volatile T vals[nTPB];
__shared__ volatile int idxs[nTPB];
__shared__ volatile int last_block;
int idx = threadIdx.x+blockDim.x*blockIdx.x;
const int dsize = dataSz.height*dataSz.width;
last_block = 0;
float my_val = FLOAT_MAX;
int my_idx = -1;
// sweep from global memory
while (idx < dsize)
int row = idx / dataSz.width;
int id = ( row*step / sizeof( float ) ) + idx % dataSz.width;
if ( data[id] < my_val )
my_val = data[id];
my_idx = idx;
idx += blockDim.x*gridDim.x;
// ... rest of the kernel
The step is in bytes so it needs to be divided by sizeof(typeVariable)
I hope this help!

CUDA Constant Memory Error

I am trying to do a sample code with constant memory with CUDA 5.5. I have 2 constant arrays of size 3000 each. I have another global array X of size N.
I want to compute
Y[tid] = X[tid]*A[tid%3000] + B[tid%3000]
Here is the code.
#include <iostream>
#include <stdio.h>
using namespace std;
#include <cuda.h>
__device__ __constant__ int A[3000];
__device__ __constant__ int B[3000];
__global__ void kernel( int *dc_A, int *dc_B, int *X, int *out, int N)
int tid = threadIdx.x + blockIdx.x*blockDim.x;
if( tid<N )
out[tid] = dc_A[tid%3000]*X[tid] + dc_B[tid%3000];
int main()
int N=100000;
// set affine constants on host
int *h_A, *h_B ; //host vectors
h_A = (int*) malloc( 3000*sizeof(int) );
h_B = (int*) malloc( 3000*sizeof(int) );
for( int i=0 ; i<3000 ; i++ )
h_A[i] = (int) (drand48() * 10);
h_B[i] = (int) (drand48() * 10);
//set X and Y on host
int * h_X = (int*) malloc( N*sizeof(int) );
int * h_out = (int *) malloc( N*sizeof(int) );
//set the vector
for( int i=0 ; i<N ; i++ )
h_X[i] = i;
h_out[i] = 0;
// copy, A,B,X,Y to device
int * d_X, *d_out;
cudaMemcpyToSymbol( A, h_A, 3000 * sizeof(int) ) ;
cudaMemcpyToSymbol( B, h_B, 3000 * sizeof(int) ) ;
cudaMalloc( (void**)&d_X, N*sizeof(int) ) );
cudaMemcpy( d_X, h_X, N*sizeof(int), cudaMemcpyHostToDevice ) ;
cudaMalloc( (void**)&d_out, N*sizeof(int) ) ;
//call kernel for vector addition
kernel<<< (N+1024)/1024,1024 >>>(A,B, d_X, d_out, N);
cudaPeekAtLastError() ;
cudaDeviceSynchronize() ;
// D --> H
cudaMemcpy(h_out, d_out, N * sizeof(int), cudaMemcpyDeviceToHost ) ;
return 0;
I am trying to run the debugger over this code to analyze. Turns out that on the line which copies to constant memory I get the following error with debugger
Coalescing of the CUDA commands output is off.
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff5c5b700 (LWP 31200)]
Can somebody please help me out with constant memory
There are several problems here. It is probably easier to start by showing the "correct" way to use those two constant arrays, then explain why what you did doesn't work. So the kernel should look like this:
__global__ void kernel(int *X, int *out, int N)
int tid = threadIdx.x + blockIdx.x*blockDim.x;
if( tid<N )
out[tid] = A[tid%3000]*X[tid] + B[tid%3000];
ie. don't try passing A and B to the kernel. The reasons are as follows:
Somewhat confusingly, A and B in host code are not valid device memory addresses. They are host symbols which provide hooks into a runtime device symbol lookup. It is illegal to pass them to a kernel- If you want their device memory address, you must use cudaGetSymbolAddress to retrieve it at runtime.
Even if you did call cudaGetSymbolAddress and retrieve the symbols device addresses in constant memory, you shouldn't pass them to a kernel as an argument, because doing do would not yield uniform memory access in the running kernel. Correct use of constant memory requires the compiler to emit special PTX instructions, and the compiler will only do that when it knows that a particular global memory location is in constant memory. If you pass a constant memory address by value as an argument, the __constant__ property is lost and the compiler can't know to produce the correct load instructions
Once you get this working, you will find it is terribly slow and if you profile it you will find that there is very high degrees of instruction replay and serialization. The whole idea of using constant memory is that you can exploit a constant cache broadcast mechanism in cases when every thread in a warp accesses the same value in constant memory. Your example is the complete opposite of that - every thread is accessing a different value. Regular global memory will be faster in such a use case. Also be aware that the performance of the modulo operator on current GPUs is poor, and you should avoid it wherever possible.