Shared Memory Cuda for Converting RGB image to Greyscale - c++

I am new to Cuda programming, I have a code that converts an RGB image to Greyscale. The algorithm for reading RGB values of pixel and converting them to GreyScale has been provided to us.
Parallelizing the code has given me around 40-50x speed up.I want to optimize it further to achieve around 100x speedup. For this purpose I want to use shared memory access as its magnitude faster than Global Memory Access. I have gone through different online resources and have the basic understanding of shared memory access. But in my code I am having problem understanding how to implement shared memory, The code to read RGB values and converting to Greyscale
for ( int y = 0; y < height; y++ ) {
for ( int x = 0; x < width; x++ ) {
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[(y * width) + x]);
float g = static_cast< float >(inputImage[(width * height) + (y * width) + x]);
float b = static_cast< float >(inputImage[(2 * width * height) + (y * width) + x]);
grayPix = ((0.3f * r) + (0.59f * g) + (0.11f * b));
grayPix = (grayPix * 0.6f) + 0.5f;
darkGrayImage[(y * width) + x] = static_cast< unsigned char >(grayPix);
}
}
Input image a char* and we are using CImg library to manipulate image
CImg< unsigned char > inputImage = CImg< unsigned char >(argv[1]);
Where user passes the path to image as a argument while running the code
This is my Cuda implementation of it
unsigned int y = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int x = (blockIdx.y * blockDim.y) + threadIdx.y;
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[(y * height) + x]);
float g = static_cast< float >(inputImage[(width * height) + (y * height) + x]);
float b = static_cast< float >(inputImage[(2 * width * height) + (y * height) + x]);
grayPix = ((0.3f * r) + (0.59f * g) + (0.11f * b));
grayPix = (grayPix * 0.6f) + 0.5f;
darkGrayImage[(y * height) + x] = static_cast< unsigned char >(grayPix);
The Grid and block and calling the code
dim3 gridSize(width/16,height/16);
dim3 blockSize(16,16);
greyScale<<< gridSize, blockSize >>>(width,height,d_in, d_out);
where width and height are the width and height of input image. I tried with block size of (32,32) but it slowed down the code instead of speeding it up
Now i Want to add shared memory but the problem the access to the input variable InputImage is quite non linear, so what values do I add to the shared memory
I tried something like
unsigned int y = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int x = (blockIdx.y * blockDim.y) + threadIdx.y;
extern __shared__ int s[];
s[x]=inputImage[x];
__syncthreads();
and then replacing inputImage with s in the implementation but that just gave a wrong output (all black image)
Can you help me out here to understand how can i implement shared memory, if even its possible and useful and is there a way i can make my access in a more coalesced way ?
Any help would be grateful

This can't work for several reasons:
unsigned int x = (blockIdx.y * blockDim.y) + threadIdx.y;
extern __shared__ int s[];
s[x]=inputImage[x];
One reason is that we cannot use a global index (x) as a shared memory index, unless the data set is small enough to fit in shared memory. For an image of reasonably large dimensions, you cannot fit the entire image into a single instance of shared memory. Furthermore, you are using only one dimensional index (x) of a two dimensional data set, so this can't possibly make sense.
This suggests a general lack of understanding of how to use shared memory in a program. However, rather than trying to sort this out, we can observe that for a properly written RGB->grayscale code, shared memory usage is unlikely to provide any benefit.
Shared memory bandwidth benefits (which is what you are referring to when you say "magnitude faster") are valuable when there is data re-use. An RGB->grayscale code should not require any data re-use. You load each R,G,B quantity exactly once from global memory, and you store the computed grayscale quantity exactly once in global memory. Moving some of this data temporarily to shared memory is not going to speed anything up. You still have to do the global loads and global stores, and for a properly written code, this should be all that is necessary.
However in your question you've already suggested a possible improvement path: coalesced access. If you were to profile your posted code, you would find completely uncoalesced access patterns. For good coalescing, we want compound index calculations to have the property that the threadIdx.x variable is not multiplied by anything:
unsigned int y = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int x = (blockIdx.y * blockDim.y) + threadIdx.y;
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[(y * height) + x]);
^
|
y depends on threadIdx.x
But in your case, your index calculation is multiplying threadIdx.x by height. This will result in non-coalesced access. Adjacent threads in a warp will have varying threadIdx.x, and we want index calculations of adjacent threads in the warp to result in adjacent locations in memory, for good coalesced access. You cannot achieve this if you multiply threadIdx.x by anything.
The solution for this problem is quite simple. You should just use kernel code that is almost an exact duplicate of the non-CUDA code you have shown, with appropriate definitions for x and y:
unsigned int x = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int y = (blockIdx.y * blockDim.y) + threadIdx.y;
if ((x < width) && (y < height)){
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[(y * width) + x]);
float g = static_cast< float >(inputImage[(width * height) + (y * width) + x]);
float b = static_cast< float >(inputImage[(2 * width * height) + (y * width) + x]);
grayPix = ((0.3f * r) + (0.59f * g) + (0.11f * b));
grayPix = (grayPix * 0.6f) + 0.5f;
darkGrayImage[(y * width) + x] = static_cast< unsigned char >(grayPix);
}
Naturally, this is not a complete code. You have not shown a complete code, so if you respond with "I tried this but it doesn't work", it's unlikely I'll be able to help you much, since I don't know what code you're actually running. But:
Shared memory is not the right way to go for this algorithm
You undoubtedly have a coalescing issue in your posted code, for the reasons I indicate
The coalescing fix should follow the path I outlined
Your performance should improve with the coalescing fix.
Note that a response of "it doesn't work" means you are really asking for debugging assistance, not conceptual explanation, in which case you are supposed to provide an MCVE. What you have shown is not an MCVE. Preferably your MCVE should not depend on an external library like CImg, which means it requires effort on your part to create one that would be a standalone test, but demonstrating the problem you are having.
Also, I would suggest whenever you are having trouble with a CUDA code, to use proper CUDA error checking as well as run your code with cuda-memcheck.
(Proper CUDA error checking would have identified a problem with your attempt to use shared memory, for example, due to out-of-bounds indexing in shared memory.)

Related

Change Perlin noise algorithm to work with continuous procedural generation

Right now I have a perlin noise function where I pass a buffer of seeds and another buffer which the function fills with the noise values. I am using this to procedurely generate the heights of the vertices in a terrain. The problem is right now the terrain is limited to the size of the buffer but I want to have it continuosly generate chunks with the chunks being consistant with eachother but I don't see how to do that with the current function I am using. Here is the code for the algorithm is there anything I can change to make it work?
inline void perlInNoise2D(int nWidth,int nHeight, float *Seed, int nOctaves, float fBias, float *fOutput)
{
for(int x = 0; x < nWidth; x++)
{
for(int y = 0; y < nHeight; y++)
{
float fNoise = 0.0f;
float fScale = 1.0f;
float fScaleAccum = 0.0f;
for(int o = 0; o < nOctaves;o++)
{
int nPitch = nWidth >> o;
int sampleX1 = (x / nPitch) * nPitch;
int sampleY1 = (y / nPitch) * nPitch;
int sampleX2 = (sampleX1 + nPitch) % nWidth;
int sampleY2 = (sampleY1 + nPitch) % nWidth;
float fBlendX = (float)(x - sampleX1) / (float) nPitch;
float fBlendY = (float)(y - sampleY1) / (float) nPitch;
float fSampleT = (1.0f - fBlendX) * Seed[sampleY1 * nWidth + sampleX1] + fBlendX * Seed[sampleY1 * nWidth + sampleX2];
float fSampleB = (1.0f - fBlendX) * Seed[sampleY2 * nWidth + sampleX1] + fBlendX * Seed[sampleY2 * nWidth + sampleX2];
fNoise += (fBlendY * (fSampleB - fSampleT) + fSampleT) * fScale;
fScaleAccum += fScale;
fScale = fScale / fBias;
}
fOutput[(y * nWidth) + x] = fNoise / fScaleAccum;
}
}
}
Presumably this is tied in to a "map reveal" mechanism?
A common technique is to generate overlapping chunks and average them together. As a simple example, you generate chunks of 2*nWidth by 2*nHeight. You'd then have 4 overlapping chunks at any XY pos. At the edge of the map, you'll have a strip where not all chunks have been generated. When this part of the map needs to be revealed, you generate those chunks on the fly. This moves the edge outwards.
The averaging process already smooths out the boundary effects. You can make this more effective by smoothing out each individual chunk near its edges. Since the chunk edges do not coincide, the smoothing of different chunks does not coincide either. A simple triangle smooth could be sufficient (i.e. the smooth window is 1 in the middle, 0 at the edge, and linear in between) but you could also use a gaussian or any other function that peaks in the middle and gradually smooths towards the chunk edge.

cuda:multiple threads access the same global variable

#define dimG 16
#define dimB 64
// slovebyGPU
__global__ void SloveStepGPU(float* X, float* Y, int * iCons, int* jCons, int * dCons, float* wCons, int cnt, float c)
{
int id = blockDim.x * blockIdx.x + threadIdx.x;
for (int i = id; i<cnt; i += dimG*dimB) {
int I = iCons[i];
int J = jCons[i];
int d = dCons[i];
float wc = 1.0f*wCons[i]*c;
if (wc > 1.0)wc = 1.0;
float XI = atomicAdd(&(X[I]), 0);
float XJ = atomicAdd(&(X[J]), 0);
float YI = atomicAdd(&(Y[I]), 0);
float YJ = atomicAdd(&(Y[J]), 0);
float pqx = XI - XJ;
float pqy = YI - YJ;
float mag = sqrtf(pqx*pqx + pqy*pqy);
float r = 1.0f*(d - mag) / 2;
float mx = wc * r * pqx / (mag + eps);
float my = wc * r * pqy / (mag + eps);
if (d == 1) {
atomicAdd(&(X[I]), mx);
atomicAdd(&(Y[I]), my);
}
atomicAdd(&(X[J]), -mx);
atomicAdd(&(Y[J]), -my);
}
In this code, I know that X, Y may have data races. My previous thought was: Allowed reading of XI, XJ, YI, YJ may not be the latest data. However, I found that in the process of data race, it may cause XI, XJ, YI, YJ to read random memory values. That is, a memory access violation. Even if I add a lock during reading and writing, I still get the same result. Only when I reduce the size of dimB and dimG so that there is almost no data race, can I get the correct result. Is there any solution?
I use 64-bit compilation under windows + vs2015 + cuda9.1 environment.
However, I used the same code under linux and found no problems.
There is no problem when using nsight cuda debugger under windows. The reason is probably that running with debugger is slow and does not cause data race.
-------update line-----
delete other code
The problem appeared in this if (d == 1), I replaced the if with the device function fminf,fmaxf and so on to solve the problem. I am guessing that the branch was entered in the same warp, and there was data competition and some processes were suspended, which caused strange problems.
if (d == 1) {
atomicAdd(&(X[I]), mx);
atomicAdd(&(Y[I]), my);
}
to
float fd = fmaxf(2.0f - d, 0.0f);
X[I] += fd * 1.0f * mx;
Y[I] += fd * 1.0f * my;

CUDA - Optimize mean of matrix rows calculation using shared memory

I am trying to optimize the computation of the mean of each row in my 512w x 1024h image, and then subtract the mean from the row from which it was computed. I wrote a piece of code which does it in 1.86 ms, but I want to reduce the speed. This piece of code works fine, but does not use shared memory, and it utilizes for loops. I want to do away with them.
__global__ void subtractMean (const float *__restrict__ img, float *lineImg, int height, int width) {
// height = 1024, width = 512
int tidy = threadIdx.x + blockDim.x * blockIdx.x;
float sum = 0.0f;
float sumDiv = 0.0f;
if(tidy < height) {
for(int c = 0; c < width; c++) {
sum += img[tidy*width + c];
}
sumDiv = (sum/width)/2;
//__syncthreads();
for(int cc = 0; cc < width; cc++) {
lineImg[tidy*width + cc] = img[tidy*width + cc] - sumDiv;
}
}
__syncthreads();
I called the above kernel using:
subtractMean <<< 2, 512 >>> (originalImage, rowMajorImage, actualImHeight, actualImWidth);
However, the following code I wrote uses shared memory to optimize. But, it does not work as expected. Any thoughts on what the problem might be?
__global__ void subtractMean (const float *__restrict__ img, float *lineImg, int height, int width) {
extern __shared__ float perRow[];
int idx = threadIdx.x; // set idx along x
int stride = width/2;
while(idx < width) {
perRow[idx] = 0;
idx += stride;
}
__syncthreads();
int tidx = threadIdx.x; // set idx along x
int tidy = blockIdx.x; // set idx along y
if(tidy < height) {
while(tidx < width) {
perRow[tidx] = img[tidy*width + tidx];
tidx += stride;
}
}
__syncthreads();
tidx = threadIdx.x; // reset idx along x
tidy = blockIdx.x; // reset idx along y
if(tidy < height) {
float sumAllPixelsInRow = 0.0f;
float sumDiv = 0.0f;
while(tidx < width) {
sumAllPixelsInRow += perRow[tidx];
tidx += stride;
}
sumDiv = (sumAllPixelsInRow/width)/2;
tidx = threadIdx.x; // reset idx along x
while(tidx < width) {
lineImg[tidy*width + tidx] = img[tidy*width + tidx] - sumDiv;
tidx += stride;
}
}
__syncthreads();
}
The shared memory function was called using:
subtractMean <<< 1024, 256, sizeof(float)*512 >>> (originalImage, rowMajorImage, actualImHeight, actualImWidth);
2 blocks is hardly enough to saturate GPU use. You are going towards the right approach with utilizing more blocks, however, you are using Kepler and I would like to present an option that does not use shared memory at all.
Start with 32 threads in a block (this can be changed later using 2D blocks)
With those 32 threads you should do something along the lines of this:
int rowID = blockIdx.x;
int tid = threadIdx.x;
int stride= blockDim.x;
int index = threadIdx.x;
float sum=0.0;
while(index<width){
sum+=img[width*rowID+index];
index+=blockDim.x;
}
at this point you will have 32 threads that have a partial sum in each of them. You next need to add them all together. You can do this without the use of shared memory (since we are within a warp) by utilizing a shuffle reduction. For details on that look here: http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/ what you want is the shuffle warp reduce, but you need to change it to use the full 32 threads.
Now that thread 0 in each warp has the sum of every row, you can divide that by the width cast to a float, and broadcast it to the rest of the warp using shfl using shfl(average, 0);. http://docs.nvidia.com/cuda/cuda-c-programming-guide/#warp-description
With the average found and the warps synchronized implicitly and explicitly (with shfl), you can continue on in a similar method with the subtract.
Possible further optimizations would be to include more than one warp in a block to improve occupancy, and to manually unroll the loops over the width to improve instruction level parallelism.
Good Luck.

C++ and CUDA: why does the code return different results each time?

Update: I found the bug. Since the code I posted before is very complicated, I simplify them and only keep the part when the problem is.
if (number >= dim * num_points)
return;
But actually, I only have num_points, I want to use num_points thread, so the correct way should be
if (number >= num_points)
return;
Thank you all for the help.
I'm rewriting some C++ code from CPU to GPU. And the code is pasted below. Sorry it's long, since I think the problems are easier to be detected in this way.
In the code, for every thread I need some matrix format intermediate results, so I allocate device memory for these intermediate results, such as d_dir2, d_R, d_Stick, d_PStick. The results turned out to be not what I expected, so to debug, I tried to output some intermediate results R in this way:
if (k == 0)
{
results[tmp_int1 + i * dim + j] = R[tmp_int1 + i * dim + j];
}
and later in C++, I print results.
However, I found that results give different values each time. Sometimes it gives the correct answer R, sometimes, the value of PStick, sometimes a combination of R and PStick, and sometimes a combination of R and 0 (results are initialized to 0 at the beginning).
I'm very confused what caused the problem. Any idea? Thank you very much :)
__global__ void stickvote(const int dim, const int num_points, const int gridx, float Sigma, float* input, float* dir2, float* R, float* Stick, float* PStick, float* results) {
float threshold = 4 * Sigma;
float c = (- 16 * log(0.1f) * (sqrt(Sigma) - 1)) / 3.1415926f / 3.1415926f;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int number = row * BLOCK_SIZE * gridx + col;
if (number >= dim * num_points) //// The bug is here!
return;
}
extern "C" void KernelStickVote(int dim, int num_points, float Sigma, float* input, float* results) {
const int totalpoints = num_points;
const int totalpoints_input = (dim + 1)* (dim + 1) * num_points;
const int totalpoints_output = dim * dim * num_points;
size_t size_input = totalpoints_input * sizeof(float);
size_t size_output = totalpoints_output * sizeof(float);
float* d_input;
cutilSafeCall(cudaMalloc((void**)&d_input, size_input));
float* d_result;
cutilSafeCall(cudaMalloc((void**)&d_result, size_output));
// used to save dir, and calculate dir * dir'
float* d_dir2;
cutilSafeCall(cudaMalloc((void**)&d_dir2, dim * num_points * sizeof(float)));
// used to save R: dim * dim * N
float* d_R;
cutilSafeCall(cudaMalloc((void**)&d_R, size_output));
// used to save Stick: dim * dim * N
float* d_Stick;
cutilSafeCall(cudaMalloc((void**)&d_Stick, size_output));
// used to save Stick: dim * dim * N
float* d_PStick;
cutilSafeCall(cudaMalloc((void**)&d_PStick, size_output));
// Copy input data from host to device
cudaMemcpy(d_input, input, size_input, cudaMemcpyHostToDevice);
int totalblock = (totalpoints % BLOCKPOINTS==0 ? totalpoints/BLOCKPOINTS : (int(totalpoints/BLOCKPOINTS) + 1));
int gridx = (65535 < totalblock ? 65535 : totalblock);
int gridy = (totalblock % gridx == 0 ? totalblock/gridx : (int(totalblock/gridx)+1) );
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(gridx, gridy);
stickvote<<<dimGrid, dimBlock>>>(dim, num_points, gridx, Sigma, d_input, d_dir2, d_R, d_Stick, d_PStick, d_result);
cudaMemcpy(results, d_result, size_output, cudaMemcpyDeviceToHost);
cudaFree(d_input);
cudaFree(d_result);
cudaFree(d_dir2);
cudaFree(d_R);
cudaFree(d_Stick);
cudaFree(d_PStick);
}
The original poster of the question performed some further code simplification and debugging his/herself and discover that the guard statement in the kernel:
if (number >= dim * num_points)
return;
was, in fact, incorrect and should have been
if (number >= num_points)
return;
This was the source of the error.
This answer has been added as a community wiki answer with the intention of removing this question from the unanswered queue.

Applying Matrix To Image, Seeking Performance Improvements

Edited: Working on Windows platform.
Problem: Less of a problem, more about advise. I'm currently not incredibly versed in low-level program, but I am attempting to optimize the code below in an attempt to increase the performance of my overall code. This application depends on extremely high speed image processing.
Current Performance: On my computer, this currently computes at about 4-6ms for a 512x512 image. I'm trying to cut that in half if possible.
Limitations: Due to this projects massive size, fundamental changes to the application are very difficult to do, so things such as porting to DirectX or other GPU methods isn't much of an option. The project currently works, I'm simply trying to figure out how to make it work faster.
Specific information about my use for this: Images going into this method are always going to be exactly square and some increment of 128. (Most likely 512 x 512) and they will always come out the same size. Other than that, there is not much else to it. The matrix is calculated somewhere else, so this is just the applying of the matrix to my image. The original image and the new image are both being used, so copying the image is necessary.
Here is my current implementation:
void ReprojectRectangle( double *mpProjMatrix, unsigned char *pDstScan0, unsigned char *pSrcScan0,
int NewBitmapDataStride, int SrcBitmapDataStride, int YOffset, double InversedAspect, int RectX, int RectY, int RectW, int RectH)
{
int i, j;
double Xnorm, Ynorm;
double Ynorm_X_ProjMatrix4, Ynorm_X_ProjMatrix5, Ynorm_X_ProjMatrix7;;
double SrcX, SrcY, T;
int SrcXnt, SrcYnt;
int SrcXec, SrcYec, SrcYnvDec;
unsigned char *pNewPtr, *pSrcPtr1, *pSrcPtr2, *pSrcPtr3, *pSrcPtr4;
int RectX2, RectY2;
/* Compensate (or re-center) the Y-coordinate regarding the aspect ratio */
RectY -= YOffset;
/* Compute the second point of the rectangle for the loops */
RectX2 = RectX + RectW;
RectY2 = RectY + RectH;
/* Clamp values (be careful with aspect ratio */
if (RectY < 0) RectY = 0;
if (RectY2 < 0) RectY2 = 0;
if ((double)RectY > (InversedAspect * 512.0)) RectY = (int)(InversedAspect * 512.0);
if ((double)RectY2 > (InversedAspect * 512.0)) RectY2 = (int)(InversedAspect * 512.0);
/* Iterate through each pixel of the scaled re-Proj */
for (i=RectY; i<RectY2; i++)
{
/* Normalize Y-coordinate and take the ratio into account */
Ynorm = InversedAspect - (double)i / 512.0;
/* Pre-compute some matrix coefficients */
Ynorm_X_ProjMatrix4 = Ynorm * mpProjMatrix[4] + mpProjMatrix[12];
Ynorm_X_ProjMatrix5 = Ynorm * mpProjMatrix[5] + mpProjMatrix[13];
Ynorm_X_ProjMatrix7 = Ynorm * mpProjMatrix[7] + mpProjMatrix[15];
for (j=RectX; j<RectX2; j++)
{
/* Get a pointer to the pixel on (i,j) */
pNewPtr = pDstScan0 + ((i+YOffset) * NewBitmapDataStride) + j;
/* Normalize X-coordinates */
Xnorm = (double)j / 512.0;
/* Compute the corresponding coordinates in the source image, before Proj and normalize source coordinates*/
T = (Xnorm * mpProjMatrix[3] + Ynorm_X_ProjMatrix7);
SrcY = (Xnorm * mpProjMatrix[0] + Ynorm_X_ProjMatrix4)/T;
SrcX = (Xnorm * mpProjMatrix[1] + Ynorm_X_ProjMatrix5)/T;
// Compute the integer and decimal values of the coordinates in the sources image
SrcXnt = (int) SrcX;
SrcYnt = (int) SrcY;
SrcXec = 64 - (int) ((SrcX - (double) SrcXnt) * 64);
SrcYec = 64 - (int) ((SrcY - (double) SrcYnt) * 64);
// Get the values of the four pixels up down right left
pSrcPtr1 = pSrcScan0 + (SrcXnt * SrcBitmapDataStride) + SrcYnt;
pSrcPtr2 = pSrcPtr1 + 1;
pSrcPtr3 = pSrcScan0 + ((SrcXnt+1) * SrcBitmapDataStride) + SrcYnt;
pSrcPtr4 = pSrcPtr3 + 1;
SrcYnvDec = (64-SrcYec);
(*pNewPtr) = (unsigned char)(((SrcYec * (*pSrcPtr1) + SrcYnvDec * (*pSrcPtr2)) * SrcXec +
(SrcYec * (*pSrcPtr3) + SrcYnvDec * (*pSrcPtr4)) * (64 - SrcXec)) >> 12);
}
}
}
Two things that could help: multiprocessing and SIMD. With multiprocessing you could break up the output image into tiles and have each processor work on the next available tile. You can use SIMD instructions (like SSE, AVX, AltiVec, etc.) to calculate multiple things at the same time, such as doing the same matrix math to multiple coordinates at the same time. You can even combine the two - use multiple processors running SIMD instructions to do as much work as possible. You didn't mention what platform you're working on.