CUDA, could using shared memory improve my performance? - c++

I'm implementing an algorithm to convert an image to grayscale using CUDA. I've got it working right now, but I'm looking for ways to improve performance.
Right now, the entire color image is transferred to the device memory, after which each thread calculates the gray pixel value by looking up the corresponding three (r,g,b) color values.
I have already made sure that the access of global memory is coalesced, though this did not really improve my performance (a 36 mb image took 0.003 s less after the memory access was coalesced...). Right now, I'm wondering whether using shared memory could improve my performance. Here's what I have right now:
My CUDA kernel:
__global__ void darkenImage(const unsigned char * inputImage,
unsigned char * outputImage, const int width, const int height, int iteration){
int x = ((blockIdx.x * blockDim.x) + (threadIdx.x + (iteration * MAX_BLOCKS * nrThreads))) * 3;
if(x+2 < (3 * width*height)){
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[x]);
float g = static_cast< float >(inputImage[x+1]);
float b = static_cast< float >(inputImage[x+2]);
grayPix = __fadd_rn(__fadd_rn(__fmul_rn(0.3f, r),__fmul_rn(0.59f, g)), __fmul_rn(0.11f, b));
grayPix = fma(grayPix,0.6f,0.5f);
outputImage[(x/3)] = static_cast< unsigned char >(grayPix);
}
}
My question really is, because there is no memory shared between any two threads, using shared memory shouldn't really help here now should it? Or did I misunderstand?
Regards,
Linus

If you are not using the same value more than once, using shared memory (cache) will not improve the performance. But you can try to remove the iteration parameter and process more data with each block. Try to have a single kernel launch and a loop within the kernel so that each thread can calculate more than one output data.

No you are correct that shared memory won't help because you are not accessing the data more than once.

Related

a specific OpenCL kernel performs differently on mobile and PC

I was trying to run an OpenCL kernel on both Adreno 630 and my laptop, it turns out that the kernel runs perfectly on mobile but crashes my laptop every single time. I am still trying to figure out the reason by myself. Here's my kernel. I hope you could help me with it, thanks.
__kernel void gen_mapxy( __read_only image2d_t _disp, const float offsetX, __write_only image2d_t _mapxy )
{
const int y = get_global_id(0);
const int local_y = get_local_id(0);
__local short temp[24][1080];
const int imageWidth = get_image_width(_disp);
for(int x = 0; x < imageWidth; ++x)
temp[local_y][x] = 0;
for(int x = imageWidth - 1; x >= 0; --x){
int tempDisp = read_imagei(_disp, sampler_nearest, (int2)(x, y)).x;
int newPos = clamp((int)(x + offsetX * (tempDisp) / 255), 0, imageWidth - 1);
temp[local_y][newPos] = tempDisp;
write_imagef(_mapxy, (int2)(newPos, y), (float4)(x, y, 0, 0));
}
You are using a big local array.
__local short temp[24][1080]
2 byte * 24 * 1080 = 50.6kB. Some desktop GPUs(and their notebook counterparts) have less available local memory limits. For example, GTX 1060 supports the value CL_DEVICE_LOCAL_MEM_SIZE 49152 bytes. But adreno 620, either it is ignoring the array usage silently or supporting larger local arrays because there is a possilibity that local arrays are emulated inside global arrays (limited in hundreds of megabytes) for those chips. If they do support in-chip fast local memory, then there is more possibility of "ignoring" issue or they really doubled local memory limits from last generation of Adrenos.
Even when GPU supports exact value, using all of it will limit thread-level-parallelism on each pipeline, severely reducing potential performance gains, generally.
If last generation of Adreno GPUs are same,
https://compubench.com/device.jsp?benchmark=compu15m&os=Android&api=cs&D=Samsung+Galaxy+S7+%28SM-G930x%29&testgroup=info
this page says
CL_DEVICE_LOCAL_MEM_SIZE
32768
CL_DEVICE_LOCAL_MEM_TYPE
CL_LOCAL
it is fast but it is 32kB so it is ignoring the error or you've missed adding necessary error catching logic in there, or both.

How do you iterate through a pitched CUDA array?

Having parallelized with OpenMP before, I'm trying to wrap my head around CUDA, which doesn't seem too intuitive to me. At this point, I'm trying to understand exactly how to loop through an array in a parallelized fashion.
Cuda by Example is a great start.
The snippet on page 43 shows:
__global__ void add( int *a, int *b, int *c ) {
int tid = blockIdx.x; // handle the data at this index
if (tid < N)
c[tid] = a[tid] + b[tid];
}
Whereas in OpenMP the programmer chooses the number of times the loop will run and OpenMP splits that into threads for you, in CUDA you have to tell it (via the number of blocks and number of threads in <<<...>>>) to run it sufficient times to iterate through your array, using a thread ID number as an iterator. In other words you can have a CUDA kernel always run 10,000 times which means the above code will work for any array up to N = 10,000 (and of course for smaller arrays you're wasting cycles dropping out at if (tid < N)).
For pitched memory (2D and 3D arrays), the CUDA Programming Guide has the following example:
// Host code
int width = 64, height = 64;
float* devPtr; size_t pitch;
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c > width; ++c) {
float element = row[c];
}
}
}
This example doesn't seem too useful to me. First they declare an array that is 64 x 64, then the kernel is set to execute 512 x 100 times. That's fine, because the kernel does nothing other than iterate through the array (so it runs 51,200 loops through a 64 x 64 array).
According to this answer the iterator for when there are blocks of threads going on will be
int tid = (blockIdx.x * blockDim.x) + threadIdx.x;
So if I wanted to run the first snippet in my question for a pitched array, I could just make sure I had enough blocks and threads to cover every element including the padding that I don't care about. But that seems wasteful.
So how do I iterate through a pitched array without going through the padding elements?
In my particular application I have a 2D FFT and I'm trying to calculate arrays of the magnitude and angle (on the GPU to save time).
After reviewing the valuable comments and answers from JackOLantern, and re-reading the documentation, I was able to get my head straight. Of course the answer is "trivial" now that I understand it.
In the code below, I define CFPtype (Complex Floating Point) and FPtype so that I can quickly change between single and double precision. For example, #define CFPtype cufftComplex.
I still can't wrap my head around the number of threads used to call the kernel. If it's too large, it simply won't go into the function at all. The documentation doesn't seem to say anything about what number should be used - but this is all for a separate question.
The key in getting my whole program to work (2D FFT on pitched memory and calculating magnitude and argument) was realizing that even though CUDA gives you plenty of "apparent" help in allocating 2D and 3D arrays, everything is still in units of bytes. It's obvious in a malloc call that the sizeof(type) must be included, but I totally missed it in calls of the type allocate(width, height). Noob mistake, I guess. Had I written the library I would have made the type size a separate parameter, but whatever.
So given an image of dimensions width x height in pixels, this is how it comes together:
Allocating memory
I'm using pinned memory on the host side because it's supposed to be faster. That's allocated with cudaHostAlloc which is straightforward. For pitched memory, you need to store the pitch for each different width and type, because it could change. In my case the dimensions are all the same (complex to complex transform) but I have arrays that are real numbers so I store a complexPitch and a realPitch. The pitched memory is done like this:
cudaMallocPitch(&inputGPU, &complexPitch, width * sizeof(CFPtype), height);
To copy memory to/from pitched arrays you cannot use cudaMemcpy.
cudaMemcpy2D(inputGPU, complexPitch, //destination and destination pitch
inputPinned, width * sizeof(CFPtype), //source and source pitch (= width because it's not padded).
width * sizeof(CFPtype), height, cudaMemcpyKind::cudaMemcpyHostToDevice);
FFT plan for pitched arrays
JackOLantern provided this answer, which I couldn't have done without. In my case the plan looks like this:
int n[] = {height, width};
int nembed[] = {height, complexPitch/sizeof(CFPtype)};
result = cufftPlanMany(
&plan,
2, n, //transform rank and dimensions
nembed, 1, //input array physical dimensions and stride
1, //input distance to next batch (irrelevant because we are only doing 1)
nembed, 1, //output array physical dimensions and stride
1, //output distance to next batch
cufftType::CUFFT_C2C, 1);
Executing the FFT is trivial:
cufftExecC2C(plan, inputGPU, outputGPU, CUFFT_FORWARD);
So far I have had little to optimize. Now I wanted to get magnitude and phase out of the transform, hence the question of how to traverse a pitched array in parallel. First I define a function to call the kernel with the "correct" threads per block and enough blocks to cover the entire image. As suggested by the documentation, creating 2D structures for these numbers is a great help.
void GPUCalcMagPhase(CFPtype *data, size_t dataPitch, int width, int height, FPtype *magnitude, FPtype *phase, size_t magPhasePitch, int cudaBlockSize)
{
dim3 threadsPerBlock(cudaBlockSize, cudaBlockSize);
dim3 numBlocks((unsigned int)ceil(width / (double)threadsPerBlock.x), (unsigned int)ceil(height / (double)threadsPerBlock.y));
CalcMagPhaseKernel<<<numBlocks, threadsPerBlock>>>(data, dataPitch, width, height, magnitude, phase, magPhasePitch);
}
Setting the blocks and threads per block is equivalent to writing the (up to 3) nested for-loops. So you have to have enough blocks * threads to cover the array, and then in the kernel you must make sure that you are not exceeding the array size. By using 2D elements for threadsPerBlock and numBlocks, you avoid having to go through the padding elements in the array.
Traversing a pitched array in parallel
The kernel uses the standard pointer arithmetic from the documentation:
__global__ void CalcMagPhaseKernel(CFPtype *data, size_t dataPitch, int width, int height,
FPtype *magnitude, FPtype *phase, size_t magPhasePitch)
{
int threadX = threadIdx.x + blockDim.x * blockIdx.x;
if (threadX >= width)
return;
int threadY = threadIdx.y + blockDim.y * blockIdx.y;
if (threadY >= height)
return;
CFPtype *threadRow = (CFPtype *)((char *)data + threadY * dataPitch);
CFPtype complex = threadRow[threadX];
FPtype *magRow = (FPtype *)((char *)magnitude + threadY * magPhasePitch);
FPtype *magElement = &(magRow[threadX]);
FPtype *phaseRow = (FPtype *)((char *)phase + threadY * magPhasePitch);
FPtype *phaseElement = &(phaseRow[threadX]);
*magElement = sqrt(complex.x*complex.x + complex.y*complex.y);
*phaseElement = atan2(complex.y, complex.x);
}
The only wasted threads here are for the cases where the width or height are not multiples of the number of threads per block.

CUDA parallelizing a dependent 2D array

I have a sample loop of following form. Notice that my psi[i][j] is dependent on psi[i+1][j], psi[i-1][j], psi[i][j+1] and psi[i][j-1] and I have to calculate psi for inner matrix only. Now I tried writing this in CUDA but the results are not same as sequential.
for(i=1;i<=leni-2;i++)
for(j=1;j<=lenj-2;j++){
psi[i][j]=(omega[i][j]*(dx*dx)*(dy*dy)+(psi[i+1][j]+psi[i-1][j])*(dy*dy)+(psi[i][j+1]+psi[i][j-1])*(dx*dx) )/(2.0*(dx*dx)+2.0*(dy*dy));
}
Here's my CUDA format.
//KERNEL
__global__ void ComputePsi(double *psi, double *omega, int imax, int jmax)
{
int x = blockIdx.x;
int y = blockIdx.y;
int i = (jmax*x) + y;
double beta = 1;
double dx=(double)30/(imax-1);
double dy=(double)1/(jmax-1);
if((i)%jmax!=0 && (i+1)%jmax!=0 && i>=jmax && i<imax*jmax-jmax){
psi[i]=(omega[i]*(dx*dx)*(dy*dy)+(psi[i+jmax]+psi[i-jmax])*(dy*dy)+(psi[i+1]+psi[i-1])*(dx*dx) )/(2.0*(dx*dx)+2.0*(dy*dy));
}
}
//Code
cudaMalloc((void **) &dev_psi, leni*lenj*sizeof(double));
cudaMalloc((void **) &dev_omega, leni*lenj*sizeof(double));
cudaMemcpy(dev_psi, psi, leni*lenj*sizeof(double),cudaMemcpyHostToDevice);
cudaMemcpy(dev_omega, omega, leni*lenj*sizeof(double),cudaMemcpyHostToDevice);
dim3 grids(leni,lenj);
for(iterpsi=0;iterpsi<30;iterpsi++)
ComputePsi<<<grids,1>>>(dev_psi, dev_omega, leni, lenj);
Where psi[leni][lenj] and omega[leni][lenj] and double arrays.
The problem is sequential and CUDA codes are giving different results. Is there any modification needed in the code?
You are working in global memory and you are changing psi entries while other threads might need the old values. Just store the values of the new iteration in a separate variable. But keep in mind that you have to swap the variables after each iteration !!
A more sophisticated approach would be a solution working with shared memory and spatial domain assignment to the separate threads. Just google for CUDA tutorials for the solving of the heat/diffusion equation and you will get the idea.
for(i=1;i<=leni-2;i++)
for(j=1;j<=lenj-2;j++){
psi[i][j]= ( omega[i][j]*(dx*dx)*(dy*dy) +
(psi[i+1][j]+psi[i-1][j]) * (dy*dy) +
(psi[i][j+1]+psi[i][j-1]) * (dx*dx)
)/(2.0*(dx*dx)+2.0*(dy*dy));
}
I think that this kernel is not correct sequentially either: the value of psi[i][j] depends on the order of the operations here - so you will be using not updated psi[i+1][j] and psi[i][j+1], but psi[i-1][j] and psi[i][j-1] have been updated in this sweep.
Be sure that with CUDA the result will be different, where the order of the operations is different.
To enforce such an ordering, if possible at all, you would need to insert so many synchronizations that probably it's not worthwhile for CUDA. Is it really what you need to do?

memory allocated in assembly using malloc - want to convert it to a3-D array in C++

I have an assembly segment of the program that does a huge malloc (typically of the order of 8Gb), populates it and does computations on it.
For debugging purposes I want to be able to convert this allocated and pre-filled memory as a 3-D array in C/C++. I specifically do not want to allocate another 8 GB because declaring unsigned char* debug_arr[crystal_size][crystal_size][crystal_size] and doing an element-by-element copy will result in a stack overflow.
I would ideally love to type cast the memory pointer to an 3D array pointer ... Is it possible ?
Objective is to verify the computation results done in Assembly segment.
My C/C++ knowledge is average. I mostly use 64-bit assembly, so request give me the C++ typecasting in some detail, please?
Env : Intel Core i7 2600K #4.4 GHz with 16 GB RAM, 64 bit assembly programming on 64 bit Windows 7, Visual Studio Express 2012
Thanks...
If you want to access a single unsigned char entry as if from a 3D array, you obviously need the relevant dimensions (call them nXDim, nYDim, nZDim for the sake of argument) and you need to know what dimension order has been assumed during writing.
If we assume that z changes less frequently than y and y less frequently than x then you can access your array via a function such as this:
unsigned char* GetEntry(int nX, int nY, int nZ)
{
return &pYourArray[(nZ * nXDim * nYDim) + (nY * nXDim) + nX];
}
First check what orderin is done in your memory . there are two types raw major orderin or column major
For row major ordering
Address = Base + ((depthindex*col_size+colindex) * row_size + rowindex) * Element_Size
For column major ordering
Address = Base + ((rowindex*col_size+colindex) * depth_size + depthindex) * Element_Size
Here is an example for you to expand on:
char array[10000]; // One dimensional array
char * mat[100]; // Matrix for 2D array
for ( int i = 0; i < 100; i++ )
mat[i] = array + i * 100;
Now, you have the matrix as a 100x100 element 2D array in the same memory as the array.
If you know the dimensions at compile time, then something like this
void * crystal_cube = 0; // set by asm magic;
typedef unsigned char * DEBUG_CUBE[2044][2044][2044];
DEBUG_CUBE debug_cube = (DEBUG_CUBE) crystal_cube;

CUDA - no blocks, just threads for undefined dimensions

I have some matrices with unknown sizes varying from 10-20.000 in both directions.
I designed a CUDA kernel with (x;y) blocks and (x;y) threads.
Since matrices width/height aren't multiple of my dimensions, it was a terrible pain to get things work and the code is becoming more and more complicated to get coalescence memory reads.
Besides all of that, the kernel is growing in size using more and more registers to check for correctness... so I think this is not the way I should adopt.
My question is: what if I totally eliminate blocks and just create a grid of x;y threads? Will a SM unit have problems without many blocks?
Can I eliminate blocks and use a large amount of threads or is the block subdivision necessary?
You can't really just make a "grid of threads", since you have to organize threads into blocks and you can have a maximum of 512 threads per block. However, you could effectively do this by using 1 thread per block, which will result in a X by Y grid of 1x1 blocks. However, this will result in pretty terrible performance due to several factors:
According to the CUDA Programming Guide, a SM can handle a maximum of 8 blocks at any time. This will limit you to 8 threads per SM, which isn't enough to fill even a single warp. If you have, say, 48 CUDA cores, you will only be able to handle 384 threads at any given time.
With only 8 threads available on a SM, there will be too few warps to hide memory latencies. The GPU will spend most of its time waiting for memory accesses to complete, rather than doing any computations.
You will be unable to coalesce memory reads and writes, resulting in poor memory bandwidth usage.
You will be effectively unable to leverage shared memory, as this is a shared resource between threads in a block.
While having to ensure correctness for threads in a block is annoying, your performance will be vastly better than your "grid of threads" idea.
Here's the code i use to divide a given task requiring num_threads into block and grid. Yes, you might end up launching to many blocks (but only very few) and you will probably end up having more actual threads than required, but it's easy and efficient this way. See the second code example below for my simple in-kernel boundary check.
PS: I always have block_size == 128 because it has been a good tradeoff between multicore occupancy, register usage, shared memory requirements and coalescent access for all of my kernels.
Code to calculate a good grid size (host):
#define GRID_SIZE 65535
//calculate grid size (store result in grid/block)
void kernelUtilCalcGridSize(unsigned int num_threads, unsigned int block_size, dim3* grid, dim3* block) {
//block
block->x = block_size;
block->y = 1;
block->z = 1;
//number of blocks
unsigned int num_blocks = kernelUtilCeilDiv(num_threads, block_size);
unsigned int total_threads = num_blocks * block_size;
assert(total_threads >= num_threads);
//calculate grid size
unsigned int gy = kernelUtilCeilDiv(num_blocks, GRID_SIZE);
unsigned int gx = kernelUtilCeilDiv(num_blocks, gy);
unsigned int total_blocks = gx * gy;
assert(total_blocks >= num_blocks);
//grid
grid->x = gx;
grid->y = gy;
grid->z = 1;
}
//ceil division (rounding up)
unsigned int kernelUtilCeilDiv(unsigned int numerator, unsigned int denominator) {
return (numerator + denominator - 1) / denominator;
}
Code to calculate the unique thread id and check boundaries (device):
//some kernel
__global__ void kernelFoo(unsigned int num_threads, ...) {
//calculate unique id
const unsigned int thread_id = threadIdx.x;
const unsigned int block_id = blockIdx.x + blockIdx.y * gridDim.x;
const unsigned int unique_id = thread_id + block_id * blockDim.x;
//check range
if (unique_id >= num_threads) return;
//do the actual work
...
}
I don't think that's a lot of effort/registers/lines-of-code to check for correctness.