Why is this CUDA kernel slow?

Why is this CUDA kernel slow? - c++

I need help getting my cuda program run faster. NVIDIA visual profiler shows poor performance saying "Low Compute Utilization 1.4%":
The code is below. First kernel preparations:
void laskeSyvyydet(int& tiilet0, int& tiilet1, int& tiilet2, int& tiilet3) {
cudaArray *tekstuuriSisaan, *tekstuuriUlos;
//take care of synchronazion
cudaEvent_t cEvent;
cudaEventCreate(&cEvent);
//let's take control of OpenGL textures
cudaGraphicsMapResources(1, &cuda.cMaxSyvyys);
cudaEventRecord(cEvent, 0);
cudaGraphicsMapResources(1, &cuda.cDepthTex);
cudaEventRecord(cEvent, 0);
//need to create CUDA pointers
cudaGraphicsSubResourceGetMappedArray(&tekstuuriSisaan, cuda.cDepthTex, 0, 0);
cudaGraphicsSubResourceGetMappedArray(&tekstuuriUlos, cuda.cMaxSyvyys, 0, 0);
cudaProfilerStart();
//launch kernel
cLaskeSyvyydet(tiilet0, tiilet1, tiilet2, tiilet3, tekstuuriSisaan, tekstuuriUlos);
cudaEventRecord(cEvent, 0);
cudaProfilerStop();
//release textures back to OpenGL
cudaGraphicsUnmapResources(1, &cuda.cMaxSyvyys, 0);
cudaEventRecord(cEvent, 0);
cudaGraphicsUnmapResources(1, &cuda.cDepthTex, 0);
cudaEventRecord(cEvent, 0);
//final synchronazion
cudaEventSynchronize(cEvent);
cudaEventDestroy(cEvent);
}
Kernel launch:
void cLaskeSyvyydet(int& tiilet0, int& tiilet1, int& tiilet2, int& tiilet3, cudaArray* tekstuuriSisaan, cudaArray* tekstuuriUlos) {
cudaBindTextureToArray(surfRefSisaan, tekstuuriSisaan);
cudaBindSurfaceToArray(surfRefUlos, tekstuuriUlos);
int blocksW = (int)ceilf( tiilet0 / 32.0f );
int blocksH = (int)ceilf( tiilet1 / 32.0f );
dim3 gridDim( blocksW, blocksH, 1 );
dim3 blockDim(32, 32, 1 );
kLaskeSyvyydet<<<gridDim, blockDim>>>(tiilet0, tiilet1, tiilet2, tiilet3);
}
And the kernel:
__global__ void kLaskeSyvyydet(const int tiilet0, const int tiilet1, const int tiilet2, const int tiilet3) {
//first define indexes
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i >= tiilet0 || j >= tiilet1) return;
//if we are inside boundaries, let's find the greatest depth value
unsigned int takana=0;
unsigned int ddd;
uchar4 syvyys;
uchar4 dd;
//there's possibly four different tile sizes to choose between
if (j!=tiilet1-1 && i!=tiilet0-1) {
for (int y=j*BLOCK_SIZE; y<(j+1)*BLOCK_SIZE; y++) {
for (int x=i*BLOCK_SIZE; x<(i+1)*BLOCK_SIZE; x++) {
dd=tex2D(surfRefSisaan, x, y);
ddd=(dd.x << 24) | (dd.y << 16) | (dd.z << 8) | (dd.w);
takana=max(takana, ddd);
}
}
} else if (j==tiilet1-1 && i!=tiilet0-1) {
for (int y=j*BLOCK_SIZE; y<j*BLOCK_SIZE+tiilet3; y++) {
for (int x=i*BLOCK_SIZE; x<(i+1)*BLOCK_SIZE; x++) {
dd=tex2D(surfRefSisaan, x, y);
ddd=(dd.x << 24) | (dd.y << 16) | (dd.z << 8) | (dd.w);
takana=max(takana, ddd);
}
}
} else if (j!=tiilet1-1 && i==tiilet0-1) {
for (int y=j*BLOCK_SIZE; y<(j+1)*BLOCK_SIZE; y++) {
for (int x=i*BLOCK_SIZE; x<i*BLOCK_SIZE+tiilet2; x++) {
dd=tex2D(surfRefSisaan, x, y);
ddd=(dd.x << 24) | (dd.y << 16) | (dd.z << 8) | (dd.w);
takana=max(takana, ddd);
}
}
} else if (j==tiilet1-1 && i==tiilet0-1) {
for (int y=j*BLOCK_SIZE; y<j*BLOCK_SIZE+tiilet3; y++) {
for (int x=i*BLOCK_SIZE; x<i*BLOCK_SIZE+tiilet2; x++) {
dd=tex2D(surfRefSisaan, x, y);
ddd=(dd.x << 24) | (dd.y << 16) | (dd.z << 8) | (dd.w);
takana=max(takana, ddd);
}
}
}
//if there's empty texture, then we choose the maximum possible value
if (takana==0) {
takana=1000000000;
}
//after slicing the greatest 32bit depth value into four 8bit pieces we write the value into another texture
syvyys.x=(takana & 0xFF000000) >> 24;
syvyys.y=(takana & 0x00FF0000) >> 16;
syvyys.z=(takana & 0x0000FF00) >> 8;
syvyys.w=(takana & 0x000000FF) >> 0;
surf2Dwrite(syvyys, surfRefUlos, i*sizeof(syvyys), j, cudaBoundaryModeZero);
}
Please help me get this working faster, I have no ideas...

It looks like you have a 2D int input array of the size
((tiilet0-1)*BLOCK_SIZE+tiilet2, ((tiilet1-1)*BLOCK_SIZE)+tiilet3)
Each of your thread will sequentially read all elements in an input block of the size
(BLOCK_SIZE, BLOCK_SIZE)
and write the the maximum of the each input block to an 2D result array of the size
(tiilet0, tiilet1)
Compared to the coalesced memory access, this may be the worst possible way to access the global memory, even with 2D texture. You many want to read about coalesced memory access.
https://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/
Generally you put too much work into one thread. Given the way you map the CUDA thread blocks to your input array, I guess unless you have a VERY large input, your gridDim will be too small to fully utilize the GPU.
For better performance you may want to change from one CUDA thread per input block to one CUDA thread block per input block (int[BLOCK_SIZE][BLOCK_SIZE]), and use parallel reduction to find the block-wise maximum.

Related

CUDA Speed Slower than expected - Image Processing

I am new to CUDA development and wanted to write a simple benchmark to test some image processing feasibility. I have 32 images that are each 720x540, one byte per pixel greyscale.
I am running benchmarks for 10 seconds, and counting how many times they are able to process. There are three benchmarks I am running:
The first is just transferring the images into the GPU global memory, via cudaMemcpy
The second is transferring and processing the images.
The third is running the equivalent test on a CPU.
For a starting, simple test, the image processing is just counting the number of pixels above a certain greyscale value. I'm finding that accessing global memory on the GPU is very slow. I have my benchmark structured such that it creates one block per image, and one thread per row in each image. Each thread counts its pixels into a shared memory array, after which the first thread sums them up (See below).
The issue I am having is that this all runs very slowly - about 50fps. Much slower than a CPU version - about 230fps. If I comment out the pixel value comparison, resulting in just a count of all pixels, I get 6x the performance. I tried using texture memory but didn't see a performance gain. I am running a Quadro K2000. Also: the image copy only benchmark is able to copy at around 330fps, so that doesn't appear to be the issue.
Any help / pointers would be appreciated. Thank you.
__global__ void ThreadPerRowCounter(int Threshold, int W, int H, U8 **AllPixels, int *AllReturns)
{
extern __shared__ int row_counts[];//this parameter to kernel call "<<<, ,>>>" sets the size
//see here for indexing https://blog.usejournal.com/cuda-thread-indexing-fb9910cba084
int myImage = blockIdx.y * gridDim.x + blockIdx.x;
int myStartRow = (threadIdx.y * blockDim.x + threadIdx.x);
unsigned char *imageStart = AllPixels[myImage];
unsigned char *pixelStart = imageStart + myStartRow * W;
unsigned char *pixelEnd = pixelStart + W;
unsigned char *pixelItr = pixelStart;
int row_count = 0;
while(pixelItr < pixelEnd)
{
if (*pixelItr > Threshold) //REMOVING THIS LINE GIVES 6x PERFORMANCE
{
row_count++;
}
pixelItr++;
}
row_counts[myStartRow] = row_count;
__syncthreads();
if (myStartRow == 0)
{//first thread sums up for the while image
int image_count = 0;
for (int i = 0; i < H; i++)
{
image_count += row_counts[i];
}
AllReturns[myImage] = image_count;
}
}
extern "C" void cuda_Benchmark(int nImages, int W, int H, U8** AllPixels, int *AllReturns, int Threshold)
{
ThreadPerRowCounter<<<nImages, H, sizeof(int)*H>>> (
Threshold,
W, H,
AllPixels,
AllReturns);
//wait for all blocks to finish
checkCudaErrors(cudaDeviceSynchronize());
}

Two changes to your kernel design can result in a significant speedup:
Perform the operations column-wise instead of row-wise. The general background for why this matters/helps is described here.
Replace your final operation with a canonical parallel reduction.
According to my testing, those 2 changes result in ~22x speedup in kernel performance:
$ cat t49.cu
#include <iostream>
#include <helper_cuda.h>
typedef unsigned char U8;
__global__ void ThreadPerRowCounter(int Threshold, int W, int H, U8 **AllPixels, int *AllReturns)
{
extern __shared__ int row_counts[];//this parameter to kernel call "<<<, ,>>>" sets the size
//see here for indexing https://blog.usejournal.com/cuda-thread-indexing-fb9910cba084
int myImage = blockIdx.y * gridDim.x + blockIdx.x;
int myStartRow = (threadIdx.y * blockDim.x + threadIdx.x);
unsigned char *imageStart = AllPixels[myImage];
unsigned char *pixelStart = imageStart + myStartRow * W;
unsigned char *pixelEnd = pixelStart + W;
unsigned char *pixelItr = pixelStart;
int row_count = 0;
while(pixelItr < pixelEnd)
{
if (*pixelItr > Threshold) //REMOVING THIS LINE GIVES 6x PERFORMANCE
{
row_count++;
}
pixelItr++;
}
row_counts[myStartRow] = row_count;
__syncthreads();
if (myStartRow == 0)
{//first thread sums up for the while image
int image_count = 0;
for (int i = 0; i < H; i++)
{
image_count += row_counts[i];
}
AllReturns[myImage] = image_count;
}
}
__global__ void ThreadPerColCounter(int Threshold, int W, int H, U8 **AllPixels, int *AllReturns, int rsize)
{
extern __shared__ int col_counts[];//this parameter to kernel call "<<<, ,>>>" sets the size
int myImage = blockIdx.y * gridDim.x + blockIdx.x;
unsigned char *imageStart = AllPixels[myImage];
int myStartCol = (threadIdx.y * blockDim.x + threadIdx.x);
int col_count = 0;
for (int i = 0; i < H; i++) if (imageStart[myStartCol+i*W]> Threshold) col_count++;
col_counts[threadIdx.x] = col_count;
__syncthreads();
for (int i = rsize; i > 0; i>>=1){
if ((threadIdx.x+i < W) && (threadIdx.x < i)) col_counts[threadIdx.x] += col_counts[threadIdx.x+i];
__syncthreads();}
if (!threadIdx.x) AllReturns[myImage] = col_counts[0];
}
void cuda_Benchmark(int nImages, int W, int H, U8** AllPixels, int *AllReturns, int Threshold)
{
ThreadPerRowCounter<<<nImages, H, sizeof(int)*H>>> (
Threshold,
W, H,
AllPixels,
AllReturns);
//wait for all blocks to finish
checkCudaErrors(cudaDeviceSynchronize());
}
unsigned next_power_of_2(unsigned v){
v--;
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v++;
return v;}
void cuda_Benchmark1(int nImages, int W, int H, U8** AllPixels, int *AllReturns, int Threshold)
{
int rsize = next_power_of_2(W/2);
ThreadPerColCounter<<<nImages, W, sizeof(int)*W>>> (
Threshold,
W, H,
AllPixels,
AllReturns, rsize);
//wait for all blocks to finish
checkCudaErrors(cudaDeviceSynchronize());
}
int main(){
const int my_W = 720;
const int my_H = 540;
const int n_img = 128;
const int my_thresh = 10;
U8 **img_p, **img_ph;
U8 *img, *img_h;
int *res, *res_h, *res_h1;
img_ph = (U8 **)malloc(n_img*sizeof(U8*));
cudaMalloc(&img_p, n_img*sizeof(U8*));
cudaMalloc(&img, n_img*my_W*my_H*sizeof(U8));
img_h = new U8[n_img*my_W*my_H];
for (int i = 0; i < n_img*my_W*my_H; i++) img_h[i] = rand()%20;
cudaMemcpy(img, img_h, n_img*my_W*my_H*sizeof(U8), cudaMemcpyHostToDevice);
for (int i = 0; i < n_img; i++) img_ph[i] = img+my_W*my_H*i;
cudaMemcpy(img_p, img_ph, n_img*sizeof(U8*), cudaMemcpyHostToDevice);
cudaMalloc(&res, n_img*sizeof(int));
cuda_Benchmark(n_img, my_W, my_H, img_p, res, my_thresh);
res_h = new int[n_img];
cudaMemcpy(res_h, res, n_img*sizeof(int), cudaMemcpyDeviceToHost);
cuda_Benchmark1(n_img, my_W, my_H, img_p, res, my_thresh);
res_h1 = new int[n_img];
cudaMemcpy(res_h1, res, n_img*sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < n_img; i++) if (res_h[i] != res_h1[i]) {std::cout << "mismatch at: " << i << " was: " << res_h1[i] << " should be: " << res_h[i] << std::endl; return 0;}
}
$ nvcc -o t49 t49.cu -I/usr/local/cuda/samples/common/inc
$ cuda-memcheck ./t49
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$ nvprof ./t49
==1756== NVPROF is profiling process 1756, command: ./t49
==1756== Profiling application: ./t49
==1756== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 72.02% 54.325ms 1 54.325ms 54.325ms 54.325ms ThreadPerRowCounter(int, int, int, unsigned char**, int*)
24.71% 18.639ms 2 9.3195ms 1.2800us 18.638ms [CUDA memcpy HtoD]
3.26% 2.4586ms 1 2.4586ms 2.4586ms 2.4586ms ThreadPerColCounter(int, int, int, unsigned char**, int*, int)
0.00% 3.1040us 2 1.5520us 1.5360us 1.5680us [CUDA memcpy DtoH]
API calls: 43.63% 59.427ms 3 19.809ms 18.514us 59.159ms cudaMalloc
41.70% 56.789ms 2 28.394ms 2.4619ms 54.327ms cudaDeviceSynchronize
14.02% 19.100ms 4 4.7749ms 17.749us 18.985ms cudaMemcpy
0.52% 705.26us 96 7.3460us 203ns 327.21us cuDeviceGetAttribute
0.05% 69.268us 1 69.268us 69.268us 69.268us cuDeviceTotalMem
0.04% 50.688us 1 50.688us 50.688us 50.688us cuDeviceGetName
0.04% 47.683us 2 23.841us 14.352us 33.331us cudaLaunchKernel
0.00% 3.1770us 1 3.1770us 3.1770us 3.1770us cuDeviceGetPCIBusId
0.00% 1.5610us 3 520ns 249ns 824ns cuDeviceGetCount
0.00% 1.0550us 2 527ns 266ns 789ns cuDeviceGet
$
(Quadro K2000, CUDA 9.2.148, Fedora Core 27)
(The next_power_of_2 code is lifted from this answer)
I don't claim correctness for this code or any other code that I post. Anyone using any code I post does so at their own risk. I merely claim that I have attempted to address the questions in the original posting, and provide some explanation thereof. I am not claiming my code is defect-free, or that it is suitable for any particular purpose. Use it (or not) at your own risk.

How to transpose in-place a bitmap in C++

I trying to create a function to transpose in-place a bitmap. But so far, the result I get is all messed up, and I can’t find what I’m doing wrong.
Source bitmaps are as a 1d pixel array in ARGB format.
void transpose(uint8_t* buffer, const uint32_t width, const uint32_t height)
{
const size_t stride = width * sizeof(uint32_t);
for (uint32_t i = 0; i < height; i++)
{
uint32_t* row = (uint32_t*)(buffer + (stride * i));
uint8_t* section = buffer + (i * sizeof(uint32_t));
for (uint32_t j = i + 1; j < height; j++)
{
const uint32_t tmp = row[j];
row[j] = *((uint32_t*)(section + (stride * j)));
*((uint32_t*)(section + (stride * j))) = tmp;
}
}
}
UPDATE:
To clarify and avoid confusions as it seems some people think this is just a rotate image question. Transposing an image is composed by 2 transformations: 1) flip horizontally 2) Rotate by 90 CCW. (As shown in the image example, see the arrow directions)

I think the problem is more complex than you realise and is not simply a case of swapping the pixels at x, y with the pixels at y, x. If you consider a 3*7 pixel image in which I've labelled the pixels a-u:
abcdefg
hijklmn
opqrstu
Rotating this image gives:
aho
bip
cjq
dkr
els
fmt
gnu
Turning both images into a 1D array gives:
abcdefghijklmnopqrstu
ahobipcjqdkrelsfmtgnu
Notice that b has moved to the position of d but has been replaced by h.
Rethink your algorithm, draw it out for a small image and make sure it works before attempting to implement it.
Due to the complexity of the task it may actually end up being faster to create a temporary buffer, rotate into that buffer then copy back as it could end up with fewer copies (2 per pixel) than the inplace algorithm that you come up with.

Mostly equivalent code that should be easier to debug:
inline uint32_t * addr(uint8_t* buffer, const uint32_t width, uint32_t i, uint32_t j) {
uint32_t * tmp = buffer;
return tmp+i*width+j;
}
void transpose(uint8_t* buffer, const uint32_t width, const uint32_t height) {
for (uint32_t i = 0; i < min(width,height); i++) {
for (uint32_t j = 0; j < i; j++) {
uint32_t * a = addr(buffer, width, i, j);
uint32_t * b = addr(buffer, width, j, i);
const uint32_t tmp = *a;
*a = *b;
*b = tmp;
}
}
}
If this doesn't work right, it is possible that it needs to know not just the width of the picture, but also the width of the underlying buffer. This only flips the square portion at the top-left, more work would be needed for non-square bitmaps. (or just pad everything to square before using...)

Note that transposing a matrix in place is not trivial when N!=M. See eg here for details.
The reason is that when N=M you can simply iterate through half of the matrix and swap elements. When N!=M this isnt the case.
For illustration, consider a simpler case:
First a 2d view on 1d data:
struct my2dview {
std::vector<int>& data;
int width,height;
my2dview(std::vector<int>& data,int width,int height):data(data),width(width),height(height){}
int operator()(int x,int y) const { return data[x*width + y]; }
int& operator()(int x,int y){ return data[x*width + y]; }
my2dview get_transposed() { return my2dview(data,height,width);}
};
std::ostream& operator<<(std::ostream& out, const my2dview& x){
for (int h=0;h<x.height;++h){
for (int w=0;w<x.width;++w){
out << x(h,w) << " ";
}
out << "\n";
}
return out;
}
Now a transpose that would work for N=M:
my2dview broken_transpose(my2dview x){
auto res = x.get_transposed();
for (int i=0;i<x.height;++i){
for (int j=0;j<x.width;++j){
res(j,i) = x(i,j);
}
}
return res;
}
Using it for some small matrix
int main() {
std::vector<int> x{1,2,3,4,5,6};
auto v = my2dview(x,2,3);
std::cout << v << '\n';
std::cout << v.get_transposed() << '\n';
auto v2 = broken_transpose(v);
std::cout << v2;
}
prints
1 2
3 4
5 6
1 2 3
4 5 6
1 3 2
2 2 6
Conclusion: The naive swapping elements approach does not work for non-square matrices.
Actually this answer just rephrases the one by #Alan Birtles. I felt challenged by his
Due to the complexity of the task it may actually end up being faster to create a temporary buffer [...]
just to come to the same conclusion ;).

Matrix the Rectangle Part transpose Cuda

im writing Cuda Program to Transpose Square Matrix, the idea is to do it in two parts depending on size of matrix; the matrix size cut into even size with Tile , and remain rectangle part left i transpose it separately Ex: 67 x 67 Matrix with Tile : 32, first part is 64x64 transposed, then second part is 3x67.
my problem is in the rectangle part,
first below code shows the main code with the defined values:
const int TILE_DIM = 32;
const int BLOCK_ROWS = 8;
const int NUM_REPS = 100;
const int Nx = 2024; //size of the matrix
const int Ny = 2024;
int main(int argc, char **argv)
{
const int nx = Nx;
const int ny = Ny; // Size of the Arrays
const int mem_size = nx*ny*sizeof(int);// Size of the Orig.Arr
int *h_idata = (int*)malloc(mem_size); // original Host Arr.
int *d_idata; //device Arr.
checkCuda(cudaMalloc(&d_idata, mem_size));
dim3 dimGridX(nx / TILE_DIM, 1, 1); //grid dimension used
dim3 dimBlockX(TILE_DIM, 1, 1); // number of threads used
// the Kernel Function for only the rectangle
EdgeTransposeX << < dimGrid, dimBlock >> >(d_idata);
cudaEventRecord(startEvent, 0);
cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);
cudaEventElapsedTime(&ms, startEvent, stopEvent);
cudaMemcpy(h_idata, d_idata, mem_size, cudaMemcpyDeviceToHost);
the Kernel Code i was advised not to use shared, so below is how ive done :
__global__ void EdgeTransposeX(int *idata)
{
int tile_C[Edge][Nx];
int tile_V[Nx][Edge];
int x = blockIdx.x * TILE_DIM + threadIdx.x;
if (x == (nEven - 1))
{
for (int j = 0; j < Nx; j++)
for (int i = 1; i <= Edge; i++)
{
tile_V[j][i - 1] = idata[j*Nx + (x + i)];
tile_C[i - 1][j] = idata[(x + i)*Nx + j];}
__syncthreads();
for (int j = 0; j < Nx; j++)
for (int i = 1; i <= Edge; i++)
{
idata[j*Nx + (x + i)] = tile_C[i - 1][j];
idata[(x + i)*Nx + j] = tile_V[j][i - 1];}
} }
the code works Okay until matrix size reaches 1025, after that it stops working, any idea why ? am i missing something here ?

your two-dimentional arrays tile_C and tile_V are fisically stored in GPU's local memory. The amount of local memory per thread is 512KB. Verify that you are not using more than 512KB of local memory per thread.
An automatic variable declared in device code without any of the device,
shared and constant qualifiers described in this section generally resides in a register. However in some cases the compiler might choose to place it in local memory. This fragment was taken from "CUDA C PROGRAMMING GUIDE 2015" pag 89.
My suggestion is that you use the visual profiler to check the occupancy, register and local memory usage.
This link may be helpful for you: link.
I implemented the Transpose of a Square Matrix using cuda surfaces in 2D, it works fine for sizes from 2 to 16384 with increments in power of two. If you dont mind implement a no tiled version, i recomend this approach.

c++: Convert 24bpp to 8 bpp or 1bpp image

I have to convert a 24bpp image to a 1bpp image or 8bpp image based on color table. The caller expects a unsigned char* in either case (which would be further processed or maybe for now debug output by sending the BITMAPINFOHEADER.biBitCount to its proper value, 8 or 1).
I have code to extract the color index into the palette (colorIndexArray is from color conversion or dithering algorithms)... I can get the info for an 8bpp bitmap...
But my problem is, I don't know how to put this info into a 1bpp bitmap
typedef struct {
unsigned int size;
unsigned char* pixels;
} ColorIndexArray;
unsigned char* convertImage(const ColorIndexArray& colorIndexArray, unsigned int paletteSize)
{
unsigned char* outputImage;
if (paleteSize > 2)
{
outputImage = (unsigned char*)LocalAlloc(LPTR, colorIndexArray.size);
for (int i=0; i<colorIndexArray.size; i++)
*(outputImage+i) = colorIndexArray.pixels[i];
// this works great
}
else // monochrome, caller has palette colors likely b/w (or purple/magenta or anything), must be 1bpp
{
outputImage = (unsigned char*)LocalAlloc(LPTR, colorIndexArray.size / 8);
// how can i place the unsigned char* info (which is already
// determined based on desired algorithm, representing index in
// color table) into the output image inside a single bit ?
// (obviously its value for a monochrome image would be 0 or 1 but
// it is saved as unsigned char* at the algorithm output)
// And how do I advance the pointer ?
// Will it be type safe ? Aligned to byte ? or do I have to fill
// with something at the end to make multiple of 8 bits ?
}
return outputImage;
}
Trying this after comment suggestion:
#include <GdiPlus.h>
....
else {
Gdiplus::Bitmap monoBitmap(w, h, PixelFormat1bppIndexed);
Gdiplus::BitmapData monoBitmapData;
Gdiplus::Rect rect(0, 0, w, h);
monoBitmap.LockBits(&rect, Gdiplus::ImageLockModeWrite, PixelFormat1bppIndexed, &monoBitmapData);
outputImage = (unsigned char*)monoBitmapData.Scan0;
for (unsigned int y = 0; y < h; y++)
{
for (unsigned int x = 0; x < w; x++)
{
if (colorIndexArray.pixels[x + y * w])
outputImage[y*monoBitmapData.Stride + x / 8] |= (unsigned char)(0x80 >> (x % 8));
}
}
monoBitmap.UnlockBits(&monoBitmapData);
}
return outputImage;
(Also need to allocate the memory for outputImage)

Based on the example suggested by Hans Passant (thank you also for pointing out how important the stride is), I wrote this little conversion
unsigned long stride = (((w + 31) & ~31) >> 3);
outputImage = (unsigned char*)LocalAlloc(LPTR, stride * h);
for (unsigned int y = 0; y < h; y++)
{
unsigned char* b = (unsigned char*)LocalAlloc(LPTR, stride);
for (unsigned int x = 0; x < w; x++)
if (colorIndexArray.pixels[x + y * w])
b[x / 8] |= (unsigned char)(0x80 >> (x % 8));
CopyMemory(outputImage + stride * y, b, stride);
}

OpenCL computation does not match output of sequential algorithm

I'm trying to implement a naive version of LU decomposition in OpenCL. To start, I have implemented a sequential version in C++ and constructed methods to verify my result (i.e., multiplication methods). Next I implemented my algorithm in a kernel and tested it with manually verified input (i.e., a 5x5 matrix). This works fine.
However, when I run my algorithm on a randomly generated matrix bigger than 5x5 I get strange results. I've cleaned my code, checked the calculations manually but I can't figure out where my kernel is going wrong. I'm starting to think that it might have something to do with the floats and the stability of the calculations. By this I mean that error margins get propagated and get bigger and bigger. I'm well-aware that I can swap rows to get the biggest pivot value and such, but the error margin is way off sometimes. And in any case I would have expected the result - albeit a wrong one - to be the same as the sequential algorithm. I would like some help identifying where I could be doing something wrong.
I'm using a single dimensional array so addressing a matrix with two dimensions happens like this:
A(row, col) = A[row * matrix_width + col].
About the results I might add that I decided to merge the L and U matrix into one. So Given L and U:
L: U:
1 0 0 A B C
X 1 0 0 D E
Y Z 1 0 0 F
I display them as:
A:
A B C
X D E
Y Z F
The kernel is the following:
The parameter source is the original matrix I want to decompose.
The parameter destin is the destination. matrix_size is the total size of the matrix (so that would be 9 for a 3x3) and matrix_width is the width (3 for a 3x3 matrix).
__kernel void matrix(
__global float * source,
__global float * destin,
unsigned int matrix_size,
unsigned int matrix_width
)
{
unsigned int index = get_global_id(0);
int col_idx = index % matrix_width;
int row_idx = index / matrix_width;
if (index >= matrix_size)
return;
// First of all, copy our value to the destination.
destin[index] = source[index];
// Iterate over all the pivots.
for(int piv_idx = 0; piv_idx < matrix_width; piv_idx++)
{
// We have to be the row below the pivot row
// And we have to be the column of the pivot
// or right of that column.
if(col_idx < piv_idx || row_idx <= piv_idx)
return;
// Calculate the divisor.
float pivot_value = destin[(piv_idx * matrix_width) + piv_idx];
float below_pivot_value = destin[(row_idx * matrix_width) + piv_idx];
float divisor = below_pivot_value/ pivot_value;
// Get the value in the pivot row on this column.
float pivot_row_value = destin[(piv_idx * matrix_width) + col_idx];
float current_value = destin[index];
destin[index] = current_value - (pivot_row_value * divisor);
// Write the divisor to the memory (we won't use these values anymore!)
// if we are the value under the pivot.
barrier(CLK_GLOBAL_MEM_FENCE);
if(col_idx == piv_idx)
{
int divisor_location = (row_idx * matrix_width) + piv_idx;
destin[divisor_location] = divisor;
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
This is the sequential version:
// Decomposes a matrix into L and U but in the same matrix.
float * decompose(float* A, int matrix_width)
{
int total_length = matrix_width*matrix_width;
float *U = new float[total_length];
for (int i = 0; i < total_length; i++)
{
U[i] = A[i];
}
for (int row = 0; row < matrix_width; row++)
{
int pivot_idx = row;
float pivot_val = U[pivot_idx * matrix_width + pivot_idx];
for (int r = row + 1; r < matrix_width; r++)
{
float below_pivot = U[r*matrix_width + pivot_idx];
float divisor = below_pivot / pivot_val;
for (int row_idx = pivot_idx; row_idx < matrix_width; row_idx++)
{
float value = U[row * matrix_width + row_idx];
U[r*matrix_width + row_idx] = U[r*matrix_width + row_idx] - (value * divisor);
}
U[r * matrix_width + pivot_idx] = divisor;
}
}
return U;
}
An example output I get is the following:
Workgroup size: 1
Array dimension: 6
Original unfactorized:
| 176.000000 | 133.000000 | 431.000000 | 839.000000 | 739.000000 | 450.000000 |
| 507.000000 | 718.000000 | 670.000000 | 753.000000 | 122.000000 | 941.000000 |
| 597.000000 | 449.000000 | 596.000000 | 742.000000 | 491.000000 | 212.000000 |
| 159.000000 | 944.000000 | 797.000000 | 717.000000 | 822.000000 | 219.000000 |
| 266.000000 | 755.000000 | 33.000000 | 231.000000 | 824.000000 | 785.000000 |
| 724.000000 | 408.000000 | 652.000000 | 863.000000 | 663.000000 | 113.000000 |
Sequential:
| 176.000000 | 133.000000 | 431.000000 | 839.000000 | 739.000000 | 450.000000 |
| 2.880682 | 334.869324 | -571.573853 | -1663.892090 | -2006.823730 | -355.306763 |
| 3.392045 | -0.006397 | -869.627747 | -2114.569580 | -2028.558716 | -1316.693359 |
| 0.903409 | 2.460203 | -2.085742 | -357.893066 | 860.526367 | -2059.689209 |
| 1.511364 | 1.654343 | -0.376231 | -2.570729 | 4476.049805 | -5097.599121 |
| 4.113636 | -0.415427 | 1.562076 | -0.065806 | 0.003290 | 52.263515 |
Sequential multiplied matching with original?:
1
GPU:
| 176.000000 | 133.000000 | 431.000000 | 839.000000 | 739.000000 | 450.000000 |
| 2.880682 | 334.869293 | -571.573914 | -1663.892212 | -2006.823975 | -355.306885 |
| 3.392045 | -0.006397 | -869.627808 | -2114.569580 | -2028.558716 | -1316.693359 |
| 0.903409 | 2.460203 | -2.085742 | -357.892578 | 5091.575684 | -2059.688965 |
| 1.511364 | 1.654343 | -0.376232 | -2.570732 | 16116.155273 | -5097.604980 |
| 4.113636 | -0.415427 | -0.737347 | 2.005755 | -3.655331 | -237.480438 |
GPU multiplied matching with original?:
Values differ: 5053.05 -- 822
0
Values differ: 5091.58 -- 860.526
Correct solution? 0
Edit
Okay, I understand why it was not working before, I think. The reason is that I only synchronize on each workgroup. When I would call my kernel with a workgroup size equal to the number of items in my matrix it would always be correct, because then the barriers would work properly. However, I decided to go with the approach as mentioned in the comments. Enqueue multiple kernels and wait for each kernel to finish before starting the next one. This would then map onto an iteration over each row of the matrix and multiplying it with the pivot element. This makes sure that I do not modify or read elements that are being modified by the kernel at that point.
Again, this works but only for small matrices. So I think I was wrong in assuming that it was the synchronization only. As per the request of Baiz I am posting my entire main here that calls the kernel:
int main(int argc, char *argv[])
{
try {
if (argc != 5) {
std::ostringstream oss;
oss << "Usage: " << argv[0] << " <kernel_file> <kernel_name> <workgroup_size> <array width>";
throw std::runtime_error(oss.str());
}
// Read in arguments.
std::string kernel_file(argv[1]);
std::string kernel_name(argv[2]);
unsigned int workgroup_size = atoi(argv[3]);
unsigned int array_dimension = atoi(argv[4]);
int total_matrix_length = array_dimension * array_dimension;
// Print parameters
std::cout << "Workgroup size: " << workgroup_size << std::endl;
std::cout << "Array dimension: " << array_dimension << std::endl;
// Create matrix to work on.
// Create a random array.
int matrix_width = sqrt(total_matrix_length);
float* input_matrix = new float[total_matrix_length];
input_matrix = randomMatrix(total_matrix_length);
/// Debugging
//float* input_matrix = new float[9];
//int matrix_width = 3;
//total_matrix_length = matrix_width * matrix_width;
//input_matrix[0] = 10; input_matrix[1] = -7; input_matrix[2] = 0;
//input_matrix[3] = -3; input_matrix[4] = 2; input_matrix[5] = 6;
//input_matrix[6] = 5; input_matrix[7] = -1; input_matrix[8] = 5;
// Allocate memory on the host and populate source
float *gpu_result = new float[total_matrix_length];
// OpenCL initialization
std::vector<cl::Platform> platforms;
std::vector<cl::Device> devices;
cl::Platform::get(&platforms);
platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);
cl::Context context(devices);
cl::CommandQueue queue(context, devices[0], CL_QUEUE_PROFILING_ENABLE);
// Load the kernel source.
std::string file_text;
std::ifstream file_stream(kernel_file.c_str());
if (!file_stream) {
std::ostringstream oss;
oss << "There is no file called " << kernel_file;
throw std::runtime_error(oss.str());
}
file_text.assign(std::istreambuf_iterator<char>(file_stream), std::istreambuf_iterator<char>());
// Compile the kernel source.
std::string source_code = file_text;
std::pair<const char *, size_t> source(source_code.c_str(), source_code.size());
cl::Program::Sources sources;
sources.push_back(source);
cl::Program program(context, sources);
try {
program.build(devices);
}
catch (cl::Error& e) {
std::string msg;
program.getBuildInfo<std::string>(devices[0], CL_PROGRAM_BUILD_LOG, &msg);
std::cerr << "Your kernel failed to compile" << std::endl;
std::cerr << "-----------------------------" << std::endl;
std::cerr << msg;
throw(e);
}
// Allocate memory on the device
cl::Buffer source_buf(context, CL_MEM_READ_ONLY, total_matrix_length*sizeof(float));
cl::Buffer dest_buf(context, CL_MEM_WRITE_ONLY, total_matrix_length*sizeof(float));
// Create the actual kernel.
cl::Kernel kernel(program, kernel_name.c_str());
// transfer source data from the host to the device
queue.enqueueWriteBuffer(source_buf, CL_TRUE, 0, total_matrix_length*sizeof(float), input_matrix);
for (int pivot_idx = 0; pivot_idx < matrix_width; pivot_idx++)
{
// set the kernel arguments
kernel.setArg<cl::Memory>(0, source_buf);
kernel.setArg<cl::Memory>(1, dest_buf);
kernel.setArg<cl_uint>(2, total_matrix_length);
kernel.setArg<cl_uint>(3, matrix_width);
kernel.setArg<cl_int>(4, pivot_idx);
// execute the code on the device
std::cout << "Enqueueing new kernel for " << pivot_idx << std::endl;
cl::Event evt;
queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(total_matrix_length), cl::NDRange(workgroup_size), 0, &evt);
evt.wait();
std::cout << "Iteration " << pivot_idx << " done" << std::endl;
}
// transfer destination data from the device to the host
queue.enqueueReadBuffer(dest_buf, CL_TRUE, 0, total_matrix_length*sizeof(float), gpu_result);
// Calculate sequentially.
float* sequential = decompose(input_matrix, matrix_width);
// Print out the results.
std::cout << "Sequential:\n";
printMatrix(total_matrix_length, sequential);
// Print out the results.
std::cout << "GPU:\n";
printMatrix(total_matrix_length, gpu_result);
std::cout << "Correct solution? " << equalMatrices(gpu_result, sequential, total_matrix_length);
// compute the data throughput in GB/s
//float throughput = (2.0*total_matrix_length*sizeof(float)) / t; // t is in nano seconds
//std::cout << "Achieved throughput: " << throughput << std::endl;
// Cleanup
// Deallocate memory
delete[] gpu_result;
delete[] input_matrix;
delete[] sequential;
return 0;
}
catch (cl::Error& e) {
std::cerr << e.what() << ": " << jc::readable_status(e.err());
return 3;
}
catch (std::exception& e) {
std::cerr << e.what() << std::endl;
return 2;
}
catch (...) {
std::cerr << "Unexpected error. Aborting!\n" << std::endl;
return 1;
}
}

As maZZZu already stated, due to the parallel execution of the work items you can not be sure if an element in the array has been read/written yet.
This can be ensured using
CLK_LOCAL_MEM_FENCE/CLK_GLOBAL_MEM_FENCE
however these mechanisms only work on threads wihtin the same work group.
There is no possibility to synchronize work items from different work groups.
Your problem most likely is:
you use multiple work groups for an algorithm which is most likely only executable by a single work group
you do not use enough barriers
if you already use only a single work group, try adding a
barrier(CLK_GLOBAL_MEM_FENCE);
to all parts where you read/write from/to destin.
You should restructure your algorithm:
have only one work group perform the algorithm on your matrix
use local memory for better performance(since you repeatedly access elements)
use barriers everywhere. If the algorithm works you can start removing them after working out, which ones you don't need.
Could you post your kernel call and the working sizes?
EDIT:
From your algorithm I came up with this code.
I haven't tested it and I doubt it'll work right away.
But it should help you in understanding how to parallelize a sequential algorithm.
It will decompose the matrix with only one kernel launch.
Some restrictions:
This code only works with a single work group.
It will only work for matrices whose size does not exceed your maximum local work-group size (probably between 256 and 1024).
If you want to change that, you should refactor the algorithm to use only as many work items as the width of the matrix.
Just adapt them to your kernel.setArg(...) code
int nbElements = width*height;
clSetKernelArg (kernel, 0, sizeof(A), &A);
clSetKernelArg (kernel, 1, sizeof(U), &U);
clSetKernelArg (kernel, 2, sizeof(float) * widthMat * heightMat, NULL); // Local memory
clSetKernelArg (kernel, 3, sizeof(int), &width);
clSetKernelArg (kernel, 4, sizeof(int), &height);
clSetKernelArg (kernel, 5, sizeof(int), &nbElements);
Kernel code:
inline int indexFrom2d(const int u, const int v, const int width)
{
return width*v + u;
}
kernel void decompose(global float* A,
global float* U,
local float* localBuffer,
const int widthMat,
const int heightMat,
const int nbElements)
{
int gidx = get_global_id(0);
int col = gidx%widthMat;
int row = gidx/widthMat;
if(gidx >= nbElements)
return;
// Copy from global to local memory
localBuffer[gidx] = A[gidx];
// Sync copy process
barrier(CLK_LOCAL_MEM_FENCE);
for (int rowOuter = 0; rowOuter < widthMat; ++rowOuter)
{
int pivotIdx = rowOuter;
float pivotValue = localBuffer[indexFrom2d(pivotIdx, pivotIdx, widthMat)];
// Data for all work items in the row
float belowPrivot = localBuffer[indexFrom2d(pivotIdx, row, widthMat)];
float divisor = belowPrivot / pivotValue;
float value = localBuffer[indexFrom2d(col, rowOuter, widthMat)];
// Only work items below pivot and from pivot to the right
if( widthMat > col >= pivotIdx &&
heightMat > row >= pivotIdx + 1)
{
localBuffer[indexFrom2d(col, row, widthMat)] = localBuffer[indexFrom2d(col, row, widthMat)] - (value * divisor);
if(col == pivotIdx)
localBuffer[indexFrom2d(pivotIdx, row, widthMat)] = divisor;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// Write back to global memory
U[gidx] = localBuffer[gidx];
}

The errors are way too big to be caused by float arithmetics.
Without any deeper understanding of your algorithm, I would say that the problem is that you are using values from the destination buffer. With sequential code this is fine, because you know what values are there. But with OpenCL, kernels are executed in parallel. So you cannot tell if another kernel has already stored its value to destination buffer or not.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why is this CUDA kernel slow? - c++

Related

CUDA Speed Slower than expected - Image Processing

How to transpose in-place a bitmap in C++

Matrix the Rectangle Part transpose Cuda

c++: Convert 24bpp to 8 bpp or 1bpp image

OpenCL computation does not match output of sequential algorithm

Categories

Resources