OpenCL: Values of __local array are lost after a barrier call - c++

I have a kernel storing some partial results in a local array before reducing
them into a single value (see the example below). Before the reduction process
starts, a barrier is placed to ensure all threads have successfully written their
partial data. However, the barrier resets the values of the temporary array to
default values (i.e. 0.0f for floats).
Minimal example:
__kernel void simulate_plate(__local float *partial)
{
__private int lpos;
lpos = get_local_id(0) + get_local_id(1) * get_local_size(1);
partial[lpos] = 1;
barrier(CLK_LOCAL_MEM_FENCE);
// At this point partial[i] == 0 for all i
// reduce data...
}
The argument partial has the following initializer:
clSetKernelArg(kernel, 0, local_group_size * sizeof(float), NULL);
The clSetKernelArg() call returns a status code CL_SUCCESS and the kernel
terminates without any errors.
Another observation is that swapping lines partial[lpos] = 1 and
barrier(CLK_LOCAL_MEM_FENCE) achieves the wanted result --- all components of
the array partial now equal to 1.
Any input why this behaviour occurs would be much appreciated.

I think the index should be like this
lpos = get_local_id(0) + get_local_id(1) * get_local_size(0);

Related

Opencl - Transfer Global memory Work-Group + border to Local memory

Here a draft of code I produced :
void __kernel myKernel(__global const short* input,
__global short* output,
const int width,
const int height){
// Always square. (and 16x16 in our example)
const uint local_size = get_local_size(0);
// Get the work-item col/row index
const uint wi_c = get_local_id(0);
const uint wi_r = get_local_id(1);
// Get the global col/row index
const uint g_c = get_global_id(0);
const uint g_r = get_global_id(1);
// Declare a local array NxN
const uint arr_size = local_size *local_size ;
__local short local_in[arr_size];
// Transfer the global memory for into a local one.
local_in[wi_c + wi_r*local_size ] = input[g_c + g_r*width];
// Wait that all the work-item are sync
barrier(CLK_LOCAL_MEM_FENCE);
// Now add code to process on the local array (local_in).
As far as I understand OpenCL work-group/work-item, this is what I need to do to copy a global 16x16 ROI of from global to local memory. (Please correct me if I'm wrong, since I'm beginning at this).
So after the barrier, each element in local_in can be access via wi_c + wi_r*local_size.
But now let's do something tricky. If I want for each work-item in my work group to work on a 3x3 neighborhood, I will need a 18x18 local_in array.
But how to create this ? Since I have only 16x16=256 work-item (threads), but I need 18x18=324 (missing 68 threads to do it).
My basic idea should be to do:
if(wi_c == 0 && wi_r == 0){
// Code that copy the border into the new array that should be
// local_in[(local_size+2)*(local_size+2)];
}
But this is terrible, since the first work-item (1st thread) will have to handle all the border and the rest of the work-items in this group will just be waiting this 1st work-item to finish. (Again, this is my understanding of OpenCL, might be wrong).
So here are my real question:
Is there another easier solution for this kind of problem ? Like changing the NDRange Local size to be overlapping or something ?
I start to read about coalesced memory access, is my first draft of code look like it ? I don't think so, since I'm using a "stride" approach to load the global memory. But I don't understand how I could change the first part of that code to be efficient also.
Once the barrier is reached, the processing continue of each work-item to get a final value that need to be stored back into the global output array. Should I put again a barrier before this "write" or all good to leave all the work-item finish their self ?
I tried different approaches and I came with the final version, which is less "if" and use thread as much as possible (On second phase, might not be fully efficient since few thread are idle, but it's the best I was able to get).
The principle is to set an origin (start pos) at the top-left corner and create Read/Write index from this position using loop index. The loop start at the local id position in 2D. So all 256 work-items write their first element, and on phase two only 68 work-items on 256 will complete the 2 bottom rows + 2 right columns.
I'm not a OpenCL pro yet, so this could still have more improvement (maybe loop unroll, I don't know).
__local float wrkSrc[324];
const int lpitch = 18;
// Add halfROI to handle the corner
const int lcol = get_local_id(0);
const int lrow = get_local_id(1);
const int2 gid = { col, row };
const int2 lid = { lcol, lrow };
// Always get the most Top-left corner of that ROI to extract.
const int2 startPos = gid - lid - halfROI;
// Loop on each thread to get their right ID.
// Thread with id < 2 * halfROI will process more then others, but not that much an issue.
for ( int x = lid.x; x < lpitch; x += 16 ) {
for ( int y = lid.y; y < lpitch; y += 16 ) {
// Get the position to write into the local array.
const int lidx = x + y * lpitch;
// Get the position to read into the global memory (src)
const int2 readPos = startPos + (int2)( x, y );
// Is inside ?
if ( readPos.x >= 0 && readPos.x < width && readPos.y >= 0 && readPos.y < height )
wrkSrc[lidx] = src[readPos.x + readPos.y * lab_new_pitch];
else
wrkSrc[lidx] = 0.0f;
}
}

How do you iterate through a pitched CUDA array?

Having parallelized with OpenMP before, I'm trying to wrap my head around CUDA, which doesn't seem too intuitive to me. At this point, I'm trying to understand exactly how to loop through an array in a parallelized fashion.
Cuda by Example is a great start.
The snippet on page 43 shows:
__global__ void add( int *a, int *b, int *c ) {
int tid = blockIdx.x; // handle the data at this index
if (tid < N)
c[tid] = a[tid] + b[tid];
}
Whereas in OpenMP the programmer chooses the number of times the loop will run and OpenMP splits that into threads for you, in CUDA you have to tell it (via the number of blocks and number of threads in <<<...>>>) to run it sufficient times to iterate through your array, using a thread ID number as an iterator. In other words you can have a CUDA kernel always run 10,000 times which means the above code will work for any array up to N = 10,000 (and of course for smaller arrays you're wasting cycles dropping out at if (tid < N)).
For pitched memory (2D and 3D arrays), the CUDA Programming Guide has the following example:
// Host code
int width = 64, height = 64;
float* devPtr; size_t pitch;
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c > width; ++c) {
float element = row[c];
}
}
}
This example doesn't seem too useful to me. First they declare an array that is 64 x 64, then the kernel is set to execute 512 x 100 times. That's fine, because the kernel does nothing other than iterate through the array (so it runs 51,200 loops through a 64 x 64 array).
According to this answer the iterator for when there are blocks of threads going on will be
int tid = (blockIdx.x * blockDim.x) + threadIdx.x;
So if I wanted to run the first snippet in my question for a pitched array, I could just make sure I had enough blocks and threads to cover every element including the padding that I don't care about. But that seems wasteful.
So how do I iterate through a pitched array without going through the padding elements?
In my particular application I have a 2D FFT and I'm trying to calculate arrays of the magnitude and angle (on the GPU to save time).
After reviewing the valuable comments and answers from JackOLantern, and re-reading the documentation, I was able to get my head straight. Of course the answer is "trivial" now that I understand it.
In the code below, I define CFPtype (Complex Floating Point) and FPtype so that I can quickly change between single and double precision. For example, #define CFPtype cufftComplex.
I still can't wrap my head around the number of threads used to call the kernel. If it's too large, it simply won't go into the function at all. The documentation doesn't seem to say anything about what number should be used - but this is all for a separate question.
The key in getting my whole program to work (2D FFT on pitched memory and calculating magnitude and argument) was realizing that even though CUDA gives you plenty of "apparent" help in allocating 2D and 3D arrays, everything is still in units of bytes. It's obvious in a malloc call that the sizeof(type) must be included, but I totally missed it in calls of the type allocate(width, height). Noob mistake, I guess. Had I written the library I would have made the type size a separate parameter, but whatever.
So given an image of dimensions width x height in pixels, this is how it comes together:
Allocating memory
I'm using pinned memory on the host side because it's supposed to be faster. That's allocated with cudaHostAlloc which is straightforward. For pitched memory, you need to store the pitch for each different width and type, because it could change. In my case the dimensions are all the same (complex to complex transform) but I have arrays that are real numbers so I store a complexPitch and a realPitch. The pitched memory is done like this:
cudaMallocPitch(&inputGPU, &complexPitch, width * sizeof(CFPtype), height);
To copy memory to/from pitched arrays you cannot use cudaMemcpy.
cudaMemcpy2D(inputGPU, complexPitch, //destination and destination pitch
inputPinned, width * sizeof(CFPtype), //source and source pitch (= width because it's not padded).
width * sizeof(CFPtype), height, cudaMemcpyKind::cudaMemcpyHostToDevice);
FFT plan for pitched arrays
JackOLantern provided this answer, which I couldn't have done without. In my case the plan looks like this:
int n[] = {height, width};
int nembed[] = {height, complexPitch/sizeof(CFPtype)};
result = cufftPlanMany(
&plan,
2, n, //transform rank and dimensions
nembed, 1, //input array physical dimensions and stride
1, //input distance to next batch (irrelevant because we are only doing 1)
nembed, 1, //output array physical dimensions and stride
1, //output distance to next batch
cufftType::CUFFT_C2C, 1);
Executing the FFT is trivial:
cufftExecC2C(plan, inputGPU, outputGPU, CUFFT_FORWARD);
So far I have had little to optimize. Now I wanted to get magnitude and phase out of the transform, hence the question of how to traverse a pitched array in parallel. First I define a function to call the kernel with the "correct" threads per block and enough blocks to cover the entire image. As suggested by the documentation, creating 2D structures for these numbers is a great help.
void GPUCalcMagPhase(CFPtype *data, size_t dataPitch, int width, int height, FPtype *magnitude, FPtype *phase, size_t magPhasePitch, int cudaBlockSize)
{
dim3 threadsPerBlock(cudaBlockSize, cudaBlockSize);
dim3 numBlocks((unsigned int)ceil(width / (double)threadsPerBlock.x), (unsigned int)ceil(height / (double)threadsPerBlock.y));
CalcMagPhaseKernel<<<numBlocks, threadsPerBlock>>>(data, dataPitch, width, height, magnitude, phase, magPhasePitch);
}
Setting the blocks and threads per block is equivalent to writing the (up to 3) nested for-loops. So you have to have enough blocks * threads to cover the array, and then in the kernel you must make sure that you are not exceeding the array size. By using 2D elements for threadsPerBlock and numBlocks, you avoid having to go through the padding elements in the array.
Traversing a pitched array in parallel
The kernel uses the standard pointer arithmetic from the documentation:
__global__ void CalcMagPhaseKernel(CFPtype *data, size_t dataPitch, int width, int height,
FPtype *magnitude, FPtype *phase, size_t magPhasePitch)
{
int threadX = threadIdx.x + blockDim.x * blockIdx.x;
if (threadX >= width)
return;
int threadY = threadIdx.y + blockDim.y * blockIdx.y;
if (threadY >= height)
return;
CFPtype *threadRow = (CFPtype *)((char *)data + threadY * dataPitch);
CFPtype complex = threadRow[threadX];
FPtype *magRow = (FPtype *)((char *)magnitude + threadY * magPhasePitch);
FPtype *magElement = &(magRow[threadX]);
FPtype *phaseRow = (FPtype *)((char *)phase + threadY * magPhasePitch);
FPtype *phaseElement = &(phaseRow[threadX]);
*magElement = sqrt(complex.x*complex.x + complex.y*complex.y);
*phaseElement = atan2(complex.y, complex.x);
}
The only wasted threads here are for the cases where the width or height are not multiples of the number of threads per block.

CUDA Kernel running repeatedly for each launch

I'm having a very odd bug with a CUDA (v5.0) code. Basically, I am trying to use device memory to accumulate values for a program that needs to take the average of a bunch of pixels. In order to do this I have two kernels, one which accumulates a sum in floating point array, sum_mask, and the other which does the division at the end, avg_mask. The odd thing is that both kernel's do exactly the operation I want them to do, multiplied by 14. I suspect it is somehow a synchronization or grid/block dim problem but I have checked and rechecked everything and cannot figure it out. Any help would be much appreciated.
Edit 1, Problem Statement: Running a CUDA kernel that does any accumulation process gives me what I would expect if each pixel were run consecutively by 14 threads. The specific input that is given me trouble has width=1280, height=720
Edit 2: Deleted some code in the snippets that was seemingly unrelated to the problem.
kernel:
__global__ void sum_mask(uint16_t * pic_d, float * mask_d,uint16_t width, uint16_t height)
{
unsigned short col = blockIdx.x*blockDim.x + threadIdx.x;
unsigned short row = blockIdx.y*blockDim.y + threadIdx.y;
unsigned short offset = col + row*width;
mask_d[offset] = mask_d[offset] + 1.0f; //This ends up incrementing by 14
//mask_d[offset] = mask_d[offset] + __uint2float_rd(pic_d[offset]); //This would increment by 14*pic_d[offset]
}
code to call kernel:
uint32_t dark_subtraction_filter::update_mask_collection(uint16_t * pic_in)
{
// Synchronous
HANDLE_ERROR(cudaSetDevice(DSF_DEVICE_NUM));
HANDLE_ERROR(cudaMemcpy(pic_in_host,pic_in,width*height*sizeof(uint16_t),cudaMemcpyHostToHost));
averaged_samples++;
HANDLE_ERROR(cudaMemcpyAsync(pic_out_host,mask_device,width*height*sizeof(uint16_t),cudaMemcpyDeviceToHost,dsf_stream));
/* This part is for testing */
HANDLE_ERROR(cudaStreamSynchronize(dsf_stream));
std::cout << "#samples: " << averaged_samples << std::endl;
std::cout << "pic_in_host: " << pic_in_host[9300] << "maskval: " << pic_out_host[9300] <<std::endl;
//Asynchronous
HANDLE_ERROR(cudaMemcpyAsync(picture_device,pic_in_host,width*height*sizeof(uint16_t),cudaMemcpyHostToDevice,dsf_stream));
sum_mask<<< gridDims, blockDims,0,dsf_stream>>>(picture_device, mask_device,width,height);
return averaged_samples;
}
constructor:
dark_subtraction_filter::dark_subtraction_filter(int nWidth, int nHeight)
{
HANDLE_ERROR(cudaSetDevice(DSF_DEVICE_NUM));
width=nWidth;
height=nHeight;
blockDims = dim3(20,20,1);
gridDims = dim3(width/20, height/20,1);
HANDLE_ERROR(cudaStreamCreate(&dsf_stream));
HANDLE_ERROR(cudaHostAlloc( (void **)&pic_in_host,width*height*sizeof(uint16_t),cudaHostAllocPortable)); //cudaHostAllocPortable??
HANDLE_ERROR(cudaHostAlloc( (void **)&pic_out_host,width*height*sizeof(float),cudaHostAllocPortable)); //cudaHostAllocPortable??
HANDLE_ERROR(cudaMalloc( (void **)&picture_device, width*height*sizeof(uint16_t)));
HANDLE_ERROR(cudaMalloc( (void **)&mask_device, width*height*sizeof(float)));
HANDLE_ERROR(cudaPeekAtLastError());
}
The variable offset is declared as a unsigned short. The offset calculation was overflowing the 16-bit storage class. If width = height = 1000 this would result in approximately 14 overflows resulting in the observed behavior.
The parameter passing and offset calculation are performed on unsigned short/uint16_t. The calculations will likely be quicker if the data types and calculations are of type int.

Understanding work-items and work-groups

Based on my previous question:
I'm still trying to copy an image (no practical reason, just to start with an easy one):
The image contains 200 * 300 == 60000 pixels.
The maximum number of work-items is 4100 according to CL_DEVICE_MAX_WORK_GROUP_SIZE.
kernel1:
std::string kernelCode =
"void kernel copy(global const int* image, global int* result)"
"{"
"result[get_local_id(0) + get_group_id(0) * get_local_size(0)] = image[get_local_id(0) + get_group_id(0) * get_local_size(0)];"
"}";
queue:
for (int offset = 0; offset < 30; ++offset)
queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(offset * 2000), cl::NDRange(60000));
queue.finish();
Gives segfault, what's wrong?
With the last parameter cl::NDRange(20000) it doesn't, but gives back only part of the image.
Also I don't understand, why I can't use this kernel:
kernel2:
std::string kernelCode =
"void kernel copy(global const int* image, global int* result)"
"{"
"result[get_global_id(0)] = image[get_global_id(0)];"
"}";
Looking at this presentation on the 31th slide:
Why can't I just simply use the global_id?
EDIT1
Platfrom: AMD Accelerated Parallel Processing
Device: AMD Athlon(tm) II P320 Dual-Core Processor
EDIT2
The result based on huseyin tugrul buyukisik's answer:
EDIT3
With the last parameter cl::NDRange(20000):
Kernel is both ways the first one.
EDIT4
std::string kernelCode =
"void kernel copy(global const int* image, global int* result)"
"{"
"result[get_global_id(0)] = image[get_global_id(0)];"
"}";
//...
cl_int err;
err = queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(0), cl::NDRange(59904), cl::NDRange(128));
if (err == 0)
qDebug() << "success";
else
{
qDebug() << err;
exit(1);
}
Prints success.
Maybe this is wrong?
int size = _originalImage.width() * _originalImage.height();
int* result = new int[size];
//...
cl::Buffer resultBuffer(context, CL_MEM_READ_WRITE, size);
//...
queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, size, result);
The guilty was:
cl::Buffer imageBuffer(context, CL_MEM_USE_HOST_PTR, sizeof(int) * size, _originalImage.bits());
cl::Buffer resultBuffer(context, CL_MEM_READ_ONLY, sizeof(int) * size);
queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, sizeof(int) * size, result);
I used size instead of sizeof(int) * size.
Edit 2:
Try non constant memory specifier please(maybe not compatible with your cpu):
std::string kernelCode =
"__kernel void copy(__global int* image, __global int* result)"
"{"
"result[get_global_id(0)] = image[get_global_id(0)];"
"}";
also you may need to change buffer options too.
Edit:
You have forgotten three '__'s before 'global' and 'kernel' specifiers so please try:
std::string kernelCode =
"__kernel void copy(__global const int* image, __global int* result)"
"{"
"result[get_global_id(0)] = image[get_global_id(0)];"
"}";
Total elements are 60000 but you are doing an offset+60000 which overflows and reads/writes unprivilaged areas.
The usual usage of ndrange for opencl 1.2 c++ bindings must be:
cl_int err;
err=cq.enqueueNDRangeKernel(kernelFunction,referenceRange,globalRange,localRange);
Then check err for the real error code you seek. 0 means succeess.**
If you want to divide work into smaller parts you should cap the range of each unit by 60000/N
If you divide by 30 parts, then
for (int offset = 0; offset < 30; ++offset)
queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(offset * 2000), cl::NDRange(60000/30));
queue.finish();
And double check the size of each buffer e.g. sizeof(cl_int)*arrElementNumber
Becuase size of an integer may not be same for the device integer. You need 60000 elements? Then you need 240000 bytes to pass as size when creating buffer.
For compatibility, you should check for size of an integer before creating buffers if you are up to run this code on another machine.
You may know this already but Im gonna tell anyway:
CL_DEVICE_MAX_WORK_GROUP_SIZE
is number of threads that can share local/shared memory in a compute unit. You dont need to divide your work just for this. Opencl does this automatically and gives a unique global id for each thread along whole work, and gives unique local id for each thread in a compute unit. If CL_DEVICE_MAX_WORK_GROUP_SIZE is 4100 than it can create threads that share same variables in a compute unit. You can compute all 60000 variables in a single sweep with just an adition: multiple workgroups are created for this and each group has a group id.
// this should work without a problem
queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(0), cl::NDRange(60000));
If you have an AMD gpu or cpu and if you are using msvc, you can install codexl from amd site and choose system info from drop-down menu to look at relevant numbers.
Which device is that of yours? I couldnt find any device with a max work group size of 4100! My cpu has 1024, gpu has 256. Is that a xeon-phi?
For example total work items can be as big as 256*256 times work group size here.
Codexl has other nice features such as performance profiling, tracing code if you need maximum performance and bugfixing.

allocate two arrays calling cudaMalloc once

Memory allocation is one of the most time consuming operations in a GPU so I wanted to allocate 2 arrays by calling cudaMalloc once using the following code:
int numElements = 50000;
size_t size = numElements * sizeof(float);
//declarations-initializations
float *d_M = NULL;
err = cudaMalloc((void **)&d_M, 2*size);
//error checking
// Allocate the device input vector A
float *d_A = d_M;
// Allocate the device input vector B
float *d_B = d_M + size;
err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
//error checking
err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
//error checking
The original code is inside the samples folder of the cuda toolkit named vectorAdd.cu so you can assume h_A, h_B are properly initiated and the code works without the modification I made.
The result was that the second cudaMemcpy returned an error with message invalid argument.
It seems that the operation "d_M + size" does not return what someone would expect as device memory behaves differently but I don't know how.
Is it possible to make my approach (calling cudaMalloc once to allocate memory for two arrays) work? Any comments/answers on whether this is a good approach are also welcome.
UPDATE
As the answers of Robert and dreamcrash suggested I had to add number of elements (numElements) to the pointer d_M not the size which is the number of bytes. Just for reference there was no observable speedup.
You just have to replace
float *d_B = d_M + size;
with
float *d_B = d_M + numElements;
This is pointer arithmetic, if you have an array of floats R = [1.0,1.2,3.3,3.4] you can print its first position by doing printf("%f",*R);.
And the second position? You just do printf("%f\n",*(++R)); thus r[0] + 1. You do not do r[0] + sizeof(float), like you were doing. When you do r[0] + sizeof(float) you will access the element in the position r[4] since size(float) = 4.
When you declare float *d_B = d_M + numElements; the compiler assumes that d_b will be continuously allocated in memory, and each element will have a size of a float. Hence, you do not need to specify the distance in terms of bytes but rather in terms of elements, the compiler will do the math for you. This approach is more human-friendly since it is more intuitive to express the pointer arithmetic in terms of elements than in terms of bytes. Moreover, it is also more portable, since if the number of bytes of a given type changes based on the underneath architecture, the compiler will handle that for you. Consequently, one's code will not break because one assumed a fixed byte size.
You said that "The result was that the second cudaMemcpy returned an error with message invalid argument":
If you print the number corresponding to this error, it will print 11 and if you check the CUDA API you verify that this error corresponds to :
cudaErrorInvalidValue
This indicates that one or more of the parameters passed to the API
call is not within an acceptable range of values.
In your example means that float *d_B = d_M + size; is getting out of the range.
You have allocate space for 100000 floats, d_a will start from 0 to 50000, but according to your code d_b will start from numElements * sizeof(float); 50000 * 4 = 200000, since 200000 > 100000 you are getting invalid argument.