Understanding work-items and work-groups - c++

Based on my previous question:
I'm still trying to copy an image (no practical reason, just to start with an easy one):
The image contains 200 * 300 == 60000 pixels.
The maximum number of work-items is 4100 according to CL_DEVICE_MAX_WORK_GROUP_SIZE.
kernel1:
std::string kernelCode =
"void kernel copy(global const int* image, global int* result)"
"{"
"result[get_local_id(0) + get_group_id(0) * get_local_size(0)] = image[get_local_id(0) + get_group_id(0) * get_local_size(0)];"
"}";
queue:
for (int offset = 0; offset < 30; ++offset)
queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(offset * 2000), cl::NDRange(60000));
queue.finish();
Gives segfault, what's wrong?
With the last parameter cl::NDRange(20000) it doesn't, but gives back only part of the image.
Also I don't understand, why I can't use this kernel:
kernel2:
std::string kernelCode =
"void kernel copy(global const int* image, global int* result)"
"{"
"result[get_global_id(0)] = image[get_global_id(0)];"
"}";
Looking at this presentation on the 31th slide:
Why can't I just simply use the global_id?
EDIT1
Platfrom: AMD Accelerated Parallel Processing
Device: AMD Athlon(tm) II P320 Dual-Core Processor
EDIT2
The result based on huseyin tugrul buyukisik's answer:
EDIT3
With the last parameter cl::NDRange(20000):
Kernel is both ways the first one.
EDIT4
std::string kernelCode =
"void kernel copy(global const int* image, global int* result)"
"{"
"result[get_global_id(0)] = image[get_global_id(0)];"
"}";
//...
cl_int err;
err = queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(0), cl::NDRange(59904), cl::NDRange(128));
if (err == 0)
qDebug() << "success";
else
{
qDebug() << err;
exit(1);
}
Prints success.
Maybe this is wrong?
int size = _originalImage.width() * _originalImage.height();
int* result = new int[size];
//...
cl::Buffer resultBuffer(context, CL_MEM_READ_WRITE, size);
//...
queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, size, result);
The guilty was:
cl::Buffer imageBuffer(context, CL_MEM_USE_HOST_PTR, sizeof(int) * size, _originalImage.bits());
cl::Buffer resultBuffer(context, CL_MEM_READ_ONLY, sizeof(int) * size);
queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, sizeof(int) * size, result);
I used size instead of sizeof(int) * size.

Edit 2:
Try non constant memory specifier please(maybe not compatible with your cpu):
std::string kernelCode =
"__kernel void copy(__global int* image, __global int* result)"
"{"
"result[get_global_id(0)] = image[get_global_id(0)];"
"}";
also you may need to change buffer options too.
Edit:
You have forgotten three '__'s before 'global' and 'kernel' specifiers so please try:
std::string kernelCode =
"__kernel void copy(__global const int* image, __global int* result)"
"{"
"result[get_global_id(0)] = image[get_global_id(0)];"
"}";
Total elements are 60000 but you are doing an offset+60000 which overflows and reads/writes unprivilaged areas.
The usual usage of ndrange for opencl 1.2 c++ bindings must be:
cl_int err;
err=cq.enqueueNDRangeKernel(kernelFunction,referenceRange,globalRange,localRange);
Then check err for the real error code you seek. 0 means succeess.**
If you want to divide work into smaller parts you should cap the range of each unit by 60000/N
If you divide by 30 parts, then
for (int offset = 0; offset < 30; ++offset)
queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(offset * 2000), cl::NDRange(60000/30));
queue.finish();
And double check the size of each buffer e.g. sizeof(cl_int)*arrElementNumber
Becuase size of an integer may not be same for the device integer. You need 60000 elements? Then you need 240000 bytes to pass as size when creating buffer.
For compatibility, you should check for size of an integer before creating buffers if you are up to run this code on another machine.
You may know this already but Im gonna tell anyway:
CL_DEVICE_MAX_WORK_GROUP_SIZE
is number of threads that can share local/shared memory in a compute unit. You dont need to divide your work just for this. Opencl does this automatically and gives a unique global id for each thread along whole work, and gives unique local id for each thread in a compute unit. If CL_DEVICE_MAX_WORK_GROUP_SIZE is 4100 than it can create threads that share same variables in a compute unit. You can compute all 60000 variables in a single sweep with just an adition: multiple workgroups are created for this and each group has a group id.
// this should work without a problem
queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(0), cl::NDRange(60000));
If you have an AMD gpu or cpu and if you are using msvc, you can install codexl from amd site and choose system info from drop-down menu to look at relevant numbers.
Which device is that of yours? I couldnt find any device with a max work group size of 4100! My cpu has 1024, gpu has 256. Is that a xeon-phi?
For example total work items can be as big as 256*256 times work group size here.
Codexl has other nice features such as performance profiling, tracing code if you need maximum performance and bugfixing.

Related

a specific OpenCL kernel performs differently on mobile and PC

I was trying to run an OpenCL kernel on both Adreno 630 and my laptop, it turns out that the kernel runs perfectly on mobile but crashes my laptop every single time. I am still trying to figure out the reason by myself. Here's my kernel. I hope you could help me with it, thanks.
__kernel void gen_mapxy( __read_only image2d_t _disp, const float offsetX, __write_only image2d_t _mapxy )
{
const int y = get_global_id(0);
const int local_y = get_local_id(0);
__local short temp[24][1080];
const int imageWidth = get_image_width(_disp);
for(int x = 0; x < imageWidth; ++x)
temp[local_y][x] = 0;
for(int x = imageWidth - 1; x >= 0; --x){
int tempDisp = read_imagei(_disp, sampler_nearest, (int2)(x, y)).x;
int newPos = clamp((int)(x + offsetX * (tempDisp) / 255), 0, imageWidth - 1);
temp[local_y][newPos] = tempDisp;
write_imagef(_mapxy, (int2)(newPos, y), (float4)(x, y, 0, 0));
}
You are using a big local array.
__local short temp[24][1080]
2 byte * 24 * 1080 = 50.6kB. Some desktop GPUs(and their notebook counterparts) have less available local memory limits. For example, GTX 1060 supports the value CL_DEVICE_LOCAL_MEM_SIZE 49152 bytes. But adreno 620, either it is ignoring the array usage silently or supporting larger local arrays because there is a possilibity that local arrays are emulated inside global arrays (limited in hundreds of megabytes) for those chips. If they do support in-chip fast local memory, then there is more possibility of "ignoring" issue or they really doubled local memory limits from last generation of Adrenos.
Even when GPU supports exact value, using all of it will limit thread-level-parallelism on each pipeline, severely reducing potential performance gains, generally.
If last generation of Adreno GPUs are same,
https://compubench.com/device.jsp?benchmark=compu15m&os=Android&api=cs&D=Samsung+Galaxy+S7+%28SM-G930x%29&testgroup=info
this page says
CL_DEVICE_LOCAL_MEM_SIZE
32768
CL_DEVICE_LOCAL_MEM_TYPE
CL_LOCAL
it is fast but it is 32kB so it is ignoring the error or you've missed adding necessary error catching logic in there, or both.

correct way of copying and printing 2dim array on CUDA device [duplicate]

I just started CUDA programming, and was trying to execute the code shown below. The idea is to copy a 2dimensional array to the device, calculate the sum of all elements and to retrieve the sum afterwards (I know that this algorithm is not parallelized. In fact it is doing more work, then necessary. This is however just intended as practice for memcopy).
#include<stdio.h>
#include<cuda.h>
#include <iostream>
#include <cutil_inline.h>
#define height 50
#define width 50
using namespace std;
// Device code
__global__ void kernel(float* devPtr, int pitch,int* sum)
{
int tempsum = 0;
for (int r = 0; r < height; ++r) {
int* row = (int*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
int element = row[c];
tempsum = tempsum + element;
}
}
*sum = tempsum;
}
//Host Code
int main()
{
int testarray[2][8] = {{4,4,4,4,4,4,4,4},{4,4,4,4,4,4,4,4}};
int* sum =0;
int* sumhost = 0;
sumhost = (int*)malloc(sizeof(int));
cout << *sumhost << endl;
float* devPtr;
size_t pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(int), height);
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);
cudaMalloc((void**)&sum, sizeof(int));
kernel<<<1, 4>>>(devPtr, pitch, sum);
cutilCheckMsg("kernel launch failure");
cudaMemcpy(sumhost, sum, sizeof(int), cudaMemcpyDeviceToHost);
cout << *sumhost << endl;
return 0;
}
This code compiles just fine (on the 4.0 sdk release candidate). However as soon as I try to execute, I get
0
cpexample.cu(43) : cutilCheckMsg() CUTIL CUDA error : kernel launch failure : invalid pitch argument.
Which is unfortunate, since I have no idea how to fix it ;-(. As far as I know, the pitch is an offset in memory to allow faster copying of data. However such a pitch is only used in the device memory, not in the host memory, isn't it? Therefore the pitch of my host memory should be 0, shouldn't it?
Moreover I would also like to ask two other questions:
If i declare a variable like int* sumhost (see above), where does this pointer point to? At first to the host memory and after cudaMalloc to the device memory?
cutilCheckMsg was very handy in this case. Are there similar functions for debugging i should know of?
In this line of your code:
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);
you're saying the source-pitch value for testarray is equal to 0, but how can that be possible when the formula for pitch is T* elem = (T*)((char*)base_address + row * pitch) + column? If we substituted a value of 0 for pitch in that formula, we will not get the right values when looking up an address at some 2-dimensional (x,y) ordered pair offset. One thing to consider is that the rule for the pitch value is pitch = width + padding. On the host, the padding is often equal to 0, but the width is not 0 unless there is nothing in your array. On the hardware side there may be extra padding, which is why the value for pitch may not equal the declared width of the array. Therefore you can conclude that pitch >= width depending on the padding value. So even on the host-side, the value for the source pitch should be at least the size of each row in bytes, meaning in the case of testarray, it should be 8*sizeof(int). Finally, the height of your 2D array in the host is also only 2 rows, not 4.
As an answer to your question about what happens with allocated pointers, if you allocate a pointer with malloc(), then the pointer is given an address value that resides in host memory. So you can dereference it on the host-side, but not on the device side. On the other-hand, a pointer allocated with cudaMalloc() is given a pointer to memory residing on the device. Therefore if you dereference it on the host, it's not pointing to allocated memory on the host, and unpredictable results will ensue. It is okay though to pass this pointer address to the kernel on the device, since when it's dereferenced on the device-side, it's pointing to memory locally accessible to the device. Overall the CUDA runtime keeps these two memory locations separate, providing memory copy functions that will copy back and forth between the device and host, and use the address values from these pointers as the source and-or destination for the copy depending on the desired direction (host-to-device or device-to-host). Now if you took the same int*, and first allocated it with malloc(), and then (after hopefully calling free() on the pointer) with cudaMalloc(), your pointer would first have an address that pointed to host memory, and then device memory. You would have to keep track of its state in-order to avoid unpredictable results from dereferencing an address that was on the device or host depending on whether it was dereferenced in host code or device code.

CUDA Kernel running repeatedly for each launch

I'm having a very odd bug with a CUDA (v5.0) code. Basically, I am trying to use device memory to accumulate values for a program that needs to take the average of a bunch of pixels. In order to do this I have two kernels, one which accumulates a sum in floating point array, sum_mask, and the other which does the division at the end, avg_mask. The odd thing is that both kernel's do exactly the operation I want them to do, multiplied by 14. I suspect it is somehow a synchronization or grid/block dim problem but I have checked and rechecked everything and cannot figure it out. Any help would be much appreciated.
Edit 1, Problem Statement: Running a CUDA kernel that does any accumulation process gives me what I would expect if each pixel were run consecutively by 14 threads. The specific input that is given me trouble has width=1280, height=720
Edit 2: Deleted some code in the snippets that was seemingly unrelated to the problem.
kernel:
__global__ void sum_mask(uint16_t * pic_d, float * mask_d,uint16_t width, uint16_t height)
{
unsigned short col = blockIdx.x*blockDim.x + threadIdx.x;
unsigned short row = blockIdx.y*blockDim.y + threadIdx.y;
unsigned short offset = col + row*width;
mask_d[offset] = mask_d[offset] + 1.0f; //This ends up incrementing by 14
//mask_d[offset] = mask_d[offset] + __uint2float_rd(pic_d[offset]); //This would increment by 14*pic_d[offset]
}
code to call kernel:
uint32_t dark_subtraction_filter::update_mask_collection(uint16_t * pic_in)
{
// Synchronous
HANDLE_ERROR(cudaSetDevice(DSF_DEVICE_NUM));
HANDLE_ERROR(cudaMemcpy(pic_in_host,pic_in,width*height*sizeof(uint16_t),cudaMemcpyHostToHost));
averaged_samples++;
HANDLE_ERROR(cudaMemcpyAsync(pic_out_host,mask_device,width*height*sizeof(uint16_t),cudaMemcpyDeviceToHost,dsf_stream));
/* This part is for testing */
HANDLE_ERROR(cudaStreamSynchronize(dsf_stream));
std::cout << "#samples: " << averaged_samples << std::endl;
std::cout << "pic_in_host: " << pic_in_host[9300] << "maskval: " << pic_out_host[9300] <<std::endl;
//Asynchronous
HANDLE_ERROR(cudaMemcpyAsync(picture_device,pic_in_host,width*height*sizeof(uint16_t),cudaMemcpyHostToDevice,dsf_stream));
sum_mask<<< gridDims, blockDims,0,dsf_stream>>>(picture_device, mask_device,width,height);
return averaged_samples;
}
constructor:
dark_subtraction_filter::dark_subtraction_filter(int nWidth, int nHeight)
{
HANDLE_ERROR(cudaSetDevice(DSF_DEVICE_NUM));
width=nWidth;
height=nHeight;
blockDims = dim3(20,20,1);
gridDims = dim3(width/20, height/20,1);
HANDLE_ERROR(cudaStreamCreate(&dsf_stream));
HANDLE_ERROR(cudaHostAlloc( (void **)&pic_in_host,width*height*sizeof(uint16_t),cudaHostAllocPortable)); //cudaHostAllocPortable??
HANDLE_ERROR(cudaHostAlloc( (void **)&pic_out_host,width*height*sizeof(float),cudaHostAllocPortable)); //cudaHostAllocPortable??
HANDLE_ERROR(cudaMalloc( (void **)&picture_device, width*height*sizeof(uint16_t)));
HANDLE_ERROR(cudaMalloc( (void **)&mask_device, width*height*sizeof(float)));
HANDLE_ERROR(cudaPeekAtLastError());
}
The variable offset is declared as a unsigned short. The offset calculation was overflowing the 16-bit storage class. If width = height = 1000 this would result in approximately 14 overflows resulting in the observed behavior.
The parameter passing and offset calculation are performed on unsigned short/uint16_t. The calculations will likely be quicker if the data types and calculations are of type int.

allocate two arrays calling cudaMalloc once

Memory allocation is one of the most time consuming operations in a GPU so I wanted to allocate 2 arrays by calling cudaMalloc once using the following code:
int numElements = 50000;
size_t size = numElements * sizeof(float);
//declarations-initializations
float *d_M = NULL;
err = cudaMalloc((void **)&d_M, 2*size);
//error checking
// Allocate the device input vector A
float *d_A = d_M;
// Allocate the device input vector B
float *d_B = d_M + size;
err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
//error checking
err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
//error checking
The original code is inside the samples folder of the cuda toolkit named vectorAdd.cu so you can assume h_A, h_B are properly initiated and the code works without the modification I made.
The result was that the second cudaMemcpy returned an error with message invalid argument.
It seems that the operation "d_M + size" does not return what someone would expect as device memory behaves differently but I don't know how.
Is it possible to make my approach (calling cudaMalloc once to allocate memory for two arrays) work? Any comments/answers on whether this is a good approach are also welcome.
UPDATE
As the answers of Robert and dreamcrash suggested I had to add number of elements (numElements) to the pointer d_M not the size which is the number of bytes. Just for reference there was no observable speedup.
You just have to replace
float *d_B = d_M + size;
with
float *d_B = d_M + numElements;
This is pointer arithmetic, if you have an array of floats R = [1.0,1.2,3.3,3.4] you can print its first position by doing printf("%f",*R);.
And the second position? You just do printf("%f\n",*(++R)); thus r[0] + 1. You do not do r[0] + sizeof(float), like you were doing. When you do r[0] + sizeof(float) you will access the element in the position r[4] since size(float) = 4.
When you declare float *d_B = d_M + numElements; the compiler assumes that d_b will be continuously allocated in memory, and each element will have a size of a float. Hence, you do not need to specify the distance in terms of bytes but rather in terms of elements, the compiler will do the math for you. This approach is more human-friendly since it is more intuitive to express the pointer arithmetic in terms of elements than in terms of bytes. Moreover, it is also more portable, since if the number of bytes of a given type changes based on the underneath architecture, the compiler will handle that for you. Consequently, one's code will not break because one assumed a fixed byte size.
You said that "The result was that the second cudaMemcpy returned an error with message invalid argument":
If you print the number corresponding to this error, it will print 11 and if you check the CUDA API you verify that this error corresponds to :
cudaErrorInvalidValue
This indicates that one or more of the parameters passed to the API
call is not within an acceptable range of values.
In your example means that float *d_B = d_M + size; is getting out of the range.
You have allocate space for 100000 floats, d_a will start from 0 to 50000, but according to your code d_b will start from numElements * sizeof(float); 50000 * 4 = 200000, since 200000 > 100000 you are getting invalid argument.

CUDA - memcpy2d - wrong pitch

I just started CUDA programming, and was trying to execute the code shown below. The idea is to copy a 2dimensional array to the device, calculate the sum of all elements and to retrieve the sum afterwards (I know that this algorithm is not parallelized. In fact it is doing more work, then necessary. This is however just intended as practice for memcopy).
#include<stdio.h>
#include<cuda.h>
#include <iostream>
#include <cutil_inline.h>
#define height 50
#define width 50
using namespace std;
// Device code
__global__ void kernel(float* devPtr, int pitch,int* sum)
{
int tempsum = 0;
for (int r = 0; r < height; ++r) {
int* row = (int*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
int element = row[c];
tempsum = tempsum + element;
}
}
*sum = tempsum;
}
//Host Code
int main()
{
int testarray[2][8] = {{4,4,4,4,4,4,4,4},{4,4,4,4,4,4,4,4}};
int* sum =0;
int* sumhost = 0;
sumhost = (int*)malloc(sizeof(int));
cout << *sumhost << endl;
float* devPtr;
size_t pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(int), height);
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);
cudaMalloc((void**)&sum, sizeof(int));
kernel<<<1, 4>>>(devPtr, pitch, sum);
cutilCheckMsg("kernel launch failure");
cudaMemcpy(sumhost, sum, sizeof(int), cudaMemcpyDeviceToHost);
cout << *sumhost << endl;
return 0;
}
This code compiles just fine (on the 4.0 sdk release candidate). However as soon as I try to execute, I get
0
cpexample.cu(43) : cutilCheckMsg() CUTIL CUDA error : kernel launch failure : invalid pitch argument.
Which is unfortunate, since I have no idea how to fix it ;-(. As far as I know, the pitch is an offset in memory to allow faster copying of data. However such a pitch is only used in the device memory, not in the host memory, isn't it? Therefore the pitch of my host memory should be 0, shouldn't it?
Moreover I would also like to ask two other questions:
If i declare a variable like int* sumhost (see above), where does this pointer point to? At first to the host memory and after cudaMalloc to the device memory?
cutilCheckMsg was very handy in this case. Are there similar functions for debugging i should know of?
In this line of your code:
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);
you're saying the source-pitch value for testarray is equal to 0, but how can that be possible when the formula for pitch is T* elem = (T*)((char*)base_address + row * pitch) + column? If we substituted a value of 0 for pitch in that formula, we will not get the right values when looking up an address at some 2-dimensional (x,y) ordered pair offset. One thing to consider is that the rule for the pitch value is pitch = width + padding. On the host, the padding is often equal to 0, but the width is not 0 unless there is nothing in your array. On the hardware side there may be extra padding, which is why the value for pitch may not equal the declared width of the array. Therefore you can conclude that pitch >= width depending on the padding value. So even on the host-side, the value for the source pitch should be at least the size of each row in bytes, meaning in the case of testarray, it should be 8*sizeof(int). Finally, the height of your 2D array in the host is also only 2 rows, not 4.
As an answer to your question about what happens with allocated pointers, if you allocate a pointer with malloc(), then the pointer is given an address value that resides in host memory. So you can dereference it on the host-side, but not on the device side. On the other-hand, a pointer allocated with cudaMalloc() is given a pointer to memory residing on the device. Therefore if you dereference it on the host, it's not pointing to allocated memory on the host, and unpredictable results will ensue. It is okay though to pass this pointer address to the kernel on the device, since when it's dereferenced on the device-side, it's pointing to memory locally accessible to the device. Overall the CUDA runtime keeps these two memory locations separate, providing memory copy functions that will copy back and forth between the device and host, and use the address values from these pointers as the source and-or destination for the copy depending on the desired direction (host-to-device or device-to-host). Now if you took the same int*, and first allocated it with malloc(), and then (after hopefully calling free() on the pointer) with cudaMalloc(), your pointer would first have an address that pointed to host memory, and then device memory. You would have to keep track of its state in-order to avoid unpredictable results from dereferencing an address that was on the device or host depending on whether it was dereferenced in host code or device code.