OpenCL Kernel won't work until another buffer creation call - c++

I have a OpenCL buffer that is created via:
return cl::Buffer(_context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, size);
I write data to that buffer and want to use it later in a kernel.
I get a strange behavior thought because my kernel wont work with that buffer. Only when I randomly call
BufferContainer blah(oclEnvironment, cv::Size(width, height), 3);
calls the above function to create a same sized buffer again, the kernel works. I don't call blah.Write(...) at all. It seems to work with the data I wrote to the first buffer. But if I comment out that single line with the "blah" buffer it wont work again.
e: both buffers are created with the exact same dimensions.
e2: does it have something to do with the command queue and the order of objects there?
Basically I try to run a kernel to reduce the image and find the max hsv v value. Then after that kernel finishes and gives me the max I run the next kernel with one parameter set to that found maximum. So the call chain is like:
float maxV = _maxValueReduce->GetValueMaximum(oclEnvironment, fiBuffer, width, height, true);
//starting to paramter the next kernel
...
_kernel.setArg(8, maxV);
oclEnvironment._commandQueue.enqueueNDRangeKernel(_kernel, cl::NullRange, global, local);
And the GetValueMaximum(...) starts itself a reducing kernel to find that maximum.
e3:
float OclMaxValueReduce::GetValueMaximum(OclEnvironment& oclEnvironment,
BufferContainer& source, int width, int height, const bool sync)
{
//Create the result buffer
//Intel HD 530 can have a max. workgroup size of 256.
int dim1 = 16;
int dim2 = 16;
cl::NDRange local(dim1, dim2,1);
cl::NDRange global(source._size.width, source._size.height, 1);
//Calculate the number of workgroups
int numberOfWorkgroups = ceil((width * height) / (float)(dim1 * dim2));
//each workgroup reduces the data to a single element. This elements are then reduced on host in the final reduction step.
//First create the buffer for the workgroups result
BufferContainer result(oclEnvironment, cv::Size(numberOfWorkgroups, 1), sizeof(float));
//set the kernel arguments
_kernel.setArg(0, source.GetOclBuffer());
_kernel.setArg(1, result.GetOclBuffer());
_kernel.setArg(2, width);
_kernel.setArg(3, height);
oclEnvironment._commandQueue.enqueueNDRangeKernel(_kernel, cl::NullRange, global, local);
if (sync)
oclEnvironment._commandQueue.finish();
//retrieve the reduced result array. The final reduce step is done here on host.
float* dest = new float[numberOfWorkgroups];
ReadBuffer(oclEnvironment, result.GetOclBuffer(), dest, numberOfWorkgroups);
std::vector<float> resultArray(dest, dest + numberOfWorkgroups);
delete[] dest;
//find and return the max in array.
std::vector<float>::iterator it;
it = std::max_element(resultArray.begin(), resultArray.end());
return resultArray[std::distance(resultArray.begin(), it)];
}
and this calls the read buffer:
/* Read a float array from ocl buffer*/
void OclMaxValueReduce::ReadBuffer(OclEnvironment oclEnvironment, cl::Buffer
&resultBuffer, float* dest, const size_t size) {
int errcode;
float* resultData = (float*)oclEnvironment._commandQueue.enqueueMapBuffer(resultBuffer, true, CL_MAP_READ, 0, size * sizeof(float), 0, 0, &errcode);
if (errcode)
throw std::exception(std::string("OclEnvironment::ReadBuffer: OCL could not map Buffer!").data(), errcode);
//std::copy(resultData, (resultData + size), dest);
memcpy(dest, resultData, size * sizeof(float));
cl::Event testEvent;
oclEnvironment._commandQueue.enqueueUnmapMemObject(resultBuffer, resultData, NULL, &testEvent); // Unmap Buffer
testEvent.wait();
}

Related

Image grayscale in OpenCL

I want to transform an RGB-Image to a grayscale image.
My problem when I copy back the data the kernel returns zeros.
OpenCL code:
__kernel void grayscale(__global uchar * input, __global uchar * output)
{
int gid = get_global_id(0);
output[gid] = 0.0722 * input[gid][0] + 0.7152 * input[gid][1] + 0.2126 * input[gid][2];
}
Host code:
void RunKernel(char fileName[], char methodName[], Mat inputImg, Mat outputImg,
char outputLoc[], int mem_size){
/*
Initialisation of the device and read the kernel source.
*/
//Creating cl_mem objects for input and output. men_size is the image width*height
imgInMem = clCreateBuffer(img_context, CL_MEM_READ_ONLY,
mem_size * sizeof(uchar), NULL, &err);
imgOutMem = clCreateBuffer(img_context, CL_MEM_WRITE_ONLY,
mem_size * sizeof(uchar), NULL, &err);
//copy the data into cl_mem input
err = clEnqueueWriteBuffer(img_cmd_queue, imgInMem, CL_TRUE, 0, mem_size *sizeof(uchar),
&inputImg.data, 0, NULL, NULL);
//Create the program and load the kernel source to it
img_program = clCreateProgramWithSource(img_context, 1, (const char **) &kernel_src_str,
(const size_t *) &kernel_size, &err);
err = clBuildProgram(img_program, 1, &dev_id, NULL, NULL, NULL);
img_kernel = clCreateKernel(img_program, methodName, &err);
//Setting the kernel args
err = clSetKernelArg(img_kernel, 0, sizeof(cl_mem), (void *) &imgInMem);
err = clSetKernelArg(img_kernel, 1, sizeof(cl_mem), (void *) &imgOutMem);
//define the global size and local size
size_t global_work_size = mem_size;
size_t local_work_size = 256;
//Enqueue a command to execute a kernel on a device ("1" indicates 1-dim work)
err = clEnqueueNDRangeKernel(img_cmd_queue, img_kernel, 1, NULL, &global_work_size,
&local_work_size, 0, NULL, NULL);
err = clFinish(img_cmd_queue);
//Read back the result from device
err = clEnqueueReadBuffer(img_cmd_queue, imgOutMem, CL_TRUE, 0,
mem_size *sizeof(uchar), outputImg.data, 0, NULL, NULL);
/*
Release the necessary objects.
*/
}
After the clEnqueueReadBuffer if I write the values to the console it is all zeros. My outputImg is declared like this in the main:
Mat outImg(height,width,CV_8UC1,Scalar(0));
and call the method with this:
RunKernel("kernels/grayscale.cl","grayscale", inImg, outImg,"resources/grayscale_car_gpu.jpg", MEM_SIZE);
The problem is likely the 2D array syntax you're using:
0.0722 * input[gid][0] + 0.7152 * input[gid][1] + 0.2126 * input[gid][2]
What addresses do you think that is accessing exactly?
Instead, assuming you're trying to access sequential bytes as RGB (in BGR order, judging by the coefficient value), try:
0.0722 * input[3*gid+0] + 0.7152 * input[3*gid+1] + 0.2126 * input[3*gid+2]
You should add an "f" to the float constants (otherwise they are doubles, which are not supported on all devices).
You should add rounding from float back to uchar. So, together, something like:
convert_uchar_sat_rte(0.0722f * input[3*gid+0] +
0.7152f * input[3*gid+1] +
0.2126f * input[3*gid+2])
Finally, you're passing the same size buffer for the input and output images, but seemingly treating the input buffer as RGB, which is 3x larger than a single byte of monochrome. So you'll need to fix that in the host code.
Any time you're getting incorrect output from a kernel, simplify it to see if it is an input problem, a calculation problem, an output problem, or host cost issues. Keep narrowing it down until you've found your problem.

ArrayFire: function with an OpenCL kernel called from main function

the function is the following (extracted from http://arrayfire.org/docs/interop_opencl.htm)
unique main function
int main() {
size_t length = 10;
// Create ArrayFire array objects:
af::array A = af::randu(length, f32);
af::array B = af::constant(0, length, f32);
// ... additional ArrayFire operations here
// 2. Obtain the device, context, and queue used by ArrayFire
static cl_context af_context = afcl::getContext();
static cl_device_id af_device_id = afcl::getDeviceId();
static cl_command_queue af_queue = afcl::getQueue();
// 3. Obtain cl_mem references to af::array objects
cl_mem * d_A = A.device<cl_mem>();
cl_mem * d_B = B.device<cl_mem>();
// 4. Load, build, and use your kernels.
// For the sake of readability, we have omitted error checking.
int status = CL_SUCCESS;
// A simple copy kernel, uses C++11 syntax for multi-line strings.
const char * kernel_name = "copy_kernel";
const char * source = R"(
void __kernel
copy_kernel(__global float * gA, __global float * gB)
{
int id = get_global_id(0);
gB[id] = gA[id];
}
)";
// Create the program, build the executable, and extract the entry point
// for the kernel.
cl_program program = clCreateProgramWithSource(af_context, 1, &source, NULL, &status);
status = clBuildProgram(program, 1, &af_device_id, NULL, NULL, NULL);
cl_kernel kernel = clCreateKernel(program, kernel_name, &status);
// Set arguments and launch your kernels
clSetKernelArg(kernel, 0, sizeof(cl_mem), d_A);
clSetKernelArg(kernel, 1, sizeof(cl_mem), d_B);
clEnqueueNDRangeKernel(af_queue, kernel, 1, NULL, &length, NULL, 0, NULL, NULL);
// 5. Return control of af::array memory to ArrayFire
A.unlock();
B.unlock();
// ... resume ArrayFire operations
// Because the device pointers, d_x and d_y, were returned to ArrayFire's
// control by the unlock function, there is no need to free them using
// clReleaseMemObject()
return 0;
}
that work well, since the final values ​​of B coincide with those of A, i.e.
af_print(B);match A, but when I write the functions separately as follows:
separately main function
arraycopy function
void arraycopy(af::array A, af::array B,size_t length) {
// 2. Obtain the device, context, and queue used by ArrayFire
static cl_context af_context = afcl::getContext();
static cl_device_id af_device_id = afcl::getDeviceId();
static cl_command_queue af_queue = afcl::getQueue();
// 3. Obtain cl_mem references to af::array objects
cl_mem * d_A = A.device<cl_mem>();
cl_mem * d_B = B.device<cl_mem>();
// 4. Load, build, and use your kernels.
// For the sake of readability, we have omitted error checking.
int status = CL_SUCCESS;
// A simple copy kernel, uses C++11 syntax for multi-line strings.
const char * kernel_name = "copy_kernel";
const char * source = R"(
void __kernel
copy_kernel(__global float * gA, __global float * gB)
{
int id = get_global_id(0);
gB[id] = gA[id];
}
)";
// Create the program, build the executable, and extract the entry point
// for the kernel.
cl_program program = clCreateProgramWithSource(af_context, 1, &source, NULL, &status);
status = clBuildProgram(program, 1, &af_device_id, NULL, NULL, NULL);
cl_kernel kernel = clCreateKernel(program, kernel_name, &status);
// Set arguments and launch your kernels
clSetKernelArg(kernel, 0, sizeof(cl_mem), d_A);
clSetKernelArg(kernel, 1, sizeof(cl_mem), d_B);
clEnqueueNDRangeKernel(af_queue, kernel, 1, NULL, &length, NULL, 0, NULL, NULL);
// 5. Return control of af::array memory to ArrayFire
A.unlock();
B.unlock();
// ... resume ArrayFire operations
// Because the device pointers, d_x and d_y, were returned to ArrayFire's
// control by the unlock function, there is no need to free them using
// clReleaseMemObject()
}
main function
int main()
{
size_t length = 10;
af::array A = af::randu(length, f32);
af::array B = af::constant(0, length, f32);
arraycopy(A, B, length);
af_print(B);//does not match A
}
the final values of B have not changed, why is this happening? and what should I do to make it work?, thanks in advance
You pass af::array into arraycopy by value, not by reference, hence A and B in main remain unchanged regardless of what you do inside arraycopy. You can pass B by reference: af::array &B in parameter list. I'd also recommend passing A by const-reference as a custom to avoid unnecessary copies (const af::array &A).
The reason behind the behavior you are seeing is reference counting. But it is not a bug for sure and falls inline with C++ language behavior.
af::array objects when created using assignment or equivalent operations perform only copy of meta data and keep a shared pointer.
In the version of your code where it is a function, B is passed by value, thus internally B from arraycopy function is a copy of meta data of B from main function and sharing the pointer to the data from array B of main. At this point, if the user does a device call to fetch the pointer, we assume it is for writing to locations of that pointer. Therefore, when device is called on a array object has a shared pointer with reference count > 1, we make a copy of original array (B from main) and return the pointer to that memory. Therefore, if you do af_print(B) inside you will see the correct values. This is essentially copy-on-write - Since B is passed by value, you are not seeing the modified results of B from arraycopy function.
In the very first line I said, it falls in line with C++ behavior because, if the object B needs to be modified from a function it has to be passed by reference. Passing it by value only makes the value change inside the function - which is exactly how ArrayFire is handling af::array objects.
Hope that clears the confusion.
Pradeep.
ArrayFire Dev Team.

Supplying a pointer renders my OpenCL code incorrect

My computer has a GeForce 1080Ti. With 11GB of VRAM, I don't expect a memory issue, so I'm at a loss to explain why the following breaks my code.
I execute the kernel on the host with this code.
cl_mem buffer = clCreateBuffer(context.GetContext(), CL_MEM_READ_WRITE, n * n * sizeof(int), NULL, &error);
error = clSetKernelArg(context.GetKernel(myKernel), 1, n * n, m1);
error = clSetKernelArg(context.GetKernel(myKernel), 0, sizeof(cl_mem), &buffer);
error = clEnqueueNDRangeKernel(context.GetCommandQueue(0), context.GetKernel(myKernel), 1, NULL, 10, 10, 0, NULL, NULL);
clFinish(context.GetCommandQueue(0));
error = clEnqueueReadBuffer(context.GetCommandQueue(0), buffer, true, 0, n * n * sizeof(int), results, 0, NULL, NULL);
results is a pointer to an n-by-n int array. m1 is a pointer to an n-by-n-bit array. The variable n is divisible by 8, so we can interpret the array as a char array.
The first ten values of the array are set to 1025 by the kernel (the value isn't important):
__kernel void PopCountProduct (__global int *results)
{
results[get_global_id(0)] = 1025;
}
When I print out the result on the host, the first 10 indices are 1025. All is well and good.
Suddenly it stops working when I introduce an additional argument:
__kernel void PopCountProduct (__global int *results, __global char *m)
{
results[get_global_id(0)] = 1025;
}
Why is this happening? Am I missing something crucial about OpenCL?
You can't pas host pointer to clSetKernelArg in OpenCL 1.2. Similar thing can only be done in OpenCL 2.0+ by clSetKernelArgSVMPointer with SVM pointer if supported. But most probable making a buffer object on GPU and copying host memory to it is what you need.

cudaMallocPitch and cudaMemcpy2D

I have an error when transfering C++ 2D array into CUDA 1D array.
Let me show my source code.
int main(void)
{
float h_arr[1024][256];
float *d_arr;
// --- Some codes to populate h_arr
// --- cudaMallocPitch
size_t pitch;
cudaMallocPitch((void**)&d_arr, &pitch, 256, 1024);
// --- Copy array to device
cudaMemcpy2D(d_arr, pitch, h_arr, 256, 256, 1024, cudaMemcpyHostToDevice);
}
I tried to run the code, but it pops up an error.
How to use cudaMallocPitch() and cudaMemcpy2D() properly?
Talonmies has already satisfactorily answered this question. Here, some further explanation that could be useful to the Community.
When accessing 2D arrays in CUDA, memory transactions are much faster if each row is properly aligned.
CUDA provides the cudaMallocPitch function to “pad” 2D matrix rows with extra bytes so to achieve the desired alignment. Please, refer to the “CUDA C Programming Guide”, Sections 3.2.2 and 5.3.2, for more information.
Assuming that we want to allocate a 2D padded array of floating point (single precision) elements, the syntax for cudaMallocPitch is the following:
cudaMallocPitch(&devPtr, &devPitch, Ncols * sizeof(float), Nrows);
where
devPtr is an output pointer to float (float *devPtr).
devPitch is a size_t output variable denoting the length, in bytes, of the padded row.
Nrows and Ncols are size_t input variables representing the matrix size.
Recalling that C/C++ and CUDA store 2D matrices by row, cudaMallocPitch will allocate a memory space of size, in bytes, equal to Nrows * pitch. However, only the first Ncols * sizeof(float) bytes of each row will contain the matrix data. Accordingly, cudaMallocPitch consumes more memory than strictly necessary for the 2D matrix storage, but this is returned in more efficient memory accesses.
CUDA provides also the cudaMemcpy2D function to copy data from/to host memory space to/from device memory space allocated with cudaMallocPitch. Under the above hypotheses (single precision 2D matrix), the syntax is the following:
cudaMemcpy2D(devPtr, devPitch, hostPtr, hostPitch, Ncols * sizeof(float), Nrows, cudaMemcpyHostToDevice)
where
devPtr and hostPtr are input pointers to float (float *devPtr and float *hostPtr) pointing to the (source) device and (destination) host memory spaces, respectively;
devPitch and hostPitch are size_t input variables denoting the length, in bytes, of the padded rows for the device and host memory spaces, respectively;
Nrows and Ncols are size_t input variables representing the matrix size.
Note that cudaMemcpy2D allows also for pitched memory allocation on the host side. If the host memory has no pitch, then hostPtr = Ncols * sizeof(float). Furthermore, cudaMemcpy2D is bidirectional. For the above example, we are copying data from host to device. If we want to copy data from device to host, then the above line changes to
cudaMemcpy2D(hostPtr, hostPitch, devPtr, devPitch, Ncols * sizeof(float), Nrows, cudaMemcpyDeviceToHost)
The access to elements of a 2D matrix allocated by cudaMallocPitch can be performed as in the following example:
int tidx = blockIdx.x*blockDim.x + threadIdx.x;
int tidy = blockIdx.y*blockDim.y + threadIdx.y;
if ((tidx < Ncols) && (tidy < Nrows))
{
float *row_a = (float *)((char*)devPtr + tidy * pitch);
row_a[tidx] = row_a[tidx] * tidx * tidy;
}
In such an example, tidx and tidy are used as column and row indices, respectively (remember that, in CUDA, x-threads span the columns and y-threads span the rows to favor coalescence). The pointer to the first element of a row is calculated by offsetting the initial pointer devPtr by the row length tidy * pitch in bytes (char * is a pointer to bytes and sizeof(char) is 1 byte), where the length of each row is computed by using the pitch information.
Below, I'm providing a fully worked example to show these concepts.
#include<stdio.h>
#include<cuda.h>
#include<cuda_runtime.h>
#include<device_launch_parameters.h>
#include<conio.h>
#define BLOCKSIZE_x 16
#define BLOCKSIZE_y 16
#define Nrows 3
#define Ncols 5
/*****************/
/* CUDA MEMCHECK */
/*****************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
fprintf(stderr, "GPUassert: %s %s %dn", cudaGetErrorString(code), file, line);
if (abort) { getch(); exit(code); }
}
}
/*******************/
/* iDivUp FUNCTION */
/*******************/
int iDivUp(int hostPtr, int b){ return ((hostPtr % b) != 0) ? (hostPtr / b + 1) : (hostPtr / b); }
/******************/
/* TEST KERNEL 2D */
/******************/
__global__ void test_kernel_2D(float *devPtr, size_t pitch)
{
int tidx = blockIdx.x*blockDim.x + threadIdx.x;
int tidy = blockIdx.y*blockDim.y + threadIdx.y;
if ((tidx < Ncols) && (tidy < Nrows))
{
float *row_a = (float *)((char*)devPtr + tidy * pitch);
row_a[tidx] = row_a[tidx] * tidx * tidy;
}
}
/********/
/* MAIN */
/********/
int main()
{
float hostPtr[Nrows][Ncols];
float *devPtr;
size_t pitch;
for (int i = 0; i < Nrows; i++)
for (int j = 0; j < Ncols; j++) {
hostPtr[i][j] = 1.f;
//printf("row %i column %i value %f \n", i, j, hostPtr[i][j]);
}
// --- 2D pitched allocation and host->device memcopy
gpuErrchk(cudaMallocPitch(&devPtr, &pitch, Ncols * sizeof(float), Nrows));
gpuErrchk(cudaMemcpy2D(devPtr, pitch, hostPtr, Ncols*sizeof(float), Ncols*sizeof(float), Nrows, cudaMemcpyHostToDevice));
dim3 gridSize(iDivUp(Ncols, BLOCKSIZE_x), iDivUp(Nrows, BLOCKSIZE_y));
dim3 blockSize(BLOCKSIZE_y, BLOCKSIZE_x);
test_kernel_2D << <gridSize, blockSize >> >(devPtr, pitch);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy2D(hostPtr, Ncols * sizeof(float), devPtr, pitch, Ncols * sizeof(float), Nrows, cudaMemcpyDeviceToHost));
for (int i = 0; i < Nrows; i++)
for (int j = 0; j < Ncols; j++)
printf("row %i column %i value %f \n", i, j, hostPtr[i][j]);
return 0;
}
The cudaMallocPitch call you have written looks ok, but this:
cudaMemcpy2D(d_arr, pitch, h_arr, 256, 256, 1024, cudaMemcpyHostToDevice);
is incorrect. Quoting from the documentation
Copies a matrix (height rows of width bytes each) from the memory area
pointed to by src to the memory area pointed to by dst, where kind is
one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice,
cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the
direction of the copy. dpitch and spitch are the widths in memory in
bytes of the 2D arrays pointed to by dst and src, including any
padding added to the end of each row. The memory areas may not
overlap. width must not exceed either dpitch or spitch. Calling
cudaMemcpy2D() with dst and src pointers that do not match the
direction of the copy results in an undefined behavior. cudaMemcpy2D()
returns an error if dpitch or spitch exceeds the maximum allowed.
So the source pitch and width to copy must be specified in bytes. Your host matrix has a pitch of sizeof(float) * 256 bytes, and because the source pitch and the width of the source you will copy are the same, this means your cudaMemcpy2Dcall should look like:
cudaMemcpy2D(d_arr, pitch, h_arr, 256*sizeof(float),
256*sizeof(float), 1024, cudaMemcpyHostToDevice);

Understanding work-items and work-groups

Based on my previous question:
I'm still trying to copy an image (no practical reason, just to start with an easy one):
The image contains 200 * 300 == 60000 pixels.
The maximum number of work-items is 4100 according to CL_DEVICE_MAX_WORK_GROUP_SIZE.
kernel1:
std::string kernelCode =
"void kernel copy(global const int* image, global int* result)"
"{"
"result[get_local_id(0) + get_group_id(0) * get_local_size(0)] = image[get_local_id(0) + get_group_id(0) * get_local_size(0)];"
"}";
queue:
for (int offset = 0; offset < 30; ++offset)
queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(offset * 2000), cl::NDRange(60000));
queue.finish();
Gives segfault, what's wrong?
With the last parameter cl::NDRange(20000) it doesn't, but gives back only part of the image.
Also I don't understand, why I can't use this kernel:
kernel2:
std::string kernelCode =
"void kernel copy(global const int* image, global int* result)"
"{"
"result[get_global_id(0)] = image[get_global_id(0)];"
"}";
Looking at this presentation on the 31th slide:
Why can't I just simply use the global_id?
EDIT1
Platfrom: AMD Accelerated Parallel Processing
Device: AMD Athlon(tm) II P320 Dual-Core Processor
EDIT2
The result based on huseyin tugrul buyukisik's answer:
EDIT3
With the last parameter cl::NDRange(20000):
Kernel is both ways the first one.
EDIT4
std::string kernelCode =
"void kernel copy(global const int* image, global int* result)"
"{"
"result[get_global_id(0)] = image[get_global_id(0)];"
"}";
//...
cl_int err;
err = queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(0), cl::NDRange(59904), cl::NDRange(128));
if (err == 0)
qDebug() << "success";
else
{
qDebug() << err;
exit(1);
}
Prints success.
Maybe this is wrong?
int size = _originalImage.width() * _originalImage.height();
int* result = new int[size];
//...
cl::Buffer resultBuffer(context, CL_MEM_READ_WRITE, size);
//...
queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, size, result);
The guilty was:
cl::Buffer imageBuffer(context, CL_MEM_USE_HOST_PTR, sizeof(int) * size, _originalImage.bits());
cl::Buffer resultBuffer(context, CL_MEM_READ_ONLY, sizeof(int) * size);
queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, sizeof(int) * size, result);
I used size instead of sizeof(int) * size.
Edit 2:
Try non constant memory specifier please(maybe not compatible with your cpu):
std::string kernelCode =
"__kernel void copy(__global int* image, __global int* result)"
"{"
"result[get_global_id(0)] = image[get_global_id(0)];"
"}";
also you may need to change buffer options too.
Edit:
You have forgotten three '__'s before 'global' and 'kernel' specifiers so please try:
std::string kernelCode =
"__kernel void copy(__global const int* image, __global int* result)"
"{"
"result[get_global_id(0)] = image[get_global_id(0)];"
"}";
Total elements are 60000 but you are doing an offset+60000 which overflows and reads/writes unprivilaged areas.
The usual usage of ndrange for opencl 1.2 c++ bindings must be:
cl_int err;
err=cq.enqueueNDRangeKernel(kernelFunction,referenceRange,globalRange,localRange);
Then check err for the real error code you seek. 0 means succeess.**
If you want to divide work into smaller parts you should cap the range of each unit by 60000/N
If you divide by 30 parts, then
for (int offset = 0; offset < 30; ++offset)
queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(offset * 2000), cl::NDRange(60000/30));
queue.finish();
And double check the size of each buffer e.g. sizeof(cl_int)*arrElementNumber
Becuase size of an integer may not be same for the device integer. You need 60000 elements? Then you need 240000 bytes to pass as size when creating buffer.
For compatibility, you should check for size of an integer before creating buffers if you are up to run this code on another machine.
You may know this already but Im gonna tell anyway:
CL_DEVICE_MAX_WORK_GROUP_SIZE
is number of threads that can share local/shared memory in a compute unit. You dont need to divide your work just for this. Opencl does this automatically and gives a unique global id for each thread along whole work, and gives unique local id for each thread in a compute unit. If CL_DEVICE_MAX_WORK_GROUP_SIZE is 4100 than it can create threads that share same variables in a compute unit. You can compute all 60000 variables in a single sweep with just an adition: multiple workgroups are created for this and each group has a group id.
// this should work without a problem
queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(0), cl::NDRange(60000));
If you have an AMD gpu or cpu and if you are using msvc, you can install codexl from amd site and choose system info from drop-down menu to look at relevant numbers.
Which device is that of yours? I couldnt find any device with a max work group size of 4100! My cpu has 1024, gpu has 256. Is that a xeon-phi?
For example total work items can be as big as 256*256 times work group size here.
Codexl has other nice features such as performance profiling, tracing code if you need maximum performance and bugfixing.