OpenCL, manage device buffer pointer from host? - c++

I'm trying to implement a code previously written in CUDA using OpenCL to run on Altera FPGA. I'm having problem reading back data that are supposed to be in the buffer. I use the same structure as CUDA version, only thing different is cudaMalloc can allocate memory for all types of pointer while for clCreateBuffer I have to use cl_mem. My code looks like this:
cl_mem d_buffer=clCreateBuffer(...);
//CUDA version:
//float* d_buffer;
//cudaMalloc((void **)&d_buffer, MemSz);
clEnqueueWriteBuffer(queue, d_buffer, ..., h_data, );
//cudaMemcpy(d_buffer, h_Data, MemSz, cudaMemcpyHostToDevice);
#define d_buffer(index1, index2, index3) &d_buffer + index1/index2*index3
//#define d_buffer(index1, index2, index3) d_buffer + index1/index2*index3
cl_mem* d_data=d_buffer(1,2,3);
clEnqueueReadBuffer(queue, *d_data,...)// Error reading d_data
I tried clEnqueueMapBuffer or CL_MEM_ALLOC_HOST_PTR for the clCreateBuffer, it doesn't work either.

cl_mem is an opaque object. You should not perform pointer arithmetic on it; attempting to do so will result in very nasty bugs.
I'm not familiar with how CUDA handles buffer allocation, but the implication of your commented out code is that CUDA buffers are always Host-Visible. This is very strictly not the case in OpenCL. OpenCL allows you to "Map" a buffer to host-visible memory, but it won't be implicitly visible to the host. If you intend to read an arbitrary index of the buffer, you need to either map it first or copy it to host data.
float * h_data = new float[1000];
cl_mem d_buffer=clCreateBuffer(...);
clEnqueueWriteBuffer(queue, d_buffer, true, 0, 1000 * sizeof(float), h_data, 0, nullptr, nullptr);
//======OR======
//float * d_data = static_cast<float*>(clEnqueueMapBuffer(queue, d_buffer, true, CL_MAP_WRITE, 0, 1000 * sizeof(float), 0, nullptr, nullptr, nullptr));
//std::copy(h_data, h_data + 1000, d_data);
//clEnqueueUnmapMemObject(queue, d_buffer, d_data, 0, nullptr, nullptr);
//clEnqueueBarrier(queue);
//Do work with buffer, probably in OpenCL Kernel...
float result;
size_t index = 1 / 2 * 3; //This is what you wrote in the original post
clEnqueueReadBuffer(queue, d_buffer, true, index * sizeof(float), 1 * sizeof(float), &result, 0, nullptr, nullptr);
//======OR======
//float * result_ptr = static_cast<float*>(clEnqueueMapBuffer(queue, d_buffer, true, CL_MAP_READ, index * sizeof(float), 1 * sizeof(float), 0, nullptr, nullptr, nullptr));
//result = *result_ptr;
//clEnqueueUnmapMemObject(queue, d_buffer, result_ptr, 0, nullptr, nullptr);
//clEnqueueBarrier(queue);
std::cout << "Result was " << result << std::endl;

Related

How to copy a big array to memory and use it in OpenCL kernel?

I have an array of uint8_t. The size of the array is about 2.000.000. I need to do some calculations on these values, but after I call the kernel and copy the modified values back, it returns only zeros.
I'm creating the array, the "row" and "columns" are int.
uint8_t arrayIn[rows * columns];
uint8_t arrayOut[rows * columns];
I'm creating the cl_mem objects and copy the array data into.
arrayInMem = clCreateBuffer(context, CL_MEM_READ_ONLY, rows * columns * sizeof(uint8_t), NULL, &err);
arrayOutMem = clCreateBuffer(context, CL_MEM_WRITE_ONLY, rows * columns * sizeof(uint8_t), NULL, &err);
err = clEnqueueWriteBuffer(img_cmd_queue, arrayInMem, CL_TRUE, 0, rows * columns * sizeof(uint8_t), arrayIn, 0, NULL, NULL);
Setting the kernel arg like this.
err = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&arrayInMem);
err = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&arrayOutMem);
Reading back to the host the modified array.
err = clEnqueueReadBuffer(img_cmd_queue, arrayOutMem, CL_TRUE, 0, MEM_SIZE * sizeof(uint8_t), arrayOut, 0, NULL, NULL);
The kernel signature look like this:
__kernel void calculate(__global uchar * arrayInKernel, __global uchar * arrayOutKernel){
//do some calculation like this eg.
//int gid = get_global_id(0);
//arrayOutKernel[gid] = 2 * arrayInKernel[gid];
}
Could somebody help, what am I missing out?
Your code is fine, assuming MEM_SIZE = rows * columns. The argument order in clEnqueueReadBuffer also is correct.
I could imagine that you forgot to call clFinish(img_cmd_queue); after clEnqueueWriteBuffer, clEnqueueNDRangeKernel and clEnqueueReadBuffer and before you check the results in arrayOut. All these commands end up in a queue and without clFinish the queue may be executed after you checked results.

Image grayscale in OpenCL

I want to transform an RGB-Image to a grayscale image.
My problem when I copy back the data the kernel returns zeros.
OpenCL code:
__kernel void grayscale(__global uchar * input, __global uchar * output)
{
int gid = get_global_id(0);
output[gid] = 0.0722 * input[gid][0] + 0.7152 * input[gid][1] + 0.2126 * input[gid][2];
}
Host code:
void RunKernel(char fileName[], char methodName[], Mat inputImg, Mat outputImg,
char outputLoc[], int mem_size){
/*
Initialisation of the device and read the kernel source.
*/
//Creating cl_mem objects for input and output. men_size is the image width*height
imgInMem = clCreateBuffer(img_context, CL_MEM_READ_ONLY,
mem_size * sizeof(uchar), NULL, &err);
imgOutMem = clCreateBuffer(img_context, CL_MEM_WRITE_ONLY,
mem_size * sizeof(uchar), NULL, &err);
//copy the data into cl_mem input
err = clEnqueueWriteBuffer(img_cmd_queue, imgInMem, CL_TRUE, 0, mem_size *sizeof(uchar),
&inputImg.data, 0, NULL, NULL);
//Create the program and load the kernel source to it
img_program = clCreateProgramWithSource(img_context, 1, (const char **) &kernel_src_str,
(const size_t *) &kernel_size, &err);
err = clBuildProgram(img_program, 1, &dev_id, NULL, NULL, NULL);
img_kernel = clCreateKernel(img_program, methodName, &err);
//Setting the kernel args
err = clSetKernelArg(img_kernel, 0, sizeof(cl_mem), (void *) &imgInMem);
err = clSetKernelArg(img_kernel, 1, sizeof(cl_mem), (void *) &imgOutMem);
//define the global size and local size
size_t global_work_size = mem_size;
size_t local_work_size = 256;
//Enqueue a command to execute a kernel on a device ("1" indicates 1-dim work)
err = clEnqueueNDRangeKernel(img_cmd_queue, img_kernel, 1, NULL, &global_work_size,
&local_work_size, 0, NULL, NULL);
err = clFinish(img_cmd_queue);
//Read back the result from device
err = clEnqueueReadBuffer(img_cmd_queue, imgOutMem, CL_TRUE, 0,
mem_size *sizeof(uchar), outputImg.data, 0, NULL, NULL);
/*
Release the necessary objects.
*/
}
After the clEnqueueReadBuffer if I write the values to the console it is all zeros. My outputImg is declared like this in the main:
Mat outImg(height,width,CV_8UC1,Scalar(0));
and call the method with this:
RunKernel("kernels/grayscale.cl","grayscale", inImg, outImg,"resources/grayscale_car_gpu.jpg", MEM_SIZE);
The problem is likely the 2D array syntax you're using:
0.0722 * input[gid][0] + 0.7152 * input[gid][1] + 0.2126 * input[gid][2]
What addresses do you think that is accessing exactly?
Instead, assuming you're trying to access sequential bytes as RGB (in BGR order, judging by the coefficient value), try:
0.0722 * input[3*gid+0] + 0.7152 * input[3*gid+1] + 0.2126 * input[3*gid+2]
You should add an "f" to the float constants (otherwise they are doubles, which are not supported on all devices).
You should add rounding from float back to uchar. So, together, something like:
convert_uchar_sat_rte(0.0722f * input[3*gid+0] +
0.7152f * input[3*gid+1] +
0.2126f * input[3*gid+2])
Finally, you're passing the same size buffer for the input and output images, but seemingly treating the input buffer as RGB, which is 3x larger than a single byte of monochrome. So you'll need to fix that in the host code.
Any time you're getting incorrect output from a kernel, simplify it to see if it is an input problem, a calculation problem, an output problem, or host cost issues. Keep narrowing it down until you've found your problem.

Supplying a pointer renders my OpenCL code incorrect

My computer has a GeForce 1080Ti. With 11GB of VRAM, I don't expect a memory issue, so I'm at a loss to explain why the following breaks my code.
I execute the kernel on the host with this code.
cl_mem buffer = clCreateBuffer(context.GetContext(), CL_MEM_READ_WRITE, n * n * sizeof(int), NULL, &error);
error = clSetKernelArg(context.GetKernel(myKernel), 1, n * n, m1);
error = clSetKernelArg(context.GetKernel(myKernel), 0, sizeof(cl_mem), &buffer);
error = clEnqueueNDRangeKernel(context.GetCommandQueue(0), context.GetKernel(myKernel), 1, NULL, 10, 10, 0, NULL, NULL);
clFinish(context.GetCommandQueue(0));
error = clEnqueueReadBuffer(context.GetCommandQueue(0), buffer, true, 0, n * n * sizeof(int), results, 0, NULL, NULL);
results is a pointer to an n-by-n int array. m1 is a pointer to an n-by-n-bit array. The variable n is divisible by 8, so we can interpret the array as a char array.
The first ten values of the array are set to 1025 by the kernel (the value isn't important):
__kernel void PopCountProduct (__global int *results)
{
results[get_global_id(0)] = 1025;
}
When I print out the result on the host, the first 10 indices are 1025. All is well and good.
Suddenly it stops working when I introduce an additional argument:
__kernel void PopCountProduct (__global int *results, __global char *m)
{
results[get_global_id(0)] = 1025;
}
Why is this happening? Am I missing something crucial about OpenCL?
You can't pas host pointer to clSetKernelArg in OpenCL 1.2. Similar thing can only be done in OpenCL 2.0+ by clSetKernelArgSVMPointer with SVM pointer if supported. But most probable making a buffer object on GPU and copying host memory to it is what you need.

Passing variables between kernels in OpenCL 1.2 / Communication between kernels

I am relatively new to OpenCL. I am using the OpenCL 1.2 C++ wrapper. Say I have the following problem: I have three integer values a, b, and c all declared on the host
int a = 1;
int b = 2;
int c = 3;
int help;
int d;
with d being my result and help being a help variable.
I want to calculate d = (a + b)*c. To do this, I now have two kernels called 'add' and 'multiply'.
Currently, I am doing this the following way (please don't be confused by my pointer oriented way of programming): First, I create my buffers
bufferA = new cl::Buffer(*context, CL_MEM_READ_ONLY, buffer_length);
cl::Buffer bufferB = new cl::Buffer(*context, CL_MEM_READ_ONLY, buffer_length);
bufferC = new cl::Buffer(*context, CL_MEM_READ_ONLY, buffer_length);
bufferHelp = new cl::Buffer(*context, CL_MEM_READ_WRITE, buffer_length);
bufferD = new cl::Buffer(*context, CL_MEM_WRITE_ONLY, buffer_length);
Then, I set my kernel arguments for the addition kernel
add->setArg(0, *bufferA);
add->setArg(1, *bufferB);
add->setArg(2, *bufferHelp);
and for the multiplicatoin kernel
multiply->setArg(0, *bufferC);
multiply->setArg(1, *bufferHelp);
multiply->setArg(2, *bufferD);
Then I enqueue my data for the addition
queueAdd->enqueueWriteBuffer(*bufferA, CL_TRUE, 0, datasize, &a);
queueAdd->enqueueWriteBuffer(*bufferB, CL_TRUE, 0, datasize, &b);
queueAdd->enqueueNDRangeKernel(*add, cl::NullRange, global[0], local[0]);
queueAdd->enqueueReadBuffer(*bufferHelp, CL_TRUE, 0, datasize, &help);
and for the multiplication
queueMult->enqueueWriteBuffer(*bufferC, CL_TRUE, 0, datasize, &c);
queueMult->enqueueWriteBuffer(*bufferHelp, CL_TRUE, 0, datasize, &help);
queueMult->enqueueNDRangeKernel(*multiply, cl::NullRange, global[0], local[0]);
queueMult->enqueueReadBuffer(*bufferD, CL_TRUE, 0, datasize, &d);
This works in a fine way. However, I do not want to copy the value of help back to the host and then back on the device again. To achieve this, I thought of 3 possiblities:
a global variable for help on the device side. Doing this, both kernels could access the value of help at any time.
kernel add calling kernel multiply at runtime. We then would insert the value for c into the add kernel and pass both help and c over to the multiply kernel as soon as the addition has finished.
Simply pass the value of help over to the multiplication kernel. What I search here is something like a pipe object as available for OpenCL 2.0. Does anybody know something similar for OpenCL 1.2.?
I would be very thankful if somebody could propose the smoothest way to solve my problem!
Thanks in advance!
There is no need to read and write the bufferHelp. Just leave it in device memory. The number 1) of your proposed solution is how cl::Buffers already are, globals in device memory.
This is equivalent to your code and will produce same results:
queueAdd->enqueueWriteBuffer(*bufferA, CL_FALSE, 0, datasize, &a);
queueAdd->enqueueWriteBuffer(*bufferB, CL_FALSE, 0, datasize, &b);
queueAdd->enqueueNDRangeKernel(*add, cl::NullRange, global[0], local[0]);
//queueAdd->enqueueReadBuffer(*bufferHelp, CL_FALSE, 0, datasize, &help);
queueMult->enqueueWriteBuffer(*bufferC, CL_FALSE, 0, datasize, &c);
//queueMult->enqueueWriteBuffer(*bufferHelp, CL_FALSE, 0, datasize, &help);
queueMult->enqueueNDRangeKernel(*multiply, cl::NullRange, global[0], local[0]);
queueMult->enqueueReadBuffer(*bufferD, CL_TRUE, 0, datasize, &d);
NOTE: I also changed the blocking write calls, this will provide much better speed, because copy of buffer C and execution of kernel "add" can be parallelized.

OpenCL instantiating local memory array: invalid pointer error in kernel

I'm trying to create 2 local arrays for a kernel to use. My goal is to copy a global input buffer into the first array (arr1), and instantiate the second array (arr2) so its elements can be accessed and set later.
My kernel looks like this:
__kernel void do_things (__global uchar* in, __global uchar* out,
uint numIterations, __local uchar* arr1, __local uchar* arr2)
{
size_t work_size = get_global_size(0) * get_global_size(1);
event_t event;
async_work_group_copy(arr1, in, work_size, event);
wait_group_events(1, &event);
int cIndex = (get_global_id(0) * get_global_size(1)) + get_global_id(1);
arr2[cIndex] = 0;
//Do other stuff later
}
In the C++ code I'm calling this from, I set the kernel arguments like this:
//Create input and output buffers
cl_mem inputBuffer = clCreateBuffer(context, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, myInputVector.size(), (void*)
myInputVector.data(), NULL);
cl_mem outputBuffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
myInputVector.size(), NULL, NULL);
//Set kernel arguments.
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&inputBuffer));
clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&outputBuffer));
clSetKernelArg(kernel, 2, sizeof(cl_uint), &iterations));
clSetKernelArg(kernel, 3, sizeof(inputBuffer), NULL));
clSetKernelArg(kernel, 4, sizeof(inputBuffer), NULL));
Where myInputVector is a vector full of uchars.
Then, I enqueue it with a 2D work size, rows * cols big. myInputVector has a size of rows * cols.
//Execute the kernel
size_t global_work_size[2] = { rows, cols }; //2d work size
status = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL,
global_work_size, NULL, 0, NULL, NULL);
The problem is, I'm getting crashes when I run the kernel. Specifically, this line in the kernel:
arr2[cIndex] = 0;
is responsible for the crash (omitting it makes it so it doesn't crash anymore). The error reads:
*** glibc detected *** ./MyProgram: free(): invalid pointer: 0x0000000001a28fb0 ***
All I want is to be able to access arr2 alongside arr1. arr2 should be the same size as arr1. If that's the case, Why am I getting this bizarre error? Why is this an invalid pointer?
The issue is that you are allocating only sizeof(cl_mem) for your local buffers. And a cl_mem is simply a typedef of some sort of pointer type (therefore 4 to 8 bytes depending on your system).
What then happen in your kernel is that you are accessing beyond the size of the local buffer you allocated and the GPU launches a memory fault.
clSetKernelArg(kernel, 3, myInputVector.size(), NULL);
clSetKernelArg(kernel, 4, myInputVector.size(), NULL);
Should fix your problem. Also note that the size you are providing is the size in bytes so you would need to multiply by the sizeof of the vector element type (which is not clear from code).