I am relatively new to OpenCL. I am using the OpenCL 1.2 C++ wrapper. Say I have the following problem: I have three integer values a, b, and c all declared on the host
int a = 1;
int b = 2;
int c = 3;
int help;
int d;
with d being my result and help being a help variable.
I want to calculate d = (a + b)*c. To do this, I now have two kernels called 'add' and 'multiply'.
Currently, I am doing this the following way (please don't be confused by my pointer oriented way of programming): First, I create my buffers
bufferA = new cl::Buffer(*context, CL_MEM_READ_ONLY, buffer_length);
cl::Buffer bufferB = new cl::Buffer(*context, CL_MEM_READ_ONLY, buffer_length);
bufferC = new cl::Buffer(*context, CL_MEM_READ_ONLY, buffer_length);
bufferHelp = new cl::Buffer(*context, CL_MEM_READ_WRITE, buffer_length);
bufferD = new cl::Buffer(*context, CL_MEM_WRITE_ONLY, buffer_length);
Then, I set my kernel arguments for the addition kernel
add->setArg(0, *bufferA);
add->setArg(1, *bufferB);
add->setArg(2, *bufferHelp);
and for the multiplicatoin kernel
multiply->setArg(0, *bufferC);
multiply->setArg(1, *bufferHelp);
multiply->setArg(2, *bufferD);
Then I enqueue my data for the addition
queueAdd->enqueueWriteBuffer(*bufferA, CL_TRUE, 0, datasize, &a);
queueAdd->enqueueWriteBuffer(*bufferB, CL_TRUE, 0, datasize, &b);
queueAdd->enqueueNDRangeKernel(*add, cl::NullRange, global[0], local[0]);
queueAdd->enqueueReadBuffer(*bufferHelp, CL_TRUE, 0, datasize, &help);
and for the multiplication
queueMult->enqueueWriteBuffer(*bufferC, CL_TRUE, 0, datasize, &c);
queueMult->enqueueWriteBuffer(*bufferHelp, CL_TRUE, 0, datasize, &help);
queueMult->enqueueNDRangeKernel(*multiply, cl::NullRange, global[0], local[0]);
queueMult->enqueueReadBuffer(*bufferD, CL_TRUE, 0, datasize, &d);
This works in a fine way. However, I do not want to copy the value of help back to the host and then back on the device again. To achieve this, I thought of 3 possiblities:
a global variable for help on the device side. Doing this, both kernels could access the value of help at any time.
kernel add calling kernel multiply at runtime. We then would insert the value for c into the add kernel and pass both help and c over to the multiply kernel as soon as the addition has finished.
Simply pass the value of help over to the multiplication kernel. What I search here is something like a pipe object as available for OpenCL 2.0. Does anybody know something similar for OpenCL 1.2.?
I would be very thankful if somebody could propose the smoothest way to solve my problem!
Thanks in advance!
There is no need to read and write the bufferHelp. Just leave it in device memory. The number 1) of your proposed solution is how cl::Buffers already are, globals in device memory.
This is equivalent to your code and will produce same results:
queueAdd->enqueueWriteBuffer(*bufferA, CL_FALSE, 0, datasize, &a);
queueAdd->enqueueWriteBuffer(*bufferB, CL_FALSE, 0, datasize, &b);
queueAdd->enqueueNDRangeKernel(*add, cl::NullRange, global[0], local[0]);
//queueAdd->enqueueReadBuffer(*bufferHelp, CL_FALSE, 0, datasize, &help);
queueMult->enqueueWriteBuffer(*bufferC, CL_FALSE, 0, datasize, &c);
//queueMult->enqueueWriteBuffer(*bufferHelp, CL_FALSE, 0, datasize, &help);
queueMult->enqueueNDRangeKernel(*multiply, cl::NullRange, global[0], local[0]);
queueMult->enqueueReadBuffer(*bufferD, CL_TRUE, 0, datasize, &d);
NOTE: I also changed the blocking write calls, this will provide much better speed, because copy of buffer C and execution of kernel "add" can be parallelized.
Related
I have an array of uint8_t. The size of the array is about 2.000.000. I need to do some calculations on these values, but after I call the kernel and copy the modified values back, it returns only zeros.
I'm creating the array, the "row" and "columns" are int.
uint8_t arrayIn[rows * columns];
uint8_t arrayOut[rows * columns];
I'm creating the cl_mem objects and copy the array data into.
arrayInMem = clCreateBuffer(context, CL_MEM_READ_ONLY, rows * columns * sizeof(uint8_t), NULL, &err);
arrayOutMem = clCreateBuffer(context, CL_MEM_WRITE_ONLY, rows * columns * sizeof(uint8_t), NULL, &err);
err = clEnqueueWriteBuffer(img_cmd_queue, arrayInMem, CL_TRUE, 0, rows * columns * sizeof(uint8_t), arrayIn, 0, NULL, NULL);
Setting the kernel arg like this.
err = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&arrayInMem);
err = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&arrayOutMem);
Reading back to the host the modified array.
err = clEnqueueReadBuffer(img_cmd_queue, arrayOutMem, CL_TRUE, 0, MEM_SIZE * sizeof(uint8_t), arrayOut, 0, NULL, NULL);
The kernel signature look like this:
__kernel void calculate(__global uchar * arrayInKernel, __global uchar * arrayOutKernel){
//do some calculation like this eg.
//int gid = get_global_id(0);
//arrayOutKernel[gid] = 2 * arrayInKernel[gid];
}
Could somebody help, what am I missing out?
Your code is fine, assuming MEM_SIZE = rows * columns. The argument order in clEnqueueReadBuffer also is correct.
I could imagine that you forgot to call clFinish(img_cmd_queue); after clEnqueueWriteBuffer, clEnqueueNDRangeKernel and clEnqueueReadBuffer and before you check the results in arrayOut. All these commands end up in a queue and without clFinish the queue may be executed after you checked results.
I'm working in Julia and I need call some customize C functions that use ArraFire library, when I use a code like:
void copy(const af::array &A, af::array &B,size_t length) {
// 2.Obtain the device, context, and queue used by ArrayFire
// 3.Obtain cl_mem references to af::array objects
cl_mem * d_A = A.device<cl_mem>();
cl_mem * d_B = B.device<cl_mem>();
// 4. Load, build, and use your kernels.
// Set arguments and launch your kernels
//kernel is the function build in step 4
clSetKernelArg(kernel, 0, sizeof(cl_mem), d_A);
clSetKernelArg(kernel, 1, sizeof(cl_mem), d_B);
clEnqueueNDRangeKernel(af_queue, kernel, 1, NULL, &length, NULL, 0, NULL, NULL);
// 5. Return control of af::array memory to ArrayFire
A.unlock();
B.unlock();
}
I used as reference the example provided in:Interoperability with OpenCL
I call this function in Julia as follows:
ccall((:copy,"path/to/dll"),Cvoid(Ref{af_array},Ref{af_array}),Af.arr,Bf.arr)
Af and Bf are ArrayFire arrays, the call works as expected, the problem is when I use directly B=A only to test i.e.
void copy(const af::array &A, af::array &B,size_t length) {
B=A;//only to test
}
the call stop works in Julia, this made me doubt if I'm using the correct way to write and call this functions.
Some of the Arrayfire functions incorporated in Julia that I saw, call functions that have af_array as arguments that are different from the arguments af :: array. Well I want to change the arguments, then I do this:
void copy(const af_array &dA, af_array &dB,size_t length) {
//this to be able to use A.device and B.device
array A=array(dA);
array B=array(dB);
//steps 2 to 5 in the original code
}
It doesn't work in C or in Julia, the question is if I want to use af_array as arguments how I get the device pointer? or what is the corret way to handle this functions to avoid problems when I call them in Julia?
thanks in advance.
UPD
I changed B=A; inside the function:
void copy(const af::array &A, af::array &B,size_t length) {
size_t len = A.dims(0);
seq idx(0, len - 1, 1);
af::copy(B, A, idx);
}
And works! However, I still doubt that this is the correct way, since this code is very simple. I will work with a more complex code that may stop working in a similar way.
This is not a definitive answer, but I think it significantly improves functionality. The af_get_device_ptr function is a solution to get the device pointer from a af_array object, and the correct way to write functions to be able to call from Julia seems to be those with af_array arguments (See: calling custom C ArrayFire functions in Julia #229
) , Since the functions integrated in ArrayFire.jl do it this way. Here is a simple and complete example of how to write and call the function from Julia:
in C
//function for adding ArrayFire arrays
void AFire::sumaaf(af_array* out , af_array dA, af_array dB) {
//to store the result
af_array dC;
af_copy_array(&dC, dA);
// 2. Obtain the device, context, and queue used by ArrayFire
static cl_context af_context = afcl::getContext();
static cl_device_id af_device_id = afcl::getDeviceId();
static cl_command_queue af_queue = afcl::getQueue();
dim_t _order[4];
af_get_dims(&_order[0], &_order[1], &_order[2], &_order[3], dA);
size_t order = _order[0];
int status = CL_SUCCESS;
// 3. Obtain cl_mem references to af_array objects
cl_mem *d_A = (cl_mem*)clCreateBuffer(af_context,
CL_MEM_READ_ONLY, sizeof(float) * order,
NULL, &status);
af_get_device_ptr((void**)d_A, dA);
cl_mem *d_B = (cl_mem*)clCreateBuffer(af_context,
CL_MEM_READ_ONLY, sizeof(float) * order,
NULL, &status);
af_get_device_ptr((void**)d_B, dB);
cl_mem *d_C = (cl_mem*)clCreateBuffer(af_context,
CL_MEM_WRITE_ONLY, sizeof(float) * order,
NULL, &status);
af_get_device_ptr((void**)d_C, dC);
// 4. Load, build, and use your kernels.
// For the sake of readability, we have omitted error checking.
// A simple sum kernel, uses C++11 syntax for multi-line strings.
const char * kernel_name = "sum_kernel";
const char * source = R"(
void __kernel
sum_kernel(__global float * gC, __global float * gA, __global float * gB)
{
int id = get_global_id(0);
gC[id] = gA[id]+gB[id];
}
)";
// Create the program, build the executable, and extract the entry point
// for the kernel.
cl_program program = clCreateProgramWithSource(af_context, 1, &source, NULL, &status);
status = clBuildProgram(program, 1, &af_device_id, NULL, NULL, NULL);
cl_kernel sumkernel = clCreateKernel(program, kernel_name, &status);
// Set arguments and launch your kernels
clSetKernelArg(sumkernel, 0, sizeof(cl_mem), d_C);
clSetKernelArg(sumkernel, 1, sizeof(cl_mem), d_A);
clSetKernelArg(sumkernel, 2, sizeof(cl_mem), d_B);
clEnqueueNDRangeKernel(af_queue, sumkernel, 1, NULL, &order, NULL, 0, NULL, NULL);
// 5. Return control of af::array memory to ArrayFire
af_unlock_array(dA);
af_unlock_array(dB);
af_unlock_array(dC);
//copy results to output argument
af_copy_array(out, dC);
// ... resume ArrayFire operations
// Because the device pointers, d_x and d_y, were returned to ArrayFire's
// control by the unlock function, there is no need to free them using
// clReleaseMemObject()
}
in Julia the call would be:
function sumaaf(A::AFArray{Float32,1},B::AFArray{Float32,1})
out = ArrayFire.RefValue{af_array}(0);
ccall((:sumaaf,"path/to/dll")
,Cvoid,(Ptr{af_array},af_array,af_array),out,Af.arr,Bf.arr);
AFArray{Float32,1}(out[])
end
I try to release allocated memory in GPU using OpenCL
int arraySize = 130000000;
cl_int* A = new cl_int[arraySize];
cl::Buffer gpuA(context, CL_MEM_READ_ONLY, sizeof(cl_int) * arraySize);
inA.setDestructorCallback(¬Need);
cl::Event event;
queue.enqueueWriteBuffer(gpuA, CL_TRUE, 0, sizeof(cl_int) * arraySize, A, NULL, &event);
event.setCallback(CL_COMPLETE, &whenWritten);
event.wait();
and after a few sec callback whenWritten is running (writing text Complete)
program memory is incresing and using TaskManager in windows10 - GPU (Dedicated memory usage) I see in chart incresing memory level too.
Very Good ;-)
then i run sleep for 10s
and now I would like to clear memory in GPU
for local variable A I use
delete A; //local memory is decresing
but when I using
clReleaseMemObject(gpuA());
I don't see any changes on GPU memory
What I doing wrong ? What is the best solution for this ?
Thanks for any advice
OK so I replace buffer and clean buffer to C not C++
cl_mem gpuA2 = clCreateBuffer(context(), CL_MEM_READ_ONLY, sizeof(cl_int) * arraySize, NULL, NULL);
clEnqueueWriteBuffer(queue(), gpuA2, CL_TRUE, 0, sizeof(cl_int) * arraySize, A, 0, NULL, NULL);
queue.finish();
clReleaseMemObject(gpuA2);
delete A;
and after run I don't see any difference
My computer has a GeForce 1080Ti. With 11GB of VRAM, I don't expect a memory issue, so I'm at a loss to explain why the following breaks my code.
I execute the kernel on the host with this code.
cl_mem buffer = clCreateBuffer(context.GetContext(), CL_MEM_READ_WRITE, n * n * sizeof(int), NULL, &error);
error = clSetKernelArg(context.GetKernel(myKernel), 1, n * n, m1);
error = clSetKernelArg(context.GetKernel(myKernel), 0, sizeof(cl_mem), &buffer);
error = clEnqueueNDRangeKernel(context.GetCommandQueue(0), context.GetKernel(myKernel), 1, NULL, 10, 10, 0, NULL, NULL);
clFinish(context.GetCommandQueue(0));
error = clEnqueueReadBuffer(context.GetCommandQueue(0), buffer, true, 0, n * n * sizeof(int), results, 0, NULL, NULL);
results is a pointer to an n-by-n int array. m1 is a pointer to an n-by-n-bit array. The variable n is divisible by 8, so we can interpret the array as a char array.
The first ten values of the array are set to 1025 by the kernel (the value isn't important):
__kernel void PopCountProduct (__global int *results)
{
results[get_global_id(0)] = 1025;
}
When I print out the result on the host, the first 10 indices are 1025. All is well and good.
Suddenly it stops working when I introduce an additional argument:
__kernel void PopCountProduct (__global int *results, __global char *m)
{
results[get_global_id(0)] = 1025;
}
Why is this happening? Am I missing something crucial about OpenCL?
You can't pas host pointer to clSetKernelArg in OpenCL 1.2. Similar thing can only be done in OpenCL 2.0+ by clSetKernelArgSVMPointer with SVM pointer if supported. But most probable making a buffer object on GPU and copying host memory to it is what you need.
I'm trying to create 2 local arrays for a kernel to use. My goal is to copy a global input buffer into the first array (arr1), and instantiate the second array (arr2) so its elements can be accessed and set later.
My kernel looks like this:
__kernel void do_things (__global uchar* in, __global uchar* out,
uint numIterations, __local uchar* arr1, __local uchar* arr2)
{
size_t work_size = get_global_size(0) * get_global_size(1);
event_t event;
async_work_group_copy(arr1, in, work_size, event);
wait_group_events(1, &event);
int cIndex = (get_global_id(0) * get_global_size(1)) + get_global_id(1);
arr2[cIndex] = 0;
//Do other stuff later
}
In the C++ code I'm calling this from, I set the kernel arguments like this:
//Create input and output buffers
cl_mem inputBuffer = clCreateBuffer(context, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, myInputVector.size(), (void*)
myInputVector.data(), NULL);
cl_mem outputBuffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
myInputVector.size(), NULL, NULL);
//Set kernel arguments.
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&inputBuffer));
clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&outputBuffer));
clSetKernelArg(kernel, 2, sizeof(cl_uint), &iterations));
clSetKernelArg(kernel, 3, sizeof(inputBuffer), NULL));
clSetKernelArg(kernel, 4, sizeof(inputBuffer), NULL));
Where myInputVector is a vector full of uchars.
Then, I enqueue it with a 2D work size, rows * cols big. myInputVector has a size of rows * cols.
//Execute the kernel
size_t global_work_size[2] = { rows, cols }; //2d work size
status = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL,
global_work_size, NULL, 0, NULL, NULL);
The problem is, I'm getting crashes when I run the kernel. Specifically, this line in the kernel:
arr2[cIndex] = 0;
is responsible for the crash (omitting it makes it so it doesn't crash anymore). The error reads:
*** glibc detected *** ./MyProgram: free(): invalid pointer: 0x0000000001a28fb0 ***
All I want is to be able to access arr2 alongside arr1. arr2 should be the same size as arr1. If that's the case, Why am I getting this bizarre error? Why is this an invalid pointer?
The issue is that you are allocating only sizeof(cl_mem) for your local buffers. And a cl_mem is simply a typedef of some sort of pointer type (therefore 4 to 8 bytes depending on your system).
What then happen in your kernel is that you are accessing beyond the size of the local buffer you allocated and the GPU launches a memory fault.
clSetKernelArg(kernel, 3, myInputVector.size(), NULL);
clSetKernelArg(kernel, 4, myInputVector.size(), NULL);
Should fix your problem. Also note that the size you are providing is the size in bytes so you would need to multiply by the sizeof of the vector element type (which is not clear from code).