openCL release GPU memory - c++

I try to release allocated memory in GPU using OpenCL
int arraySize = 130000000;
cl_int* A = new cl_int[arraySize];
cl::Buffer gpuA(context, CL_MEM_READ_ONLY, sizeof(cl_int) * arraySize);
inA.setDestructorCallback(&notNeed);
cl::Event event;
queue.enqueueWriteBuffer(gpuA, CL_TRUE, 0, sizeof(cl_int) * arraySize, A, NULL, &event);
event.setCallback(CL_COMPLETE, &whenWritten);
event.wait();
and after a few sec callback whenWritten is running (writing text Complete)
program memory is incresing and using TaskManager in windows10 - GPU (Dedicated memory usage) I see in chart incresing memory level too.
Very Good ;-)
then i run sleep for 10s
and now I would like to clear memory in GPU
for local variable A I use
delete A; //local memory is decresing
but when I using
clReleaseMemObject(gpuA());
I don't see any changes on GPU memory
What I doing wrong ? What is the best solution for this ?
Thanks for any advice

OK so I replace buffer and clean buffer to C not C++
cl_mem gpuA2 = clCreateBuffer(context(), CL_MEM_READ_ONLY, sizeof(cl_int) * arraySize, NULL, NULL);
clEnqueueWriteBuffer(queue(), gpuA2, CL_TRUE, 0, sizeof(cl_int) * arraySize, A, 0, NULL, NULL);
queue.finish();
clReleaseMemObject(gpuA2);
delete A;
and after run I don't see any difference

Related

Supplying a pointer renders my OpenCL code incorrect

My computer has a GeForce 1080Ti. With 11GB of VRAM, I don't expect a memory issue, so I'm at a loss to explain why the following breaks my code.
I execute the kernel on the host with this code.
cl_mem buffer = clCreateBuffer(context.GetContext(), CL_MEM_READ_WRITE, n * n * sizeof(int), NULL, &error);
error = clSetKernelArg(context.GetKernel(myKernel), 1, n * n, m1);
error = clSetKernelArg(context.GetKernel(myKernel), 0, sizeof(cl_mem), &buffer);
error = clEnqueueNDRangeKernel(context.GetCommandQueue(0), context.GetKernel(myKernel), 1, NULL, 10, 10, 0, NULL, NULL);
clFinish(context.GetCommandQueue(0));
error = clEnqueueReadBuffer(context.GetCommandQueue(0), buffer, true, 0, n * n * sizeof(int), results, 0, NULL, NULL);
results is a pointer to an n-by-n int array. m1 is a pointer to an n-by-n-bit array. The variable n is divisible by 8, so we can interpret the array as a char array.
The first ten values of the array are set to 1025 by the kernel (the value isn't important):
__kernel void PopCountProduct (__global int *results)
{
results[get_global_id(0)] = 1025;
}
When I print out the result on the host, the first 10 indices are 1025. All is well and good.
Suddenly it stops working when I introduce an additional argument:
__kernel void PopCountProduct (__global int *results, __global char *m)
{
results[get_global_id(0)] = 1025;
}
Why is this happening? Am I missing something crucial about OpenCL?
You can't pas host pointer to clSetKernelArg in OpenCL 1.2. Similar thing can only be done in OpenCL 2.0+ by clSetKernelArgSVMPointer with SVM pointer if supported. But most probable making a buffer object on GPU and copying host memory to it is what you need.

OpenCL, manage device buffer pointer from host?

I'm trying to implement a code previously written in CUDA using OpenCL to run on Altera FPGA. I'm having problem reading back data that are supposed to be in the buffer. I use the same structure as CUDA version, only thing different is cudaMalloc can allocate memory for all types of pointer while for clCreateBuffer I have to use cl_mem. My code looks like this:
cl_mem d_buffer=clCreateBuffer(...);
//CUDA version:
//float* d_buffer;
//cudaMalloc((void **)&d_buffer, MemSz);
clEnqueueWriteBuffer(queue, d_buffer, ..., h_data, );
//cudaMemcpy(d_buffer, h_Data, MemSz, cudaMemcpyHostToDevice);
#define d_buffer(index1, index2, index3) &d_buffer + index1/index2*index3
//#define d_buffer(index1, index2, index3) d_buffer + index1/index2*index3
cl_mem* d_data=d_buffer(1,2,3);
clEnqueueReadBuffer(queue, *d_data,...)// Error reading d_data
I tried clEnqueueMapBuffer or CL_MEM_ALLOC_HOST_PTR for the clCreateBuffer, it doesn't work either.
cl_mem is an opaque object. You should not perform pointer arithmetic on it; attempting to do so will result in very nasty bugs.
I'm not familiar with how CUDA handles buffer allocation, but the implication of your commented out code is that CUDA buffers are always Host-Visible. This is very strictly not the case in OpenCL. OpenCL allows you to "Map" a buffer to host-visible memory, but it won't be implicitly visible to the host. If you intend to read an arbitrary index of the buffer, you need to either map it first or copy it to host data.
float * h_data = new float[1000];
cl_mem d_buffer=clCreateBuffer(...);
clEnqueueWriteBuffer(queue, d_buffer, true, 0, 1000 * sizeof(float), h_data, 0, nullptr, nullptr);
//======OR======
//float * d_data = static_cast<float*>(clEnqueueMapBuffer(queue, d_buffer, true, CL_MAP_WRITE, 0, 1000 * sizeof(float), 0, nullptr, nullptr, nullptr));
//std::copy(h_data, h_data + 1000, d_data);
//clEnqueueUnmapMemObject(queue, d_buffer, d_data, 0, nullptr, nullptr);
//clEnqueueBarrier(queue);
//Do work with buffer, probably in OpenCL Kernel...
float result;
size_t index = 1 / 2 * 3; //This is what you wrote in the original post
clEnqueueReadBuffer(queue, d_buffer, true, index * sizeof(float), 1 * sizeof(float), &result, 0, nullptr, nullptr);
//======OR======
//float * result_ptr = static_cast<float*>(clEnqueueMapBuffer(queue, d_buffer, true, CL_MAP_READ, index * sizeof(float), 1 * sizeof(float), 0, nullptr, nullptr, nullptr));
//result = *result_ptr;
//clEnqueueUnmapMemObject(queue, d_buffer, result_ptr, 0, nullptr, nullptr);
//clEnqueueBarrier(queue);
std::cout << "Result was " << result << std::endl;

clCreateBuffer() allocating on the CPU?

I'm working with a (recurrent) neural network on C++ & OpenCL to get some low-level experience with deep learning. Right now I have a simple forward propagation kernel that's yielding oddly low performance; the setup is memory limited as are most deep learning setups, and based on some crude profiling the memory bandwidth that I'm getting is around 2 GB/s. A call to clGetDeviceInfo() confirms that I'm using my onboard GPU (GTX 960m); I suspect that somehow the memory I'm allocating with clCreateBuffer() is somehow ending up on the CPU which would lead to transfer rates hovering around 2 GB/s as suggested by this article. The buffers I'm allocating shouldn't be too large for the GPU; the largest are 1024*1024*4 bytes = 4 MB (weights) and only 12 of those are created.
The calls to clCreateBuffer(), with some context:
NVector::NVector(int size) {
empty = false;
numNeurons = size;
activationsMem = clCreateBuffer(RNN::clContext, CL_MEM_READ_WRITE, sizeof(float) * numNeurons, NULL, NULL);
parametersMem = clCreateBuffer(RNN::clContext, CL_MEM_READ_WRITE, sizeof(float) * numNeurons, NULL, NULL);
derivativesMem = clCreateBuffer(RNN::clContext, CL_MEM_READ_WRITE, sizeof(float) * numNeurons, NULL, NULL);
}
//...
void NVector::connect(NVector& other) {
int numWeights = other.numNeurons * numNeurons;
cl_mem weightMem = clCreateBuffer(RNN::clContext, CL_MEM_READ_WRITE, sizeof(float) * numWeights, NULL, NULL);
float weightAmplitude = 0.2f;
float* weightData = new float[numWeights];
for (int i = 0; i < numWeights; i++) {
weightData[i] = ((rand() % 256) / 256.0f - 0.5f) * weightAmplitude;
}
clEnqueueWriteBuffer(RNN::clQueue, weightMem, CL_TRUE, 0, sizeof(float) * numWeights, weightData, 0, NULL, NULL);
connections.push_back(&other);
weightsMem.push_back(weightMem);
}
What are some reasons that OpenCL might allocate memory to the CPU instead of the active device? What can I do to force memory to be allocated on the GPU?
EDIT: a simple test yielded this value for memory bandwidth, which is in accordance with the suggested 5-6 GB/s bandwidth between the CPU and GPU.
operating device name: GeForce GTX 960M
2.09715 seconds
1.00663e+10 bytes
4.8e+09 bytes / second
Press any key to continue . . .

Passing variables between kernels in OpenCL 1.2 / Communication between kernels

I am relatively new to OpenCL. I am using the OpenCL 1.2 C++ wrapper. Say I have the following problem: I have three integer values a, b, and c all declared on the host
int a = 1;
int b = 2;
int c = 3;
int help;
int d;
with d being my result and help being a help variable.
I want to calculate d = (a + b)*c. To do this, I now have two kernels called 'add' and 'multiply'.
Currently, I am doing this the following way (please don't be confused by my pointer oriented way of programming): First, I create my buffers
bufferA = new cl::Buffer(*context, CL_MEM_READ_ONLY, buffer_length);
cl::Buffer bufferB = new cl::Buffer(*context, CL_MEM_READ_ONLY, buffer_length);
bufferC = new cl::Buffer(*context, CL_MEM_READ_ONLY, buffer_length);
bufferHelp = new cl::Buffer(*context, CL_MEM_READ_WRITE, buffer_length);
bufferD = new cl::Buffer(*context, CL_MEM_WRITE_ONLY, buffer_length);
Then, I set my kernel arguments for the addition kernel
add->setArg(0, *bufferA);
add->setArg(1, *bufferB);
add->setArg(2, *bufferHelp);
and for the multiplicatoin kernel
multiply->setArg(0, *bufferC);
multiply->setArg(1, *bufferHelp);
multiply->setArg(2, *bufferD);
Then I enqueue my data for the addition
queueAdd->enqueueWriteBuffer(*bufferA, CL_TRUE, 0, datasize, &a);
queueAdd->enqueueWriteBuffer(*bufferB, CL_TRUE, 0, datasize, &b);
queueAdd->enqueueNDRangeKernel(*add, cl::NullRange, global[0], local[0]);
queueAdd->enqueueReadBuffer(*bufferHelp, CL_TRUE, 0, datasize, &help);
and for the multiplication
queueMult->enqueueWriteBuffer(*bufferC, CL_TRUE, 0, datasize, &c);
queueMult->enqueueWriteBuffer(*bufferHelp, CL_TRUE, 0, datasize, &help);
queueMult->enqueueNDRangeKernel(*multiply, cl::NullRange, global[0], local[0]);
queueMult->enqueueReadBuffer(*bufferD, CL_TRUE, 0, datasize, &d);
This works in a fine way. However, I do not want to copy the value of help back to the host and then back on the device again. To achieve this, I thought of 3 possiblities:
a global variable for help on the device side. Doing this, both kernels could access the value of help at any time.
kernel add calling kernel multiply at runtime. We then would insert the value for c into the add kernel and pass both help and c over to the multiply kernel as soon as the addition has finished.
Simply pass the value of help over to the multiplication kernel. What I search here is something like a pipe object as available for OpenCL 2.0. Does anybody know something similar for OpenCL 1.2.?
I would be very thankful if somebody could propose the smoothest way to solve my problem!
Thanks in advance!
There is no need to read and write the bufferHelp. Just leave it in device memory. The number 1) of your proposed solution is how cl::Buffers already are, globals in device memory.
This is equivalent to your code and will produce same results:
queueAdd->enqueueWriteBuffer(*bufferA, CL_FALSE, 0, datasize, &a);
queueAdd->enqueueWriteBuffer(*bufferB, CL_FALSE, 0, datasize, &b);
queueAdd->enqueueNDRangeKernel(*add, cl::NullRange, global[0], local[0]);
//queueAdd->enqueueReadBuffer(*bufferHelp, CL_FALSE, 0, datasize, &help);
queueMult->enqueueWriteBuffer(*bufferC, CL_FALSE, 0, datasize, &c);
//queueMult->enqueueWriteBuffer(*bufferHelp, CL_FALSE, 0, datasize, &help);
queueMult->enqueueNDRangeKernel(*multiply, cl::NullRange, global[0], local[0]);
queueMult->enqueueReadBuffer(*bufferD, CL_TRUE, 0, datasize, &d);
NOTE: I also changed the blocking write calls, this will provide much better speed, because copy of buffer C and execution of kernel "add" can be parallelized.

OpenCL instantiating local memory array: invalid pointer error in kernel

I'm trying to create 2 local arrays for a kernel to use. My goal is to copy a global input buffer into the first array (arr1), and instantiate the second array (arr2) so its elements can be accessed and set later.
My kernel looks like this:
__kernel void do_things (__global uchar* in, __global uchar* out,
uint numIterations, __local uchar* arr1, __local uchar* arr2)
{
size_t work_size = get_global_size(0) * get_global_size(1);
event_t event;
async_work_group_copy(arr1, in, work_size, event);
wait_group_events(1, &event);
int cIndex = (get_global_id(0) * get_global_size(1)) + get_global_id(1);
arr2[cIndex] = 0;
//Do other stuff later
}
In the C++ code I'm calling this from, I set the kernel arguments like this:
//Create input and output buffers
cl_mem inputBuffer = clCreateBuffer(context, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, myInputVector.size(), (void*)
myInputVector.data(), NULL);
cl_mem outputBuffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
myInputVector.size(), NULL, NULL);
//Set kernel arguments.
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&inputBuffer));
clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&outputBuffer));
clSetKernelArg(kernel, 2, sizeof(cl_uint), &iterations));
clSetKernelArg(kernel, 3, sizeof(inputBuffer), NULL));
clSetKernelArg(kernel, 4, sizeof(inputBuffer), NULL));
Where myInputVector is a vector full of uchars.
Then, I enqueue it with a 2D work size, rows * cols big. myInputVector has a size of rows * cols.
//Execute the kernel
size_t global_work_size[2] = { rows, cols }; //2d work size
status = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL,
global_work_size, NULL, 0, NULL, NULL);
The problem is, I'm getting crashes when I run the kernel. Specifically, this line in the kernel:
arr2[cIndex] = 0;
is responsible for the crash (omitting it makes it so it doesn't crash anymore). The error reads:
*** glibc detected *** ./MyProgram: free(): invalid pointer: 0x0000000001a28fb0 ***
All I want is to be able to access arr2 alongside arr1. arr2 should be the same size as arr1. If that's the case, Why am I getting this bizarre error? Why is this an invalid pointer?
The issue is that you are allocating only sizeof(cl_mem) for your local buffers. And a cl_mem is simply a typedef of some sort of pointer type (therefore 4 to 8 bytes depending on your system).
What then happen in your kernel is that you are accessing beyond the size of the local buffer you allocated and the GPU launches a memory fault.
clSetKernelArg(kernel, 3, myInputVector.size(), NULL);
clSetKernelArg(kernel, 4, myInputVector.size(), NULL);
Should fix your problem. Also note that the size you are providing is the size in bytes so you would need to multiply by the sizeof of the vector element type (which is not clear from code).