CL_MEM_ALLOC_HOST_PTR slower than CL_MEM_USE_HOST_PTR - c++

So I've been playing around with OpenCL for a bit now and testing the speeds of memory transfer between host and device.
I was using Intel OpenCL SDK and running on the Intel i5 Processor with integrated graphics.
I then discovered clEnqueueMapBuffer instead of clEnqueueWriteBuffer which turned out to be faster by almost 10 times when using pinned memory like so:
int amt = 16*1024*1024;
...
k_a = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(int)*amt, a, NULL);
k_b = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(int)*amt, b, NULL);
k_c = clCreateBuffer(context,CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, sizeof(int)*amt, ret, NULL);
int* map_a = (int*) clEnqueueMapBuffer(c_q, k_a, CL_TRUE, CL_MAP_READ, 0, sizeof(int)*amt, 0, NULL, NULL, &error);
int* map_b = (int*) clEnqueueMapBuffer(c_q, k_b, CL_TRUE, CL_MAP_READ, 0, sizeof(int)*amt, 0, NULL, NULL, &error);
int* map_c = (int*) clEnqueueMapBuffer(c_q, k_c, CL_TRUE, CL_MAP_WRITE, 0, sizeof(int)*amt, 0, NULL, NULL, &error);
clFinish(c_q);
Where a b and ret are 128 bit aligned int arrays.
The time came out to about 22.026186 ms, compared to 198.604528 ms using clEnqueueWriteBuffer
However, when I changed my code to
k_a = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, sizeof(int)*amt, NULL, NULL);
k_b = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, sizeof(int)*amt, NULL, NULL);
k_c = clCreateBuffer(context,CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, sizeof(int)*amt, NULL, NULL);
int* map_a = (int*)clEnqueueMapBuffer(c_q, k_a, CL_TRUE, CL_MAP_READ, 0, sizeof(int)*amt, 0, NULL, NULL, &error);
int* map_b = (int*)clEnqueueMapBuffer(c_q, k_b, CL_TRUE, CL_MAP_READ, 0, sizeof(int)*amt, 0, NULL, NULL, &error);
int* map_c = (int*)clEnqueueMapBuffer(c_q, k_c, CL_TRUE, CL_MAP_WRITE, 0, sizeof(int)*amt, 0, NULL, NULL, &error);
/** initiate map_a and map_b **/
the time increases to 91.350065 ms
What could be the problem? Or is it a problem at all?
EDIT:
This is how I initialize the arrays in the second code:
for (int i = 0; i < amt; i++)
{
map_a[i] = i;
map_b[i] = i;
}
And now that I check, map_a and map_b do contain the right elements at the end of the program, but map_c contains all 0's. I did this:
clEnqueueUnmapMemObject(c_q, k_a, map_a, 0, NULL, NULL);
clEnqueueUnmapMemObject(c_q, k_b, map_b, 0, NULL, NULL);
clEnqueueUnmapMemObject(c_q, k_c, map_c, 0, NULL, NULL);
and my kernel is just
__kernel void test(__global int* a, __global int* b, __global int* c)
{
int i = get_global_id(0);
c[i] = a[i] + b[i];
}

My understanding is that CL_MEM_ALLOC_HOST_PTR allocates but doesn't copy. Does the 2nd block of code actually get any data onto the device?
Also, clCreateBuffer when used with CL_MEM_USE_HOST_PTR and CL_MEM_COPY_HOST_PTR shouldn't require clEnqueueWrite, as the buffer is created with the memory pointed to by void *host_ptr.
Using "pinned" memory in OpenCL should be a process like:
int amt = 16*1024*1024;
int Array[] = new int[amt];
int Error = 0;
//Note, since we are using NULL for the data pointer, we HAVE to use CL_MEM_ALLOC_HOST_PTR
//This allocates memory on the devices
cl_mem B1 = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(int)*amt, NULL, &Error);
//Map the Device memory to host memory, aka pinning it
int *host_ptr = clEnqueueMapBuffer(queue, B1, CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, 0, sizeof(int)*amt, 0, NULL, NULL, &Error);
//Copy from host memory to pinned host memory which copies to the card automatically`
memcpy(host_ptr, Array, sizeof(int)*amt);
//Call your kernel and everything else and memcpy back the pinned back to host when
//you are done
Edit: One final thing you can do to speed up the program is to not make the memory read/write blocking by using CL_FALSE instead of CL_TRUE. Just make sure to call clFinish() before data gets copied back to the host so that the command queue is emptied and all commands are processed.
Source: OpenCL In Action

With the right combination of flags, you should be able to achieve "zero copy" (i.e. very fast) map/unmap on Intel Integrated Graphics since there is no need for a "CPU to GPU" copy since they both use the same memory (that's what the "Integrated" means). Read the Intel OpenCL Optimization Guide section on memory.

Related

Can dedicated memory be released on the GPU using OpenCL?

Is there a way to free up dedicated memory when a function that runs on GPU using OpenCL ends? I have noticed that if you repeatedly call a program that uses OpenCL in GPU and it ends, a certain amount of dedicated memory is not released, which would cause an error if the function is called too many times.
UPDATE 22/12/2019:
I enclose a fragment of the code that is within the iteration. The configuration of the cl_program and cl_context is done outside the iteration:
void launch_kernel(float * blockC, float * blockR, float * blockEst, float * blockEstv, float * blockL, cl_program program, cl_context context, std::vector<cl_device_id> deviceIds, int n){
cl_kernel kernel_function = clCreateKernel(program, "function_example", &error);
cl_event event;
cl_command_queue queue = clCreateCommandQueue(context, (deviceIds)[0], 0, &error);
std::size_t size[1] = { (size_t)n };
cl_mem blockCBuffer = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float) * (n * s), (void *)&blockC[0], &error);
cl_mem blockRBuffer = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float) * (n * s), (void *)&blockR[0], &error);
cl_mem blockEstBuffer = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(float) * n, (void *)&blockEst[0], &error);
cl_mem blockEstvBuffer = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(float) * n, (void *)&blockEstv[0], &error);
cl_mem blockLBuffer = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(float) * n, (void *)&blockL[0], &error);
clSetKernelArg(kernel_solution, 0, sizeof(cl_mem), &blockCBuffer);
clSetKernelArg(kernel_solution, 1, sizeof(cl_mem), &blockRBuffer);
clSetKernelArg(kernel_solution, 2, sizeof(cl_mem), &blockEstBuffer);
clSetKernelArg(kernel_solution, 3, sizeof(cl_mem), &blockEstvBuffer);
clSetKernelArg(kernel_solution, 4, sizeof(cl_mem), &blockLBuffer);
clSetKernelArg(kernel_solution, 5, sizeof(int), &s);
openclSingleton->checkError(clEnqueueNDRangeKernel(queue, kernel_function, 1, nullptr, size, nullptr, 0, nullptr, nullptr));
clEnqueueMapBuffer(queue, blockEstBuffer, CL_TRUE, CL_MAP_READ, 0, n * sizeof(float) , 0, nullptr, nullptr, &error);
clEnqueueMapBuffer(queue, blockEstvBuffer, CL_TRUE, CL_MAP_READ, 0, n * sizeof(float) , 0, nullptr, nullptr, &error);
clEnqueueMapBuffer(queue, blockLBuffer, CL_TRUE, CL_MAP_READ, 0, n * sizeof(float) , 0, nullptr, nullptr, &error);
clReleaseMemObject(blockCBuffer);
clReleaseMemObject(blockRBuffer);
clReleaseMemObject(blockEstBuffer);
clReleaseMemObject(blockEstvBuffer);
clReleaseMemObject(blockLBuffer);
clFlush(queue);
clFinish(queue);
clWaitForEvents(1, &event);
clReleaseCommandQueue(queue);
clReleaseKernel(kernel_function);
}
UPDATE 23/12/2019
Updated the code with the iterative process that calls the function in OpenCl. The problem arises when at the end of the launch_kernel function it leaves some dedicated memory used, which causes that if the variable m is too large, the memory becomes full and the program crashes due to lack of resources.
std::vector<cl_device_id> deviceIds;
cl_program program;
cl_context context;
... //configuration program and context
int n;
float *blockEst, *blockEstv, *blockL, *blockC, *blockR;
for(int i = 0; i < m; m++){
blockC = (float*)std::malloc(sizeof(float)*n*s);
blockR = (float*)std::malloc(sizeof(float)*n*s);
blockEst = (float*)std::malloc(sizeof(float)*n);
blockEstv = (float*)std::malloc(sizeof(float)*n);
blockL = (float*)std::malloc(sizeof(float)*n);
.... //Set values blockC and blockR
launch_kernel(blockC, blockR, blockEst, blockEstv, blockL, program, context, deviceIds, n);
... //Save value blockEst, blockEstv and blockL
std::free(blockC);
std::free(blockR);
std::free(blockEst);
std::free(blockEstv);
std::free(blockL);
}
clReleaseProgram(program);
clReleaseContext(context);
clReleaseDevice(deviceIds[0]);

OpenCL clEnqueueWriteBuffer pass pointer error

Is it necessary that array pointer passed to clEnqueueWriteBuffer should malloc in the same scope?
Here is my code:
class int_matrix{
public:
int_matrix(size_t size_row, size_t size_col) :
_size_row(size_row), _size_col(size_col) {
element = (int*)malloc(size_row * size_col * sizeof(int));
}
friend int_matrix cl_prod_l(int_matrix& lhs, int_matrix& rhs);
private:
int* element;
};
int_matrix cl_prod_l(int_matrix& lhs, int_matrix& rhs) {
...
int_matrix return_val(lhs._size_row, rhs._size_col, 0); // Initialize elements in retrun_val
cl_mem lhs_buffer, rhs_buffer, return_buffer;
/* config buffer */
lhs_buffer = clCreateBuffer(int_matrix::_clconfig._context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, M*K * sizeof(int), NULL, &err);
rhs_buffer = clCreateBuffer(int_matrix::_clconfig._context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, N*K * sizeof(int), NULL, &err);
return_buffer = clCreateBuffer(int_matrix::_clconfig._context,
CL_MEM_READ_WRITE, M*N * sizeof(int), NULL, &err);
cl_kernel Kernel= clCreateKernel(int_matrix::_clconfig._program, ker, &err);
/* enqueue buffer */
clEnqueueWriteBuffer(int_matrix::_clconfig._cmdque, lhs_buffer, CL_TRUE, 0, M*K * sizeof(int), lhs.element, 0, NULL, NULL);
clEnqueueWriteBuffer(int_matrix::_clconfig._cmdque, rhs_buffer, CL_TRUE, 0, N*K * sizeof(int), rhs.element, 0, NULL, NULL);
clEnqueueWriteBuffer(int_matrix::_clconfig._cmdque, return_buffer, CL_TRUE, 0, M*N * sizeof(int), return_val.element, 0, NULL, NULL);
...
}
In this example, I find lhs.element, rhs.element and return_val.element cannot be passed in kernel. But when I change to some array malloc in this this function(copy the same value), the kernel can return the right result.
So is there some limitations on the array pointer passed to clEnqueueWriteBuffer?
Emmm...
I find the answer my seld, cl_mem object and int* element should be put in the same scope.

OpenCL directories declared invalid after using them once

I have a strange problem with my OpenCL project. When I add the directories to the compiler(compiler is Dev-C++) it works at first. Problem is if I close Dev-C++ and open it up again all the directories are declared invalid. If I delete them and add them again they seem to work. Why is this happening. Here is my code if it helps:
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//
// A minimalist OpenCL program.
#include <CL/cl.h>
#include <CL/cl.hpp>
#include <stdio.h>
#include <CL/cl_ext.h>
#define NWITEMS 512
// A simple memset kernel
const char *source =
"__kernel void memset( __global uint *dst ) \n"
"{ \n"
" dst[get_global_id(0)] = get_global_id(0); \n"
"} \n";
int main(int argc, char ** argv, global_work_size)
{
// 1. Get a platform.
cl_platform_id platform;
clGetPlatformIDs( 1, &platform, NULL );
// 2. Find a gpu device.
cl_device_id device;
clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU,
1,
&device,
NULL);
// 3. Create a context and command queue on that device.
cl_context context = clCreateContext( NULL,
1,
&device,
NULL, NULL, NULL);
cl_command_queue queue = clCreateCommandQueue( context,
device,
0, NULL );
// 4. Perform runtime source compilation, and obtain kernel entry point.
cl_program program = clCreateProgramWithSource( context,
1,
&source,
NULL, NULL );
clBuildProgram( program, 1, &device, NULL, NULL, NULL );
cl_kernel kernel = clCreateKernel( program, "memset", NULL );
// 5. Create a data buffer.
cl_mem buffer = clCreateBuffer( context,
CL_MEM_WRITE_ONLY, ...);
// 6. Launch the kernel. Let OpenCL pick the local work size.
size_t KERNEL_WORK_GROUP_SIZE = NWITEMS;
clSetKernelArg(kernel, 0, sizeof(buffer), (void*) &buffer);
clEnqueueNDRangeKernel( queue,
kernel,
1,
NULL,
&global_work_size,
NULL, 0, NULL, NULL);
clFinish( queue );
// 7. Look at the results via synchronous buffer map.
cl_uint *ptr;
ptr = (cl_uint *) clEnqueueMapBuffer( queue,
buffer,
CL_TRUE,
CL_MAP_READ,
0,
NWITEMS * sizeof(cl_uint),
0, NULL, NULL, NULL );
int i;
for(i=0; i < NWITEMS; i++)
printf("%d %d\n", i, ptr[i]);
return(0);
}

Not all work-items being used opencl

so I'm able to compile and execute my kernel, the problem is that only two work-items are being used. I'm basically trying to fill up a float array[8] with {0,1,2,3,4,5,6,7}. So this is a very simple hello world application. Bellow is my kernel.
// Highly simplified to demonstrate
__kernel void rnd_float32_matrix (
__global float * res
) {
uint idx = get_global_id(0);
res[idx] = idx;
}
I then create and execute the kernel with the following code...
// Some more code
cl::Program program(context, sources, &err);
program.build(devices, NULL, NULL, NULL);
cl::Kernel kernel(program, "rnd_float32_matrix", &err);
kernel.setArg(0, src_d);
cl::CommandQueue queue(context, devices[0], 0, &err);
cl::Event event;
err = queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(8),
// I've tried cl::NDRange(8) as well
cl::NDRange(1),
NULL,
&event
);
event.wait();
err = queue.enqueueReadBuffer(
// This is:
// cl::Buffer src_d(
// context,
// CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
// mem_size,
// src_h,
// &err);
src_d,
CL_TRUE,
0,
8,
// This is float * src_h = new float[8];
src_h);
for(int i = 0; i < 8; i ++) {
std::cout << src_h[i] << std::endl;
}
I may not show it in the code, but I also do select a gpu device and using context.getInfo(..) it shows I'm using my NVidia GTX 770M card which shows 1024, 1024, 64 work-items available in dimensions 0, 1 and 2. When this array prints I keep getting... 0, 1, 0, 0, 0, 0, 0, 0. I've also tried setting res[idx] = 5, and I get... 5, 5, 0, 0, 0, 0, 0, 0. So it seems that only two give work-items are actually being used. What am I doing wrong?
Your command to read the data back from the device is only reading 8 bytes, which is two floats:
err = queue.enqueueReadBuffer(
src_d,
CL_TRUE,
0,
8, // <- This is the number of bytes, not the number of elements!
// This is float * src_h = new float[8];
src_h);
To read 8 floats, you would need to do this:
err = queue.enqueueReadBuffer(
src_d,
CL_TRUE,
0,
8 * sizeof(cl_float),
// This is float * src_h = new float[8];
src_h);

How to print results from kernel in OpenCL?

I am new to OpenCL. I am trying to use OpenCL c++ kernel language extension http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/CPP_kernel_language.pdf. I am trying to print results using page 10 code of this document. Please find the code below from this documentation and correct me if am wrong anywhere.
class Test{
public:
void setX(int value){ x = value;}
int getX(){ return x;}
private:
int x;
};
int main() {
cl_mem classObj = clCreateBuffer(context, CL_MEM_USE_HOST_PTR, sizeof(Test), &tempClass, &ret);
void* dm_idata = clEnqueueMapBuffer(command_queue, classObj, CL_TRUE, CL_MAP_WRITE, 0 , sizeof(Test), 0, NULL, NULL, &ret);
tempClass.setX(10); //prints this value
clEnqueueUnmapMemObject(command_queue, classObj, dm_idata, 0, NULL, NULL);//class is passed to the device
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);
clEnqueueMapBuffer(command_queue, classObj, CL_TRUE, CL_MAP_WRITE, 0, sizeof(Test), 0, NULL, NULL, &ret);//class is passed back to the host
printf("\n temp value: %d\n", tempClass.getX());
}
Here is the kernel code.
class Test {
setX (int value);
private:
int x;
};
__kernel void foo(__global Test* Inclass){
if(get_global_id(0) == 0)
Inclass->setX(6);
}
It prints the value from host code. I need to get the result from kernel. Any help is highly appreciated.
The result I got is
temp value = 10
Your second call to clEnqueueMapBuffer should be passing CL_MAP_READ, not CL_MAP_WRITE, since you want to read the data.