In my current OpenCL implementation, I wanted to save time with arguments, avoid to pass them every time I wanted to use a buffer inside a kernel and have a shorter argument list for my kernel.
So I made a structure (workspace) that holds the pointer to the buffer in device memory, the struct act like an object with member variable you want to access through time and you want to stay alive for the whole execution. I never had a problem on AMD GPU or even on CPU. But Nvidia causing a lot of problems with this. It always seems to be an alignment problem, never reaching to right buffer, etc.
Here some code to help, see question below:
The structure define on host:
#define SRC_IMG 0 // (float4 buffer) Source image
#define LAB_IMG 1 // (float4 buffer) LAB image
// NOTE: The size of this array should be as much as the last define + 1.
#define __WRKSPC_SIZE__ 2
// Structure defined on host.
struct Workspace
{
cl_ulong getPtr[__WRKSPC_SIZE__];
};
struct HostWorkspace
{
cl::Buffer srcImg;
cl::Buffer labImg;
};
The structure defined on device:
typedef struct __attribute__(( packed )) gpuWorkspace
{
ulong getPtr[__WRKSPC_SIZE__];
} gpuWorkspace_t;
Note that on device, I use ulong and on host I use cl_ulong as shown here OpenCL: using struct as kernel argument.
So once cl::Buffer for source image or LAB image are created, I save them into a HostWorkspace object, so until that object is released, the reference to cl::Buffer is kept, so buffer exists for the entire project on the host, and defacto on the device.
Now, I need to feed those the device, so I have a simple kernel which init my device workspace as follow:
__kernel void Workspace_Init(__global gpuWorkspace_t* wrkspc,
__global float4* src,
__global float4* LAB)
{
// Get the ulong pointer on the first element of each buffer.
wrkspc->getPtr[SRC_IMG] = &src[0];
wrkspc->getPtr[LAB_IMG] = &LAB[0];
}
where wrkspc is a buffer allocated with struct Workspace, and src + LAB are just buffer allocate as 1D array images.
And afterwards, in any of my kernel, if I want to use src or LAB, I do as follow:
__kernel void ComputeLABFromSrc(__global gpuWorkspace_t* wrkSpc)
{
// =============================================================
// Get pointer from work space.
// =============================================================
// Cast back the pointer of first element as a normal buffer you
// want to use along the execution of the kernel.
__global float4* srcData = ( __global float4* )( wrkSpc->getPtr[SRC_IMG] );
__global float4* labData = ( __global float4* )( wrkSpc->getPtr[LAB_IMG] );
// Code kernel as usual.
}
When I started to use this, I had like 4-5 images which was going well, with a different structure like this:
struct Workspace
{
cl_ulong imgPtr;
cl_ulong labPtr;
};
where each image had there own pointer.
At a certain point I reach more images, and I had some problem. So I search online, and I found some recommendation that the sizeof() the struct could be different in-between device/host, so I change it to a single array of the same time, and this works fine until 16 elements.
So I search more, and I found a recommendation about the attribute((packed)), which I put on the device structure (see above). But now, I reach 26 elements, when I check the sizeof the struct either on device or on host, the size is 208 (elements * sizeof(cl_ulong) == 26 * 8). But I still have a similar issue to my previous model, my pointer goes read somewhere else in the middle of the previous image, etc.
So I have wondering, if anyone ever try a similar model (maybe with a different approach) or have any tips to have a "solid" model with this.
Note that all kernel are well coded, I have a good result when executing on AMD or on CPU with the same code. The only issue is on Nvidia.
Don't try to store GPU-side pointer values across kernel boundaries. They are not guaranteed to stay the same. Always use indices. And if a kernel uses a specific buffer, you need to pass it as an argument to that kernel.
References:
The OpenCL 1.2 specification (as far as I'm aware, nvidia does not implement a newer standard) does not define the behaviour of pointer-to-integer casts or vice versa.
Section 6.9p states: "Arguments to kernel functions that are declared to be a struct or union do not allow OpenCL objects to be passed as elements of the struct or union." This is exactly what you are attempting to do: passing a struct of buffers to a kernel.
Section 6.9a states: "Arguments to kernel functions in a program cannot be declared as a pointer to a pointer(s)." - This is essentially what you're trying to subvert by casting your pointers to an integer and back. (point 1) You can't "trick" OpenCL into this being well-defined by bypassing the type system.
As I suggest in the comment thread below, you will need to use indices to save positions inside a buffer object. If you want to store positions across different memory regions, you'll need to either unify multiple buffers into one and save one index into this giant buffer, or save a numeric value that identifies which buffer you are referring to.
Related
i have created a vector of custom type whose size is equal to number of gpu devices in the platform. Now the idea is to access the element from this vector at an index of current device referrring to this vector. for example, if device with id 0 is currently referring to this vector , then i need to access the element at index 0 from this vector.
The actual problem is in the device id type , which seems to be cl_Device_id type. since to access the vector in "[]" way, we need an int type, we need the device id numerical value.. as i am a beginner, i dont know how can we access the numerical value of this var i.e the actual device id.
after looking at the header file (cl.h), it seems its a struct but its definition is nowhere to be found.
can someone please point me correct direction
OpenCL objects are Opaque Types, which means you cannot view their internal structure at runtime. You can dereference the pointer and try to read its bytes, but you probably shouldn't, since if its contents are even allocated in host memory (no guarantee of that!), it won't make much sense to you.
OpenCL Devices don't have a "number" associated with them. If you need a number associated, you have to define that yourself. There's nothing wrong with, say, writing std::vector<cl_device_id> devices {device1, device2, device3};, where you know devices[0] always refers to the first device, devices[1] refers to the second, and so on.
If you need a more "stable" way of referring to devices, using its name is a good idea, since device names are guaranteed not to change during the runtime of an application (but they can change between runs of an application if the drivers are updated!):
std::string getDeviceName(cl_device_id did) {
size_t size;
clGetDeviceInfo(did, CL_DEVICE_NAME, 0, nullptr, &size);//Gets the size of the name
std::string name (size);
clGetDeviceInfo(did, CL_DEVICE_NAME, size, name.data(), nullptr);
return name;
}
//...
std::map<std::string, cl_device_id> ids {//or std::unordered_map
{getDeviceName(device1), device1},
{getDeviceName(device2), device2},
{getDeviceName(device3), device3}
};
Whatever you choose, the important part is that you have to choose this for yourself: there's no intrinsic ordering to OpenCL objects, beyond whatever the API happens to provide as the value of its pointer.
You're right, the cl_device_id type is defined in the /usr/include/CL/cl.h header as:
typedef struct _cl_device_id * cl_device_id;
This is one of OpenCL Abstract Data Types - to see the complete list of them please look here.
The OpenCL designers intentionally hide data structures, pointed by values of these types. All these values have the same size as pointer, however you are not allowed to dereference or increment/decrement these pointers - you can only assign them, copy them, compare them and use as arguments or return values in OpenCL functions.
I need to pass a vector<vector<string>> to a kernel OpenCL. What is the easiest way of doing it? Passing a char*** gives me an error:
__kernel void vadd(
__global char*** sets,
__global int* m,
__global long* result)
{}
ERROR: clBuildProgram(CL_BUILD_PROGRAM_FAILURE)
In OpenCL 1.x, this sort of thing is basically not possible. You'll need to convert your data such that it fits into a single buffer object, or at least into a fixed number of buffer objects. Pointers on the host don't make sense on the device. (With OpenCL 2's SVM feature, you can pass pointer values between host and kernel code, but you'll still need to ensure the memory is allocated in a way that's appropriate for this.)
One option I can think of, bearing in mind I know nothing about the rest of your program, is as follows:
Create an OpenCL buffer for all the strings. Add up the number of bytes required for all your strings. (possibly including nul termination, depending on what you're trying to do)
Create a buffer for looking up string start offsets (and possibly length). Looks like you have 2 dimensions of lookup (nested vectors) so how you lay this out will depend on whether your inner vectors (second dimension) are all the same size or not.
Write your strings back-to-back into the first buffer, recording start offsets (and length, if necessary) in the second buffer.
How can I use my custom struct in OpenCL? Because there are no array of objects in OpenCL, or 2D array beside image.
struct Block {
char item[4][4];
};
I would like to use array of these structs in OpenCL and access its elements by indices like in C/C++.
For example
Block *keys = new Block[11];
keys[3].item[2][2];
Let me explane. I am working on implementing AES-128 ECB in OpenCL.
Here is AES description.
These structs(blocks) I used for dividing plaintext into blocks 4x4 bytes. This array of 11 blocks is 11 keys for each round.
I did same thing with plaintext. For example, plaintext of 67bytes is divided into 5 blocks. In C this is working very well in sequential execution (key scheduling, subbytes, shift rows, mixcolumns, addround) encryption and decryption.
But problem now is that is not simple like that in OpenCL. How can I use array of these blocks in OpenCL? Or do I need to transform everything into 1D array of char (for example)?
In OpenCL C and OpenCL C++ you can't dynamically allocate memory in a kernel --
no malloc, only placement new, etc. You can indeed make 2D arrays like char item[4][4] and declare structs like Block in your kernels. But since you can't allocate memory if you must have a dynamically sized array then you can do the following things:
Declare an array (of whatever dimension) of automatic storage duration that is sufficiently large for your use. For example declare char item[100][100] if you know you won't need more than a 100x100 array.
Create a buffer on the host with clCreateBuffer and pass it in as a kernel argument.
If you want to build an array of structs on your host and then pass it in to your kernel as a buffer, you can do that too! But you must declare the structs separately in your host source and kernel source, and pay attention to size and alignment characteristics, and byte order. It's on you to make sure bits you pass in from the host are interpreted correctly on your device.
Edit:
To understand the layout of a struct take a look at this question: C struct memory layout?
OpenCL C is based on C, so you can expect the layout of structs to follow the same rules. The sizes of primitive types may differ on your host and device, but OpenCL defines several typedefs like cl_int that you should probably use when declaring a struct on your host to make sure it has the same size as on your device. For example, the size of cl_int on your host will be the same as the size of int on your device.
You can determine the endianness of your device with clGetDeviceInfo using the param_name CL_DEVICE_ENDIAN_LITTLE.
I read the formal definition on the official site but still I dont understand: What is void* hst-ptr used for? For example here. Why is buffer here so useful, Why buffer is a pointer to char and the size is 4*width*height
host_ptr is used when either of the flags CL_MEM_USE_HOST_PTR or CL_MEM_COPY_HOST_PTR are specified. In those cases, host_ptr points to CPU memory containing an image to use or copy.
In the example code you linked to, buffer is the host-side (CPU memory) image being copied to the device-side (GPU typ.) image (since they are using the CL_MEM_COPY_HOST_PTR flag).
It's not important that they made it a pointer to char since they are using memcpy to fill it in, but it helps the allocation using new char since a char is 1 byte in size.
4 * width * height is because that is the total number of bytes (chars) needed.
4 is the size of one CL_RGBA CL_UNORM_INT8 pixel (R, G, B, A are each one byte).
width * height because that the total count of pixels.
The code is allocating host-side memory for an image, filling it in with something, then creating a device-side image using those bytes, then processing it using a kernel, then copying the bytes back to the host-side image.
I'm wondering how the system-side cl::Buffer objects instantiate in a multi-device context.
Let's say I have an OCL environment class, which generates, from cl::Platform, ONE cl::Context:
this->ocl_context = cl::Context(CL_DEVICE_TYPE_GPU, con_prop);
And then, the corresponding set of devices:
this->ocl_devices = this->ocl_context.getInfo<CL_CONTEXT_DEVICES>();
I generate one cl::CommandQueue object and one set of cl::Kernel(s) for EACH device.
Let's say I have 4 GPUs of the same type. Now I have 4x cl::Device objects in ocl_devices.
Now, what happens when I have a second handler class to manage computation on each device:
oclHandler h1(cl::Context* c, cl::CommandQueue cq1, std::vector<cl::Kernel*> k1);
...
oclHandler h2(cl::Context* c, cl::CommandQueue cq4, std::vector<cl::Kernel*> k4);
And then INSIDE EACH CLASS, I both instantiate:
oclHandler::classMemberFunction(
...
this->buffer =
cl::Buffer(
*(this->ocl_context),
CL_MEM_READ_WRITE,
mem_size,
NULL,
NULL
);
...
)
and then after, write to
oclHandler::classMemberFunction(
...
this->ocl_cq->enqueueWriteBuffer(
this->buffer,
CL_FALSE,
static_cast<unsigned int>(0),
mem_size,
ptr,
NULL,
NULL
);
...
this->ocl_cq.finish();
...
)
each buffer. There is a concern that, because the instantiation is for a cl::context, and not tied to a particular device, that there is maybe possibly quadruple memory address assignment on each device. I can't determine when the operation that says "on the device, this buffer runs from 0xXXXXXXXXXXXXXXXX for N bytes" occurs.
Should instantiate one context per device? That seems unnatural because I'd have to instantiate a context. See how many devices there are, and then instantiate d-1 more contexts....seems inefficient. My concern is with limiting available memory device side. I am doing computations on massive sets, and I'd probably be using all of the 6GB available on each card.
Thanks.
EDIT: is there a way to instantiate a buffer and fill it asynchronously without using a commandqueue? Like let's say I have 4 devices, and one buffer host side full of static, read only data. Let's say that buffer is like 500MB in size. If I want to just use clCreateBuffer, with
shared_buffer = new
cl::Buffer(
this->ocl_context,
CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR,
total_mem_size,
ptr
NULL
);
that will start a blocking write, where I can not do anything host side until all of ptr's contents are copied to the newly allocated memory. I have a multithreaded device management system and I've created one cl::CommandQueue for each device, always passing along &shared_buffer for every kernel::setArg required. I'm having a hard time wrapping my head around what to do.
When you have a context that contains multiple devices, any buffers that you create within that context are visible to all of it's devices. This means that any device in the context could read from any buffer in the context, and the OpenCL implementation is in charge of making sure the data is actually moved to the correct devices as and when they need it. There are some grey areas around what should happen if multiple devices try and access the same buffer at the same time, but this kind of behaviour is generally avoided anyway.
Although all of the buffers are visible to all of the devices, this doesn't necessarily mean that they will be allocated on all of the devices. All of the OpenCL implementations that I've worked with use an 'allocate-on-first-use' policy, whereby the buffer is allocated on the device only when it is needed by that device. So in your particular case, you should end up with one buffer per device, as long as each buffer is only used by one device.
In theory an OpenCL implementation might pre-allocate all of the buffers on all the devices just in case they are needed, but I wouldn't expect this to happen in reality (and I've certainly never seen this happen). If you are running on a platform that has a GPU profiler available, you can often use the profiler to see when and where buffer allocations and data movement is actually occurring, to convince yourself that the system isn't doing anything undesirable.