Char*** in OpenCL kernel argument? - c++

I need to pass a vector<vector<string>> to a kernel OpenCL. What is the easiest way of doing it? Passing a char*** gives me an error:
__kernel void vadd(
__global char*** sets,
__global int* m,
__global long* result)
{}
ERROR: clBuildProgram(CL_BUILD_PROGRAM_FAILURE)

In OpenCL 1.x, this sort of thing is basically not possible. You'll need to convert your data such that it fits into a single buffer object, or at least into a fixed number of buffer objects. Pointers on the host don't make sense on the device. (With OpenCL 2's SVM feature, you can pass pointer values between host and kernel code, but you'll still need to ensure the memory is allocated in a way that's appropriate for this.)
One option I can think of, bearing in mind I know nothing about the rest of your program, is as follows:
Create an OpenCL buffer for all the strings. Add up the number of bytes required for all your strings. (possibly including nul termination, depending on what you're trying to do)
Create a buffer for looking up string start offsets (and possibly length). Looks like you have 2 dimensions of lookup (nested vectors) so how you lay this out will depend on whether your inner vectors (second dimension) are all the same size or not.
Write your strings back-to-back into the first buffer, recording start offsets (and length, if necessary) in the second buffer.

Related

Can I alias an array of structs to an array of struct members?

I am wondering if it is possible to create/copy a "virtual" array of a specific member of a struct in another array. Let's say we have a struct
struct foo {
int value;
char character;
};
Now assume there is an array containing this struct foo and I have an operation that needs to add all int value's together. This would normally be very easy with a loop adding all the values with a pointer. Problem is I am using OpenCL and need to copy an array to some device. In OpenCL this is done using
clEnqueueWriteBuffer(cmdQueue, buffer, CL_TRUE, 0, datasize, A, 0, NULL, NULL);
which will copy an array buffer to the device. It doesn't make sense to copy the entire array of structs, since the would take more time, because it also sends the characters which is not needed. It would also take up more space on the OpenCL device. Is it therefore possible to copy the "array" of values from the structs directly as an array to the device?
I know I can create a new array on the host (CPU) with all the values and then copy that array to the OpenCL device, but then I would spend time copying to a local int-array and afterwards copy that array to the OpenCL device.
Would it be possible to copy a "virtual" array of values directly from the array of foo-structs, containing only the int values?
Please beware, that this is a very simplified example of my actual problem and would like to avoid having the values in a separate array from the beginning, which the structs would then point to. I have big doubts that this is possible, and if my explanation even makes sense, but look forward to feedback!
No.
clEnqueueWriteBuffer expects a contiguous container. You cannot create a "virtual" contiguous container.
[I] like to avoid having the values in a separate array from the beginning.
At that point, you must profile and compare two implementations: one copying the array as-is with the superfluous data, and one creating a local copy of the useful data to send. Compare and choose.
If you have an array of structs you would need a staging buffer with just the values, which is extra copies on the CPU side.
Sometimes such work is unavoidable, but if you can, it is better to have multiple arrays of continuous values. Even in pure CPU work this is frequently more efficient for the CPU cache as it avoids read/writes of unneeded members and is often easier for SIMD instruction sets like SSE.
For example you could have int *values and char *chars of the same length (prefer some type like std::vector or std::unique_ptr<T[]> though!), then the copy is easy.

Using structure as buffer holder

In my current OpenCL implementation, I wanted to save time with arguments, avoid to pass them every time I wanted to use a buffer inside a kernel and have a shorter argument list for my kernel.
So I made a structure (workspace) that holds the pointer to the buffer in device memory, the struct act like an object with member variable you want to access through time and you want to stay alive for the whole execution. I never had a problem on AMD GPU or even on CPU. But Nvidia causing a lot of problems with this. It always seems to be an alignment problem, never reaching to right buffer, etc.
Here some code to help, see question below:
The structure define on host:
#define SRC_IMG 0 // (float4 buffer) Source image
#define LAB_IMG 1 // (float4 buffer) LAB image
// NOTE: The size of this array should be as much as the last define + 1.
#define __WRKSPC_SIZE__ 2
// Structure defined on host.
struct Workspace
{
cl_ulong getPtr[__WRKSPC_SIZE__];
};
struct HostWorkspace
{
cl::Buffer srcImg;
cl::Buffer labImg;
};
The structure defined on device:
typedef struct __attribute__(( packed )) gpuWorkspace
{
ulong getPtr[__WRKSPC_SIZE__];
} gpuWorkspace_t;
Note that on device, I use ulong and on host I use cl_ulong as shown here OpenCL: using struct as kernel argument.
So once cl::Buffer for source image or LAB image are created, I save them into a HostWorkspace object, so until that object is released, the reference to cl::Buffer is kept, so buffer exists for the entire project on the host, and defacto on the device.
Now, I need to feed those the device, so I have a simple kernel which init my device workspace as follow:
__kernel void Workspace_Init(__global gpuWorkspace_t* wrkspc,
__global float4* src,
__global float4* LAB)
{
// Get the ulong pointer on the first element of each buffer.
wrkspc->getPtr[SRC_IMG] = &src[0];
wrkspc->getPtr[LAB_IMG] = &LAB[0];
}
where wrkspc is a buffer allocated with struct Workspace, and src + LAB are just buffer allocate as 1D array images.
And afterwards, in any of my kernel, if I want to use src or LAB, I do as follow:
__kernel void ComputeLABFromSrc(__global gpuWorkspace_t* wrkSpc)
{
// =============================================================
// Get pointer from work space.
// =============================================================
// Cast back the pointer of first element as a normal buffer you
// want to use along the execution of the kernel.
__global float4* srcData = ( __global float4* )( wrkSpc->getPtr[SRC_IMG] );
__global float4* labData = ( __global float4* )( wrkSpc->getPtr[LAB_IMG] );
// Code kernel as usual.
}
When I started to use this, I had like 4-5 images which was going well, with a different structure like this:
struct Workspace
{
cl_ulong imgPtr;
cl_ulong labPtr;
};
where each image had there own pointer.
At a certain point I reach more images, and I had some problem. So I search online, and I found some recommendation that the sizeof() the struct could be different in-between device/host, so I change it to a single array of the same time, and this works fine until 16 elements.
So I search more, and I found a recommendation about the attribute((packed)), which I put on the device structure (see above). But now, I reach 26 elements, when I check the sizeof the struct either on device or on host, the size is 208 (elements * sizeof(cl_ulong) == 26 * 8). But I still have a similar issue to my previous model, my pointer goes read somewhere else in the middle of the previous image, etc.
So I have wondering, if anyone ever try a similar model (maybe with a different approach) or have any tips to have a "solid" model with this.
Note that all kernel are well coded, I have a good result when executing on AMD or on CPU with the same code. The only issue is on Nvidia.
Don't try to store GPU-side pointer values across kernel boundaries. They are not guaranteed to stay the same. Always use indices. And if a kernel uses a specific buffer, you need to pass it as an argument to that kernel.
References:
The OpenCL 1.2 specification (as far as I'm aware, nvidia does not implement a newer standard) does not define the behaviour of pointer-to-integer casts or vice versa.
Section 6.9p states: "Arguments to kernel functions that are declared to be a struct or union do not allow OpenCL objects to be passed as elements of the struct or union." This is exactly what you are attempting to do: passing a struct of buffers to a kernel.
Section 6.9a states: "Arguments to kernel functions in a program cannot be declared as a pointer to a pointer(s)." - This is essentially what you're trying to subvert by casting your pointers to an integer and back. (point 1) You can't "trick" OpenCL into this being well-defined by bypassing the type system.
As I suggest in the comment thread below, you will need to use indices to save positions inside a buffer object. If you want to store positions across different memory regions, you'll need to either unify multiple buffers into one and save one index into this giant buffer, or save a numeric value that identifies which buffer you are referring to.

How to use struct in OpenCL

How can I use my custom struct in OpenCL? Because there are no array of objects in OpenCL, or 2D array beside image.
struct Block {
char item[4][4];
};
I would like to use array of these structs in OpenCL and access its elements by indices like in C/C++.
For example
Block *keys = new Block[11];
keys[3].item[2][2];
Let me explane. I am working on implementing AES-128 ECB in OpenCL.
Here is AES description.
These structs(blocks) I used for dividing plaintext into blocks 4x4 bytes. This array of 11 blocks is 11 keys for each round.
I did same thing with plaintext. For example, plaintext of 67bytes is divided into 5 blocks. In C this is working very well in sequential execution (key scheduling, subbytes, shift rows, mixcolumns, addround) encryption and decryption.
But problem now is that is not simple like that in OpenCL. How can I use array of these blocks in OpenCL? Or do I need to transform everything into 1D array of char (for example)?
In OpenCL C and OpenCL C++ you can't dynamically allocate memory in a kernel --
no malloc, only placement new, etc. You can indeed make 2D arrays like char item[4][4] and declare structs like Block in your kernels. But since you can't allocate memory if you must have a dynamically sized array then you can do the following things:
Declare an array (of whatever dimension) of automatic storage duration that is sufficiently large for your use. For example declare char item[100][100] if you know you won't need more than a 100x100 array.
Create a buffer on the host with clCreateBuffer and pass it in as a kernel argument.
If you want to build an array of structs on your host and then pass it in to your kernel as a buffer, you can do that too! But you must declare the structs separately in your host source and kernel source, and pay attention to size and alignment characteristics, and byte order. It's on you to make sure bits you pass in from the host are interpreted correctly on your device.
Edit:
To understand the layout of a struct take a look at this question: C struct memory layout?
OpenCL C is based on C, so you can expect the layout of structs to follow the same rules. The sizes of primitive types may differ on your host and device, but OpenCL defines several typedefs like cl_int that you should probably use when declaring a struct on your host to make sure it has the same size as on your device. For example, the size of cl_int on your host will be the same as the size of int on your device.
You can determine the endianness of your device with clGetDeviceInfo using the param_name CL_DEVICE_ENDIAN_LITTLE.

Sending a part of a byte array

I am reading data from a serial port (in an Arduino) and framing it (syncing on a few bytes). To do that, I am reading the data into a big buffer.
Once I got the frame, I extract data and I want to send it to a different serial port using serial.write (Serial.write(buf, len)) which accepts a byte array and its size.
Since the data size can be random, I need something like a dynamic array (which is not recommended in Arduino). Any ideas?
Since the data size can be random, I need something like a dynamic array
In C you rarely need a dynamic array, because arrays passed to functions do not carry their size with them. That is why all functions that take an array also take length.
Let's say you have your data inside bigBuffer at position startPos, and you wish to send length bytes. All you need to do is
Serial.write(&bigBuffer[startPos], length);
or with pointer arithmetic syntax
Serial.write(bigBuffer+startPos, length);

Padding extra bytes to flatbuffer's buffer pointer which is to be sent over the network

Please look up this code for context.
auto normalized_log = CreateNormalizedLog(builder, pairs);
builder.Finish(normalized_log);
auto buffPtr = builder.GetBufferPointer();
SendUdpPacket(&ipInfo, reinterpret_cast<SOCKET>(ipInfo.hdle), buffPtr, builder.GetSize());
I need to pack the size of the created buffPtr (with fixed two bytes). Is there any preferred way of appending/offsetting without copying the entire buffer?
I think I cannot add size to the schema, because after receiving I should know the size without calling getRootAsNormalizedLog.
Is there any way to add extra bytes to the resulting buffer?
There's no built-in facility to make a buffer length-prefixed. You shouldn't need to either: UDP packets (and most transfer mechanisms) know the size of their payload, so prefixing it yourself would just be duplicate information.
That said, if you insist to do this without copying, you could do something like this:
auto size = static_cast<uint16_t>(builder.GetSize());
builder.PushElement(size);
This will prefix the buffer with a 16bit size. The problem with this approach is that the buffer will already have been aligned for their largest elements, so the buffer is now possibly unaligned at the destination. Hence, you're better off using a 32bit (or 64bit) length, depending on what are the largest scalars in your buffer.