clCreateImage2D (..., void* hst_ptr,..) how to use it? - c++

I read the formal definition on the official site but still I dont understand: What is void* hst-ptr used for? For example here. Why is buffer here so useful, Why buffer is a pointer to char and the size is 4*width*height

host_ptr is used when either of the flags CL_MEM_USE_HOST_PTR or CL_MEM_COPY_HOST_PTR are specified. In those cases, host_ptr points to CPU memory containing an image to use or copy.
In the example code you linked to, buffer is the host-side (CPU memory) image being copied to the device-side (GPU typ.) image (since they are using the CL_MEM_COPY_HOST_PTR flag).
It's not important that they made it a pointer to char since they are using memcpy to fill it in, but it helps the allocation using new char since a char is 1 byte in size.
4 * width * height is because that is the total number of bytes (chars) needed.
4 is the size of one CL_RGBA CL_UNORM_INT8 pixel (R, G, B, A are each one byte).
width * height because that the total count of pixels.
The code is allocating host-side memory for an image, filling it in with something, then creating a device-side image using those bytes, then processing it using a kernel, then copying the bytes back to the host-side image.

Related

What's "pitch" in cudaMemcpy2DToArray and cudaMemcpy2DFromArray

I'm converting the deprecated cudaMemcpyToArray and cudaMemcpyFromArray into cudaMemcpy2DToArray and cudaMemcpy2DFromArray. Rather than size of the deprecated calls, the new API calls for width, height, and pitch. The descriptions of spitch and dpitch are correspondingly "Pitch of source memory" and "Pitch of destination memory". I wonder what are those values: size of data items, something else?
More specifically, if I were to copy W*H floats, should I have pitch=sizeof(float), width=W, height=H, or pitch=sizeof(float)*W, width=sizeof(float)*W, height=H, or something else?
It should be:
pitch=sizeof(float)*W
width = sizeof(float)*W
height = H
The above is for cudaMemcpy2DToArray, and assumes you are transferring from host to device, which would most likely involve an unpitched allocation in host memory as the source.
The pitch of a pitched allocation is the size in bytes of one line of of a 2D allocation, including padding bytes at the end of the line. It is the value returned by cudaMallocPitch, for example. For unpitched allocations, it is still the width of the line, and it is given by W*sizeof(element) where the 2D allocation width is given by W elements each of size sizeof(element).
This question and the link it refers to may also be of interest.

Char*** in OpenCL kernel argument?

I need to pass a vector<vector<string>> to a kernel OpenCL. What is the easiest way of doing it? Passing a char*** gives me an error:
__kernel void vadd(
__global char*** sets,
__global int* m,
__global long* result)
{}
ERROR: clBuildProgram(CL_BUILD_PROGRAM_FAILURE)
In OpenCL 1.x, this sort of thing is basically not possible. You'll need to convert your data such that it fits into a single buffer object, or at least into a fixed number of buffer objects. Pointers on the host don't make sense on the device. (With OpenCL 2's SVM feature, you can pass pointer values between host and kernel code, but you'll still need to ensure the memory is allocated in a way that's appropriate for this.)
One option I can think of, bearing in mind I know nothing about the rest of your program, is as follows:
Create an OpenCL buffer for all the strings. Add up the number of bytes required for all your strings. (possibly including nul termination, depending on what you're trying to do)
Create a buffer for looking up string start offsets (and possibly length). Looks like you have 2 dimensions of lookup (nested vectors) so how you lay this out will depend on whether your inner vectors (second dimension) are all the same size or not.
Write your strings back-to-back into the first buffer, recording start offsets (and length, if necessary) in the second buffer.

Using structure as buffer holder

In my current OpenCL implementation, I wanted to save time with arguments, avoid to pass them every time I wanted to use a buffer inside a kernel and have a shorter argument list for my kernel.
So I made a structure (workspace) that holds the pointer to the buffer in device memory, the struct act like an object with member variable you want to access through time and you want to stay alive for the whole execution. I never had a problem on AMD GPU or even on CPU. But Nvidia causing a lot of problems with this. It always seems to be an alignment problem, never reaching to right buffer, etc.
Here some code to help, see question below:
The structure define on host:
#define SRC_IMG 0 // (float4 buffer) Source image
#define LAB_IMG 1 // (float4 buffer) LAB image
// NOTE: The size of this array should be as much as the last define + 1.
#define __WRKSPC_SIZE__ 2
// Structure defined on host.
struct Workspace
{
cl_ulong getPtr[__WRKSPC_SIZE__];
};
struct HostWorkspace
{
cl::Buffer srcImg;
cl::Buffer labImg;
};
The structure defined on device:
typedef struct __attribute__(( packed )) gpuWorkspace
{
ulong getPtr[__WRKSPC_SIZE__];
} gpuWorkspace_t;
Note that on device, I use ulong and on host I use cl_ulong as shown here OpenCL: using struct as kernel argument.
So once cl::Buffer for source image or LAB image are created, I save them into a HostWorkspace object, so until that object is released, the reference to cl::Buffer is kept, so buffer exists for the entire project on the host, and defacto on the device.
Now, I need to feed those the device, so I have a simple kernel which init my device workspace as follow:
__kernel void Workspace_Init(__global gpuWorkspace_t* wrkspc,
__global float4* src,
__global float4* LAB)
{
// Get the ulong pointer on the first element of each buffer.
wrkspc->getPtr[SRC_IMG] = &src[0];
wrkspc->getPtr[LAB_IMG] = &LAB[0];
}
where wrkspc is a buffer allocated with struct Workspace, and src + LAB are just buffer allocate as 1D array images.
And afterwards, in any of my kernel, if I want to use src or LAB, I do as follow:
__kernel void ComputeLABFromSrc(__global gpuWorkspace_t* wrkSpc)
{
// =============================================================
// Get pointer from work space.
// =============================================================
// Cast back the pointer of first element as a normal buffer you
// want to use along the execution of the kernel.
__global float4* srcData = ( __global float4* )( wrkSpc->getPtr[SRC_IMG] );
__global float4* labData = ( __global float4* )( wrkSpc->getPtr[LAB_IMG] );
// Code kernel as usual.
}
When I started to use this, I had like 4-5 images which was going well, with a different structure like this:
struct Workspace
{
cl_ulong imgPtr;
cl_ulong labPtr;
};
where each image had there own pointer.
At a certain point I reach more images, and I had some problem. So I search online, and I found some recommendation that the sizeof() the struct could be different in-between device/host, so I change it to a single array of the same time, and this works fine until 16 elements.
So I search more, and I found a recommendation about the attribute((packed)), which I put on the device structure (see above). But now, I reach 26 elements, when I check the sizeof the struct either on device or on host, the size is 208 (elements * sizeof(cl_ulong) == 26 * 8). But I still have a similar issue to my previous model, my pointer goes read somewhere else in the middle of the previous image, etc.
So I have wondering, if anyone ever try a similar model (maybe with a different approach) or have any tips to have a "solid" model with this.
Note that all kernel are well coded, I have a good result when executing on AMD or on CPU with the same code. The only issue is on Nvidia.
Don't try to store GPU-side pointer values across kernel boundaries. They are not guaranteed to stay the same. Always use indices. And if a kernel uses a specific buffer, you need to pass it as an argument to that kernel.
References:
The OpenCL 1.2 specification (as far as I'm aware, nvidia does not implement a newer standard) does not define the behaviour of pointer-to-integer casts or vice versa.
Section 6.9p states: "Arguments to kernel functions that are declared to be a struct or union do not allow OpenCL objects to be passed as elements of the struct or union." This is exactly what you are attempting to do: passing a struct of buffers to a kernel.
Section 6.9a states: "Arguments to kernel functions in a program cannot be declared as a pointer to a pointer(s)." - This is essentially what you're trying to subvert by casting your pointers to an integer and back. (point 1) You can't "trick" OpenCL into this being well-defined by bypassing the type system.
As I suggest in the comment thread below, you will need to use indices to save positions inside a buffer object. If you want to store positions across different memory regions, you'll need to either unify multiple buffers into one and save one index into this giant buffer, or save a numeric value that identifies which buffer you are referring to.

OpenCL: How to find out maximum size of Image3D that fits on GPU

When using OpenCL one can check the maximum theoretical size of an image3D on a device by calling
CL_DEVICE_IMAGE3D_MAX_WIDTH
CL_DEVICE_IMAGE3D_MAX_DEPTH
CL_DEVICE_IMAGE3D_MAX_HEIGHT
but for example my GPU does not provide enough memory to allocate a 3D RGBA image of that size. So obviously, when I try to push such an image, an OutOfResources error occurs.
My question: Given a vector<float> which contains an image with dimensions <2048, how do I check if it fits on my GPU?
The background of my question is, that I would like to split up the image otherwise in order to process it in parts.
You can use CL_DEVICE_MAX_MEM_ALLOC_SIZE which returns the maximum size of memory object allocation in bytes.
Although this doesn't take into account the currently used memory, you could try to do it yourself by keeping track of the allocations you're doing and checking against CL_DEVICE_GLOBAL_MEM_SIZE.

How and when should I use pitched pointer with the cuda API?

I have quite a good understanding about how to allocate and copy linear memory with cudaMalloc() and cudaMemcpy(). However, when I want to use the CUDA functions to allocate and copy 2D or 3D matrices, I am often befuddled by the various arguments, especially concerning pitched pointers which are always present when dealing with 2D/3D arrays. The documentation is good for providing a couple examples on how to use them but it assumes that I am familiar with the notion of padding and pitch, which I am not.
I usually end up tweaking the various examples I find in the documentation or somewhere else on the web, but the blind debugging that follows is quite painful, so my question is:
What is a pitch? How do I use it? How do I allocate and copy 2D and 3D arrays in CUDA?
Here is an explanation about pitched pointer and padding in CUDA.
Linear memory vs padded memory
First, lets start with the reason for the existence of non linear memory. When allocating memory with cudaMalloc, the result is like an allocation with malloc, we have a contiguous memory chunk of the size specified and we can put anything we want in it. If we want to allocate a vector of 10000 float, we simply do:
float* myVector;
cudaMalloc(&myVector, 10000*sizeof(float));
and then access ith element of myVector by classic indexing:
float element = myVector[i];
and if we want to access the next element, we just do:
float next_element = myvector[i+1];
It works very fine because accessing an element right next to the first one is (for reasons I am not aware of and I don't wish to be for now) cheap.
Things become a little bit different when we use our memory as a 2D array. Lets say our 10000 float vector is in fact a 100x100 array. We can allocate it by using the same cudaMalloc function, and if we want to read the i-th row, we do:
float* myArray;
cudaMalloc(&myArray, 10000*sizeof(float));
int row[100]; // number of columns
for (int j=0; j<100; ++j)
row[j] = myArray[i*100+j];
Word alignment
So we have to read memory from myArray+100*i to myArray+101*i-1. The number of memory access operation it will take depends on the number of memory words this row takes. The number of bytes in a memory word depends on the implementation. To minimize the number of memory accesses when reading a single row, we must assure that we start the row on the start of a word, hence we must pad the memory for every row until the start of a new one.
Bank conflicts
Another reason for padding arrays is the bank mechanism in CUDA, concerning shared memory access. When the array is in the shared memory, it is split into several memory banks. Two CUDA threads can access it simultaneously, provided they don't access memory belonging to the same memory bank. Since we usually want to treat each row in parallel, we can ensure that we can access it simulateously by padding each row to the start of a new bank.
Now, instead of allocating the 2D array with cudaMalloc, we will use cudaMallocPitch:
size_t pitch;
float* myArray;
cudaMallocPitch(&myArray, &pitch, 100*sizeof(float), 100); // width in bytes by height
Note that the pitch here is the return value of the function: cudaMallocPitch checks what it should be on your system and returns the appropriate value. What cudaMallocPitch does is the following:
Allocate the first row.
Check if the number of bytes allocated makes it correctly aligned. For example that it is a multiple of 128.
If not, allocate further bytes to reach the next multiple of 128. the pitch is then the number of bytes allocated for a single row, including the extra bytes (padding bytes).
Reiterate for each row.
At the end, we have typically allocated more memory than necessary because each row is now the size of pitch, and not the size of w*sizeof(float).
But now, when we want to access an element in a column, we must do:
float* row_start = (float*)((char*)myArray + row * pitch);
float column_element = row_start[column];
The offset in bytes between two successive columns can no more be deduced from the size of our array, that is why we want to keep the pitch returned by cudaMallocPitch. And since pitch is a multiple of the padding size (typically, the biggest of word size and bank size), it works great. Yay.
Copying data to/from pitched memory
Now that we know how to create and access a single element in an array created by cudaMallocPitch, we might want to copy whole part of it to and from other memory, linear or not.
Lets say we want to copy our array in a 100x100 array allocated on our host with malloc:
float* host_memory = (float*)malloc(100*100*sizeof(float));
If we use cudaMemcpy, we will copy all the memory allocated with cudaMallocPitch, including the padded bytes between each rows. What we must do to avoid padding memory is copying each row one by one. We can do it manually:
for (size_t i=0; i<100; ++i) {
cudaMemcpy(host_memory[i*100], myArray[pitch*i],
100*sizeof(float), cudaMemcpyDeviceToHost);
}
Or we can tell the CUDA API that we want only the useful memory from the memory we allocated with padding bytes for its convenience so if it could deal with its own mess automatically it would be very nice indeed, thank you. And here enters cudaMemcpy2D:
cudaMemcpy2D(host_memory, 100*sizeof(float)/*no pitch on host*/,
myArray, pitch/*CUDA pitch*/,
100*sizeof(float)/*width in bytes*/, 100/*heigth*/,
cudaMemcpyDeviceToHost);
Now the copy will be done automatically. It will copy the number of bytes specified in width (here: 100xsizeof(float)), height time (here: 100), skipping pitch bytes every time it jumps to a next row. Note that we must still provide the pitch for the destination memory because it could be padded, too. Here it is not, so the pitch is equal to the pitch of a non-padded array: it is the size of a row. Note also that the width parameter in the memcpy function is expressed in bytes, but the height parameter is expressed in number of elements. That is because of the way the copy is done, someway like I wrote the manual copy above: the width is the size of each copy along a row (elements that are contiguous in memory) and the height is the number of times this operation must be accomplished. (These inconsistencies in units, as a physicist, annoys me very much.)
Dealing with 3D arrays
3D arrays are no different that 2D arrays actually, there is no additional padding included. A 3D array is just a 2D classical array of padded rows. That is why when allocating a 3D array, you only get one pitch that is the difference in bytes count between to successive points along a row. If you want to access to successive points along the depth dimension, you can safely multiply the pitch by the number of columns, which gives you the slicePitch.
The CUDA API for accessing 3D memory is slightly different than the one for 2D memory, but the idea is the same :
When using cudaMalloc3D, you receive a pitch value that you must carefully keep for subsequent access to the memory.
When copying a 3D memory chunk, you cannot use cudaMemcpy unless you are copying a single row. You must use any other kind of copy utility provided by the CUDA utility that takes the pitch into account.
When you copy your data to/from linear memory, you must provide a pitch to your pointer even though it is irrelevant: this pitch is the size of a row, expressed in bytes.
The size parameters are expressed in bytes for the row size, and in number of elements for the column and depth dimension.