OpenCL get_global_id - c++

I'm trying to port a piece of OpenCL Kernel code over to SideFX Houdini using
its internal scripting language call VEX (stand for vector expression).
However, Im having problem in understanding what those indexes do and how they work.
I understand that get_global_id() returned the index into the work for a given work item (read it somewhere ) but I dnt really understand exactly whats that is. (perhaps something to do with the computer cores, i guess?)
SO aasuming the input is a 2D grid formed by 500pixel in x and y,
and assuming every pixel got some attributes (the one I pass into the kernel arguments, with the name_in, while the name_out are to update the same attributes value ), what is he doing with those index operation ?
How exactly is it workin and how could I do the same in c for example ?
Many thank you in advance,
Alessandro
__kernel void rd_compute(__global float4 *a_in, __global float4 *b_in, __global float4 *c_in, __global float4 *d_in, __global float4 *e_in, __global float4 *f_in, __global float4 *g_in, __global float4 *h_in, __global float4 *i_in, __global float4 *a_out, __global float4 *b_out, __global float4 *c_out, __global float4 *d_out, __global float4 *e_out, __global float4 *f_out, __global float4 *g_out, __global float4 *h_out, __global float4 *i_out)
{
const int index_x = get_global_id(0);
const int index_y = get_global_id(1);
const int index_z = get_global_id(2);
const int X = get_global_size(0);
const int Y = get_global_size(1);
const int Z = get_global_size(2);
const int index_here = X*(Y*index_z + index_y) + index_x;

Please study many of the great introductory tutorials.
In serial code if you used a loop (e.g., for (int i=0; i<10; i++)) then int i = get_global_id(0) replaces that so you can get the index of the current work item. The runtime ensures that all work items are run. They might be in parallel, in serial, or in groups (some combination).

Related

OpenCL casting to convert float4 components into uchars

I have an OpenCL program that calculates pixel RGB values. Declaration as follows.
__kernel void pixel_kernel( __global uchar* r,
__global uchar* g,
__global uchar* b,
__global uchar* a,
__global int* width,
__global int* height)
During the program a float4 col variable is created and calculated. So I want to extract the RGB components and return them as the r g b and a uchar types.
At the end of the code I have
r[x]=255;
g[x]=0;
b[x]=0;
Which happily compiles and the returned color is red.
If I try and convert the float4 values into RGB I cannot seem to work out how to cast them. For example the following results in a compilation error and the cl does not run
r[x]=(uchar)(col[0]*255);
g[x]=(uchar)(col[1]*255);
b[x]=(uchar)(col[2]*255);
What am I missing? How should this cast be declared so it correctly converts the float RGB components into uchar values between 0 and 255?
Must be a simple fix, but I have tried all permutations of casting I can think of and none of them seem to want to work. Thanks for any tips.
The OpenCL float4 data type contains 4 float. To address these components, you can use either .x, .y, .z, .w or .s0, .s1, .s2, .s3:
float4 col = (float4)(0.1f, 0.2f, 0.3f, 0.4f);
r[x]=(uchar)(col.s0*255);
g[x]=(uchar)(col.s1*255);
b[x]=(uchar)(col.s2*255);
a[x]=(uchar)(col.s3*255);
float4 col; is not the same as a 4-vector float col[4];, but more like a C99 struct; this is why addressing like col[0] does not work with float4. See also the OpenCL 1.2 Reference Card page 3.

What is the maximum allowable size of a local float array?

For OpenCL, specifically:
What is the maximum size that a local float array can be?
I set up the kernel like this:
__kernel void mykern( unsigned int N, __global float* input, __global float* output, __local float* sdata )
{
// ...
}
What is the maximum that I can set the size of sdata to be (in OpenCL)?
I did the following in C++ OpenCL:
clSetKernelArg(kf_myvred,3,(lws[0])*sizeof(cl_float),NULL);
clEnqueueNDRangeKernel(mycommandq,kf_myvred,1,NULL,work,lws,0,NULL,NULL);
If the size is too big, then clEnqueueNDRangeKernel returns an error of CL_OUT_OF_RESOURCES. But I'm not sure what the limit is.
Use clGetDeviceInfo with CL_DEVICE_LOCAL_MEM_SIZE parameter to query local memory size of your OpenCL device. Typically that is between 32 and 64 KB.

Assigning float4 to float array opencl

i am trying to optimize a simple opencl kernel using float4 instead of float.
This is the example code without float4.
example code:
__kernel void Substract (
__global const float* data,
const float val,
__global float* result
){
size_t gi = get_global_id(0);
float input_val = data[gi];
result[gi] = val - input_val;
}
My idea for float4:
__kernel void substract (
__global const float* data,
const float val,
__global float* result
){
size_t gi = get_global_id(0);
float4 val2 = float4 (val,val,val,val);
float4 input_val = data[gi*4];
result[gi] = val2 - input_val;
}
However this does not work, because we can not write back a float4 result into a float array. Is there a performant possibilty to write back float4 to a normal float array in opencl? The simple idea would be a for loop with 4 runs.
I want to optimize the kernel for gpu and cpu.
So if i have a variant with float4 and one without, both should run under the excact same kernel arguments. Is this possible?
You can just declare your arguments as float4 pointers, without changing anything on the host. Also, the compiler should automatically widen scalar values if they are used in expressions containing vectors, so you don't need to manually create a float4 version of val:
__kernel void Substract (
__global const float4* data,
const float val,
__global float4* result
){
size_t gi = get_global_id(0);
float4 input_val = data[gi];
result[gi] = val - input_val;
}

OpenCL: Downsampling with bilinear interpolation

I've a problem with downsampling image with bilinear interpolation. I've read almost all relevant articles on stackoverflow and searched around in google, trying to solve or at least to find the problem in my OpenCL kernel. This is my main source for the theory. After I implemented this code in OpenCL:
__kernel void downsample(__global uchar* image, __global uchar* outputImage, __global int* width, __global int* height, __global float* factor){
//image vector containing original RGB values
//outputImage vector containing "downsampled" RGB mean values
//factor - downsampling factor, downscaling the image by factor: 1024*1024 -> 1024/factor * 1024/factor
int r = get_global_id(0);
int c = get_global_id(1); //current coordinates
int oWidth = get_global_size(0);
int olc, ohc, olr, ohr; //coordinates of the original image used for bilinear interpolation
int index; //linearized index of the point
uchar q11, q12, q21, q22;
float accurate_c, accurate_r; //the exact scaled point
int k;
accurate_c = convert_float(c*factor[0]);
olc=convert_int(accurate_c);
ohc=olc+1;
if(!(ohc<width[0]))
ohc=olc;
accurate_r = convert_float(r*factor[0]);
olr=convert_int(accurate_r);
ohr=olr+1;
if(!(ohr<height[0]))
ohr=olr;
index= (c + r*oWidth)*3; //3 bytes per pixel
//Compute RGB values: take a central mean RGB values among four points
for(k=0; k<3; k++){
q11=image[(olc + olr*width[0])*3+k];
q12=image[(olc + ohr*width[0])*3+k];
q21=image[(ohc + olr*width[0])*3+k];
q22=image[(ohc + ohr*width[0])*3+k];
outputImage[index+k] = convert_uchar((q11*(ohc - accurate_c)*(ohr - accurate_r) +
q21*(accurate_c - olc)*(ohr - accurate_r) +
q12*(ohc - accurate_c)*(accurate_r - olr) +
q22*(accurate_c - olc)*(accurate_r - olr)));
}
}
The kernel works with factor = 2, 4, 5, 6 but not with factor = 3, 7 (I get missing pixels, and the image appears little bit skewed) whereas the "identical" code written in c++ works fine with all factor values. I cann't explain it to myself why that happens in opencl. I attach my full code project here

Change OpenCL function to C++

I am trying to write a code in C++, but after some search on the internet, I found one OpenCL based code is doing exactly the same thing as I want to do in C++. But since this is the first time I see a OpenCL code, I don't know how to change the following functions into c++:
const __global float4 *in_buf;
int x = get_global_id(0);
int y = get_global_id(1);
float result = y * get_global_size(0);
Is 'const __global float4 *in_buf' equivalent to 'const float *in_buf' in c++? And how to change the above other functions? Could anyone help? Thanks.
In general, you should take a look at the OpenCL specification (I'm assuming it's written in OpenCL 1.x) to better understand functions, types and how a kernel works.
Specifically for your question:
get_global_id returns the id of the current work item, and get_global_size returns the total number of work items. Since an OpenCL work-item is roughly equivalent to a single iteration in a sequential language, the equivalent of OpenCL's:
int x = get_global_id(0);
int y = get_global_id(1);
// do something with x and y
float result = y * get_global_size(0);
Will be C's:
for (int x = 0; x < dim0; x++) {
for (int y = 0; y < dim1; y++) {
// do something with x and y
float result = y * dim0;
}
}
As for float4 it's a vector type of 4 floats, roughly equivalent to C's float[4] (except that it supports many additional operators, such as vector arithmetic). Of course in this case it's a buffer, so an appropriate type would be float** or float[4]* - or better yet, just pack them together into a float* buffer and then load 4 at a time.
Feel free to ignore the __global modifier.
const __global float4 *in_buf is not equivalent to const float *in_buf.
The OpenCL uses vector variables, e.g. floatN, where N is e.g. 2,4,8. So float4 is in fact struct { float w, float x, float y, float z} with lot of tricks available to express vector operations.
get_global_id(0) gives you the iterator variable, so essentially replace every get_global_id(dim) with for(int x = 0; x< max[dim]; x++)