Assigning float4 to float array opencl - c++

i am trying to optimize a simple opencl kernel using float4 instead of float.
This is the example code without float4.
example code:
__kernel void Substract (
__global const float* data,
const float val,
__global float* result
){
size_t gi = get_global_id(0);
float input_val = data[gi];
result[gi] = val - input_val;
}
My idea for float4:
__kernel void substract (
__global const float* data,
const float val,
__global float* result
){
size_t gi = get_global_id(0);
float4 val2 = float4 (val,val,val,val);
float4 input_val = data[gi*4];
result[gi] = val2 - input_val;
}
However this does not work, because we can not write back a float4 result into a float array. Is there a performant possibilty to write back float4 to a normal float array in opencl? The simple idea would be a for loop with 4 runs.
I want to optimize the kernel for gpu and cpu.
So if i have a variant with float4 and one without, both should run under the excact same kernel arguments. Is this possible?

You can just declare your arguments as float4 pointers, without changing anything on the host. Also, the compiler should automatically widen scalar values if they are used in expressions containing vectors, so you don't need to manually create a float4 version of val:
__kernel void Substract (
__global const float4* data,
const float val,
__global float4* result
){
size_t gi = get_global_id(0);
float4 input_val = data[gi];
result[gi] = val - input_val;
}

Related

OpenCL casting to convert float4 components into uchars

I have an OpenCL program that calculates pixel RGB values. Declaration as follows.
__kernel void pixel_kernel( __global uchar* r,
__global uchar* g,
__global uchar* b,
__global uchar* a,
__global int* width,
__global int* height)
During the program a float4 col variable is created and calculated. So I want to extract the RGB components and return them as the r g b and a uchar types.
At the end of the code I have
r[x]=255;
g[x]=0;
b[x]=0;
Which happily compiles and the returned color is red.
If I try and convert the float4 values into RGB I cannot seem to work out how to cast them. For example the following results in a compilation error and the cl does not run
r[x]=(uchar)(col[0]*255);
g[x]=(uchar)(col[1]*255);
b[x]=(uchar)(col[2]*255);
What am I missing? How should this cast be declared so it correctly converts the float RGB components into uchar values between 0 and 255?
Must be a simple fix, but I have tried all permutations of casting I can think of and none of them seem to want to work. Thanks for any tips.
The OpenCL float4 data type contains 4 float. To address these components, you can use either .x, .y, .z, .w or .s0, .s1, .s2, .s3:
float4 col = (float4)(0.1f, 0.2f, 0.3f, 0.4f);
r[x]=(uchar)(col.s0*255);
g[x]=(uchar)(col.s1*255);
b[x]=(uchar)(col.s2*255);
a[x]=(uchar)(col.s3*255);
float4 col; is not the same as a 4-vector float col[4];, but more like a C99 struct; this is why addressing like col[0] does not work with float4. See also the OpenCL 1.2 Reference Card page 3.

OpenCL get_global_id

I'm trying to port a piece of OpenCL Kernel code over to SideFX Houdini using
its internal scripting language call VEX (stand for vector expression).
However, Im having problem in understanding what those indexes do and how they work.
I understand that get_global_id() returned the index into the work for a given work item (read it somewhere ) but I dnt really understand exactly whats that is. (perhaps something to do with the computer cores, i guess?)
SO aasuming the input is a 2D grid formed by 500pixel in x and y,
and assuming every pixel got some attributes (the one I pass into the kernel arguments, with the name_in, while the name_out are to update the same attributes value ), what is he doing with those index operation ?
How exactly is it workin and how could I do the same in c for example ?
Many thank you in advance,
Alessandro
__kernel void rd_compute(__global float4 *a_in, __global float4 *b_in, __global float4 *c_in, __global float4 *d_in, __global float4 *e_in, __global float4 *f_in, __global float4 *g_in, __global float4 *h_in, __global float4 *i_in, __global float4 *a_out, __global float4 *b_out, __global float4 *c_out, __global float4 *d_out, __global float4 *e_out, __global float4 *f_out, __global float4 *g_out, __global float4 *h_out, __global float4 *i_out)
{
const int index_x = get_global_id(0);
const int index_y = get_global_id(1);
const int index_z = get_global_id(2);
const int X = get_global_size(0);
const int Y = get_global_size(1);
const int Z = get_global_size(2);
const int index_here = X*(Y*index_z + index_y) + index_x;
Please study many of the great introductory tutorials.
In serial code if you used a loop (e.g., for (int i=0; i<10; i++)) then int i = get_global_id(0) replaces that so you can get the index of the current work item. The runtime ensures that all work items are run. They might be in parallel, in serial, or in groups (some combination).

OpenCL kernel only affecting portion of entire work-space

I have been working on a 2D implementation of SPH as found here: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/code/sph.pdf
I got it working CPU side, but found that without the ability to crank up the number of iterations, the quality of the simulation is not the best.
Hence I decided to port it to OpenCL 1.2 using the C++ bindings. My kernel is compiling, and the data is being written into and read from the buffers perfectly well.
However, due to my unfamiliarity with the GPU architechture and how index-spaces and work-groups are laid out, and the scarcity of resources that address this particular topic, I've been clawing in the dark somewhat when it comes to making sure that my kernel code is doing what I think it is doing.
The problem that I am encountering is that only one work-group of particles seem to be updated the way they should.
My guess is that I am updating acceleration wrong, but I am not certain how to accomplish it given that I need to compare every work-item to every other work item, and when I try to perform the operation through the naive method the simulation ends up exploding.
I am trying to iterate through the work-groups(blocks) and then through the individual work-items in each block, but the effect seems to be that only 1 block is ever updated.
Suggestions? Ideas? Resources?
Would welcome any input at this stage.
Thanks!
Code for the kernel is attached below.
void Density_Calculation(
float ConstantDensitySumTerm,
float ConstantDensityKernelTerm,
float eps,
float H2,
global float4* position,
global float* density,
local float4* pblock
)
{
// Id of this work-item in global index space
int gid = get_global_id(0);
// Id of this work-item within it's work group
int tid = get_local_id(0);
int globalSize = get_global_size(0);
int localSize = get_local_size(0);
int numTiles = globalSize/localSize;
// Zero out the density term of this work-item
density[gid] = 0;
density[gid] += ConstantDensitySumTerm;
float4 thisPosition = position[gid];
float densityTerm = 0.0;
// Outer loop iterates over all the work-group blocks
for(int i = 0; i < numTiles; ++i)
{
// Cache the particle position within the work-group
pblock[tid] = position[(i * localSize) + tid];
// synchronize to make sure data is available for processing
barrier(CLK_LOCAL_MEM_FENCE);
// Inner loop iterates over the work-items in each work-group, after all positions have been cached
for(int j = 0; j < localSize; ++j)
{
float4 otherPosition = pblock[j];
float4 deltaPosition = thisPosition - otherPosition;
float r2 = (deltaPosition.x * deltaPosition.x) + (deltaPosition.y * deltaPosition.y) + (deltaPosition.z * deltaPosition.z);
float z = (H2 - r2) + eps;
if(z > 0)
{
float rho_ij = ConstantDensityKernelTerm * z * z * z;
densityTerm += rho_ij;
}
}
// Synchronize so that next tile can be loaded
barrier(CLK_LOCAL_MEM_FENCE);
}
density[gid] += densityTerm;
}
void Acceleration_Calculation(
float eps,
float ConstantDensitySumTerm, float ConstantDensityKernelTerm,
float H2, float ReferenceDensity, float InteractionRadius,
float C0, float CP, float CV,
global float4* position,
global float4* velocity_full,
global float4* acceleration,
global float* density,
local float4* pblock
)
{
Density_Calculation(ConstantDensitySumTerm, ConstantDensityKernelTerm, eps, H2, position, density, pblock);
// Id of this work-item in global index space
int gid = get_global_id(0);
// Id of this work-item within it's work group
int tid = get_local_id(0);
int globalSize = get_global_size(0);
int localSize = get_local_size(0);
int numTiles = globalSize/localSize;
// Set acceleration parameters
//acceleration[gid].x = 0.0;
acceleration[gid].y = -0.01;
float4 thisPosition = position[gid];
float4 thisVelocity = velocity_full[gid];
float rhoi = density[gid];
float accelerationTermX = 0.0;
float accelerationTermY = 0.0;
for(int i = 0; i < numTiles; ++i)
{
for(int j = 0; j < localSize; ++j)
{
float4 otherPosition = position[j];
float4 deltaPosition = thisPosition - otherPosition;
float r2 = (deltaPosition.x * deltaPosition.x) + (deltaPosition.y * deltaPosition.y) + (deltaPosition.z * deltaPosition.z);
if(r2 < (H2 + eps))
{
float rhoj = density[j];
float q = sqrt(r2) / InteractionRadius;
float u = 1 - q;
float w0 = C0 * (u / rhoi / rhoj);
float wP = w0 * CP * (rhoi + rhoj - (2 * ReferenceDensity)) * (u / q);
float wV = w0 * CV;
float4 deltaVelocity = thisVelocity - velocity_full[j];
accelerationTermX += (wP * deltaPosition.x) + (wV * deltaVelocity.x);
accelerationTermY += (wP * deltaPosition.y) + (wV * deltaVelocity.y);
}
}
}
acceleration[gid].x += accelerationTermX;
acceleration[gid].y += accelerationTermY;
}
void LeapfrogIntegrator(
const float4 dt,
global float4* position,
global float4* velocity_full,
global float4* velocity_half,
global float4* acceleration
)
{
// Id of this work-item in global index space
int gid = get_global_id(0);
velocity_half[gid] = velocity_full[gid] + (acceleration[gid] * (dt/2));
velocity_full[gid] += acceleration[gid] * dt;
position[gid] += velocity_full[gid] * dt;
}
void kernel SPH_kernel(
float dt, float eps,
float ConstantDensitySumTerm, float ConstantDensityKernelTerm,
float H2, float ReferenceDensity, float InteractionRadius,
float C0, float CP, float CV,
global float4* position,
global float4* velocity_half,
global float4* velocity_full,
global float4* acceleration,
global float* density,
local float4* pblock
)
{
const float4 dt4 = (float4)(dt,dt,dt,0.0f);
Acceleration_Calculation(eps, ConstantDensitySumTerm, ConstantDensityKernelTerm, H2, ReferenceDensity, InteractionRadius, C0, CP, CV,
position, velocity_full, acceleration, density, pblock);
LeapfrogIntegrator(dt4, position, velocity_full, velocity_half, acceleration);
}

MATLAB CUDA Kernel Object- Error using gather?

I have the following CUDAKernel object:
Which I invoke using:
kernel1 = parallel.gpu.CUDAKernel('kcc2.ptx', 'kcc2.cu');
kernel1.ThreadBlockSize = 256;
kernel1.GridSize = 4;
gpuTM = gpuArray(single(TM));
gpuLTM = gpuArray(single(LTM));
gpuLTMP = gpuArray(int32(LTMP));
rng('shuffle');
randz = abs(randi(2^53 -1, [1, r_max]));
GPUrands = gpuArray(double(randz));
[x,y] = gather(feval(kernel1, gpuLTM, gpuLTMP, F_M, Force, GPUrands, ...
(r_max), single(Lamda), single(Fixed_dt), single(r), single(q), ...
single(gama_B), single(gama_M), single(mu_B), single(mu_M), ...
single(KB_p_ref), single(KB_m_ref), single(f_ref), single(g_ref), ...
single(Kca_p_ref), single(Kca_m_ref)));
As you see above, I have 2 left hand arguments yet I get the error in MATLAB:
Error using gpuArray/gather:
Too many output arguments.
I don't get it. All my parameters line up in the CUDA kernel and in MATLAB. Just so you can see, the kernel function has the following C++ prototype:
__global__ void myKern(const float *transMatrix, const int *pointerMatrix,
float *masterForces, float *Force, const double *rands, const int r_max,
const float lamda, const float dt, const float r, const float q,
const float gama_B, const float gama_M, const float mu_B, const float mu_M,
const float KB_p_ref, const float KB_m_ref, const float f_ref,
const float g_ref, const float Kca_p_ref, const float Kca_m_ref)
It should only return masterForces and Force ([x,y] in MATLAB) since they are the only non-constant pointers.
What could be the problem?
You can't apply gather directly on multiple output variables, you have to do that in separate lines (this is basic MATLAB syntax):
[x,y] = feval(kernel1, ...);
x = gather(x);
y = gather(y);
The output of evaluating the CUDA kernel is two variables of type gpuArray (data stored on the GPU). You can then transfer the data to CPU memory using gather applied on each variable.

Change OpenCL function to C++

I am trying to write a code in C++, but after some search on the internet, I found one OpenCL based code is doing exactly the same thing as I want to do in C++. But since this is the first time I see a OpenCL code, I don't know how to change the following functions into c++:
const __global float4 *in_buf;
int x = get_global_id(0);
int y = get_global_id(1);
float result = y * get_global_size(0);
Is 'const __global float4 *in_buf' equivalent to 'const float *in_buf' in c++? And how to change the above other functions? Could anyone help? Thanks.
In general, you should take a look at the OpenCL specification (I'm assuming it's written in OpenCL 1.x) to better understand functions, types and how a kernel works.
Specifically for your question:
get_global_id returns the id of the current work item, and get_global_size returns the total number of work items. Since an OpenCL work-item is roughly equivalent to a single iteration in a sequential language, the equivalent of OpenCL's:
int x = get_global_id(0);
int y = get_global_id(1);
// do something with x and y
float result = y * get_global_size(0);
Will be C's:
for (int x = 0; x < dim0; x++) {
for (int y = 0; y < dim1; y++) {
// do something with x and y
float result = y * dim0;
}
}
As for float4 it's a vector type of 4 floats, roughly equivalent to C's float[4] (except that it supports many additional operators, such as vector arithmetic). Of course in this case it's a buffer, so an appropriate type would be float** or float[4]* - or better yet, just pack them together into a float* buffer and then load 4 at a time.
Feel free to ignore the __global modifier.
const __global float4 *in_buf is not equivalent to const float *in_buf.
The OpenCL uses vector variables, e.g. floatN, where N is e.g. 2,4,8. So float4 is in fact struct { float w, float x, float y, float z} with lot of tricks available to express vector operations.
get_global_id(0) gives you the iterator variable, so essentially replace every get_global_id(dim) with for(int x = 0; x< max[dim]; x++)