I am trying to write a code in C++, but after some search on the internet, I found one OpenCL based code is doing exactly the same thing as I want to do in C++. But since this is the first time I see a OpenCL code, I don't know how to change the following functions into c++:
const __global float4 *in_buf;
int x = get_global_id(0);
int y = get_global_id(1);
float result = y * get_global_size(0);
Is 'const __global float4 *in_buf' equivalent to 'const float *in_buf' in c++? And how to change the above other functions? Could anyone help? Thanks.
In general, you should take a look at the OpenCL specification (I'm assuming it's written in OpenCL 1.x) to better understand functions, types and how a kernel works.
Specifically for your question:
get_global_id returns the id of the current work item, and get_global_size returns the total number of work items. Since an OpenCL work-item is roughly equivalent to a single iteration in a sequential language, the equivalent of OpenCL's:
int x = get_global_id(0);
int y = get_global_id(1);
// do something with x and y
float result = y * get_global_size(0);
Will be C's:
for (int x = 0; x < dim0; x++) {
for (int y = 0; y < dim1; y++) {
// do something with x and y
float result = y * dim0;
}
}
As for float4 it's a vector type of 4 floats, roughly equivalent to C's float[4] (except that it supports many additional operators, such as vector arithmetic). Of course in this case it's a buffer, so an appropriate type would be float** or float[4]* - or better yet, just pack them together into a float* buffer and then load 4 at a time.
Feel free to ignore the __global modifier.
const __global float4 *in_buf is not equivalent to const float *in_buf.
The OpenCL uses vector variables, e.g. floatN, where N is e.g. 2,4,8. So float4 is in fact struct { float w, float x, float y, float z} with lot of tricks available to express vector operations.
get_global_id(0) gives you the iterator variable, so essentially replace every get_global_id(dim) with for(int x = 0; x< max[dim]; x++)
Related
I'm using the next algorithm to perform nearest neighbor resizing. Is there anyway to optimize it's speed? Input and Output buffers are in ARGB format, though images are known to be always opaque. Thank you.
void resizeNearestNeighbor(const uint8_t* input, uint8_t* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight) ;
const int colors = 4;
for (int y = 0; y < targetHeight; y++)
{
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
for (int x = 0; x < targetWidth; x++)
{
int x2 = ((x * x_ratio) >> 16) ;
int y2_x2_colors = (y2_xsource + x2) * colors;
int i_x_colors = (i_xdest + x) * colors;
output[i_x_colors] = input[y2_x2_colors];
output[i_x_colors + 1] = input[y2_x2_colors + 1];
output[i_x_colors + 2] = input[y2_x2_colors + 2];
output[i_x_colors + 3] = input[y2_x2_colors + 3];
}
}
}
restrict keyword will help a lot, assuming no aliasing.
Another improvement is to declare another pointerToOutput and pointerToInput as uint_32_t, so that the four 8-bit copy-assignments can be combined into a 32-bit one, assuming pointers are 32bit aligned.
There's little that you can do to speed this up, as you already arranged the loops in the right order and cleverly used fixed-point arithmetic. As others suggested, try to move the 32 bits in a single go (hoping that the compiler didn't see that yet).
In case of significant enlargement, there is a possibility: you can determine how many times every source pixel needs to be replicated (you'll need to work on the properties of the relation Xd=Wd.Xs/Ws in integers), and perform a single pixel read for k writes. This also works on the y's, and you can memcpy the identical rows instead of recomputing them. You can precompute and tabulate the mappings of the X's and Y's using run-length coding.
But there is a barrier that you will not pass: you need to fill the destination image.
If you are desperately looking for speedup, there could remain the option of using vector operations (SEE or AVX) to handle several pixels at a time. Shuffle instructions are available that might enable to control the replication (or decimation) of the pixels. But due to the complicated replication pattern combined with the fixed structure of the vector registers, you will probably need to integrate a complex decision table.
The algorithm is fine, but you can utilize massive parallelization by submitting your image to the GPU. If you use opengl, simply creating a context of the new size and providing a properly sized quad can give you inherent nearest neighbor calculations. Also opengl could give you access to other resizing sampling techniques by simply changing the properties of the texture you read from (which would amount to a single gl command which could be an easy paramter to your resize function).
Also later in development, you could simply swap out a shader for other blending techniques which also keeps you utilizing your wonderful GPU processor of image processing glory.
Also, since you aren't using any fancy geometry it can become almost trivial to write the program. It would be a little more involved than your algorithm, but it could perform magnitudes faster depending on image size.
I hope I didn't break anything. This combines some of the suggestions posted thus far and is about 30% faster. I'm amazed that is all we got. I did not actually check the destination image to see if it was right.
Changes:
- remove multiplies from inner loop (10% improvement)
- uint32_t instead of uint8_t (10% improvement)
- __restrict keyword (1% improvement)
This was on an i7 x64 machine running Windows, compiled with MSVC 2013. You will have to change the __restrict keyword for other compilers.
void resizeNearestNeighbor2_32(const uint8_t* __restrict input, uint8_t* __restrict output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
const uint32_t* input32 = (const uint32_t*)input;
uint32_t* output32 = (uint32_t*)output;
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight);
int x_ratio_with_color = x_ratio;
for (int y = 0; y < targetHeight; y++)
{
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
int source_x_offset = 0;
int startingOffset = y2_xsource;
const uint32_t * inputLine = input32 + startingOffset;
for (int x = 0; x < targetWidth; x++)
{
i_xdest += 1;
source_x_offset += x_ratio_with_color;
int sourceOffset = source_x_offset >> 16;
output[i_xdest] = inputLine[sourceOffset];
}
}
}
I'm optimizing a piece of code that moves particles on the screen around gravity fields. For this we're told to use SSE. Now after rewriting this little bit of code, I was wondering if there is an easier/smaller way of storing the values back in the array of particles.
Here's the code before:
for (unsigned int i = 0; i < PARTICLES; i++) {
m_Particle[i]->x += m_Particle[i]->vx;
m_Particle[i]->y += m_Particle[i]->vy;
}
And here's the code after:
for (unsigned int i = 0; i < PARTICLES; i += 4) {
// Particle position/velocity x & y
__m128 ppx4 = _mm_set_ps(m_Particle[i]->x, m_Particle[i+1]->x,
m_Particle[i+2]->x, m_Particle[i+3]->x);
__m128 ppy4 = _mm_set_ps(m_Particle[i]->y, m_Particle[i+1]->y,
m_Particle[i+2]->y, m_Particle[i+3]->y);
__m128 pvx4 = _mm_set_ps(m_Particle[i]->vx, m_Particle[i+1]->vx,
m_Particle[i+2]->vx, m_Particle[i+3]->vx);
__m128 pvy4 = _mm_set_ps(m_Particle[i]->vy, m_Particle[i+1]->vy,
m_Particle[i+2]->vy, m_Particle[i+3]->vy);
union { float newx[4]; __m128 pnx4; };
union { float newy[4]; __m128 pny4; };
pnx4 = _mm_add_ps(ppx4, pvx4);
pny4 = _mm_add_ps(ppy4, pvy4);
m_Particle[i+0]->x = newx[3]; // Particle i + 0
m_Particle[i+0]->y = newy[3];
m_Particle[i+1]->x = newx[2]; // Particle i + 1
m_Particle[i+1]->y = newy[2];
m_Particle[i+2]->x = newx[1]; // Particle i + 2
m_Particle[i+2]->y = newy[1];
m_Particle[i+3]->x = newx[0]; // Particle i + 3
m_Particle[i+3]->y = newy[0];
}
It works, but it looks way too large for something as simple as adding a value to another value. Is there a shorter way of doing this without changing the m_Particle structure?
There's no reason why you couldn't put x and y side by side in one __m128, shortening the code somewhat:
for (unsigned int i = 0; i < PARTICLES; i += 2) {
// Particle position/velocity x & y
__m128 pos = _mm_set_ps(m_Particle[i]->x, m_Particle[i+1]->x,
m_Particle[i]->y, m_Particle[i+1]->y);
__m128 vel = _mm_set_ps(m_Particle[i]->vx, m_Particle[i+1]->vx,
m_Particle[i]->vy, m_Particle[i+1]->vy);
union { float pnew[4]; __m128 pnew4; };
pnew4 = _mm_add_ps(pos, vel);
m_Particle[i+0]->x = pnew[0]; // Particle i + 0
m_Particle[i+0]->y = pnew[2];
m_Particle[i+1]->x = pnew[1]; // Particle i + 1
m_Particle[i+1]->y = pnew[3];
}
But really, you've encountered the "Array of structs" vs. "Struct of arrays" issue. SSE code works better with a "Struct of arrays" like:
struct Particles
{
float x[PARTICLES];
float y[PARTICLES];
float xv[PARTICLES];
float yv[PARTICLES];
};
Another option is a hybrid approach:
struct Particles4
{
__m128 x;
__m128 y;
__m128 xv;
__m128 yv;
};
Particles4 particles[PARTICLES / 4];
Either way will give simpler and faster code than your example.
I went a slightly different route to simplify: process 2 elements per iteration and pack them as (x,y,x,y) instead of (x,x,x,x) and (y,y,y,y) as you did.
If in your particle class x and y are contiguous floats and you align fields on 32 bits, a single operation loading a x as a double will in fact load the two floats x and y.
for (unsigned int i = 0; i < PARTICLES; i += 2) {
__m128 pos = _mm_set1_pd(0); // zero vector
// I assume x and y are contiguous in memory
// so loading a double at x loads 2 floats: x and the following y.
pos = _mm_loadl_pd(pos, (double*)&m_Particle[i ]->x);
// a register can contain 4 floats so 2 positions
pos = _mm_loadh_pd(pos, (double*)&m_Particle[i+1]->x);
// same for velocities
__m128 vel = _mm_set1_pd(0);
vel = _mm_loadl_pd(pos, (double*)&m_Particle[i ]->vx);
vel = _mm_loadh_pd(pos, (double*)&m_Particle[i+1]->vy);
pos = _mm_add_ps(pos, vel); // do the math
// store the same way as load
_mm_storel_pd(&m_Particle[i ]->x, pos);
_mm_storeh_pd(&m_Particle[i+1]->x, pos);
}
Also, since you mention particle, do you intend to draw them with OpenGL / DirectX ? If so, you could perform this kind of permutation on the GPU faster while also avoiding data transfers from main memory to GPU, so it's a gain on all fronts.
If that's not the case and you intend to stay on the CPU, using an SSE friendly layout like one array for positions and one for velocities could be a solution:
struct particle_data {
std::vector<float> xys, vxvys;
};
But it would have the drawback of either breaking your architecture or requiring a copy from your current array of structs to a temporary struct of arrays. The compute would be faster but the additional copy might outweigh that. Only benchmarking can show...
A last option is to sacrifice a little performance and load your data as it is, and use SSE shuffle instructions to rearrange the data locally at each iteration. But arguably this would make the code even harder to maintain.
For the design of performance you should avoid handling array of structure but you should work with structure of array.
I wrote a kernel for OpenCL where I initialise all the elements of a 3D array to -> i*i*i + j*j*j. I'm now having problems in creating a grid of threads to do the initialisation of the elements (concurrently). I know that the code that I have now only uses 3 threads, how can I expand on that?
Please help. I'm new to OpenCL, so any suggestion or explanation might be handy. Thanks!
This is code:
_kernel void initialize (
int X;
int Y;
int Z;
_global float*A) {
// Get global position in X direction
int dirX = get_global_id(0);
// Get global position in Y direction
int dirY = get_global_id(1);
// Get global position in Z direction
int dirZ = get_global_id(2);
int A[2000][100][4];
int i,j,k;
for (i=0;i<2000;i++)
{
for (j=0;j<100;j++)
{
for (k=0;k<4;k++)
{
A[dirX*X+i][dirY*Y+j][dirZ*Z+k] = i*i*i + j*j*j;
}
}
}
}
You create the buffer to store your output 'A' in the calling (host) code. This is passed to your kernel as a pointer, which is correct in your function definition above. However you don't need to declare it again inside your kernel function, so remove the line int A[2000][100][4];.
You can simplify the code greatly. Using the 3D global ID to indicate the 3D index into the array for each work-item, you could change the loop as follows (assuming that for a given i and j, all elements along Z should have the same value):
__kernel void initialize (__global float* A) {
// cast required so that kernel compiler knows the array dimensions
__global float (*a)[2000][100][4] = A;
// Get global position in X direction
int i = get_global_id(0);
// Get global position in Y direction
int j = get_global_id(1);
// Get global position in Z direction
int k = get_global_id(2);
(*a)[i][j][k] = i*i*i + j*j*j;
}
In your calling code you would then create the kernel with a global work-size of 2000x100x4.
Practically this is a lot of work items to schedule, so you would likely get better performance from a global (one-dimensional) work-size of 2000 and a loop inside the kernel, e.g.:
__kernel void initialize (__global float* A) {
// cast required so that kernel compiler knows the array dimensions
__global float (*a)[2000][100][4] = A;
// Get global position in X direction
int i = get_global_id(0);
for (j=0;j<100;j++) {
for (k=0;k<4;k++) {
(*a)[i][j][k] = i*i*i + j*j*j;
}
}
}
I have some C++ I'm trying to port, and I'm confused about a couple lines and what exactly they're doing. The code is as follows. The variable im is a 2D float array of size num_rows by num_cols.
for(x=0; x < num_cols; x++){
float *im_x_cp = im[1]+x; //(1)
for(y = 1; y < num_rows; y++, im_x_cp+=num_cols){
float s1 = *im_x_cp;
//et cetera
}
}
The code marked (1) is particularly confusing to me. What part of the 2d array im is this referencing?
Thanks for your help in advance.
im[1] is a pointer to an array of floats, that is, it's the second line/column of your matrix.
im[1] + x is a pointer to the element at coordinate (1,x) (recall how pointer arithmetic works) and s1 is its value.
The type of im[1] is float *. So, according to the rules of C++ pointer arithmetic:
float* im_x_cp = im[1];
im_x_cp = im_x_cp + x;
Now it's a float* pointing to item '1+x' in that slice.
__global__ void finalImageGathering(float3 *lists[]) {
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
float3 test;
for(int i = 0; i<Blah; i++)
test += lists[i][y * width + x];
}
Is it possible to have a list of pointers to diffrent float3 lists, or do I need to do something else?
You can do that, CUDA imposes no special limits on pointer indirection for anything other than function pointers (and that limitation is mostly gone on recent hardware too). What is more complex is allocating the memory for such an array of device pointers, and copying it to and from host memory, if you need to.