SSE copy data to variables - c++

I'm optimizing a piece of code that moves particles on the screen around gravity fields. For this we're told to use SSE. Now after rewriting this little bit of code, I was wondering if there is an easier/smaller way of storing the values back in the array of particles.
Here's the code before:
for (unsigned int i = 0; i < PARTICLES; i++) {
m_Particle[i]->x += m_Particle[i]->vx;
m_Particle[i]->y += m_Particle[i]->vy;
}
And here's the code after:
for (unsigned int i = 0; i < PARTICLES; i += 4) {
// Particle position/velocity x & y
__m128 ppx4 = _mm_set_ps(m_Particle[i]->x, m_Particle[i+1]->x,
m_Particle[i+2]->x, m_Particle[i+3]->x);
__m128 ppy4 = _mm_set_ps(m_Particle[i]->y, m_Particle[i+1]->y,
m_Particle[i+2]->y, m_Particle[i+3]->y);
__m128 pvx4 = _mm_set_ps(m_Particle[i]->vx, m_Particle[i+1]->vx,
m_Particle[i+2]->vx, m_Particle[i+3]->vx);
__m128 pvy4 = _mm_set_ps(m_Particle[i]->vy, m_Particle[i+1]->vy,
m_Particle[i+2]->vy, m_Particle[i+3]->vy);
union { float newx[4]; __m128 pnx4; };
union { float newy[4]; __m128 pny4; };
pnx4 = _mm_add_ps(ppx4, pvx4);
pny4 = _mm_add_ps(ppy4, pvy4);
m_Particle[i+0]->x = newx[3]; // Particle i + 0
m_Particle[i+0]->y = newy[3];
m_Particle[i+1]->x = newx[2]; // Particle i + 1
m_Particle[i+1]->y = newy[2];
m_Particle[i+2]->x = newx[1]; // Particle i + 2
m_Particle[i+2]->y = newy[1];
m_Particle[i+3]->x = newx[0]; // Particle i + 3
m_Particle[i+3]->y = newy[0];
}
It works, but it looks way too large for something as simple as adding a value to another value. Is there a shorter way of doing this without changing the m_Particle structure?

There's no reason why you couldn't put x and y side by side in one __m128, shortening the code somewhat:
for (unsigned int i = 0; i < PARTICLES; i += 2) {
// Particle position/velocity x & y
__m128 pos = _mm_set_ps(m_Particle[i]->x, m_Particle[i+1]->x,
m_Particle[i]->y, m_Particle[i+1]->y);
__m128 vel = _mm_set_ps(m_Particle[i]->vx, m_Particle[i+1]->vx,
m_Particle[i]->vy, m_Particle[i+1]->vy);
union { float pnew[4]; __m128 pnew4; };
pnew4 = _mm_add_ps(pos, vel);
m_Particle[i+0]->x = pnew[0]; // Particle i + 0
m_Particle[i+0]->y = pnew[2];
m_Particle[i+1]->x = pnew[1]; // Particle i + 1
m_Particle[i+1]->y = pnew[3];
}
But really, you've encountered the "Array of structs" vs. "Struct of arrays" issue. SSE code works better with a "Struct of arrays" like:
struct Particles
{
float x[PARTICLES];
float y[PARTICLES];
float xv[PARTICLES];
float yv[PARTICLES];
};
Another option is a hybrid approach:
struct Particles4
{
__m128 x;
__m128 y;
__m128 xv;
__m128 yv;
};
Particles4 particles[PARTICLES / 4];
Either way will give simpler and faster code than your example.

I went a slightly different route to simplify: process 2 elements per iteration and pack them as (x,y,x,y) instead of (x,x,x,x) and (y,y,y,y) as you did.
If in your particle class x and y are contiguous floats and you align fields on 32 bits, a single operation loading a x as a double will in fact load the two floats x and y.
for (unsigned int i = 0; i < PARTICLES; i += 2) {
__m128 pos = _mm_set1_pd(0); // zero vector
// I assume x and y are contiguous in memory
// so loading a double at x loads 2 floats: x and the following y.
pos = _mm_loadl_pd(pos, (double*)&m_Particle[i ]->x);
// a register can contain 4 floats so 2 positions
pos = _mm_loadh_pd(pos, (double*)&m_Particle[i+1]->x);
// same for velocities
__m128 vel = _mm_set1_pd(0);
vel = _mm_loadl_pd(pos, (double*)&m_Particle[i ]->vx);
vel = _mm_loadh_pd(pos, (double*)&m_Particle[i+1]->vy);
pos = _mm_add_ps(pos, vel); // do the math
// store the same way as load
_mm_storel_pd(&m_Particle[i ]->x, pos);
_mm_storeh_pd(&m_Particle[i+1]->x, pos);
}
Also, since you mention particle, do you intend to draw them with OpenGL / DirectX ? If so, you could perform this kind of permutation on the GPU faster while also avoiding data transfers from main memory to GPU, so it's a gain on all fronts.
If that's not the case and you intend to stay on the CPU, using an SSE friendly layout like one array for positions and one for velocities could be a solution:
struct particle_data {
std::vector<float> xys, vxvys;
};
But it would have the drawback of either breaking your architecture or requiring a copy from your current array of structs to a temporary struct of arrays. The compute would be faster but the additional copy might outweigh that. Only benchmarking can show...
A last option is to sacrifice a little performance and load your data as it is, and use SSE shuffle instructions to rearrange the data locally at each iteration. But arguably this would make the code even harder to maintain.

For the design of performance you should avoid handling array of structure but you should work with structure of array.

Related

Cuda move element in array to the end

Hello my issue is this any advice will be greatfully accepted:
I have array of structs (representating Particles) but for simplify I have array containing only True values at start (Particle.exist = True). I am running my own CUDA kernel function on this array and in some cases the True value is changed to False. After that I have to move this Value to the end of array for better optimalization (No more working with dead Particle (exist = False)).
I have theoretically two options how to do this...
Some Parallel sorting Algorithms or
Move instead dead Particle to the end and shift array.
Second option should be better choice but I donĀ“t know how to do this in parallel. I could Have 1 000 000 Particles so shifting in one thread is not good idea...
Here is example of my code. I put Todo in part where I need shift array
struct Particle
{
float2 position;
float angle;
bool exists;
};
__global__ void moveParticles(Particle* particles, const unsigned int lengthOfParticles, const Particle* leaders, const unsigned int lengthOfLeaders, const unsigned int sizeOfLeader, const float speedFactor, const cudaTextureObject_t heightMapTexture)
{
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
while (idx < lengthOfParticles)
{
// If particle does not exist then do nothing and skip
if (!particles[idx].exists) { idx += skip; continue; }
float bestLength = 3.40282e+038;
unsigned int bestLeaderIndex;
for (unsigned int i = 0; i < lengthOfLeaders; i++)
{
float currentLength = (
(particles[idx].position.x - leaders[i].position.x) * (particles[idx].position.x - leaders[i].position.x)
) + (
(particles[idx].position.y - leaders[i].position.y) * (particles[idx].position.y - leaders[i].position.y)
);
if (currentLength < bestLength)
{
bestLength = currentLength;
bestLeaderIndex = i;
}
}
Particle bestLeader = leaders[bestLeaderIndex];
float differenceX = bestLeader.position.x - particles[idx].position.x;
float differenceY = bestLeader.position.y - particles[idx].position.y;
float newLength = sqrtf(differenceX * differenceX + differenceY * differenceY);
// If the newLenght is equal to zero, then the particle is at the same position as leader
// TODO: HERE I NEED SORT NOT EXISTING PARTICLE TO THE END
if (newLength <= sizeOfLeader / 2) { particles[idx].exists = false; idx += skip; continue; }
// Current height at the position
const uchar4 texelOfHeight = tex2D<uchar4>(heightMapTexture, particles[idx].position.x, particles[idx].position.y);
// Normalize vector
differenceX /= newLength;
differenceY /= newLength;
int nextPositionOnMapX = round(particles[idx].position.x + differenceX);
int nextPositionOnMapY = round(particles[idx].position.y + differenceY);
// Height of the next position
const uchar4 texelOfNextPosition = tex2D<uchar4>(heightMapTexture, nextPositionOnMapX, nextPositionOnMapY);
float differenceHeight = texelOfHeight.x - texelOfNextPosition.x;
float speed = sqrtf(speedFactor + 2 * fabsf(differenceHeight));
// Multiply by speed
differenceX *= speed;
differenceY *= speed;
particles[idx].position.x += differenceX;
particles[idx].position.y += differenceY;
idx += skip;
}
}
One possible solution what am I thinking about is do own kernel function which will only shifting particles. Something like this
__global__ void shiftParticles(const Particle* particles, const unsigned int lengthOfParticles, const unsigned int sizeOfParticle) {
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
//TODO: Shifting...
}
Sorting on GPUs is rather inefficient, so it is better to select the values to keep and perform a partition based on them. To do that easily, you can use CUB which is quite efficient (as it often implement best state-of-the-art algorithm or close to).
You can use DevicePartition or two DeviceSelect (the former will likely be faster, except if you do not want to keep dead particles at all). You could also use block primitives if you want to perform some advanced tweaks/optimizations.
If you still want to do this yourself for some reason (eg. reducing the number of dependencies in your project), then you can use atomic adds on relatively new devices since they are very-well optimized by the hardware. On old device you could use scans to do that but it is a but harder to implement. The thing is atomics do not scale particularly when there is a lot of SM, so you need to perform some advanced blocking strategy. Here is an untested naive implementation to understand the idea:
// PS: what is the difference between sizeOfParticle and lengthOfParticles?
// pos must be initialized to 0 and contains the number of living particles (pivot) once the kernel finished its execution.
__global__ void shiftParticles(const Particle* particles, const unsigned int lengthOfParticles, const unsigned int sizeOfParticle, Particle* outParticles, int* pos) {
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
const bool exists = particles[idx].exists;
const int localPos = atomicAdd(pos, exists); // Here is the important point
const Particle current = particles[idx];
// outParticles is a needed temporary array or output one
// as the operation cannot be efficiently performed in parallel.
// It should likely be allocated and provided in argument to the kernel
if(exists)
{
// Move the current particle to the beginning
outParticles[localPos] = current;
}
else
{
// Move the current particle to the end
outParticles[lengthOfParticles-1-idx+localPos] = current;
}
}
Note that the ordering is not preserved due to the atomic operations. If you need to keep the order of the particles, then it gets significantly more complicated, especially on GPUs, since it would make the algorithm more sequential. A naive solution could be to use a stable sort in that case. Another solution is to use a global scan followed by an indirection to store the values (so with two pass). Implementing an efficient scan is a bit complex/tedious. Hopefully, CUB can help a lot in this case with its DeviceScan primitive.
Finally note that using array of structures is not efficient, especially on hardware using SIMD instructions like GPUs. The implementation should be significantly faster with structures of arrays (due to cache lines, coalescence, contiguity of access pattern, etc.).

Threads are slow c++

im trying to draw a mandelbrot and want to use 4 threats to do the calculation at the same time but a different part of the image , here are the functions
void Mandelbrot(int x_min,int x_max,int y_min,int y_max,Image &im)
{
for (int i = y_min; i < y_max; i++)
{
for (int j = x_min; j < x_max; j++)
{
//scaled x and y cordinate
double x0 = mape(j, 0, W, MinX, MaxX);
double y0 = mape(i, 0, H, MinY, MaxY);
double x = 0.0f;
double y = 0.0f;
int iteration = 0;
double z = 0;
while (abs(z)<2.0f && iteration < maxIteration)// && iteration < maxIteration)
{
double xtemp = x * x - y * y + x0;
y = 2 * x * y + y0;
x = xtemp;
iteration++;
z = x * x + y * y;
if (z > 10)//must be 10
break;
}
int b =mape(iteration, 0, maxIteration, 0, 255);
if (iteration == maxIteration)
b = 0;
im.setPixel(j, i, Color(b,b,0));
}
}
}
mape functions just convert a number from one range to another
Here is the thread function
void th(Image& im)
{
float size = (float)im.getSize().x / num_th;
int x_min = 0, x_max = size, y_min = 0, y_max = im.getSize().y;
thread t[num_th];
for (size_t i = 0; i < num_th; i++)
{
t[i] = thread(Mandelbrot, x_min, x_max, y_min, y_max, ref(im));
x_min = x_max;
x_max += size;
}
for (size_t i = 0; i<num_th; i++)
{
t[i].join();
}
}
The main function looks like this
int main()
{
Image img;
while(1)//here is while window.open()
{
th(img);
//here im drawing
}
}
So i am not getting any performance boost but it gets even slower , can anyone tell my where is the problem what im doing wrong , it happened to me before too
I sow a question what is an image , it's a class from the SFML library dont'n know if this is of any help.
Your code is incomplete to be able to answer you concretely, but there are a few suspicions:
Spawning a thread has non-trivial overhead. If the amount of work performed by the thread is not large enough, the overhead of launching it may cost more than any gains you would get through parallelism.
Excessive locking and contention. Does not look like a problem in your code, as you don't seem to use any locks at all. Be careful (though as long as they don't write to the same addresses, it should be correct.)
False sharing: Possible problem in your code. Cache lines tend to be 64 bytes. Any write to any portion of a cache line causes the whole line to be committed to memory. If two threads are looking at the same cache line and one of them writes to it, even if all the other threads use a different part of that cache line, they all will have their copy invalidated and will have to re-fetch. This can cause significant problems if multiple threads work in non-overlapping data that share a cache line and cause these invalidations. If they iterate at the same rate through the same data, it can cause this problem to recur over and over. This problem can be significant, and always worth considering.
memory layout causing your cache to be thrashed. While walking through an array, going "across" may align with actual memory layout, reading one full cacheline after another, but scanning "vertically" touches one portion of a cache line then jumps to the corresponding portion of another cache line. If this happens in many threads and you have a lot of memory to churn through, it can mean that your cache is vastly underutilized. Just something to beware of, whether your machine is row- or column- major, and write code to match it, and avoid jumping around in memory.

Initializing a box with N particles arranged in a specific pattern

I'm new to C++, and as an exercise I'm trying to reproduce what was done by Metropolis et al. (Metropolis Monte Carlo).
What I have done thus far - Made 2 classes: Vector and Atom
class Vector {
public:
double x;
double y;
Vector() {
}
Vector (double x_, double y_) {
x = x_;
y = y_;
}
double len() {
return sqrt(x*x + y*y);
}
double lenSqr() {
return x*x + y*y;
}
};
class Atom {
public:
Vector pos;
Vector vel;
Vector force;
Atom (double x_, double y_) {
pos = Vector(x_, y_);
vel = Vector(0, 0);
force = Vector(0, 0);
}
double KE() {
return .5 * vel.lenSqr();
}
};
I am not certain that the way I have defined the class Atom is... the best way to go about things since I will not be using a random number generator to place the atoms in the box.
My problem:
I need to initialize a box of length L (in my case L=1) and load it with 224 atoms/particles in an offset lattice (I have included a picture). I have done some reading and I was wondering if maybe an array would be appropriate here.
One thing that I am confused about is how I could normalize the array to get the appropriate distance between the particles and what would happen to the array once the particles begin to move. I am also not sure how an array could give me the x and y position of each and every atom in the box.
Metropolis offset (hexagonal) lattice
Well, It seems, that generally you don't need to use array to represent the lattice. In practice most often it may sense to represent lattice as array only if your atoms can naturally move only on the cells (for example as figures in chess). But seems that your atoms can move in any direction (already not practicle to use such rigid structure as array, because it has naturally 4 or 8 directions for move in 2D) by any step (it is bad for arrays too, because in this case you need almost countless cells in array to represent minimal distance step).
So basically what do you need is just use array as storage for your 224 atoms and set particular position in lattice via pos parameter.
std::vector<Atom> atoms;
// initialize atoms to be in trigonal lattice
const double x_shift = 1. / 14;
const double y_shift = 1. / 16;
double x_offset = 0;
for (double y = 0; y < 1; y += y_shift){
for (double x = x_offset; x < 1; x += x_shift){
// create atom in position (x, y)
// and store it in array of atoms
atoms.push_back(Atom(x, y));
}
// every new row flip offset 0 -> 1/28 -> 0 -> 1/28...
if (x_offset == 0){
x_offset = x_shift / 2;
}
else{
x_offset = 0;
}
}
Afterwards you just need to process this array of atoms and change their positions, velocities and what you need else according to algorithm.

NEON increasing run time

I am currently trying to optimize some of my image processing code to use NEON instructions.
Let's say I have to very large float arrays and I want to multiply each value of the first one with three consecutive values of the second one. (The second one is three times as large.)
float* l_ptrGauss_pf32 = [...];
float* l_ptrLaplace_pf32 = [...]; // Three times as large
for (uint64_t k = 0; k < l_numPixels_ui64; ++k)
{
float l_weight_f32 = *l_ptrGauss_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
++l_ptrLaplace_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
++l_ptrLaplace_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
++l_ptrLaplace_pf32;
++l_ptrGauss_pf32;
}
So when I replace the above code with NEON intrinsics, the run time is about 10% longer.
float32x4_t l_gaussElem_f32x4;
float32x4_t l_laplElem1_f32x4;
float32x4_t l_laplElem2_f32x4;
float32x4_t l_laplElem3_f32x4;
for( uint64_t k=0; k<(l_lastPixelInBlock_ui64/4); ++k)
{
l_gaussElem_f32x4 = vld1q_f32(l_ptrGauss_pf32);
l_laplElem1_f32x4 = vld1q_f32(l_ptrLaplace_pf32);
l_laplElem2_f32x4 = vld1q_f32(l_ptrLaplace_pf32+4);
l_laplElem3_f32x4 = vld1q_f32(l_ptrLaplace_pf32+8);
l_laplElem1_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem1_f32x4);
l_laplElem2_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem2_f32x4);
l_laplElem3_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem3_f32x4);
vst1q_f32(l_ptrLaplace_pf32, l_laplElem1_f32x4);
vst1q_f32(l_ptrLaplace_pf32+4, l_laplElem2_f32x4);
vst1q_f32(l_ptrLaplace_pf32+8, l_laplElem3_f32x4);
l_ptrLaplace_pf32 += 12;
l_ptrGauss_pf32 += 4;
}
Both versions are compiled with -Ofast using Apple LLVM 8.0. Is the compiler really so good at optimizing this code even without NEON intrinsics?
You code contains relatively many operations of vector loading and a few operations of multiplication. So I would recommend to optimize loading of vectors. There are two steps:
Use aligned memory in your arrays.
Use prefetch.
In order to do this I would recommend to use next function:
inline float32x4_t Load(const float * p)
{
// use prefetch:
__builtin_prefetch(p + 256);
// tell compiler that address is aligned:
float * _p = (float *)__builtin_assume_aligned(p, 16);
return vld1q_f32(_p);
}

Optimize a nearest neighbor resizing algorithm for speed

I'm using the next algorithm to perform nearest neighbor resizing. Is there anyway to optimize it's speed? Input and Output buffers are in ARGB format, though images are known to be always opaque. Thank you.
void resizeNearestNeighbor(const uint8_t* input, uint8_t* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight) ;
const int colors = 4;
for (int y = 0; y < targetHeight; y++)
{
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
for (int x = 0; x < targetWidth; x++)
{
int x2 = ((x * x_ratio) >> 16) ;
int y2_x2_colors = (y2_xsource + x2) * colors;
int i_x_colors = (i_xdest + x) * colors;
output[i_x_colors] = input[y2_x2_colors];
output[i_x_colors + 1] = input[y2_x2_colors + 1];
output[i_x_colors + 2] = input[y2_x2_colors + 2];
output[i_x_colors + 3] = input[y2_x2_colors + 3];
}
}
}
restrict keyword will help a lot, assuming no aliasing.
Another improvement is to declare another pointerToOutput and pointerToInput as uint_32_t, so that the four 8-bit copy-assignments can be combined into a 32-bit one, assuming pointers are 32bit aligned.
There's little that you can do to speed this up, as you already arranged the loops in the right order and cleverly used fixed-point arithmetic. As others suggested, try to move the 32 bits in a single go (hoping that the compiler didn't see that yet).
In case of significant enlargement, there is a possibility: you can determine how many times every source pixel needs to be replicated (you'll need to work on the properties of the relation Xd=Wd.Xs/Ws in integers), and perform a single pixel read for k writes. This also works on the y's, and you can memcpy the identical rows instead of recomputing them. You can precompute and tabulate the mappings of the X's and Y's using run-length coding.
But there is a barrier that you will not pass: you need to fill the destination image.
If you are desperately looking for speedup, there could remain the option of using vector operations (SEE or AVX) to handle several pixels at a time. Shuffle instructions are available that might enable to control the replication (or decimation) of the pixels. But due to the complicated replication pattern combined with the fixed structure of the vector registers, you will probably need to integrate a complex decision table.
The algorithm is fine, but you can utilize massive parallelization by submitting your image to the GPU. If you use opengl, simply creating a context of the new size and providing a properly sized quad can give you inherent nearest neighbor calculations. Also opengl could give you access to other resizing sampling techniques by simply changing the properties of the texture you read from (which would amount to a single gl command which could be an easy paramter to your resize function).
Also later in development, you could simply swap out a shader for other blending techniques which also keeps you utilizing your wonderful GPU processor of image processing glory.
Also, since you aren't using any fancy geometry it can become almost trivial to write the program. It would be a little more involved than your algorithm, but it could perform magnitudes faster depending on image size.
I hope I didn't break anything. This combines some of the suggestions posted thus far and is about 30% faster. I'm amazed that is all we got. I did not actually check the destination image to see if it was right.
Changes:
- remove multiplies from inner loop (10% improvement)
- uint32_t instead of uint8_t (10% improvement)
- __restrict keyword (1% improvement)
This was on an i7 x64 machine running Windows, compiled with MSVC 2013. You will have to change the __restrict keyword for other compilers.
void resizeNearestNeighbor2_32(const uint8_t* __restrict input, uint8_t* __restrict output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
const uint32_t* input32 = (const uint32_t*)input;
uint32_t* output32 = (uint32_t*)output;
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight);
int x_ratio_with_color = x_ratio;
for (int y = 0; y < targetHeight; y++)
{
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
int source_x_offset = 0;
int startingOffset = y2_xsource;
const uint32_t * inputLine = input32 + startingOffset;
for (int x = 0; x < targetWidth; x++)
{
i_xdest += 1;
source_x_offset += x_ratio_with_color;
int sourceOffset = source_x_offset >> 16;
output[i_xdest] = inputLine[sourceOffset];
}
}
}