Trying to convert a for loop to SIMD in C++ - c++

For a game i'm trying to make faster, i have a part of code that makes sure tanks evade eachother, by pushing them away from eachother with a force.
for ( unsigned int i = 0; i < (MAXP1 + MAXP2); i++ )
{
if (game->m_Tank[i] == this) continue;
float2 d = pos - game->m_Tank[i]->pos;
float dlsq = (d.x * d.x + d.y * d.y);
if (dlsq < 64) force += normalize(d) * 2.0f;
else if (dlsq < 256) force += normalize(d) * 0.4f;
}
Now i'm trying to do 4 tanks at a time using SIMD, so i have
for (unsigned int i = 0; i < (MAXP1 + MAXP2) / 4; i++)
{
union {
__m128 valuesx4;
float valuesx[4];
};
union {
__m128 valuesy4;
float valuesy[4];
};
__m128 dx4 = _mm_set_ps((pos - game->m_Tank[i*4]->pos).x, (pos - game->m_Tank[i*4+1]->pos).x, (pos - game->m_Tank[i*4+2]->pos).x, (pos - game->m_Tank[i*4+3]->pos).x);
__m128 dy4 = _mm_set_ps((pos - game->m_Tank[i*4]->pos).y, (pos - game->m_Tank[i*4+1]->pos).y, (pos - game->m_Tank[i*4+2]->pos).y, (pos - game->m_Tank[i*4+3]->pos).y);
__m128 dlsq4 = _mm_add_ps(_mm_mul_ps(dx4, dx4), _mm_mul_ps(dy4, dy4));
__m128 sixtyfour4 = _mm_set_ps1(64.0f);
__m128 twofiftysix4 = _mm_set_ps1(256.0f);
__m128 mask0 = _mm_cmpeq_ps(dlsq4,_mm_set_ps1(0.f));
__m128 adjdlsq4 = _mm_add_ps(dlsq4, _mm_and_ps(mask0, _mm_set_ps1(300.f)));
__m128 mask1 = _mm_cmplt_ps(adjdlsq4, sixtyfour4);
__m128 mask2 = _mm_and_ps(_mm_cmplt_ps(adjdlsq4, twofiftysix4), _mm_cmpge_ps(adjdlsq4,sixtyfour4));
__m128 multiplier4 = _mm_add_ps(_mm_and_ps(mask1, _mm_set_ps1(2.0f)), _mm_and_ps(mask2, _mm_set_ps1(0.4f)));
__m128 rsqrt4 = _mm_rsqrt_ps(adjdlsq4);
__m128 values4 = _mm_mul_ps(rsqrt4, multiplier4);
valuesx4 = _mm_mul_ps(values4, dx4);
valuesy4 = _mm_mul_ps(values4, dy4);
force += (valuesx[0], valuesy[0]);
force += (valuesx[1], valuesy[1]);
force += (valuesx[2], valuesy[2]);
force += (valuesx[3], valuesy[3]);
}
I adjust the dlsq to account for the case where the checked Tank is the current Tank. For some reason however, in contrast to the normal way where the tanks gather around the target and stay roughly in a filled circle around the target, in the SIMD version, the tanks start following a sloped line and intersecting with eachother. They all move on a line with direction (1,-1) from the target. Does anyone see where it goes wrong?

Related

How can I force the compiler to make critical variables in a register?

My self-set task was to experiment with optimising the ReLu activation function (for neural networks) where the function would activate an entire layer at a time, and rely on SIMD vectorisation and loop unrolling to get the job done much faster - with success! I've seen a consistent 4x performance increase from the standard c++ way of doing this task (that I could think of anyway.)
I'm still curious about pushing it faster and faster - and I've been looking at the disassembly of the program in Release mode, x64, msvc compiler, with all the optimisations and instruction sets on the highest setting - so in theory the compiler should be storing commonly used variables in the registers, however it refuses to do so - they're always in memory. This seems to me as highly inefficient, and one final bottleneck. I've been using intrinsics (because no inline assembly in x64 - if that was allowed, I could bypass this infuriating compiler issue entirely), and here is the code:
The global bitmask is just the MSB of a 64-bit value set to 1, and all else as zero - it's a mask
to check if a value is negative - the ReLu is max(in, 0.0) - but I found that to be far slower, so
I've been creating a mask, that is set to true when the value is non-negative (andNOT is my
instruction) and then using a maskload, which zeros the destination out if the mask isn't true for
that element. Likewise, I have the global ones and zeros vectors as pre-stored values, that I would hope would be stored in registers, as they are used often and would be more efficiently placed in the ymm registers, but instead are being put on the stack.
I have experimented with creating the bitmasks, and one vectors as local const variables, in the hope that that would force them into registers, but there is no effect - they are put in a register for the first instruction, and then sent out to memory immediately afterwards.
The two loops for the activation and the derivative are separated, as they use different intrinsics - I've found that the max intrinsic is far faster than a maskload - but I can't see any other way to compute the derivative (1.0 if in > 0.0, else 0.0) with a faster intrinsic. I've tested it - counter-intuitively, the separation of the loops makes it far faster.
void ReluTesters::reluCompBothV3(double* in, double* out, int args) {
int unrolled = args / 16;
__m256d* inPtr = (__m256d*) in;
__m256d* inPtr1 = (__m256d*) in + 1;
__m256d* inPtr2 = (__m256d*) in + 2;
__m256d* inPtr3 = (__m256d*) in + 3;
__m256d* actOutPtr = (__m256d*) out;
__m256d* actOutPtr1 = (__m256d*) out + 1;
__m256d* actOutPtr2 = (__m256d*) out + 2;
__m256d* actOutPtr3 = (__m256d*) out + 3;
__m256d* delOutPtr = (__m256d*) (out + args);
__m256d* delOutPtr1 = (__m256d*) (out + args + 4);
__m256d* delOutPtr2 = (__m256d*) (out + args + 8);
__m256d* delOutPtr3 = (__m256d*) (out + args + 12);
const __m256d zeros = reluGlobalZeroVectorAVX;
for (int i = 0; i < unrolled; ++i) {
*actOutPtr = _mm256_max_pd(*inPtr, zeros);
*actOutPtr1 = _mm256_max_pd(*inPtr1, zeros);
*actOutPtr2 = _mm256_max_pd(*inPtr2, zeros);
*actOutPtr3 = _mm256_max_pd(*inPtr3, zeros);
inPtr += 4;
inPtr1 += 4;
inPtr2 += 4;
inPtr3 += 4;
actOutPtr += 4;
actOutPtr1 += 4;
actOutPtr2 += 4;
actOutPtr3 += 4;
}
inPtr = (__m256d*) in;
inPtr1 = (__m256d*) in + 1;
inPtr2 = (__m256d*) in + 2;
inPtr3 = (__m256d*) in + 3;
const __m256d ones = reluGlobalOneVectorAVX;
const __m256i bitmask = reluGlobalBitmask;
for (int i = 0; i < unrolled ; ++i) {
__m256i mask = _mm256_andnot_si256(*(__m256i*)inPtr, bitmask);
__m256i mask1 = _mm256_andnot_si256(*(__m256i*)inPtr1, bitmask);
__m256i mask2 = _mm256_andnot_si256(*(__m256i*)inPtr2, bitmask);
__m256i mask3 = _mm256_andnot_si256(*(__m256i*)inPtr3, bitmask);
*delOutPtr = _mm256_maskload_pd((double*)&ones, mask);
*delOutPtr1 = _mm256_maskload_pd((double*)&ones, mask1);
*delOutPtr2 = _mm256_maskload_pd((double*)&ones, mask2);
*delOutPtr3 = _mm256_maskload_pd((double*)&ones, mask3);
delOutPtr += 4;
delOutPtr1 += 4;
delOutPtr2 += 4;
delOutPtr3 += 4;
}
double* inD = (double*)inPtr;
double* actOutD = (double*)actOutPtr;
double* delOutD = (double*)delOutPtr;
for (int i = inD - in; i < args; ++i, ++inD, ++actOutD, ++delOutD) {
if ((*((long long*)inD)) & (*((long long*)&bitmask))) {
*actOutD = 0.0;
*delOutD = 0.0;
}
else {
*actOutD = *inD;
*delOutD = 1.0;
}
}
}

SIMD -> uint16_t array to float array work on float then back to uint16_t

I am currently working on a project that manipulates images. To speed up the process (and increase my knowledge), I decided to write some of the basic functions using SIMD instructions.
The code using for loops is
int idx;
uint16_t* A, B, C;
float gAlpha = 0.8;
float alpha = 0.2;
for (size_t rw = 0; rw < height; rw++) {
for (size_t cl = 0; cl < width; cl++) {
idx = rw * width + height;
C[idx] = static_cast<uint16_t>(gAlpha * static_cast<float>(A[idx]) + alpha * static_cast<float>(B[idx]));
}
}
}
This loop is probably not perfect but it makes its job perfectly and my unit test gives me the expected results.
As I said, I am trying to convert these loops using SIMD intrinsic. This is my working code and, as you will see, it is not very pretty... We do have access to intrinsic up to AVX2.
size_t n_pixels = height * width;
for (size_t px = 0; px < n_pixels; px += 8) {
__m128i xlo = _mm_unpacklo_epi16(_mm_load_si128((__m128i*)&A[px]), _mm_set1_epi16(0));
__m128i xhi = _mm_unpackhi_epi16(_mm_load_si128((__m128i*)&A[px]), _mm_set1_epi16(0));
__m128 ylo = _mm_cvtepi32_ps(xlo);
__m128 yhi = _mm_cvtepi32_ps(xhi);
__m256 pxMinFl = _mm256_castps128_ps256(ylo);
pxMinFl = _mm256_insertf128_ps(pxMinFl, yhi, 1);
xlo = _mm_unpacklo_epi16(_mm_load_si128((__m128i*)&B[px]), _mm_set1_epi16(0));
xhi = _mm_unpackhi_epi16(_mm_load_si128((__m128i*)&B[px]), _mm_set1_epi16(0));
ylo = _mm_cvtepi32_ps(xlo);
yhi = _mm_cvtepi32_ps(xhi);
__m256 pxMaxFl = _mm256_castps128_ps256(ylo);
pxMaxFl = _mm256_insertf128_ps(pxMaxFl, yhi, 1);
__m256 avGain1 = _mm256_set1_ps(gAlpha);
__m256 avGain2 = _mm256_set1_ps(alpha);
__m256 prodUp = _mm256_mul_ps(prodUp, avGain1);
__m256 prodBt = _mm256_mul_ps(prodBt, avGain2);
__m256 pxOutFl = _mm256_add_ps(prodUp, prodBt);
__m128 ylo_ps = _mm256_castps256_ps128(pxOutFl);
__m128 yhi_ps = _mm256_extractf128_ps(pxOutFl, 1);
__m128i xlo_ep = _mm_cvtps_epi32(ylo_ps);
__m128i xhi_ep = _mm_cvtps_epi32(yhi_ps); <- POINT 1
int* xl = reinterpret_cast<int*>(&xlo_ep); <- POINT 2
for (int i=0; i < 8; i++) { <- POINT 2
C[px + i] = static_cast<uint16_t>(xl[i]); <- POINT 2
}
}
There are probably tons of optimization that could be done on this code but I have checked that the output of pxOutFl corresponds to the expected value. Where is start to look like black magic to me is when I looked at how I had to save the data back into the output array C. First of all, the code doesn't work if I comment the line at POINT 1 even if, as you can read, I don't use the variable. Secondly, I would have guessed that there is a better solution than the trick I used to store the data back into the uint16_t array (POINT 2) but I can't find one that is working.
Could someone point me into the correct direction? What am I missing? How could I improve this code?
Thanks in advance!
PS: We use the Intel compiler 2017 for the parallel studio professional edition 2117 on Linux (Fedora 25).
You can re-write all of POINT 2 as:
_mm_storeu_si128((__m128i *)&C[px], xlo_ep);
Also note that all instances of _mm_load_si128 should probably be _mm_loadu_si128, since you don't seem to be guaranteeing alignment anywhere.

OpenACC present clause update data

I am trying to do openACC optimizations for many body simulations. Currently, I am facing a problem which lead to memory problem in below
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution
srun: error: jrc0017: task 0: Exited with exit code 1
I am using pgc++ compiler and my compiler flags are -acc -Minfo=accel -ta=tesla -fast -std=c++11 and I don't want to use -ta=tesla:managed because I want to organise memory by myself.
#pragma acc kernels present(sim.part.rx, sim.part.ry, sim.part.rz, sim.part.vx, sim.part.vy, sim.part.vz)
{
for(int idx = 0; idx < sim.num; ++idx) { // Loop over target particle
float
prx = sim.part.rx[idx], // my position
pry = sim.part.ry[idx],
prz = sim.part.rz[idx];
float Fx = 0.f, Fy = 0.f, Fz = 0.f; // Force
#pragma acc loop
for(int jdx = 0; jdx < sim.num; ++jdx) { // Loop over interaction partners
if(idx != jdx) { // No self-force
const float dx = prx - sim.part.rx[jdx]; // Distance to partner
const float dy = pry - sim.part.ry[jdx];
const float dz = prz - sim.part.rz[jdx];
const float h = 1.f/sqrt(dx*dx + dy*dy + dz*dz + eps);
const float h3 = h*h*h;
Fx += dx*h3; // Sum up force
Fy += dy*h3;
Fz += dz*h3;
}
}
sim.part.vx[idx] += sim.mass*dt*Fx; // update velocity
sim.part.vy[idx] += sim.mass*dt*Fy;
sim.part.vz[idx] += sim.mass*dt*Fz;
}
}
If I delete the code in below
sim.part.vx[idx] += sim.mass*dt*Fx; // update velocity
sim.part.vy[idx] += sim.mass*dt*Fy;
sim.part.vz[idx] += sim.mass*dt*Fz;
my code is able to run without problem. But I got memory problem if I un-comment them. It seems that sim.part.vx are try to update the data but compiler don't know which lead to the memory problem.
Does anyone know how to fix this problem?
I suspect the problem is that sim and sim.part are not on the device (or the compiler doesn't realize that they're on the device. As a workaround, can you try introducing pointers to those arrays directly?
float *rx = sim.part.rx, *ry = sim.part.ry, *rz = sim.part.rz,
*vx = sim.part.vx, *vy = sim.part.vy, *vz = sim.part.vz;
#pragma acc kernels present(rx, ry, rz, vx, vy, vz)
{
for(int idx = 0; idx < sim.num; ++idx) { // Loop over target particle
float
prx = rx[idx], // my position
pry = ry[idx],
prz = rz[idx];
float Fx = 0.f, Fy = 0.f, Fz = 0.f; // Force
#pragma acc loop
for(int jdx = 0; jdx < sim.num; ++jdx) { // Loop over interaction partners
if(idx != jdx) { // No self-force
const float dx = prx - rx[jdx]; // Distance to partner
const float dy = pry - ry[jdx];
const float dz = prz - rz[jdx];
const float h = 1.f/sqrt(dx*dx + dy*dy + dz*dz + eps);
const float h3 = h*h*h;
Fx += dx*h3; // Sum up force
Fy += dy*h3;
Fz += dz*h3;
}
}
vx[idx] += sim.mass*dt*Fx; // update velocity
vy[idx] += sim.mass*dt*Fy;
vz[idx] += sim.mass*dt*Fz;
}
}
How are sim and sim.part allocated? It's possible to use unstructured data directives in the constructor and destructor to make sure that sim and sim.part are on the device too. If you've already done this, then another possible solution is to add present(sim, sim.part) to your existing present clause so the compiler knows that you've already taken care of those data structures too.

random velocities when using simulation shader

I'm trying to implement a Fruchterman Reingold simulation using shaders. Before implementing the compute portion in a shader I wrote it in javascript. It works exactly as I expect, as seen here:
http://jaredmcqueen.github.io/gpgpu-force-direction/canvas_app.html
When implementing the compute portion in a shader, I get a stable structure that randomly drifts around the screen. I cannot figure out what repulsion / attraction forces are causing my graphs to float around so unpredictably:
http://jaredmcqueen.github.io/gpgpu-force-direction/gpgpu_app.html
the core of the physics are from the repulsion / attraction functions:
//fr(x) = (k*k)/x;
vec3 addRepulsion(vec3 self, vec3 neighbor){
vec3 diff = self - neighbor;
float x = length( diff );
float f = ( k * k ) / x;
return normalize(diff) * f;
}
//fa(x) = (x*x)/k;
vec3 addAttraction(vec3 self, vec3 neighbor){
vec3 diff = self - neighbor;
float x = length( diff );
float f = ( x * x ) / k;
return normalize(diff) * f;
}
Any insight as to why gpgpu, simulation based shaders would behave seemingly random would be greatly appreciated.
It doesn't seem random, construction stabilizes in seemingly right state and moves out in constant direction.
It looks like you apply force in shader and then update your model's position on CPU side and this global model position should stay constant either should have been updated by another value.
From what i've seen in code i recommend to eliminate floating point comprasions (compareNodePosition.w == -1.0 || 0.0) and continue operator. Please say if it helped. I haven't looked into algorithms logic yet.
It turns out I was iterating through the edges incorrectly. Here's my new edge iteration:
float idx = selfEdgeIndices.x;
float idy = selfEdgeIndices.y;
float idz = selfEdgeIndices.z;
float idw = selfEdgeIndices.w;
float start = idx * 4.0 + idy;
float end = idz * 4.0 + idw;
if(! ( idx == idz && idy == idw ) ){
float edgeIndex = 0.0;
for(float y = 0.0; y < edgesTexWidth; y++){
for(float x = 0.0; x < edgesTexWidth; x++){
vec2 ref = vec2( x + 0.5 , y + 0.5 ) / vec2(edgesTexWidth,edgesTexWidth);
vec4 pixel = texture2D(edgeData,ref);
if (edgeIndex >= start && edgeIndex < end){
nodePosition = getNeighbor(pixel.x);
nodeDiff.xyz -= addAttraction(currentNodePosition.xyz, nodePosition);
}
edgeIndex++;
if (edgeIndex >= start && edgeIndex < end){
nodePosition = getNeighbor(pixel.y);
nodeDiff.xyz -= addAttraction(currentNodePosition.xyz, nodePosition);
}
edgeIndex++;
if (edgeIndex >= start && edgeIndex < end){
nodePosition = getNeighbor(pixel.z);
nodeDiff.xyz -= addAttraction(currentNodePosition.xyz, nodePosition);
}
edgeIndex++;
if (edgeIndex >= start && edgeIndex < end){
nodePosition = getNeighbor(pixel.w);
nodeDiff.xyz -= addAttraction(currentNodePosition.xyz, nodePosition);
}
edgeIndex++;
}
}
}

Second iteration crash - order irrelevant

To save on global memory transfers, and because all of the steps of the code work individually, I have tried to combine all of the kernals into a single kernal, with the first 2 (of 3) steps being done as device calls rather than global calls.
This is failing in the second half of the first step.
There is a function that I need to call twice, to calculate the 2 halves of an image. Regardless of the order the image is calculated in, it crashes on the second iteration.
After examining the code as well as I could, and running it multiple times with different return points, I have found what makes it crash.
__device__
void IntersectCone( float* ModDistance,
float* ModIntensity,
float3 ray,
int threadID,
modParam param )
{
bool ignore = false;
float3 normal = make_float3(0.0f,0.0f,0.0f);
float3 result = make_float3(0.0f,0.0f,0.0f);
float normDist = 0.0f;
float intensity = 0.0f;
float check = abs( Dot(param.position, Cross(param.direction,ray) ) );
if(check > param.r1 && check > param.r2)
ignore = true;
float tran = param.length / (param.r2/param.r1 - 1);
float length = tran + param.length;
float Lsq = length * length;
float cosSqr = Lsq / (Lsq + param.r2 * param.r2);
//Changes the centre position?
float3 position = param.position - tran * param.direction;
float aDd = Dot(param.direction, ray);
float3 e = position * -1.0f;
float aDe = Dot(param.direction, e);
float dDe = Dot(ray, e);
float eDe = Dot(e, e);
float c2 = aDd * aDd - cosSqr;
float c1 = aDd * aDe - cosSqr * dDe;
float c0 = aDe * aDe - cosSqr * eDe;
float discr = c1 * c1 - c0 * c2;
if(discr <= 0.0f)
ignore = true;
if(!ignore)
{
float root = sqrt(discr);
float sign;
if(c1 > 0.0f)
sign = 1.0f;
else
sign = -1.0f;
//Try opposite sign....?
float3 result = (-c1 + sign * root) * ray / c2;
e = result - position;
float dot = Dot(e, param.direction);
float3 s1 = Cross(e, param.direction);
float3 normal = Cross(e, s1);
if( (dot > tran) || (dot < length) )
{
if(Dot(normal,ray) <= 0)
{
normal = Norm(normal); //This stuff (1)
normDist = Magnitude(result);
intensity = -IntensAt1m * Dot(ray, normal) / (normDist * normDist);
}
}
}
ModDistance[threadID] = normDist; and this stuff (2)
ModIntensity[threadID] = intensity;
}
There are two things I can do to to make this not crash, both off which negate the point of the function: If I do not try to write to ModDistance[] and ModIntensity[], or if I do not write to normDist and intensity.
First chance exceptions are thrown by the code above, but not if either of the blocks commented out.
Also, The program only crashes the second time this routine is called.
Have been trying to figure this out all day, any help would be fantastic.
The code that calls it is:
int subrow = threadIdx.y + Mod_Height/2;
int threadID = subrow * (Mod_Width+1) + threadIdx.x;
int obsY = windowY + subrow;
float3 ray = CalculateRay(obsX,obsY);
if( !IntersectSphere(ModDistance, ModIntensity, ray, threadID, param) )
{
IntersectCone(ModDistance, ModIntensity, ray, threadID, param);
}
subrow = threadIdx.y;
threadID = subrow * (Mod_Width+1) + threadIdx.x;
obsY = windowY + subrow;
ray = CalculateRay(obsX,obsY);
if( !IntersectSphere(ModDistance, ModIntensity, ray, threadID, param) )
{
IntersectCone(ModDistance, ModIntensity, ray, threadID, param);
}
The kernel is running out of resources. As posted in the comments, it was giving the error CudaErrorLaunchOutOfResources.
To avoid this, you should use a __launch_bounds__ specifier to specify the block dimensions you want for your kernel. This will force the compiler to ensure there are enough resources. See the CUDA programming guide for details on __launch_bounds__.