Conditional statements with SSE

Conditional statements with SSE - c++

I'm trying to do some calculations for my game, and I'm trying to calculate the distance between two points. Essentially, I'm using the equation of a circle to see if the points are inside of the radius that I define.
(x - x1)^2 + (y - y1)^2 <= r^2
My question is: how do I evaluate the conditional statement with SSE and interpret the results? So far I have this:
float distSqr4 = (pow(x4 - k->getPosition().x, 2) + pow(y4 - k->getPosition().y, 2));
float distSqr3 = (pow(x3 - k->getPosition().x, 2) + pow(y3 - k->getPosition().y, 2));
float distSqr2 = (pow(x2 - k->getPosition().x, 2) + pow(y2 - k->getPosition().y, 2));
float distSqr1 = (pow(x1 - k->getPosition().x, 2) + pow(y1 - k->getPosition().y, 2));
__m128 distances = _mm_set_ps(distSqr1, distSqr2, distSqr3, distSqr4);
__m128 maxDistSqr = _mm_set1_ps(k->getMaxDistance() * k->getMaxDistance());
__m128 result = _mm_cmple_ps(distances, maxDistSqr);
Once I get the result variable, I get lost. How do I use the result variable that I just got? My plan was, if the condition evaluated turned out to be true, to do some lighting calculations and then draw the pixel on the screen. How do I interpret true vs false in this case?
Any help towards the right direction is greatly appreciated!

My plan was, if the condition evaluated turned out to be true, to do some lighting calculations and then draw the pixel on the screen.
Then you really have little choice but to branch.
The big advantage of doing conditional tests using SSE is that it allows you to write branchless code, which can lead to massive speed improvements. But in your case, you pretty much have to branch because, if I'm understanding you correctly, you never want to output anything on the screen if the condition evaluated to false.
I mean, I guess you could do all of the calculations unconditionally (speculatively) and then just use the result of the conditional to twiddle bits in the pixel values, essentially causing you to draw off of the screen. That would give you branchless code, but it's pretty silly. There is a penalty for branch mispredictions, but it won't be anywhere near as expensive as all of the calculations and drawing code.
In other words, the parallelism you're using SIMD to exploit is exhausted once you have the final result. It's just a simple, scalar compare-and-branch. First you test whether the condition evaluated to true. If not, you'll jump over the code that does the lighting calculations and pixel-drawing. Otherwise, you'll just fall through to execute that code.
The tricky part is that the compiler won't let you use an __m128 variable in a regular old if statement, so you need to "convert" result to an integer that you can use as the basis for a conditional. The easiest way to do that would be the _mm_movemask_epi8 intrinsic.
So you would basically just do:
__m128 distances = _mm_set_ps(distSqr1, distSqr2, distSqr3, distSqr4);
__m128 maxDistSqr = _mm_set1_ps(k->getMaxDistance() * k->getMaxDistance());
__m128 result = _mm_cmple_ps(distances, maxDistSqr);
if (_mm_movemask_epi8(result) == (unsigned)-1)
{
// All distances were less-than-or-equal-to the maximum, so
// go ahead and calculate the lighting and draw the pixels.
CalcLightingAndDraw(…);
}
This works because _mm_cmple_ps sets each packed double-word to all 1s if the comparison is true, or all 0s if the comparison is false. _mm_movemask_epi8 then collapses that into an integer-sized mask and moves it to an integer value. You then can use that integer value in a normal conditional statement.
Note: With Clang and ICC, you can get away with passing a __m128 value to the _mm_movemask_epi8 intrinsic. On GCC, it insists upon a __m128i value. You can handle this with a cast: _mm_movemask_epi8((__m128i)result).
Of course, I'm assuming here that you are only going to do the drawing if all of the distances are less-than-or-equal-to the maximum distance. If you want to treat each of the four distances independently, then you need to add more conditional tests on the mask:
__m128 distances = _mm_set_ps(distSqr1, distSqr2, distSqr3, distSqr4);
__m128 maxDistSqr = _mm_set1_ps(k->getMaxDistance() * k->getMaxDistance());
__m128 result = _mm_cmple_ps(distances, maxDistSqr);
unsigned condition = _mm_movemask_epi8(result);
if (condition != 0)
{
// One or more of the distances were less-than-or-equal-to the maximum,
// so we have something to draw.
if ((condition & 0x000F) != 0)
{
// distSqr1 was less-than-or-equal-to the maximum
CalcLightingAndDraw(distSqr1);
}
if ((condition & 0x00F0) != 0)
{
// distSqr2 was less-than-or-equal-to the maximum
CalcLightingAndDraw(distSqr2);
}
if ((condition & 0x0F00) != 0)
{
// distSqr3 was less-than-or-equal-to the maximum
CalcLightingAndDraw(distSqr3);
}
if ((condition & 0xF000) != 0)
{
// distSqr4 was less-than-or-equal-to the maximum
CalcLightingAndDraw(distSqr4);
}
}
This won't result in very efficient code, since you have to do so many conditional test-and-branch operations. You might be able to continue parallelizing some of the lighting calculations inside of the main if block. I can't say for sure if this is workable, since I don't have enough details about your algorithm/design.
Otherwise, if you can't see any way to wring more parallelism out of the drawing code, the use of explicit SSE intrinsics isn't buying you much here. You were able to parallelize one comparison (_mm_cmple_ps), but the overhead of setting up for that comparison (_mm_set_ps, which will probably compile into vinsertps or unpcklps+movlhps instructions, assuming the inputs are already in XMM registers) will more than cancel out any trivial gains you might get. You'd arguably be just as well off writing the code like so:
float maxDistSqr = k->getMaxDistance() * k->getMaxDistance();
if (distSqr1 <= maxDistSqr)
{
CalcLightingAndDraw(distSqr1);
}
if (distSqr2 <= maxDistSqr)
{
CalcLightingAndDraw(distSqr2);
}
if (distSqr3 <= maxDistSqr)
{
CalcLightingAndDraw(distSqr3);
}
if (distSqr4 <= maxDistSqr)
{
CalcLightingAndDraw(distSqr4);
}

Related

Will if-else statements nest without brackets?

I want to write something utterly ridiculous that calls for a great depth of conditional nesting. The least disorienting way to write this is to forgo brackets entirely, but I have not been able to find any info on if nesting single-statement if-else guards is legal; the non-nested version causes people enough problems it seems.
Is it valid to write the following? (In both C and C++, please let me know if they differ on this.)
float x = max(abs(min), abs(max));
uint32 count = 0u;
// divides and conquers but, tries to shortcut toward more common values
if (x < 100'000.f)
if (x < 10.f)
count = 1u;
else
if(x < 1'000.f)
if (x < 100.f)
count = 2u;
else
count = 3u;
else
if (x < 10'000.f)
count = 4u;
else
count = 5u;
else
... // covers the IEEE-754 float32 range to ~1.0e+37 (maybe 37 end branches)
--skippable lore--
The underlying puzzle (this is for fun) is that I want to figure out the number of glyphs necessary to display a float's internal representation without rounding/truncation, in constant time. Counting the fractional part's glyph count in constant time was much neater/faster, but unfortunately I wasn't able to figure out any bit-twiddling tricks for the integer part, so I've decided to just brute-force it. Never use math when you can use your fists.

From cppreference.com:
in nested if-statements, the else is associated with the closest if that doesn't have an else
So as long as every if has an else, nesting without brackets works fine. The problem occurs when an else should not be associated with the closest if. For example:
if ( condition1 ) {
if ( condition2 )
DoSomething();
} // <-- This is needed so the else goes with the intended if.
else
DoOtherThing();
A quick scan of your code looks like it's fine.

Improve numerical accuracy of LineLine intersection method (3D)

I coded the following LineLine intersection method:
double LineLineIntersection(
const Eigen::Vector3d& origin1,
const Eigen::Vector3d& ray1,
const Eigen::Vector3d& origin2,
const Eigen::Vector3d& ray2)
{
if(abs((origin1 - origin2).norm() - 1.0) < 0.000001) return 0;
auto n1 = (origin2 - origin1).cross(ray2);
auto n2 = ray1.cross(ray2);
// Use this to test whether the vectors point in the same or opposite directions
auto n = n2.normalized();
// If n2 is the 0 vector or if the cross products are not colinear, no solution exists
if(n2.norm() < 0.00001 || abs(abs(n1.dot(n)) - n1.norm()) > 0.000001)
return std::numeric_limits<double>::infinity();;
return n1.dot(n) / n2.dot(n);
}
The theory for how this works is explained here. However the page has a mistake, taking just the absolute value has only the magnitude, it erases the direction. So instead, the dot product with the cross direction must be taken. That way the distance can be either positive or negative depending on whether the vectors point in the same direction or not.
This technically works but I am running into big numerical errors. For example in one of my tests I am getting:
The difference between i1.x() and Eigen::Vector3d(-0.581, 1.232, 0).x() is 0.0024061184231309873, which exceeds 0.001, where
i1.x() evaluates to -0.58340611842313095,
Eigen::Vector3d(-0.581, 1.232, 0).x() evaluates to -0.58099999999999996, and
0.001 evaluates to 0.001.
An error bigger than 0.001 is huge. What can I do to make the method more accurate?
This is the value of i1: -0.583406 1.23237 0 sorry to not have included it before.

You're using the type "double", try to change it to "long double" or "__float128" if it exists in your version of G++. Also, you can use "BigDecimal" in Java for better accuracy or maybe some long arithmetics from Python.

Is this the correct way to test to see if two lines are colinear?

My code in the colinear does not seem to work and its frustrating the hell out of me. Am i going the best way to use my line class by using two points in my point class? My test for colinearlirty is crashing so I am stuck in a rut for the past few days.
bool line::isColinear(line)
{
bool line2=false;
line l1,l2;
if (l1.slope()==l2.slope())
{
if (l1.y_int()==l2.y_int())
{
line2 =true;
return line2;
}
}
else
{
line2 =false;
}
}
//Heres a copy of my line class
class line
{
private:
point p1,p2;
public:
bool isColinear(line);
bool isParallel(line);
point solve(line);
double slope();
double y_int();
void Display(ostream&);
};

You are storing line as between two points. Slope of a line is usually defined as
slope = (y2 - y1) / ( x2 - x1 )
if x1 is equal to x2, you can have a division by zero error/exception.
Other things to be careful about
If you are storing point coordinates as integers, you could be doing just a integer division and not get what you expect
If you are using doubles throughout, please use a tolerance when comparing them

There's not nearly enough here to really judge what's going wrong, so a few generalities:
Never compare floating-point values directly for equality. It won't work a surprising amount of the time. Instead, compare their difference with an amount so small that you're content to call it "zero" (normally we call it "epsilon"):
if (abs((num1 - num2)) < 0.001) {
/* pretend they're equal */
} else {
/* not equal */
}
line2 is unnecessary in this example. You might as well return true or false directly from the conclusions. Often even the return true or return false is needlessly confusing. Lets assume you re-write this method a little to three methods. (Which might or might not be an improvement. Just assume it for a bit.)
bool line::compare_slope(line l2) {
return fabs((l2.slope() - self.slope()) < 0.001; // don't hardcode this
}
bool line::compare_origin(line l2) {
return fabs((l2.y_int() - self.y_int()) < 0.001; // nor this
}
bool line::is_colinear(line l2) {
return compare_slope(l2) && compare_origin(l2);
}
No true or false hard coded anywhere -- instead, you rely on the value of the conditionals to compute true or false. (And incidentally, the repetition in those functions goes to show that a function floating_comparison(double f1, double f2, double epsilon), could make it far easier to modify epsilon either project-wide or compute an epsilon based on the absolute values of the floating point numbers in question.)

My guess is that since l1 and l2 are uninitialized, their slope methods are doing a divide by zero. Initialize those properly or switch to the proper variables and you'll fix your crash.
Even once you get that working, the test is likely to fail. You can't compare floating point numbers and expect them to be equal, even if it seems they ought to be equal. You should read What Every Computer Scientist Should Know About Floating-Point Arithmetic.

A simple formula for a line (in 2D) (derived from here) is:
P1(x1,y1) and P2(x2,y2) are the points determining the line.
(y-y1) (x2-x1) + (y1-y2) (x-x1) = 0 ( let's use f(x,y) = 0 )
To test if two lines match imagine that a second line is defined by points P3(x3,y3), P4(x4,y4).
To make sure that those lines are 'quite' the same you should test if the two points (P3, P4) determining the second line are close 'enough' to the previous line.
This is easily done by computing f(x3,y3) and f(x4,y4). If those values are close to 0 then the lines are the same.
Pseudocode:
// I would chose tolerance around 1
if ( f(x3,y3) < tolerance && f(x4,y4) < tolerance )
{
// line P1,P2 is the same as P3,P4
}

C++, most efficient way to modulate sign (+/-) in an expression

I have an expression
x += y;
and, based on a boolean, I want to be able to change it to
x -= y;
Of course I could do
if(i){x+=y;} else{x-=y;}
//or
x+=(y*sign); //where sign is either 1 or -1
But if I have to do this iteratively, I want to avoid the extra computation. Is there a more efficient way? Is it possible to modulate the operator?

if (i) {x += y;} else {x -= y;}
is probably going to be as efficient as anything else you can do. y * sign is likely to be fairly expensive (unless the compiler can figure out that y is guaranteed to be 1 or -1).

The most efficient way to do this iteratively is to precompute the data you need.
So, precomputation:
const YourNumberType increment = (i? y : -y);
Then in your loop:
x += increment;
EDIT: re question in commentary about how to generate code, like this:
#include <stdio.h>
void display( int x ) { printf( "%d\n", x ); }
template< bool isSomething >
inline void advance( int& x, int y );
template<> inline void advance<true>( int& x, int y ) { x += y; }
template<> inline void advance<false>( int& x, int y ) { x -= y; }
template< bool isSomething >
void myFunc()
{
int x = 314;
int y = 271;
for( ;; )
{
advance< isSomething >( x, y ); // The nano-optimization.
display( x );
if( !( -10000 < x && x < 10000 ) ) { return; }
}
}
int main( int n, char*[] )
{
n > 1? myFunc<true>() : myFunc<false>();
}
E.g. with Visual C++ 10.0 this generates two versions of myFunc, one with an add instruction and the other with a sub instruction.
Cheers & hth.,

On a modern pipelined machine you want to avoid branching if at all possible in those cases where performance really does count. When the front of the pipeline hits a branch, the CPU guesses which branch to take and lets the pipeline work ahead based on that guess. Everything is fine if the guess was right. Everything is not so fine if the guess was wrong, particularly so if you're still using one of Intel's processors such as a Pentium 4 that suffered from pipeline bloat. Intel discovered that too much pipelining is not a good thing.
More modern processors still do use pipelining (the Core line has a pipeline length of 14 or so), so avoiding branching still remains one of those good things to do -- when it counts, that is. Don't make your code an ugly, prematurely optimized mess when it doesn't count.
The best thing to do is to first find out where your performance demons lie. It is not at all uncommon for a tiny fraction of one percent of the code base to be the cause of almost all of the CPU usage. Optimizing the 99.9% of the code that doesn't contribute to the CPU usage won't solve your performance problems but it will have a deleterious effect on maintenance.
You optimize once you have found the culprit code, and even then, maybe not. When performance doesn't matter, don't optimize. Performance as a metric runs counter to almost every other code quality metric out there.
So, getting off the soap box, let's suppose that little snippet of code is the performance culprit. Try both approaches and test. Try a third approach you haven't thought of yet and test. Sometimes the code that is the best performance-wise is surprisingly non-intuitive. Think Duff's device.

If i stays constant during the execution of the loop, and y doesn't, move the if outside of the loop.
So instead of...
your_loop {
y = ...;
if (i)
x += y;
else
x -= y;
}
...do the following....
if (i) {
your_loop {
y = ...;
x += y;
}
}
else {
your_loop {
y = ...;
x -= y;
}
}
BTW, a decent compiler will do that optimization for you, so you may not see the difference when actually benchmarking.

Sounds like you want to avoid branching and multiplication. Let's say the switch i is set to all 1 bits, same size as y, when you want to add, and to 0 when you want to subtract. Then:
x += (y & i) - (y & ~i)
Haven't tested it, this is just to give you the general idea. Bear in mind that this makes the code a lot harder to read in exchange for what would probably be a very small increase in efficiency.
Edit: or, as bdonlan points out in the comments, possibly even a decrease.

I put my suggestion in the comments to the test, and in a simple test bit-fiddling is faster than branching options on an Intel(R) Xeon(R) CPU L5520 # 2.27GHz, but slower on my laptop Intel Core Duo.
If you are free to give i either the value 0 (for +) or ~0 (for -), these statements are equivalent:
// branching:
if ( i ) sum -= add; else sum += add;
sum += i?-add:add;
sum += (i?-1:1)*add;
// bit fiddling:
sum += (add^i)+(i&1);
sum += (add^i)+(!!i);
sum += (i&~add)-(i&add);
And as said, one method can beat the other by a factor of 2, depending on CPU and optimization level used.
Conclusion, as always, is that benchmarking is the only way to find out which is faster in your particular situation.

Is It Possible To Simplify This Branch-Based Vector Math Operation?

I'm trying to achieve something like the following in C++:
class MyVector; // 3 component vector class
MyVector const kA = /* ... */;
MyVector const kB = /* ... */;
MyVector const kC = /* ... */;
MyVector const kD = /* ... */;
// I'd like to shorten the remaining lines, ideally making it readable but less code/operations.
MyVector result = kA;
MyVector const kCMinusD = kC - kD;
if(kCMinusD.X <= 0)
{
result.X = kB.X;
}
if(kCMinusD.Y <= 0)
{
result.Y = kB.Y;
}
if(kCMinusD.Z <= 0)
{
result.Z = kB.Z;
}
Paraphrasing the code into English, I have four 'known' vectors. Two of the vectors have values that I may or may not want in my result, and whether I want them or not is contingent on a branch based on the components of two other vectors.
I feel like I should be able to simplify this code with some matrix math and masking, but I can't wrap my head around it.
For now I'm going with the branch, but I'm curious to know if there's a better way that still would be understandable, and less code-verbose.
Edit:
In reference to Mark's comment, I'll explain what I'm trying to do here.
This code is an excerpt from some spring physics I'm working on. The components are as follows:
kC is the springs length currently, and kD is minimum spring length.
kA and kB are two sets of spring tensions, each component of which may be unique per component (i.e., a different spring tension along the X, Y, or Z). kA is the springs tension if it's not fully compressed, and kB is the springs tension if it IS fully compressed.
I'd like to build up a resultant 'vector' that simply is the amalgamation of kC and kD, dependant on whether the spring is compressed or not.

Depending on the platform you're on, the compiler might be able to optimize statements like
result.x = (kC.x > kD.x) ? kA.x : kB.x;
result.y = (kC.y > kD.y) ? kA.y : kB.y;
result.z = (kC.z > kD.z) ? kA.z : kB.z;
using fsel (floating point select) instructions or conditional moves. Personally, I think the code looks nicer and more concise this way too, but that's subjective.
If the code is really performance critical, and you don't mind changing your vector class to be 4 floats instead of 3, you could use SIMD (e.g SSE on Intel platforms, VMX on PowerPC) to do the comparison and select the answers. If you went ahead with this, it would like this: (in pseudo code)
// Set each component of mask to be either 0x0 or 0xFFFFFFFF depending on the comparison
MyVector4 mask = vec_compareLessThan(kC, kD);
// Sets each component of result to either kA or kB's component, depending on whether the bits are set in mask
result = vec_select(kA, kb, mask);
This takes a while getting used to, and it might be less readable initially, but you eventually get used to thinking in SIMD mode.
The usual caveats apply, of course - don't optimize before you profile, etc.

If your vector elements are ints, you can do:
MyVector result;
MyVector const kCMinusD = kC - kD;
int mask = kCMinusD.X >> 31; // either 0 or -1
result.X = (kB.X & mask) | (kCMinusD.X & ~mask)
mask = kCMinusD.Y >> 31;
result.X = (kB.Y & mask) | (kCMinusD.Y & ~mask)
mask = kCMinusD.Z >> 31;
result.X = (kB.Z & mask) | (kCMinusD.Z & ~mask)
(note this handles the == 0 case differently, not sure if you care)
If your vector elements are doubles instead of ints, you can do something similar as the sign bit is in the same place, you just have to convert to integers, do the mask, and convert back.

If you're seeking a clean expression in source more than a runtime optimization, you might consider solving this problem from the "toolbox" point of view. So let's say that on MyVector you defined sign, gt (greater-than), and le (less-than-or-equal-to). Then in two lines:
MyVector const kSignCMinusD = (kC - kD).sign();
result = kSignCMinusD.gt(0) * kA + kSignCMinusD.le(0) * kB;
With operator overloading:
MyVector const kSignCMinusD = (kC - kD).sign();
result = (kSignCMinusD > 0) * kA + (kSignCMinusD <= 0) * kB;
For inspiration here's the MatLab function reference. And obviously there are many C++ vector libraries to choose from with such functions.
You can always go in and optimize further if profiling shows it necessary. But often the biggest performance issues are how well you can see the big picture and reuse intermediate computations.

Since you are only doing subtraction you are rewrite as below:
MyVector result;
result.x = kD.x > kC.x ? kB.x : kA.x;
result.y = kD.y > kC.y ? kB.y : kA.y;
result.z = kD.z > kC.z ? kB.z : kA.z;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js