C++ modulus by 2 performance - c++

I was wondering what the typical compiler's assembly reduction would be when performing an integer modulus by 2 operation such as this:
const char* integer_string = "300"; // avoid compiler optimization
int i = atoi(integer_string);
int b = i % 2; // the line in question
I'd imagine the compiler could turn it into a bit-wise operation to just check that last bit (1s place), but does it do this?

The question only makes sense in the context of a particular compiler, platform, optimization options etc.
My compiler (gcc 4.7.2 on x86_64) does do this when -O3 optimizations are turned on:
andl $1, %esi

Related

Optimal implementation of iterative Kahan summation

Intro
Kahan summation / compensated summation is technique that addresses compilers´ inability to respect the associative property of numbers. Truncation errors results in (a+b)+c not being exactly equal to a+(b+c) and thus accumulate an undesired relative error on longer series of sums, which is a common obstacle in scientific computing.
Task
I desire the optimal implementation of Kahan summation. I suspect that the best performance may be achieved with handcrafted assembly code.
Attempts
The code below calculates the sum of 1000 random numbers in range [0,1] with three approaches.
Standard summation: Naive implementation which accumulates a root mean square relative error that grows as O(sqrt(N))
Kahan summation [g++]: Compensated summation using the c/c++ function "csum". Explanation in comments. Note that some compilers may have default flags that invalidate this implementation (see output below).
Kahan summation [asm]: Compensated summation implemented as "csumasm" using the same algorithm as "csum". Cryptic explanation in comments.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
extern "C" void csumasm(double&, double, double&);
__asm__(
"csumasm:\n"
"movsd (%rcx), %xmm0\n" //xmm0 = a
"subsd (%r8), %xmm1\n" //xmm1 - r8 (c) | y = b-c
"movapd %xmm0, %xmm2\n"
"addsd %xmm1, %xmm2\n" //xmm2 + xmm1 (y) | b = a+y
"movapd %xmm2, %xmm3\n"
"subsd %xmm0, %xmm3\n" //xmm3 - xmm0 (a) | b - a
"movapd %xmm3, %xmm0\n"
"subsd %xmm1, %xmm0\n" //xmm0 - xmm1 (y) | - y
"movsd %xmm0, (%r8)\n" //xmm0 to c
"movsd %xmm2, (%rcx)\n" //b to a
"ret\n"
);
void csum(double &a,double b,double &c) { //this function adds a and b, and passes c as a compensation term
double y = b-c; //y is the correction of b argument
b = a+y; //add corrected b argument to a argument. The output of the current summation
c = (b-a)-y; //find new error to be passed as a compensation term
a = b;
}
double fun(double fMin, double fMax){
double f = (double)rand()/RAND_MAX;
return fMin + f*(fMax - fMin); //returns random value
}
int main(int argc, char** argv) {
int N = 1000;
srand(0); //use 0 seed for each method
double sum1 = 0;
for (int n = 0; n < N; ++n)
sum1 += fun(0,1);
srand(0);
double sum2 = 0;
double c = 0; //compensation term
for (int n = 0; n < N; ++n)
csum(sum2,fun(0,1),c);
srand(0);
double sum3 = 0;
c = 0;
for (int n = 0; n < N; ++n)
csumasm(sum3,fun(0,1),c);
printf("Standard summation:\n %.16e (error: %.16e)\n\n",sum1,sum1-sum3);
printf("Kahan compensated summation [g++]:\n %.16e (error: %.16e)\n\n",sum2,sum2-sum3);
printf("Kahan compensated summation [asm]:\n %.16e\n",sum3);
return 0;
}
The output with -O3 is:
Standard summation:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [g++]:
5.1991955320902127e+002 (error: 0.0000000000000000e+000)
Kahan compensated summation [asm]:
5.1991955320902127e+002
The output with -O3 -ffast-math
Standard summation:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [g++]:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [asm]:
5.1991955320902127e+002
It is clear that -ffast-math destroys the Kahan summation arithmetic, which is unfortunate because my program requires the use of -ffast-math.
Question
Is it possible to construct a better/faster asm x64 code for Kahan's compensated summation? Perhaps there is a clever way to skip some of the movapd instructions?
If no better asm codes are possible, is there a c++ way to implement Kahan summation that can be used with -ffast-math without devolving to the naive summation? Perhaps a c++ implementation is generally more flexible for the compiler to optimize.
Ideas or suggestions are appreciated.
Further information
The contents of "fun" cannot be inlined, but the "csum" function could be.
The sum must be calculated as an iterative process (the corrected term must be applied on every single addition). This is because the intended summation function takes an input that depends on the previous sum.
The intended summation function is called indefinitely and several hundred million times per second, which motives the pursuit of a high performance low-level implementation.
Higher precision arithmetic such as long double, float128 or arbitrary precision libraries are not to be considered as higher precision solutions due to performance reasons.
Edit: Inlined csum (doesn't make much sense without the full code, but just for reference)
subsd xmm0, QWORD PTR [rsp+32]
movapd xmm1, xmm3
addsd xmm3, xmm0
movsd QWORD PTR [rsp+16], xmm3
subsd xmm3, xmm1
movapd xmm1, xmm3
subsd xmm1, xmm0
movsd QWORD PTR [rsp+32], xmm1
You can put functions that need to not use -ffast-math (like a csum loop) in a separate file that gets compiled without -ffast-math.
Possibly you could also use __attribute__((optimize("no-fast-math"))), but https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html says that optimization-level pragmas and attributes aren't "suitable in production code", unfortunately.
update: apparently part of the question was based on a misunderstanding that -O3 wasn't safe, or something? It is; ISO C++ specifies FP math rules that are like GCC's -fno-fast-math. Compiling everything with just -O3 apparently makes the OP's code run quickly and safely. See the bottom of this answer for workarounds like OpenMP to get some of the benefit of fast-math for some parts of your code without actually enabling -ffast-math.
ICC defaults to fast-path so you have to specifically enable FP=strict for it to be safe with -O3, but gcc/clang default to fully strict FP regardless of other optimization settings. (except -Ofast = -O3 -ffast-math)
You should be able to vectorize Kahan summation by keeping a vector (or four) of totals and an equal number of vectors of compensations. You can do that with intrinsics (as long as you don't enable fast-math for that file).
e.g. use SSE2 __m128d for 2 packed additions per instruction. Or AVX __m256d. On modern x86, addpd / subpd have the same performance as addsd and subsd (1 uop, 3 to 5 cycle latency depending on microarchitecture: https://agner.org/optimize/).
So you're effectively doing 8 compensated summations in parallel, each sum getting every 8th input element.
Generating random numbers on the fly with your fun() is significantly slower than reading them from memory. If your normal use-case has data in memory, you should be benchmarking that. Otherwise I guess scalar is interesting.
If you're going to use inline asm, it would be much better to actually use it inline so you can get multiple inputs and multiple outputs in XMM registers with Extended asm, not stored/reloaded through memory.
Defining a stand-alone function that actually takes args by reference looks pretty performance-defeating. (Especially when it doesn't even return either of them as a return value to avoid one of the store/reload chains). Even just making a function call introduces a lot of overhead by clobbering many registers. (Not as bad in Windows x64 as in x86-64 System V where all the XMM regs are call-clobbered, and more of the integer regs.)
Also your stand-alone function is specific to the Windows x64 calling convention so it's less portable than inline asm inside a function would be.
And BTW, clang managed to implement csum(double&, double, double&): with only two movapd instructions, instead of the 3 in your asm (which I assume you copied from GCC's asm output). https://godbolt.org/z/lw6tug. If you can assume AVX is available, you can avoid any.
And BTW, movaps is 1 byte smaller and should be used instead. No CPUs have had separate data domains / forwarding networks for double vs. float, just vec-FP vs. vec-int (vs. GP integer)
But really by far your bet is to get GCC to compile a file or function without -ffast-math. https://gcc.gnu.org/wiki/DontUseInlineAsm. That lets the compiler avoid the movaps instructions when AVX is available, besides letting it optimize better when unrolling.
If you're willing to accept the overhead of a function-call for every element, you might as well let the compiler generate that asm by putting csum in a separate file. (Hopefully link-time optimization respects -fno-fast-math for one file, perhaps by not inlining that function.)
But it would be much better to disable fast-math for the whole function containing the summation loop by putting it in a separate file. You may be stuck choosing where non-inline function-call boundaries need to be, based on compiling some code with fast-math and others without.
Ideally compile all of your code with -O3 -march=native, and profile-guided optimization. Also -flto link-time optimization to enable cross-file inlining.
It's not surprising that -ffast-math breaks Kahan summation: treating FP math as associative is one of the main reasons to use fast-math. If you need other parts of -ffast-math like -fno-math-errno and -fno-trapping-math so math functions can inline better, then enable those manually. Those are basically always safe and a good idea; nobody checks errno after calling sqrt so that requirement to set errno for some inputs is just a terrible misdesign of C that burdens implementations unnecessarily. GCC's -ftrapping-math is on by default even though it's broken (it doesn't always exactly reproduce the number of FP exceptions you'd get if you unmasked any) so it should really be off by default. Turning it off doesn't enable any optimizations that would break NaN propagation, it only tells GCC that the number of exceptions isn't a visible side-effect.
Or maybe try -ffast-math -fno-associative-math for your Kahan summation file, but that's the main one that's needed to auto-vectorize FP loops that involve reductions, and helps in other cases. But still, there are several other valuable optimizations that you'd still get.
Another way to get optimizations that normally require fast-math is #pragma omp simd to enable auto-vectorization with OpenMP even in files compiled without auto-vectorization. You can declare an accumulator variable for a reduction to let gcc reorder operations on it as if they were associative.

Is there anything special about -1 (0xFFFFFFFF) regarding ADC?

In a research project of mine I'm writing C++ code. However, the generated assembly is one of the crucial points of the project. C++ doesn't provide direct access to flag manipulating instructions, in particular, to ADC but this shouldn't be a problem provided the compiler is smart enough to use it. Consider:
constexpr unsigned X = 0;
unsigned f1(unsigned a, unsigned b) {
b += a;
unsigned c = b < a;
return c + b + X;
}
Variable c is a workaround to get my hands on the carry flag and add it to b and X. It looks I got luck and the (g++ -O3, version 9.1) generated code is this:
f1(unsigned int, unsigned int):
add %edi,%esi
mov %esi,%eax
adc $0x0,%eax
retq
For all values of X that I've tested the code is as above (except, of course for the immediate value $0x0 that changes accordingly). I found one exception though: when X == -1 (or 0xFFFFFFFFu or ~0u, ... it really doesn't matter how you spell it) the generated code is:
f1(unsigned int, unsigned int):
xor %eax,%eax
add %edi,%esi
setb %al
lea -0x1(%rsi,%rax,1),%eax
retq
This seems less efficient than the initial code as suggested by indirect measurements (not very scientific though) Am I right? If so, is this a "missing optimization opportunity" kind of bug that is worth reporting?
For what is worth, clang -O3, version 8.8.0, always uses ADC (as I wanted) and icc -O3, version 19.0.1 never does.
I've tried using the intrinsic _addcarry_u32 but it didn't help.
unsigned f2(unsigned a, unsigned b) {
b += a;
unsigned char c = b < a;
_addcarry_u32(c, b, X, &b);
return b;
}
I reckon I might not be using _addcarry_u32 correctly (I couldn't find much info on it). What's the point of using it since it's up to me to provide the carry flag? (Again, introducing c and praying for the compiler to understand the situation.)
I might, actually, be using it correctly. For X == 0 I'm happy:
f2(unsigned int, unsigned int):
add %esi,%edi
mov %edi,%eax
adc $0x0,%eax
retq
For X == -1 I'm unhappy :-(
f2(unsigned int, unsigned int):
add %esi,%edi
mov $0xffffffff,%eax
setb %dl
add $0xff,%dl
adc %edi,%eax
retq
I do get the ADC but this is clearly not the most efficient code. (What's dl doing there? Two instructions to read the carry flag and restore it? Really? I hope I'm very wrong!)
mov + adc $-1, %eax is more efficient than xor-zero + setc + 3-component lea for both latency and uop count on most CPUs, and no worse on any still-relevant CPUs.1
This looks like a gcc missed optimization: it probably sees a special case and latches onto that, shooting itself in the foot and preventing the adc pattern recognition from happening.
I don't know what exactly it saw / was looking for, so yes you should report this as a missed-optimization bug. Or if you want to dig deeper yourself, you could look at the GIMPLE or RTL output after optimization passes and see what happens. If you know anything about GCC's internal representations. Godbolt has a GIMPLE tree-dump window you can add from the same dropdown as "clone compiler".
The fact that clang compiles it with adc proves that it's legal, i.e. that the asm you want does match the C++ source, and you didn't miss some special case that's stopping the compiler from doing that optimization. (Assuming clang is bug-free, which is the case here.)
That problem can certainly happen if you're not careful, e.g. trying to write a general-case adc function that takes carry in and provides carry-out from the 3-input addition is hard in C, because either of the two additions can carry so you can't just use the sum < a+b idiom after adding the carry to one of the inputs. I'm not sure it's possible to get gcc or clang to emit add/adc/adc where the middle adc has to take carry-in and produce carry-out.
e.g. 0xff...ff + 1 wraps around to 0, so sum = a+b+carry_in / carry_out = sum < a can't optimize to an adc because it needs to ignore carry in the special case where a = -1 and carry_in = 1.
So another guess is that maybe gcc considered doing the + X earlier, and shot itself in the foot because of that special case. That doesn't make a lot of sense, though.
What's the point of using it since it's up to me to provide the carry flag?
You're using _addcarry_u32 correctly.
The point of its existence is to let you express an add with carry in as well as carry out, which is hard in pure C. GCC and clang don't optimize it well, often not just keeping the carry result in CF.
If you only want carry-out, you can provide a 0 as the carry in and it will optimize to add instead of adc, but still give you the carry-out as a C variable.
e.g. to add two 128-bit integers in 32-bit chunks, you can do this
// bad on x86-64 because it doesn't optimize the same as 2x _addcary_u64
// even though __restrict guarantees non-overlap.
void adc_128bit(unsigned *__restrict dst, const unsigned *__restrict src)
{
unsigned char carry;
carry = _addcarry_u32(0, dst[0], src[0], &dst[0]);
carry = _addcarry_u32(carry, dst[1], src[1], &dst[1]);
carry = _addcarry_u32(carry, dst[2], src[2], &dst[2]);
carry = _addcarry_u32(carry, dst[3], src[3], &dst[3]);
}
(On Godbolt with GCC/clang/ICC)
That's very inefficient vs. unsigned __int128 where compilers would just use 64-bit add/adc, but does get clang and ICC to emit a chain of add/adc/adc/adc. GCC makes a mess, using setcc to store CF to an integer for some of the steps, then add dl, -1 to put it back into CF for an adc.
GCC unfortunately sucks at extended-precision / biginteger written in pure C. Clang sometimes does slightly better, but most compilers are bad at it. This is why the lowest-level gmplib functions are hand-written in asm for most architectures.
Footnote 1: or for uop count: equal on Intel Haswell and earlier where adc is 2 uops, except with a zero immediate where Sandybridge-family's decoders special case that as 1 uop.
But the 3-component LEA with a base + index + disp makes it a 3-cycle latency instruction on Intel CPUs, so it's definitely worse.
On Intel Broadwell and later, adc is a 1-uop instruction even with a non-zero immediate, taking advantage of support for 3-input uops introduced with Haswell for FMA.
So equal total uop count but worse latency means that adc would still be a better choice.
https://agner.org/optimize/

Will C/C++ compiler do reorder for commutative operators (eg: +, *) to optimize constants

Will the 2nd line of the following code
int bar;
int foo = bar * 3 * 5;
be optimized to
int bar;
int foo = bar * 15;
Or even more:
int foo = 3 * bar * 5;
can be optimized?
The purpose is actually to ask if I can just write
int foo = bar * 3 * 5;
instead of
int foo = bar * (3 * 5);
to save the parentheses. (and the relieve the need to manually manipulate those constant ordering => and in many case grouping constants with related variables are more meaningful rather than grouping constants for optimization)
Almost all compilers will do it for integers, because even if a constant collapse might overflow in a different way, overflow may be ignored by the standard, so they can do what they like.
It often will not work for floating point values if it's adhering to strict floating point math; the order of evaluation with floating point math can affect the outcome, so strict compliance can't reorder floating point math.
5.1.2.3 Program execution
[#1] The semantic descriptions in this International Standard describe the behavior of an abstract machine in which issues of optimization are irrelevant.
[#3] In the abstract machine, all expressions are evaluated as specified by the semantics.
[#13] EXAMPLE 5 Rearrangement for floating-point expressions is often restricted because of limitations in precision as well as range. The implementation cannot generally apply the mathematical associative rules for addition or multiplication, nor the distributive rule, because of roundoff error, even in the absence of overflow and underflow. (Source)
It's not describing the use with constants precisely, but it's clearly noting that seemingly equivalent operations are not actually equivalent in the bizarro world that is floating point arithmetic (e.g. x / 5.0 cannot be translated to x * 0.2 with complete equivalence, x + x * y cannot be equivalently represented as x * (1.0 + y)).
Here's an example of what an optimizer will do. Compiling this code with g++ 4.9.2 using -O2:
int calculate(int bar)
{
return bar*3*5;
}
is translated into this assembly code:
movl %edi, %eax # copy argument into eax
sall $4, %eax # shift eax left 4 bits
subl %edi, %eax # subtract original value from eax
ret # return (with eax as result)
Not only did it not do two multiplications, it didn't even do one. It converted the multipication by 15 into something equivalent to this:
int calculate(int bar)
{
return (bar<<4)-bar;
}
A given implementation may or may not optimise any of those expressions. If you really want to know what it's doing for a given set of inputs, examine the generated assembler code.
But there's no guarantee you'll get the same optimisation from another compiler, the same compiler with different options or even the exact same compiler/options on Tuesday a week from now.
The general rule to follow is the "as if" rule, the compiler does things as if it was doing exactly what is specified in the standard. That doesn't mean it has to do it in any given way.
In other words, a compiler if free to do whatever it wants as long as it has the same effect as what the standard mandates.
The standard actually starts focusing on this aspect very early on, in the definitions section 3.4, where it defines behaviour as the "external appearance or action", and further examples pepper the document throughout.

C/C++: adding 0

I have the following line in a function to count the number of 'G' and 'C' in a sequence:
count += (seq[i] == 'G' || seq[i] == 'C');
Are compilers smart enough to do nothing when they see 'count += 0' or do they actually lose time 'adding' 0 ?
Generally
x += y;
is faster than
if (y != 0) { x += y; }
Even if y = 0, because there is no branch in the first option. If its really important, you'll have to check the compiler output, but don't assume your way is faster because it sometimes doesn't do an add.
Honestly, who cares??
[Edit:] This actually turned out somewhat interesting. Contrary to my initial assumption, in unoptimized compilation, n += b; is better than n += b ? 1 : 0;. However, with optimizations, the two are identical. More importantly, though, the optimized version of this form is always better than if (b) ++n;, which always creates a cmp/je instruction. [/Edit]
If you're terribly curious, put the following code through your compiler and compare the resulting assembly! (Be sure to test various optimization settings.)
int n;
void f(bool b) { n += b ? 1 : 0; }
void g(bool b) { if (b) ++n; }
I tested it with GCC 4.6.1: With g++ and with no optimization, g() is shorter. With -O3, however, f() is shorter:
g(): f():
cmpb $0, 4(%esp) movzbl 4(%esp), %eax
je .L1 addl %eax, n
addl $1, n
.L1:
rep
Note that the optimization for f() actually does what you wrote originally: It literally adds the value of the conditional to n. This is in C++ of course. It'd be interesting to see what the C compiler would do, absent a bool type.
Another note, since you tagged this C as well: In C, if you don't use bools (from <stdbool.h>) but rather ints, then the advantage of one version over the other disappears, since both now have to do some sort of testing.
It depends on your compiler, its optimization options that you used and its optimization heuristics. Also, on some architectures it may be faster to add than to perform a conditional jump to avoid the addition of 0.
Compilers will NOT optimize away the +0 unless the expression on the right is a compiler const value equaling zero. But adding zero is much faster on all modern processors than branching (if then) to attempt to avoid the add. So the compiler ends up doing the smartest thing available in the given situation- simply adding 0.
Some are and some are not smart enough. its highly dependent on the optimizer implementation.
The optimizer might also determine that if is slower than + so it will still do the addition.

How efficient is an if statement compared to a test that doesn't use an if? (C++)

I need a program to get the smaller of two numbers, and I'm wondering if using a standard "if x is less than y"
int a, b, low;
if (a < b) low = a;
else low = b;
is more or less efficient than this:
int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));
(or the variation of putting int delta = a - b at the top and rerplacing instances of a - b with that).
I'm just wondering which one of these would be more efficient (or if the difference is too miniscule to be relevant), and the efficiency of if-else statements versus alternatives in general.
(Disclaimer: the following deals with very low-level optimizations that are most often not necessary. If you keep reading, you waive your right to complain that computers are fast and there is never any reason to worry about this sort of thing.)
One advantage of eliminating an if statement is that you avoid branch prediction penalties.
Branch prediction penalties are generally only a problem when the branch is not easily predicted. A branch is easily predicted when it is almost always taken/not taken, or it follows a simple pattern. For example, the branch in a loop statement is taken every time except the last one, so it is easily predicted. However, if you have code like
a = random() % 10
if (a < 5)
print "Less"
else
print "Greater"
then this branch is not easily predicted, and will frequently incur the prediction penalty associated with clearing the cache and rolling back instructions that were executed in the wrong part of the branch.
One way to avoid these kinds of penalties is to use the ternary (?:) operator. In simple cases, the compiler will generate conditional move instructions rather than branches.
So
int a, b, low;
if (a < b) low = a;
else low = b;
becomes
int a, b, low;
low = (a < b) ? a : b
and in the second case a branching instruction is not necessary. Additionally, it is much clearer and more readable than your bit-twiddling implementation.
Of course, this is a micro-optimization which is unlikely to have significant impact on your code.
Simple answer: One conditional jump is going to be more efficient than two subtractions, one addition, a bitwise and, and a shift operation combined. I've been sufficiently schooled on this point (see the comments) that I'm no longer even confident enough to say that it's usually more efficient.
Pragmatic answer: Either way, you're not paying nearly as much for the extra CPU cycles as you are for the time it takes a programmer to figure out what that second example is doing. Program for readability first, efficiency second.
Compiling this on gcc 4.3.4, amd64 (core 2 duo), Linux:
int foo1(int a, int b)
{
int low;
if (a < b) low = a;
else low = b;
return low;
}
int foo2(int a, int b)
{
int low;
low = b + ((a - b) & ((a - b) >> 31));
return low;
}
I get:
foo1:
cmpl %edi, %esi
cmovle %esi, %edi
movl %edi, %eax
ret
foo2:
subl %esi, %edi
movl %edi, %eax
sarl $31, %eax
andl %edi, %eax
addl %esi, %eax
ret
...which I'm pretty sure won't count for branch predictions, since the code doesn't jump. Also, the non if-statement version is 2 instructions longer. I think I will continue coding, and let the compiler do it's job.
Like with any low-level optimization, test it on the target CPU/board setup.
On my compiler (gcc 4.5.1 on x86_64), the first example becomes
cmpl %ebx, %eax
cmovle %eax, %esi
The second example becomes
subl %eax, %ebx
movl %ebx, %edx
sarl $31, %edx
andl %ebx, %edx
leal (%rdx,%rax), %esi
Not sure if the first one is faster in all cases, but I would bet it is.
The biggest problem is that your second example won't work on 64-bit machines.
However, even neglecting that, modern compilers are smart enough to consider branchless prediction in every case possible, and compare the estimated speeds. So, you second example will most likely actually be slower
There will be no difference between the if statement and using a ternary operator, as even most dumb compilers are smart enough to recognize this special case.
[Edit] Because I think this is such an interesting topic, I've written a blog post on it.
Either way, the assembly will only be a few instructions and either way it'll take picoseconds for those instructions to execute.
I would profile the application an concentrate your optimization efforts to something more worthwhile.
Also, the time saved by this type of optimization will not be worth the time wasted by anyone trying to maintain it.
For simple statements like this, I find the ternary operator very intuitive:
low = (a < b) ? a : b;
Clear and concise.
For something as simple as this, why not just experiment and try it out?
Generally, you'd profile first, identify this as a hotspot, experiment with a change, and view the result.
I wrote a simple program that compares both techniques passing in random numbers (so that we don't see perfect branch prediction) with Visual C++ 2010. The difference between the approaches on my machine for 100,000,000 iteration? Less than 50ms total, and the if version tended to be faster. Looking at the codegen, the compiler successfully converted the simple if to a cmovl instruction, avoiding a branch altogether.
One thing to be wary of when you get into really bit-fiddly kinds of hacks is how they may interact with compiler optimizations that take place after inlining. For example, the readable procedure
int foo (int a, int b) {
return ((a < b) ? a : b);
}
is likely to be compiled into something very efficient in any case, but in some cases it may be even better. Suppose, for example, that someone writes
int bar = foo (x, x+3);
After inlining, the compiler will recognize that 3 is positive, and may then make use of the fact that signed overflow is undefined to eliminate the test altogether, to get
int bar = x;
It's much less clear how the compiler should optimize your second implementation in this context. This is a rather contrived example, of course, but similar optimizations actually are important in practice. Of course you shouldn't accept bad compiler output when performance is critical, but it's likely wise to see if you can find clear code that produces good output before you resort to code that the next, amazingly improved, version of the compiler won't be able to optimize to death.
One thing I will point out that I haven't noticed mention that an optimization like this can easily be overwhelmed by other issues. For example, if you are running this routine on two large arrays of numbers (or worse yet, pairs of number scattered in memory), the cost of fetching the values on today's CPUs can easily stall the CPU's execution pipelines.
I'm just wondering which one of these
would be more efficient (or if the
difference is to miniscule to be
relevant), and the efficiency of
if-else statements versus alternatives
in general.
Desktop/server CPUs are optimized for pipelining. Second is theoretically faster because CPU doesn't have to branch and can utilize multiple ALUs to evaluate parts of expression in parallel. More non-branching code with intermixed independent operations are best for such CPUs. (But even that is negated now by modern "conditional" CPU instructions which allow to make the first code branch-less too.)
On embedded CPUs branching if often less expensive (relatively to everything else), nor they have many spare ALUs to evaluate operations out-of-order (that's if they support out-of-order execution at all). Less code/data is better - caches are small too. (I have even seen uses of buble-sort in embedded applications: the algorithm uses least of memory/code and fast enough for small amounts of information.)
Important: do not forget about the compiler optimizations. Using many tricks, the compilers sometimes can remove the branching themselves: inlining, constant propagation, refactoring, etc.
But in the end I would say that yes, the difference is minuscule to be relevant. In long term, readable code wins.
The way things go on the CPU front, it is more rewarding to invest time now in making the code multi-threaded and OpenCL capable.
Why low = a; in the if and low = a; in the else? And, why 31? If 31 has anything to do with CPU word size, what if the code is to be run on a CPU of different size?
The if..else way looks more readable. I like programs to be as readable to humans as they are to the compilers.
profile results with gcc -o foo -g -p -O0, Solaris 9 v240
%Time Seconds Cumsecs #Calls msec/call Name
36.8 0.21 0.21 8424829 0.0000 foo2
28.1 0.16 0.37 1 160. main
17.5 0.10 0.4716850667 0.0000 _mcount
17.5 0.10 0.57 8424829 0.0000 foo1
0.0 0.00 0.57 4 0. atexit
0.0 0.00 0.57 1 0. _fpsetsticky
0.0 0.00 0.57 1 0. _exithandle
0.0 0.00 0.57 1 0. _profil
0.0 0.00 0.57 1000 0.000 rand
0.0 0.00 0.57 1 0. exit
code:
int
foo1 (int a, int b, int low)
{
if (a < b)
low = a;
else
low = b;
return low;
}
int
foo2 (int a, int b, int low)
{
low = (a < b) ? a : b;
return low;
}
int main()
{
int low=0;
int a=0;
int b=0;
int i=500;
while (i--)
{
for(a=rand(), b=rand(); a; a--)
{
low=foo1(a,b,low);
low=foo2(a,b,low);
}
}
return 0;
}
Based on data, in the above environment, the exact opposite of several beliefs stated here were not found to be true. Note the 'in this environment' If construct was faster than ternary ? : construct
I had written ternary logic simulator not so long ago, and this question was viable to me, as it directly affects my interpretator execution speed; I was required to simulate tons and tons of ternary logic gates as fast as possible.
In a binary-coded-ternary system one trit is packed in two bits. Most significant bit means negative and least significant means positive one. Case "11" should not occur, but it must be handled properly and threated as 0.
Consider inline int bct_decoder( unsigned bctData ) function, which should return our formatted trit as regular integer -1, 0 or 1; As i observed there are 4 approaches: i called them "cond", "mod", "math" and "lut"; Lets investigate them
First is based on jz|jnz and jl|jb conditional jumps, thus cond. Its performance is not good at all, because relies on a branch predictor. And even worse - it varies, because it is unknown if there will be one branch or two a priori. And here is an example:
inline int bct_decoder_cond( unsigned bctData ) {
unsigned lsB = bctData & 1;
unsigned msB = bctData >> 1;
return
( lsB == msB ) ? 0 : // most possible -> make zero fastest branch
( lsB > msB ) ? 1 : -1;
}
This is slowest version, it could involve 2 branches in worst case and this is something where binary logic fails. On my 3770k it prodices around 200MIPS on average on random data. (here and after - each test is average from 1000 tries on randomly filled 2mb dataset)
Next one relies on modulo operator and its speed is somewhere in between first and third, but is definetely faster - 600 MIPS:
inline int bct_decoder_mod( unsigned bctData ) {
return ( int )( ( bctData + 1 ) % 3 ) - 1;
}
Next one is branchless approach, which involves only maths, thus math; it does not assume jump instrunctions at all:
inline int bct_decoder_math( unsigned bctData ) {
return ( int )( bctData & 1 ) - ( int )( bctData >> 1 );
}
This does what is should, and behaves really great. To compare, performance estimate is 1000 MIPS, and it is 5x faster than branched version. Probably branched version is slowed down due to lack of native 2-bit signed int support. But in my application it is quite good version in itself.
If this is not enough then we can go futher, having something special. Next is called lookup table approach:
inline int bct_decoder_lut( unsigned bctData ) {
static const int decoderLUT[] = { 0, 1, -1, 0 };
return decoderLUT[ bctData & 0x3 ];
}
In my case one trit occupied only 2 bits, so lut table was only 2b*4 = 8 bytes, and was worth trying. It fits in cache and works blazing fast at 1400-1600 MIPS, here is where my measurement accuracy is going down. And that is is 1.5x speedup from fast math approach. That's because you just have precalculated result and single AND instruction. Sadly caches are small and (if your index length is greater than several bits) you simply cannot use it.
So i think i answered your question, on what what could branched/branchless code be like. Answer is much better and with detailed samples, real world application and real performance measurements results.
Updated answer taking the current (2018) state of compiler vectorization. Please see danben's answer for the general case where vectorization is not a concern.
TLDR summary: avoiding ifs can help with vectorization.
Because SIMD would be too complex to allow branching on some elements, but not others, any code containing an if statement will fail to be vectorized unless the compiler knows a "superoptimization" technique that can rewrite it into a branchless set of operations. I don't know of any compilers that are doing this as an integrated part of the vectorization pass (Clang does some of this independently, but not specificly toward helping vectorization AFAIK)
Using the OP's provided example:
int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));
Many compilers can vectorize this to be something approximately equivalent to:
__m128i low128i(__m128i a, __m128i b){
__m128i diff, tmp;
diff = _mm_sub_epi32(a,b);
tmp = _mm_srai_epi32(diff, 31);
tmp = _mm_and_si128(tmp,diff);
return _mm_add_epi32(tmp,b);
}
This optimization would require the data to be layed out in a fashion that would allow for it, but it could be extended to __m256i with avx2 or __m512i with avx512 (and even unroll loops further to take advantage of additional registers) or other simd instructions on other architectures. Another plus is that these instructions are all low latency, high-throughput instructions (latencies of ~1 and reciprocal throughputs in the range of 0.33 to 0.5 - so really fast relative to non-vectorized code)
I see no reason why compilers couldn't optimize an if statement to a vectorized conditional move (except that the corresponding x86 operations only work on memory locations and have low throughput and other architectures like arm may lack it entirely) but it could be done by doing something like:
void lowhi128i(__m128i *a, __m128i *b){ // does both low and high
__m128i _a=*a, _b=*b;
__m128i lomask = _mm_cmpgt_epi32(_a,_b),
__m128i himask = _mm_cmpgt_epi32(_b,_a);
_mm_maskmoveu_si128(_b,lomask,a);
_mm_maskmoveu_si128(_a,himask,b);
}
However this would have a much higher latency due to memory reads and writes and lower throughput (higher/worse reciprocal throughput) than the example above.
Unless you're really trying to buckle down on efficiency, I don't think this is something you need to worry about.
My simple thought though is that the if would be quicker because it's comparing one thing, while the other code is doing several operations. But again, I imagine that the difference is minuscule.
If it is for Gnu C++, try this
int min = i <? j;
I have not profiled it but I think it is definitely the one to beat.