Efficiency of logical "or" on bool values in C++ - c++

bool x = false, y = false, z = true;
if(x || y || z){}
or
if(x | y | z){}
Does the second if statement perform a bit wise "or" operation on all booleans? treating them as if there were bytes? ex) (0000 | 0000 | 0001) = true...
Or does it act like a Java | on booleans, where it will evaluate every bool in the expression even if the first was true?
I want to know how bit wise operators work on bool values. is it equivalent to integer bitwise operations?

Efficiency depends, the logical or operator || is a short circuit operator
meaning if x in your example is true it will not evaluate y or z.
If it was a logical and && then if x is false, it will not test y or z.
Its important to note that this operation does not exist as an instruction
so that means you have to use test and jump instructions. This means branching, which slows down things. Since modern CPU's are pipelined.
But the real answer is it depends, like many other questions of this nature, as sometimes the benefit of short circuiting operations outweighs the cost.
In the following extremely simple example you can see that bitwise or | is superior.
#include <iostream>
bool test1(bool a, bool b, bool c)
{
return a | b | c;
}
bool test2(bool a, bool b, bool c)
{
return a || b || c;
}
int main()
{
bool a = true;
bool b = false;
bool c = true;
test1(a,b,c);
test2(a,b,c);
return 0;
}
The following is the intel-style assembly listings produced by gcc-4.8 with -O3 :
test1 assembly :
_Z5test1bbb:
.LFB1264:
.cfi_startproc
mov eax, edx
or eax, esi
or eax, edi
ret
.cfi_endproc
test2 assembly :
_Z5test2bbb:
.LFB1265:
.cfi_startproc
test dil, dil
jne .L6
test sil, sil
mov eax, edx
jne .L6
rep; ret
.p2align 4,,10
.p2align 3
.L6:
mov eax, 1
ret
.cfi_endproc
You can see that it has branch instructions, which mess up the pipeline.
Sometimes however short-circuiting is worth it such as
return x && deep_recursion_function();
Disclaimer:
I would always use logical operators on bools. Unless performance really is critical, or maybe simple case like in test1 and test2 but with lots of bools.
And in either case first verify that you do get an improvement.

The second acts a java | on integers, a bit-wise or. As C originally didn't have a boolean type, the if statement reads any non-zero as true, so you can use it as that, but it is often more efficient to use the short-circuiting operator || instead, especially when calling functions that return the conditions.
I would also like to point out that short-circuit lets you check unsafe conditions, like if(myptr == NULL || myptr->struct_member < 0) return -1;, while using the bitwise or there will give you a segfault when myptr is null.

Related

Explanation for GCC compiler optimisation's adverse performance effect?

Please note: this question is neither about code quality, and ways to improve the code, nor about the (in)significance of the runtime differences. It is about GCC and why which compiler optimisation costs performance.
The program
The following code counts the number of Fibonacci primes up to m:
int main() {
unsigned int m = 500000000u;
unsigned int i = 0u;
unsigned int a = 1u;
unsigned int b = 1u;
unsigned int c = 1u;
unsigned int count = 0u;
while (a + b <= m) {
for (i = 2u; i < a + b; ++i) {
c = (a + b) % i;
if (c == 0u) {
i = a + b;
// break;
}
}
if (c != 0u) {
count = count + 1u;
}
a = a + b;
b = a - b;
}
return count; // Just to "output" (and thus use) count
}
When compiled with g++.exe (Rev2, Built by MSYS2 project) 9.2.0 and no optimisations (-O0), the resulting binary executes (on my machine) in 1.9s. With -O1 and -O3 it takes 3.3s and 1.7s, respectively.
I've tried to make sense of the resulting binaries by looking at the assembly code (godbolt.org) and the corresponding control-flow graph (hex-rays.com/products/ida), but my assembler skills don't suffice.
Additional observations
An explicit break in the innermost if makes the -O1 code fast again:
if (c == 0u) {
i = a + b; // Not actually needed any more
break;
}
As does "inlining" the loop's progress expression:
for (i = 2u; i < a + b; ) { // No ++i any more
c = (a + b) % i;
if (c == 0u) {
i = a + b;
++i;
} else {
++i;
}
}
Questions
Which optimisation does/could explain the performance drop?
Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?
Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?
The important thing at play here are loop-carried data dependencies.
Look at machine code of the slow variant of the innermost loop. I'm showing -O2 assembly here, -O1 is less optimized, but has similar data dependencies overall:
.L4:
xorl %edx, %edx
movl %esi, %eax
divl %ecx
testl %edx, %edx
cmove %esi, %ecx
addl $1, %ecx
cmpl %ecx, %esi
ja .L4
See how the increment of the loop counter in %ecx depends on the previous instruction (the cmov), which in turn depends on the result of the division, which in turn depends on the previous value of loop counter.
Effectively there is a chain of data dependencies on computing the value in %ecx that spans the entire loop, and since the time to execute the loop dominates, the time to compute that chain decides the execution time of the program.
Adjusting the program to compute the number of divisions reveals that it executes 434044698 div instructions. Dividing the number of machine cycles taken by the program by this number gives 26 cycles in my case, which corresponds closely to latency of the div instruction plus about 3 or 4 cycles from the other instructions in the chain (the chain is div-test-cmov-add).
In contrast, the -O3 code does not have this chain of dependencies, making it throughput-bound rather than latency-bound: the time to execute the -O3 variant is determined by the time to compute 434044698 independent div instructions.
Finally, to give specific answers to your questions:
1. Which optimisation does/could explain the performance drop?
As another answer mentioned, this is if-conversion creating a loop-carried data dependency where originally there was a control dependency. Control dependencies may be costly too, when they correspond to unpredictable branches, but in this case the branch is easy to predict.
2. Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?
Perhaps you can imagine the optimization transforming the code to
for (i = 2u; i < a + b; ++i) {
c = (a + b) % i;
i = (c != 0) ? i : a + b;
}
Where the ternary operator is evaluated on the CPU such that new value of i is not known until c has been computed.
3. Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?
In those variants the code is not eligible for if-conversion, so the problematic data dependency is not introduced.
I think the problem is in the -fif-conversion that instructs the compiler to do CMOV instead of TEST/JZ for some comparisons. And CMOV is known for being not so great in the general case.
There are two points in the disassembly, that I know of, affected by this flag:
First, the if (c == 0u) { i = a + b; } in line 13 is compiled to:
test edx,edx //edx is c
cmove ecx,esi //esi is (a + b), ecx is i
Second, the if (c != 0u) { count = count + 1u; } is compiled to
cmp eax,0x1 //eax is c
sbb r8d,0xffffffff //r8d is count, but what???
Nice trick! It is substracting -1 to count but with carry, and the carry is only set if c is less than 1, which being unsigned means 0. Thus, if eax is 0 it substracts -1 to count but then substracts 1 again: it does not change. If eax is not 0, then it substracts -1, that increments the variable.
Now, this avoids branches, but at the cost of missing the obvious optimization that if c == 0u you could jump directly to the next while iteration. This one is so easy that it is even done in -O0.
I believe this is caused by the "conditional move" instruction (CMOVEcc) that the compiler generates to replace branching when using -O1 and -O2.
When using -O0, the statement if (c == 0u) is compiled to a jump:
cmp DWORD PTR [rbp-16], 0
jne .L4
With -O1 and -O2:
test edx, edx
cmove ecx, esi
while -O3 produces a jump (similar to -O0):
test edx, edx
je .L5
There is a known bug in gcc where "using conditional moves instead of compare and branch result in almost 2x slower code"
As rodrigo suggested in his comment, using the flag -fno-if-conversion tells gcc not to replace branching with conditional moves, hence preventing this performance issue.

Are bitwise operators faster?, if yes then why?

How much will it affect the performance if I use:
n>>1 instead of n/2
n&1 instead of n%2!=0
n<<3 instead of n*8
n++ instead of n+=1
and so on...
and if it does increase the performance then please explain why.
Any half decent compiler will optimize the two versions into the same thing. For example, GCC compiles this:
unsigned int half1(unsigned int n) { return n / 2; }
unsigned int half2(unsigned int n) { return n >> 1; }
bool parity1(int n) { return n % 2; }
bool parity2(int n) { return n & 1; }
int mult1(int n) { return n * 8; }
int mult2(int n) { return n << 3; }
void inc1(int& n) { n += 1; }
void inc2(int& n) { n++; }
to
half1(unsigned int):
mov eax, edi
shr eax
ret
half2(unsigned int):
mov eax, edi
shr eax
ret
parity1(int):
mov eax, edi
and eax, 1
ret
parity2(int):
mov eax, edi
and eax, 1
ret
mult1(int):
lea eax, [0+rdi*8]
ret
mult2(int):
lea eax, [0+rdi*8]
ret
inc1(int&):
add DWORD PTR [rdi], 1
ret
inc2(int&):
add DWORD PTR [rdi], 1
ret
One small caveat is that in the first example, if n could be negative (in case that it is signed and the compiler can't prove that it's nonnegative), then the division and the bitshift are not equivalent and the division needs some extra instructions. Other than that, compilers are smart and they'll optimize operations with constant operands, so use whichever version makes more sense logically and is more readable.
Strictly speaking, in most cases, yes.
This is because bit manipulation is a simpler operation to perform for CPUs due to the circuitry in the APU being much simpler and requiring less discrete steps (clock cycles) to perform fully.
As others have mentioned, any compiler worth a damn will automatically detect constant operands to certain arithmetic operations with bitwise analogs (like those in your examples) and will convert them to the appropriate bitwise operations under the hood.
Keep in mind, if the operands are runtime values, such optimizations cannot occur.

Efficient symmetric comparison based on a bool toggle

my code has a lot of patterns like
int a, b.....
bool c = x ? a >= b : a <= b;
and similarly for other inequality comparison operators. Is there a way to write this to achieve better performance/branchlessness for x86.
Please spare me with have you benchmarked your code? Is this really your bottleneck? type comment. I am asking for other ways to write this so I can benchmark and test.
EDIT:
bool x
Original expression:
x ? a >= b : a <= b
Branch-free equivalent expression without short-circuit evaluation:
!!x & a >= b | !x & a <= b
This is an example of a generic pattern without resorting to arithmetic trickery. Watch out for operator precedence; you may need parentheses for more complex examples.
Another way would be :
bool c = (2*x - 1) * (a - b) >= 0;
This generates a branch-less code here: https://godbolt.org/z/1nAp7G
#include <stdbool.h>
bool foo(int a, int b, bool x)
{
return (2*x - 1) * (a - b) >= 0;
}
------------------------------------------
foo:
movzx edx, dl
sub edi, esi
lea eax, [rdx-1+rdx]
imul eax, edi
not eax
shr eax, 31
ret
Since you're just looking for equivalent expressions, this comes from patching #AlexanderZhang's comment:
(a==b) || (c != (a<b))
The way you currently have it is possibly unbeatable.
But for positive integral a and b and bool x you can use
a / b * x + b / a * !x
(You could adapt this, at the cost of extra cpu burn, by replacing a with a + 1 and similarly for b if you need to support zero.)
If a>=b, a-b will be positive and the first bit(sign bit) is 0. Otherwise a-b is negative and first bit is 1.
So we can simply “xor” the first bit of a-b and the the value of x
constexpr auto shiftBit = sizeof(int)*8-1;
bool foo(bool x, int a, int b){
return x ^ bool((a-b)>>shiftBit);
}
foo(bool, int, int):
sub esi, edx
mov eax, edi
shr esi, 31
xor eax, esi
ret

Convert flag into either 0xFF or 0, based on whether flag equals 1 or 0

I have a binary flag f, equal to either zero or one.
If equal to one, I would like to convert to 0xFF, otherwise, to 0.
Current solution is f*0xFF, but I would rather use bit twiddling to achieve this.
How about just:
(unsigned char)-f
or alternately:
0xFF & -f
If f is already a char, then you just need -f.
This approach works because -0 == 0 and -1 == 0xFFFFF..., so the negation gets you want you want directly, perhaps with some extra high bits set if f is larger than a char (you didn't say).
Remember though that compilers are smart. I tried all of the following solutions, and all compiled down to 3 instructions or less, and none had a branch (even the solution with a conditional):
Conditional
int remap_cond(int f) {
return f ? 0xFF : 0;
}
Compiles to:
remap_cond:
test edi, edi
mov eax, 255
cmove eax, edi
ret
So even the "obvious" conditional works well, in three instructions and a latency of 2 or 3 cycles on most modern x86 hardware, depending on cmov performance.
Multiplication
Your original solution of:
int remap_mul(int f) {
return f * 0xFF;
}
Actually compiles into nice code that avoids the multiplication entirely, replacing it with a shift and subtract:
remap_mul:
mov eax, edi
sal eax, 8
sub eax, edi
ret
This will generally take two cycles on machines with mov-elimination, and the mov would often be removed by inlining anyway.
Subtraction
As corn3lius pointed out, you can do some subtraction from 0x100 and a mask, like so:
int remap_shift_sub(int f) {
return 0xFF & (0x100 - f);
}
This compiles to1:
remap_shift_sub:
neg edi
movzx eax, dil
ret
So that's the best so far I think - a latency of 2 cycles on most hosts, and the movzx can often be eliminated by inlining2 - e.g., since it could use the 8-bit register in a subsequent consuming instruction.
Note that the compiler has smartly eliminated both the masking operation (you could perhaps argue the movzx accounts for it), and the use of the 0x100 constant, because it understands that a simple negation does the same thing here (in particular, all the bits that differ between -f and 0x100 - f are masked away by the 0xFF & ... operation).
That leads directly to the following C code:
int remap_neg_mask(int f) {
return -f;
}
which compiles down the exact same thing.
You can play with all of this on godbolt.
1 Except on clang, which inserts an extra mov to get the result in eax rather than generating it there in the first place.
2 Note that by "inlining" I mean both real inlining the compiler does if you actually write this as a function, but also what happens if you just do the remapping operation directly at the place you need it without a function.
value = 0xFF & ((1 << 16) - f )
If f is one, subtract it from 0x100 giving you 0xFF; otherwise subtract 0 and bitmask with 0xFF and get 0.
Too obvious?
value = ( f == 1 ) ? 0xFF : 0;

Why is the C compiler dropping my &0xFF mask?

I have a problem where 2 different compilers (GCC and IAR) are dropping my mask from an if of different sized variables.
I have the following code:
uint8_t Value2;
uint16_t WriteOffset;
bool Fail;
void test(void)
{
uint8_t buff[100];
uint16_t r;
for(r=0;r<Value2+1;r++)
{
if(buff[r]!=(WriteOffset+r)&0xFF)
{
Fail=true;
}
}
}
The if fails (goes into the {} block) when buff[r] ==0 and WriteOffset+r == 0x100.
GCC outputs the following assembly:
movzwl -0xc(%ebp),%eax ; Load 'r'->EAX
mov -0x70(%ebp,%eax,1),%al ; Load 'buff[r]'->AL
movzbl %al,%edx ; Move AL to (unsigned int)EDX
mov 0x4b19e0,%ax ; Load 'WriteOffset'->AX
movzwl %ax,%ecx ; Move AX to (unsigned int)ECX
movzwl -0xc(%ebp),%eax ; Load 'r'->EAX
lea (%ecx,%eax,1),%eax ; 'WriteOffset' + 'r'->EAX
cmp %eax,%edx ; (unsigned int)'WriteOffset+r' == (unsigned int)'buff[r]'
je 0x445e28 <Test+1254> ; If == skip {} block
My question is why is the compiler dropping my &0xFF from the if (I have already fixed the problem with a cast, but I still do not understand why it dropped it in the first place)?
It isn't, operator precedence is biting you here
You want if( buff[r] != ((WriteOffset+r)&0xFF) )
What you currently have is the same as if( (buff[r]!=(WriteOffset+r)) & 0xFF )
The precedence confusion is causing you to mask a value that can only be 0 or 1 (the result of the comparison expression) with 0xFF, so the optimizer is quite reasonably removing it.
!= is higher precedence than &. I think you need an extra set of parentheses.
http://en.wikipedia.org/wiki/Operators_in_C_and_C%2B%2B has a C/C++ operator precedence table, have a look.
Operator != has higher priority than &. So you should write so:
if(buff[r]!=((WriteOffset+r)&0xFF))