I'm trying to solve a timing leak by removing an if statement in my code but because of c++'s interpretation of integer inputs in if statements I am stuck.
Note that I assume the compiler does create a conditional branch, which results in timing information being leaked!
The original code is:
int s
if (s)
r = A
else
r = B
Now I'm trying to rewrite it as:
int s;
r = sA+(1-s)B
Because s is not bound to [0,1] I run into the problem that it multiplies by A and B incorrectly if s is out of [0,1]. What can I do, without using an if-statement on s to solve this?
Thanks in advance
What evidence do you have that the if statement is resulting in the timing leak?
If you use a modern compiler with optimizations turned on, that code should not produce a branch. You should check what your compiler is doing by looking at the assembly language output.
For instance, g++ 5.3.0 compiles this code:
int f(int s, int A, int B) {
int r;
if (s)
r = A;
else
r = B;
return r;
}
to this assembly:
movl %esi, %eax
testl %edi, %edi
cmove %edx, %eax
ret
Look, ma! No branches! ;)
If you know the number of bits in the integer, it's pretty easy, although there are a few complications making it standards-clean with the possibility of unusual integer representations.
Here's one simple solution for 32-bit integers:
uint32_t mask = s;
mask |= mask >> 1;
mask |= mask >> 2;
mask |= mask >> 4;
mask |= mask >> 8;
mask |= mask >> 16;
mask &= 1;
r = b ^ (-mask & (a ^ b)):
The five shift-and-or statements propagate any set bit in mask so that in the end the low-order bit is 1 unless the mask was originally 0. Then we isolate the low-order bit, resulting in a 1 or 0. The last statement is a bit-hacking equivalent of your two multiplies and add.
Here is a faster one based on the observation that if you subtract one from a number and the sign bit changes from 0 to 1, then the number was 0:
uint32_t mask = ((uint32_t(s)-1U)&~uint32_t(s))>>31) - 1U;
That is essentially the same computation as subtracting 1 and then using the carry bit, but unfortunately the carry bit is not exposed to the C language (except possibly through compiler-specific intrinsics).
Other variations are possible.
The only way to do it without branches when the optimization is not available is to resort to inline assembly. Assuming 8086:
mov ax, s
neg ax ; CF = (ax != 0)
sbb ax, ax ; ax = (s != 0 ? -1 : 0)
neg ax ; ax = (s != 0 ? 1 : 0)
mov s, ax ; now use s at will, it will be: s = (s != 0 ? 1 : 0)
Related
Please note: this question is neither about code quality, and ways to improve the code, nor about the (in)significance of the runtime differences. It is about GCC and why which compiler optimisation costs performance.
The program
The following code counts the number of Fibonacci primes up to m:
int main() {
unsigned int m = 500000000u;
unsigned int i = 0u;
unsigned int a = 1u;
unsigned int b = 1u;
unsigned int c = 1u;
unsigned int count = 0u;
while (a + b <= m) {
for (i = 2u; i < a + b; ++i) {
c = (a + b) % i;
if (c == 0u) {
i = a + b;
// break;
}
}
if (c != 0u) {
count = count + 1u;
}
a = a + b;
b = a - b;
}
return count; // Just to "output" (and thus use) count
}
When compiled with g++.exe (Rev2, Built by MSYS2 project) 9.2.0 and no optimisations (-O0), the resulting binary executes (on my machine) in 1.9s. With -O1 and -O3 it takes 3.3s and 1.7s, respectively.
I've tried to make sense of the resulting binaries by looking at the assembly code (godbolt.org) and the corresponding control-flow graph (hex-rays.com/products/ida), but my assembler skills don't suffice.
Additional observations
An explicit break in the innermost if makes the -O1 code fast again:
if (c == 0u) {
i = a + b; // Not actually needed any more
break;
}
As does "inlining" the loop's progress expression:
for (i = 2u; i < a + b; ) { // No ++i any more
c = (a + b) % i;
if (c == 0u) {
i = a + b;
++i;
} else {
++i;
}
}
Questions
Which optimisation does/could explain the performance drop?
Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?
Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?
The important thing at play here are loop-carried data dependencies.
Look at machine code of the slow variant of the innermost loop. I'm showing -O2 assembly here, -O1 is less optimized, but has similar data dependencies overall:
.L4:
xorl %edx, %edx
movl %esi, %eax
divl %ecx
testl %edx, %edx
cmove %esi, %ecx
addl $1, %ecx
cmpl %ecx, %esi
ja .L4
See how the increment of the loop counter in %ecx depends on the previous instruction (the cmov), which in turn depends on the result of the division, which in turn depends on the previous value of loop counter.
Effectively there is a chain of data dependencies on computing the value in %ecx that spans the entire loop, and since the time to execute the loop dominates, the time to compute that chain decides the execution time of the program.
Adjusting the program to compute the number of divisions reveals that it executes 434044698 div instructions. Dividing the number of machine cycles taken by the program by this number gives 26 cycles in my case, which corresponds closely to latency of the div instruction plus about 3 or 4 cycles from the other instructions in the chain (the chain is div-test-cmov-add).
In contrast, the -O3 code does not have this chain of dependencies, making it throughput-bound rather than latency-bound: the time to execute the -O3 variant is determined by the time to compute 434044698 independent div instructions.
Finally, to give specific answers to your questions:
1. Which optimisation does/could explain the performance drop?
As another answer mentioned, this is if-conversion creating a loop-carried data dependency where originally there was a control dependency. Control dependencies may be costly too, when they correspond to unpredictable branches, but in this case the branch is easy to predict.
2. Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?
Perhaps you can imagine the optimization transforming the code to
for (i = 2u; i < a + b; ++i) {
c = (a + b) % i;
i = (c != 0) ? i : a + b;
}
Where the ternary operator is evaluated on the CPU such that new value of i is not known until c has been computed.
3. Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?
In those variants the code is not eligible for if-conversion, so the problematic data dependency is not introduced.
I think the problem is in the -fif-conversion that instructs the compiler to do CMOV instead of TEST/JZ for some comparisons. And CMOV is known for being not so great in the general case.
There are two points in the disassembly, that I know of, affected by this flag:
First, the if (c == 0u) { i = a + b; } in line 13 is compiled to:
test edx,edx //edx is c
cmove ecx,esi //esi is (a + b), ecx is i
Second, the if (c != 0u) { count = count + 1u; } is compiled to
cmp eax,0x1 //eax is c
sbb r8d,0xffffffff //r8d is count, but what???
Nice trick! It is substracting -1 to count but with carry, and the carry is only set if c is less than 1, which being unsigned means 0. Thus, if eax is 0 it substracts -1 to count but then substracts 1 again: it does not change. If eax is not 0, then it substracts -1, that increments the variable.
Now, this avoids branches, but at the cost of missing the obvious optimization that if c == 0u you could jump directly to the next while iteration. This one is so easy that it is even done in -O0.
I believe this is caused by the "conditional move" instruction (CMOVEcc) that the compiler generates to replace branching when using -O1 and -O2.
When using -O0, the statement if (c == 0u) is compiled to a jump:
cmp DWORD PTR [rbp-16], 0
jne .L4
With -O1 and -O2:
test edx, edx
cmove ecx, esi
while -O3 produces a jump (similar to -O0):
test edx, edx
je .L5
There is a known bug in gcc where "using conditional moves instead of compare and branch result in almost 2x slower code"
As rodrigo suggested in his comment, using the flag -fno-if-conversion tells gcc not to replace branching with conditional moves, hence preventing this performance issue.
An array of 3 bytes is specified. Count the number of bytes where there's a zeros after any one. i.e. where the bits below the most-significant 1 are not all 1.
{00000100, 00000011, 00001000} - for this array the answer is 2.
My code gives 1, but it is incorrect; how to fix that?
#include <iostream>
#include <bitset>
using namespace std;
int main() {
int res = 0, res1 = 0;
_int8 arr[3] = { 4, 3, 8 };
__asm {
mov ecx, 3
mov esi, 0
start_outer:
mov bx, 8
mov al, arr[esi]
start_inner :
shl al, 1
jnb zero
jc one
one :
dec bx к
test bx, bx
jnz start_inner
jmp end_
zero :
dec bx
test bx, bx
jz end_
inc res
shl al, 1
jnb was_zero
jc start_inner
was_zero :
dec bx
dec res
jmp start_inner
end_ :
inc esi
loop start_outer
}
cout << res << endl;
system("pause");
}
Next try.
Please try to explain better next time. Many many people did not understand your question. Anyway. I hope that I understood now.
I will explain the used algorithm for one byte. Later in the program, we will run simple a outer loop 3 times, to work on all values. And, I will of course show the result in assembler. And, this is one of many possible solutions.
We can observe the following:
Your satement "Count the number of bytes where there's a zeros after any one." means, that you want to count the number of transition of a bit from 1 to 0 in one byte. And this, if we look at the bits from the msb to the lsb. So, from left to right.
If we formulate this vice versa, then we can also count the number of transitions from 0 to 1, if we go from right to left.
A transition from 0 to 1 can always be calculated by "and"ing the new value with the negated old value. Example:
OldValue NewValue NotOldValue And
0 0 1 0
0 1 1 1 --> Rising edge
1 0 0 0
1 1 0 0
We can also say in words, if the old, previous vale was not set, and the new value is set, then we have a rising edge.
We can look at one bit (of a byte) after the other, if we shift right the byte. Then, the new Value (the new lowest bit) will be the LSB. We remember the old previous bit, and the do the test. Then we set old = new, read again the new value, do the test and so on and so on. This we do for all bits.
In C++ this could look like this:
#include <iostream>
#include <bitset>
using byte = unsigned char;
byte countForByte(byte b) {
// Initialize counter variable to 0
byte counter{};
// Get the first old value. The lowest bit of the orignal array entry
byte oldValue = b & 1;
// Check all 8 bits
for (int i=0; i<8; ++i) {
// Calculate a new value. First shift to right
b = b >> 1;
// Then mask out lowest bit
byte newValue = b & 1;
// Now apply our algorithm. The result will always be 0 or one. Add to result
counter += (newValue & !oldValue);
// The next old value is the current value from this time
oldValue = newValue;
}
return counter;
}
int main() {
unsigned int x;
std::cin >> x;
std::cout << std::bitset<8>(x).to_string() << "\n";
byte s = countForByte(x);
std::cout << static_cast<int>(s) << '\n';
return 0;
}
So, and for whatever reason, you want a solution in assembler. Also here, you need to tell the people why you want to have it, what compiler you use and what target microprocessor you use. Otherwise, how can people give the correct answer?
Anyway. Here the solution for X86 architecture. Tested wis MS VS2019.
#include <iostream>
int main() {
int res = 0;
unsigned char arr[3] = { 139, 139, 139 };
__asm {
mov esi, 0; index in array
mov ecx, 3; We will work with 3 array values
DoArray:
mov ah, arr[esi]; Load array value at index
mov bl, ah; Old Value
and bl, 1; Get lowest bit of old value
push ecx; Save loop Counter for outer loop
mov ecx, 7; 7Loop runs to get the result for one byte
DoTest:
shr ah, 1; This was the original given byte
mov al, ah; Get the lowest bit from the new shifted value
and al, 1; This is now new value
not bl; Invert the old value
and bl, al; Check for rising edge
movzx edi, bl
add res, edi; Calculate new result
mov bl, al; Old value = new value
loop DoTest
inc esi; Next index in array
pop ecx; Get outer loop counter
loop DoArray; Outer loop
}
std::cout << res << '\n';
return 0;
}
And for this work, I want 100 upvotes and an accepted answer . . .
Basically, user #Michael gave already the correct answer. So all credits go to him.
You can find a lot of bit fiddling posts here on stack overflow. But a very good description for such kind of activities, you may find in the book "Hacker’s Delight" by "Henry S. Warren, Jr.". I have here the 2nd edition.
The solution is presented in chapter 2, "Basics", then "2–1 Manipulating Rightmost Bits"
And if you manually check, what values do NOT fullfill your condition, then you will find out that these are
0,1,3,7,15,31,63,127,255,
or, in binary
0b0000'0000, 0b0000'0001, 0b0000'0011, 0b0000'0111, 0b0000'1111, 0b0001'1111, 0b0011'1111, 0b0111'1111, 0b1111'1111,
And we detect that these values correspond to 2^n - 1. And, according to "Hacker’s Delight", we can find that with the simple formular
(x & (x + 1)) != 0
So, we can translate that to the following code:
#include <iostream>
int main() {
unsigned char arr[3];
unsigned int x, y, z;
std::cin >> x >> y >> z;
arr[0] = static_cast<unsigned char>(x);
arr[1] = static_cast<unsigned char>(y);
arr[2] = static_cast<unsigned char>(z);
unsigned char res = ((arr[0] & (arr[0] + 1)) != 0) + ((arr[1] & (arr[1] + 1)) != 0) + ((arr[2] & (arr[2] + 1)) != 0);
std::cout << static_cast<unsigned int>(res) << '\n';
return 0;
}
Very important. You do not need assembler code. Optimizing compiler will nearly always outperform your handwritten code.
You can check many different versions on Compiler Explorer. Here you could see, that your code example with static values would be completely optimized away. The compiler would simply calculate everthing in compile time and simply show 2 as result. So, caveat. Compiler explorer will show you the assembly language generated by different compilers and for selected hardware. You can take that if you want.
Please additionally note: The above sketched algorithm does not need any branch. Except, if you want to iterate over an array/vector. For this, you could write a small lambda and use algorithms from the C++ standard library.
C++ solution
#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>
#include <iterator>
int main() {
// Define Lambda to check conditions
auto add = [](const size_t& sum, const unsigned char& x) -> size_t {
return sum + static_cast<size_t>(((x & (x + 1)) == 0) ? 0U : 1U); };
// Vector with any number of test values
std::vector<unsigned char> test{ 4, 3, 8 };
// Calculate and show result
std::cout << std::accumulate(test.begin(), test.end(), 0U, add) << '\n';
return 0;
}
Suppose x is a bit mask (ie all of its bits except for one are 0) and y is either a bit mask or is equal to 0. I need a bit hack to return x if y is non-zero and return zero if y is zero.
Here's one possible solution: take the base-2 logarithm of x and y (using de Bruijn sequence) and subtract them, storing the value in d. Then y << d will return x unless y was zero to begin with.
Two problems with this approach: 1) if y is zero, technically the base-2 logarithm is undefined. Not sure if that matters though, because even if d is some garbage value, y << d should still return zero if y is zero; 2) if d is negative, the right-shift operator does not become a left-shift operator (according to Google search), meaning that I'll have to include some sign checking.
I'm convinced that there's a simpler approach, but I can't find it and would appreciate some help.
EDIT: to clarify, I'm looking for the fastest way to do this. The obvious if (y == 0) return 0; else return x uses an if statement and hence suffers from the adverse effects of branch prediction, which is why I'm resorting to convoluted base-2 log solutions.
The use of the ternary operator would be preferred on most common processor architectures:
/* if y != 0, return x, else return 0 */
int select1 (int x, int y)
{
return y ? x : 0;
}
The use of the ternary operator does not typically involve the use of branches on modern processor architectures, since it can easily be implemented in branchless fashion by using conditional moves (e.g. on x86), instruction predication (e.g. on ARM), or select instructions (e.g. on some GPUs).
If use of the ternary operator is not desired or allowed, and a bit-twiddly solution is required, one could (assuming the platform uses two's complement representation for integers) use:
/* if y != 0, return x, else return 0 */
int select2 (int x, int y)
{
return (0 - (y != 0)) & x;
}
Note that select2() is likely to be slower than select1(). Example: If I compile the above functions for the x86-64 architecture my compiler produces this instruction sequence for select1()
test edx, edx
cmovne edx, ecx
mov eax, edx
ret
but this longer instruction sequence for select2():
mov r8d, 1
test edx, edx
cmovne edx, r8d
neg edx
and edx, ecx
mov eax, edx
ret
Note that neither instruction sequence involves branching as part of value selection, but the instruction sequence in select2() requires more instructions to be executed and also has a longer dependency chain, compared to the instruction sequence in select1().
static_cast<bool>(y) * x
Just take y and use its bits to form a string of all 1's if it was nonzero, then AND this with x. The silly way of doing this is linearly but you can do it with a binary method too (not given).
#include <stdio.h>
#include <limits.h>
int foo(int x, int y) {
int z = 0;
for(int z = 1; z < CHAR_BIT * sizeof(int); z ++) {
y |= y << z;
}
return x & y;
}
int main() {
printf("%lx\n", foo(0x1000, 0xdead));
return 0;
}
That should run in constant time. You can unroll the loop of course.
This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Previous power of 2
Getting the Leftmost Bit
What I want is, suppose there is a number 5 i.e. 101. My answer should be 100. For 9 i.e. 1001, the answer should be 1000
You can't ask for the fastest sequence without giving constrains on the machine on which this has to run. For example, some machines support an instruction called "count leading zeroes" or have means to emulate it very quickly. If you can access this instruction (for example with gcc) then you can write:
#include <limits.h>
#include <stdint.h>
uint32_t f(uint32_t x)
{
return ((uint64_t)1)<<(32-__builtin_clz(x)-1);
}
int main()
{
printf("=>%d\n",f(5));
printf("=>%d\n",f(9));
}
f(x) returns what you want (the least y with x>=y and y=2**n). The compiler will now generate the optimal code sequence for the target machine. For example, when compiling for a default x86_64 architecture, f() looks like this:
bsrl %edi, %edi
movl $31, %ecx
movl $1, %eax
xorl $31, %edi
subl %edi, %ecx
salq %cl, %rax
ret
You see, no loops here! 7 instructions, no branches.
But if I tell my compiler (gcc-4.5) to optimize for the machine I'm using right now (AMD Phenom-II), then this comes out for f():
bsrl %edi, %ecx
movl $1, %eax
salq %cl, %rax
ret
This is probably the fastest way to go for this machine.
EDIT: f(0) would have resulted in UB, I've fixed that (and the assembly). Also, uint32_t means that I can write 32 without feeling guilty :-)
From Hacker's Delight, a nice branchless solution:
uint32_t flp2 (uint32_t x)
{
x = x | (x >> 1);
x = x | (x >> 2);
x = x | (x >> 4);
x = x | (x >> 8);
x = x | (x >> 16);
return x - (x >> 1);
}
This typically takes 12 instructions. You can do it in fewer if your CPU has a "count leading zeroes" instruction.
int input = 5;
std::size_t numBits = 0;
while(input)
{
input >>= 1;
numBits++;
}
int output = 1 << (numBits-1);
This is a task related to the bit counting. Check this out.
Using the 2a (which is my favorite of the algorithms; not the fastest) one can come up with this:
int highest_bit_mask (unsigned int n) {
while (n) {
if (n & (n-1)) {
n &= (n-1) ;
} else {
return n;
}
}
return 0;
}
The magic of n &= (n-1); is that it removes from n the least significant bit. (Corollary: n & (n-1) is false only when n has precisely one bit set.) The algorithm complexity depends on number of bits set in the input.
Check out the link anyway. It is a very amusing and enlightening read which might give you more ideas.
Write a branchless function that returns 0, 1, or 2 if the difference between two signed integers is zero, negative, or positive.
Here's a version with branching:
int Compare(int x, int y)
{
int diff = x - y;
if (diff == 0)
return 0;
else if (diff < 0)
return 1;
else
return 2;
}
Here's a version that may be faster depending on compiler and processor:
int Compare(int x, int y)
{
int diff = x - y;
return diff == 0 ? 0 : (diff < 0 ? 1 : 2);
}
Can you come up with a faster one without branches?
SUMMARY
The 10 solutions I benchmarked had similar performance. The actual numbers and winner varied depending on compiler (icc/gcc), compiler options (e.g., -O3, -march=nocona, -fast, -xHost), and machine. Canon's solution performed well in many benchmark runs, but again the performance advantage was slight. I was surprised that in some cases some solutions were slower than the naive solution with branches.
Branchless (at the language level) code that maps negative to -1, zero to 0 and positive to +1 looks as follows
int c = (n > 0) - (n < 0);
if you need a different mapping you can simply use an explicit map to remap it
const int MAP[] = { 1, 0, 2 };
int c = MAP[(n > 0) - (n < 0) + 1];
or, for the requested mapping, use some numerical trick like
int c = 2 * (n > 0) + (n < 0);
(It is obviously very easy to generate any mapping from this as long as 0 is mapped to 0. And the code is quite readable. If 0 is mapped to something else, it becomes more tricky and less readable.)
As an additinal note: comparing two integers by subtracting one from another at C language level is a flawed technique, since it is generally prone to overflow. The beauty of the above methods is that they can immedately be used for "subtractionless" comparisons, like
int c = 2 * (x > y) + (x < y);
int Compare(int x, int y) {
return (x < y) + (y < x) << 1;
}
Edit: Bitwise only? Guess < and > don't count, then?
int Compare(int x, int y) {
int diff = x - y;
return (!!diff) | (!!(diff & 0x80000000) << 1);
}
But there's that pesky -.
Edit: Shift the other way around.
Meh, just to try again:
int Compare(int x, int y) {
int diff = y - x;
return (!!diff) << ((diff >> 31) & 1);
}
But I'm guessing there's no standard ASM instruction for !!. Also, the << can be replaced with +, depending on which is faster...
Bit twiddling is fun!
Hmm, I just learned about setnz.
I haven't checked the assembler output (but I did test it a bit this time), and with a bit of luck it could save a whole instruction!:
IN THEORY. MY ASSEMBLER IS RUSTY
subl %edi, %esi
setnz %eax
sarl $31, %esi
andl $1, %esi
sarl %eax, %esi
mov %esi, %eax
ret
Rambling is fun.
I need sleep.
Assuming 2s complement, arithmetic right shift, and no overflow in the subtraction:
#define SHIFT (CHARBIT*sizeof(int) - 1)
int Compare(int x, int y)
{
int diff = x - y;
return -(diff >> SHIFT) - (((-diff) >> SHIFT) << 1);
}
Two's complement:
#include <limits.h>
#define INT_BITS (CHAR_BITS * sizeof (int))
int Compare(int x, int y) {
int d = y - x;
int p = (d + INT_MAX) >> (INT_BITS - 1);
d = d >> (INT_BITS - 2);
return (d & 2) + (p & 1);
}
Assuming a sane compiler, this will not invoke the comparison hardware of your system, nor is it using a comparison in the language. To verify: if x == y then d and p will clearly be 0 so the final result will be zero. If (x - y) > 0 then ((x - y) + INT_MAX) will set the high bit of the integer otherwise it will be unset. So p will have its lowest bit set if and only if (x - y) > 0. If (x - y) < 0 then its high bit will be set and d will set its second to lowest bit.
Unsigned Comparison that returns -1,0,1 (cmpu) is one of the cases that is tested for by the GNU SuperOptimizer.
cmpu: compare (unsigned)
int cmpu(unsigned_word v0, unsigned_word v1)
{
return ( (v0 > v1) ? 1 : ( (v0 < v1) ? -1 : 0) );
}
A SuperOptimizer exhaustively searches the instruction space for the best possible combination of instructions that will implement a given function. It is suggested that compilers automagically replace the functions above by their superoptimized versions (although not all compilers do this). For example, in the PowerPC Compiler Writer's Guide (powerpc-cwg.pdf), the cmpu function is shown as this in Appendix D pg 204:
cmpu: compare (unsigned)
PowerPC SuperOptimized Version
subf R5,R4,R3
subfc R6,R3,R4
subfe R7,R4,R3
subfe R8,R7,R5
That's pretty good isn't it... just four subtracts (and with carry and/or extended versions). Not to mention it is genuinely branchfree at the machine opcode level. There is probably a PC / Intel X86 equivalent sequence that is similarly short since the GNU Superoptimizer runs for X86 as well as PowerPC.
Note that Unsigned Comparison (cmpu) can be turned into Signed Comparison (cmps) on a 32-bit compare by adding 0x80000000 to both Signed inputs before passing it to cmpu.
cmps: compare (signed)
int cmps(signed_word v0, signed_word v1)
{
signed_word offset=0x80000000;
return ( (unsigned_word) (v0 + signed_word),
(unsigned_word) (v1 + signed_word) );
}
This is just one option though... the SuperOptimizer may find a cmps that is shorter and does not have to add offsets and call cmpu.
To get the version that you requested that returns your values of {1,0,2} rather than {-1,0,1} use the following code which takes advantage of the SuperOptimized cmps function.
int Compare(int x, int y)
{
static const int retvals[]={1,0,2};
return (retvals[cmps(x,y)+1]);
}
I'm siding with Tordek's original answer:
int compare(int x, int y) {
return (x < y) + 2*(y < x);
}
Compiling with gcc -O3 -march=pentium4 results in branch-free code that uses conditional instructions setg and setl (see this explanation of x86 instructions).
push %ebp
mov %esp,%ebp
mov %eax,%ecx
xor %eax,%eax
cmp %edx,%ecx
setg %al
add %eax,%eax
cmp %edx,%ecx
setl %dl
movzbl %dl,%edx
add %edx,%eax
pop %ebp
ret
Good god, this has haunted me.
Whatever, I think I squeezed out a last drop of performance:
int compare(int a, int b) {
return (a != b) << (a > b);
}
Although, compiling with -O3 in GCC will give (bear with me I'm doing it from memory)
xorl %eax, %eax
cmpl %esi, %edi
setne %al
cmpl %esi, %edi
setgt %dl
sall %dl, %eax
ret
But the second comparison seems (according to a tiny bit of testing; I suck at ASM) to be redundant, leaving the small and beautiful
xorl %eax, %eax
cmpl %esi, %edi
setne %al
setgt %dl
sall %dl, %eax
ret
(Sall may totally not be an ASM instruction, but I don't remember exactly)
So... if you don't mind running your benchmark once more, I'd love to hear the results (mine gave a 3% improvement, but it may be wrong).
Combining Stephen Canon and Tordek's answers:
int Compare(int x, int y)
{
int diff = x - y;
return -(diff >> 31) + (2 & (-diff >> 30));
}
Yields: (g++ -O3)
subl %esi,%edi
movl %edi,%eax
sarl $31,%edi
negl %eax
sarl $30,%eax
andl $2,%eax
subl %edi,%eax
ret
Tight! However, Paul Hsieh's version has even fewer instructions:
subl %esi,%edi
leal 0x7fffffff(%rdi),%eax
sarl $30,%edi
andl $2,%edi
shrl $31,%eax
leal (%rdi,%rax,1),%eax
ret
int Compare(int x, int y)
{
int diff = x - y;
int absdiff = 0x7fffffff & diff; // diff with sign bit 0
int absdiff_not_zero = (int) (0 != udiff);
return
(absdiff_not_zero << 1) // 2 iff abs(diff) > 0
-
((0x80000000 & diff) >> 31); // 1 iff diff < 0
}
For 32 signed integers (like in Java), try:
return 2 - ((((x >> 30) & 2) + (((x-1) >> 30) & 2))) >> 1;
where (x >> 30) & 2 returns 2 for negative numbers and 0 otherwise.
x would be the difference of the two input integers
The basic C answer is :
int v; // find the absolute value of v
unsigned int r; // the result goes here
int const mask = v >> sizeof(int) * CHAR_BIT - 1;
r = (v + mask) ^ mask;
Also :
r = (v ^ mask) - mask;
Value of sizeof(int) is often 4 and CHAR_BIT is often 8.