How to efficiently de-interleave bits (inverse Morton)

How to efficiently de-interleave bits (inverse Morton) - bit-manipulation

This question: How to de-interleave bits (UnMortonizing?) has a good answer for extracting one of the two halves of a Morton number (just the odd bits), but I need a solution which extracts both parts (the odd bits and the even bits) in as few operations as possible.
For my use I would need to take a 32 bit int and extract two 16 bit ints, where one is the even bits and the other is the odd bits shifted right by 1 bit, e.g.
input, z: 11101101 01010111 11011011 01101110
output, x: 11100001 10110111 // odd bits shifted right by 1
y: 10111111 11011010 // even bits
There seem to be plenty of solutions using shifts and masks with magic numbers for generating Morton numbers (i.e. interleaving bits), e.g. Interleave bits by Binary Magic Numbers, but I haven't yet found anything for doing the reverse (i.e. de-interleaving).
UPDATE
After re-reading the section from Hacker's Delight on perfect shuffles/unshuffles I found some useful examples which I adapted as follows:
// morton1 - extract even bits
uint32_t morton1(uint32_t x)
{
x = x & 0x55555555;
x = (x | (x >> 1)) & 0x33333333;
x = (x | (x >> 2)) & 0x0F0F0F0F;
x = (x | (x >> 4)) & 0x00FF00FF;
x = (x | (x >> 8)) & 0x0000FFFF;
return x;
}
// morton2 - extract odd and even bits
void morton2(uint32_t *x, uint32_t *y, uint32_t z)
{
*x = morton1(z);
*y = morton1(z >> 1);
}
I think this can still be improved on, both in its current scalar form and also by taking advantage of SIMD, so I'm still interested in better solutions (either scalar or SIMD).

If your processor handles 64 bit ints efficiently, you could combine the operations...
int64 w = (z &0xAAAAAAAA)<<31 | (z &0x55555555 )
w = (w | (w >> 1)) & 0x3333333333333333;
w = (w | (w >> 2)) & 0x0F0F0F0F0F0F0F0F;
...

Code for the Intel Haswell and later CPUs. You can use the BMI2 instruction set which contains the pext and pdep instructions. These can (among other great things) be used to build your functions.
#include <immintrin.h>
#include <stdint.h>
// on GCC, compile with option -mbmi2, requires Haswell or better.
uint64_t xy_to_morton (uint32_t x, uint32_t y)
{
return _pdep_u32(x, 0x55555555) | _pdep_u32(y,0xaaaaaaaa);
}
uint64_t morton_to_xy (uint64_t m, uint32_t *x, uint32_t *y)
{
*x = _pext_u64(m, 0x5555555555555555);
*y = _pext_u64(m, 0xaaaaaaaaaaaaaaaa);
}

In case someone is using morton codes in 3d, so he needs to read one bit every 3, and 64 bits here is the function I used:
uint64_t morton3(uint64_t x) {
x = x & 0x9249249249249249;
x = (x | (x >> 2)) & 0x30c30c30c30c30c3;
x = (x | (x >> 4)) & 0xf00f00f00f00f00f;
x = (x | (x >> 8)) & 0x00ff0000ff0000ff;
x = (x | (x >> 16)) & 0xffff00000000ffff;
x = (x | (x >> 32)) & 0x00000000ffffffff;
return x;
}
uint64_t bits;
uint64_t x = morton3(bits)
uint64_t y = morton3(bits>>1)
uint64_t z = morton3(bits>>2)

You can extract 8 interleaved bits by multiplying like so:
uint8_t deinterleave_even(uint16_t x) {
return ((x & 0x5555) * 0xC00030000C0003 & 0x0600180060008001) * 0x0101010101010101 >> 56;
}
uint8_t deinterleave_odd(uint16_t x) {
return ((x & 0xAAAA) * 0xC00030000C0003 & 0x03000C003000C000) * 0x0101010101010101 >> 56;
}
It should be trivial to combine them for 32 bits or larger.

If you need speed than you can use table-lookup for one byte conversion at once (two bytes table is faster but to big). Procedure is made under Delphi IDE but the assembler/algorithem is the same.
const
MortonTableLookup : array[byte] of byte = ($00, $01, $10, $11, $12, ... ;
procedure DeinterleaveBits(Input: cardinal);
//In: eax
//Out: dx = EvenBits; ax = OddBits;
asm
movzx ecx, al //Use 0th byte
mov dl, byte ptr[MortonTableLookup + ecx]
//
shr eax, 8
movzx ecx, ah //Use 2th byte
mov dh, byte ptr[MortonTableLookup + ecx]
//
shl edx, 16
movzx ecx, al //Use 1th byte
mov dl, byte ptr[MortonTableLookup + ecx]
//
shr eax, 8
movzx ecx, ah //Use 3th byte
mov dh, byte ptr[MortonTableLookup + ecx]
//
mov ecx, edx
and ecx, $F0F0F0F0
mov eax, ecx
rol eax, 12
or eax, ecx
rol edx, 4
and edx, $F0F0F0F0
mov ecx, edx
rol ecx, 12
or edx, ecx
end;

I didn't want to be limited to a fixed size integer and making lists of similar commands with hardcoded constants, so I developed a C++11 solution which makes use of template metaprogramming to generate the functions and the constants. The assembly code generated with -O3 seems as tight as it can get without using BMI:
andl $0x55555555, %eax
movl %eax, %ecx
shrl %ecx
orl %eax, %ecx
andl $0x33333333, %ecx
movl %ecx, %eax
shrl $2, %eax
orl %ecx, %eax
andl $0xF0F0F0F, %eax
movl %eax, %ecx
shrl $4, %ecx
orl %eax, %ecx
movzbl %cl, %esi
shrl $8, %ecx
andl $0xFF00, %ecx
orl %ecx, %esi
TL;DR source repo and live demo.
Implementation
Basically every step in the morton1 function works by shifting and adding to a sequence of constants which look like this:
0b0101010101010101 (alternate 1 and 0)
0b0011001100110011 (alternate 2x 1 and 0)
0b0000111100001111 (alternate 4x 1 and 0)
0b0000000011111111 (alternate 8x 1 and 0)
If we were to use D dimensions, we would have a pattern with D-1 zeros and 1 one. So to generate these it's enough to generate consecutive ones and apply some bitwise or:
/// #brief Generates 0b1...1 with #tparam n ones
template <class T, unsigned n>
using n_ones = std::integral_constant<T, (~static_cast<T>(0) >> (sizeof(T) * 8 - n))>;
/// #brief Performs `#tparam input | (#tparam input << #tparam width` #tparam repeat times.
template <class T, T input, unsigned width, unsigned repeat>
struct lshift_add :
public lshift_add<T, lshift_add<T, input, width, 1>::value, width, repeat - 1> {
};
/// #brief Specialization for 1 repetition, just does the shift-and-add operation.
template <class T, T input, unsigned width>
struct lshift_add<T, input, width, 1> : public std::integral_constant<T,
(input & n_ones<T, width>::value) | (input << (width < sizeof(T) * 8 ? width : 0))> {
};
Now that we can generate the constants at compile time for arbitrary dimensions with the following:
template <class T, unsigned step, unsigned dimensions = 2u>
using mask = lshift_add<T, n_ones<T, 1 << step>::value, dimensions * (1 << step), sizeof(T) * 8 / (2 << step)>;
With the same type of recursion, we can generate functions for each of the steps of the algorithm x = (x | (x >> K)) & M:
template <class T, unsigned step, unsigned dimensions>
struct deinterleave {
static T work(T input) {
input = deinterleave<T, step - 1, dimensions>::work(input);
return (input | (input >> ((dimensions - 1) * (1 << (step - 1))))) & mask<T, step, dimensions>::value;
}
};
// Omitted specialization for step 0, where there is just a bitwise and
It remains to answer the question "how many steps do we need?". This depends also on the number of dimensions. In general, k steps compute 2^k - 1 output bits; the maximum number of meaningful bits for each dimension is given by z = sizeof(T) * 8 / dimensions, therefore it is enough to take 1 + log_2 z steps. The problem is now that we need this as constexpr in order to use it as a template parameter. The best way I found to work around this is to define log2 via metaprogramming:
template <unsigned arg>
struct log2 : public std::integral_constant<unsigned, log2<(arg >> 1)>::value + 1> {};
template <>
struct log2<1u> : public std::integral_constant<unsigned, 0u> {};
/// #brief Helper constexpr which returns the number of steps needed to fully interleave a type #tparam T.
template <class T, unsigned dimensions>
using num_steps = std::integral_constant<unsigned, log2<sizeof(T) * 8 / dimensions>::value + 1>;
And finally, we can perform one single call:
/// #brief Helper function which combines #see deinterleave and #see num_steps into a single call.
template <class T, unsigned dimensions>
T deinterleave_first(T n) {
return deinterleave<T, num_steps<T, dimensions>::value - 1, dimensions>::work(n);
}

Related

How bitwise shift operators are used to combine bytes into a larger integer

The following code combines two bytes into one 16 bit integer.
unsigned char byteOne = 0b00000010; // 2
unsigned char byteTwo = 0b00000011; // 3
uint16_t i = 0b0000000000000000;
i = (byteOne << 8) | byteTwo; //515
I'm trying to understand WHY this code works.
If we break this down and just focus on one byte, byteOne; This is an 8 bit value equal to 00000010. So, left-shifting this by 8 bits should always yield 00000000 (as the bits shifted off the end are lost), right? This seems to be the case with the following code:
uint8_t i = (byteOne << 8); // equal to 0, always, no matter what 8 bit value is assigned to byteOne
But if this way of thinking was correct, then
uint16_t i = (byteOne << 8) | byteTwo;
Should be equivalent to
uint16_t i = 0 | byteTwo; // Because 0b00000010 << 8 == 0b00000000
Or just
uint16_t i = byteTwo; // Because 0b00000000 | 0b00000011 == 0b00000011
But they're not equivalent and this is throwing me off. Is byteOne being cast/converted into a 16 bit int before the shifting operation? That would explain what's going on here as then
0b0000000000000010 << 8 == 0b0000001000000000 // 512
If byteOne isn't being converted into a 16 bit int before the shifting operation, then please explain why the (byteOne << 8) isn't evaluating to 0 when assigning to a 16 bit integer.

Yes--when you do almost any sort of operation on any value smaller than an int the first thing that happens is that the value is promoted to int (or, in some cases, unsigned int).
In case you really care about the details that apply here (§[conv.prom]/1):
A prvalue of an integer type other than bool, char16_t, char32_t, or wchar_t whose integer conversion rank (6.8.4) is less than the rank of int can be converted to a prvalue of type int if int can represent all the values of the source type; otherwise, the source prvalue can be converted to a prvalue of type unsigned int.
Then the operation happens on the promoted value (§[expr.shift]/1):
The shift operators << and >> group left-to-right.
[...]
The operands shall be of integral or unscoped enumeration type and integral promotions are performed. The type of the result is that of the promoted left operand.

As the shift does not hapen 'inplace' (byteOne = byteOne << 8), the compiler needs to use a register for the intermediate result.
In the line i = (byteOne << 8) | byteTwo; the size of the register for the intermediate is not specified (for example with a cast). Only the final result has to be uint16_t. So for the intermediate result it's up to the compiler.
When your code snipped is feed to a compiler you could get the following assembler code:
;// copy the two bytes and the word in the stack
movb $2, -1(%rbp) ;// uint8_t byteOne = 2
movb $3, -2(%rbp) ;// uint8_t byteTwo = 3
movw $0, -4(%rbp) ;// uint16_t i = 0
;// move the byteOne into the acumulate register(32bit)
movzbl -1(%rbp), %eax ;// uint32_t temp = byteOne
;// shift left by 8
sall $8, %eax ;// temp = temp << 8
;// move temp to different register
movl %eax, %edx ;// uint32_t temp2 = temp
;// move the byteTwo into the acumulate register(32bit)
movzbl -2(%rbp), %eax ;// temp = byteTwo
;// logical or of temp2 and temp
orl %edx, %eax ;// temp2 = temp2 | temp
;// copy back to stack location of i
movw %ax, -4(%rbp) ;// i = (uint16_t)temp2
%eax is a 32-bit register, therefore no overflow. The cast to uint16_t is done actively by MOVWord movw %ax, -4(%rbp).
I'm not sure how the compiler desides which register size to use for these intermediate results, but I suspect that it depends on your system and compiler.
The compiler on my system g++.exe (x86_64-posix-seh-rev1, Built by MinGW-W64 project) 7.2.0 seams to use 32bit registers as standard.
The following code used 32 bit registers, too and therefore did not return the expected result:
unsigned char byteOne = 0b00000010; // 2
unsigned char byteTwo = 0b00000011; // 3
uint16_t i = 0b0000000000000000;
i = ((byteOne << 32) | byteTwo << 24) >> 24; // 3
The same 32 bit %eax register is used, therefore the overflow ocured.
So if the intermediate result is not exceeding 32 bit the result is as expected like with:
unsigned char byteOne = 0b00000010; // 2
unsigned char byteTwo = 0b00000011; // 3
uint16_t i = 0b0000000000000000;
i = ((byteOne << 16) | byteTwo << 8) >> 8; // 515
The compiler for an 8bit Microcontroller will most certainly give a different result.

Reducing an integer to 1 if it is not equal to 0

I'm trying to solve a timing leak by removing an if statement in my code but because of c++'s interpretation of integer inputs in if statements I am stuck.
Note that I assume the compiler does create a conditional branch, which results in timing information being leaked!
The original code is:
int s
if (s)
r = A
else
r = B
Now I'm trying to rewrite it as:
int s;
r = sA+(1-s)B
Because s is not bound to [0,1] I run into the problem that it multiplies by A and B incorrectly if s is out of [0,1]. What can I do, without using an if-statement on s to solve this?
Thanks in advance

What evidence do you have that the if statement is resulting in the timing leak?
If you use a modern compiler with optimizations turned on, that code should not produce a branch. You should check what your compiler is doing by looking at the assembly language output.
For instance, g++ 5.3.0 compiles this code:
int f(int s, int A, int B) {
int r;
if (s)
r = A;
else
r = B;
return r;
}
to this assembly:
movl %esi, %eax
testl %edi, %edi
cmove %edx, %eax
ret
Look, ma! No branches! ;)

If you know the number of bits in the integer, it's pretty easy, although there are a few complications making it standards-clean with the possibility of unusual integer representations.
Here's one simple solution for 32-bit integers:
uint32_t mask = s;
mask |= mask >> 1;
mask |= mask >> 2;
mask |= mask >> 4;
mask |= mask >> 8;
mask |= mask >> 16;
mask &= 1;
r = b ^ (-mask & (a ^ b)):
The five shift-and-or statements propagate any set bit in mask so that in the end the low-order bit is 1 unless the mask was originally 0. Then we isolate the low-order bit, resulting in a 1 or 0. The last statement is a bit-hacking equivalent of your two multiplies and add.
Here is a faster one based on the observation that if you subtract one from a number and the sign bit changes from 0 to 1, then the number was 0:
uint32_t mask = ((uint32_t(s)-1U)&~uint32_t(s))>>31) - 1U;
That is essentially the same computation as subtracting 1 and then using the carry bit, but unfortunately the carry bit is not exposed to the C language (except possibly through compiler-specific intrinsics).
Other variations are possible.

The only way to do it without branches when the optimization is not available is to resort to inline assembly. Assuming 8086:
mov ax, s
neg ax ; CF = (ax != 0)
sbb ax, ax ; ax = (s != 0 ? -1 : 0)
neg ax ; ax = (s != 0 ? 1 : 0)
mov s, ax ; now use s at will, it will be: s = (s != 0 ? 1 : 0)

Optimizing Bitwise Logic

In my code the following lines are currently the hotspot:
int table1[256] = /*...*/;
int table2[512] = /*...*/;
int table3[512] = /*...*/;
int* result = /*...*/;
for(int r = 0; r < r_end; ++r)
{
std::uint64_t bits = bit_reader.value(); // 64 bits, no assumption regarding bits.
// The get_ functions are table lookups from the highest word of the bits variable.
struct entry
{
int sign_offset : 5;
int r_offset : 4;
int x : 7;
};
// NOTE: We are only interested in the highest word in the bits variable.
entry e;
if(is_in_table1(bits)) // branch prediction should work well here since table1 will be hit more often than 2 or 3, and 2 more often than 3.
e = reinterpret_cast<const entry&>(table1[get_table1_index(bits)]);
else if(is_in_table2(bits))
e = reinterpret_cast<const entry&>(table2[get_table2_index(bits)]);
else
e = reinterpret_cast<const entry&>(table3[get_table3_index(bits)]);
r += e.r_offset; // r is 18 bits, top 14 bits are always 0.
int x = e.x; // x is 14 bits, top 18 bits are always 0.
int sign_offset = e.sign_offset;
assert(sign_offset <= 16 && sign_offset > 0);
// The following is the hotspot.
int sign = 1 - (bits >> (63 - sign_offset) & 0x2);
(*result++) = ((x << 18) * sign) | r; // 32 bits
// End of hotspot
bit_reader.skip(sign_offset); // sign_offset is the last bit used.
}
Though I haven't figured out how to further optimize this, maybe something from intrinsics for Operations at Bit-Granularity, __shiftleft128 or _rot could be useful?
Note that I am also doing processing of the resulting data on the GPU, so the important thing is to get something into result which the GPU then can use to calculate the correct.
Suggestions?
EDIT:
Added table look-up.
EDIT:
int sign = 1 - (bits >> (63 - e.sign_offset) & 0x2);
000000013FD6B893 and ecx,1Fh
000000013FD6B896 mov eax,3Fh
000000013FD6B89B sub eax,ecx
000000013FD6B89D movzx ecx,al
000000013FD6B8A0 shr r8,cl
000000013FD6B8A3 and r8d,2
000000013FD6B8A7 mov r14d,1
000000013FD6B8AD sub r14d,r8d

I overlooked the fact that the sign is +/-1, so I'm correcting my answer.
Assuming that mask is an array with properly defined bitmasks for all possible values of sign_offset, this approach might be faster
bool sign = (bits & mask[sign_offset]) != 0;
__int64 result = r;
if (sign)
result |= -(x << 18);
else
result |= x << 18;
The code generated by VC2010 optimized build
OP code (11 instructions)
; 23 : __int64 sign = 1 - (bits >> (63 - sign_offset) & 0x2);
mov rax, QWORD PTR bits$[rsp]
mov ecx, 63 ; 0000003fH
sub cl, BYTE PTR sign_offset$[rsp]
mov edx, 1
sar rax, cl
; 24 : __int64 result = ((x << 18) * sign) | r; // 32 bits
; 25 : std::cout << result;
and eax, 2
sub rdx, rax
mov rax, QWORD PTR x$[rsp]
shl rax, 18
imul rdx, rax
or rdx, QWORD PTR r$[rsp]
My code (8 instructions)
; 34 : bool sign = (bits & mask[sign_offset]) != 0;
mov r11, QWORD PTR sign_offset$[rsp]
; 35 : __int64 result = r;
; 36 : if (sign)
; 37 : result |= -(x << 18);
mov rdx, QWORD PTR x$[rsp]
mov rax, QWORD PTR mask$[rsp+r11*8]
shl rdx, 18
test rax, QWORD PTR bits$[rsp]
je SHORT $LN2#Test1
neg rdx
$LN2#Test1:
; 38 : else
; 39 : result |= x << 18;
or rdx, QWORD PTR r$[rsp]
EDIT by Skizz
To get rid of branch:
shl rdx, 18
lea rbx,[rdx*2]
test rax, QWORD PTR bits$[rsp]
cmove rbx,0
sub rdx,rbx
or rdx, QWORD PTR r$[rsp]

Let's do some equivalent transformations:
int sign = 1 - (bits >> (63 - sign_offset) & 0x2);
int result = ((x << 18) * sign) | r; // 32 bits
Perhaps the processor will find shifting 32-bit values cheaper -- replace the definition of HIDWORD with whatever leads to direct access to the high-order DWORD without shifting. Also, for preparation of the next step, let's rearrange the shifting in the second assignment:
#define HIDWORD(q) ((uint32_t)((q) >> 32))
int sign = 1 - (HIDWORD(bits) >> (31 - sign_offset) & 0x2);
int result = ((x * sign) << 18) | r; // 32 bits
Observe that, in two-s complement, q * (-1) equals ~q + 1, or (q ^ -1) - (-1), while q * 1 equals (q ^ 0) - 0. This justifies the second transformation which gets rid of the nasty multiplication:
int mask = -(HIDWORD(bits) >> (32 - sign_offset) & 0x1);
int result = (((x ^ mask) - mask) << 18) | r; // 32 bits
Now let's rearrange shifting again:
int mask = (-(HIDWORD(bits) >> (32 - sign_offset) & 0x1)) << 18;
int result = (((x << 18) ^ mask) - mask) | r; // 32 bits
Recall the identity concerning - and ~:
int mask = (~(HIDWORD(bits) >> (32 - sign_offset) & 0x1) + 1) << 18;
Shift rearrangement again:
int mask = (~(HIDWORD(bits) >> (32 - sign_offset) & 0x1)) << 18 + (1 << 18);
Who can finally unfiddle this? (Are the transformations corect anyway?)
(Note that only profiling on a real CPU can
assess the performance. Measures like instruction count won't do. I am not even sure that the transformations helped at all.)

Memory access is usually the root of all optimisation problems on modern CPUs. You are being misled by the performance tools as to where the slow down is happening. The compiler is probably re-ordering the code to something like this:-
int sign = 1 - (bits >> (63 - get_sign_offset(bits)) & 0x2);
(*result++) = ((get_x(bits) << 18) * sign) | (r += get_r_offset(bits));
or even:-
(*result++) = ((get_x(bits) << 18) * (1 - (bits >> (63 - get_sign_offset(bits)) & 0x2))) | (r += get_r_offset(bits));
This would highlight the lines you identified as being the hotspot.
I would look at the way you organise your memory and the what the various get_ functions do. Can you post the get_ functions at all?

To calculate the sign, I would suggest this:
int sign = (int)(((int64_t)(bits << sign_offset)) >> 63);
Which is only 2 instructions (shl and sar).
If sign_offset is one bigger than I expected:
int sign = (int)(((int64_t)(bits << (sign_offset - 1))) >> 63);
Which is still not bad. Should be only 3 instructions.
That gives an answer as 0 or -1, with which you can do this:
(*result++) = (((x << 18) ^ sign) - sign) | r;

I think this is the fastest solution:
*result++ = (_rotl64(bits, sign_offset) << 31) | (x << 18) | (r << 0); // 32 bits
And then correct x depending on whether the sign bit is set or not on the GPU.

What is fastest method to calculate a number having only bit set which is the most significant digit set in another number? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Previous power of 2
Getting the Leftmost Bit
What I want is, suppose there is a number 5 i.e. 101. My answer should be 100. For 9 i.e. 1001, the answer should be 1000

You can't ask for the fastest sequence without giving constrains on the machine on which this has to run. For example, some machines support an instruction called "count leading zeroes" or have means to emulate it very quickly. If you can access this instruction (for example with gcc) then you can write:
#include <limits.h>
#include <stdint.h>
uint32_t f(uint32_t x)
{
return ((uint64_t)1)<<(32-__builtin_clz(x)-1);
}
int main()
{
printf("=>%d\n",f(5));
printf("=>%d\n",f(9));
}
f(x) returns what you want (the least y with x>=y and y=2**n). The compiler will now generate the optimal code sequence for the target machine. For example, when compiling for a default x86_64 architecture, f() looks like this:
bsrl %edi, %edi
movl $31, %ecx
movl $1, %eax
xorl $31, %edi
subl %edi, %ecx
salq %cl, %rax
ret
You see, no loops here! 7 instructions, no branches.
But if I tell my compiler (gcc-4.5) to optimize for the machine I'm using right now (AMD Phenom-II), then this comes out for f():
bsrl %edi, %ecx
movl $1, %eax
salq %cl, %rax
ret
This is probably the fastest way to go for this machine.
EDIT: f(0) would have resulted in UB, I've fixed that (and the assembly). Also, uint32_t means that I can write 32 without feeling guilty :-)

From Hacker's Delight, a nice branchless solution:
uint32_t flp2 (uint32_t x)
{
x = x | (x >> 1);
x = x | (x >> 2);
x = x | (x >> 4);
x = x | (x >> 8);
x = x | (x >> 16);
return x - (x >> 1);
}
This typically takes 12 instructions. You can do it in fewer if your CPU has a "count leading zeroes" instruction.

int input = 5;
std::size_t numBits = 0;
while(input)
{
input >>= 1;
numBits++;
}
int output = 1 << (numBits-1);

This is a task related to the bit counting. Check this out.
Using the 2a (which is my favorite of the algorithms; not the fastest) one can come up with this:
int highest_bit_mask (unsigned int n) {
while (n) {
if (n & (n-1)) {
n &= (n-1) ;
} else {
return n;
}
}
return 0;
}
The magic of n &= (n-1); is that it removes from n the least significant bit. (Corollary: n & (n-1) is false only when n has precisely one bit set.) The algorithm complexity depends on number of bits set in the input.
Check out the link anyway. It is a very amusing and enlightening read which might give you more ideas.

Branchless code that maps zero, negative, and positive to 0, 1, 2

Write a branchless function that returns 0, 1, or 2 if the difference between two signed integers is zero, negative, or positive.
Here's a version with branching:
int Compare(int x, int y)
{
int diff = x - y;
if (diff == 0)
return 0;
else if (diff < 0)
return 1;
else
return 2;
}
Here's a version that may be faster depending on compiler and processor:
int Compare(int x, int y)
{
int diff = x - y;
return diff == 0 ? 0 : (diff < 0 ? 1 : 2);
}
Can you come up with a faster one without branches?
SUMMARY
The 10 solutions I benchmarked had similar performance. The actual numbers and winner varied depending on compiler (icc/gcc), compiler options (e.g., -O3, -march=nocona, -fast, -xHost), and machine. Canon's solution performed well in many benchmark runs, but again the performance advantage was slight. I was surprised that in some cases some solutions were slower than the naive solution with branches.

Branchless (at the language level) code that maps negative to -1, zero to 0 and positive to +1 looks as follows
int c = (n > 0) - (n < 0);
if you need a different mapping you can simply use an explicit map to remap it
const int MAP[] = { 1, 0, 2 };
int c = MAP[(n > 0) - (n < 0) + 1];
or, for the requested mapping, use some numerical trick like
int c = 2 * (n > 0) + (n < 0);
(It is obviously very easy to generate any mapping from this as long as 0 is mapped to 0. And the code is quite readable. If 0 is mapped to something else, it becomes more tricky and less readable.)
As an additinal note: comparing two integers by subtracting one from another at C language level is a flawed technique, since it is generally prone to overflow. The beauty of the above methods is that they can immedately be used for "subtractionless" comparisons, like
int c = 2 * (x > y) + (x < y);

int Compare(int x, int y) {
return (x < y) + (y < x) << 1;
}
Edit: Bitwise only? Guess < and > don't count, then?
int Compare(int x, int y) {
int diff = x - y;
return (!!diff) | (!!(diff & 0x80000000) << 1);
}
But there's that pesky -.
Edit: Shift the other way around.
Meh, just to try again:
int Compare(int x, int y) {
int diff = y - x;
return (!!diff) << ((diff >> 31) & 1);
}
But I'm guessing there's no standard ASM instruction for !!. Also, the << can be replaced with +, depending on which is faster...
Bit twiddling is fun!
Hmm, I just learned about setnz.
I haven't checked the assembler output (but I did test it a bit this time), and with a bit of luck it could save a whole instruction!:
IN THEORY. MY ASSEMBLER IS RUSTY
subl %edi, %esi
setnz %eax
sarl $31, %esi
andl $1, %esi
sarl %eax, %esi
mov %esi, %eax
ret
Rambling is fun.
I need sleep.

Assuming 2s complement, arithmetic right shift, and no overflow in the subtraction:
#define SHIFT (CHARBIT*sizeof(int) - 1)
int Compare(int x, int y)
{
int diff = x - y;
return -(diff >> SHIFT) - (((-diff) >> SHIFT) << 1);
}

Two's complement:
#include <limits.h>
#define INT_BITS (CHAR_BITS * sizeof (int))
int Compare(int x, int y) {
int d = y - x;
int p = (d + INT_MAX) >> (INT_BITS - 1);
d = d >> (INT_BITS - 2);
return (d & 2) + (p & 1);
}
Assuming a sane compiler, this will not invoke the comparison hardware of your system, nor is it using a comparison in the language. To verify: if x == y then d and p will clearly be 0 so the final result will be zero. If (x - y) > 0 then ((x - y) + INT_MAX) will set the high bit of the integer otherwise it will be unset. So p will have its lowest bit set if and only if (x - y) > 0. If (x - y) < 0 then its high bit will be set and d will set its second to lowest bit.

Unsigned Comparison that returns -1,0,1 (cmpu) is one of the cases that is tested for by the GNU SuperOptimizer.
cmpu: compare (unsigned)
int cmpu(unsigned_word v0, unsigned_word v1)
{
return ( (v0 > v1) ? 1 : ( (v0 < v1) ? -1 : 0) );
}
A SuperOptimizer exhaustively searches the instruction space for the best possible combination of instructions that will implement a given function. It is suggested that compilers automagically replace the functions above by their superoptimized versions (although not all compilers do this). For example, in the PowerPC Compiler Writer's Guide (powerpc-cwg.pdf), the cmpu function is shown as this in Appendix D pg 204:
cmpu: compare (unsigned)
PowerPC SuperOptimized Version
subf R5,R4,R3
subfc R6,R3,R4
subfe R7,R4,R3
subfe R8,R7,R5
That's pretty good isn't it... just four subtracts (and with carry and/or extended versions). Not to mention it is genuinely branchfree at the machine opcode level. There is probably a PC / Intel X86 equivalent sequence that is similarly short since the GNU Superoptimizer runs for X86 as well as PowerPC.
Note that Unsigned Comparison (cmpu) can be turned into Signed Comparison (cmps) on a 32-bit compare by adding 0x80000000 to both Signed inputs before passing it to cmpu.
cmps: compare (signed)
int cmps(signed_word v0, signed_word v1)
{
signed_word offset=0x80000000;
return ( (unsigned_word) (v0 + signed_word),
(unsigned_word) (v1 + signed_word) );
}
This is just one option though... the SuperOptimizer may find a cmps that is shorter and does not have to add offsets and call cmpu.
To get the version that you requested that returns your values of {1,0,2} rather than {-1,0,1} use the following code which takes advantage of the SuperOptimized cmps function.
int Compare(int x, int y)
{
static const int retvals[]={1,0,2};
return (retvals[cmps(x,y)+1]);
}

I'm siding with Tordek's original answer:
int compare(int x, int y) {
return (x < y) + 2*(y < x);
}
Compiling with gcc -O3 -march=pentium4 results in branch-free code that uses conditional instructions setg and setl (see this explanation of x86 instructions).
push %ebp
mov %esp,%ebp
mov %eax,%ecx
xor %eax,%eax
cmp %edx,%ecx
setg %al
add %eax,%eax
cmp %edx,%ecx
setl %dl
movzbl %dl,%edx
add %edx,%eax
pop %ebp
ret

Good god, this has haunted me.
Whatever, I think I squeezed out a last drop of performance:
int compare(int a, int b) {
return (a != b) << (a > b);
}
Although, compiling with -O3 in GCC will give (bear with me I'm doing it from memory)
xorl %eax, %eax
cmpl %esi, %edi
setne %al
cmpl %esi, %edi
setgt %dl
sall %dl, %eax
ret
But the second comparison seems (according to a tiny bit of testing; I suck at ASM) to be redundant, leaving the small and beautiful
xorl %eax, %eax
cmpl %esi, %edi
setne %al
setgt %dl
sall %dl, %eax
ret
(Sall may totally not be an ASM instruction, but I don't remember exactly)
So... if you don't mind running your benchmark once more, I'd love to hear the results (mine gave a 3% improvement, but it may be wrong).

Combining Stephen Canon and Tordek's answers:
int Compare(int x, int y)
{
int diff = x - y;
return -(diff >> 31) + (2 & (-diff >> 30));
}
Yields: (g++ -O3)
subl %esi,%edi
movl %edi,%eax
sarl $31,%edi
negl %eax
sarl $30,%eax
andl $2,%eax
subl %edi,%eax
ret
Tight! However, Paul Hsieh's version has even fewer instructions:
subl %esi,%edi
leal 0x7fffffff(%rdi),%eax
sarl $30,%edi
andl $2,%edi
shrl $31,%eax
leal (%rdi,%rax,1),%eax
ret

int Compare(int x, int y)
{
int diff = x - y;
int absdiff = 0x7fffffff & diff; // diff with sign bit 0
int absdiff_not_zero = (int) (0 != udiff);
return
(absdiff_not_zero << 1) // 2 iff abs(diff) > 0
-
((0x80000000 & diff) >> 31); // 1 iff diff < 0
}

For 32 signed integers (like in Java), try:
return 2 - ((((x >> 30) & 2) + (((x-1) >> 30) & 2))) >> 1;
where (x >> 30) & 2 returns 2 for negative numbers and 0 otherwise.
x would be the difference of the two input integers

The basic C answer is :
int v; // find the absolute value of v
unsigned int r; // the result goes here
int const mask = v >> sizeof(int) * CHAR_BIT - 1;
r = (v + mask) ^ mask;
Also :
r = (v ^ mask) - mask;
Value of sizeof(int) is often 4 and CHAR_BIT is often 8.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to efficiently de-interleave bits (inverse Morton) - bit-manipulation

If your processor handles 64 bit ints efficiently, you could combine the operations... int64 w = (z &0xAAAAAAAA)<<31 | (z &0x55555555 ) w = (w | (w >> 1)) & 0x3333333333333333; w = (w | (w >> 2)) & 0x0F0F0F0F0F0F0F0F; ...

Related

How bitwise shift operators are used to combine bytes into a larger integer

Reducing an integer to 1 if it is not equal to 0

Optimizing Bitwise Logic

What is fastest method to calculate a number having only bit set which is the most significant digit set in another number? [duplicate]

Branchless code that maps zero, negative, and positive to 0, 1, 2

Categories

Resources