Branchless code that maps zero, negative, and positive to 0, 1, 2 - c++

Write a branchless function that returns 0, 1, or 2 if the difference between two signed integers is zero, negative, or positive.
Here's a version with branching:
int Compare(int x, int y)
{
int diff = x - y;
if (diff == 0)
return 0;
else if (diff < 0)
return 1;
else
return 2;
}
Here's a version that may be faster depending on compiler and processor:
int Compare(int x, int y)
{
int diff = x - y;
return diff == 0 ? 0 : (diff < 0 ? 1 : 2);
}
Can you come up with a faster one without branches?
SUMMARY
The 10 solutions I benchmarked had similar performance. The actual numbers and winner varied depending on compiler (icc/gcc), compiler options (e.g., -O3, -march=nocona, -fast, -xHost), and machine. Canon's solution performed well in many benchmark runs, but again the performance advantage was slight. I was surprised that in some cases some solutions were slower than the naive solution with branches.

Branchless (at the language level) code that maps negative to -1, zero to 0 and positive to +1 looks as follows
int c = (n > 0) - (n < 0);
if you need a different mapping you can simply use an explicit map to remap it
const int MAP[] = { 1, 0, 2 };
int c = MAP[(n > 0) - (n < 0) + 1];
or, for the requested mapping, use some numerical trick like
int c = 2 * (n > 0) + (n < 0);
(It is obviously very easy to generate any mapping from this as long as 0 is mapped to 0. And the code is quite readable. If 0 is mapped to something else, it becomes more tricky and less readable.)
As an additinal note: comparing two integers by subtracting one from another at C language level is a flawed technique, since it is generally prone to overflow. The beauty of the above methods is that they can immedately be used for "subtractionless" comparisons, like
int c = 2 * (x > y) + (x < y);

int Compare(int x, int y) {
return (x < y) + (y < x) << 1;
}
Edit: Bitwise only? Guess < and > don't count, then?
int Compare(int x, int y) {
int diff = x - y;
return (!!diff) | (!!(diff & 0x80000000) << 1);
}
But there's that pesky -.
Edit: Shift the other way around.
Meh, just to try again:
int Compare(int x, int y) {
int diff = y - x;
return (!!diff) << ((diff >> 31) & 1);
}
But I'm guessing there's no standard ASM instruction for !!. Also, the << can be replaced with +, depending on which is faster...
Bit twiddling is fun!
Hmm, I just learned about setnz.
I haven't checked the assembler output (but I did test it a bit this time), and with a bit of luck it could save a whole instruction!:
IN THEORY. MY ASSEMBLER IS RUSTY
subl %edi, %esi
setnz %eax
sarl $31, %esi
andl $1, %esi
sarl %eax, %esi
mov %esi, %eax
ret
Rambling is fun.
I need sleep.

Assuming 2s complement, arithmetic right shift, and no overflow in the subtraction:
#define SHIFT (CHARBIT*sizeof(int) - 1)
int Compare(int x, int y)
{
int diff = x - y;
return -(diff >> SHIFT) - (((-diff) >> SHIFT) << 1);
}

Two's complement:
#include <limits.h>
#define INT_BITS (CHAR_BITS * sizeof (int))
int Compare(int x, int y) {
int d = y - x;
int p = (d + INT_MAX) >> (INT_BITS - 1);
d = d >> (INT_BITS - 2);
return (d & 2) + (p & 1);
}
Assuming a sane compiler, this will not invoke the comparison hardware of your system, nor is it using a comparison in the language. To verify: if x == y then d and p will clearly be 0 so the final result will be zero. If (x - y) > 0 then ((x - y) + INT_MAX) will set the high bit of the integer otherwise it will be unset. So p will have its lowest bit set if and only if (x - y) > 0. If (x - y) < 0 then its high bit will be set and d will set its second to lowest bit.

Unsigned Comparison that returns -1,0,1 (cmpu) is one of the cases that is tested for by the GNU SuperOptimizer.
cmpu: compare (unsigned)
int cmpu(unsigned_word v0, unsigned_word v1)
{
return ( (v0 > v1) ? 1 : ( (v0 < v1) ? -1 : 0) );
}
A SuperOptimizer exhaustively searches the instruction space for the best possible combination of instructions that will implement a given function. It is suggested that compilers automagically replace the functions above by their superoptimized versions (although not all compilers do this). For example, in the PowerPC Compiler Writer's Guide (powerpc-cwg.pdf), the cmpu function is shown as this in Appendix D pg 204:
cmpu: compare (unsigned)
PowerPC SuperOptimized Version
subf R5,R4,R3
subfc R6,R3,R4
subfe R7,R4,R3
subfe R8,R7,R5
That's pretty good isn't it... just four subtracts (and with carry and/or extended versions). Not to mention it is genuinely branchfree at the machine opcode level. There is probably a PC / Intel X86 equivalent sequence that is similarly short since the GNU Superoptimizer runs for X86 as well as PowerPC.
Note that Unsigned Comparison (cmpu) can be turned into Signed Comparison (cmps) on a 32-bit compare by adding 0x80000000 to both Signed inputs before passing it to cmpu.
cmps: compare (signed)
int cmps(signed_word v0, signed_word v1)
{
signed_word offset=0x80000000;
return ( (unsigned_word) (v0 + signed_word),
(unsigned_word) (v1 + signed_word) );
}
This is just one option though... the SuperOptimizer may find a cmps that is shorter and does not have to add offsets and call cmpu.
To get the version that you requested that returns your values of {1,0,2} rather than {-1,0,1} use the following code which takes advantage of the SuperOptimized cmps function.
int Compare(int x, int y)
{
static const int retvals[]={1,0,2};
return (retvals[cmps(x,y)+1]);
}

I'm siding with Tordek's original answer:
int compare(int x, int y) {
return (x < y) + 2*(y < x);
}
Compiling with gcc -O3 -march=pentium4 results in branch-free code that uses conditional instructions setg and setl (see this explanation of x86 instructions).
push %ebp
mov %esp,%ebp
mov %eax,%ecx
xor %eax,%eax
cmp %edx,%ecx
setg %al
add %eax,%eax
cmp %edx,%ecx
setl %dl
movzbl %dl,%edx
add %edx,%eax
pop %ebp
ret

Good god, this has haunted me.
Whatever, I think I squeezed out a last drop of performance:
int compare(int a, int b) {
return (a != b) << (a > b);
}
Although, compiling with -O3 in GCC will give (bear with me I'm doing it from memory)
xorl %eax, %eax
cmpl %esi, %edi
setne %al
cmpl %esi, %edi
setgt %dl
sall %dl, %eax
ret
But the second comparison seems (according to a tiny bit of testing; I suck at ASM) to be redundant, leaving the small and beautiful
xorl %eax, %eax
cmpl %esi, %edi
setne %al
setgt %dl
sall %dl, %eax
ret
(Sall may totally not be an ASM instruction, but I don't remember exactly)
So... if you don't mind running your benchmark once more, I'd love to hear the results (mine gave a 3% improvement, but it may be wrong).

Combining Stephen Canon and Tordek's answers:
int Compare(int x, int y)
{
int diff = x - y;
return -(diff >> 31) + (2 & (-diff >> 30));
}
Yields: (g++ -O3)
subl %esi,%edi
movl %edi,%eax
sarl $31,%edi
negl %eax
sarl $30,%eax
andl $2,%eax
subl %edi,%eax
ret
Tight! However, Paul Hsieh's version has even fewer instructions:
subl %esi,%edi
leal 0x7fffffff(%rdi),%eax
sarl $30,%edi
andl $2,%edi
shrl $31,%eax
leal (%rdi,%rax,1),%eax
ret

int Compare(int x, int y)
{
int diff = x - y;
int absdiff = 0x7fffffff & diff; // diff with sign bit 0
int absdiff_not_zero = (int) (0 != udiff);
return
(absdiff_not_zero << 1) // 2 iff abs(diff) > 0
-
((0x80000000 & diff) >> 31); // 1 iff diff < 0
}

For 32 signed integers (like in Java), try:
return 2 - ((((x >> 30) & 2) + (((x-1) >> 30) & 2))) >> 1;
where (x >> 30) & 2 returns 2 for negative numbers and 0 otherwise.
x would be the difference of the two input integers

The basic C answer is :
int v; // find the absolute value of v
unsigned int r; // the result goes here
int const mask = v >> sizeof(int) * CHAR_BIT - 1;
r = (v + mask) ^ mask;
Also :
r = (v ^ mask) - mask;
Value of sizeof(int) is often 4 and CHAR_BIT is often 8.

Related

On a 64 bit machine, can I safely operate on individual bytes of a 64 bit quadword in parallel?

Background
I am doing parallel operations on rows and columns in images. My images are 8 bit or 16 bit pixels and I'm on a 64 bit machine.
When I do operations on columns in parallel, two adjacent columns may share the same 32 bit int or 64 bit long. Basically, I want to know whether I can safely operate on individual bytes of the same quadword in parallel.
Minimal Test
I wrote a minimal test function that I have not been able to make fail. For each byte in a 64 bit long, I concurrently perform successive multiplications in a finite field of order p. I know that by Fermat's little theorem a^(p-1) = 1 mod p when p is prime. I vary the values a and p for each of my 8 threads, and I perform k*(p-1) multiplications of a. When the threads finish each byte should be 1. And in fact, my test cases pass. Each time I run, I get the following output:
8
101010101010101
101010101010101
My system is Linux 4.13.0-041300-generic x86_64 with an 8 core Intel(R) Core(TM) i7-7700HQ CPU # 2.80GHz. I compiled with g++ 7.2.0 -O2 and examined the assembly. I added the assembly for the "INNER LOOP" and commented it. It seems to me that the code generated is safe because the stores are only writing the lower 8 bits to the destination instead of doing some bitwise arithmetic and storing to the entire word or quadword. g++ -O3 generated similar code.
Question:
I want to know if this code is always thread-safe, and if not, in what conditions would it not be. Maybe I am being very paranoid, but I feel that I would need to operate on quadwords at a time in order to be safe.
#include <iostream>
#include <pthread.h>
class FermatLTParams
{
public:
FermatLTParams(unsigned char *_dst, unsigned int _p, unsigned int _a, unsigned int _k)
: dst(_dst), p(_p), a(_a), k(_k) {}
unsigned char *dst;
unsigned int p, a, k;
};
void *PerformFermatLT(void *_p)
{
unsigned int j, i;
FermatLTParams *p = reinterpret_cast<FermatLTParams *>(_p);
for(j=0; j < p->k; ++j)
{
//a^(p-1) == 1 mod p
//...BEGIN INNER LOOP
for(i=1; i < p->p; ++i)
{
p->dst[0] = (unsigned char)(p->dst[0]*p->a % p->p);
}
//...END INNER LOOP
/* gcc 7.2.0 -O2 (INNER LOOP)
.L4:
movq (%rdi), %r8 # r8 = dst
xorl %edx, %edx # edx = 0
addl $1, %esi # ++i
movzbl (%r8), %eax # eax (lower 8 bits) = dst[0]
imull 12(%rdi), %eax # eax = a * eax
divl %ecx # eax = eax / ecx; edx = eax % ecx
movb %dl, (%r8) # dst[0] = edx (lower 8 bits)
movl 8(%rdi), %ecx # ecx = p
cmpl %esi, %ecx # if (i < p)
ja .L4 # goto L4
*/
}
return NULL;
}
int main(int argc, const char **argv)
{
int i;
unsigned long val = 0x0101010101010101; //a^0 = 1
unsigned int k = 10000000;
std::cout << sizeof(val) << std::endl;
std::cout << std::hex << val << std::endl;
unsigned char *dst = reinterpret_cast<unsigned char *>(&val);
pthread_t threads[8];
FermatLTParams params[8] =
{
FermatLTParams(dst+0, 11, 5, k),
FermatLTParams(dst+1, 17, 8, k),
FermatLTParams(dst+2, 43, 3, k),
FermatLTParams(dst+3, 31, 4, k),
FermatLTParams(dst+4, 13, 3, k),
FermatLTParams(dst+5, 7, 2, k),
FermatLTParams(dst+6, 11, 10, k),
FermatLTParams(dst+7, 13, 11, k)
};
for(i=0; i < 8; ++i)
{
pthread_create(threads+i, NULL, PerformFermatLT, params+i);
}
for(i=0; i < 8; ++i)
{
pthread_join(threads[i], NULL);
}
std::cout << std::hex << val << std::endl;
return 0;
}
The answer is YES, you can safely operate on individual bytes of a 64-bit quadword in parallel, by different threads.
It is amazing that it works, but it would be a disaster if it did not. All hardware acts as if a core writing a byte in its own core marks not just that the cache line is dirty, but which bytes within it. When that cache line (64 or 128 or even 256 bytes) eventually gets written to main memory, only the dirty bytes actually modify the main memory. This is essential, because otherwise when two threads were working on independent data that happened to occupy the same cache line, they would trash each other's results.
This can be bad for performance, because the way it works is partly through the magic of "cache coherency," where when one thread writes a byte all the caches in the system that have that same line of data are affected. If they're dirty, they need to write to main memory, and then either drop the cache line, or capture the changes from the other thread. There are all kinds of different implementations, but it is generally expensive.

Reducing an integer to 1 if it is not equal to 0

I'm trying to solve a timing leak by removing an if statement in my code but because of c++'s interpretation of integer inputs in if statements I am stuck.
Note that I assume the compiler does create a conditional branch, which results in timing information being leaked!
The original code is:
int s
if (s)
r = A
else
r = B
Now I'm trying to rewrite it as:
int s;
r = sA+(1-s)B
Because s is not bound to [0,1] I run into the problem that it multiplies by A and B incorrectly if s is out of [0,1]. What can I do, without using an if-statement on s to solve this?
Thanks in advance
What evidence do you have that the if statement is resulting in the timing leak?
If you use a modern compiler with optimizations turned on, that code should not produce a branch. You should check what your compiler is doing by looking at the assembly language output.
For instance, g++ 5.3.0 compiles this code:
int f(int s, int A, int B) {
int r;
if (s)
r = A;
else
r = B;
return r;
}
to this assembly:
movl %esi, %eax
testl %edi, %edi
cmove %edx, %eax
ret
Look, ma! No branches! ;)
If you know the number of bits in the integer, it's pretty easy, although there are a few complications making it standards-clean with the possibility of unusual integer representations.
Here's one simple solution for 32-bit integers:
uint32_t mask = s;
mask |= mask >> 1;
mask |= mask >> 2;
mask |= mask >> 4;
mask |= mask >> 8;
mask |= mask >> 16;
mask &= 1;
r = b ^ (-mask & (a ^ b)):
The five shift-and-or statements propagate any set bit in mask so that in the end the low-order bit is 1 unless the mask was originally 0. Then we isolate the low-order bit, resulting in a 1 or 0. The last statement is a bit-hacking equivalent of your two multiplies and add.
Here is a faster one based on the observation that if you subtract one from a number and the sign bit changes from 0 to 1, then the number was 0:
uint32_t mask = ((uint32_t(s)-1U)&~uint32_t(s))>>31) - 1U;
That is essentially the same computation as subtracting 1 and then using the carry bit, but unfortunately the carry bit is not exposed to the C language (except possibly through compiler-specific intrinsics).
Other variations are possible.
The only way to do it without branches when the optimization is not available is to resort to inline assembly. Assuming 8086:
mov ax, s
neg ax ; CF = (ax != 0)
sbb ax, ax ; ax = (s != 0 ? -1 : 0)
neg ax ; ax = (s != 0 ? 1 : 0)
mov s, ax ; now use s at will, it will be: s = (s != 0 ? 1 : 0)

How to efficiently de-interleave bits (inverse Morton)

This question: How to de-interleave bits (UnMortonizing?) has a good answer for extracting one of the two halves of a Morton number (just the odd bits), but I need a solution which extracts both parts (the odd bits and the even bits) in as few operations as possible.
For my use I would need to take a 32 bit int and extract two 16 bit ints, where one is the even bits and the other is the odd bits shifted right by 1 bit, e.g.
input, z: 11101101 01010111 11011011 01101110
output, x: 11100001 10110111 // odd bits shifted right by 1
y: 10111111 11011010 // even bits
There seem to be plenty of solutions using shifts and masks with magic numbers for generating Morton numbers (i.e. interleaving bits), e.g. Interleave bits by Binary Magic Numbers, but I haven't yet found anything for doing the reverse (i.e. de-interleaving).
UPDATE
After re-reading the section from Hacker's Delight on perfect shuffles/unshuffles I found some useful examples which I adapted as follows:
// morton1 - extract even bits
uint32_t morton1(uint32_t x)
{
x = x & 0x55555555;
x = (x | (x >> 1)) & 0x33333333;
x = (x | (x >> 2)) & 0x0F0F0F0F;
x = (x | (x >> 4)) & 0x00FF00FF;
x = (x | (x >> 8)) & 0x0000FFFF;
return x;
}
// morton2 - extract odd and even bits
void morton2(uint32_t *x, uint32_t *y, uint32_t z)
{
*x = morton1(z);
*y = morton1(z >> 1);
}
I think this can still be improved on, both in its current scalar form and also by taking advantage of SIMD, so I'm still interested in better solutions (either scalar or SIMD).
If your processor handles 64 bit ints efficiently, you could combine the operations...
int64 w = (z &0xAAAAAAAA)<<31 | (z &0x55555555 )
w = (w | (w >> 1)) & 0x3333333333333333;
w = (w | (w >> 2)) & 0x0F0F0F0F0F0F0F0F;
...
Code for the Intel Haswell and later CPUs. You can use the BMI2 instruction set which contains the pext and pdep instructions. These can (among other great things) be used to build your functions.
#include <immintrin.h>
#include <stdint.h>
// on GCC, compile with option -mbmi2, requires Haswell or better.
uint64_t xy_to_morton (uint32_t x, uint32_t y)
{
return _pdep_u32(x, 0x55555555) | _pdep_u32(y,0xaaaaaaaa);
}
uint64_t morton_to_xy (uint64_t m, uint32_t *x, uint32_t *y)
{
*x = _pext_u64(m, 0x5555555555555555);
*y = _pext_u64(m, 0xaaaaaaaaaaaaaaaa);
}
In case someone is using morton codes in 3d, so he needs to read one bit every 3, and 64 bits here is the function I used:
uint64_t morton3(uint64_t x) {
x = x & 0x9249249249249249;
x = (x | (x >> 2)) & 0x30c30c30c30c30c3;
x = (x | (x >> 4)) & 0xf00f00f00f00f00f;
x = (x | (x >> 8)) & 0x00ff0000ff0000ff;
x = (x | (x >> 16)) & 0xffff00000000ffff;
x = (x | (x >> 32)) & 0x00000000ffffffff;
return x;
}
uint64_t bits;
uint64_t x = morton3(bits)
uint64_t y = morton3(bits>>1)
uint64_t z = morton3(bits>>2)
You can extract 8 interleaved bits by multiplying like so:
uint8_t deinterleave_even(uint16_t x) {
return ((x & 0x5555) * 0xC00030000C0003 & 0x0600180060008001) * 0x0101010101010101 >> 56;
}
uint8_t deinterleave_odd(uint16_t x) {
return ((x & 0xAAAA) * 0xC00030000C0003 & 0x03000C003000C000) * 0x0101010101010101 >> 56;
}
It should be trivial to combine them for 32 bits or larger.
If you need speed than you can use table-lookup for one byte conversion at once (two bytes table is faster but to big). Procedure is made under Delphi IDE but the assembler/algorithem is the same.
const
MortonTableLookup : array[byte] of byte = ($00, $01, $10, $11, $12, ... ;
procedure DeinterleaveBits(Input: cardinal);
//In: eax
//Out: dx = EvenBits; ax = OddBits;
asm
movzx ecx, al //Use 0th byte
mov dl, byte ptr[MortonTableLookup + ecx]
//
shr eax, 8
movzx ecx, ah //Use 2th byte
mov dh, byte ptr[MortonTableLookup + ecx]
//
shl edx, 16
movzx ecx, al //Use 1th byte
mov dl, byte ptr[MortonTableLookup + ecx]
//
shr eax, 8
movzx ecx, ah //Use 3th byte
mov dh, byte ptr[MortonTableLookup + ecx]
//
mov ecx, edx
and ecx, $F0F0F0F0
mov eax, ecx
rol eax, 12
or eax, ecx
rol edx, 4
and edx, $F0F0F0F0
mov ecx, edx
rol ecx, 12
or edx, ecx
end;
I didn't want to be limited to a fixed size integer and making lists of similar commands with hardcoded constants, so I developed a C++11 solution which makes use of template metaprogramming to generate the functions and the constants. The assembly code generated with -O3 seems as tight as it can get without using BMI:
andl $0x55555555, %eax
movl %eax, %ecx
shrl %ecx
orl %eax, %ecx
andl $0x33333333, %ecx
movl %ecx, %eax
shrl $2, %eax
orl %ecx, %eax
andl $0xF0F0F0F, %eax
movl %eax, %ecx
shrl $4, %ecx
orl %eax, %ecx
movzbl %cl, %esi
shrl $8, %ecx
andl $0xFF00, %ecx
orl %ecx, %esi
TL;DR source repo and live demo.
Implementation
Basically every step in the morton1 function works by shifting and adding to a sequence of constants which look like this:
0b0101010101010101 (alternate 1 and 0)
0b0011001100110011 (alternate 2x 1 and 0)
0b0000111100001111 (alternate 4x 1 and 0)
0b0000000011111111 (alternate 8x 1 and 0)
If we were to use D dimensions, we would have a pattern with D-1 zeros and 1 one. So to generate these it's enough to generate consecutive ones and apply some bitwise or:
/// #brief Generates 0b1...1 with #tparam n ones
template <class T, unsigned n>
using n_ones = std::integral_constant<T, (~static_cast<T>(0) >> (sizeof(T) * 8 - n))>;
/// #brief Performs `#tparam input | (#tparam input << #tparam width` #tparam repeat times.
template <class T, T input, unsigned width, unsigned repeat>
struct lshift_add :
public lshift_add<T, lshift_add<T, input, width, 1>::value, width, repeat - 1> {
};
/// #brief Specialization for 1 repetition, just does the shift-and-add operation.
template <class T, T input, unsigned width>
struct lshift_add<T, input, width, 1> : public std::integral_constant<T,
(input & n_ones<T, width>::value) | (input << (width < sizeof(T) * 8 ? width : 0))> {
};
Now that we can generate the constants at compile time for arbitrary dimensions with the following:
template <class T, unsigned step, unsigned dimensions = 2u>
using mask = lshift_add<T, n_ones<T, 1 << step>::value, dimensions * (1 << step), sizeof(T) * 8 / (2 << step)>;
With the same type of recursion, we can generate functions for each of the steps of the algorithm x = (x | (x >> K)) & M:
template <class T, unsigned step, unsigned dimensions>
struct deinterleave {
static T work(T input) {
input = deinterleave<T, step - 1, dimensions>::work(input);
return (input | (input >> ((dimensions - 1) * (1 << (step - 1))))) & mask<T, step, dimensions>::value;
}
};
// Omitted specialization for step 0, where there is just a bitwise and
It remains to answer the question "how many steps do we need?". This depends also on the number of dimensions. In general, k steps compute 2^k - 1 output bits; the maximum number of meaningful bits for each dimension is given by z = sizeof(T) * 8 / dimensions, therefore it is enough to take 1 + log_2 z steps. The problem is now that we need this as constexpr in order to use it as a template parameter. The best way I found to work around this is to define log2 via metaprogramming:
template <unsigned arg>
struct log2 : public std::integral_constant<unsigned, log2<(arg >> 1)>::value + 1> {};
template <>
struct log2<1u> : public std::integral_constant<unsigned, 0u> {};
/// #brief Helper constexpr which returns the number of steps needed to fully interleave a type #tparam T.
template <class T, unsigned dimensions>
using num_steps = std::integral_constant<unsigned, log2<sizeof(T) * 8 / dimensions>::value + 1>;
And finally, we can perform one single call:
/// #brief Helper function which combines #see deinterleave and #see num_steps into a single call.
template <class T, unsigned dimensions>
T deinterleave_first(T n) {
return deinterleave<T, num_steps<T, dimensions>::value - 1, dimensions>::work(n);
}

How to swap two numbers without using temp variables or arithmetic operations?

This equation swaps two numbers without a temporary variable, but uses arithmetic operations:
a = (a+b) - (b=a);
How can I do it without arithmetic operations? I was thinking about XOR.
a=a+b;
b=a-b;
a=a-b;
This is simple yet effective....
Why not use the std libs?
std::swap(a,b);
In C this should work:
a = a^b;
b = a^b;
a = a^b;
OR a cooler/geekier looking:
a^=b;
b^=a;
a^=b;
For more details look into this. XOR is a very powerful operation that has many interesting usages cropping up here and there.
The best way to swap two numbers without using any temporary storage or arithmetic operations is to load both variables into registers, and then use the registers the other way around!
You can't do that directly from C, but the compiler is probably quite capable of working it out for you (at least, if optimisation is enabled) - if you write simple, obvious code, such as that which KennyTM suggested in his comment.
e.g.
void swap_tmp(unsigned int *p)
{
unsigned int tmp;
tmp = p[0];
p[0] = p[1];
p[1] = tmp;
}
compiled with gcc 4.3.2 with the -O2 optimisation flag gives:
swap_tmp:
pushl %ebp ; (prologue)
movl %esp, %ebp ; (prologue)
movl 8(%ebp), %eax ; EAX = p
movl (%eax), %ecx ; ECX = p[0]
movl 4(%eax), %edx ; EDX = p[1]
movl %ecx, 4(%eax) ; p[1] = ECX
movl %edx, (%eax) ; p[0] = EDX
popl %ebp ; (epilogue)
ret ; (epilogue)
I haven't seen this C solution before, but I'm sure someone has thought of it. And perhaps had more posting self-control than I do.
fprintf(fopen("temp.txt", "w"), "%d", a);
a = b;
fscanf(fopen("temp.txt", "r"), "%d", &b);
No extra variables!
It works for me, but depending on the stdio implementation you may have to do something about output buffering.
Using XOR,
void swap(int &a, int &b)
{
a = a ^ b;
b = a ^ b;
a = a ^ b;
}
One liner with XOR,
void swap(int &a, int &b)
{
a ^= b ^= a ^= b;
}
These methods appear to be clean, because they don't fail for any test-case, but again since (as in method 2) value of variable is modified twice within the same sequence point, it is said to be having undefined behavior declared by ANSI C.
C++11 allows to:
Swap values:
std::swap(a, b);
Swap ranges:
std::swap_ranges(a.begin(), a.end(), b.begin());
Create LValue tuple with tie:
std::tie(b, a) = std::make_tuple(a, b);
std::tie(c, b, a) = std::make_tuple(a, b, c);
a =((a = a + b) - (b = a - b));
In addition to the above solutions for a case where if one of the value is out of range for a signed integer, the two variables values can be swapped in this way
a = a+b;
b=b-(-a);
a=b-a;
b=-(b);
Multiplication and division can also be used.
int x = 10, y = 5;
// Code to swap 'x' and 'y'
x = x * y; // x now becomes 50
y = x / y; // y becomes 10
x = x / y; // x becomes 5

What is fastest method to calculate a number having only bit set which is the most significant digit set in another number? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Previous power of 2
Getting the Leftmost Bit
What I want is, suppose there is a number 5 i.e. 101. My answer should be 100. For 9 i.e. 1001, the answer should be 1000
You can't ask for the fastest sequence without giving constrains on the machine on which this has to run. For example, some machines support an instruction called "count leading zeroes" or have means to emulate it very quickly. If you can access this instruction (for example with gcc) then you can write:
#include <limits.h>
#include <stdint.h>
uint32_t f(uint32_t x)
{
return ((uint64_t)1)<<(32-__builtin_clz(x)-1);
}
int main()
{
printf("=>%d\n",f(5));
printf("=>%d\n",f(9));
}
f(x) returns what you want (the least y with x>=y and y=2**n). The compiler will now generate the optimal code sequence for the target machine. For example, when compiling for a default x86_64 architecture, f() looks like this:
bsrl %edi, %edi
movl $31, %ecx
movl $1, %eax
xorl $31, %edi
subl %edi, %ecx
salq %cl, %rax
ret
You see, no loops here! 7 instructions, no branches.
But if I tell my compiler (gcc-4.5) to optimize for the machine I'm using right now (AMD Phenom-II), then this comes out for f():
bsrl %edi, %ecx
movl $1, %eax
salq %cl, %rax
ret
This is probably the fastest way to go for this machine.
EDIT: f(0) would have resulted in UB, I've fixed that (and the assembly). Also, uint32_t means that I can write 32 without feeling guilty :-)
From Hacker's Delight, a nice branchless solution:
uint32_t flp2 (uint32_t x)
{
x = x | (x >> 1);
x = x | (x >> 2);
x = x | (x >> 4);
x = x | (x >> 8);
x = x | (x >> 16);
return x - (x >> 1);
}
This typically takes 12 instructions. You can do it in fewer if your CPU has a "count leading zeroes" instruction.
int input = 5;
std::size_t numBits = 0;
while(input)
{
input >>= 1;
numBits++;
}
int output = 1 << (numBits-1);
This is a task related to the bit counting. Check this out.
Using the 2a (which is my favorite of the algorithms; not the fastest) one can come up with this:
int highest_bit_mask (unsigned int n) {
while (n) {
if (n & (n-1)) {
n &= (n-1) ;
} else {
return n;
}
}
return 0;
}
The magic of n &= (n-1); is that it removes from n the least significant bit. (Corollary: n & (n-1) is false only when n has precisely one bit set.) The algorithm complexity depends on number of bits set in the input.
Check out the link anyway. It is a very amusing and enlightening read which might give you more ideas.