Producing good add with carry code from clang

Producing good add with carry code from clang - c++

I'm trying to produce code (currently using clang++-3.8) that adds two numbers consisting of multiple machine words. To simplify things for the moment I'm only adding 128bit numbers, but I'd like to be able to generalise this.
First some typedefs:
typedef unsigned long long unsigned_word;
typedef __uint128_t unsigned_128;
And a "result" type:
struct Result
{
unsigned_word lo;
unsigned_word hi;
};
The first function, f, takes two pairs of unsigned words and returns a result, by as an intermediate step putting both of these 64 bit words into a 128 bit word before adding them, like so:
Result f (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_128 n1 = lo1 + (static_cast<unsigned_128>(hi1) << 64);
unsigned_128 n2 = lo2 + (static_cast<unsigned_128>(hi2) << 64);
unsigned_128 r1 = n1 + n2;
x.lo = r1 & ((static_cast<unsigned_128>(1) << 64) - 1);
x.hi = r1 >> 64;
return x;
}
This actually gets inlined quite nicely like so:
movq 8(%rsp), %rsi
movq (%rsp), %rbx
addq 24(%rsp), %rsi
adcq 16(%rsp), %rbx
Now, instead I've written a simpler function using the clang multi-precision primatives, as below:
static Result g (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_word carryout;
x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);
x.hi = __builtin_addcll(hi1, hi2, carryout, &x.carry);
return x;
}
This produces the following assembly:
movq 24(%rsp), %rsi
movq (%rsp), %rbx
addq 16(%rsp), %rbx
addq 8(%rsp), %rsi
adcq $0, %rbx
In this case, there's an extra add. Instead of doing an ordinary add on the lo-words, then an adc on the hi-words, it just adds the hi-words, then adds the lo-words, then does an adc on the hi-word again with an argument of zero.
This may not look too bad, but when you try this with larger words (say 192bit, 256bit) you soon get a mess of ors and other instructions dealing with the carries up the chain, instead of a simple chain of add, adc, adc, ... adc.
The multi-precision primitives seem to be doing a terrible job at exactly what they're intended to do.
So what I'm looking for is code that I could generalise to any length (no need to do it, just enough so I can work out how to), which clang produces additions in an manner with is as efficient as what it does with it's built in 128 bit type (which unfortunately I can't easily generalise). I presume this should just a chain of adcs, but I'm welcome to arguments and code that it should be something else.

There is an intrinsic to do this: _addcarry_u64. However, only Visual Studio and ICC (at least VS 2013 and 2015 and ICC 13 and ICC 15) do this efficiently. Clang 3.7 and GCC 5.2 still don't produce efficient code with this intrinsic.
Clang in addition has a built-in which one would think does this, __builtin_addcll, but it does not produce efficient code either.
The reason Visual Studio does this is that it does not allow inline assembly in 64-bit mode so the compiler should provide a way to do this with an intrinsic (though Microsoft took their time implementing this).
Therefore, with Visual Studio use _addcarry_u64. With ICC use _addcarry_u64 or inline assembly. With Clang and GCC use inline assembly.
Note that since the Broadwell microarchitecture there are two new instructions: adcx and adox which you can access with the _addcarryx_u64 intrinsic . Intel's documentation for these intrinsics used to be different then the assembly produced by the compiler but it appears their documentation is correct now. However, Visual Studio still only appears to produce adcx with _addcarryx_u64 whereas ICC produces both adcx and adox with this intrinsic. But even though ICC produces both instructions it does not produce the most optimal code (ICC 15) and so inline assembly is still necessary.
Personally, I think the fact that a non-standard feature of C/C++, such as inline assembly or intrinsics, is required to do this is a weakness of C/C++ but others might disagree. The adc instruction has been in the x86 instruction set since 1979. I would not hold my breath on C/C++ compilers being able to optimally figure out when you want adc. Sure they can have built-in types such as __int128 but the moment you want a larger type that's not built-in you have to use some non-standard C/C++ feature such as inline assembly or intrinsics.
In terms of inline assembly code to do this I already posted a solution for 256-bit addition for eight 64-bit integers in register at multi-word addition using the carry flag.
Here is that code reposted.
#define ADD256(X1, X2, X3, X4, Y1, Y2, Y3, Y4) \
__asm__ __volatile__ ( \
"addq %[v1], %[u1] \n" \
"adcq %[v2], %[u2] \n" \
"adcq %[v3], %[u3] \n" \
"adcq %[v4], %[u4] \n" \
: [u1] "+&r" (X1), [u2] "+&r" (X2), [u3] "+&r" (X3), [u4] "+&r" (X4) \
: [v1] "r" (Y1), [v2] "r" (Y2), [v3] "r" (Y3), [v4] "r" (Y4))
If you want to explicitly load the values from memory you can do something like this
//uint64_t dst[4] = {1,1,1,1};
//uint64_t src[4] = {1,2,3,4};
asm (
"movq (%[in]), %%rax\n"
"addq %%rax, %[out]\n"
"movq 8(%[in]), %%rax\n"
"adcq %%rax, 8%[out]\n"
"movq 16(%[in]), %%rax\n"
"adcq %%rax, 16%[out]\n"
"movq 24(%[in]), %%rax\n"
"adcq %%rax, 24%[out]\n"
: [out] "=m" (dst)
: [in]"r" (src)
: "%rax"
);
That produces nearlly identical assembly as from the following function in ICC
void add256(uint256 *x, uint256 *y) {
unsigned char c = 0;
c = _addcarry_u64(c, x->x1, y->x1, &x->x1);
c = _addcarry_u64(c, x->x2, y->x2, &x->x2);
c = _addcarry_u64(c, x->x3, y->x3, &x->x3);
_addcarry_u64(c, x->x4, y->x4, &x->x4);
}
I have limited experience with GCC inline assembly (or inline assembly in general - I usually use an assembler such as NASM) so maybe there are better inline assembly solutions.
So what I'm looking for is code that I could generalize to any length
To answer this question here is another solution using template meta programming. I used this same trick for loop unrolling. This produces optimal code with ICC. If Clang or GCC ever implement _addcarry_u64 efficiently this would be a good general solution.
#include <x86intrin.h>
#include <inttypes.h>
#define LEN 4 // N = N*64-bit add e.g. 4=256-bit add, 3=192-bit add, ...
static unsigned char c = 0;
template<int START, int N>
struct Repeat {
static void add (uint64_t *x, uint64_t *y) {
c = _addcarry_u64(c, x[START], y[START], &x[START]);
Repeat<START+1, N>::add(x,y);
}
};
template<int N>
struct Repeat<LEN, N> {
static void add (uint64_t *x, uint64_t *y) {}
};
void sum_unroll(uint64_t *x, uint64_t *y) {
Repeat<0,LEN>::add(x,y);
}
Assembly from ICC
xorl %r10d, %r10d #12.13
movzbl c(%rip), %eax #12.13
cmpl %eax, %r10d #12.13
movq (%rsi), %rdx #12.13
adcq %rdx, (%rdi) #12.13
movq 8(%rsi), %rcx #12.13
adcq %rcx, 8(%rdi) #12.13
movq 16(%rsi), %r8 #12.13
adcq %r8, 16(%rdi) #12.13
movq 24(%rsi), %r9 #12.13
adcq %r9, 24(%rdi) #12.13
setb %r10b
Meta programming is a basic feature of assemblers so it's too bad C and C++ (except through template meta programming hacks) have no solution for this either (the D language does).
The inline assembly I used above which referenced memory was causing some problems in a function. Here is a new version which seems to work better
void foo(uint64_t *dst, uint64_t *src)
{
__asm (
"movq (%[in]), %%rax\n"
"addq %%rax, (%[out])\n"
"movq 8(%[in]), %%rax\n"
"adcq %%rax, 8(%[out])\n"
"movq 16(%[in]), %%rax\n"
"addq %%rax, 16(%[out])\n"
"movq 24(%[in]), %%rax\n"
"adcq %%rax, 24(%[out])\n"
:
: [in] "r" (src), [out] "r" (dst)
: "%rax"
);
}

On Clang 6, both __builtin_addcl and __builtin_add_overflow produce the same, optimal disassembly.
Result g(unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_word carryout;
x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);
x.hi = __builtin_addcll(hi1, hi2, carryout, &carryout);
return x;
}
Result h(unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_word carryout;
carryout = __builtin_add_overflow(lo1, lo2, &x.lo);
carryout = __builtin_add_overflow(hi1, carryout, &hi1);
__builtin_add_overflow(hi1, hi2, &x.hi);
return x;
}
Assembly for both:
add rdi, rdx
adc rsi, rcx
mov rax, rdi
mov rdx, rsi
ret

Starting with clang 5.0 it is possible to get good results using __uint128_t-addition and getting the carry bit by shifting:
inline uint64_t add_with_carry(uint64_t &a, const uint64_t &b, const uint64_t &c)
{
__uint128_t s = __uint128_t(a) + b + c;
a = s;
return s >> 64;
}
In many situations clang still does strange operations (I assume because of possible aliasing?), but usually copying one variable into a temporary helps.
Usage examples with
template<int size> struct LongInt
{
uint64_t data[size];
};
Manual usage:
void test(LongInt<3> &a, const LongInt<3> &b_)
{
const LongInt<3> b = b_; // need to copy b_ into local temporary
uint64_t c0 = add_with_carry(a.data[0], b.data[0], 0);
uint64_t c1 = add_with_carry(a.data[1], b.data[1], c0);
uint64_t c2 = add_with_carry(a.data[2], b.data[2], c1);
}
Generic solution:
template<int size>
void addTo(LongInt<size> &a, const LongInt<size> b)
{
__uint128_t c = __uint128_t(a.data[0]) + b.data[0];
for(int i=1; i<size; ++i)
{
c = __uint128_t(a.data[i]) + b.data[i] + (c >> 64);
a.data[i] = c;
}
}
Godbolt Link: All examples above are compiled to only mov, add and adc instructions (starting with clang 5.0, and at least -O2).
The examples don't produce good code with gcc (up to 8.1, which at the moment is the highest version on godbolt).
And I did not yet manage to get anything usable with __builtin_addcll ...

The code using __builtin_addcll is fully optimized by Clang since at version 10, for chains of at least 3 (which require an adc with variable carry-in that also produces a carry-out). Godbolt shows clang 9 making a mess of setc/movzx for that case.
Clang 6 and later handle it well for the much easier case of chains of 2, as shown in #zneak's answer, where no carry-out from an adc is needed.
The idiomatic code without builtins is good too. Moreover, it works in every compiler and is also fully optimized by GCC 5+ for chains of 2 (add/adc, without using the carry-out from the adc). It's tricky to write correct C that generates carry-out when there's carry-in, so this doesn't extend easily.
Result h (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
unsigned_word lo = lo1 + lo2;
bool carry = lo < lo1;
unsigned_word hi = hi1 + hi2 + carry;
return Result{lo, hi};
}
https://godbolt.org/z/ThxGj1WGK

Related

Explanation for GCC compiler optimisation's adverse performance effect?

Please note: this question is neither about code quality, and ways to improve the code, nor about the (in)significance of the runtime differences. It is about GCC and why which compiler optimisation costs performance.
The program
The following code counts the number of Fibonacci primes up to m:
int main() {
unsigned int m = 500000000u;
unsigned int i = 0u;
unsigned int a = 1u;
unsigned int b = 1u;
unsigned int c = 1u;
unsigned int count = 0u;
while (a + b <= m) {
for (i = 2u; i < a + b; ++i) {
c = (a + b) % i;
if (c == 0u) {
i = a + b;
// break;
}
}
if (c != 0u) {
count = count + 1u;
}
a = a + b;
b = a - b;
}
return count; // Just to "output" (and thus use) count
}
When compiled with g++.exe (Rev2, Built by MSYS2 project) 9.2.0 and no optimisations (-O0), the resulting binary executes (on my machine) in 1.9s. With -O1 and -O3 it takes 3.3s and 1.7s, respectively.
I've tried to make sense of the resulting binaries by looking at the assembly code (godbolt.org) and the corresponding control-flow graph (hex-rays.com/products/ida), but my assembler skills don't suffice.
Additional observations
An explicit break in the innermost if makes the -O1 code fast again:
if (c == 0u) {
i = a + b; // Not actually needed any more
break;
}
As does "inlining" the loop's progress expression:
for (i = 2u; i < a + b; ) { // No ++i any more
c = (a + b) % i;
if (c == 0u) {
i = a + b;
++i;
} else {
++i;
}
}
Questions
Which optimisation does/could explain the performance drop?
Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?
Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?

The important thing at play here are loop-carried data dependencies.
Look at machine code of the slow variant of the innermost loop. I'm showing -O2 assembly here, -O1 is less optimized, but has similar data dependencies overall:
.L4:
xorl %edx, %edx
movl %esi, %eax
divl %ecx
testl %edx, %edx
cmove %esi, %ecx
addl $1, %ecx
cmpl %ecx, %esi
ja .L4
See how the increment of the loop counter in %ecx depends on the previous instruction (the cmov), which in turn depends on the result of the division, which in turn depends on the previous value of loop counter.
Effectively there is a chain of data dependencies on computing the value in %ecx that spans the entire loop, and since the time to execute the loop dominates, the time to compute that chain decides the execution time of the program.
Adjusting the program to compute the number of divisions reveals that it executes 434044698 div instructions. Dividing the number of machine cycles taken by the program by this number gives 26 cycles in my case, which corresponds closely to latency of the div instruction plus about 3 or 4 cycles from the other instructions in the chain (the chain is div-test-cmov-add).
In contrast, the -O3 code does not have this chain of dependencies, making it throughput-bound rather than latency-bound: the time to execute the -O3 variant is determined by the time to compute 434044698 independent div instructions.
Finally, to give specific answers to your questions:
1. Which optimisation does/could explain the performance drop?
As another answer mentioned, this is if-conversion creating a loop-carried data dependency where originally there was a control dependency. Control dependencies may be costly too, when they correspond to unpredictable branches, but in this case the branch is easy to predict.
2. Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?
Perhaps you can imagine the optimization transforming the code to
for (i = 2u; i < a + b; ++i) {
c = (a + b) % i;
i = (c != 0) ? i : a + b;
}
Where the ternary operator is evaluated on the CPU such that new value of i is not known until c has been computed.
3. Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?
In those variants the code is not eligible for if-conversion, so the problematic data dependency is not introduced.

I think the problem is in the -fif-conversion that instructs the compiler to do CMOV instead of TEST/JZ for some comparisons. And CMOV is known for being not so great in the general case.
There are two points in the disassembly, that I know of, affected by this flag:
First, the if (c == 0u) { i = a + b; } in line 13 is compiled to:
test edx,edx //edx is c
cmove ecx,esi //esi is (a + b), ecx is i
Second, the if (c != 0u) { count = count + 1u; } is compiled to
cmp eax,0x1 //eax is c
sbb r8d,0xffffffff //r8d is count, but what???
Nice trick! It is substracting -1 to count but with carry, and the carry is only set if c is less than 1, which being unsigned means 0. Thus, if eax is 0 it substracts -1 to count but then substracts 1 again: it does not change. If eax is not 0, then it substracts -1, that increments the variable.
Now, this avoids branches, but at the cost of missing the obvious optimization that if c == 0u you could jump directly to the next while iteration. This one is so easy that it is even done in -O0.

I believe this is caused by the "conditional move" instruction (CMOVEcc) that the compiler generates to replace branching when using -O1 and -O2.
When using -O0, the statement if (c == 0u) is compiled to a jump:
cmp DWORD PTR [rbp-16], 0
jne .L4
With -O1 and -O2:
test edx, edx
cmove ecx, esi
while -O3 produces a jump (similar to -O0):
test edx, edx
je .L5
There is a known bug in gcc where "using conditional moves instead of compare and branch result in almost 2x slower code"
As rodrigo suggested in his comment, using the flag -fno-if-conversion tells gcc not to replace branching with conditional moves, hence preventing this performance issue.

On a 64 bit machine, can I safely operate on individual bytes of a 64 bit quadword in parallel?

Background
I am doing parallel operations on rows and columns in images. My images are 8 bit or 16 bit pixels and I'm on a 64 bit machine.
When I do operations on columns in parallel, two adjacent columns may share the same 32 bit int or 64 bit long. Basically, I want to know whether I can safely operate on individual bytes of the same quadword in parallel.
Minimal Test
I wrote a minimal test function that I have not been able to make fail. For each byte in a 64 bit long, I concurrently perform successive multiplications in a finite field of order p. I know that by Fermat's little theorem a^(p-1) = 1 mod p when p is prime. I vary the values a and p for each of my 8 threads, and I perform k*(p-1) multiplications of a. When the threads finish each byte should be 1. And in fact, my test cases pass. Each time I run, I get the following output:
8
101010101010101
101010101010101
My system is Linux 4.13.0-041300-generic x86_64 with an 8 core Intel(R) Core(TM) i7-7700HQ CPU # 2.80GHz. I compiled with g++ 7.2.0 -O2 and examined the assembly. I added the assembly for the "INNER LOOP" and commented it. It seems to me that the code generated is safe because the stores are only writing the lower 8 bits to the destination instead of doing some bitwise arithmetic and storing to the entire word or quadword. g++ -O3 generated similar code.
Question:
I want to know if this code is always thread-safe, and if not, in what conditions would it not be. Maybe I am being very paranoid, but I feel that I would need to operate on quadwords at a time in order to be safe.
#include <iostream>
#include <pthread.h>
class FermatLTParams
{
public:
FermatLTParams(unsigned char *_dst, unsigned int _p, unsigned int _a, unsigned int _k)
: dst(_dst), p(_p), a(_a), k(_k) {}
unsigned char *dst;
unsigned int p, a, k;
};
void *PerformFermatLT(void *_p)
{
unsigned int j, i;
FermatLTParams *p = reinterpret_cast<FermatLTParams *>(_p);
for(j=0; j < p->k; ++j)
{
//a^(p-1) == 1 mod p
//...BEGIN INNER LOOP
for(i=1; i < p->p; ++i)
{
p->dst[0] = (unsigned char)(p->dst[0]*p->a % p->p);
}
//...END INNER LOOP
/* gcc 7.2.0 -O2 (INNER LOOP)
.L4:
movq (%rdi), %r8 # r8 = dst
xorl %edx, %edx # edx = 0
addl $1, %esi # ++i
movzbl (%r8), %eax # eax (lower 8 bits) = dst[0]
imull 12(%rdi), %eax # eax = a * eax
divl %ecx # eax = eax / ecx; edx = eax % ecx
movb %dl, (%r8) # dst[0] = edx (lower 8 bits)
movl 8(%rdi), %ecx # ecx = p
cmpl %esi, %ecx # if (i < p)
ja .L4 # goto L4
*/
}
return NULL;
}
int main(int argc, const char **argv)
{
int i;
unsigned long val = 0x0101010101010101; //a^0 = 1
unsigned int k = 10000000;
std::cout << sizeof(val) << std::endl;
std::cout << std::hex << val << std::endl;
unsigned char *dst = reinterpret_cast<unsigned char *>(&val);
pthread_t threads[8];
FermatLTParams params[8] =
{
FermatLTParams(dst+0, 11, 5, k),
FermatLTParams(dst+1, 17, 8, k),
FermatLTParams(dst+2, 43, 3, k),
FermatLTParams(dst+3, 31, 4, k),
FermatLTParams(dst+4, 13, 3, k),
FermatLTParams(dst+5, 7, 2, k),
FermatLTParams(dst+6, 11, 10, k),
FermatLTParams(dst+7, 13, 11, k)
};
for(i=0; i < 8; ++i)
{
pthread_create(threads+i, NULL, PerformFermatLT, params+i);
}
for(i=0; i < 8; ++i)
{
pthread_join(threads[i], NULL);
}
std::cout << std::hex << val << std::endl;
return 0;
}

The answer is YES, you can safely operate on individual bytes of a 64-bit quadword in parallel, by different threads.
It is amazing that it works, but it would be a disaster if it did not. All hardware acts as if a core writing a byte in its own core marks not just that the cache line is dirty, but which bytes within it. When that cache line (64 or 128 or even 256 bytes) eventually gets written to main memory, only the dirty bytes actually modify the main memory. This is essential, because otherwise when two threads were working on independent data that happened to occupy the same cache line, they would trash each other's results.
This can be bad for performance, because the way it works is partly through the magic of "cache coherency," where when one thread writes a byte all the caches in the system that have that same line of data are affected. If they're dirty, they need to write to main memory, and then either drop the cache line, or capture the changes from the other thread. There are all kinds of different implementations, but it is generally expensive.

Store four 16bit integers with SSE intrinsics

I multiply and round four 32bit floats, then convert it to four 16bit integers with SSE intrinsics. I'd like to store the four integer results to an array. With floats it's easy: _mm_store_ps(float_ptr, m128value). However I haven't found any instruction to do this with 16bit (__m64) integers.
void process(float *fptr, int16_t *sptr, __m128 factor)
{
__m128 a = _mm_load_ps(fptr);
__m128 b = _mm_mul_ps(a, factor);
__m128 c = _mm_round_ps(b, _MM_FROUND_TO_NEAREST_INT);
__m64 s =_mm_cvtps_pi16(c);
// now store the values to sptr
}
Any help would be appreciated.

Personally I would avoid using MMX. Also, I would use an explicit store rather than implicit which often only work on certain compilers. The following codes works find in MSVC2012 and SSE 4.1.
Note that fptr needs to be 16-byte aligned. This is not a problem if you compile in 64-bit mode but in 32-bit mode you should make sure it's aligned.
#include <stdio.h>
#include <stdint.h>
#include <smmintrin.h>
void process(float *fptr, int16_t *sptr, __m128 factor)
{
__m128 a = _mm_load_ps(fptr);
__m128 b = _mm_mul_ps(a, factor);
__m128i c = _mm_cvttps_epi32(b);
__m128i d = _mm_packs_epi32(c,c);
_mm_storel_epi64((__m128i*)sptr, d);
}
int main() {
float x[] = {1.0, 2.0, 3.0, 4.0};
int16_t y[4];
__m128 factor = _mm_set1_ps(3.14159f);
process(x, y, factor);
printf("%d %d %d %d\n", y[0], y[1], y[2], y[3]);
}
Note that _mm_cvtps_pi16 is not a simple instrinsic the Intel Intrinsic Guide says "This intrinsic creates a sequence of two or more instructions, and may perform worse than a native instruction. Consider the performance impact of this intrinsic."
Here is the assembly output using the MMX version
mulps (%rdi), %xmm0
roundps $0, %xmm0, %xmm0
movaps %xmm0, %xmm1
cvtps2pi %xmm0, %mm0
movhlps %xmm0, %xmm1
cvtps2pi %xmm1, %mm1
packssdw %mm1, %mm0
movq %mm0, (%rsi)
ret
Here is the assembly output ussing the SSE only version
mulps (%rdi), %xmm0
cvttps2dq %xmm0, %xmm0
packssdw %xmm0, %xmm0
movq %xmm0, (%rsi)
ret

With __m64 types, you can just cast the destination pointer appropriately:
void process(float *fptr, int16_t *sptr, __m128 factor)
{
__m128 a = _mm_load_ps(fptr);
__m128 b = _mm_mul_ps(a, factor);
__m128 c = _mm_round_ps(b, _MM_FROUND_TO_NEAREST_INT);
__m64 s =_mm_cvtps_pi16(c);
*((__m64 *) sptr) = s;
}
There is no distinction between aligned and unaligned stores with MMX instructions like there is with SSE/AVX; therefore, you don't need the intrinsics to perform a store.

I think you're safe moving that to a general 64bit register (long long will work for both Linux LLP64 and Windows LP64) and copy it yourself.
From what I read in xmmintrin.h, gcc will handle the cast perfectly fine from __m64 to a long long.
To be sure, you can use _mm_cvtsi64_si64x.
short* f;
long long b = _mm_cvtsi64_si64x(s);
f[0] = b >> 48;
f[1] = b >> 32 & 0x0000FFFFLL;
f[2] = b >> 16 & 0x000000000FFFFLL;
f[3] = b & 0x000000000000FFFFLL;
You could type pune that with an union to make it look better, but I guess that would fall in undefined behavior.

What is fastest method to calculate a number having only bit set which is the most significant digit set in another number? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Previous power of 2
Getting the Leftmost Bit
What I want is, suppose there is a number 5 i.e. 101. My answer should be 100. For 9 i.e. 1001, the answer should be 1000

You can't ask for the fastest sequence without giving constrains on the machine on which this has to run. For example, some machines support an instruction called "count leading zeroes" or have means to emulate it very quickly. If you can access this instruction (for example with gcc) then you can write:
#include <limits.h>
#include <stdint.h>
uint32_t f(uint32_t x)
{
return ((uint64_t)1)<<(32-__builtin_clz(x)-1);
}
int main()
{
printf("=>%d\n",f(5));
printf("=>%d\n",f(9));
}
f(x) returns what you want (the least y with x>=y and y=2**n). The compiler will now generate the optimal code sequence for the target machine. For example, when compiling for a default x86_64 architecture, f() looks like this:
bsrl %edi, %edi
movl $31, %ecx
movl $1, %eax
xorl $31, %edi
subl %edi, %ecx
salq %cl, %rax
ret
You see, no loops here! 7 instructions, no branches.
But if I tell my compiler (gcc-4.5) to optimize for the machine I'm using right now (AMD Phenom-II), then this comes out for f():
bsrl %edi, %ecx
movl $1, %eax
salq %cl, %rax
ret
This is probably the fastest way to go for this machine.
EDIT: f(0) would have resulted in UB, I've fixed that (and the assembly). Also, uint32_t means that I can write 32 without feeling guilty :-)

From Hacker's Delight, a nice branchless solution:
uint32_t flp2 (uint32_t x)
{
x = x | (x >> 1);
x = x | (x >> 2);
x = x | (x >> 4);
x = x | (x >> 8);
x = x | (x >> 16);
return x - (x >> 1);
}
This typically takes 12 instructions. You can do it in fewer if your CPU has a "count leading zeroes" instruction.

int input = 5;
std::size_t numBits = 0;
while(input)
{
input >>= 1;
numBits++;
}
int output = 1 << (numBits-1);

This is a task related to the bit counting. Check this out.
Using the 2a (which is my favorite of the algorithms; not the fastest) one can come up with this:
int highest_bit_mask (unsigned int n) {
while (n) {
if (n & (n-1)) {
n &= (n-1) ;
} else {
return n;
}
}
return 0;
}
The magic of n &= (n-1); is that it removes from n the least significant bit. (Corollary: n & (n-1) is false only when n has precisely one bit set.) The algorithm complexity depends on number of bits set in the input.
Check out the link anyway. It is a very amusing and enlightening read which might give you more ideas.

Branchless code that maps zero, negative, and positive to 0, 1, 2

Write a branchless function that returns 0, 1, or 2 if the difference between two signed integers is zero, negative, or positive.
Here's a version with branching:
int Compare(int x, int y)
{
int diff = x - y;
if (diff == 0)
return 0;
else if (diff < 0)
return 1;
else
return 2;
}
Here's a version that may be faster depending on compiler and processor:
int Compare(int x, int y)
{
int diff = x - y;
return diff == 0 ? 0 : (diff < 0 ? 1 : 2);
}
Can you come up with a faster one without branches?
SUMMARY
The 10 solutions I benchmarked had similar performance. The actual numbers and winner varied depending on compiler (icc/gcc), compiler options (e.g., -O3, -march=nocona, -fast, -xHost), and machine. Canon's solution performed well in many benchmark runs, but again the performance advantage was slight. I was surprised that in some cases some solutions were slower than the naive solution with branches.

Branchless (at the language level) code that maps negative to -1, zero to 0 and positive to +1 looks as follows
int c = (n > 0) - (n < 0);
if you need a different mapping you can simply use an explicit map to remap it
const int MAP[] = { 1, 0, 2 };
int c = MAP[(n > 0) - (n < 0) + 1];
or, for the requested mapping, use some numerical trick like
int c = 2 * (n > 0) + (n < 0);
(It is obviously very easy to generate any mapping from this as long as 0 is mapped to 0. And the code is quite readable. If 0 is mapped to something else, it becomes more tricky and less readable.)
As an additinal note: comparing two integers by subtracting one from another at C language level is a flawed technique, since it is generally prone to overflow. The beauty of the above methods is that they can immedately be used for "subtractionless" comparisons, like
int c = 2 * (x > y) + (x < y);

int Compare(int x, int y) {
return (x < y) + (y < x) << 1;
}
Edit: Bitwise only? Guess < and > don't count, then?
int Compare(int x, int y) {
int diff = x - y;
return (!!diff) | (!!(diff & 0x80000000) << 1);
}
But there's that pesky -.
Edit: Shift the other way around.
Meh, just to try again:
int Compare(int x, int y) {
int diff = y - x;
return (!!diff) << ((diff >> 31) & 1);
}
But I'm guessing there's no standard ASM instruction for !!. Also, the << can be replaced with +, depending on which is faster...
Bit twiddling is fun!
Hmm, I just learned about setnz.
I haven't checked the assembler output (but I did test it a bit this time), and with a bit of luck it could save a whole instruction!:
IN THEORY. MY ASSEMBLER IS RUSTY
subl %edi, %esi
setnz %eax
sarl $31, %esi
andl $1, %esi
sarl %eax, %esi
mov %esi, %eax
ret
Rambling is fun.
I need sleep.

Assuming 2s complement, arithmetic right shift, and no overflow in the subtraction:
#define SHIFT (CHARBIT*sizeof(int) - 1)
int Compare(int x, int y)
{
int diff = x - y;
return -(diff >> SHIFT) - (((-diff) >> SHIFT) << 1);
}

Two's complement:
#include <limits.h>
#define INT_BITS (CHAR_BITS * sizeof (int))
int Compare(int x, int y) {
int d = y - x;
int p = (d + INT_MAX) >> (INT_BITS - 1);
d = d >> (INT_BITS - 2);
return (d & 2) + (p & 1);
}
Assuming a sane compiler, this will not invoke the comparison hardware of your system, nor is it using a comparison in the language. To verify: if x == y then d and p will clearly be 0 so the final result will be zero. If (x - y) > 0 then ((x - y) + INT_MAX) will set the high bit of the integer otherwise it will be unset. So p will have its lowest bit set if and only if (x - y) > 0. If (x - y) < 0 then its high bit will be set and d will set its second to lowest bit.

Unsigned Comparison that returns -1,0,1 (cmpu) is one of the cases that is tested for by the GNU SuperOptimizer.
cmpu: compare (unsigned)
int cmpu(unsigned_word v0, unsigned_word v1)
{
return ( (v0 > v1) ? 1 : ( (v0 < v1) ? -1 : 0) );
}
A SuperOptimizer exhaustively searches the instruction space for the best possible combination of instructions that will implement a given function. It is suggested that compilers automagically replace the functions above by their superoptimized versions (although not all compilers do this). For example, in the PowerPC Compiler Writer's Guide (powerpc-cwg.pdf), the cmpu function is shown as this in Appendix D pg 204:
cmpu: compare (unsigned)
PowerPC SuperOptimized Version
subf R5,R4,R3
subfc R6,R3,R4
subfe R7,R4,R3
subfe R8,R7,R5
That's pretty good isn't it... just four subtracts (and with carry and/or extended versions). Not to mention it is genuinely branchfree at the machine opcode level. There is probably a PC / Intel X86 equivalent sequence that is similarly short since the GNU Superoptimizer runs for X86 as well as PowerPC.
Note that Unsigned Comparison (cmpu) can be turned into Signed Comparison (cmps) on a 32-bit compare by adding 0x80000000 to both Signed inputs before passing it to cmpu.
cmps: compare (signed)
int cmps(signed_word v0, signed_word v1)
{
signed_word offset=0x80000000;
return ( (unsigned_word) (v0 + signed_word),
(unsigned_word) (v1 + signed_word) );
}
This is just one option though... the SuperOptimizer may find a cmps that is shorter and does not have to add offsets and call cmpu.
To get the version that you requested that returns your values of {1,0,2} rather than {-1,0,1} use the following code which takes advantage of the SuperOptimized cmps function.
int Compare(int x, int y)
{
static const int retvals[]={1,0,2};
return (retvals[cmps(x,y)+1]);
}

I'm siding with Tordek's original answer:
int compare(int x, int y) {
return (x < y) + 2*(y < x);
}
Compiling with gcc -O3 -march=pentium4 results in branch-free code that uses conditional instructions setg and setl (see this explanation of x86 instructions).
push %ebp
mov %esp,%ebp
mov %eax,%ecx
xor %eax,%eax
cmp %edx,%ecx
setg %al
add %eax,%eax
cmp %edx,%ecx
setl %dl
movzbl %dl,%edx
add %edx,%eax
pop %ebp
ret

Good god, this has haunted me.
Whatever, I think I squeezed out a last drop of performance:
int compare(int a, int b) {
return (a != b) << (a > b);
}
Although, compiling with -O3 in GCC will give (bear with me I'm doing it from memory)
xorl %eax, %eax
cmpl %esi, %edi
setne %al
cmpl %esi, %edi
setgt %dl
sall %dl, %eax
ret
But the second comparison seems (according to a tiny bit of testing; I suck at ASM) to be redundant, leaving the small and beautiful
xorl %eax, %eax
cmpl %esi, %edi
setne %al
setgt %dl
sall %dl, %eax
ret
(Sall may totally not be an ASM instruction, but I don't remember exactly)
So... if you don't mind running your benchmark once more, I'd love to hear the results (mine gave a 3% improvement, but it may be wrong).

Combining Stephen Canon and Tordek's answers:
int Compare(int x, int y)
{
int diff = x - y;
return -(diff >> 31) + (2 & (-diff >> 30));
}
Yields: (g++ -O3)
subl %esi,%edi
movl %edi,%eax
sarl $31,%edi
negl %eax
sarl $30,%eax
andl $2,%eax
subl %edi,%eax
ret
Tight! However, Paul Hsieh's version has even fewer instructions:
subl %esi,%edi
leal 0x7fffffff(%rdi),%eax
sarl $30,%edi
andl $2,%edi
shrl $31,%eax
leal (%rdi,%rax,1),%eax
ret

int Compare(int x, int y)
{
int diff = x - y;
int absdiff = 0x7fffffff & diff; // diff with sign bit 0
int absdiff_not_zero = (int) (0 != udiff);
return
(absdiff_not_zero << 1) // 2 iff abs(diff) > 0
-
((0x80000000 & diff) >> 31); // 1 iff diff < 0
}

For 32 signed integers (like in Java), try:
return 2 - ((((x >> 30) & 2) + (((x-1) >> 30) & 2))) >> 1;
where (x >> 30) & 2 returns 2 for negative numbers and 0 otherwise.
x would be the difference of the two input integers

The basic C answer is :
int v; // find the absolute value of v
unsigned int r; // the result goes here
int const mask = v >> sizeof(int) * CHAR_BIT - 1;
r = (v + mask) ^ mask;
Also :
r = (v ^ mask) - mask;
Value of sizeof(int) is often 4 and CHAR_BIT is often 8.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js