Char substraction with clipped underflow

Char substraction with clipped underflow - c++

is there a way to compute:
/** clipped(a - b) **/
unsigned char clipped_substract(unsigned char a, unsigned char b)
{
return a > b ? a - b : 0;
}
using some binary operations instead of a test?

The original function compiles to a conditional move and is not subject to the performance concern you have. You are attempting to perform premature optimization here, and it will do absolutely no good for you. Write readable code, then profile it, then identify and optimize performance bottlenecks.
You say "tests can lower performances". If you want to do micro-optimizations like this, you should aim to understand the reasons underlying these rules of thumb, not apply them as unconditional truth.
What lowers performance about "tests" are (mispredicted) control flow branches. Modern CPUs heavily use instruction pipelining, and control flow branches mean it's unclear which instruction comes next. The common approach is that the CPU guesses (using branch prediction hardware/algorithms) which way the branch will go. If it gets it wrong, it has to flush the entire pipeline and refill it, which wastes cycles.
Well, the code in question has no such issue. It compiles to a conditional move:
https://godbolt.org/z/upkgok
clipped_substract(unsigned char, unsigned char):
mov eax, edi
mov edx, 0
sub eax, esi
cmp dil, sil
cmovbe eax, edx
ret
You can see that there is no control flow branch here. See Why is a conditional move not vulnerable for Branch Prediction Failure? for why that is not subject to the above performance concern.
I repeat: Write readable code, then profile it, then identify and optimize performance bottlenecks. What you attempted here was to fix a performance problem code that didn't even exist in the first place (not to mention you didn't prove that the performance of this code is even relevant to your program's performance).

Using clang 11.0, compiler optimizations enabled
unsigned char clipped_substract(unsigned char a, unsigned char b)
{
long c = b - a;
return ((unsigned long)c >> (sizeof(c)*8-1)) * c;
}
Although time taken by any of the alternatives is neglagibly different, but when performing 10000000 iterations, other ways take around 0.300secs on my machine and this solution takes around 0.200secs.
But this is only valid if you use a type whose size is greater than char, for example here I took long. I don't remember exact definitions of char or long but I know long is wider on my machine compared to char. So adjust it accordingly.
Basically, I just store the result of subtraction in a larger width variable. If the result is negative, most significant bit will be 1 else 0.

/** clipped(a - b) **/
unsigned char clipped_substract(unsigned char a, unsigned char b)
{
return a - (a+b) / 2 + getAbs(a-b) / 2;
}
unsigned int getAbs(int n)
{
int const mask = n >> (sizeof(int) * 8 - 1);
return ((n + mask) ^ mask);
}

Related

What is the fastest way to check the leading characters in a char array?

I reached a bottleneck in my code, so the main issue of this question is performance.
I have a hexadecimal checksum and I want to check the leading zeros of an array of chars. This is what I am doing:
bool starts_with (char* cksum_hex, int n_zero) {
bool flag {true};
for (int i=0; i<n_zero; ++i)
flag &= (cksum_hex[i]=='0');
return flag;
}
The above function returns true if the cksum_hex has n_zero leading zeros. However, for my application, this function is very expensive (60% of total time). In other words, it is the bottleneck of my code. So I need to improve it.
I also checked std::string::starts_with which is available in C++20 and I observed no difference in performance:
// I have to convert cksum to string
std::string cksum_hex_s (cksum_hex);
cksum_hex_s.starts_with("000"); // checking for 3 leading zeros
For more information I am using g++ -O3 -std=c++2a and my gcc version is 9.3.1.
Questions
What is the faster way of checking the leading characters in a char array?
Is there a more efficient way of doing it with std::string::starts_with?
Does the bitwise operations help here?

If you modify your function to return early
bool starts_with (char* cksum_hex, int n_zero) {
for (int i=0; i<n_zero; ++i)
{
if (cksum_hex[i] != '0') return false;
}
return true;
}
It will be faster in case of big n_zero and false result. Otherwise, maybe you can try to allocate a global array of characters '0' and use std::memcmp:
// make it as big as you need
constexpr char cmp_array[4] = {'0', '0', '0', '0'};
bool starts_with (char* cksum_hex, int n_zero) {
return std::memcmp(cksum_hex, cmp_array, n_zero) == 0;
}
The problem here is that you need to assume some max possible value of n_zero.
Live example
=== EDIT ===
Considering the complains about no profiling data to justify the suggested approaches, here you go:
Benchmark results comparing early return implementation with memcmp implementation
Benchmark results comparing memcmp implementation with OP original implementation
Data used:
const char* cs1 = "00000hsfhjshjshgj";
const char* cs2 = "20000hsfhjshjshgj";
const char* cs3 = "0000000000hsfhjshjshgj";
const char* cs4 = "0000100000hsfhjshjshgj";
memcmp is fastest in all cases but cs2 with early return impl.

Presumably you also have the binary checksum? Instead of converting it to ASCII text first, look at the 4*n high bits to check n nibbles directly for 0 rather than checking n bytes for equality to '0'.
e.g. if you have the hash (or the high 8 bytes of it) as a uint64_t or unsigned __int128, right-shift it to keep only the high n nibbles.
I showed some examples of how they compile for x86-64 when both inputs are runtime variables, but these also compile nicely to other ISAs like AArch64. This code is all portable ISO C++.
bool starts_with (uint64_t cksum_high8, int n_zero)
{
int shift = 64 - n_zero * 4; // A hex digit represents a 4-bit nibble
return (cksum_high8 >> shift) == 0;
}
clang does a nice job for x86-64 with -O3 -march=haswell to enable BMI1/BMI2
high_zero_nibbles(unsigned long, int):
shl esi, 2
neg sil # x86 shifts wrap the count so 64 - c is the same as -c
shrx rax, rdi, rsi # BMI2 variable-count shifts save some uops.
test rax, rax
sete al
ret
This even works for n=16 (shift=0) to test all 64 bits. It fails for n_zero = 0 to test none of the bits; it would encounter UB by shifting a uint64_t by a shift count >= its width. (On ISAs like x86 that wrap out-of-bounds shift counts, code-gen that worked for other shift counts would result in checking all 16 bits. As long as the UB wasn't visible at compile time...) Hopefully you're not planning to call this with n_zero=0 anyway.
Other options: create a mask that keeps only the high n*4 bits, perhaps shortening the critical path through cksum_high8 if that's ready later than n_zero. Especially if n_zero is a compile-time constant after inlining, this can be as fast as checking cksum_high8 == 0. (e.g. x86-64 test reg, immediate.)
bool high_zero_nibbles_v2 (uint64_t cksum_high8, int n_zero) {
int shift = 64 - n_zero * 4; // A hex digit represents a 4-bit nibble
uint64_t low4n_mask = (1ULL << shift) - 1;
return cksum_high8 & ~low4n_mask;
}
Or use a bit-scan function to count leading zero bits and compare for >= 4*n. Unfortunately it took ISO C++ until C++20 <bit>'s countl_zero to finally portably expose this common CPU feature that's been around for decades (e.g. 386 bsf / bsr); before that only as compiler extensions like GNU C __builtin_clz.
This is great if you want to know how many and don't have one specific cutoff threshold.
bool high_zero_nibbles_lzcnt (uint64_t cksum_high8, int n_zero) {
// UB on cksum_high8 == 0. Use x86-64 BMI1 _lzcnt_u64 to avoid that, guaranteeing 64 on input=0
return __builtin_clzll(cksum_high8) > 4*n_zero;
}
#include <bit>
bool high_zero_nibbles_stdlzcnt (uint64_t cksum_high8, int n_zero) {
return std::countl_zero(cksum_high8) > 4*n_zero;
}
compile to (clang for Haswell):
high_zero_nibbles_lzcnt(unsigned long, int):
lzcnt rax, rdi
shl esi, 2
cmp esi, eax
setl al # FLAGS -> boolean integer return value
ret
All these instructions are cheap on Intel and AMD, and there's even some instruction-level parallelism between lzcnt and shl.
See asm output for all 4 of these on the Godbolt compiler explorer. Clang compiles 1 and 2 to identical asm. Same for both lzcnt ways with -march=haswell. Otherwise it needs to go out of its way to handle the bsr corner case for input=0, for the C++20 version where that's not UB.
To extend these to wider hashes, you can check the high uint64_t for being all-zero, then proceed to the next uint64_t chunk.
Using an SSE2 compare with pcmpeqb on the string, pmovmskb -> bsf could find the position of the first 1 bit, thus how many leading-'0' characters there were in the string representation, if you have that to start with. So x86 SIMD can do this very efficiently, and you can use that from C++ via intrinsics.

You can make a buffer of zeros large enough for you than compare with memcmp.
const char *zeroBuffer = "000000000000000000000000000000000000000000000000000";
if (memcmp(zeroBuffer, cksum_hex, n_zero) == 0) {
// ...
}

Things you want to check to make your application faster:
1. Can the compiler inline this function in places where it is called?
Either declare the function as inline in a header or put the definition in the compile unit where it is used.
2. Not computing something is faster than computing something more efficiently
Are all calls to this function necessary? High cost is generally the sign of a function called inside a high frequency loop or in an expensive algorithm. You can often reduce the call count, hence the time spent in the function, by optimizing the outer algorithm
3. Is n_zero small or, even better, a constant?
Compilers are pretty good at optimizing algorithm for typically small constant values. If the constant is known to the compiler, it will most likely remove the loop entirely.
4. Does the bitwise operation help here?
It definitely has an effect and allows Clang (but not GCC as far as I can tell) to do some vectorization. Vectorization tend to be faster but that's not always the case depending on your hardware and actual data processed.
Whether it is an optimization or not might depend on how big n_zero is. Considering you are processing checksums, it should be pretty small so it sounds like a potential optimization.
For known n_zero using bitwise operation allows the compiler to remove all branching. I expect, though I did not measure, this to be faster.
std::all_of and std::string::starts_with should be compiled exactly as your implementation except they will use && instead of &.

Unless n_zero is quite high I would agree with others that you may be misinterpreting profiler results. But anyway:
Could the data be swapped to disk? If your system is under RAM pressure, data could be swapped out to disk and need to be loaded back to RAM when you perform the first operation on it. (Assuming this checksum check is the first access to the data in a while.)
Chances are you could use multiple threads/processes to take advantage of a multicore processor.
Maybe you could use statistics/correlation of your input data, or other structural features of your problem.
For instance, if you have a large number of digits (e.g. 50) and you know that the later digits have a higher probability of being nonzero, you can check the last one first.
If nearly all of your checksums should match, you can use [[likely]] to give a compiler hint that this is the case. (Probably won't make a difference but worth a try.)

Adding my two cents to this interesting discussion, though a little late to the game, I gather you could use std::equal, it's a fast method with a slightly different approach, using a hardcoded string with the maximum number of zeros, instead of the number of zeros.
This works passing to the function pointers to the begin and end of the string to be searched, and to string of zeros, specifically iterators to begin and end, end pointing to position of one past of the wanted number of zeros, these will be used as iterators by std::equal:
Sample
bool startsWith(const char* str, const char* end, const char* substr, const char* subend) {
return std::equal(str, end, substr, subend);
}
int main() {
const char* str = "000x1234567";
const char* substr = "0000000000000000000000000000";
std::cout << startsWith(&str[0], &str[3], &substr[0], &substr[3]);
}
Using the test cases in #pptaszni's good answer and the same testing conditions:
const char* cs1 = "00000hsfhjshjshgj";
const char* cs2 = "20000hsfhjshjshgj";
const char* cs3 = "0000000000hsfhjshjshgj";
const char* cs4 = "0000100000hsfhjshjshgj";
The result where as follows:
Slower than using memcmp but still faster (except for false results with low number of zeros) and more consistent than your original code.

Use std::all_of
return std::all_of(chsum_hex, chsum_hex + n_zero, [](char c){ return c == '0'; })

Signed overflow in C++ and undefined behaviour (UB)

I'm wondering about the use of code like the following
int result = 0;
int factor = 1;
for (...) {
result = ...
factor *= 10;
}
return result;
If the loop is iterated over n times, then factor is multiplied by 10 exactly n times. However, factor is only ever used after having been multiplied by 10 a total of n-1 times. If we assume that factor never overflows except on the last iteration of the loop, but may overflow on the last iteration of the loop, then should such code be acceptable? In this case, the value of factor would provably never be used after the overflow has happened.
I'm having a debate on whether code like this should be accepted. It would be possible to put the multiplication inside an if-statement and just not do the multiplication on the last iteration of the loop when it can overflow. The downside is that it clutters the code and adds an unnecessary branch that would need to check for on all the previous loop iterations. I could also iterate over the loop one fewer time and replicate the loop body once after the loop, again, this complicates the code.
The actual code in question is used in a tight inner-loop that consumes a large chunk of the total CPU time in a real-time graphics application.

Compilers do assume that a valid C++ program does not contain UB. Consider for example:
if (x == nullptr) {
*x = 3;
} else {
*x = 5;
}
If x == nullptr then dereferencing it and assigning a value is UB. Hence the only way this could end in a valid program is when x == nullptr will never yield true and the compiler can assume under the as if rule, the above is equivalent to:
*x = 5;
Now in your code
int result = 0;
int factor = 1;
for (...) { // Loop until factor overflows but not more
result = ...
factor *= 10;
}
return result;
The last multiplication of factor cannot happen in a valid program (signed overflow is undefined). Hence also the assignment to result cannot happen. As there is no way to branch before the last iteration also the previous iteration cannot happen. Eventually, the part of code that is correct (i.e., no undefined behaviour ever happens) is:
// nothing :(

The behaviour of int overflow is undefined.
It doesn't matter if you read factor outside the loop body; if it has overflowed by then then the behaviour of your code on, after, and somewhat paradoxically before the overflow is undefined.
One issue that might arise in keeping this code is that compilers are getting more and more aggressive when it comes to optimisation. In particular they are developing a habit where they assume that undefined behaviour never happens. For this to be the case, they may remove the for loop altogether.
Can't you use an unsigned type for factor although then you'd need to worry about unwanted conversion of int to unsigned in expressions containing both?

It might be insightful to consider real-world optimizers. Loop unrolling is a known technique. The basic idea of loop unrolling is that
for (int i = 0; i != 3; ++i)
foo()
might be better implemented behind the scenes as
foo()
foo()
foo()
This is the easy case, with a fixed bound. But modern compilers can also do this
for variable bounds:
for (int i = 0; i != N; ++i)
foo();
becomes
__RELATIVE_JUMP(3-N)
foo();
foo();
foo();
Obviously this only works if the compiler knows that N<=3. And that's where we get back to the original question:
int result = 0;
int factor = 1;
for (...) {
result = ...
factor *= 10;
}
return result;
Because the compiler knows that signed overflow does not occur, it knows that the loop can execute a maximum of 9 times on 32 bits architectures. 10^10 > 2^32. It can therefore do a 9 iteration loop unroll. But the intended maximum was 10 iterations !.
What might happen is that you get a relative jump to a assembly instruction (9-N) with N==10, so an offset of -1, which is the jump instruction itself. Oops. This is a perfectly valid loop optimization for well-defined C++, but the example given turns into a tight infinite loop.

Any signed integer overflow results in undefined behaviour, regardless of whether or not the overflowed value is or might be read.
Maybe in your use-case you can to lift the first iteration out of the loop, turning this
int result = 0;
int factor = 1;
for (int n = 0; n < 10; ++n) {
result += n + factor;
factor *= 10;
}
// factor "is" 10^10 > INT_MAX, UB
into this
int factor = 1;
int result = 0 + factor; // first iteration
for (int n = 1; n < 10; ++n) {
factor *= 10;
result += n + factor;
}
// factor is 10^9 < INT_MAX
With optimization enabled, the compiler might unroll the second loop above into one conditional jump.

This is UB; in ISO C++ terms the entire behaviour of the entire program is completely unspecified for an execution that eventually hits UB. The classic example is as far as the C++ standard cares, it can make demons fly out of your nose. (I recommend against using an implementation where nasal demons are a real possibility). See other answers for more details.
Compilers can "cause trouble" at compile time for paths of execution they can see leading to compile-time-visible UB, e.g. assume those basic blocks are never reached.
See also What Every C Programmer Should Know About Undefined Behavior (LLVM blog). As explained there, signed-overflow UB lets compilers prove that for(... i <= n ...) loops are not infinite loops, even for unknown n. It also lets them "promote" int loop counters to pointer width instead of redoing sign-extension. (So the consequence of UB in that case could be accessing outside the low 64k or 4G elements of an array, if you were expecting signed wrapping of i into its value range.)
In some cases compilers will emit an illegal instruction like x86 ud2 for a block that provably causes UB if ever executed. (Note that a function might not ever be called, so compilers can't in general go berserk and break other functions, or even possible paths through a function that don't hit UB. i.e. the machine code it compiles to must still work for all inputs that don't lead to UB.)
Probably the most efficient solution is to manually peel the last iteration so the unneeded factor*=10 can be avoided.
int result = 0;
int factor = 1;
for (... i < n-1) { // stop 1 iteration early
result = ...
factor *= 10;
}
result = ... // another copy of the loop body, using the last factor
// factor *= 10; // and optimize away this dead operation.
return result;
Or if the loop body is large, consider simply using an unsigned type for factor. Then you can let the unsigned multiply overflow and it will just do well-defined wrapping to some power of 2 (the number of value bits in the unsigned type).
This is fine even if you use it with signed types, especially if your unsigned->signed conversion never overflows.
Conversion between unsigned and 2's complement signed is free (same bit-pattern for all values); the modulo wrapping for int -> unsigned specified by the C++ standard simplifies to just using the same bit-pattern, unlike for one's complement or sign/magnitude.
And unsigned->signed is similarly trivial, although it is implementation-defined for values larger than INT_MAX. If you aren't using the huge unsigned result from the last iteration, you have nothing to worry about. But if you are, see Is conversion from unsigned to signed undefined?. The value-doesn't-fit case is implementation-defined, which means that an implementation must pick some behaviour; sane ones just truncate (if necessary) the unsigned bit pattern and use it as signed, because that works for in-range values the same way with no extra work. And it's definitely not UB. So big unsigned values can become negative signed integers. e.g. after int x = u; gcc and clang don't optimize away x>=0 as always being true, even without -fwrapv, because they defined the behaviour.

If you can tolerate a few additional assembly instructions in the loop, instead of
int factor = 1;
for (int j = 0; j < n; ++j) {
...
factor *= 10;
}
you can write:
int factor = 0;
for (...) {
factor = 10 * factor + !factor;
...
}
to avoid the last multiplication. !factor will not introduce a branch:
xor ebx, ebx
L1:
xor eax, eax
test ebx, ebx
lea edx, [rbx+rbx*4]
sete al
add ebp, 1
lea ebx, [rax+rdx*2]
mov edi, ebx
call consume(int)
cmp r12d, ebp
jne .L1
This code
int factor = 0;
for (...) {
factor = factor ? 10 * factor : 1;
...
}
also results in branchless assembly after optimization:
mov ebx, 1
jmp .L1
.L2:
lea ebx, [rbx+rbx*4]
add ebx, ebx
.L1:
mov edi, ebx
add ebp, 1
call consume(int)
cmp r12d, ebp
jne .L2
(Compiled with GCC 8.3.0 -O3)

You didn't show what's in the parentheses of the for statement, but I'm going to assume it's something like this:
for (int n = 0; n < 10; ++n) {
result = ...
factor *= 10;
}
You can simply move the counter increment and loop termination check into the body:
for (int n = 0; ; ) {
result = ...
if (++n >= 10) break;
factor *= 10;
}
The number of assembly instructions in the loop will remain the same.
Inspired by Andrei Alexandrescu's presentation "Speed Is Found In The Minds of People".

Consider the function:
unsigned mul_mod_65536(unsigned short a, unsigned short b)
{
return (a*b) & 0xFFFFu;
}
According to the published Rationale, the authors of the Standard would have expected that if this function were invoked on (e.g.) a commonplace 32-bit computer with arguments of 0xC000 and 0xC000, promoting the operands of * to signed int would cause the computation to yield -0x10000000, which when converted to unsigned would yield 0x90000000u--the same answer as if they had made unsigned short promote to unsigned. Nonetheless, gcc will sometimes optimize that function in ways that would behave nonsensically if an overflow occurs. Any code where some combination of inputs could cause an overflow must be processed with -fwrapv option unless it would be acceptable to allow creators of deliberately-malformed input to execute arbitrary code of their choosing.

Why not this:
int result = 0;
int factor = 10;
for (...) {
factor *= 10;
result = ...
}
return result;

There are many different faces of Undefined Behavior, and what's acceptable depends on the usage.
tight inner-loop that consumes a large chunk of the total CPU time in a real-time graphics application
That, by itself, is a bit of an unusual thing, but be that as it may... if this is indeed the case, then the UB is most probably within the realm "allowable, acceptable". Graphics programming is notorious for hacks and ugly stuff. As long as it "works" and it doesn't take longer than 16.6ms to produce a frame, usually, nobody cares. But still, be aware of what it means to invoke UB.
First, there is the standard. From that point of view, there's nothing to discuss and no way to justify, your code is simply invalid. There are no ifs or whens, it just isn't a valid code. You might as well say that's middle-finger-up from your point of view, and 95-99% of the time you'll be good to go anyway.
Next, there's the hardware side. There are some uncommon, weird architectures where this is a problem. I'm saying "uncommon, weird" because on the one architecture that makes up 80% of all computers (or the two architectures that together make up 95% of all computers) overflow is a "yeah, whatever, don't care" thing on the hardware level. You sure do get a garbage (although still predictable) result, but no evil things happen.
That is not the case on every architecture, you might very well get a trap on overflow (though seeing how you speak of a graphics application, the chances of being on such an odd architecture are rather small). Is portability an issue? If it is, you may want to abstain.
Last, there is the compiler/optimizer side. One reason why overflow is undefined is that simply leaving it at that was easiest to cope with hardware once upon a time. But another reason is that e.g. x+1 is guaranteed to always be larger than x, and the compiler/optimizer can exploit this knowledge. Now, for the previously mentioned case, compilers are indeed known to act this way and simply strip out complete blocks (there existed a Linux exploit some years ago which was based on the compiler having dead-stripped some validation code because of exactly this).
For your case, I would seriously doubt that the compiler does some special, odd, optimizations. However, what do you know, what do I know. When in doubt, try it out. If it works, you are good to go.
(And finally, there's of course code audit, you might have to waste your time discussing this with an auditor if you're unlucky.)

Is there anything special about -1 (0xFFFFFFFF) regarding ADC?

In a research project of mine I'm writing C++ code. However, the generated assembly is one of the crucial points of the project. C++ doesn't provide direct access to flag manipulating instructions, in particular, to ADC but this shouldn't be a problem provided the compiler is smart enough to use it. Consider:
constexpr unsigned X = 0;
unsigned f1(unsigned a, unsigned b) {
b += a;
unsigned c = b < a;
return c + b + X;
}
Variable c is a workaround to get my hands on the carry flag and add it to b and X. It looks I got luck and the (g++ -O3, version 9.1) generated code is this:
f1(unsigned int, unsigned int):
add %edi,%esi
mov %esi,%eax
adc $0x0,%eax
retq
For all values of X that I've tested the code is as above (except, of course for the immediate value $0x0 that changes accordingly). I found one exception though: when X == -1 (or 0xFFFFFFFFu or ~0u, ... it really doesn't matter how you spell it) the generated code is:
f1(unsigned int, unsigned int):
xor %eax,%eax
add %edi,%esi
setb %al
lea -0x1(%rsi,%rax,1),%eax
retq
This seems less efficient than the initial code as suggested by indirect measurements (not very scientific though) Am I right? If so, is this a "missing optimization opportunity" kind of bug that is worth reporting?
For what is worth, clang -O3, version 8.8.0, always uses ADC (as I wanted) and icc -O3, version 19.0.1 never does.
I've tried using the intrinsic _addcarry_u32 but it didn't help.
unsigned f2(unsigned a, unsigned b) {
b += a;
unsigned char c = b < a;
_addcarry_u32(c, b, X, &b);
return b;
}
I reckon I might not be using _addcarry_u32 correctly (I couldn't find much info on it). What's the point of using it since it's up to me to provide the carry flag? (Again, introducing c and praying for the compiler to understand the situation.)
I might, actually, be using it correctly. For X == 0 I'm happy:
f2(unsigned int, unsigned int):
add %esi,%edi
mov %edi,%eax
adc $0x0,%eax
retq
For X == -1 I'm unhappy :-(
f2(unsigned int, unsigned int):
add %esi,%edi
mov $0xffffffff,%eax
setb %dl
add $0xff,%dl
adc %edi,%eax
retq
I do get the ADC but this is clearly not the most efficient code. (What's dl doing there? Two instructions to read the carry flag and restore it? Really? I hope I'm very wrong!)

mov + adc $-1, %eax is more efficient than xor-zero + setc + 3-component lea for both latency and uop count on most CPUs, and no worse on any still-relevant CPUs.1
This looks like a gcc missed optimization: it probably sees a special case and latches onto that, shooting itself in the foot and preventing the adc pattern recognition from happening.
I don't know what exactly it saw / was looking for, so yes you should report this as a missed-optimization bug. Or if you want to dig deeper yourself, you could look at the GIMPLE or RTL output after optimization passes and see what happens. If you know anything about GCC's internal representations. Godbolt has a GIMPLE tree-dump window you can add from the same dropdown as "clone compiler".
The fact that clang compiles it with adc proves that it's legal, i.e. that the asm you want does match the C++ source, and you didn't miss some special case that's stopping the compiler from doing that optimization. (Assuming clang is bug-free, which is the case here.)
That problem can certainly happen if you're not careful, e.g. trying to write a general-case adc function that takes carry in and provides carry-out from the 3-input addition is hard in C, because either of the two additions can carry so you can't just use the sum < a+b idiom after adding the carry to one of the inputs. I'm not sure it's possible to get gcc or clang to emit add/adc/adc where the middle adc has to take carry-in and produce carry-out.
e.g. 0xff...ff + 1 wraps around to 0, so sum = a+b+carry_in / carry_out = sum < a can't optimize to an adc because it needs to ignore carry in the special case where a = -1 and carry_in = 1.
So another guess is that maybe gcc considered doing the + X earlier, and shot itself in the foot because of that special case. That doesn't make a lot of sense, though.
What's the point of using it since it's up to me to provide the carry flag?
You're using _addcarry_u32 correctly.
The point of its existence is to let you express an add with carry in as well as carry out, which is hard in pure C. GCC and clang don't optimize it well, often not just keeping the carry result in CF.
If you only want carry-out, you can provide a 0 as the carry in and it will optimize to add instead of adc, but still give you the carry-out as a C variable.
e.g. to add two 128-bit integers in 32-bit chunks, you can do this
// bad on x86-64 because it doesn't optimize the same as 2x _addcary_u64
// even though __restrict guarantees non-overlap.
void adc_128bit(unsigned *__restrict dst, const unsigned *__restrict src)
{
unsigned char carry;
carry = _addcarry_u32(0, dst[0], src[0], &dst[0]);
carry = _addcarry_u32(carry, dst[1], src[1], &dst[1]);
carry = _addcarry_u32(carry, dst[2], src[2], &dst[2]);
carry = _addcarry_u32(carry, dst[3], src[3], &dst[3]);
}
(On Godbolt with GCC/clang/ICC)
That's very inefficient vs. unsigned __int128 where compilers would just use 64-bit add/adc, but does get clang and ICC to emit a chain of add/adc/adc/adc. GCC makes a mess, using setcc to store CF to an integer for some of the steps, then add dl, -1 to put it back into CF for an adc.
GCC unfortunately sucks at extended-precision / biginteger written in pure C. Clang sometimes does slightly better, but most compilers are bad at it. This is why the lowest-level gmplib functions are hand-written in asm for most architectures.
Footnote 1: or for uop count: equal on Intel Haswell and earlier where adc is 2 uops, except with a zero immediate where Sandybridge-family's decoders special case that as 1 uop.
But the 3-component LEA with a base + index + disp makes it a 3-cycle latency instruction on Intel CPUs, so it's definitely worse.
On Intel Broadwell and later, adc is a 1-uop instruction even with a non-zero immediate, taking advantage of support for 3-input uops introduced with Haswell for FMA.
So equal total uop count but worse latency means that adc would still be a better choice.
https://agner.org/optimize/

Fast multiplication/division by 2 for floats and doubles (C/C++)

In the software I'm writing, I'm doing millions of multiplication or division by 2 (or powers of 2) of my values. I would really like these values to be int so that I could access the bitshift operators
int a = 1;
int b = a<<24
However, I cannot, and I have to stick with doubles.
My question is : as there is a standard representation of doubles (sign, exponent, mantissa), is there a way to play with the exponent to get fast multiplications/divisions by a power of 2?
I can even assume that the number of bits is going to be fixed (the software will work on machines that will always have 64 bits long doubles)
P.S : And yes, the algorithm mostly does these operations only. This is the bottleneck (it's already multithreaded).
Edit : Or am I completely mistaken and clever compilers already optimize things for me?
Temporary results (with Qt to measure time, overkill, but I don't care):
#include <QtCore/QCoreApplication>
#include <QtCore/QElapsedTimer>
#include <QtCore/QDebug>
#include <iostream>
#include <math.h>
using namespace std;
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
while(true)
{
QElapsedTimer timer;
timer.start();
int n=100000000;
volatile double d=12.4;
volatile double D;
for(unsigned int i=0; i<n; ++i)
{
//D = d*32; // 200 ms
//D = d*(1<<5); // 200 ms
D = ldexp (d,5); // 6000 ms
}
qDebug() << "The operation took" << timer.elapsed() << "milliseconds";
}
return a.exec();
}
Runs suggest that D = d*(1<<5); and D = d*32; run in the same time (200 ms) whereas D = ldexp (d,5); is much slower (6000 ms). I know that this is a micro benchmark, and that suddenly, my RAM has exploded because Chrome has suddenly asked to compute Pi in my back every single time I run ldexp(), so this benchmark is worth nothing. But I'll keep it nevertheless.
On the other had, I'm having trouble doing reinterpret_cast<uint64_t *> because there's a const violation (seems the volatile keyword interferes)

This is one of those highly-application specific things. It may help in some cases and not in others. (In the vast majority of cases, a straight-forward multiplication is still best.)
The "intuitive" way of doing this is just to extract the bits into a 64-bit integer and add the shift value directly into the exponent. (this will work as long as you don't hit NAN or INF)
So something like this:
union{
uint64 i;
double f;
};
f = 123.;
i += 0x0010000000000000ull;
// Check for zero. And if it matters, denormals as well.
Note that this code is not C-compliant in any way, and is shown just to illustrate the idea. Any attempt to implement this should be done directly in assembly or SSE intrinsics.
However, in most cases the overhead of moving the data from the FP unit to the integer unit (and back) will cost much more than just doing a multiplication outright. This is especially the case for pre-SSE era where the value needs to be stored from the x87 FPU into memory and then read back into the integer registers.
In the SSE era, the Integer SSE and FP SSE use the same ISA registers (though they still have separate register files). According the Agner Fog, there's a 1 to 2 cycle penalty for moving data between the Integer SSE and FP SSE execution units. So the cost is much better than the x87 era, but it's still there.
All-in-all, it will depend on what else you have on your pipeline. But in most cases, multiplying will still be faster. I've run into this exact same problem before so I'm speaking from first-hand experience.
Now with 256-bit AVX instructions that only support FP instructions, there's even less of an incentive to play tricks like this.

How about ldexp?
Any half-decent compiler will generate optimal code on your platform.
But as #Clinton points out, simply writing it in the "obvious" way should do just as well. Multiplying and dividing by powers of two is child's play for a modern compiler.
Directly munging the floating point representation, besides being non-portable, will almost certainly be no faster (and might well be slower).
And of course, you should not waste time even thinking about this question unless your profiling tool tells you to. But the kind of people who listen to this advice will never need it, and the ones who need it will never listen.
[update]
OK, so I just tried ldexp with g++ 4.5.2. The cmath header inlines it as a call to __builtin_ldexp, which in turn...
...emits a call to the libm ldexp function. I would have thought this builtin would be trivial to optimize, but I guess the GCC developers never got around to it.
So, multiplying by 1 << p is probably your best bet, as you have discovered.

You can pretty safely assume IEEE 754 formatting, the details of which can get pretty gnarley (esp. when you get into subnormals). In the common cases, however, this should work:
const int DOUBLE_EXP_SHIFT = 52;
const unsigned long long DOUBLE_MANT_MASK = (1ull << DOUBLE_EXP_SHIFT) - 1ull;
const unsigned long long DOUBLE_EXP_MASK = ((1ull << 63) - 1) & ~DOUBLE_MANT_MASK;
void unsafe_shl(double* d, int shift) {
unsigned long long* i = (unsigned long long*)d;
if ((*i & DOUBLE_EXP_MASK) && ((*i & DOUBLE_EXP_MASK) != DOUBLE_EXP_MASK)) {
*i += (unsigned long long)shift << DOUBLE_EXP_SHIFT;
} else if (*i) {
*d *= (1 << shift);
}
}
EDIT: After doing some timing, this method is oddly slower than the double method on my compiler and machine, even stripped to the minimum executed code:
double ds[0x1000];
for (int i = 0; i != 0x1000; i++)
ds[i] = 1.2;
clock_t t = clock();
for (int j = 0; j != 1000000; j++)
for (int i = 0; i != 0x1000; i++)
#if DOUBLE_SHIFT
ds[i] *= 1 << 4;
#else
((unsigned int*)&ds[i])[1] += 4 << 20;
#endif
clock_t e = clock();
printf("%g\n", (float)(e - t) / CLOCKS_PER_SEC);
In the DOUBLE_SHIFT completes in 1.6 seconds, with an inner loop of
movupd xmm0,xmmword ptr [ecx]
lea ecx,[ecx+10h]
mulpd xmm0,xmm1
movupd xmmword ptr [ecx-10h],xmm0
Versus 2.4 seconds otherwise, with an inner loop of:
add dword ptr [ecx],400000h
lea ecx, [ecx+8]
Truly unexpected!
EDIT 2: Mystery solved! One of the changes for VC11 is now it always vectorizes floating point loops, effectively forcing /arch:SSE2, though VC10, even with /arch:SSE2 is still worse with 3.0 seconds with an inner loop of:
movsd xmm1,mmword ptr [esp+eax*8+38h]
mulsd xmm1,xmm0
movsd mmword ptr [esp+eax*8+38h],xmm1
inc eax
VC10 without /arch:SSE2 (even with /arch:SSE) is 5.3 seconds... with 1/100th of the iterations!!, inner loop:
fld qword ptr [esp+eax*8+38h]
inc eax
fmul st,st(1)
fstp qword ptr [esp+eax*8+30h]
I knew the x87 FP stack was aweful, but 500 times worse is kinda ridiculous. You probably won't see these kinds of speedups converting, i.e. matrix ops to SSE or int hacks, since this is the worst case loading into the FP stack, doing one op, and storing from it, but it's a good example for why x87 is not the way to go for anything perf. related.

The fastest way to do this is probably:
x *= (1 << p);
This sort of thing may simply be done by calling an machine instruction to add p to the exponent. Telling the compiler to instead extract the some bits with a mask and doing something manually to it will probably make things slower, not faster.
Remember, C/C++ is not assembly language. Using a bitshift operator does not necessarily compile to a bitshift assembly operation, not does using multiplication necessarily compile to multiplication. There's all sorts of weird and wonderful things going on like what registers are being used and what instructions can be run simultaneously which I'm not smart enough to understand. But your compiler, with many man years of knowledge and experience and lots of computational power, is much better at making these judgements.
p.s. Keep in mind, if your doubles are in an array or some other flat data structure, your compiler might be really smart and use SSE to multiple 2, or even 4 doubles at the same time. However, doing a lot of bit shifting is probably going to confuse your compiler and prevent this optimisation.

Since c++17 you can also use hexadecimal floating literals. That way you can multiply by higher powers of 2. For instance:
d *= 0x1p64;
will multiply d by 2^64. I use it to implement my fast integer arithmetic in a conversion to double.

What other operations does this algorithm require? You might be able to break your floats into int pairs (sign/mantissa and magnitude), do your processing, and reconstitute them at the end.

Multiplying by 2 can be replaced by an addition: x *= 2 is equivalent to x += x.
Division by 2 can be replaced by multiplication by 0.5. Multiplication is usually significantly faster than division.

Although there is little/no practical benefit to treating powers of two specially for float of double types there is a case for this for double-double types. Double-double multiplication and division is complicated in general but is trivial for multiplying and dividing by a power of two.
E.g. for
typedef struct {double hi; double lo;} doubledouble;
doubledouble x;
x.hi*=2, x.lo*=2; //multiply x by 2
x.hi/=2, x.lo/=2; //divide x by 2
In fact I have overloaded << and >> for doubledouble so that it's analogous to integers.
//x is a doubledouble type
x << 2 // multiply x by four;
x >> 3 // divide x by eight.

Depending on what you're multiplying, if you have data that is recurring enough, a look up table might provide better performance, at the expense of memory.

How efficient is an if statement compared to a test that doesn't use an if? (C++)

I need a program to get the smaller of two numbers, and I'm wondering if using a standard "if x is less than y"
int a, b, low;
if (a < b) low = a;
else low = b;
is more or less efficient than this:
int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));
(or the variation of putting int delta = a - b at the top and rerplacing instances of a - b with that).
I'm just wondering which one of these would be more efficient (or if the difference is too miniscule to be relevant), and the efficiency of if-else statements versus alternatives in general.

(Disclaimer: the following deals with very low-level optimizations that are most often not necessary. If you keep reading, you waive your right to complain that computers are fast and there is never any reason to worry about this sort of thing.)
One advantage of eliminating an if statement is that you avoid branch prediction penalties.
Branch prediction penalties are generally only a problem when the branch is not easily predicted. A branch is easily predicted when it is almost always taken/not taken, or it follows a simple pattern. For example, the branch in a loop statement is taken every time except the last one, so it is easily predicted. However, if you have code like
a = random() % 10
if (a < 5)
print "Less"
else
print "Greater"
then this branch is not easily predicted, and will frequently incur the prediction penalty associated with clearing the cache and rolling back instructions that were executed in the wrong part of the branch.
One way to avoid these kinds of penalties is to use the ternary (?:) operator. In simple cases, the compiler will generate conditional move instructions rather than branches.
So
int a, b, low;
if (a < b) low = a;
else low = b;
becomes
int a, b, low;
low = (a < b) ? a : b
and in the second case a branching instruction is not necessary. Additionally, it is much clearer and more readable than your bit-twiddling implementation.
Of course, this is a micro-optimization which is unlikely to have significant impact on your code.

Simple answer: One conditional jump is going to be more efficient than two subtractions, one addition, a bitwise and, and a shift operation combined. I've been sufficiently schooled on this point (see the comments) that I'm no longer even confident enough to say that it's usually more efficient.
Pragmatic answer: Either way, you're not paying nearly as much for the extra CPU cycles as you are for the time it takes a programmer to figure out what that second example is doing. Program for readability first, efficiency second.

Compiling this on gcc 4.3.4, amd64 (core 2 duo), Linux:
int foo1(int a, int b)
{
int low;
if (a < b) low = a;
else low = b;
return low;
}
int foo2(int a, int b)
{
int low;
low = b + ((a - b) & ((a - b) >> 31));
return low;
}
I get:
foo1:
cmpl %edi, %esi
cmovle %esi, %edi
movl %edi, %eax
ret
foo2:
subl %esi, %edi
movl %edi, %eax
sarl $31, %eax
andl %edi, %eax
addl %esi, %eax
ret
...which I'm pretty sure won't count for branch predictions, since the code doesn't jump. Also, the non if-statement version is 2 instructions longer. I think I will continue coding, and let the compiler do it's job.

Like with any low-level optimization, test it on the target CPU/board setup.
On my compiler (gcc 4.5.1 on x86_64), the first example becomes
cmpl %ebx, %eax
cmovle %eax, %esi
The second example becomes
subl %eax, %ebx
movl %ebx, %edx
sarl $31, %edx
andl %ebx, %edx
leal (%rdx,%rax), %esi
Not sure if the first one is faster in all cases, but I would bet it is.

The biggest problem is that your second example won't work on 64-bit machines.
However, even neglecting that, modern compilers are smart enough to consider branchless prediction in every case possible, and compare the estimated speeds. So, you second example will most likely actually be slower
There will be no difference between the if statement and using a ternary operator, as even most dumb compilers are smart enough to recognize this special case.
[Edit] Because I think this is such an interesting topic, I've written a blog post on it.

Either way, the assembly will only be a few instructions and either way it'll take picoseconds for those instructions to execute.
I would profile the application an concentrate your optimization efforts to something more worthwhile.
Also, the time saved by this type of optimization will not be worth the time wasted by anyone trying to maintain it.
For simple statements like this, I find the ternary operator very intuitive:
low = (a < b) ? a : b;
Clear and concise.

For something as simple as this, why not just experiment and try it out?
Generally, you'd profile first, identify this as a hotspot, experiment with a change, and view the result.
I wrote a simple program that compares both techniques passing in random numbers (so that we don't see perfect branch prediction) with Visual C++ 2010. The difference between the approaches on my machine for 100,000,000 iteration? Less than 50ms total, and the if version tended to be faster. Looking at the codegen, the compiler successfully converted the simple if to a cmovl instruction, avoiding a branch altogether.

One thing to be wary of when you get into really bit-fiddly kinds of hacks is how they may interact with compiler optimizations that take place after inlining. For example, the readable procedure
int foo (int a, int b) {
return ((a < b) ? a : b);
}
is likely to be compiled into something very efficient in any case, but in some cases it may be even better. Suppose, for example, that someone writes
int bar = foo (x, x+3);
After inlining, the compiler will recognize that 3 is positive, and may then make use of the fact that signed overflow is undefined to eliminate the test altogether, to get
int bar = x;
It's much less clear how the compiler should optimize your second implementation in this context. This is a rather contrived example, of course, but similar optimizations actually are important in practice. Of course you shouldn't accept bad compiler output when performance is critical, but it's likely wise to see if you can find clear code that produces good output before you resort to code that the next, amazingly improved, version of the compiler won't be able to optimize to death.

One thing I will point out that I haven't noticed mention that an optimization like this can easily be overwhelmed by other issues. For example, if you are running this routine on two large arrays of numbers (or worse yet, pairs of number scattered in memory), the cost of fetching the values on today's CPUs can easily stall the CPU's execution pipelines.

I'm just wondering which one of these
would be more efficient (or if the
difference is to miniscule to be
relevant), and the efficiency of
if-else statements versus alternatives
in general.
Desktop/server CPUs are optimized for pipelining. Second is theoretically faster because CPU doesn't have to branch and can utilize multiple ALUs to evaluate parts of expression in parallel. More non-branching code with intermixed independent operations are best for such CPUs. (But even that is negated now by modern "conditional" CPU instructions which allow to make the first code branch-less too.)
On embedded CPUs branching if often less expensive (relatively to everything else), nor they have many spare ALUs to evaluate operations out-of-order (that's if they support out-of-order execution at all). Less code/data is better - caches are small too. (I have even seen uses of buble-sort in embedded applications: the algorithm uses least of memory/code and fast enough for small amounts of information.)
Important: do not forget about the compiler optimizations. Using many tricks, the compilers sometimes can remove the branching themselves: inlining, constant propagation, refactoring, etc.
But in the end I would say that yes, the difference is minuscule to be relevant. In long term, readable code wins.
The way things go on the CPU front, it is more rewarding to invest time now in making the code multi-threaded and OpenCL capable.

Why low = a; in the if and low = a; in the else? And, why 31? If 31 has anything to do with CPU word size, what if the code is to be run on a CPU of different size?
The if..else way looks more readable. I like programs to be as readable to humans as they are to the compilers.

profile results with gcc -o foo -g -p -O0, Solaris 9 v240
%Time Seconds Cumsecs #Calls msec/call Name
36.8 0.21 0.21 8424829 0.0000 foo2
28.1 0.16 0.37 1 160. main
17.5 0.10 0.4716850667 0.0000 _mcount
17.5 0.10 0.57 8424829 0.0000 foo1
0.0 0.00 0.57 4 0. atexit
0.0 0.00 0.57 1 0. _fpsetsticky
0.0 0.00 0.57 1 0. _exithandle
0.0 0.00 0.57 1 0. _profil
0.0 0.00 0.57 1000 0.000 rand
0.0 0.00 0.57 1 0. exit
code:
int
foo1 (int a, int b, int low)
{
if (a < b)
low = a;
else
low = b;
return low;
}
int
foo2 (int a, int b, int low)
{
low = (a < b) ? a : b;
return low;
}
int main()
{
int low=0;
int a=0;
int b=0;
int i=500;
while (i--)
{
for(a=rand(), b=rand(); a; a--)
{
low=foo1(a,b,low);
low=foo2(a,b,low);
}
}
return 0;
}
Based on data, in the above environment, the exact opposite of several beliefs stated here were not found to be true. Note the 'in this environment' If construct was faster than ternary ? : construct

I had written ternary logic simulator not so long ago, and this question was viable to me, as it directly affects my interpretator execution speed; I was required to simulate tons and tons of ternary logic gates as fast as possible.
In a binary-coded-ternary system one trit is packed in two bits. Most significant bit means negative and least significant means positive one. Case "11" should not occur, but it must be handled properly and threated as 0.
Consider inline int bct_decoder( unsigned bctData ) function, which should return our formatted trit as regular integer -1, 0 or 1; As i observed there are 4 approaches: i called them "cond", "mod", "math" and "lut"; Lets investigate them
First is based on jz|jnz and jl|jb conditional jumps, thus cond. Its performance is not good at all, because relies on a branch predictor. And even worse - it varies, because it is unknown if there will be one branch or two a priori. And here is an example:
inline int bct_decoder_cond( unsigned bctData ) {
unsigned lsB = bctData & 1;
unsigned msB = bctData >> 1;
return
( lsB == msB ) ? 0 : // most possible -> make zero fastest branch
( lsB > msB ) ? 1 : -1;
}
This is slowest version, it could involve 2 branches in worst case and this is something where binary logic fails. On my 3770k it prodices around 200MIPS on average on random data. (here and after - each test is average from 1000 tries on randomly filled 2mb dataset)
Next one relies on modulo operator and its speed is somewhere in between first and third, but is definetely faster - 600 MIPS:
inline int bct_decoder_mod( unsigned bctData ) {
return ( int )( ( bctData + 1 ) % 3 ) - 1;
}
Next one is branchless approach, which involves only maths, thus math; it does not assume jump instrunctions at all:
inline int bct_decoder_math( unsigned bctData ) {
return ( int )( bctData & 1 ) - ( int )( bctData >> 1 );
}
This does what is should, and behaves really great. To compare, performance estimate is 1000 MIPS, and it is 5x faster than branched version. Probably branched version is slowed down due to lack of native 2-bit signed int support. But in my application it is quite good version in itself.
If this is not enough then we can go futher, having something special. Next is called lookup table approach:
inline int bct_decoder_lut( unsigned bctData ) {
static const int decoderLUT[] = { 0, 1, -1, 0 };
return decoderLUT[ bctData & 0x3 ];
}
In my case one trit occupied only 2 bits, so lut table was only 2b*4 = 8 bytes, and was worth trying. It fits in cache and works blazing fast at 1400-1600 MIPS, here is where my measurement accuracy is going down. And that is is 1.5x speedup from fast math approach. That's because you just have precalculated result and single AND instruction. Sadly caches are small and (if your index length is greater than several bits) you simply cannot use it.
So i think i answered your question, on what what could branched/branchless code be like. Answer is much better and with detailed samples, real world application and real performance measurements results.

Updated answer taking the current (2018) state of compiler vectorization. Please see danben's answer for the general case where vectorization is not a concern.
TLDR summary: avoiding ifs can help with vectorization.
Because SIMD would be too complex to allow branching on some elements, but not others, any code containing an if statement will fail to be vectorized unless the compiler knows a "superoptimization" technique that can rewrite it into a branchless set of operations. I don't know of any compilers that are doing this as an integrated part of the vectorization pass (Clang does some of this independently, but not specificly toward helping vectorization AFAIK)
Using the OP's provided example:
int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));
Many compilers can vectorize this to be something approximately equivalent to:
__m128i low128i(__m128i a, __m128i b){
__m128i diff, tmp;
diff = _mm_sub_epi32(a,b);
tmp = _mm_srai_epi32(diff, 31);
tmp = _mm_and_si128(tmp,diff);
return _mm_add_epi32(tmp,b);
}
This optimization would require the data to be layed out in a fashion that would allow for it, but it could be extended to __m256i with avx2 or __m512i with avx512 (and even unroll loops further to take advantage of additional registers) or other simd instructions on other architectures. Another plus is that these instructions are all low latency, high-throughput instructions (latencies of ~1 and reciprocal throughputs in the range of 0.33 to 0.5 - so really fast relative to non-vectorized code)
I see no reason why compilers couldn't optimize an if statement to a vectorized conditional move (except that the corresponding x86 operations only work on memory locations and have low throughput and other architectures like arm may lack it entirely) but it could be done by doing something like:
void lowhi128i(__m128i *a, __m128i *b){ // does both low and high
__m128i _a=*a, _b=*b;
__m128i lomask = _mm_cmpgt_epi32(_a,_b),
__m128i himask = _mm_cmpgt_epi32(_b,_a);
_mm_maskmoveu_si128(_b,lomask,a);
_mm_maskmoveu_si128(_a,himask,b);
}
However this would have a much higher latency due to memory reads and writes and lower throughput (higher/worse reciprocal throughput) than the example above.

Unless you're really trying to buckle down on efficiency, I don't think this is something you need to worry about.
My simple thought though is that the if would be quicker because it's comparing one thing, while the other code is doing several operations. But again, I imagine that the difference is minuscule.

If it is for Gnu C++, try this
int min = i <? j;
I have not profiled it but I think it is definitely the one to beat.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js