I'm trying to find an efficient way to check if an integer is zero without jumping.
I have two integer variables in and out. If in is zero, I want out to be zero. If in is not zero, I want out to be one.
If it may help, I know that in will be zero or a power of two (only one set bit). I also know that the most significant and the less significant bits are never set.
I could do the obvious : out = (in == 0 ? 0 : 1); But that implies a jump which is costly.
I could do something like this out = (in * 0xFFFFFFFF) >> 63;. This implies a multiplication and shift that I would like to avoid, but I can't find a way. Maybe it's not possible.
Any other way I could do this without jump and only using bit-wise operators and arithmetic?
Thanks
This will differ with architecture but the code doesn't compile to a jump on Intel CPUs.
This code:
int square(int in) {
int out = (in != 0);
return out;
}
is compiled to:
square(int):
xor eax, eax
test edi, edi
setne al
ret
or:
square, COMDAT PROC
xor eax, eax
test ecx, ecx
setne al
ret 0
square ENDP
by msvc, clang and gcc with O2:
msvc: https://godbolt.org/g/Mfh2Qj
clang: https://godbolt.org/g/6p7kL1
gcc: https://godbolt.org/g/vUM2Zv
It is only a jump with no optimization which you would never do anyway.
I've also found the need to do this, to index a length-2 array at 0 for zero values and 1 for non-zero values.
Cast the int to bool, and then back to int. This does not jump on almost every compiler I've tried (gcc, clang, recent MSVC) except MSVC pre-2018. I recommend you check the assembly code to make sure on your platform.
int one_if_nonzero_else_zero(int value) { return (bool) value; }
EDIT: This does not satisfy your constraint "only using bit-wise operators and arithmetic" but this cast takes advantage of assembly optimization and will be very efficient.
EDIT: Your "obvious" solution out = (in == 0 ? 0 : 1); results in identical assembly code as solutions posted by Jerry Jeremiah and myself on gcc, clang, and msvc. No jump after optimization! I suggest you use it for clarity.
I have two integer variables in and out. If in is zero, I want out to be zero. If in is not zero, I want out to be one.
Try this:
int in = ...;
int out = !!in;
Live Demo
C++ has an implicit conversion defined from int to bool, so in as a bool will be false when in is 0, and will be true otherwise.
Then !false will be true, and !true will be false.
Then !true will be false, and !false will be true.
Then there is also an implicit conversion defined from bool to int, so true as an int will be 1, and false will be 0.
Thus, out will be 0 when in is 0, and will be 1 otherwise.
Related
I reached a bottleneck in my code, so the main issue of this question is performance.
I have a hexadecimal checksum and I want to check the leading zeros of an array of chars. This is what I am doing:
bool starts_with (char* cksum_hex, int n_zero) {
bool flag {true};
for (int i=0; i<n_zero; ++i)
flag &= (cksum_hex[i]=='0');
return flag;
}
The above function returns true if the cksum_hex has n_zero leading zeros. However, for my application, this function is very expensive (60% of total time). In other words, it is the bottleneck of my code. So I need to improve it.
I also checked std::string::starts_with which is available in C++20 and I observed no difference in performance:
// I have to convert cksum to string
std::string cksum_hex_s (cksum_hex);
cksum_hex_s.starts_with("000"); // checking for 3 leading zeros
For more information I am using g++ -O3 -std=c++2a and my gcc version is 9.3.1.
Questions
What is the faster way of checking the leading characters in a char array?
Is there a more efficient way of doing it with std::string::starts_with?
Does the bitwise operations help here?
If you modify your function to return early
bool starts_with (char* cksum_hex, int n_zero) {
for (int i=0; i<n_zero; ++i)
{
if (cksum_hex[i] != '0') return false;
}
return true;
}
It will be faster in case of big n_zero and false result. Otherwise, maybe you can try to allocate a global array of characters '0' and use std::memcmp:
// make it as big as you need
constexpr char cmp_array[4] = {'0', '0', '0', '0'};
bool starts_with (char* cksum_hex, int n_zero) {
return std::memcmp(cksum_hex, cmp_array, n_zero) == 0;
}
The problem here is that you need to assume some max possible value of n_zero.
Live example
=== EDIT ===
Considering the complains about no profiling data to justify the suggested approaches, here you go:
Benchmark results comparing early return implementation with memcmp implementation
Benchmark results comparing memcmp implementation with OP original implementation
Data used:
const char* cs1 = "00000hsfhjshjshgj";
const char* cs2 = "20000hsfhjshjshgj";
const char* cs3 = "0000000000hsfhjshjshgj";
const char* cs4 = "0000100000hsfhjshjshgj";
memcmp is fastest in all cases but cs2 with early return impl.
Presumably you also have the binary checksum? Instead of converting it to ASCII text first, look at the 4*n high bits to check n nibbles directly for 0 rather than checking n bytes for equality to '0'.
e.g. if you have the hash (or the high 8 bytes of it) as a uint64_t or unsigned __int128, right-shift it to keep only the high n nibbles.
I showed some examples of how they compile for x86-64 when both inputs are runtime variables, but these also compile nicely to other ISAs like AArch64. This code is all portable ISO C++.
bool starts_with (uint64_t cksum_high8, int n_zero)
{
int shift = 64 - n_zero * 4; // A hex digit represents a 4-bit nibble
return (cksum_high8 >> shift) == 0;
}
clang does a nice job for x86-64 with -O3 -march=haswell to enable BMI1/BMI2
high_zero_nibbles(unsigned long, int):
shl esi, 2
neg sil # x86 shifts wrap the count so 64 - c is the same as -c
shrx rax, rdi, rsi # BMI2 variable-count shifts save some uops.
test rax, rax
sete al
ret
This even works for n=16 (shift=0) to test all 64 bits. It fails for n_zero = 0 to test none of the bits; it would encounter UB by shifting a uint64_t by a shift count >= its width. (On ISAs like x86 that wrap out-of-bounds shift counts, code-gen that worked for other shift counts would result in checking all 16 bits. As long as the UB wasn't visible at compile time...) Hopefully you're not planning to call this with n_zero=0 anyway.
Other options: create a mask that keeps only the high n*4 bits, perhaps shortening the critical path through cksum_high8 if that's ready later than n_zero. Especially if n_zero is a compile-time constant after inlining, this can be as fast as checking cksum_high8 == 0. (e.g. x86-64 test reg, immediate.)
bool high_zero_nibbles_v2 (uint64_t cksum_high8, int n_zero) {
int shift = 64 - n_zero * 4; // A hex digit represents a 4-bit nibble
uint64_t low4n_mask = (1ULL << shift) - 1;
return cksum_high8 & ~low4n_mask;
}
Or use a bit-scan function to count leading zero bits and compare for >= 4*n. Unfortunately it took ISO C++ until C++20 <bit>'s countl_zero to finally portably expose this common CPU feature that's been around for decades (e.g. 386 bsf / bsr); before that only as compiler extensions like GNU C __builtin_clz.
This is great if you want to know how many and don't have one specific cutoff threshold.
bool high_zero_nibbles_lzcnt (uint64_t cksum_high8, int n_zero) {
// UB on cksum_high8 == 0. Use x86-64 BMI1 _lzcnt_u64 to avoid that, guaranteeing 64 on input=0
return __builtin_clzll(cksum_high8) > 4*n_zero;
}
#include <bit>
bool high_zero_nibbles_stdlzcnt (uint64_t cksum_high8, int n_zero) {
return std::countl_zero(cksum_high8) > 4*n_zero;
}
compile to (clang for Haswell):
high_zero_nibbles_lzcnt(unsigned long, int):
lzcnt rax, rdi
shl esi, 2
cmp esi, eax
setl al # FLAGS -> boolean integer return value
ret
All these instructions are cheap on Intel and AMD, and there's even some instruction-level parallelism between lzcnt and shl.
See asm output for all 4 of these on the Godbolt compiler explorer. Clang compiles 1 and 2 to identical asm. Same for both lzcnt ways with -march=haswell. Otherwise it needs to go out of its way to handle the bsr corner case for input=0, for the C++20 version where that's not UB.
To extend these to wider hashes, you can check the high uint64_t for being all-zero, then proceed to the next uint64_t chunk.
Using an SSE2 compare with pcmpeqb on the string, pmovmskb -> bsf could find the position of the first 1 bit, thus how many leading-'0' characters there were in the string representation, if you have that to start with. So x86 SIMD can do this very efficiently, and you can use that from C++ via intrinsics.
You can make a buffer of zeros large enough for you than compare with memcmp.
const char *zeroBuffer = "000000000000000000000000000000000000000000000000000";
if (memcmp(zeroBuffer, cksum_hex, n_zero) == 0) {
// ...
}
Things you want to check to make your application faster:
1. Can the compiler inline this function in places where it is called?
Either declare the function as inline in a header or put the definition in the compile unit where it is used.
2. Not computing something is faster than computing something more efficiently
Are all calls to this function necessary? High cost is generally the sign of a function called inside a high frequency loop or in an expensive algorithm. You can often reduce the call count, hence the time spent in the function, by optimizing the outer algorithm
3. Is n_zero small or, even better, a constant?
Compilers are pretty good at optimizing algorithm for typically small constant values. If the constant is known to the compiler, it will most likely remove the loop entirely.
4. Does the bitwise operation help here?
It definitely has an effect and allows Clang (but not GCC as far as I can tell) to do some vectorization. Vectorization tend to be faster but that's not always the case depending on your hardware and actual data processed.
Whether it is an optimization or not might depend on how big n_zero is. Considering you are processing checksums, it should be pretty small so it sounds like a potential optimization.
For known n_zero using bitwise operation allows the compiler to remove all branching. I expect, though I did not measure, this to be faster.
std::all_of and std::string::starts_with should be compiled exactly as your implementation except they will use && instead of &.
Unless n_zero is quite high I would agree with others that you may be misinterpreting profiler results. But anyway:
Could the data be swapped to disk? If your system is under RAM pressure, data could be swapped out to disk and need to be loaded back to RAM when you perform the first operation on it. (Assuming this checksum check is the first access to the data in a while.)
Chances are you could use multiple threads/processes to take advantage of a multicore processor.
Maybe you could use statistics/correlation of your input data, or other structural features of your problem.
For instance, if you have a large number of digits (e.g. 50) and you know that the later digits have a higher probability of being nonzero, you can check the last one first.
If nearly all of your checksums should match, you can use [[likely]] to give a compiler hint that this is the case. (Probably won't make a difference but worth a try.)
Adding my two cents to this interesting discussion, though a little late to the game, I gather you could use std::equal, it's a fast method with a slightly different approach, using a hardcoded string with the maximum number of zeros, instead of the number of zeros.
This works passing to the function pointers to the begin and end of the string to be searched, and to string of zeros, specifically iterators to begin and end, end pointing to position of one past of the wanted number of zeros, these will be used as iterators by std::equal:
Sample
bool startsWith(const char* str, const char* end, const char* substr, const char* subend) {
return std::equal(str, end, substr, subend);
}
int main() {
const char* str = "000x1234567";
const char* substr = "0000000000000000000000000000";
std::cout << startsWith(&str[0], &str[3], &substr[0], &substr[3]);
}
Using the test cases in #pptaszni's good answer and the same testing conditions:
const char* cs1 = "00000hsfhjshjshgj";
const char* cs2 = "20000hsfhjshjshgj";
const char* cs3 = "0000000000hsfhjshjshgj";
const char* cs4 = "0000100000hsfhjshjshgj";
The result where as follows:
Slower than using memcmp but still faster (except for false results with low number of zeros) and more consistent than your original code.
Use std::all_of
return std::all_of(chsum_hex, chsum_hex + n_zero, [](char c){ return c == '0'; })
In a research project of mine I'm writing C++ code. However, the generated assembly is one of the crucial points of the project. C++ doesn't provide direct access to flag manipulating instructions, in particular, to ADC but this shouldn't be a problem provided the compiler is smart enough to use it. Consider:
constexpr unsigned X = 0;
unsigned f1(unsigned a, unsigned b) {
b += a;
unsigned c = b < a;
return c + b + X;
}
Variable c is a workaround to get my hands on the carry flag and add it to b and X. It looks I got luck and the (g++ -O3, version 9.1) generated code is this:
f1(unsigned int, unsigned int):
add %edi,%esi
mov %esi,%eax
adc $0x0,%eax
retq
For all values of X that I've tested the code is as above (except, of course for the immediate value $0x0 that changes accordingly). I found one exception though: when X == -1 (or 0xFFFFFFFFu or ~0u, ... it really doesn't matter how you spell it) the generated code is:
f1(unsigned int, unsigned int):
xor %eax,%eax
add %edi,%esi
setb %al
lea -0x1(%rsi,%rax,1),%eax
retq
This seems less efficient than the initial code as suggested by indirect measurements (not very scientific though) Am I right? If so, is this a "missing optimization opportunity" kind of bug that is worth reporting?
For what is worth, clang -O3, version 8.8.0, always uses ADC (as I wanted) and icc -O3, version 19.0.1 never does.
I've tried using the intrinsic _addcarry_u32 but it didn't help.
unsigned f2(unsigned a, unsigned b) {
b += a;
unsigned char c = b < a;
_addcarry_u32(c, b, X, &b);
return b;
}
I reckon I might not be using _addcarry_u32 correctly (I couldn't find much info on it). What's the point of using it since it's up to me to provide the carry flag? (Again, introducing c and praying for the compiler to understand the situation.)
I might, actually, be using it correctly. For X == 0 I'm happy:
f2(unsigned int, unsigned int):
add %esi,%edi
mov %edi,%eax
adc $0x0,%eax
retq
For X == -1 I'm unhappy :-(
f2(unsigned int, unsigned int):
add %esi,%edi
mov $0xffffffff,%eax
setb %dl
add $0xff,%dl
adc %edi,%eax
retq
I do get the ADC but this is clearly not the most efficient code. (What's dl doing there? Two instructions to read the carry flag and restore it? Really? I hope I'm very wrong!)
mov + adc $-1, %eax is more efficient than xor-zero + setc + 3-component lea for both latency and uop count on most CPUs, and no worse on any still-relevant CPUs.1
This looks like a gcc missed optimization: it probably sees a special case and latches onto that, shooting itself in the foot and preventing the adc pattern recognition from happening.
I don't know what exactly it saw / was looking for, so yes you should report this as a missed-optimization bug. Or if you want to dig deeper yourself, you could look at the GIMPLE or RTL output after optimization passes and see what happens. If you know anything about GCC's internal representations. Godbolt has a GIMPLE tree-dump window you can add from the same dropdown as "clone compiler".
The fact that clang compiles it with adc proves that it's legal, i.e. that the asm you want does match the C++ source, and you didn't miss some special case that's stopping the compiler from doing that optimization. (Assuming clang is bug-free, which is the case here.)
That problem can certainly happen if you're not careful, e.g. trying to write a general-case adc function that takes carry in and provides carry-out from the 3-input addition is hard in C, because either of the two additions can carry so you can't just use the sum < a+b idiom after adding the carry to one of the inputs. I'm not sure it's possible to get gcc or clang to emit add/adc/adc where the middle adc has to take carry-in and produce carry-out.
e.g. 0xff...ff + 1 wraps around to 0, so sum = a+b+carry_in / carry_out = sum < a can't optimize to an adc because it needs to ignore carry in the special case where a = -1 and carry_in = 1.
So another guess is that maybe gcc considered doing the + X earlier, and shot itself in the foot because of that special case. That doesn't make a lot of sense, though.
What's the point of using it since it's up to me to provide the carry flag?
You're using _addcarry_u32 correctly.
The point of its existence is to let you express an add with carry in as well as carry out, which is hard in pure C. GCC and clang don't optimize it well, often not just keeping the carry result in CF.
If you only want carry-out, you can provide a 0 as the carry in and it will optimize to add instead of adc, but still give you the carry-out as a C variable.
e.g. to add two 128-bit integers in 32-bit chunks, you can do this
// bad on x86-64 because it doesn't optimize the same as 2x _addcary_u64
// even though __restrict guarantees non-overlap.
void adc_128bit(unsigned *__restrict dst, const unsigned *__restrict src)
{
unsigned char carry;
carry = _addcarry_u32(0, dst[0], src[0], &dst[0]);
carry = _addcarry_u32(carry, dst[1], src[1], &dst[1]);
carry = _addcarry_u32(carry, dst[2], src[2], &dst[2]);
carry = _addcarry_u32(carry, dst[3], src[3], &dst[3]);
}
(On Godbolt with GCC/clang/ICC)
That's very inefficient vs. unsigned __int128 where compilers would just use 64-bit add/adc, but does get clang and ICC to emit a chain of add/adc/adc/adc. GCC makes a mess, using setcc to store CF to an integer for some of the steps, then add dl, -1 to put it back into CF for an adc.
GCC unfortunately sucks at extended-precision / biginteger written in pure C. Clang sometimes does slightly better, but most compilers are bad at it. This is why the lowest-level gmplib functions are hand-written in asm for most architectures.
Footnote 1: or for uop count: equal on Intel Haswell and earlier where adc is 2 uops, except with a zero immediate where Sandybridge-family's decoders special case that as 1 uop.
But the 3-component LEA with a base + index + disp makes it a 3-cycle latency instruction on Intel CPUs, so it's definitely worse.
On Intel Broadwell and later, adc is a 1-uop instruction even with a non-zero immediate, taking advantage of support for 3-input uops introduced with Haswell for FMA.
So equal total uop count but worse latency means that adc would still be a better choice.
https://agner.org/optimize/
I ran into a bug in my program:
for (int i = 0; i < objArray.size() - 1; ++i)
In my case objArray.size() is an unsigned long long and an empty vector minus 1 equals about 18 quintillion. I was wondering, does the loop on every iteration have to cast an int to an unsigned long long? I checked the assembly and while using an int creates different code than size_t without optimisations, with -O2 specified it generates exactly the same assembly. Does this mean it's not implicitly casting?
I don't understand assembly, but the code it generated was:
test rcx, rcx
je .L32
add rdx, rax
and then :
cmp rdx, rax
jne .L28
This may be caused by a compiler optimization. The c++ standard says that overflow of signed integral types is undefined. In this case, i starts at 0. Supposing that i is not written to in the loop, the compiler can thus deduce, that i >= 0, since overflowing is undefined behaviour and can be pruned.
Normally, for an signed-unsigned comparison, the signed value would have to be converted to the unsigned type following the rules you can see here. These rules are the reason for the compiler warnings when comparing a signed and unsigned type (leading to confusion, e.g. -1 > 2U
is true). In this case, that doesn't matter though.
With the assumption i >= 0 and 2-complement signed types though, the compiler can safely reinterpret i as an unsigned long long since he knows the sign-bit is 0. That's what your assemly output shows.
Now, we can see that there is indeed a bug. Suppose objArray.size() - 1 does not fit into a positive signed int. This would cause i to overflow, thus cause undefined behaviour which is always bad news.
Lets dissect the code
for (int i = 0; i < objArray.size() - 1; ++i)
you are doing a comparison between and int and a size_t. The size( )-1 is an unsigned underflow when the array is empty and results in a value of std::numeric_limits::max( ). The comparison, will be signed/unsigned and use the type promotion rules as outlined here Signed/unsigned comparisons
In removing conditional branches from high-performance code, converting a true boolean to unsigned long i = -1 to set all bits can be useful.
I came up with a way to obtain this integer-mask-boolean from input of a int b (or bool b) taking values either 1 or 0:
unsigned long boolean_mask = -(!b);
To get the opposite value:
unsigned long boolean_mask = -b;
Has anybody seen this construction before? Am I on to something? When a int value of -1 (which I assume -b or -(!b) does produce) is promoted to a bigger unsigned int type is it guaranteed to set all the bits?
Here's the context:
uint64_t ffz_flipped = ~i&~(~i-1); // least sig bit unset
// only set our least unset bit if we are not pow2-1
i |= (ffz_flipped < i) ? ffz_flipped : 0;
I will inspect the generated asm before asking questions like this next time. Sounds very likely the compiler will not burden the cpu with a branch here.
The question you should be asking yourself is this: If you write:
int it_was_true = b > c;
then it_was_true will be either 1 or 0. But where did that 1 come from?
The machine's instruction set doesn't contain an instruction of the form:
Compare R1 with R2 and store either 1 or 0 in R3
or, indeed, anything like that. (I put a note on SSE at the end of this answer, illustrating that the former statement is not quite true.) The machine has an internal condition register, consisting of several condition bits, and the compare instruction -- and a number of other arithmetic operations -- cause those condition bits to be modified in specific ways. Subsequently, you can do a conditional branch, based on some condition bits, or a conditional load, and sometimes other conditional operations.
So actually, it could be a lot less efficient to store that 1 in a variable than it would have been to have directly done some conditional operation. Could have been, but maybe not, because the compiler (or at least, the guys who wrote the compiler) may well be cleverer than you, and it might just remember that it should have put a 1 into it_was_true so that when you actually get around to checking the value, the compiler can emit an appropriate branch or whatever.
So, speaking of clever compilers, you should take a careful look at the assembly code produced by:
uint64_t ffz_flipped = ~i&~(~i-1);
Looking at that expression, I can count five operations: three bitwise negations, one bitwise conjunction (and), and one subtract. But you won't find five operations in the assembly output (at least, if you use gcc -O3). You'll find three.
Before we look at the assembly output, let's do some basic algebra. Here's the most important identity:
-X == ~X + 1
Can you see why that's true? -X, in 2's complement, is just another way of saying 2n - X, where n is the number of bits in the word. In fact, that's why it's called "2's complement". What about ~X? Well, we can think of that as the result of subtracting every bit in X from the corresponding power of 2. For example, if we have four bits in our word, and X is 0101 (which is 5, or 22 + 20), then ~X is 1010 which we can think of as 23×(1-0) + 22×(1-1) + 21×(1-0) + 20×(1-1), which is exactly the same as 1111 − 0101. Or, in other words:
−X == 2n − X
~X == (2n−1) − X
which means that
~X == (−X) − 1
Remember that we had
ffz_flipped = ~i&~(~i-1);
But we now know that we can change ~(~i−1) into minus operations:
~(~i−1)
== −(~i−1) − 1
== −(−i - 1 - 1) − 1
== (i + 2) - 1
== i + 1
How cool is that! We could have just written:
ffz_flipped = ~i & (i + 1);
which is only three operations, instead of five.
Now, I don't know if you followed that, and it took me a bit of time to get it right, but now let's look at gcc's output:
leaq 1(%rdi), %rdx # rdx = rdi + 1
movq %rdi, %rax # rax = rdi
notq %rax # rax = ~rax
andq %rax, %rdx # rdx &= rax
So gcc just went and figured all that out on its own.
The promised note about SSE: It turns out that SSE can do parallel comparisons, even to the point of doing 16 byte-wise comparisons at a time between two 16-byte registers. Condition registers weren't designed for that, and anyway no-one wants to branch when they don't have to. So the CPU does actually change one of the SSE registers (a vector of 16 bytes, or 8 "words" or 4 "double words", whatever the operation says) into a vector of boolean indicators. But it doesn't use 1 for true; instead, it uses a mask of all 1s. Why? Because it's likely that the next thing the programmer is going to do with that comparison result is use it to mask out values, which I think is just exactly what you were planning to do with your -(!B) trick, except in the parallel streaming version.
So, rest assured, it's been covered.
Has anybody seen this construction before? Am I on to something?
Many people have seen it. It's old as rocks. It's not unusual but you should encapsulate it in an inline function to avoid obfuscating your code.
And, verify that you compiler is actually producing branches on the old code, and that it is configured properly, and that this micro-optimization actually improves performance. (And it's a good idea to keep notes on how much time each optimization strategy cuts.)
On the plus side, it is perfectly standard-compliant.
When a int value of -1 (which I assume -b or -(!b) does produce) is promoted to a bigger unsigned int type is it guaranteed to set all the bits?
Just be careful that b is not already unsigned. Since unsigned numbers are always positive, the result of casting -1u is not special and won't be extended with more ones.
If you have different sizes and want to be anal, try this:
template< typename uint >
uint mask_cast( bool f )
{ return static_cast< uint >( - ! f ); }
struct full_mask {
bool b;
full_mask(bool b_):b(b_){}
template<
typename int_type,
typename=typename std::enable_if<std::is_unsigned<int_type>::value>::type
>
operator int_type() const {
return -b;
}
};
use:
unsigned long long_mask = full_mask(b);
unsigned char char_mask = full_mask(b);
char char_mask2 = full_mask(b); // does not compile
basically I use the class full_mask to deduce the type we are casting to, and automatically generate a bit-filled unsigned value of that type. I tossed in some SFINAE code to detect that the type I'm converting to is an unsigned integer type.
You can convert 1 / 0 to 0 / -1 just by decrementing. That inverts the boolean condition, but if you can generate the inverse of the boolean in the first place, or use the inverse of the mask, then it's only a single operation instead of two.
What can be a reason for converting an integer to a boolean in this way?
bool booleanValue = !!integerValue;
instead of just
bool booleanValue = integerValue;
All I know is that in VC++7 the latter will cause C4800 warning and the former will not. Is there any other difference between the two?
The problems with the "!!" idiom are that it's terse, hard to see, easy to mistake for a typo, easy to drop one of the "!'s", and so forth. I put it in the "look how cute we can be with C/C++" category.
Just write bool isNonZero = (integerValue != 0); ... be clear.
Historically, the !! idiom was used to ensure that your bool really contained one of the two values expected in a bool-like variable, because C and C++ didn't have a true bool type and we faked it with ints. This is less of an issue now with "real" bools.
But using !! is an efficient means of documenting (for both the compiler and any future people working in your code) that yes, you really did intend to cast that int to a bool.
It is used because the C language (and some pre-standard C++ compilers too) didn't have the bool type, just int. So the ints were used to represent logical values: 0 was supposed to mean false, and everything else was true. The ! operator was returning 1 from 0 and 0 from everything else. Double ! was used to invert those, and it was there to make sure that the value is just 0 or 1 depending on its logical value.
In C++, since introducing a proper bool type, there's no need to do that anymore. But you cannot just update all legacy sources, and you shouldn't have to, due to backward compatibility of C with C++ (most of the time). But many people still do it, from the same reason: to remain their code backward-compatible with old compilers which still don't understand bools.
And this is the only real answer. Other answers are misleading.
Because !integerValue means integerValue == 0 and !!integerValue thus means integerValue != 0, a valid expression returning a bool. The latter is a cast with information loss.
Another option is the ternary operator which appears to generate one line less of assembly code (in Visual Studio 2005 anyways):
bool ternary_test = ( int_val == 0 ) ? false : true;
which produces the assembly code:
cmp DWORD PTR _int_val$[ebp], 0
setne al
mov BYTE PTR _ternary_test$[ebp], al
Versus:
bool not_equal_test = ( int_val != 0 );
which produces:
xor eax, eax
cmp DWORD PTR _int_val$[ebp], 0
setne al
mov BYTE PTR _not_equal_test$[ebp], al
I know it isn't a huge difference but I was curious about it and just thought that I would share my findings.
A bool can only have two states, 0, and 1. An integer can have any state from -2147483648 to 2147483647 assuming a signed 32-bit integer. The unary ! operator outputs 1 if the input is 0 and outputs 0 if the input is anything except 0. So !0 = 1 and !234 = 0. The second ! simply switches the output so 0 becomes 1 and 1 becomes 0.
So the first statement guarantees that booleanValue will be be set equal to either 0 or 1 and no other value, the second statement does not.
!! is an idiomatic way to convert to bool, and it works to shut up the Visual C++ compiler's sillywarning about alleged inefficiency of such conversion.
I see by the other answers and comments that many people are not familiar with this idiom's usefulness in Windows programming. Which means they haven't done any serious Windows programming. And assume blindly that what they have encountered is representative (it is not).
#include <iostream>
using namespace std;
int main( int argc, char* argv[] )
{
bool const b = static_cast< bool >( argc );
(void) argv;
(void) b;
}
> [d:\dev\test]
> cl foo.cpp
foo.cpp
foo.cpp(6) : warning C4800: 'int' : forcing value to bool 'true' or 'false' (performance warning)
[d:\dev\test]
> _
And at least one person thinks that if an utter novice does not recognize its meaning, then it's ungood. Well that's stupid. There's a lot that utter novices don't recognize or understand. Writing one's code so that it will be understood by any utter novice is not something for professionals. Not even for students. Starting on the path of excluding operators and operator combinations that utter novices don't recognize... Well I don't have the words to give that approach an appropriate description, sorry.
The answer of user143506 is correct but for a possible performance issue I compared the possibilies in asm:
return x;, return x != 0;, return !!x; and even return boolean_cast<bool>(x) results in this perfect set of asm instructions:
test edi/ecx, edi/ecx
setne al
ret
This was tested for GCC 7.1 and MSVC 19 2017. (Only the boolean_converter in MSVC 19 2017 results in a bigger amount of asm-code but this is caused by templatization and structures and can be neglected by a performance point of view, because the same lines as noted above may just duplicated for different functions with the same runtime.)
This means: There is no performance difference.
PS: This boolean_cast was used:
#define BOOL int
// primary template
template< class TargetT, class SourceT >
struct boolean_converter;
// full specialization
template< >
struct boolean_converter<bool, BOOL>
{
static bool convert(BOOL b)
{
return b ? true : false;
}
};
// Type your code here, or load an example.
template< class TargetT, class SourceT >
TargetT boolean_cast(SourceT b)
{
typedef boolean_converter<TargetT, SourceT> converter_t;
return converter_t::convert(b);
}
bool is_non_zero(int x) {
return boolean_cast< bool >(x);
}
No big reason except being paranoid or yelling through code that its a bool.
for compiler in the end it wont make difference .
I've never like this technique of converting to a bool data type - it smells wrong!
Instead, we're using a handy template called boolean_cast found here. It's a flexible solution that's more explicit in what it's doing and can used as follows:
bool IsWindow = boolean_cast< bool >(::IsWindow(hWnd));