Successfully enabling -fno-finite-math-only on NaN removal method - c++

In finding a bug which made everything turn into NaNs when running the optimized version of my code (compiling in g++ 4.8.2 and 4.9.3), I identified that the problem was the -Ofast option, specifically, the -ffinite-math-only flag it includes.
One part of the code involves reading floats from a FILE* using fscanf, and then replacing all NaNs with a numeric value. As could be expected, however, -ffinite-math-only kicks in, and removes these checks, thus leaving the NaNs.
In trying to solve this problem, I stumbled uppon this, which suggested adding -fno-finite-math-only as a method attribute to disable the optimization on the specific method. The following illustrates the problem and the attempted fix (which doesn't actually fix it):
#include <cstdio>
#include <cmath>
__attribute__((optimize("-fno-finite-math-only")))
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++) if (std::isnan(arr[i])) arr[i] = newValue;
}
int main(void){
const size_t cnt = 10;
float val[cnt];
for(int i = 0; i < cnt; i++) scanf("%f", val + i);
replaceNaN(val, cnt, -1.0f);
for(int i = 0; i < cnt; i++) printf("%f ", val[i]);
return 0;
}
The code does not act as desired if compiled/run using echo 1 2 3 4 5 6 7 8 nan 10 | (g++ -ffinite-math-only test.cpp -o test && ./test), specifically, it outputs a nan (which should have been replaced by a -1.0f) -- it behaves fine if the -ffinite-math-only flag is ommited. Shouldn't this work? Am I missing something with the syntax for attributes in gcc, or is this one of the afforementioned "there being some trouble with some version of GCC related to this" (from the linked SO question)
A few solutions I'm aware of, but would rather something a bit cleaner/more portable:
Compile the code with -fno-finite-math-only (my interrim solution): I suspect that this optimization may be rather useful in my context in the remainder of the program;
Manually look for the string "nan" in the input stream, and then replace the value there (the input reader is in an unrelated part of the library, yielding poor design to include this test there).
Assume a specific floating point architecture and make my own isNaN: I may do this, but it's a bit hackish and non-portable.
Prefilter the data using a separately compiled program without the -ffinite-math-only flag, and then feed that into the main program: The added complexity of maintaining two binaries and getting them to talk to each other just isn't worth it.
Edit: As put in the accepted answer, It would seem this is a compiler "bug" in older versions of g++, such as 4.82 and 4.9.3, that is fixed in newer versions, such as 5.1 and 6.1.1.
If for some reason updating the compiler is not a reasonably easy option (e.g.: no root access), or adding this attribute to a single function still doesn't completely solve the NaN check problem, an alternate solution, if you can be certain that the code will always run in an IEEE754 floating point environment, is to manually check the bits of the float for a NaN signature.
The accepted answer suggests doing this using a bit field, however, the order in which the compiler places the elements in a bit field is non-standard, and in fact, changes between the older and newer versions of g++, even refusing to adhere to the desired positioning in older versions (4.8.2 and 4.9.3, always placing the mantissa first), regardless of the order in which they appear in the code.
A solution using bit manipulation, however, is guaranteed to work on all IEEE754 compliant compilers. Below is my such implementation, which I ultimately used to solve my problem. It checks for IEEE754 compliance, and I've extended it to allow for doubles, as well as other more routine floating point bit manipulations.
#include <limits> // IEEE754 compliance test
#include <type_traits> // enable_if
template<
typename T,
typename = typename std::enable_if<std::is_floating_point<T>::value>::type,
typename = typename std::enable_if<std::numeric_limits<T>::is_iec559>::type,
typename u_t = typename std::conditional<std::is_same<T, float>::value, uint32_t, uint64_t>::type
>
struct IEEE754 {
enum class WIDTH : size_t {
SIGN = 1,
EXPONENT = std::is_same<T, float>::value ? 8 : 11,
MANTISSA = std::is_same<T, float>::value ? 23 : 52
};
enum class MASK : u_t {
SIGN = (u_t)1 << (sizeof(u_t) * 8 - 1),
EXPONENT = ((~(u_t)0) << (size_t)WIDTH::MANTISSA) ^ (u_t)MASK::SIGN,
MANTISSA = (~(u_t)0) >> ((size_t)WIDTH::SIGN + (size_t)WIDTH::EXPONENT)
};
union {
T f;
u_t u;
};
IEEE754(T f) : f(f) {}
inline u_t sign() const { return u & (u_t)MASK::SIGN >> ((size_t)WIDTH::EXPONENT + (size_t)WIDTH::MANTISSA); }
inline u_t exponent() const { return u & (u_t)MASK::EXPONENT >> (size_t)WIDTH::MANTISSA; }
inline u_t mantissa() const { return u & (u_t)MASK::MANTISSA; }
inline bool isNan() const {
return (mantissa() != 0) && ((u & ((u_t)MASK::EXPONENT)) == (u_t)MASK::EXPONENT);
}
};
template<typename T>
inline IEEE754<T> toIEEE754(T val) { return IEEE754<T>(val); }
And the replaceNaN function now becomes:
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++)
if (toIEEE754(arr[i]).isNan()) arr[i] = newValue;
}
An inspection of the assembly of these functions reveals that, as expected, all masks become compile-time constants, leading to the following (seemingly) efficient code:
# In loop of replaceNaN
movl (%rcx), %eax # eax = arr[i]
testl $8388607, %eax # Check if mantissa is empty
je .L3 # If it is, it's not a nan (it's inf), continue loop
andl $2139095040, %eax # Mask leaves only exponent
cmpl $2139095040, %eax # Test if exponent is all 1s
jne .L3 # If it isn't, it's not a nan, so continue loop
This is one instruction less than with a working bit field solution (no shift), and the same number of registers are used (although it's tempting to say this alone makes it more efficient, there are other concerns such as pipelining which may make one solution more or less efficient than the other one).

Looks like a compiler bug to me. Up through GCC 4.9.2, the attribute is completely ignored. GCC 5.1 and later pay attention to it. Perhaps it's time to upgrade your compiler?
__attribute__((optimize("-fno-finite-math-only")))
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++) if (std::isnan(arr[i])) arr[i] = newValue;
}
Compiled with -ffinite-math-only on GCC 4.9.2:
replaceNaN(float*, int, float):
rep ret
But with the exact same settings on GCC 5.1:
replaceNaN(float*, int, float):
test esi, esi
jle .L26
sub rsp, 8
call std::isnan(float) [clone .isra.0]
test al, al
je .L2
mov rax, rdi
and eax, 15
shr rax, 2
neg rax
and eax, 3
cmp eax, esi
cmova eax, esi
cmp esi, 6
jg .L28
mov eax, esi
.L5:
cmp eax, 1
movss DWORD PTR [rdi], xmm0
je .L16
cmp eax, 2
movss DWORD PTR [rdi+4], xmm0
je .L17
cmp eax, 3
movss DWORD PTR [rdi+8], xmm0
je .L18
cmp eax, 4
movss DWORD PTR [rdi+12], xmm0
je .L19
cmp eax, 5
movss DWORD PTR [rdi+16], xmm0
je .L20
movss DWORD PTR [rdi+20], xmm0
mov edx, 6
.L7:
cmp esi, eax
je .L2
.L6:
mov r9d, esi
lea r8d, [rsi-1]
mov r11d, eax
sub r9d, eax
lea ecx, [r9-4]
sub r8d, eax
shr ecx, 2
add ecx, 1
cmp r8d, 2
lea r10d, [0+rcx*4]
jbe .L9
movaps xmm1, xmm0
lea r8, [rdi+r11*4]
xor eax, eax
shufps xmm1, xmm1, 0
.L11:
add eax, 1
add r8, 16
movaps XMMWORD PTR [r8-16], xmm1
cmp ecx, eax
ja .L11
add edx, r10d
cmp r9d, r10d
je .L2
.L9:
movsx rax, edx
movss DWORD PTR [rdi+rax*4], xmm0
lea eax, [rdx+1]
cmp eax, esi
jge .L2
add edx, 2
cdqe
cmp esi, edx
movss DWORD PTR [rdi+rax*4], xmm0
jle .L2
movsx rdx, edx
movss DWORD PTR [rdi+rdx*4], xmm0
.L2:
add rsp, 8
.L26:
rep ret
.L28:
test eax, eax
jne .L5
xor edx, edx
jmp .L6
.L20:
mov edx, 5
jmp .L7
.L19:
mov edx, 4
jmp .L7
.L18:
mov edx, 3
jmp .L7
.L17:
mov edx, 2
jmp .L7
.L16:
mov edx, 1
jmp .L7
The output is similar, although not quite identical, on GCC 6.1.
Replacing the attribute with
#pragma GCC push_options
#pragma GCC optimize ("-fno-finite-math-only")
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++) if (std::isnan(arr[i])) arr[i] = newValue;
}
#pragma GCC pop_options
makes absolutely no difference, so it is not simply a matter of the attribute being ignored. These older versions of the compiler clearly do not support controlling the floating-point optimization behavior at function-level granularity.
Note, however, that the generated code on GCC 5.1 and later is still significantly worse than compiling the function without without the -ffinite-math-only switch:
replaceNaN(float*, int, float):
test esi, esi
jle .L1
lea eax, [rsi-1]
lea rax, [rdi+4+rax*4]
.L5:
movss xmm1, DWORD PTR [rdi]
ucomiss xmm1, xmm1
jnp .L6
movss DWORD PTR [rdi], xmm0
.L6:
add rdi, 4
cmp rdi, rax
jne .L5
rep ret
.L1:
rep ret
I have no idea why there is such a discrepancy. Something is badly throwing the compiler off its game; this is even worse code than you get with optimizations completely disabled. If I had to guess, I'd speculate it was the implementation of std::isnan. If this replaceNaN method is not speed-critical, then it probably doesn't matter. If you need to repeatedly parse values from a file, you might prefer to have a reasonably efficient implementation.
Personally, I would write my own non-portable implementation of std::isnan. The IEEE 754 formats are all quite well-documented, and assuming you thoroughly test and comment the code, I can't see the harm in this, unless you absolutely need the code to be portable to all different architectures. It will drive purists up the wall, but so should using non-standard options like -ffinite-math-only. For a single-precision float, something like:
bool my_isnan(float value)
{
union IEEE754_Single
{
float f;
struct
{
#if BIG_ENDIAN
uint32_t sign : 1;
uint32_t exponent : 8;
uint32_t mantissa : 23;
#else
uint32_t mantissa : 23;
uint32_t exponent : 8;
uint32_t sign : 1;
#endif
} bits;
} u = { value };
// In the IEEE 754 representation, a float is NaN when
// the mantissa is non-zero, and the exponent is all ones
// (2^8 - 1 == 255).
return (u.bits.mantissa != 0) && (u.bits.exponent == 255);
}
Now, no need for annotations, just use my_isnan instead of std::isnan. The produces the following object code when compiled with -ffinite-math-only enabled:
replaceNaN(float*, int, float):
test esi, esi
jle .L6
lea eax, [rsi-1]
lea rdx, [rdi+4+rax*4]
.L13:
mov eax, DWORD PTR [rdi] ; get original floating-point value
test eax, 8388607 ; test if mantissa != 0
je .L9
shr eax, 16 ; test if exponent has all bits set
and ax, 32640
cmp ax, 32640
jne .L9
movss DWORD PTR [rdi], xmm0 ; set newValue if original was NaN
.L9:
add rdi, 4
cmp rdx, rdi
jne .L13
rep ret
.L6:
rep ret
The NaN check is slightly more complicated than a simple ucomiss followed by a test of the parity flag, but is guaranteed to be correct as long as your compiler adheres to the IEEE 754 standard. This works on all versions of GCC, and any other compiler.

Related

Compiler optimization for sum of squared numbers [duplicate]

This question already has answers here:
How does clang generate non-looping code for sum of squares?
(2 answers)
Closed last month.
Here is something that I find interesting:
pub fn sum_of_squares(n: i32) -> i32 {
let mut sum = 0;
for i in 1..n+1 {
sum += i*i;
}
sum
}
This is the naive implementation of the sum of squared numbers in Rust. This is the assembly code with rustc 1.65.0 with -O3
lea ecx, [rdi + 1]
xor eax, eax
cmp ecx, 2
jl .LBB0_2
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
lea eax, [rdi - 3]
imul rax, rcx
shr rax
imul eax, eax, 1431655766
shr rcx
lea ecx, [rcx + 4*rcx]
add ecx, eax
lea eax, [rcx + 4*rdi]
add eax, -3
.LBB0_2:
ret
I was expecting it to use the formula for the sum of squared numbers, but it does not. It uses a magical number 1431655766 which I don't understand at all.
Then I wanted to see what clang and gcc do in C++ for the same function
test edi, edi
jle .L8
lea eax, [rdi-1]
cmp eax, 17
jbe .L9
mov edx, edi
movdqa xmm3, XMMWORD PTR .LC0[rip]
xor eax, eax
pxor xmm1, xmm1
movdqa xmm4, XMMWORD PTR .LC1[rip]
shr edx, 2
.L4:
movdqa xmm0, xmm3
add eax, 1
paddd xmm3, xmm4
movdqa xmm2, xmm0
pmuludq xmm2, xmm0
psrlq xmm0, 32
pmuludq xmm0, xmm0
pshufd xmm2, xmm2, 8
pshufd xmm0, xmm0, 8
punpckldq xmm2, xmm0
paddd xmm1, xmm2
cmp eax, edx
jne .L4
movdqa xmm0, xmm1
mov eax, edi
psrldq xmm0, 8
and eax, -4
paddd xmm1, xmm0
add eax, 1
movdqa xmm0, xmm1
psrldq xmm0, 4
paddd xmm1, xmm0
movd edx, xmm1
test dil, 3
je .L1
.L7:
mov ecx, eax
imul ecx, eax
add eax, 1
add edx, ecx
cmp edi, eax
jge .L7
.L1:
mov eax, edx
ret
.L8:
xor edx, edx
mov eax, edx
ret
.L9:
mov eax, 1
xor edx, edx
jmp .L7
.LC0:
.long 1
.long 2
.long 3
.long 4
.LC1:
.long 4
.long 4
.long 4
.long 4
This is gcc 12.2 with -O3. GCC also does not use the sum of squared formula. I also don't know why it checks if the number is greater than 17? But for some reason, gcc does make a lot of operations compared to clang and rustc.
This is clang 15.0.0 with -O3
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
lea eax, [rdi - 3]
imul rax, rcx
shr rax
imul eax, eax, 1431655766
shr rcx
lea ecx, [rcx + 4*rcx]
add ecx, eax
lea eax, [rcx + 4*rdi]
add eax, -3
ret
.LBB0_1:
xor eax, eax
ret
I don't really understand what kind of optimization clang is doing there. But rustc, clang, and gcc doesn't like n(n+1)(2n+1)/6
Then I timed their performance. Rust is doing significantly better than gcc and clang. These are the average results of 100 executions. Using 11th gen intel core i7-11800h # 2.30 GHz
Rust: 0.2 microseconds
Clang: 3 microseconds
gcc: 5 microseconds
Can someone explain the performance difference?
Edit
C++:
int sum_of_squares(int n){
int sum = 0;
for(int i = 1; i <= n; i++){
sum += i*i;
}
return sum;
}
EDIT 2
For everyone wondering here is my benchmark code:
use std::time::Instant;
pub fn sum_of_squares(n: i32) -> i32 {
let mut sum = 0;
for i in 1..n+1 {
sum += i*i;
}
sum
}
fn main() {
let start = Instant::now();
let result = sum_of_squares(1000);
let elapsed = start.elapsed();
println!("Result: {}", result);
println!("Elapsed time: {:?}", elapsed);
}
And in C++:
#include <chrono>
#include <iostream>
int sum_of_squares(int n){
int sum = 0;
for(int i = 1; i <= n; i++){
sum += i*i;
}
return sum;
}
int main() {
auto start = std::chrono::high_resolution_clock::now();
int result = sum_of_squares(1000);
auto end = std::chrono::high_resolution_clock::now();
std::cout << "Result: " << result << std::endl;
std::cout << "Elapsed time: "
<< std::chrono::duration_cast<std::chrono::microseconds>(end - start).count()
<< " microseconds" << std::endl;
return 0;
}
I was expecting it to use the formula for the sum of squared numbers, but it does not. It uses a magical number 1431655766 which I don't understand at all.
LLVM does transform that loop into a formula but its different from naive square sum formula.
This article explains formula and generated code better than I could.
Clang does the same optimization using -O3 in C++ but not GCC yet. See on GodBolt. AFAIK, the default Rust compiler use LLVM internally like Clang. This is why they produce a similar code. GCC use a naive loop vectorized using SIMD instructions while Clang uses a formula like the one you gave in the question.
The optimized assembly code from the C++ code is the following:
sum_of_squares(int): # #sum_of_squares(int)
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
lea eax, [rdi - 3]
imul rax, rcx
shr rax
imul eax, eax, 1431655766
shr rcx
lea ecx, [rcx + 4*rcx]
add ecx, eax
lea eax, [rcx + 4*rdi]
add eax, -3
ret
.LBB0_1:
xor eax, eax
ret
This optimization mainly comes from the IndVarSimplify optimization pass. On can see that some variables are encoded on 32 bits while some others are encoded on 33 bits (requiring a 64 bit register on mainstream platforms). The code basically does:
if(edi == 0)
return 0;
eax = rdi - 1;
ecx = rdi - 2;
rcx *= rax;
eax = rdi - 3;
rax *= rcx;
rax >>= 1;
eax *= 1431655766;
rcx >>= 1;
ecx = rcx + 4*rcx;
ecx += eax;
eax = rcx + 4*rdi;
eax -= 3;
return eax;
This can be further simplified to the following equivalent C++ code:
if(n == 0)
return 0;
int64_t m = n;
int64_t tmp = ((m - 3) * (m - 1) * (m - 2)) / 2;
tmp = int32_t(int32_t(tmp) * int32_t(1431655766));
return 5 * ((m - 1) * (m - 2) / 2) + tmp + (4*m - 3);
Note that some casts and overflows are ignored for sake of clarity.
The magical number 1431655766 comes from a kind of correction from an overflow related to a division by 3. Indeed, 1431655766 / 2**32 ~= 0.33333333348855376. Clang plays with the 32-bit overflows so to generate a fast implementation of the formula n(n+1)(2n+1)/6.
Division by a constant c on a machine with 128 bit product is often implemented by multiplying by 2^64 / c. That’s where your strange constant comes from.
Now the formula n(n+1)(2n+1) / 6 will overflow for large n, while the sum won’t, so this formula can only be used very, very carefully.

check palindrome through recursion c++

bool palindrome(char arr[],int size){
if(size<=1){
return true;
}
if(*(arr)==*(arr+size-1)){
bool small_ans=palindrome(arr+1,size-2);
return small_ans;
}
return false;
}
How efficient is this code for checking palindrome ??
There is compiler optimization called tailing recursion.
In your quite simple case compiler spotted that there is possibility to use this optimization. As a result it silently turn your code into iterative version:
https://godbolt.org/z/rsjaYhde6
palindrome(char*, int):
cmp esi, 1
jle .L4
movsx rax, esi
sub esi, 2
shr esi
lea rax, [rdi-1+rax]
lea edx, [rsi+1]
add rdx, rdi
jmp .L3
.L8:
add rdi, 1
sub rax, 1
cmp rdi, rdx
je .L4
.L3:
movzx ecx, BYTE PTR [rax]
cmp BYTE PTR [rdi], cl
je .L8
xor eax, eax
ret
.L4:
mov eax, 1
ret
Note:
there is no call instruction needed in code which actually uses recursion
label .L8 is responsible for a loop which replaced recursion
Remember there is "As-if rule" so compiler can transom your code in may ways to make it faster.
In general, the recursive solution is often more elegant than iteration, but mostly needs more CPU time and memory space. The CPU has to put data on the stack every recursion.
Especially in this case, iteration seems more efficient in time and memory.
Try somthing like this:
bool palindrome(char arr[], int size)
{
for (int i = 0; i < size; ++i) {
if (arr[i] != arr[size-1-i])
return false;
}
return true;
}

C++ Optimizer : division

Lets say I have 2 if statements:
if (frequency1_mhz > frequency2_hz * 1000) {// some code}
if (frequency1_mhz / 1000 > frequency2_hz ) {// some code}
I'd imagine the two to function the exact same, yet I'm guessing the first statement with the multiplication is more efficient than the division.
Would a C++ compiler optimize this? Or is this something I should take into account when designing my code
Yes and no.
The code is not identical:
due to rounding, there can be differences in results (e.g. frequency1_mhz=1001 and frequency2_hz=1)
the first version might overflow sooner than the second one. e.g. a frequency2_hz of 1000000000 would overflow an int (and cause UB)
It's still possible to perform division using multiplication.
When unsure, just look at the generated assembly.
Here's the generated assembly for both versions. The second one is longer, but still contains no division.
version1(int, int):
imul esi, esi, 1000
xor eax, eax
cmp esi, edi
setl al
ret
version2(int, int):
movsx rax, edi
imul rax, rax, 274877907 ; look ma, no idiv!
sar edi, 31
sar rax, 38
sub eax, edi
cmp eax, esi
setg al
movzx eax, al
ret
No, these are not equivalent statemets because division is not a precise inverse of multiplication for floats nor integers.
Integer divison rounds down positive fractionns
int f1=999;
int f2=0;
static_assert(f1>f2*1000);
static_assert(f1/1000==f2);
Reciprocals are not precise:
static_assert(10.0!=10*(1.0/10));
If they are floats built with -O3, GCC will generate the same assembly (for better or for worse).
bool first(float frequency1_mhz,float frequency2_hz) {
return frequency1_mhz > frequency2_hz * 1000;
}
bool second(float frequency1_mhz,float frequency2_hz) {
return frequency1_mhz / 1000 > frequency2_hz;
}
The assembly
first(float, float):
mulss xmm1, DWORD PTR .LC0[rip]
comiss xmm0, xmm1
seta al
ret
second(float, float):
divss xmm0, DWORD PTR .LC0[rip]
comiss xmm0, xmm1
seta al
ret
.LC0:
.long 1148846080
So, really, its ends up the same code :-)

Unexpected x64 assembly for __atomic_fetch_or with gcc 7.3

I am attempting to use a 64-bits integral as a bitmap, and acquire/release ownership of individual bits, atomically.
To this end, I have written the following lock-less code:
#include <cstdint>
#include <atomic>
static constexpr std::uint64_t NO_INDEX = ~std::uint64_t(0);
class AtomicBitMap {
public:
static constexpr std::uint64_t occupied() noexcept {
return ~std::uint64_t(0);
}
std::uint64_t acquire() noexcept {
while (true) {
auto map = mData.load(std::memory_order_relaxed);
if (map == occupied()) {
return NO_INDEX;
}
std::uint64_t index = __builtin_ctzl(~map);
auto previous =
mData.fetch_or(bit(index), std::memory_order_relaxed);
if ((previous & bit(index)) == 0) {
return index;
}
}
}
private:
static constexpr std::uint64_t bit(std::uint64_t index) noexcept {
return std::uint64_t(1) << index;
}
std::atomic_uint64_t mData{ 0 };
};
int main() {
AtomicBitMap map;
return map.acquire();
}
Which, on godbolt, yields the following assembly in isolation:
main:
mov QWORD PTR [rsp-8], 0
jmp .L3
.L10:
not rax
rep bsf rax, rax
mov edx, eax
mov eax, eax
lock bts QWORD PTR [rsp-8], rax
jnc .L9
.L3:
mov rax, QWORD PTR [rsp-8]
cmp rax, -1
jne .L10
ret
.L9:
movsx rax, edx
ret
Which is exactly what I expected1.
#Jester has heroically managed to reduce my 97 lines reproducer to a much simpler 44 lines reproducer which I further reduced to 35 lines:
using u64 = unsigned long long;
struct Bucket {
u64 mLeaves[16] = {};
};
struct BucketMap {
u64 acquire() noexcept {
while (true) {
u64 map = mData;
u64 index = (map & 1) ? 1 : 0;
auto mask = u64(1) << index;
auto previous =
__atomic_fetch_or(&mData, mask, __ATOMIC_SEQ_CST);
if ((previous & mask) == 0) {
return index;
}
}
}
__attribute__((noinline)) Bucket acquireBucket() noexcept {
acquire();
return Bucket();
}
volatile u64 mData = 1;
};
int main() {
BucketMap map;
map.acquireBucket();
return 0;
}
Which generates the following assembly:
BucketMap::acquireBucket():
mov r8, rdi
mov rdx, rsi
.L2:
mov rax, QWORD PTR [rsi]
xor eax, eax
lock bts QWORD PTR [rdx], rax
setc al
jc .L2
mov rdi, r8
mov ecx, 16
rep stosq
mov rax, r8
ret
main:
sub rsp, 152
lea rsi, [rsp+8]
lea rdi, [rsp+16]
mov QWORD PTR [rsp+8], 1
call BucketMap::acquireBucket()
xor eax, eax
add rsp, 152
ret
The xor eax,eax means that the assembly here always attempts to obtain index 0... resulting in an infinite loop.
I can only see two explanations for this assembly:
I have somehow triggered Undefined Behavior.
There is a code-generation bug in gcc.
And I have exhausted all my ideas as to what could trigger UB.
Can anyone explain why gcc would generate this xor eax,eax?
Note: tentatively reported to gcc as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86314.
Compiler version used:
$ gcc --version
gcc (GCC) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is
NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.
Compiler flags:
-Wall -Wextra -Werror -Wduplicated-cond -Wnon-virtual-dtor -Wvla
-rdynamic -Wno-deprecated-declarations -Wno-type-limits
-Wno-unused-parameter -Wno-unused-local-typedefs -Wno-unused-value
-Wno-aligned-new -Wno-implicit-fallthrough -Wno-deprecated
-Wno-noexcept-type -Wno-register -ggdb -fno-strict-aliasing
-std=c++17 -Wl,--no-undefined -Wno-sign-compare
-g -O3 -mpopcnt
1 Actually, it's better than I expected, the compiler understanding that the fetch_or(bit(index)) followed by previous & bit(index) is the equivalent of using bts and checking the CF flag is pure gold.
This is peephole optimization bug in gcc, see #86413 affecting versions 7.1, 7.2, 7.3 and 8.1. The fix is already in, and will be delivered in version 7.4 and 8.2 respectively.
The short answer is that the particular code sequence (fetch_or + checking result) generates a setcc (set conditional, aka based on status of flags) followed by a movzbl (move and zero-extend); in 7.x an optimization was introduced which transforms a setcc followed by movzbl into a xor followed by setcc, however this optimization was missing some checks resulting in the xor possibly clobbering a register which was still needed (in this case, eax).
The longer answer is that fetch_or can be implemented either as a cmpxchg for full generality, or, if only setting one bit, as bts (bit test and set). As another optimization introduced in 7.x, gcc now generates a bts here (gcc 6.4 still generates a cmpxchg). bts sets the carry flag (CF) to the previous value of the bit.
That is, auto previous = a.fetch_or(bit); auto n = previous & bit; will generate:
lock bts QWORD PTR [<address of a>], <bit index> to set the bit, and capture its previous value,
setc <n>l to set the lower 8 bits of r<n>x to the value of the carry flag (CF),
movzx e<n>x, <n>l to zero-out the upper bits of r<n>x.
And then the peephole optimization will apply, which messes things up.
gcc trunk now generates proper assembly:
BucketMap::acquireBucket():
mov rdx, rdi
mov rcx, rsi
.L2:
mov rax, QWORD PTR [rsi]
and eax, 1
lock bts QWORD PTR [rcx], rax
setc al
movzx eax, al
jc .L2
mov rdi, rdx
mov ecx, 16
rep stosq
mov rax, rdx
ret
main:
sub rsp, 152
lea rsi, [rsp+8]
lea rdi, [rsp+16]
mov QWORD PTR [rsp+8], 1
call BucketMap::acquireBucket()
xor eax, eax
add rsp, 152
ret
Although unfortunately the optimization no longer applies so we are left with setc + mov instead of xor + setc... but at least it's correct!
As a side note, you can find the lowest 0 bit with a straight-forward bit manipulation:
template<class T>
T find_lowest_0_bit_mask(T value) {
T t = value + 1;
return (t ^ value) & t;
}
Returns bit mask, rather than bit index.
Precoditions: T must be unsigned, value must contain at least 1 zero bit.
mData.load must synchronise with mData.fetch_or, so it should be
mData.load(std::memory_order_acquire)
and
mData.fetch_or(..., std::memory_order_release)
And, IMO, there is something about these bit intrinsics that make it generate wrong assembly with clang, see .LBB0_5 loop that is clearly wrong because it keeps trying to set the same bit rather than recalculating another bit to set. A version that generates correct assembly:
#include <cstdint>
#include <atomic>
static constexpr int NO_INDEX = -1;
template<class T>
T find_lowest_0_bit_mask(T value) {
T t = value + 1;
return (t ^ value) & t;
}
class AtomicBitMap {
public:
static constexpr std::uint64_t occupied() noexcept { return ~std::uint64_t(0); }
int acquire() noexcept {
auto map = mData.load(std::memory_order_acquire);
while(map != occupied()) {
std::uint64_t mask = find_lowest_0_bit_mask(map);
if(mData.compare_exchange_weak(map, map | mask, std::memory_order_release))
return __builtin_ffsl(mask) - 1;
}
return NO_INDEX;
}
void release(int i) noexcept {
mData.fetch_and(~bit(i), std::memory_order_release);
}
private:
static constexpr std::uint64_t bit(int index) noexcept {
return std::uint64_t(1) << index;
}
std::atomic_uint64_t mData{ 0 };
};
xor-zero / set flags / setcc is usually the best way to create a 32-bit 0/1 integer.
Obviously it's only safe to do this if you have a spare register to xor-zero without destroying any inputs to the flag-setting instruction(s), so this is pretty clearly a bug.
(Otherwise you can setcc dl / movzx eax,dl. Separate regs are preferable so the movzx can be zero latency (mov-elimination) on some CPUs, but it's on the critical path on other CPUs so the xor/set-flags / setcc idiom is preferable because fewer instructions are on the critical path.)
IDK why gcc creates the integer value of (previous & mask) == 0 in a register at all; that's probably part of the bug.

Getting GCC/Clang to use CMOV

I have a simple tagged union of values. The values can either be int64_ts or doubles. I am performing addition on the these unions with the caveat that if both arguments represent int64_t values then the result should also have an int64_t value.
Here is the code:
#include<stdint.h>
union Value {
int64_t a;
double b;
};
enum Type { DOUBLE, LONG };
// Value + type.
struct TaggedValue {
Type type;
Value value;
};
void add(const TaggedValue& arg1, const TaggedValue& arg2, TaggedValue* out) {
const Type type1 = arg1.type;
const Type type2 = arg2.type;
// If both args are longs then write a long to the output.
if (type1 == LONG && type2 == LONG) {
out->value.a = arg1.value.a + arg2.value.a;
out->type = LONG;
} else {
// Convert argument to a double and add it.
double op1 = type1 == LONG ? (double)arg1.value.a : arg1.value.b; // Why isn't CMOV used?
double op2 = type2 == LONG ? (double)arg2.value.a : arg2.value.b; // Why isn't CMOV used?
out->value.b = op1 + op2;
out->type = DOUBLE;
}
}
The output of gcc at -O2 is here: http://goo.gl/uTve18
Attached here in case the link doesn't work.
add(TaggedValue const&, TaggedValue const&, TaggedValue*):
cmp DWORD PTR [rdi], 1
sete al
cmp DWORD PTR [rsi], 1
sete cl
je .L17
test al, al
jne .L18
.L4:
test cl, cl
movsd xmm1, QWORD PTR [rdi+8]
jne .L19
.L6:
movsd xmm0, QWORD PTR [rsi+8]
mov DWORD PTR [rdx], 0
addsd xmm0, xmm1
movsd QWORD PTR [rdx+8], xmm0
ret
.L17:
test al, al
je .L4
mov rax, QWORD PTR [rdi+8]
add rax, QWORD PTR [rsi+8]
mov DWORD PTR [rdx], 1
mov QWORD PTR [rdx+8], rax
ret
.L18:
cvtsi2sd xmm1, QWORD PTR [rdi+8]
jmp .L6
.L19:
cvtsi2sd xmm0, QWORD PTR [rsi+8]
addsd xmm0, xmm1
mov DWORD PTR [rdx], 0
movsd QWORD PTR [rdx+8], xmm0
ret
It produced code with a lot of branches. I know that the input data is pretty random i.e it has a random combination of int64_ts and doubles. I'd like to have at least the conversion to a double done with an equivalent of a CMOV instruction. Is there any way I can coax gcc to produce that code? I'd ideally like to run some benchmark on real data to see how the code with a lot of branches does vs one with fewer branches but more expensive CMOV instructions. It might turn out that the code generated by default by GCC works better but I'd like to confirm that. I could inline the assembly myself but I'd prefer not to.
The interactive compiler link is a good way to check the assembly. Any suggestions?