Getting GCC/Clang to use CMOV - c++

I have a simple tagged union of values. The values can either be int64_ts or doubles. I am performing addition on the these unions with the caveat that if both arguments represent int64_t values then the result should also have an int64_t value.
Here is the code:
#include<stdint.h>
union Value {
int64_t a;
double b;
};
enum Type { DOUBLE, LONG };
// Value + type.
struct TaggedValue {
Type type;
Value value;
};
void add(const TaggedValue& arg1, const TaggedValue& arg2, TaggedValue* out) {
const Type type1 = arg1.type;
const Type type2 = arg2.type;
// If both args are longs then write a long to the output.
if (type1 == LONG && type2 == LONG) {
out->value.a = arg1.value.a + arg2.value.a;
out->type = LONG;
} else {
// Convert argument to a double and add it.
double op1 = type1 == LONG ? (double)arg1.value.a : arg1.value.b; // Why isn't CMOV used?
double op2 = type2 == LONG ? (double)arg2.value.a : arg2.value.b; // Why isn't CMOV used?
out->value.b = op1 + op2;
out->type = DOUBLE;
}
}
The output of gcc at -O2 is here: http://goo.gl/uTve18
Attached here in case the link doesn't work.
add(TaggedValue const&, TaggedValue const&, TaggedValue*):
cmp DWORD PTR [rdi], 1
sete al
cmp DWORD PTR [rsi], 1
sete cl
je .L17
test al, al
jne .L18
.L4:
test cl, cl
movsd xmm1, QWORD PTR [rdi+8]
jne .L19
.L6:
movsd xmm0, QWORD PTR [rsi+8]
mov DWORD PTR [rdx], 0
addsd xmm0, xmm1
movsd QWORD PTR [rdx+8], xmm0
ret
.L17:
test al, al
je .L4
mov rax, QWORD PTR [rdi+8]
add rax, QWORD PTR [rsi+8]
mov DWORD PTR [rdx], 1
mov QWORD PTR [rdx+8], rax
ret
.L18:
cvtsi2sd xmm1, QWORD PTR [rdi+8]
jmp .L6
.L19:
cvtsi2sd xmm0, QWORD PTR [rsi+8]
addsd xmm0, xmm1
mov DWORD PTR [rdx], 0
movsd QWORD PTR [rdx+8], xmm0
ret
It produced code with a lot of branches. I know that the input data is pretty random i.e it has a random combination of int64_ts and doubles. I'd like to have at least the conversion to a double done with an equivalent of a CMOV instruction. Is there any way I can coax gcc to produce that code? I'd ideally like to run some benchmark on real data to see how the code with a lot of branches does vs one with fewer branches but more expensive CMOV instructions. It might turn out that the code generated by default by GCC works better but I'd like to confirm that. I could inline the assembly myself but I'd prefer not to.
The interactive compiler link is a good way to check the assembly. Any suggestions?

Related

Why is vzeroupper being inserted at the end of this code?

I noticed something strange when I compile this code on godbolt, with MSVC:
#include <intrin.h>
#include <cstdint>
void test(unsigned char*& pSrc) {
__m256i data = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(pSrc));
int32_t mask = _mm256_movemask_epi8(data);
if (!mask) {
++pSrc;
}
else {
unsigned long v;
_BitScanForward(&v, mask);
pSrc += v;
}
}
I get this resulting assembly:
pSrc$ = 8
void test(unsigned char * &) PROC ; test, COMDAT
mov rdx, QWORD PTR [rcx]
vmovdqu ymm0, YMMWORD PTR [rdx]
vpmovmskb eax, ymm0
test eax, eax
jne SHORT $LN2#test
mov eax, 1
add rax, rdx
mov QWORD PTR [rcx], rax
vzeroupper ; Why is this being inserted?
ret 0
$LN2#test:
bsf eax, eax
add rax, rdx
mov QWORD PTR [rcx], rax
vzeroupper ; Why is this being inserted?
ret 0
void test(unsigned char * &) ENDP ; test
Why is vzeroupper being inserted at the end of each scope? I heard that it's because of switching between SSE and AVX, but I'm not doing that here. I'm using exclusively AVX code.
I was wondering, does this pose a performance problem?

Why aren't clang++ and g++ de-duplicating these instructions?

Consider the following function:
std::string get_value(const bool b)
{
if (b) {
return "Hello";
}
else {
return "World";
}
}
g++ 11.0.1 20210312 compiles this (as C++17 and with maximum optimization) into:
get_value[abi:cxx11](bool):
lea rdx, [rdi+16]
mov rax, rdi
mov QWORD PTR [rdi], rdx
test sil, sil
je .L2
mov DWORD PTR [rdi+16], 1819043144
mov BYTE PTR [rdx+4], 111
mov QWORD PTR [rax+8], 5
mov BYTE PTR [rax+21], 0
ret
.L2:
mov DWORD PTR [rdi+16], 1819438935
mov BYTE PTR [rdx+4], 100
mov QWORD PTR [rax+8], 5
mov BYTE PTR [rax+21], 0
ret
Why does it not move the two replicated mov instructions up before the jump, or even before the test, reducing the code size by two instructions?
The same thing happens with clang++ and libc++, except it only has one relevant instruction to move up.
(See this also on GodBolt)

Differences in custom and std fetch_add on floats

This is an attempt at implementing fetch_add on floats without C++20.
void fetch_add(volatile float* x, float y)
{
bool success = false;
auto xi = (volatile std::int32_t*)x;
while(!success)
{
union {
std::int32_t sumint;
float sum;
};
auto tmp = __atomic_load_n(xi, __ATOMIC_RELAXED);
sumint = tmp;
sum += y;
success = __atomic_compare_exchange_n(xi, &tmp, sumint, true, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
}
}
To my great confusion, when I compare the assembly from gcc10.1 -O2 -std=c++2a for x86-64, they differ.
fetch_add(float volatile*, float):
.L2:
mov eax, DWORD PTR [rdi]
movd xmm1, eax
addss xmm1, xmm0
movd edx, xmm1
lock cmpxchg DWORD PTR [rdi], edx
jne .L2
ret
fetch_add_std(std::atomic<float>&, float):
mov eax, DWORD PTR [rdi]
movaps xmm1, xmm0
movd xmm0, eax
mov DWORD PTR [rsp-4], eax
addss xmm0, xmm1
.L9:
mov eax, DWORD PTR [rsp-4]
movd edx, xmm0
lock cmpxchg DWORD PTR [rdi], edx
je .L6
mov DWORD PTR [rsp-4], eax
movss xmm0, DWORD PTR [rsp-4]
addss xmm0, xmm1
jmp .L9
.L6:
ret
My ability to read assembly is near non-existent, but the custom version looks correct to me, which implies it is either incorrect, inefficient or somehow the standard library is rather broken. I don't quite believe the third case, which leads me to ask, is the custom version incorrect or inefficient?
After some comments, a second version without reloading after cmpxchg is written. They do still differ.

Successfully enabling -fno-finite-math-only on NaN removal method

In finding a bug which made everything turn into NaNs when running the optimized version of my code (compiling in g++ 4.8.2 and 4.9.3), I identified that the problem was the -Ofast option, specifically, the -ffinite-math-only flag it includes.
One part of the code involves reading floats from a FILE* using fscanf, and then replacing all NaNs with a numeric value. As could be expected, however, -ffinite-math-only kicks in, and removes these checks, thus leaving the NaNs.
In trying to solve this problem, I stumbled uppon this, which suggested adding -fno-finite-math-only as a method attribute to disable the optimization on the specific method. The following illustrates the problem and the attempted fix (which doesn't actually fix it):
#include <cstdio>
#include <cmath>
__attribute__((optimize("-fno-finite-math-only")))
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++) if (std::isnan(arr[i])) arr[i] = newValue;
}
int main(void){
const size_t cnt = 10;
float val[cnt];
for(int i = 0; i < cnt; i++) scanf("%f", val + i);
replaceNaN(val, cnt, -1.0f);
for(int i = 0; i < cnt; i++) printf("%f ", val[i]);
return 0;
}
The code does not act as desired if compiled/run using echo 1 2 3 4 5 6 7 8 nan 10 | (g++ -ffinite-math-only test.cpp -o test && ./test), specifically, it outputs a nan (which should have been replaced by a -1.0f) -- it behaves fine if the -ffinite-math-only flag is ommited. Shouldn't this work? Am I missing something with the syntax for attributes in gcc, or is this one of the afforementioned "there being some trouble with some version of GCC related to this" (from the linked SO question)
A few solutions I'm aware of, but would rather something a bit cleaner/more portable:
Compile the code with -fno-finite-math-only (my interrim solution): I suspect that this optimization may be rather useful in my context in the remainder of the program;
Manually look for the string "nan" in the input stream, and then replace the value there (the input reader is in an unrelated part of the library, yielding poor design to include this test there).
Assume a specific floating point architecture and make my own isNaN: I may do this, but it's a bit hackish and non-portable.
Prefilter the data using a separately compiled program without the -ffinite-math-only flag, and then feed that into the main program: The added complexity of maintaining two binaries and getting them to talk to each other just isn't worth it.
Edit: As put in the accepted answer, It would seem this is a compiler "bug" in older versions of g++, such as 4.82 and 4.9.3, that is fixed in newer versions, such as 5.1 and 6.1.1.
If for some reason updating the compiler is not a reasonably easy option (e.g.: no root access), or adding this attribute to a single function still doesn't completely solve the NaN check problem, an alternate solution, if you can be certain that the code will always run in an IEEE754 floating point environment, is to manually check the bits of the float for a NaN signature.
The accepted answer suggests doing this using a bit field, however, the order in which the compiler places the elements in a bit field is non-standard, and in fact, changes between the older and newer versions of g++, even refusing to adhere to the desired positioning in older versions (4.8.2 and 4.9.3, always placing the mantissa first), regardless of the order in which they appear in the code.
A solution using bit manipulation, however, is guaranteed to work on all IEEE754 compliant compilers. Below is my such implementation, which I ultimately used to solve my problem. It checks for IEEE754 compliance, and I've extended it to allow for doubles, as well as other more routine floating point bit manipulations.
#include <limits> // IEEE754 compliance test
#include <type_traits> // enable_if
template<
typename T,
typename = typename std::enable_if<std::is_floating_point<T>::value>::type,
typename = typename std::enable_if<std::numeric_limits<T>::is_iec559>::type,
typename u_t = typename std::conditional<std::is_same<T, float>::value, uint32_t, uint64_t>::type
>
struct IEEE754 {
enum class WIDTH : size_t {
SIGN = 1,
EXPONENT = std::is_same<T, float>::value ? 8 : 11,
MANTISSA = std::is_same<T, float>::value ? 23 : 52
};
enum class MASK : u_t {
SIGN = (u_t)1 << (sizeof(u_t) * 8 - 1),
EXPONENT = ((~(u_t)0) << (size_t)WIDTH::MANTISSA) ^ (u_t)MASK::SIGN,
MANTISSA = (~(u_t)0) >> ((size_t)WIDTH::SIGN + (size_t)WIDTH::EXPONENT)
};
union {
T f;
u_t u;
};
IEEE754(T f) : f(f) {}
inline u_t sign() const { return u & (u_t)MASK::SIGN >> ((size_t)WIDTH::EXPONENT + (size_t)WIDTH::MANTISSA); }
inline u_t exponent() const { return u & (u_t)MASK::EXPONENT >> (size_t)WIDTH::MANTISSA; }
inline u_t mantissa() const { return u & (u_t)MASK::MANTISSA; }
inline bool isNan() const {
return (mantissa() != 0) && ((u & ((u_t)MASK::EXPONENT)) == (u_t)MASK::EXPONENT);
}
};
template<typename T>
inline IEEE754<T> toIEEE754(T val) { return IEEE754<T>(val); }
And the replaceNaN function now becomes:
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++)
if (toIEEE754(arr[i]).isNan()) arr[i] = newValue;
}
An inspection of the assembly of these functions reveals that, as expected, all masks become compile-time constants, leading to the following (seemingly) efficient code:
# In loop of replaceNaN
movl (%rcx), %eax # eax = arr[i]
testl $8388607, %eax # Check if mantissa is empty
je .L3 # If it is, it's not a nan (it's inf), continue loop
andl $2139095040, %eax # Mask leaves only exponent
cmpl $2139095040, %eax # Test if exponent is all 1s
jne .L3 # If it isn't, it's not a nan, so continue loop
This is one instruction less than with a working bit field solution (no shift), and the same number of registers are used (although it's tempting to say this alone makes it more efficient, there are other concerns such as pipelining which may make one solution more or less efficient than the other one).
Looks like a compiler bug to me. Up through GCC 4.9.2, the attribute is completely ignored. GCC 5.1 and later pay attention to it. Perhaps it's time to upgrade your compiler?
__attribute__((optimize("-fno-finite-math-only")))
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++) if (std::isnan(arr[i])) arr[i] = newValue;
}
Compiled with -ffinite-math-only on GCC 4.9.2:
replaceNaN(float*, int, float):
rep ret
But with the exact same settings on GCC 5.1:
replaceNaN(float*, int, float):
test esi, esi
jle .L26
sub rsp, 8
call std::isnan(float) [clone .isra.0]
test al, al
je .L2
mov rax, rdi
and eax, 15
shr rax, 2
neg rax
and eax, 3
cmp eax, esi
cmova eax, esi
cmp esi, 6
jg .L28
mov eax, esi
.L5:
cmp eax, 1
movss DWORD PTR [rdi], xmm0
je .L16
cmp eax, 2
movss DWORD PTR [rdi+4], xmm0
je .L17
cmp eax, 3
movss DWORD PTR [rdi+8], xmm0
je .L18
cmp eax, 4
movss DWORD PTR [rdi+12], xmm0
je .L19
cmp eax, 5
movss DWORD PTR [rdi+16], xmm0
je .L20
movss DWORD PTR [rdi+20], xmm0
mov edx, 6
.L7:
cmp esi, eax
je .L2
.L6:
mov r9d, esi
lea r8d, [rsi-1]
mov r11d, eax
sub r9d, eax
lea ecx, [r9-4]
sub r8d, eax
shr ecx, 2
add ecx, 1
cmp r8d, 2
lea r10d, [0+rcx*4]
jbe .L9
movaps xmm1, xmm0
lea r8, [rdi+r11*4]
xor eax, eax
shufps xmm1, xmm1, 0
.L11:
add eax, 1
add r8, 16
movaps XMMWORD PTR [r8-16], xmm1
cmp ecx, eax
ja .L11
add edx, r10d
cmp r9d, r10d
je .L2
.L9:
movsx rax, edx
movss DWORD PTR [rdi+rax*4], xmm0
lea eax, [rdx+1]
cmp eax, esi
jge .L2
add edx, 2
cdqe
cmp esi, edx
movss DWORD PTR [rdi+rax*4], xmm0
jle .L2
movsx rdx, edx
movss DWORD PTR [rdi+rdx*4], xmm0
.L2:
add rsp, 8
.L26:
rep ret
.L28:
test eax, eax
jne .L5
xor edx, edx
jmp .L6
.L20:
mov edx, 5
jmp .L7
.L19:
mov edx, 4
jmp .L7
.L18:
mov edx, 3
jmp .L7
.L17:
mov edx, 2
jmp .L7
.L16:
mov edx, 1
jmp .L7
The output is similar, although not quite identical, on GCC 6.1.
Replacing the attribute with
#pragma GCC push_options
#pragma GCC optimize ("-fno-finite-math-only")
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++) if (std::isnan(arr[i])) arr[i] = newValue;
}
#pragma GCC pop_options
makes absolutely no difference, so it is not simply a matter of the attribute being ignored. These older versions of the compiler clearly do not support controlling the floating-point optimization behavior at function-level granularity.
Note, however, that the generated code on GCC 5.1 and later is still significantly worse than compiling the function without without the -ffinite-math-only switch:
replaceNaN(float*, int, float):
test esi, esi
jle .L1
lea eax, [rsi-1]
lea rax, [rdi+4+rax*4]
.L5:
movss xmm1, DWORD PTR [rdi]
ucomiss xmm1, xmm1
jnp .L6
movss DWORD PTR [rdi], xmm0
.L6:
add rdi, 4
cmp rdi, rax
jne .L5
rep ret
.L1:
rep ret
I have no idea why there is such a discrepancy. Something is badly throwing the compiler off its game; this is even worse code than you get with optimizations completely disabled. If I had to guess, I'd speculate it was the implementation of std::isnan. If this replaceNaN method is not speed-critical, then it probably doesn't matter. If you need to repeatedly parse values from a file, you might prefer to have a reasonably efficient implementation.
Personally, I would write my own non-portable implementation of std::isnan. The IEEE 754 formats are all quite well-documented, and assuming you thoroughly test and comment the code, I can't see the harm in this, unless you absolutely need the code to be portable to all different architectures. It will drive purists up the wall, but so should using non-standard options like -ffinite-math-only. For a single-precision float, something like:
bool my_isnan(float value)
{
union IEEE754_Single
{
float f;
struct
{
#if BIG_ENDIAN
uint32_t sign : 1;
uint32_t exponent : 8;
uint32_t mantissa : 23;
#else
uint32_t mantissa : 23;
uint32_t exponent : 8;
uint32_t sign : 1;
#endif
} bits;
} u = { value };
// In the IEEE 754 representation, a float is NaN when
// the mantissa is non-zero, and the exponent is all ones
// (2^8 - 1 == 255).
return (u.bits.mantissa != 0) && (u.bits.exponent == 255);
}
Now, no need for annotations, just use my_isnan instead of std::isnan. The produces the following object code when compiled with -ffinite-math-only enabled:
replaceNaN(float*, int, float):
test esi, esi
jle .L6
lea eax, [rsi-1]
lea rdx, [rdi+4+rax*4]
.L13:
mov eax, DWORD PTR [rdi] ; get original floating-point value
test eax, 8388607 ; test if mantissa != 0
je .L9
shr eax, 16 ; test if exponent has all bits set
and ax, 32640
cmp ax, 32640
jne .L9
movss DWORD PTR [rdi], xmm0 ; set newValue if original was NaN
.L9:
add rdi, 4
cmp rdx, rdi
jne .L13
rep ret
.L6:
rep ret
The NaN check is slightly more complicated than a simple ucomiss followed by a test of the parity flag, but is guaranteed to be correct as long as your compiler adheres to the IEEE 754 standard. This works on all versions of GCC, and any other compiler.

Calling convention mismatch for x64 floating point functions

I'm having a weird error. I have one module compiled by one compiler (msvc in this case), that calls code loaded from another module compiled by a seperate compiler (TCC).
The tcc code provides a callback function that for both modules are defined like this:
typedef float( * ScaleFunc)(float value, float _min, float _max);
The MSVC code calls the code like this:
finalValue = extScale(val, _min, _max);
000007FEECAFCF52 mov rax,qword ptr [this]
000007FEECAFCF5A movss xmm2,dword ptr [rax+0D0h]
000007FEECAFCF62 mov rax,qword ptr [this]
000007FEECAFCF6A movss xmm1,dword ptr [rax+0CCh]
000007FEECAFCF72 movss xmm0,dword ptr [val]
000007FEECAFCF78 mov rax,qword ptr [this]
000007FEECAFCF80 call qword ptr [rax+0B8h]
000007FEECAFCF86 movss dword ptr [finalValue],xmm0
and the function compiled by TCC looks like this:
float linear_scale(float value, float _min, float _max)
{
return value * (_max - _min) + _min;
}
0000000000503DC4 push rbp
0000000000503DC5 mov rbp,rsp
0000000000503DC8 sub rsp,0
0000000000503DCF mov qword ptr [rbp+10h],rcx
0000000000503DD3 mov qword ptr [rbp+18h],rdx
0000000000503DD7 mov qword ptr [rbp+20h],r8
0000000000503DDB movd xmm0,dword ptr [rbp+20h]
0000000000503DE0 subss xmm0,dword ptr [rbp+18h]
0000000000503DE5 movq xmm1,xmm0
0000000000503DE9 movd xmm0,dword ptr [rbp+10h]
0000000000503DEE mulss xmm0,xmm1
0000000000503DF2 addss xmm0,dword ptr [rbp+18h]
0000000000503DF7 jmp 0000000000503DFC
0000000000503DFC leave
0000000000503DFD ret
It seems that TCC expects the arguments in the integer registers r6 to r8, while msvc puts them in the sse registers. I thought that x64 (on windows) defines one common calling convention? What exactly is going on here, and how can i enforce the same model on both platforms?
The same code works correctly in 32-bit mode. Weirdly enough, on OSX (where the other code is compiled by llvm) it works in both modes (32 and 64-bit). Ill see if i can fetch some assembly from there, later.
---- edit ----
I have created a working solution. It is however, without doubt, the dirtiest hack i've ever made (bar questionable inline assembly, unfortunately it isn't available on msvc 64-bit :)).
// passes first three floating point arguments in r6 to r8
template<typename sseType>
sseType TCCAssemblyHelper(ScaleFunc cb, sseType val, sseType _min, sseType _max)
{
sseType xmm0(val), xmm1(_min), xmm2(_max);
long long rcx, rdx, r8;
rcx = *(long long*)&xmm0;
rdx = *(long long*)&xmm1;
r8 = *(long long*)&xmm2;
typedef float(*interMedFunc)(long long rcx, long long rdx, long long r8);
interMedFunc helperFunc = reinterpret_cast<interMedFunc>(cb);
return helperFunc(rcx, rdx, r8);
}