is this a bug in sqrt function - c++

I create an application to compute the primes numbers in 64-bit range so when i tried to compute the square root of 64-bit number using sqrt function from math.h i found the answer is not accurate for example when the input is ~0ull the answer should be ~0u but the one I get is 0x100000000 which is not right, so i decided to create my own version using assembly x86 language to see if this is a bug, here is my function:
inline unsigned prime_isqrt(unsigned long long value)
{
const unsigned one = 1;
const unsigned two = 2;
__asm
{
test dword ptr [value+4], 0x80000000
jz ZERO
mov eax, dword ptr [value]
mov ecx, dword ptr [value + 4]
shrd eax, ecx, 1
shr ecx, 1
mov dword ptr [value],eax
mov dword ptr [value+4],ecx
fild value
fimul two
fiadd one
jmp REST
ZERO:
fild value
REST:
fsqrt
fisttp value
mov eax, dword ptr [value]
}
}
the input is an odd number to get its square root. When i test my function with same input the result was the same.
What i don't get is why those functions round the result or to be specific why sqrt instruction round the result?

sqrt doesn't round anything - you do when you convert your integer into a double. A double can't represent all the numbers that a 64-bit integer can without loss of precision. Specifically starting at 253, there are multiple integers that will be represented as the same double value.
So if you convert an integer above 253 to double, you lose some of the least significant bits, which is why (double)(~0ull) is 18446744073709552000.0, not 18446744073709551615.0 (or to be more precise the latter is actually equal to the former because they represent the same double number).

You're not very clear about the C++ function that you're calling. sqrt is an overloaded name. You probably wanted sqrt(double(~0ull)). There's no sqrt overload which takes an unsigned long long.

Related

Efficient overflow-immune arithmetic mean in C/C++

The arithmetic mean of two unsigned integers is defined as:
mean = (a+b)/2
Directly implementing this in C/C++ may overflow and produce a wrong result. A correct implementation would avoid this. One way of coding it could be:
mean = a/2 + b/2 + (a%2 + b%2)/2
But this produces rather a lot of code with typical compilers. In assembler, this usually can be done much more efficiently. For example, the x86 can do this in the following way (assembler pseudo code, I hope you get the point):
ADD a,b ; addition, leaving the overflow condition in the carry bit
RCR a,1 ; rotate right through carry, effectively a division by 2
After those two instructions, the result is in a, and the remainder of the division is in the carry bit. If correct rounding is desired, a third ADC instruction would have to add the carry into the result.
Note that the RCR instruction is used, which rotates a register through the carry. In our case, it is a rotate by one position, so that the previous carry becomes the most significant bit in the register, and the new carry holds the previous LSB from the register. It seems that MSVC doesn't even offer an intrinsic for this instruction.
Is there a known C/C++ pattern that can be expected to be recognized by an optimizing compiler so that it produces such efficient code? Or, more generally, is there a rational way how to program in C/C++ source level so that the carry bit is being used by the compiler to optimize the generated code?
EDIT:
A 1-hour lecture about std::midpoint: https://www.youtube.com/watch?v=sBtAGxBh-XI
Wow!
EDIT2: Great discussion on Microsoft blog
The following method avoids overflow and should result in fairly efficient assembly (example) without depending on non-standard features:
mean = (a&b) + (a^b)/2;
There are three typical methods to compute average without overflow, one of which is limited to uint32_t (on 64-bit architectures).
// average "SWAR" / Montgomery
uint32_t avg(uint32_t a, uint32_t b) {
return (a & b) + ((a ^ b) >> 1);
}
// in case the relative magnitudes are known
uint32_t avg2(uint32_t min, uint32_t max) {
return min + (max - min) / 2;
}
// in case the relative magnitudes are not known
uint32_t avg2_constrained(uint32_t a, uint32_t b) {
return a + (int32_t)(b - a) / 2;
}
// average increase width (not applicable to uint64_t)
uint32_t avg3(uint32_t a, uint32_t b) {
return ((uint64_t)a + b) >> 1;
}
The corresponding assembler sequences from clang in two architectures are
avg(unsigned int, unsigned int)
mov eax, esi
and eax, edi
xor esi, edi
shr esi
add eax, esi
avg2(unsigned int, unsigned int)
sub esi, edi
shr esi
lea eax, [rsi + rdi]
avg3(unsigned int, unsigned int)
mov ecx, edi
mov eax, esi
add rax, rcx
shr rax
vs.
avg(unsigned int, unsigned int)
and w8, w1, w0
eor w9, w1, w0
add w0, w8, w9, lsr #1
ret
avg2(unsigned int, unsigned int)
sub w8, w1, w0
add w0, w0, w8, lsr #1
ret
avg3(unsigned int, unsigned int):
mov w8, w1
add x8, x8, w0, uxtw
lsr x0, x8, #1
ret
Out of these three versions, avg2 would perform in ARM64 as well, as the optimal sequence using carry flag -- and also it's likely that avg3 would perform as well, noticing that the mov w8,w1 is used to clear the top 32-bits, which may be unnecessary given that the compiler knows they are cleared by any previous instruction that is used to produce the value.
Similar statement can be made of the Intel version for avg3, which would in optimal case compiled to just the two meaningful instructions:
add rax, rcx
shr rax
See https://godbolt.org/z/5TMd3zr81 for online comparison.
The "SWAR"/Montgomery version is typically only justified, when trying to compute multiple averages packed to a single (large) integer in which case the full formula contains masking with the bit positions of the highest bits: return (a & b) + (((a ^ b) >> 1) & ~kH;.

Why does division by 3 require a rightshift (and other oddities) on x86?

I have the following C/C++ function:
unsigned div3(unsigned x) {
return x / 3;
}
When compiled using clang 10 at -O3, this results in:
div3(unsigned int):
mov ecx, edi # tmp = x
mov eax, 2863311531 # result = 3^-1
imul rax, rcx # result *= tmp
shr rax, 33 # result >>= 33
ret
What I do understand is: division by 3 is equivalent to multiplying with the multiplicative inverse 3-1 mod 232 which is 2863311531.
There are some things that I don't understand though:
Why do we need to use ecx/rcx at all? Can't we multiply rax with edi directly?
Why do we multiply in 64-bit mode? Wouldn't it be faster to multiply eax and ecx?
Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.
Edit 1
For those who don't understand what I mean by 3-1 mod 232, I am talking about the multiplicative inverse here.
For example:
// multiplying with inverse of 3:
15 * 2863311531 = 42949672965
42949672965 mod 2^32 = 5
// using fixed-point multiplication
15 * 2863311531 = 42949672965
42949672965 >> 33 = 5
// simply dividing by 3
15 / 3 = 5
So multiplying with 42949672965 is actually equivalent to dividing by 3. I assumed clang's optimization is based on modular arithmetic, when it's really based on fixed point arithmetic.
Edit 2
I have now realized that the multiplicative inverse can only be used for divisions without a remainder. For example, multiplying 1 times 3-1 is equal to 3-1, not zero. Only fixed point arithmetic has correct rounding.
Unfortunately, clang does not make any use of modular arithmetic which would just be a single imul instruction in this case, even when it could. The following function has the same compile output as above.
unsigned div3(unsigned x) {
__builtin_assume(x % 3 == 0);
return x / 3;
}
(Canonical Q&A about fixed-point multiplicative inverses for exact division that work for every possible input: Why does GCC use multiplication by a strange number in implementing integer division? - not quite a duplicate because it only covers the math, not some of the implementation details like register width and imul vs. mul.)
Can't we multiply rax with edi directly?
We can't imul rax, rdi because the calling convention allows the caller to leave garbage in the high bits of RDI; only the EDI part contains the value. This is a non-issue when inlining; writing a 32-bit register does implicitly zero-extend to the full 64-bit register, so the compiler will usually not need an extra instruction to zero-extend a 32-bit value.
(zero-extending into a different register is better because of limitations on mov-elimination, if you can't avoid it).
Taking your question even more literally, no, x86 doesn't have any multiply instructions that zero-extend one of their inputs to let you multiply a 32-bit and a 64-bit register. Both inputs must be the same width.
Why do we multiply in 64-bit mode?
(terminology: all of this code runs in 64-bit mode. You're asking why 64-bit operand-size.)
You could mul edi to multiply EAX with EDI to get a 64-bit result split across EDX:EAX, but mul edi is 3 uops on Intel CPUs, vs. most modern x86-64 CPUs having fast 64-bit imul. (Although imul r64, r64 is slower on AMD Bulldozer-family, and on some low-power CPUs.) https://uops.info/ and https://agner.org/optimize/ (instruction tables and microarch PDF)
(Fun fact: mul rdi is actually cheaper on Intel CPUs, only 2 uops. Perhaps something to do with not having to do extra splitting on the output of the integer multiply unit, like mul edi would have to split the 64-bit low half multiplier output into EDX and EAX halves, but that happens naturally for 64x64 => 128-bit mul.)
Also the part you want is in EDX so you'd need another mov eax, edx to deal with it. (Again, because we're looking at code for a stand-alone definition of the function, not after inlining into a caller.)
GCC 8.3 and earlier did use 32-bit mul instead of 64-bit imul (https://godbolt.org/z/5qj7d5). That was not crazy for -mtune=generic when Bulldozer-family and old Silvermont CPUs were more relevant, but those CPUs are farther in the past for more recent GCC, and its generic tuning choices reflect that. Unfortunately GCC also wasted a mov instruction copying EDI to EAX, making this way look even worse :/
# gcc8.3 -O3 (default -mtune=generic)
div3(unsigned int):
mov eax, edi # 1 uop, stupid wasted instruction
mov edx, -1431655765 # 1 uop (same 32-bit constant, just printed differently)
mul edx # 3 uops on Sandybridge-family
mov eax, edx # 1 uop
shr eax # 1 uop
ret
# total of 7 uops on SnB-family
Would only be 6 uops with mov eax, 0xAAAAAAAB / mul edi, but still worse than:
# gcc9.3 -O3 (default -mtune=generic)
div3(unsigned int):
mov eax, edi # 1 uop
mov edi, 2863311531 # 1 uop
imul rax, rdi # 1 uop
shr rax, 33 # 1 uop
ret
# total 4 uops, not counting ret
Unfortunately, 64-bit 0x00000000AAAAAAAB can't be represented as a 32-bit sign-extended immediate, so imul rax, rcx, 0xAAAAAAAB isn't encodeable. It would mean 0xFFFFFFFFAAAAAAAB.
Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
It is unsigned. Signedness of the inputs only affects the high half of the result, but imul reg, reg doesn't produce the high half. Only the one-operand forms of mul and imul are full multiplies that do NxN => 2N, so only they need separate signed and unsigned versions.
Only imul has the faster and more flexible low-half-only forms. The only thing that's signed about imul reg, reg is that it sets OF based on signed overflow of the low half. It wasn't worth spending more opcodes and more transistors just to have a mul r,r whose only difference from imul r,r is the FLAGS output.
Intel's manual (https://www.felixcloutier.com/x86/imul) even points out the fact that it can be used for unsigned.
What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.
No, there's no multiplier constant that would give the exact right answer for every possible input x if you implemented it that way. The "as-if" optimization rule doesn't allow approximations, only implementations that produce the exact same observable behaviour for every input the program uses. Without knowing a value-range for x other than full range of unsigned, compilers don't have that option. (-ffast-math only applies to floating point; if you want faster approximations for integer math, code them manually like below):
See Why does GCC use multiplication by a strange number in implementing integer division? for more about the fixed-point multiplicative inverse method compilers use for exact division by compile time constants.
For an example of this not working in the general case, see my edit to an answer on Divide by 10 using bit shifts? which proposed
// Warning: INEXACT FOR LARGE INPUTS
// this fast approximation can just use the high half,
// so on 32-bit machines it avoids one shift instruction vs. exact division
int32_t div10(int32_t dividend)
{
int64_t invDivisor = 0x1999999A;
return (int32_t) ((invDivisor * dividend) >> 32);
}
Its first wrong answer (if you loop from 0 upward) is div10(1073741829) = 107374183 when 1073741829/10 is actually 107374182. (It rounded up instead of toward 0 like C integer division is supposed to.)
From your edit, I see you were actually talking about using the low half of a multiply result, which apparently works perfectly for exact multiples all the way up to UINT_MAX.
As you say, it completely fails when the division would have a remainder, e.g. 16 * 0xaaaaaaab = 0xaaaaaab0 when truncated to 32-bit, not 5.
unsigned div3_exact_only(unsigned x) {
__builtin_assume(x % 3 == 0); // or an equivalent with if() __builtin_unreachable()
return x / 3;
}
Yes, if that math works out, it would be legal and optimal for compilers to implement that with 32-bit imul. They don't look for this optimization because it's rarely a known fact. IDK if it would be worth adding compiler code to even look for the optimization, in terms of compile time, not to mention compiler maintenance cost in developer time. It's not a huge difference in runtime cost, and it's rarely going to be possible. It is nice, though.
div3_exact_only:
imul eax, edi, 0xAAAAAAAB # 1 uop, 3c latency
ret
However, it is something you can do yourself in source code, at least for known type widths like uint32_t:
uint32_t div3_exact_only(uint32_t x) {
return x * 0xaaaaaaabU;
}
What's up with the 33-bit right shift at the end? I thought we can just drop the highest 32-bits.
Instead of 3^(-1) mod 3 you have to think more about 0.3333333 where the 0 before the . is located in the upper 32 bit and the the 3333 is located in the lower 32 bit.
This fixed point operation works fine, but the result is obviously shifted to the upper part of rax, therefor the CPU must shift the result down again after the operation.
Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
There is no MUL instruction equivalent to the IMUL instruction. The IMUL variant that is used takes two registers:
a <= a * b
There is no MUL instruction that does that. MUL instructions are more expensive because they store the result as 128 Bit in two registers.
Of course you could use the legacy instructions, but this does not change the fact that the result is stored in two registers.
If you look at my answer to the prior question:
Why does GCC use multiplication by a strange number in implementing integer division?
It contains a link to a pdf article that explains this (my answer clarifies the stuff that isn't explained well in this pdf article):
https://gmplib.org/~tege/divcnst-pldi94.pdf
Note that one extra bit of precision is needed for some divisors, such as 7, the multiplier would normally require 33 bits, and the product would normally require 65 bits, but this can be avoided by handling the 2^32 bit separately with 3 additional instructions as shown in my prior answer and below.
Take a look at the generated code if you change to
unsigned div7(unsigned x) {
return x / 7;
}
So to explain the process, let L = ceil(log2(divisor)). For the question above, L = ceil(log2(3)) == 2. The right shift count would initially be 32+L = 34.
To generate a multiplier with a sufficient number of bits, two potential multipliers are generated: mhi will be the multiplier to be used, and the shift count will be 32+L.
mhi = (2^(32+L) + 2^(L))/3 = 5726623062
mlo = (2^(32+L) )/3 = 5726623061
Then a check is made to see if the number of required bits can be reduced:
while((L > 0) && ((mhi>>1) > (mlo>>1))){
mhi = mhi>>1;
mlo = mlo>>1;
L = L-1;
}
if(mhi >= 2^32){
mhi = mhi-2^32
L = L-1;
; use 3 additional instructions for missing 2^32 bit
}
... mhi>>1 = 5726623062>>1 = 2863311531
... mlo>>1 = 5726623061>>1 = 2863311530 (mhi>>1) > (mlo>>1)
... mhi = mhi>>1 = 2863311531
... mlo = mhi>>1 = 2863311530
... L = L-1 = 1
... the next loop exits since now (mhi>>1) == (mlo>>1)
So the multiplier is mhi = 2863311531 and the shift count = 32+L = 33.
On an modern X86, multiply and shift instructions are constant time, so there's no point in reducing the multiplier (mhi) to less than 32 bits, so that while(...) above is changed to an if(...).
In the case of 7, the loop exits on the first iteration, and requires 3 extra instructions to handle the 2^32 bit, so that mhi is <= 32 bits:
L = ceil(log2(7)) = 3
mhi = (2^(32+L) + 2^(L))/7 = 4908534053
mhi = mhi-2^32 = 613566757
Let ecx = dividend, the simple approach could overflow on the add:
mov eax, 613566757 ; eax = mhi
mul ecx ; edx:eax = ecx*mhi
add edx, ecx ; edx:eax = ecx*(mhi + 2^32), potential overflow
shr edx, 3
To avoid the potential overflow, note that eax = eax*2 - eax:
(ecx*eax) = (ecx*eax)<<1) -(ecx*eax)
(ecx*(eax+2^32)) = (ecx*eax)<<1)+ (ecx*2^32)-(ecx*eax)
(ecx*(eax+2^32))>>3 = ((ecx*eax)<<1)+ (ecx*2^32)-(ecx*eax) )>>3
= (((ecx*eax) )+(((ecx*2^32)-(ecx*eax))>>1))>>2
so the actual code, using u32() to mean upper 32 bits:
... visual studio generated code for div7, dividend is ecx
mov eax, 613566757
mul ecx ; edx = u32( (ecx*eax) )
sub ecx, edx ; ecx = u32( ((ecx*2^32)-(ecx*eax)) )
shr ecx, 1 ; ecx = u32( (((ecx*2^32)-(ecx*eax))>>1) )
lea eax, DWORD PTR [edx+ecx] ; eax = u32( (ecx*eax)+(((ecx*2^32)-(ecx*eax))>>1) )
shr eax, 2 ; eax = u32(((ecx*eax)+(((ecx*2^32)-(ecx*eax))>>1))>>2)
If a remainder is wanted, then the following steps can be used:
mhi and L are generated based on divisor during compile time
...
quotient = (x*mhi)>>(32+L)
product = quotient*divisor
remainder = x - product
x/3 is approximately (x * (2^32/3)) / 2^32. So we can perform a single 32x32->64 bit multiplication, take the higher 32 bits, and get approximately x/3.
There is some error because we cannot multiply exactly by 2^32/3, only by this number rounded to an integer. We get more precision using x/3 ≈ (x * (2^33/3)) / 2^33. (We can't use 2^34/3 because that is > 2^32). And that turns out to be good enough to get x/3 in all cases exactly. You would prove this by checking that the formula gives a result of k if the input is 3k or 3k+2.

Is there a branchless method to quickly find the min/max of two double-precision floating-point values?

I have two doubles, a and b, which are both in [0,1]. I want the min/max of a and b without branching for performance reasons.
Given that a and b are both positive, and below 1, is there an efficient way of getting the min/max of the two? Ideally, I want no branching.
Yes, there is a way to calculate the maximum or minimum of two doubles without any branches. The C++ code to do so looks like this:
#include <algorithm>
double FindMinimum(double a, double b)
{
return std::min(a, b);
}
double FindMaximum(double a, double b)
{
return std::max(a, b);
}
I bet you've seen this before. Lest you don't believe that this is branchless, check out the disassembly:
FindMinimum(double, double):
minsd xmm1, xmm0
movapd xmm0, xmm1
ret
FindMaximum(double, double):
maxsd xmm1, xmm0
movapd xmm0, xmm1
ret
That's what you get from all popular compilers targeting x86. The SSE2 instruction set is used, specifically the minsd/maxsd instructions, which branchlessly evaluate the minimum/maximum value of two double-precision floating-point values.
All 64-bit x86 processors support SSE2; it is required by the AMD64 extensions. Even most x86 processors without 64-bit support SSE2. It was released in 2000. You'd have to go back a long way to find a processor that didn't support SSE2. But what about if you did? Well, even there, you get branchless code on most popular compilers:
FindMinimum(double, double):
fld QWORD PTR [esp + 12]
fld QWORD PTR [esp + 4]
fucomi st(1)
fcmovnbe st(0), st(1)
fstp st(1)
ret
FindMaximum(double, double):
fld QWORD PTR [esp + 4]
fld QWORD PTR [esp + 12]
fucomi st(1)
fxch st(1)
fcmovnbe st(0), st(1)
fstp st(1)
ret
The fucomi instruction performs a comparison, setting flags, and then the fcmovnbe instruction performs a conditional move, based on the value of those flags. This is all completely branchless, and relies on instructions introduced to the x86 ISA with the Pentium Pro back in 1995, supported on all x86 chips since the Pentium II.
The only compiler that won't generate branchless code here is MSVC, because it doesn't take advantage of the FCMOVxx instruction. Instead, you get:
double FindMinimum(double, double) PROC
fld QWORD PTR [a]
fld QWORD PTR [b]
fcom st(1) ; compare "b" to "a"
fnstsw ax ; transfer FPU status word to AX register
test ah, 5 ; check C0 and C2 flags
jp Alt
fstp st(1) ; return "b"
ret
Alt:
fstp st(0) ; return "a"
ret
double FindMinimum(double, double) ENDP
double FindMaximum(double, double) PROC
fld QWORD PTR [b]
fld QWORD PTR [a]
fcom st(1) ; compare "b" to "a"
fnstsw ax ; transfer FPU status word to AX register
test ah, 5 ; check C0 and C2 flags
jp Alt
fstp st(0) ; return "b"
ret
Alt:
fstp st(1) ; return "a"
ret
double FindMaximum(double, double) ENDP
Notice the branching JP instruction (jump if parity bit set). The FCOM instruction is used to do the comparison, which is part of the base x87 FPU instruction set. Unfortunately, this sets flags in the FPU status word, so in order to branch on those flags, they need to be extracted. That's the purpose of the FNSTSW instruction, which stores the x87 FPU status word to the general-purpose AX register (it could also store to memory, but…why?). The code then TESTs the appropriate bit, and branches accordingly to ensure that the correct value is returned. In addition to the branch, retrieving the FPU status word will also be relatively slow. This is why the Pentium Pro introduced the FCOM instructions.
However, it is unlikely that you would be able to improve upon the speed of any of this code by using bit-twiddling operations to determine min/max. There are two basic reasons:
The only compiler generating inefficient code is MSVC, and there's no good way to force it to generate the instructions you want it to. Although inline assembly is supported in MSVC for 32-bit x86 targets, it is a fool's errand when seeking performance improvements. I'll also quote myself:
Inline assembly disrupts the optimizer in rather significant ways, so unless you're writing significant swaths of code in inline assembly, there is unlikely to be a substantial net performance gain. Furthermore, Microsoft's inline assembly syntax is extremely limited. It trades flexibility for simplicity in a big way. In particular, there is no way to specify input values, so you're stuck loading the input from memory into a register, and the caller is forced to spill the input from a register to memory in preparation. This creates a phenomenon I like to call "a whole lotta shufflin' goin' on", or for short, "slow code". You don't drop to inline assembly in cases where slow code is acceptable. Thus, it is always preferable (at least on MSVC) to figure out how to write C/C++ source code that persuades the compiler to emit the object code you want. Even if you can only get close to the ideal output, that's still considerably better than the penalty you pay for using inline assembly.
In order to get access to the raw bits of a floating-point value, you'd have to do a domain transition, from floating-point to integer, and then back to floating-point. That's slow, especially without SSE2, because the only way to get a value from the x87 FPU to the general-purpose integer registers in the ALU is indirectly via memory.
If you wanted to pursue this strategy anyway—say, to benchmark it—you could take advantage of the fact that floating-point values are lexicographically ordered in terms of their IEEE 754 representations, except for the sign bit. So, since you are assuming that both values are positive:
FindMinimumOfTwoPositiveDoubles(double a, double b):
mov rax, QWORD PTR [a]
mov rdx, QWORD PTR [b]
sub rax, rdx ; subtract bitwise representation of the two values
shr rax, 63 ; isolate the sign bit to see if the result was negative
ret
FindMaximumOfTwoPositiveDoubles(double a, double b):
mov rax, QWORD PTR [b] ; \ reverse order of parameters
mov rdx, QWORD PTR [a] ; / for the SUB operation
sub rax, rdx
shr rax, 63
ret
Or, to avoid inline assembly:
bool FindMinimumOfTwoPositiveDoubles(double a, double b)
{
static_assert(sizeof(a) == sizeof(uint64_t),
"A double must be the same size as a uint64_t for this bit manipulation to work.");
const uint64_t aBits = *(reinterpret_cast<uint64_t*>(&a));
const uint64_t bBits = *(reinterpret_cast<uint64_t*>(&b));
return ((aBits - bBits) >> ((sizeof(uint64_t) * CHAR_BIT) - 1));
}
bool FindMaximumOfTwoPositiveDoubles(double a, double b)
{
static_assert(sizeof(a) == sizeof(uint64_t),
"A double must be the same size as a uint64_t for this bit manipulation to work.");
const uint64_t aBits = *(reinterpret_cast<uint64_t*>(&a));
const uint64_t bBits = *(reinterpret_cast<uint64_t*>(&b));
return ((bBits - aBits) >> ((sizeof(uint64_t) * CHAR_BIT) - 1));
}
Note that there are severe caveats to this implementation. In particular, it will break if the two floating-point values have different signs, or if both values are negative. If both values are negative, then the code can be modified to flip their signs, do the comparison, and then return the opposite value. To handle the case where the two values have different signs, code can be added to check the sign bit.
// ...
// Enforce two's-complement lexicographic ordering.
if (aBits < 0)
{
aBits = ((1 << ((sizeof(uint64_t) * CHAR_BIT) - 1)) - aBits);
}
if (bBits < 0)
{
bBits = ((1 << ((sizeof(uint64_t) * CHAR_BIT) - 1)) - bBits);
}
// ...
Dealing with negative zero will also be a problem. IEEE 754 says that +0.0 is equal to −0.0, so your comparison function will have to decide if it wants to treat these values as different, or add special code to the comparison routines that ensures negative and positive zero are treated as equivalent.
Adding all of this special-case code will certainly reduce performance to the point that we will break even with a naïve floating-point comparison, and will very likely end up being slower.

Strange uint32_t to float array conversion

I have the following code snippet:
#include <cstdio>
#include <cstdint>
static const size_t ARR_SIZE = 129;
int main()
{
uint32_t value = 2570980487;
uint32_t arr[ARR_SIZE];
for (int x = 0; x < ARR_SIZE; ++x)
arr[x] = value;
float arr_dst[ARR_SIZE];
for (int x = 0; x < ARR_SIZE; ++x)
{
arr_dst[x] = static_cast<float>(arr[x]);
}
printf("%s\n", arr_dst[ARR_SIZE - 1] == arr_dst[ARR_SIZE - 2] ? "OK" : "WTF??!!");
printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 2]);
printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 1]);
return 0;
}
If I compile it under MS Visual Studio 2015 I can see that the output is:
WTF??!!
magic = 2570980352.0000000000
magic = 2570980608.0000000000
So the last arr_dst element is different from the previous one, yet these two values were obtained by converting the same value, which populates the arr array!
Is it a bug?
I noticed that if I modify the conversion loop in the following manner, I get the "OK" result:
for (int x = 0; x < ARR_SIZE; ++x)
{
if (x == 0)
x = 0;
arr_dst[x] = static_cast<float>(arr[x]);
}
So this probably is some issue with vectorizing optimisation.
This behavior does not reproduce on gcc 4.8. Any ideas?
A 32-bit IEEE-754 binary float, such as MSVC++ uses, provides only 6-7 decimal digits of precision. Your starting value is well within the range of that type, but it seems not to be exactly representable by that type, as indeed is the case for most values of type uint32_t.
At the same time, the floating-point unit of an x86 or x86_64 processor uses a wider representation even than MSVC++'s 64-bit double. It seems likely that after the loop exits, the last-computed array element remains in an FPU register, in its extended precision form. The program may then use that value directly from the register instead of reading it back from memory, which it is obligated to do with previous elements.
If the program performs the == comparison by promoting the narrower representation to the wider instead of the other way around, then the two values might indeed compare unequal, as the round-trip from extended precision to float and back loses precision. In any event, both values are converted to type double when passed to printf(); if indeed they compared unequal, then it is likely that the results of those conversions differ, too.
I'm not up on MSVC++ compile options, but very likely there is one that would quash this behavior. Such options sometimes go by names such as "strict math" or "strict fp". Be aware, however, that turning on such an option (or turning off its opposite) can be very costly in an FP-heavy program.
Converting between unsigned and float is not simple on x86; there's no single instruction for it (until AVX512). A common technique is to convert as signed and then fixup the result. There are multiple ways of doing this. (See this Q&A for some manually-vectorized methods with C intrinsics, not all of which have perfectly-rounded results.)
MSVC vectorizes the first 128 with one strategy, and then uses a different strategy (which wouldn't vectorize) for the last scalar element, which involves converting to double and then from double to float.
gcc and clang produce the 2570980608.0 result from their vectorized and scalar methods. 2570980608 - 2570980487 = 121, and 2570980487 - 2570980352 = 135 (with no rounding of inputs/outputs), so gcc and clang produce the correctly rounded result in this case (less than 0.5ulp of error). IDK if that's true for every possible uint32_t (but there are only 2^32 of them, we could exhaustively check). MSVC's end result for the vectorized loop has slightly more than 0.5ulp of error, but the scalar method is correctly rounded for this input.
IEEE math demands that + - * / and sqrt produce correctly rounded results (less than 0.5ulp of error), but other functions (like log) don't have such a strict requirement. IDK what the requirements are on rounding for int->float conversions, so IDK if what MSVC does is strictly legal (if you didn't use /fp:fast or anything).
See also Bruce Dawson's Floating-Point Determinism blog post (part of his excellent series about FP math), although he doesn't mention integer<->FP conversions.
We can see in the asm linked by the OP what MSVC did (stripped down to only the interesting instructions and commented by hand):
; Function compile flags: /Ogtp
# assembler macro constants
_arr_dst$ = -1040 ; size = 516
_arr$ = -520 ; size = 516
_main PROC ; COMDAT
00013 mov edx, 129
00018 mov eax, -1723986809 ; this is your unsigned 2570980487
0001d mov ecx, edx
00023 lea edi, DWORD PTR _arr$[esp+1088] ; edi=arr
0002a rep stosd ; memset in chunks of 4B
# arr[0..128] = 2570980487 at this point
0002c xor ecx, ecx ; i = 0
# xmm2 = 0.0 in each element (i.e. all-zero)
# xmm3 = __xmm#4f8000004f8000004f8000004f800000 (a constant repeated in each of 4 float elements)
####### The vectorized unsigned->float conversion strategy:
$LL7#main: ; do{
00030 movups xmm0, XMMWORD PTR _arr$[esp+ecx*4+1088] ; load 4 uint32_t
00038 cvtdq2ps xmm1, xmm0 ; SIGNED int to Single-precision float
0003b movaps xmm0, xmm1
0003e cmpltps xmm0, xmm2 ; xmm0 = (xmm0 < 0.0)
00042 andps xmm0, xmm3 ; mask the magic constant
00045 addps xmm0, xmm1 ; x += (x<0.0) ? magic_constant : 0.0f;
# There's no instruction for converting from unsigned to float, so compilers use inconvenient techniques like this to correct the result of converting as signed.
00048 movups XMMWORD PTR _arr_dst$[esp+ecx*4+1088], xmm0 ; store 4 floats to arr_dst
; and repeat the same thing again, with addresses that are 16B higher (+1104)
; i.e. this loop is unrolled by two
0006a add ecx, 8 ; i+=8 (two vectors of 4 elements)
0006d cmp ecx, 128
00073 jb SHORT $LL7#main ; }while(i<128)
#### End of vectorized loop
# and then IDK what MSVC smoking; both these values are known at compile time. Is /Ogtp not full optimization?
# I don't see a branch target that would let execution reach this code
# other than by falling out of the loop that ends with ecx=128
00075 cmp ecx, edx
00077 jae $LN21#main ; if(i>=129): always false
0007d sub edx, ecx ; edx = 129-128 = 1
... some more ridiculous known-at-compile-time jumping later ...
######## The scalar unsigned->float conversion strategy for the last element
$LC15#main:
00140 mov eax, DWORD PTR _arr$[esp+ecx*4+1088]
00147 movd xmm0, eax
# eax = xmm0[0] = arr[128]
0014b cvtdq2pd xmm0, xmm0 ; convert the last element TO DOUBLE
0014f shr eax, 31 ; shift the sign bit to bit 1, so eax = 0 or 1
; then eax indexes a 16B constant, selecting either 0 or 0x41f0... (as whatever double that represents)
00152 addsd xmm0, QWORD PTR __xmm#41f00000000000000000000000000000[eax*8]
0015b cvtpd2ps xmm0, xmm0 ; double -> float
0015f movss DWORD PTR _arr_dst$[esp+ecx*4+1088], xmm0 ; and store it
00165 inc ecx ; ++i;
00166 cmp ecx, 129 ; } while(i<129)
0016c jb SHORT $LC15#main
# Yes, this is a loop, which always runs exactly once for the last element
By way of comparison, clang and gcc also don't optimize the whole thing away at compile time, but they do realize that they don't need a cleanup loop, and just do a single scalar store or convert after the respective loops. (clang actually fully unrolls everything unless you tell it not to.)
See the code on the Godbolt compiler explorer.
gcc just converts the upper and lower 16b halves to float separately, and combines them with a multiply by 65536 and add.
Clang's unsigned -> float conversion strategy is interesting: it never uses a cvt instruction at all. I think it stuffs the two 16-bit halves of the unsigned integer into the mantissa of two floats directly (with some tricks to set the exponents (bitwise boolean stuff and an ADDPS), then adds the low and high half together like gcc does.
Of course, if you compile to 64-bit code, the scalar conversion can just zero-extend the uint32_t to 64-bit and convert that as a signed int64_t to float. Signed int64_t can represent every value of uint32_t, and x86 can convert a 64-bit signed int to float efficiently. But that doesn't vectorize.
I did an investigation on a PowerPC imeplementation (Freescale MCP7450) as they IMHO are far better documented than any voodoo Intel comes up with.
As it turns out the floating point unit, FPU, and vector unit may have different rounding for floating point operations. The FPU can be configured to use one of four rounding modes; round to nearest (default), truncate, towards positive infinity and towards negative infinity. The vector unit however is only able to round to nearest, with a few select instructions having specific rounding rules. The internal precision of the FPU is 106-bit. The vector unit fulfills IEEE-754 but the documentation does not state much more.
Looking at your result the conversion 2570980608 is closer to the original integer, suggesting the FPU has better internal precision than the vector unit OR different rounding modes.

Can I expect float variable values that I set from literal constants to be unchanged after assignment to other variables?

If I do something like this:
float a = 1.5f;
float b = a;
void func(float arg)
{
if (arg == 1.5f) printf("You are teh awresome!");
}
func(b);
Will the text print every time (and on every machine)?
EDIT
I mean, I'm not really sure if the value will pass through the FPU at some point even if I'm not doing any calculations, and if so, whether the FPU will change the binary representation of the value. I've read somewhere that the (approximate) same floating point values can have multiple binary representations in IEEE 754.
First of all 1.5 can be stored accurately in memory so for this specific value, yes it will always be true.
More generally I think that the inaccuracies only pop up when you're doing computations, if you just store a value even if it doesn't have an accurate IEEE representation it will always be mapped to the same value (so 0.3 == 0.3 even though 0.3 * 3 != 0.9).
In the example, the value 1.5F has an exact representation in IEEE 754 (and pretty much any other conceivable binary or decimal floating point representation), so the answer is almost certainly going to be yes. However, there is no guarantee, and there could be compilers which do not manage to achieve the result.
If you change the value to one without an exact binary representation, such as 5.1F, the result is far from guaranteed.
Way, way, way back in their excellent classic book "The Elements of Programming Style", Kernighan & Plauger said:
A wise programmer once said, "Floating point numbers are like sand piles; every time you move one, you lose a little sand and you pick up a little dirt". And after a few computations, things can get pretty dirty.
(It's one of two phrases in the book that I highlighted many years ago1.)
They also observe:
10.0 times 0.1 is hardly ever 1.0.
Don't compare floating point numbers just for equality
Those observations were made in 1978 (for the second edition), but are still fundamentally valid today.
If the question is viewed at its most extremely restricted scope, you may be OK. If the question is varied very much, you are more likely to be bitten than not, and you'll probably be bitten sooner rather later.
1 The other highlighted phrase is (minus bullets):
the subroutine call permits us to summarize the irregularities in the argument list [...]
[t]he subroutine itself summarizes the regularities of the code [...]
If it passes the FPU on one point in time it could be due to optimizations and register handling by the compiler that you end up comparing a FPU register with a value from the memory. The first one may have a higher precision as the latter one and so the comparison gives you a false.
This may vary depending on compiler, compiler options and the platform you run on.
Here's a (not quite comprehensive) proof that (at least in GCC) you are guaranteed equality for floating literals.
Python code to generate file:
print """
#include <stdio.h>
int main()
{
"""
import random
chars = "abcdefghijklmnopqrstuvwxyz"
randoms = [str(random.random()) for _ in xrange(26)]
for c, r in zip(chars, randoms):
print "float %s = %sf;" % (c, r)
for c, r in zip(chars, randoms):
print r'if (%s != %sf) { printf("Error!\n"); }' % (c,r)
print """
return 0;
}
"""
Snippet of generated file:
#include <stdio.h>
int main()
{
float a = 0.199698325654f;
float b = 0.402517512357f;
float c = 0.700489844438f;
float d = 0.699640984356f;
if (a != 0.199698325654f) { printf("Error!\n"); }
if (b != 0.402517512357f) { printf("Error!\n"); }
if (c != 0.700489844438f) { printf("Error!\n"); }
if (d != 0.699640984356f) { printf("Error!\n"); }
return 0;
}
And running it correctly does not print anything to the screen:
$ ./a.out
$
But here's the catch: if you don't put the literal f after the floats in the check for equality, it will fail every time! You can leave the literal f out of the assignment, though, without problems.
I took a look at the disassembly from the VS2005 compiler. When running this simple program, I found that after float f=.1;, the condition f==.1 resulted in .... FALSE.
EDIT - this was due to the comparand being a double. When using a float literal (i.e. .1f), the comparison resulted in TRUE. This also holds when comparing double variables with double literals.
I added the source code and disassembly here:
float f=.1f;
0041363E fld dword ptr [__real#3dcccccd (415764h)]
00413644 fstp dword ptr [f]
double d=.1;
00413647 fld qword ptr [__real#3fb999999999999a (415748h)]
0041364D fstp qword ptr [d]
bool ffequal = f == .1f;
00413650 fld dword ptr [f]
00413653 fcomp qword ptr [__real#3fb99999a0000000 (415758h)]
00413659 fnstsw ax
0041365B test ah,44h
0041365E jp main+4Ch (41366Ch)
00413660 mov dword ptr [ebp-0F8h],1
0041366A jmp main+56h (413676h)
0041366C mov dword ptr [ebp-0F8h],0
00413676 mov al,byte ptr [ebp-0F8h]
0041367C mov byte ptr [fequal],al
// here, ffequal is true
bool dfequal = d == .1f;
0041367F fld qword ptr [__real#3fb99999a0000000 (415758h)]
00413685 fcomp qword ptr [d]
00413688 fnstsw ax
0041368A test ah,44h
0041368D jp main+7Bh (41369Bh)
0041368F mov dword ptr [ebp-0F8h],1
00413699 jmp main+85h (4136A5h)
0041369B mov dword ptr [ebp-0F8h],0
004136A5 mov al,byte ptr [ebp-0F8h]
004136AB mov byte ptr [dequal],al
// here, dfequal is false.
bool ddequal = d == .1;
004136AE fld qword ptr [__real#3fb999999999999a (415748h)]
004136B4 fcomp qword ptr [d]
004136B7 fnstsw ax
004136B9 test ah,44h
004136BC jp main+0AAh (4136CAh)
004136BE mov dword ptr [ebp-104h],1
004136C8 jmp main+0B4h (4136D4h)
004136CA mov dword ptr [ebp-104h],0
004136D4 mov al,byte ptr [ebp-104h]
004136DA mov byte ptr [ddequal],al
// ddequal is true
I don't think it will behave equal:
float f1 = .1;
// compiled as
//// 64-bits literal into EAX:EBX
if( f1 == .1 )
//// load .1 into 80 bits of FPU register 1
//// copy EAX:EBX into FPU register 2 (leaving alone the last 16 bits)
//// call FPU compare instruction
////
When the compiler generates code as mentioned above, the condition will never be true: 16 bits will be different when comparing the 64bits version of .1 with it's 80bits version.
Concluding: no: you cannot guarantee equality on each and every machine with this code.
It should print every time and on every machine. You're not using the output of any calculation, so there's no reason to assume that the values would change.
One caveat: certain "magic" numbers don't necessarily come from memory. In x86 we have the following instructions:
FLDZ - load 0.0 into ST(0)
FLD1 - load 1.0 into ST(0)
FLD2T - load log2(10) into ST(0)
FLD2E - load log2(e) into ST(0)
FLDPI - load pi into ST(0)
FLDLG2 - load log10(2) into ST(0)
FLDLN2 - load ln(2) into ST(0)
So if you happen to be comparing with one of these values you may possibly not get the exact same value as the const literal you referenced.