strange output in comparison of float with float literal - c++

float f = 0.7;
if( f == 0.7 )
printf("equal");
else
printf("not equal");
Why is the output not equal ?
Why does this happen?

This happens because in your statement
if(f == 0.7)
the 0.7 is treated as a double. Try 0.7f to ensure the value is treated as a float:
if(f == 0.7f)
But as Michael suggested in the comments below you should never test for exact equality of floating-point values.

This answer to complement the existing ones: note that 0.7 is not representable exactly either as a float (or as a double). If it was represented exactly, then there would be no loss of information when converting to float and then back to double, and you wouldn't have this problem.
It could even be argued that there should be a compiler warning for literal floating-point constants that cannot be represented exactly, especially when the standard is so fuzzy regarding whether the rounding will be made at run-time in the mode that has been set as that time or at compile-time in another rounding mode.
All non-integer numbers that can be represented exactly have 5 as their last decimal digit. Unfortunately, the converse is not true: some numbers have 5 as their last decimal digit and cannot be represented exactly. Small integers can all be represented exactly, and division by a power of 2 transforms a number that can be represented into another that can be represented, as long as you do not enter the realm of denormalized numbers.

First of all let look inside float number. I take 0.1f it is 4 byte long (binary32), in hex it is
3D CC CC CD.
By the standart IEEE 754 to convert it to decimal we must do like this:
In binary 3D CC CC CD is
0 01111011 1001100 11001100 11001101
here first digit is a Sign bit. 0 means (-1)^0 that our number is positive.
Second 8 bits is an Exponent. In binary it is 01111011 - in decimal 123. But the real Exponent is 123-127 (always 127)=-4, it's mean we need to multiply the number we will get by 2^ (-4).
The last 23 bytes is the Significand precision. There the first bit we multiply by 1/ (2^1) (0.5), second by 1/ (2^2) (0.25) and so on. Here what we get:
We need to add all numbers (power of 2) and add to it 1 (always 1, by standart). It is
1,60000002384185791015625
Now let's multiply this number by 2^ (-4), it's from Exponent. We just devide number above by 2 four time:
0,100000001490116119384765625
I used MS Calculator
**
Now the second part. Converting from decimal to binary.
**
I take the number 0.1
It ease because there is no integer part. First Sign bit - it is 0.
Exponent and Significand precision I will calculate now. The logic is multiply by 2 whole number (0.1*2=0.2) and if it's bigger than 1 substract and continue.
And the number is .00011001100110011001100110011, standart says that we must shift left before we get 1. (something). How you see we need 4 shifts, from this number calculating Exponent (127-4=123). And the Significand precision now is 10011001100110011001100 (and there is lost bits).
Now the whole number. Sign bit 0 Exponent is 123 (01111011) and Significand precision is 10011001100110011001100 and whole it is
00111101110011001100110011001100
let's compare it with those we have from previous chapter
00111101110011001100110011001101
As you see the lasts bit are not equal. It is because I truncate the number. The CPU and compiler know that the is something after Significand precision can not hold and just set the last bit to 1.

Another near exact question was linked to this one, thus the years late answer. I don't think the above answers are complete.
int fun1 ( void )
{
float x=0.7;
if(x==0.7) return(1);
else return(0);
}
int fun2 ( void )
{
float x=1.1;
if(x==1.1) return(1);
else return(0);
}
int fun3 ( void )
{
float x=1.0;
if(x==1.0) return(1);
else return(0);
}
int fun4 ( void )
{
float x=0.0;
if(x==0.0) return(1);
else return(0);
}
int fun5 ( void )
{
float x=0.7;
if(x==0.7f) return(1);
else return(0);
}
float fun10 ( void )
{
return(0.7);
}
double fun11 ( void )
{
return(0.7);
}
float fun12 ( void )
{
return(1.0);
}
double fun13 ( void )
{
return(1.0);
}
Disassembly of section .text:
00000000 <fun1>:
0: e3a00000 mov r0, #0
4: e12fff1e bx lr
00000008 <fun2>:
8: e3a00000 mov r0, #0
c: e12fff1e bx lr
00000010 <fun3>:
10: e3a00001 mov r0, #1
14: e12fff1e bx lr
00000018 <fun4>:
18: e3a00001 mov r0, #1
1c: e12fff1e bx lr
00000020 <fun5>:
20: e3a00001 mov r0, #1
24: e12fff1e bx lr
00000028 <fun10>:
28: e59f0000 ldr r0, [pc] ; 30 <fun10+0x8>
2c: e12fff1e bx lr
30: 3f333333 svccc 0x00333333
00000034 <fun11>:
34: e28f1004 add r1, pc, #4
38: e8910003 ldm r1, {r0, r1}
3c: e12fff1e bx lr
40: 66666666 strbtvs r6, [r6], -r6, ror #12
44: 3fe66666 svccc 0x00e66666
00000048 <fun12>:
48: e3a005fe mov r0, #1065353216 ; 0x3f800000
4c: e12fff1e bx lr
00000050 <fun13>:
50: e3a00000 mov r0, #0
54: e59f1000 ldr r1, [pc] ; 5c <fun13+0xc>
58: e12fff1e bx lr
5c: 3ff00000 svccc 0x00f00000 ; IMB
Why did fun3 and fun4 return one and not the others? why does fun5 work?
It is about the language. The language says that 0.7 is a double unless you use this syntax 0.7f then it is a single. So
float x=0.7;
the double 0.7 is converted to a single and stored in x.
if(x==0.7) return(1);
The language says we have to promote to the higher precision so the single in x is converted to a double and compared with the double 0.7.
00000028 <fun10>:
28: e59f0000 ldr r0, [pc] ; 30 <fun10+0x8>
2c: e12fff1e bx lr
30: 3f333333 svccc 0x00333333
00000034 <fun11>:
34: e28f1004 add r1, pc, #4
38: e8910003 ldm r1, {r0, r1}
3c: e12fff1e bx lr
40: 66666666 strbtvs r6, [r6], -r6, ror #12
44: 3fe66666 svccc 0x00e66666
single 3f333333
double 3fe6666666666666
As Alexandr pointed out if that answer remains IEEE 754 a single is
seeeeeeeefffffffffffffffffffffff
And double is
seeeeeeeeeeeffffffffffffffffffffffffffffffffffffffffffffffffffff
with 52 bits of fraction rather than the 23 that single has.
00111111001100110011... single
001111111110011001100110... double
0 01111110 01100110011... single
0 01111111110 01100110011... double
Just like 1/3rd in base 10 is 0.3333333... forever. We have a repeating pattern here 0110
01100110011001100110011 single, 23 bits
01100110011001100110011001100110.... double 52 bits.
And here is the answer.
if(x==0.7) return(1);
x contains 01100110011001100110011 as its fraction, when that gets converted back
to double the fraction is
01100110011001100110011000000000....
which is not equal to
01100110011001100110011001100110...
but here
if(x==0.7f) return(1);
That promotion doesn't happen the same bit patterns are compared with each other.
Why does 1.0 work?
00000048 <fun12>:
48: e3a005fe mov r0, #1065353216 ; 0x3f800000
4c: e12fff1e bx lr
00000050 <fun13>:
50: e3a00000 mov r0, #0
54: e59f1000 ldr r1, [pc] ; 5c <fun13+0xc>
58: e12fff1e bx lr
5c: 3ff00000 svccc 0x00f00000 ; IMB
0011111110000000...
0011111111110000000...
0 01111111 0000000...
0 01111111111 0000000...
In both cases the fraction is all zeros. So converting from double to single to double there is no loss of precision. It converts from single to double exactly and the bit comparison of the two values works.
The highest voted and checked answer by halfdan is the correct answer, this is a case of mixed precision AND you should never do an equals comparison.
The why wasn't shown in that answer. 0.7 fails 1.0 works. Why did 0.7 fail wasn't shown. A duplicate question 1.1 fails as well.
Edit
The equals can be taken out of the problem here, it is a different question that has already been answered, but it is the same problem and also has the "what the ..." initial shock.
int fun1 ( void )
{
float x=0.7;
if(x<0.7) return(1);
else return(0);
}
int fun2 ( void )
{
float x=0.6;
if(x<0.6) return(1);
else return(0);
}
Disassembly of section .text:
00000000 <fun1>:
0: e3a00001 mov r0, #1
4: e12fff1e bx lr
00000008 <fun2>:
8: e3a00000 mov r0, #0
c: e12fff1e bx lr
Why does one show as less than and the other not less than? When they should be equal.
From above we know the 0.7 story.
01100110011001100110011 single, 23 bits
01100110011001100110011001100110.... double 52 bits.
01100110011001100110011000000000....
is less than.
01100110011001100110011001100110...
0.6 is a different repeating pattern 0011 rather than 0110.
but when converted from a double to a single or in general when represented
as a single IEEE 754.
00110011001100110011001100110011.... double 52 bits.
00110011001100110011001 is NOT the fraction for single
00110011001100110011010 IS the fraction for single
IEEE 754 uses rounding modes, round up, round down or round to zero. Compilers tend to round up by default. If you remember rounding in grade school 12345678 if I wanted to round to the 3rd digit from the top it would be 12300000 but round to the next digit 1235000 if the digit after is 5 or greater then round up. 5 is 1/2 of 10 the base (Decimal) in binary 1 is 1/2 of the base so if the digit after the position we want to round is 1 then round up else don't. So for 0.7 we didn't round up, for 0.6 we do round up.
And now it is easy to see that
00110011001100110011010
converted to a double because of (x<0.7)
00110011001100110011010000000000....
is greater than
00110011001100110011001100110011....
So without having to talk about using equals the issue still presents itself 0.7 is double 0.7f is single, the operation is promoted to the highest precision if they differ.

The problem you're facing is, as other commenters have noted, that it's generally unsafe to test for exact equivalency between floats, as initialization errors, or rounding errors in calculations can introduce minor differences that will cause the == operator to return false.
A better practice is to do something like
float f = 0.7;
if( fabs(f - 0.7) < FLT_EPSILON )
printf("equal");
else
printf("not equal");
Assuming that FLT_EPSILON has been defined as an appropriately small float value for your platform.
Since the rounding or initialization errors will be unlikely to exceed the value of FLT_EPSILON, this will give you the reliable equivalency test you're looking for.

A lot of the answers around the web make the mistake of looking at the abosulute difference between floating point numbers, this is only valid for special cases, the robust way is to look at the relative difference as in below:
// Floating point comparison:
bool CheckFP32Equal(float referenceValue, float value)
{
const float fp32_epsilon = float(1E-7);
float abs_diff = std::abs(referenceValue - value);
// Both identical zero is a special case
if( referenceValue==0.0f && value == 0.0f)
return true;
float rel_diff = abs_diff / std::max(std::abs(referenceValue) , std::abs(value) );
if(rel_diff < fp32_epsilon)
return true;
else
return false;
}

Consider this:
int main()
{
float a = 0.7;
if(0.7 > a)
printf("Hi\n");
else
printf("Hello\n");
return 0;
}
if (0.7 > a) here a is a float variable and 0.7 is a double constant. The double constant 0.7 is greater than the float variable a. Hence the if condition is satisfied and it prints 'Hi'
Example:
int main()
{
float a=0.7;
printf("%.10f %.10f\n",0.7, a);
return 0;
}
Output:
0.7000000000 0.6999999881

Pointing value saved in variable and constant have not same data types. It's the difference in the precision of data types.
If you change the datatype of f variable to double, it'll print equal, This is because constants in floating-point stored in double and non-floating in long by default, double's precision is higher than float. it'll be completely clear if you see the method of floating-point numbers conversion to binary conversion

Related

ARM 7 Assembly - ADC with immediate 0

I have written a little c++ function on godbolt.org and I am curious about a certain line inside the assembly. Here is the function:
unsigned long long foo(uint64_t a, uint8_t b){
// unsigned long long fifteen = 15 * b;
// unsigned long long result = a + fifteen;
// unsigned long long resultfinal = result / 2;
// return resultfinal;
return (a+(15*b)) / 2;
}
The generated assembly:
rsb r2, r2, r2, lsl #4
adds r0, r2, r0
adc r1, r1, #0
lsrs r1, r1, #1
rrx r0, r0
Now I dont understand why the line with the ADC instruction happens. It adds 0 to the high of the 64 bit number. Why does it do that?
Here is the link if you want to play yourself:
Link to assembly
The arm32 is only 32 bits. The value 'a' is 64bits. The instructions that you are seeing are to allow computations of sizes larger than 32bits.
rsb r2, r2, r2, lsl #4 # 15*b -> b*16-b
adds r0, r2, r0 # a+(15*b) !LOW 32 bits! could carry.
adc r1, r1, #0 # add a carry bit to the high portion
lsrs r1, r1, #1 # divide high part by 2; (a+(15*b))/2
rrx r0, r0 # The opposite of add with carry flowing down.
Note: if you are confused by the adc instruction, then the rrx will also be confusing? It is a 'dual' of the addition/multiplication. For division you need to take care of underflow in the higher part and put it in the next lower value.
I think the important point is that you can 'carry' this logic to arbitrarily large values. It has applications in cryptography, large value finance and other high accuracy science and engineering applications.
See: Gnu Multi-precision library, libtommath, etc.

How does a C/C++ compiler optimize division by non-powers-of-two? [duplicate]

I've been reading about div and mul assembly operations, and I decided to see them in action by writing a simple program in C:
File division.c
#include <stdlib.h>
#include <stdio.h>
int main()
{
size_t i = 9;
size_t j = i / 5;
printf("%zu\n",j);
return 0;
}
And then generating assembly language code with:
gcc -S division.c -O0 -masm=intel
But looking at generated division.s file, it doesn't contain any div operations! Instead, it does some kind of black magic with bit shifting and magic numbers. Here's a code snippet that computes i/5:
mov rax, QWORD PTR [rbp-16] ; Move i (=9) to RAX
movabs rdx, -3689348814741910323 ; Move some magic number to RDX (?)
mul rdx ; Multiply 9 by magic number
mov rax, rdx ; Take only the upper 64 bits of the result
shr rax, 2 ; Shift these bits 2 places to the right (?)
mov QWORD PTR [rbp-8], rax ; Magically, RAX contains 9/5=1 now,
; so we can assign it to j
What's going on here? Why doesn't GCC use div at all? How does it generate this magic number and why does everything work?
Integer division is one of the slowest arithmetic operations you can perform on a modern processor, with latency up to the dozens of cycles and bad throughput. (For x86, see Agner Fog's instruction tables and microarch guide).
If you know the divisor ahead of time, you can avoid the division by replacing it with a set of other operations (multiplications, additions, and shifts) which have the equivalent effect. Even if several operations are needed, it's often still a heck of a lot faster than the integer division itself.
Implementing the C / operator this way instead of with a multi-instruction sequence involving div is just GCC's default way of doing division by constants. It doesn't require optimizing across operations and doesn't change anything even for debugging. (Using -Os for small code size does get GCC to use div, though.) Using a multiplicative inverse instead of division is like using lea instead of mul and add
As a result, you only tend to see div or idiv in the output if the divisor isn't known at compile-time.
For information on how the compiler generates these sequences, as well as code to let you generate them for yourself (almost certainly unnecessary unless you're working with a braindead compiler), see libdivide.
Dividing by 5 is the same as multiplying 1/5, which is again the same as multiplying by 4/5 and shifting right 2 bits. The value concerned is CCCCCCCCCCCCCCCD in hex, which is the binary representation of 4/5 if put after a hexadecimal point (i.e. the binary for four fifths is 0.110011001100 recurring - see below for why). I think you can take it from here! You might want to check out fixed point arithmetic (though note it's rounded to an integer at the end).
As to why, multiplication is faster than division, and when the divisor is fixed, this is a faster route.
See Reciprocal Multiplication, a tutorial for a detailed writeup about how it works, explaining in terms of fixed-point. It shows how the algorithm for finding the reciprocal works, and how to handle signed division and modulo.
Let's consider for a minute why 0.CCCCCCCC... (hex) or 0.110011001100... binary is 4/5. Divide the binary representation by 4 (shift right 2 places), and we'll get 0.001100110011... which by trivial inspection can be added the original to get 0.111111111111..., which is obviously equal to 1, the same way 0.9999999... in decimal is equal to one. Therefore, we know that x + x/4 = 1, so 5x/4 = 1, x=4/5. This is then represented as CCCCCCCCCCCCD in hex for rounding (as the binary digit beyond the last one present would be a 1).
In general multiplication is much faster than division. So if we can get away with multiplying by the reciprocal instead we can significantly speed up division by a constant
A wrinkle is that we cannot represent the reciprocal exactly (unless the division was by a power of two but in that case we can usually just convert the division to a bit shift). So to ensure correct answers we have to be careful that the error in our reciprocal does not cause errors in our final result.
-3689348814741910323 is 0xCCCCCCCCCCCCCCCD which is a value of just over 4/5 expressed in 0.64 fixed point.
When we multiply a 64 bit integer by a 0.64 fixed point number we get a 64.64 result. We truncate the value to a 64-bit integer (effectively rounding it towards zero) and then perform a further shift which divides by four and again truncates By looking at the bit level it is clear that we can treat both truncations as a single truncation.
This clearly gives us at least an approximation of division by 5 but does it give us an exact answer correctly rounded towards zero?
To get an exact answer the error needs to be small enough not to push the answer over a rounding boundary.
The exact answer to a division by 5 will always have a fractional part of 0, 1/5, 2/5, 3/5 or 4/5 . Therefore a positive error of less than 1/5 in the multiplied and shifted result will never push the result over a rounding boundary.
The error in our constant is (1/5) * 2-64. The value of i is less than 264 so the error after multiplying is less than 1/5. After the division by 4 the error is less than (1/5) * 2−2.
(1/5) * 2−2 < 1/5 so the answer will always be equal to doing an exact division and rounding towards zero.
Unfortunately this doesn't work for all divisors.
If we try to represent 4/7 as a 0.64 fixed point number with rounding away from zero we end up with an error of (6/7) * 2-64. After multiplying by an i value of just under 264 we end up with an error just under 6/7 and after dividing by four we end up with an error of just under 1.5/7 which is greater than 1/7.
So to implement divison by 7 correctly we need to multiply by a 0.65 fixed point number. We can implement that by multiplying by the lower 64 bits of our fixed point number, then adding the original number (this may overflow into the carry bit) then doing a rotate through carry.
Here is link to a document of an algorithm that produces the values and code I see with Visual Studio (in most cases) and that I assume is still used in GCC for division of a variable integer by a constant integer.
http://gmplib.org/~tege/divcnst-pldi94.pdf
In the article, a uword has N bits, a udword has 2N bits, n = numerator = dividend, d = denominator = divisor, ℓ is initially set to ceil(log2(d)), shpre is pre-shift (used before multiply) = e = number of trailing zero bits in d, shpost is post-shift (used after multiply), prec is precision = N - e = N - shpre. The goal is to optimize calculation of n/d using a pre-shift, multiply, and post-shift.
Scroll down to figure 6.2, which defines how a udword multiplier (max size is N+1 bits), is generated, but doesn't clearly explain the process. I'll explain this below.
Figure 4.2 and figure 6.2 show how the multiplier can be reduced to a N bit or less multiplier for most divisors. Equation 4.5 explains how the formula used to deal with N+1 bit multipliers in figure 4.1 and 4.2 was derived.
In the case of modern X86 and other processors, multiply time is fixed, so pre-shift doesn't help on these processors, but it still helps to reduce the multiplier from N+1 bits to N bits. I don't know if GCC or Visual Studio have eliminated pre-shift for X86 targets.
Going back to Figure 6.2. The numerator (dividend) for mlow and mhigh can be larger than a udword only when denominator (divisor) > 2^(N-1) (when ℓ == N => mlow = 2^(2N)), in this case the optimized replacement for n/d is a compare (if n>=d, q = 1, else q = 0), so no multiplier is generated. The initial values of mlow and mhigh will be N+1 bits, and two udword/uword divides can be used to produce each N+1 bit value (mlow or mhigh). Using X86 in 64 bit mode as an example:
; upper 8 bytes of dividend = 2^(ℓ) = (upper part of 2^(N+ℓ))
; lower 8 bytes of dividend for mlow = 0
; lower 8 bytes of dividend for mhigh = 2^(N+ℓ-prec) = 2^(ℓ+shpre) = 2^(ℓ+e)
dividend dq 2 dup(?) ;16 byte dividend
divisor dq 1 dup(?) ; 8 byte divisor
; ...
mov rcx,divisor
mov rdx,0
mov rax,dividend+8 ;upper 8 bytes of dividend
div rcx ;after div, rax == 1
mov rax,dividend ;lower 8 bytes of dividend
div rcx
mov rdx,1 ;rdx:rax = N+1 bit value = 65 bit value
You can test this with GCC. You're already seen how j = i/5 is handled. Take a look at how j = i/7 is handled (which should be the N+1 bit multiplier case).
On most current processors, multiply has a fixed timing, so a pre-shift is not needed. For X86, the end result is a two instruction sequence for most divisors, and a five instruction sequence for divisors like 7 (in order to emulate a N+1 bit multiplier as shown in equation 4.5 and figure 4.2 of the pdf file). Example X86-64 code:
; rbx = dividend, rax = 64 bit (or less) multiplier, rcx = post shift count
; two instruction sequence for most divisors:
mul rbx ;rdx = upper 64 bits of product
shr rdx,cl ;rdx = quotient
;
; five instruction sequence for divisors like 7
; to emulate 65 bit multiplier (rbx = lower 64 bits of multiplier)
mul rbx ;rdx = upper 64 bits of product
sub rbx,rdx ;rbx -= rdx
shr rbx,1 ;rbx >>= 1
add rdx,rbx ;rdx = upper 64 bits of corrected product
shr rdx,cl ;rdx = quotient
; ...
To explain the 5 instruction sequence, a simple 3 instruction sequence could overflow. Let u64() mean upper 64 bits (all that is needed for quotient)
mul rbx ;rdx = u64(dvnd*mplr)
add rdx,rbx ;rdx = u64(dvnd*(2^64 + mplr)), could overflow
shr rdx,cl
To handle this case, cl = post_shift-1. rax = multiplier - 2^64, rbx = dividend. u64() is upper 64 bits. Note that rax = rax<<1 - rax. Quotient is:
u64( ( rbx * (2^64 + rax) )>>(cl+1) )
u64( ( rbx * (2^64 + rax<<1 - rax) )>>(cl+1) )
u64( ( (rbx * 2^64) + (rbx * rax)<<1 - (rbx * rax) )>>(cl+1) )
u64( ( (rbx * 2^64) - (rbx * rax) + (rbx * rax)<<1 )>>(cl+1) )
u64( ( ((rbx * 2^64) - (rbx * rax))>>1) + (rbx*rax) )>>(cl ) )
mul rbx ; (rbx*rax)
sub rbx,rdx ; (rbx*2^64)-(rbx*rax)
shr rbx,1 ;( (rbx*2^64)-(rbx*rax))>>1
add rdx,rbx ;( ((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax)
shr rdx,cl ;((((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax))>>cl
I will answer from a slightly different angle: Because it is allowed to do it.
C and C++ are defined against an abstract machine. The compiler transforms this program in terms of the abstract machine to concrete machine following the as-if rule.
The compiler is allowed to make ANY changes as long as it doesn't change the observable behaviour as specified by the abstract machine. There is no reasonable expectation that the compiler will transform your code in the most straightforward way possible (even when a lot of C programmer assume that). Usually, it does this because the compiler wants to optimize the performance compared to the straightforward approach (as discussed in the other answers at length).
If under any circumstances the compiler "optimizes" a correct program to something that has a different observable behaviour, that is a compiler bug.
Any undefined behaviour in our code (signed integer overflow is a classical example) and this contract is void.

Assembler division x86 strange numbers [duplicate]

I've been reading about div and mul assembly operations, and I decided to see them in action by writing a simple program in C:
File division.c
#include <stdlib.h>
#include <stdio.h>
int main()
{
size_t i = 9;
size_t j = i / 5;
printf("%zu\n",j);
return 0;
}
And then generating assembly language code with:
gcc -S division.c -O0 -masm=intel
But looking at generated division.s file, it doesn't contain any div operations! Instead, it does some kind of black magic with bit shifting and magic numbers. Here's a code snippet that computes i/5:
mov rax, QWORD PTR [rbp-16] ; Move i (=9) to RAX
movabs rdx, -3689348814741910323 ; Move some magic number to RDX (?)
mul rdx ; Multiply 9 by magic number
mov rax, rdx ; Take only the upper 64 bits of the result
shr rax, 2 ; Shift these bits 2 places to the right (?)
mov QWORD PTR [rbp-8], rax ; Magically, RAX contains 9/5=1 now,
; so we can assign it to j
What's going on here? Why doesn't GCC use div at all? How does it generate this magic number and why does everything work?
Integer division is one of the slowest arithmetic operations you can perform on a modern processor, with latency up to the dozens of cycles and bad throughput. (For x86, see Agner Fog's instruction tables and microarch guide).
If you know the divisor ahead of time, you can avoid the division by replacing it with a set of other operations (multiplications, additions, and shifts) which have the equivalent effect. Even if several operations are needed, it's often still a heck of a lot faster than the integer division itself.
Implementing the C / operator this way instead of with a multi-instruction sequence involving div is just GCC's default way of doing division by constants. It doesn't require optimizing across operations and doesn't change anything even for debugging. (Using -Os for small code size does get GCC to use div, though.) Using a multiplicative inverse instead of division is like using lea instead of mul and add
As a result, you only tend to see div or idiv in the output if the divisor isn't known at compile-time.
For information on how the compiler generates these sequences, as well as code to let you generate them for yourself (almost certainly unnecessary unless you're working with a braindead compiler), see libdivide.
Dividing by 5 is the same as multiplying 1/5, which is again the same as multiplying by 4/5 and shifting right 2 bits. The value concerned is CCCCCCCCCCCCCCCD in hex, which is the binary representation of 4/5 if put after a hexadecimal point (i.e. the binary for four fifths is 0.110011001100 recurring - see below for why). I think you can take it from here! You might want to check out fixed point arithmetic (though note it's rounded to an integer at the end).
As to why, multiplication is faster than division, and when the divisor is fixed, this is a faster route.
See Reciprocal Multiplication, a tutorial for a detailed writeup about how it works, explaining in terms of fixed-point. It shows how the algorithm for finding the reciprocal works, and how to handle signed division and modulo.
Let's consider for a minute why 0.CCCCCCCC... (hex) or 0.110011001100... binary is 4/5. Divide the binary representation by 4 (shift right 2 places), and we'll get 0.001100110011... which by trivial inspection can be added the original to get 0.111111111111..., which is obviously equal to 1, the same way 0.9999999... in decimal is equal to one. Therefore, we know that x + x/4 = 1, so 5x/4 = 1, x=4/5. This is then represented as CCCCCCCCCCCCD in hex for rounding (as the binary digit beyond the last one present would be a 1).
In general multiplication is much faster than division. So if we can get away with multiplying by the reciprocal instead we can significantly speed up division by a constant
A wrinkle is that we cannot represent the reciprocal exactly (unless the division was by a power of two but in that case we can usually just convert the division to a bit shift). So to ensure correct answers we have to be careful that the error in our reciprocal does not cause errors in our final result.
-3689348814741910323 is 0xCCCCCCCCCCCCCCCD which is a value of just over 4/5 expressed in 0.64 fixed point.
When we multiply a 64 bit integer by a 0.64 fixed point number we get a 64.64 result. We truncate the value to a 64-bit integer (effectively rounding it towards zero) and then perform a further shift which divides by four and again truncates By looking at the bit level it is clear that we can treat both truncations as a single truncation.
This clearly gives us at least an approximation of division by 5 but does it give us an exact answer correctly rounded towards zero?
To get an exact answer the error needs to be small enough not to push the answer over a rounding boundary.
The exact answer to a division by 5 will always have a fractional part of 0, 1/5, 2/5, 3/5 or 4/5 . Therefore a positive error of less than 1/5 in the multiplied and shifted result will never push the result over a rounding boundary.
The error in our constant is (1/5) * 2-64. The value of i is less than 264 so the error after multiplying is less than 1/5. After the division by 4 the error is less than (1/5) * 2−2.
(1/5) * 2−2 < 1/5 so the answer will always be equal to doing an exact division and rounding towards zero.
Unfortunately this doesn't work for all divisors.
If we try to represent 4/7 as a 0.64 fixed point number with rounding away from zero we end up with an error of (6/7) * 2-64. After multiplying by an i value of just under 264 we end up with an error just under 6/7 and after dividing by four we end up with an error of just under 1.5/7 which is greater than 1/7.
So to implement divison by 7 correctly we need to multiply by a 0.65 fixed point number. We can implement that by multiplying by the lower 64 bits of our fixed point number, then adding the original number (this may overflow into the carry bit) then doing a rotate through carry.
Here is link to a document of an algorithm that produces the values and code I see with Visual Studio (in most cases) and that I assume is still used in GCC for division of a variable integer by a constant integer.
http://gmplib.org/~tege/divcnst-pldi94.pdf
In the article, a uword has N bits, a udword has 2N bits, n = numerator = dividend, d = denominator = divisor, ℓ is initially set to ceil(log2(d)), shpre is pre-shift (used before multiply) = e = number of trailing zero bits in d, shpost is post-shift (used after multiply), prec is precision = N - e = N - shpre. The goal is to optimize calculation of n/d using a pre-shift, multiply, and post-shift.
Scroll down to figure 6.2, which defines how a udword multiplier (max size is N+1 bits), is generated, but doesn't clearly explain the process. I'll explain this below.
Figure 4.2 and figure 6.2 show how the multiplier can be reduced to a N bit or less multiplier for most divisors. Equation 4.5 explains how the formula used to deal with N+1 bit multipliers in figure 4.1 and 4.2 was derived.
In the case of modern X86 and other processors, multiply time is fixed, so pre-shift doesn't help on these processors, but it still helps to reduce the multiplier from N+1 bits to N bits. I don't know if GCC or Visual Studio have eliminated pre-shift for X86 targets.
Going back to Figure 6.2. The numerator (dividend) for mlow and mhigh can be larger than a udword only when denominator (divisor) > 2^(N-1) (when ℓ == N => mlow = 2^(2N)), in this case the optimized replacement for n/d is a compare (if n>=d, q = 1, else q = 0), so no multiplier is generated. The initial values of mlow and mhigh will be N+1 bits, and two udword/uword divides can be used to produce each N+1 bit value (mlow or mhigh). Using X86 in 64 bit mode as an example:
; upper 8 bytes of dividend = 2^(ℓ) = (upper part of 2^(N+ℓ))
; lower 8 bytes of dividend for mlow = 0
; lower 8 bytes of dividend for mhigh = 2^(N+ℓ-prec) = 2^(ℓ+shpre) = 2^(ℓ+e)
dividend dq 2 dup(?) ;16 byte dividend
divisor dq 1 dup(?) ; 8 byte divisor
; ...
mov rcx,divisor
mov rdx,0
mov rax,dividend+8 ;upper 8 bytes of dividend
div rcx ;after div, rax == 1
mov rax,dividend ;lower 8 bytes of dividend
div rcx
mov rdx,1 ;rdx:rax = N+1 bit value = 65 bit value
You can test this with GCC. You're already seen how j = i/5 is handled. Take a look at how j = i/7 is handled (which should be the N+1 bit multiplier case).
On most current processors, multiply has a fixed timing, so a pre-shift is not needed. For X86, the end result is a two instruction sequence for most divisors, and a five instruction sequence for divisors like 7 (in order to emulate a N+1 bit multiplier as shown in equation 4.5 and figure 4.2 of the pdf file). Example X86-64 code:
; rbx = dividend, rax = 64 bit (or less) multiplier, rcx = post shift count
; two instruction sequence for most divisors:
mul rbx ;rdx = upper 64 bits of product
shr rdx,cl ;rdx = quotient
;
; five instruction sequence for divisors like 7
; to emulate 65 bit multiplier (rbx = lower 64 bits of multiplier)
mul rbx ;rdx = upper 64 bits of product
sub rbx,rdx ;rbx -= rdx
shr rbx,1 ;rbx >>= 1
add rdx,rbx ;rdx = upper 64 bits of corrected product
shr rdx,cl ;rdx = quotient
; ...
To explain the 5 instruction sequence, a simple 3 instruction sequence could overflow. Let u64() mean upper 64 bits (all that is needed for quotient)
mul rbx ;rdx = u64(dvnd*mplr)
add rdx,rbx ;rdx = u64(dvnd*(2^64 + mplr)), could overflow
shr rdx,cl
To handle this case, cl = post_shift-1. rax = multiplier - 2^64, rbx = dividend. u64() is upper 64 bits. Note that rax = rax<<1 - rax. Quotient is:
u64( ( rbx * (2^64 + rax) )>>(cl+1) )
u64( ( rbx * (2^64 + rax<<1 - rax) )>>(cl+1) )
u64( ( (rbx * 2^64) + (rbx * rax)<<1 - (rbx * rax) )>>(cl+1) )
u64( ( (rbx * 2^64) - (rbx * rax) + (rbx * rax)<<1 )>>(cl+1) )
u64( ( ((rbx * 2^64) - (rbx * rax))>>1) + (rbx*rax) )>>(cl ) )
mul rbx ; (rbx*rax)
sub rbx,rdx ; (rbx*2^64)-(rbx*rax)
shr rbx,1 ;( (rbx*2^64)-(rbx*rax))>>1
add rdx,rbx ;( ((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax)
shr rdx,cl ;((((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax))>>cl
I will answer from a slightly different angle: Because it is allowed to do it.
C and C++ are defined against an abstract machine. The compiler transforms this program in terms of the abstract machine to concrete machine following the as-if rule.
The compiler is allowed to make ANY changes as long as it doesn't change the observable behaviour as specified by the abstract machine. There is no reasonable expectation that the compiler will transform your code in the most straightforward way possible (even when a lot of C programmer assume that). Usually, it does this because the compiler wants to optimize the performance compared to the straightforward approach (as discussed in the other answers at length).
If under any circumstances the compiler "optimizes" a correct program to something that has a different observable behaviour, that is a compiler bug.
Any undefined behaviour in our code (signed integer overflow is a classical example) and this contract is void.

Why this comparison to zero is not working properly?

I have this code:
double A = doSomethingWonderful(); // same as doing A = 0;
if(A == 0)
{
fprintf(stderr,"A=%llx\n", A);
}
and this output:
A=7f35959a51c0
how is this possible?
I checked the value of 7f35959a51c0 and seems to be something like 6.91040329973658785751176861252E-310, which is very small, but not zero.
EDIT:
Ok I understood that that way of printing the hex value for a double is not working. I need to find a way to print the bytes of the double.
Following the comments I modified my code:
A = doSomethingWonderful();// same as doing A = 0;
if(A == 0)
{
char bytes[8];
memcpy(bytes, &A, sizeof(double));
for(int i = 0; i < 8; i++)
fprintf(stderr," %x", bytes[i]);
}
and I get this output:
0 0 0 0 0 0 0 0
So finally it seems that the comparison is working properly but I was doing a bad print.
IEEE 754 precision floating point values use a bias in the exponent value in order to fully represent both positive and negative exponents. For double-precision values, that bias is 1023[source], which happens to be 0x3ff in hex, which matches the hex value of A that you printed for 1, or 0e0.
Two other small notes:
When printing bytes, you can use %hhx to get it to only print 2 hex digits instead of sign-extending to 8.
You can use a union to reliably print the double value as an 8-byte integer.
double A = 0;
if(A == 0)
{
A = 1; // Note that you were setting A to 1 here!
char bytes[8];
memcpy(bytes, &A, sizeof(double));
for(int i = 0; i < 8; i++)
printf(" %hhx", bytes[i]);
}
int isZero;
union {
unsigned long i;
double d;
} u;
u.d = 0;
isZero = (u.d == 0.0);
printf("\n============\n");
printf("hex = %lx\nfloat = %f\nzero? %d\n", u.i, u.d, isZero);
Result:
0 0 0 0 0 0 f0 3f
============
hex = 0
float = 0.000000
zero? 1
So in the first line, we see that 1.0 is 0e0 (i.e., 00).
In the following lines, we see that when you use a union to print the hex value of the double 0.0, you get 0 as expected.
When you pass your double to printf(), you pass it as a floating point value. However, since the "%x" format is an integer format, your printf() implementation will try to read an integer argument. Due to this fundamental type mismatch, it is possible, for instance, that the calling code places your double value in a floating point register, while the printf() implementation tries to read it from an integer register. Details depend on your ABI, but apparently the bits that you see are not the bits that you passed. From a language point of view, you have undefined behavior the moment that you have a type mismatch between one printf() argument and its corresponding format specification.
Apart from that, +0.0 is indeed represented as all bits zero, both in single and in double precision formats. However, this is only positive zero, -0.0 is represented with the sign bit set.
In your last bit of code, you are inspecting the bit pattern of 1.0, because you overwrite the value of A before you do the conversion. Note also that you get fffffff0 instead of f0 for the seventh byte because of sign extension. For correct output, use an array of unsigned bytes.
The pattern that you are seeing decodes like this:
00 00 00 00 00 00 f0 3f
to big endian:
3f f0 00 00 00 00 00 00
decode fields:
sign: 0 (1 bit)
exponent: 01111111111 (11 bit), value = 1023
exponent = value - bias = 1023 - 1023 = 0
mantissa: 0...0 (52 bit), value with implicit leading 1 bit: 1.0000...
entire value: -1^0 * 2^0 * 1.0 = 1.0

Why does changing 0.1f to 0 slow down performance by 10x?

Why does this bit of code,
const float x[16] = { 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,
1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
1.923, 2.034, 2.145, 2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
y[i] = x[i];
}
for (int j = 0; j < 9000000; j++)
{
for (int i = 0; i < 16; i++)
{
y[i] *= x[i];
y[i] /= z[i];
y[i] = y[i] + 0.1f; // <--
y[i] = y[i] - 0.1f; // <--
}
}
run more than 10 times faster than the following bit (identical except where noted)?
const float x[16] = { 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,
1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
1.923, 2.034, 2.145, 2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
y[i] = x[i];
}
for (int j = 0; j < 9000000; j++)
{
for (int i = 0; i < 16; i++)
{
y[i] *= x[i];
y[i] /= z[i];
y[i] = y[i] + 0; // <--
y[i] = y[i] - 0; // <--
}
}
when compiling with Visual Studio 2010 SP1.
The optimization level was -02 with sse2 enabled.
I haven't tested with other compilers.
Welcome to the world of denormalized floating-point! They can wreak havoc on performance!!!
Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. Operations on denormalized floating-point can be tens to hundreds of times slower than on normalized floating-point. This is because many processors can't handle them directly and must trap and resolve them using microcode.
If you print out the numbers after 10,000 iterations, you will see that they have converged to different values depending on whether 0 or 0.1 is used.
Here's the test code compiled on x64:
int main() {
double start = omp_get_wtime();
const float x[16]={1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
const float z[16]={1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,1.923,2.034,2.145,2.256,2.367,2.478,2.589,2.690};
float y[16];
for(int i=0;i<16;i++)
{
y[i]=x[i];
}
for(int j=0;j<9000000;j++)
{
for(int i=0;i<16;i++)
{
y[i]*=x[i];
y[i]/=z[i];
#ifdef FLOATING
y[i]=y[i]+0.1f;
y[i]=y[i]-0.1f;
#else
y[i]=y[i]+0;
y[i]=y[i]-0;
#endif
if (j > 10000)
cout << y[i] << " ";
}
if (j > 10000)
cout << endl;
}
double end = omp_get_wtime();
cout << end - start << endl;
system("pause");
return 0;
}
Output:
#define FLOATING
1.78814e-007 1.3411e-007 1.04308e-007 0 7.45058e-008 6.70552e-008 6.70552e-008 5.58794e-007 3.05474e-007 2.16067e-007 1.71363e-007 1.49012e-007 1.2666e-007 1.11759e-007 1.04308e-007 1.04308e-007
1.78814e-007 1.3411e-007 1.04308e-007 0 7.45058e-008 6.70552e-008 6.70552e-008 5.58794e-007 3.05474e-007 2.16067e-007 1.71363e-007 1.49012e-007 1.2666e-007 1.11759e-007 1.04308e-007 1.04308e-007
//#define FLOATING
6.30584e-044 3.92364e-044 3.08286e-044 0 1.82169e-044 1.54143e-044 2.10195e-044 2.46842e-029 7.56701e-044 4.06377e-044 3.92364e-044 3.22299e-044 3.08286e-044 2.66247e-044 2.66247e-044 2.24208e-044
6.30584e-044 3.92364e-044 3.08286e-044 0 1.82169e-044 1.54143e-044 2.10195e-044 2.45208e-029 7.56701e-044 4.06377e-044 3.92364e-044 3.22299e-044 3.08286e-044 2.66247e-044 2.66247e-044 2.24208e-044
Note how in the second run the numbers are very close to zero.
Denormalized numbers are generally rare and thus most processors don't try to handle them efficiently.
To demonstrate that this has everything to do with denormalized numbers, if we flush denormals to zero by adding this to the start of the code:
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
Then the version with 0 is no longer 10x slower and actually becomes faster. (This requires that the code be compiled with SSE enabled.)
This means that rather than using these weird lower precision almost-zero values, we just round to zero instead.
Timings: Core i7 920 # 3.5 GHz:
// Don't flush denormals to zero.
0.1f: 0.564067
0 : 26.7669
// Flush denormals to zero.
0.1f: 0.587117
0 : 0.341406
In the end, this really has nothing to do with whether it's an integer or floating-point. The 0 or 0.1f is converted/stored into a register outside of both loops. So that has no effect on performance.
Using gcc and applying a diff to the generated assembly yields only this difference:
73c68,69
< movss LCPI1_0(%rip), %xmm1
---
> movabsq $0, %rcx
> cvtsi2ssq %rcx, %xmm1
81d76
< subss %xmm1, %xmm0
The cvtsi2ssq one being 10 times slower indeed.
Apparently, the float version uses an XMM register loaded from memory, while the int version converts a real int value 0 to float using the cvtsi2ssq instruction, taking a lot of time. Passing -O3 to gcc doesn't help. (gcc version 4.2.1.)
(Using double instead of float doesn't matter, except that it changes the cvtsi2ssq into a cvtsi2sdq.)
Update
Some extra tests show that it is not necessarily the cvtsi2ssq instruction. Once eliminated (using a int ai=0;float a=ai; and using a instead of 0), the speed difference remains. So #Mysticial is right, the denormalized floats make the difference. This can be seen by testing values between 0 and 0.1f. The turning point in the above code is approximately at 0.00000000000000000000000000000001, when the loops suddenly takes 10 times as long.
Update<<1
A small visualisation of this interesting phenomenon:
Column 1: a float, divided by 2 for every iteration
Column 2: the binary representation of this float
Column 3: the time taken to sum this float 1e7 times
You can clearly see the exponent (the last 9 bits) change to its lowest value, when denormalization sets in. At that point, simple addition becomes 20 times slower.
0.000000000000000000000000000000000100000004670110: 10111100001101110010000011100000 45 ms
0.000000000000000000000000000000000050000002335055: 10111100001101110010000101100000 43 ms
0.000000000000000000000000000000000025000001167528: 10111100001101110010000001100000 43 ms
0.000000000000000000000000000000000012500000583764: 10111100001101110010000110100000 42 ms
0.000000000000000000000000000000000006250000291882: 10111100001101110010000010100000 48 ms
0.000000000000000000000000000000000003125000145941: 10111100001101110010000100100000 43 ms
0.000000000000000000000000000000000001562500072970: 10111100001101110010000000100000 42 ms
0.000000000000000000000000000000000000781250036485: 10111100001101110010000111000000 42 ms
0.000000000000000000000000000000000000390625018243: 10111100001101110010000011000000 42 ms
0.000000000000000000000000000000000000195312509121: 10111100001101110010000101000000 43 ms
0.000000000000000000000000000000000000097656254561: 10111100001101110010000001000000 42 ms
0.000000000000000000000000000000000000048828127280: 10111100001101110010000110000000 44 ms
0.000000000000000000000000000000000000024414063640: 10111100001101110010000010000000 42 ms
0.000000000000000000000000000000000000012207031820: 10111100001101110010000100000000 42 ms
0.000000000000000000000000000000000000006103515209: 01111000011011100100001000000000 789 ms
0.000000000000000000000000000000000000003051757605: 11110000110111001000010000000000 788 ms
0.000000000000000000000000000000000000001525879503: 00010001101110010000100000000000 788 ms
0.000000000000000000000000000000000000000762939751: 00100011011100100001000000000000 795 ms
0.000000000000000000000000000000000000000381469876: 01000110111001000010000000000000 896 ms
0.000000000000000000000000000000000000000190734938: 10001101110010000100000000000000 813 ms
0.000000000000000000000000000000000000000095366768: 00011011100100001000000000000000 798 ms
0.000000000000000000000000000000000000000047683384: 00110111001000010000000000000000 791 ms
0.000000000000000000000000000000000000000023841692: 01101110010000100000000000000000 802 ms
0.000000000000000000000000000000000000000011920846: 11011100100001000000000000000000 809 ms
0.000000000000000000000000000000000000000005961124: 01111001000010000000000000000000 795 ms
0.000000000000000000000000000000000000000002980562: 11110010000100000000000000000000 835 ms
0.000000000000000000000000000000000000000001490982: 00010100001000000000000000000000 864 ms
0.000000000000000000000000000000000000000000745491: 00101000010000000000000000000000 915 ms
0.000000000000000000000000000000000000000000372745: 01010000100000000000000000000000 918 ms
0.000000000000000000000000000000000000000000186373: 10100001000000000000000000000000 881 ms
0.000000000000000000000000000000000000000000092486: 01000010000000000000000000000000 857 ms
0.000000000000000000000000000000000000000000046243: 10000100000000000000000000000000 861 ms
0.000000000000000000000000000000000000000000022421: 00001000000000000000000000000000 855 ms
0.000000000000000000000000000000000000000000011210: 00010000000000000000000000000000 887 ms
0.000000000000000000000000000000000000000000005605: 00100000000000000000000000000000 799 ms
0.000000000000000000000000000000000000000000002803: 01000000000000000000000000000000 828 ms
0.000000000000000000000000000000000000000000001401: 10000000000000000000000000000000 815 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 44 ms
An equivalent discussion about ARM can be found in Stack Overflow question Denormalized floating point in Objective-C?.
It's due to denormalized floating-point use. How to get rid of both it and the performance penalty? Having scoured the Internet for ways of killing denormal numbers, it seems there is no "best" way to do this yet. I have found these three methods that may work best in different environments:
Might not work in some GCC environments:
// Requires #include <fenv.h>
fesetenv(FE_DFL_DISABLE_SSE_DENORMS_ENV);
Might not work in some Visual Studio environments: 1
// Requires #include <xmmintrin.h>
_mm_setcsr( _mm_getcsr() | (1<<15) | (1<<6) );
// Does both FTZ and DAZ bits. You can also use just hex value 0x8040 to do both.
// You might also want to use the underflow mask (1<<11)
Appears to work in both GCC and Visual Studio:
// Requires #include <xmmintrin.h>
// Requires #include <pmmintrin.h>
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
The Intel compiler has options to disable denormals by default on modern Intel CPUs. More details here
Compiler switches. -ffast-math, -msse or -mfpmath=sse will disable denormals and make a few other things faster, but unfortunately also do lots of other approximations that might break your code. Test carefully! The equivalent of fast-math for the Visual Studio compiler is /fp:fast but I haven't been able to confirm whether this also disables denormals.1
Dan Neely's comment ought to be expanded into an answer:
It is not the zero constant 0.0f that is denormalized or causes a slow down, it is the values that approach zero each iteration of the loop. As they come closer and closer to zero, they need more precision to represent and they become denormalized. These are the y[i] values. (They approach zero because x[i]/z[i] is less than 1.0 for all i.)
The crucial difference between the slow and fast versions of the code is the statement y[i] = y[i] + 0.1f;. As soon as this line is executed each iteration of the loop, the extra precision in the float is lost, and the denormalization needed to represent that precision is no longer needed. Afterwards, floating point operations on y[i] remain fast because they aren't denormalized.
Why is the extra precision lost when you add 0.1f? Because floating point numbers only have so many significant digits. Say you have enough storage for three significant digits, then 0.00001 = 1e-5, and 0.00001 + 0.1 = 0.1, at least for this example float format, because it doesn't have room to store the least significant bit in 0.10001.
In short, y[i]=y[i]+0.1f; y[i]=y[i]-0.1f; isn't the no-op you might think it is.
Mystical said this as well: the content of the floats matters, not just the assembly code.
EDIT: To put a finer point on this, not every floating point operation takes the same amount of time to run, even if the machine opcode is the same. For some operands/inputs, the same instruction will take more time to run. This is especially true for denormal numbers.
In gcc you can enable FTZ and DAZ with this:
#include <xmmintrin.h>
#define FTZ 1
#define DAZ 1
void enableFtzDaz()
{
int mxcsr = _mm_getcsr ();
if (FTZ) {
mxcsr |= (1<<15) | (1<<11);
}
if (DAZ) {
mxcsr |= (1<<6);
}
_mm_setcsr (mxcsr);
}
also use gcc switches: -msse -mfpmath=sse
(corresponding credits to Carl Hetherington [1])
[1] http://carlh.net/plugins/denormals.php
CPUs are only a bit slower for denormal numbers for a long time. My Zen2 CPU needs five clock cycles for a computation with denormal inputs and denormal outputs and four clock cycles with a normalized number.
This is a small benchmark written with Visual C++ to show the slightly peformance-degrading effect of denormal numbers:
#include <iostream>
#include <cstdint>
#include <chrono>
using namespace std;
using namespace chrono;
uint64_t denScale( uint64_t rounds, bool den );
int main()
{
auto bench = []( bool den ) -> double
{
constexpr uint64_t ROUNDS = 25'000'000;
auto start = high_resolution_clock::now();
int64_t nScale = denScale( ROUNDS, den );
return (double)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / nScale;
};
double
tDen = bench( true ),
tNorm = bench( false ),
rel = tDen / tNorm - 1;
cout << tDen << endl;
cout << tNorm << endl;
cout << trunc( 100 * 10 * rel + 0.5 ) / 10 << "%" << endl;
}
This is the MASM assembly part.
PUBLIC ?denScale##YA_K_K_N#Z
CONST SEGMENT
DEN DQ 00008000000000000h
ONE DQ 03FF0000000000000h
P5 DQ 03fe0000000000000h
CONST ENDS
_TEXT SEGMENT
?denScale##YA_K_K_N#Z PROC
xor rax, rax
test rcx, rcx
jz byeBye
mov r8, ONE
mov r9, DEN
test dl, dl
cmovnz r8, r9
movq xmm1, P5
mov rax, rcx
loopThis:
movq xmm0, r8
REPT 52
mulsd xmm0, xmm1
ENDM
sub rcx, 1
jae loopThis
mov rdx, 52
mul rdx
byeBye:
ret
?denScale##YA_K_K_N#Z ENDP
_TEXT ENDS
END
It would be nice to see some results in the comments.