using g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
I have tried different typecasting of scaledvalue2 but not until I stored the multiplication in a double variable and then to an int could I get desired result.. but I can't explain why ???
I know double precission(0.6999999999999999555910790149937383830547332763671875) is an issue but I don't understand why one way is OK and the other is not ??
I would expect both to fail if precision is a problem.
I DON'T NEED solution to fix it.. but just a WHY ??
(the problem IS fixed)
void main()
{
double value = 0.7;
int scaleFactor = 1000;
double doubleScaled = (double)scaleFactor * value;
int scaledvalue1 = doubleScaled; // = 700
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??
int scaledvalue3 = (double)(1000.0 * 0.7); // = 700
std::ostringstream oss;
oss << scaledvalue2;
printf("convert FloatValue[%f] multi with %i to get %f = %i or %i or %i[%s]\r\n",
value,scaleFactor,doubleScaled,scaledvalue1,scaledvalue2,scaledvalue3,oss.str().c_str());
}
or in short:
value = 0.6999999999999999555910790149937383830547332763671875;
int scaledvalue_a = (double)(1000 * value); // = 699??
int scaledvalue_b = (double)(1000 * 0.6999999999999999555910790149937383830547332763671875); // = 700
// scaledvalue_a = 699
// scaledvalue_b = 700
I can't figure out what is going wrong here.
Output :
convert FloatValue[0.700000] multi with 1000 to get 700.000000 = 700 or 699 or 700[699]
vendor_id : GenuineIntel
cpu family : 6
model : 54
model name : Intel(R) Atom(TM) CPU N2600 # 1.60GHz
This is going to be a bit handwaving; I was up too late last night watching the Cubs win the World Series, so don't insist on precision.
The rules for evaluating floating-point expressions are somewhat flexible, and compilers typically treat floating-point expressions even more flexibly than the rules formally allow. This makes evaluation of floating-point expressions faster, at the expense of making the results somewhat less predictable. Speed is important for floating-point calculations. Java initially made the mistake of imposing exact requirements on floating-point expressions and the numerics community screamed with pain. Java had to give in to the real world and relax those requirements.
double f();
double g();
double d = f() + g(); // 1
double dd1 = 1.6 * d; // 2
double dd2 = 1.6 * (f() + g()); // 3
On x86 hardware (i.e., just about every desktop system in existence), floating-point calculations are in fact done with 80 bits of precision (unless you set some switches that kill performance, as Java required), even though double and float are 64 bits and 32 bits, respectively. So for arithmetic operations the operands are converted up to 80 bits and the results are converted back down to 64 or 32 bits. That's slow, so the generated code typically delays doing conversions as long as possible, doing all of the calculation with 80-bit precision.
But C and C++ both require that when a value is stored into a floating-point variable, the conversion has to be done. So, formally, in line //1, the compiler must convert the sum back to 64 bits to store it into the variable d. Then the value of dd1, calculated in line //2, must be computed using the value that was stored into d, i.e., a 64-bit value, while the value of dd2, calculated in line //3, can be calculated using f() + g(), i.e., a full 80-bit value. Those extra bits can make a difference, and the value of dd1 might be different from the value of dd2.
And often the compiler will hang on to the 80-bit value of f() + g() and use that instead of the value stored in d when it calculates the value of dd1. That's a non-conforming optimization, but as far as I know, every compiler does that sort of thing by default. They all have command-line switches to enforce the strictly-required behavior, so if you want slower code you can get it. <g>
For serious number crunching, speed is critical, so this flexibility is welcome, and number-crunching code is carefully written to avoid sensitivity to this kind of subtle difference. People get PhDs for figuring out how to make floating-point code fast and effective, so don't feel bad that the results you see don't seem to make sense. They don't, but they're close enough that, handled carefully, they give correct results without a speed penalty.
Since x86 floating-point unit performs its computations in extended precision floating point type (80 bits wide), the result might easily depend on whether the intermediate values were forcefully converted to double (64-bit floating-point type). In that respect, in non-optimized code it is not unusual to see compilers treat memory writes to double variables literally, but ignore "unnecessary" casts to double applied to temporary intermediate values.
In your example, the first part involves saving the intermediate result in a double variable
double doubleScaled = (double)scaleFactor * value;
int scaledvalue1 = doubleScaled; // = 700
The compiler takes it literally and does indeed store the product in a double variable doubleScaled, which unavoidably requires converting the 80-bit product to double. Later that double value is read from memory again and then converted to int type.
The second part
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??
involves conversions that the compiler might see as unnecessary (and they indeed are unnecessary from the point of view of abstract C++ machine). The compiler ignores them, which means that the final int value is generated directly from the 80-bit product.
The presence of that intermediate conversion to double in the first variant (and its absence in the second one) is what causes that difference.
I converted mindriot's example assembly code to Intel syntax to test with Visual Studio. I could only reproduce the error by setting the floating point control word to use extended precision.
The issue is that rounding is performed when converting from extended precision to double precision when storing a double, versus truncation is performed when converting from extended precision to integer when storing an integer.
The extended precision multiply produces a product of 699.999..., but the product is rounded to 700.000... during the conversion from extended to double precision when the product is stored into doubleScaled.
double doubleScaled = (double)scaleFactor * value;
Since doubleScaled == 700.000..., when truncated to integer, it's still 700:
int scaledvalue1 = doubleScaled; // = 700
The product 699.999... is truncated when it's converted into an integer:
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??
My guess here is that the compiler generated a compile time constant 0f 700.000... rather than doing the multiply at run time.
int scaledvalue3 = (double)(1000.0 * 0.7); // = 700
This truncation issue can be avoided by using the round() function from the C standard library.
int scaledvalue2 = (int)round(scaleFactor * value); // should == 700
Depending on compiler and optimization flags, scaledvalue_a, involving a variable, may be evaluated at runtime using your processors floating point instructions, whereas scaledvalue_b, involving constants only, may be evaluated at compile time using a math library (e.g. gcc uses GMP - the GNU Multiple Precision math library for this). The difference you are seeing seems to be the difference between the precision and rounding of the runtime vs compile time evaluation of that expression.
Due to rounding errors, most floating-point numbers end up being slightly imprecise.
For the below double to int conversion use std::ceil() API
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699
??
Related
This C/C++ reduced testcase:
int x = ~~~;
const double s = 1.0 / x;
const double r = 1.0 - x * s;
assert(r >= 0); // fail
isn't numerically stable, and hits the assert. The reason is that the last calculation can be done with FMA, which brings r into the negatives.
Clang has enabled FMA by default (since version 14), so it is causing some interesting regressions. Here's a running version: https://godbolt.org/z/avvnnKo5E
Interestingly enough, if one splits the last calculation in two, then the FMA is not emitted and the result is always non-negative:
int x = ~~~;
const double s = 1.0 / x;
const double tmp = x * s;
const double r = 1.0 - tmp;
assert(r >= 0); // OK
Is this a guaranteed behavior of IEEE754 / FP_CONTRACT, or is this playing with fire and one should just find a more numerically stable approach? I can't find any indication that fp contractions are only meant to happen "locally" (within one expression), and a simple split like to above is sufficient to prevent them.
(Of course, in due time, one could also think of replacing the algorithm with a more numerically stable one. Or add a clamp into the [0.0, 1.0] range, but that feels hackish.)
The C++ standard allows floating-point expressions to be evaluated with extra range and precision because C++ 2020 draft N4849 7.1 [expr.pre] 6 says:
The values of the floating-point operands and the results of floating-point expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.
However, note 51 tells us:
The cast and assignment operators must still perform their specific conversions as described in 7.6.1.3, 7.6.3, 7.6.1.8
and 7.6.19.
The meaning of this is that an assignment or a cast must convert the value to the nominal type. So if extra range or precision has been used, that result must be converted to an actual double value when an assignment to a double is performed. (I expect, for this purpose, assignment includes initialization in a definition.)
So 1.0 - x * s may used a fused multiply-add, but const double tmp = x * s; const double r = 1.0 - tmp; must compute a double result for x * s, then subtract that double result from 1.0.
Note that this does not preclude const double tmp = x * s; from using extra precision to compute x * s and then rounding again to get a double result. In rare instances, this can produce a double-rounding error, where the result differs slightly from what one would get by rounding the real-number-arithmetic result of x•s directly to a double. That is unlikely to occur in practice; a C++ implementation would have no reason to compute x * s with extra precision and then round it to double.
Also note that C and C++ implementations do not necessarily conform to the C or C++ standard.
I've read about the difference between double precision and single precision. However, in most cases, float and double seem to be interchangeable, i.e. using one or the other does not seem to affect the results. Is this really the case? When are floats and doubles interchangeable? What are the differences between them?
Huge difference.
As the name implies, a double has 2x the precision of float[1]. In general a double has 15 decimal digits of precision, while float has 7.
Here's how the number of digits are calculated:
double has 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits
float has 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits
This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.
float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.7g\n", b); // prints 9.000023
while
double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.15g\n", b); // prints 8.99999999999996
Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.
During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.
Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.
Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.
[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).
Here is what the standard C99 (ISO-IEC 9899 6.2.5 §10) or C++2003 (ISO-IEC 14882-2003 3.1.9 §8) standards say:
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.
The C++ standard adds:
The value representation of floating-point types is implementation-defined.
I would suggest having a look at the excellent What Every Computer Scientist Should Know About Floating-Point Arithmetic that covers the IEEE floating-point standard in depth. You'll learn about the representation details and you'll realize there is a tradeoff between magnitude and precision. The precision of the floating point representation increases as the magnitude decreases, hence floating point numbers between -1 and 1 are those with the most precision.
Given a quadratic equation: x2 − 4.0000000 x + 3.9999999 = 0, the exact roots to 10 significant digits are, r1 = 2.000316228 and r2 = 1.999683772.
Using float and double, we can write a test program:
#include <stdio.h>
#include <math.h>
void dbl_solve(double a, double b, double c)
{
double d = b*b - 4.0*a*c;
double sd = sqrt(d);
double r1 = (-b + sd) / (2.0*a);
double r2 = (-b - sd) / (2.0*a);
printf("%.5f\t%.5f\n", r1, r2);
}
void flt_solve(float a, float b, float c)
{
float d = b*b - 4.0f*a*c;
float sd = sqrtf(d);
float r1 = (-b + sd) / (2.0f*a);
float r2 = (-b - sd) / (2.0f*a);
printf("%.5f\t%.5f\n", r1, r2);
}
int main(void)
{
float fa = 1.0f;
float fb = -4.0000000f;
float fc = 3.9999999f;
double da = 1.0;
double db = -4.0000000;
double dc = 3.9999999;
flt_solve(fa, fb, fc);
dbl_solve(da, db, dc);
return 0;
}
Running the program gives me:
2.00000 2.00000
2.00032 1.99968
Note that the numbers aren't large, but still you get cancellation effects using float.
(In fact, the above is not the best way of solving quadratic equations using either single- or double-precision floating-point numbers, but the answer remains unchanged even if one uses a more stable method.)
A double is 64 and single precision
(float) is 32 bits.
The double has a bigger mantissa (the integer bits of the real number).
Any inaccuracies will be smaller in the double.
I just ran into a error that took me forever to figure out and potentially can give you a good example of float precision.
#include <iostream>
#include <iomanip>
int main(){
for(float t=0;t<1;t+=0.01){
std::cout << std::fixed << std::setprecision(6) << t << std::endl;
}
}
The output is
0.000000
0.010000
0.020000
0.030000
0.040000
0.050000
0.060000
0.070000
0.080000
0.090000
0.100000
0.110000
0.120000
0.130000
0.140000
0.150000
0.160000
0.170000
0.180000
0.190000
0.200000
0.210000
0.220000
0.230000
0.240000
0.250000
0.260000
0.270000
0.280000
0.290000
0.300000
0.310000
0.320000
0.330000
0.340000
0.350000
0.360000
0.370000
0.380000
0.390000
0.400000
0.410000
0.420000
0.430000
0.440000
0.450000
0.460000
0.470000
0.480000
0.490000
0.500000
0.510000
0.520000
0.530000
0.540000
0.550000
0.560000
0.570000
0.580000
0.590000
0.600000
0.610000
0.620000
0.630000
0.640000
0.650000
0.660000
0.670000
0.680000
0.690000
0.700000
0.710000
0.720000
0.730000
0.740000
0.750000
0.760000
0.770000
0.780000
0.790000
0.800000
0.810000
0.820000
0.830000
0.839999
0.849999
0.859999
0.869999
0.879999
0.889999
0.899999
0.909999
0.919999
0.929999
0.939999
0.949999
0.959999
0.969999
0.979999
0.989999
0.999999
As you can see after 0.83, the precision runs down significantly.
However, if I set up t as double, such an issue won't happen.
It took me five hours to realize this minor error, which ruined my program.
There are three floating point types:
float
double
long double
A simple Venn diagram will explain about:
The set of values of the types
The size of the numbers involved in the float-point calculations is not the most relevant thing. It's the calculation that is being performed that is relevant.
In essence, if you're performing a calculation and the result is an irrational number or recurring decimal, then there will be rounding errors when that number is squashed into the finite size data structure you're using. Since double is twice the size of float then the rounding error will be a lot smaller.
The tests may specifically use numbers which would cause this kind of error and therefore tested that you'd used the appropriate type in your code.
Type float, 32 bits long, has a precision of 7 digits. While it may store values with very large or very small range (+/- 3.4 * 10^38 or * 10^-38), it has only 7 significant digits.
Type double, 64 bits long, has a bigger range (*10^+/-308) and 15 digits precision.
Type long double is nominally 80 bits, though a given compiler/OS pairing may store it as 12-16 bytes for alignment purposes. The long double has an exponent that just ridiculously huge and should have 19 digits precision. Microsoft, in their infinite wisdom, limits long double to 8 bytes, the same as plain double.
Generally speaking, just use type double when you need a floating point value/variable. Literal floating point values used in expressions will be treated as doubles by default, and most of the math functions that return floating point values return doubles. You'll save yourself many headaches and typecastings if you just use double.
Floats have less precision than doubles. Although you already know, read What WE Should Know About Floating-Point Arithmetic for better understanding.
When using floating point numbers you cannot trust that your local tests will be exactly the same as the tests that are done on the server side. The environment and the compiler are probably different on you local system and where the final tests are run. I have seen this problem many times before in some TopCoder competitions especially if you try to compare two floating point numbers.
The built-in comparison operations differ as in when you compare 2 numbers with floating point, the difference in data type (i.e. float or double) may result in different outcomes.
If one works with embedded processing, eventually the underlying hardware (e.g. FPGA or some specific processor / microcontroller model) will have float implemented optimally in hardware whereas double will use software routines. So if the precision of a float is enough to handle the needs, the program will execute some times faster with float then double. As noted on other answers, beware of accumulation errors.
Quantitatively, as other answers have pointed out, the difference is that type double has about twice the precision, and three times the range, as type float (depending on how you count).
But perhaps even more important is the qualitative difference. Type float has good precision, which will often be good enough for whatever you're doing. Type double, on the other hand, has excellent precision, which will almost always be good enough for whatever you're doing.
The upshot, which is not nearly as well known as it should be, is that you should almost always use type double. Unless you have some particularly special need, you should almost never use type float.
As everyone knows, "roundoff error" is often a problem when you're doing floating-point work. Roundoff error can be subtle, and difficult to track down, and difficult to fix. Most programmers don't have the time or expertise to track down and fix numerical errors in floating-point algorithms — because unfortunately, the details end up being different for every different algorithm. But type double has enough precision such that, much of the time, you don't have to worry.
You'll get good results anyway. With type float, on the other hand, alarming-looking issues with roundoff crop up all the time.
And the thing that's not necessarily different between type float and double is execution speed. On most of today's general-purpose processors, arithmetic operations on type float and double take more or less exactly the same amount of time. Everything's done in parallel, so you don't pay a speed penalty for the greater range and precision of type double. That's why it's safe to make the recommendation that you should almost never use type float: Using double shouldn't cost you anything in speed, and it shouldn't cost you much in space, and it will almost definitely pay off handsomely in freedom from precision and roundoff error woes.
(With that said, though, one of the "special needs" where you may need type float is when you're doing embedded work on a microcontroller, or writing code that's optimized for a GPU. On those processors, type double can be significantly slower, or practically nonexistent, so in those cases programmers do typically choose type float for speed, and maybe pay for it in precision.)
Unlike an int (whole number), a float have a decimal point, and so can a double.
But the difference between the two is that a double is twice as detailed as a float, meaning that it can have double the amount of numbers after the decimal point.
I've recently come across some code which has a loop of the form
for (int i = 0; i < 1e7; i++){
}
I question the wisdom of doing this since 1e7 is a floating point type, and will cause i to be promoted when evaluating the stopping condition. Should this be of cause for concern?
The elephant in the room here is that the range of an int could be as small as -32767 to +32767, and the behaviour on assigning a larger value than this to such an int is undefined.
But, as for your main point, indeed it should concern you as it is a very bad habit. Things could go wrong as yes, 1e7 is a floating point double type.
The fact that i will be converted to a floating point due to type promotion rules is somewhat moot: the real damage is done if there is unexpected truncation of the apparent integral literal. By the way of a "proof by example", consider first the loop
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 18446744073709551615ULL; ){
std::cout << i << "\n";
}
This outputs every consecutive value of i in the range, as you'd expect. Note that std::numeric_limits<std::uint64_t>::max() is 18446744073709551615ULL, which is 1 less than the 64th power of 2. (Here I'm using a slide-like "operator" ++< which is useful when working with unsigned types. Many folk consider --> and ++< as obfuscating but in scientific programming they are common, particularly -->.)
Now on my machine, a double is an IEEE754 64 bit floating point. (Such as scheme is particularly good at representing powers of 2 exactly - IEEE754 can represent powers of 2 up to 1022 exactly.) So 18,446,744,073,709,551,616 (the 64th power of 2) can be represented exactly as a double. The nearest representable number before that is 18,446,744,073,709,550,592 (which is 1024 less).
So now let's write the loop as
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 1.8446744073709551615e19; ){
std::cout << i << "\n";
}
On my machine that will only output one value of i: 18,446,744,073,709,550,592 (the number that we've already seen). This proves that 1.8446744073709551615e19 is a floating point type. If the compiler was allowed to treat the literal as an integral type then the output of the two loops would be equivalent.
It will work, assuming that your int is at least 32 bits.
However, if you really want to use exponential notation, you should better define an integer constant outside the loop and use proper casting, like this:
const int MAX_INDEX = static_cast<int>(1.0e7);
...
for (int i = 0; i < MAX_INDEX; i++) {
...
}
Considering this, I'd say it is much better to write
const int MAX_INDEX = 10000000;
or if you can use C++14
const int MAX_INDEX = 10'000'000;
1e7 is a literal of type double, and usually double is 64-bit IEEE 754 format with a 52-bit mantissa. Roughly every tenth power of 2 corresponds to a third power of 10, so double should be able to represent integers up to at least 105*3 = 1015, exactly. And if int is 32-bit then int has roughly 103*3 = 109 as max value (asking Google search it says that "2**31 - 1" = 2 147 483 647, i.e. twice the rough estimate).
So, in practice it's safe on current desktop systems and larger.
But C++ allows int to be just 16 bits, and on e.g. an embedded system with that small int, one would have Undefined Behavior.
If the intention to loop for a exact integer number of iterations, for example if iterating over exactly all the elements in an array then comparing against a floating point value is maybe not such a good idea, solely for accuracy reasons; since the implicit cast of an integer to float will truncate integers toward zero there's no real danger of out-of-bounds access, it will just abort the loop short.
Now the question is: When do these effects actually kick in? Will your program experience them? The floating point representation usually used these days is IEEE 754. As long as the exponent is 0 a floating point value is essentially an integer. C double precision floats 52 bits for the mantissa, which gives you integer precision to a value of up to 2^52, which is in the order of about 1e15. Without specifying with a suffix f that you want a floating point literal to be interpreted single precision the literal will be double precision and the implicit conversion will target that as well. So as long as your loop end condition is less 2^52 it will work reliably!
Now one question you have to think about on the x86 architecture is efficiency. The very first 80x87 FPUs came in a different package, and later a different chip and as aresult getting values into the FPU registers is a bit awkward on the x86 assembly level. Depending on what your intentions are it might make the difference in runtime for a realtime application; but that's premature optimization.
TL;DR: Is it safe to to? Most certainly yes. Will it cause trouble? It could cause numerical problems. Could it invoke undefined behavior? Depends on how you use the loop end condition, but if i is used to index an array and for some reason the array length ended up in a floating point variable always truncating toward zero it's not going to cause a logical problem. Is it a smart thing to do? Depends on the application.
When reading Lua’s source code, I noticed that Lua uses a macro to round double values to 32-bit int values. The macro is defined in the Llimits.h header file and reads as follows:
union i_cast {double d; int i[2]};
#define double2int(i, d, t) \
{volatile union i_cast u; u.d = (d) + 6755399441055744.0; \
(i) = (t)u.i[ENDIANLOC];}
Here ENDIANLOC is defined according to endianness: 0 for little endian, 1 for big endian architectures; Lua carefully handles endianness. The t argument is substituted with an integer type like int or unsigned int.
I did a little research and found that there is a simpler format of that macro which uses the same technique:
#define double2int(i, d) \
{double t = ((d) + 6755399441055744.0); i = *((int *)(&t));}
Or, in a C++-style:
inline int double2int(double d)
{
d += 6755399441055744.0;
return reinterpret_cast<int&>(d);
}
This trick can work on any machine using IEEE 754 (which means pretty much every machine today). It works for both positive and negative numbers, and the rounding follows Banker’s Rule. (This is not surprising, since it follows IEEE 754.)
I wrote a little program to test it:
int main()
{
double d = -12345678.9;
int i;
double2int(i, d)
printf("%d\n", i);
return 0;
}
And it outputs -12345679, as expected.
I would like to understand how this tricky macro works in detail. The magic number 6755399441055744.0 is actually 251 + 252, or 1.5 × 252, and 1.5 in binary can be represented as 1.1. When any 32-bit integer is added to this magic number—
Well, I’m lost from here. How does this trick work?
Update
As #Mysticial points out, this method does not limit itself to a 32-bit int, it can also be expanded to a 64-bit int as long as the number is in the range of 252. (Although the macro needs some modification.)
Some materials say this method cannot be used in Direct3D.
When working with Microsoft assembler for x86, there is an even faster macro written in assembly code (the following is also extracted from Lua source):
#define double2int(i,n) __asm {__asm fld n __asm fistp i}
There is a similar magic number for single precision numbers: 1.5 × 223.
A value of the double floating-point type is represented like so:
and it can be seen as two 32-bit integers; now, the int taken in all the versions of your code (supposing it’s a 32-bit int) is the one on the right in the figure, so what you are doing in the end is just taking the lowest 32 bits of mantissa.
Now, to the magic number; as you correctly stated, 6755399441055744 is 251 + 252; adding such a number forces the double to go into the “sweet range” between 252 and 253, which, as explained by Wikipedia, has an interesting property:
Between 252 = 4,503,599,627,370,496 and 253 = 9,007,199,254,740,992, the representable numbers are exactly the integers.
This follows from the fact that the mantissa is 52 bits wide.
The other interesting fact about adding 251 + 252 is that it affects the mantissa only in the two highest bits—which are discarded anyway, since we are taking only its lowest 32 bits.
Last but not least: the sign.
IEEE 754 floating point uses a magnitude and sign representation, while integers on “normal” machines use 2’s complement arithmetic; how is this handled here?
We talked only about positive integers; now suppose we are dealing with a negative number in the range representable by a 32-bit int, so less (in absolute value) than (−231 + 1); call it −a. Such a number is obviously made positive by adding the magic number, and the resulting value is 252 + 251 + (−a).
Now, what do we get if we interpret the mantissa in 2’s complement representation? It must be the result of 2’s complement sum of (252 + 251) and (−a). Again, the first term affects only the upper two bits, what remains in the bits 0–50 is the 2’s complement representation of (−a) (again, minus the upper two bits).
Since reduction of a 2’s complement number to a smaller width is done just by cutting away the extra bits on the left, taking the lower 32 bits gives us correctly (−a) in 32-bit, 2’s complement arithmetic.
This kind of "trick" comes from older x86 processors, using the 8087 intructions/interface for floating point. On these machines, there's an instruction for converting floating point to integer "fist", but it uses the current fp rounding mode. Unfortunately, the C spec requires that fp->int conversions truncate towards zero, while all other fp operations round to nearest, so doing an
fp->int conversion requires first changing the fp rounding mode, then doing a fist, then restoring the fp rounding mode.
Now on the original 8086/8087, this wasn't too bad, but on later processors that started to get super-scalar and out-of-order execution, altering the fp rounding mode generally serializes the CPU core and is quite expensive. So on a CPU like a Pentium-III or Pentium-IV, this overall cost is quite high -- a normal fp->int conversion is 10x or more expensive than this add+store+load trick.
On x86-64, however, floating point is done with the xmm instructions, and the cost of converting
fp->int is pretty small, so this "optimization" is likely slower than a normal conversion.
if this helps with visualization, that lua magic value
(2^52+2^51, or base2 of 110 then [50 zeros]
hex
0x 0018 0000 0000 0000 (18e12)
octal
0 300 00000 00000 00000 ( 3e17)
Here is a simpler implementation of the above Lua trick:
/**
* Round to the nearest integer.
* for tie-breaks: round half to even (bankers' rounding)
* Only works for inputs in the range: [-2^51, 2^51]
*/
inline double rint(double d)
{
double x = 6755399441055744.0; // 2^51 + 2^52
return d + x - x;
}
The trick works for numbers with absolute value < 2 ^ 51.
This is a little program to test it: ideone.com
#include <cstdio>
int main()
{
// round to nearest integer
printf("%.1f, %.1f\n", rint(-12345678.3), rint(-12345678.9));
// test tie-breaking rule
printf("%.1f, %.1f, %.1f, %.1f\n", rint(-24.5), rint(-23.5), rint(23.5), rint(24.5));
return 0;
}
// output:
// -12345678.0, -12345679.0
// -24.0, -24.0, 24.0, 24.0
I've read about the difference between double precision and single precision. However, in most cases, float and double seem to be interchangeable, i.e. using one or the other does not seem to affect the results. Is this really the case? When are floats and doubles interchangeable? What are the differences between them?
Huge difference.
As the name implies, a double has 2x the precision of float[1]. In general a double has 15 decimal digits of precision, while float has 7.
Here's how the number of digits are calculated:
double has 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits
float has 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits
This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.
float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.7g\n", b); // prints 9.000023
while
double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.15g\n", b); // prints 8.99999999999996
Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.
During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.
Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.
Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.
[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).
Here is what the standard C99 (ISO-IEC 9899 6.2.5 §10) or C++2003 (ISO-IEC 14882-2003 3.1.9 §8) standards say:
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.
The C++ standard adds:
The value representation of floating-point types is implementation-defined.
I would suggest having a look at the excellent What Every Computer Scientist Should Know About Floating-Point Arithmetic that covers the IEEE floating-point standard in depth. You'll learn about the representation details and you'll realize there is a tradeoff between magnitude and precision. The precision of the floating point representation increases as the magnitude decreases, hence floating point numbers between -1 and 1 are those with the most precision.
Given a quadratic equation: x2 − 4.0000000 x + 3.9999999 = 0, the exact roots to 10 significant digits are, r1 = 2.000316228 and r2 = 1.999683772.
Using float and double, we can write a test program:
#include <stdio.h>
#include <math.h>
void dbl_solve(double a, double b, double c)
{
double d = b*b - 4.0*a*c;
double sd = sqrt(d);
double r1 = (-b + sd) / (2.0*a);
double r2 = (-b - sd) / (2.0*a);
printf("%.5f\t%.5f\n", r1, r2);
}
void flt_solve(float a, float b, float c)
{
float d = b*b - 4.0f*a*c;
float sd = sqrtf(d);
float r1 = (-b + sd) / (2.0f*a);
float r2 = (-b - sd) / (2.0f*a);
printf("%.5f\t%.5f\n", r1, r2);
}
int main(void)
{
float fa = 1.0f;
float fb = -4.0000000f;
float fc = 3.9999999f;
double da = 1.0;
double db = -4.0000000;
double dc = 3.9999999;
flt_solve(fa, fb, fc);
dbl_solve(da, db, dc);
return 0;
}
Running the program gives me:
2.00000 2.00000
2.00032 1.99968
Note that the numbers aren't large, but still you get cancellation effects using float.
(In fact, the above is not the best way of solving quadratic equations using either single- or double-precision floating-point numbers, but the answer remains unchanged even if one uses a more stable method.)
A double is 64 and single precision
(float) is 32 bits.
The double has a bigger mantissa (the integer bits of the real number).
Any inaccuracies will be smaller in the double.
I just ran into a error that took me forever to figure out and potentially can give you a good example of float precision.
#include <iostream>
#include <iomanip>
int main(){
for(float t=0;t<1;t+=0.01){
std::cout << std::fixed << std::setprecision(6) << t << std::endl;
}
}
The output is
0.000000
0.010000
0.020000
0.030000
0.040000
0.050000
0.060000
0.070000
0.080000
0.090000
0.100000
0.110000
0.120000
0.130000
0.140000
0.150000
0.160000
0.170000
0.180000
0.190000
0.200000
0.210000
0.220000
0.230000
0.240000
0.250000
0.260000
0.270000
0.280000
0.290000
0.300000
0.310000
0.320000
0.330000
0.340000
0.350000
0.360000
0.370000
0.380000
0.390000
0.400000
0.410000
0.420000
0.430000
0.440000
0.450000
0.460000
0.470000
0.480000
0.490000
0.500000
0.510000
0.520000
0.530000
0.540000
0.550000
0.560000
0.570000
0.580000
0.590000
0.600000
0.610000
0.620000
0.630000
0.640000
0.650000
0.660000
0.670000
0.680000
0.690000
0.700000
0.710000
0.720000
0.730000
0.740000
0.750000
0.760000
0.770000
0.780000
0.790000
0.800000
0.810000
0.820000
0.830000
0.839999
0.849999
0.859999
0.869999
0.879999
0.889999
0.899999
0.909999
0.919999
0.929999
0.939999
0.949999
0.959999
0.969999
0.979999
0.989999
0.999999
As you can see after 0.83, the precision runs down significantly.
However, if I set up t as double, such an issue won't happen.
It took me five hours to realize this minor error, which ruined my program.
There are three floating point types:
float
double
long double
A simple Venn diagram will explain about:
The set of values of the types
The size of the numbers involved in the float-point calculations is not the most relevant thing. It's the calculation that is being performed that is relevant.
In essence, if you're performing a calculation and the result is an irrational number or recurring decimal, then there will be rounding errors when that number is squashed into the finite size data structure you're using. Since double is twice the size of float then the rounding error will be a lot smaller.
The tests may specifically use numbers which would cause this kind of error and therefore tested that you'd used the appropriate type in your code.
Type float, 32 bits long, has a precision of 7 digits. While it may store values with very large or very small range (+/- 3.4 * 10^38 or * 10^-38), it has only 7 significant digits.
Type double, 64 bits long, has a bigger range (*10^+/-308) and 15 digits precision.
Type long double is nominally 80 bits, though a given compiler/OS pairing may store it as 12-16 bytes for alignment purposes. The long double has an exponent that just ridiculously huge and should have 19 digits precision. Microsoft, in their infinite wisdom, limits long double to 8 bytes, the same as plain double.
Generally speaking, just use type double when you need a floating point value/variable. Literal floating point values used in expressions will be treated as doubles by default, and most of the math functions that return floating point values return doubles. You'll save yourself many headaches and typecastings if you just use double.
Floats have less precision than doubles. Although you already know, read What WE Should Know About Floating-Point Arithmetic for better understanding.
When using floating point numbers you cannot trust that your local tests will be exactly the same as the tests that are done on the server side. The environment and the compiler are probably different on you local system and where the final tests are run. I have seen this problem many times before in some TopCoder competitions especially if you try to compare two floating point numbers.
The built-in comparison operations differ as in when you compare 2 numbers with floating point, the difference in data type (i.e. float or double) may result in different outcomes.
If one works with embedded processing, eventually the underlying hardware (e.g. FPGA or some specific processor / microcontroller model) will have float implemented optimally in hardware whereas double will use software routines. So if the precision of a float is enough to handle the needs, the program will execute some times faster with float then double. As noted on other answers, beware of accumulation errors.
Quantitatively, as other answers have pointed out, the difference is that type double has about twice the precision, and three times the range, as type float (depending on how you count).
But perhaps even more important is the qualitative difference. Type float has good precision, which will often be good enough for whatever you're doing. Type double, on the other hand, has excellent precision, which will almost always be good enough for whatever you're doing.
The upshot, which is not nearly as well known as it should be, is that you should almost always use type double. Unless you have some particularly special need, you should almost never use type float.
As everyone knows, "roundoff error" is often a problem when you're doing floating-point work. Roundoff error can be subtle, and difficult to track down, and difficult to fix. Most programmers don't have the time or expertise to track down and fix numerical errors in floating-point algorithms — because unfortunately, the details end up being different for every different algorithm. But type double has enough precision such that, much of the time, you don't have to worry.
You'll get good results anyway. With type float, on the other hand, alarming-looking issues with roundoff crop up all the time.
And the thing that's not necessarily different between type float and double is execution speed. On most of today's general-purpose processors, arithmetic operations on type float and double take more or less exactly the same amount of time. Everything's done in parallel, so you don't pay a speed penalty for the greater range and precision of type double. That's why it's safe to make the recommendation that you should almost never use type float: Using double shouldn't cost you anything in speed, and it shouldn't cost you much in space, and it will almost definitely pay off handsomely in freedom from precision and roundoff error woes.
(With that said, though, one of the "special needs" where you may need type float is when you're doing embedded work on a microcontroller, or writing code that's optimized for a GPU. On those processors, type double can be significantly slower, or practically nonexistent, so in those cases programmers do typically choose type float for speed, and maybe pay for it in precision.)
Unlike an int (whole number), a float have a decimal point, and so can a double.
But the difference between the two is that a double is twice as detailed as a float, meaning that it can have double the amount of numbers after the decimal point.