Why are double preferred over float? [closed] - c++

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
In most of the code I see around, double is favourite against float, even when a high precision is not needed.
Since there are performance penalties when using double types (CPU/GPU/memory/bus/cache/...), what is the reason of this double overuse?
Example: in computational fluid dynamics all the software I worked with uses doubles. In this case a high precision is useless (because of the errors due to the approximations in the mathematical model), and there is a huge amount of data to be moved around, which could be cut in half using floats.
The fact that today's computers are powerful is meaningless, because they are used to solve more and more complex problems.

Among others:
The savings are hardly ever worth it (number-crunching is not typical).
Rounding errors accumulate, so better go to higher precision than needed from the start (experts may know it is precise enough anyway, and there are calculations which can be done exactly).
Common floating operations using the fpu internally often work on double or higher precision anyway.
C and C++ can implicitly convert from float to double, the other way needs an explicit cast.
Variadic and no-prototype functions always get double, not float. (second one is only in ancient C and actively discouraged)
You may commonly do an operation with more than needed precision, but seldom with less, so libraries generally favor higher precision too.
But in the end, YMMV: Measure, test, and decide for yourself and your specific situation.
BTW: There's even more for performance fanatics: Use the IEEE half precision type. Little hardware or compiler support for it exists, but it cuts your bandwidth requirements in half yet again.

In my opinion the answers so far don't really get the right point across, so here's my crack at it.
The short answer is C++ developers use doubles over floats:
To avoid premature optimization when they don't understand the performance trade-offs well ("they have higher precision, why not?" Is the thought process)
Habit
Culture
To match library function signatures
To match simple-to-write floating point literals (you can write 0.0 instead of 0.0f)
It's true double may be as fast as a float for a single computation because most FPUs have a wider internal representation than either the 32-bit float or 64-bit double represent.
However that's only a small piece of the picture. Now-days operational optimizations don't mean anything if you're bottle necked on cache/memory bandwidth.
Here is why some developers seeking to optimize their code should look into using 32-bit floats over 64-bit doubles:
They fit in half the memory. Which is like having all your caches be twice as large. (big win!!!)
If you really care about performance you'll use SSE instructions. SSE instructions that operate on floating point values have different instructions for 32-bit and 64-bit floating point representations. The 32-bit versions can fit 4 values in the 128-bit register operands, but the 64-bit versions can only fit 2 values. In this scenario you can likely double your FLOPS by using floats over double because each instruction operates on twice as much data.
In general, there is a real lack of knowledge of how floating point numbers really work in the majority of developers I've encountered. So I'm not really surprised most developers blindly use double.

double is, in some ways, the "natural" floating point type in the C language, which also influences C++. Consider that:
an unadorned, ordinary floating-point constant like 13.9 has type double. To make it float, we have to add an extra suffix f or F.
default argument promotion in C converts float function arguments* to double: this takes place when no declaration exists for an argument, such as when a function is declared as variadic (e.g. printf) or no declaration exists (old style C, not permitted in C++).
The %f conversion specifier of printf takes a double argument, not float. There is no dedicated way to print float-s; a float argument default-promotes to double and so matches %f.
On modern hardware, float and double are usually mapped, respectively, to 32 bit and 64 bit IEEE 754 types. The hardware works with the 64 bit values "natively": the floating-point registers are 64 bits wide, and the operations are built around the more precise type (or internally may be even more precise than that). Since double is mapped to that type, it is the "natural" floating-point type.
The precision of float is poor for any serious numerical work, and the reduced range could be a problem also. The IEEE 32 bit type has only 23 bits of mantissa (8 bits are consumed by the exponent field and one bit for the sign). The float type is useful for saving storage in large arrays of floating-point values provided that the loss of precision and range isn't a problem in the given application. For example, 32 bit floating-point values are sometimes used in audio for representing samples.
It is true that the use of a 64 bit type over 32 bit type doubles the raw memory bandwidth. However, that only affects programs which with a large arrays of data, which are accessed in a pattern that shows poor locality. The superior precision of the 64 bit floating-point type trumps issues of optimization. Quality of numerical results is more important than shaving cycles off the running time, in accordance with the principle of "get it right first, then make it fast".
* Note, however, that there is no general automatic promotion from float expressions to double; the only promotion of that kind is integral promotion: char, short and bitfields going to int.

This is mostly hardware dependent, but consider that the most common CPU (x86/x87 based) have internal FPU that operate on 80bits floating point precision (which exceeds both floats and doubles).
If you have to store in memory some intermediate calculations, double is the good average from internal precision and external space. Performance is more or less the same, on single values. It may be affected by the memory bandwidth on large numeric pipes (since they will have double length).
Consider that floats have a precision that approximate 6 decimal digits. On a N-cubed complexity problem (like a matrix inversion or transformation), you lose two or three more in mul and div, remaining with just 3 meaningful digits. On a 1920 pixel wide display they are simply not enough (you need at least 5 to match a pixel properly).
This roughly makes double to be preferable.

It is often relatively easy to determine that double is sufficient, even in cases where it would take significant numerical analysis effort to show that float is sufficient. That saves development cost, and the risk of incorrect results if the analysis is not done correctly.
Also any performance gain by using float is usually relatively slighter than using double,that is because most of the popular processors do all floating point arithmetic in one format that is even wider than double.

I think higher precision is the only reason. Actually most people don't think a lot about it, they just use double.
I think if float precision is good enough for particular task there is no reason to use double.

Related

How to force 32bits floating point calculation consistency across different platforms?

I have a simple piece of code that operates with floating points.
Few multiplications, divisions, exp(), subtraction and additions in a loop.
When I run the same piece of code on different platforms (like PC, Android phones, iPhones) I get slightly different results.
The result is pretty much equal on all the platforms but has a very small discrepancy - typically 1/1000000 of the floating point value.
I suppose the reason is that some phones don't have floating point registers and just simulate those calculations with integers, some do have floating point registers but have different implementations.
There are proofs to that here: http://christian-seiler.de/projekte/fpmath/
Is there a way to force all the platform to produce a consistent results?
For example a good & fast open-source library that implements floating point mechanics with integers (in software), thus I can avoid hardware implementation differences.
The reason I need an exact consistency is to avoid compound errors among layers of calculations.
Currently those compound errors do produce a significantly different result.
In other words, I don't care so much which platform has a more correct result, but rather want to force consistency to be able to reproduce equal behavior. For example a bug which was discovered on a mobile phone is much easier to debug on PC, but I need to reproduce this exact behavior
One relatively widely used and high quality software FP implementation is MPFR. It is a lot slower than hardware FP, though.
Of course, this won't solve the actual problems your algorithm has with compound errors, it will just make it produce the same errors on all platforms. Probably a better approach would be to design an algorithm which isn't as sensitive to small differences in FP arithmetic, if feasible. Or if you go the MPFR route, you can use a higher precision FP type and see if that helps, no need to limit yourself to emulating the hardware single/double precision.
32-bit floating point math, for a given calculation will, at best, have a precision of 1 in 16777216 (1 in 224). Functions such as exp are often implemented as a sequence of calculations, so may have a larger error due to this. If you do several calculations in a row, the errors will add and multiply up. In general float has about 6-7 digits of precision.
As one comment says, check the rounding mode is the same. Most FPU's have a "round to nearest" (rtn), "round to zero" (rtz) and "round to even" (rte) mode that you can choose. The default on different platforms MAY vary.
If you perform additions or subtractions of fairly small numbers to fairly large numbers, since the number has to be normalized you will have a greater error from these sort of operations.
Normalized means shifted such that both numbers have the decimal place lined up - just like if you do that on paper, you have to fill in extra zeros to line up the two numbers you are adding - but of course on paper you can add 12419818.0 with 0.000000001 and end up with 12419818.000000001 because paper has as much precision as you can be bothered with. Doing this in float or double will result in the same number as before.
There are indeed libraries that do floating point math - the most popular being MPFR - but it is a "multiprecision" library, but it will be fairly slow - because they are not really built to be "plugin replacement of float", but a tool for when you want to calculate pi with 1000s of digits, or when you want to calculate prime numbers in the ranges much larger than 64 or 128 bits, for example.
It MAY solve the problem to use such a library, but it will be slow.
A better choice would, moving from float to double should have a similar effect (double has 53 bits of mantissa, compared to the 23 in a 32-bit float, so more than twice as many bits in the mantissa). And should still be available as hardware instructions in any reasonably recent ARM processor, and as such relatively fast, but not as fast as float (FPU is available from ARMv7 - which certainly what you find in iPhone - at least from iPhone 3 and the middle to high end Android devices - I managed to find that Samsung Galaxy ACE has an ARM9 processor [first introduced in 1997] - so has no floating point hardware).

what difference between VS c++ and MinGW to implement double type

The same code run in VS c++ and MinGW got different result. The result is type of double. Example: in VS c++ got "-6.397745731873350", but in MinGW got "-6.397745731873378". There was litter different. But I don't known why?
I'd hazard a guess that it's one of two possibilities.
Back when Windows NT was new, and they supported porting to other processors (e.g., MIPS and DEC Alpha), MS had a little bit of a problem: the processors all had 64-bit floating point types, but they sometimes generated slightly different results. The DEC Alpha did computation on a 64-bit double as a 64-bit double. The default mode on an x86 was a little different: as you loaded a floating point number, any smaller type was converted to its internal 80-bit extended double format. Then all computation was done in 80-bit precision. Finally, when you stored the value, it was rounded back to 64 bits. This meant two things: first, for single- and double-precision results, the Intel was quite a bit slower. Second, double precision results often differed slightly between the processors.
To fix those "problems", Microsoft set up their standard library to adjust the floating point processor to only use 64-bit precision instead of 80-bit. Even though they've long-since dropped all support for other processors, they still (at least the last time I looked, and I'd be surprised if it's changed) set the floating point processor to only work in 64-bit precision. I haven't checked to be sure, but I'd guess that MingW may leave the floating point processor set to its default 80-bit precision instead.
There's one other possible source of difference: if you were comparing a 32-bit compiler to a 64-bit compiler, you get a different (though still somewhat similar) situation. The 32-bit compilers (both Microsoft and gcc) use the x87-style floating registers and instructions. Microsoft's 64-bit compiler does not use the x87-style floating point though (at least by default). Instead, it uses SSE instructions. I haven't done a lot of testing with this either, but I wouldn't be surprised at all if (again) there's a slight difference between x87 and SSE when it comes to things like guard bits and rounding. I wouldn't expect big differences at all, but would consider some slight differences extremely likely (bordering on inevitable).
Most floating-point numbers cannot be represented accurately by computers. They're approximation. There is a certain degree of unreliability in their representation. Different compilers may implement the unreliability differently. That is why you see those diffferences.
Read this excellent article:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
The difference is in the Precision in which MinGW and VS C++ can represent your floating point number..
What is Precision?
The precision of a floating point number is how many digits it can represent without losing any information it contains.
Consider the fraction 1/3. The decimal representation of this number is 0.33333333333333… with 3′s going out to infinity. An infinite length number would require infinite memory to be depicted with exact precision, but float or double data types typically only have 4 or 8 bytes. Thus Floating point & double numbers can only store a certain number of digits, and the rest are bound to get lost. Thus, there is no definite accurate way of representing float or double numbers with numbers that require more precision than the variables can hold.

What should i know when using floats/doubles between different machines?

I've heard that there are many problems with floats/doubles on different CPU's.
If i want to make a game that uses floats for everything, how can i be sure the float calculations are exactly the same on every machine so that my simulation will look exactly same on every machine?
I am also concerned about writing/reading files or sending/receiving the float values to different computers. What conversions there must be done, if any?
I need to be 100% sure that my float values are computed exactly the same, because even a slight difference in the calculations will result in a totally different future. Is this even possible ?
Standard C++ does not prescribe any details about floating point types other than range constraints, and possibly that some of the maths functions (like sine and exponential) have to be correct up to a certain level of accuracy.
Other than that, at that level of generality, there's really nothing else you can rely on!
That said, it is quite possible that you will not actually require binarily identical computations on every platform, and that the precision and accuracy guarantees of the float or double types will in fact be sufficient for simulation purposes.
Note that you cannot even produce a reliable result of an algebraic expression inside your own program when you modify the order of evaluation of subexpressions, so asking for the sort of reproducibility that you want may be a bit unrealistic anyway. If you need real floating point precision and accuracy guarantees, you might be better off with an arbitrary precision library with correct rounding, like MPFR - but that seems unrealistic for a game.
Serializing floats is an entirely different story, and you'll have to have some idea of the representations used by your target platforms. If all platforms were in fact to use IEEE 754 floats of 32 or 64 bit size, you could probably just exchange the binary representation directly (modulo endianness). If you have other platforms, you'll have to think up your own serialization scheme.
What every programmer should know: http://docs.sun.com/source/806-3568/ncg_goldberg.html

Emulate "double" using 2 "float"s

I am writing a program for an embedded hardware that only supports 32-bit single-precision floating-point arithmetic. The algorithm I am implementing, however, requires a 64-bit double-precision addition and comparison. I am trying to emulate double datatype using a tuple of two floats. So a double d will be emulated as a struct containing the tuple: (float d.hi, float d.low).
The comparison should be straightforward using a lexicographic ordering. The addition however is a bit tricky because I am not sure which base should I use. Should it be FLT_MAX? And how can I detect a carry?
How can this be done?
Edit (Clarity): I need the extra significant digits rather than the extra range.
double-float is a technique that uses pairs of single-precision numbers to achieve almost twice the precision of single precision arithmetic accompanied by a slight reduction of the single precision exponent range (due to intermediate underflow and overflow at the far ends of the range). The basic algorithms were developed by T.J. Dekker and William Kahan in the 1970s. Below I list two fairly recent papers that show how these techniques can be adapted to GPUs, however much of the material covered in these papers is applicable independent of platform so should be useful for the task at hand.
https://hal.archives-ouvertes.fr/hal-00021443
Guillaume Da Graça, David Defour
Implementation of float-float operators on graphics hardware,
7th conference on Real Numbers and Computers, RNC7.
http://andrewthall.org/papers/df64_qf128.pdf
Andrew Thall
Extended-Precision Floating-Point Numbers for GPU Computation.
This is not going to be simple.
A float (IEEE 754 single-precision) has 1 sign bit, 8 exponent bits, and 23 bits of mantissa (well, effectively 24).
A double (IEEE 754 double-precision) has 1 sign bit, 11 exponent bits, and 52 bits of mantissa (effectively 53).
You can use the sign bit and 8 exponent bits from one of your floats, but how are you going to get 3 more exponent bits and 29 bits of mantissa out of the other?
Maybe somebody else can come up with something clever, but my answer is "this is impossible". (Or at least, "no easier than using a 64-bit struct and implementing your own operations")
It depends a bit on what types of operations you want to perform. If you only care about additions and subtractions, Kahan Summation can be a great solution.
If you need both the precision and a wide range, you'll be needing a software implementation of double precision floating point, such as SoftFloat.
(For addition, the basic principle is to break the representation (e.g. 64 bits) of each value into its three consitituent parts - sign, exponent and mantissa; then shift the mantissa of one part based on the difference in the exponents, add to or subtract from the mantissa of the other part based on the sign bits, and possibly renormalise the result by shifting the mantissa and adjusting the exponent correspondingly. Along the way, there are a lot of fiddly details to account for, in order to avoid unnecessary loss of accuracy, and deal with special values such as infinities, NaNs, and denormalised numbers.)
Given all the constraints for high precision over 23 magnitudes, I think the most fruitful method would be to implement a custom arithmetic package.
A quick survey shows Briggs' doubledouble C++ library should address your needs and then some. See this.[*] The default implementation is based on double to achieve 30 significant figure computation, but it is readily rewritten to use float to achieve 13 or 14 significant figures. That may be enough for your requirements if care is taken to segregate addition operations with similar magnitude values, only adding extremes together in the last operations.
Beware though, the comments mention messing around with the x87 control register. I didn't check into the details, but that might make the code too non-portable for your use.
[*] The C++ source is linked by that article, but only the gzipped tar was not a dead link.
This is similar to the double-double arithmetic used by many compilers for long double on some machines that have only hardware double calculation support. It's also used as float-float on older NVIDIA GPUs where there's no double support. See Emulating FP64 with 2 FP32 on a GPU. This way the calculation will be much faster than a software floating-point library.
However in most microcontrollers there's no hardware support for floats so they're implemented purely in software. Because of that, using float-float may not increase performance and introduce some memory overhead to save the extra bytes of exponent.
If you really need the longer mantissa, try using a custom floating-point library. You can choose whatever is enough for you, for example change the library to adapt a new 48-bit float type of your own if only 40 bits of mantissa and 7 bits of exponent is needed. No need to spend time for calculating/storing the unnecessary 16 bits anymore. But this library should be very efficient because compiler's libraries often have assembly level optimization for their own type of float.
Another software-based solution that might be of use: GNU MPFR
It takes care of many other special cases and allows arbitrary precision (better than 64-bit double) that you would have to otherwise take care of yourself.
That's not practical. If it was, every embedded 32-bit processor (or compiler) would emulate double precision by doing that. As it stands, none do it that I am aware of. Most of them just substitute float for double.
If you need the precision and not the dynamic range, your best bet would be to use fixed point. IF the compiler supports 64-bit this will be easier too.

How to write portable floating point arithmetic in c++?

Say you're writing a C++ application doing lots of floating point arithmetic. Say this application needs to be portable accross a reasonable range of hardware and OS platforms (say 32 and 64 bits hardware, Windows and Linux both in 32 and 64 bits flavors...).
How would you make sure that your floating point arithmetic is the same on all platforms ? For instance, how to be sure that a 32 bits floating point value will really be 32 bits on all platforms ?
For integers we have stdint.h but there doesn't seem to exist a floating point equivalent.
[EDIT]
I got very interesting answers but I'd like to add some precision to the question.
For integers, I can write:
#include <stdint>
[...]
int32_t myInt;
and be sure that whatever the (C99 compatible) platform I'm on, myInt is a 32 bits integer.
If I write:
double myDouble;
float myFloat;
am I certain that this will compile to, respectively, 64 bits and 32 bits floating point numbers on all platforms ?
Non-IEEE 754
Generally, you cannot. There's always a trade-off between consistency and performance, and C++ hands that to you.
For platforms that don't have floating point operations (like embedded and signal processing processors), you cannot use C++ "native" floating point operations, at least not portably so. While a software layer would be possible, that's certainly not feasible for this type of devices.
For these, you could use 16 bit or 32 bit fixed point arithmetic (but you might even discover that long is supported only rudimentary - and frequently, div is very expensive). However, this will be much slower than built-in fixed-point arithmetic, and becomes painful after the basic four operations.
I haven't come across devices that support floating point in a different format than IEEE 754. From my experience, your best bet is to hope for the standard, because otherwise you usually end up building algorithms and code around the capabilities of the device. When sin(x) suddenly costs 1000 times as much, you better pick an algorithm that doesn't need it.
IEEE 754 - Consistency
The only non-portability I found here is when you expect bit-identical results across platforms. The biggest influence is the optimizer. Again, you can trade accuracy and speed for consistency. Most compilers have a option for that - e.g. "floating point consistency" in Visual C++. But note that this is always accuracy beyond the guarantees of the standard.
Why results become inconsistent?
First, FPU registers often have higher resolution than double's (e.g. 80 bit), so as long as the code generator doesn't store the value back, intermediate values are held with higher accuracy.
Second, the equivalences like a*(b+c) = a*b + a*c are not exact due to the limited precision. Nonetheless the optimizer, if allowed, may make use of them.
Also - what I learned the hard way - printing and parsing functions are not necessarily consistent across platforms, probably due to numeric inaccuracies, too.
float
It is a common misconception that float operations are intrinsically faster than double. working on large float arrays is faster usually through less cache misses alone.
Be careful with float accuracy. it can be "good enough" for a long time, but I've often seen it fail faster than expected. Float-based FFT's can be much faster due to SIMD support, but generate notable artefacts quite early for audio processing.
Use fixed point.
However, if you want to approach the realm of possibly making portable floating point operations, you at least need to use controlfp to ensure consistent FPU behavior as well as ensuring that the compiler enforces ANSI conformance with respect to floating point operations. Why ANSI? Because it's a standard.
And even then you aren't guaranteeing that you can generate identical floating point behavior; that also depends on the CPU/FPU you are running on.
It shouldn't be an issue, IEEE 754 already defines all details of the layout of floats.
The maximum and minimum values storable should be defined in limits.h
Portable is one thing, generating consistent results on different platforms is another. Depending on what you are trying to do then writing portable code shouldn't be too difficult, but getting consistent results on ANY platform is practically impossible.
I believe "limits.h" will include the C library constants INT_MAX and its brethren. However, it is preferable to use "limits" and the classes it defines:
std::numeric_limits<float>, std::numeric_limits<double>, std::numberic_limits<int>, etc...
If you're assuming that you will get the same results on another system, read What could cause a deterministic process to generate floating point errors first. You might be surprised to learn that your floating point arithmetic isn't even the same across different runs on the very same machine!