Related
I wish to declare a floating point variable that can store more significant digits than the more common doubles and long doubles, preferably something like an octuple (256 bits), that (I believe) might give about 70 significant digits.
How do I declare such a variable? And will cross-platform compability be an issue (as opposed to fixed-width integers)?
Any help is much appreciated.
The C++ standard mandates precision up to and including double; and the finer details of that floating point scheme are left to the implementation.
An IEEE754 quadruple precision long double will only give you 36 significant figures. I've never come across a system, at the time of writing, that implements octuple precision.
Your best bet is to use something like the GNU Multiple Precision Arithmetic Library, or, if you really want binary floating point, The GNU Multiple Precision Floating Point Reliable Library.
While I don't know of any C++ libraries that fully implement a proper IEEE754 octuple precision, I've found a library by the name ttmath which implements a multi-word system, allowing it to deal with much larger numbers.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
In most of the code I see around, double is favourite against float, even when a high precision is not needed.
Since there are performance penalties when using double types (CPU/GPU/memory/bus/cache/...), what is the reason of this double overuse?
Example: in computational fluid dynamics all the software I worked with uses doubles. In this case a high precision is useless (because of the errors due to the approximations in the mathematical model), and there is a huge amount of data to be moved around, which could be cut in half using floats.
The fact that today's computers are powerful is meaningless, because they are used to solve more and more complex problems.
Among others:
The savings are hardly ever worth it (number-crunching is not typical).
Rounding errors accumulate, so better go to higher precision than needed from the start (experts may know it is precise enough anyway, and there are calculations which can be done exactly).
Common floating operations using the fpu internally often work on double or higher precision anyway.
C and C++ can implicitly convert from float to double, the other way needs an explicit cast.
Variadic and no-prototype functions always get double, not float. (second one is only in ancient C and actively discouraged)
You may commonly do an operation with more than needed precision, but seldom with less, so libraries generally favor higher precision too.
But in the end, YMMV: Measure, test, and decide for yourself and your specific situation.
BTW: There's even more for performance fanatics: Use the IEEE half precision type. Little hardware or compiler support for it exists, but it cuts your bandwidth requirements in half yet again.
In my opinion the answers so far don't really get the right point across, so here's my crack at it.
The short answer is C++ developers use doubles over floats:
To avoid premature optimization when they don't understand the performance trade-offs well ("they have higher precision, why not?" Is the thought process)
Habit
Culture
To match library function signatures
To match simple-to-write floating point literals (you can write 0.0 instead of 0.0f)
It's true double may be as fast as a float for a single computation because most FPUs have a wider internal representation than either the 32-bit float or 64-bit double represent.
However that's only a small piece of the picture. Now-days operational optimizations don't mean anything if you're bottle necked on cache/memory bandwidth.
Here is why some developers seeking to optimize their code should look into using 32-bit floats over 64-bit doubles:
They fit in half the memory. Which is like having all your caches be twice as large. (big win!!!)
If you really care about performance you'll use SSE instructions. SSE instructions that operate on floating point values have different instructions for 32-bit and 64-bit floating point representations. The 32-bit versions can fit 4 values in the 128-bit register operands, but the 64-bit versions can only fit 2 values. In this scenario you can likely double your FLOPS by using floats over double because each instruction operates on twice as much data.
In general, there is a real lack of knowledge of how floating point numbers really work in the majority of developers I've encountered. So I'm not really surprised most developers blindly use double.
double is, in some ways, the "natural" floating point type in the C language, which also influences C++. Consider that:
an unadorned, ordinary floating-point constant like 13.9 has type double. To make it float, we have to add an extra suffix f or F.
default argument promotion in C converts float function arguments* to double: this takes place when no declaration exists for an argument, such as when a function is declared as variadic (e.g. printf) or no declaration exists (old style C, not permitted in C++).
The %f conversion specifier of printf takes a double argument, not float. There is no dedicated way to print float-s; a float argument default-promotes to double and so matches %f.
On modern hardware, float and double are usually mapped, respectively, to 32 bit and 64 bit IEEE 754 types. The hardware works with the 64 bit values "natively": the floating-point registers are 64 bits wide, and the operations are built around the more precise type (or internally may be even more precise than that). Since double is mapped to that type, it is the "natural" floating-point type.
The precision of float is poor for any serious numerical work, and the reduced range could be a problem also. The IEEE 32 bit type has only 23 bits of mantissa (8 bits are consumed by the exponent field and one bit for the sign). The float type is useful for saving storage in large arrays of floating-point values provided that the loss of precision and range isn't a problem in the given application. For example, 32 bit floating-point values are sometimes used in audio for representing samples.
It is true that the use of a 64 bit type over 32 bit type doubles the raw memory bandwidth. However, that only affects programs which with a large arrays of data, which are accessed in a pattern that shows poor locality. The superior precision of the 64 bit floating-point type trumps issues of optimization. Quality of numerical results is more important than shaving cycles off the running time, in accordance with the principle of "get it right first, then make it fast".
* Note, however, that there is no general automatic promotion from float expressions to double; the only promotion of that kind is integral promotion: char, short and bitfields going to int.
This is mostly hardware dependent, but consider that the most common CPU (x86/x87 based) have internal FPU that operate on 80bits floating point precision (which exceeds both floats and doubles).
If you have to store in memory some intermediate calculations, double is the good average from internal precision and external space. Performance is more or less the same, on single values. It may be affected by the memory bandwidth on large numeric pipes (since they will have double length).
Consider that floats have a precision that approximate 6 decimal digits. On a N-cubed complexity problem (like a matrix inversion or transformation), you lose two or three more in mul and div, remaining with just 3 meaningful digits. On a 1920 pixel wide display they are simply not enough (you need at least 5 to match a pixel properly).
This roughly makes double to be preferable.
It is often relatively easy to determine that double is sufficient, even in cases where it would take significant numerical analysis effort to show that float is sufficient. That saves development cost, and the risk of incorrect results if the analysis is not done correctly.
Also any performance gain by using float is usually relatively slighter than using double,that is because most of the popular processors do all floating point arithmetic in one format that is even wider than double.
I think higher precision is the only reason. Actually most people don't think a lot about it, they just use double.
I think if float precision is good enough for particular task there is no reason to use double.
My program has some problems with precision when using REAL(KIND=16) or REAL*16. Is there a way to go higher than that with precision?
REAL*32 (kind values are not directly portable) would bee a 256 bit real. There is no such IEEE floating point type. See http://en.wikipedia.org/wiki/IEEE_floating-point_standard
I don't know of any processor (compiler) that supports such a kind as an extension. Also, no hardware known to me handles this natively.
At such high precisions already I would reconsider the algorithm and its stability. It is not usual for program to need more then quad (your 16 bytes) precision. Even double is normally enough. I do many of my computations with single precision.
Finally, there are some libraries that support more precision, but their use is more complicated than just recompiling with different kind parameter. See
http://crd-legacy.lbl.gov/~dhbailey/mpdist/
Is there an arbitrary precision floating point library for C/C++ which allows arbitrary precision exponents?
At a special request: The kind numbers are implementation dependent. Kind 16 may not exist or may not denote IEEE 128 bit float. See many questions here
Fortran: integer*4 vs integer(4) vs integer(kind=4)
Fortran 90 kind parameter
What does `real*8` mean? and so on.
I need to program a fixed point processor that was used for a legacy application. New features are requested and these features need large dynamic range , most likely beyond the fixed point range even after scaling. As the processor will not be changed due to several reasons, I am planning to incorporate the floating point operation based on fixed point arithmetic-- basically software based approach. I want to define few data structures to represent a floating point numbers in C for the underlying fixed point processor. Is it possible to do at all? I am planning to use the IEEE floating point representation . What kind of data structures would be good for achieving basic operation like multiplication, division, add and sub . Are there already some open source libraries available in C /C++?
Most C development tools for microcontrollers without native floating-point support provide software libraries for floating-point operations. Even if you do your programming in assembler, you could probably use those libraries. (Just curious, which processor and which development tools are you using?)
However, if you are serious about writing your own floating-point libraries, the most efficient way is to treat the routines as routines operating on integers. You can use unions, or code like to following to convert between floating-point and integer representation.
uint32_t integer_representation = *(uint32_t *)&fvalue;
Note that this is inherently undefined behavior, as the format of a floating-point number is not specified in the C standard.
The problem is much easier if you stick to floating-point and integer types that match (typically 32 or 64 bit types), that way you can see the routines as plain integer routines, as, for example, addition takes two 32 bit integer representations of floating-point values and return a 32 bit integer representation of the result.
Also, if you use the routines yourself, you could probably get away with leaving parts of the standard unimplemented, such as exception flags, subnormal numbers, NaN and Inf etc.
You don’t really need any data structures to do this. You simply use integers to represent floating-point encodings and integers to represent unpacked sign, exponent, and significand fields.
I was wondering what kind of method was used to multiply numbers in C++. Is it the traditional schoolbook long multiplication? Fürer's algorithm? Toom-Cook?
I was wondering because I am going to need to be multiplying extremely large numbers and need a high degree of efficiency. Therefore the traditional schoolbook long multiplication O(n^2) might be too inefficient, and I would need to resort to another method of multiplication.
So what kind of multiplication does C++ use?
You seem to be missing several crucial things here:
There's a difference between native arithmetic and bignum arithmetic.
You seem to be interested in bignum arithmetic.
C++ doesn't support bignum arithmetic. The primitive datatypes are generally native arithmetic to the processor.
To get bignum (arbitrary precision) arithmetic, you need to implement it yourself or use a library. (such as GMP) Unlike Java, and C# (among others), C++ does not have a library for arbitrary precision arithmetic.
All of those fancy algorithms:
Karatsuba: O(n^1.585)
Toom-Cook: < O(n^1.465)
FFT-based: ~ O(n log(n))
are applicable only to bignum arithmetic which are implemented in bignum libraries. What the processor uses for its native arithmetic operations is somewhat irrelevant as it's
usually constant time.
In any case, I don't recommend that you try to implement a bignum library. I've done it before and it's quite demanding (especially the math). So you're better off using a library.
What do you mean by "extremely large numbers"?
C++, like most other programming languages, uses the multiplication hardware that is built-in in the processor. Exactly how that works is not specified by the C++ language. But for normal integers and floating-point numbers, you will not be able to write something faster in software.
The largest numbers that can be represented by the various data types can vary between different implementations, but some typical values are 2147483647 for int, 9223372036854775807 for long, and 1.79769e+308 for double.
In C++ integer multiplication is handled by the chip. There is no equivalent of Perl's BigNum in the standard language, although I'm certain such libraries do exist.
That all depends on the library and compiler used.
It is performed in hardware. for the same reason huge numbers won't work. The largest number c++ can represent in 64 bit hardware is 18446744073709551616. if you need larger numbers you need an arbitrary precision library.
If you work with large numbers the standard integer multiplication in c++ will no longer work and you should use a library providing arbitrary precision multiplication, like GMP http://gmplib.org/
Also, you should not worry about performance prior to writing your application (=premature optimization). These multiplications will be fast, and most likely many other components in your software will cause much more slowdown.
plain c++ uses CPU mult instructions (or schoolbook multiplication using bitshifts and additions if your CPU does not have such an instruction. )
if you need fast multiplication for large numbers, I would suggest looking at gmp ( http://gmplib.org ) and use the c++ interface from gmpxx.h
Just how big are these numbers going to be? Even languages like python can do 1e100*1e100 with arbitrary precision integers over 3 million times a second on a standard processor. That's multiplication to 100 significant places taking less than one millionth of second. To put that into context there are only about 10^80 atoms in the observable universe.
Write what you want to achieve first, and optimise later if necessary.