My program has some problems with precision when using REAL(KIND=16) or REAL*16. Is there a way to go higher than that with precision?
REAL*32 (kind values are not directly portable) would bee a 256 bit real. There is no such IEEE floating point type. See http://en.wikipedia.org/wiki/IEEE_floating-point_standard
I don't know of any processor (compiler) that supports such a kind as an extension. Also, no hardware known to me handles this natively.
At such high precisions already I would reconsider the algorithm and its stability. It is not usual for program to need more then quad (your 16 bytes) precision. Even double is normally enough. I do many of my computations with single precision.
Finally, there are some libraries that support more precision, but their use is more complicated than just recompiling with different kind parameter. See
http://crd-legacy.lbl.gov/~dhbailey/mpdist/
Is there an arbitrary precision floating point library for C/C++ which allows arbitrary precision exponents?
At a special request: The kind numbers are implementation dependent. Kind 16 may not exist or may not denote IEEE 128 bit float. See many questions here
Fortran: integer*4 vs integer(4) vs integer(kind=4)
Fortran 90 kind parameter
What does `real*8` mean? and so on.
Related
I am writing a marshaling layer to automatically convert values between different domains. When it comes to floating point values this potentially means converting values from one floating point format to another. However, it seems that almost every modern system is using IEEE754, so I'm wondering whether it's actually worth generalising to allow other formats, or just manage marshaling between different IEEE754 formats.
Does anyone know of any commonly used floating point formats other than IEEE754 that I should consider (perhaps on ARM processors or mainframes)? If so, a reference to the format specification would be extremely helpful.
Virtually all relatively modern (within the last 15 years) general purpose computers use IEEE 754. In the very unlikely event that you find system that you need to support which uses a non-IEEE 754 floating point format, there will probably be a library available to convert to/from IEEE 754.
Some non-ancient systems which did not natively use IEEE 754 were the Cray SV1 (1998-2003) and IBM System 360, 370, and 390 prior to Generation 5 (ended 2002). IBM implemented IEEE 754 emulation around 2001 in a software release for prior S/390 hardware.
As of now, what systems do you actually want this to work on? If you come across one down the line that doesn't use IEEE754 (which as #JohnZwinick says, is vanishingly unlikely) then you should be able to code for that then.
To put it another way, what you are designing here is, in effect, a communications protocol and you obviously seek to make a sensible choice for how you will represent a floating point number (both single precision and double precision, I guess) in the bytes that travel between domains.
I think #SomeProgrammerDude was trying to imply that representing these as text strings (while they are in transit) might offer the most portability, and if so I would agree, but it's obviously not the most efficient way to do it.
So, if you do decide to plump for IEEE754 as your interchange format (as I would) then the worst that can happen is that you might need to find a way to convert these to and from the native format used on some antique architecture that you are almost certainly never going to encounter, and if that does happen then that problem would not be not difficult to solve.
Also, floats and doubles can be big-endian or little-endian, so you need to decide what you're going to use in your byte stream and convert when marshalling if necessary. Little-endian is much more common these days so I'd go with that.
Does anyone know of any commonly used floating point formats other than IEEE754 that I should consider ...?
CCSI uses a variation on binary32 for select processors.
it seems that almost every modern system is using IEEE754,
Yes, but... various implementations fudge on the particulars with edge values like subnormals, negative zero in visual studio, infinity and not-a-number.
It is this second issue that is more lethal and harder to discern that a given implementation has completely coded IEEE754. See __STDC_IEC_559__
OP has "I am writing a marshaling layer". It is in this coding that likely troubles remain for edge cases. Also IEEE754 does not specify endian so that marshaling issues remains. Recall integer endian may not match FP endian.
I wish to declare a floating point variable that can store more significant digits than the more common doubles and long doubles, preferably something like an octuple (256 bits), that (I believe) might give about 70 significant digits.
How do I declare such a variable? And will cross-platform compability be an issue (as opposed to fixed-width integers)?
Any help is much appreciated.
The C++ standard mandates precision up to and including double; and the finer details of that floating point scheme are left to the implementation.
An IEEE754 quadruple precision long double will only give you 36 significant figures. I've never come across a system, at the time of writing, that implements octuple precision.
Your best bet is to use something like the GNU Multiple Precision Arithmetic Library, or, if you really want binary floating point, The GNU Multiple Precision Floating Point Reliable Library.
While I don't know of any C++ libraries that fully implement a proper IEEE754 octuple precision, I've found a library by the name ttmath which implements a multi-word system, allowing it to deal with much larger numbers.
I have been in the process of writing a FORTRAN code for numerical simulations of an applied physics problem for more than two years and I've tried to follow the conventions described in Fortran Best Practices.
More specifically, I defined a parameter as
integer, parameter:: dp=kind(0.d0)
and then used it for all doubles in my code.
However, I found out (on this forum) that using KIND parameters not necessarily gives you the same precision if you compile your code using other compilers. In this question, I read that a possible solution is using the SELECTED_REAL_KIND and SELECTED_INT_KIND, which follow some convention, as far as I understand.
Later on, though, I found out about the ISO_FORTRAN_ENV module which defines the REAL32, REAL64 and REAL128 KIND parameters.
I guess that these are indeed portable and, since they belong to the FORTRAN 2008 standard (though supported by GNU), I guess that I should use these?
Therefore, I would greatly appreciate if someone with more knowledge and experience clear up the confusion.
Also, I have a follow-up question about using these KINDs, in HDF5. I was using H5T_NATIVE_DOUBLE and it was indeed working fine (as far as I know) However, in this document it is stated that this is now an obsolete feature and should not be used. In stead, they provide a function
INTEGER(HID_T) FUNCTION h5kind_to_type(kind, flag) RESULT(h5_type) .
When I use it, and print out the exact numerical value of the HID_T integer corresponding to REAL64 gives me 50331972, whereas H5T_NATIVE_DOUBLE gives me 50331963, which is different.
If I then try to use the value calculated by H5kind_to_type, the HDF5 library runs just as fine and, using XDMF, I can plot the output in VisIt or Paraview without modifying the accompanying .xmf file.
So my second question would be (again): Is this correct usage?
The type double precision and the corresponding kind kind(1.d0) are perfectly well defined by the standard. But they are also not exactly fixed. Indeed there were many computers in history and use different kind of native formats for their floating point numbers and the standard must allow this!
So, a double precision is a kind of real which has a higher enough precision than the default real. The default real is also not fixed, it must correspond to what the computers can use.
Now today we have the standard for floating point numbers IEEE_754 which defines IEEE single (binary32) and IEEE double types (binary64) and some others. If the computer hardware implements this standard, as almost all computers younger than 20 years do, it is very likely that the compiler chooses these two as the real and double precision.
The Fortran 2008 standard brings the two kind constants real32 and real64 (and others). They enable you to request the real kinds which have storage size of 32 and 64 bits. It is not guaranteed it will be the IEEE types, but it is almost certain on modern computers.
To request the IEEE types (if they are available) use the intrinsic function ieee_selected_real_kind() from module ieee_arithmetic.
The IEEE types are the same an all computers (excluding endianness!), but the compiler is not required to support them, because you may have a computer which does not support these in hardware. This is only a theoretical possibility, all modern computers support them.
Now to your HDF constants, these are apparently just some indexes to some table, it does not matter if they are different or not, the important is whether they mean the same and in your case they do.
As I wrote above, it is extremely likely that on a computer which supports IEEE 754 the double precision will be identical to IEEE double. It may not be, if you use some compiler options which change this behaviour. There are compiler options which promote the default real to double and hey may also promote double precision to quad precision (128 bit) to preserve the standard semantics which requires the double precision to have more precision and storage size.
Conclusion: You can use both, or any other way to choose your kind constants (you can also use iso_c_binding's c_float and c_double), but you should be aware why are those ways different and what do they actually mean.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
In most of the code I see around, double is favourite against float, even when a high precision is not needed.
Since there are performance penalties when using double types (CPU/GPU/memory/bus/cache/...), what is the reason of this double overuse?
Example: in computational fluid dynamics all the software I worked with uses doubles. In this case a high precision is useless (because of the errors due to the approximations in the mathematical model), and there is a huge amount of data to be moved around, which could be cut in half using floats.
The fact that today's computers are powerful is meaningless, because they are used to solve more and more complex problems.
Among others:
The savings are hardly ever worth it (number-crunching is not typical).
Rounding errors accumulate, so better go to higher precision than needed from the start (experts may know it is precise enough anyway, and there are calculations which can be done exactly).
Common floating operations using the fpu internally often work on double or higher precision anyway.
C and C++ can implicitly convert from float to double, the other way needs an explicit cast.
Variadic and no-prototype functions always get double, not float. (second one is only in ancient C and actively discouraged)
You may commonly do an operation with more than needed precision, but seldom with less, so libraries generally favor higher precision too.
But in the end, YMMV: Measure, test, and decide for yourself and your specific situation.
BTW: There's even more for performance fanatics: Use the IEEE half precision type. Little hardware or compiler support for it exists, but it cuts your bandwidth requirements in half yet again.
In my opinion the answers so far don't really get the right point across, so here's my crack at it.
The short answer is C++ developers use doubles over floats:
To avoid premature optimization when they don't understand the performance trade-offs well ("they have higher precision, why not?" Is the thought process)
Habit
Culture
To match library function signatures
To match simple-to-write floating point literals (you can write 0.0 instead of 0.0f)
It's true double may be as fast as a float for a single computation because most FPUs have a wider internal representation than either the 32-bit float or 64-bit double represent.
However that's only a small piece of the picture. Now-days operational optimizations don't mean anything if you're bottle necked on cache/memory bandwidth.
Here is why some developers seeking to optimize their code should look into using 32-bit floats over 64-bit doubles:
They fit in half the memory. Which is like having all your caches be twice as large. (big win!!!)
If you really care about performance you'll use SSE instructions. SSE instructions that operate on floating point values have different instructions for 32-bit and 64-bit floating point representations. The 32-bit versions can fit 4 values in the 128-bit register operands, but the 64-bit versions can only fit 2 values. In this scenario you can likely double your FLOPS by using floats over double because each instruction operates on twice as much data.
In general, there is a real lack of knowledge of how floating point numbers really work in the majority of developers I've encountered. So I'm not really surprised most developers blindly use double.
double is, in some ways, the "natural" floating point type in the C language, which also influences C++. Consider that:
an unadorned, ordinary floating-point constant like 13.9 has type double. To make it float, we have to add an extra suffix f or F.
default argument promotion in C converts float function arguments* to double: this takes place when no declaration exists for an argument, such as when a function is declared as variadic (e.g. printf) or no declaration exists (old style C, not permitted in C++).
The %f conversion specifier of printf takes a double argument, not float. There is no dedicated way to print float-s; a float argument default-promotes to double and so matches %f.
On modern hardware, float and double are usually mapped, respectively, to 32 bit and 64 bit IEEE 754 types. The hardware works with the 64 bit values "natively": the floating-point registers are 64 bits wide, and the operations are built around the more precise type (or internally may be even more precise than that). Since double is mapped to that type, it is the "natural" floating-point type.
The precision of float is poor for any serious numerical work, and the reduced range could be a problem also. The IEEE 32 bit type has only 23 bits of mantissa (8 bits are consumed by the exponent field and one bit for the sign). The float type is useful for saving storage in large arrays of floating-point values provided that the loss of precision and range isn't a problem in the given application. For example, 32 bit floating-point values are sometimes used in audio for representing samples.
It is true that the use of a 64 bit type over 32 bit type doubles the raw memory bandwidth. However, that only affects programs which with a large arrays of data, which are accessed in a pattern that shows poor locality. The superior precision of the 64 bit floating-point type trumps issues of optimization. Quality of numerical results is more important than shaving cycles off the running time, in accordance with the principle of "get it right first, then make it fast".
* Note, however, that there is no general automatic promotion from float expressions to double; the only promotion of that kind is integral promotion: char, short and bitfields going to int.
This is mostly hardware dependent, but consider that the most common CPU (x86/x87 based) have internal FPU that operate on 80bits floating point precision (which exceeds both floats and doubles).
If you have to store in memory some intermediate calculations, double is the good average from internal precision and external space. Performance is more or less the same, on single values. It may be affected by the memory bandwidth on large numeric pipes (since they will have double length).
Consider that floats have a precision that approximate 6 decimal digits. On a N-cubed complexity problem (like a matrix inversion or transformation), you lose two or three more in mul and div, remaining with just 3 meaningful digits. On a 1920 pixel wide display they are simply not enough (you need at least 5 to match a pixel properly).
This roughly makes double to be preferable.
It is often relatively easy to determine that double is sufficient, even in cases where it would take significant numerical analysis effort to show that float is sufficient. That saves development cost, and the risk of incorrect results if the analysis is not done correctly.
Also any performance gain by using float is usually relatively slighter than using double,that is because most of the popular processors do all floating point arithmetic in one format that is even wider than double.
I think higher precision is the only reason. Actually most people don't think a lot about it, they just use double.
I think if float precision is good enough for particular task there is no reason to use double.
I need to program a fixed point processor that was used for a legacy application. New features are requested and these features need large dynamic range , most likely beyond the fixed point range even after scaling. As the processor will not be changed due to several reasons, I am planning to incorporate the floating point operation based on fixed point arithmetic-- basically software based approach. I want to define few data structures to represent a floating point numbers in C for the underlying fixed point processor. Is it possible to do at all? I am planning to use the IEEE floating point representation . What kind of data structures would be good for achieving basic operation like multiplication, division, add and sub . Are there already some open source libraries available in C /C++?
Most C development tools for microcontrollers without native floating-point support provide software libraries for floating-point operations. Even if you do your programming in assembler, you could probably use those libraries. (Just curious, which processor and which development tools are you using?)
However, if you are serious about writing your own floating-point libraries, the most efficient way is to treat the routines as routines operating on integers. You can use unions, or code like to following to convert between floating-point and integer representation.
uint32_t integer_representation = *(uint32_t *)&fvalue;
Note that this is inherently undefined behavior, as the format of a floating-point number is not specified in the C standard.
The problem is much easier if you stick to floating-point and integer types that match (typically 32 or 64 bit types), that way you can see the routines as plain integer routines, as, for example, addition takes two 32 bit integer representations of floating-point values and return a 32 bit integer representation of the result.
Also, if you use the routines yourself, you could probably get away with leaving parts of the standard unimplemented, such as exception flags, subnormal numbers, NaN and Inf etc.
You don’t really need any data structures to do this. You simply use integers to represent floating-point encodings and integers to represent unpacked sign, exponent, and significand fields.