The C++14 draft standard seems rather quiet about the specific requirements for float, double and long double, although these sizes seem to be common:
float: IEEE 32-bit floating-point representation (roughly 7 digits of precision, exponent range of 1e-38..1e+38)
double: IEEE 64-bit floating-point representation (roughly 16 digits of precision, exponent range of 1e-308..1e+308)
long double: 80-bit floating-point representation (roughly 19 digits of precision, exponent range of 1e-4951..1e+4932)
What C++ compilers and systems currently use floating-point sizes other than these?
I'm interested in longer, shorter, and non-binary representations using the standard types, not libraries, as my primary interest is portability of C++ programs.
It's unclear what "uncommon sizes" you're talking about
If you're only asking about size in bits then "odd-sized" (i.e. not a power of 2) types usually exist in older platforms that don't use 8-bit (or another power of 2) bytes
One example is the Unisys ClearPath Dorado Servers with 36-bit float and 72-bit double. That beast is still even in active development until now. The last version was in 2018. Mainframes and servers live a very long life so you can still see some PDP-10 and other architectures in use in modern times, with modern compiler support.
But even in newer platforms you can still see some examples like Intel Itanium's 82-bit extended float format. Many platforms also use a 40-bit floating-point format. It's especially common in many modern DSPs that use 40-bit accumulators like the TI C3x/C4x, SHARC ADSP-21160, Atmel TSC21020F. There are also many old 40-bit floating-point formats like the IBM extended or Microsoft MBF extended formats. See also Why did 8-bit Basic use 40-bit floating point?
In addition there are some other non-standard 24-bit floats in a few modern C/C++ compilers for microcontrollers. And in computer graphics many minifloat formats like 10-bit or 11-bit floats aren't unknown, beside 16 and 24-bit floats
If you care about the formats then there are lots of standard compliant 32, 64 and 128-bit floating-point formats that aren't IEEE-754 like the hex and decimal floating point types in IBM z, Cray formats and VAX formats.
In fact IBM z is one of the very rare modern platforms with decimal float hardware, although if you use GCC and some other compilers you can use their built-in software support for decimal float. IBM also uses the special double-double format which is still the default for long double on PowerPC until now
Here's the summary of most of the available floating-point formats. See also Do any real-world CPUs not use IEEE 754?. For more information continue to the next section
Types in C++ are generally mapped to hardware types for performance reasons. Therefore floating-point types will be whatever available on the CPU if it ever has an FPU. In modern computers IEEE-754 is the dominant format in hardware, and due to the requirements in C++ standard float and double must be mapped to at least IEEE-754 single and double precision respectively
Hardware support for types with higher precision is not common except on x86 and a few other rare platforms with 80-bit extended precision, therefore long double is usually mapped to the same type as double on those platforms. However recently long double is being slowly migrated to IEEE-754 quadruple precision in many compilers like GCC or Clang. Since that one is implemented with the built-in software library, performance is a lot worse. Depending on whether you favor faster execution or higher precision you're still free to choose whatever type long double maps to though. For example on x86 GCC has -mlong-double-64/80/128 and -m96/128bit-long-double options to set the padding and format of long double. The option is also available in many other architectures like the S/390 and zSeries
PowerPC OTOH by default uses a completely different 128-bit long double format implemented using double-double arithmetic and has the same range as IEEE-754 double precision. Its precision is slightly lower than quadruple precision but it's a lot faster because it can utilize the hardware double arithmetic. As above, you can choose between the 2 formats with the -mabi=ibmlongdouble/ieeelongdouble options. That trick is also used in some platforms where only 32-bit float is supported to get near-double precision
IBM z mainframes traditionally use IBM hex float formats and they still use it nowadays. But they do also support IEEE-754 binary and decimal floating-point types in addition to that
The format of floating-point numbers can be either base 16 S/390® hexadecimal format, base 2 IEEE-754 binary format, or base 10 IEEE-754 decimal format. The formats are based on three operand lengths for hexadecimal and binary: short (32 bits), long (64 bits), and extended (128 bits). The formats are also based on three operand lengths for decimal: _Decimal32 (32 bits), _Decimal64 (64 bits), and _Decimal128 (128 bits).
Floating-point numbers
Other architectures may have other floating-point formats, like VAX or Cray. However since those mainframes are still being used, their newer hardware version also include support for IEEE-754 just like how IBM did with their mainframes
On modern platforms without FPU the floating-point types are usually IEEE-754 single and double precision for better interoperability and library support. However on 8-bit microcontrollers even single precision is too costly, therefore some compilers support a non-standard mode where float is a 24-bit type. For example the XC8 compiler uses a 24-bit floating-point format that is a truncated form of the 32-bit format, and NXP's MRK uses a different 24-bit float format
Due to the rise of graphics and AI applications that require a narrower floating-point type, 16-bit float formats like IEEE-754 binary16 and Google's bfloat16 are also introduced to in many platforms and compilers also have some limited support for them, like __fp16 in GCC
First of, I am new to Stack Overflow, so please bear with me.
However, to answer your question. Looking at the float.h headers, which specify floating point parameters for the:
Intel Compiler
//Float:
#define FLT_MAX 3.40282347e+38F
//Double:
#define DBL_MAX 1.7976931348623157e+308
//Long Double:
#if (__IMFLONGDOUBLE == 64) || defined(__LONGDOUBLE_AS_DOUBLE)
#define LDBL_MAX 1.7976931348623157e+308L
#else
#define LDBL_MAX 1.1897314953572317650213E+4932L
GCC (MinGW actually gcc 4 or 5)
//Float:
#define FLT_MAX 3.40282347e+38F
//Double:
#define DBL_MAX 1.7976931348623157e+308
//Long Double: (same as double for gcc):
#define LDBL_MAX 1.7976931348623157e+308L
Microsoft
//Float:
#define FLT_MAX 3.40282347e+38F
//Double:
#define DBL_MAX 1.7976931348623157e+308
//Long Double: (same as double for Microsoft):
#define LDBL_MAX DBL_MAX
So, as you can see only the Intel compiler provides 80-bit representation for long double on a "standard" Windows machine.
This data is copied from the respective float.h headers from a Windows machine.
float and double are de-facto standardised on the IEEE single and double precision representations. I would put assuming these sizes in the same category as assuming CHAR_BIT==8. Some older ARM systems did have weird "mixed-endian" doubles, but unless you are working with retro stuff you are unlikely to encounter that nowadays.
long double on the other hand is far more variable. Sometimes it's IEEE double precision, sometimes it's 80-bit x87 extended, sometimes it's IEEE quad precision , sometimes it's a "double double" format made up from two IEEE double precision numbers added together.
So in portable code you can't rely on long double being any better than double.
Related
In the stdint.h (C99), boost/cstdint.hpp, and cstdint (C++0x) headers there is, among others, the type int32_t.
Are there similar fixed-size floating point types? Something like float32_t?
Nothing like this exists in the C or C++ standards at present. In fact, there isn't even a guarantee that float will be a binary floating-point format at all.
Some compilers guarantee that the float type will be the IEEE-754 32 bit binary format. Some do not. In reality, float is in fact the IEEE-754 single type on most non-embedded platforms, though the usual caveats about some compilers evaluating expressions in a wider format apply.
There is a working group discussing adding C language bindings for the 2008 revision of IEEE-754, which could consider recommending that such a typedef be added. If this were added to C, I expect the C++ standard would follow suit... eventually.
If you want to know whether your float is the IEEE 32-bit type, check std::numeric_limits<float>::is_iec559. It's a compile-time constant, not a function.
If you want to be more bulletproof, also check std::numeric_limits<float>::digits to make sure they aren't sneakily using the IEEE standard double-precision for float. It should be 24.
When it comes to long double, it's more important to check digits because there are a couple IEEE formats which it might reasonably be: 128 bits (digits = 113) or 80 bits (digits = 64).
It wouldn't be practical to have float32_t as such because you usually want to use floating-point hardware, if available, and not to fall back on a software implementation.
If you think having typedefs such as float32_t and float64_t are impractical for any reasons, you must be too accustomed to your familiar OS, compiler, that you are unable too look outside your little nest.
There exist hardware which natively runs 32-bit IEEE floating point operations and others that do 64-bit. Sometimes such systems even have to talk to eachother, in which case it is extremely important to know if a double is 32 bit or 64 bit on each platform. If the 32-bit platform were to do excessive calculations on base on the 64-bit values from the other, we may want to cast to the lower precision depending on timing and speed requirements.
I personally feel uncomfortable using floats and doubles unless I know exactly how many bits they are on my platfrom. Even more so if I am to transfer these to another platform over some communications channel.
There is currently a proposal to add the following types into the language:
decimal32
decimal64
decimal128
which may one day be accessible through #include <decimal>.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3871.html
I'm optimizing a sorting function for a numerics/statistics library based on the assumption that, after filtering out any NaNs and doing a little bit twiddling, floats can be compared as 32-bit ints without changing the result and doubles can be compared as 64-bit ints.
This seems to speed up sorting these arrays by somewhere on the order of 40%, and my assumption holds as long as the bit-level representation of floating point numbers is IEEE 754. Are there any real-world CPUs that people actually use (excluding in embedded devices, which this library doesn't target) that use some other representation that might break this assumption?
https://en.wikipedia.org/wiki/Single-precision_floating-point_format
(binary32, aka float in systems that use IEEE754)
https://en.wikipedia.org/wiki/Double-precision_floating-point_format
(binary64, aka double in systems that use IEEE754)
Other than flawed Pentiums, any x86 or x64-based CPU is using IEEE 754 as their floating-point arithmetic standard.
Here are a brief overview of the FPA standards and their adoptions.
IEEE 754: Intel x86, and all RISC systems (IBM Power
and PowerPC, Compaq/DEC Alpha, HP PA-RISC,
Motorola 68xxx and 88xxx, SGI (MIPS) R-xxxx,
Sun SPARC, and others);
VAX: Compaq/DEC
IBM S/390: IBM (however, in 1998, IBM added an IEEE 754
option to S/390)
Cray: X-MP, Y-MP, C-90; other Cray models have been
based on Alpha and SPARC processors with
IEEE-754 arithmetic.
Unless your planning on supporting your library on fairly exotic CPU architectures, it is safe to assume that for now 99% of CPUs are IEEE 754 compliant.
It depends on where you draw the line between the "real world" and the imaginary one.
Vax G format is still supported on Alpha machines (which HP says they will support through at least 2013).
IBM hexadecimal FP is still supported by IBM z-series mainframes. They've added IEEE binary and decimal support, but from what I've heard they're rarely used, because the hexadecimal FP is quite a bit faster (IBM's been optimizing it for about 45 years now...)
Until fairly recently, Unisys still sold ClearPath IX servers that supported the Burroughs FP format, and ClearPath MCP machines that supported the Univac FP format. I believe those are now only run in emulation (on Xeons) but from a software viewpoint, they'll probably continue in active use for another decade or more.
There are even a few people still using DtCyber to run Plato on (emulated) Control Data mainframes, with their unique floating point format. (Sorry, but my first serious programming was on a CDC Cyber machine, so I couldn't resist bringing it up, even if it hasn't been "real world" for decades).
The Cell Processor's SPUs differ in a few ways (like lack of INF and NANs), but I don't think there are differences would break your assumptions...
PowerPC processors (Macs until about 2006-2007, tons of current IBM servers) use a 128 bit format consisting of two doubles for long double, instead if the IEEE 754 extended format.
However, in C or Objective-C, there is no portable way to interpret a 32 bit or 64 bit floating point number as an integer (assuming float and uint32_t, or double and uint64_t have the same number of bits). When I needed to do that kind of thing, I had to write different code depending on the compiler (one was using a union, one was by casting double* to long long*). No idea whether a reinterpretcast in C++ will do it portably.
Many real-world CPUs don't have any native floating-point format. Many implementations of C and other languages for such CPUs bundle libraries that use IEEE-754 single and double-precision formats and omit the extended-precision format despite the fact that other formats would be more suitable for many purposes.
It has been asserted that (even accounting for byte endian-ness) IEEE754 floating point is not guaranteed to be exchangeable between platforms.
So:
Why, theoretically, is IEEE floating point not exchangeable between platforms?
Are any of these concerns valid for modern hardware platforms (e.g. i686, x64, arm)?
If the concerns are valid, can you please demonstrate an example where this is the case (C or C++ is preferred)?
Motivation: Several GPS manufacturers exchange their binary formats for (e.g.) latitude, longitude and raw data in "IEEE-754 compliant floating point values". So, I don't have control to choose a text format or other "portable" format. Hence, my question has to when the differences may or may not occur.
IEEE 754 clause 3.4 specifies binary interchange format encodings. Given a floating-point format (below), the interchange format puts the sign bit in the most significant bit, biased exponent bits in the next most significant bits, and the significand encoding in the least significant bits. A mapping from bits to bytes is not specified, so a system could use little-endian, big-endian, or other ordering.
Clause 3.6 specifies format parameters for various format widths, including 64-bit binary, for which there is one sign bit, 11 exponent field bits, and 52 significand field bits. This clause also specifies the exponent bias.
Clauses 3.3 and 3.4 specify the data represented by this format.
So, to interchange IEEE-754 floating-point data, it seems systems need only to agree on two things: which format to use (e.g., 64-bit binary) and how to get the bits back and forth (e.g., how to map bits to bytes for writing to a file or a network message).
The same code run in VS c++ and MinGW got different result. The result is type of double. Example: in VS c++ got "-6.397745731873350", but in MinGW got "-6.397745731873378". There was litter different. But I don't known why?
I'd hazard a guess that it's one of two possibilities.
Back when Windows NT was new, and they supported porting to other processors (e.g., MIPS and DEC Alpha), MS had a little bit of a problem: the processors all had 64-bit floating point types, but they sometimes generated slightly different results. The DEC Alpha did computation on a 64-bit double as a 64-bit double. The default mode on an x86 was a little different: as you loaded a floating point number, any smaller type was converted to its internal 80-bit extended double format. Then all computation was done in 80-bit precision. Finally, when you stored the value, it was rounded back to 64 bits. This meant two things: first, for single- and double-precision results, the Intel was quite a bit slower. Second, double precision results often differed slightly between the processors.
To fix those "problems", Microsoft set up their standard library to adjust the floating point processor to only use 64-bit precision instead of 80-bit. Even though they've long-since dropped all support for other processors, they still (at least the last time I looked, and I'd be surprised if it's changed) set the floating point processor to only work in 64-bit precision. I haven't checked to be sure, but I'd guess that MingW may leave the floating point processor set to its default 80-bit precision instead.
There's one other possible source of difference: if you were comparing a 32-bit compiler to a 64-bit compiler, you get a different (though still somewhat similar) situation. The 32-bit compilers (both Microsoft and gcc) use the x87-style floating registers and instructions. Microsoft's 64-bit compiler does not use the x87-style floating point though (at least by default). Instead, it uses SSE instructions. I haven't done a lot of testing with this either, but I wouldn't be surprised at all if (again) there's a slight difference between x87 and SSE when it comes to things like guard bits and rounding. I wouldn't expect big differences at all, but would consider some slight differences extremely likely (bordering on inevitable).
Most floating-point numbers cannot be represented accurately by computers. They're approximation. There is a certain degree of unreliability in their representation. Different compilers may implement the unreliability differently. That is why you see those diffferences.
Read this excellent article:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
The difference is in the Precision in which MinGW and VS C++ can represent your floating point number..
What is Precision?
The precision of a floating point number is how many digits it can represent without losing any information it contains.
Consider the fraction 1/3. The decimal representation of this number is 0.33333333333333… with 3′s going out to infinity. An infinite length number would require infinite memory to be depicted with exact precision, but float or double data types typically only have 4 or 8 bytes. Thus Floating point & double numbers can only store a certain number of digits, and the rest are bound to get lost. Thus, there is no definite accurate way of representing float or double numbers with numbers that require more precision than the variables can hold.
I am interested to learn about the binary format for a single or a double type used by C++ on Intel based systems.
I have avoided the use of floating point numbers in cases where the data needs to potentially be read or written by another system (i.e. files or networking). I do realise that I could use fixed point numbers instead, and that fixed point is more accurate, but I am interested to learn about the floating point format.
Wikipedia has a reasonable summary - see http://en.wikipedia.org/wiki/IEEE_754.
Burt if you want to transfer numbers betwen systems you should avoid doing it in binary format. Either use middleware like CORBA (only joking, folks), Tibco etc. or fall back on that old favourite, textual representation.
This should get you started : http://docs.sun.com/source/806-3568/ncg_goldberg.html. (:
Floating-point format is determined by the processor, not the language or compiler. These days almost all processors (including all Intel desktop machines) either have no floating-point unit or have one that complies with IEEE 754. You get two or three different sizes (Intel with SSE offers 32, 64, and 80 bits) and each one has a sign bit, an exponent, and a significand. The number represented is usually given by this formula:
sign * (2**(E-k)) * (1 + S / (2**k'))
where k' is the number of bits in the significand and k is a constant around the middle range of exponents. There are special representations for zero (plus and minus zero) as well as infinities and other "not a number" (NaN) values.
There are definite quirks; for example, the fraction 1/10 cannot be represented exactly as a binary IEEE standard floating-point number. For this reason the IEEE standard also provides for a decimal representation, but this is used primarily by handheld calculators and not by general-purpose computers.
Recommended reading: David Golberg's What Every Computer Scientist Should Know About Floating-Point Arithmetic
As other posters have noted, there is plenty of information about on the IEEE format used by every modern processor, but that is not where your problems will arise.
You can rely on any modern system using IEEE format, but you will need to watch for byte ordering. Look up "endianness" on Wikipedia (or somewhere else). Intel systems are little-endian, a lot of RISC processors are big-endian. Swapping between the two is trivial, but you need to know what type you have.
Traditionally, people use big-endian formats for transmission. Sometimes people include a header indicating the byte order they are using.
If you want absolute portability, the simplest thing is to use a text representation. However that can get pretty verbose for floating point numbers if you want to capture the full precision. 0.1234567890123456e+123.
Intel's representation is IEEE 754 compliant.
You can find the details at http://download.intel.com/technology/itj/q41999/pdf/ia64fpbf.pdf .
Note that decimal floating-point constants may convert to different floating-point binary values on different systems (even with different compilers on the same system). The difference would be slight -- maybe only as large as 2^-54 for a double -- but is a difference nonetheless.
Use hexadecimal constants if you want to guarantee the same floating-point binary value on any platform.