Declaring an octuple precision floating point variable in C++

Declaring an octuple precision floating point variable in C++ - c++

I wish to declare a floating point variable that can store more significant digits than the more common doubles and long doubles, preferably something like an octuple (256 bits), that (I believe) might give about 70 significant digits.
How do I declare such a variable? And will cross-platform compability be an issue (as opposed to fixed-width integers)?
Any help is much appreciated.

The C++ standard mandates precision up to and including double; and the finer details of that floating point scheme are left to the implementation.
An IEEE754 quadruple precision long double will only give you 36 significant figures. I've never come across a system, at the time of writing, that implements octuple precision.
Your best bet is to use something like the GNU Multiple Precision Arithmetic Library, or, if you really want binary floating point, The GNU Multiple Precision Floating Point Reliable Library.

While I don't know of any C++ libraries that fully implement a proper IEEE754 octuple precision, I've found a library by the name ttmath which implements a multi-word system, allowing it to deal with much larger numbers.

Related

How to create REAL(KIND=32) variables?

My program has some problems with precision when using REAL(KIND=16) or REAL*16. Is there a way to go higher than that with precision?

REAL*32 (kind values are not directly portable) would bee a 256 bit real. There is no such IEEE floating point type. See http://en.wikipedia.org/wiki/IEEE_floating-point_standard
I don't know of any processor (compiler) that supports such a kind as an extension. Also, no hardware known to me handles this natively.
At such high precisions already I would reconsider the algorithm and its stability. It is not usual for program to need more then quad (your 16 bytes) precision. Even double is normally enough. I do many of my computations with single precision.
Finally, there are some libraries that support more precision, but their use is more complicated than just recompiling with different kind parameter. See
http://crd-legacy.lbl.gov/~dhbailey/mpdist/
Is there an arbitrary precision floating point library for C/C++ which allows arbitrary precision exponents?
At a special request: The kind numbers are implementation dependent. Kind 16 may not exist or may not denote IEEE 128 bit float. See many questions here
Fortran: integer*4 vs integer(4) vs integer(kind=4)
Fortran 90 kind parameter
What does `real*8` mean? and so on.

data structures for floating point operation in fixed point processor

I need to program a fixed point processor that was used for a legacy application. New features are requested and these features need large dynamic range , most likely beyond the fixed point range even after scaling. As the processor will not be changed due to several reasons, I am planning to incorporate the floating point operation based on fixed point arithmetic-- basically software based approach. I want to define few data structures to represent a floating point numbers in C for the underlying fixed point processor. Is it possible to do at all? I am planning to use the IEEE floating point representation . What kind of data structures would be good for achieving basic operation like multiplication, division, add and sub . Are there already some open source libraries available in C /C++?

Most C development tools for microcontrollers without native floating-point support provide software libraries for floating-point operations. Even if you do your programming in assembler, you could probably use those libraries. (Just curious, which processor and which development tools are you using?)
However, if you are serious about writing your own floating-point libraries, the most efficient way is to treat the routines as routines operating on integers. You can use unions, or code like to following to convert between floating-point and integer representation.
uint32_t integer_representation = *(uint32_t *)&fvalue;
Note that this is inherently undefined behavior, as the format of a floating-point number is not specified in the C standard.
The problem is much easier if you stick to floating-point and integer types that match (typically 32 or 64 bit types), that way you can see the routines as plain integer routines, as, for example, addition takes two 32 bit integer representations of floating-point values and return a 32 bit integer representation of the result.
Also, if you use the routines yourself, you could probably get away with leaving parts of the standard unimplemented, such as exception flags, subnormal numbers, NaN and Inf etc.

You don’t really need any data structures to do this. You simply use integers to represent floating-point encodings and integers to represent unpacked sign, exponent, and significand fields.

Emulate "double" using 2 "float"s

I am writing a program for an embedded hardware that only supports 32-bit single-precision floating-point arithmetic. The algorithm I am implementing, however, requires a 64-bit double-precision addition and comparison. I am trying to emulate double datatype using a tuple of two floats. So a double d will be emulated as a struct containing the tuple: (float d.hi, float d.low).
The comparison should be straightforward using a lexicographic ordering. The addition however is a bit tricky because I am not sure which base should I use. Should it be FLT_MAX? And how can I detect a carry?
How can this be done?
Edit (Clarity): I need the extra significant digits rather than the extra range.

double-float is a technique that uses pairs of single-precision numbers to achieve almost twice the precision of single precision arithmetic accompanied by a slight reduction of the single precision exponent range (due to intermediate underflow and overflow at the far ends of the range). The basic algorithms were developed by T.J. Dekker and William Kahan in the 1970s. Below I list two fairly recent papers that show how these techniques can be adapted to GPUs, however much of the material covered in these papers is applicable independent of platform so should be useful for the task at hand.
https://hal.archives-ouvertes.fr/hal-00021443
Guillaume Da Graça, David Defour
Implementation of float-float operators on graphics hardware,
7th conference on Real Numbers and Computers, RNC7.
http://andrewthall.org/papers/df64_qf128.pdf
Andrew Thall
Extended-Precision Floating-Point Numbers for GPU Computation.

This is not going to be simple.
A float (IEEE 754 single-precision) has 1 sign bit, 8 exponent bits, and 23 bits of mantissa (well, effectively 24).
A double (IEEE 754 double-precision) has 1 sign bit, 11 exponent bits, and 52 bits of mantissa (effectively 53).
You can use the sign bit and 8 exponent bits from one of your floats, but how are you going to get 3 more exponent bits and 29 bits of mantissa out of the other?
Maybe somebody else can come up with something clever, but my answer is "this is impossible". (Or at least, "no easier than using a 64-bit struct and implementing your own operations")

It depends a bit on what types of operations you want to perform. If you only care about additions and subtractions, Kahan Summation can be a great solution.

If you need both the precision and a wide range, you'll be needing a software implementation of double precision floating point, such as SoftFloat.
(For addition, the basic principle is to break the representation (e.g. 64 bits) of each value into its three consitituent parts - sign, exponent and mantissa; then shift the mantissa of one part based on the difference in the exponents, add to or subtract from the mantissa of the other part based on the sign bits, and possibly renormalise the result by shifting the mantissa and adjusting the exponent correspondingly. Along the way, there are a lot of fiddly details to account for, in order to avoid unnecessary loss of accuracy, and deal with special values such as infinities, NaNs, and denormalised numbers.)

Given all the constraints for high precision over 23 magnitudes, I think the most fruitful method would be to implement a custom arithmetic package.
A quick survey shows Briggs' doubledouble C++ library should address your needs and then some. See this.[*] The default implementation is based on double to achieve 30 significant figure computation, but it is readily rewritten to use float to achieve 13 or 14 significant figures. That may be enough for your requirements if care is taken to segregate addition operations with similar magnitude values, only adding extremes together in the last operations.
Beware though, the comments mention messing around with the x87 control register. I didn't check into the details, but that might make the code too non-portable for your use.
[*] The C++ source is linked by that article, but only the gzipped tar was not a dead link.

This is similar to the double-double arithmetic used by many compilers for long double on some machines that have only hardware double calculation support. It's also used as float-float on older NVIDIA GPUs where there's no double support. See Emulating FP64 with 2 FP32 on a GPU. This way the calculation will be much faster than a software floating-point library.
However in most microcontrollers there's no hardware support for floats so they're implemented purely in software. Because of that, using float-float may not increase performance and introduce some memory overhead to save the extra bytes of exponent.
If you really need the longer mantissa, try using a custom floating-point library. You can choose whatever is enough for you, for example change the library to adapt a new 48-bit float type of your own if only 40 bits of mantissa and 7 bits of exponent is needed. No need to spend time for calculating/storing the unnecessary 16 bits anymore. But this library should be very efficient because compiler's libraries often have assembly level optimization for their own type of float.

Another software-based solution that might be of use: GNU MPFR
It takes care of many other special cases and allows arbitrary precision (better than 64-bit double) that you would have to otherwise take care of yourself.

That's not practical. If it was, every embedded 32-bit processor (or compiler) would emulate double precision by doing that. As it stands, none do it that I am aware of. Most of them just substitute float for double.
If you need the precision and not the dynamic range, your best bet would be to use fixed point. IF the compiler supports 64-bit this will be easier too.

Large number of float digits without extra library

i have a float value that is hundreds of digits long (like the first 100 digits of pi - 3) and need a way to operate on it. is there any way to store and operate on the float that has a large number of decimals and maintain much precision with built in libraries? is there anything like python's Decimal module in c++?

The other answers all point to high precision integer libraries. There are however a few floating point libraries around:
The High Precision Arithmetic library
The GNU Multiple Precision Arithmetic Library (GMP) "Arithmetic without limitations"
The GNU multiple-precision floating-point computations with correct rounding (the GNU MPFR library). There's also a C++ wrapper.
NTL: A Library for doing Number Theory. Together with NTL::RR you can use this even within boost.
The LBNL double-double precision, quad-double precision and arbitrary precision software.
... and don't forget the possibility that you can always implement your own solution. (Might not be the most effective or fastest solution, but it's "the" solution if you want to learn something.

No built-in library, but you can do that using Bignum arithmetics :) http://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic.
What a Bignum is: an array (vector) of digits. You can easily implement sum/difference....
I've actually asked something simillar here: STL big int class implementation

Unless it is some extra exotic platform, where a float is 100+ bytes long, you will find it hard to archive what you want without a library for big numbers.

What is the binary format of a floating point number used by C++ on Intel based systems?

I am interested to learn about the binary format for a single or a double type used by C++ on Intel based systems.
I have avoided the use of floating point numbers in cases where the data needs to potentially be read or written by another system (i.e. files or networking). I do realise that I could use fixed point numbers instead, and that fixed point is more accurate, but I am interested to learn about the floating point format.

Wikipedia has a reasonable summary - see http://en.wikipedia.org/wiki/IEEE_754.
Burt if you want to transfer numbers betwen systems you should avoid doing it in binary format. Either use middleware like CORBA (only joking, folks), Tibco etc. or fall back on that old favourite, textual representation.

This should get you started : http://docs.sun.com/source/806-3568/ncg_goldberg.html. (:

Floating-point format is determined by the processor, not the language or compiler. These days almost all processors (including all Intel desktop machines) either have no floating-point unit or have one that complies with IEEE 754. You get two or three different sizes (Intel with SSE offers 32, 64, and 80 bits) and each one has a sign bit, an exponent, and a significand. The number represented is usually given by this formula:
sign * (2**(E-k)) * (1 + S / (2**k'))
where k' is the number of bits in the significand and k is a constant around the middle range of exponents. There are special representations for zero (plus and minus zero) as well as infinities and other "not a number" (NaN) values.
There are definite quirks; for example, the fraction 1/10 cannot be represented exactly as a binary IEEE standard floating-point number. For this reason the IEEE standard also provides for a decimal representation, but this is used primarily by handheld calculators and not by general-purpose computers.
Recommended reading: David Golberg's What Every Computer Scientist Should Know About Floating-Point Arithmetic

As other posters have noted, there is plenty of information about on the IEEE format used by every modern processor, but that is not where your problems will arise.
You can rely on any modern system using IEEE format, but you will need to watch for byte ordering. Look up "endianness" on Wikipedia (or somewhere else). Intel systems are little-endian, a lot of RISC processors are big-endian. Swapping between the two is trivial, but you need to know what type you have.
Traditionally, people use big-endian formats for transmission. Sometimes people include a header indicating the byte order they are using.
If you want absolute portability, the simplest thing is to use a text representation. However that can get pretty verbose for floating point numbers if you want to capture the full precision. 0.1234567890123456e+123.

Intel's representation is IEEE 754 compliant.
You can find the details at http://download.intel.com/technology/itj/q41999/pdf/ia64fpbf.pdf .

Note that decimal floating-point constants may convert to different floating-point binary values on different systems (even with different compilers on the same system). The difference would be slight -- maybe only as large as 2^-54 for a double -- but is a difference nonetheless.
Use hexadecimal constants if you want to guarantee the same floating-point binary value on any platform.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Declaring an octuple precision floating point variable in C++ - c++

While I don't know of any C++ libraries that fully implement a proper IEEE754 octuple precision, I've found a library by the name ttmath which implements a multi-word system, allowing it to deal with much larger numbers.

Related

How to create REAL(KIND=32) variables?

data structures for floating point operation in fixed point processor

Emulate "double" using 2 "float"s

Large number of float digits without extra library

What is the binary format of a floating point number used by C++ on Intel based systems?

Categories

Resources