Why aren’t posit arithmetic representations commonly used?

Why aren’t posit arithmetic representations commonly used? - c++

I recently found this library that seems to provide its own types and operations on real numbers that are 2 to 3 orders of magnitude faster than normal floating point arithmetic.
The library is based on using a different representation for real numbers. One that is described to be both more efficient and mathematically accurate than floating point - posit.
If this representation is so efficient why isn’t it widely used in all sorts of applications and implemented in hardware, or maybe it is? As far as I know most typical hardware uses some kind of IEEE floating point representation for real numbers.
Is it somehow maybe only applicable to some very specific AI research, as they seem to list mostly that as an example?
If this representation is not only hundreds to thousands of times faster than floating point, but also much more deterministic and designed for use in concurrent systems, why isn’t it implemented in GPUs, which are basically massively concurrent calculators working on real numbers? Wouldn’t it bring huge advances in rendering performance and GPU computation capabilities?
Update: People behind the linked Universal library have released a paper about their design and implementation.

The most objective and convincing reason I know of is that posits were introduced less than 4 years ago. That's not enough time to make inroads in the marketplace (people need time to develop implementations), much less take it over (which, among other things, requires overcoming incompatibilities with existing software).
Whether or not the industry wants to make such a change is a separate issue that tends towards subjectivity.

The reason why the IEEE standard seems to be slower is because the IEEE addresses some topics with an higher importance. For example:
.
.
.
The IEEE Standard for Floating-Point Arithmetic (IEEE 754) defines:
arithmetic formats: sets of binary and decimal floating-point data, which consist of finite numbers (including signed zeros and subnormal numbers), infinities, and special "not a number" values (NaNs)
interchange formats: encodings (bit strings) that may be used to exchange floating-point data in an efficient and compact form
rounding rules: properties to be satisfied when rounding numbers during arithmetic and conversions
operations: arithmetic and other operations (such as trigonometric functions) on arithmetic formats
exception handling: indications of exceptional conditions (such as division by zero, overflow, etc.)
The above is from Wikipedia copied: https://en.wikipedia.org/wiki/IEEE_754
.
.
.
Your linked library, which seems to be called the posit number system advocates the following strengths.
Economical - No bit patterns are redundant. There is one representation for infinity denoted as ± inf and zero. All other bit patterns are valid distinct non-zero real numbers. ± inf serves as a replacement for NaN.
Mathematical Elegant - There is only one representation for zero, and the encoding is symmetric around 1.0. Associative and distributive laws are supported through deferred rounding via the quire, enabling reproducible linear algebra algorithms in any concurrency environment.
Tapered Accuracy - Tapered accuracy is when values with small exponent have more digits of accuracy and values with large exponents have fewer digits of accuracy. This concept was first introduced by Morris (1971) in his paper ”Tapered Floating Point: A New Floating-Point Representation”.
Parameterized precision and dynamic range -- posits are defined by a size, nbits, and the number of exponent bits, es. This enables system designers the freedom to pick the right precision and dynamic range required for the application. For example, for AI applications we may pick 5 or 6 bit posits without any exponent bits to improve performance. For embedded DSP applications, such as 5G base stations, we may select a 16 bit posit with 1 exponent bit to improve performance per Watt.
Simpler Circuitry - There are only two special cases, Not a Real and Zero. No denormalized numbers, overflow, or underflow.
The above is from GitHub copied: https://github.com/stillwater-sc/universal
.
.
.
So, in my opinion, the posit number system prefers performance, while the IEEE Standard for Floating-Point Arithmetic (IEEE 754) prefers technical compatibility and interchangeability.

I strongly challenge the claim of that library being faster than IEEE floating point:
Modern hardware includes circuitry specifically designed to handle IEEE floating point arithmetic. Depending on your CPU model, it can perform roughly 0.5 to 4 floating point operations per clock cycle. Yes, this circuitry does complex things, but because it's built in hardware and aggressively optimized for many years, it achieves this kind of speed.
Any software library that provide a different floating point format must perform the arithmetic in software. It cannot just say "please multiply these two numbers using double precision arithmetic" and see the result appear in the corresponding register two clock cycles later, it must contain code that takes the four different parts of the posit format, handles them separately, and fuses together a result. And that code takes time to execute. Much more time than just two clock cycles.
The "universal" library may have corner cases where its posit number format shines. But speed is not where it can hope to compete.

Related

How to force 32bits floating point calculation consistency across different platforms?

I have a simple piece of code that operates with floating points.
Few multiplications, divisions, exp(), subtraction and additions in a loop.
When I run the same piece of code on different platforms (like PC, Android phones, iPhones) I get slightly different results.
The result is pretty much equal on all the platforms but has a very small discrepancy - typically 1/1000000 of the floating point value.
I suppose the reason is that some phones don't have floating point registers and just simulate those calculations with integers, some do have floating point registers but have different implementations.
There are proofs to that here: http://christian-seiler.de/projekte/fpmath/
Is there a way to force all the platform to produce a consistent results?
For example a good & fast open-source library that implements floating point mechanics with integers (in software), thus I can avoid hardware implementation differences.
The reason I need an exact consistency is to avoid compound errors among layers of calculations.
Currently those compound errors do produce a significantly different result.
In other words, I don't care so much which platform has a more correct result, but rather want to force consistency to be able to reproduce equal behavior. For example a bug which was discovered on a mobile phone is much easier to debug on PC, but I need to reproduce this exact behavior

One relatively widely used and high quality software FP implementation is MPFR. It is a lot slower than hardware FP, though.
Of course, this won't solve the actual problems your algorithm has with compound errors, it will just make it produce the same errors on all platforms. Probably a better approach would be to design an algorithm which isn't as sensitive to small differences in FP arithmetic, if feasible. Or if you go the MPFR route, you can use a higher precision FP type and see if that helps, no need to limit yourself to emulating the hardware single/double precision.

32-bit floating point math, for a given calculation will, at best, have a precision of 1 in 16777216 (1 in 224). Functions such as exp are often implemented as a sequence of calculations, so may have a larger error due to this. If you do several calculations in a row, the errors will add and multiply up. In general float has about 6-7 digits of precision.
As one comment says, check the rounding mode is the same. Most FPU's have a "round to nearest" (rtn), "round to zero" (rtz) and "round to even" (rte) mode that you can choose. The default on different platforms MAY vary.
If you perform additions or subtractions of fairly small numbers to fairly large numbers, since the number has to be normalized you will have a greater error from these sort of operations.
Normalized means shifted such that both numbers have the decimal place lined up - just like if you do that on paper, you have to fill in extra zeros to line up the two numbers you are adding - but of course on paper you can add 12419818.0 with 0.000000001 and end up with 12419818.000000001 because paper has as much precision as you can be bothered with. Doing this in float or double will result in the same number as before.
There are indeed libraries that do floating point math - the most popular being MPFR - but it is a "multiprecision" library, but it will be fairly slow - because they are not really built to be "plugin replacement of float", but a tool for when you want to calculate pi with 1000s of digits, or when you want to calculate prime numbers in the ranges much larger than 64 or 128 bits, for example.
It MAY solve the problem to use such a library, but it will be slow.
A better choice would, moving from float to double should have a similar effect (double has 53 bits of mantissa, compared to the 23 in a 32-bit float, so more than twice as many bits in the mantissa). And should still be available as hardware instructions in any reasonably recent ARM processor, and as such relatively fast, but not as fast as float (FPU is available from ARMv7 - which certainly what you find in iPhone - at least from iPhone 3 and the middle to high end Android devices - I managed to find that Samsung Galaxy ACE has an ARM9 processor [first introduced in 1997] - so has no floating point hardware).

data structures for floating point operation in fixed point processor

I need to program a fixed point processor that was used for a legacy application. New features are requested and these features need large dynamic range , most likely beyond the fixed point range even after scaling. As the processor will not be changed due to several reasons, I am planning to incorporate the floating point operation based on fixed point arithmetic-- basically software based approach. I want to define few data structures to represent a floating point numbers in C for the underlying fixed point processor. Is it possible to do at all? I am planning to use the IEEE floating point representation . What kind of data structures would be good for achieving basic operation like multiplication, division, add and sub . Are there already some open source libraries available in C /C++?

Most C development tools for microcontrollers without native floating-point support provide software libraries for floating-point operations. Even if you do your programming in assembler, you could probably use those libraries. (Just curious, which processor and which development tools are you using?)
However, if you are serious about writing your own floating-point libraries, the most efficient way is to treat the routines as routines operating on integers. You can use unions, or code like to following to convert between floating-point and integer representation.
uint32_t integer_representation = *(uint32_t *)&fvalue;
Note that this is inherently undefined behavior, as the format of a floating-point number is not specified in the C standard.
The problem is much easier if you stick to floating-point and integer types that match (typically 32 or 64 bit types), that way you can see the routines as plain integer routines, as, for example, addition takes two 32 bit integer representations of floating-point values and return a 32 bit integer representation of the result.
Also, if you use the routines yourself, you could probably get away with leaving parts of the standard unimplemented, such as exception flags, subnormal numbers, NaN and Inf etc.

You don’t really need any data structures to do this. You simply use integers to represent floating-point encodings and integers to represent unpacked sign, exponent, and significand fields.

Emulate "double" using 2 "float"s

I am writing a program for an embedded hardware that only supports 32-bit single-precision floating-point arithmetic. The algorithm I am implementing, however, requires a 64-bit double-precision addition and comparison. I am trying to emulate double datatype using a tuple of two floats. So a double d will be emulated as a struct containing the tuple: (float d.hi, float d.low).
The comparison should be straightforward using a lexicographic ordering. The addition however is a bit tricky because I am not sure which base should I use. Should it be FLT_MAX? And how can I detect a carry?
How can this be done?
Edit (Clarity): I need the extra significant digits rather than the extra range.

double-float is a technique that uses pairs of single-precision numbers to achieve almost twice the precision of single precision arithmetic accompanied by a slight reduction of the single precision exponent range (due to intermediate underflow and overflow at the far ends of the range). The basic algorithms were developed by T.J. Dekker and William Kahan in the 1970s. Below I list two fairly recent papers that show how these techniques can be adapted to GPUs, however much of the material covered in these papers is applicable independent of platform so should be useful for the task at hand.
https://hal.archives-ouvertes.fr/hal-00021443
Guillaume Da Graça, David Defour
Implementation of float-float operators on graphics hardware,
7th conference on Real Numbers and Computers, RNC7.
http://andrewthall.org/papers/df64_qf128.pdf
Andrew Thall
Extended-Precision Floating-Point Numbers for GPU Computation.

This is not going to be simple.
A float (IEEE 754 single-precision) has 1 sign bit, 8 exponent bits, and 23 bits of mantissa (well, effectively 24).
A double (IEEE 754 double-precision) has 1 sign bit, 11 exponent bits, and 52 bits of mantissa (effectively 53).
You can use the sign bit and 8 exponent bits from one of your floats, but how are you going to get 3 more exponent bits and 29 bits of mantissa out of the other?
Maybe somebody else can come up with something clever, but my answer is "this is impossible". (Or at least, "no easier than using a 64-bit struct and implementing your own operations")

It depends a bit on what types of operations you want to perform. If you only care about additions and subtractions, Kahan Summation can be a great solution.

If you need both the precision and a wide range, you'll be needing a software implementation of double precision floating point, such as SoftFloat.
(For addition, the basic principle is to break the representation (e.g. 64 bits) of each value into its three consitituent parts - sign, exponent and mantissa; then shift the mantissa of one part based on the difference in the exponents, add to or subtract from the mantissa of the other part based on the sign bits, and possibly renormalise the result by shifting the mantissa and adjusting the exponent correspondingly. Along the way, there are a lot of fiddly details to account for, in order to avoid unnecessary loss of accuracy, and deal with special values such as infinities, NaNs, and denormalised numbers.)

Given all the constraints for high precision over 23 magnitudes, I think the most fruitful method would be to implement a custom arithmetic package.
A quick survey shows Briggs' doubledouble C++ library should address your needs and then some. See this.[*] The default implementation is based on double to achieve 30 significant figure computation, but it is readily rewritten to use float to achieve 13 or 14 significant figures. That may be enough for your requirements if care is taken to segregate addition operations with similar magnitude values, only adding extremes together in the last operations.
Beware though, the comments mention messing around with the x87 control register. I didn't check into the details, but that might make the code too non-portable for your use.
[*] The C++ source is linked by that article, but only the gzipped tar was not a dead link.

This is similar to the double-double arithmetic used by many compilers for long double on some machines that have only hardware double calculation support. It's also used as float-float on older NVIDIA GPUs where there's no double support. See Emulating FP64 with 2 FP32 on a GPU. This way the calculation will be much faster than a software floating-point library.
However in most microcontrollers there's no hardware support for floats so they're implemented purely in software. Because of that, using float-float may not increase performance and introduce some memory overhead to save the extra bytes of exponent.
If you really need the longer mantissa, try using a custom floating-point library. You can choose whatever is enough for you, for example change the library to adapt a new 48-bit float type of your own if only 40 bits of mantissa and 7 bits of exponent is needed. No need to spend time for calculating/storing the unnecessary 16 bits anymore. But this library should be very efficient because compiler's libraries often have assembly level optimization for their own type of float.

Another software-based solution that might be of use: GNU MPFR
It takes care of many other special cases and allows arbitrary precision (better than 64-bit double) that you would have to otherwise take care of yourself.

That's not practical. If it was, every embedded 32-bit processor (or compiler) would emulate double precision by doing that. As it stands, none do it that I am aware of. Most of them just substitute float for double.
If you need the precision and not the dynamic range, your best bet would be to use fixed point. IF the compiler supports 64-bit this will be easier too.

How to write portable floating point arithmetic in c++?

Say you're writing a C++ application doing lots of floating point arithmetic. Say this application needs to be portable accross a reasonable range of hardware and OS platforms (say 32 and 64 bits hardware, Windows and Linux both in 32 and 64 bits flavors...).
How would you make sure that your floating point arithmetic is the same on all platforms ? For instance, how to be sure that a 32 bits floating point value will really be 32 bits on all platforms ?
For integers we have stdint.h but there doesn't seem to exist a floating point equivalent.
[EDIT]
I got very interesting answers but I'd like to add some precision to the question.
For integers, I can write:
#include <stdint>
[...]
int32_t myInt;
and be sure that whatever the (C99 compatible) platform I'm on, myInt is a 32 bits integer.
If I write:
double myDouble;
float myFloat;
am I certain that this will compile to, respectively, 64 bits and 32 bits floating point numbers on all platforms ?

Non-IEEE 754
Generally, you cannot. There's always a trade-off between consistency and performance, and C++ hands that to you.
For platforms that don't have floating point operations (like embedded and signal processing processors), you cannot use C++ "native" floating point operations, at least not portably so. While a software layer would be possible, that's certainly not feasible for this type of devices.
For these, you could use 16 bit or 32 bit fixed point arithmetic (but you might even discover that long is supported only rudimentary - and frequently, div is very expensive). However, this will be much slower than built-in fixed-point arithmetic, and becomes painful after the basic four operations.
I haven't come across devices that support floating point in a different format than IEEE 754. From my experience, your best bet is to hope for the standard, because otherwise you usually end up building algorithms and code around the capabilities of the device. When sin(x) suddenly costs 1000 times as much, you better pick an algorithm that doesn't need it.
IEEE 754 - Consistency
The only non-portability I found here is when you expect bit-identical results across platforms. The biggest influence is the optimizer. Again, you can trade accuracy and speed for consistency. Most compilers have a option for that - e.g. "floating point consistency" in Visual C++. But note that this is always accuracy beyond the guarantees of the standard.
Why results become inconsistent?
First, FPU registers often have higher resolution than double's (e.g. 80 bit), so as long as the code generator doesn't store the value back, intermediate values are held with higher accuracy.
Second, the equivalences like a*(b+c) = a*b + a*c are not exact due to the limited precision. Nonetheless the optimizer, if allowed, may make use of them.
Also - what I learned the hard way - printing and parsing functions are not necessarily consistent across platforms, probably due to numeric inaccuracies, too.
float
It is a common misconception that float operations are intrinsically faster than double. working on large float arrays is faster usually through less cache misses alone.
Be careful with float accuracy. it can be "good enough" for a long time, but I've often seen it fail faster than expected. Float-based FFT's can be much faster due to SIMD support, but generate notable artefacts quite early for audio processing.

Use fixed point.
However, if you want to approach the realm of possibly making portable floating point operations, you at least need to use controlfp to ensure consistent FPU behavior as well as ensuring that the compiler enforces ANSI conformance with respect to floating point operations. Why ANSI? Because it's a standard.
And even then you aren't guaranteeing that you can generate identical floating point behavior; that also depends on the CPU/FPU you are running on.

It shouldn't be an issue, IEEE 754 already defines all details of the layout of floats.
The maximum and minimum values storable should be defined in limits.h

Portable is one thing, generating consistent results on different platforms is another. Depending on what you are trying to do then writing portable code shouldn't be too difficult, but getting consistent results on ANY platform is practically impossible.

I believe "limits.h" will include the C library constants INT_MAX and its brethren. However, it is preferable to use "limits" and the classes it defines:
std::numeric_limits<float>, std::numeric_limits<double>, std::numberic_limits<int>, etc...

If you're assuming that you will get the same results on another system, read What could cause a deterministic process to generate floating point errors first. You might be surprised to learn that your floating point arithmetic isn't even the same across different runs on the very same machine!

What is the binary format of a floating point number used by C++ on Intel based systems?

I am interested to learn about the binary format for a single or a double type used by C++ on Intel based systems.
I have avoided the use of floating point numbers in cases where the data needs to potentially be read or written by another system (i.e. files or networking). I do realise that I could use fixed point numbers instead, and that fixed point is more accurate, but I am interested to learn about the floating point format.

Wikipedia has a reasonable summary - see http://en.wikipedia.org/wiki/IEEE_754.
Burt if you want to transfer numbers betwen systems you should avoid doing it in binary format. Either use middleware like CORBA (only joking, folks), Tibco etc. or fall back on that old favourite, textual representation.

This should get you started : http://docs.sun.com/source/806-3568/ncg_goldberg.html. (:

Floating-point format is determined by the processor, not the language or compiler. These days almost all processors (including all Intel desktop machines) either have no floating-point unit or have one that complies with IEEE 754. You get two or three different sizes (Intel with SSE offers 32, 64, and 80 bits) and each one has a sign bit, an exponent, and a significand. The number represented is usually given by this formula:
sign * (2**(E-k)) * (1 + S / (2**k'))
where k' is the number of bits in the significand and k is a constant around the middle range of exponents. There are special representations for zero (plus and minus zero) as well as infinities and other "not a number" (NaN) values.
There are definite quirks; for example, the fraction 1/10 cannot be represented exactly as a binary IEEE standard floating-point number. For this reason the IEEE standard also provides for a decimal representation, but this is used primarily by handheld calculators and not by general-purpose computers.
Recommended reading: David Golberg's What Every Computer Scientist Should Know About Floating-Point Arithmetic

As other posters have noted, there is plenty of information about on the IEEE format used by every modern processor, but that is not where your problems will arise.
You can rely on any modern system using IEEE format, but you will need to watch for byte ordering. Look up "endianness" on Wikipedia (or somewhere else). Intel systems are little-endian, a lot of RISC processors are big-endian. Swapping between the two is trivial, but you need to know what type you have.
Traditionally, people use big-endian formats for transmission. Sometimes people include a header indicating the byte order they are using.
If you want absolute portability, the simplest thing is to use a text representation. However that can get pretty verbose for floating point numbers if you want to capture the full precision. 0.1234567890123456e+123.

Intel's representation is IEEE 754 compliant.
You can find the details at http://download.intel.com/technology/itj/q41999/pdf/ia64fpbf.pdf .

Note that decimal floating-point constants may convert to different floating-point binary values on different systems (even with different compilers on the same system). The difference would be slight -- maybe only as large as 2^-54 for a double -- but is a difference nonetheless.
Use hexadecimal constants if you want to guarantee the same floating-point binary value on any platform.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js