Printing exact values for floats

Printing exact values for floats - c++

How does a program (MySQL is an example) store a float like 0.9 and then return it to me as 0.9? Why does it not return 0.899...?
The issue I am currently experiencing is retrieving floating point values from MySQL with C++ and then reprinting those values.

There are software libraries, like Gnu MP that implement arbitrary precision arithmetic, that calculate floating point numbers to specified precision. Using Gnu MP you can, for example, add 0.3 to 0.6, and get exactly 0.9. No more, no less.
Database servers do pretty much the same thing.
For normal, run of the mill applications, native floating point arithmetic is fast, and it's good enough. But database servers typically have plenty of spare CPU cycles. Their limiting factors will not be available CPU, but things like available I/O bandwidth. They'll have plenty of CPU cycles to burn on executing complicated arbitrary precision arithmetic calculations.

There are a number of algorithms for rounding floating point numbers in a way that will result in the same internal representation when read back in. For an overview of the subject, with links to papers with full details of the algorithms, see
Printing Floating-Point Numbers

What's happening, in a nutshell, is that the function which converts the floating-point approximation of 0.9 to decimal text is actually coming up with a value like 0.90000....0123 or 0.89999....9573. This gets rounded to 0.90000...0. And then these trailing zeros are trimmed off so you get a tidy looking 0.9.
Although floating-point numbers are inexact, and often do not use base 10 internally, they can in fact precisely save and recover a decimal representation. For instance, an IEEE 754 64 bit representation has enough precision to preserve 15 decimal digits. This is often mapped to the C language type double, and that language has the constant DBL_DIG, which will be 15 when double is this IEEE type.
If a decimal number with 15 digits or less is converted to double, it can be coverted back to exactly that number. The conversion routine just has to round it off at 15 digits; of course if the conversion routine uses, say, 40 digits, there will be messy trailing digits representing the error between the floating-point value and the original number. The more digits you print, the more accurately rendered is that error.
There is also the opposite problem: given a floating-point object, can it be printed into decimal such that the resulting decimal can be scanned back to reproduce that object? For an IEEE 64 bit double, the number of decimal digits required for that is 17.

Related

Is hardcode float precise if it can be represented by binary format in IEEE 754?

for example, 0 , 0.5, 0.15625 , 1 , 2 , 3... are values converted from IEEE 754. Are their hardcode version precise?
for example:
is
float a=0;
if(a==0){
return true;
}
always return true? other example:
float a=0.5;
float b=0.25;
float c=0.125;
is a * b always equal to 0.125 and a * b==c always true? And one more example:
int a=123;
float b=0.5;
is a * b always be 61.5? or in general, is integer multiply by IEEE 754 binary float precise?
Or a more general question: if the value is hardcode and both the value and result can be represented by binary format in IEEE 754 (e.g.:0.5 - 0.125), is the value precise?

There is no inherent fuzzyness in floating-point numbers. It's just that some, but not all, real numbers can't be exactly represented.
Compare with a fixed-width decimal representation, let's say with three digits. The integer 1 can be represented, using 1.00, and 1/10 can be represented, using 0.10, but 1/3 can only be approximated, using 0.33.
If we instead use binary digits, the integer 1 would be represented as 1.00 (binary digits), 1/2 as 0.10, 1/4 as 0.01, but 1/3 can (again) only be approximated.
There are some things to remember, though:
It's not the same numbers as with decimal digits. 1/10 can be
written exactly as 0.1 using decimal digits, but not using binary
digits, no matter how many you use (short of infinity).
In practice, it is difficult to keep track of which numbers can be
represented and which can't. 0.5 can, but 0.4 can't. So when you need
exact numbers, such as (often) when working with money, you shouldn't
use floating-point numbers.
According to some sources, some processors do strange things
internally when performing floating-point calculations on numbers
that can't be exactly represented, causing results to vary in a way
that is, in practice, unpredictable.
(My opinion is that it's actually a reasonable first approximation to say that yes, floating-point numbers are inherently fuzzy, so unless you are sure your particular application can handle that, stay away from them.)
For more details than you probably need or want, read the famous What Every Computer Scientist Should Know About Floating-Point Arithmetic. Also, this somewhat more accessible website: The Floating-Point Guide.

No, but as Thomas Padron-McCarthy says, some numbers can be exactly represented using binary but not all of them can.
This is the way I explain it to non-developers who I work with (like Mahmut Ali I also work on an very old financial package): Imagine having a very large cake that is cut into 256 slices. Now you can give 1 person the whole cake, 2 people half of the slices but soon as you decide to split it between 3 you can't - it's either 85 or 86 - you can't split the cake any further. The same is with floating point. You can only get exact numbers on some representations - some numbers can only be closely approximated.

C++ does not require binary floating point representation. Built-in integers are required to have a binary representation, commonly two's complement, but one's complement and sign and magnitude are also supported. But floating point can be e.g. decimal.
This leaves open the question of whether C++ floating point can have a radix that does not have 2 as a prime factor, like 2 and 10. Are other radixes permitted? I don't know, and last time I tried to check that, I failed.
However, assuming that the radix must be 2 or 10, then all your examples involve values that are powers of 2 and therefore can be exactly represented.
This means that the single answer to most of your questions is “yes”. The exception is the question “is integer multiply by IEEE 754 binary float [exact]”. If the result exceeds the precision available, then it can't be exact, but otherwise it is.
See the classic “What Every Computer Scientist Should Know About Floating-Point Arithmetic” for background info about floating point representation & properties in general.
If a value can be exactly represented in 32-bit or 64-bit IEEE 754, then that doesn't mean that it can be exactly represented with some other floating point representation. That's because different 32-bit representations and different 64-bit representations use different number of bits to hold the mantissa and have different exponent ranges. So a number that can be exactly represented in one way, can be beyond the precision or range of some other representation.
You can use std::numeric_limits<T>::is_iec559 (where e.g. T is double) to check whether your implementation claims to be IEEE 754 compatible. However, when floating point optimizations are turned on, at least the g++ compiler (1)erroneously claims to be IEEE 754, while not treating e.g. NaN values correctly according to that standard. In effect, the is_iec559 only tells you whether the number representation is IEEE 754, and not whether the semantics conform.
(1) Essentially, instead of providing different types for different semantics, gcc and g++ try to accommodate different semantics via compiler options. And with separate compilation of parts of a program, that can't conform to the C++ standard.

In principle, this should be possible. If you restrict yourself to exactly this class of numbers with a finite 2-power representation.
But it is dangerous: what if someone takes your code and changes your 0.5 to 0.4 or your .0625 to .065 due to whatever reasons? Then your code is broken. And no, even excessive comments won't help about that - someone will always ignore them.

Rounding a float upward to an integer, how reliable is that?

I've seen static_cast<int>(std::ceil(floatValue)); before.
My question though, is can I absolutely count on this not "needlessly" rounding up? I've read that some whole numbers can't be perfectly represented in floating point, so my worry is that the miniscule "error" will trick ceil() into rounding upwards when it logically shouldn't. Not only that, but once rounded up, I worry it may be possible for a small "error" in representation to cause the number to be slightly less than a whole number, causing the cast to int to truncate it.
Is this worry unfounded? I remember a while back, an example in python where printing a specific whole number would cause it to print something very slightly less (like x.999, though I can't remember the exact number)
The reason I need to make sure, is I'm writing a texture buffer. The common case is whole numbers as floating point, but it'll occasionally get between values that need to be rounded to the nearest integer width and height that contains them. It increments in steps of power of 2, so the cost of rounding up needlessly can cause what should've only took a 256x256 texture to need a 512x512 texture.

If floatValue is exact, then there is no problem with rounding in your code. The only possible problem is overflow (if the result doesn't fit inside an int). Of course with such large values, the float will typically not have enough precision to distinguish adjacent integers anyway.
However, the danger usually lies in floatValue itself not being exact. For example, if it is the result of some computation whose exact answer is a whole number, it may end up a tiny amount greater than a whole number due to floating point rounding errors in the computation.
So whether you have a problem depends on how you got floatValue.

can I absolutely count on this not "needlessly" rounding up? I've read that some whole numbers can't be perfectly represented in floating point, so my worry is that the miniscule "error" will trick ceil()
Yes, some large numbers are impossible to represent exactly as floating-point numbers. In the zone where this happens, all floating-point numbers are integers. The error is not minuscule: the error in representing an integer by a floating-point, if error there is, is at least one. And, obviously, in the zone where some integers cannot be represented as floating-point and where all floating-point numbers are integers, ceil(f) == f.
The zone in question is |f| > 224 (16*1024*1024) for IEEE 754 single-precision and |f| > 253 for IEEE 754 double-precision.
A problem you are more likely to come across does not come from the impossibility of representing integers in floating-point format but from the cumulative effects of rounding errors. If your compiler offers IEEE 754 (the floating-point standard implemented exactly by the SSE2 instructions of modern and not so modern Intel processors) semantics, then any +, -, *, / and sqrt operation that results in a number exactly representable as floating-point is guaranteed to produce that result, but if several of the operations you apply do not have exactly representable results, the floating-point computation may drift away from the mathematical computation, even when the final result is an integer and is exactly representable. Then you may end up with a floating-point result slightly above the target integer and cause ceil() to return something other than you would have obtained with exact mathematical computations.
There are ways to be confident that some floating-point operations are exact (because the result is always representable). For instance (double)float1 * (double)float2, where float1 and float2 are two single-precision variables, is always exact, because the mathematical result of the multiplication of two single-precision numbers is always representable as a double. By doing the computation the “right” way, it is possible to minimize or eliminate the error in the end result.

The range is 0.0 to ~1024.0
All integers in this range can be represented exactly as float, so you'll be fine.
You'll only start having issues once you stray beyond the 24 bits of mantissa afforded by float.

Parsing decimal values with GMP?

How can I parse a decimal value exactly? That is, I have a string of the value, like "43.879" and I wish to get an exact GMP value. I'm not clear from the docs how, or whether this is actually possible. It doesn't seem to fit in the integer/rational/float value types -- though perhaps twiddling with rational is possible.
My intent is to retain exact precision decimals over operations like addition and subtraction, but to switch to high-precision floating point for operations like division or exponents.

Most libraries give you arbitrarily large precision, GMP included. However even with large precision there are some numbers that cannot be represented exactly in binary format, much same as you cannot represent 1/3 in decimal. For many applications setting the precision to a high number, like 10, doing the calculations, and then rounding the results back to desired precision, like 3 works. Would it not work for you? See this - Is there a C++ equivalent to Java's BigDecimal?
You could also use http://software.intel.com/en-us/articles/intel-decimal-floating-point-math-library
********* EDITS
Exact representation DOES NOT exist in binary floating point for many numbers; the kind that most current floating point libraries offer. A number like 0.1 CANNOT be represented as a binary number, whatever be the precision.
To be able to do what you suggest the library would have to do equivalent of 'hand addition', 'hand division' - the kind you do on pencil and paper to add two decimal numbers. e.g. to store 0.1, the library might elect to represent it as a string itself and then do additions on strings. Needless to say a naive implementation would make the process exceedingly slow - orders of magnitude slow. To add 0.1 + 0.1, it would have to parse string, add 1+1, remember the carries, remember the decimal position etc. That is the thing that the computer micro code does for you in few CPU cycles (or a single instruction). Instead of single instruction, your software library would end up taking like 100 CPU cycles/instructions.
If it tries to convert 0.1 to a number, it is back to square 1 - 0.1 cannot be a number in binary.
However people do recognize the need for representing 0.1 exactly. It is just that binary number representation is NOT going to do it. That is where newer floating point standards come in, and that is where the intel decimal point library is headed.
Repeating my previous example, suppose you had a 10 base computer that could do 10 base numbers. that computer cannot store 1/3 as a 'plain' floating point number. It would have to store the representation that the number is 1/3. The equivalent of how it is written on paper. Try writing 1/3 as a base 10 floating point number on paper.
Also see Why can't decimal numbers be represented exactly in binary?

is it possible to categorize the different forms of the approximation of floating point numbers

I am just wondering if we can make rules for the form of the approximation of real numbers using floating point numbers.
For intance is a floating point number can be terminated by 1.xxx777777 (so terminated by infinite 7 by instance and eventually a random digit at the end ) ?
I believe that there is only this form of floating point number :
1. exact value.
2. value like 1.23900008721.... so where 1.239 is approximated with digits that appears as "noise" but with 0 between the exact value and this noise
3. value like 3.2599995, where 3.26 is approximated by adding 9999.. and a final digit (like 5), so approximated with a floating number just below the real number
4. value like 2.000001, where 2.0 is approximated with a floating number just above the real number

You are thinking in terms of decimal numbers, that is, numbers that can be represented as n*(10^e), with e either positive or negative. These numbers occur naturally in your thought processes for historical reasons having to do with having ten fingers.
Computer numbers are represented in binary, for technical reasons that have to do with an electrical signal being either present or absent.
When you are dealing with smallish integer numbers, it does not matter much that the computer representation does not match your own, because you are thinking of an accurate approximation of the mathematical number, and so is the computer, so by transitivity, you and the computer are thinking about the same thing.
With either very large or very small numbers, you will tend to think in terms of powers of ten, and the computer will definitely think in terms of powers of two. In these cases you can observe a difference between your intuition and what the computer does, and also, your classification is nonsense. Binary floating-point numbers are neither more dense or less dense near numbers that happen to have a compact representation as decimal numbers. They are simply represented in binary, n*(2^p), with p either positive or negative. Many real numbers have only an approximative representation in decimal, and many real numbers have only an approximative representation in binary. These numbers are not the same (binary numbers can be represented in decimal, but not always compactly. Some decimal numbers cannot be represented exactly in binary at all, for instance 0.1).
If you want to understand the computer's floating-point numbers, you must stop thinking in decimal. 1.23900008721.... is not special, and neither is 1.239. 3.2599995 is not special, and neither is 3.26. You think they are special because they are either exactly or close to compact decimal numbers. But that does not make any difference in binary floating-point.
Here are a few pieces of information that may amuse you, since you tagged your question C++:
If you print a double-precision number with the format %.16e, you get a decimal number that converts back to the original double. But it does not always represent the exact value of the original double. To see the exact value of the double in decimal, you must use %.53e. If you write 0.1 in a program, the compiler interprets this as meaning 1.000000000000000055511151231257827021181583404541015625e-01, which is a relatively compact number in binary. Your question speaks of 3.2599995 and 2.000001 as if these were floating-point numbers, but they aren't. If you write these numbers in a program, the compiler will interpret them as 3.25999950000000016103740563266910612583160400390625
and
2.00000100000000013977796697872690856456756591796875. So the pattern you are looking for is simple: the decimal representation of a floating-point number is always 17 significant digits followed by 53-17=36 “noise” digits as you call them. The noise digits are sometimes all zeroes, and the significant digits can end in a bunch of zeroes too.

Floating point is presented by bits. What this means is:
1 bit flipped after the decimal is 0.5 or 1/2
01 bits is 0.25 or 1/4
etc.
This means floating point is always approximately close but not exact if it's not an exact power of 2, when represented in terms of what the machine can handle.
Rational numbers can very accurately be represented by the machine (not precisely of course if not a power of two below the decimal point), but irrational numbers will always carry an error. In terms of this your question is not so much related to c++ as to computer architecture.

How to convert float to double(both stored in IEEE-754 representation) without losing precision?

I mean, for example, I have the following number encoded in IEEE-754 single precision:
"0100 0001 1011 1110 1100 1100 1100 1100" (approximately 23.85 in decimal)
The binary number above is stored in literal string.
The question is, how can I convert this string into IEEE-754 double precision representation(somewhat like the following one, but the value is not the same), WITHOUT losing precision?
"0100 0000 0011 0111 1101 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010"
which is the same number encoded in IEEE-754 double precision.
I have tried using the following algorithm to convert the first string back to decimal number first, but it loses precision.
num in decimal = (sign) * (1 + frac * 2^(-23)) * 2^(exp - 127)
I'm using Qt C++ Framework on Windows platform.
EDIT: I must apologize maybe I didn't get the question clearly expressed.
What I mean is that I don't know the true value 23.85, I only got the first string and I want to convert it to double precision representation without precision loss.

Well: keep the sign bit, rewrite the exponent (minus old bias, plus new bias), and pad the mantissa with zeros on the right...
(As #Mark says, you have to treat some special cases separately, namely when the biased exponent is either zero or max.)

IEEE-754 (and floating point in general) cannot represent periodic binary decimals with full precision. Not even when they, in fact, are rational numbers with relatively small integer numerator and denominator. Some languages provide a rational type that may do it (they are the languages that also support unbounded precision integers).
As a consequence those two numbers you posted are NOT the same number.
They in fact are:
10111.11011001100110011000000000000000000000000000000000000000 ...
10111.11011001100110011001100110011001100110011001101000000000 ...
where ... represent an infinite sequence of 0s.
Stephen Canon in a comment above gives you the corresponding decimal values (did not check them, but I have no reason to doubt he got them right).
Therefore the conversion you want to do cannot be done as the single precision number does not have the information you would need (you have NO WAY to know if the number is in fact periodic or simply looks like being because there happens to be a repetition).

First of all, +1 for identifying the input in binary.
Second, that number does not represent 23.85, but slightly less. If you flip its last binary digit from 0 to 1, the number will still not accurately represent 23.85, but slightly more. Those differences cannot be adequately captured in a float, but they can be approximately captured in a double.
Third, what you think you are losing is called accuracy, not precision. The precision of the number always grows by conversion from single precision to double precision, while the accuracy can never improve by a conversion (your inaccurate number remains inaccurate, but the additional precision makes it more obvious).
I recommend converting to a float or rounding or adding a very small value just before displaying (or logging) the number, because visual appearance is what you really lost by increasing the precision.
Resist the temptation to round right after the cast and to use the rounded value in subsequent computation - this is especially risky in loops. While this might appear to correct the issue in the debugger, the accummulated additional inaccuracies could distort the end result even more.

It might be easiest to convert the string into an actual float, convert that to a double, and convert it back to a string.

Binary floating points cannot, in general, represent decimal fraction values exactly. The conversion from a decimal fractional value to a binary floating point (see "Bellerophon" in "How to Read Floating-Point Numbers Accurately" by William D.Clinger) and from a binary floating point back to a decimal value (see "Dragon4" in "How to Print Floating-Point Numbers Accurately" by Guy L.Steele Jr. and Jon L.White) yield the expected results because one converts a decimal number to the closest representable binary floating point and the other controls the error to know which decimal value it came from (both algorithms are improved on and made more practical in David Gay's dtoa.c. The algorithms are the basis for restoring std::numeric_limits<T>::digits10 decimal digits (except, potentially, trailing zeros) from a floating point value stored in type T.
Unfortunately, expanding a float to a double wrecks havoc on the value: Trying to format the new number will in many cases not yield the decimal original because the float padded with zeros is different from the closest double Bellerophon would create and, thus, Dragon4 expects. There are basically two approaches which work reasonably well, however:
As someone suggested convert the float to a string and this string into a double. This isn't particularly efficient but can be proven to produce the correct results (assuming a correct implementation of the not entirely trivial algorithms, of course).
Assuming your value is in a reasonable range, you can multiply it by a power of 10 such that the least significant decimal digit is non-zero, convert this number to an integer, this integer to a double, and finally divide the resulting double by the original power of 10. I don't have a proof that this yields the correct number but for the range of value I'm interested in and which I want to store accurately in a float, this works.
One reasonable approach to avoid this entirely issue is to use decimal floating point values as described for C++ in the Decimal TR in the first place. Unfortunately, these are not, yet, part of the standard but I have submitted a proposal to the C++ standardization committee to get this changed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js