How to convert float to double(both stored in IEEE-754 representation) without losing precision?

How to convert float to double(both stored in IEEE-754 representation) without losing precision? - c++

I mean, for example, I have the following number encoded in IEEE-754 single precision:
"0100 0001 1011 1110 1100 1100 1100 1100" (approximately 23.85 in decimal)
The binary number above is stored in literal string.
The question is, how can I convert this string into IEEE-754 double precision representation(somewhat like the following one, but the value is not the same), WITHOUT losing precision?
"0100 0000 0011 0111 1101 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010"
which is the same number encoded in IEEE-754 double precision.
I have tried using the following algorithm to convert the first string back to decimal number first, but it loses precision.
num in decimal = (sign) * (1 + frac * 2^(-23)) * 2^(exp - 127)
I'm using Qt C++ Framework on Windows platform.
EDIT: I must apologize maybe I didn't get the question clearly expressed.
What I mean is that I don't know the true value 23.85, I only got the first string and I want to convert it to double precision representation without precision loss.

Well: keep the sign bit, rewrite the exponent (minus old bias, plus new bias), and pad the mantissa with zeros on the right...
(As #Mark says, you have to treat some special cases separately, namely when the biased exponent is either zero or max.)

IEEE-754 (and floating point in general) cannot represent periodic binary decimals with full precision. Not even when they, in fact, are rational numbers with relatively small integer numerator and denominator. Some languages provide a rational type that may do it (they are the languages that also support unbounded precision integers).
As a consequence those two numbers you posted are NOT the same number.
They in fact are:
10111.11011001100110011000000000000000000000000000000000000000 ...
10111.11011001100110011001100110011001100110011001101000000000 ...
where ... represent an infinite sequence of 0s.
Stephen Canon in a comment above gives you the corresponding decimal values (did not check them, but I have no reason to doubt he got them right).
Therefore the conversion you want to do cannot be done as the single precision number does not have the information you would need (you have NO WAY to know if the number is in fact periodic or simply looks like being because there happens to be a repetition).

First of all, +1 for identifying the input in binary.
Second, that number does not represent 23.85, but slightly less. If you flip its last binary digit from 0 to 1, the number will still not accurately represent 23.85, but slightly more. Those differences cannot be adequately captured in a float, but they can be approximately captured in a double.
Third, what you think you are losing is called accuracy, not precision. The precision of the number always grows by conversion from single precision to double precision, while the accuracy can never improve by a conversion (your inaccurate number remains inaccurate, but the additional precision makes it more obvious).
I recommend converting to a float or rounding or adding a very small value just before displaying (or logging) the number, because visual appearance is what you really lost by increasing the precision.
Resist the temptation to round right after the cast and to use the rounded value in subsequent computation - this is especially risky in loops. While this might appear to correct the issue in the debugger, the accummulated additional inaccuracies could distort the end result even more.

It might be easiest to convert the string into an actual float, convert that to a double, and convert it back to a string.

Binary floating points cannot, in general, represent decimal fraction values exactly. The conversion from a decimal fractional value to a binary floating point (see "Bellerophon" in "How to Read Floating-Point Numbers Accurately" by William D.Clinger) and from a binary floating point back to a decimal value (see "Dragon4" in "How to Print Floating-Point Numbers Accurately" by Guy L.Steele Jr. and Jon L.White) yield the expected results because one converts a decimal number to the closest representable binary floating point and the other controls the error to know which decimal value it came from (both algorithms are improved on and made more practical in David Gay's dtoa.c. The algorithms are the basis for restoring std::numeric_limits<T>::digits10 decimal digits (except, potentially, trailing zeros) from a floating point value stored in type T.
Unfortunately, expanding a float to a double wrecks havoc on the value: Trying to format the new number will in many cases not yield the decimal original because the float padded with zeros is different from the closest double Bellerophon would create and, thus, Dragon4 expects. There are basically two approaches which work reasonably well, however:
As someone suggested convert the float to a string and this string into a double. This isn't particularly efficient but can be proven to produce the correct results (assuming a correct implementation of the not entirely trivial algorithms, of course).
Assuming your value is in a reasonable range, you can multiply it by a power of 10 such that the least significant decimal digit is non-zero, convert this number to an integer, this integer to a double, and finally divide the resulting double by the original power of 10. I don't have a proof that this yields the correct number but for the range of value I'm interested in and which I want to store accurately in a float, this works.
One reasonable approach to avoid this entirely issue is to use decimal floating point values as described for C++ in the Decimal TR in the first place. Unfortunately, these are not, yet, part of the standard but I have submitted a proposal to the C++ standardization committee to get this changed.

Related

What exactly happens while multiplying a double value by 10

I have recently been wondering about multiplying floating point numbers.
Let's assume I have a number, for example 3.1415 with a guaranteed 3-digit precision.
Now, I multiply this value by 10, and I get 31.415X, where X is a digit I cannot
define because of the limited precision.
Now, can I be sure, that the five get's carried over to the precise digits?
If a number is proven to be precise up to 3 digits I wouldn't expect this
five to always pop up there, but after studying many cases in c++ i have noticed that it always happens.
From my point of view, however, this doesn't make any sense, because floating point numbers are stored base-two, so multiplication by ten isn't really possible, it will always be mutiplication by 10.something.
I ask this question because I wanted to create a function that calculates how precise a type is. I have came up with something like this:
template <typename T>
unsigned accuracy(){
unsigned acc = 0;
T num = (T)1/(T)3;
while((unsigned)(num *= 10) == 3){
acc++;
num -= 3;
}
return acc;
}
Now, this works for any types I've used it with, but I'm still not sure that the first unprecise digit will always be carried over in an unchanged form.

I'll talk specifically about IEEE754 doubles since that what I think you're asking for.
Doubles are defined as a sign bit, an 11-bit exponent and a 52-bit mantissa, which are concatenated to form a 64-bit value:
sign|exponent|mantissa
Exponent bits are stored in a biased format, which means we store the actual exponent +1023 (for a double). The all-zeros exponent and all-ones exponent are special, so we end up being able to represent an exponent from 2^-1022 to 2^+1023
It's a common misconception that integer values can't be represented exactly by doubles, but we can actually store any integer in [0,2^53) exactly by setting the mantissa and exponent properly, in fact the range [2^52,2^53) can only store the integer values in that range. So 10 is easily stored exactly in a double.
When it comes to multiplying doubles, we effectively have two numbers of this form:
A = (-1)^sA*mA*2^(eA-1023)
B = (-1)^sB*mB*2^(eB-1023)
Where sA,mA,eA are the sign,mantissa and exponent for A (and similarly for B).
If we multiply these:
A*B = (-1)^(sA+sB)*(mA*mB)*2^((eA-1023)+(eB-1023))
We can see that we merely sum the exponents, and then multiply the mantissas. This actually isn't bad for precision! We might overflow the exponent bits (and thus get an infinity), but other than that we just have to round the intermediate mantissa result back to 52 bits, but this will at worst only change the least significant bit in the new mantissa.
Ultimately, the error you'll see will be proportional to the magnitude of the result. But, doubles have an error proportional to their magnitude anyways so this is really as safe as we can get. The way to approximate the error in your number is as |magnitude|*2^-53. In your case, since 10 is exact, the only error will come in the representation of pi. It will have an error of ~2^-51 and thus the result will as well.
As a rule of thumb, I consider doubles to have ~15 digits of decimal precision when thinking about precision concerns.

Lets assume that for single precision 3.1415 is
0x40490E56
in IEEE 754 format which is a very popular but not the only format used.
01000000010010010000111001010110
0 10000000 10010010000111001010110
so the binary portion is 1.10010010000111001010110
110010010000111001010110
1100 1001 0000 1110 0101 0110
0xC90E56 * 10 = 0x7DA8F5C
Just like in grade school with decimal you worry about the decimal(/binary) point later, you just do a multiply.
01111.10110101000111101011100
to get into IEEE 754 format it needs to be shifted to a 1.mantissa format
so that is a shift of 3
1.11110110101000111101011
but look at the three bits chopped off 100 specifically the 1 so this means depending on the rounding mode you round, in this case lets round up
1.11110110101000111101100
0111 1011 0101 0001 1110 1100
0x7BA1EC
now if I already computed the answer:
0x41FB51EC
0 10000011 11110110101000111101100
we moved the point 3 and the exponent reflects that, the mantissa matches what we computed. we did lose one of the original non-zero bits off the right, but is that too much loss?
double, extended, work the same way just more exponent and mantissa bits, more precision and range. but at the end of the day it is nothing more than what we learned in grade school as far as the math goes, the format requires 1.mantissa so you have to use your grade school math to adjust the exponent of the base to get it in that form.

Now, can I be sure, that the five get's carried over to the precise digits?
In general, no. You can only be sure about the precision of output when you know the exact representation format used by your system, and know that the correct output is exactly representable in that format.
If you want precise result for any rational input, then you cannot use finite precision.
It seems that your function attempts to calculate how accurately the floating point type can represent 1/3. This accuracy is not useful for evaluating accuracy of representing other numbers.
because floating point numbers are stored base-two
While very common, this is not universally true. Some systems use base-10 for example.

Printing exact values for floats

How does a program (MySQL is an example) store a float like 0.9 and then return it to me as 0.9? Why does it not return 0.899...?
The issue I am currently experiencing is retrieving floating point values from MySQL with C++ and then reprinting those values.

There are software libraries, like Gnu MP that implement arbitrary precision arithmetic, that calculate floating point numbers to specified precision. Using Gnu MP you can, for example, add 0.3 to 0.6, and get exactly 0.9. No more, no less.
Database servers do pretty much the same thing.
For normal, run of the mill applications, native floating point arithmetic is fast, and it's good enough. But database servers typically have plenty of spare CPU cycles. Their limiting factors will not be available CPU, but things like available I/O bandwidth. They'll have plenty of CPU cycles to burn on executing complicated arbitrary precision arithmetic calculations.

There are a number of algorithms for rounding floating point numbers in a way that will result in the same internal representation when read back in. For an overview of the subject, with links to papers with full details of the algorithms, see
Printing Floating-Point Numbers

What's happening, in a nutshell, is that the function which converts the floating-point approximation of 0.9 to decimal text is actually coming up with a value like 0.90000....0123 or 0.89999....9573. This gets rounded to 0.90000...0. And then these trailing zeros are trimmed off so you get a tidy looking 0.9.
Although floating-point numbers are inexact, and often do not use base 10 internally, they can in fact precisely save and recover a decimal representation. For instance, an IEEE 754 64 bit representation has enough precision to preserve 15 decimal digits. This is often mapped to the C language type double, and that language has the constant DBL_DIG, which will be 15 when double is this IEEE type.
If a decimal number with 15 digits or less is converted to double, it can be coverted back to exactly that number. The conversion routine just has to round it off at 15 digits; of course if the conversion routine uses, say, 40 digits, there will be messy trailing digits representing the error between the floating-point value and the original number. The more digits you print, the more accurately rendered is that error.
There is also the opposite problem: given a floating-point object, can it be printed into decimal such that the resulting decimal can be scanned back to reproduce that object? For an IEEE 64 bit double, the number of decimal digits required for that is 17.

VBA debugger precision

I had a single which I believe the C++ equivalent is float in VBA in an Excel workbook module. Anyways, the value I originally assigned (876.34497) is rounded off to 876.345 in the Immediate Window, and Watch, and hover tooltip when I set a breakpoint on the VBA. However, if I pass this Single to a C++ DLL C++ reports it as the original value 876.34497.
So, is it actually stored in memory as the original value? Is this some limitation of the debugger? Unsure what is going on here. Makes it difficult to test if what I'm passing is what I'm getting on the C++ side.
I tried:
?CStr(test)
876.345
?CDbl(test)
876.344970703125
?CSng(test)
876.345
VBA isn't very straightforward, so at some level it must be stored as 876.34497 in memory. Otherwise, I don't think CDbl would be correct like it is.

VBA variables of type "single" are stored as "32-bit hardware implementation of IEEE 754[-]1985 [sic]." [see: https://msdn.microsoft.com/en-us/library/ee177324.aspx].
What this means in English is, "single" precision numbers are converted to binary then truncated to fit in a 4 byte (32-bit) sequence. The exact process is very well described in Wikipedia under http://en.wikipedia.org/wiki/Single-precision_floating-point_format . The upshot is that all single precision numbers are expressed as
(1) a 23 bit "fraction" between 0 and 1, *times*
(2) an 8-bit exponent which represents a multiplier between 2^(-127) and 2^128, *times*
(3) one more bit for positive or negative.
The process of converting numbers to binary and back causes two types of rounding errors:
(1) Significant Digits -- as you have noticed, there is a limit on significant digits. A 22 bit integer can only have 8,388,607 unique values. Stated another way, no number can be expressed with greater than +/- 0.000012% precision. Reaching back to high school science, you may recall that that is another way of saying you cannot count on more than six significant digits (well, decimal digits, at least ... of course you have 22 significant binary digits). So any representation of a number with more than six significant digits will get rounded off. However, it won't get rounded off to the nearest decimal digit ... it will get rounded off to the nearest binary digit. This often causes some unexpected results (like yours).
(2) Binary conversion -- The other type of error is even more pernicious. There are some numbers with significantly less than six (decimal) digits that will get rounded off. For example, 1/5 in decimal is 0.2000000. It never gets "rounded off." But the same number in binary is 0.00110011001100110011.... repeating forever. (That sequence is equivalent to 1/8 + 1/16 + 1/16*(1/8+1/16) + 1/256*(1/8+1/16) ... ) If you used an arbitrary number of binary digits to represent 0.20, then converted it back to decimal, you will NEVER get exactly 0.20. For example, if you used eight bits, you would have 0.00110011 in binary which is:
0.12500000
0.06250000
0.00781250
+ 0.00390625
------------
0.19921875
No matter how many binary digits you use, you will never get exactly 0.20, because 0.20 cannot be expressed as the sum of powers of two.
That in a nutshell explains what's going on. When you assign 876.34497 to "test," it gets converted internally to:
1 10001000 0110110001011000010011
136 5,969,427
Which is (+1) * 2^(136-127) * (5,969,427)/(2^23)
Excel is automatically truncating the display of your single-precision number to show only six significant digits, because it knows that the seventh digit might be wrong. I can't tell you what the number is exactly because my excel doesn't display enough significant digits! But you get the point.
When you coerce the value into double precision, it uses the entire binary string and then adds another 4 bytes worth of zeroes to the end. It now allows you to display twice as many significant figures because it is double precision, but as you can see, the conversion from 8 decimal digits to 23 binary digits and then appending another long string of zeros has introduced some errors. Not really errors, if you understand what it's doing; just artifacts. After all, it's doing exactly what you told it to do ... you just didn't know what you were telling it to do!

floating point issue

I have a floating value as 0.1 entering from UI.
But, while converting that string to float i am getting as 0.10...01. The problem is the appending of non zero digit. How do i tackle with this problem.
Thanks,
iSight

You need to do some background reading on floating point representations: http://docs.sun.com/source/806-3568/ncg_goldberg.html.
Given computers are on-off switches, they're storing a rounded answer, and they work in base two not the base ten we humans seem to like.
Your options are to:
display it back with less digits so you round back to base 10 (checkout the Standard library's <iomanip> header, and setprecision)
store the number in some actual decimal-capable object - you'll find plenty of C++ classes to do this via google, but none are provided in the Standard, nor in boost last I looked
convert the input from a string directly to an integral number of some smaller unit (like thousandths), avoiding the rounding.

0.1 (decimal) = 0.00011001100110011... (binary)
So, in general, a number you can represent with a finite number of decimal digits may not be representable with a finite number of bits. But floating point numbers only store the most N significant bits. So, conversions between a decimal string and a "binary" float usually involves rounding.
However a lossless roundtrip conversion decimal string -> double -> decimal string is possible if you restrict yourself to decimal strings with at most 15 significant digits (assuming IEEE 754 64 bit floats). This includes the last conversion. You need to produce a string from the double with at most 15 significant digits.
It is also possible to make the roundtrip double -> string -> double lossless. But here you may need decimal strings with 17 decimal digits to make it work (again assuming IEEE-754 64bit floats).

The best site I've ever seen that explains why some numbers can't be represented exactly is Harald Schmidt's IEEE754 Converter site.
It's an online tool for showing representations of IEEE754 single precision values and I liked it so much, I wrote my own Java app to do it (and double precision as well).
Bottom line, there are only about four billion different 32-bit values you can have but there are an infinite number of real values between any two different values. So you have a problem with precision. That's something you'll have to get used to.
If you want more precision and/or better type for decimal values, you can either:
switch to a higher number of bits.
use a decimal type
use a big-number library like GMP (although I refuse to use this in production code since I discovered it doesn't handle memory shortages elegantly).
Alternatively, you can use the inaccurate values (their error rates are very low, something like one part per hundred million for floats, from memory) and just print them out with less precision. Printing out 0.10000000145 to two decimal places will get you 0.10.
You would have to do millions and millions of additions for the error to accumulate noticeably. Less of other operations of course but still a lot.
As to why you're getting that value, 0.1 is stored in IEEE754 single precision format as follows (sign, exponent and mantissa):
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm 1/n
0 01111011 10011001100110011001101
||||||||||||||||||||||+- 8388608
|||||||||||||||||||||+-- 4194304
||||||||||||||||||||+--- 2097152
|||||||||||||||||||+---- 1048576
||||||||||||||||||+----- 524288
|||||||||||||||||+------ 262144
||||||||||||||||+------- 131072
|||||||||||||||+-------- 65536
||||||||||||||+--------- 32768
|||||||||||||+---------- 16384
||||||||||||+----------- 8192
|||||||||||+------------ 4096
||||||||||+------------- 2048
|||||||||+-------------- 1024
||||||||+--------------- 512
|||||||+---------------- 256
||||||+----------------- 128
|||||+------------------ 64
||||+------------------- 32
|||+-------------------- 16
||+--------------------- 8
|+---------------------- 4
+----------------------- 2
The sign is positive, that's pretty easy.
The exponent is 64+32+16+8+2+1 = 123 - 127 bias = -4, so the multiplier is 2-4 or 1/16.
The mantissa is chunky. It consists of 1 (the implicit base) plus (for all those bits with each being worth 1/(2n) as n starts at 1 and increases to the right), {1/2, 1/16, 1/32, 1/256, 1/512, 1/4096, 1/8192, 1/65536, 1/131072, 1/1048576, 1/2097152, 1/8388608}.
When you add all these up, you get 1.60000002384185791015625.
When you multiply that by the multiplier, you get 0.100000001490116119384765625, matching the double precision value on Harald's site as far as it's printed:
0.10000000149011612 (out by 0.00000000149011612)
And when you turn off the least significant (rightmost) bit, which is the smallest downward movement you can make, you get:
0.09999999403953552 (out by 0.00000000596046448)
Putting those two together:
0.10000000149011612 (out by 0.00000000149011612)
|
0.09999999403953552 (out by 0.00000000596046448)
you can see that the first one is a closer match, by about a factor of four (14.9:59.6). So that's the closest value you can get to 0.1.

Since floats get stored in binary, the fractional portion is effectively in base-two... and one-tenth is a repeating decimal in base two, same as one-ninth is in base ten.
The most common ways to deal with this are to store your values as appropriately-scaled integers, as in the C# or SQL currency types, or to round off floating-point numbers when you display them.

Setprecision() for a float number in C++?

In C++,
What are the random digits that are displayed after giving setprecision() for a floating point number?
Note: After setting the fixed flag.
example:
float f1=3.14;
cout < < fixed<<setprecision(10)<<f1<<endl;
we get random numbers for the remaining 7 digits? But it is not the same case in double.

Two things to be aware of:
floats are stored in binary.
float has a maximum of 24 significant bits. This is equivalent to 7.22 significant digits.
So, to your computer, there's no such number as 3.14. The closest you can get using float is 3.1400001049041748046875.
double has 53 significant bits (~15.95 significant digits), so you get a more accurate approximation, 3.140000000000000124344978758017532527446746826171875. The "noise" digits don't show up with setprecision(10), but would with setprecision(17) or higher.

They're not really "random" -- they're the (best available) decimal representation of that binary fraction (will be exact only for fractions whose denominator is a power of two, e.g., 3.125 would display exactly).
Of course that changes depending on the number of bits available to represent the binary fraction that best approaches the decimal one you originally entered as a literal, i.e., single vs double precision floats.
Not really a C++ specific issue (applies to all languages using binary floats, typically to exploit the machine's underlying HW, i.e., most languages). For a very bare-bone tutorial, I recommend reading this.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js