Float in c++ smaller range than IEEE 754

Float in c++ smaller range than IEEE 754 - c++

I try to make the following division: 1/16777216, what is equal to 5.96046448e-8
but this:
printf("number: %f \n", 1.0f / 16777216.0f);
allways gives me 0.00000 instead of the answer I would expect.
I looked up the ranges, because I thought well, that might be a problem that float is simply to smal
to handle such a number, but IEEE 754 states it to be ±1.18×10−38.
Am I missing something and thats why the result not the expected one?

When using fixed formatting (%f) you get a format with a decimal point and up to 6 digits. Since the value you used rounds to a value smaller than 0.000001 it seems reasonable to have 0.000000 printed. You can either use more digits (I think using %.10f but I'm not that good at <stdio.h> format specifiers) or you change the format to use either scientific notation (%e) or the "better" of both options (%g).

Related

strtof() function misplacing decimal place

I have a string "1613894376.500012077" and I want to use strtof in order to convert to floating point 1613894376.500012077. The problem is when I use strtof I get the following result with the decimal misplaced 1.61389e+09. Please help me determine how to use strof properly.

A typical float is 32-bit and can only represent exactly about 232 different values. "1613894376.500012077" is not one of those.
"1.61389e+09" is the same value as "1613890000.0" and represents a close value that float can represent.
The 2 closest floats are:
1613894272.0
1613894400.0 // slightly closer to 1613894376.500012077
Print with more precision to see more digits.

The decimal point is not misplaced. The notation “1.61389e+09” means 1.61389•109, which is 1,613,890,000., which has the decimal point in the correct place.
The actual result of strtof in your computer is probably 1,613,894,400. This is the closest value to 1613894376.500012077 that the IEEE-754 binary32 (“single”) format can represent, and that is the format commonly used for float. When you print it with %g, the default is to use just six significant digits. To see it with more precision, print it with %.999g.

The number 1613894376.500012077 is equivalent (the same number up to the precision of the machine as 1.61389e+09.) The e+09 suffix means that the decimal point is located nine decimal digits right the place it has been placed (or that the number is multiplied by 10 to the ninth power). This is a common notation in computer science called scientific notation.

Is a floating-point value of 0.0 represented differently from other floating-point values?

I've been going back through my C++ book, and I came across a statement that says zero can be represented exactly as a floating-point number. I was wondering how this is possible unless the value of 0.0 is stored as a type other than a floating point value. I wrote the following code to test this:
#include <iomanip>
#include <iostream>
int main()
{
float value1 {0.0};
float value2 {0.1};
std::cout << std::setprecision(10) << std::fixed;
std::cout << value1 << '\n'
<< value2 << std::endl;
}
Running this code gave the following output:
0.0000000000
0.1000000015
To 10 digits of precision, 0.0 is still 0, and 0.1 has some inaccuracies (which is to be expected). Is a value of 0.0 different from other floating point numbers in the way it is represented, and is this a feature of the compiler or the computer's architecture?

How can 2 be represented as an exact number? 4? 15? 0.5? The answer is just that some numbers can be represented exactly in the floating-point format (which is based on base-2/binary) and others can't.
This is no different from in decimal. You can't represent 1/3 exactly in decimal, but that doesn't mean you can't represent 0.
Zero is special in a way, because (like the other real numbers) it's more trivial to prove this property than for some arbitrary fractional number. But that's about it.
So:
what is it about these values (0, 1/16, 1/2048, ...) that allows them to be represented exactly.
Simple mathematics. In any given base, in the sort of representation we're talking about, some numbers can be written out with a fixed number of decimal places; others can't. That's it.
You can play online with H. Schmidt's IEEE-754 Floating Point Converter for different numbers to see a bunch of different representations, and what errors come about as a result of encoding into those representations. For starters, try 0.5, 0.2 and 0.1.
It was my (perhaps naive) understanding that all floating point values contained some instability.
No, absolutely not.
You want to treat every floating point value in your program as potentially having some small error on it, because you generally don't know what sequence of calculations led to it. You can't trust it, in general. I expect someone half-taught this to you in the past, and that's what led to your misunderstanding.
But, if you do know the error (or lack thereof) involved at each step in the creation of the value (e.g. "all I've done is initialised it to zero"), then that's fine! No need to worry about it then.

Here is one way to look at the situation: with 64 bits to store a number, there are 2^64 bit patterns. Some of these are "not-a-number" representations, but most of the 2^64 patterns represent numbers. The number that is represented is represented exactly, with no error. This might seem strange after learning about floating point math; a caveat lurks ahead.
However, as huge as 2^64 is, there are infinitely many more real numbers. When a calculation produces a non-integer result, the odds are pretty good that the answer will not be a number represented by one of the 2^64 patterns. There are exceptions. For example, 1/2 is represented by one of the patterns. If you store 0.5 in a floating point variable, it will actually store 0.5. Let's try that for other single-digit denominators. (Note: I am writing fractions for their expressive power; I do not intend integer arithmetic.)
1/1 – stored exactly
1/2 – stored exactly
1/3 – not stored exactly
1/4 – stored exactly
1/5 – not stored exactly
1/6 – not stored exactly
1/7 – not stored exactly
1/8 – stored exactly
1/9 – not stored exactly
So with these simple examples, over half are not stored exactly. When you get into more complicated calculations, any one piece of the calculation can throw you off the islands of exact representation. Do you see why the general rule of thumb is that floating point values are not exact? It is incredibly easy to fall into that realm. It is possible to avoid it, but don't count on it.
Some numbers can be represented exactly by a floating point value. Most cannot.

How to convert float to double(both stored in IEEE-754 representation) without losing precision?

I mean, for example, I have the following number encoded in IEEE-754 single precision:
"0100 0001 1011 1110 1100 1100 1100 1100" (approximately 23.85 in decimal)
The binary number above is stored in literal string.
The question is, how can I convert this string into IEEE-754 double precision representation(somewhat like the following one, but the value is not the same), WITHOUT losing precision?
"0100 0000 0011 0111 1101 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010"
which is the same number encoded in IEEE-754 double precision.
I have tried using the following algorithm to convert the first string back to decimal number first, but it loses precision.
num in decimal = (sign) * (1 + frac * 2^(-23)) * 2^(exp - 127)
I'm using Qt C++ Framework on Windows platform.
EDIT: I must apologize maybe I didn't get the question clearly expressed.
What I mean is that I don't know the true value 23.85, I only got the first string and I want to convert it to double precision representation without precision loss.

Well: keep the sign bit, rewrite the exponent (minus old bias, plus new bias), and pad the mantissa with zeros on the right...
(As #Mark says, you have to treat some special cases separately, namely when the biased exponent is either zero or max.)

IEEE-754 (and floating point in general) cannot represent periodic binary decimals with full precision. Not even when they, in fact, are rational numbers with relatively small integer numerator and denominator. Some languages provide a rational type that may do it (they are the languages that also support unbounded precision integers).
As a consequence those two numbers you posted are NOT the same number.
They in fact are:
10111.11011001100110011000000000000000000000000000000000000000 ...
10111.11011001100110011001100110011001100110011001101000000000 ...
where ... represent an infinite sequence of 0s.
Stephen Canon in a comment above gives you the corresponding decimal values (did not check them, but I have no reason to doubt he got them right).
Therefore the conversion you want to do cannot be done as the single precision number does not have the information you would need (you have NO WAY to know if the number is in fact periodic or simply looks like being because there happens to be a repetition).

First of all, +1 for identifying the input in binary.
Second, that number does not represent 23.85, but slightly less. If you flip its last binary digit from 0 to 1, the number will still not accurately represent 23.85, but slightly more. Those differences cannot be adequately captured in a float, but they can be approximately captured in a double.
Third, what you think you are losing is called accuracy, not precision. The precision of the number always grows by conversion from single precision to double precision, while the accuracy can never improve by a conversion (your inaccurate number remains inaccurate, but the additional precision makes it more obvious).
I recommend converting to a float or rounding or adding a very small value just before displaying (or logging) the number, because visual appearance is what you really lost by increasing the precision.
Resist the temptation to round right after the cast and to use the rounded value in subsequent computation - this is especially risky in loops. While this might appear to correct the issue in the debugger, the accummulated additional inaccuracies could distort the end result even more.

It might be easiest to convert the string into an actual float, convert that to a double, and convert it back to a string.

Binary floating points cannot, in general, represent decimal fraction values exactly. The conversion from a decimal fractional value to a binary floating point (see "Bellerophon" in "How to Read Floating-Point Numbers Accurately" by William D.Clinger) and from a binary floating point back to a decimal value (see "Dragon4" in "How to Print Floating-Point Numbers Accurately" by Guy L.Steele Jr. and Jon L.White) yield the expected results because one converts a decimal number to the closest representable binary floating point and the other controls the error to know which decimal value it came from (both algorithms are improved on and made more practical in David Gay's dtoa.c. The algorithms are the basis for restoring std::numeric_limits<T>::digits10 decimal digits (except, potentially, trailing zeros) from a floating point value stored in type T.
Unfortunately, expanding a float to a double wrecks havoc on the value: Trying to format the new number will in many cases not yield the decimal original because the float padded with zeros is different from the closest double Bellerophon would create and, thus, Dragon4 expects. There are basically two approaches which work reasonably well, however:
As someone suggested convert the float to a string and this string into a double. This isn't particularly efficient but can be proven to produce the correct results (assuming a correct implementation of the not entirely trivial algorithms, of course).
Assuming your value is in a reasonable range, you can multiply it by a power of 10 such that the least significant decimal digit is non-zero, convert this number to an integer, this integer to a double, and finally divide the resulting double by the original power of 10. I don't have a proof that this yields the correct number but for the range of value I'm interested in and which I want to store accurately in a float, this works.
One reasonable approach to avoid this entirely issue is to use decimal floating point values as described for C++ in the Decimal TR in the first place. Unfortunately, these are not, yet, part of the standard but I have submitted a proposal to the C++ standardization committee to get this changed.

Determining output (printing) of float with %f in C/C++

I have gone through earlier discussions on floating point numbers in SO but that didn't clarified my problem,I knew this floating point issues may be common in every forum but my question in not concerned about Floating point arithmetic or Comparison.I am rather inquisitive about its representation and output with %f.
The question is straight forward :"How to determine the exact output of :
float = <Some_Value>f;
printf("%f \n",<Float_Variable>);
Lets us consider this code snippet:
float f = 43.2f,
f1 = 23.7f,
f2 = 58.89f,
f3 = 0.7f;
printf("f1 = %f\n",f);
printf("f2 = %f\n",f1);
printf("f3 = %f\n",f2);
printf("f4 = %f\n",f3);
Output:
f1 = 43.200001
f2 = 23.700001
f3 = 58.889999
f4 = 0.700000
I am aware that %f (is meant to be for double) has a default precision of 6, also I am aware that the problem (in this case) can be fixed by using double but I am inquisitive about the output f2 = 23.700001 and f3 = 58.889999 in float.
EDIT: I am aware that floating point number cannot be represented precisely, but what is the rule of for obtaining the closest representable value ?
Thanks,

Assuming that you're talking about IEEE 754 float, which has a precision of 24 binary digits: represent the number in binary (exactly) and round the number to the 24th most significant digit. The result will be the closest floating point.
For example, 23.7 represented in binary is
10111.1011001100110011001100110011...
After rounding you'll get
10111.1011001100110011010
Which in decimal is
23.700000762939453125
After rounding to the sixth decimal place, you'll have
23.700001
which is exactly the output of your printf.

What Every Computer Scientist Should Know About Floating-Point Arithmetic
You may interest to see other people question regarding that on SO too.
Please take a look too.
https://stackoverflow.com/search?q=floating+point

A 32-bit float (as in this case) is represented as 1 bit of sign, 8 bits of exponent and 23 bits of the fractional part of the mantissa.
First, forget the sign of what you put in. Then the rest of what you put in will be stored as a fraction of the format
(1 + x/8,388,608) * 2^(y-127) (note that the 8.388,608 is 2^23). Where x is the fractional mantissa and y is the exponent. Believe it or not, there is only one representation in this form for every value you put in. The value stored will be the closest value to the number you want, if your value cannot be represented exactly, it means you'll pick up an extra .0001 or whatever.
So, if you want to figure out the value that will actually be stored, just figure out what it will turn into.
So second thing to do (after throwing out the sign) is to find the largest power of 2 that is smaller in magnitude than the number you are representing. So let's take 43.2.
The largest power of two smaller than that is 32. So that's the "1" on the left, since it's a 32, not a 1, that means the 2^ value on the right must be 2^5 (32), which means y is 132. So now subtract off the 32, it's done for. What's left is 11.2. Now we need to represent 11.2 as a fraction over 8,338,608 times 2^5.
So
11.2 approx equals x*32/8,336,608 or x/262,144. The value you get for x is 2,938,013/262,144. The real numerator was 0.2 lower (2,938,012.8), so there will be an error of 0.2 in 262,144 or 2 in 131,072. In decmial, this value is 0.000015258789063. So if you print enough digits, you'll see this error value show up in your output.
When you see the output be too low, it's because the rounding went the other way, the value produced was nearer to the wanted value by being too low, and so you get an output that is too low. When the value can be represented exactly (like for example any power of 2), you never get an error.
It's not simple, but there you go. I'm sure you can code this up.
*note: for very small in magnitude values (roughly less than 2^-127) you get into weirdness called denormals. I'm not going to explain them, but they won't fit the pattern. Luckily they don't show up much. And once you get into that range, your accuracy goes to pot anyway.

You can control the number of decimal points that are outputted by including this in the format specifier.
So instead of having
float f = 43.2f,
printf("f1 = %f\n",f);
Have this
float f = 43.2f,
printf("f1 = %.2f\n",f);
to print two numbers after the decimal point.
Do note that floating point numbers are not precisely represented in memory.

The compiler and CPU use IEEE 754 to represent floating point values in memory. Most rational numbers cannot be expressed exactly in this format, so the compiler chooses the closest approximate representation.
To avoid unpredictable output, you should round to the appropriate precision.
// outputs "0.70"
printf("%.2f\n", 0.7f);

A floating point number or a double precision floating point number is stored as an integer numerator, and a power of 2 as denominator. The math behind it is pretty simple. It involves shifting and bit testing.
So when you declare a constant in base 10, the compiler converts it to a binary integer in 23 bits and an exponent in 8 (or 52 bit integer and 11 bit exponent).
To print it back out, it converts this fraction back into base 10.

Gross simplification: the rule is that "floats are good for 2 or 3 decimal places, doubles for 4 or 5". That is to say, the first 2 or 3 decimal places printed will be exactly what you put in. After that, you have to work out the encoding to see what you're going to get.
This is only a rule of thumb, and as it happens your test case shows one instance where the float representation is only good to 1 d.p.

The way to figure out what will be printed is to simulate exactly what the compiler / libraries / hardware will do:
Convert the number to binary, and round to 24 significant (binary) digits.
Convert that number to decimal, and round to 6 (decimal) digits after the decimal point.
Of course, this is exactly what your program does already, so what are you asking for?
Edit to illustrate, I'll work through one of your examples:
Begin by converting 23.7 to binary:
10111.1011001100110011001100110011001100110011001100110011...
Round that number to 24 significant binary digits:
10111.1011001100110011010
Note that it rounded up. Converting back to decimal gives:
23.700000762939453125
Now, round this value to 6 digits after the decimal point:
23.700001
Which is exactly what you observed.

Rounding doubles - .5 - sprintf

I'm using the following code for rounding to 2dp:
sprintf(temp,"%.2f",coef[i]); //coef[i] returns a double
It successfully rounds 6.666 to 6.67, but it doesn't work properly when rounding
5.555. It returns 5.55, whereas it should (at least in my opinion) return 5.56.
How can I get it to round up when the next digit is 5? i.e. return 5.56.
edit: I now realise that this is happening because when I enter 5.555 with cin it gets
saved as 5.554999997.
I'm going to try rounding in two stages- first to 3dp and then to 2dp. any other
(more elegant) ideas?

It seems you have to use math round function for correct rounding.
printf("%.2f %.2f\n", 5.555, round(5.555 * 100.)/100.);
This gives the following output on my machine:
5.55 5.56

The number 5.555 cannot be represented as an exact number in IEEE754. Printing out the constant 5.555 with "%.50f" results in:
5.55499999999999971578290569595992565155029300000000
so it will be rounded down. Try using this instead:
printf ("%.2f\n",x+0.0005);
although you need to be careful of numbers that can be represented exactly, since they'll be rounded up wrongly by this expression.
You need to understand the limitations of floating point representations. If it's important that you get accuracy, you can use (or code) a BCD or other decimal class that doesn't have the shortcoming of IEEE754 representation.

How about this for another possible solution:
printf("%.2f", _nextafter(n, n*2));
The idea is to increase the number away from zero (the n*2 gets the sign right) by the smallest possible amount representable by floating point math.
Eg:
double n=5.555;
printf("%.2f\n", n);
printf("%.2f\n", _nextafter(n, n*2));
printf("%.20f\n", n);
printf("%.20f\n", _nextafter(n, n*2));
With MSVC yields:
5.55
5.56
5.55499999999999970000
5.55500000000000060000

This question is tagged C++, so I'll proceed under that assumption. Note that the C++ streams will round, unlike the C printf family. All you have to do is provide the precision you want and the streams library will round for you. I'm just throwing that out there in case you don't already have a reason not to use streams.

You could also do this (saves multiply/divide):
printf("%.2f\n", coef[i] + 0.00049999);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js