Inconsistencies with double data type in C++ - c++

This may be something really simple that I'm just missing, however I am having trouble using the double data type. It states here that a double in C++ is accurate to ~15 digits. Yet in the code:
double n = pow(2,1000);
cout << fixed << setprecision(0) << n << endl;
n stores the exact value of 2^1000, something that is 302 decimal digits long according to WolframAlpha. Yet, when I attempt to compute 100!, with the function:
double factorial(int number)
{
double product = 1.0;
for (int i = 1; i <= number; i++){product*=i;}
return product;
}
Inaccuracies begin to appear at the 16th digit. Can someone please explain this behaviour, as well as provide me with some kind of solution.
Thanks

You will need eventually to read the basics in What Every Computer Scientist Should Know About Floating-Point Arithmetic. The most common standard for floating-point hardware is generally IEEE 754; the current version is from 2008, but most CPUs do not implement the new decimal arithmetic featured in the 2008 edition.
However, a floating point number (double or float) stores an approximation to the value, using a fixed-size mantissa (fractional part) and an exponent (power of 2). The dynamic range of the exponents is such that the values that can be represented range from about 10-300 to 10+300 in decimal, with about 16 significant (decimal) digits of accuracy. People persistently try to print more digits than can be stored, and get interesting and machine-dependent (and library-dependent) results when they do.
So, what you are seeing is intrinsic to the nature of floating-point arithmetic on computers using fixed-precision values. If you need greater precision, you need to use one of the arbitrary-precision libraries - there are many of them available.

The double actually stores an exponent to be applied to 2 in it's internal representation. So of course it can store 2^1000 accurately. But try adding 1 to that.

IEEE 754 gives an algorithm for storing floating point data. Computers have a finite number of bits to store an infinite number of numbers, and thus introduces error when storing digits.
In this form of representation, there is less room for precision the larger the represented number gets (larger == absolute distance from zero). Probably at that point you are seeing the loss of precision, and as you get even larger numbers, it will have even larger loss of precision.

Related

Is a floating-point value of 0.0 represented differently from other floating-point values?

I've been going back through my C++ book, and I came across a statement that says zero can be represented exactly as a floating-point number. I was wondering how this is possible unless the value of 0.0 is stored as a type other than a floating point value. I wrote the following code to test this:
#include <iomanip>
#include <iostream>
int main()
{
float value1 {0.0};
float value2 {0.1};
std::cout << std::setprecision(10) << std::fixed;
std::cout << value1 << '\n'
<< value2 << std::endl;
}
Running this code gave the following output:
0.0000000000
0.1000000015
To 10 digits of precision, 0.0 is still 0, and 0.1 has some inaccuracies (which is to be expected). Is a value of 0.0 different from other floating point numbers in the way it is represented, and is this a feature of the compiler or the computer's architecture?
How can 2 be represented as an exact number? 4? 15? 0.5? The answer is just that some numbers can be represented exactly in the floating-point format (which is based on base-2/binary) and others can't.
This is no different from in decimal. You can't represent 1/3 exactly in decimal, but that doesn't mean you can't represent 0.
Zero is special in a way, because (like the other real numbers) it's more trivial to prove this property than for some arbitrary fractional number. But that's about it.
So:
what is it about these values (0, 1/16, 1/2048, ...) that allows them to be represented exactly.
Simple mathematics. In any given base, in the sort of representation we're talking about, some numbers can be written out with a fixed number of decimal places; others can't. That's it.
You can play online with H. Schmidt's IEEE-754 Floating Point Converter for different numbers to see a bunch of different representations, and what errors come about as a result of encoding into those representations. For starters, try 0.5, 0.2 and 0.1.
It was my (perhaps naive) understanding that all floating point values contained some instability.
No, absolutely not.
You want to treat every floating point value in your program as potentially having some small error on it, because you generally don't know what sequence of calculations led to it. You can't trust it, in general. I expect someone half-taught this to you in the past, and that's what led to your misunderstanding.
But, if you do know the error (or lack thereof) involved at each step in the creation of the value (e.g. "all I've done is initialised it to zero"), then that's fine! No need to worry about it then.
Here is one way to look at the situation: with 64 bits to store a number, there are 2^64 bit patterns. Some of these are "not-a-number" representations, but most of the 2^64 patterns represent numbers. The number that is represented is represented exactly, with no error. This might seem strange after learning about floating point math; a caveat lurks ahead.
However, as huge as 2^64 is, there are infinitely many more real numbers. When a calculation produces a non-integer result, the odds are pretty good that the answer will not be a number represented by one of the 2^64 patterns. There are exceptions. For example, 1/2 is represented by one of the patterns. If you store 0.5 in a floating point variable, it will actually store 0.5. Let's try that for other single-digit denominators. (Note: I am writing fractions for their expressive power; I do not intend integer arithmetic.)
1/1 – stored exactly
1/2 – stored exactly
1/3 – not stored exactly
1/4 – stored exactly
1/5 – not stored exactly
1/6 – not stored exactly
1/7 – not stored exactly
1/8 – stored exactly
1/9 – not stored exactly
So with these simple examples, over half are not stored exactly. When you get into more complicated calculations, any one piece of the calculation can throw you off the islands of exact representation. Do you see why the general rule of thumb is that floating point values are not exact? It is incredibly easy to fall into that realm. It is possible to avoid it, but don't count on it.
Some numbers can be represented exactly by a floating point value. Most cannot.

Printing exact values for floats

How does a program (MySQL is an example) store a float like 0.9 and then return it to me as 0.9? Why does it not return 0.899...?
The issue I am currently experiencing is retrieving floating point values from MySQL with C++ and then reprinting those values.
There are software libraries, like Gnu MP that implement arbitrary precision arithmetic, that calculate floating point numbers to specified precision. Using Gnu MP you can, for example, add 0.3 to 0.6, and get exactly 0.9. No more, no less.
Database servers do pretty much the same thing.
For normal, run of the mill applications, native floating point arithmetic is fast, and it's good enough. But database servers typically have plenty of spare CPU cycles. Their limiting factors will not be available CPU, but things like available I/O bandwidth. They'll have plenty of CPU cycles to burn on executing complicated arbitrary precision arithmetic calculations.
There are a number of algorithms for rounding floating point numbers in a way that will result in the same internal representation when read back in. For an overview of the subject, with links to papers with full details of the algorithms, see
Printing Floating-Point Numbers
What's happening, in a nutshell, is that the function which converts the floating-point approximation of 0.9 to decimal text is actually coming up with a value like 0.90000....0123 or 0.89999....9573. This gets rounded to 0.90000...0. And then these trailing zeros are trimmed off so you get a tidy looking 0.9.
Although floating-point numbers are inexact, and often do not use base 10 internally, they can in fact precisely save and recover a decimal representation. For instance, an IEEE 754 64 bit representation has enough precision to preserve 15 decimal digits. This is often mapped to the C language type double, and that language has the constant DBL_DIG, which will be 15 when double is this IEEE type.
If a decimal number with 15 digits or less is converted to double, it can be coverted back to exactly that number. The conversion routine just has to round it off at 15 digits; of course if the conversion routine uses, say, 40 digits, there will be messy trailing digits representing the error between the floating-point value and the original number. The more digits you print, the more accurately rendered is that error.
There is also the opposite problem: given a floating-point object, can it be printed into decimal such that the resulting decimal can be scanned back to reproduce that object? For an IEEE 64 bit double, the number of decimal digits required for that is 17.

Real numbers - how to determine whether float or double is required?

Given a real value, can we check if a float data type is enough to store the number, or a double is required?
I know precision varies from architecture to architecture. Is there any C/C++ function to determine the right data type?
For background, see What Every Computer Scientist Should Know About Floating-Point Arithmetic
Unfortunately, I don't think there is any way to automate the decision.
Generally, when people represent numbers in floating point, rather than as strings, the intent is to do arithmetic using the numbers. Even if all the inputs fit in a given floating point type with acceptable precision, you still have to consider rounding error and intermediate results.
In practice, most calculations will work with enough precision for usable results, using a 64 bit type. Many calculations will not get usable results using only 32 bits.
In modern processors, buses and arithmetic units are wide enough to give 32 bit and 64 bit floating point similar performance. The main motivation for using 32 bit is to save space when storing a very large array.
That leads to the following strategy:
If arrays are large enough to justify spending significant effort to halve their size, do analysis and experiments to decide whether a 32 bit type gives good enough results, and if so use it. Otherwise, use a 64 bit type.
I think your question presupposes a way to specify any "real number" to C / C++ (or any other program) without precision loss.
Suppose that you get this real number by specifying it in code or through user input; a way to check if a float or a double would be enough to store it without precision loss is to just count the number of significant bits and check that against the data range for float and double.
If the number is given as an expression (i.e. 1/7 or sqrt(2)), you will also want ways of detecting:
If the number is rational, whether it has repeating decimals, or cyclic decimals.
Or, What happens when you have an irrational number?
More over, there are numbers, such as 0.9, that float / double cannot in theory represent "exactly" )at least not in our binary computation paradigm) - see Jon Skeet's excellent answer on this.
Lastly, see additional discussion on float vs. double.
Precision is not very platform-dependent. Although platforms are allowed to be different, float is almost universally IEEE standard single precision and double is double precision.
Single precision assigns 23 bits of "mantissa," or binary digits after the radix point (decimal point). Since the bit before the dot is always one, this equates to a 24-bit fraction. Dividing by log2(10) = 3.3, a float gets you 7.2 decimal digits of precision.
Following the same process for double yields 15.9 digits and long double yields 19.2 (for systems using the Intel 80-bit format).
The bits besides the mantissa are used for exponent. The number of exponent bits determines the range of numbers allowed. Single goes to ~ 10±38, double goes to ~ 10±308.
As for whether you need 7, 16, or 19 digits or if limited-precision representation is appropriate at all, that's really outside the scope of the question. It depends on the algorithm and the application.
A very detailed post that may or may not answer your question.
An entire series in floating point complexities!
Couldn't you simply store it to a float and a double variable and than compare these two? This should implicitely convert the float back to a double - if there is no difference, the float is sufficient?
float f = value;
double d = value;
if ((double)f == d)
{
// float is sufficient
}
You cannot represent real number with float or double variables, but only a subset of rational numbers.
When you do floating point computation, your CPU floating point unit will decide the best approximation for you.
I might be wrong but I thought that float (4 bytes) and double (8 bytes) floating point representation were actually specified independently of comp architectures.

Float subtraction returns incorrect value

So I have a calculation whereby two floats that are components of vector objects are subtracted and then seem to return an incorrect result.
The code I'm attempting to use is:
cout << xresult.x << " " << vec1.x << endl;
float xpart1 = xresult.x - vec1.x;
cout << xpart1 << endl;
Where running this code will return
16 17
-1.00002
As you can see, printing out the values of xresult.x and vec1.x tells you that they are 16 and 17 respectively, yet the subtraction operation seems to introduce an error.
Any ideas why?
As you can see, printing out the
values of xresult.x and vec1.x tells
you that they are 16 and 17
respectively, yet the subtraction
operation seems to introduce an error.
No, it doesn't tell us that at all. It tells us that the input values are approximately 16 and 17. The imprecision might, generally, come from two sources: the nature of floating-point representation, and the precision with which the numbers are printed.
Output streams print floating-point values to a certain level of precision. From a description of the std::setprecision function:
On the default floating-point
notation, the precision field
specifies the maximum number of
meaningful digits to display in total
counting both those before and those
after the decimal point.
So, the values of xresult.x and vec1.x are 16 and 17 with 5 decimal digits of accuracy. In fact, one is slightly less than 16 and the other slightly more than 17. (Note that this has nothing to do with imprecise floating-point representation. The declarations float f = 16 and float g = 17 both assign exact values. A float can hold the exact integers 16 and 17 (although there are infinitely many other integers a float cannot hold.)) When we subtract slightly-more-than-17 from slightly-less-than-16, we get an answer of slightly-larger-than-negative-1.
To prove to yourself that this is the case, do one or both of these experiments. First, in your own code, add "cout << std::setprecision(10)" before printing those values. Second, run this test program
#include <iostream>
#include <iomanip>
int main() {
for(int i = 0; i < 10; i++) {
std::cout << std::setprecision(i) <<
15.99999f << " - " << 17.00001f << " = " <<
15.99999f - 17.00001f << "\n";
}
}
Notice how the 7th line of output matches your case:
16 - 17 = -1.00002
P.s. All of the other advice about imprecise floating-point representation is valid, it just doesn't apply to your particular circumstance. You really should read "What Every Computer Scientist Should Know About Floating-Point Arithmetic".
This is called floating point arithmetic. It is why numerical code is so "tricky" and filled with pitfalls. That result is expected. And what is more, it can depend on the processor that you're working with as to what and to what extent you'll see it.
I'd like to add that each type of variable of the floating point variables: float, double, long double have different precision factors. That is, one may be more able to represent more accurately the value of the floating point number. That is evidenced by how these numbers are held in memory.
When you look at a float, it contains less significant digits than say a double or long double. Hence, when you perform numerics on them, you must expect that floats will suffer from larger rounding errors. When dealing with financial data, developers often use some semblance of a "decimal." These are much better designed to handle currency type manipulations with better accuracy of the significant digits. It comes with a price however.
Take a look at the IEEE 745-2008 specification.
Its because of how floating points work. http://en.wikipedia.org/wiki/Floating_point
Because you can't accurately represent all numbers using a float. Wikipedia has a good description of it: http://en.wikipedia.org/wiki/Floating_point
How much do you know about the way numbers are stored in a computer?
Also, what are xresult.x and vec1.x - as in are they ints etc or floats.
I'd be suprised that if they were all floats the error occured, but you are converting between types and binary is not the same as decimal.
If there was a small decimal portion on the 16 and 17 that wasn't printed out, when the values are normalized to the same base for subtraction, that could introduce extra error, especially for 32 bit types like float.
When you use floating point values, you need to be prepared within your application to deal with the fact that you won't get 100% accurate decimal results. Your results will be as accurate as possible in the internal binary representation. Addition and subtraction especially can introduce a significant amount of relative error for operands that are orders of magnitude apart and for results that should be close to 0.
People keep talking about how computer representations cannot perfectly represent real numbers, and how computer operations on floating point numbers cannot be perfectly precise.
This is true, but the same is true of the real world.
Real measurements are approximations to some degree of precision. Operations on real measurements result in approximations to some degree of precision.
If I count 17 bowling balls, I have 17 bowling balls. If I remove 16 bowling balls, I have one bowling ball.
But if I have a stick that is 17 inches long, what I really have is a stick that is about 17 inches long. If I cut off 16 inches, I'm really cutting off is about 16 inches, and what I'm left with is about 1 inch.
You have to keep track of the accuracy of your measurements, and the precision of your results. If I have 17.0, accurate to three significant digits, and subtract 16.0, also accurate to three significant digits, the result is 1.0, accurate to two significant digits. And that's what you got. Your mistake was in assuming that the extra precision provided by your results, beyond the accuracy you were given, was meaningful. It's not. It's meaningless noise.
This isn't something specific to computer floating point numbers, you have the same issue whether using a calculator or working out the problems by hand.
Keep track of your significant digits, and format your answers to suppress precision beyond what is significant.
Make your variables doubles instead of floats. Youl get more precision.
EDIT
Computers store numbers using a sequence of bits. The more bits you store the higher the precision of the result. In fact floats usually have half the number of bits as doubles so they have lower precision.

Setprecision() for a float number in C++?

In C++,
What are the random digits that are displayed after giving setprecision() for a floating point number?
Note: After setting the fixed flag.
example:
float f1=3.14;
cout < < fixed<<setprecision(10)<<f1<<endl;
we get random numbers for the remaining 7 digits? But it is not the same case in double.
Two things to be aware of:
floats are stored in binary.
float has a maximum of 24 significant bits. This is equivalent to 7.22 significant digits.
So, to your computer, there's no such number as 3.14. The closest you can get using float is 3.1400001049041748046875.
double has 53 significant bits (~15.95 significant digits), so you get a more accurate approximation, 3.140000000000000124344978758017532527446746826171875. The "noise" digits don't show up with setprecision(10), but would with setprecision(17) or higher.
They're not really "random" -- they're the (best available) decimal representation of that binary fraction (will be exact only for fractions whose denominator is a power of two, e.g., 3.125 would display exactly).
Of course that changes depending on the number of bits available to represent the binary fraction that best approaches the decimal one you originally entered as a literal, i.e., single vs double precision floats.
Not really a C++ specific issue (applies to all languages using binary floats, typically to exploit the machine's underlying HW, i.e., most languages). For a very bare-bone tutorial, I recommend reading this.