c++ incorrect floating point arithmetic

c++ incorrect floating point arithmetic - c++

For the following program:
#include <iostream>
#include <iomanip>
using namespace std;
int main()
{
for (float a = 1.0; a < 10; a++)
cout << std::setprecision(30) << 1.0/a << endl;
return 0;
}
I recieve the following output:
1
0.5
0.333333333333333314829616256247
0.25
0.200000000000000011102230246252
0.166666666666666657414808128124
0.142857142857142849212692681249
0.125
0.111111111111111104943205418749
Which is definitely not right right for the lower place digits, particularly with respect to 1/3,1/5,1/7, and 1/9. things just start going wrong around 10^-16 I would expect to see out put more resembling:
1
0.5
0.333333333333333333333333333333
0.25
0.2
0.166666666666666666666666666666
0.142857142857142857142857142857
0.125
0.111111111111111111111111111111
Is this an inherit flaw in the float class? Is there a way to overcome this and have proper division? Is there a special datatype for doing precise decimal operations? Am I just doing something stupid or wrong in my example?

There are a lot of numbers that computers cannot represent, even if you use float or double-precision float. 1/3, or .3 repeating, is one of those numbers. So it just does the best it can, which is the result you get.
See http://floating-point-gui.de/, or google float precision, there's a ton of info out there (including many SO questions) on this subject.
To answer your questions -- yes, this is an inherent limitation in both the float class and the double class. Some mathematical programs (MathCAD, probably Mathematica) can do "symbolic" math, which allows calculation of the "correct" answers. In many cases, the round-off error can be managed, even over really complex computations, such that the top 6-8 decimal places are correct. However, the opposite is true as well -- naive computations can be constructed that return wildly incorrect answers.
For small problems like division of whole numbers, you'll get a decent number of decimal place accuracy (maybe 4-6 places). If you use double precision floats, that will go up to maybe 8. If you need more... well, I'd start questioning why you want that many decimal places.

First of all, since your code does 1.0/a, it gives you double (1.0 is a double value, 1.0f is float) as the rules of C++ (and C) always extends a smaller type to the larger one if the operands of an operation is different size (so, int + char makes the char into an int before adding the values, long + int will make the int long, etc, etc).
Second floating point values have a set number of bits for the "number". In float, that is 23 bits (+ 1 'hidden' bit), and in double it's 52 bits (+1). Yet get approximately 3 digits per bit (exactly: log2(10), if we use decimal number representation), so a 23 bit number gives approximately 7-8 digits, a 53 bit number approximately 16-17 digits. The remainder is just "noise" caused by the last few bits of the number not evening out when converting to a decimal number.
To have infinite precision, we would have to either store the value as a fraction, or have an infinite number of bits. And of course, we could have some other finite precision, such as 100 bits, but I'm sure you'd complain about that too, because it would just have another 15 or so digits before it "goes wrong".

Floats only have so much precision (23 bits worth to be precise). If you REALLY want to see "0.333333333333333333333333333333" output, you could create a custom "Fraction" class which stores the numerator and denominator separately. Then you could calculate the digit at any given point with complete accuracy.

Related

Very large differences using float and double

#include <iostream>
using namespace std;
int main() {
int steps=1000000000;
float s = 0;
for (int i=1;i<(steps+1);i++){
s += (i/2.0) ;
}
cout << s << endl;
}
Declaring s as float: 9.0072e+15
Declaring s as double: 2.5e+17 (same result as implementing it in Julia)
I understand double has double precision than float, but float should still handle numbers up to 10^38.
I did read similar topics where results where not the same, but in that cases the differences were very small, here the difference is 25x.
I also add that using long double instead gives me the same result as double. If the matter is the precision, I would have expected to have something a bit different.

The problem is the lack of precision: https://en.wikipedia.org/wiki/Floating_point
After 100 million numbers you are adding 1e8 to 1e16 (or at least numbers of that magnitude), but single precision numbers are only accurate to 7 digits - so it is the same as adding 0 to 1e16; that's why your result is considerably lower for float.
Prefer double over float in most cases.

Problem with floating point precision! Infinite real numbers cannot possibly be represented by the finite memory of a computer. Float, in general, are just approximations of the number they are meant to represent.
For more details, please check the following documentation:
https://softwareengineering.stackexchange.com/questions/101163/what-causes-floating-point-rounding-errors

You didn't mention what type of floating point numbers you are using, but I'm going to assume that you use IEEE 754, or similar.
I understand double has double precision
To be more precise with the terminology, double uses twice as many bits. That's not double the number of reprensentable values, it's 4294967296 times as many representable values, despite being named "double precision".
but float should still handle numbers up to 10^38.
Float can handle a few numbers up to that magnitude. But that does't mean that float values in that range are precise. For example, 3,4028235E+38 can be represented as a single precision float. How much would you imagine is the difference between the previous value representable by float? Is it the machine epsilon? Perhaps 0.1? Maybe 1? No. The difference is about 2E+31.
Now, your numbers aren't quite in that range. But, they're outside the continuous range of whole integers that can be precisely represented by float. The highest value in that range happens to be 16777217, or about 1.7E+7, which is way less than 2.5E+17. So, every addition beyond that range adds some error to the result. You perform a billion calculations so those errors add up.
Conclusions:
Understand that single precision is way less precise than double precision.
Avoid long sequences of calculations where precision errors can accumulate.

C++ determining if a number is an integer

I have a program in C++ where I divide two numbers, and I need to know if the answer is an integer or not. What I am using is:
if(fmod(answer,1) == 0)
I also tried this:
if(floor(answer)==answer)
The problem is that answer usually is a 5 digit number, but with many decimals. For example, answer can be: 58696.000000000000000025658 and the program considers that an integer.
Is there any way I can make this work?
I am dividing double a/double b= double answer
(sometimes there are more than 30 decimals)
Thanks!
EDIT:
a and b are numbers in the thousands (about 100,000) which are then raised to powers of 2 and 3, added together and divided (according to a complicated formula). So I am plugging in various a and b values and looking at the answer. I will only keep the a and b values that make the answer an integer. An example of what I got for one of the answers was: 218624 which my program above considered to be an integer, but it really was: 218624.00000000000000000056982 So I need a code that can distinguish integers with more than 20-30 decimals.

You can use std::modf in cmath.h:
double integral;
if(std::modf(answer, &integral) == 0.0)
The integral part of answer is stored in fraction and the return value of std::modf is the fractional part of answer with the same sign as answer.

The usual solution is to check if the number is within a very short distance of an integer, like this:
bool isInteger(double a){
double b=round(a),epsilon=1e-9; //some small range of error
return (a<=b+epsilon && a>=b-epsilon);
}
This is needed because floating point numbers have limited precision, and numbers that indeed are integers may not be represented perfectly. For example, the following would fail if we do a direct comparison:
double d=sqrt(2); //square root of 2
double answer=2.0/(d*d); //2 divided by 2
Here, answer actually holds the value 0.99999..., so we cannot compare that to an integer, and we cannot check if the fractional part is close to 0.
In general, since the floating point representation of a number can be either a bit smaller or a bit bigger than the actual number, it is not good to check if the fractional part is close to 0. It may be a number like 0.99999999 or 0.000001 (or even their negatives), these are all possible results of a precision loss. That's also why I'm checking both sides (+epsilon and -epsilon). You should adjust that epsilon variable to fit your needs.
Also, keep in mind that the precision of a double is close to 15 digits. You may also use a long double, which may give you some extra digits of precision (or not, it is up to the compiler), but even that only gets you around 18 digits. If you need more precision than that, you will need to use an external library, like GMP.

Floating point numbers are stored in memory using a very different bit format than integers. Because of this, comparing them for equality is not likely to work effectively. Instead, you need to test if the difference is smaller than some epsilon:
const double EPSILON = 0.00000000000000000001; // adjust for whatever precision is useful for you
double remainder = std::fmod(numer, denom);
if(std::fabs(0.0 - remainder) < EPSILON)
{
//...
}
Alternatively, if you want to include values that are close to integers (based on your desired precision), you can modify the if condition slightly (since the remainder returned by std::fmod will be in the range [0, 1)):
if (std::fabs(std::round(d) - d) < EPSILON)
{
// ...
}
You can see the test for this here.
Floating point numbers are generally somewhat precise to about 12-15 digits (as a double), but as they are stored as a mantissa (fraction) and a exponent, rational numbers (integers or common fractions) are not likely to be stored as such. For example,
double d = 2.0; // d might actually be 1.99999999999999995
Because of this, you need to compare the difference of what you expect to some very small number that encompasses the precision you desire (we will call this value, epsilon):
double d = 2.0;
bool test = std::fabs(2 - d) < epsilon; // will return true
So when you are trying to compare the remainder from std::fmod, you need to check it against the difference from 0.0 (not for actual equality to 0.0), which is what is done above.
Also, the std::fabs call prevents you from having to do 2 checks by asserting that the value will always be positive.
If you desire a precision that is greater than 15-18 decimal places, you cannot use double or long double; you will need to use a high precision floating point library.

Exact decimal datatype for C++?

PHP has a decimal type, which doesn't have the "inaccuracy" of floats and doubles, so that 2.5 + 2.5 = 5 and not 4.999999999978325 or something like that.
So I wonder if there is such a data type implementation for C or C++?

The Boost.Multiprecision library has a decimal based floating point template class called cpp_dec_float, for which you can specify any precision you want.
#include <iostream>
#include <iomanip>
#include <boost/multiprecision/cpp_dec_float.hpp>
int main()
{
namespace mp = boost::multiprecision;
// here I'm using a predefined type that stores 100 digits,
// but you can create custom types very easily with any level
// of precision you want.
typedef mp::cpp_dec_float_100 decimal;
decimal tiny("0.0000000000000000000000000000000000000000000001");
decimal huge("100000000000000000000000000000000000000000000000");
decimal a = tiny;
while (a != huge)
{
std::cout.precision(100);
std::cout << std::fixed << a << '\n';
a *= 10;
}
}

Yes:
There are arbitrary precision libraries for C++.
A good example is The GNU Multiple Precision arithmetic library.

If you are looking for data type supporting money / currency then try this:
https://github.com/vpiotr/decimal_for_cpp
(it's header-only solution)

There will be always some precision. On any computer in any number representation there will be always numbers which can be represented accurately, and other numbers which can't.
Computers use a base 2 system. Numbers such as 0.5 (2^-1), 0.125 (2^-3), 0.325 (2^-2 + 2^-3) will be represented accurately (0.1, 0.001, 0.011 for the above cases).
In a base 3 system those numbers cannot be represented accurately (half would be 0.111111...), but other numbers can be accurate (e.g. 2/3 would be 0.2)
Even in human base 10 system there are numbers which can't be represented accurately, for example 1/3.
You can use rational number representation and all the above will be accurate (1/2, 1/3, 3/8 etc.) but there will be always some irrational numbers too. You are also practically limited by the sizes of the integers of this representation.
For every non-representable number you can extend the representation to include it explicitly. (e.g. compare rational numbers and a representation a/b + c/d*sqrt(2)), but there will be always more numbers which still cannot be represented accurately. There is a mathematical proof that says so.
So - let me ask you this: what exactly do you need? Maybe precise computation on decimal-based numbers, e.g. in some monetary calculation?

What you're asking is anti-physics.
What phyton (and C++ as well) do is cut off the inaccuracy by rounding the result at the time to print it out, by reducing the number of significant digits:
double x = 2.5;
x += 2.5;
std::cout << x << std::endl;
just makes x to be printed with 6 decimal digit precision (while x itself has more than 12), and will be rounded as 5, cutting away the imprecision.
Alternatives are not using floating point at all, and implement data types that do just integer "scaled" arithmetic: 25/10 + 25/10 = 50/10;
Note, however, that this will reduce the upper limit represented by each integer type. The gain in precision (and exactness) will result in a faster reach to overflow.
Rational arithmetic is also possible (each number is represented by a "numarator" and a "denominator"), with no precision loss against divisions, (that -in fact- are not done unless exact) but again, with increasing values as the number of operation grows (the less "rational" is the number, the bigger are the numerator and denominator) with greater risk of overflow.
In other word the fact a finite number of bits is used (no matter how organized) will always result in a loss you have to pay on the side of small on on the side of big numbers.

I presume you are talking about the Binary Calculator in PHP. No, there isn't one in the C runtime or STL. But you can write your own if you are so inclined.
Here is a C++ version of BCMath compiled using Facebook's HipHop for PHP:
http://fossies.org/dox/facebook-hiphop-php-cf9b612/dir_2abbe3fda61b755422f6c6bae0a5444a.html

Being a higher level language PHP just cuts off what you call "inaccuracy" but it's certainly there. In C/C++ you can achieve similar effect by casting the result to integer type.

C++ function that do base 10 significant + exponent calculation from double

I need to represent numbers using the following structure. The purpose of this structure is not to lose the precision.
struct PreciseNumber
{
long significand;
int exponent;
}
Using this structure actual double value can be represented as value = significand * 10e^exponent.
Now I need to write utility function which can covert double into PreciseNumber.
Can you please let me know how to extract the exponent and significand from the double?

The prelude is somewhat flawed.
Firstly, barring any restrictions on storage space, conversion from a double to a base 10 significand-exponent form won't alter the precision in any form. To understand that, consider the following: any binary terminating fraction (like the one that forms the mantissa on a typical IEEE-754 float) can be written as a sum of negative powers of two. Each negative power of two is a terminating fraction itself, and hence it follows that their sum must be terminating as well.
However, the converse isn't necessarily true. For instance, 0.3 base 10 is equivalent to the non-terminating 0.01 0011 0011 0011 ... in base 2. Fitting this into a fixed size mantissa would blow some precision out of it (which is why 0.3 is actually stored as something that translates back to 0.29999999999999999.)
By this, we may assume that any precision that is intended by storing the numbers in decimal significand-exponent form is either lost, or isn't simply gained at all.
Of course, you might think of the apparent loss of accuracy generated by storing a decimal number as a float as loss in precision, in which case the Decimal32 and Decimal64 floating point formats may be of some interest -- check out http://en.wikipedia.org/wiki/Decimal64_floating-point_format.

This is a very difficult problem. You might want to see how much code it takes to implement a double-to-string conversion (for printf, e.g.). You might steal the code from gnu's implementation of gcc.

You cannot convert an "imprecise" double into a "precise" decimal number, because the required "precision" simply isn't there to begin with (otherwise why would you even want to convert?).
This is what happens if you try something like it in Java:
BigDecimal x = new BigDecimal(0.1);
System.out.println(x);
The output of the program is:
0.1000000000000000055511151231257827021181583404541015625

Well you're at less precision than a typical double. Your significand is a long giving you a range from -2 billion to +2 billion which is more than 9 but fewer than 10 digits of precision.
Here's an untested starting point on what you'd want to do for some simple math on PreciseNumbers
PreciseNumber Multiply(PreciseNumber lhs, PreciseNumber rhs)
{
PreciseNumber ret;
ret.s=lhs.s;
ret.e=lhs.e;
ret.s*=rhs.s;
ret.e+=lhs.e;
return ret;
}
PreciseNumber Add(PreciseNumber lhs, PreciseNumber rhs)
{
PreciseNumber ret;
ret.s=lhs.s;
ret.e=lhs.e;
ret.s+=(rhs.s*pow(10,rhs.e-lhs.e));
}
I didn't take care of any renormalization, but in both cases there are places where you have to worry about over/under flows and loss of precision. Just because you're doing it yourself rather than letting the computer take care of it in a double, doesn't meat the same pitfalls aren't there. The only way to not lose precision is to keep track of all of the digits.

Here's a very rough algorithm. I'll try to fill in some details later.
Take the log10 of the number to get the exponent. Multiply the double by 10^x if positive, or divide by 10^-x if negative.
Start with a significand of zero. Repeat the following 15 times, since a double contains 15 digits of significance:
Multiply the previous significand by 10.
Take the integer portion of the double, add it to the significand, and subtract it from the double.
Subtract 1 from the exponent.
Multiply the double by 10.
When finished, take the remaining double value and use it for rounding: if it's >= 5, add one to the significand.

double type digits in C++

The IEE754 (64 bits) floating point is supposed to correctly represent 15 significant digit although the internal representation has 17 ditigs. Is there a way to force the 16th and 17th digits to zero ??
Ref:
http://msdn.microsoft.com/en-us/library/system.double(VS.80).aspx :
.
.
Remember that a floating-point number can only approximate a decimal number, and that the precision of a floating-point number determines how accurately that number approximates a decimal number. By default, a Double value contains 15 decimal digits of precision, although a maximum of 17 digits is maintained internally. The precision of a floating-point number has several consequences:
.
.
Example nos:
d1 = 97842111437.390091
d2 = 97842111437.390076
d1 and d2 differ in 16th and 17th decimal places that are not supposed to be significant. Looking for ways to force them to zero. ie
d1 = 97842111437.390000
d2 = 97842111437.390000

No. Counter-example: the two closest floating-point numbers to a rational
1.11111111111118
(which has 15 decimal digits) are
1.1111111111111799942818834097124636173248291015625
1.1111111111111802163264883347437717020511627197265625
In other words, there is not floating-point number that starts with 1.1111111111111800.

This question is a little malformed. The hardware stores the numbers
in binary, not decimal. So in the general case you can't do precise
math in base 10. Some decimal numbers (0.1 is one of them!) do not
even have a non-repeating representation in binary. If you have
precision requirements like this, where you care about the number
being of known precision to exactly 15 decimal digits, you will need
to pick another representation for your numbers.

No, but I wonder if this is relevant to any of your issues (GCC specific):
GCC Documentation
-ffloat-store Do not store floating point variables in registers, and
inhibit other options that might
change whether a floating point value
is taken from a register or memory.
This option prevents undesirable
excess precision on machines such as
the 68000 where the floating registers
(of the 68881) keep more precision
than a double is supposed to have.
Similarly for the x86 architecture.
For most programs, the excess
precision does only good, but a few
programs rely on the precise
definition of IEEE floating point. Use
-ffloat-store for such programs, after modifying them to store all pertinent
intermediate computations into
variables.

You should be able to directly modify the bits in your number by creating a union with a field for the floating point number and an integral type of the same size. Then you can access the bits you want and set them however you want. Here is in example where I whack the sign bit; you can choose any field you want, of course.
#include <stdio.h>
union double_int {
double fp;
unsigned long long integer;
};
int main(int argc, const char *argv[])
{
double my_double = 1325.34634;
union double_int *my_union = (union double_int *)&my_double;
/* print original numbers */
printf("Float %f\n", my_double);
printf("Integer %llx\n", my_union->integer);
/* whack the sign bit to 1 */
my_union->integer |= 1ULL << 63;
/* print modified numbers */
printf("Negative float %f\n", my_double);
printf("Negative integer %llx\n", my_union->integer);
return 0;
}

Generally speaking, people only care about something like this ("I only want the first x digits") when displaying the number. That's relatively easy with stringstreams or sprintf.
If you're concerned about comparing numbers with ==; you really can't do that with floating point numbers. Instead you want to see if the numbers are close enough (say, within an epsilon() of each other).
Playing with the bits of the number directly isn't a great idea.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

c++ incorrect floating point arithmetic - c++

Related

Very large differences using float and double

C++ determining if a number is an integer

Exact decimal datatype for C++?

C++ function that do base 10 significant + exponent calculation from double

double type digits in C++

Categories

Resources