Adding double to double in a more fixed way? - c++

I'm using double instead of float in my code but unfortunately i faced the next problem :
When i try to add :
1.000000000000020206059048177849 + 0.000000000000020206059048177849
i have this result :
1.000000000000040400000000000000
which avoid the last 14 number.. i want the result to be more accurate.
i know this might look silly but really this is so important to me .. anyone can help?
here's a simple code example :
#include <iomanip>
#include <iostream>
using namespace std;
int main()
{
double a=1.000000000000020206059048177849 + 0.000000000000020206059048177849;
cout<<fixed<<setprecision(30)<<a;
system("pause");
return 0;
}

Update: The answer below assumes that the expression is evaluated during run-time, i.e. you are not adding compile-time constants. This is not necessarily true, your compiler may evaluate the expression during compile time. It may use higher precision for this. As suggested in the comments, the way you print out the number might be the root cause for your problem.
If you absolutly need more precision and can't make any other twists, your only option is to increase precision. double values provide a precision of about 16 decimal digits. You have the following options:
Use a library that provides higher precision by implementing floating point operations in software. This is slow, but you can get as precise as you want to, see e.g. GMP, the GNU Multiple Precision Library.
The other option is to use a long double, which is at least as precise as double. On some platforms, long double may even provide more precision than a double, but in general it does not. On your typical desktop PC it may be 80 bits long (compared to 64 bits), but this is not necessarily true and depends on your platform and your compiler. It is not portable.
Maybe, you can avoid the hassle and tune your implementation a bit in order to avoid floating point errors. Can you reorder operations? Your intermediate results are of the form 1+x. Is there a way to compute x instead of 1+x? Subtracting 1 is not an option here, of course, because then precision of x is already lost.

Related

Differences in rounded result when calling pow()

OK, I know that there was many question about pow function and casting it's result to int, but I couldn't find answer to this a bit specific question.
OK, this is the C code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main()
{
int i = 5;
int j = 2;
double d1 = pow(i,j);
double d2 = pow(5,2);
int i1 = (int)d1;
int i2 = (int)d2;
int i3 = (int)pow(i,j);
int i4 = (int)pow(5,2);
printf("%d %d %d %d",i1,i2,i3,i4);
return 0;
}
And this is the output: "25 25 24 25". Notice that only in third case where arguments to pow are not literals we have that wrong result, probably caused by rounding errors. Same thing happends without explicit casting. Could somebody explain what happens in this four cases?
Im using CodeBlocks in Windows 7, and MinGW gcc compiler that came with it.
The result of the pow operation is 25.0000 plus or minus some bit of rounding error. If the rounding error is positive or zero, 25 will result from the conversion to an integer. If the rounding error is negative, 24 will result. Both answers are correct.
What is most likely happening internally is that in one case a higher-precision, 80-bit FPU value is being used directly and in the other case, the result is being written from the FPU to memory (as a 64-bit double) and then read back in (converting it to a slightly different 80-bit value). This can make a microscopic difference in the final result, which is all it takes to change a 25.0000000001 to a 24.999999997
Another possibility is that your compiler recognizes the constants passed to pow and does the calculation itself, substituting the result for the call to pow. Your compiler may use an internal arbitrary-precision math library or it may just use one that's different.
This is caused by a combination of two problems:
The implementation of pow you are using is not high quality. Floating-point arithmetic is necessarily approximate in many cases, but good implementations take care to ensure that simple cases such as pow(5, 2) return exact results. The pow you are using is returning a result that is less than 25 by an amount greater than 0 but less than or equal to 2–49. For example, it might be returning 25–2-50.
The C implementation you are using sometimes uses a 64-bit floating-point format and sometimes uses an 80-bit floating-point format. As long as the number is kept in the 80-bit format, it retains the complete value that pow returned. If you convert this value to an integer, it produces 24, because the value is less than 25 and conversion to integer truncates; it does not round. When the number is converted to the 64-bit format, it is rounded. Converting between floating-point formats rounds, so the result is rounded to the nearest representable value, 25. After that, conversion to integer produces 25.
The compiler may switch formats whenever it is “convenient” in some sense. For example, there are a limited number of registers with the 80-bit format. When they are full, the compiler may convert some values to the 64-bit format and store them in memory. The compiler may also rearrange expressions or perform parts of them at compile-time instead of run-time, and these can affect the arithmetic performed and the format used.
It is troublesome when a C implementation mixes floating-point formats, because users generally cannot predict or control when the conversions between formats occur. This leads to results that are not easily reproducible and interferes with deriving or controlling numerical properties of software. C implementations can be designed to use a single format throughout and avoid some of these problems, but your C implementation is apparently not so designed.
To add to the other answers here: just generally be very careful when working with floating point values.
I highly recommend reading this paper (even though it is a long read):
http://hal.archives-ouvertes.fr/docs/00/28/14/29/PDF/floating-point-article.pdf
Skip to section 3 for practical examples, but don't neglect the previous chapters!
I'm fairly sure this can be explained by "intermediate rounding" and the fact that pow is not simply looping around j times multiplying by i, but calculating using exp(log(i)*j) as a floating point calculation. Intermediate rounding may well convert 24.999999999996 into 25.000000000 - even arbitrary storing and reloading of the value may cause differences in this sort of behaviuor, so depending on how the code is generated, it may make a difference to the exact result.
And of course, in some cases, the compiler may even "know" what pow actually achieves, and replace the calculation with a constant result.

About precision on C++ calculus

I have gotten a prototype function that make some calculus (integrals of gamma function) in C++ and I need to convert it to C language. The author used float variables with suffix f in every calculus. Like these sentences...
float a1=.083333333f;
float vv=dif*i/1.414214f;
The program makes use of truncated series on many lines by multiplication of some of that variables.
My question so is... Don`t I get more precision if I use double precision variables? Why the sufix could be necessary on that case?
Thanks in advance!
You would get more precision with double precision, and you don't need any special suffix in C/C++. So, your code could look like
double a1=.083333333;
double vv=dif*i/1.414214;
Also, you are free to use more accurate floating-point literals if you want... so add more "3"s and expand "1.414214" to your heart's content. Bear in mind, however, that not even doubles are perfectly accurate.
float, in modern systems, is a 32 bit floating point type.
double, in modern systems, is a 64 bit floating point type.
The accuracy of a float is much less than a 64 bit floating point, but is still very useful for speed and because it occupies less bytes of ram.
Now with 64 bit system the difference between float and double are a lot less noticeable, but in the past, they could be really different.
You have always to find a trade-off between performance and precision: the choice between double and float is exactly this: do you want high precision or low precision but better performance?
In 3D games usually we use float, in calculus applications we usually use double.
See if the precision of double suits you, if not, i would suggest you to search a C++ library for numbers with abritrary precision, though, to continue our discussion about performances, the performances of this objects are really bad compared to native double or float.

Preventing Rounding Errors

I was just reading about rounding errors in C++. So, if I'm making a math intense program (or any important calculations) should I just drop floats all together and use only doubles or is there an easier way to prevent rounding errors?
Obligatory lecture: What Every Programmer Should Know About Floating-Point Arithmetic.
Also, try reading IEEE Floating Point standard.
You'll always get rounding errors. Unless you use an infinite arbitrary precision library, like gmplib. You have to decide if your application really needs this kind of effort.
Or, you could use integer arithmetic, converting to floats only when needed. This is still hard to do, you have to decide if it's worth it.
Lastly, you can use float or double taking care not to make assumption about values at the limit of representation's precision. I'd wish this Valgrind plugin was implemented (grep for float)...
The rounding errors are normally very insignificant, even using floats. Mathematically-intense programs like games, which do very large numbers of floating-point computations, often still use single-precision.
This might work if your highest number is less than 10 billion and you're using C++ double precision.
if ( ceil(10000*(x + 0.00001)) > ceil(100000*(x - 0.00001))) {
x = ceil(10000*(x + 0.00004)) / 10000;
}
This should allow at least the last digit to be off +/- 9. I'm assuming dividing by 1000 will always just move a decimal place. If not, then maybe it could be done in binary.
You would have to apply it after every operation that is not +, -, *, or a comparison. For example, you can't do two divisions in the same formula because you'd have to apply it to each division.
If that doesn't work, you could work in integers by scaling the numbers up and always use integer division. If you need advanced functions maybe there is a package that does deterministic integer math. Integer division is required in a lot of financial settings because of round off error being subject to exploit like in the movie "The Office".

C++: Store large numbers in a float like PHP?

In PHP if you go above INT_MAX it will cast it as a float, allowing very high numbers to be formed (that are non-decimal as well), is this possible to do in C++ or are the way they store floating point/double precision numbers different?
The reason why is I am wishing to benchmark large factorials, but something such as 80! is way too large for an unsigned integer..
The language will not make the switch for you, but has the datatypes float and double, which are usually 32 bit and 64 bit IEEE floats, respectively.
A 64 bit double has enough range for 80!, but doesn't have enough precision to represent it exactly. The language doesn't have anything built in that can do that: you would need to use a big integer library, for example GMP.
try using the GMP library or there are several other Big Integer libraries provided for C++. You may also use string manipulation to calculate large factorials. Click here for the algorithm and its explanation.
C++ doesn't have such kind of "automatic casting" facilities, even if you could build a class that mimics such behavior by having an int and a float (a double would be even better, IIRC it lets you get up to 170!) private fields and some operator overloading black magic.
Anyhow, going from integers to fp you're going to lose precision, so, even if you can reach higher numbers, you aren't going to represent them exactly. Actually, if you're going in fp fields with factorials usually you could just use the Stirling's approximation (but I understand that in this case it do not apply, since it's a benchmark).
If you want to get to arbitrarily big factorials without losing precision, the usual solution is to use some bigint library; you can find several of them easily with Google.
Use one of bigint libraries, which allow you to create arbitrary precission ints in cost of performance. Or you have to write your own class to emulate PHPs hybrid float-int functionality
Something like this
class IntFloat {
union {
float fval;
int ival;
} val;
bool foatUsed;
public:
setVal(float val)
{
this->val.fval = val;
floatUsed = true;
}
setVal(int val)
{
this->val.ival = val;
floatUsed = false;
}
//additional code for getters, setters, operators etc
}
However what PHP does isn't worthy of imitation.
You can find list of big int libraries on wikipedia
PS:
"or are the way they store floating point/double precision numbers different?"
Yes it is different. C++ stores them straightly in target machine format, whle PHP uses intermediate representation (or bytecode, or in case of PHP opcode). Thus PHP converts number to machine format under the hood.
You can use __float128 (long double) if precision is enough and you compiler supports it.

C++: how to truncate the double in efficient way?

I would like to truncate the float to 4 digits.
Are there some efficient way to do that?
My current solution is:
double roundDBL(double d,unsigned int p=4)
{
unsigned int fac=pow(10,p);
double facinv=1.0/static_cast<double>(fac);
double x=static_cast<unsigned int>(d*fac)*facinv;
return x;
}
but using pow and delete seems to me not so efficient.
kind regards
Arman.
round(d*10000.0)/10000.0;
or if p must be variable;
double f = pow(10,p);
round(d*f)/f;
round will usually be compiled as a single instruction that is faster than converting to an integer and back. Profile to verify.
Note that a double may not have an accurate representation to 4 decimal places. You will not truly be able to truncate an arbitrary double, just find the nearest approximation.
Efficiency depends on your platform.
Whatever methods you try, you should profile to make sure
the efficiency is required (and a straightforward implementation is not fast enough for you)
the method you're trying is faster than others for your application on real data
You could multiply by 10000, truncate as an integer, and divide again. Converting between double and int might be faster or slower for you.
You could truncate on output, e.g. a printf format string of "%.4f"
You could replace pow with a more efficient integer-based variant instead. There's one here on Stack Overflow: The most efficient way to implement an integer based power function pow(int, int)
Also, if you can accept some inaccuracy, replace the divide with a multiply. Divisions are one of the slowest common math operations.
Other than that, I'll echo what others have said and simply truncate on output, unless you actually need to use the truncated double in calculations.
If you need to perform exact calculations that involve decimal digits, then stop using double right now! It's not the right data type for your purpose. You will not get actually rounded decimal values. Almost all values will (after truncation, not matter what method you use) be in fact be something like 1,000999999999999841, not 1,0001.
That's because double is implemented using binary fractions, not decimal ones. There are decimal types you can use instead that will work correctly. They will be a lot slower, but then, if the result does not need to be correct, I know a method to make it infinitely fast...