Avoiding the rounding of double numbers

Avoiding the rounding of double numbers - c++

I'm trying to write a function that calculates the regression line from a data sheet with the least squares methode, but I encountered some serious problem with my code.
My first issue is that I don't know why my "linear regression" function is rounding the result of the iterations, even when I'm trying to use other "bigger" types.
My second issue is that the last part of my code is giving me the wrong results for the y intercept (b) and the slope (a), and I think that it could be a problem of conversion but I'm not really sure. If it is the case what should I do to avoid it?
void RegLin (const vector<double>& valuesX, const vector<double>& valuesY, vector<double>& PenOrd) {
unsigned int N=valuesX.size();
long double SomXi{0};
for (unsigned i=0; i<N; ++i){
SomXi+=valuesX.at(i);
}
long double SomXiXi{0};
for (unsigned i=0; i<N; ++i){ //Here is a problem (number rounded) Expected value: 937352,25 / Given value: 937352
SomXiXi+=(valuesX.at(i))*(valuesX.at(i));
}
long double SomYi{0};
for (unsigned i=0; i<N; ++i){
SomYi+=valuesY.at(i);
}
long double SomXiYi{0};
for (unsigned i=0; i<N; ++i){ //Here is the same problem Excepted value: 334107,41 / Given value: 334107
SomXiYi+=(valuesX.at(i))*(valuesY.at(i));
}
long double a=(SomYi*SomXiXi-SomXi*SomXiYi)/(N*SomXiXi-pow(SomXi,2)); //Bad result
long double b=(N*SomXiYi-SomYi*SomXi)/(N*SomXiXi-pow(SomXi,2)); //Bad result
PenOrd.push_back(a);
PenOrd.push_back(b);
return;
}
Thank you in advance for your support
P.S: I'm using g++ and the 2011 C++ standard.

there are serveral points with your effort. I'm a theoretical physics and numerical maths guy. So let me share some best practices with you.
First, I never encountered the need for using long double. Stick with double, because if that would not suffice, then you should go about considering working log-log plots to further analyse your data.
Second, you are using unsigned int instead of int. You should never do regression work with that much values (i.e., value-pairs) that it isn't sufficient to use int or best std::size_t for your integer counters. Using too many values can degrade accuracy because of accumulating numerical roundoff issues. So do not use use more than say 10000 to 1 Million of values until you have a very good reason to do so.
Third, it quickly becomes necessary to not bluntly add your squares (e.g., for SumXiXi and so forth) but to sort your contributions to the sum before actually summing them up. You correctly sum them up starting with the tiniest of values proceeding with your ever growing contributions to your sums. That is the only way to stay on top of accumulating roundoff issues.
Fourth, control of results. A good sign of reliability of results is achievable, if you to your work twice, one time as you did go about (i.e., using x_iy_i - xy_i - x_iy + xy formuae) and then as a second approach using the still un-multiplicated (x_i - x)(y_i - y) formulae. Good quality calculations would yield very comparable results using either formulae.
So, maybe that was quite a detour about doing numerical regression work, hope it might help a little.
Regards, Micha

The first rule of numerical calculations with floating point says: "Work only with values of the same order".
Floating point mathematics works pretty simple in case of e.g. addition (float) :
1e6 + 1e-6 = 1000000 + 0.000001 = 1000000.000001 = 1000000 = 1e6
^
precision limit
So, as you see, result is "rounded".

It's entirerly possible that you are given error due to 2 possibilites:
1) long double == double on your compiler and you are getting wrong results
2) floating point arithmetic does not represent values with 100% accuracy and therefore '0.10 != 0.10 written as float/double
Depending on what kind of calculations you are doing, I would adivse you to increase values by a few power OR change the data to float and store values in double.

Related

Loss of precision with pow function when surpassing 10^10 limit?

Doing one of my first homeworks of uni, and have ran into this problem:
Task: Find a sum of all n elements where n is the count of numerals in a number (n=1, means 1, 2, 3... 8, 9 for example, answer is 45)
Problem: The code I wrote has gotten all the test answers correctly up to 10 to the power of 9, but when it reaches 10 to the power of 10 territory, then the answers start being wrong, it's really close to what I should be getting, but not quite there (For example, my output = 49499999995499995136, expected result = 49499999995500000000)
Would really appreciate some help/insights, am guessing it's something to do with the variable types, but not quite sure of a possible solution..
#include <iostream>
#include <cmath>
#include <iomanip>
using namespace std;
int main()
{
int n;
double ats = 0, maxi, mini;
cin >> n;
maxi = pow(10, n) - 1;
mini = pow(10, n-1) - 1;
ats = (maxi * (maxi + 1)) / 2 - (mini * (mini + 1)) / 2;
cout << setprecision(0) << fixed << ats;
}

The main reason of problems is pow() function. It works with double, not int. Loss of accuracy is price for representing huge numbers.
There are 3 way's to solve problem:
For small n you can make your own long long int pow(int x, int pow) function. But there is problem, that we can overflow even long long int
Use long arithmetic functions, as #rustyx sayed. You can write your own with vector, or find and include library.
There is Math solution specific for topic's task. It solves the big numbers problem.
You can write your formula like
((10^n) - 1) * (10^n) - (10^m - 1) * (10^m)) / 2 , (here m = n-1)
Then multiply numbers in numerator. Regroup them. Extract common multiples 10^(n-1). And then you can see, that answer have a structure:
X9...9Y0...0 for big enought n, where letter X and Y are constants.
So, you can just print the answer "string" without calculating.

I think you're stretching floating points beyond their precision. Let me explain:
The C pow() function takes doubles as arguments. You're passing ints, the compiler is adding the code to convert them to doubles before they reach pow(). (And anyway you're storing it as a double when you get the return value since you declared it that way).
Floating points are called that way precisely because the point "floats". Inside a double there's a sign bit, a few bits for the mantissa and a few bits for the exponent. In binary, elevating to a power of two is equivalent to moving the fractional point to the right (or to the left if you're elevating to a negative number). So basically the exponent is saying where the fractional point is, in binary. The great advantage of using this kind of in-memory representation for doubles is that you get a lot of precision for numbers close to 0, and gradually lose precision as numbers become bigger.
That last thing is exactly what's happening to you. Your number is too large to be stored exactly. So it's being rounded to the closest sum of powers of two (powers of two are the numbers that have all zeroes to the right in binary).
Quick experiment: press F12 in your browser, open the javascript console and type 49499999995499995136. In my case, in chrome, I reproduce the same problem.
If you really really really want precision with such big numbers then you can try some of these libraries, but that's too advanced for a student program, you don't need it. Just add an if block and print an error message if the number that the user typed is too big (professors love that, which is actually quite correct).

nan output due to maclaurin series expansion of sine, console crashes

Here is my code:
#include <iostream>
#include <cmath>
using namespace std;
int factorial(int);
int main()
{
for(int k = 0; k < 100000; k++)
{
static double sum = 0.0;
double term;
term = (double)pow(-1.0, k) * (double)pow(4.0, 2*k+1) / factorial(2*k+1);
sum = sum + term;
cout << sum << '\n';
}
}
int factorial(int n)
{
if(n == 0)
{
return 1;
}
return n*factorial(n-1);
}
I'm just trying to calculate the value of sine(4) using the maclaurin expansion form of sine. For each console output, the value reads 'nan'. The console gives an error and shuts down after like 10 second. I don't get any errors in the IDE.

There're multiple problems with your approach.
Your factorial function can't return an int. The return value will be way too big, very quickly.
Using pow(-1, value) to get a alternating positive/negative one is very inefficient and will yield incorrect value pretty quick. You should pick 1.0 or -1.0 depending on k's parity.
When you sum a long series of terms, you want to sum the terms with the least magnitude first. Otherwise, you lose precision due to existing bit limiting the range you can reach. In your case, the power of four is dominated by the factorial, so you sum the highest magnitude values first. You'd probably get better precision starting by the other end.
Algorithmically, if you're going to raise 4 to the 2k+1 power and then divide by (2k+1)!, you should keep both the list of factors (4, 4, 4, 4...) and (2,3,4,5,6,7,8,9,....) and simplify both sides. There's plenty of fours to remove on the numerators and denominators at the same time.
Even with those four, I'm not sure you can get anywhere close to the 100000 target you set, without specialized code.

As already stated by others, the intermediate results you will get for large k are magnitudes too large to fit into a double. From a certain k on pow as well as factorial will return infinity. This is simply what happens for very large doubles. And as you then divide one infinity by another you get NaN.
One common trick to deal with too large numbers is using logarithms for intermediate results and only in the end apply the exponential function once.
Some mathematical knowledge of logarithms is required here. To understand what I am doing here you need to know exp(log(x)) == x, log(a^b) == b*log(a), and log(a/b) == log(a) - log(b).
In your case you can rewrite
pow(4, 2*k+1)
to
exp((2*k+1)*log(4))
Then there is still the factorial. The lgamma function can help with factorial(n) == gamma(n+1) and log(factorial(n)) == lgamma(n+1). In short, lgamma gives you the log of a factorial without huge intermediate results.
So summing up, replace
pow(4, 2*k+1) / factorial(2*k+1)
With
exp((2*k+1)*log(4) - lgamma(2*k+2))
This should help you with your NaNs. Also, this should increase performance as lgamma operates in O(1) whereas your factorial is in O(k).
Note, however, that I have still very little confidence that your result will be numerically accurate.
A double has still limited precision of roughly 16 decimal digits. Your 100000 iterations are very likely worthless, probably even harmfull.

What can std::numeric_limits<double>::epsilon() be used for?

unsigned int updateStandardStopping(unsigned int numInliers, unsigned int totPoints, unsigned int sampleSize)
{
double max_hypotheses_=85000;
double n_inliers = 1.0;
double n_pts = 1.0;
double conf_threshold_=0.95
for (unsigned int i = 0; i < sampleSize; ++i)
{
n_inliers *= numInliers - i;//n_linliers=I(I-1)...(I-m+1)
n_pts *= totPoints - i;//totPoints=N(N-1)(N-2)...(N-m+1)
}
double prob_good_model = n_inliers/n_pts;
if ( prob_good_model < std::numeric_limits<double>::epsilon() )
{
return max_hypotheses_;
}
else if ( 1 - prob_good_model < std::numeric_limits<double>::epsilon() )
{
return 1;
}
else
{
double nusample_s = log(1-conf_threshold_)/log(1-prob_good_model);
return (unsigned int) ceil(nusample_s);
}
}
Here is a selection statement:
if ( prob_good_model < std::numeric_limits<double>::epsilon() )
{...}
To my understanding, the judgement statement is the same as(or an approximation to)
prob_good_model < 0
So whether or not I am right and where std::numeric_limits<double>::epsilon() can be used besides that?

The point of epsilon is to make it (fairly) easy for you to figure out the smallest difference you could see between two numbers.
You don't normally use it exactly as-is though. You need to scale it based on the magnitudes of the numbers you're comparing. If you have two numbers around 1e-100, then you'd use something on the order of: std::numeric_limits<double>::epsilon() * 1.0e-100 as your standard of comparison. Likewise, if your numbers are around 1e+100, your standard would be std::numeric_limits<double>::epsilon() * 1e+100.
If you try to use it without scaling, you can get drastically incorrect (utterly meaningless) results. For example:
if (std::abs(1e-100 - 1e-200) < std::numeric_limits<double>::epsilon())
Yup, that's going to show up as "true" (i.e., saying the two are equal) even though they differ by 100 orders of magnitude. In the other direction, if the numbers are much larger than 1, comparing to an (unscaled) epsilon is equivalent to saying if (x != y)--it leaves no room for rounding errors at all.
At least in my experience, the epsilon specified for the floating point type isn't often of a great deal of use though. With proper scaling, it tells you the smallest difference there could possibly be between two numbers of a given magnitude (for a particular floating point implementation).
In real use, however, that's of relatively little real use. A more realistic number will typically be based on the precision of the inputs, and an estimate of the amount of precision you're likely to have lost due to rounding (and such).
For example, let's assume you started with values measured to a precision of 1 part per million, and you did only a few calculations, so you believe you may have lost as much as 2 digits of precision due to rounding errors. In this case, the "epsilon" you care about is roughly 1e-4, scaled to the magnitude of the numbers you're dealing with. That is to say, under those circumstances, you can expect on the order of 4 digits of precision to be meaningful, so if you see a difference in the first four digits, it probably means the values aren't equal, but if they differ only in the fifth (or later) digits, you should probably treat them as equal.
The fact that the floating point type you're using can represent (for example) 16 digits of precision doesn't mean that every measurement you use is going to be nearly the precise--in fact, it's relatively rare the anything based on physical measurements has any hope of being even close to that precise. It does, however, give you a limit on what you can hope for from a calculation--even if you start with a value that's precise to, say, 30 digits, the most you can hope for after calculation is going to be defined by std::numeric_limits<T>::epsilon.

It can be used on situation where a function is undefined, but you still need a value at that point. You lose a bit of accuracy, especially in extreme cases, but sometimes it's alright.
Like let's say you're using 1/x somewhere but your range of x is [0, n[. you can use 1/(x + std::numeric_limits<double>::epsilon()) instead so that 0 is still defined. That being said, you have to be careful with how the value is used, it might not work for every case.

How to improve precision in multiplication?

I am using C++ to implement transfinite interpolation algorithm (http://en.wikipedia.org/wiki/Transfinite_interpolation). Everything looks good until when I was trying to test some small numbers, the results look weird and incorrect. I guess it must have something to do with loss of precision. The code is
for (int j = 0; j < Ny+1; ++j)
{
for (int i = 0; i < Nx+1; ++i)
{
int imax = Nx;
int jmax = Ny;
double CA1 = (double)i/imax; // s
double CB1 = (double)j/jmax; // t
double CA2 = 1.0-CA1; // 1-s
double CB2 = 1.0-CB1; // 1-t
point U = BD41[j]*CA2 + BD23[j]*CA1;
point V = BD12[i]*CB2 + BD34[i]*CB1;
point UV =
(BD12[0]*CB2 + BD41[jmax]*CB1)*CA2
+ (BD12[imax]*CB2 + BD23[jmax]*CB1)*CA1;
tfiGrid[k][j][i] = U + V - UV;
}
}
I guess when BD12[i] (or BD34[i] or BD41[j] or BD23[j]) is very small, the rounding error or something would accumulated and become in-negligible. Any ideas how to handle this sort of situation?
PS: Even though similar questions have been asked for millions of times. I still cannot figure out is it related to my multiplication or division or subtraction or what?

In addition to the points that Antoine made (all very good):
it's probably worth remembering that adding two values with very
different orders of magnitude will cause a very large loss of
precision. For example, if CA1 is less than about 1E-16,
1.0 - CA1 is probably still 1.0, and even if it is just
a little larger, you'll loose quite a few digits of precision.
If this is the problem, you should be able to isolate it just by
putting a few print statements in the inner loop, and looking at
the values you are adding (or perhaps even with a debugger).
What to do about it is another question. There may be some
numerically stable algorithms for what you are trying to do;
I don't know. Otherwise, you'll probably have to detect the
problem dynamically, and rewrite the equations to avoid it if it
occurs. (For example, to detect if CA1 is too small for the
addition, you might check whether 1.0 / CA1 is more than
a couple of thousand, or million, or however much precision you
can afford to loose.)

The accuracy of the arithmetics built in C/C++ is limited. Of course errors will accumulate in your case.
Have you considered using a library that provides higher precision? Maybe have a look at https://gmplib.org/
A short example that clearifys the higher accuracy:
double d, e, f;
d = 1000000000000000000000000.0;
e = d + 1.0;
f = e - d;
printf( "the difference of %f and %f is %f\n", e, d, f);
This will not print 1 but 0. With gmplib the code would look like:
#include "gmp.h"
mpz_t m, n, r;
mpz_init( m);
mpz_init( n);
mpz_init( r);
mpz_set_str( m, "1 000 000 000 000 000 000 000 000", 10);
mpz_add_ui( n, m, 1);
mpz_sub( r, n, m);
printf( "the difference of %s and %s is %s (using gmp)\n",
mpz_get_str( NULL, 10, n),
mpz_get_str( NULL, 10, m),
mpz_get_str( NULL, 10, r));
mpz_clear( m);
mpz_clear( n);
mpz_clear( r);
This will return 1.

Your algorithm doesn't appear to be accumulating errors by re-using previous computations at each iteration, so it's difficult to answer your question without looking at your data.
Basically, you have 3 options:
Increase the precision of your numbers: x86 CPUs can handle float (32 bit), double (64-bit) and often long double (80-bit). Beyond that you have to use "soft" floating points, where all operations are implemented in software instead of hardware. There is a good C lib that do just that: MPFR based on GMP, GNU recommends using MPFR. I strongly recommend using easier to use C++ wrappers like boost multiprocesion. Expect your computations to be orders of magnitude slower.
Analyze where your precision loss comes from by using something more informative than a single scalar number for your computations. Have a look at interval arithmetic and the MPFI lib, based on MPFR. CADENA is another, very promising solution, based on randomly changing rounding modes of the hardware, and comes with a low run-time cost.
Perform static analysis, which doesn't even require running your code and work by analyzing your algorithms. Unfortunately, I don't have experience with such tools so I can't recommend anything beyond goggling.
I think the best course is to run static or dynamic analysis while developing your algorithm, identify your precision problems, address them by either changing the algorithm or using higher precision for the most unstable variables - and not others to avoid too much performance impact at run-time.

Numerical (i.e., floating point) computation is hard to do precisely. You have to be particularly vary of substractions, is is mostly there where you lose precision. In this case the 1.0 - CA1 and such are suspect (if CA1 is very small, you'll get 1). Reorganize your expressions, the Wikipedia article (a stub!) probably has them written for understanding (showing symmetries, ...) and aestetics, not numerical robustness.
Search for courses/lecture notes on numerical computation, thy should include an introductory chapter on this. And check out Goldberg's classic What every computer scientist should know about floating point arithmetic.

C/C++ rounding up decimals with a certain precision, efficiently

I'm trying to optimize the following. The code bellow does this :
If a = 0.775 and I need precision 2 dp then a => 0.78
Basically, if the last digit is 5, it rounds upwards the next digit, otherwise it doesn't.
My problem was that 0.45 doesnt round to 0.5 with 1 decimalpoint, as the value is saved as 0.44999999343.... and setprecision rounds it to 0.4.
Thats why setprecision is forced to be higher setprecision(p+10) and then if it really ends in a 5, add the small amount in order to round up correctly.
Once done, it compares a with string b and returns the result. The problem is, this function is called a few billion times, making the program craw. Any better ideas on how to rewrite / optimize this and what functions in the code are so heavy on the machine?
bool match(double a,string b,int p) { //p = precision no greater than 7dp
double t[] = {0.2, 0.02, 0.002, 0.0002, 0.00002, 0.000002, 0.0000002, 0.00000002};
stringstream buff;
string temp;
buff << setprecision(p+10) << setiosflags(ios_base::fixed) << a; // 10 decimal precision
buff >> temp;
if(temp[temp.size()-10] == '5') a += t[p]; // help to round upwards
ostringstream test;
test << setprecision(p) << setiosflags(ios_base::fixed) << a;
temp = test.str();
if(b.compare(temp) == 0) return true;
return false;
}

I wrote an integer square root subroutine with nothing more than a couple dozen lines of ASM, with no API calls whatsoever - and it still could only do about 50 million SqRoots/second (this was about five years ago ...).
The point I'm making is that if you're going for billions of calls, even today's technology is going to choke.
But if you really want to make an effort to speed it up, remove as many API usages as humanly possible. This may require you to perform API tasks manually, instead of letting the libraries do it for you. Specifically, remove any type of stream operation. Those are slower than dirt in this context. You may really have to improvise there.
The only thing left to do after that is to replace as many lines of C++ as you can with custom ASM - but you'll have to be a perfectionist about it. Make sure you are taking full advantage of every CPU cycle and register - as well as every byte of CPU cache and stack space.
You may consider using integer values instead of floating-points, as these are far more ASM-friendly and much more efficient. You'd have to multiply the number by 10^7 (or 10^p, depending on how you decide to form your logic) to move the decimal all the way over to the right. Then you could safely convert the floating-point into a basic integer.
You'll have to rely on the computer hardware to do the rest.
<--Microsoft Specific-->
I'll also add that C++ identifiers (including static ones, as Donnie DeBoer mentioned) are directly accessible from ASM blocks nested into your C++ code. This makes inline ASM a breeze.
<--End Microsoft Specific-->

Depending on what you want the numbers for, you might want to use fixed point numbers instead of floating point. A quick search turns up this.

I think you can just add 0.005 for precision to hundredths, 0.0005 for thousands, etc. snprintf the result with something like "%1.2f" (hundredths, 1.3f thousandths, etc.) and compare the strings. You should be able to table-ize or parameterize this logic.

You could save some major cycles in your posted code by just making that double t[] static, so that it's not allocating it over and over.

Try this instead:
#include <cmath>
double setprecision(double x, int prec) {
return
ceil( x * pow(10,(double)prec) - .4999999999999)
/ pow(10,(double)prec);
}
It's probably faster. Maybe try inlining it as well, but that might hurt if it doesn't help.
Example of how it works:
2.345* 100 (10 to the 2nd power) = 234.5
234.5 - .4999999999999 = 234.0000000000001
ceil( 234.0000000000001 ) = 235
235 / 100 (10 to the 2nd power) = 2.35
The .4999999999999 was chosen because of the precision for a c++ double on a 32 bit system. If you're on a 64 bit platform you'll probably need more nines. If you increase the nines further on a 32 bit system it overflows and rounds down instead of up, i. e. 234.00000000000001 gets truncated to 234 in a double in (my) 32 bit environment.

Using floating point (an inexact representation) means you've lost some information about the true number. You can't simply "fix" the value stored in the double by adding a fudge value. That might fix certain cases (like .45), but it will break other cases. You'll end up rounding up numbers that should have been rounded down.
Here's a related article:
http://www.theregister.co.uk/2006/08/12/floating_point_approximation/

I'm taking at guess at what you really mean to do. I suspect you're trying to see if a string contains a decimal representation of a double to some precision. Perhaps it's an arithmetic quiz program and you're trying to see if the user's response is "close enough" to the real answer. If that's the case, then it may be simpler to convert the string to a double and see if the absolute value of the difference between the two doubles is within some tolerance.
double string_to_double(const std::string &s)
{
std::stringstream buffer(s);
double d = 0.0;
buffer >> d;
return d;
}
bool match(const std::string &guess, double answer, int precision)
{
const static double thresh[] = { 0.5, 0.05, 0.005, 0.0005, /* etc. */ };
const double g = string_to_double(guess);
const double delta = g - answer;
return -thresh[precision] < delta && delta <= thresh[precision];
}
Another possibility is to round the answer first (while it's still numeric) BEFORE converting it to a string.
bool match2(const std::string &guess, double answer, int precision)
{
const static double thresh[] = {0.5, 0.05, 0.005, 0.0005, /* etc. */ };
const double rounded = answer + thresh[precision];
std::stringstream buffer;
buffer << std::setprecision(precision) << rounded;
return guess == buffer.str();
}
Both of these solutions should be faster than your sample code, but I'm not sure if they do what you really want.

As far as i see you are checking if a rounded on p points is equal b.
Insted of changing a to string, make other way and change string to double
- (just multiplications and addion or only additoins using small table)
- then substract both numbers and check if substraction is in proper range (if p==1 => abs(p-a) < 0.05)

Old time developers trick from the dark ages of Pounds, Shilling and pence in the old country.
The trick was to store the value as a whole number fo half-pennys. (Or whatever your smallest unit is). Then all your subsequent arithmatic is straightforward integer arithimatic and rounding etc will take care of itself.
So in your case you store your data in units of 200ths of whatever you are counting,
do simple integer calculations on these values and divide by 200 into a float varaible whenever you want to display the result.
I beleive Boost does a "BigDecimal" library these days, but, your requirement for run time speed would probably exclude this otherwise excellent solution.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js