Floating point overflow

Floating point overflow - fortran

I am getting floating point overflow error in this part of code. Can any of you guys help me to find out the reason.
do j=1,ny-1
do i=1,nx-1
sum = 0.0d0
do k=0,1000
n=2.0d0*dfloat(k)+ 1.0d0
sum = sum + ((dsinh(n*pi*x(i))*dcos(n*pi*y(j)))/((n*n*pi*pi)*dsinh(2*n*pi)))
end do
ue(i,j)= (x(i)/(4.0d0))- 4.0d0*sum
end do
end do

The problem is the intermediate term dsinh(2*n*pi). Consider k=1000. Then n=2001 so we need to evaluate dsinh(2001*pi) which is about 0.5*exp(6286) or over 10^2700! This is far higher than any number that can be represented in double precision. You need to reevaluate the the way you are calculating the sum. The term dsinh(n*pi*x(i)) is problematic too.
My guess is that some sort of aysmptotic expansion is required for the robust evaluation of the quotient dsinh(n*pi*x(i))/dsinh(2*n*pi). For 0<x(i)<2 this term should behave as exp(n*pi*(x(i)-2)) as n becomes large. This is will be well behaved.

Related

Should I worry about precision when I use C++ mathematical functions with integers?

For example, The code below will give undesirable result due to precision of floating point numbers.
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
I wonder whether similar problems will show up if I use mathematical functions. For example
int a = sqrt(4); // Do I have guarantee that I will always get 2 here?
int b = log2(8); // Do I have guarantee that I will always get 3 here?
If not, how to solve this problem?
Edit:
Actually, I came across this problem when I was programming for an algorithm task. There I want to get
the largest integer which is power of 2 and is less than or equal to integer N
So round function can not solve my problem. I know I can solve this problem through a loop, but it seems not very elegant.
I want to know if
int a = pow(2, static_cast<int>(log2(N)));
can always give correct result. For example if N==8, is it possible that log2(N) gives me something like 2.9999999999999 and the final result become 4 instead of 8?

Inaccurate operands vs inaccurate results
I wonder whether similar problems will show up if I use mathematical functions.
Actually, the problem that could prevent log2(8) to be 3 does not exist for basic operations (including *). But it exists for the log2 function.
You are confusing two different issues:
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
In the example above, a is not exactly 1/3, so it is possible that a*3 does not produce 1.0. The product could have happened to round to 1.0, it just doesn't. However, if a somehow had been exactly 1/3, the product of a by 3 would have been exactly 1.0, because this is how IEEE 754 floating-point works: the result of basic operations is the nearest representable value to the mathematical result of the same operation on the same operands. When the exact result is representable as a floating-point number, then that representation is what you get.
Accuracy of sqrt and log2
sqrt is part of the “basic operations”, so sqrt(4) is guaranteed always, with no exception, in an IEEE 754 system, to be 2.0.
log2 is not part of the basic operations. The result of an implementation of this function is not guaranteed by the IEEE 754 standard to be the closest to the mathematical result. It can be another representable number further away. So without more hypotheses on the log2 function that you use, it is impossible to tell what log2(8.0) can be.
However, most implementations of reasonable quality for elementary functions such as log2 guarantee that the result of the implementation is within 1 ULP of the mathematical result. When the mathematical result is not representable, this means either the representable value above or the one below (but not necessarily the closest one of the two). When the mathematical result is exactly representable (such as 3.0), then this representation is still the only one guaranteed to be returned.
So about log2(8), the answer is “if you have a reasonable quality implementation of log2, you can expect the result to be 3.0`”.
Unfortunately, not every implementation of every elementary function is a quality implementation. See this blog post, caused by a widely used implementation of pow being inaccurate by more than 1 ULP when computing pow(10.0, 2.0), and thus returning 99.0 instead of 100.0.
Rounding to the nearest integer
Next, in each case, you assign the floating-point to an int with an implicit conversion. This conversion is defined in the C++ standard as truncating the floating-point values (that is, rounding towards zero). If you expect the result of the floating-point computation to be an integer, you can round the floating-point value to the nearest integer before assigning it. It will help obtain the desired answer in all cases where the error does not accumulate to a value larger than 1/2:
int b = std::nearbyint(log2(8.0));
To conclude with a straightforward answer to the question the the title: yes, you should worry about accuracy when using floating-point functions for the purpose of producing an integral end-result. These functions do not come even with the guarantees that basic operations come with.

Unfortunately the default conversion from a floating point number to integer in C++ is really crazy as it works by dropping the decimal part.
This is bad for two reasons:
a floating point number really really close to a positive integer, but below it will be converted to the previous integer instead (e.g. 3-1×10-10 = 2.9999999999 will be converted to 2)
a floating point number really really close to a negative integer, but above it will be converted to the next integer instead (e.g. -3+1×10-10 = -2.9999999999 will be converted to -2)
The combination of (1) and (2) means also that using int(x + 0.5) will not work reasonably as it will round negative numbers up.
There is a reasonable round function, but unfortunately returns another floating point number, thus you need to write int(round(x)).
When working with C99 or C++11 you can use lround(x).
Note that the only numbers that can be represented correctly in floating point are quotients where the denominator is an integral power of 2.
For example 1/65536 = 0.0000152587890625 can be represented correctly, but even just 0.1 is impossible to represent correctly and thus any computation involving that quantity will be approximated.
Of course when using 0.1 approximations can cancel out leaving a correct result occasionally, but even just adding ten times 0.1 will not give 1.0 as result when doing the computation using IEEE754 double-precision floating point numbers.
Even worse the compilers are allowed to use higher precision for intermediate results. This means that adding 10 times 0.1 may give back 1 when converted to an integer if the compiler decides to use higher accuracy and round to closest double at the end.
This is "worse" because despite being the precision higher the results are compiler and compiler options dependent, making reasoning about the computations harder and making the exact result non portable among different systems (even if they use the same precision and format).
Most compilers have special options to avoid this specific problem.

IEEE-754 floating point: Divide first or multiply first for best precision?

What's better if I want to preserve as much precision as possible in a calculation with IEEE-754 floating point values:
a = b * c / d
or
a = b / d * c
Is there a difference? If there is, does it depend on the magnitudes of the input values? And, if magnitude matters, how is the best ordering determined when general magnitudes of the values are known?

It depends on the magnitude of the values. Obviously if one divides by zero, all bets are off, but if a multiplication or division results in a denormal subsequent operations can lose precision.
You may find it useful to study Goldberg's seminal paper What Every Computer Scientist Should Know About Floating-Point Arithmetic which will explain things far better than any answer you're likely to receive here. (Goldberg was one of the original authors of IEEE-754.)

Assuming that none of the operations would yield an overflow or an underflow, and your input values have uniformly distributed significands, then this is equivalent. Well, I suppose that to have a rigorous proof, one should do an exhaustive test (probably not possible in practice for double precision since there are 2^156 inputs), but if there is a difference in the average error, then it is tiny. I could try in low precisions with Sipe.
In any case, in the absence of overflow/underflow, only the exact values of the significands matter, not the exponents.
However if the result a is added to (or subtracted from) another expression and not reused, then starting with the division may be more interesting since you can group the multiplication with the following addition by using a FMA (thus with a single rounding).

Computer precision: when should I have to worry about it?

In C++ programming, when do I need to worry about the precision issue? To take a small example (it might not be a perfect one though),
std::vector<double> first (50000, 0.0);
std::vector<double> second (first);
Could it be possible that second[619] = 0.00000000000000000000000000001234 (I mean a very small value). Or SUM = second[0]+second[1]+...+second[49999] => 1e-31? Or SUM = second[0]-second[1]-...-second[49999] => -7.987654321e-12?
My questions:
Could it be some small disturbances in working with the double type numbers?
What may cause these kind of small disturbances? i.e. rounding errors become large? Could you please list them? How to take precautions?
If there could be small disturbance in certain operations, does it then mean after these operations, using if (SUM == 0) is dangerous? One should then always use if (SUM < SMALL) instead, where SMALL is defined as a very small value, such as 1E-30?
Lastly, could the small disturbances result into a negative value? Because if it is possible, then I should be better use if (abs(SUM) < SMALL) instead.
Any experiences?

This is a good reference document for floating point precision: What Every Computer Scientist Should Know About Floating-Point Arithmetic
One of the more important parts is catastrophic cancellation
Catastrophic cancellation occurs when the operands are subject to
rounding errors. For example in the quadratic formula, the expression
b2 - 4ac occurs. The quantities b2 and 4ac are subject to rounding
errors since they are the results of floating-point multiplications.
Suppose that they are rounded to the nearest floating-point number,
and so are accurate to within .5 ulp. When they are subtracted,
cancellation can cause many of the accurate digits to disappear,
leaving behind mainly digits contaminated by rounding error. Hence the
difference might have an error of many ulps. For example, consider b =
3.34, a = 1.22, and c = 2.28. The exact value of b2 - 4ac is .0292. But b2 rounds to 11.2 and 4ac rounds to 11.1, hence the final answer
is .1 which is an error by 70 ulps, even though 11.2 - 11.1 is exactly
equal to .16. The subtraction did not introduce any error, but rather
exposed the error introduced in the earlier multiplications.
Benign cancellation occurs when subtracting exactly known quantities.
If x and y have no rounding error, then by Theorem 2 if the
subtraction is done with a guard digit, the difference x-y has a very
small relative error (less than 2).
A formula that exhibits catastrophic cancellation can sometimes be
rearranged to eliminate the problem. Again consider the quadratic
formula

For your specific example, 0 has an exact representation as a double, and adding exactly 0 to a double does not change its value.
Also, like any other values you put in variables, numbers that you initialize in the array are not going to mysteriously change. You only get rounding when the result of a calculation cannot be exactly represented as a floating point number.
To give a better opinion about "disturbances" I would need to know the kinds of calculations that your code performs.

Dividing two integer without casting to double

I have two integer variables, partial and total. It is a progress, so partial starts at zero and goes up one-by-one to the value of total.
If I want to get a fraction value indicating the progress(from 0.0 to 1.0) I may do the following:
double fraction = double(partial)/double(total);
But if total is too big, the conversion to double may lose information.
Actually, the amount of lost information is tolerable, but I was wondering if there is a algorithm or a std function to get the fraction between two values losing less information.

The obvious answer is to multiply partial by some scaling factor;
100 is a frequent choice, since the division then gives the percent as
an integral value (rounded down). The problem is that if the values are
so large that they can't be represented precisely in a double, there's
also a good chance that the multiplication by the scaling factor will
overflow. (For that matter, if they're that big, the initial values
will overflow an int on most machines.)

Yes, there is an algorithm losing less information. Assuming you want to find the double value closest to the mathematical value of the fraction, you need an integer type capable of holding total << 53. You can create your own or use a library like GMP for that. Then
scale partial so that (total << 52) <= numerator < (total << 53), where numerator = (partial << m)
let q be the integer quotient numerator / total and r = numerator % total
let mantissa = q if 2*r < total, = q+1 if 2*r > total and if 2*r == total, mantissa = q+1 if you want to round half up, = q if you want to round half down, the even of the two if you want round-half-to-even
result = scalbn(mantissa, -m)
Most of the time you get the same value as for (double)partial / (double)total, differences of one least significant bit are probably not too rare, two or three LSB difference wouldn't surprise me either, but are rare, a bigger difference is unlikely (that said, somebody will probably give an example soon).
Now, is it worth the effort? Usually not.

If you want a precise representation of the fraction, you'd have some sort of structure containing the numerator and the denominator as integers, and, for unique representation, you'd just factor out the greatest common divisor (with a special case for zero). If you are just worried that after repeated operations the floating point representation might not be accurate enough, you need to just find some courses on numerical analysisas that issue isn't strictly a programming issue. There are better ways than others to calculate certain results, but I can't really go into them (I've never done the coursework, just read about it).

Accurate evaluation of 1/1 + 1/2 + ... 1/n row

I need to evaluate the sum of the row: 1/1+1/2+1/3+...+1/n. Considering that in C++ evaluations are not complete accurate, the order of summation plays important role. 1/n+1/(n-1)+...+1/2+1/1 expression gives the more accurate result.
So I need to find out the order of summation, which provides the maximum accuracy.
I don't even know where to begin.
Preferred language of realization is C++.
Sorry for my English, if there are any mistakes.

For large n you'd better use asymptotic formulas, like the ones on http://en.wikipedia.org/wiki/Harmonic_number;
Another way is to use exp-log transformation. Basically:
H_n = 1 + 1/2 + 1/3 + ... + 1/n = log(exp(1 + 1/2 + 1/3 + ... + 1/n)) = log(exp(1) * exp(1/2) * exp(1/3) * ... * exp(1/n)).
Exponents and logarithms can be calculated pretty quickly and accuratelly by your standard library. Using multiplication you should get much more accurate results.
If this is your homework and you are required to use simple addition, you'll better add from the smallest one to the largest one, as others suggested.

The reason for the lack of accuracy is the precision of the float, double, and long double types. They only store so many "decimal" places. So adding a very small value to a large value has no effect, the small term is "lost" in the larger one.
The series you're summing has a "long tail", in the sense that the small terms should add up to a large contribution. But if you sum in descending order, then after a while each new small term will have no effect (even before that, most of its decimal places will be discarded). Once you get to that point you can add a billion more terms, and if you do them one at a time it still has no effect.
I think that summing in ascending order should give best accuracy for this kind of series, although it's possible there are some odd corner cases where errors due to rounding to powers of (1/2) might just so happen to give a closer answer for some addition orders than others. You probably can't really predict this, though.

I don't even know where to begin.
Here: What Every Computer Scientist Should Know About Floating-Point Arithmetic

Actually, if you're doing the summation for large N, adding in order from smallest to largest is not the best way -- you can still get into a situation where the numbers you're adding are too small relative to the sum to produce an accurate result.
Look at the problem this way: You have N summations, regardless of ordering, and you wish to have the least total error. Thus, you should be able to get the least total error by minimizing the error of each summation -- and you minimize the error in a summation by adding values as nearly close to each other as possible. I believe that following that chain of logic gives you a binary tree of partial sums:
Sum[0,i] = value[i]
Sum[1,i/2] = Sum[0,i] + Sum[0,i+1]
Sum[j+1,i/2] = Sum[j,i] + Sum[j,i+1]
and so on until you get to a single answer.
Of course, when N is not a power of two, you'll end up with leftovers at each stage, which you need to carry over into the summations at the next stage.
(The margins of StackOverflow are of course too small to include a proof that this is optimal. In part because I haven't taken the time to prove it. But it does work for any N, however large, as all of the additions are adding values of nearly identical magnitude. Well, all but log(N) of them in the worst not-power-of-2 case, and that's vanishingly small compared to N.)

http://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic
You can find libraries with ready for use implementation for C/C++.
For example http://www.apfloat.org/apfloat/

Unless you use some accurate closed-form representation, a small-to-large ordered summation is likely to be most accurate simple solution (it's not clear to me why a log-exp would help - that's a neat trick, but you're not winning anything with it here, as far as I can tell).
You can further gain precision by realizing that after a while, the sum will become "quantized": Effectively, when you have 2 digits of precision, adding 1.3 to 41 results in 42, not 42.3 - but you achieve almost a precision doubling by maintaining an "error" term. This is called Kahan Summation. You'd compute the error term (42-41-1.3 == -0.3) and correct that in the next addition by adding 0.3 to the next term before you add it in again.
Kahan Summation in addition to a small-to-large ordering is liable to be as accurate as you'll ever need to get. I seriously doubt you'll ever need anything better for the harmonic series - after all, even after 2^45 iterations (crazy many) you'd still only be dealing with a numbers that are at least 1/2^45 large, and a sum that's on the order of 45 (<2^6), for an order of magnitude difference of 51 powers-of-two - i.e. even still representable in a double precision variable if you add in the "wrong" order.
If you go small-to-large, and use Kahan Summation, the sun's probably going to extinguish before today's processors reach a percent of error - and you'll run into other tricky accuracy issues just due to the individual term error on that scale first anyhow (being that a number of the order of 2^53 or larger cannot be represented accurately as a double at all anyhow.)

I'm not sure about the order of summation playing an important role, I havent heard that before. I guess you want to do this in floating point arithmetic so the first thing is to think more inline of (1.0/1.0 + 1.0/2.0+1.0/3.0) - otherwise the compiler will do integer division
to determine order of evaluation, maybe a for loop or brackets?
e.g.
float f = 0.0;
for (int i=n; i>0; --i)
{
f += 1.0/static_cast<float>(i);
}
oh forgot to say, compilers will normally have switches to determine floating point evaluation mode. this is maybe related to what you say on order of summation - in visual C+ these are found in code-generation compile settings, in g++ there're options -float that handle this
actually, the other guy is right - you should do summation in order of smallest component first; so
1/n + 1/(n-1) .. 1/1
this is because the precision of a floating point number is linked to the scale, if you start at 1 you'll have 23 bits of precision relative to 1.0. if you start at a smaller number the precision is relative to the smaller number, so you'll get 23 bits of precision relative to 1xe-200 or whatever. then as the number gets bigger rounding error will occur, but the overall error will be less than the other direction

As all your numbers are rationals, the easiest (and also maybe the fastest, as it will have to do less floating point operations) would be to do the computations with rationals (tuples of 2 integers p,q), and then do just one floating point division at the end.
update to use this technique effectively you will need to use bigints for p & q, as they grow quite fast...
A fast prototype in Lisp, that has built in rationals shows:
(defun sum_harmonic (n acc)
(if (= n 0) acc (sum_harmonic (- n 1) (+ acc (/ 1 n)))))
(sum_harmonic 10 0)
7381/2520
[2.9289682]
(sum_harmonic 100 0)
14466636279520351160221518043104131447711/278881500918849908658135235741249214272
[5.1873775]
(sum_harmonic 1000 0)
53362913282294785045591045624042980409652472280384260097101349248456268889497101
75750609790198503569140908873155046809837844217211788500946430234432656602250210
02784256328520814055449412104425101426727702947747127089179639677796104532246924
26866468888281582071984897105110796873249319155529397017508931564519976085734473
01418328401172441228064907430770373668317005580029365923508858936023528585280816
0759574737836655413175508131522517/712886527466509305316638415571427292066835886
18858930404520019911543240875811114994764441519138715869117178170195752565129802
64067621009251465871004305131072686268143200196609974862745937188343705015434452
52373974529896314567498212823695623282379401106880926231770886197954079124775455
80493264757378299233527517967352480424636380511370343312147817468508784534856780
21888075373249921995672056932029099390891687487672697950931603520000
[7.485471]
So, the next better option could be to mantain the list of floating points and to reduce it summing the two smallest numbers in each step...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js