Select ULP value in float comparison

Select ULP value in float comparison - c++

I've read several resource on the network and I understood there's no a single value or universal parameters when we compare float numbers. I've read from here several replies and I found the code from Google test to compare the floats. I want to better understand the meaning of ULP and its value. Reading comments from source code I read:
The maximum error of a single floating-point operation is 0.5 units in
the last place. On Intel CPU's, all floating-point calculations are
done with 80-bit precision, while double has 64 bits. Therefore, 4
should be enough for ordinary use.
It's not really clear why "therefore 4 should be enough". Can anyone explain why? From my understanding we are saying that we can tolerate 4*10^-6 or 4*10^-15 as difference between our numbers to say if they are the same or not, taking into account the number of significant digits of float (6/7) or double (15/16). Is it correct?

It is wrong. Very wrong. Consider that every operation can accumulate some error—½ ULP is the maximum (in round-to-nearest mode), so ¼ might be an average. So 17 operations are enough to accumulate more than 4 ULP of error just from average effects.1 Today’s computers do billions of operations per second. How many operations will a program do between its inputs and some later comparison? That depends on the program, but it could be zero, dozens, thousands, or millions just for “ordinary“ use. (Let’s say we exclude billions because then it gets slow for a human to use, so we can call that special-purpose software, not ordinary.)
But that is not all. Suppose we add a few numbers around 1 and then subtract a number that happens to be around the sum. Maybe the adds get a total error around 2 ULP. But when we subtract, the result might be around 2−10 instead of around 1. So the ULP of 2−10 is 1024 times smaller than the ULP of 1. That error that is 2 ULP relative to 1 is 2048 ULP relative to the result of the subtraction. Oops! 4 ULP will not cut it. It would need to be 4 ULP of some of the other numbers involved, not the ULP of the result.
In fact, characterizing the error is difficult in general and is the subject of an entire field of study, numerical analysis. 4 is not the answer.
Footnote
1 Errors will vary in direction, so some will cancel out. The behavior might be modeled as a random walk, and the average error might be proportional to the square root of the number of operations performed.

Related

How can I calculate this prime product faster with PARI/GP?

I want to calculate the product over 1-1/p , where p runs over the primes upto 10^10
I know the approximation exp(-gamma)/ln(10^10) , where gamma is the Euler-Mascheroni-constant and ln the natural logarithm, but I want to calculate the exact product to see how close the approximation is.
The problem is that PARI/GP takes very long to calculate the prime numbers from about 4.2 * 10^9 to 10^10. The prodeuler-command also takes very long.
Is there any method to speed up the calculation with PARI/GP ?

I'm inclined to think the performance issue has mostly to do with the rational numbers rather than the generation of primes up to 10^10.
As a quick test I ran
a(n)=my(t=0);forprime(p=1,n,t+=p);t
with a(10^10) and it computed in a couple of minutes which seems reasonable.
The corresponding program for your request is:
a(n)=my(t=1);forprime(p=1,n,t*=(1-1/p));t
and this runs much slower than the first program, so my question would be to ask if there is a way to reformulate the computation to avoid rationals until the end? Is my formulation above even as you intended? - the numbers are extremely large even for 10^6, so it is no wonder it takes a long time to compute and perhaps the issue has less to do with the numbers being rational but just their size.
One trick I have used to compute large products is to split the problem so that at each stage the numbers on the left and right of the multiplication are roughly the same size. For example to compute a large factorial, say 8! it is much more efficient to compute ((1*8)*(2*7))*((3*6)*(4*5)) rather than the obvious left to right approach.
The following is a quick attempt to do what you want using exact arithmetic. It takes approximately 8mins up to 10^8, but the size of the numerator is already 1.9 million digits so it is unlikely this could ever get to 10^10 before running out of memory. [even for this computation i needed to increase the stack size].
xvecprod(v)={if(#v<=1, if(#v,v[1],1), xvecprod(v[1..#v\2]) * xvecprod(v[#v\2+1..#v]))}
faster(n)={my(b=10^6);xvecprod(apply(i->xvecprod(
apply(p->1-1/p, select(isprime, [i*b+1..min((i+1)*b,n)]))), [0..n\b]))}
Using decimals will definitely speed things up. The following runs reasonably quickly for up to 10^8 with 1000 digits of precision.
xvecprod(v)={if(#v<=1, if(#v,v[1],1), xvecprod(v[1..#v\2]) * xvecprod(v[#v\2+1..#v]))}
fasterdec(n)={my(b=10^6);xvecprod(apply(i->xvecprod(
apply(p->1-1.0/p,select(isprime,[i*b+1..min((i+1)*b,n)]))),[0..n\b]))}
The fastest method using decimals is the simplest:
a(n)=my(t=1);forprime(p=1,n,t*=(1-1.0/p));t
With precision set to 100 decimal digits, this produces a(10^9) in 2 minutes and a(10^10) in 22 minutes.
10^9: 0.02709315486987096878842689330617424348105764850
10^10: 0.02438386113804076644782979967638833694491163817
When working with decimals, the trick of splitting the multiplications does not improve performance because the numbers always have the same number of digits. However, I have left the code, since there is a potential for better accuracy. (at least in theory.)
I am not sure I can give any good advice on the number of digits of precision required. (I'm more of a programmer type and tend to work with whole numbers). However, my understanding is that there is a possibility of losing 1 binary digit of precision with every multiplication, although since rounding can go either way on average it won't be quite so bad. Given that this is a product of over 450 million terms, that would imply all precision is lost.
However, using the algorithm that splits the computation, each value only goes through around 30 multiplications, so that should only result in a loss of at most 30 binary digits (10 decimal digits) of precision so working with 100 digits of precision should be sufficient. Surprisingly, I get the same answers either way, so the simple naive method seems to work.
During my tests, I have noticed that using forprime is much faster than using isprime. (For example, the fasterdec version took almost 2 hours compared with the simple version which took 22 minutes to get to the same result.). Similary, sum(p=1,10^9,isprime(p)) takes approximately 8 minutes, compared with my(t=1);forprime(p=1,10^9,t++);t which takes just 11 seconds.

Does log2 (n)*(x) set compression limits.

I may get all kinds of flags and penalties thrown at me for this. So please be patient. 2 questions
If the minimal number of bits to represent an arbitrary number of decimals is calculated by log2 (n)*(x)....n is range x is length, then you should be able to calculate max compression by turning the file into decimals by the>>> bin to dec.?
Is this result a law that one can not compress below the theoretical min compression limit, or is it an approximated limit?
Jon Hutton

It's actually a bit (ha) trickier. That formula assumes that the number is drawn from a uniform distribution, which is often not the case, but notably is the case for what is commonly called "random data" (though that is an inaccurate name, since data may be random but drawn from a non-uniform distribution).
The entropy H of X in bits is given by the formula:
H(X) = - sum[i](P(x[i]) log2(P(x[i])))
Where P gives the probability of every value x[i] that X may take. The bounds of i are implied and irrelevant, impossible options have a probability of zero anyway. In the uniform case, P(x[i]) is (by definition) 1/N for any possible x[i], we have H(X) = -N * (1/N log2(1/N)) = -log2(1/N) = log2(N).
The formula should in general not simply be multiplied by the length of the data, that only works if all symbols are independent and identically distributed (so for example on your file with IID uniform-random digits, it does work). Often for meaningful data, the probability distribution for a symbol depends on its context, and indeed a lot of compression techniques are aimed at exploiting this.
There is no law that says you cannot get lucky and thereby compress an individual file to fewer bits than are suggested by its entropy. You can arrange for it to be possible on purpose (but it won't necessarily happen), for example, let's say we expect that any letter is equally probable, but we decide to go against the flow and encode an A with the single bit 0, and any other letter as a 1 followed by 5 bits that indicate which letter it is. This is obviously a bad encoding given the expectation, there are only 26 letters and they're equally probable but we're using more than log2(26) ≈ 4.7 bits on average, the average would be (1 + 25 * 6)/26 ≈ 5.8. However, if by some accident we happen to actually get an A (there is a chance of 1/26th that this happens, the odds are not too bad), we compress it to a single bit, which is much better than expected. Of course one cannot rely on luck, it can only come as a surprise.
For further reference you could read about entropy (information theory) on Wikipedia.

Checking results of parallelized BLAS routines

I implemented some parallel BLAS routines in OpenCL. To check if the kernels are correct, I also implemented the same routines in a naive way. After executing the kernels I compare the kernel results with the results of the naive implementation.
I understand that I can not compare float values with ==. I therefore calculate the absolute difference of the two floats and check if it exceeds a limit. I already read this article that describes a few other methods of comparing floats. My problem however is, that I am unsure about the limit to use to compare the floats. In my case the limit seems highly dependent on the BLAS routine and input size.
For example, I implemented asum that calculates the absolute sum of a vector of float values. For an input vector of size 16 777 216 the difference between the naive implementation and my parallelized implementation is 96! For an input size of 1 048 576 the difference is only 0.5. Im fairly certain that my kernel is correct, because I checked the results by hand for small input sizes. I'm guessing the difference accumulates due to the large input vector.
My question is, is there a way to calculate the maximal difference that can originate from float inaccuracies? Is there a way to know when the difference is definitly due to an error in the kernel code?

There is a technique called interval mathematics you can use here.
Instead of having some fixed error which you deem acceptable, you keep track of the most and least value a given floating point operation could "actually" be referring to.
Wikipedia has an article on it.
If I couldn't find a library, what I'd do is create an interval float type. It contains two floats, which represent the highest and lowest (inclusive) values that the interval could represent.
It would override + and * and / and - to include the effects of rounding. It would take work to write.
So if you add {1.0,1.0} and {2.0,2.0}, the answer would be {3.0,3.0}, as the range of values in the 3.0 may be large enough to account for the errors in the 1.0 and 2.0s.
Subtract 2.0 and the answer becomes {0.9999999999997, 1.00000000003} or similar, as the error in the {3.0, 3.0} is larger than error implied by {1.0, 1.0}.
The same holds for multiplication and division.
It may be shockingly easy for these intervals to reach "every possible number including inf/nan" if you have division involved. And, as noted, subtraction leads to serious problems; and if you have large terms that cancel, you can easily end up with error bars far larger than you might expect.
In the end, if your OpenCL solution results in a value within the interval, you can say "well, it isn't wrong".

controlling overflow and loss in precision while multiplying doubles

ques:
I have a large number of floating point numbers (~10,000 numbers) , each having 6 digits after decimal. Now, the multiplication of all these numbers would yield about 60,000 digits. But the double range is for 15 digits only. The output product has to have 6 digits of precision after decimal.
my approach:
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
I also thought of multiplying these numbers using arrays to store their digits and later converting them to decimal. But this also appears cumbersome and may not yield correct result.
Is there an alternate easier way to do this?

I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
This would only achieve further loss of accuracy. In floating-point, large numbers are represented approximately just like small numbers are. Making your numbers bigger only means you are doing 19999 multiplications (and one division) instead of 9999 multiplications; it does not magically give you more significant digits.
This manipulation would only be useful if it prevented the partial product to reach into subnormal territory (and in this case, multiplying by a power of two would be recommended to avoid loss of accuracy due to the multiplication). There is no indication in your question that this happens, no example data set, no code, so it is only possible to provide the generic explanation below:
Floating-point multiplication is very well behaved when it does not underflow or overflow. At the first order, you can assume that relative inaccuracies add up, so that multiplying 10000 values produces a result that's 9999 machine epsilons away from the mathematical result in relative terms(*).
The solution to your problem as stated (no code, no data set) is to use a wider floating-point type for the intermediate multiplications. This solves both the problems of underflow or overflow and leaves you with a relative accuracy on the end result such that once rounded to the original floating-point type, the product is wrong by at most one ULP.
Depending on your programming language, such a wider floating-point type may be available as long double. For 10000 multiplications, the 80-bit “extended double” format, widely available in x86 processors, would improve things dramatically and you would barely see any performance difference, as long as your compiler does map this 80-bit format to a floating-point type. Otherwise, you would have to use a software implementation such as MPFR's arbitrary-precision floating-point format or the double-double format.
(*) In reality, relative inaccuracies compound, so that the real bound on the relative error is more like (1 + ε)9999 - 1 where ε is the machine epsilon. Also, in reality, relative errors often cancel each other, so that you can expect the actual relative error to grow like the square root of the theoretical maximum error.

Accurate evaluation of 1/1 + 1/2 + ... 1/n row

I need to evaluate the sum of the row: 1/1+1/2+1/3+...+1/n. Considering that in C++ evaluations are not complete accurate, the order of summation plays important role. 1/n+1/(n-1)+...+1/2+1/1 expression gives the more accurate result.
So I need to find out the order of summation, which provides the maximum accuracy.
I don't even know where to begin.
Preferred language of realization is C++.
Sorry for my English, if there are any mistakes.

For large n you'd better use asymptotic formulas, like the ones on http://en.wikipedia.org/wiki/Harmonic_number;
Another way is to use exp-log transformation. Basically:
H_n = 1 + 1/2 + 1/3 + ... + 1/n = log(exp(1 + 1/2 + 1/3 + ... + 1/n)) = log(exp(1) * exp(1/2) * exp(1/3) * ... * exp(1/n)).
Exponents and logarithms can be calculated pretty quickly and accuratelly by your standard library. Using multiplication you should get much more accurate results.
If this is your homework and you are required to use simple addition, you'll better add from the smallest one to the largest one, as others suggested.

The reason for the lack of accuracy is the precision of the float, double, and long double types. They only store so many "decimal" places. So adding a very small value to a large value has no effect, the small term is "lost" in the larger one.
The series you're summing has a "long tail", in the sense that the small terms should add up to a large contribution. But if you sum in descending order, then after a while each new small term will have no effect (even before that, most of its decimal places will be discarded). Once you get to that point you can add a billion more terms, and if you do them one at a time it still has no effect.
I think that summing in ascending order should give best accuracy for this kind of series, although it's possible there are some odd corner cases where errors due to rounding to powers of (1/2) might just so happen to give a closer answer for some addition orders than others. You probably can't really predict this, though.

I don't even know where to begin.
Here: What Every Computer Scientist Should Know About Floating-Point Arithmetic

Actually, if you're doing the summation for large N, adding in order from smallest to largest is not the best way -- you can still get into a situation where the numbers you're adding are too small relative to the sum to produce an accurate result.
Look at the problem this way: You have N summations, regardless of ordering, and you wish to have the least total error. Thus, you should be able to get the least total error by minimizing the error of each summation -- and you minimize the error in a summation by adding values as nearly close to each other as possible. I believe that following that chain of logic gives you a binary tree of partial sums:
Sum[0,i] = value[i]
Sum[1,i/2] = Sum[0,i] + Sum[0,i+1]
Sum[j+1,i/2] = Sum[j,i] + Sum[j,i+1]
and so on until you get to a single answer.
Of course, when N is not a power of two, you'll end up with leftovers at each stage, which you need to carry over into the summations at the next stage.
(The margins of StackOverflow are of course too small to include a proof that this is optimal. In part because I haven't taken the time to prove it. But it does work for any N, however large, as all of the additions are adding values of nearly identical magnitude. Well, all but log(N) of them in the worst not-power-of-2 case, and that's vanishingly small compared to N.)

http://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic
You can find libraries with ready for use implementation for C/C++.
For example http://www.apfloat.org/apfloat/

Unless you use some accurate closed-form representation, a small-to-large ordered summation is likely to be most accurate simple solution (it's not clear to me why a log-exp would help - that's a neat trick, but you're not winning anything with it here, as far as I can tell).
You can further gain precision by realizing that after a while, the sum will become "quantized": Effectively, when you have 2 digits of precision, adding 1.3 to 41 results in 42, not 42.3 - but you achieve almost a precision doubling by maintaining an "error" term. This is called Kahan Summation. You'd compute the error term (42-41-1.3 == -0.3) and correct that in the next addition by adding 0.3 to the next term before you add it in again.
Kahan Summation in addition to a small-to-large ordering is liable to be as accurate as you'll ever need to get. I seriously doubt you'll ever need anything better for the harmonic series - after all, even after 2^45 iterations (crazy many) you'd still only be dealing with a numbers that are at least 1/2^45 large, and a sum that's on the order of 45 (<2^6), for an order of magnitude difference of 51 powers-of-two - i.e. even still representable in a double precision variable if you add in the "wrong" order.
If you go small-to-large, and use Kahan Summation, the sun's probably going to extinguish before today's processors reach a percent of error - and you'll run into other tricky accuracy issues just due to the individual term error on that scale first anyhow (being that a number of the order of 2^53 or larger cannot be represented accurately as a double at all anyhow.)

I'm not sure about the order of summation playing an important role, I havent heard that before. I guess you want to do this in floating point arithmetic so the first thing is to think more inline of (1.0/1.0 + 1.0/2.0+1.0/3.0) - otherwise the compiler will do integer division
to determine order of evaluation, maybe a for loop or brackets?
e.g.
float f = 0.0;
for (int i=n; i>0; --i)
{
f += 1.0/static_cast<float>(i);
}
oh forgot to say, compilers will normally have switches to determine floating point evaluation mode. this is maybe related to what you say on order of summation - in visual C+ these are found in code-generation compile settings, in g++ there're options -float that handle this
actually, the other guy is right - you should do summation in order of smallest component first; so
1/n + 1/(n-1) .. 1/1
this is because the precision of a floating point number is linked to the scale, if you start at 1 you'll have 23 bits of precision relative to 1.0. if you start at a smaller number the precision is relative to the smaller number, so you'll get 23 bits of precision relative to 1xe-200 or whatever. then as the number gets bigger rounding error will occur, but the overall error will be less than the other direction

As all your numbers are rationals, the easiest (and also maybe the fastest, as it will have to do less floating point operations) would be to do the computations with rationals (tuples of 2 integers p,q), and then do just one floating point division at the end.
update to use this technique effectively you will need to use bigints for p & q, as they grow quite fast...
A fast prototype in Lisp, that has built in rationals shows:
(defun sum_harmonic (n acc)
(if (= n 0) acc (sum_harmonic (- n 1) (+ acc (/ 1 n)))))
(sum_harmonic 10 0)
7381/2520
[2.9289682]
(sum_harmonic 100 0)
14466636279520351160221518043104131447711/278881500918849908658135235741249214272
[5.1873775]
(sum_harmonic 1000 0)
53362913282294785045591045624042980409652472280384260097101349248456268889497101
75750609790198503569140908873155046809837844217211788500946430234432656602250210
02784256328520814055449412104425101426727702947747127089179639677796104532246924
26866468888281582071984897105110796873249319155529397017508931564519976085734473
01418328401172441228064907430770373668317005580029365923508858936023528585280816
0759574737836655413175508131522517/712886527466509305316638415571427292066835886
18858930404520019911543240875811114994764441519138715869117178170195752565129802
64067621009251465871004305131072686268143200196609974862745937188343705015434452
52373974529896314567498212823695623282379401106880926231770886197954079124775455
80493264757378299233527517967352480424636380511370343312147817468508784534856780
21888075373249921995672056932029099390891687487672697950931603520000
[7.485471]
So, the next better option could be to mantain the list of floating points and to reduce it summing the two smallest numbers in each step...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js