An efficient way to calculate extremely large powers of 2 - c++

I am solving a problem which requires me to calculate the sum of squares of all possible subsets of a set. I am required to return this sum, modulo 10^9+7
I have understood the logic. I just need to sum the squares and multiply the result by 2^N-1, where N is the size of the set.
But the issue is that N can be as big as 10^5.
And for this, I am getting an integer overflow.
I looked into fast modular exponentiation but still where would I store something as huge as 2^100000 ?
Can I use the modulo as I calculate the power of 2, to keep the number down? Wouldn't that change the final value?
If anyone can tell me how to get it or what to read into, it would be really helpful.

If you modulo some value with 2^something_big it just means that you don't have to output bits beyond something_big. For instance x%power(2,10) == x%(1<<10) == x&(1<<10 - 1) == x&1023.
So in your case, the problem is computing the actual value before the modulo while keeping in mind that you only need 99999 bits. All higher bits are to be dropped (and should not influence the result if I understand your premise correctly).
Btw. storing 99999 bits is doable. It's just 13kB.

Related

Fast, unbiased, integer pseudo random generator with arbitrary bounds

For a monte carlo integration process, I need to pull a lot of random samples from
a histogram that has N buckets, and where N is arbitrary (i.e. not a power of two) but
doesn't change at all during the course of the computation.
By a lot, I mean something on the order of 10^10, 10 billions, so pretty much any
kind of lengthy precomputation is likely worth it in the face of the sheer number of
samples).
I have at my disposal a very fast uniform pseudo random number generator that
typically produces unsigned 64 bits integers (all the ints in the discussion
below are unsigned).
The naive way to pull a sample : histogram[ prng() % histogram.size() ]
The naive way is very slow: the modulo operation is using an integer division (IDIV)
which is terribly expensive and the compiler, not knowing the value of histogram.size()
at compile time, can't be up to its usual magic (i.e. http://www.azillionmonkeys.com/qed/adiv.html)
As a matter of fact, the bulk of my computation time is spent extracting that darn modulo.
The slightly less naive way: I use libdivide (http://libdivide.com/) which is capable
of pulling off a very fast "divide by a constant not known at compile time".
That gives me a very nice win (25% or so), but I have a nagging feeling that I can do
better, here's why:
First intuition: libdivide computes a division. What I need is a modulo, and to get there
I have to do an additional mult and a sub : mod = dividend - divisor*(uint64_t)(dividend/divisor). I suspect there might be a small win there, using libdivide-type
techniques that produce the modulo directly.
Second intuition: I am actually not interested in the modulo itself. What I truly want is
to efficiently produce a uniformly distributed integer value that is guaranteed to be strictly smaller than N.
The modulo is a fairly standard way of getting there, because of two of its properties:
A) mod(prng(), N) is guaranteed to be uniformly distributed if prng() is
B) mod(prgn(), N) is guaranteed to belong to [0,N[
But the modulo is/does much more that just satisfy the two constraints above, and in fact
it does probably too much work.
All need is a function, any function that obeys constraints A) and B) and is fast.
So, long intro, but here comes my two questions:
Is there something out there equivalent to libdivide that computes integer modulos directly ?
Is there some function F(X, N) of integers X and N which obeys the following two constraints:
If X is a random variable uniformly distributed then F(X,N) is also unirformly distributed
F(X, N) is guranteed to be in [0, N[
(PS : I know that if N is small, I do not need to cunsume all the 64 bits coming out of
the PRNG. As a matter of fact, I already do that. But like I said, even that optimization
is a minor win when compare to the big fat loss of having to compute a modulo).
Edit : prng() % N is indeed not exactly uniformly distributed. But for N large enough, I don't think it's much of problem (or is it ?)
Edit 2 : prng() % N is indeed potentially very badly distributed. I had never realized how bad it could get. Ouch. I found a good article on this : http://ericlippert.com/2013/12/16/how-much-bias-is-introduced-by-the-remainder-technique
Under the circumstances, the simplest approach may work the best. One extremely simple approach that might work out if your PRNG is fast enough would be to pre-compute one less than the next larger power of 2 than your N to use as a mask. I.e., given some number that looks like 0001xxxxxxxx in binary (where x means we don't care if it's a 1 or a 0) we want a mask like 000111111111.
From there, we generate numbers as follows:
Generate a number
and it with your mask
if result > n, go to 1
The exact effectiveness of this will depend on how close N is to a power of 2. Each successive power of 2 is (obviously enough) double its predecessor. So, in the best case N is exactly one less than a power of 2, and our test in step 3 always passes. We've added only a mask and a comparison to the time taken for the PRNG itself.
In the worst case, N is exactly equal to a power of 2. In this case, we expect to throw away roughly half the numbers we generated.
On average, N ends up roughly halfway between powers of 2. That means, on average, we throw away about one out of four inputs. We can nearly ignore the mask and comparison themselves, so our speed loss compared to the "raw" generator is basically equal to the number of its outputs that we discard, or 25% on average.
If you have fast access to the needed instruction, you could 64-bit multiply prng() by N and return the high 64 bits of the 128-bit result. This is sort of like multiplying a uniform real in [0, 1) by N and truncating, with bias on the order of the modulo version (i.e., practically negligible; a 32-bit version of this answer would have small but perhaps noticeable bias).
Another possibility to explore would be use word parallelism on a branchless modulo algorithm operating on single bits, to get random numbers in batches.
Libdivide, or any other complex ways to optimize that modulo are simply overkill. In a situation as yours, the only sensible approach is to
ensure that your table size is a power of two (add padding if you must!)
replace the modulo operation with a bitmask operation. Like this:
size_t tableSize = 1 << 16;
size_t tableMask = tableSize - 1;
...
histogram[prng() & tableMask]
A bitmask operation is a single cycle on any CPU that is worth its money, you can't beat its speed.
--
Note:
I don't know about the quality of your random number generator, but it may not be a good idea to use the last bits of the random number. Some RNGs produce poor randomness in the last bits and better randomness in the upper bits. If that is the case with your RNG, use a bitshift to get the most significant bits:
size_t bitCount = 16;
...
histogram[prng() >> (64 - bitCount)]
This is just as fast as the bitmask, but it uses different bits.
You could extend your histogram to a "large" power of two by cycling it, filling in the trailing spaces with some dummy value (guaranteed to never occur in the real data). E.g. given a histogram
[10, 5, 6]
extend it to length 16 like so (assuming -1 is an appropriate sentinel):
[10, 5, 6, 10, 5, 6, 10, 5, 6, 10, 5, 6, 10, 5, 6, -1]
Then sampling can be done via a binary mask histogram[prng() & mask] where mask = (1 << new_length) - 1, with a check for the sentinel value to retry, that is,
int value;
do {
value = histogram[prng() & mask];
} while (value == SENTINEL);
// use `value` here
The extension is longer than necessary to make retries unlikely by ensuring that the vast majority of the elements are valid (e.g. in the example above only 1/16 lookups will "fail", and this rate can be reduced further by extending it to e.g. 64). You could even use a "branch prediction" hint (e.g. __builtin_expect in GCC) on the check so that the compiler orders code to be optimal for the case when value != SENTINEL, which is hopefully the common case.
This is very much a memory vs. speed trade-off.
Just a few ideas to complement the other good answers:
What percent of time is spent in the modulo operation, and how do you know what that percent is? I only ask because sometimes people say something is terribly slow when in fact it is less than 10% of the time and they only think it's big because they're using a silly self-time-only profiler. (I have a hard time envisioning a modulo operation taking a lot of time compared to a random number generator.)
When does the number of buckets become known? If it doesn't change too frequently, you can write a program-generator. When the number of buckets changes, automatically print out a new program, compile, link, and use it for your massive execution.
That way, the compiler will know the number of buckets.
Have you considered using a quasi-random number generator, as opposed to a pseudo-random generator? It can give you higher precision of integration in much fewer samples.
Could the number of buckets be reduced without hurting the accuracy of the integration too much?
The non-uniformity dbaupp cautions about can be side-stepped by rejecting&redrawing values no less than M*(2^64/M) (before taking the modulus).
If M can be represented in no more than 32 bits, you can get more than one value less than M by repeated multiplication (see David Eisenstat's answer) or divmod; alternatively, you can use bit operations to single out bit patterns long enough for M, again rejecting values no less than M.
(I'd be surprised at modulus not being dwarfed in time/cycle/energy consumption by random number generation.)
To feed the bucket, you may use std::binomial_distribution to directly feed each bucket instead of feeding the bucket one sample by one sample:
Following may help:
int nrolls = 60; // number of experiments
const std::size_t N = 6;
unsigned int bucket[N] = {};
std::mt19937 generator(time(nullptr));
for (int i = 0; i != N; ++i) {
double proba = 1. / static_cast<double>(N - i);
std::binomial_distribution<int> distribution (nrolls, proba);
bucket[i] = distribution(generator);
nrolls -= bucket[i];
}
Live example
Instead of integer division you can use fixed point math, i.e integer multiplication & bitshift. Say if your prng() returns values in range 0-65535 and you want this quantized to range 0-99, then you do (prng()*100)>>16. Just make sure that the multiplication doesn't overflow your integer type, so you may have to shift the result of prng() right. Note that this mapping is better than modulo since it's retains the uniform distribution.
Thanks everyone for you suggestions.
First, I am now thoroughly convinced that modulo is really evil.
It is both very slow and yields incorrect results in most cases.
After implementing and testing quite a few of the suggestions, what
seems to be the best speed/quality compromise is the solution proposed
by #Gene:
pre-compute normalizer as:
auto normalizer = histogram.size() / (1.0+urng.max());
draw samples with:
return histogram[ (uint32_t)floor(urng() * normalizer);
It is the fastest of all methods I've tried so far, and as far as I can tell,
it yields a distribution that's much better, even if it may not be as perfect
as the rejection method.
Edit: I implemented David Eisenstat's method, which is more or less the same as Jarkkol's suggestion : index = (rng() * N) >> 32. It works as well as the floating point normalization and it is a little faster (9% faster in fact). So it is my preferred way now.

composition of a number where a number is very large

I am trying to calculate number of ways of composition of a number using numbers 1 and 2.
This can be found using fibonacci series where F(1)=1 and F(2)=2 and
F(n)=F(n-1)+F(n-2)
Since F(n) can be very large I just need F(n)%1000000007.To speed up the process I am using fibonacci exponentiation .I have written two codes for the same problem(both are almost similar).But one of them fails for large numbers.I am not able to figure out which one is correct ?
CODE 1
http://ideone.com/iCPEyz
CODE 2
http://ideone.com/Un5p2S
Though I have a feeling first one should be correct.I am not able to figure what would happen when there is a case like when we are multiplying say a and b and value of a has already exceeded the upper limit of a and when we multiply this by b ,then how sure can I be that a*b is correct. As per my knowledge if a value is above its data type limits then the value starts again from the lowest value like in below example.
#include<iostream>
#include<limits.h>
using namespace std;
int main()
{
cout<<UINT_MAX<<endl;
cout<<UINT_MAX+2;
}
Output
4294967295
1
"Overflow" (you don't really call it that for unsigneds, they wrap around) of unsigned n-bit types will preserve values modulo 2^n only, not modulo an arbitrary modulus (how could they? Try to reproduce the steps with pen and paper). You therefore have to make sure that no operation ever goes over the limits of your type in order to maintain correct results mod 100000007.

Fastest way to add numbers in a very large arithmetic series?

I'm trying to minimize overhead as much as possible when adding numbers in an arithmetic series. I'm talking about a very large set, such as from 1 to 2^128. Is there any fast way of doing this? If so, what would it be without actually using the arithmetic sequence sum formula? Just as a reference, the sum from 1 to 2^128 is:
57896044618658097711785492504343953926464851149359812787997104700240680714240
Only fast way is to use the formula:
n * (n+1) / 2
Any other method (adding naively) will take way too long! (Even if you had a million years on a supercomputer, you wouldn't finish the calculation).
For such a large integer though, you cannot use normal integers. You will need to use a big integer object. So get a Big Integer library, eg. Google search, https://mattmccutchen.net/bigint/.
Note: a 256-bit integer may be able to hold results up to around that scale, but it is quite platform and compiler-dependent, as to whether 256-bit integers are readily available, and how they are used.

Dividing two integer without casting to double

I have two integer variables, partial and total. It is a progress, so partial starts at zero and goes up one-by-one to the value of total.
If I want to get a fraction value indicating the progress(from 0.0 to 1.0) I may do the following:
double fraction = double(partial)/double(total);
But if total is too big, the conversion to double may lose information.
Actually, the amount of lost information is tolerable, but I was wondering if there is a algorithm or a std function to get the fraction between two values losing less information.
The obvious answer is to multiply partial by some scaling factor;
100 is a frequent choice, since the division then gives the percent as
an integral value (rounded down). The problem is that if the values are
so large that they can't be represented precisely in a double, there's
also a good chance that the multiplication by the scaling factor will
overflow. (For that matter, if they're that big, the initial values
will overflow an int on most machines.)
Yes, there is an algorithm losing less information. Assuming you want to find the double value closest to the mathematical value of the fraction, you need an integer type capable of holding total << 53. You can create your own or use a library like GMP for that. Then
scale partial so that (total << 52) <= numerator < (total << 53), where numerator = (partial << m)
let q be the integer quotient numerator / total and r = numerator % total
let mantissa = q if 2*r < total, = q+1 if 2*r > total and if 2*r == total, mantissa = q+1 if you want to round half up, = q if you want to round half down, the even of the two if you want round-half-to-even
result = scalbn(mantissa, -m)
Most of the time you get the same value as for (double)partial / (double)total, differences of one least significant bit are probably not too rare, two or three LSB difference wouldn't surprise me either, but are rare, a bigger difference is unlikely (that said, somebody will probably give an example soon).
Now, is it worth the effort? Usually not.
If you want a precise representation of the fraction, you'd have some sort of structure containing the numerator and the denominator as integers, and, for unique representation, you'd just factor out the greatest common divisor (with a special case for zero). If you are just worried that after repeated operations the floating point representation might not be accurate enough, you need to just find some courses on numerical analysisas that issue isn't strictly a programming issue. There are better ways than others to calculate certain results, but I can't really go into them (I've never done the coursework, just read about it).

Accurate evaluation of 1/1 + 1/2 + ... 1/n row

I need to evaluate the sum of the row: 1/1+1/2+1/3+...+1/n. Considering that in C++ evaluations are not complete accurate, the order of summation plays important role. 1/n+1/(n-1)+...+1/2+1/1 expression gives the more accurate result.
So I need to find out the order of summation, which provides the maximum accuracy.
I don't even know where to begin.
Preferred language of realization is C++.
Sorry for my English, if there are any mistakes.
For large n you'd better use asymptotic formulas, like the ones on http://en.wikipedia.org/wiki/Harmonic_number;
Another way is to use exp-log transformation. Basically:
H_n = 1 + 1/2 + 1/3 + ... + 1/n = log(exp(1 + 1/2 + 1/3 + ... + 1/n)) = log(exp(1) * exp(1/2) * exp(1/3) * ... * exp(1/n)).
Exponents and logarithms can be calculated pretty quickly and accuratelly by your standard library. Using multiplication you should get much more accurate results.
If this is your homework and you are required to use simple addition, you'll better add from the smallest one to the largest one, as others suggested.
The reason for the lack of accuracy is the precision of the float, double, and long double types. They only store so many "decimal" places. So adding a very small value to a large value has no effect, the small term is "lost" in the larger one.
The series you're summing has a "long tail", in the sense that the small terms should add up to a large contribution. But if you sum in descending order, then after a while each new small term will have no effect (even before that, most of its decimal places will be discarded). Once you get to that point you can add a billion more terms, and if you do them one at a time it still has no effect.
I think that summing in ascending order should give best accuracy for this kind of series, although it's possible there are some odd corner cases where errors due to rounding to powers of (1/2) might just so happen to give a closer answer for some addition orders than others. You probably can't really predict this, though.
I don't even know where to begin.
Here: What Every Computer Scientist Should Know About Floating-Point Arithmetic
Actually, if you're doing the summation for large N, adding in order from smallest to largest is not the best way -- you can still get into a situation where the numbers you're adding are too small relative to the sum to produce an accurate result.
Look at the problem this way: You have N summations, regardless of ordering, and you wish to have the least total error. Thus, you should be able to get the least total error by minimizing the error of each summation -- and you minimize the error in a summation by adding values as nearly close to each other as possible. I believe that following that chain of logic gives you a binary tree of partial sums:
Sum[0,i] = value[i]
Sum[1,i/2] = Sum[0,i] + Sum[0,i+1]
Sum[j+1,i/2] = Sum[j,i] + Sum[j,i+1]
and so on until you get to a single answer.
Of course, when N is not a power of two, you'll end up with leftovers at each stage, which you need to carry over into the summations at the next stage.
(The margins of StackOverflow are of course too small to include a proof that this is optimal. In part because I haven't taken the time to prove it. But it does work for any N, however large, as all of the additions are adding values of nearly identical magnitude. Well, all but log(N) of them in the worst not-power-of-2 case, and that's vanishingly small compared to N.)
http://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic
You can find libraries with ready for use implementation for C/C++.
For example http://www.apfloat.org/apfloat/
Unless you use some accurate closed-form representation, a small-to-large ordered summation is likely to be most accurate simple solution (it's not clear to me why a log-exp would help - that's a neat trick, but you're not winning anything with it here, as far as I can tell).
You can further gain precision by realizing that after a while, the sum will become "quantized": Effectively, when you have 2 digits of precision, adding 1.3 to 41 results in 42, not 42.3 - but you achieve almost a precision doubling by maintaining an "error" term. This is called Kahan Summation. You'd compute the error term (42-41-1.3 == -0.3) and correct that in the next addition by adding 0.3 to the next term before you add it in again.
Kahan Summation in addition to a small-to-large ordering is liable to be as accurate as you'll ever need to get. I seriously doubt you'll ever need anything better for the harmonic series - after all, even after 2^45 iterations (crazy many) you'd still only be dealing with a numbers that are at least 1/2^45 large, and a sum that's on the order of 45 (<2^6), for an order of magnitude difference of 51 powers-of-two - i.e. even still representable in a double precision variable if you add in the "wrong" order.
If you go small-to-large, and use Kahan Summation, the sun's probably going to extinguish before today's processors reach a percent of error - and you'll run into other tricky accuracy issues just due to the individual term error on that scale first anyhow (being that a number of the order of 2^53 or larger cannot be represented accurately as a double at all anyhow.)
I'm not sure about the order of summation playing an important role, I havent heard that before. I guess you want to do this in floating point arithmetic so the first thing is to think more inline of (1.0/1.0 + 1.0/2.0+1.0/3.0) - otherwise the compiler will do integer division
to determine order of evaluation, maybe a for loop or brackets?
e.g.
float f = 0.0;
for (int i=n; i>0; --i)
{
f += 1.0/static_cast<float>(i);
}
oh forgot to say, compilers will normally have switches to determine floating point evaluation mode. this is maybe related to what you say on order of summation - in visual C+ these are found in code-generation compile settings, in g++ there're options -float that handle this
actually, the other guy is right - you should do summation in order of smallest component first; so
1/n + 1/(n-1) .. 1/1
this is because the precision of a floating point number is linked to the scale, if you start at 1 you'll have 23 bits of precision relative to 1.0. if you start at a smaller number the precision is relative to the smaller number, so you'll get 23 bits of precision relative to 1xe-200 or whatever. then as the number gets bigger rounding error will occur, but the overall error will be less than the other direction
As all your numbers are rationals, the easiest (and also maybe the fastest, as it will have to do less floating point operations) would be to do the computations with rationals (tuples of 2 integers p,q), and then do just one floating point division at the end.
update to use this technique effectively you will need to use bigints for p & q, as they grow quite fast...
A fast prototype in Lisp, that has built in rationals shows:
(defun sum_harmonic (n acc)
(if (= n 0) acc (sum_harmonic (- n 1) (+ acc (/ 1 n)))))
(sum_harmonic 10 0)
7381/2520
[2.9289682]
(sum_harmonic 100 0)
14466636279520351160221518043104131447711/278881500918849908658135235741249214272
[5.1873775]
(sum_harmonic 1000 0)
53362913282294785045591045624042980409652472280384260097101349248456268889497101
75750609790198503569140908873155046809837844217211788500946430234432656602250210
02784256328520814055449412104425101426727702947747127089179639677796104532246924
26866468888281582071984897105110796873249319155529397017508931564519976085734473
01418328401172441228064907430770373668317005580029365923508858936023528585280816
0759574737836655413175508131522517/712886527466509305316638415571427292066835886
18858930404520019911543240875811114994764441519138715869117178170195752565129802
64067621009251465871004305131072686268143200196609974862745937188343705015434452
52373974529896314567498212823695623282379401106880926231770886197954079124775455
80493264757378299233527517967352480424636380511370343312147817468508784534856780
21888075373249921995672056932029099390891687487672697950931603520000
[7.485471]
So, the next better option could be to mantain the list of floating points and to reduce it summing the two smallest numbers in each step...