How to optimize calculating large numbers in Python - python-2.7

When calculating a large number in Python such as 2^(2^1000000), the program would run out of memory. Is there a way to break this calculation into smaller chunks so not as much memory gets used?
EDIT:
I only want to display a module of this number so I really only want to calculate the last 10 or so digits of the number.

If you really want to compute all digits of this number, you will not just run out of memory, but run out of lifetime, even using all the computers in the universe.

What you are asking is to perform one million squarings of 2, modulo 10^10.
So it suffices to implement product modulo 10^10 (readily available with 64 bits arithmetic) and iterate.
If I am right, 7841627136.
P= 2
for i in range(1000000):
P= (P * P) % 10000000000
print P

Related

An efficient way to calculate extremely large powers of 2

I am solving a problem which requires me to calculate the sum of squares of all possible subsets of a set. I am required to return this sum, modulo 10^9+7
I have understood the logic. I just need to sum the squares and multiply the result by 2^N-1, where N is the size of the set.
But the issue is that N can be as big as 10^5.
And for this, I am getting an integer overflow.
I looked into fast modular exponentiation but still where would I store something as huge as 2^100000 ?
Can I use the modulo as I calculate the power of 2, to keep the number down? Wouldn't that change the final value?
If anyone can tell me how to get it or what to read into, it would be really helpful.
If you modulo some value with 2^something_big it just means that you don't have to output bits beyond something_big. For instance x%power(2,10) == x%(1<<10) == x&(1<<10 - 1) == x&1023.
So in your case, the problem is computing the actual value before the modulo while keeping in mind that you only need 99999 bits. All higher bits are to be dropped (and should not influence the result if I understand your premise correctly).
Btw. storing 99999 bits is doable. It's just 13kB.

How can I calculate this prime product faster with PARI/GP?

I want to calculate the product over 1-1/p , where p runs over the primes upto 10^10
I know the approximation exp(-gamma)/ln(10^10) , where gamma is the Euler-Mascheroni-constant and ln the natural logarithm, but I want to calculate the exact product to see how close the approximation is.
The problem is that PARI/GP takes very long to calculate the prime numbers from about 4.2 * 10^9 to 10^10. The prodeuler-command also takes very long.
Is there any method to speed up the calculation with PARI/GP ?
I'm inclined to think the performance issue has mostly to do with the rational numbers rather than the generation of primes up to 10^10.
As a quick test I ran
a(n)=my(t=0);forprime(p=1,n,t+=p);t
with a(10^10) and it computed in a couple of minutes which seems reasonable.
The corresponding program for your request is:
a(n)=my(t=1);forprime(p=1,n,t*=(1-1/p));t
and this runs much slower than the first program, so my question would be to ask if there is a way to reformulate the computation to avoid rationals until the end? Is my formulation above even as you intended? - the numbers are extremely large even for 10^6, so it is no wonder it takes a long time to compute and perhaps the issue has less to do with the numbers being rational but just their size.
One trick I have used to compute large products is to split the problem so that at each stage the numbers on the left and right of the multiplication are roughly the same size. For example to compute a large factorial, say 8! it is much more efficient to compute ((1*8)*(2*7))*((3*6)*(4*5)) rather than the obvious left to right approach.
The following is a quick attempt to do what you want using exact arithmetic. It takes approximately 8mins up to 10^8, but the size of the numerator is already 1.9 million digits so it is unlikely this could ever get to 10^10 before running out of memory. [even for this computation i needed to increase the stack size].
xvecprod(v)={if(#v<=1, if(#v,v[1],1), xvecprod(v[1..#v\2]) * xvecprod(v[#v\2+1..#v]))}
faster(n)={my(b=10^6);xvecprod(apply(i->xvecprod(
apply(p->1-1/p, select(isprime, [i*b+1..min((i+1)*b,n)]))), [0..n\b]))}
Using decimals will definitely speed things up. The following runs reasonably quickly for up to 10^8 with 1000 digits of precision.
xvecprod(v)={if(#v<=1, if(#v,v[1],1), xvecprod(v[1..#v\2]) * xvecprod(v[#v\2+1..#v]))}
fasterdec(n)={my(b=10^6);xvecprod(apply(i->xvecprod(
apply(p->1-1.0/p,select(isprime,[i*b+1..min((i+1)*b,n)]))),[0..n\b]))}
The fastest method using decimals is the simplest:
a(n)=my(t=1);forprime(p=1,n,t*=(1-1.0/p));t
With precision set to 100 decimal digits, this produces a(10^9) in 2 minutes and a(10^10) in 22 minutes.
10^9: 0.02709315486987096878842689330617424348105764850
10^10: 0.02438386113804076644782979967638833694491163817
When working with decimals, the trick of splitting the multiplications does not improve performance because the numbers always have the same number of digits. However, I have left the code, since there is a potential for better accuracy. (at least in theory.)
I am not sure I can give any good advice on the number of digits of precision required. (I'm more of a programmer type and tend to work with whole numbers). However, my understanding is that there is a possibility of losing 1 binary digit of precision with every multiplication, although since rounding can go either way on average it won't be quite so bad. Given that this is a product of over 450 million terms, that would imply all precision is lost.
However, using the algorithm that splits the computation, each value only goes through around 30 multiplications, so that should only result in a loss of at most 30 binary digits (10 decimal digits) of precision so working with 100 digits of precision should be sufficient. Surprisingly, I get the same answers either way, so the simple naive method seems to work.
During my tests, I have noticed that using forprime is much faster than using isprime. (For example, the fasterdec version took almost 2 hours compared with the simple version which took 22 minutes to get to the same result.). Similary, sum(p=1,10^9,isprime(p)) takes approximately 8 minutes, compared with my(t=1);forprime(p=1,10^9,t++);t which takes just 11 seconds.

Fast, unbiased, integer pseudo random generator with arbitrary bounds

For a monte carlo integration process, I need to pull a lot of random samples from
a histogram that has N buckets, and where N is arbitrary (i.e. not a power of two) but
doesn't change at all during the course of the computation.
By a lot, I mean something on the order of 10^10, 10 billions, so pretty much any
kind of lengthy precomputation is likely worth it in the face of the sheer number of
samples).
I have at my disposal a very fast uniform pseudo random number generator that
typically produces unsigned 64 bits integers (all the ints in the discussion
below are unsigned).
The naive way to pull a sample : histogram[ prng() % histogram.size() ]
The naive way is very slow: the modulo operation is using an integer division (IDIV)
which is terribly expensive and the compiler, not knowing the value of histogram.size()
at compile time, can't be up to its usual magic (i.e. http://www.azillionmonkeys.com/qed/adiv.html)
As a matter of fact, the bulk of my computation time is spent extracting that darn modulo.
The slightly less naive way: I use libdivide (http://libdivide.com/) which is capable
of pulling off a very fast "divide by a constant not known at compile time".
That gives me a very nice win (25% or so), but I have a nagging feeling that I can do
better, here's why:
First intuition: libdivide computes a division. What I need is a modulo, and to get there
I have to do an additional mult and a sub : mod = dividend - divisor*(uint64_t)(dividend/divisor). I suspect there might be a small win there, using libdivide-type
techniques that produce the modulo directly.
Second intuition: I am actually not interested in the modulo itself. What I truly want is
to efficiently produce a uniformly distributed integer value that is guaranteed to be strictly smaller than N.
The modulo is a fairly standard way of getting there, because of two of its properties:
A) mod(prng(), N) is guaranteed to be uniformly distributed if prng() is
B) mod(prgn(), N) is guaranteed to belong to [0,N[
But the modulo is/does much more that just satisfy the two constraints above, and in fact
it does probably too much work.
All need is a function, any function that obeys constraints A) and B) and is fast.
So, long intro, but here comes my two questions:
Is there something out there equivalent to libdivide that computes integer modulos directly ?
Is there some function F(X, N) of integers X and N which obeys the following two constraints:
If X is a random variable uniformly distributed then F(X,N) is also unirformly distributed
F(X, N) is guranteed to be in [0, N[
(PS : I know that if N is small, I do not need to cunsume all the 64 bits coming out of
the PRNG. As a matter of fact, I already do that. But like I said, even that optimization
is a minor win when compare to the big fat loss of having to compute a modulo).
Edit : prng() % N is indeed not exactly uniformly distributed. But for N large enough, I don't think it's much of problem (or is it ?)
Edit 2 : prng() % N is indeed potentially very badly distributed. I had never realized how bad it could get. Ouch. I found a good article on this : http://ericlippert.com/2013/12/16/how-much-bias-is-introduced-by-the-remainder-technique
Under the circumstances, the simplest approach may work the best. One extremely simple approach that might work out if your PRNG is fast enough would be to pre-compute one less than the next larger power of 2 than your N to use as a mask. I.e., given some number that looks like 0001xxxxxxxx in binary (where x means we don't care if it's a 1 or a 0) we want a mask like 000111111111.
From there, we generate numbers as follows:
Generate a number
and it with your mask
if result > n, go to 1
The exact effectiveness of this will depend on how close N is to a power of 2. Each successive power of 2 is (obviously enough) double its predecessor. So, in the best case N is exactly one less than a power of 2, and our test in step 3 always passes. We've added only a mask and a comparison to the time taken for the PRNG itself.
In the worst case, N is exactly equal to a power of 2. In this case, we expect to throw away roughly half the numbers we generated.
On average, N ends up roughly halfway between powers of 2. That means, on average, we throw away about one out of four inputs. We can nearly ignore the mask and comparison themselves, so our speed loss compared to the "raw" generator is basically equal to the number of its outputs that we discard, or 25% on average.
If you have fast access to the needed instruction, you could 64-bit multiply prng() by N and return the high 64 bits of the 128-bit result. This is sort of like multiplying a uniform real in [0, 1) by N and truncating, with bias on the order of the modulo version (i.e., practically negligible; a 32-bit version of this answer would have small but perhaps noticeable bias).
Another possibility to explore would be use word parallelism on a branchless modulo algorithm operating on single bits, to get random numbers in batches.
Libdivide, or any other complex ways to optimize that modulo are simply overkill. In a situation as yours, the only sensible approach is to
ensure that your table size is a power of two (add padding if you must!)
replace the modulo operation with a bitmask operation. Like this:
size_t tableSize = 1 << 16;
size_t tableMask = tableSize - 1;
...
histogram[prng() & tableMask]
A bitmask operation is a single cycle on any CPU that is worth its money, you can't beat its speed.
--
Note:
I don't know about the quality of your random number generator, but it may not be a good idea to use the last bits of the random number. Some RNGs produce poor randomness in the last bits and better randomness in the upper bits. If that is the case with your RNG, use a bitshift to get the most significant bits:
size_t bitCount = 16;
...
histogram[prng() >> (64 - bitCount)]
This is just as fast as the bitmask, but it uses different bits.
You could extend your histogram to a "large" power of two by cycling it, filling in the trailing spaces with some dummy value (guaranteed to never occur in the real data). E.g. given a histogram
[10, 5, 6]
extend it to length 16 like so (assuming -1 is an appropriate sentinel):
[10, 5, 6, 10, 5, 6, 10, 5, 6, 10, 5, 6, 10, 5, 6, -1]
Then sampling can be done via a binary mask histogram[prng() & mask] where mask = (1 << new_length) - 1, with a check for the sentinel value to retry, that is,
int value;
do {
value = histogram[prng() & mask];
} while (value == SENTINEL);
// use `value` here
The extension is longer than necessary to make retries unlikely by ensuring that the vast majority of the elements are valid (e.g. in the example above only 1/16 lookups will "fail", and this rate can be reduced further by extending it to e.g. 64). You could even use a "branch prediction" hint (e.g. __builtin_expect in GCC) on the check so that the compiler orders code to be optimal for the case when value != SENTINEL, which is hopefully the common case.
This is very much a memory vs. speed trade-off.
Just a few ideas to complement the other good answers:
What percent of time is spent in the modulo operation, and how do you know what that percent is? I only ask because sometimes people say something is terribly slow when in fact it is less than 10% of the time and they only think it's big because they're using a silly self-time-only profiler. (I have a hard time envisioning a modulo operation taking a lot of time compared to a random number generator.)
When does the number of buckets become known? If it doesn't change too frequently, you can write a program-generator. When the number of buckets changes, automatically print out a new program, compile, link, and use it for your massive execution.
That way, the compiler will know the number of buckets.
Have you considered using a quasi-random number generator, as opposed to a pseudo-random generator? It can give you higher precision of integration in much fewer samples.
Could the number of buckets be reduced without hurting the accuracy of the integration too much?
The non-uniformity dbaupp cautions about can be side-stepped by rejecting&redrawing values no less than M*(2^64/M) (before taking the modulus).
If M can be represented in no more than 32 bits, you can get more than one value less than M by repeated multiplication (see David Eisenstat's answer) or divmod; alternatively, you can use bit operations to single out bit patterns long enough for M, again rejecting values no less than M.
(I'd be surprised at modulus not being dwarfed in time/cycle/energy consumption by random number generation.)
To feed the bucket, you may use std::binomial_distribution to directly feed each bucket instead of feeding the bucket one sample by one sample:
Following may help:
int nrolls = 60; // number of experiments
const std::size_t N = 6;
unsigned int bucket[N] = {};
std::mt19937 generator(time(nullptr));
for (int i = 0; i != N; ++i) {
double proba = 1. / static_cast<double>(N - i);
std::binomial_distribution<int> distribution (nrolls, proba);
bucket[i] = distribution(generator);
nrolls -= bucket[i];
}
Live example
Instead of integer division you can use fixed point math, i.e integer multiplication & bitshift. Say if your prng() returns values in range 0-65535 and you want this quantized to range 0-99, then you do (prng()*100)>>16. Just make sure that the multiplication doesn't overflow your integer type, so you may have to shift the result of prng() right. Note that this mapping is better than modulo since it's retains the uniform distribution.
Thanks everyone for you suggestions.
First, I am now thoroughly convinced that modulo is really evil.
It is both very slow and yields incorrect results in most cases.
After implementing and testing quite a few of the suggestions, what
seems to be the best speed/quality compromise is the solution proposed
by #Gene:
pre-compute normalizer as:
auto normalizer = histogram.size() / (1.0+urng.max());
draw samples with:
return histogram[ (uint32_t)floor(urng() * normalizer);
It is the fastest of all methods I've tried so far, and as far as I can tell,
it yields a distribution that's much better, even if it may not be as perfect
as the rejection method.
Edit: I implemented David Eisenstat's method, which is more or less the same as Jarkkol's suggestion : index = (rng() * N) >> 32. It works as well as the floating point normalization and it is a little faster (9% faster in fact). So it is my preferred way now.

Sieve of Eratosthenes using precalculated primes

I've all prime numbers that can be stored in 32bit unsigned int and I want to use them to generate some 64bit prime numbers. using trial division is too slow even with optimizations in logic and compilation.
I'm trying to modify Sieve of Eratosthenes to work with the predefined list, as follow:
in array A from 2 to 4294967291
in array B from 2^32 to X inc by 1
find C which is first multiple of current prime.
from C mark and jump by current prime till X.
go to 1.
The problem is step 3 which use modulus to find the prime multiple, such operation is the reason i didn't use trail division.
Is there any better way to implement step 3 or the whole algorithm.
thank you.
Increment by 2, not 1. That's the minimal optimization you should always use - working with odds only. No need to bother with the evens.
In C++, use vector<bool> for the sieve array. It gets automatically bit-packed.
Pre-calculate your core primes with segmented sieve. Then continue to work by big enough segments that fit in your cache, without adding new primes to the core list. For each prime p maintain additional long long int value: its current multiple (starting from the prime's square, of course). The step value is twice p in value, or p offset in the odds-packed sieve array, where the i-th entry stands for the number o + 2i, o being the least odd not below the range start. No need to sort by the multiples' values, the upper bound of core primes' use rises monotonically.
sqrt(0xFFFFFFFFFF) = 1048576. PrimePi(1048576)=82025 primes is all you need in your core primes list. That's peanuts.
Integer arithmetics for long long ints should work just fine to find the modulo, and so the smallest multiple in range, when you first start (or resume your work).
See also a related answer with pseudocode, and another with C code.

Fastest way to add numbers in a very large arithmetic series?

I'm trying to minimize overhead as much as possible when adding numbers in an arithmetic series. I'm talking about a very large set, such as from 1 to 2^128. Is there any fast way of doing this? If so, what would it be without actually using the arithmetic sequence sum formula? Just as a reference, the sum from 1 to 2^128 is:
57896044618658097711785492504343953926464851149359812787997104700240680714240
Only fast way is to use the formula:
n * (n+1) / 2
Any other method (adding naively) will take way too long! (Even if you had a million years on a supercomputer, you wouldn't finish the calculation).
For such a large integer though, you cannot use normal integers. You will need to use a big integer object. So get a Big Integer library, eg. Google search, https://mattmccutchen.net/bigint/.
Note: a 256-bit integer may be able to hold results up to around that scale, but it is quite platform and compiler-dependent, as to whether 256-bit integers are readily available, and how they are used.