Sieve of Eratosthenes using precalculated primes - c++

I've all prime numbers that can be stored in 32bit unsigned int and I want to use them to generate some 64bit prime numbers. using trial division is too slow even with optimizations in logic and compilation.
I'm trying to modify Sieve of Eratosthenes to work with the predefined list, as follow:
in array A from 2 to 4294967291
in array B from 2^32 to X inc by 1
find C which is first multiple of current prime.
from C mark and jump by current prime till X.
go to 1.
The problem is step 3 which use modulus to find the prime multiple, such operation is the reason i didn't use trail division.
Is there any better way to implement step 3 or the whole algorithm.
thank you.

Increment by 2, not 1. That's the minimal optimization you should always use - working with odds only. No need to bother with the evens.
In C++, use vector<bool> for the sieve array. It gets automatically bit-packed.
Pre-calculate your core primes with segmented sieve. Then continue to work by big enough segments that fit in your cache, without adding new primes to the core list. For each prime p maintain additional long long int value: its current multiple (starting from the prime's square, of course). The step value is twice p in value, or p offset in the odds-packed sieve array, where the i-th entry stands for the number o + 2i, o being the least odd not below the range start. No need to sort by the multiples' values, the upper bound of core primes' use rises monotonically.
sqrt(0xFFFFFFFFFF) = 1048576. PrimePi(1048576)=82025 primes is all you need in your core primes list. That's peanuts.
Integer arithmetics for long long ints should work just fine to find the modulo, and so the smallest multiple in range, when you first start (or resume your work).
See also a related answer with pseudocode, and another with C code.

Related

Bit Manipulation: Harder Flipping Coins

Recently, I saw this problem from CodeChef titled 'Flipping Coins' (Link: FLIPCOINS).
Summarily, there are N coins and we must write a program that supports two operations.
To flip coin in range [A,B]
To find the number of heads in range [A,B] respectively.
Of course, we can quickly use a segment tree (range query, range updates using lazy propagation) to solve this.
However, I faced another similar problem where after a series of flips (operation 1), we are required to output the resulting permutation of coins after the flips (e.g 100101, where 0 represents head while 1 represents tail).
More specifically, operation 2 changes from counting number of heads to producing the resulting permutation of all N coins. Also, the new operation 2 is only called after all the flips have been done (i.e operation 2 is the last to be called and is only called one time).
May I know how does one solve this? It requires some form of bit manipulation, according to the problem tags.
Edit
I attempted brute-forcing through all queries, and alas, it yield Time Limit Exceeded.
Printing out the state of the coins can be done using a Binary-indexed tree:
Initially all values are 0.
When we need to flip coins [A, B], we increment A by 1 and
decrement B + 1 by 1.
The state of coin i is then the prefix sum at i modulo 2.
This works because the prefix sum at i is always the number of flip operations done at i.

How to optimize calculating large numbers in Python

When calculating a large number in Python such as 2^(2^1000000), the program would run out of memory. Is there a way to break this calculation into smaller chunks so not as much memory gets used?
EDIT:
I only want to display a module of this number so I really only want to calculate the last 10 or so digits of the number.
If you really want to compute all digits of this number, you will not just run out of memory, but run out of lifetime, even using all the computers in the universe.
What you are asking is to perform one million squarings of 2, modulo 10^10.
So it suffices to implement product modulo 10^10 (readily available with 64 bits arithmetic) and iterate.
If I am right, 7841627136.
P= 2
for i in range(1000000):
P= (P * P) % 10000000000
print P

How to fill a matrix with prime entries in C++?

I have this task:
Design a program which fills a matrix, of size n x n, with prime
entries (its entries must be prime numbers).
Now, I have a subroutine which reads and impries any matrix, when the user gives the entries of the matrix, and also have a subroutine which impries the prime numbers less than a given number of the user (as an array). What I can't do is try to combine these subroutines. Could you give me some good advices, please?
(I admit I misunderstood the question, as probably did some other commentators of the original post. It's relatively simple but not that trivial as it looks. For small inputs a naive approach 4. may work best.)
Let me reformulate the task:
Given a number N, find first N prime numbers.
Since you already implemented the sieve of Eratosthenes, the question is which number should be chosen as the upper limit for the sieve. Essentially, this is equivalent to finding the inverse of the prime counting function or to finding x, possibly smallest, such that
pi(x) >= N
(where pi is the prime counting function).
The article in wikipedia contains some hints, for example the inequality
pi(x) >= x/log(x).
So, one approach could rely on finding an approximate solution of the equation
x/log(x) = N,
which would be later used in the sieve of Eratosthenes. This can be done relatively easy (for small N even binary search will do).
There is, however, a widening gap between x/log(x) and pi(x) (see the table in the linked wikipedia article). So if we are really concerned about memory we could try a better inequality:
pi(x) >= li(x), (true for x <= 10^19)
where li is the logarithmic integral. This one gives a better approximation but a) we'd need some external library with the function 'li' and b) the inequality may not be true for very large x (probably not an issue here).
And if we'd like to improve the estimation even further (and for all x), we may need the assumption that the Riemann Hypothesis is true (yes, it's scary).
There are some direct algorithms for calculating pi but it's not worth using them for this task.
More direct approach:
make a guess for the upper limit in the sieve, say A, and run the sieve
if number of primes is too small, choose a larger upper limit, say B, and run the sieve, starting with the primes already found, for numbers in interval (A,B]; repeat.
If in 4. you are off by very few primes, a brute force may be faster. I've just found this post with interesting answers.

Fast, unbiased, integer pseudo random generator with arbitrary bounds

For a monte carlo integration process, I need to pull a lot of random samples from
a histogram that has N buckets, and where N is arbitrary (i.e. not a power of two) but
doesn't change at all during the course of the computation.
By a lot, I mean something on the order of 10^10, 10 billions, so pretty much any
kind of lengthy precomputation is likely worth it in the face of the sheer number of
samples).
I have at my disposal a very fast uniform pseudo random number generator that
typically produces unsigned 64 bits integers (all the ints in the discussion
below are unsigned).
The naive way to pull a sample : histogram[ prng() % histogram.size() ]
The naive way is very slow: the modulo operation is using an integer division (IDIV)
which is terribly expensive and the compiler, not knowing the value of histogram.size()
at compile time, can't be up to its usual magic (i.e. http://www.azillionmonkeys.com/qed/adiv.html)
As a matter of fact, the bulk of my computation time is spent extracting that darn modulo.
The slightly less naive way: I use libdivide (http://libdivide.com/) which is capable
of pulling off a very fast "divide by a constant not known at compile time".
That gives me a very nice win (25% or so), but I have a nagging feeling that I can do
better, here's why:
First intuition: libdivide computes a division. What I need is a modulo, and to get there
I have to do an additional mult and a sub : mod = dividend - divisor*(uint64_t)(dividend/divisor). I suspect there might be a small win there, using libdivide-type
techniques that produce the modulo directly.
Second intuition: I am actually not interested in the modulo itself. What I truly want is
to efficiently produce a uniformly distributed integer value that is guaranteed to be strictly smaller than N.
The modulo is a fairly standard way of getting there, because of two of its properties:
A) mod(prng(), N) is guaranteed to be uniformly distributed if prng() is
B) mod(prgn(), N) is guaranteed to belong to [0,N[
But the modulo is/does much more that just satisfy the two constraints above, and in fact
it does probably too much work.
All need is a function, any function that obeys constraints A) and B) and is fast.
So, long intro, but here comes my two questions:
Is there something out there equivalent to libdivide that computes integer modulos directly ?
Is there some function F(X, N) of integers X and N which obeys the following two constraints:
If X is a random variable uniformly distributed then F(X,N) is also unirformly distributed
F(X, N) is guranteed to be in [0, N[
(PS : I know that if N is small, I do not need to cunsume all the 64 bits coming out of
the PRNG. As a matter of fact, I already do that. But like I said, even that optimization
is a minor win when compare to the big fat loss of having to compute a modulo).
Edit : prng() % N is indeed not exactly uniformly distributed. But for N large enough, I don't think it's much of problem (or is it ?)
Edit 2 : prng() % N is indeed potentially very badly distributed. I had never realized how bad it could get. Ouch. I found a good article on this : http://ericlippert.com/2013/12/16/how-much-bias-is-introduced-by-the-remainder-technique
Under the circumstances, the simplest approach may work the best. One extremely simple approach that might work out if your PRNG is fast enough would be to pre-compute one less than the next larger power of 2 than your N to use as a mask. I.e., given some number that looks like 0001xxxxxxxx in binary (where x means we don't care if it's a 1 or a 0) we want a mask like 000111111111.
From there, we generate numbers as follows:
Generate a number
and it with your mask
if result > n, go to 1
The exact effectiveness of this will depend on how close N is to a power of 2. Each successive power of 2 is (obviously enough) double its predecessor. So, in the best case N is exactly one less than a power of 2, and our test in step 3 always passes. We've added only a mask and a comparison to the time taken for the PRNG itself.
In the worst case, N is exactly equal to a power of 2. In this case, we expect to throw away roughly half the numbers we generated.
On average, N ends up roughly halfway between powers of 2. That means, on average, we throw away about one out of four inputs. We can nearly ignore the mask and comparison themselves, so our speed loss compared to the "raw" generator is basically equal to the number of its outputs that we discard, or 25% on average.
If you have fast access to the needed instruction, you could 64-bit multiply prng() by N and return the high 64 bits of the 128-bit result. This is sort of like multiplying a uniform real in [0, 1) by N and truncating, with bias on the order of the modulo version (i.e., practically negligible; a 32-bit version of this answer would have small but perhaps noticeable bias).
Another possibility to explore would be use word parallelism on a branchless modulo algorithm operating on single bits, to get random numbers in batches.
Libdivide, or any other complex ways to optimize that modulo are simply overkill. In a situation as yours, the only sensible approach is to
ensure that your table size is a power of two (add padding if you must!)
replace the modulo operation with a bitmask operation. Like this:
size_t tableSize = 1 << 16;
size_t tableMask = tableSize - 1;
...
histogram[prng() & tableMask]
A bitmask operation is a single cycle on any CPU that is worth its money, you can't beat its speed.
--
Note:
I don't know about the quality of your random number generator, but it may not be a good idea to use the last bits of the random number. Some RNGs produce poor randomness in the last bits and better randomness in the upper bits. If that is the case with your RNG, use a bitshift to get the most significant bits:
size_t bitCount = 16;
...
histogram[prng() >> (64 - bitCount)]
This is just as fast as the bitmask, but it uses different bits.
You could extend your histogram to a "large" power of two by cycling it, filling in the trailing spaces with some dummy value (guaranteed to never occur in the real data). E.g. given a histogram
[10, 5, 6]
extend it to length 16 like so (assuming -1 is an appropriate sentinel):
[10, 5, 6, 10, 5, 6, 10, 5, 6, 10, 5, 6, 10, 5, 6, -1]
Then sampling can be done via a binary mask histogram[prng() & mask] where mask = (1 << new_length) - 1, with a check for the sentinel value to retry, that is,
int value;
do {
value = histogram[prng() & mask];
} while (value == SENTINEL);
// use `value` here
The extension is longer than necessary to make retries unlikely by ensuring that the vast majority of the elements are valid (e.g. in the example above only 1/16 lookups will "fail", and this rate can be reduced further by extending it to e.g. 64). You could even use a "branch prediction" hint (e.g. __builtin_expect in GCC) on the check so that the compiler orders code to be optimal for the case when value != SENTINEL, which is hopefully the common case.
This is very much a memory vs. speed trade-off.
Just a few ideas to complement the other good answers:
What percent of time is spent in the modulo operation, and how do you know what that percent is? I only ask because sometimes people say something is terribly slow when in fact it is less than 10% of the time and they only think it's big because they're using a silly self-time-only profiler. (I have a hard time envisioning a modulo operation taking a lot of time compared to a random number generator.)
When does the number of buckets become known? If it doesn't change too frequently, you can write a program-generator. When the number of buckets changes, automatically print out a new program, compile, link, and use it for your massive execution.
That way, the compiler will know the number of buckets.
Have you considered using a quasi-random number generator, as opposed to a pseudo-random generator? It can give you higher precision of integration in much fewer samples.
Could the number of buckets be reduced without hurting the accuracy of the integration too much?
The non-uniformity dbaupp cautions about can be side-stepped by rejecting&redrawing values no less than M*(2^64/M) (before taking the modulus).
If M can be represented in no more than 32 bits, you can get more than one value less than M by repeated multiplication (see David Eisenstat's answer) or divmod; alternatively, you can use bit operations to single out bit patterns long enough for M, again rejecting values no less than M.
(I'd be surprised at modulus not being dwarfed in time/cycle/energy consumption by random number generation.)
To feed the bucket, you may use std::binomial_distribution to directly feed each bucket instead of feeding the bucket one sample by one sample:
Following may help:
int nrolls = 60; // number of experiments
const std::size_t N = 6;
unsigned int bucket[N] = {};
std::mt19937 generator(time(nullptr));
for (int i = 0; i != N; ++i) {
double proba = 1. / static_cast<double>(N - i);
std::binomial_distribution<int> distribution (nrolls, proba);
bucket[i] = distribution(generator);
nrolls -= bucket[i];
}
Live example
Instead of integer division you can use fixed point math, i.e integer multiplication & bitshift. Say if your prng() returns values in range 0-65535 and you want this quantized to range 0-99, then you do (prng()*100)>>16. Just make sure that the multiplication doesn't overflow your integer type, so you may have to shift the result of prng() right. Note that this mapping is better than modulo since it's retains the uniform distribution.
Thanks everyone for you suggestions.
First, I am now thoroughly convinced that modulo is really evil.
It is both very slow and yields incorrect results in most cases.
After implementing and testing quite a few of the suggestions, what
seems to be the best speed/quality compromise is the solution proposed
by #Gene:
pre-compute normalizer as:
auto normalizer = histogram.size() / (1.0+urng.max());
draw samples with:
return histogram[ (uint32_t)floor(urng() * normalizer);
It is the fastest of all methods I've tried so far, and as far as I can tell,
it yields a distribution that's much better, even if it may not be as perfect
as the rejection method.
Edit: I implemented David Eisenstat's method, which is more or less the same as Jarkkol's suggestion : index = (rng() * N) >> 32. It works as well as the floating point normalization and it is a little faster (9% faster in fact). So it is my preferred way now.

composition of a number where a number is very large

I am trying to calculate number of ways of composition of a number using numbers 1 and 2.
This can be found using fibonacci series where F(1)=1 and F(2)=2 and
F(n)=F(n-1)+F(n-2)
Since F(n) can be very large I just need F(n)%1000000007.To speed up the process I am using fibonacci exponentiation .I have written two codes for the same problem(both are almost similar).But one of them fails for large numbers.I am not able to figure out which one is correct ?
CODE 1
http://ideone.com/iCPEyz
CODE 2
http://ideone.com/Un5p2S
Though I have a feeling first one should be correct.I am not able to figure what would happen when there is a case like when we are multiplying say a and b and value of a has already exceeded the upper limit of a and when we multiply this by b ,then how sure can I be that a*b is correct. As per my knowledge if a value is above its data type limits then the value starts again from the lowest value like in below example.
#include<iostream>
#include<limits.h>
using namespace std;
int main()
{
cout<<UINT_MAX<<endl;
cout<<UINT_MAX+2;
}
Output
4294967295
1
"Overflow" (you don't really call it that for unsigneds, they wrap around) of unsigned n-bit types will preserve values modulo 2^n only, not modulo an arbitrary modulus (how could they? Try to reproduce the steps with pen and paper). You therefore have to make sure that no operation ever goes over the limits of your type in order to maintain correct results mod 100000007.