I have this snippet of code that generates the primes on "max" in a sufficient time with Sieve of Eratosthenes.
I want to give the function the posibility to use a starting value to calculate a range of primes. So I wonder at what point in the algoritm I have to hand over the starting value..
e.g.
get_primes(unsigned long from, unsigned long to);
get_primes(200, 5000);
-> Saves the prime numbers from 200 to 5000 in the vector.
Unfortunately I don't understand the algorithm completely. [Especially lines 3 to 5, 7 & 10 are unclear]
I tryed to follow the steps by using a debugger but that also did not make me smarter.
It would be great if anyone can explain me this code better and tell me how to set a start value.
Thank you.
vector<unsigned long long> get_primes(unsigned long max) {
vector<unsigned long long> primes;
char *sieve;
sieve = new char[max / 8 + 1];
memset(sieve, 0xFF, (max / 8 + 1) * sizeof(char));
for (unsigned long long x = 2; x <= max; x++)
if (sieve[x / 8] & (0x01 << (x % 8))) {
primes.push_back(x);
for (unsigned long long j = 2 * x; j <= max; j += x)
sieve[j / 8] &= ~(0x01 << (j % 8));
}
delete[] sieve;
return primes;
}
You must start at 2 since the sieve first removes all multiples of 2 to find the next prime as 3. It then removes all multiples of 3 to find the next prime as 5 and so on.
If you want to generate unsigned long long primes using a version of get_primes() then you are in for a very long wait.
For generating primes in the range lo ... hi (inclusive) you need to consider only factors up to sqrt(hi). Hence you need a small sieve (up to 32 bits) for factors and another small sieve of size (hi - lo + 1) for sieving the target range.
Here's an abbreviated version of a sieve that runs up to 2^64 - 1; it uses a full sieve instead of sieving only odd numbers, because it is the reference code I use to verify optimised implementations. The changes for that are straightforward but add even more pitfalls to the whole shebang. As it is it sieves the 10 million primes between 999560010209 and 999836351599 in about 3 seconds, and those between 18446744073265777349u and 18446744073709551557u (i.e. just below 2^64) in about 20 seconds
The factor sieve is global because it gets reused a lot, and sieving the factors can take a while too. I.e. prepping the factors for a range close to 2^64 means sieving all (or most) of the factors up to 2^32 - 1, and thus it can easily take up to 10 seconds.
I wrapped my bitmap code (moral equivalent to std::bitset<>) and the factor sieve into classes; using raw vectors would make the code inflexible and unreadable. I shortened my code, remove a lot of asserts and other noise, and substituted calls to external functions with inlined code (like the call to std::sqrt()), for the sake of exposition. That way you can cull answers like what to do with the offset (here called lo) directly from verified working code.
The point of having separate number_t and index_t is that number_t can be unsigned long long but index_t must be uint32_t for my current infrastructure. The member functions of bitmap_t use the name of the underlying CPU instructions. BTS ... bit test and set, BT ... bit test. Bitmaps are initialised to 0 and a set bit signifies non-prime.
typedef uint32_t index_t;
sieve_t g_factor_sieve;
template<typename number_t, typename OutputIterator>
index_t generate_primes (number_t lo, number_t hi, OutputIterator sink)
{
// ...
index_t max_factor = index_t(std::sqrt(double(hi)));
g_factor_sieve.extend_to_cover(max_factor);
number_t range = hi - lo; assert( range <= index_t(index_t(0) - 1) );
index_t range32 = index_t(range);
bitmap_t bm(range32);
if (lo < 2) bm.bts(1 - index_t(lo)); // 1 is not a prime
if (lo == 0) bm.bts(0); // 0 is not a prime
for (index_t n = 2; n <= max_factor && n > 1; n += 1 + (n & 1))
{
if (g_factor_sieve.not_prime(n)) continue;
number_t start = square(number_t(n));
index_t stride = n << (int(n) > 2 ? 1 : 0); // double stride for n > 2
if (start >= lo)
start -= lo;
else
start = (stride - (lo - start) % stride) % stride;
// double test because of the possibility of wrapping
for (index_t i = index_t(start); i <= bm.max_bit; )
{
bm.bts(i);
if ((i += stride) < stride)
{
break;
}
}
}
// output
for (index_t i = 0; ; ++i)
{
if (!bm.bt(i))
{
*sink = lo + i;
++sink;
++n;
}
if (i >= bm.max_bit) break;
}
return n;
}
Try this one;
I used it to set starting and ending numbers
for(int x = m;x<n;x++){
if(x%2!=0 && x%3!=0 && x%5!=0 && x%7!=0 && x%11!=0)
// then x is prime
}
where m is starting value, and n is the ending value
Related
well I want to sum up the multiples of 3 and 5. Not too hard if I want just the sum upon to a given number, e.g. -> up to 60 the sum is 870.
But what if I want just the first 15 multiples?
well one way is
void summation (const unsigned long number_n, unsigned long &summe,unsigned int &counter );
void summation (const unsigned long number_n, unsigned long &summe,unsigned int &counter )
{
unsigned int howoften = 0;
summe = 0;
for( unsigned long i = 1; i <=number_n; i++ )
if (howoften <= counter-1)
{
if( !( i % 3 ) || !( i % 5 ) )
{
summe += i;
howoften++;
}
}
counter = howoften;
return;
}
But as expected the runtime is not accceptable for a counter like 1.500.000 :-/
Hm I tried a lot of things but I cannot find a solution by my own.
I also tried a faster summation algorithm like (dont care bout overflow at this point):
int sum(int N);
int sum(int N)
{
int S1, S2, S3;
S1 = ((N / 3)) * (2 * 3 + (N / 3 - 1) * 3) / 2;
S2 = ((N / 5)) * (2 * 5 + (N / 5 - 1) * 5) / 2;
S3 = ((N / 15)) *(2 * 15 + (N / 15 - 1) * 15) / 2;
return S1 + S2 - S3;
}
or even
unsigned long sum1000 (unsigned long target);
unsigned long sum1000 (unsigned long target)
{
unsigned int summe = 0;
for (unsigned long i = 0; i<=target; i+=3) summe+=i;
for (unsigned long i = 0; i<=target; i+=5) summe+=i;
for (unsigned long i = 0; i<=target; i+=15) summe-=i;
return summe;
}
But I'm not smart enough to set up an algorithm which is fast enough (I say 5-10 sec. are ok)
The whole sum of the multiples is not my problem, the first N multiples are :)
Thanks for reading, and if u have any ideas, it would be great
Some prerequisites:
(dont care bout overflow at this point)
Ok, so lets ignore that completely.
Next, the sum of all numbers from 1 till n can be calculated from (see eg here):
int sum(int n) {
return (n * (n+1)) / 2;
}
Note that n*(n+1) is an even number for any n, so using integer artihmetics for /2 is not an issue.
How does this help to get sum of numbers divisible by 3? Lets start with even numbers (divisble by 2). We write out the long form of the sum above:
1 + 2 + 3 + 4 + ... + n
multiply each term by 2:
2 + 4 + 6 + 8 + ... + 2*n
now I hope you see that this sum contains all numbers that are divisible by 2 up to 2*n. Those numbers are the first n numbers that are divisble by 2.
Hence, the sum of the fist n numbers that are divisble by 2 is 2 * sum(n). We can generalize that to write a function that returns the sum of the first n numbers that are divisble by m:
int sum_div_m( int n, int m) {
return sum(n) * m;
}
First I want to reproduce your inital example "up to 60 the sum is 870". For that we consider that
60/3 == 20 -> there are 20 numbers divisble by 3 and we get their sum from sum_div_m(20,3)
60/5 == 12 -> there are 12 numbers divisible by 5 and we get their sum from sum_div_m(12,5)
we cannot simply add the above two results because then we would count some numbers double. Those numbers are those divisible by 3 and 5, ie divisible by 15
60/15 == 4 -> there are 4 numbers divisble by 3 and 5 and we get their sum from sum_div_m(4,15).
Putting it together, the sum of all numbers divisible by 3 or 5 up to 60 is
int x = sum_div_m( 20,3) + sum_div_m( 12,5) - sum_div_m( 4,15);
Finally, back to your actual question:
But what if I want just the first 15 multiples?
Above we saw that there are
n == x/3 + x/5 - x/15
numbers that are divisble by 3 or 5 in the range 0...x. All division are using integer arithmetics. We already had the example of 60 with 20+12-4 == 28 divisble numbers. Another example is x=10 where there are n = 3 + 2 - 0 = 5 numbers divisible by 3 or 5 (3,5,6,9,10). We have to be a bit careful with integer arithmetics, but no big deal:
15*n == 5*x + 3*x - x
-> 15*n == 7*x
-> x == 15*n/7
Quick test: 15*28/7 == 60, looks correct.
Putting it all together the sum of the first n numbers divisible by 3 or 5 is
int sum_div_3_5(int n) {
int x = (15*n)/7;
return sum_div_m(x/3, 3) + sum_div_m(x/5, 5) - sum_div_m(x/15, 15);
}
To check that this is correct we can again try sum_div_3_5(28) to see that it returns 870 (because there are 28 numbers divisble by 3 or 5 up to 60 and that was the initial example).
PS Turned out that the question is really only about doing the maths. Though that isnt a big surprise. When you want to write efficient code you should primarily take care to use the right algorithm. Optimizations based on a given algorithm often are less effective than choosing a better algorithm. Once you chose an algorithm, often it does not pay off to try to be "clever" because compilers are much better at optimizing. For example this code:
int main(){
int x = 0;
int n = 60;
for (int i=0; i <= n; ++i) x += i;
return x;
}
will be be optimized by most compilers to a simple return 1830; when optimizations are turned on because compilers do know how to add all numbers from 1 to n. See here.
You can do it in compile time recursively by using class templates/meta functions if your value is known in compile time. So there will be no runtime cost.
Ex:
template<int n>
struct Sum{
static const int value = n + Sum<n-1>::value;
};
template<>
struct Sum<0>{
static constexpr int value = 0;
};
int main()
{
constexpr auto x = Sum<100>::value;
// x is known (5050) in compile time
return 0;
}
Consider the following algorithm from the C++ standard library: std::shuffle that has the following signature:
template <class RandomIt, class URBG>
void shuffle(RandomIt first, RandomIt last, URBG&& g);
It reorders the elements in the given range [first, last) such that each possible permutation of those elements has equal probability of appearance.
I am trying to implement the same algorithms, but which works at the bit level, randomly shuffling the bits of the words of the input sequence. Considering a sequence of 64-bits words, I am trying to implement:
template <class URBG>
void bit_shuffle(std::uint64_t* first, std::uint64_t* last, URBG&& g)
Question: How to do that as efficiently as possible (using compiler intrinsics if necessary)? I am not necessarily looking for an entire implementation, but more for suggestions/directions of research, because it's really not clear to me if it's even feasible to implement that efficiently.
It's obvious that asymptotically, the speed is O(N), where N is number of bits. Our goal is to improve the constants involved in it.
Disclaimer: the description proposed algorithm is a rough sketch. There are a lot of stuffs needs to be added and, especially, a lot of details that needs to be cared of in order to make it work correctly. The approximated execution time will not be different from what is claimed here though.
Baseline Algorithm
The most obvious one is the textbook approach, which takes N operations, each of which involves calling the random_generator which takes R milliseconds, and accessing the bit's value of two different bits, and set new value to them in total of 4 * A milliseconds (A is time to read/write one bit). Suppose that the array lookup operations takes C milliseconds. So the total time of this algorithm is N * (R + 4 * A + 2 * C) milliseconds (approximately). It is also reasonable to assume that the random number generation takes more time, i.e. R >> A == C.
Proposed Algorithm
Suppose the bits are stored in a byte storage, i.e. we will work with blocks of bytes.
unsigned char bit_field[field_size = N / 8];
First, let's count the number of 1 bits in our bitset. For that, we can use a lookup-table and iterate through the bitset as byte array:
# Generate lookup-table, you may modify it with `constexpr`
# to make it run in compile time.
int bitcount_lookup[256];
for (int = 0; i < 256; ++i) {
bitcount_lookup[i] = 0;
for (int b = 0; b < 8; ++b)
bitcount_lookup[i] += (i >> b) & 1;
}
We can treat this as preprocessing overhead (as it may as well be calculated at compile-time) and say that it takes 0 milliseconds. Now, counting number of 1 bits is easy (the following will take (N / 8) * C milliseconds):
int bitcount = 0;
for (auto *it = bit_field; it != bit_field + field_size; ++it)
bitcount += bitcount_lookup[*it];
Now, we randomly generate N / 8 numbers (let's call the resulting array gencnt[N / 8]), each in the range [0..8], such that they sums up to bitcount. This is a bit tricky and kind of hard to do it uniformly (the "correct" algorithm to generate uniform distribution is quite slow comparing to the baseline algo). A quite uniform-ish but quick solution is roughly:
Fill the gencnt[N / 8] array with values v = bitcount / (N / 8).
Randomly choose N / 16 "black" cells. The rests are "white". The algorithm is similar to random permutation, but only of half of the array.
Generate N / 16 random numbers in the range [0..v]. Let's call them tmp[N / 16].
Increase "black" cells by tmp[i] values, and decrease "white" cells by tmp[i]. This will ensure that the overall sum is bitcount.
After that, we will have a uniform-ish random-ish array gencnt[N / 8], the value of which are the number of 1 bytes in a particular "cell". It was all generated in:
(N / 8) * C + (N / 16) * (4 * C) + (N / 16) * (R + 2 * C)
^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^
filling step random coloring filling
milliseconds (this estimation is done with a concrete implementation in my mind). Lastly, we can have a lookup table of the bytes with specified number of bits set to 1 (can be compiled overhead, or even in compile-time as constexpr, so let's assume that this takes 0 milliseconds):
std::vector<std::vector<unsigned char>> random_lookup(8);
for (int c = 0; c < 8; c++)
random_lookup[c] = { /* numbers with `c` bits set to `1` */ };
Then, we can fill our bit_field as follows (which takes roughly (N / 8) * (R + 3 * C) milliseconds):
for (int i = 0; i < field_size; i++) {
bit_field[i] = random_lookup[gencnt[i]][rand() % gencnt[i].size()];
Summing everything up, we have the total execution time:
T = (N / 8) * C +
(N / 8) * C + (N / 16) * (4 * C) + (N / 16) * (R + 2 * C) +
(N / 8) * (R + 3 * C)
= N * (C + (3/16) * R) < N * (R + 4 * A + 2 * C)
^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^
proposed algorithm naive baseline algo
Although it's not truly uniformly random, but it does spread the bits out quite evenly and randomly, and it's quite fast and hopefully gets the job done in your use-case.
Observing that actual shuffling bits, which involves swapping via Fisher-Yates, is not required for producing the exact equivalent, a random distribution of the bits.
#include <iostream>
#include <vector>
#include <random>
// shuffle a vector of bools. This requires only counting the number of trues in the vector
// followed by clearing the vector and inserting bool trues to produce an equivalent to
// a bit shuffle. This is cache line friendly and doesn't require swapping.
std::vector<bool> DistributeBitsRandomly(std::vector<bool> bvector)
{
std::random_device rd;
static std::mt19937 gen(rd()); //mersenne_twister_engine seeded with rd()
// count the number of set bits and clear bvector
int set_bits_count = 0;
for (int i=0; i < bvector.size(); i++)
if (bvector[i])
{
set_bits_count++;
bvector[i] = 0;
}
// set a bit if a random value in range bvector.size()-bit_loc-1 is
// less than the number of bits remaining to be placed. This produces exactly the same
// distribution as a random shuffle but only does an insertion of a 1 bit rather than
// a swap. It requires counting the number of 1 bits. There are efficient ways
// of doing this. See https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer
for (int bit_loc = 0; set_bits_count; bit_loc++)
{
std::uniform_int_distribution<int> dis(0, bvector.size()-bit_loc-1);
auto x = dis(gen);
if (x < set_bits_count)
{
bvector[bit_loc] = true;
set_bits_count--;
}
}
return bvector;
}
This performs the equivalent of shuffling the bools in a vector<bool> It is cache line friendly and involves no swapping. It's presented in executable, but simple algorithmic form as requested by the OP. Much can be done to optimize this such as improving the speed of bit counting and clearing the array.
This sets 4 bits out of 10, calls the "shuffle" routine 100,000 times, and prints the number of time a 1 bit occurs in each of the 10 locations. It should be around 40,000 in each position.
int main()
{
std::vector<bool> initial{ 1,1,1,1,0,0,0,0,0,0 };
std::vector<int> totals(initial.size());
for (int i = 0; i < 100000; i++)
{
auto a_distribution = DistributeBitsRandomly(initial);
for (int ii = 0; ii < totals.size(); ii++)
if (a_distribution[ii])
totals[ii]++;
}
for (auto cnt : totals)
std::cout << cnt << "\n";
}
Possible Output:
40116
39854
40045
39917
40105
40074
40214
39963
39946
39766
Given an array of n non-negative integers: A1, A2, …, AN. How to find a pair of integers Au, Av (1 ≤ u < v ≤ N) such that (Au and Av) is as large as possible.
Example : Let N=4 and array be [2 4 8 10] .Here answer is 8
Explanation
2 and 4 = 0
2 and 8 = 0
2 and 10 = 2
4 and 8 = 0
4 and 10 = 0
8 and 10 = 8
How to do it if N can go upto 10^5.
I have O(N^2) solution.But its not efficient
Code :
for(int i=0;i<n;i++){
for(int j=i+1;j<n;j++){
if(arr[i] & arr[j] > ans)
{
ans=arr[i] & arr[j];
}
}
}
One way you could speed it up is to take advantage of the fact that if any of the high bits are set in any two numbers, then the AND of those two number will ALWAYS be larger than any combination using lower bits.
Therefore, if you order your numbers by the bits set you may decrease the number of operations drastically.
In order to find the most significant bit efficiently, GCC has a builtin intrinsic: __builtin_clz(unsigned int x) that returns the index of the most significant set bit. (Other compilers have similar intrinsics, translating to a single instruction on at least x86).
const unsigned int BITS = sizeof(unsigned int)*8; // Assuming 8 bit bytes.
// Your implementation over.
unsigned int max_and_trivial( const std::vector<unsigned int> & input);
// Partition the set.
unsigned int max_and( const std::vector<unsigned int> & input ) {
// For small input, just use the trivial algorithm.
if ( input.size() < 100 ) {
return max_and_trivial(input);
}
std::vector<unsigned int> by_bit[BITS];
for ( auto elem : input ) {
unsigned int mask = elem;
while (mask) { // Ignore elements that are 0.
unsigned int most_sig = __builtin_clz(mask);
by_bits[ most_sig ].push_back(elem);
mask ^= (0x1 << BITS-1) >> most_sig;
}
}
// Now, if any of the vectors in by_bits have more
// than one element, the one with the highest index
// will include the largest AND-value.
for ( unsigned int i = BITS-1; i >= 0; i--) {
if ( by_bits[i].size() > 1 ) {
return max_and_trivial( by_bits[i]);
}
}
// If you get here, the largest value is 0.
return 0;
}
This algorithm still has worst case runtime O(N*N), but on average it should perform much better. You can also further increase the performance by repeating the partition step when you search through the smaller vector (just remember to ignore the most significant bit in the partition step, doing this should increase the performance to a worst case of O(N)).
Guaranteeing that there are no duplicates in the input-data will further increase the performance.
Sort the array in descending order.
Take the first two numbers. If they are both between two consecutive powers of 2 (say 2^k and 2^(k+1), then you can remove all elements that are less than 2^k.
From the remaining elements, subtract 2^k.
Repeat steps 2 and 3 until the number of elements in the array is 2.
Note: If you find that only the largest element is between 2^k and 2^(k+1) and the second largest element is less than 2^k, then you will not remove any element, but just subtract 2^k from the largest element.
Also, determining where an element lies in the series {1, 2, 4, 8, 16, ...} can be done in O(log(log(MAX))) time where MAX is the largest number in the array.
I didn't test this, and I'm not going to. O(N) memory and O(N) complexity.
#include <vector>
#include <utility>
#include <algorithm>
using namespace std;
/*
* The idea is as follows:
* 1.) Create a mathematical set A that holds integers.
* 2.) Initialize importantBit = highest bit in any integer in v
* 3.) Put into A all integers that have importantBit set to 1.
* 4.) If |A| = 2, that is our answer. If |A| < 2, --importantBit and try again. If |A| > 2, basically
* redo the problem but only on the integers in set A.
*
* Keep "set A" at the beginning of v.
*/
pair<unsigned, unsigned> find_and_sum_pair(vector<unsigned> v)
{
// Find highest bit in v.
int importantBit = 0;
for(auto num : v)
importantBit = max(importantBit, highest_bit_index(num));
// Move all elements with imortantBit to front of vector until doing so gives us at least 2 in the set.
int setEnd;
while((setEnd = partial_sort_for_bit(v, importantBit, v.size())) < 2 && importantBit > 0)
--importantBit;
// If the set is never sufficient, no answer exists
if(importantBit == 0)
return pair<unsigned, unsigned>();
// Repeat the problem only on the subset defined by A until |A| = 2 and impBit > 0 or impBit = 0
while(importantBit > 1)
{
unsigned secondSetEnd = partial_sort_for_bit(v, --importantBit, setEnd);
if(secondSetEnd >= 2)
setEnd = secondSetEnd;
}
return pair<unsigned, unsigned>(v[0], v[1]);
}
// Returns end index (1 past last) of set A
int partial_sort_for_bit(vector<unsigned> &v, unsigned importantBit, unsigned vSize)
{
unsigned setEnd = 0;
unsigned mask = 1<<(importantBit-1);
for(decltype(v.size()) index = 0; index < vSize; ++index)
if(v[index]&mask > 0)
swap(v[index], v[setEnd++]);
return setEnd;
}
unsigned highest_bit_index(unsigned i)
{
unsigned ret = i != 0;
while(i >>= 1)
++ret;
return ret;
}
I came upon this problem again and solved it a different way (much more understandable to me):
unsigned findMaxAnd(vector<unsigned> &input) {
vector<unsigned> candidates;
for(unsigned mask = 1<<31; mask; mask >>= 1) {
for(unsigned i : input)
if(i&mask)
candidates.push_back(i);
if (candidates.size() >= 2)
input = move(candidates);
candidates = vector<unsigned>();
}
if(input.size() < 2) {
return 0;
return input[0]&input[1];
}
Here is an O(N * log MAX_A) solution:
1)Let's construct the answer greedily, iterating from the highest bit to the lowest one.
2)To do it, one can mantain a set S of numbers that currently fit. Initially, it consists of all numbers in the array. Let's also assume that initially ANS = 0.
3)Now lets iterate over all the bits from the highest to the lowest. Let's say that current bit is B.
4)If the number of elements in S with value 1 of the B-th bit is greater than 1, it is possible to have 1 in this position without changing the values of higher bits in ANS so we should add 2^B to the ANS and remove all elements from S which have 0 value of this bit(they do not fit anymore).
5)Otherwise, it is not possible to obtain 1 in this position, so we do not change S and ANS and proceed to the next bit.
I have started doing competitive programming and most of the time i find that the input size of numbers is like
1 <= n <= 10^(500).
So i understand that it would be like 500 digits which can not be stored on simple int memory. I know c and c++.
I think i should use an array. But then i get confused on how would i find
if ( (nCr % P) == 0 ) //for all (0<=r<=n)//
I think that i would store it in an array and then find nCr. Which would require coding multiplication and division on digits but what about modulus.
Is there any other way?
Thanks.
I think you don't want to code the multiplication and division yourself, but use something like the GNU MP Bignum library http://gmplib.org/
Regarding large number libraries, I have used ttmath, which provides arbitrary length integers, floats, etc, and some really good operations, all with relatively little bulk.
However, if you are only trying to figure out what (n^e) mod m is, you can do this for very large values of e even without extremely large number calculation. Below is a function I added to my local ttmath lib to do just that:
/*!
mod power this = (this ^ pow) % m
binary algorithm (r-to-l)
return values:
0 - ok
1 - carry
2 - incorrect argument (0^0)
*/
uint PowMod(UInt<value_size> pow, UInt<value_size> mod)
{
if(pow.IsZero() && IsZero())
// we don't define zero^zero
return 2;
UInt<value_size> remainder;
UInt<value_size> x = 1;
uint c = 0;
while (pow != 0)
{
remainder = (pow & 1 == 1);
pow /= 2;
if (remainder != 0)
{
c += x.Mul(*this);
x = x % mod;
}
c += Mul(*this);
*this = *this % mod;
}
*this = x;
return (c==0)? 0 : 1;
}
I don't believe you ever need to store a number larger than n^2 for this algorithm. It should be easy to modify such that it removes the ttmath related aspects, if you don't want to use those headers.
You can find the details of the mathematics online by looking up modular exponentiation, if you care about it.
If we have to calcuate nCr mod p(where p is a prime), we can calculate factorial mod p and then use modular inverse to find nCr mod p. If we have to find nCr mod m(where m is not prime), we can factorize m into primes and then use Chinese Remainder Theorem(CRT) to find nCr mod m.
#include<iostream>
using namespace std;
#include<vector>
/* This function calculates (a^b)%MOD */
long long pow(int a, int b, int MOD)
{
long long x=1,y=a;
while(b > 0)
{
if(b%2 == 1)
{
x=(x*y);
if(x>MOD) x%=MOD;
}
y = (y*y);
if(y>MOD) y%=MOD;
b /= 2;
}
return x;
}
/* Modular Multiplicative Inverse
Using Euler's Theorem
a^(phi(m)) = 1 (mod m)
a^(-1) = a^(m-2) (mod m) */
long long InverseEuler(int n, int MOD)
{
return pow(n,MOD-2,MOD);
}
long long C(int n, int r, int MOD)
{
vector<long long> f(n + 1,1);
for (int i=2; i<=n;i++)
f[i]= (f[i-1]*i) % MOD;
return (f[n]*((InverseEuler(f[r], MOD) * InverseEuler(f[n-r], MOD)) % MOD)) % MOD;
}
int main()
{
int n,r,p;
while (~scanf("%d%d%d",&n,&r,&p))
{
printf("%lld\n",C(n,r,p));
}
}
Here, I've used long long int to stote the number.
In many. many cases in these coding competitions, the idea is that you don't actually calculate these big numbers, but figure out how to answer the question without calculating it. For example:
What are the last ten digits of 1,000,000! (factorial)?
It's a number with over five million digits. However, I can answer that question without a computer, not even using pen and paper. Or take the question: What is (2014^2014) modulo 153? Here's a simple way to calculate this in C:
int modulo = 1;
for (int i = 0; i < 2014; ++i) modulo = (modulo * 2014) % 153;
Again, you avoided doing a calculation with a 6,000 digit number. (You can actually do this considerably faster, but I'm not trying to enter a competition).
Some time ago I used the (blazing fast) primesieve in python that I found here: Fastest way to list all primes below N
To be precise, this implementation:
def primes2(n):
""" Input n>=6, Returns a list of primes, 2 <= p < n """
n, correction = n-n%6+6, 2-(n%6>1)
sieve = [True] * (n/3)
for i in xrange(1,int(n**0.5)/3+1):
if sieve[i]:
k=3*i+1|1
sieve[ k*k/3 ::2*k] = [False] * ((n/6-k*k/6-1)/k+1)
sieve[k*(k-2*(i&1)+4)/3::2*k] = [False] * ((n/6-k*(k-2*(i&1)+4)/6-1)/k+1)
return [2,3] + [3*i+1|1 for i in xrange(1,n/3-correction) if sieve[i]]
Now I can slightly grasp the idea of the optimizing by automaticly skipping multiples of 2, 3 and so on, but when it comes to porting this algorithm to C++ I get stuck (I have a good understanding of python and a reasonable/bad understanding of C++, but good enough for rock 'n roll).
What I currently have rolled myself is this (isqrt() is just a simple integer square root function):
template <class T>
void primesbelow(T N, std::vector<T> &primes) {
T sievemax = (N-3 + (1-(N % 2))) / 2;
T i;
T sievemaxroot = isqrt(sievemax) + 1;
boost::dynamic_bitset<> sieve(sievemax);
sieve.set();
primes.push_back(2);
for (i = 0; i <= sievemaxroot; i++) {
if (sieve[i]) {
primes.push_back(2*i+3);
for (T j = 3*i+3; j <= sievemax; j += 2*i+3) sieve[j] = 0; // filter multiples
}
}
for (; i <= sievemax; i++) {
if (sieve[i]) primes.push_back(2*i+3);
}
}
This implementation is decent and automatically skips multiples of 2, but if I could port the Python implementation I think it could be much faster (50%-30% or so).
To compare the results (in the hope this question will be successfully answered), the current execution time with N=100000000, g++ -O3 on a Q6600 Ubuntu 10.10 is 1230ms.
Now I would love some help with either understanding what the above Python implementation does or that you would port it for me (not as helpful though).
EDIT
Some extra information about what I find difficult.
I have trouble with the techniques used like the correction variable and in general how it comes together. A link to a site explaining different Eratosthenes optimizations (apart from the simple sites that say "well you just skip multiples of 2, 3 and 5" and then get slam you with a 1000 line C file) would be awesome.
I don't think I would have issues with a 100% direct and literal port, but since after all this is for learning that would be utterly useless.
EDIT
After looking at the code in the original numpy version, it actually is pretty easy to implement and with some thinking not too hard to understand. This is the C++ version I came up with. I'm posting it here in full version to help further readers in case they need a pretty efficient primesieve that is not two million lines of code. This primesieve does all primes under 100000000 in about 415 ms on the same machine as above. That's a 3x speedup, better then I expected!
#include <vector>
#include <boost/dynamic_bitset.hpp>
// http://vault.embedded.com/98/9802fe2.htm - integer square root
unsigned short isqrt(unsigned long a) {
unsigned long rem = 0;
unsigned long root = 0;
for (short i = 0; i < 16; i++) {
root <<= 1;
rem = ((rem << 2) + (a >> 30));
a <<= 2;
root++;
if (root <= rem) {
rem -= root;
root++;
} else root--;
}
return static_cast<unsigned short> (root >> 1);
}
// https://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
// https://stackoverflow.com/questions/5293238/porting-optimized-sieve-of-eratosthenes-from-python-to-c/5293492
template <class T>
void primesbelow(T N, std::vector<T> &primes) {
T i, j, k, l, sievemax, sievemaxroot;
sievemax = N/3;
if ((N % 6) == 2) sievemax++;
sievemaxroot = isqrt(N)/3;
boost::dynamic_bitset<> sieve(sievemax);
sieve.set();
primes.push_back(2);
primes.push_back(3);
for (i = 1; i <= sievemaxroot; i++) {
if (sieve[i]) {
k = (3*i + 1) | 1;
l = (4*k-2*k*(i&1)) / 3;
for (j = k*k/3; j < sievemax; j += 2*k) {
sieve[j] = 0;
sieve[j+l] = 0;
}
primes.push_back(k);
}
}
for (i = sievemaxroot + 1; i < sievemax; i++) {
if (sieve[i]) primes.push_back((3*i+1)|1);
}
}
I'll try to explain as much as I can. The sieve array has an unusual indexing scheme; it stores a bit for each number that is congruent to 1 or 5 mod 6. Thus, a number 6*k + 1 will be stored in position 2*k and k*6 + 5 will be stored in position 2*k + 1. The 3*i+1|1 operation is the inverse of that: it takes numbers of the form 2*n and converts them into 6*n + 1, and takes 2*n + 1 and converts it into 6*n + 5 (the +1|1 thing converts 0 to 1 and 3 to 5). The main loop iterates k through all numbers with that property, starting with 5 (when i is 1); i is the corresponding index into sieve for the number k. The first slice update to sieve then clears all bits in the sieve with indexes of the form k*k/3 + 2*m*k (for m a natural number); the corresponding numbers for those indexes start at k^2 and increase by 6*k at each step. The second slice update starts at index k*(k-2*(i&1)+4)/3 (number k * (k+4) for k congruent to 1 mod 6 and k * (k+2) otherwise) and similarly increases the number by 6*k at each step.
Here's another attempt at an explanation: let candidates be the set of all numbers that are at least 5 and are congruent to either 1 or 5 mod 6. If you multiply two elements in that set, you get another element in the set. Let succ(k) for some k in candidates be the next element (in numerical order) in candidates that is larger than k. In that case, the inner loop of the sieve is basically (using normal indexing for sieve):
for k in candidates:
for (l = k; ; l += 6) sieve[k * l] = False
for (l = succ(k); ; l += 6) sieve[k * l] = False
Because of the limitations on which elements are stored in sieve, that is the same as:
for k in candidates:
for l in candidates where l >= k:
sieve[k * l] = False
which will remove all multiples of k in candidates (other than k itself) from the sieve at some point (either when the current k was used as l earlier or when it is used as k now).
Piggy-Backing onto Howard Hinnant's response, Howard, you don't have to test numbers in the set of all natural numbers not divisible by 2, 3 or 5 for primality, per se. You need simply multiply each number in the array (except 1, which self-eliminates) times itself and every subsequent number in the array. These overlapping products will give you all the non-primes in the array up to whatever point you extend the deterministic-multiplicative process. Thus the first non-prime in the array will be 7 squared, or 49. The 2nd, 7 times 11, or 77, etc. A full explanation here: http://www.primesdemystified.com
As an aside, you can "approximate" prime numbers. Call the approximate prime P. Here are a few formulas:
P = 2*k+1 // not divisible by 2
P = 6*k + {1, 5} // not divisible 2, 3
P = 30*k + {1, 7, 11, 13, 17, 19, 23, 29} // not divisble by 2, 3, 5
The properties of the set of numbers found by these formulas is that P may not be prime, however all primes are in the set P. I.e. if you only test numbers in the set P for prime, you won't miss any.
You can reformulate these formulas to:
P = X*k + {-i, -j, -k, k, j, i}
if that is more convenient for you.
Here is some code that uses this technique with a formula for P not divisible by 2, 3, 5, 7.
This link may represent the extent to which this technique can be practically leveraged.