C/C++: Multiply, or bitshift then divide? [duplicate]

C/C++: Multiply, or bitshift then divide? [duplicate] - c++

This question already has answers here:
Is multiplication and division using shift operators in C actually faster?
(19 answers)
Closed 8 years ago.
Where it's possible to do so, I'm wondering if it's faster to replace a single multiplication with a bitshift followed by an integer division. Say I've got an int k and I want to multiply it by 2.25.
What's faster?
int k = 5;
k *= 2.25;
std::cout << k << std::endl;
or
int k = 5;
k = (k<<1) + (k/4);
std::cout << k << std::endl;
Output
11
11
Both give the same result, you can check this full example.

The first attempt
I defined functions regularmultiply() and bitwisemultiply() as follows:
int regularmultiply(int j)
{
return j * 2.25;
}
int bitwisemultiply(int k)
{
return (k << 1) + (k >> 2);
}
Upon doing profiling with Instruments (in XCode on a 2009 Macbook OS X 10.9.2), it seemed that bitwisemultiply executed about 2x faster than regularmultiply.
The assembly code output seemed to confirm this, with bitwisemultiply spending most of its time on register shuffling and function returns, while regularmultiply spent most of its time on the multiplying.
regularmultiply:
bitwisemultiply:
But the length of my trials was too short.
The second attempt
Next, I tried executing both functions with 10 million multiplications, and this time putting the loops in the functions so that all the function entry and leaving wouldn't obscure the numbers. And this time, the results were that each method took about 52 milliseconds of time. So at least for a relatively large but not gigantic number of calculations, the two functions take about the same time. This surprised me, so I decided to calculate for longer and with larger numbers.
The third attempt
This time, I only multiplied 100 million through 500 million by 2.25, but the bitwisemultiply actually came out slightly slower than the regularmultiply.
The final attempt
Finally, I switched the order of the two functions, just to see if the growing CPU graph in Instruments was perhaps slowing the second function down. But still, the regularmultiply performed slightly better:
Here is what the final program looked like:
#include <stdio.h>
int main(void)
{
void regularmultiplyloop(int j);
void bitwisemultiplyloop(int k);
int i, j, k;
j = k = 4;
bitwisemultiplyloop(k);
regularmultiplyloop(j);
return 0;
}
void regularmultiplyloop(int j)
{
for(int m = 0; m < 10; m++)
{
for(int i = 100000000; i < 500000000; i++)
{
j = i;
j *= 2.25;
}
printf("j: %d\n", j);
}
}
void bitwisemultiplyloop(int k)
{
for(int m = 0; m < 10; m++)
{
for(int i = 100000000; i < 500000000; i++)
{
k = i;
k = (k << 1) + (k >> 2);
}
printf("k: %d\n", k);
}
}
Conclusion
So what can we say about all this? One thing we can say for certain is that optimizing compilers are better than most people. And furthermore, those optimizations show themselves even more when there are a lot of computations, which is the only time you'd really want to optimize anyway. So unless you're coding your optimizations in assembly, changing multiplication to bit shifting probably won't help much.
It's always good to think about efficiency in your applications, but the gains of micro-efficiency are usually not enough to warrant making your code less readable.

Indeed it depends on a variety of factors. So I have just checked it by running and measuring time. So the string we are interested in takes only a few instructions of CPU which is very fast so I have wrapped it into the cycle - multiplied the execution time of one code by a big number, and I got the k *= 2.25; is about in 1.5 times slower than k = (k<<1) + (k/4);.
Here is my two codes to comapre:
prog1:
#include <iostream>
using namespace std;
int main() {
int k = 5;
for (unsigned long i = 0; i <= 0x2fffffff;i++)
k = (k<<1) + (k/4);
cout << k << endl;
return 0;
}
prog 2:
#include <iostream>
using namespace std;
int main() {
int k = 5;
for (unsigned long i = 0; i <= 0x2fffffff;i++)
k *= 2.25;
cout << k << endl;
return 0;
}
Prog1 takes 8 secs and Prog2 takes 14 secs. So by running this test with you architecture and compiler you can get the result which is correct to your particular environment.

That depends heavily on the CPU architecture: Floating point arithmetic, including multiplications, has become quite cheap on many CPUs. But the necessary float->int conversion can bite you: on POWER-CPUs, for instance, the regular multiplication will crawl along due to the pipeline flushes that are generated when a value is moved from the floating point unit to the integer unit.
On some CPUs (including mine, which is an AMD CPU), this version is actually the fastest:
k *= 9;
k >>= 2;
because these CPUs can do a 64 bit integer multiplication in a single cycle. Other CPUs are definitely slower with my version than with your bitshift version, because their integer multiplication is not as heavily optimized. Most CPUs aren't as bad on multiplications as they used to be, but a multiplication can still take more than four cycles.
So, if you know which CPU your program will run on, measure which is fastest. If you don't know, your bitshift version won't perform badly on any architecture (unlike both the regular version and mine), which makes it a really safe bet.

It highly depends on what hardware are you using. On modern hardware floating point multiplications may run way faster than integer ones, so you might want to change the entire algorithm and start using doubles instead of integers. If you're writing for modern hardware and you have a lot of operations like multiplying by 2.25, I'd suggest using double rather than integers, if nothing else prevents you from doing that.
And be data driven - measure performance, because it's affected by compiler, hardware and your way of implementing your algorithm.

Related

sieve or eratosthenes calculator -- running into memory issues and crashing with numbers >=1,000,000

I'm not exactly sure why this is. I tried changing the variables to long long, and I even tried doing a few other things -- but its either about the inefficiency of my code (it literally does the whole process of finding all primes up to the number, then checking against the number to see if its divisible by that prime -- very inefficient, but its my first attempt at this and I feel pretty accomplished having it work at all....)
Or the fact that it overflows the stack. Im not sure where it is exactly, but all I know is that it MUST be related to memory and the way its dealing with the number.
If I had to guess, Id say its a memory issue happening when it is dealing with the prime number generation up to that number -- thats where it dies even if I remove the check against the input number.
I'll post my code -- just be aware, I didnt change long long back to int in a few places, and I also have a SquareRoot Variable that is not used, because it was supposed to try and help memory efficiency but was not effective the way I tried to do it. I Just never deleted it. I will clean up the code when and if I can successfully finish it.
As far as I am aware though, it DOES work pretty reliably for 999,999 and down, I actually checked it up against other calculators of the same type and it seemingly does generate the proper answers.
If anyone can help or explain what I screwed up here, your helping a guy trying to learn on his own without any school or anything. so its appreciated.
#include <iostream>
#include <cmath>
void sieve(int ubound, int primes[]);
int main()
{
long long n;
int i;
std::cout << "Input Number: ";
std::cin >> n;
if (n < 2) {
return 1;
}
long long upperbound = n;
int A[upperbound];
int SquareRoot = sqrt(upperbound);
sieve(upperbound, A);
for (i = 0; i < upperbound; i++) {
if (A[i] == 1 && upperbound % i == 0) {
std::cout << " " << i << " ";
}
}
return 0;
}
void sieve(int ubound, int primes[])
{
long long i, j, m;
for (i = 0; i < ubound; i++) {
primes[i] = 1;
}
primes[0] = 0, primes[1] = 0;
for (i = 2; i < ubound; i++) {
for(j = i * i; j < ubound; j += i) {
primes[j] = 0;
}
}
}

If you used legal C++ constructs instead of non-standard variable length arrays, your code will run (whether it produces the correct answers is another question).
The issue is more than likely that you're exceeding the limits of the stack when you declare arrays with a million or more elements.
Therefore instead of this:
long long upperbound = n;
A[upperbound];
Use std::vector:
#include <vector>
//...
long long upperbound = n;
std::vector<int> A(upperbound);
and then:
sieve(upperbound, A.data());
The std::vector does not use the stack space to allocate its elements (unless you have written an allocator for it that uses the stack).
As a matter of fact, you don't even need to pass upperbound to sieve, as a std::vector knows its own size by calling the size() member function. But I leave that as an exercise.
Live example using 2,000,000

First of all, read and apply PaulMcKenzie's advice. That's the most important thing. I'm only addressing some teeny bits of your question that remained open.
It seems that you are trying to factor the number that you misleadingly called upperbound. The mysterious role of the square root of this number is related to this fact: if the number is composite at all - and hence can be computed as the product of some prime factors - then the smallest of these prime factors cannot be greater than the square root of the number. In fact, only one factor can possibly be greater, all others cannot exceed the square root.
However, in its present form your code cannot draw advantage from this fact. The trial division loop as it stands now has to run up to number_to_be_factored / 2 in order not to miss any factors because its body looks like this:
if (sieve[i] == 1 && number_to_be_factored % i == 0) {
std::cout << " " << i << " ";
}
You can factor much more efficiently if you refactor your code a bit: when you have found the smallest prime factor p of your number then the remaining factors to be found must be precisely those of rest = number_to_be_factored / p (or n = n / p, if you will), and none of the remaining factors can be smaller than p. However, don't forget that p might occur more than once as a factor.
During any round of the proceedings you only need to consider the prime factors between p and the square root of the current number; if none of those primes divides the current number then it must be prime. To test whether p exceeds the square root of some number n you can use if (p * p > n), which is computationally more efficient that actually computing the square root.
Hence the square root occurs in two different roles:
the square root of the number to be factored limits the amount of sieving that needs to be done
during the trial division loop, the square root of the current number gives an upper bound for the highest prime factor that you need to consider
That's two faces of the same coin but two different usages in the actual code.
Note: once you got your code working by applying PaulMcKenzie's advice, you might also to consider posting over on Code Review.

C++ Optimizing this Algorithm

After watching some Terence Tao videos, I wanted to try implementing algorithms into c++ code to find all the prime numbers up to a number n. In my first version, where I simply had every integer from 2 to n tested to see if they were divisible by anything from 2 to sqrt(n), I got the program to find the primes between 1-10,000,000 in ~52 seconds.
Attempting to optimize the program, and implementing what I now know to be the Sieve of Eratosthenes, I assumed the task would be done much faster than 51 seconds, but sadly, that wasn't the case. Even going up to 1,000,000 took a considerable amount of time (didn't time it, though)
#include <iostream>
#include <vector>
using namespace std;
void main()
{
vector<int> tosieve = {};
for (int i = 2; i < 1000001; i++)
{
tosieve.push_back(i);
}
for (int j = 0; j < tosieve.size(); j++)
{
for (int k = j + 1; k < tosieve.size(); k++)
{
if (tosieve[k] % tosieve[j] == 0)
{
tosieve.erase(tosieve.begin() + k);
}
}
}
//for (int f = 0; f < tosieve.size(); f++)
//{
// cout << (tosieve[f]) << endl;
//}
cout << (tosieve.size()) << endl;
system("pause");
}
Is it the repeated referencing of the vectors or something? Why is this so slow? Even if I'm completely overlooking something (could be, complete beginner at this :I) I would think that finding the primes between 2 and 1,000,000 with this horrible inefficient method would be faster than my original way of finding them from 2 to 10,000,000.
Hope someone has a clear answer to this - hopefully I can use whatever knowledge is gleaned in the future when optimizing programs using a lot of recursion.

The problem is that 'erase' moves every element in the vector down one, meaning it is an O(n) operation.
There are three alternative choices:
1) Just mark deleted elements as 'empty' (make them 0, for example). This will mean future passes have to pass over those empty positions, but that isn't that expensive.
2) Make a new vector, and push_back new values into there.
3) Use std::remove_if: This will move the elements down, but do it in a single pass so will be more efficient. If you use std::remove_if, then you will have to remember it doesn't resize the vector itself.

Most of vector operations, including erase() have a O(n) linear time complexity.
Since you have two loops of size 10^6, and a vector of size 10^6, your algorithm executes up to 10^18 operations.
Qubic algorithms for such a big N will take a huge amount of time.
N = 10^6 is even big enough for quadratic algorithms.
Please, read carefully about Sieve of Eratosthenes. The fact that both full search and Sieve of Eratosthenes algorithms took the same time, means that you have done the second one wrong.

I see two performanse issues here:
First of all, push_back() will have to reallocate the dynamic memory block once in a while. Use reserve():
vector<int> tosieve = {};
tosieve.resreve(1000001);
for (int i = 2; i < 1000001; i++)
{
tosieve.push_back(i);
}
Second erase() has to move all Elements behind the one you try to remove. You set the elements to 0 instead and do a run over the vector in the end (untested code):
for (auto& x : tosieve) {
for (auto y = tosieve.begin(); *y < x; ++y) // this check works only in
// the case of an ordered vector
if (y != 0 && x % y == 0) x = 0;
}
{ // this block will make sure, that sieved will be released afterwards
auto sieved = vector<int>{};
for(auto x : tosieve)
sieved.push_back(x);
swap(tosieve, sieved);
} // the large memory block is released now, just keep the sieved elements.
consider to use standard algorithms instead of hand written loops. They help you to state your intent. In this case I see std::transform() for the outer loop of the sieve, std::any_of() for the inner loop, std::generate_n() for filling tosieve at the beginning and std::copy_if() for filling sieved (untested code):
vector<int> tosieve = {};
tosieve.resreve(1000001);
generate_n(back_inserter(tosieve), 1000001, []() -> int {
static int i = 2; return i++;
});
transform(begin(tosieve), end(tosieve), begin(tosieve), [](int i) -> int {
return any_of(begin(tosieve), begin(tosieve) + i - 2,
[&i](int j) -> bool {
return j != 0 && i % j == 0;
}) ? 0 : i;
});
swap(tosieve, [&tosieve]() -> vector<int> {
auto sieved = vector<int>{};
copy_if(begin(tosieve), end(tosieve), back_inserter(sieved),
[](int i) -> bool { return i != 0; });
return sieved;
});
EDIT:
Yet another way to get that done:
vector<int> tosieve = {};
tosieve.resreve(1000001);
generate_n(back_inserter(tosieve), 1000001, []() -> int {
static int i = 2; return i++;
});
swap(tosieve, [&tosieve]() -> vector<int> {
auto sieved = vector<int>{};
copy_if(begin(tosieve), end(tosieve), back_inserter(sieved),
[](int i) -> bool {
return !any_of(begin(tosieve), begin(tosieve) + i - 2,
[&i](int j) -> bool {
return i % j == 0;
});
});
return sieved;
});
Now instead of marking elements, we don't want to copy afterwards, but just directly copy only the elements, we want to copy. This is not only faster than the above suggestion, but also better states the intent.

Very interesting task you have. Thanks!
With pleasure I implemented from scratch my own versions of solving it.
I created 3 separate (independent) functions, all based on Sieve of Eratosthenes. These 3 versions are different in their complexity and speed.
Just a quick note, my simplest (slowest) version finds all primes below your desired limit of 10'000'000 within just 0.025 sec (i.e. 25 milli-seconds).
I also tested all 3 versions to find primes below 2^32 (4'294'967'296), which is solved by "simple" version within 47 seconds, by "intermediate" version within 30 seconds, by "advanced" within 12 seconds. So within just 12 seconds it finds all primes below 4 Billion (there are 203'280'221 such primes below 2^32, see OEIS sequence)!!!
For simplicity I will describe in details only Simple version out of 3. Here's code:
template <typename T>
std::vector<T> GenPrimes_SieveOfEratosthenes(size_t end) {
// https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes
if (end <= 2)
return {};
size_t const cnt = end >> 1;
std::vector<u8> composites((cnt + 7) / 8);
auto Get = [&](size_t i){ return bool((composites[i / 8] >> (i % 8)) & 1); };
auto Set = [&](size_t i){ composites[i / 8] |= u8(1) << (i % 8); };
std::vector<T> primes = {2};
size_t i = 0;
for (i = 1; i < cnt; ++i) {
if (Get(i))
continue;
size_t const p = 2 * i + 1, start = (p * p) >> 1;
primes.push_back(p);
if (start >= cnt)
break;
for (size_t j = start; j < cnt; j += p)
Set(j);
}
for (i = i + 1; i < cnt; ++i)
if (!Get(i))
primes.push_back(2 * i + 1);
return primes;
}
This code implements simplest but fast algorithm of finding primes, called Sieve of Eratosthenes. As a small optimization of speed and memory, I search only over odd numbers. This odd numbers optimization gives me ability to store 2x times less memory and do 2x times less steps, hence improves both speed and memory consumption exactly 2 times.
Algorithm is simple, we allocate array of bits, this array at position K has bit 1 if K is composite, or has 0 if K is probably prime. At the end all 0 bits in array signify Definite primes (that are for sure primes). Also due to odd numbers optimization this bit-array stores only odd numbers, so K-th bit is actually a number 2 * K + 1.
Then left to right we go over this array of bits and if we meet 0 bit at position K then it means we found a prime number P = 2 * K + 1 and now starting from position (P * P) / 2 we mark every P-th bit with 1. It means we mark all numbers bigger than P*P that are composite, because they are divisible by P.
We do this procedure only until P * P becomes greater or equal to our limit End (we're finding all primes < End). This limit guarantees that after reaching it ALL zero bits inside array signify prime numbers.
Second version of code does only one optimization to this Simple version, it makes all multi-core (multi-threaded). But this only optimization makes code much bigger and more complex. Basically it slices whole range of bits into all cores, so that they write bits to memory in parallel.
I'll explain only my third Advanced version, it is most complex of 3 versions. It does not only multi-threaded optimization, but also so-called Primorial optimization.
What is Primorial, it is a product of first smallest primes, for example I take primorial 2 * 3 * 5 * 7 = 210.
We can see that any primorial splits infinite range of integers into wheels by modulus of this primorial. For example primorial 210 splits into ranges [0; 210), [210; 2210), [2210; 3*210), etc.
Now it is easy to mathematically prove that inside All ranges of primorial we can mark same positions of numbers as complex, exactly we can mark all numbers that are multiple of 2 or 3 or 5 or 7 as composite.
We can see that out of 210 remainders there are 162 remainders that are for sure composite, and only 48 remainders are probably prime.
Hence it is enough for us to check primality of only 48/210=22.8% of whole search space. This reduction of search space makes task more than 4x times faster, and 4x times less memory consuming.
One can see that my first Simple version in fact due to odd-only optimization was actually using Primorial equal to 2 optimization. Yes, if we take primorial 2 instead of primorial 210, then we gain exactly first version (Simple) algorithm.
All of my 3 versions are tested for correctness and speed. Although still some tiny bugs can remain. Note. Yet it is recommended not to use my code straight away in production, unless it is tested thoroughly.
All 3 versions are tested for correctness by re-using each other answers. I thoroughly test correctness by feeding all limits (end value) from 0 to 2^18. It takes some time to do this.
See main() function to figure out how to use my functions.
Try it online!
SOURCE CODE GOES HERE. Due to StackOverflow limit of 30K symbols per post, I can't inline source code here, as it is almost 30K in size and together with English post above it takes more than 30K. So I'm providing source code on separate Github Gist server, link below. Note that Try it online! link above also contains full source code, but I reduced search limit of 2^32 to smaller one due to GodBolt limit of running time to 3 seconds.
Github Gist code
Output:
10M time 'Simple' 0.024 sec
Time 2^32 'Simple' 46.924 sec, number of primes 203280221
Time 2^32 'Intermediate' 30.999 sec
Time 2^32 'Advanced' 11.359 sec
All checked till 0
All checked till 5000
All checked till 10000
All checked till 15000
All checked till 20000
All checked till 25000

Optimizing my code for finding the factors of a given integer

Here is my code,but i'lld like to optimize it.I don't like the idea of it testing all the numbers before the square root of n,considering the fact that one could be faced with finding the factors of a large number. Your answers would be of great help. Thanks in advance.
unsigned int* factor(unsigned int n)
{
unsigned int tab[40];
int dim=0;
for(int i=2;i<=(int)sqrt(n);++i)
{
while(n%i==0)
{
tab[dim++]=i;
n/=i;
}
}
if(n>1)
tab[dim++]=n;
return tab;
}

Here's a suggestion on how to do this in 'proper' c++ (since you tagged as c++).
PS. Almost forgot to mention: I optimized the call to sqrt away :)
See it live on http://liveworkspace.org/code/6e2fcc2f7956fafbf637b54be2db014a
#include <vector>
#include <iostream>
#include <iterator>
#include <algorithm>
typedef unsigned int uint;
std::vector<uint> factor(uint n)
{
std::vector<uint> tab;
int dim=0;
for(unsigned long i=2;i*i <= n; ++i)
{
while(n%i==0)
{
tab.push_back(i);
n/=i;
}
}
if(n>1)
tab.push_back(n);
return tab;
}
void test(uint x)
{
auto v = factor(x);
std::cout << x << ":\t";
std::copy(v.begin(), v.end(), std::ostream_iterator<uint>(std::cout, ";"));
std::cout << std::endl;
}
int main(int argc, const char *argv[])
{
test(1);
test(2);
test(4);
test(43);
test(47);
test(9997);
}
Output
1:
2: 2;
4: 2;2;
43: 43;
47: 47;
9997: 13;769;

There's a simple change that will cut the run time somewhat: factor out all the 2's, then only check odd numbers.

If you use
... i*i <= n; ...
It may run much faster than i <= sqrt(n)
By the way, you should try to handle factors of negative n or at least be sure you never pass a neg number

I'm afraid you cannot. There is no known method in the planet can factorize large integers in polynomial time. However, there are some methods can help you slightly (not significantly) speed up your program. Search Wikipedia for more references. http://en.wikipedia.org/wiki/Integer_factorization

As seen from your solution , you find basically all prime numbers ( the condition while (n%i == 0)) works like that , especially for the case of large numbers , you could compute prime numbers beforehand, and keep checking only those. The prime number calculation could be done using Sieve of Eratosthenes method or some other efficient method.

unsigned int* factor(unsigned int n)
If unsigned int is the typical 32-bit type, the numbers are too small for any of the more advanced algorithms to pay off. The usual enhancements for the trial division are of course worthwhile.
If you're moving the division by 2 out of the loop, and divide only by odd numbers in the loop, as mentioned by Pete Becker, you're essentially halving the number of divisions needed to factor the input number, and thus speed up the function by a factor of very nearly 2.
If you carry that one step further and also eliminate the multiples of 3 from the divisors in the loop, you reduce the number of divisions and hence increase the speed by a factor close to 3 (on average; most numbers don't have any large prime factors, but are divisible by 2 or by 3, and for those the speedup is much smaller; but those numbers are quick to factor anyway. If you factor a longer range of numbers, the bulk of the time is spent factoring the few numbers with large prime divisors).
// if your compiler doesn't transform that to bit-operations, do it yourself
while(n % 2 == 0) {
tab[dim++] = 2;
n /= 2;
}
while(n % 3 == 0) {
tab[dim++] = 3;
n /= 3;
}
for(int d = 5, s = 2; d*d <= n; d += s, s = 6-s) {
while(n % d == 0) {
tab[dim++] = d;
n /= d;
}
}
If you're calling that function really often, it would be worthwhile to precompute the 6542 primes not exceeding 65535, store them in a static array, and divide only by the primes to eliminate all divisions that are a priori guaranteed to not find a divisor.
If unsigned int happens to be larger than 32 bits, then using one of the more advanced algorithms would be profitable. You should still begin with trial divisions to find the small prime factors (whether small should mean <= 1000, <= 10000, <= 100000 or perhaps <= 1000000 would need to be tested, my gut feeling says one of the smaller values would be better on average). If after the trial division phase the factorisation is not yet complete, check whether the remaining factor is prime using e.g. a deterministic (for the range in question) variant of the Miller-Rabin test. If it's not, search a factor using your favourite advanced algorithm. For 64 bit numbers, I'd recommend Pollard's rho algorithm or an elliptic curve factorisation. Pollard's rho algorithm is easier to implement and for numbers of that magnitude finds factors in comparable time, so that's my first recommendation.

Int is way to small to encounter any performance problems. I just tried to measure the time of your algorithm with boost but couldn't get any useful output (too fast). So you shouldn't worry about integers at all.
If you use i*i I was able to calculate 1.000.000 9-digit integers in 15.097 seconds. It's good to optimize an algorithm but instead of "wasting" time (depends on your situation) it's important to consider if a small improvement really is worth the effort. Sometimes you have to ask yourself if you rally need to be able to calculate 1.000.000 ints in 10 seconds or if 15 is fine as well.

C++ Coprimes Problem. Optimize code

Hi i want to optimize the following code. It tries to find all coprimes in a given range by comparing them to n. But i want to make it run faster... any ideas?
#include <iostream>
using namespace std;
int GCD(int a, int b)
{
while( 1 )
{
a = a % b;
if( a == 0 )
return b;
b = b % a;
if( b == 0 )
return a;
}
}
int main(void){
int t;
cin >> t;
for(int i=0; i<t; i++){
int n,a,b;
cin >> n >> a >> b;
int c = 0;
for(int j=a; j<=b; j++){
if(GCD(j, n) == 1) c++;
}
cout << c << endl;
}
return 0;
}

This smells like homework, so only a hint.
You don't need to calculate GCD here. If you can factorize n (even in the crudest way of trying to divide by every odd number smaller than 2^16), then you can just count numbers which happen not to divide by factors of n.
Note that there will be at most 10 factors of a 32-bit number (we don't need to remember how many times given prime is used in factorization).
How to do that? Try to count non-coprimes using inclusion–exclusion principle. You will have at most 1023 subsets of primes to check, for every subset you need to calculate how many multiplies are in the range, which is constant time for each subset.
Anyway, my code works in no time now:
liori:~/gg% time ./moje <<< "1 1003917915 1 1003917915"
328458240
./moje <<< "1 1003917915 1 1003917915" 0,00s user 0,00s system 0% cpu 0,002 total

On a single core computer it's not going to get much faster than it currently is. So you want to utilize multiple cores or even multiple computers. Parallelize and distribute.
Since each pair of numbers you want to calculate the GCD for isn't linked to any other pair of numbers you can easily modify your program to utilize multiple cores by using threads.
If this still isn't fast enough you'd better start thinking of using distributed computing, assigning the work to many computers. This is a bit trickier but should improve the performance the most if the search space is large.

Consider giving it a try with doubles. It said that divisions with doubles are faster on typical intel chips. Integer division is the slowest instruction out there. This is a chicken egg problem. Nobody uses them because they're slow and intel doesnt make it faster because nobody uses it.

C++ program to calculate quotients of large factorials

How can I write a c++ program to calculate large factorials.
Example, if I want to calculate (100!) / (99!), we know the answer is 100, but if i calculate the factorials of the numerator and denominator individually, both the numbers are gigantically large.

expanding on Dirk's answer (which imo is the correct one):
#include "math.h"
#include "stdio.h"
int main(){
printf("%lf\n", (100.0/99.0) * exp(lgamma(100)-lgamma(99)) );
}
try it, it really does what you want even though it looks a little crazy if you are not familiar with it. Using a bigint library is going to be wildly inefficient. Taking exps of logs of gammas is super fast. This runs instantly.
The reason you need to multiply by 100/99 is that gamma is equivalent to n-1! not n!. So yeah, you could just do exp(lgamma(101)-lgamma(100)) instead. Also, gamma is defined for more than just integers.

You can use the Gamma function instead, see the Wikipedia page which also pointers to code.

Of course this particular expression should be optimized, but as for the title question, I like GMP because it offers a decent C++ interface, and is readily available.
#include <iostream>
#include <gmpxx.h>
mpz_class fact(unsigned int n)
{
mpz_class result(n);
while(n --> 1) result *= n;
return result;
}
int main()
{
mpz_class result = fact(100) / fact(99);
std::cout << result.get_str(10) << std::endl;
}
compiles on Linux with g++ -Wall -Wextra -o test test.cc -lgmpxx -lgmp

By the sounds of your comments, you also want to calculate expressions like 100!/(96!*4!).
Having "cancelled out the 96", leaving yourself with (97 * ... * 100)/4!, you can then keep the arithmetic within smaller bounds by taking as few numbers "from the top" as possible as you go. So, in this case:
i = 96
j = 4
result = i
while (i <= 100) or (j > 1)
if (j > 1) and (result % j == 0)
result /= j
--j
else
result *= i
++i
You can of course be cleverer than that in the same vein.
This just delays the inevitable, though: eventually you reach the limits of your fixed-size type. Factorials explode so quickly that for heavy-duty use you're going to need multiple-precision.

Here's an example of how to do so:
http://www.daniweb.com/code/snippet216490.html
The approach they take is to store the big #s as a character array of digits.
Also see this SO question: Calculate the factorial of an arbitrarily large number, showing all the digits

You can use a big integer library like gmp which can handle arbitrarily large integers.

The only optimization that can be made here (considering that in m!/n! m is larger than n) means crossing out everything you can before using multiplication.
If m is less than n we would have to swap the elements first, then calculate the factorial and then make something like 1 / result. Note that the result in this case would be double and you should handle it as double.
Here is the code.
if (m == n) return 1;
// If 'm' is less than 'n' we would have
// to calculate the denominator first and then
// make one division operation
bool need_swap = (m < n);
if (need_swap) std::swap(m, n);
// #note You could also use some BIG integer implementation,
// if your factorial would still be big after crossing some values
// Store the result here
int result = 1;
for (int i = m; i > n; --i) {
result *= i;
}
// Here comes the division if needed
// After that, we swap the elements back
if (need_swap) {
// Note the double here
// If m is always > n then these lines are not needed
double fractional_result = (double)1 / result;
std::swap(m, n);
}
Also to mention (if you need some big int implementation and want to do it yourself) - the best approach that is not so hard to implement is to treat your int as a sequence of blocks and the best is to split your int to series, that contain 4 digits each.
Example: 1234 | 4567 | 2323 | 2345 | .... Then you'll have to implement every basic operation that you need (sum, mult, maybe pow, division is actually a tough one).

To solve x!/y! for x > y:
int product = 1;
for(int i=0; i < x - y; i ++)
{
product *= x-i;
}
If y > x switch the variables and take the reciprocal of your solution.

I asked a similar question, and got some pointers to some libraries:
How can I calculate a factorial in C# using a library call?
It depends on whether or not you need all the digits, or just something close. If you just want something close, Stirling's Approximation is a good place to start.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js