Tesseral arithmetic/quadtree - bit-manipulation

I did a project a while back on path finding with quadtrees and I would like to improve on its performance. It seems that using tesseral arithmetic to determine node adjacency (as per this page, courtesy of the Geography department of the University of British Columbia) would be much faster than the brute force method I'm using at the moment (I'm checking for shared edges, which works fine for a static quadtree but would be too much overhead if the map were changing).
I more or less understand what's said in the Adjacency Algorithm section, but I'm not really sure how to begin. I'm primarily interested in C#, but it'd be awesome if there's already some source floating around for working with tesseral arithmetic that I could look at, regardless of language. Otherwise, could anyone give me some pointers on dealing with the addition/subtraction carries?

Well, I don't know of any way to do this efficiently, but the usual "add with bitwise operations" algorithm suggests the following algorithm (not tested):
static int tesseral_add(int x, int y)
{
int a, b;
do
{
a = x & y;
b = x ^ y;
x = a << 2; // move carry up 2 places instead of the usual 1
y = b;
} while (b != 0);
return b;
}
Which possibly loops quite a lot, if there are carry chains.
Actually, there's a much better way to do this.
Observe that for z = interleave(a, -1); w = interleave(b, 0);, adding z and w directly gives a partially correct result, because any carries are re-carried (all the "in-between" bits are 1). The only "problem" is that it destroys the y-coordinates.
So to add two tesseral numbers z = interleave(a, b); w = interleave(c, d);, there is a nice short way to do it:
int xsum = (z | 0xAAAAAAAA) + (w & 0x55555555);
int ysum = (z | 0x55555555) + (w & 0xAAAAAAAA);
int result = (xsum & 0x55555555) | (ysum & 0xAAAAAAAA);

I think, the easiest way to deal with tesseral arithmetic is to "bit-unzip" numbers, perform any number of arithmetical operations normally, and "bit-zip" them back when tesseral form is needed:
z = bit_zip(bit_unzip(x) + bit_unzip(y));
(This example works for unsigned only. For signed integers, unpack each number into two variables and do normal arithmetic on both parts separately).
You can find fast implementations for "bit-unzip" and "bit-zip" in "Matters Computational
", chapter 1.15 "Bit-wise zip".

Related

Evaluating multiplication with exponential function

I'm trying to come up with a good way to evaluate the following function
double foo(std::vector<double> const& x, double c = 0.95)
{
auto N = x.size(); // Small power of 2 such as 512 or 1024
double sum = 0;
for (auto i = 0; i != N; ++i) {
sum += (x[i] * pow(c, double(i)/N));
}
return sum;
}
My two main concerns with this naive implementation are performance and accuracy. So I suspect that the most trivial improvement would be to reverse the loop order: for (auto i = N-1; i != -1; --i) (The -1 wraps around, this is OK). This improves accuracy by adding smaller terms first.
While this is good for accuracy, it keeps the performance problem of pow. Numerically, pow(c, double(i)/N) is pow(c, (i-1)/N) * pow(c, 1/N). And the latter is a constant. So in theory we can replace pow with repeated multiplication. While good for performance, this hurts accuracy - errors will accumulate.
I suspect that there's a significantly better algorithm hiding in here. For instance, the fact that N is a power of two means that there is a middle term x[N/2] that's multiplied with sqrt(c). That hints at a recursive solution.
On a somewhat related numerical observation, this looks like a signal multiplication with an exponential, so I naturally think : "FFT, trivial convolution=shift, IFFT", but that seems to offer no real benefit in terms of accuracy or performance.
So, is this a well-known problem with known solutions?
The task is a polynomial evaluation. The method for a single evaluation with the least operation count is the Horner scheme. In general a low operation count will reduce the accumulation of floating point noise.
As the example value c=0.95 is close to 1, any root will be still closer to 1 and thus lose accuracy. Avoid that by computing the difference to 1 directly, z=1-c^(1/n), via
z = -expm1(log(c)/N).
Now you have to evaluate the polynomial
sum of x[i] * (1-z)^i
which can be done by careful modification of the Horner scheme. Instead of
for(i=N; i-->0; ) {
res = res*(1-z)+x[i]
}
use
for(i=N; i-->0; ) {
res = (res+x[i])-res*z
}
which is mathematically equivalent but has the loss of digits in 1-z happening as late as possible without using more involved method like doubly accurate addition.
In tests those two methods contrary to the intent gave almost the same results, a substantial improvement could be observed by separating the result into its value at c=1, z=0 and a multiple of z as in
double res0 = 0, resz=0;
int i;
for(i=N; i-->0; ) {
/* res0+z*resz = (res0+z*resz)*(1-z)+x[i]; */
resz = resz - res0 -z*resz;
res0 = res0 + x[i];
}
The test case that showed this improvement was for the coefficient sequence of
f(u) = (1-u/N)^(N-2)*(1-u)
where for N=1000 the evaluations result in
c z=1-c^(1/N) f(1-z) diff for 1st proc diff for 3rd proc
0.950000 0.000051291978909 0.000018898570629 1.33289104579937e-17 4.43845264361253e-19
0.951000 0.000050239954368 0.000018510931892 1.23765066121009e-16 -9.24959978401696e-19
0.952000 0.000049189034371 0.000018123700958 1.67678642238461e-17 -5.38712954453735e-19
0.953000 0.000048139216599 0.000017736876972 -2.86635949350855e-17 -2.37169225231204e-19
...
0.994000 0.000006018054217 0.000002217256601 1.31645860662263e-17 1.15619997300212e-19
0.995000 0.000005012529261 0.000001846785028 -4.15668713370839e-17 -3.5363625547867e-20
0.996000 0.000004008013365 0.000001476685973 8.48811716443534e-17 8.470329472543e-22
0.997000 0.000003004504507 0.000001106958687 1.44711343873661e-17 -2.92226366802734e-20
0.998000 0.000002002000667 0.000000737602425 5.6734266807093e-18 -6.56450534122083e-21
0.999000 0.000001000499833 0.000000368616443 -3.72557383333555e-17 1.47701370177469e-20
Yves' answer inspired me.
It seems that the best approach is to not calculate pow(c, 1.0/N) directly, but indirectly:
cc[0]=c; cc[1]=sqrt(cc[0]), cc[2]=sqrt(cc[1]),... cc[logN] = sqrt(cc[logN-1])
Or in binary,
cc[0]=c, cc[1]=c^0.1, cc[2]=c^0.01, cc[3]=c^0.001, ....
Now if we need x[0b100100] * c^0.100100, we can calculate that as x[0b100100]* c^0.1 * c^0.0001. I don't need to precalculate a table of size N, as geza suggested. A table of size log(N) is probably sufficient, and it can be created by repeatedly taking square roots.
[edit]
As pointed out in a comment thread on another answer, pairwise summation is very effective in keeping errors under control. And it happens to combine extremely nicely with this answer.
We start by observing that we sum
x[0] * c^0.0000000
x[1] * c^0.0000001
x[2] * c^0.0000010
x[3] * c^0.0000011
...
So, we run log(N) iterations. In iteration 1, we add the N/2 pairs x[i]+x[i+1]*c^0.000001 and store the result in x[i/2]. In iteration 2, we add the pairs x[i]+x[i+1]*c^0.000010, etcetera. The chief difference with normal pairwise summation is that this is a multiply-and-add in each step.
We see now that in each iteration, we're using the same multiplier pow(c, 2^i/N), which means we only need to calculate log(N) multipliers. It's also quite cache-efficient, as we're doing only contiguous memory access. It also allows for easy SIMD parallelization, especially when you have FMA instructions.
If N is a power of 2, you can replace the evaluations of the powers by geometric means, using
a^(i+j)/2 = √(a^i.a^j)
and recursively subdivide from c^N/N.c^0/N. With preorder recursion, you can make sure to accumulate by increasing weights.
Anyway, the speedup of sqrt vs. pow might be marginal.
You can also stop recursion at a certain level and continue linearly, with mere products.
You could mix repeated multiplication by pow(c, 1./N) with some explicit pow calls. I.e. every 16th iteration or so do a real pow and otherwise move forward with the multiply. This should yield large performance benefits at negligible accuracy cost.
Depending on how much c varies, you might even be able to precompute and replace all pow calls with a lookup, or just the ones needed in the above method (= smaller lookup table = better caching).

Return non-duplicate random values from a very large range

I would like a function that will produce k pseudo-random values from a set of n integers, zero to n-1, without repeating any previous result. k is less than or equal to n. O(n) memory is unacceptable because of the large size of n and the frequency with which I'll need to re-shuffle.
These are the methods I've considered so far:
Array:
Normally if I wanted duplicate-free random values I'd shuffle an array, but that's O(n) memory. n is likely to be too large for that to work.
long nextvalue(void) {
static long array[4000000000];
static int s = 0;
if (s == 0) {
for (int i = 0; i < 4000000000; i++) array[i] = i;
shuffle(array, 4000000000);
}
return array[s++];
}
n-state PRNG:
There are a variety of random number generators that can be designed so as to have a period of n and to visit n unique states over that period. The simplest example would be:
long nextvalue(void) {
static long s = 0;
static const long i = 1009; // assumed co-prime to n
s = (s + i) % n;
return s;
}
The problem with this is that it's not necessarily easy to design a good PRNG on the fly for a given n, and it's unlikely that that PRNG will approximate a fair shuffle if it doesn't have a lot of variable parameters (even harder to design). But maybe there's a good one I don't know about.
m-bit hash:
If the size of the set is a power of two, then it's possible to devise a perfect hash function f() which performs a 1:1 mapping from any value in the range to some other value in the range, where every input produces a unique output. Using this function I could simply maintain a static counter s, and implement a generator as:
long nextvalue(void) {
static long s = 0;
return f(s++);
}
This isn't ideal because the order of the results is determined by f(), rather than random values, so it's subject to all the same problems as above.
NPOT hash:
In principle I can use the same design principles as above to define a version of f() which works in an arbitrary base, or even a composite, that is compatible with the range needed; but that's potentially difficult, and I'm likely to get it wrong. Instead a function can be defined for the next power of two greater than or equal to n, and used in this construction:
long nextvalue(void) {
static long s = 0;
long x = s++;
do { x = f(x); } while (x >= n);
}
But this still have the same problem (unlikely to give a good approximation of a fair shuffle).
Is there a better way to handle this situation? Or perhaps I just need a good function for f() that is highly parameterisable and easy to design to visit exactly n discrete states.
One thing I'm thinking of is a hash-like operation where I contrive to have the first j results perfectly random through carefully designed mapping, and then any results between j and k would simply extrapolate on that pattern (albeit in a predictable way). The value j could then be chosen to find a compromise between a fair shuffle and a tolerable memory footprint.
First of all, it seems unreasonable to discount anything that uses O(n) memory and then discuss a solution that refers to an underlying array. You have an array. Shuffle it. If that doesn't work or isn't fast enough, come back to us with a question about it.
You only need to perform a complete shuffle once. After that, draw from index n, swap that element with an element located randomly before it and increase n, modulo element count. For example, with such a large dataset I'd use something like this.
Prime numbers are an option for hashes, but probably not the same way you think. Using two Mersenne primes (low and high, perhaps 0xefff and 0xefffffff) you should be able to come up with a much more general-purpose hashing algorithm.
size_t hash(unsigned char *value, size_t value_size, size_t low, size_t high) {
size_t x = 0;
while (value_size--) {
x += *value++;
x *= low;
}
return x % high;
}
#define hash(value, value_size, low, high) (hash((void *) value, value_size, low, high))
This should produce something fairly well distributed for all inputs larger than about two octets for example, with the minor troublesome exception for zero byte prefixes. You might want to treat those differently.
So... what I've ended up doing is digging deeper into pre-existing methods to
try to confirm their ability to approximate a fair shuffle.
I take a simple counter, which itself is guaranteed to visit
every in-range value exactly once, and then 'encrypt' it with an n-bit block
cypher. Rather, I round the range up to a power of two, and apply some 1:1
function; then if the result is out of range I repeat the permutation until the
result is in range.
This can be guaranteed to complete eventually because there are only a finite
number of out-of-range values within the power-of-two range, and they cannot
enter into a always-out-of-range cycle because that would imply that something
in the cycle was mapped from two different previous states (one from the
in-range set, and another from the out-of-range set), which would make the
function not bijective.
So all I need to do is devise a parameterisable function which I can tune to an
arbitrary number of bits. Like this one:
uint64_t mix(uint64_t x, uint64_t k) {
const int s0 = BITS * 4 / 5;
const int s1 = BITS / 5 + (k & 1);
const int s2 = BITS * 2 / 5;
k |= 1;
x *= k;
x ^= (x & BITMASK) >> s0;
x ^= (x << s1) & BITMASK;
x ^= (x & BITMASK) >> s2;
x += 0x9e3779b97f4a7c15;
return x & BITMASK;
}
I know it's bijective because I happen to have its inverse function handy:
uint64_t unmix(uint64_t x, uint64_t k) {
const int s0 = BITS * 4 / 5;
const int s1 = BITS / 5 + (k & 1);
const int s2 = BITS * 2 / 5;
k |= 1;
uint64_t kp = k * k;
while ((kp & BITMASK) > 1) {
k *= kp;
kp *= kp;
}
x -= 0x9e3779b97f4a7c15;
x ^= ((x & BITMASK) >> s2) ^ ((x & BITMASK) >> s2 * 2);
x ^= (x << s1) ^ (x << s1 * 2) ^ (x << s1 * 3) ^ (x << s1 * 4) ^ (x << s1 * 5);
x ^= (x & BITMASK) >> s0;
x *= k;
return x & BITMASK;
}
This allows me to define a simple parameterisable PRNG like this:
uint64_t key[ROUNDS];
uint64_t seed = 0;
uint64_t rand_no_rep(void) {
uint64_t x = seed++;
do {
for (int i = 0; i < ROUNDS; i++) x = mix(x, key[i]);
} while (x >= RANGE);
return x;
}
Initialise seed and key to random values and you're good to go.
Using the inverse function to lets me determine what seed must be to force
rand_no_rep() to produce a given output; making it much easier to test.
So far I've checked the cases where constant a, it is followed by constant
b. For ROUNDS==1 pairs collide on exactly 50% of the keys (and each
pair of collisions is with a different pair of a and b; they don't all converge on 0, 1 or whatever). That is, for
various k, a specific a-followed-by-b cases occurs for more than one k
(this must happen at least one). Subsequent values values do not collide in
that case, so different keys aren't falling into the same cycle at different
positions. Every k gives a unique cycle.
50% collisions comes from 25% being not unique when they're added to the list (count itself, and count the guy it ran into). That might sound bad but it's actually lower than birthday paradox logic would suggest. Selecting randomly, the percentage of new entries that fail to be unique looks to converge between 36% and 37%. Being "better than random" is obviously worse than random, as far as randomness goes, but that's why they're called pseudo-random numbers.
Extending that to ROUNDS==2, we want to make sure that a second round doesn't
cancel out or simply repeat the effects of the first.
This is important because it would mean that multiple rounds are a waste of
time and memory, and that the function cannot be paramaterised to any
substantial degree. It could happen trivially if mix() contained all linear
operations (say, multiply and add, mod RANGE). In that case all of the
parameters could be multiplied/added together to produce a single parameter for
a single round that would have the same effect. That would be disappointing,
as it would reduce the number of attainable permutations to the size of just
that one parameter, and if the set is as small as that then more work would be
needed to ensure that it's a good, representative set.
So what we want to see from two rounds is a large set of outcomes that could
never be achieved by one round. One way to demonstrate this is to look for the
original b-follows-a cases with an additional parameter c, where we want
to see every possible c following a and b.
We know from the one-round testing that in 50% of cases there is only one c
that can follow a and b because there is only one k that places b
immediately after a. We also know that 25% of the pairs of a and b were
unreachable (being the gap left behind by half the pairs that went into
collisions rather than new unique values), and the last 25% appear for two
different k.
The result that I get is that given a free choice of both keys, it's possible
to find about five eights of the values of c following a given a and b.
About a quarter of the a/b pairs are unreachable (it's a less predictable,
now, because of the potential intermediate mappings into or out of the
duplicate or unreachable cases) and a quarter have a, b, and c appear
together in two sequences (which diverge afterwards).
I think there's a lot to be inferred from the difference between one round and
two, but I could be wrong about that and I need to double-check. Further
testing gets harder; or at least slower unless I think more carefully about how
I'm going to do it.
I haven't yet demonstrated that amongst the set of permutations it can produce, that they're all equally likely; but this is normally not guaranteed for any other PRNG either.
It's fairly slow for a PRNG, but it would fit SIMD trivially.

Finding the remainder of a large multiplication in C++

I would like to ask some help concerning the following problem. I need to create a function that will multiply two integers and extract the remainder of this multiplication divided by a certain number (in short, (x*y)%A).
I am using unsigned long long int for this problem, but A = 15! in this case, and both x and y have been calculated modulo A previously. Thus, x*y can be greater than 2^64 - 1, therefore overflowing.
I did not want to use external libraries. Could anyone help me designing a short algorithm to solve this problem?
Thanks in advance.
If you already have mod A of x and y, why not use them? something like,
if,
x = int_x*A + mod_x
y = int_y*A + mod_y
then
(x*y)%A = ((int_x*A + mod_x)(int_y*A + mod_y))%A = (mod_x*mod_y)%A
mod_x*mod_y should be much smaller, right?
EDIT:
If you are trying to find the modulus wrt a large number like 10e11, I guess you would have to use another method. But while not really efficient, something like this would work
const int MAX_INT = 10e22 // get max int
int larger = max(mod_x, mod_y) // get the larger number
int smaller = max(mod_x, mod_y)
int largest_part = floor(MAX_INT/smaller)
if (largest_part > larger):
// no prob of overflow. use normal routine
else:
int larger_array = []
while(largest_part < larger):
larger_array.append(largest_part)
larger -= largest_part
largest_part = floor(MAX_INT/smaller)
// now use the parts array to calculate the mod by going through each elements mod and adding them etc
If you understand this code and the setup, you should be able to figure out the rest

How to approximate the count of distinct values in an array in a single pass through it

I have several huge arrays (millions++ members). All those are arrays of numbers and they are not sorted (and I cannot do that). Some are uint8_t, some uint16_t/32/64. I would like to approximate the count of distinct values in these arrays. The conditions are following:
speed is VERY important, I need to do this in one pass through the array and I must go through it sequentially (cannot jump back and forth) (I am doing this in C++, if that's important)
I don't need EXACT counts. What I want to know is that if it is an uint32_t array if there are like 10 or 20 distinct numbers or if there are thousands or millions.
I have quite a bit of memory that I can use, but the less is used the better
the smaller the array data type, the more accurate I need to be
I dont mind STL, but if I can do it without it that would be great (no BOOST though, sorry)
if the approach can be easily parallelized, that would be cool (but its not a mandatory condition)
Examples of perfect output:
ArrayA [uint32_t, 3M members]: ~128 distinct values
ArrayB [uint32_t, 9M members]: 100000+ distinct values
ArrayC [uint8_t, 50K members]: 2-5 distinct values
ArrayD [uint8_t, 700K members]: 64+ distinct values
I understand that some of the constraints may seem illogical, but thats the way it is.
As a side note, I also want the top X (3 or 10) most used and least used values, but that is far easier to do and I can do it on my own. However if someone has thoughts for that too, feel free to share them!
EDIT: a bit of clarification regarding STL. If you have a solution using it, please post it. Not using STL would be just a bonus for us, we dont fancy it too much. However if it is a good solution, it will be used!
For 8- and 16-bit values, you can just make a table of the count of each value; every time you write to a table entry that was previously zero, that's a different value found.
For larger values, if you are not interested in counts above 100000, std::map is suitable, if it's fast enough. If that's too slow for you, you could program your own B-tree.
I'm pretty sure you can do it by:
Create a Bloom filter
Run through the array inserting each element into the filter (this is a "slow" O(n), since it requires computing several independent decent hashes of each value)
Count how many bits are set in the Bloom Filter
Compute back from the density of the filter to an estimate of the number of distinct values. I don't know the calculation off the top of my head, but any treatment of the theory of Bloom filters goes into this, because it's vital to the probability of the filter giving a false positive on a lookup.
Presumably if you're simultaneously computing the top 10 most frequent values, then if there are less than 10 distinct values you'll know exactly what they are and you don't need an estimate.
I believe the "most frequently used" problem is difficult (well, memory-consuming). Suppose for a moment that you only want the top 1 most frequently used value. Suppose further that you have 10 million entries in the array, and that after the first 9.9 million of them, none of the numbers you've seen so far has appeared more than 100k times. Then any of the values you've seen so far might be the most-frequently used value, since any of them could have a run of 100k values at the end. Even worse, any two of them could have a run of 50k each at the end, in which case the count from the first 9.9 million entries is the tie-breaker between them. So in order to work out in a single pass which is the most frequently used, I think you need to know the exact count of each value that appears in the 9.9 million. You have to prepare for that freak case of a near-tie between two values in the last 0.1 million, because if it happens you aren't allowed to rewind and check the two relevant values again. Eventually you can start culling values -- if there's a value with a count of 5000 and only 4000 entries left to check, then you can cull anything with a count of 1000 or less. But that doesn't help very much.
So I might have missed something, but I think that in the worst case, the "most frequently used" problem requires you to maintain a count for every value you have seen, right up until nearly the end of the array. So you might as well use that collection of counts to work out how many distinct values there are.
One approach that can work, even for big values, is to spread them into lazily allocated buckets.
Suppose that you are working with 32 bits integers, creating an array of 2**32 bits is relatively impractical (2**29 bytes, hum). However, we can probably assume that 2**16 pointers is still reasonable (2**19 bytes: 500kB), so we create 2**16 buckets (null pointers).
The big idea therefore is to take a "sparse" approach to counting, and hope that the integers won't be to dispersed, and thus that many of the buckets pointers will remain null.
typedef std::pair<int32_t, int32_t> Pair;
typedef std::vector<Pair> Bucket;
typedef std::vector<Bucket*> Vector;
struct Comparator {
bool operator()(Pair const& left, Pair const& right) const {
return left.first < right.first;
}
};
void add(Bucket& v, int32_t value) {
Pair const pair(value, 1);
Vector::iterator it = std::lower_bound(v.begin(), v.end(), pair, Compare());
if (it == v.end() or it->first > value) {
v.insert(it, pair);
return;
}
it->second += 1;
}
void gather(Vector& v, int32_t const* begin, int32_t const* end) {
for (; begin != end; ++begin) {
uint16_t const index = *begin >> 16;
Bucket*& bucket = v[index];
if (bucket == 0) { bucket = new Bucket(); }
add(*bucket, *begin);
}
}
Once you have gathered your data, then you can count the number of different values or find the top or bottom pretty easily.
A few notes:
the number of buckets is completely customizable (thus letting you control the amount of original memory)
the strategy of repartition is customizable as well (this is just a cheap hash table I have made here)
it is possible to monitor the number of allocated buckets and abandon, or switch gear, if it starts blowing up.
if each value is different, then it just won't work, but when you realize it, you will already have collected many counts, so you'll at least be able to give a lower bound of the number of different values, and a you'll also have a starting point for the top/bottom.
If you manage to gather those statistics, then the work is cut out for you.
For 8 and 16 bit it's pretty obvious, you can track every possibility every iteration.
When you get to 32 and 64 bit integers, you don't really have the memory to track every possibility.
Here's a few natural suggestions that are likely outside the bounds of your constraints.
I don't really understand why you can't sort the array. RadixSort is O(n) and once sorted it would be one more pass to get accurate distinctiveness and top X information. In reality it would be 6 passes all together for 32bit if you used a 1 byte radix (1 pass for counting + 1 * 4 passes for each byte + 1 pass for getting values).
In the same cusp as above, why not just use SQL. You could create a stored procedure that takes the array in as a table valued parameter and return the number of distinct values and the top x values in one go. This stored procedure could also be called in parallel.
-- number of distinct
SELECT COUNT(DISTINCT(n)) FROM #tmp
-- top x
SELECT TOP 10 n, COUNT(n) FROM #tmp GROUP BY n ORDER BY COUNT(n) DESC
I've just thought of an interesting solution. It's based on law of boolean algebra called Idempotence of Multiplication, which states that:
X * X = X
From it, and using the commutative property of boolean multiplication, we can deduce that:
X * Y * X = X * X * Y = X * Y
Now, you see where I'm going to? This is how the algorithm would work (I'm terrible with pseudo-code):
make c = element1 & element2 (binary AND between the binary representation of the integers)
for i=3 until i == size_of_array
make b = c & element[i];
if b != c then diferent_values++;
c=b;
In first iteration, we make (element1*element2) * element3. We could represent it as:
(X * Y) * Z
If Z (element3) is equal to X (element1), then:
(X * Y) * Z = X * Y * X = X * Y
And if Z is equal to Y (element2), then:
(X * Y) * Z = X * Y * Y = X * Y
So, if Z isn´t different to X or Y, then X * Y won't change when we multiply it for Z
This remains valid for big expressions, like:
(X * A * Z * G * T * P * S) * S = X * A * Z * G * T * P * S
If we receive a value which is factor of our big multiplicand (that means that it has been already computed) then the big multiplicand won't change when we multiply it to the recieved input, so there's no new distinct value.
So that's how it will go. Each time that a different value is computed then the multiplication of our big multiplicand and that distinct value, will be different to the big operand. So, with b = c & element[i], if b!= c we just increment out distinct values counter.
I guess I'm no being clear enough. If that's the case, please let me know.

C++ program to calculate quotients of large factorials

How can I write a c++ program to calculate large factorials.
Example, if I want to calculate (100!) / (99!), we know the answer is 100, but if i calculate the factorials of the numerator and denominator individually, both the numbers are gigantically large.
expanding on Dirk's answer (which imo is the correct one):
#include "math.h"
#include "stdio.h"
int main(){
printf("%lf\n", (100.0/99.0) * exp(lgamma(100)-lgamma(99)) );
}
try it, it really does what you want even though it looks a little crazy if you are not familiar with it. Using a bigint library is going to be wildly inefficient. Taking exps of logs of gammas is super fast. This runs instantly.
The reason you need to multiply by 100/99 is that gamma is equivalent to n-1! not n!. So yeah, you could just do exp(lgamma(101)-lgamma(100)) instead. Also, gamma is defined for more than just integers.
You can use the Gamma function instead, see the Wikipedia page which also pointers to code.
Of course this particular expression should be optimized, but as for the title question, I like GMP because it offers a decent C++ interface, and is readily available.
#include <iostream>
#include <gmpxx.h>
mpz_class fact(unsigned int n)
{
mpz_class result(n);
while(n --> 1) result *= n;
return result;
}
int main()
{
mpz_class result = fact(100) / fact(99);
std::cout << result.get_str(10) << std::endl;
}
compiles on Linux with g++ -Wall -Wextra -o test test.cc -lgmpxx -lgmp
By the sounds of your comments, you also want to calculate expressions like 100!/(96!*4!).
Having "cancelled out the 96", leaving yourself with (97 * ... * 100)/4!, you can then keep the arithmetic within smaller bounds by taking as few numbers "from the top" as possible as you go. So, in this case:
i = 96
j = 4
result = i
while (i <= 100) or (j > 1)
if (j > 1) and (result % j == 0)
result /= j
--j
else
result *= i
++i
You can of course be cleverer than that in the same vein.
This just delays the inevitable, though: eventually you reach the limits of your fixed-size type. Factorials explode so quickly that for heavy-duty use you're going to need multiple-precision.
Here's an example of how to do so:
http://www.daniweb.com/code/snippet216490.html
The approach they take is to store the big #s as a character array of digits.
Also see this SO question: Calculate the factorial of an arbitrarily large number, showing all the digits
You can use a big integer library like gmp which can handle arbitrarily large integers.
The only optimization that can be made here (considering that in m!/n! m is larger than n) means crossing out everything you can before using multiplication.
If m is less than n we would have to swap the elements first, then calculate the factorial and then make something like 1 / result. Note that the result in this case would be double and you should handle it as double.
Here is the code.
if (m == n) return 1;
// If 'm' is less than 'n' we would have
// to calculate the denominator first and then
// make one division operation
bool need_swap = (m < n);
if (need_swap) std::swap(m, n);
// #note You could also use some BIG integer implementation,
// if your factorial would still be big after crossing some values
// Store the result here
int result = 1;
for (int i = m; i > n; --i) {
result *= i;
}
// Here comes the division if needed
// After that, we swap the elements back
if (need_swap) {
// Note the double here
// If m is always > n then these lines are not needed
double fractional_result = (double)1 / result;
std::swap(m, n);
}
Also to mention (if you need some big int implementation and want to do it yourself) - the best approach that is not so hard to implement is to treat your int as a sequence of blocks and the best is to split your int to series, that contain 4 digits each.
Example: 1234 | 4567 | 2323 | 2345 | .... Then you'll have to implement every basic operation that you need (sum, mult, maybe pow, division is actually a tough one).
To solve x!/y! for x > y:
int product = 1;
for(int i=0; i < x - y; i ++)
{
product *= x-i;
}
If y > x switch the variables and take the reciprocal of your solution.
I asked a similar question, and got some pointers to some libraries:
How can I calculate a factorial in C# using a library call?
It depends on whether or not you need all the digits, or just something close. If you just want something close, Stirling's Approximation is a good place to start.