Fast divisibility tests (by 2,3,4,5,.., 16)?

Fast divisibility tests (by 2,3,4,5,.., 16)? - c++

What are the fastest divisibility tests? Say, given a little-endian architecture and a 32-bit signed integer: how to calculate very fast that a number is divisible by 2,3,4,5,... up to 16?
WARNING: given code is EXAMPLE only. Every line is independent! Just obvious solution using modulo operation is slow on many processors, which don't have DIV hardware (like many ARMs). Some compilers are also cannot make such optimizations (say, if divisor is a function's argument or is dependent on something).
Divisible_by_1 = do();
Divisible_by_2 = if (!(number & 1)) do();
Divisible_by_3 = ?
Divisible_by_4 = ?
Divisible_by_5 = ?
Divisible_by_6 = ?
Divisible_by_7 = ?
Divisible_by_8 = ?
Divisible_by_9 = ?
Divisible_by_10 = ?
Divisible_by_11 = ?
Divisible_by_12 = ?
Divisible_by_13 = ?
Divisible_by_14 = ?
Divisible_by_15 = ?
Divisible_by_16 = if(!number & 0x0000000F) do();
and special cases:
Divisible_by_2k = if(number & (tk-1)) do(); //tk=2**k=(2*2*2*...) k times

In every case (including divisible by 2):
if (number % n == 0) do();
Anding with a mask of low order bits is just obfuscation, and with a modern compiler will not be any faster than writing the code in a readable fashion.
If you have to test all of the cases, you might improve performance by putting some of the cases in the if for another: there's no point it testing for divisibility by 4 if divisibility by 2 has already failed, for example.

It is not a bad idea AT ALL to figure out alternatives to division instructions (which includes modulo on x86/x64) because they are very slow. Slower (or even much slower) than most people realize. Those suggesting "% n" where n is a variable are giving foolish advice because it will invariably lead to the use of the division instruction. On the other hand "% c" (where c is a constant) will allow the compiler to determine the best algorithm available in its repertoire. Sometimes it will be the division instruction but a lot of the time it won't.
In this document Torbjörn Granlund shows that the ratio of clock cycles required for unsigned 32-bit mults:divs is 4:26 (6.5x) on Sandybridge and 3:45 (15x) on K10. for 64-bit the respective ratios are 4:92 (23x) and 5:77 (14.4x).
The "L" columns denote latency. "T" columns denote throughput. This has to do with the processor's ability to handle multiple instructions in parallell. Sandybridge can issue one 32-bit multiplication every other cycle or one 64-bit every cycle. For K10 the corresponding throughput is reversed. For divisions the K10 needs to complete the entire sequence before it may begin another. I suspect it is the same for Sandybridge.
Using the K10 as an example it means that during the cycles required for a 32-bit division (45) the same number (45) of multiplications can be issued and the next-to-last and last one of these will complete one and two clock cycles after the division has completed. A LOT of work can be performed in 45 multiplications.
It is also interesting to note that divs have become less efficient with the evolution from K8-K9 to K10: from 39 to 45 and 71 to 77 clock cycles for 32- and 64-bit.
Granlund's page at gmplib.org and at the Royal Institute of Technology in Stockholm contain more goodies, some of which have been incorporated into the gcc compiler.

As #James mentioned, let the compiler simplify it for you. If n is a constant, any decent compiler is able to recognize the pattern and change it to a more efficient equivalent.
For example, the code
#include <stdio.h>
int main() {
size_t x;
scanf("%u\n", &x);
__asm__ volatile ("nop;nop;nop;nop;nop;");
const char* volatile foo = (x%3 == 0) ? "yes" : "no";
__asm__ volatile ("nop;nop;nop;nop;nop;");
printf("%s\n", foo);
return 0;
}
compiled with g++-4.5 -O3, the relevant part of x%3 == 0 will become
mov rcx,QWORD PTR [rbp-0x8] # rbp-0x8 = &x
mov rdx,0xaaaaaaaaaaaaaaab
mov rax,rcx
mul rdx
lea rax,"yes"
shr rdx,1
lea rdx,[rdx+rdx*2]
cmp rcx,rdx
lea rdx,"no"
cmovne rax,rdx
mov QWORD PTR [rbp-0x10],rax
which, translated back to C code, means
(hi64bit(x * 0xaaaaaaaaaaaaaaab) / 2) * 3 == x ? "yes" : "no"
// equivalatent to: x % 3 == 0 ? "yes" : "no"
no division involved here. (Note that 0xaaaaaaaaaaaaaaab == 0x20000000000000001L/3)
Edit:
The magic constant 0xaaaaaaaaaaaaaaab can be computed in http://www.hackersdelight.org/magic.htm
For divisors of the form 2n - 1, check http://graphics.stanford.edu/~seander/bithacks.html#ModulusDivision

A bit tongue in cheek, but assuming you get the rest of the answers:
Divisible_by_6 = Divisible_by_3 && Divisible_by_2;
Divisible_by_10 = Divisible_by_5 && Divisible_by_2;
Divisible_by_12 = Divisible_by_4 && Divisible_by_3;
Divisible_by_14 = Divisible_by_7 && Divisible_by_2;
Divisible_by_15 = Divisible_by_5 && Divisible_by_3;

Assume number is unsigned (32-bits). Then the following are very fast ways to compute divisibility up to 16. (I haven't measured but the assembly code indicates so.)
bool divisible_by_2 = number % 2 == 0;
bool divisible_by_3 = number * 2863311531u <= 1431655765u;
bool divisible_by_4 = number % 4 == 0;
bool divisible_by_5 = number * 3435973837u <= 858993459u;
bool divisible_by_6 = divisible_by_2 && divisible_by_3;
bool divisible_by_7 = number * 3067833783u <= 613566756u;
bool divisible_by_8 = number % 8 == 0;
bool divisible_by_9 = number * 954437177u <= 477218588u;
bool divisible_by_10 = divisible_by_2 && divisible_by_5;
bool divisible_by_11 = number * 3123612579u <= 390451572u;
bool divisible_by_12 = divisible_by_3 && divisible_by_4;
bool divisible_by_13 = number * 3303820997u <= 330382099u;
bool divisible_by_14 = divisible_by_2 && divisible_by_7;
bool divisible_by_15 = number * 4008636143u <= 286331153u;
bool divisible_by_16 = number % 16 == 0;
Regarding divisibility by d the following rules hold:
When d is a power of 2:
As pointed out by James Kanze, you can use is_divisible_by_d = (number % d == 0). Compilers are clever enough to implement this as (number & (d - 1)) == 0 which is very efficient but obfuscated.
However, when d is not a power of 2 it looks like the obfuscations shown above are more efficient than what current compilers do. (More on that later).
When d is odd:
The technique takes the form is_divisible_by_d = number * a <= b where a and b are cleverly obtained constants. Notice that all we need is 1 multiplication and 1 comparison:
When d is even but not a power of 2:
Then, write d = p * q where p is a power of 2 and q is odd and use the "tongue in cheek" suggested by unpythonic, that is, is_divisible_by_d = is_divisible_by_p && is_divisible_by_q. Again, only 1 multiplication (in the calculation of is_divisible_by_q) is performed.
Many compilers (I've tested clang 5.0.0, gcc 7.3, icc 18 and msvc 19 using godbolt) replace number % d == 0 by (number / d) * d == number. They use a clever technique (see references in Olof Forshell's answer) to replace the division by a multiplication and a bit shift. They end up doing 2 multiplications. In contrast the techniques above perform only 1 multiplication.
Update 01-Oct-2018
Looks like the algorithm above is coming to GCC soon (already in trunk):
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853
The GCC's implementation seems even more efficient. Indeed, the implementation above has three parts: 1) divisibility by the divisor's even part; 2) divisibility by the divisor's odd part; 3) && to connect the results of the two previous steps. By using an assembler instruction which is not efficiently available in standard C++ (ror), GCC wraps up the three parts into a single one which is very similar to that of divisibility by the odd part. Great stuff! Having this implementation available, it's better (for both clarity and performance) to fall back to % all times.
Update 05-May-2020
My articles on the subject have been published:
Quick Modular Calculations (Part 1), Overload Journal 154, December 2019, pages 11-15.
Quick Modular Calculations (Part 2), Overload Journal 155, February 2020, pages 14-17.
Quick Modular Calculations (Part 3), Overload Journal 156, April 2020, pages 10-13.

First of all, I remind you that a number in the form bn...b2b1b0 in binary has value:
number = bn*2^n+...+b2*4+b1*2+b0
Now, when you say number%3, you have:
number%3 =3= bn*(2^n % 3)+...+b2*1+b1*2+b0
(I used =3= to indicate congruence modulo 3). Note also that b1*2 =3= -b1*1
Now I will write all the 16 divisions using + and - and possibly multiplication (note that multiplication could be written as shift or sum of same value shifted to different locations. For example 5*x means x+(x<<2) in which you compute x once only)
Let's call the number n and let's say Divisible_by_i is a boolean value. As an intermediate value, imagine Congruence_by_i is a value congruent to n modulo i.
Also, lets say n0 means bit zero of n, n1 means bit 1 etc, that is
ni = (n >> i) & 1;
Congruence_by_1 = 0
Congruence_by_2 = n&0x1
Congruence_by_3 = n0-n1+n2-n3+n4-n5+n6-n7+n8-n9+n10-n11+n12-n13+n14-n15+n16-n17+n18-n19+n20-n21+n22-n23+n24-n25+n26-n27+n28-n29+n30-n31
Congruence_by_4 = n&0x3
Congruence_by_5 = n0+2*n1-n2-2*n3+n4+2*n5-n6-2*n7+n8+2*n9-n10-2*n11+n12+2*n13-n14-2*n15+n16+2*n17-n18-2*n19+n20+2*n21-n22-2*n23+n24+2*n25-n26-2*n27+n28+2*n29-n30-2*n31
Congruence_by_7 = n0+2*n1+4*n2+n3+2*n4+4*n5+n6+2*n7+4*n8+n9+2*n10+4*n11+n12+2*n13+4*n14+n15+2*n16+4*n17+n18+2*n19+4*n20+n21+2*n22+4*n23+n24+2*n25+4*n26+n27+2*n28+4*n29+n30+2*n31
Congruence_by_8 = n&0x7
Congruence_by_9 = n0+2*n1+4*n2-n3-2*n4-4*n5+n6+2*n7+4*n8-n9-2*n10-4*n11+n12+2*n13+4*n14-n15-2*n16-4*n17+n18+2*n19+4*n20-n21-2*n22-4*n23+n24+2*n25+4*n26-n27-2*n28-4*n29+n30+2*n31
Congruence_by_11 = n0+2*n1+4*n2+8*n3+5*n4-n5-2*n6-4*n7-8*n8-5*n9+n10+2*n11+4*n12+8*n13+5*n14-n15-2*n16-4*n17-8*n18-5*n19+n20+2*n21+4*n22+8*n23+5*n24-n25-2*n26-4*n27-8*n28-5*n29+n30+2*n31
Congruence_by_13 = n0+2*n1+4*n2+8*n3+3*n4+6*n5-n6-2*n7-4*n8-8*n9-3*n10-6*n11+n12+2*n13+4*n14+8*n15+3*n16+6*n17-n18-2*n19-4*n20-8*n21-3*n22-6*n3+n24+2*n25+4*n26+8*n27+3*n28+6*n29-n30-2*n31
Congruence_by_16 = n&0xF
Or when factorized:
Congruence_by_1 = 0
Congruence_by_2 = n&0x1
Congruence_by_3 = (n0+n2+n4+n6+n8+n10+n12+n14+n16+n18+n20+n22+n24+n26+n28+n30)-(n1+n3+n5+n7+n9+n11+n13+n15+n17+n19+n21+n23+n25+n27+n29+n31)
Congruence_by_4 = n&0x3
Congruence_by_5 = n0+n4+n8+n12+n16+n20+n24+n28-(n2+n6+n10+n14+n18+n22+n26+n30)+2*(n1+n5+n9+n13+n17+n21+n25+n29-(n3+n7+n11+n15+n19+n23+n27+n31))
Congruence_by_7 = n0+n3+n6+n9+n12+n15+n18+n21+n24+n27+n30+2*(n1+n4+n7+n10+n13+n16+n19+n22+n25+n28+n31)+4*(n2+n5+n8+n11+n14+n17+n20+n23+n26+n29)
Congruence_by_8 = n&0x7
Congruence_by_9 = n0+n6+n12+n18+n24+n30-(n3+n9+n15+n21+n27)+2*(n1+n7+n13+n19+n25+n31-(n4+n10+n16+n22+n28))+4*(n2+n8+n14+n20+n26-(n5+n11+n17+n23+n29))
// and so on
If these values end up being negative, add it with i until they become positive.
Now what you should do is recursively feed these values through the same process we just did until Congruence_by_i becomes less than i (and obviously >= 0). This is similar to what we do when we want to find remainder of a number by 3 or 9, remember? Sum up the digits, if it had more than one digit, some up the digits of the result again until you get only one digit.
Now for i = 1, 2, 3, 4, 5, 7, 8, 9, 11, 13, 16:
Divisible_by_i = (Congruence_by_i == 0);
And for the rest:
Divisible_by_6 = Divisible_by_3 && Divisible_by_2;
Divisible_by_10 = Divisible_by_5 && Divisible_by_2;
Divisible_by_12 = Divisible_by_4 && Divisible_by_3;
Divisible_by_14 = Divisible_by_7 && Divisible_by_2;
Divisible_by_15 = Divisible_by_5 && Divisible_by_3;
Edit: Note that some of the additions could be avoided from the very beginning. For example n0+2*n1+4*n2 is the same as n&0x7, similarly n3+2*n4+4*n5 is (n>>3)&0x7 and thus with each formula, you don't have to get each bit individually, I wrote it like that for the sake of clarity and similarity in operation. To optimize each of the formulas, you should work on it yourself; group operands and factorize operation.

The LCM of these numbers seems to be 720720. Its quite small, so that you can perform a single modulus operation and use the remainder as the index in the precomputed LUT.

You should just use (i % N) == 0 as your test.
My compiler (a fairly old version of gcc) generated good code for all the cases I tried.
Where bit tests were appropriate it did that. Where N was a constant it didn't generate the obvious "divide" for any case, it always used some "trick".
Just let the compiler generate the code for you, it will almost certainly know more about the architecture of the machine than you do :) And these are easy optimisations where you are unlikely to think up something better than the compiler does.
It's an interesting question though. I can't list the tricks used by the compiler for each constant as I have to compile on a different computer.. But I'll update this reply later on if nobody beats me to it :)

This probably won't help you in code, but there's a neat trick which can help do this in your head in some cases:
For divide by 3: For a number represented in decimal, you can sum all the digits, and check if the sum is divisible by 3.
Example: 12345 => 1+2+3+4+5 = 15 => 1+5 = 6, which is divisible by 3 (3 x 4115 = 12345).
More interestingly the same technique works for all factors of X-1, where X is the base in which the number is represented. So for decimal number, you can check divide by 3 or 9. For hex, you can check divide by 3,5 or 15. And for octal numbers, you can check divide by 7.

In a previous question, I showed a fast algorithm to check in base N for divisors that are factors of N-1. Base transformations between different powers of 2 are trivial; that's just bit grouping.
Therefore, checking for 3 is easy in base 4; checking for 5 is easy in base 16, and checking for 7 (and 9) is easy in base 64.
Non-prime divisors are trivial, so only 11 and 13 are hard cases. For 11, you could use base 1024, but at that point it's not really efficient for small integers.

A method that can help modulo reduction of all integer values uses bit-slicing and popcount.
mod3 = pop(x & 0x55555555) + pop(x & 0xaaaaaaaa) << 1; // <- one term is shared!
mod5 = pop(x & 0x99999999) + pop(x & 0xaaaaaaaa) << 1 + pop(x & 0x44444444) << 2;
mod7 = pop(x & 0x49249249) + pop(x & 0x92492492) << 1 + pop(x & 0x24924924) << 2;
modB = pop(x & 0x5d1745d1) + pop(x & 0xba2e8ba2) << 1 +
pop(x & 0x294a5294) << 2 + pop(x & 0x0681a068) << 3;
modD = pop(x & 0x91b91b91) + pop(x & 0xb2cb2cb2) << 1 +
pop(x & 0x64a64a64) << 2 + pop(x & 0xc85c85c8) << 3;
The maximum values for these variables are 48, 80, 73, 168 and 203, which all fit into 8-bit variables. The second round can be carried in parallel (or some LUT method can be applied)
mod3 mod3 mod5 mod5 mod5 mod7 mod7 mod7 modB modB modB modB modD modD modD modD
mask 0x55 0xaa 0x99 0xaa 0x44 0x49 0x92 0x24 0xd1 0xa2 0x94 0x68 0x91 0xb2 0x64 0xc8
shift *1 *2 *1 *2 *4 *1 *2 *4 *1 *2 *4 *8 *1 *2 *4 *8
sum <-------> <------------> <-----------> <-----------------> <----------------->

You can replace division by a non-power-of-two constant by a multiplication, essentially multiplying by the reciprocal of your divisor. The details to get the exact result by this method are complicated.
Hacker's Delight discusses this at length in chapter 10 (unfortunately not available online).
From the quotient you can get the modulus by another multiplication and a subtraction.

One thing to consider: since you only care about divisibility up to 16, you really only need to check divisibility by the primes up to 16. These are 2, 3, 5, 7, 11, and 13.
Divide your number by each of the primes, keeping track with a boolean (such as div2 = true). The numbers two and three are special cases. If div3 is true, try dividing by 3 again, setting div9. Two and its powers are very simple (note: '&' is one of the fastest things a processor can do):
if n & 1 == 0:
div2 = true
if n & 3 == 0:
div4 = true
if n & 7 == 0:
div8 = true
if n & 15 == 0:
div16 = true
You now have the booleans div2, div3, div4, div5, div7, div8, div9, div11, div13, and div16. All
other numbers are combinations; for instance div6 is the same as (div2 && div3)
So, you only need to do either 5 or 6 actual divisions (6 only if your number is divisible by 3).
For myself, i would probably use bits in a single register for my booleans; for instance
bit_0 means div2. I can then use masks:
if (flags & (div2+div3)) == (div2 + div3): do_6()
note that div2+div3 can be a precomputed constant. If div2 is bit0, and div3 is bit1,
then div2+div3 == 3. This makes the above 'if' optimize to:
if (flags & 3) == 3: do_6()
So now... mod without a divide:
def mod(n,m):
i = 0
while m < n:
m <<= 1
i += 1
while i > 0:
m >>= 1
if m <= n: n -= m
i -= 1
return n
div3 = mod(n,3) == 0
...
btw: the worst case for the above code is 31 times through either loop for a 32-bit number
FYI: Just looked at Msalter's post, above. His technique can be used instead of mod(...) for some of the primes.

Fast tests for divisibility depend heavily on the base in which the number is represented. In case when base is 2, I think you can only do "fast tests" for divisibility by powers of 2. A binary number is divisible by 2n iff the last n binary digits of that number are 0. For other tests I don't think you can generally find anything faster than %.

A bit of evil, obfuscated bit-twiddling can get you divisbility by 15.
For a 32-bit unsigned number:
def mod_15ish(unsigned int x) {
// returns a number between 0 and 21 that is either x % 15
// or 15 + (x % 15), and returns 0 only for x == 0
x = (x & 0xF0F0F0F) + ((x >> 4) & 0xF0F0F0F);
x = (x & 0xFF00FF) + ((x >> 8) & 0xFF00FF);
x = (x & 0xFFFF) + ((x >> 16) & 0xFFFF);
// *1
x = (x & 0xF) + ((x >> 4) & 0xF);
return x;
}
def Divisible_by_15(unsigned int x) {
return ((x == 0) || (mod_15ish(x) == 15));
}
You can build similar divisibility routines for 3 and 5 based on mod_15ish.
If you have 64-bit unsigned ints to deal with, extend each constant above the *1 line in the obvious way, and add a line above the *1 line to do a right shift by 32 bits with a mask of 0xFFFFFFFF. (The last two lines can stay the same) mod_15ish then obeys the same basic contract, but the return value is now between 0 and 31. (so what's maintained is that x % 15 == mod_15ish(x) % 15)

Here are some tips I haven't see anyone else suggest yet:
One idea is to use a switch statement, or precompute some array. Then, any decent optimizer can simply index each case directly. For example:
// tests for (2,3,4,5,6,7)
switch (n % 8)
{
case 0: break;
case 1: break;
case 2: do(2); break;
case 3: do(3); break;
case 4: do(2); do(4) break;
case 5: do(5); break;
case 6: do(2); do(3); do(4); break;
case 7: do(7); break;
}
Your application is a bit ambiguous, but you may only need to check prime numbers less than n=16. This is because all numbers are factors of the current or previous prime numbers. So for n=16, you might be able to get away with only checking 2, 3, 5, 7, 11, 13 somehow. Just a thought.

Related

Incorrect Multiplication/Division in Galois Field (2^8)

I'm attempting to implement multiplication and division in GF(2^8) using log and exponential tables. I'm using the exponent of 3 as my generator, using instructions from here.
However I'm failing some trivial test cases.
example:
//passes
assert((GF256elm(4) / GF256elm(1)) == GF256elm(4));
assert((GF256elm(32) / GF256elm(16)) == GF256elm(2));
assert((GF256elm(15) / GF256elm(5)) == GF256elm(3));
assert((GF256elm(88) / GF256elm(8)) == GF256elm(11));
//fails, but should pass
assert((GF256elm(77) / GF256elm(11)) == GF256elm(7));
assert((GF256elm(77) / GF256elm(7)) == GF256elm(11));
The first four line passes, however it fails on both 5th and 6th line.
Upon further investigation I found out that these error occur when there is a 'wrap over', i.e. log3(a) + log3(b) > 255 (multiplication case) or log3(a) - log3(b) < 0. However the value is "modded" such that they remain in 0~255 using true modulus.
GF256elm& GF256elm::operator/=(const GF256elm& other) { //C++ operator override for division
int t = _logTable[val] - _logTable[other.val]; //log3(a) - log3(b)
int temp = ((t % 255) + 255) % 255; //this wraps the value to between 0~254 inclusive.
val = _expTable[temp];
return *this;
}
the / operator is implemented using the /= override above so nothing special happens there.
I have checked that the generated log/exp tables are correct.
What am I missing here? Thanks!

First, read this question and all its answers and comments carefully:
Addition and multiplication in a Galois Field
I think your code is OK, but you have two problems.
First, the comments are wrong; you are keeping the exponent in the range 0-254, not 0-255.
Second, your "trivial" test cases are wrong.
In this field, think of numbers as polynomials whose coefficients you get from the binary representation of the number. For example, since 5 = 2^2 + 1, in this field "5" means x^2 + 1.
So "5" * "3" = (x^2 + 1) * (x + 1) = x^3 + x^2 + x + 1, or "15". This is why your test case assert((GF256elm(15) / GF256elm(5)) == GF256elm(3)); works. It has nothing to do with your usual notion that five times three equals fifteen. Similarly for your other working test cases, which you will notice mostly involve powers of two.
However, "7" * "11" = (x^2 + x + 1) * (x^3 + x + 1) = x^5 + x^4 + 2x^3 + 2x^2 +2x + 1
But the coefficients are all modulo 2, so this is actually x^5 + x^4 + 1 = "49". This is why your last two test cases fail.
If you try assert(GF256elm(49) / GF256elm(7) == GF256elm(11)); you should find it checks out.

x % n evaluates to an integer between 0 and (n - 1), inclusive.
This means that x % 255 evaluates to an integer between 0 and 254, not 0 and 255.
You should replace 255 with 256, or alternatively, perform a bitwise AND with 0xff for the same result. The latter is faster, though it is quite likely that compilers are smart enough to optimize them to the same bytecode.

There is nothing wrong with the code. Finite field multiplication/division is different from normal arithmetic. Please refer to this question in cryptostackxchange.

C++: Binary to Decimal Conversion

I am trying to convert a binary array to decimal in following way:
uint8_t array[8] = {1,1,1,1,0,1,1,1} ;
int decimal = 0 ;
for(int i = 0 ; i < 8 ; i++)
decimal = (decimal << 1) + array[i] ;
Actually I have to convert 64 bit binary array to decimal and I have to do it for million times.
Can anybody help me, is there any faster way to do the above ? Or is the above one is nice ?

Your method is adequate, to call it nice I would just not mix bitwise operations and "mathematical" way of converting to decimal, i.e. use either
decimal = decimal << 1 | array[i];
or
decimal = decimal * 2 + array[i];

It is important, before attempting any optimisation, to profile the code. Time it, look at the code being generated, and optimise only when you understand what is going on.
And as already pointed out, the best optimisation is to not do something, but to make a higher level change that removes the need.
However...
Most changes you might want to trivially make here, are likely to be things the compiler has already done (a shift is the same as a multiply to the compiler). Some may actually prevent the compiler from making an optimisation (changing an add to an or will restrict the compiler - there are more ways to add numbers, and only you know that in this case the result will be the same).
Pointer arithmetic may be better, but the compiler is not stupid - it ought to already be producing decent code for dereferencing the array, so you need to check that you have not in fact made matters worse by introducing an additional variable.
In this case the loop count is well defined and limited, so unrolling probably makes sense.
Further more it depends on how dependent you want the result to be on your target architecture. If you want portability, it is hard(er) to optimise.
For example, the following produces better code here:
unsigned int x0 = *(unsigned int *)array;
unsigned int x1 = *(unsigned int *)(array+4);
int decimal = ((x0 * 0x8040201) >> 20) + ((x1 * 0x8040201) >> 24);
I could probably also roll a 64-bit version that did 8 bits at a time instead of 4.
But it is very definitely not portable code. I might use that locally if I knew what I was running on and I just wanted to crunch numbers quickly. But I probably wouldn't put it in production code. Certainly not without documenting what it did, and without the accompanying unit test that checks that it actually works.

The binary 'compression' can be generalized as a problem of weighted sum -- and for that there are some interesting techniques.
X mod (255) means essentially summing of all independent 8-bit numbers.
X mod 254 means summing each digit with a doubling weight, since 1 mod 254 = 1, 256 mod 254 = 2, 256*256 mod 254 = 2*2 = 4, etc.
If the encoding was big endian, then *(unsigned long long)array % 254 would produce a weighted sum (with truncated range of 0..253). Then removing the value with weight 2 and adding it manually would produce the correct result:
uint64_t a = *(uint64_t *)array;
return (a & ~256) % 254 + ((a>>9) & 2);
Other mechanism to get the weight is to premultiply each binary digit by 255 and masking the correct bit:
uint64_t a = (*(uint64_t *)array * 255) & 0x0102040810204080ULL; // little endian
uint64_t a = (*(uint64_t *)array * 255) & 0x8040201008040201ULL; // big endian
In both cases one can then take the remainder of 255 (and correct now with weight 1):
return (a & 0x00ffffffffffffff) % 255 + (a>>56); // little endian, or
return (a & ~1) % 255 + (a&1);
For the sceptical mind: I actually did profile the modulus version to be (slightly) faster than iteration on x64.
To continue from the answer of JasonD, parallel bit selection can be iteratively utilized.
But first expressing the equation in full form would help the compiler to remove the artificial dependency created by the iterative approach using accumulation:
ret = ((a[0]<<7) | (a[1]<<6) | (a[2]<<5) | (a[3]<<4) |
(a[4]<<3) | (a[5]<<2) | (a[6]<<1) | (a[7]<<0));
vs.
HI=*(uint32_t)array, LO=*(uint32_t)&array[4];
LO |= (HI<<4); // The HI dword has a weight 16 relative to Lo bytes
LO |= (LO>>14); // High word has 4x weight compared to low word
LO |= (LO>>9); // high byte has 2x weight compared to lower byte
return LO & 255;
One more interesting technique would be to utilize crc32 as a compression function; then it just happens that the result would be LookUpTable[crc32(array) & 255]; as there is no collision with this given small subset of 256 distinct arrays. However to apply that, one has already chosen the road of even less portability and could as well end up using SSE intrinsics.

You could use accumulate, with a doubling and adding binary operation:
int doubleSumAndAdd(const int& sum, const int& next) {
return (sum * 2) + next;
}
int decimal = accumulate(array, array+ARRAY_SIZE,
doubleSumAndAdd);
This produces big-endian integers, whereas OP code produces little-endian.

Try this, I converted a binary digit of up to 1020 bits
#include <sstream>
#include <string>
#include <math.h>
#include <iostream>
using namespace std;
long binary_decimal(string num) /* Function to convert binary to dec */
{
long dec = 0, n = 1, exp = 0;
string bin = num;
if(bin.length() > 1020){
cout << "Binary Digit too large" << endl;
}
else {
for(int i = bin.length() - 1; i > -1; i--)
{
n = pow(2,exp++);
if(bin.at(i) == '1')
dec += n;
}
}
return dec;
}
Theoretically this method will work for a binary digit of infinate length

Logarithm of the very-very large number

I have to find log of very large number.
I do this in C++
I have already made a function of multiplication, addition, subtraction, division, but there were problems with the logarithm. I do not need code, I need a simple idea how to do it using these functions.
Thanks.
P.S.
Sorry, i forgot to tell you: i have to find only binary logarithm of that number
P.S.-2
I found in Wikipedia:
int floorLog2(unsigned int n) {
if (n == 0)
return -1;
int pos = 0;
if (n >= (1 <<16)) { n >>= 16; pos += 16; }
if (n >= (1 << 8)) { n >>= 8; pos += 8; }
if (n >= (1 << 4)) { n >>= 4; pos += 4; }
if (n >= (1 << 2)) { n >>= 2; pos += 2; }
if (n >= (1 << 1)) { pos += 1; }
return pos;
}
if I remade it under the big numbers, it will work correctly?

I assume you're writing a bignum class of your own. If you only care about an integral result of log2, it's quite easy. Take the log of the most significant digit that's not zero, and add 8 for each byte after that one. This is assuming that each byte holds values 0-255. These are only accurate within ±.5, but very fast.
[0][42][53] (10805 in bytes)
log2(42) = 5
+ 8*1 = 8 (because of the one byte lower than MSB)
= 13 (Actual: 13.39941145)
If your values hold base 10 digits, that works out to log2(MSB)+3.32192809*num_digits_less_than_MSB.
[0][5][7][6][2] (5762)
log2(5) = 2.321928095
+ 3.32192809*3 = 9.96578427 (because 3 digits lower than MSB)
= 12.28771 (Actual: 12.49235395)
(only accurate for numbers with less than ~10 million digits)
If you used the algorithm you found on wikipedia, it will be IMMENSELY slow. (but accurate if you need decimals)
It's been pointed out that my method is inaccurate when the MSB is small (still within ±.5, but no farther), but this is easily fixed by simply shifting the top two bytes into a single number, taking the log of that, and doing the multiplication for the bytes less than that number. I believe this will be accurate within half a percent, and still significantly faster than a normal logarithm.
[1][42][53] (76341 in bytes)
log2(1*256+42) = ?
log2(298) = 8.21916852046
+ 8*1 = 8 (because of the one byte lower than MSB)
= 16.21916852046 (Actual: 16.2201704643)
For base 10 digits, it's log2( [mostSignificantDigit]*10+[secondMostSignifcantDigit] ) + 3.32192809*[remainingDigitCount].
If performance is still an issue, you can use lookup tables for the log2 instead of using a full logarithm function.

I assume you want to know how to compute the logarithm "by hand". So I tell you what I've found for this.
Have a look over here, where it is described how to logarithmize by hand. You can implement this as an algorithm. Here's an article by "How Euler did it". I also find this article promising.
I suppose there are more sophisticated methods to do this, but they are so involved you probably don't want to implement them.

How can I add and subtract 128 bit integers in C or C++ if my compiler does not support them?

I'm writing a compressor for a long stream of 128 bit numbers. I would like to store the numbers as differences -- storing only the difference between the numbers rather than the numbers themselves because I can pack the differences in fewer bytes because they are smaller.
However, for compression then I need to subtract these 128 bit values, and for decompression I need to add these values. Maximum integer size for my compiler is 64 bits wide.
Anyone have any ideas for doing this efficiently?

If all you need is addition and subtraction, and you already have your 128-bit values in binary form, a library might be handy but isn't strictly necessary. This math is trivial to do yourself.
I don't know what your compiler uses for 64-bit types, so I'll use INT64 and UINT64 for signed and unsigned 64-bit integer quantities.
class Int128
{
public:
...
Int128 operator+(const Int128 & rhs)
{
Int128 sum;
sum.high = high + rhs.high;
sum.low = low + rhs.low;
// check for overflow of low 64 bits, add carry to high
if (sum.low < low)
++sum.high;
return sum;
}
Int128 operator-(const Int128 & rhs)
{
Int128 difference;
difference.high = high - rhs.high;
difference.low = low - rhs.low;
// check for underflow of low 64 bits, subtract carry to high
if (difference.low > low)
--difference.high;
return difference;
}
private:
INT64 high;
UINT64 low;
};

Take a look at GMP.
#include <stdio.h>
#include <gmp.h>
int main (int argc, char** argv) {
mpz_t x, y, z;
char *xs, *ys, *zs;
int i;
int base[4] = {2, 8, 10, 16};
/* setting the value of x in base 10 */
mpz_init_set_str(x, "100000000000000000000000000000000", 10);
/* setting the value of y in base 16 */
mpz_init_set_str(y, "FF", 16);
/* just initalizing the result variable */
mpz_init(z);
mpz_sub(z, x, y);
for (i = 0; i < 4; i++) {
xs = mpz_get_str(NULL, base[i], x);
ys = mpz_get_str(NULL, base[i], y);
zs = mpz_get_str(NULL, base[i], z);
/* print all three in base 10 */
printf("x = %s\ny = %s\nz = %s\n\n", xs, ys, zs);
free(xs);
free(ys);
free(zs);
}
return 0;
}
The output is
x = 10011101110001011010110110101000001010110111000010110101100111011111000000100000000000000000000000000000000
y = 11111111
z = 10011101110001011010110110101000001010110111000010110101100111011111000000011111111111111111111111100000001
x = 235613266501267026547370040000000000
y = 377
z = 235613266501267026547370037777777401
x = 100000000000000000000000000000000
y = 255
z = 99999999999999999999999999999745
x = 4ee2d6d415b85acef8100000000
y = ff
z = 4ee2d6d415b85acef80ffffff01

Having stumbled across this relatively old post entirely by accident, I thought it pertinent to elaborate on Volte's previous conjecture for the benefit of inexperienced readers.
Firstly, the signed range of a 128-bit number is -2127 to 2127-1 and not -2127 to 2127 as originally stipulated.
Secondly, due to the cyclic nature of finite arithmetic the largest required differential between two 128-bit numbers is -2127 to 2127-1, which has a storage prerequisite of 128-bits, not 129. Although (2127-1) - (-2127) = 2128-1 which is clearly greater than our maximum 2127-1 positive integer, arithmetic overflow always ensures that the nearest distance between any two n-bit numbers always falls within the range 0 to 2n-1 and thus implicitly -2n-1 to 2n-1-1.
In order to clarify, let us first examine how a hypothetical 3-bit processor would implement binary addition. As an example, consider the following table which depicts the absolute unsigned range of a 3-bit integer.
0 = 000b
1 = 001b
2 = 010b
3 = 011b
4 = 100b
5 = 101b
6 = 110b
7 = 111b ---> [Cycles back to 000b on overflow]
From the above table it is readily apparent that:
001b(1) + 010b(2) = 011b(3)
It is also apparent that adding any of these numbers with its numeric complement always yields 2n-1:
010b(2) + 101b([complement of 2] = 5) = 111b(7) = (23-1)
Due to the cyclic overflow which occurs when the addition of two n-bit numbers results in an (n+1)-bit result, it therefore follows that adding any of these numbers with its numeric complement + 1 will always yield 0:
010b(2) + 110b([complement of 2] + 1) = 000b(0)
Thus we can say that [complement of n] + 1 = -n, so that n + [complement of n] + 1 = n + (-n) = 0. Furthermore, if we now know that n + [complement of n] + 1 = 0, then n + [complement of n - x] + 1 must = n - (n-x) = x.
Applying this to our original 3-bit table yields:
0 = 000b = [complement of 0] + 1 = 0
1 = 001b = [complement of 7] + 1 = -7
2 = 010b = [complement of 6] + 1 = -6
3 = 011b = [complement of 5] + 1 = -5
4 = 100b = [complement of 4] + 1 = -4
5 = 101b = [complement of 3] + 1 = -3
6 = 110b = [complement of 2] + 1 = -2
7 = 111b = [complement of 1] + 1 = -1 ---> [Cycles back to 000b on overflow]
Whether the representational abstraction is positive, negative or a combination of both as implied with signed twos-complement arithmetic, we now have 2n n-bit patterns which can seamlessly serve both positive 0 to 2n-1 and negative 0 to -(2n)-1 ranges as and when required. In point of fact, all modern processors employ just such a system in order to implement common ALU circuitry for both addition and subtraction operations. When a CPU encounters an i1 - i2 subtraction instruction, it internally performs a [complement + 1] operation on i2 and subsequently processes the operands through the addition circuitry in order to compute i1 + [complement of i2] + 1. With the exception of an additional carry/sign XOR-gated overflow flag, both signed and unsigned addition, and by implication subtraction, are each implicit.
If we apply the above table to the input sequence [-2n-1, 2n-1-1, -2n-1] as presented in Volte's original reply, we are now able to compute the following n-bit differentials:
diff #1:
(2n-1-1) - (-2n-1) =
3 - (-4) = 3 + 4 =
(-1) = 7 = 111b
diff #2:
(-2n-1) - (2n-1-1) =
(-4) - 3 = (-4) + (5) =
(-7) = 1 = 001b
Starting with our seed -2n-1, we are now able to reproduce the original input sequence by applying each of the above differentials sequentially:
(-2n-1) + (diff #1) =
(-4) + 7 = 3 =
2n-1-1
(2n-1-1) + (diff #2) =
3 + (-7) = (-4) =
-2n-1
You may of course wish to adopt a more philosophical approach to this problem and conjecture as to why 2n cyclically-sequential numbers would require more than 2n cyclically-sequential differentials?
Taliadon.

Boost 1.53 now includes multiprecision:
#include <boost/multiprecision/cpp_int.hpp>
#include <iostream>
// Requires Boost 1.53 or higher
// build: g++ text.cpp
int main()
{
namespace mp = boost::multiprecision;
mp::uint128_t a = 4294967296;
mp::uint256_t b(0);
mp::uint512_t c(0);
b = a * a;
c = b * b;
std::cout << "c: " << c << "\n";
return 0;
}
Output:
./a.out
c: 340282366920938463463374607431768211456

There is a lot of literature regarding large integer math. You can use one of the libraries freely available (see links) or you can roll your own. Although I should warn you, for anything more complicated than addition and subtraction (and shifts), you'll need to use non-trivial algorithms.
To add and subtract, you can create a class/structure that holds two 64-bit integers. You can use simple school math to do the addition and subtraction. Basically, do what you do with a pencil and paper to add or subtract, with careful consideration to carries/borrows.
Search for large integer. Btw recent versions of VC++, IntelC++ and GCC compilers have 128-bit integer types, although I'm not sure they are as easily accessible as you might like (they are intended to be used with sse2/xmms registers).
http://en.wikipedia.org/wiki/Arbitrary_precision_arithmetic
http://orion.math.iastate.edu/cbergman/crypto/bignums.html
http://www.mathgoodies.com/tutorial/

TomsFastMath is a bit like GMP (mentioned above), but is public domain, and was designed from the ground up to be extremely fast (it even contains assembly code optimizations for x86, x86-64, ARM, SSE2, PPC32, and AVR32).

Also worth noting: if the goal is merely to improve the compression of a stream of numbers by preprocessing it, then the preprocessed stream doesn't have to be made of exactly arithmetic differences. You can use XOR (^) instead of + and -. The advantage is that a 128-bit XOR is exactly the same as two independent XORs on the 64-bit parts, so it is both simple and efficient.

Is there any alternative to using % (modulus) in C/C++?

I read somewhere once that the modulus operator is inefficient on small embedded devices like 8 bit micro-controllers that do not have integer division instruction. Perhaps someone can confirm this but I thought the difference is 5-10 time slower than with an integer division operation.
Is there another way to do this other than keeping a counter variable and manually overflowing to 0 at the mod point?
const int FIZZ = 6;
for(int x = 0; x < MAXCOUNT; x++)
{
if(!(x % FIZZ)) print("Fizz\n"); // slow on some systems
}
vs:
The way I am currently doing it:
const int FIZZ = 6;
int fizzcount = 1;
for(int x = 1; x < MAXCOUNT; x++)
{
if(fizzcount >= FIZZ)
{
print("Fizz\n");
fizzcount = 0;
}
}

Ah, the joys of bitwise arithmetic. A side effect of many division routines is the modulus - so in few cases should division actually be faster than modulus. I'm interested to see the source you got this information from. Processors with multipliers have interesting division routines using the multiplier, but you can get from division result to modulus with just another two steps (multiply and subtract) so it's still comparable. If the processor has a built in division routine you'll likely see it also provides the remainder.
Still, there is a small branch of number theory devoted to Modular Arithmetic which requires study if you really want to understand how to optimize a modulus operation. Modular arithmatic, for instance, is very handy for generating magic squares.
So, in that vein, here's a very low level look at the math of modulus for an example of x, which should show you how simple it can be compared to division:
Maybe a better way to think about the problem is in terms of number
bases and modulo arithmetic. For example, your goal is to compute DOW
mod 7 where DOW is the 16-bit representation of the day of the
week. You can write this as:
DOW = DOW_HI*256 + DOW_LO
DOW%7 = (DOW_HI*256 + DOW_LO) % 7
= ((DOW_HI*256)%7 + (DOW_LO % 7)) %7
= ((DOW_HI%7 * 256%7) + (DOW_LO%7)) %7
= ((DOW_HI%7 * 4) + (DOW_LO%7)) %7
Expressed in this manner, you can separately compute the modulo 7
result for the high and low bytes. Multiply the result for the high by
4 and add it to the low and then finally compute result modulo 7.
Computing the mod 7 result of an 8-bit number can be performed in a
similar fashion. You can write an 8-bit number in octal like so:
X = a*64 + b*8 + c
Where a, b, and c are 3-bit numbers.
X%7 = ((a%7)*(64%7) + (b%7)*(8%7) + c%7) % 7
= (a%7 + b%7 + c%7) % 7
= (a + b + c) % 7
since 64%7 = 8%7 = 1
Of course, a, b, and c are
c = X & 7
b = (X>>3) & 7
a = (X>>6) & 7 // (actually, a is only 2-bits).
The largest possible value for a+b+c is 7+7+3 = 17. So, you'll need
one more octal step. The complete (untested) C version could be
written like:
unsigned char Mod7Byte(unsigned char X)
{
X = (X&7) + ((X>>3)&7) + (X>>6);
X = (X&7) + (X>>3);
return X==7 ? 0 : X;
}
I spent a few moments writing a PIC version. The actual implementation
is slightly different than described above
Mod7Byte:
movwf temp1 ;
andlw 7 ;W=c
movwf temp2 ;temp2=c
rlncf temp1,F ;
swapf temp1,W ;W= a*8+b
andlw 0x1F
addwf temp2,W ;W= a*8+b+c
movwf temp2 ;temp2 is now a 6-bit number
andlw 0x38 ;get the high 3 bits == a'
xorwf temp2,F ;temp2 now has the 3 low bits == b'
rlncf WREG,F ;shift the high bits right 4
swapf WREG,F ;
addwf temp2,W ;W = a' + b'
; at this point, W is between 0 and 10
addlw -7
bc Mod7Byte_L2
Mod7Byte_L1:
addlw 7
Mod7Byte_L2:
return
Here's a liitle routine to test the algorithm
clrf x
clrf count
TestLoop:
movf x,W
RCALL Mod7Byte
cpfseq count
bra fail
incf count,W
xorlw 7
skpz
xorlw 7
movwf count
incfsz x,F
bra TestLoop
passed:
Finally, for the 16-bit result (which I have not tested), you could
write:
uint16 Mod7Word(uint16 X)
{
return Mod7Byte(Mod7Byte(X & 0xff) + Mod7Byte(X>>8)*4);
}
Scott

If you are calculating a number mod some power of two, you can use the bit-wise and operator. Just subtract one from the second number. For example:
x % 8 == x & 7
x % 256 == x & 255
A few caveats:
This only works if the second number is a power of two.
It's only equivalent if the modulus is always positive. The C and C++ standards don't specify the sign of the modulus when the first number is negative (until C++11, which does guarantee it will be negative, which is what most compilers were already doing). A bit-wise and gets rid of the sign bit, so it will always be positive (i.e. it's a true modulus, not a remainder). It sounds like that's what you want anyway though.
Your compiler probably already does this when it can, so in most cases it's not worth doing it manually.

There is an overhead most of the time in using modulo that are not powers of 2.
This is regardless of the processor as (AFAIK) even processors with modulus operators are a few cycles slower for divide as opposed to mask operations.
For most cases this is not an optimisation that is worth considering, and certainly not worth calculating your own shortcut operation (especially if it still involves divide or multiply).
However, one rule of thumb is to select array sizes etc. to be powers of 2.
so if calculating day of week, may as well use %7 regardless
if setting up a circular buffer of around 100 entries... why not make it 128. You can then write % 128 and most (all) compilers will make this & 0x7F

Unless you really need high performance on multiple embedded platforms, don't change how you code for performance reasons until you profile!
Code that's written awkwardly to optimize for performance is hard to debug and hard to maintain. Write a test case, and profile it on your target. Once you know the actual cost of modulus, then decide if the alternate solution is worth coding.

#Matthew is right. Try this:
int main() {
int i;
for(i = 0; i<=1024; i++) {
if (!(i & 0xFF)) printf("& i = %d\n", i);
if (!(i % 0x100)) printf("mod i = %d\n", i);
}
}

x%y == (x-(x/y)*y)
Hope this helps.

Do you have access to any programmable hardware on the embedded device? Like counters and such? If so, you might be able to write a hardware based mod unit, instead of using the simulated %. (I did that once in VHDL. Not sure if I still have the code though.)
Mind you, you did say that division was 5-10 times faster. Have you considered doing a division, multiplication, and subtraction to simulated the mod? (Edit: Misunderstood the original post. I did think it was odd that division was faster than mod, they are the same operation.)
In your specific case, though, you are checking for a mod of 6. 6 = 2*3. So you could MAYBE get some small gains if you first checked if the least significant bit was a 0. Something like:
if((!(x & 1)) && (x % 3))
{
print("Fizz\n");
}
If you do that, though, I'd recommend confirming that you get any gains, yay for profilers. And doing some commenting. I'd feel bad for the next guy who has to look at the code otherwise.

You should really check the embedded device you need. All the assembly language I have seen (x86, 68000) implement the modulus using a division.
Actually, the division assembly operation returns the result of the division and the remaining in two different registers.

In the embedded world, the "modulus" operations you need to do are often the ones that break down nicely into bit operations that you can do with &, | and sometimes >>.

#Jeff V: I see a problem with it! (Beyond that your original code was looking for a mod 6 and now you are essentially looking for a mod 8). You keep doing an extra +1! Hopefully your compiler optimizes that away, but why not just test start at 2 and go to MAXCOUNT inclusive? Finally, you are returning true every time that (x+1) is NOT divisible by 8. Is that what you want? (I assume it is, but just want to confirm.)

For modulo 6 you can change the Python code to C/C++:
def mod6(number):
while number > 7:
number = (number >> 3 << 1) + (number & 0x7)
if number > 5:
number -= 6
return number

Not that this is necessarily better, but you could have an inner loop which always goes up to FIZZ, and an outer loop which repeats it all some certain number of times. You've then perhaps got to special case the final few steps if MAXCOUNT is not evenly divisible by FIZZ.
That said, I'd suggest doing some research and performance profiling on your intended platforms to get a clear idea of the performance constraints you're under. There may be much more productive places to spend your optimisation effort.

The print statement will take orders of magnitude longer than even the slowest implementation of the modulus operator. So basically the comment "slow on some systems" should be "slow on all systems".
Also, the two code snippets provided don't do the same thing. In the second one, the line
if(fizzcount >= FIZZ)
is always false so "FIZZ\n" is never printed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js