Need an efficient subtraction algorithm modulo a number - c++

For given numbers x,y and n, I would like to calculate x-y mod n in C. Look at this example:
int substract_modulu(int x, int y, int n)
{
return (x-y) % n;
}
As long as x>y, we are fine. In the other case, however, the modulu operation is undefined.
You can think of x,y,n>0. I would like the result to be positive, so if (x-y)<0, then ((x-y)-substract_modulu(x,y,n))/ n shall be an integer.
What is the fastest algorithm you know for that? Is there one which avoids any calls of if and operator??

As many have pointed out, in current C and C++ standards, x % n is no longer implementation-defined for any values of x and n. It is undefined behaviour in the cases where x / n is undefined [1]. Also, x - y is undefined behaviour in the case of integer overflow, which is possible if the signs of x and y might differ.
So the main problem for a general solution is avoiding integer overflow, either in the division or the subtraction. If we know that x and y are non-negative and n is positive, then overflow and division by zero are not possible, and we can confidently say that (x - y) % n is defined. Unfortunately, x - y might be negative, in which case so will be the result of the % operator.
It's easy to correct for the result being negative if we know that n is positive; all we have to do is unconditionally add n and do another modulo operation. That's unlikely to be the best solution, unless you have a computer where division is faster than branching.
If a conditional load instruction is available (pretty common these days), then the compiler will probably do well with the following code, which is portable and well-defined, subject to the constraints that x,y ≥ 0 ∧ n > 0:
((x - y) % n) + ((x >= y) ? 0 : n)
For example, gcc produces this code for my core I5 (although it's generic enough to work on any non-Paleozoic intel chip):
idivq %rcx
cmpq %rsi, %rdi
movl $0, %eax
cmovge %rax, %rcx
leaq (%rdx,%rcx), %rax
which is cheerfully branch-free. (Conditional move is usually a lot faster than branching.)
Another way of doing this would be (except that the function sign needs to be written):
((x - y) % n) + (sign(x - y) & (unsigned long)n)
where sign is all 1s if its argument is negative, and otherwise 0. One possible implementation of sign (adapted from bithacks) is
unsigned long sign(unsigned long x) {
return x >> (sizeof(long) * CHAR_BIT - 1);
}
This is portable (casting negative integer values to unsigned is defined), but it may be slow on architectures which lack high-speed shift. It's unlikely to be faster than the previous solution, but YMMV. TIAS.
Neither of these produce correct results for the general case where integer overflow is possible. It's very difficult to deal with integer overflow. (One particularly annoying case is n == -1, although you can test for that and return 0 without any use of %.) Also, you need to decide your preference for the result of modulo of negative n. I personally prefer the definition where x%n is either 0 or has the same sign as n -- otherwise why would you bother with a negative divisor -- but applications differ.
The three-modulo solution proposed by Tom Tanner will work if n is not -1 and n + n does not overflow. n == -1 will fail if either x or y is INT_MIN, and the simple fix of using abs(n) instead of n will fail if n is INT_MIN. The cases where n has a large absolute value could be replaced with comparisons, but there are a lot of corner cases, and made more complicated by the fact that the standard does not require 2's complement arithmetic, so it's not easily predictable what the corner cases are [2].
As a final note, some tempting solutions do not work. You cannot just take the absolute value of (x - y):
(-z) % n == -(z % n) == n - (z % n) ≠ z % n (unless z % n happens to be n / 2)
And, for the same reason, you cannot just take the absolute value of the result of modulo.
Also, you cannot just cast (x - y) to unsigned:
(unsigned)z == z + 2k (for some k) if z < 0
(z + 2k) % n == (z % n) + (2k % n) ≠ z % n unless (2k % n) == 0
[1] x/n and x%n are both undefined if n==0. But x%n is also undefined if x/n is "not representable" (i.e. there was integer overflow), which will happen on twos-complement
machines (that is, all the ones you care about) if x is most negative representable number and n == -1. It's clear why x/n should be undefined in this case, but slightly less so in the case of x%n, since that value is (mathematically) 0.
[2] Most people who complain about the difficulty of predicting the results of floating-point arithmetic haven't spent much time trying to write truly portable integer arithmetic code :)

If you want to avoid undefined behaviour, without an if, the following would work
return (x % n - y % n + n) % n;
The efficiency depends on the implementation of the modulo operation, but I'd suspect algorithms involving if would be rather faster.
Alternatively you could treat x and y as unsigned. In which case there are no negative numbers involved and no undefined behaviour.

With C++11 the undefined behavior was removed. Depending on the the exact behavior you want you can there just stick with
return (x-y) % n;
For a full explanation read this answer:
https://stackoverflow.com/a/13100805/1149664
You still get undefined behavior for n==0 or if x-y can not be stored in the type you are using.

Whether branching is going to matter will depend on the CPU to some degree. According to the documentation abs (on MSDN) has intrinsic behavior and it might not be a bottleneck at all. This you'll have to test.
If you wan't unconditionally compute things there are several nice methods that can be adapted from the Bit Twiddling Hacks site.
int v; // we want to find the absolute value of v
unsigned int r; // the result goes here
int const mask = v >> sizeof(int) * CHAR_BIT - 1;
r = (v + mask) ^ mask;
However, I don't know if this will be helpful to your situation without more information about hardware targets and testing.
Just out of curiosity I had to test this myself and when you look at the assembly generated by the compiler we can see there's no real overhead in the use of abs.
unsigned r = abs(i);
====
00381006 cdq
00381007 xor eax,edx
00381009 sub eax,edx
The following is just an alternate form of the above example which according to the Bit Twiddling Site is not patented (while the version used by the Visual C++ 2008 compiler is).
Throughout my answer I have been using MSDN and Visual C++ but I would assume that any sane compiler has similar behavior.

Assuming 0 <= x < n and 0 <= y < n, how about (x + n - y) % n? Then x + n will certainly be larger than y, subtracting y will always result in a positive integer, and the final mod n reduces the result if necessary.

I'm going to guess that it's not really the case here, but I'd like to mention that if the value you are taking modulo with is a power of two, then using the "AND" method is a lot quicker (I'm going to ignore the x-y, and just show how it works for a single x, as x-y is not part of the equation here):
int modpow2(int x, int n)
{
return x & (n-1);
}
If you want to ensure that your code doesn't do anything daft, you could add ASSERT(!(n & n-1)); - this checks that there is only a single bit set in n (so, n is a power of two).

Here is the CPP Code I use in competitive programming:
#include <iostream>
#include<bits/stdc++.h>
using namespace std;
#define ll long long
#define mod 1000000007
ll subtraction_modulo(ll x, ll y ){
return ( ( (x - y) % mod ) + mod ) % mod;
}
Here,
ll -> long long int
mod -> globally defined mod value to be used.

Related

Correctness of multiplication with overflow detection

The following C++ template detects overflows from multiplying two unsigned integers.
template<typename UInt> UInt safe_multiply(UInt a, UInt b) {
UInt x = a * b; // x := ab mod n, for n := 2^#bits > 0
if (a != 0 && x / a != b)
cerr << "Overflow for " << a << " * " << b << "." << endl;
return x;
}
Can you give a proof that this algorithm detects every potential overflow, regardless of how many bits UInt uses?
The case
cannot result in overflows, so we can consider
.
It seems that the correctness proof boils down to leading
to a contradiction, since x / a actually means .
When assuming
, this leads to the straightforward consequence
thus
which contradicts n > 0.
So it remains to show
or there must be another way.
If the last equation is true, WolframAlpha fails to confirm that (also with exponents).
However, it asserts that the original assumptions have no integer solutions, so the algorithms seems to be correct indeed.
But it doesn't provide an explanation. So why is it correct?
I am looking for the smallest possible explanation that is still mathematically profound, ideally that it fits in a single-line comment. Maybe I am missing something trivial, or the problem is not as easy as it looks.
On a side note, I used Codecogs Equation Editor for the LaTeX markup images, which apparently looks bad in dark mode, so consider switching to light mode or, if you know, please tell me how to use different images depending on the client settings. It is just \bg{white} vs. \bg{black} as part of the image URLs.
To be clear, I'll use the multiplication and division symbols (*, /) mathematically.
Also, for convenience let's name the set N = {0, 1, ..., n - 1}.
Let's clear up what unsigned multiplication is:
Unsigned multiplication for some magnitude, n, is a modular n operation on unsigned-n inputs (inputs that are in N) that results in an unsigned-n output (ie. also in N).
In other words, the result of unsigned multiplication, x, is x = a*b (mod n), and, additionally, we know that x,a,b are in N.
It's important to be able to expand many modular formulas like so: x = a*b - k*n, where k is an integer - but in our case x,a,b are in N so this implies that k is in N.
Now, let's mathematically say what we're trying to prove:
Given positive integers, a,n, and non-negative integers x,b, where x,a,b are in N, and x = a*b (mod n), then a*b >= n (overflow) implies floor(x/a) != b.
Proof:
If overflow (a*b >= n) then x >= n - k*n = (1 - k)*n (for k in N),
As x < n then (1 - k)*n < n, so k > 0.
This means x <= a*b - n.
So, floor(x/a) <= floor([a*b - n]/a) = floor(a*b/a - n/a) = b - floor(n/a) <= b - 1,
Abbreviated: floor(x/a) <= b - 1
Therefore floor(x/a) != b
The multiplication gives either the mathematically correct result, or it is off by some multiple of 2^64. Since you check for a=0, the division always gives the correct result for its input. But in the case of overflow, the input is off by 2^64 or more, so the test will fail as you hoped.
The last bit is that unsigned operations don’t have undefined behaviour except for division by zero, so your code is fine.

How to find the fixed points of a simple mod function elegantly?

Here is a function, which expressed in C is:
uint32_t f(uint32_t x) {
return (x * 0x156) ^ 0xfca802c7;
}
Then I came across a challenge: How to find all its fixed points?
I know we can test every uint32_t value to solve this problem, but I still want to know if there is another way that is more elegant - especially when uint32_t becomes uint64_t and (0x156, 0xfca802c7) is an arbitrary pair of values.
Python code:
def f(x, n):
return ((x*0x156)^0xfca802c7) % n
solns = [1] # The one solution modulo 2, see text for explanation
n = 1
while n < 2**32:
prev_n = n
n = n * 2
lifted_solns = []
for soln in solns:
if f(soln, n) == soln:
lifted_solns.append(soln)
if f(soln + prev_n, n) == soln + prev_n:
lifted_solns.append(soln + prev_n)
solns = lifted_solns
for soln in solns:
print soln, "evaluates to ", f(soln, 2**32)
Output: 150129329 evaluates to 150129329
Idea behind the algorithm: We are trying to find x XOR 0xfca802c7 = x*0x156 modulo n, where in our case n=2^32. I wrote it this way because the right side is a simple modular multiplication that behaves nicely with the left side.
The main property we are going to use is that a solution to x XOR 0xfca802c7 = x*0x156 modulo 2^(i+1) reduces to a solution to x XOR 0xfca802c7 = x*0x156 modulo 2^i. Another way of saying that is that a solution to x XOR 0xfca802c7 = x*0x156 modulo 2^i translates to one or two solutions modulo 2^(i+1): those possibilities are either x and/or x+2^i (if we want to be more precise, we are only looking at integers between 0, ..., modulus size - 1 when we say "solution").
We can easily solve this for i=1: x XOR 0xfca802c7 = x*0x156 modulo 2^1 is the same as x XOR 1 = x*0 mod 2, which means x=1 is the only solution. From there we know that only 1 and 3 are the possible solutions modulo 2^2 = 4. So we only have two to try. It turns out that only one works. That's our current solution modulo 4. We can then lift that solution to the possibilities modulo 8. And so on. Eventually we get all such solutions.
Remark1: This code finds all solutions. In this case, there is only one, but for more general parameters there may be more than one.
Remark2: the running time is O(max[number of solutions, modulus size in bits]), assuming I have not made an error. So it is fast unless there are many, many fixed points. In this case, there seems to only be one.
Let's use Z3 solver:
(declare-const x (_ BitVec 32))
(assert (= x (bvxor (bvmul x #x00000156) #xfca802c7)))
(check-sat)
(get-model)
The result is '#x08f2cab1' = 150129329.
Since input bits at position n only affect output bits at positions ≥ n you know that you can find a solution by choosing the first bit, then the second bit, etc.
Here is how you could solve it in C++ for 64-bit integers (of course it also works with 32-bit integers):
#include <cstdint>
#include <cstdio>
uint64_t f(uint64_t x) {
return (x * 0x7ef93a76ULL) ^ 0x3550e08f8a9c89c7ULL;
}
static void search(uint64_t x, uint64_t bit)
{
if (bit == 0)
{
printf("Fixed point: 0x%llx\n", (long long unsigned)x);
return;
}
if (f(x + bit) & bit) search(x + bit, bit << 1);
if ((f(x) & bit) == 0) search(x, bit << 1);
}
int main()
{
search(0x0, 1);
}
With this output:
Fixed point: 0xb9642f1d99863811

Mutliplication overflow test [duplicate]

This question already has answers here:
How do I detect unsigned integer overflow?
(31 answers)
Closed 8 years ago.
How to correctly check if overflow occurs in integer multiplication?
int i = X(), j = Y();
i *= j;
How to check for overflow, given values of i, j and their type? Note that the check must work correctly for both signed and unsigned types. Can assume that both i and j are of the same type. Can also assume that the type is known while writing the code, so different solutions can be provided for signed / unsigned cases (no need for template juggling, if it works in "C", it is a bonus).
EDIT:
Answer of #pmg is the correct one. I just couldn't wrap my head around its simplicity for a while so I will share with you here. Suppose we want to check:
i * j > MAX
But we can't really check because i * j would cause overflow and the result would be incorrect (and always less or equal to MAX). So we modify it like this:
i > MAX / j
But this is not quite correct, as in the division, there is some rounding involved. Rather, we want to know the result of this:
i > floor(MAX / j) + float(MAX % j) / j
So we have the division itself, which is implicitly rounded down by the integer arithmetics (the floor is no-op there, merely as an illustration), and we have the remainder of the division which was missing in the previous inequality (which evaluates to less than 1).
Assume that i and j are two numbers at the limit and if any of them increases by 1, an overflow will occur. Assuming none of them is zero (in which case no overflow would occur anyway), both (i + 1) * j and i * (j + 1) are both more than 1 + (i * j). We can therefore safely ignore the roundoff error of the division, which is less than 1.
Alternately, we can reorganize as such:
i - floor(MAX / j) > float(MAX % j) / j
Basically, this tells us that i - floor(MAX / j) must be greater than a number in a [0, 1) interval. That can be written exactly, as:
i - floor(MAX / j) >= 1
Because 1 is just after the interval. We can rewrite as:
i - floor(MAX / j) > 0
Or as:
i > floor(MAX / j)
So we have shown equivalence of the simple test and the floating-point version. It is because the division does not cause significant roundoff error. We can now use the simple test and live happily ever after.
You cannot test afterwards. If the multiplication overflows, it triggers Undefined Behaviour which can render tests inconclusive.
You need to test before doing the multiplication
if (INT_MAX / x > y) /* multiplication of x and y will overflow */;
If your compiler has a type that is at least twice as big as int then you can do this:
long long r = 1LL * x * y;
if ( r > INT_MAX || r < INT_MIN )
// overflowed...
else
x = r;
For portability you should STATIC_ASSERT( sizeof(long long) >= 2 * sizeof(int) ); or something similar but more extreme if you're worried about padding bits!
Try this
bool willoverflow(uint32_t a, uint32_t b) {
size_t a_bits=highestOneBitPosition(a),
size_t b_bits=highestOneBitPosition(b);
return (a_bits+b_bits<=32);
}
It is possible to see if overflow occured postfacto by using a division. In the case of unsigned values, the multiplication z=x*y has overflowed if y!=0 and:
bool overflow_occured = (y!=0)? z/y!=x : false;
(if y did equal zero, no overflow occured). For the case of signed values, it is a little trickier.
if(y!=0){
bool overflow_occured = (y<0 && x=2^31) | (y!=0 && z/y != x);
}
We need the first part of the expression because the first test will fail if x=-2^31 and y=-1. In this case the multiplication overflows, but the machine may give a result of -2^31. Therefore we test for it seperately.
This is true for 32 bit values. Extending the code to the 64 bit case is left as an exercise for the reader.

Fast fixed point pow, log, exp and sqrt

I've got a fixed point class (10.22) and I have a need of a pow, a sqrt, an exp and a log function.
Alas I have no idea where to even start on this. Can anyone provide me with some links to useful articles or, better yet, provide me with some code?
I'm assuming that once I have an exp function then it becomes relatively easy to implement pow and sqrt as they just become.
pow( x, y ) => exp( y * log( x ) )
sqrt( x ) => pow( x, 0.5 )
Its just those exp and log functions that I'm finding difficult (as though I remember a few of my log rules, I can't remember much else about them).
Presumably, there would also be a faster method for sqrt and pow so any pointers on that front would be appreciated even if its just to say use the methods i outline above.
Please note: This HAS to be cross platform and in pure C/C++ code so I cannot use any assembler optimisations.
A very simple solution is to use a decent table-driven approximation. You don't actually need a lot of data if you reduce your inputs correctly. exp(a)==exp(a/2)*exp(a/2), which means you really only need to calculate exp(x) for 1 < x < 2. Over that range, a runga-kutta approximation would give reasonable results with ~16 entries IIRC.
Similarly, sqrt(a) == 2 * sqrt(a/4) == sqrt(4*a) / 2 which means you need only table entries for 1 < a < 4. Log(a) is a bit harder: log(a) == 1 + log(a/e). This is a rather slow iteration, but log(1024) is only 6.9 so you won't have many iterations.
You'd use a similar "integer-first" algorithm for pow: pow(x,y)==pow(x, floor(y)) * pow(x, frac(y)). This works because pow(double, int) is trivial (divide and conquer).
[edit] For the integral component of log(a), it may be useful to store a table 1, e, e^2, e^3, e^4, e^5, e^6, e^7 so you can reduce log(a) == n + log(a/e^n) by a simple hardcoded binary search of a in that table. The improvement from 7 to 3 steps isn't so big, but it means you only have to divide once by e^n instead of n times by e.
[edit 2]
And for that last log(a/e^n) term, you can use log(a/e^n) = log((a/e^n)^8)/8 - each iteration produces 3 more bits by table lookup. That keeps your code and table size small. This is typically code for embedded systems, and they don't have large caches.
[edit 3]
That's still not to smart on my side. log(a) = log(2) + log(a/2). You can just store the fixed-point value log2=0.6931471805599, count the number of leading zeroes, shift a into the range used for your lookup table, and multiply that shift (integer) by the fixed-point constant log2. Can be as low as 3 instructions.
Using e for the reduction step just gives you a "nice" log(e)=1.0 constant but that's false optimization. 0.6931471805599 is just as good a constant as 1.0; both are 32 bits constants in 10.22 fixed point. Using 2 as the constant for range reduction allows you to use a bit shift for a division.
[edit 5]
And since you're storing it in Q10.22, you can better store log(65536)=11.09035488. (16 x log(2)). The "x16" means that we've got 4 more bits of precision available.
You still get the trick from edit 2, log(a/2^n) = log((a/2^n)^8)/8. Basically, this gets you a result (a + b/8 + c/64 + d/512) * 0.6931471805599 - with b,c,d in the range [0,7]. a.bcd really is an octal number. Not a surprise since we used 8 as the power. (The trick works equally well with power 2, 4 or 16.)
[edit 4]
Still had an open end. pow(x, frac(y) is just pow(sqrt(x), 2 * frac(y)) and we have a decent 1/sqrt(x). That gives us the far more efficient approach. Say frac(y)=0.101 binary, i.e. 1/2 plus 1/8. Then that means x^0.101 is (x^1/2 * x^1/8). But x^1/2 is just sqrt(x) and x^1/8 is (sqrt(sqrt(sqrt(x))). Saving one more operation, Newton-Raphson NR(x) gives us 1/sqrt(x) so we calculate 1.0/(NR(x)*NR((NR(NR(x))). We only invert the end result, don't use the sqrt function directly.
Below is an example C implementation of Clay S. Turner's fixed-point log base 2 algorithm[1]. The algorithm doesn't require any kind of look-up table. This can be useful on systems where memory constraints are tight and the processor lacks an FPU, such as is the case with many microcontrollers. Log base e and log base 10 are then also supported by using the property of logarithms that, for any base n:
logₘ(x)
logₙ(x) = ───────
logₘ(n)
where, for this algorithm, m equals 2.
A nice feature of this implementation is that it supports variable precision: the precision can be determined at runtime, at the expense of range. The way I've implemented it, the processor (or compiler) must be capable of doing 64-bit math for holding some intermediate results. It can be easily adapted to not require 64-bit support, but the range will be reduced.
When using these functions, x is expected to be a fixed-point value scaled according to the
specified precision. For instance, if precision is 16, then x should be scaled by 2^16 (65536). The result is a fixed-point value with the same scale factor as the input. A return value of INT32_MIN represents negative infinity. A return value of INT32_MAX indicates an error and errno will be set to EINVAL, indicating that the input precision was invalid.
#include <errno.h>
#include <stddef.h>
#include "log2fix.h"
#define INV_LOG2_E_Q1DOT31 UINT64_C(0x58b90bfc) // Inverse log base 2 of e
#define INV_LOG2_10_Q1DOT31 UINT64_C(0x268826a1) // Inverse log base 2 of 10
int32_t log2fix (uint32_t x, size_t precision)
{
int32_t b = 1U << (precision - 1);
int32_t y = 0;
if (precision < 1 || precision > 31) {
errno = EINVAL;
return INT32_MAX; // indicates an error
}
if (x == 0) {
return INT32_MIN; // represents negative infinity
}
while (x < 1U << precision) {
x <<= 1;
y -= 1U << precision;
}
while (x >= 2U << precision) {
x >>= 1;
y += 1U << precision;
}
uint64_t z = x;
for (size_t i = 0; i < precision; i++) {
z = z * z >> precision;
if (z >= 2U << (uint64_t)precision) {
z >>= 1;
y += b;
}
b >>= 1;
}
return y;
}
int32_t logfix (uint32_t x, size_t precision)
{
uint64_t t;
t = log2fix(x, precision) * INV_LOG2_E_Q1DOT31;
return t >> 31;
}
int32_t log10fix (uint32_t x, size_t precision)
{
uint64_t t;
t = log2fix(x, precision) * INV_LOG2_10_Q1DOT31;
return t >> 31;
}
The code for this implementation also lives at Github, along with a sample/test program that illustrates how to use this function to compute and display logarithms from numbers read from standard input.
[1] C. S. Turner, "A Fast Binary Logarithm Algorithm", IEEE Signal Processing Mag., pp. 124,140, Sep. 2010.
A good starting point is Jack Crenshaw's book, "Math Toolkit for Real-Time Programming". It has a good discussion of algorithms and implementations for various transcendental functions.
Check my fixed point sqrt implementation using only integer operations.
It was fun to invent. Quite old now.
https://groups.google.com/forum/?hl=fr%05aacf5997b615c37&fromgroups#!topic/comp.lang.c/IpwKbw0MAxw/discussion
Otherwise check the CORDIC set of algorithms. That's the way to implement all the functions you listed and the trigonometric functions.
EDIT : I published the reviewed source on GitHub here

Is there any alternative to using % (modulus) in C/C++?

I read somewhere once that the modulus operator is inefficient on small embedded devices like 8 bit micro-controllers that do not have integer division instruction. Perhaps someone can confirm this but I thought the difference is 5-10 time slower than with an integer division operation.
Is there another way to do this other than keeping a counter variable and manually overflowing to 0 at the mod point?
const int FIZZ = 6;
for(int x = 0; x < MAXCOUNT; x++)
{
if(!(x % FIZZ)) print("Fizz\n"); // slow on some systems
}
vs:
The way I am currently doing it:
const int FIZZ = 6;
int fizzcount = 1;
for(int x = 1; x < MAXCOUNT; x++)
{
if(fizzcount >= FIZZ)
{
print("Fizz\n");
fizzcount = 0;
}
}
Ah, the joys of bitwise arithmetic. A side effect of many division routines is the modulus - so in few cases should division actually be faster than modulus. I'm interested to see the source you got this information from. Processors with multipliers have interesting division routines using the multiplier, but you can get from division result to modulus with just another two steps (multiply and subtract) so it's still comparable. If the processor has a built in division routine you'll likely see it also provides the remainder.
Still, there is a small branch of number theory devoted to Modular Arithmetic which requires study if you really want to understand how to optimize a modulus operation. Modular arithmatic, for instance, is very handy for generating magic squares.
So, in that vein, here's a very low level look at the math of modulus for an example of x, which should show you how simple it can be compared to division:
Maybe a better way to think about the problem is in terms of number
bases and modulo arithmetic. For example, your goal is to compute DOW
mod 7 where DOW is the 16-bit representation of the day of the
week. You can write this as:
DOW = DOW_HI*256 + DOW_LO
DOW%7 = (DOW_HI*256 + DOW_LO) % 7
= ((DOW_HI*256)%7 + (DOW_LO % 7)) %7
= ((DOW_HI%7 * 256%7) + (DOW_LO%7)) %7
= ((DOW_HI%7 * 4) + (DOW_LO%7)) %7
Expressed in this manner, you can separately compute the modulo 7
result for the high and low bytes. Multiply the result for the high by
4 and add it to the low and then finally compute result modulo 7.
Computing the mod 7 result of an 8-bit number can be performed in a
similar fashion. You can write an 8-bit number in octal like so:
X = a*64 + b*8 + c
Where a, b, and c are 3-bit numbers.
X%7 = ((a%7)*(64%7) + (b%7)*(8%7) + c%7) % 7
= (a%7 + b%7 + c%7) % 7
= (a + b + c) % 7
since 64%7 = 8%7 = 1
Of course, a, b, and c are
c = X & 7
b = (X>>3) & 7
a = (X>>6) & 7 // (actually, a is only 2-bits).
The largest possible value for a+b+c is 7+7+3 = 17. So, you'll need
one more octal step. The complete (untested) C version could be
written like:
unsigned char Mod7Byte(unsigned char X)
{
X = (X&7) + ((X>>3)&7) + (X>>6);
X = (X&7) + (X>>3);
return X==7 ? 0 : X;
}
I spent a few moments writing a PIC version. The actual implementation
is slightly different than described above
Mod7Byte:
movwf temp1 ;
andlw 7 ;W=c
movwf temp2 ;temp2=c
rlncf temp1,F ;
swapf temp1,W ;W= a*8+b
andlw 0x1F
addwf temp2,W ;W= a*8+b+c
movwf temp2 ;temp2 is now a 6-bit number
andlw 0x38 ;get the high 3 bits == a'
xorwf temp2,F ;temp2 now has the 3 low bits == b'
rlncf WREG,F ;shift the high bits right 4
swapf WREG,F ;
addwf temp2,W ;W = a' + b'
; at this point, W is between 0 and 10
addlw -7
bc Mod7Byte_L2
Mod7Byte_L1:
addlw 7
Mod7Byte_L2:
return
Here's a liitle routine to test the algorithm
clrf x
clrf count
TestLoop:
movf x,W
RCALL Mod7Byte
cpfseq count
bra fail
incf count,W
xorlw 7
skpz
xorlw 7
movwf count
incfsz x,F
bra TestLoop
passed:
Finally, for the 16-bit result (which I have not tested), you could
write:
uint16 Mod7Word(uint16 X)
{
return Mod7Byte(Mod7Byte(X & 0xff) + Mod7Byte(X>>8)*4);
}
Scott
If you are calculating a number mod some power of two, you can use the bit-wise and operator. Just subtract one from the second number. For example:
x % 8 == x & 7
x % 256 == x & 255
A few caveats:
This only works if the second number is a power of two.
It's only equivalent if the modulus is always positive. The C and C++ standards don't specify the sign of the modulus when the first number is negative (until C++11, which does guarantee it will be negative, which is what most compilers were already doing). A bit-wise and gets rid of the sign bit, so it will always be positive (i.e. it's a true modulus, not a remainder). It sounds like that's what you want anyway though.
Your compiler probably already does this when it can, so in most cases it's not worth doing it manually.
There is an overhead most of the time in using modulo that are not powers of 2.
This is regardless of the processor as (AFAIK) even processors with modulus operators are a few cycles slower for divide as opposed to mask operations.
For most cases this is not an optimisation that is worth considering, and certainly not worth calculating your own shortcut operation (especially if it still involves divide or multiply).
However, one rule of thumb is to select array sizes etc. to be powers of 2.
so if calculating day of week, may as well use %7 regardless
if setting up a circular buffer of around 100 entries... why not make it 128. You can then write % 128 and most (all) compilers will make this & 0x7F
Unless you really need high performance on multiple embedded platforms, don't change how you code for performance reasons until you profile!
Code that's written awkwardly to optimize for performance is hard to debug and hard to maintain. Write a test case, and profile it on your target. Once you know the actual cost of modulus, then decide if the alternate solution is worth coding.
#Matthew is right. Try this:
int main() {
int i;
for(i = 0; i<=1024; i++) {
if (!(i & 0xFF)) printf("& i = %d\n", i);
if (!(i % 0x100)) printf("mod i = %d\n", i);
}
}
x%y == (x-(x/y)*y)
Hope this helps.
Do you have access to any programmable hardware on the embedded device? Like counters and such? If so, you might be able to write a hardware based mod unit, instead of using the simulated %. (I did that once in VHDL. Not sure if I still have the code though.)
Mind you, you did say that division was 5-10 times faster. Have you considered doing a division, multiplication, and subtraction to simulated the mod? (Edit: Misunderstood the original post. I did think it was odd that division was faster than mod, they are the same operation.)
In your specific case, though, you are checking for a mod of 6. 6 = 2*3. So you could MAYBE get some small gains if you first checked if the least significant bit was a 0. Something like:
if((!(x & 1)) && (x % 3))
{
print("Fizz\n");
}
If you do that, though, I'd recommend confirming that you get any gains, yay for profilers. And doing some commenting. I'd feel bad for the next guy who has to look at the code otherwise.
You should really check the embedded device you need. All the assembly language I have seen (x86, 68000) implement the modulus using a division.
Actually, the division assembly operation returns the result of the division and the remaining in two different registers.
In the embedded world, the "modulus" operations you need to do are often the ones that break down nicely into bit operations that you can do with &, | and sometimes >>.
#Jeff V: I see a problem with it! (Beyond that your original code was looking for a mod 6 and now you are essentially looking for a mod 8). You keep doing an extra +1! Hopefully your compiler optimizes that away, but why not just test start at 2 and go to MAXCOUNT inclusive? Finally, you are returning true every time that (x+1) is NOT divisible by 8. Is that what you want? (I assume it is, but just want to confirm.)
For modulo 6 you can change the Python code to C/C++:
def mod6(number):
while number > 7:
number = (number >> 3 << 1) + (number & 0x7)
if number > 5:
number -= 6
return number
Not that this is necessarily better, but you could have an inner loop which always goes up to FIZZ, and an outer loop which repeats it all some certain number of times. You've then perhaps got to special case the final few steps if MAXCOUNT is not evenly divisible by FIZZ.
That said, I'd suggest doing some research and performance profiling on your intended platforms to get a clear idea of the performance constraints you're under. There may be much more productive places to spend your optimisation effort.
The print statement will take orders of magnitude longer than even the slowest implementation of the modulus operator. So basically the comment "slow on some systems" should be "slow on all systems".
Also, the two code snippets provided don't do the same thing. In the second one, the line
if(fizzcount >= FIZZ)
is always false so "FIZZ\n" is never printed.