Is multiplication and division using shift operators in C actually faster?

Is multiplication and division using shift operators in C actually faster? - c++

Multiplication and division can be achieved using bit operators, for example
i*2 = i<<1
i*3 = (i<<1) + i;
i*10 = (i<<3) + (i<<1)
and so on.
Is it actually faster to use say (i<<3)+(i<<1) to multiply with 10 than using i*10 directly? Is there any sort of input that can't be multiplied or divided in this way?

Short answer: Not likely.
Long answer:
Your compiler has an optimizer in it that knows how to multiply as quickly as your target processor architecture is capable. Your best bet is to tell the compiler your intent clearly (i.e. i*2 rather than i << 1) and let it decide what the fastest assembly/machine code sequence is. It's even possible that the processor itself has implemented the multiply instruction as a sequence of shifts & adds in microcode.
Bottom line--don't spend a lot of time worrying about this. If you mean to shift, shift. If you mean to multiply, multiply. Do what is semantically clearest--your coworkers will thank you later. Or, more likely, curse you later if you do otherwise.

Just a concrete point of measure: many years back, I benchmarked two
versions of my hashing algorithm:
unsigned
hash( char const* s )
{
unsigned h = 0;
while ( *s != '\0' ) {
h = 127 * h + (unsigned char)*s;
++ s;
}
return h;
}
and
unsigned
hash( char const* s )
{
unsigned h = 0;
while ( *s != '\0' ) {
h = (h << 7) - h + (unsigned char)*s;
++ s;
}
return h;
}
On every machine I benchmarked it on, the first was at least as fast as
the second. Somewhat surprisingly, it was sometimes faster (e.g. on a
Sun Sparc). When the hardware didn't support fast multiplication (and
most didn't back then), the compiler would convert the multiplication
into the appropriate combinations of shifts and add/sub. And because it
knew the final goal, it could sometimes do so in less instructions than
when you explicitly wrote the shifts and the add/subs.
Note that this was something like 15 years ago. Hopefully, compilers
have only gotten better since then, so you can pretty much count on the
compiler doing the right thing, probably better than you could. (Also,
the reason the code looks so C'ish is because it was over 15 years ago.
I'd obviously use std::string and iterators today.)

In addition to all the other good answers here, let me point out another reason to not use shift when you mean divide or multiply. I have never once seen someone introduce a bug by forgetting the relative precedence of multiplication and addition. I have seen bugs introduced when maintenance programmers forgot that "multiplying" via a shift is logically a multiplication but not syntactically of the same precedence as multiplication. x * 2 + z and x << 1 + z are very different!
If you're working on numbers then use arithmetic operators like + - * / %. If you're working on arrays of bits, use bit twiddling operators like & ^ | >> . Don't mix them; an expression that has both bit twiddling and arithmetic is a bug waiting to happen.

This depends on the processor and the compiler. Some compilers already optimize code this way, others don't.
So you need to check each time your code needs to be optimized this way.
Unless you desperately need to optimize, I would not scramble my source code just to save an assembly instruction or processor cycle.

Is it actually faster to use say (i<<3)+(i<<1) to multiply with 10 than using i*10 directly?
It might or might not be on your machine - if you care, measure in your real-world usage.
A case study - from 486 to core i7
Benchmarking is very difficult to do meaningfully, but we can look at a few facts. From http://www.penguin.cz/~literakl/intel/s.html#SAL and http://www.penguin.cz/~literakl/intel/i.html#IMUL we get an idea of x86 clock cycles needed for arithmetic shift and multiplication. Say we stick to "486" (the newest one listed), 32 bit registers and immediates, IMUL takes 13-42 cycles and IDIV 44. Each SAL takes 2, and adding 1, so even with a few of those together shifting superficially looks like a winner.
These days, with the core i7:
(from http://software.intel.com/en-us/forums/showthread.php?t=61481)
The latency is 1 cycle for an integer addition and 3 cycles for an integer multiplication. You can find the latencies and thoughput in Appendix C of the "Intel® 64 and IA-32 Architectures Optimization Reference Manual", which is located on http://www.intel.com/products/processor/manuals/.
(from some Intel blurb)
Using SSE, the Core i7 can issue simultaneous add and multiply instructions, resulting in a peak rate of 8 floating-point operations (FLOP) per clock cycle
That gives you an idea of how far things have come. The optimisation trivia - like bit shifting versus * - that was been taken seriously even into the 90s is just obsolete now. Bit-shifting is still faster, but for non-power-of-two mul/div by the time you do all your shifts and add the results it's slower again. Then, more instructions means more cache faults, more potential issues in pipelining, more use of temporary registers may mean more saving and restoring of register content from the stack... it quickly gets too complicated to quantify all the impacts definitively but they're predominantly negative.
functionality in source code vs implementation
More generally, your question is tagged C and C++. As 3rd generation languages, they're specifically designed to hide the details of the underlying CPU instruction set. To satisfy their language Standards, they must support multiplication and shifting operations (and many others) even if the underlying hardware doesn't. In such cases, they must synthesize the required result using many other instructions. Similarly, they must provide software support for floating point operations if the CPU lacks it and there's no FPU. Modern CPUs all support * and <<, so this might seem absurdly theoretical and historical, but the significance thing is that the freedom to choose implementation goes both ways: even if the CPU has an instruction that implements the operation requested in the source code in the general case, the compiler's free to choose something else that it prefers because it's better for the specific case the compiler's faced with.
Examples (with a hypothetical assembly language)
source literal approach optimised approach
#define N 0
int x; .word x xor registerA, registerA
x *= N; move x -> registerA
move x -> registerB
A = B * immediate(0)
store registerA -> x
...............do something more with x...............
Instructions like exclusive or (xor) have no relationship to the source code, but xor-ing anything with itself clears all the bits, so it can be used to set something to 0. Source code that implies memory addresses may not entail any being used.
These kind of hacks have been used for as long as computers have been around. In the early days of 3GLs, to secure developer uptake the compiler output had to satisfy the existing hardcore hand-optimising assembly-language dev. community that the produced code wasn't slower, more verbose or otherwise worse. Compilers quickly adopted lots of great optimisations - they became a better centralised store of it than any individual assembly language programmer could possibly be, though there's always the chance that they miss a specific optimisation that happens to be crucial in a specific case - humans can sometimes nut it out and grope for something better while compilers just do as they've been told until someone feeds that experience back into them.
So, even if shifting and adding is still faster on some particular hardware, then the compiler writer's likely to have worked out exactly when it's both safe and beneficial.
Maintainability
If your hardware changes you can recompile and it'll look at the target CPU and make another best choice, whereas you're unlikely to ever want to revisit your "optimisations" or list which compilation environments should use multiplication and which should shift. Think of all the non-power-of-two bit-shifted "optimisations" written 10+ years ago that are now slowing down the code they're in as it runs on modern processors...!
Thankfully, good compilers like GCC can typically replace a series of bitshifts and arithmetic with a direct multiplication when any optimisation is enabled (i.e. ...main(...) { return (argc << 4) + (argc << 2) + argc; } -> imull $21, 8(%ebp), %eax) so a recompilation may help even without fixing the code, but that's not guaranteed.
Strange bitshifting code implementing multiplication or division is far less expressive of what you were conceptually trying to achieve, so other developers will be confused by that, and a confused programmer's more likely to introduce bugs or remove something essential in an effort to restore seeming sanity. If you only do non-obvious things when they're really tangibly beneficial, and then document them well (but don't document other stuff that's intuitive anyway), everyone will be happier.
General solutions versus partial solutions
If you have some extra knowledge, such as that your int will really only be storing values x, y and z, then you may be able to work out some instructions that work for those values and get you your result more quickly than when the compiler's doesn't have that insight and needs an implementation that works for all int values. For example, consider your question:
Multiplication and division can be achieved using bit operators...
You illustrate multiplication, but how about division?
int x;
x >> 1; // divide by 2?
According to the C++ Standard 5.8:
-3- The value of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 divided by the quantity 2 raised to the power E2. If E1 has a signed type and a negative value, the resulting value is implementation-defined.
So, your bit shift has an implementation defined result when x is negative: it may not work the same way on different machines. But, / works far more predictably. (It may not be perfectly consistent either, as different machines may have different representations of negative numbers, and hence different ranges even when there are the same number of bits making up the representation.)
You may say "I don't care... that int is storing the age of the employee, it can never be negative". If you have that kind of special insight, then yes - your >> safe optimisation might be passed over by the compiler unless you explicitly do it in your code. But, it's risky and rarely useful as much of the time you won't have this kind of insight, and other programmers working on the same code won't know that you've bet the house on some unusual expectations of the data you'll be handling... what seems a totally safe change to them might backfire because of your "optimisation".
Is there any sort of input that can't be multiplied or divided in this way?
Yes... as mentioned above, negative numbers have implementation defined behaviour when "divided" by bit-shifting.

Just tried on my machine compiling this :
int a = ...;
int b = a * 10;
When disassembling it produces output :
MOV EAX,DWORD PTR SS:[ESP+1C] ; Move a into EAX
LEA EAX,DWORD PTR DS:[EAX+EAX*4] ; Multiply by 5 without shift !
SHL EAX, 1 ; Multiply by 2 using shift
This version is faster than your hand-optimized code with pure shifting and addition.
You really never know what the compiler is going to come up with, so it's better to simply write a normal multiplication and let him optimize the way he wants to, except in very precise cases where you know the compiler cannot optimize.

Shifting is generally a lot faster than multiplying at an instruction level but you may well be wasting your time doing premature optimisations. The compiler may well perform these optimisations at compiletime. Doing it yourself will affect readability and possibly have no effect on performance. It's probably only worth it to do things like this if you have profiled and found this to be a bottleneck.
Actually the division trick, known as 'magic division' can actually yield huge payoffs. Again you should profile first to see if it's needed. But if you do use it there are useful programs around to help you figure out what instructions are needed for the same division semantics. Here is an example : http://www.masm32.com/board/index.php?topic=12421.0
An example which I have lifted from the OP's thread on MASM32:
include ConstDiv.inc
...
mov eax,9999999
; divide eax by 100000
cdiv 100000
; edx = quotient
Would generate:
mov eax,9999999
mov edx,0A7C5AC47h
add eax,1
.if !CARRY?
mul edx
.endif
shr edx,16

Shift and integer multiply instructions have similar performance on most modern CPUs - integer multiply instructions were relatively slow back in the 1980s but in general this is no longer true. Integer multiply instructions may have higher latency, so there may still be cases where a shift is preferable. Ditto for cases where you can keep more execution units busy (although this can cut both ways).
Integer division is still relatively slow though, so using a shift instead of division by a power of 2 is still a win, and most compilers will implement this as an optimisation. Note however that for this optimisation to be valid the dividend needs to be either unsigned or must be known to be positive. For a negative dividend the shift and divide are not equivalent!
#include <stdio.h>
int main(void)
{
int i;
for (i = 5; i >= -5; --i)
{
printf("%d / 2 = %d, %d >> 1 = %d\n", i, i / 2, i, i >> 1);
}
return 0;
}
Output:
5 / 2 = 2, 5 >> 1 = 2
4 / 2 = 2, 4 >> 1 = 2
3 / 2 = 1, 3 >> 1 = 1
2 / 2 = 1, 2 >> 1 = 1
1 / 2 = 0, 1 >> 1 = 0
0 / 2 = 0, 0 >> 1 = 0
-1 / 2 = 0, -1 >> 1 = -1
-2 / 2 = -1, -2 >> 1 = -1
-3 / 2 = -1, -3 >> 1 = -2
-4 / 2 = -2, -4 >> 1 = -2
-5 / 2 = -2, -5 >> 1 = -3
So if you want to help the compiler then make sure the variable or expression in the dividend is explicitly unsigned.

It completely depends on target device, language, purpose, etc.
Pixel crunching in a video card driver? Very likely, yes!
.NET business application for your department? Absolutely no reason to even look into it.
For a high performance game for a mobile device it might be worth looking into, but only after easier optimizations have been performed.

Don't do unless you absolutely need to and your code intent requires shifting rather than multiplication/division.
In typical day - you could potentialy save few machine cycles (or loose, since compiler knows better what to optimize), but the cost doesn't worth it - you spend time on minor details rather than actual job, maintaining the code becomes harder and your co-workers will curse you.
You might need to do it for high-load computations, where each saved cycle means minutes of runtime. But, you should optimize one place at a time and do performance tests each time to see if you really made it faster or broke compilers logic.

As far as I know in some machines multiplication can need upto 16 to 32 machine cycle. So Yes, depending on the machine type, bitshift operators are faster than multiplication / division.
However certain machine do have their math processor, which contains special instructions for multiplication/division.

I agree with the marked answer by Drew Hall. The answer could use some additional notes though.
For the vast majority of software developers the processor and compiler are no longer relevant to the question. Most of us are far beyond the 8088 and MS-DOS. It is perhaps only relevant for those who are still developing for embedded processors...
At my software company Math (add/sub/mul/div) should be used for all mathematics.
While Shift should be used when converting between data types eg. ushort to byte as n>>8 and not n/256.

In the case of signed integers and right shift vs division, it can make a difference. For negative numbers, the shift rounds rounds towards negative infinity whereas division rounds towards zero. Of course the compiler will change the division to something cheaper, but it will usually change it to something that has the same rounding behavior as division, because it is either unable to prove that the variable won't be negative or it simply doesn't care.
So if you can prove that a number won't be negative or if you don't care which way it will round, you can do that optimization in a way that is more likely to make a difference.

Python test performing same multiplication 100 million times against the same random numbers.
>>> from timeit import timeit
>>> setup_str = 'import scipy; from scipy import random; scipy.random.seed(0)'
>>> N = 10*1000*1000
>>> timeit('x=random.randint(65536);', setup=setup_str, number=N)
1.894096851348877 # Time from generating the random #s and no opperati
>>> timeit('x=random.randint(65536); x*2', setup=setup_str, number=N)
2.2799630165100098
>>> timeit('x=random.randint(65536); x << 1', setup=setup_str, number=N)
2.2616429328918457
>>> timeit('x=random.randint(65536); x*10', setup=setup_str, number=N)
2.2799630165100098
>>> timeit('x=random.randint(65536); (x << 3) + (x<<1)', setup=setup_str, number=N)
2.9485139846801758
>>> timeit('x=random.randint(65536); x // 2', setup=setup_str, number=N)
2.490908145904541
>>> timeit('x=random.randint(65536); x / 2', setup=setup_str, number=N)
2.4757170677185059
>>> timeit('x=random.randint(65536); x >> 1', setup=setup_str, number=N)
2.2316000461578369
So in doing a shift rather than multiplication/division by a power of two in python, there's a slight improvement (~10% for division; ~1% for multiplication). If its a non-power of two, there's likely a considerable slowdown.
Again these #s will change depending on your processor, your compiler (or interpreter -- did in python for simplicity).
As with everyone else, don't prematurely optimize. Write very readable code, profile if its not fast enough, and then try to optimize the slow parts. Remember, your compiler is much better at optimization than you are.

There are optimizations the compiler can't do because they only work for a reduced set of inputs.
Below there is c++ sample code that can do a faster division doing a 64bits "Multiplication by the reciprocal". Both numerator and denominator must be below certain threshold. Note that it must be compiled to use 64 bits instructions to be actually faster than normal division.
#include <stdio.h>
#include <chrono>
static const unsigned s_bc = 32;
static const unsigned long long s_p = 1ULL << s_bc;
static const unsigned long long s_hp = s_p / 2;
static unsigned long long s_f;
static unsigned long long s_fr;
static void fastDivInitialize(const unsigned d)
{
s_f = s_p / d;
s_fr = s_f * (s_p - (s_f * d));
}
static unsigned fastDiv(const unsigned n)
{
return (s_f * n + ((s_fr * n + s_hp) >> s_bc)) >> s_bc;
}
static bool fastDivCheck(const unsigned n, const unsigned d)
{
// 32 to 64 cycles latency on modern cpus
const unsigned expected = n / d;
// At least 10 cycles latency on modern cpus
const unsigned result = fastDiv(n);
if (result != expected)
{
printf("Failed for: %u/%u != %u\n", n, d, expected);
return false;
}
return true;
}
int main()
{
unsigned result = 0;
// Make sure to verify it works for your expected set of inputs
const unsigned MAX_N = 65535;
const unsigned MAX_D = 40000;
const double ONE_SECOND_COUNT = 1000000000.0;
auto t0 = std::chrono::steady_clock::now();
unsigned count = 0;
printf("Verifying...\n");
for (unsigned d = 1; d <= MAX_D; ++d)
{
fastDivInitialize(d);
for (unsigned n = 0; n <= MAX_N; ++n)
{
count += !fastDivCheck(n, d);
}
}
auto t1 = std::chrono::steady_clock::now();
printf("Errors: %u / %u (%.4fs)\n", count, MAX_D * (MAX_N + 1), (t1 - t0).count() / ONE_SECOND_COUNT);
t0 = t1;
for (unsigned d = 1; d <= MAX_D; ++d)
{
fastDivInitialize(d);
for (unsigned n = 0; n <= MAX_N; ++n)
{
result += fastDiv(n);
}
}
t1 = std::chrono::steady_clock::now();
printf("Fast division time: %.4fs\n", (t1 - t0).count() / ONE_SECOND_COUNT);
t0 = t1;
count = 0;
for (unsigned d = 1; d <= MAX_D; ++d)
{
for (unsigned n = 0; n <= MAX_N; ++n)
{
result += n / d;
}
}
t1 = std::chrono::steady_clock::now();
printf("Normal division time: %.4fs\n", (t1 - t0).count() / ONE_SECOND_COUNT);
getchar();
return result;
}

I think in the one case that you want to multiply or divide by a power of two, you can't go wrong with using bitshift operators, even if the compiler converts them to a MUL/DIV, because some processors microcode (really, a macro) them anyway, so for those cases you will achieve an improvement, especially if the shift is more than 1. Or more explicitly, if the CPU has no bitshift operators, it will be a MUL/DIV anyway, but if the CPU has bitshift operators, you avoid a microcode branch and this is a few instructions less.
I am writing some code right now that requires a lot of doubling/halving operations because it is working on a dense binary tree, and there is one more operation that I suspect might be more optimal than an addition - a left (power of two multiply) shift with an addition. This can be replaced with a left shift and an xor if the shift is wider than the number of bits you want to add, easy example is (i<<1)^1, which adds one to a doubled value. This does not of course apply to a right shift (power of two divide) because only a left (little endian) shift fills the gap with zeros.
In my code, these multiply/divide by two and powers of two operations are very intensively used and because the formulae are quite short already, each instruction that can be eliminated can be a substantial gain. If the processor does not support these bitshift operators, no gain will happen but neither will there be a loss.
Also, in the algorithms I am writing, they visually represent the movements that occur so in that sense they are in fact more clear. The left hand side of a binary tree is bigger, and the right is smaller. As well as that, in my code, odd and even numbers have a special significance, and all left-hand children in the tree are odd and all right hand children, and the root, are even. In some cases, which I haven't encountered yet, but may, oh, actually, I didn't even think of this, x&1 may be a more optimal operation compared to x%2. x&1 on an even number will produce zero, but will produce 1 for an odd number.
Going a bit further than just odd/even identification, if I get zero for x&3 I know that 4 is a factor of our number, and same for x%7 for 8, and so on. I know that these cases have probably got limited utility but it's nice to know that you can avoid a modulus operation and use a bitwise logic operation instead, because bitwise operations are almost always the fastest, and least likely to be ambiguous to the compiler.
I am pretty much inventing the field of dense binary trees so I expect that people may not grasp the value of this comment, as very rarely do people want to only perform factorisations on only powers of two, or only multiply/divide powers of two.

Whether it is actually faster depends on the hardware and compiler actually used.

If you compare output for x+x , x*2 and x<<1 syntax on a gcc compiler, then you would get the same result in x86 assembly : https://godbolt.org/z/JLpp0j
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
add eax, eax
pop rbp
ret
So you can consider gcc as smart enought to determine his own best solution independently from what you typed.

I too wanted to see if I could Beat the House. this is a more general bitwise for any-number by any number multiplication. the macros I made are about 25% more to twice as slower than normal * multiplication. as said by others if it's close to a multiple of 2 or made up of few multiples of 2 you might win. like X*23 made up of (X<<4)+(X<<2)+(X<<1)+X is going to be slower then X*65 made up of (X<<6)+X.
#include <stdio.h>
#include <time.h>
#define MULTIPLYINTBYMINUS(X,Y) (-((X >> 30) & 1)&(Y<<30))+(-((X >> 29) & 1)&(Y<<29))+(-((X >> 28) & 1)&(Y<<28))+(-((X >> 27) & 1)&(Y<<27))+(-((X >> 26) & 1)&(Y<<26))+(-((X >> 25) & 1)&(Y<<25))+(-((X >> 24) & 1)&(Y<<24))+(-((X >> 23) & 1)&(Y<<23))+(-((X >> 22) & 1)&(Y<<22))+(-((X >> 21) & 1)&(Y<<21))+(-((X >> 20) & 1)&(Y<<20))+(-((X >> 19) & 1)&(Y<<19))+(-((X >> 18) & 1)&(Y<<18))+(-((X >> 17) & 1)&(Y<<17))+(-((X >> 16) & 1)&(Y<<16))+(-((X >> 15) & 1)&(Y<<15))+(-((X >> 14) & 1)&(Y<<14))+(-((X >> 13) & 1)&(Y<<13))+(-((X >> 12) & 1)&(Y<<12))+(-((X >> 11) & 1)&(Y<<11))+(-((X >> 10) & 1)&(Y<<10))+(-((X >> 9) & 1)&(Y<<9))+(-((X >> 8) & 1)&(Y<<8))+(-((X >> 7) & 1)&(Y<<7))+(-((X >> 6) & 1)&(Y<<6))+(-((X >> 5) & 1)&(Y<<5))+(-((X >> 4) & 1)&(Y<<4))+(-((X >> 3) & 1)&(Y<<3))+(-((X >> 2) & 1)&(Y<<2))+(-((X >> 1) & 1)&(Y<<1))+(-((X >> 0) & 1)&(Y<<0))
#define MULTIPLYINTBYSHIFT(X,Y) (((((X >> 30) & 1)<<31)>>31)&(Y<<30))+(((((X >> 29) & 1)<<31)>>31)&(Y<<29))+(((((X >> 28) & 1)<<31)>>31)&(Y<<28))+(((((X >> 27) & 1)<<31)>>31)&(Y<<27))+(((((X >> 26) & 1)<<31)>>31)&(Y<<26))+(((((X >> 25) & 1)<<31)>>31)&(Y<<25))+(((((X >> 24) & 1)<<31)>>31)&(Y<<24))+(((((X >> 23) & 1)<<31)>>31)&(Y<<23))+(((((X >> 22) & 1)<<31)>>31)&(Y<<22))+(((((X >> 21) & 1)<<31)>>31)&(Y<<21))+(((((X >> 20) & 1)<<31)>>31)&(Y<<20))+(((((X >> 19) & 1)<<31)>>31)&(Y<<19))+(((((X >> 18) & 1)<<31)>>31)&(Y<<18))+(((((X >> 17) & 1)<<31)>>31)&(Y<<17))+(((((X >> 16) & 1)<<31)>>31)&(Y<<16))+(((((X >> 15) & 1)<<31)>>31)&(Y<<15))+(((((X >> 14) & 1)<<31)>>31)&(Y<<14))+(((((X >> 13) & 1)<<31)>>31)&(Y<<13))+(((((X >> 12) & 1)<<31)>>31)&(Y<<12))+(((((X >> 11) & 1)<<31)>>31)&(Y<<11))+(((((X >> 10) & 1)<<31)>>31)&(Y<<10))+(((((X >> 9) & 1)<<31)>>31)&(Y<<9))+(((((X >> 8) & 1)<<31)>>31)&(Y<<8))+(((((X >> 7) & 1)<<31)>>31)&(Y<<7))+(((((X >> 6) & 1)<<31)>>31)&(Y<<6))+(((((X >> 5) & 1)<<31)>>31)&(Y<<5))+(((((X >> 4) & 1)<<31)>>31)&(Y<<4))+(((((X >> 3) & 1)<<31)>>31)&(Y<<3))+(((((X >> 2) & 1)<<31)>>31)&(Y<<2))+(((((X >> 1) & 1)<<31)>>31)&(Y<<1))+(((((X >> 0) & 1)<<31)>>31)&(Y<<0))
int main()
{
int randomnumber=23;
int randomnumber2=23;
int checknum=23;
clock_t start, diff;
srand(time(0));
start = clock();
for(int i=0;i<1000000;i++)
{
randomnumber = rand() % 10000;
randomnumber2 = rand() % 10000;
checknum=MULTIPLYINTBYMINUS(randomnumber,randomnumber2);
if (checknum!=randomnumber*randomnumber2)
{
printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
}
}
diff = clock() - start;
int msec = diff * 1000 / CLOCKS_PER_SEC;
printf("MULTIPLYINTBYMINUS Time %d milliseconds", msec);
start = clock();
for(int i=0;i<1000000;i++)
{
randomnumber = rand() % 10000;
randomnumber2 = rand() % 10000;
checknum=MULTIPLYINTBYSHIFT(randomnumber,randomnumber2);
if (checknum!=randomnumber*randomnumber2)
{
printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
}
}
diff = clock() - start;
msec = diff * 1000 / CLOCKS_PER_SEC;
printf("MULTIPLYINTBYSHIFT Time %d milliseconds", msec);
start = clock();
for(int i=0;i<1000000;i++)
{
randomnumber = rand() % 10000;
randomnumber2 = rand() % 10000;
checknum= randomnumber*randomnumber2;
if (checknum!=randomnumber*randomnumber2)
{
printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
}
}
diff = clock() - start;
msec = diff * 1000 / CLOCKS_PER_SEC;
printf("normal * Time %d milliseconds", msec);
return 0;
}

Related

ADD vs OR vs XOR to calculate an index

When working with two dimensional arrays with a width that is a power of two, there are multiple ways to calculate the index for a (x, y) pair:
given
size_t width, height, stride_shift;
auto array = std::vector<T>(width*height);
size_t x, y;
assert(width == ((size_t)1 << stride_shift));
assert(x < width)
assert(y < height);
1st way - Addition
size_t index = x + (y << stride_shift);
2nd way - Bitwise OR
size_t index = x | (y << stride_shift);
3rd way - Bitwise XOR
size_t index = x ^ (y << stride_shift);
It is then used like this
auto& ref = array[index];
...
All yield the same result, because a+b, a|b and a^b result in the same value, as long as not both bits are set at the same bit position.
It works because the index is just a concatenation of the bits of x and y:
lowest bits
v v
index = 0b...yyyyyyyyyxxxxxxxxx
\___ ___/
'
stride_shift bits
So the question is, what is potentially faster in x86(-64)? What is recommended for readable code?
Could there be a speedup by using +, because the compiler might use both the ADD and LEA instruction, such that multiple additions can be performed at the same time on both the address calculation unit and the ALU.
Could OR and XOR be potentially faster? Because the calculation of every new bit is independent of all others, that means no bit is dependent on the carry of lower bits.
Consider a nested for loop, y is iterated on the outside. Are there compilers that do an optimization in loops where row_pointer = array.data()+(y << stride_shift) is calculated for each iteration of y. And then pointer = row_pointer+x. I think this optimization wouldn't be possible using | or ^.
Mixing bitwise operations and arithmetic operations is sometimes considered bad code, so should one choose | to go with the <<?
As a bonus: How does this apply in general/to other architectures?

I would expect no difference between the 3 versions; on modern x86 architectures +, | and ^ all have identical latency of 1, throughput 0.5 (including the LEA r+r*s variant) and are all executed on the ALU.
So I would concentrate on code readability instead.
x + (y << stride_shift) definitely looks more readable (perhaps even better: x + y * width, which should get optimized into a shift).
But just to be sure, we can benchmark it real quick.
Test results from quick-bench (Clang 11 results, GCC is acting up at the moment):
The apparent speedup in the + version seems to be due to 1 less MOV thanks to the LEA's output register (see godbolt link). But that's probably just an artifact of the contrived test. In real-life code there would be plenty of room for the optimizer to hide this latency.
But anyway, it seems the + version wins either way.

Fast method to multiply integer by proper fraction without floats or overflow

My program frequently requires the following calculation to be performed:
Given:
N is a 32-bit integer
D is a 32-bit integer
abs(N) <= abs(D)
D != 0
X is a 32-bit integer of any value
Find:
X * N / D as a rounded integer that is X scaled to N/D (i.e. 10 * 2 / 3 = 7)
Obviously I could just use r=x*n/d directly but I will often get overflow from the x*n. If I instead do r=x*(n/d) then I only get 0 or x due to integer division dropping the fractional component. And then there's r=x*(float(n)/d) but I can't use floats in this case.
Accuracy would be great but isn't as critical as speed and being a deterministic function (always returning the same value given the same inputs).
N and D are currently signed but I could work around them being always unsigned if it helps.
A generic function that works with any value of X (and N and D, as long as N <= D) is ideal since this operation is used in various different ways but I also have a specific case where the value of X is a known constant power of 2 (2048, to be precise), and just getting that specific call sped up would be a big help.
Currently I am accomplishing this using 64-bit multiply and divide to avoid overflow (essentially int multByProperFraction(int x, int n, int d) { return (__int64)x * n / d; } but with some asserts and extra bit fiddling for rounding instead of truncating).
Unfortunately, my profiler is reporting the 64-bit divide function as taking up way too much CPU (this is a 32-bit application). I've tried to reduce how often I need to do this calculation but am running out of ways around it, so I'm trying to figure out a faster method, if it is even possible. In the specific case where X is a constant 2048, I use a bit shift instead of multiply but that doesn't help much.

Tolerate imprecision and use the 16 MSBits of n,d,x
Algorithm
while (|n| > 0xffff) n/2, sh++
while (|x| > 0xffff) x/2, sh++
while (|d| > 0xffff) d/2, sh--
r = n*x/d // A 16x16 to 32 multiply followed by a 32/16-bit divide.
shift r by sh.
When 64 bit divide is expensive, the pre/post processing here may be worth to do a 32-bit divide - which will certainly be the big chunk of CPU.
If the compiler cannot be coaxed into doing a 32-bit/16-bit divide, then skip the while (|d| > 0xffff) d/2, sh-- step and do a 32/32 divide.
Use unsigned math as possible.

The basic correct approach to this is simply (uint64_t)x*n/d. That's optimal assuming d is variable and unpredictable. But if d is constant or changes infrequently, you can pre-generate constants such that exact division by d can be performed as a multiplication followed by a bitshift. A good description of the algorithm, which is roughly what GCC uses internally to transform division by a constant into multiplication, is here:
http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html
I'm not sure how easy it is to make it work for a "64/32" division (i.e. dividing the result of (uint64_t)x*n), but you should be able to just break it up into high and low parts if nothing else.
Note that these algorithms are also available as libdivide.

I've now benchmarked several possible solutions, including weird/clever ones from other sources like combining 32-bit div & mod & add or using peasant math, and here are my conclusions:
First, if you are only targeting Windows and using VSC++, just use MulDiv(). It is quite fast (faster than directly using 64-bit variables in my tests) while still being just as accurate and rounding the result for you. I could not find any superior method to do this kind of thing on Windows with VSC++, even taking into account restrictions like unsigned-only and N <= D.
However, in my case having a function with deterministic results even across platforms is even more important than speed. On another platform I was using as a test, the 64-bit divide is much, much slower than the 32-bit one when using the 32-bit libraries, and there is no MulDiv() to use. The 64-bit divide on this platform takes ~26x as long as a 32-bit divide (yet the 64-bit multiply is just as fast as the 32-bit version...).
So if you have a case like me, I will share the best results I got, which turned out to be just optimizations of chux's answer.
Both of the methods I will share below make use of the following function (though the compiler-specific intrinsics only actually helped in speed with MSVC in Windows):
inline u32 bitsRequired(u32 val)
{
#ifdef _MSC_VER
DWORD r = 0;
_BitScanReverse(&r, val | 1);
return r+1;
#elif defined(__GNUC__) || defined(__clang__)
return 32 - __builtin_clz(val | 1);
#else
int r = 1;
while (val >>= 1) ++r;
return r;
#endif
}
Now, if x is a constant that's 16-bit in size or smaller and you can pre-compute the bits required, I found the best results in speed and accuracy from this function:
u32 multConstByPropFrac(u32 x, u32 nMaxBits, u32 n, u32 d)
{
//assert(nMaxBits == 32 - bitsRequired(x));
//assert(n <= d);
const int bitShift = bitsRequired(n) - nMaxBits;
if( bitShift > 0 )
{
n >>= bitShift;
d >>= bitShift;
}
// Remove the + d/2 part if don't need rounding
return (x * n + d/2) / d;
}
On the platform with the slow 64-bit divide, the above function ran ~16.75x as fast as return ((u64)x * n + d/2) / d; and with an average 99.999981% accuracy (comparing difference in return value from expected to range of x, i.e. returning +/-1 from expected when x is 2048 would be 100 - (1/2048 * 100) = 99.95% accurate) when testing it with a million or so randomized inputs where roughly half of them would normally have been an overflow. Worst-case accuracy was 99.951172%.
For the general use case, I found the best results from the following (and without needing to restrict N <= D to boot!):
u32 scaleToFraction(u32 x, u32 n, u32 d)
{
u32 bits = bitsRequired(x);
int bitShift = bits - 16;
if( bitShift < 0 ) bitShift = 0;
int sh = bitShift;
x >>= bitShift;
bits = bitsRequired(n);
bitShift = bits - 16;
if( bitShift < 0 ) bitShift = 0;
sh += bitShift;
n >>= bitShift;
bits = bitsRequired(d);
bitShift = bits - 16;
if( bitShift < 0 ) bitShift = 0;
sh -= bitShift;
d >>= bitShift;
// Remove the + d/2 part if don't need rounding
u32 r = (x * n + d/2) / d;
if( sh < 0 )
r >>= (-sh);
else //if( sh > 0 )
r <<= sh;
return r;
}
On the platform with the slow 64-bit divide, the above function ran ~18.5x as fast as using 64-bit variables and with 99.999426% average and 99.947479% worst-case accuracy.
I was able to get more speed or more accuracy by messing with the shifting, such as trying to not shift all the way down to 16-bit if it wasn't strictly necessary, but any increase in speed came at a high cost in accuracy and vice versa.
None of the other methods I tested came even close to the same speed or accuracy, most being slower than just using the 64-bit method or having huge loss in precision, so not worth going into.
Obviously, no guarantee that anyone else will get similar results on other platforms!
EDIT: Replaced some bit-twiddling hacks with plain code that actually ran faster anyway by letting the compiler do its job.

C++: Binary to Decimal Conversion

I am trying to convert a binary array to decimal in following way:
uint8_t array[8] = {1,1,1,1,0,1,1,1} ;
int decimal = 0 ;
for(int i = 0 ; i < 8 ; i++)
decimal = (decimal << 1) + array[i] ;
Actually I have to convert 64 bit binary array to decimal and I have to do it for million times.
Can anybody help me, is there any faster way to do the above ? Or is the above one is nice ?

Your method is adequate, to call it nice I would just not mix bitwise operations and "mathematical" way of converting to decimal, i.e. use either
decimal = decimal << 1 | array[i];
or
decimal = decimal * 2 + array[i];

It is important, before attempting any optimisation, to profile the code. Time it, look at the code being generated, and optimise only when you understand what is going on.
And as already pointed out, the best optimisation is to not do something, but to make a higher level change that removes the need.
However...
Most changes you might want to trivially make here, are likely to be things the compiler has already done (a shift is the same as a multiply to the compiler). Some may actually prevent the compiler from making an optimisation (changing an add to an or will restrict the compiler - there are more ways to add numbers, and only you know that in this case the result will be the same).
Pointer arithmetic may be better, but the compiler is not stupid - it ought to already be producing decent code for dereferencing the array, so you need to check that you have not in fact made matters worse by introducing an additional variable.
In this case the loop count is well defined and limited, so unrolling probably makes sense.
Further more it depends on how dependent you want the result to be on your target architecture. If you want portability, it is hard(er) to optimise.
For example, the following produces better code here:
unsigned int x0 = *(unsigned int *)array;
unsigned int x1 = *(unsigned int *)(array+4);
int decimal = ((x0 * 0x8040201) >> 20) + ((x1 * 0x8040201) >> 24);
I could probably also roll a 64-bit version that did 8 bits at a time instead of 4.
But it is very definitely not portable code. I might use that locally if I knew what I was running on and I just wanted to crunch numbers quickly. But I probably wouldn't put it in production code. Certainly not without documenting what it did, and without the accompanying unit test that checks that it actually works.

The binary 'compression' can be generalized as a problem of weighted sum -- and for that there are some interesting techniques.
X mod (255) means essentially summing of all independent 8-bit numbers.
X mod 254 means summing each digit with a doubling weight, since 1 mod 254 = 1, 256 mod 254 = 2, 256*256 mod 254 = 2*2 = 4, etc.
If the encoding was big endian, then *(unsigned long long)array % 254 would produce a weighted sum (with truncated range of 0..253). Then removing the value with weight 2 and adding it manually would produce the correct result:
uint64_t a = *(uint64_t *)array;
return (a & ~256) % 254 + ((a>>9) & 2);
Other mechanism to get the weight is to premultiply each binary digit by 255 and masking the correct bit:
uint64_t a = (*(uint64_t *)array * 255) & 0x0102040810204080ULL; // little endian
uint64_t a = (*(uint64_t *)array * 255) & 0x8040201008040201ULL; // big endian
In both cases one can then take the remainder of 255 (and correct now with weight 1):
return (a & 0x00ffffffffffffff) % 255 + (a>>56); // little endian, or
return (a & ~1) % 255 + (a&1);
For the sceptical mind: I actually did profile the modulus version to be (slightly) faster than iteration on x64.
To continue from the answer of JasonD, parallel bit selection can be iteratively utilized.
But first expressing the equation in full form would help the compiler to remove the artificial dependency created by the iterative approach using accumulation:
ret = ((a[0]<<7) | (a[1]<<6) | (a[2]<<5) | (a[3]<<4) |
(a[4]<<3) | (a[5]<<2) | (a[6]<<1) | (a[7]<<0));
vs.
HI=*(uint32_t)array, LO=*(uint32_t)&array[4];
LO |= (HI<<4); // The HI dword has a weight 16 relative to Lo bytes
LO |= (LO>>14); // High word has 4x weight compared to low word
LO |= (LO>>9); // high byte has 2x weight compared to lower byte
return LO & 255;
One more interesting technique would be to utilize crc32 as a compression function; then it just happens that the result would be LookUpTable[crc32(array) & 255]; as there is no collision with this given small subset of 256 distinct arrays. However to apply that, one has already chosen the road of even less portability and could as well end up using SSE intrinsics.

You could use accumulate, with a doubling and adding binary operation:
int doubleSumAndAdd(const int& sum, const int& next) {
return (sum * 2) + next;
}
int decimal = accumulate(array, array+ARRAY_SIZE,
doubleSumAndAdd);
This produces big-endian integers, whereas OP code produces little-endian.

Try this, I converted a binary digit of up to 1020 bits
#include <sstream>
#include <string>
#include <math.h>
#include <iostream>
using namespace std;
long binary_decimal(string num) /* Function to convert binary to dec */
{
long dec = 0, n = 1, exp = 0;
string bin = num;
if(bin.length() > 1020){
cout << "Binary Digit too large" << endl;
}
else {
for(int i = bin.length() - 1; i > -1; i--)
{
n = pow(2,exp++);
if(bin.at(i) == '1')
dec += n;
}
}
return dec;
}
Theoretically this method will work for a binary digit of infinate length

How can I safely average two unsigned ints in C++?

Using integer math alone, I'd like to "safely" average two unsigned ints in C++.
What I mean by "safely" is avoiding overflows (and anything else that can be thought of).
For instance, averaging 200 and 5000 is easy:
unsigned int a = 200;
unsigned int b = 5000;
unsigned int average = (a + b) / 2; // Equals: 2600 as intended
But in the case of 4294967295 and 5000 then:
unsigned int a = 4294967295;
unsigned int b = 5000;
unsigned int average = (a + b) / 2; // Equals: 2499 instead of 2147486147
The best I've come up with is:
unsigned int a = 4294967295;
unsigned int b = 5000;
unsigned int average = (a / 2) + (b / 2); // Equals: 2147486147 as expected
Are there better ways?

Your last approach seems promising. You can improve on that by manually considering the lowest bits of a and b:
unsigned int average = (a / 2) + (b / 2) + (a & b & 1);
This gives the correct results in case both a and b are odd.

If you know ahead of time which one is higher, a very efficient way is possible. Otherwise you're better off using one of the other strategies, instead of conditionally swapping to use this.
unsigned int average = low + ((high - low) / 2);
Here's a related article: http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html

Your method is not correct if both numbers are odd eg 5 and 7, average is 6 but your method #3 returns 5.
Try this:
average = (a>>1) + (b>>1) + (a & b & 1)
with math operators only:
average = a/2 + b/2 + (a%2) * (b%2)

And the correct answer is...
(A&B)+((A^B)>>1)

If you don't mind a little x86 inline assembly (GNU C syntax), you can take advantage of supercat's suggestion to use rotate-with-carry after an add to put the high 32 bits of the full 33-bit result into a register.
Of course, you usually should mind using inline-asm, because it defeats some optimizations (https://gcc.gnu.org/wiki/DontUseInlineAsm). But here we go anyway:
// works for 64-bit long as well on x86-64, and doesn't depend on calling convention
unsigned average(unsigned x, unsigned y)
{
unsigned result;
asm("add %[x], %[res]\n\t"
"rcr %[res]"
: [res] "=r" (result) // output
: [y] "%0"(y), // input: in the same reg as results output. Commutative with next operand
[x] "rme"(x) // input: reg, mem, or immediate
: // no clobbers. ("cc" is implicit on x86)
);
return result;
}
The % modifier to tell the compiler the args are commutative doesn't actually help make better asm in the case I tried, calling the function with y being a constant or pointer-deref (memory operand). Probably using a matching constraint for an output operand defeats that, since you can't use it with read-write operands.
As you can see on the Godbolt compiler explorer, this compiles correctly, and so does a version where we change the operands to unsigned long, with the same inline asm. clang3.9 makes a mess of it, though, and decides to use the "m" option for the "rme" constraint, so it stores to memory and uses a memory operand.
RCR-by-one is not too slow, but it's still 3 uops on Skylake, with 2 cycle latency. It's great on AMD CPUs, where RCR has single-cycle latency. (Source: Agner Fog's instruction tables, see also the x86 tag wiki for x86 performance links). It's still better than #sellibitze's version, but worse than #Sheldon's order-dependent version. (See code on Godbolt)
But remember that inline-asm defeats optimizations like constant-propagation, so any pure-C++ version will be better in that case.

What you have is fine, with the minor detail that it will claim that the average of 3 and 3 is 2. I'm guessing that you don't want that; fortunately, there's an easy fix:
unsigned int average = a/2 + b/2 + (a & b & 1);
This just bumps the average back up in the case that both divisions were truncated.

In C++20, you can use std::midpoint:
template <class T>
constexpr T midpoint(T a, T b) noexcept;
The paper P0811R3 that introduced std::midpoint recommended this snippet (slightly adopted to work with C++11):
#include <type_traits>
template <typename Integer>
constexpr Integer midpoint(Integer a, Integer b) noexcept {
using U = std::make_unsigned<Integer>::type;
return a>b ? a-(U(a)-b)/2 : a+(U(b)-a)/2;
}
For completeness, here is the unmodified C++20 implementation from the paper:
constexpr Integer midpoint(Integer a, Integer b) noexcept {
using U = make_unsigned_t<Integer>;
return a>b ? a-(U(a)-b)/2 : a+(U(b)-a)/2;
}

If the code is for an embedded micro, and if speed is critical, assembly language may be helpful. On many microcontrollers, the result of the add would naturally go into the carry flag, and instructions exist to shift it back into a register. On an ARM, the average operation (source and dest. in registers) could be done in two instructions; any C-language equivalent would likely yield at least 5, and probably a fair bit more than that.
Incidentally, on machines with shorter word sizes, the differences can be even more substantial. On an 8-bit PIC-18 series, averaging two 32-bit numbers would take twelve instructions. Doing the shifts, add, and correction, would take 5 instructions for each shift, eight for the add, and eight for the correction, so 26 (not quite a 2.5x difference, but probably more significant in absolute terms).

int[] array = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };
decimal avg = 0;
for (int i = 0; i < array.Length; i++){
avg = (array[i] - avg) / (i+1) + avg;
}
expects avg == 5.0 for this test

(((a&b << 1) + (a^b)) >> 1) is also a nice way.
Courtesy: http://www.ragestorm.net/blogs/?p=29

Is there any alternative to using % (modulus) in C/C++?

I read somewhere once that the modulus operator is inefficient on small embedded devices like 8 bit micro-controllers that do not have integer division instruction. Perhaps someone can confirm this but I thought the difference is 5-10 time slower than with an integer division operation.
Is there another way to do this other than keeping a counter variable and manually overflowing to 0 at the mod point?
const int FIZZ = 6;
for(int x = 0; x < MAXCOUNT; x++)
{
if(!(x % FIZZ)) print("Fizz\n"); // slow on some systems
}
vs:
The way I am currently doing it:
const int FIZZ = 6;
int fizzcount = 1;
for(int x = 1; x < MAXCOUNT; x++)
{
if(fizzcount >= FIZZ)
{
print("Fizz\n");
fizzcount = 0;
}
}

Ah, the joys of bitwise arithmetic. A side effect of many division routines is the modulus - so in few cases should division actually be faster than modulus. I'm interested to see the source you got this information from. Processors with multipliers have interesting division routines using the multiplier, but you can get from division result to modulus with just another two steps (multiply and subtract) so it's still comparable. If the processor has a built in division routine you'll likely see it also provides the remainder.
Still, there is a small branch of number theory devoted to Modular Arithmetic which requires study if you really want to understand how to optimize a modulus operation. Modular arithmatic, for instance, is very handy for generating magic squares.
So, in that vein, here's a very low level look at the math of modulus for an example of x, which should show you how simple it can be compared to division:
Maybe a better way to think about the problem is in terms of number
bases and modulo arithmetic. For example, your goal is to compute DOW
mod 7 where DOW is the 16-bit representation of the day of the
week. You can write this as:
DOW = DOW_HI*256 + DOW_LO
DOW%7 = (DOW_HI*256 + DOW_LO) % 7
= ((DOW_HI*256)%7 + (DOW_LO % 7)) %7
= ((DOW_HI%7 * 256%7) + (DOW_LO%7)) %7
= ((DOW_HI%7 * 4) + (DOW_LO%7)) %7
Expressed in this manner, you can separately compute the modulo 7
result for the high and low bytes. Multiply the result for the high by
4 and add it to the low and then finally compute result modulo 7.
Computing the mod 7 result of an 8-bit number can be performed in a
similar fashion. You can write an 8-bit number in octal like so:
X = a*64 + b*8 + c
Where a, b, and c are 3-bit numbers.
X%7 = ((a%7)*(64%7) + (b%7)*(8%7) + c%7) % 7
= (a%7 + b%7 + c%7) % 7
= (a + b + c) % 7
since 64%7 = 8%7 = 1
Of course, a, b, and c are
c = X & 7
b = (X>>3) & 7
a = (X>>6) & 7 // (actually, a is only 2-bits).
The largest possible value for a+b+c is 7+7+3 = 17. So, you'll need
one more octal step. The complete (untested) C version could be
written like:
unsigned char Mod7Byte(unsigned char X)
{
X = (X&7) + ((X>>3)&7) + (X>>6);
X = (X&7) + (X>>3);
return X==7 ? 0 : X;
}
I spent a few moments writing a PIC version. The actual implementation
is slightly different than described above
Mod7Byte:
movwf temp1 ;
andlw 7 ;W=c
movwf temp2 ;temp2=c
rlncf temp1,F ;
swapf temp1,W ;W= a*8+b
andlw 0x1F
addwf temp2,W ;W= a*8+b+c
movwf temp2 ;temp2 is now a 6-bit number
andlw 0x38 ;get the high 3 bits == a'
xorwf temp2,F ;temp2 now has the 3 low bits == b'
rlncf WREG,F ;shift the high bits right 4
swapf WREG,F ;
addwf temp2,W ;W = a' + b'
; at this point, W is between 0 and 10
addlw -7
bc Mod7Byte_L2
Mod7Byte_L1:
addlw 7
Mod7Byte_L2:
return
Here's a liitle routine to test the algorithm
clrf x
clrf count
TestLoop:
movf x,W
RCALL Mod7Byte
cpfseq count
bra fail
incf count,W
xorlw 7
skpz
xorlw 7
movwf count
incfsz x,F
bra TestLoop
passed:
Finally, for the 16-bit result (which I have not tested), you could
write:
uint16 Mod7Word(uint16 X)
{
return Mod7Byte(Mod7Byte(X & 0xff) + Mod7Byte(X>>8)*4);
}
Scott

If you are calculating a number mod some power of two, you can use the bit-wise and operator. Just subtract one from the second number. For example:
x % 8 == x & 7
x % 256 == x & 255
A few caveats:
This only works if the second number is a power of two.
It's only equivalent if the modulus is always positive. The C and C++ standards don't specify the sign of the modulus when the first number is negative (until C++11, which does guarantee it will be negative, which is what most compilers were already doing). A bit-wise and gets rid of the sign bit, so it will always be positive (i.e. it's a true modulus, not a remainder). It sounds like that's what you want anyway though.
Your compiler probably already does this when it can, so in most cases it's not worth doing it manually.

There is an overhead most of the time in using modulo that are not powers of 2.
This is regardless of the processor as (AFAIK) even processors with modulus operators are a few cycles slower for divide as opposed to mask operations.
For most cases this is not an optimisation that is worth considering, and certainly not worth calculating your own shortcut operation (especially if it still involves divide or multiply).
However, one rule of thumb is to select array sizes etc. to be powers of 2.
so if calculating day of week, may as well use %7 regardless
if setting up a circular buffer of around 100 entries... why not make it 128. You can then write % 128 and most (all) compilers will make this & 0x7F

Unless you really need high performance on multiple embedded platforms, don't change how you code for performance reasons until you profile!
Code that's written awkwardly to optimize for performance is hard to debug and hard to maintain. Write a test case, and profile it on your target. Once you know the actual cost of modulus, then decide if the alternate solution is worth coding.

#Matthew is right. Try this:
int main() {
int i;
for(i = 0; i<=1024; i++) {
if (!(i & 0xFF)) printf("& i = %d\n", i);
if (!(i % 0x100)) printf("mod i = %d\n", i);
}
}

x%y == (x-(x/y)*y)
Hope this helps.

Do you have access to any programmable hardware on the embedded device? Like counters and such? If so, you might be able to write a hardware based mod unit, instead of using the simulated %. (I did that once in VHDL. Not sure if I still have the code though.)
Mind you, you did say that division was 5-10 times faster. Have you considered doing a division, multiplication, and subtraction to simulated the mod? (Edit: Misunderstood the original post. I did think it was odd that division was faster than mod, they are the same operation.)
In your specific case, though, you are checking for a mod of 6. 6 = 2*3. So you could MAYBE get some small gains if you first checked if the least significant bit was a 0. Something like:
if((!(x & 1)) && (x % 3))
{
print("Fizz\n");
}
If you do that, though, I'd recommend confirming that you get any gains, yay for profilers. And doing some commenting. I'd feel bad for the next guy who has to look at the code otherwise.

You should really check the embedded device you need. All the assembly language I have seen (x86, 68000) implement the modulus using a division.
Actually, the division assembly operation returns the result of the division and the remaining in two different registers.

In the embedded world, the "modulus" operations you need to do are often the ones that break down nicely into bit operations that you can do with &, | and sometimes >>.

#Jeff V: I see a problem with it! (Beyond that your original code was looking for a mod 6 and now you are essentially looking for a mod 8). You keep doing an extra +1! Hopefully your compiler optimizes that away, but why not just test start at 2 and go to MAXCOUNT inclusive? Finally, you are returning true every time that (x+1) is NOT divisible by 8. Is that what you want? (I assume it is, but just want to confirm.)

For modulo 6 you can change the Python code to C/C++:
def mod6(number):
while number > 7:
number = (number >> 3 << 1) + (number & 0x7)
if number > 5:
number -= 6
return number

Not that this is necessarily better, but you could have an inner loop which always goes up to FIZZ, and an outer loop which repeats it all some certain number of times. You've then perhaps got to special case the final few steps if MAXCOUNT is not evenly divisible by FIZZ.
That said, I'd suggest doing some research and performance profiling on your intended platforms to get a clear idea of the performance constraints you're under. There may be much more productive places to spend your optimisation effort.

The print statement will take orders of magnitude longer than even the slowest implementation of the modulus operator. So basically the comment "slow on some systems" should be "slow on all systems".
Also, the two code snippets provided don't do the same thing. In the second one, the line
if(fizzcount >= FIZZ)
is always false so "FIZZ\n" is never printed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js