I was wonder if there is a simpler (single) way to calculate the remaining space in a circular buffer than this?
int remaining = (end > start)
? end-start
: bufferSize - start + end;
If you're worried about poorly-predicted conditionals slowing down your CPU's pipeline, you could use this:
int remaining = (end - start) + (-((int) (end <= start)) & bufferSize);
But that's likely to be premature optimisation (unless you have really identified this as a hotspot). Stick with your current technique, which is much more readable.
Hmmm....
int remaining = (end - start + bufferSize) % bufferSize;
13 tokens, do I win?
If your circular buffer size is a power of two, you can do even better by having start and end represent positions in a virtual stream instead of indices into the circular buffer's storage. Assuming that start and end are unsigned, the above becomes:
int remaining= bufferSize - (end - start);
Actually getting elements out of the buffer is a little more complicated, but the overhead is usually small enough with a power of 2 sized circular buffer (just masking with bufferSize - 1) to make all the other logic of your circular buffer much simpler and cleaner. Plus, you get to use all the elements since you no longer worry about end==start!
According to the C++ Standard, section 5.6, paragraph 4:
The binary / operator yields the quotient, and the binary % operator yields the remainder from the division of the first expression by the second. If the second operand of / or % is zero the behavior is undefined; otherwise (a/b)*b + a%b is equal to a. If both operands are nonnegative then the remainder is nonnegative; if not, the sign of the remainder is implementation-defined.
A footnote suggests that rounding the quotient towards zero is preferred, which would leave the remainder negative.
Therefore, the (end - start) % bufferSize approaches do not work reliably. C++ does not have modular arithmetic (except in the sense offered by unsigned integral types).
The approach recommended by j_random_hacker is different, and looks good, but I don't know that it's any actual improvement in simplicity or speed. The conversion of a boolean to an int is ingenious, but requires mental parsing, and that fiddling could be more expensive than the use of ?:, depending on compiler and machine.
I think you've got the simplest and best version right there, and I wouldn't change it.
Lose the conditional:
int remaining = (end + bufferSize - start - 1) % bufferSize + 1
Edit: The -1 and +1 are for the case when end == start. In that case, this method will assume the buffer is empty. Depending on the specific implementation of your buffer, you may need to adjust these to avoid an off-by-1 situation.
Older thread I know but thought this might be helpful.
Not sure how fast this implements in C++ but in rtl we do this if size is n^2
remaining = (end[n] ^ start[n])
? start[n-1:0] - end[n-1:0]
: end[n-1:0] - start[n-1:0];
or
remaining = if (end[n] ^ start[n]) {
start[n-1:0] - end[n-1:0]
} else {
end[n-1:0] - start[n-1:0]
};
Related
On the DMOJ online judge, used for competitive programming, one of the tips for a faster execution time (C++) was to add this macro on top if the problem only requires unsigned integral data types to be read.
How does this work and what are the advantages and disadvantages of using this?
#define scan(x) do{while((x=getchar())<'0'); for(x-='0'; '0'<=(_=getchar()); x= (x<<3)+(x<<1)+_-'0');}while(0)
char _;
Source: https://dmoj.ca/tips/#cpp-io
First let's reformat this a bit:
#define scan(dest) \
do { \
while((dest = getchar()) < '0'); \
for(dest -= '0'; '0' <= (temp = getchar()); dest = (dest<<3) + (dest<<1) + temp - '0');
} while(0)
char temp;
First, the outer do{...}while(0) is just to ensure proper parsing of the macro. See here for more info.
Next, while((dest = getchar()) < '0'); - this might as well just be dest = getchar() but it does some additional work by discarding any characters below (but not above) the '0' character. This can be useful since whitespace characters are all "less than" the 0 character in ascii.
The meat of the macro is the for loop. First, the initialization expression dest -= '0', sets dest to the actual integer value represented by the character by taking advantage of the fact that the 0-9 characters in ascii encoding are adjacent and sequential. So if the first character were '5' (value 53), subtracting '0' (value 48) results in the integer value 5.
The condition statement, '0' <= (temp = getchar()), does several things - first, it gets the next character and assigns it to temp, then checks to see if it is greater than or equal to the '0' character (so will fail on whitespace).
As long as the character is a numeral (or at least equal to '0'), the increment expression is evaluated. dest = (dest<<3) + (dest<<1) + temp - '0' - the temp - '0' expression does the same adjustment as before from ascii to numeric value, and the shifts and adds are just an obscure way of multiplying by 10. In other words, it is equivalent to temp -= '0'; dest = dest * 10 + temp;. Multiplying by 10 and adding the next digit's value is what builds the final value.
Finally, char temp; declares the temporary character storage for use in subsequent macro invocations in the program.
As far as why you'd use it, I'm skeptical that it would provide any measurable benefit compared to something like scanf or atoi.
What this does is it reads a number character by character with a bunch of premature optimizations. See #MooseBoys' answer for more details.
About its advantages and disadvantages, I don't see any benefit to using this at all. Stuff like (x<<3)+(x<<1) which is equal to x * 10 are optimizations that should be done by the compiler, not you.
As far as I know, cin and cout is fast enough for all competitive programming purposes especially if you disable syncing with stdio. I've been using it since I started competitive programming and never had any problems.
Also, my own testing shows cin and cout isn't slower than C I/O, despite the popular belief. You can try testing the performance of this yourself. Make sure you have optimizations enabled.
Apparently, some competitive programmers focus way too much on stuff like fast I/O when their algorithm is the thing that matters most.
Is there any good way to optimize this function in terms of execution time? My final goal is to parse a long string composed of several integers (thousands of integer per line, and thousands of lines). This was my initial solution.
int64_t get_next_int(char *newLine) {
char *token=strtok(newLine, " ");
if( token == NULL ) {
exit(0);
}
return atoll(token);
}
More details: I need the "state" based implementation of strtok, so the padding implemented by strtok should exist in the final string. Atoll does not need of any kind of verification.
Target system: Intel x86_64 (Xeon series)
Related topics:
atoi optimization: C++ most efficient way to convert string to int (faster than atoi)
First off: I find optimizing string conversion routines in signal processing chains most of the time to be totally in vain. The speed at which your system loads data in string form (which will probably happen from some mass storage, where it was put by something that didn't care about performance, since it wouldn't have chosen a string format in the first place, otherwise), and if you compare read speeds of all but clusters of SSDs attached via PCIe with how fast atoll is, you'll notice that you're losing a negligible amount of time on inefficient conversion. If you pipeline loading parts of that string with conversion, the time spent waiting for storage will not even be remotely filled up with converting, so even without any algorithmic optimization, pipelining/multi-threading will eliminate practically all time spent on conversion.
I'm going to go ahead and assume your integer-containing string is sufficiently large. Like, tens of millions of integers. Otherwise, all optimization might be pretty premature, considering there's little to complain about std::iostream performance.
Now, the trick is that no performance optimization can be done once the performance of your conversion routine hits the memory bandwidth barrier. To push that barrier as far as possible, it's crucial to optimize usage of CPU caches – hence, doing linear access and shuffling memory as little as possible is crucial here. Also, if you care for speed, you don't want to call a function every time you need to convert a few-digit number – the call overhead (saving/restoring stack, jumping back and forth) will be significant. So if you're after performance, you'll do the conversion of the whole string at once, and then just access the resulting integer array.
So you'd have roughly something like, on a modern, SSE4.2 capable x86 processor
Outer loop, jumps in steps of 16:
load 128 bit of input string into 128 bit SIMD register
run something like __mm_cmpestri to find indices of delimiters and \0 terminator in all these 16 bytes at once
inner loop over the found indices
Use SSE copy/shift/immediate instructions to isolate substrings; fill the others with 0
prepend saved "last characters" from previous iteration (if any – should only be the case for first inner loop iteration per outer loop iteration)
subtract 0 from each of the digits, again using SSE instructions to do up to 16 subtractions with a single instruction (_mm_sub_epi8)
convert the eight 16bit subwords to eight 128 bit words containing two packed 64bit integers each (one instruction per 16bit, _mm_cvtepi8_epi64, I think)
initialize a __mm128 register with [10^15 10^14], let's call it powers
loop over pairs dual-64bit words: (each step should be one SSE instruction)
multiply first with powers
divide powers by [100 100]
multiply second with powers
add results to dual-64bit accumulator
sum the two values in accumulator
store the result to integer array
I'd rather use something along the lines of a std::istringstream:
int64_t get_next_int(std::istringstream& line) {
int64_t token;
if(!(line >> token))
exit(0);
return token;
}
std::istringstream line(newLine);
int64_t i = get_next_int(line);
strtok() has well known drawbacks, and you don't want to use it at all.
What about
int n= 0;
// Find the token
for ( ; *newline == ' '; newline++)
;
if (*newline == 0)
// Not found
exit(0);
// Scan and convert the token
for ( ; unsigned(*newline - '0') < 10; newline++)
n= 10 * n + *newline - '0';
return n;
AFA I get from your code at first splitting it will return. It seems at first parsing(before space character) it will returun 0 if it is non-number entry or combined alphabetic and number in such a way that alphabetic at beginning . If combined and number at beginning, it will return the number merely. Namely, you just need a string for the conversion. So you don't need tokenizing just check the string is null or not. You can change return type as well. Because, if you need a type with _exactly_ 64 bits, use (u)int64_t, if you need _at least_ 64 bits, (unsigned) long long is perfectly fine, as would be (u)int_least64_t. I think your code is little gobbledygook. Show what you exactly want without simplification.
/*
* ascii-to-longlong conversion
*
* no error checking; assumes decimal digits
*
* efficient conversion:
* start with value = 0
* then, starting at first character, repeat the following
* until the end of the string:
*
* new value = (10 * (old value)) + decimal value of next character
*
*/
long long my_atoll(char *instr)
{
if(str[0] == '\0')
return -1;
long long retval;
int i;
retval = 0;
for (; *instr; instr++) {
retval = 10*retval + (*instr - '0');
}
return retval;
}
My goal is as the following,
Generate successive values, such that each new one was never generated before, until all possible values are generated. At this point, the counter start the same sequence again. The main point here is that, all possible values are generated without repetition (until the period is exhausted). It does not matter if the sequence is simple 0, 1, 2, 3,..., or in other order.
For example, if the range can be represented simply by an unsigned, then
void increment (unsigned &n) {++n;}
is enough. However, the integer range is larger than 64-bits. For example, in one place, I need to generated 256-bits sequence. A simple implementation is like the following, just to illustrate what I am trying to do,
typedef std::array<uint64_t, 4> ctr_type;
static constexpr uint64_t max = ~((uint64_t) 0);
void increment (ctr_type &ctr)
{
if (ctr[0] < max) {++ctr[0]; return;}
if (ctr[1] < max) {++ctr[1]; return;}
if (ctr[2] < max) {++ctr[2]; return;}
if (ctr[3] < max) {++ctr[3]; return;}
ctr[0] = ctr[1] = ctr[2] = ctr[3] = 0;
}
So if ctr start with all zeros, then first ctr[0] is increased one by one until it reach max, and then ctr[1], and so on. If all 256-bits are set, then we reset it to all zero, and start again.
The problem is that, such implementation is surprisingly slow. My current improved version is sort of equivalent to the following,
void increment (ctr_type &ctr)
{
std::size_t k = (!(~ctr[0])) + (!(~ctr[1])) + (!(~ctr[2])) + (!(~ctr[3]))
if (k < 4)
++ctr[k];
else
memset(ctr.data(), 0, 32);
}
If the counter is only manipulated with the above increment function, and always start with zero, then ctr[k] == 0 if ctr[k - 1] == 0. And thus the value k will be the index of the first element that is less than the maximum.
I expected the first to be faster, since branch mis-prediction shall happen only once in every 2^64 iterations. The second, though mis-predication only happen every 2^256 iterations, it shall not make a difference. And apart from the branching, it needs four bitwise negation, four boolean negation, and three addition. Which might cost much more than the first.
However, both clang, gcc, or intel icpc generate binaries that the second was much faster.
My main question is that does anyone know if there any faster way to implement such a counter? It does not matter if the counter start by increasing the first integers or if it is implemented as an array of integers at all, as long as the algorithm generate all 2^256 combinations of 256-bits.
What makes things more complicated, I also need non uniform increment. For example, each time the counter is incremented by K where K > 1, but almost always remain a constant. My current implementation is similar to the above.
To provide some more context, one place I am using the counters is using them as input to AES-NI aesenc instructions. So distinct 128-bits integer (loaded into __m128i), after going through 10 (or 12 or 14, depending on the key size) rounds of the instructions, a distinct 128-bits integer is generated. If I generate one __m128i integer at once, then the cost of increment matters little. However, since aesenc has quite a bit latency, I generate integers by blocks. For example, I might have 4 blocks, ctr_type block[4], initialized equivalent to the following,
block[0]; // initialized to zero
block[1] = block[0]; increment(block[1]);
block[2] = block[1]; increment(block[2]);
block[3] = block[2]; increment(block[3]);
And each time I need new output, I increment each block[i] by 4, and generate 4 __m128i output at once. By interleaving instructions, overall I was able to increase the throughput, and reduce the cycles per bytes of output (cpB) from 6 to 0.9 when using 2 64-bits integers as the counter and 8 blocks. However, if instead, use 4 32-bits integers as counter, the throughput, measured as bytes per sec is reduced to half. I know for a fact that on x86-64, 64-bits integers could be faster than 32-bits in some situations. But I did not expect such simple increment operation makes such a big difference. I have carefully benchmarked the application, and the increment is indeed the one slow down the program. Since the loading into __m128i and store the __m128i output into usable 32-bits or 64-bits integers are done through aligned pointers, the only difference between the 32-bits and 64-bits version is how the counter is incremented. I expected that the AES-NI expected, after loading the integers into __m128i, shall dominate the performance. But when using 4 or 8 blocks, it was clearly not the case.
So to summary, my main question is that, if anyone know a way to improve the above counter implementation.
It's not only slow, but impossible. The total energy of universe is insufficient for 2^256 bit changes. And that would require gray counter.
Next thing before optimization is to fix the original implementation
void increment (ctr_type &ctr)
{
if (++ctr[0] != 0) return;
if (++ctr[1] != 0) return;
if (++ctr[2] != 0) return;
++ctr[3];
}
If each ctr[i] was not allowed to overflow to zero, the period would be just 4*(2^32), as in 0-9, 19,29,39,49,...99, 199,299,... and 1999,2999,3999,..., 9999.
As a reply to the comment -- it takes 2^64 iterations to have the first overflow. Being generous, upto 2^32 iterations could take place in a second, meaning that the program should run 2^32 seconds to have the first carry out. That's about 136 years.
EDIT
If the original implementation with 2^66 states is really what is wanted, then I'd suggest to change the interface and the functionality to something like:
(*counter) += 1;
while (*counter == 0)
{
counter++; // Move to next word
if (counter > tail_of_array) {
counter = head_of_array;
memset(counter,0, 16);
break;
}
}
The point being, that the overflow is still very infrequent. Almost always there's just one word to be incremented.
If you're using GCC or compilers with __int128 like Clang or ICC
unsigned __int128 H = 0, L = 0;
L++;
if (L == 0) H++;
On systems where __int128 isn't available
std::array<uint64_t, 4> c[4]{};
c[0]++;
if (c[0] == 0)
{
c[1]++;
if (c[1] == 0)
{
c[2]++;
if (c[2] == 0)
{
c[3]++;
}
}
}
In inline assembly it's much easier to do this using the carry flag. Unfortunately most high level languages don't have means to access it directly. Some compilers do have intrinsics for adding with carry like __builtin_uaddll_overflow in GCC and __builtin_addcll
Anyway this is rather wasting time since the total number of particles in the universe is only about 1080 and you cannot even count up the 64-bit counter in your life
Neither of your counter versions increment correctly. Instead of counting up to UINT256_MAX, you are actually just counting up to UINT64_MAX 4 times and then starting back at 0 again. This is apparent from the fact that you do not bother to clear any of the indices that has reached the max value until all of them have reached the max value. If you are measuring performance based on how often the counter reaches all bits 0, then this is why. Thus your algorithms do not generate all combinations of 256 bits, which is a stated requirement.
You mention "Generate successive values, such that each new one was never generated before"
To generate a set of such values, look at linear congruential generators
the sequence x = (x*1 + 1) % (power_of_2), you thought about it, this are simply sequential numbers.
the sequence x = (x*13 + 137) % (power of 2) , this generates unique numbers with a predictable period (power_of_2 - 1) and the unique numbers look more "random", kind of pseudo-random. You need to resort to arbitrary precision arithmetic to get it working, and also all the trickeries of multiplications by constants. This will get you a nice way to start.
You also complain that your simple code is "slow"
At 4.2 GHz frequency, running 4 intructions per cycle and using AVX512 vectorizations, on a 64-core computer with a multithreaded version of your program doing nothing else than increments, you get only 64x8x4*232=8796093022208 increments per second, that is 264 increments reached in 25 days. This post is old, you might have reached 841632698362998292480 by now, running such a program on such a machine, and you will gloriously reach 1683265396725996584960 in 2 years time.
You also require "until all possible values are generated".
You can only generate a finite number of values, depending how much you are willing to pay for the energy to power your computers. As mentioned in the other responses, with 128 or 256-bit numbers, even being the richest man in the world, you will never wrap around before the first of these conditions occurs:
getting out of money
end of humankind (nobody will get the outcome of your software)
burning the energy from the last particles of the universe
Multi-word addition can easily be accomplished in portable fashion by using three macros that mimic three types of addition instructions found on many processors:
ADDcc adds two words, and sets the carry if their was unsigned overflow
ADDC adds two words plus carry (from a previous addition)
ADDCcc adds two words plus carry, and sets the carry if their was unsigned overflow
A multi-word addition with two words uses ADDcc of the least significant words followed by ADCC of the most significant words. A multi-word addition with more than two words forms sequence ADDcc, ADDCcc, ..., ADDC. The MIPS architecture is a processor architecture without conditions code and therefore without carry flag. The macro implementations shown below basically follow the techniques used on MIPS processors for multi-word additions.
The ISO-C99 code below shows the operation of a 32-bit counter and a 64-bit counter based on 16-bit "words". I chose arrays as the underlying data structure, but one might also use struct, for example. Use of a struct will be significantly faster if each operand only comprises a few words, as the overhead of array indexing is eliminated. One would want to use the widest available integer type for each "word" for best performance. In the example from the question that would likely be a 256-bit counter comprising four uint64_t components.
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#define ADDCcc(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)
#define ADDcc(a,b,cy,t0,t1) \
(t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)
#define ADDC(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), t0+t1)
typedef uint16_t T;
/* increment a multi-word counter comprising n words */
void inc_array (T *counter, const T *increment, int n)
{
T cy, t0, t1;
counter [0] = ADDcc (counter [0], increment [0], cy, t0, t1);
for (int i = 1; i < (n - 1); i++) {
counter [i] = ADDCcc (counter [i], increment [i], cy, t0, t1);
}
counter [n-1] = ADDC (counter [n-1], increment [n-1], cy, t0, t1);
}
#define INCREMENT (10)
#define UINT32_ARRAY_LEN (2)
#define UINT64_ARRAY_LEN (4)
int main (void)
{
uint32_t count32 = 0, incr32 = INCREMENT;
T count_arr2 [UINT32_ARRAY_LEN] = {0};
T incr_arr2 [UINT32_ARRAY_LEN] = {INCREMENT};
do {
count32 = count32 + incr32;
inc_array (count_arr2, incr_arr2, UINT32_ARRAY_LEN);
} while (count32 < (0U - INCREMENT - 1));
printf ("count32 = %08x arr_count = %08x\n",
count32, (((uint32_t)count_arr2 [1] << 16) +
((uint32_t)count_arr2 [0] << 0)));
uint64_t count64 = 0, incr64 = INCREMENT;
T count_arr4 [UINT64_ARRAY_LEN] = {0};
T incr_arr4 [UINT64_ARRAY_LEN] = {INCREMENT};
do {
count64 = count64 + incr64;
inc_array (count_arr4, incr_arr4, UINT64_ARRAY_LEN);
} while (count64 < 0xa987654321ULL);
printf ("count64 = %016llx arr_count = %016llx\n",
count64, (((uint64_t)count_arr4 [3] << 48) +
((uint64_t)count_arr4 [2] << 32) +
((uint64_t)count_arr4 [1] << 16) +
((uint64_t)count_arr4 [0] << 0)));
return EXIT_SUCCESS;
}
Compiled with full optimization, the 32-bit example executes in about a second, while the 64-bit example runs for about a minute on a modern PC. The output of the program should look like so:
count32 = fffffffa arr_count = fffffffa
count64 = 000000a987654326 arr_count = 000000a987654326
Non-portable code that is based on inline assembly or proprietary extensions for wide integer types may execute about two to three times as fast as the portable solution presented here.
I am trying to determine the next and previous even number with bitwise operations.
So for example for the next function:
x nextEven(x)
1 2
2 2
3 4
4 4
and for the previous:
x previousEven(x)
1 0
2 2
3 2
4 4
I had the idea for the nextEven function something like: value = ((value+1)>>1)<<1;
And for the previousEven function something like: value = ((value)>>1)<<1
is there a better approach?, without comparing and seeing if the values are even or odd.
Thank you.
Doing a right shift followed by a left shift to clear the LSB isn't very efficient.
I'd use something like:
previous: value &= ~1;
next: value = (value +1) & ~1;
The ~1 can (and normally will) be pre-computed at compile time, so the previous will end up as a single bit-wise operation at run-time. the next will probably end up as two operations (increment, and), but should still be quite fast.
About the best you can hope for from the shifts is that the compiler will recognize that you're just clearly the LSB, and optimize it to about what you'd expect this to produce anyway.
you could do something like this
for previous even
unsigned prevev(unsigned x)
{
return x-(x%2);//bitwise counterpart x-(x&1);
}
for next even
unsigned nxtev(unsigned x)
{
return (x%2)+x; //bitwise counterpart x+(x&1);
}
Say you're using unsigned ints, previous even (matching your values - we could argue about whether previous even of 2 should be 0 etc) is simply x & ~1u. Next even is previous even of x + 1.
Tricks like Duff's Device, or swapping two variables with XOR, or working out next and previous even number with bitwise operations seem clever, but they rarely are.
The best thing you can do as a developer is to optimise for readability first and only tackle performance once you've identified a specific bottleneck that is causing real problems.
The best code for getting the previous even number (by your definition where the previous even number of 2 is 2) is simply writing something like:
if ((num % 2) == 1) num--; // num++ for next.
or (slightly more advanced):
num -= num % 2; // += for next.
and letting the insane optimising compilers figure out the best underlying code.
Unless you need to do these operations billions of times per second, readability should always be your prime concern.
Previous even number:
For previous even number I prefer Jerry Coffin's answer
// Get previous even number
unsigned prevEven(unsigned no)
{
return (no & ~1);
}
Next even number:
I try to use only bitwise operator's but still i use one unary minus(-) operator to get next number.
// Get next even number
unsigned nextEven(unsigned no)
{
return (no & 1) ? (-(~no)) : no ;
}
Working of Method nextEven():
If number is even return the same number,
if no is even it's LSB is 0 otherwise 1
Get LSB of number => number & 1
If number is odd return the number + 1,
Add 1 to number => -(~number)
unsigned int previous(unsigned int x)
{
return x & 0xfffffffe;
}
unsigned int next(unsigned int x)
{
return previous(x + 2);
}
I read somewhere once that the modulus operator is inefficient on small embedded devices like 8 bit micro-controllers that do not have integer division instruction. Perhaps someone can confirm this but I thought the difference is 5-10 time slower than with an integer division operation.
Is there another way to do this other than keeping a counter variable and manually overflowing to 0 at the mod point?
const int FIZZ = 6;
for(int x = 0; x < MAXCOUNT; x++)
{
if(!(x % FIZZ)) print("Fizz\n"); // slow on some systems
}
vs:
The way I am currently doing it:
const int FIZZ = 6;
int fizzcount = 1;
for(int x = 1; x < MAXCOUNT; x++)
{
if(fizzcount >= FIZZ)
{
print("Fizz\n");
fizzcount = 0;
}
}
Ah, the joys of bitwise arithmetic. A side effect of many division routines is the modulus - so in few cases should division actually be faster than modulus. I'm interested to see the source you got this information from. Processors with multipliers have interesting division routines using the multiplier, but you can get from division result to modulus with just another two steps (multiply and subtract) so it's still comparable. If the processor has a built in division routine you'll likely see it also provides the remainder.
Still, there is a small branch of number theory devoted to Modular Arithmetic which requires study if you really want to understand how to optimize a modulus operation. Modular arithmatic, for instance, is very handy for generating magic squares.
So, in that vein, here's a very low level look at the math of modulus for an example of x, which should show you how simple it can be compared to division:
Maybe a better way to think about the problem is in terms of number
bases and modulo arithmetic. For example, your goal is to compute DOW
mod 7 where DOW is the 16-bit representation of the day of the
week. You can write this as:
DOW = DOW_HI*256 + DOW_LO
DOW%7 = (DOW_HI*256 + DOW_LO) % 7
= ((DOW_HI*256)%7 + (DOW_LO % 7)) %7
= ((DOW_HI%7 * 256%7) + (DOW_LO%7)) %7
= ((DOW_HI%7 * 4) + (DOW_LO%7)) %7
Expressed in this manner, you can separately compute the modulo 7
result for the high and low bytes. Multiply the result for the high by
4 and add it to the low and then finally compute result modulo 7.
Computing the mod 7 result of an 8-bit number can be performed in a
similar fashion. You can write an 8-bit number in octal like so:
X = a*64 + b*8 + c
Where a, b, and c are 3-bit numbers.
X%7 = ((a%7)*(64%7) + (b%7)*(8%7) + c%7) % 7
= (a%7 + b%7 + c%7) % 7
= (a + b + c) % 7
since 64%7 = 8%7 = 1
Of course, a, b, and c are
c = X & 7
b = (X>>3) & 7
a = (X>>6) & 7 // (actually, a is only 2-bits).
The largest possible value for a+b+c is 7+7+3 = 17. So, you'll need
one more octal step. The complete (untested) C version could be
written like:
unsigned char Mod7Byte(unsigned char X)
{
X = (X&7) + ((X>>3)&7) + (X>>6);
X = (X&7) + (X>>3);
return X==7 ? 0 : X;
}
I spent a few moments writing a PIC version. The actual implementation
is slightly different than described above
Mod7Byte:
movwf temp1 ;
andlw 7 ;W=c
movwf temp2 ;temp2=c
rlncf temp1,F ;
swapf temp1,W ;W= a*8+b
andlw 0x1F
addwf temp2,W ;W= a*8+b+c
movwf temp2 ;temp2 is now a 6-bit number
andlw 0x38 ;get the high 3 bits == a'
xorwf temp2,F ;temp2 now has the 3 low bits == b'
rlncf WREG,F ;shift the high bits right 4
swapf WREG,F ;
addwf temp2,W ;W = a' + b'
; at this point, W is between 0 and 10
addlw -7
bc Mod7Byte_L2
Mod7Byte_L1:
addlw 7
Mod7Byte_L2:
return
Here's a liitle routine to test the algorithm
clrf x
clrf count
TestLoop:
movf x,W
RCALL Mod7Byte
cpfseq count
bra fail
incf count,W
xorlw 7
skpz
xorlw 7
movwf count
incfsz x,F
bra TestLoop
passed:
Finally, for the 16-bit result (which I have not tested), you could
write:
uint16 Mod7Word(uint16 X)
{
return Mod7Byte(Mod7Byte(X & 0xff) + Mod7Byte(X>>8)*4);
}
Scott
If you are calculating a number mod some power of two, you can use the bit-wise and operator. Just subtract one from the second number. For example:
x % 8 == x & 7
x % 256 == x & 255
A few caveats:
This only works if the second number is a power of two.
It's only equivalent if the modulus is always positive. The C and C++ standards don't specify the sign of the modulus when the first number is negative (until C++11, which does guarantee it will be negative, which is what most compilers were already doing). A bit-wise and gets rid of the sign bit, so it will always be positive (i.e. it's a true modulus, not a remainder). It sounds like that's what you want anyway though.
Your compiler probably already does this when it can, so in most cases it's not worth doing it manually.
There is an overhead most of the time in using modulo that are not powers of 2.
This is regardless of the processor as (AFAIK) even processors with modulus operators are a few cycles slower for divide as opposed to mask operations.
For most cases this is not an optimisation that is worth considering, and certainly not worth calculating your own shortcut operation (especially if it still involves divide or multiply).
However, one rule of thumb is to select array sizes etc. to be powers of 2.
so if calculating day of week, may as well use %7 regardless
if setting up a circular buffer of around 100 entries... why not make it 128. You can then write % 128 and most (all) compilers will make this & 0x7F
Unless you really need high performance on multiple embedded platforms, don't change how you code for performance reasons until you profile!
Code that's written awkwardly to optimize for performance is hard to debug and hard to maintain. Write a test case, and profile it on your target. Once you know the actual cost of modulus, then decide if the alternate solution is worth coding.
#Matthew is right. Try this:
int main() {
int i;
for(i = 0; i<=1024; i++) {
if (!(i & 0xFF)) printf("& i = %d\n", i);
if (!(i % 0x100)) printf("mod i = %d\n", i);
}
}
x%y == (x-(x/y)*y)
Hope this helps.
Do you have access to any programmable hardware on the embedded device? Like counters and such? If so, you might be able to write a hardware based mod unit, instead of using the simulated %. (I did that once in VHDL. Not sure if I still have the code though.)
Mind you, you did say that division was 5-10 times faster. Have you considered doing a division, multiplication, and subtraction to simulated the mod? (Edit: Misunderstood the original post. I did think it was odd that division was faster than mod, they are the same operation.)
In your specific case, though, you are checking for a mod of 6. 6 = 2*3. So you could MAYBE get some small gains if you first checked if the least significant bit was a 0. Something like:
if((!(x & 1)) && (x % 3))
{
print("Fizz\n");
}
If you do that, though, I'd recommend confirming that you get any gains, yay for profilers. And doing some commenting. I'd feel bad for the next guy who has to look at the code otherwise.
You should really check the embedded device you need. All the assembly language I have seen (x86, 68000) implement the modulus using a division.
Actually, the division assembly operation returns the result of the division and the remaining in two different registers.
In the embedded world, the "modulus" operations you need to do are often the ones that break down nicely into bit operations that you can do with &, | and sometimes >>.
#Jeff V: I see a problem with it! (Beyond that your original code was looking for a mod 6 and now you are essentially looking for a mod 8). You keep doing an extra +1! Hopefully your compiler optimizes that away, but why not just test start at 2 and go to MAXCOUNT inclusive? Finally, you are returning true every time that (x+1) is NOT divisible by 8. Is that what you want? (I assume it is, but just want to confirm.)
For modulo 6 you can change the Python code to C/C++:
def mod6(number):
while number > 7:
number = (number >> 3 << 1) + (number & 0x7)
if number > 5:
number -= 6
return number
Not that this is necessarily better, but you could have an inner loop which always goes up to FIZZ, and an outer loop which repeats it all some certain number of times. You've then perhaps got to special case the final few steps if MAXCOUNT is not evenly divisible by FIZZ.
That said, I'd suggest doing some research and performance profiling on your intended platforms to get a clear idea of the performance constraints you're under. There may be much more productive places to spend your optimisation effort.
The print statement will take orders of magnitude longer than even the slowest implementation of the modulus operator. So basically the comment "slow on some systems" should be "slow on all systems".
Also, the two code snippets provided don't do the same thing. In the second one, the line
if(fizzcount >= FIZZ)
is always false so "FIZZ\n" is never printed.