Timing difference/increase different when using `%` vs `&`

Timing difference/increase different when using `%` vs `&` - c++

In trying to determine the cache size for a given CPU, I tried to time the memory access to memory/cache like:
lengthMod = sizes[i]/sizeof(int) - 1; // where sizes[i] is something like 1024, 2048 ...
for (unsigned int k = 0; k < REPS; k++) {
data[(k * 16) & lengthMod]++;
}
1, 0.52
4, 0.52
8, 0.52
16, 0.52
32, 0.52
64, 1.11 // << note the jump in timing. L1 cache size is 32K
128, 1.12
256, 1.19
So I think if the lengthMod is not a power of 2, I cant do this. So I tried doing
lengthMod = sizes[i]/sizeof(int);
for (unsigned int k = 0; k < REPS; k++) {
data[(k * 16) % lengthMod]++;
}
1, 2.67
4, 2.57
8, 2.55
16, 2.51
32, 2.42
64, 2.42 // << no jump anymore ...
128, 2.42
256, 2.42
Then I find that the timing increase that I expected is non-existant anymore ... I expected the time to increase but it should apply to all values? So if its x seconds when using &, I'd expect ~x+c seconds (where c is approximatly constant), but thats not the case, in fact, it reduces the timing difference to non-existant why is that?

What you're seeing is a trade-off of bottlenecks.
In the first example, you are bottlenecked by your cache bandwidth.
In the second example, you are bottlenecked by integer division.
Before we continue, let's look at the difference between the two examples:
In the first case, you use & which is a fast bitwise operation.
In the second case, you use % which is very slow division.
Divisions are very slow. Modern compilers will try to optimize them when the divisor/modulus is a compile-time constant.
But that's not the case here. So you pay the full cost of a hardware division. This is why the times in your second example are much slower than the first.
With the &, the code is fast enough to max out the cache bandwidth. However, with %, the code is much slower - not fast enough to keep up with the cache. So you see the same times all the way up.

Looks like in second case compiler generates less optimized code (because of "division with remainder").

Related

INTEL SIMD: why is inplace multiplication so slow?

I have written some vector-methods that do simple math inplace or copying and that share the same penalty for the inplace variant.
The simplest can be boiled down to something like these:
void scale(float* dst, const float* src, int count, float factor)
{
__m128 factorV = _mm_set1_ps(factorV);
for(int i = 0; i < count; i+= 4)
{
__m128 in = _mm_load_ps(src);
in = _mm_mul_ps(in, factorV);
_mm_store_ps(dst, in);
dst += 4;
src += 4;
}
}
testing code:
for(int i = 0; i < 1000000; i++)
{
scale(alignedMemPtrDst, alignedMemPtrSrc, 256, randomFloatAbsRange1);
}
When testing, i.e. repeatedly operating this function on the SAME buffers, I found that if dst and src are the same, speed is the same. If they are different, its about a factor 70 faster. The main cycles burned on writing (i.e. _mm_store_ps)
Interestingly the same behaviour does not hold for addition, i.e. += works nicely, only *= is a problem..
--
This has been answered in the comments. It's denormals during artificial testing.

Does your factor produce a subnormal result? Non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to require slow FP assists.
(Turns out, yes that was the problem for the OP).
Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.
It doesn't take extra time to produce a +-Inf or NaN result, but it does for gradual underflow to subnormal on Intel CPUs at least. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.
I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.
There's a performance counter on Intel CPUs for fp_assist.any which counts when a sub-normal result requires extra microcode uops to handle the special case. (I think it's as intrusive as that for the front-end and OoO exec. It's definitely slow, though.)
Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?
Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)

Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1?

Is there any possible optimization for random access on a very big array (I currently use uint8_t, and I'm asking about what's better)
uint8_t MyArray[10000000];
when the value at any position in the array is
0 or 1 for 95% of all cases,
2 in 4% of cases,
between 3 and 255 in
the other 1% of cases?
So, is there anything better than a uint8_t array to use for this? It should be as quick as possible to loop over the whole array in a random order, and this is very heavy on RAM bandwidth, so when having more than a few threads doing that at the same time for different arrays, currently the whole RAM bandwidth is quickly saturated.
I'm asking since it feels very inefficient to have such a big array (10 MB) when it's actually known that almost all values, apart from 5%, will be either 0 or 1. So when 95% of all values in the array would only actually need 1 bit instead of 8 bit, this would reduce memory usage by almost an order of magnitude.
It feels like there has to be a more memory efficient solution that would greatly reduce RAM bandwidth required for this, and as a result also be significantly quicker for random access.

A simple possibility that comes to mind is to keep a compressed array of 2 bits per value for the common cases, and a separated 4 byte per value (24 bit for original element index, 8 bit for actual value, so (idx << 8) | value)) sorted array for the other ones.
When you look up a value, you first do a lookup in the 2bpp array (O(1)); if you find 0, 1 or 2 it's the value you want; if you find 3 it means that you have to look it up in the secondary array. Here you'll perform a binary search to look for the index of your interest left-shifted by 8 (O(log(n) with a small n, as this should be the 1%), and extract the value from the 4-byte thingie.
std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;
uint8_t lookup(unsigned idx) {
// extract the 2 bits of our interest from the main array
uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
// usual (likely) case: value between 0 and 2
if(v != 3) return v;
// bad case: lookup the index<<8 in the secondary array
// lower_bound finds the first >=, so we don't need to mask out the value
auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
// some coherency checks
if(ptr == sec_arr.end()) std::abort();
if((*ptr >> 8) != idx) std::abort();
#endif
// extract our 8-bit value from the 32 bit (index, value) thingie
return (*ptr) & 0xff;
}
void populate(uint8_t *source, size_t size) {
main_arr.clear(); sec_arr.clear();
// size the main storage (round up)
main_arr.resize((size+3)/4);
for(size_t idx = 0; idx < size; ++idx) {
uint8_t in = source[idx];
uint8_t &target = main_arr[idx>>2];
// if the input doesn't fit, cap to 3 and put in secondary storage
if(in >= 3) {
// top 24 bits: index; low 8 bit: value
sec_arr.push_back((idx << 8) | in);
in = 3;
}
// store in the target according to the position
target |= in << ((idx & 3)*2);
}
}
For an array such as the one you proposed, this should take 10000000 / 4 = 2500000 bytes for the first array, plus 10000000 * 1% * 4 B = 400000 bytes for the second array; hence 2900000 bytes, i.e. less than one third of the original array, and the most used portion is all kept together in memory, which should be good for caching (it may even fit L3).
If you need more than 24-bit addressing, you'll have to tweak the "secondary storage"; a trivial way to extend it is to have a 256 element pointer array to switch over the top 8 bits of the index and forward to a 24-bit indexed sorted array as above.
Quick benchmark
#include <algorithm>
#include <vector>
#include <stdint.h>
#include <chrono>
#include <stdio.h>
#include <math.h>
using namespace std::chrono;
/// XorShift32 generator; extremely fast, 2^32-1 period, way better quality
/// than LCG but fail some test suites
struct XorShift32 {
/// This stuff allows to use this class wherever a library function
/// requires a UniformRandomBitGenerator (e.g. std::shuffle)
typedef uint32_t result_type;
static uint32_t min() { return 1; }
static uint32_t max() { return uint32_t(-1); }
/// PRNG state
uint32_t y;
/// Initializes with seed
XorShift32(uint32_t seed = 0) : y(seed) {
if(y == 0) y = 2463534242UL;
}
/// Returns a value in the range [1, 1<<32)
uint32_t operator()() {
y ^= (y<<13);
y ^= (y>>17);
y ^= (y<<15);
return y;
}
/// Returns a value in the range [0, limit); this conforms to the RandomFunc
/// requirements for std::random_shuffle
uint32_t operator()(uint32_t limit) {
return (*this)()%limit;
}
};
struct mean_variance {
double rmean = 0.;
double rvariance = 0.;
int count = 0;
void operator()(double x) {
++count;
double ormean = rmean;
rmean += (x-rmean)/count;
rvariance += (x-ormean)*(x-rmean);
}
double mean() const { return rmean; }
double variance() const { return rvariance/(count-1); }
double stddev() const { return std::sqrt(variance()); }
};
std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;
uint8_t lookup(unsigned idx) {
// extract the 2 bits of our interest from the main array
uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
// usual (likely) case: value between 0 and 2
if(v != 3) return v;
// bad case: lookup the index<<8 in the secondary array
// lower_bound finds the first >=, so we don't need to mask out the value
auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
// some coherency checks
if(ptr == sec_arr.end()) std::abort();
if((*ptr >> 8) != idx) std::abort();
#endif
// extract our 8-bit value from the 32 bit (index, value) thingie
return (*ptr) & 0xff;
}
void populate(uint8_t *source, size_t size) {
main_arr.clear(); sec_arr.clear();
// size the main storage (round up)
main_arr.resize((size+3)/4);
for(size_t idx = 0; idx < size; ++idx) {
uint8_t in = source[idx];
uint8_t &target = main_arr[idx>>2];
// if the input doesn't fit, cap to 3 and put in secondary storage
if(in >= 3) {
// top 24 bits: index; low 8 bit: value
sec_arr.push_back((idx << 8) | in);
in = 3;
}
// store in the target according to the position
target |= in << ((idx & 3)*2);
}
}
volatile unsigned out;
int main() {
XorShift32 xs;
std::vector<uint8_t> vec;
int size = 10000000;
for(int i = 0; i<size; ++i) {
uint32_t v = xs();
if(v < 1825361101) v = 0; // 42.5%
else if(v < 4080218931) v = 1; // 95.0%
else if(v < 4252017623) v = 2; // 99.0%
else {
while((v & 0xff) < 3) v = xs();
}
vec.push_back(v);
}
populate(vec.data(), vec.size());
mean_variance lk_t, arr_t;
for(int i = 0; i<50; ++i) {
{
unsigned o = 0;
auto beg = high_resolution_clock::now();
for(int i = 0; i < size; ++i) {
o += lookup(xs() % size);
}
out += o;
int dur = (high_resolution_clock::now()-beg)/microseconds(1);
fprintf(stderr, "lookup: %10d µs\n", dur);
lk_t(dur);
}
{
unsigned o = 0;
auto beg = high_resolution_clock::now();
for(int i = 0; i < size; ++i) {
o += vec[xs() % size];
}
out += o;
int dur = (high_resolution_clock::now()-beg)/microseconds(1);
fprintf(stderr, "array: %10d µs\n", dur);
arr_t(dur);
}
}
fprintf(stderr, " lookup | ± | array | ± | speedup\n");
printf("%7.0f | %4.0f | %7.0f | %4.0f | %0.2f\n",
lk_t.mean(), lk_t.stddev(),
arr_t.mean(), arr_t.stddev(),
arr_t.mean()/lk_t.mean());
return 0;
}
(code and data always updated in my Bitbucket)
The code above populates a 10M element array with random data distributed as OP specified in their post, initializes my data structure and then:
performs a random lookup of 10M elements with my data structure
does the same through the original array.
(notice that in case of sequential lookup the array always wins by a huge measure, as it's the most cache-friendly lookup you can do)
These last two blocks are repeated 50 times and timed; at the end, the mean and standard deviation for each type of lookup are calculated and printed, along with the speedup (lookup_mean/array_mean).
I compiled the code above with g++ 5.4.0 (-O3 -static, plus some warnings) on Ubuntu 16.04, and ran it on some machines; most of them are running Ubuntu 16.04, some some older Linux, some some newer Linux. I don't think the OS should be relevant at all in this case.
CPU | cache | lookup (µs) | array (µs) | speedup (x)
Xeon E5-1650 v3 # 3.50GHz | 15360 KB | 60011 ± 3667 | 29313 ± 2137 | 0.49
Xeon E5-2697 v3 # 2.60GHz | 35840 KB | 66571 ± 7477 | 33197 ± 3619 | 0.50
Celeron G1610T # 2.30GHz | 2048 KB | 172090 ± 629 | 162328 ± 326 | 0.94
Core i3-3220T # 2.80GHz | 3072 KB | 111025 ± 5507 | 114415 ± 2528 | 1.03
Core i5-7200U # 2.50GHz | 3072 KB | 92447 ± 1494 | 95249 ± 1134 | 1.03
Xeon X3430 # 2.40GHz | 8192 KB | 111303 ± 936 | 127647 ± 1503 | 1.15
Core i7 920 # 2.67GHz | 8192 KB | 123161 ± 35113 | 156068 ± 45355 | 1.27
Xeon X5650 # 2.67GHz | 12288 KB | 106015 ± 5364 | 140335 ± 6739 | 1.32
Core i7 870 # 2.93GHz | 8192 KB | 77986 ± 429 | 106040 ± 1043 | 1.36
Core i7-6700 # 3.40GHz | 8192 KB | 47854 ± 573 | 66893 ± 1367 | 1.40
Core i3-4150 # 3.50GHz | 3072 KB | 76162 ± 983 | 113265 ± 239 | 1.49
Xeon X5650 # 2.67GHz | 12288 KB | 101384 ± 796 | 152720 ± 2440 | 1.51
Core i7-3770T # 2.50GHz | 8192 KB | 69551 ± 1961 | 128929 ± 2631 | 1.85
The results are... mixed!
In general, on most of these machines there is some kind of speedup, or at least they are on a par.
The two cases where the array truly trumps the "smart structure" lookup are on a machines with lots of cache and not particularly busy: the Xeon E5-1650 above (15 MB cache) is a night build machine, at the moment quite idle; the Xeon E5-2697 (35 MB cache) is a machine for high performance calculations, in an idle moment as well. It does make sense, the original array fits completely in their huge cache, so the compact data structure only adds complexity.
At the opposite side of the "performance spectrum" - but where again the array is slightly faster, there's the humble Celeron that powers my NAS; it has so little cache that neither the array nor the "smart structure" fits in it at all. Other machines with cache small enough perform similarly.
The Xeon X5650 must be taken with some caution - they are virtual machines on a quite busy dual-socket virtual machine server; it may well be that, although nominally it has a decent amount of cache, during the time of the test it gets preempted by completely unrelated virtual machines several times.

Another option could be
check if the result is 0, 1 or 2
if not do a regular lookup
In other words something like:
unsigned char lookup(int index) {
int code = (bmap[index>>2]>>(2*(index&3)))&3;
if (code != 3) return code;
return full_array[index];
}
where bmap uses 2 bits per element with the value 3 meaning "other".
This structure is trivial to update, uses 25% more memory but the big part is looked up only in 5% of the cases. Of course, as usual, if it's a good idea or not depends on a lot of other conditions so the only answer is experimenting with real usage.

This is more of a "long comment" than a concrete answer
Unless your data is something that is something well-known, I doubt anyone can DIRECTLY answer your question (and I'm not aware of anything that matches your description, but then I don't know EVERYTHING about all kinds of data patterns for all kinds of use-cases). Sparse data is a common problem in high performance computing, but it's typically "we have a very large array, but only some values are non-zero".
For not well known patterns like what I think yours is, nobody will KNOW directly which is better, and it depends on the details: how random is the random access - is the system accessing clusters of data items, or is it completely random like from a uniform random number generator. Is the table data completely random, or are there sequences of 0 then sequences of 1, with a scattering of other values? Run length encoding would work well if you have reasonably long sequences of 0 and 1, but won't work if you have "checkerboard of 0/1". Also, you'd have to keep a table of "starting points", so you can work your way to the relevant place reasonably quickly.
I know from a long time back that some big databases are just a large table in RAM (telephone exchange subscriber data in this example), and one of the problems there is that caches and page-table optimisations in the processor is pretty useless. The caller is so rarely the same as one recently calling someone, that there is no pre-loaded data of any kind, it's just purely random. Big page-tables is the best optimisation for that type of access.
In a lot of cases, compromising between "speed and small size" is one of those things you have to pick between in software engineering [in other engineering it's not necessarily so much of a compromise]. So, "wasting memory for simpler code" is quite often the preferred choice. In this sense, the "simple" solution is quite likely better for speed, but if you have "better" use for the RAM, then optimising for size of the table would give you sufficient performance and a good improvement on size. There are lots of different ways you could achieve this - as suggested in a comment, a 2 bit field where the two or three most common values are stored, and then some alternative data format for the other values - a hash-table would be my first approach, but a list or binary tree may work too - again, it depends on the patterns of where your "not 0, 1 or 2" are. Again, it depends on how the values are "scattered" in the table - are they in clusters or are they more of an evenly distributed pattern?
But a problem with that is that you are still reading the data from RAM. You are then spending more code processing the data, including some code to cope with the "this is not a common value".
The problem with most common compression algorithms is that they are based on unpacking sequences, so you can't random access them. And the overhead of splitting your big data into chunks of, say, 256 entries at a time, and uncompressing the 256 into a uint8_t array, fetching the data you want, and then throwing away your uncompressed data, is highly unlikely to give you good performance - assuming that's of some importance, of course.
In the end, you will probably have to implement one or a few of the ideas in comments/answers to test out, see if it helps solving your problem, or if memory bus is still the main limiting factor.

What I've done in the past is use a hashmap in front of a bitset.
This halves the space compared to Matteo's answer, but may be slower if "exception" lookups are slow (i.e. there are many exceptions).
Often, however, "cache is king".

Unless there is pattern to your data it is unlikely that there is any sensible speed or size optimisation, and - assuming you are targetting a normal computer - 10 MB isn't that big a deal anyway.
There are two assumptions in your questions:
The data is being poorly stored because you aren't using all the bits
Storing it better would make things faster.
I think both of these assumptions are false. In most cases the appropriate way to store data is to store the most natural representation. In your case, this is the one you've gone for: a byte for a number between 0 and 255. Any other representation will be more complex and therefore - all other things being equal - slower and more error prone. To need to divert from this general principle you need a stronger reason than potentially six "wasted" bits on 95% of your data.
For your second assumption, it will be true if, and only if, changing the size of the array results in substantially fewer cache misses. Whether this will happen can only be definitively determined by profiling working code, but I think it's highly unlikely to make a substantial difference. Because you will be randomly accessing the array in either case, the processor will struggle to know which bits of data to cache and keep in either case.

If the data and accesses are uniformly randomly distributed, performance is probably going to depend upon what fraction of accesses avoid an outer-level cache miss. Optimizing that will require knowing what size array can be reliably accommodated in cache. If your cache is large enough to accommodate one byte for every five cells, the simplest approach may be to have one byte hold the five base-three encoded values in the range 0-2 (there are 243 combinations of 5 values, so that will fit in a byte), along with a 10,000,000 byte array that would be queried whenever an the base-3 value indicates "2".
If the cache isn't that big, but could accommodate one byte per 8 cells, then it would not be possible to use one byte value to select from among all 6,561 possible combinations of eight base-3 values, but since the only effect of changing a 0 or 1 to a 2 would be to cause an otherwise-unnecessary lookup, correctness wouldn't require supporting all 6,561. Instead, one could focus on the 256 most "useful" values.
Especially if 0 is more common than 1, or vice versa, a good approach might be to use 217 values to encode the combinations of 0 and 1 that contain 5 or fewer 1's, 16 values to encode xxxx0000 through xxxx1111, 16 to encode 0000xxxx through 1111xxxx, and one for xxxxxxxx. Four values would remain for whatever other use one might find. If the data are randomly distributed as described, a slight majority of all queries would hit bytes which contained just zeroes and ones (in about 2/3 of all groups of eight, all bits would be zeroes and ones, and about 7/8 of those would have six or fewer 1 bits); the vast majority of those that didn't would land in a byte which contained four x's, and would have a 50% chance of landing on a zero or a one. Thus, only about one in four queries would necessitate a large-array lookup.
If the data are randomly distributed but the cache isn't big enough to handle one byte per eight elements, one could try to use this approach with each byte handling more than eight items, but unless there is a strong bias toward 0 or toward 1, the fraction of values that can be handled without having to do a lookup in the big array will shrink as the number handled by each byte increases.

I'll add to #o11c's answer, as his wording might be a bit confusing.
If I need to squeeze the last bit and CPU cycle I'd do the following.
We will start by constructing a balanced binary search tree that holds the 5% "something else" cases. For every lookup, you walk the tree quickly: you have 10000000 elements: 5% of which is in the tree: hence the tree data structure holds 500000 elements. Walking this in O(log(n)) time, gives you 19 iterations. I'm no expert at this, but I guess there are some memory-efficient implementations out there. Let's guesstimate:
Balanced tree, so subtree position can be calculated (indices do not need to be stored in the nodes of the tree). The same way a heap (data structure) is stored in linear memory.
1 byte value (2 to 255)
3 bytes for the index (10000000 takes 23 bits, which fits 3 bytes)
Totalling, 4 bytes: 500000*4 = 1953 kB. Fits the cache!
For all the other cases (0 or 1), you can use a bitvector. Note that you cannot leave out the 5% other cases for random access: 1.19 MB.
The combination of these two use approximately 3,099 MB. Using this technique, you will save a factor 3.08 of memory.
However, this doesn't beat the answer of #Matteo Italia (which uses 2.76 MB), a pity. Is there anything we can do extra? The most memory consuming part is the 3 bytes of index in the tree. If we can get this down to 2, we would save 488 kB and the total memory usage would be: 2.622 MB, which is smaller!
How do we do this? We have to reduce the indexing to 2 bytes. Again, 10000000 takes 23 bits. We need to be able to drop 7 bits. We can simply do this by partitioning the range of 10000000 elements into 2^7 (=128) regions of 78125 elements. Now we can build a balanced tree for each of these regions, with 3906 elements on average. Picking the right tree is done by a simple division of the target index by 2^7 (or a bitshift >> 7). Now the required index to store can be represented by the remaining 16 bits. Note that there is some overhead for the length of the tree that needs to be stored, but this is negligible. Also note that this splitting mechanism reduces the required number of iterations to walk the tree, this now reduces to 7 iterations less, because we dropped 7 bits: only 12 iterations are left.
Note that you could theoretically repeat the process to cut off the next 8 bits, but this would require you to create 2^15 balanced trees, with ~305 elements on average. This would result in 2.143 MB, with only 4 iterations to walk the tree, which is a considerable speedup, compared to the 19 iterations we started with.
As a final conclusion: this beats the 2-bit vector strategy by a tiny bit of memory usage, but is a whole struggle to implement. But if it can make the difference between fitting the cache or not, it might be worth the try.

If you only perform read operations it would be better to not assign a value to an single index but to an interval of indices.
For example:
[0, 15000] = 0
[15001, 15002] = 153
[15003, 26876] = 2
[25677, 31578] = 0
...
This can be done with a struct. You also might want to define a class similar to this if you like an OO approach.
class Interval{
private:
uint32_t start; // First element of interval
uint32_t end; // Last element of interval
uint8_t value; // Assigned value
public:
Interval(uint32_t start, uint32_t end, uint8_t value);
bool isInInterval(uint32_t item); // Checks if item lies within interval
uint8_t getValue(); // Returns the assigned value
}
Now you just have to iterate trough a list of intervals and check if your index lies within one of them which can be much less memory intensive in average but costs more CPU resources.
Interval intervals[INTERVAL_COUNT];
intervals[0] = Interval(0, 15000, 0);
intervals[1] = Interval(15001, 15002, 153);
intervals[2] = Interval(15003, 26876, 2);
intervals[3] = Interval(25677, 31578, 0);
...
uint8_t checkIntervals(uint32_t item)
for(int i=0; i<INTERVAL_COUNT-1; i++)
{
if(intervals[i].isInInterval(item) == true)
{
return intervals[i].getValue();
}
}
return DEFAULT_VALUE;
}
If you order the intervals by descending size you increase the probability that the item you are looking for is found early which further decreases your average memory and CPU resource usage.
You could also remove all intervals with a size of 1. Put the corresponding values into a map and check them only if the item you are looking for wasn't found in the intervals. This should also raise the average performance a bit.

Long long time ago, I can just remember...
In university we got a task to accelerate a ray tracer program, that has to read by algorithm over and over again from buffer arrays. A friend told me to always use RAM-reads that are multiples of 4Bytes. So I changed the array from a pattern of [x1,y1,z1,x2,y2,z2,...,xn,yn,zn] to a pattern of [x1,y1,z1,0,x2,y2,z2,0,...,xn,yn,zn,0]. Means I add a empty field after each 3D coordinate. After some performance testing: It was faster.
So long story short: Read multiple of 4 Bytes from your array from RAM, and maybe also from the right starting position, so you read a little cluster where the searched index is in it and read the searched index from this little cluster in cpu. (In your case you will not need to inserting fill-fields, but the concept should be clear)
Maybe also other multiples could be the key in newer systems.
I don't know if this will work in your case, so if it doesn't work: Sorry. If it work I would be happy to hear about some test results.
PS: Oh and if there is any access pattern or nearby accessed indices, you can reuse the cached cluster.
PPS: It could be, that the multiple factor was more like 16Bytes or something like that, it's too long ago, that I can remember exactly.

Looking at this, you could split your data, for example:
a bitset which gets indexed and represents the value 0 (std::vector would be useful here)
a bitset which gets indexed and represents the value 1
a std::vector for the values of 2, containing the indexes which refer to this value
a map for the other values (or std::vector>)
In this case, all values appear till a given index, so you could even remove one of bitsets and represents the value as it being missing in the other ones.
This will save you some memory for this case, though would make the worst case worse.
You'll also need more CPU power to do the lookups.
Make sure to measure!

Like Mats mentions in his comment-answer, it is hard to say what is actually the best solution without knowing specifically what kind of data you have (e.g., are there long runs of 0's, and so on), and what your access pattern looks like (does "random" mean "all over the place" or just "not strictly in completely linear fashion" or "every value exactly once, just randomized" or ...).
That said, there are two mechanisms coming to mind:
Bit arrays; i.e., if you only had two values, you could trivially compress your array by a factor of 8; if you have 4 values (or "3 values + everything else") you can compress by a factor of two. Which might just not be worth the trouble and would need benchmarks, especially if you have really random access patterns which escape your caches and hence do not change the access time at all.
(index,value) or (value,index) tables. I.e., have one very small table for the 1% case, maybe one table for the 5% case (which only needs to store the indexes as all have the same value), and a big compressed bit array for the final two cases. And with "table" I mean something which allows relatively quick lookup; i.e., maybe a hash, a binary tree, and so on, depending on what you have available and your actual needs. If these subtables fit into your 1st/2nd level caches, you might get lucky.

I am not very familiar with C, but in C++ you can use unsigned char to represent an integer in the range 0 - 255.
Compared to normal int (again, I am coming from Java and C++ world) in which 4 byte (32 bit) are required, an unsigned char requires 1 byte (8 bits).
so it might reduce the total size of the array by 75%.

You have succinctly described all the distribution characteristics of your array; toss the array.
You can easily replace the array with a randomized method that produces the same probabilistic output as the array.
If consistency matters (producing the same value for the same random index), consider using a bloom filter and/or hash map to track repeat hits. If your array accesses really are random, though, this is totally unnecessary.

Fast implementation of a large integer counter (in C/C++)

My goal is as the following,
Generate successive values, such that each new one was never generated before, until all possible values are generated. At this point, the counter start the same sequence again. The main point here is that, all possible values are generated without repetition (until the period is exhausted). It does not matter if the sequence is simple 0, 1, 2, 3,..., or in other order.
For example, if the range can be represented simply by an unsigned, then
void increment (unsigned &n) {++n;}
is enough. However, the integer range is larger than 64-bits. For example, in one place, I need to generated 256-bits sequence. A simple implementation is like the following, just to illustrate what I am trying to do,
typedef std::array<uint64_t, 4> ctr_type;
static constexpr uint64_t max = ~((uint64_t) 0);
void increment (ctr_type &ctr)
{
if (ctr[0] < max) {++ctr[0]; return;}
if (ctr[1] < max) {++ctr[1]; return;}
if (ctr[2] < max) {++ctr[2]; return;}
if (ctr[3] < max) {++ctr[3]; return;}
ctr[0] = ctr[1] = ctr[2] = ctr[3] = 0;
}
So if ctr start with all zeros, then first ctr[0] is increased one by one until it reach max, and then ctr[1], and so on. If all 256-bits are set, then we reset it to all zero, and start again.
The problem is that, such implementation is surprisingly slow. My current improved version is sort of equivalent to the following,
void increment (ctr_type &ctr)
{
std::size_t k = (!(~ctr[0])) + (!(~ctr[1])) + (!(~ctr[2])) + (!(~ctr[3]))
if (k < 4)
++ctr[k];
else
memset(ctr.data(), 0, 32);
}
If the counter is only manipulated with the above increment function, and always start with zero, then ctr[k] == 0 if ctr[k - 1] == 0. And thus the value k will be the index of the first element that is less than the maximum.
I expected the first to be faster, since branch mis-prediction shall happen only once in every 2^64 iterations. The second, though mis-predication only happen every 2^256 iterations, it shall not make a difference. And apart from the branching, it needs four bitwise negation, four boolean negation, and three addition. Which might cost much more than the first.
However, both clang, gcc, or intel icpc generate binaries that the second was much faster.
My main question is that does anyone know if there any faster way to implement such a counter? It does not matter if the counter start by increasing the first integers or if it is implemented as an array of integers at all, as long as the algorithm generate all 2^256 combinations of 256-bits.
What makes things more complicated, I also need non uniform increment. For example, each time the counter is incremented by K where K > 1, but almost always remain a constant. My current implementation is similar to the above.
To provide some more context, one place I am using the counters is using them as input to AES-NI aesenc instructions. So distinct 128-bits integer (loaded into __m128i), after going through 10 (or 12 or 14, depending on the key size) rounds of the instructions, a distinct 128-bits integer is generated. If I generate one __m128i integer at once, then the cost of increment matters little. However, since aesenc has quite a bit latency, I generate integers by blocks. For example, I might have 4 blocks, ctr_type block[4], initialized equivalent to the following,
block[0]; // initialized to zero
block[1] = block[0]; increment(block[1]);
block[2] = block[1]; increment(block[2]);
block[3] = block[2]; increment(block[3]);
And each time I need new output, I increment each block[i] by 4, and generate 4 __m128i output at once. By interleaving instructions, overall I was able to increase the throughput, and reduce the cycles per bytes of output (cpB) from 6 to 0.9 when using 2 64-bits integers as the counter and 8 blocks. However, if instead, use 4 32-bits integers as counter, the throughput, measured as bytes per sec is reduced to half. I know for a fact that on x86-64, 64-bits integers could be faster than 32-bits in some situations. But I did not expect such simple increment operation makes such a big difference. I have carefully benchmarked the application, and the increment is indeed the one slow down the program. Since the loading into __m128i and store the __m128i output into usable 32-bits or 64-bits integers are done through aligned pointers, the only difference between the 32-bits and 64-bits version is how the counter is incremented. I expected that the AES-NI expected, after loading the integers into __m128i, shall dominate the performance. But when using 4 or 8 blocks, it was clearly not the case.
So to summary, my main question is that, if anyone know a way to improve the above counter implementation.

It's not only slow, but impossible. The total energy of universe is insufficient for 2^256 bit changes. And that would require gray counter.
Next thing before optimization is to fix the original implementation
void increment (ctr_type &ctr)
{
if (++ctr[0] != 0) return;
if (++ctr[1] != 0) return;
if (++ctr[2] != 0) return;
++ctr[3];
}
If each ctr[i] was not allowed to overflow to zero, the period would be just 4*(2^32), as in 0-9, 19,29,39,49,...99, 199,299,... and 1999,2999,3999,..., 9999.
As a reply to the comment -- it takes 2^64 iterations to have the first overflow. Being generous, upto 2^32 iterations could take place in a second, meaning that the program should run 2^32 seconds to have the first carry out. That's about 136 years.
EDIT
If the original implementation with 2^66 states is really what is wanted, then I'd suggest to change the interface and the functionality to something like:
(*counter) += 1;
while (*counter == 0)
{
counter++; // Move to next word
if (counter > tail_of_array) {
counter = head_of_array;
memset(counter,0, 16);
break;
}
}
The point being, that the overflow is still very infrequent. Almost always there's just one word to be incremented.

If you're using GCC or compilers with __int128 like Clang or ICC
unsigned __int128 H = 0, L = 0;
L++;
if (L == 0) H++;
On systems where __int128 isn't available
std::array<uint64_t, 4> c[4]{};
c[0]++;
if (c[0] == 0)
{
c[1]++;
if (c[1] == 0)
{
c[2]++;
if (c[2] == 0)
{
c[3]++;
}
}
}
In inline assembly it's much easier to do this using the carry flag. Unfortunately most high level languages don't have means to access it directly. Some compilers do have intrinsics for adding with carry like __builtin_uaddll_overflow in GCC and __builtin_addcll
Anyway this is rather wasting time since the total number of particles in the universe is only about 1080 and you cannot even count up the 64-bit counter in your life

Neither of your counter versions increment correctly. Instead of counting up to UINT256_MAX, you are actually just counting up to UINT64_MAX 4 times and then starting back at 0 again. This is apparent from the fact that you do not bother to clear any of the indices that has reached the max value until all of them have reached the max value. If you are measuring performance based on how often the counter reaches all bits 0, then this is why. Thus your algorithms do not generate all combinations of 256 bits, which is a stated requirement.

You mention "Generate successive values, such that each new one was never generated before"
To generate a set of such values, look at linear congruential generators
the sequence x = (x*1 + 1) % (power_of_2), you thought about it, this are simply sequential numbers.
the sequence x = (x*13 + 137) % (power of 2) , this generates unique numbers with a predictable period (power_of_2 - 1) and the unique numbers look more "random", kind of pseudo-random. You need to resort to arbitrary precision arithmetic to get it working, and also all the trickeries of multiplications by constants. This will get you a nice way to start.
You also complain that your simple code is "slow"
At 4.2 GHz frequency, running 4 intructions per cycle and using AVX512 vectorizations, on a 64-core computer with a multithreaded version of your program doing nothing else than increments, you get only 64x8x4*232=8796093022208 increments per second, that is 264 increments reached in 25 days. This post is old, you might have reached 841632698362998292480 by now, running such a program on such a machine, and you will gloriously reach 1683265396725996584960 in 2 years time.
You also require "until all possible values are generated".
You can only generate a finite number of values, depending how much you are willing to pay for the energy to power your computers. As mentioned in the other responses, with 128 or 256-bit numbers, even being the richest man in the world, you will never wrap around before the first of these conditions occurs:
getting out of money
end of humankind (nobody will get the outcome of your software)
burning the energy from the last particles of the universe

Multi-word addition can easily be accomplished in portable fashion by using three macros that mimic three types of addition instructions found on many processors:
ADDcc adds two words, and sets the carry if their was unsigned overflow
ADDC adds two words plus carry (from a previous addition)
ADDCcc adds two words plus carry, and sets the carry if their was unsigned overflow
A multi-word addition with two words uses ADDcc of the least significant words followed by ADCC of the most significant words. A multi-word addition with more than two words forms sequence ADDcc, ADDCcc, ..., ADDC. The MIPS architecture is a processor architecture without conditions code and therefore without carry flag. The macro implementations shown below basically follow the techniques used on MIPS processors for multi-word additions.
The ISO-C99 code below shows the operation of a 32-bit counter and a 64-bit counter based on 16-bit "words". I chose arrays as the underlying data structure, but one might also use struct, for example. Use of a struct will be significantly faster if each operand only comprises a few words, as the overhead of array indexing is eliminated. One would want to use the widest available integer type for each "word" for best performance. In the example from the question that would likely be a 256-bit counter comprising four uint64_t components.
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#define ADDCcc(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)
#define ADDcc(a,b,cy,t0,t1) \
(t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)
#define ADDC(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), t0+t1)
typedef uint16_t T;
/* increment a multi-word counter comprising n words */
void inc_array (T *counter, const T *increment, int n)
{
T cy, t0, t1;
counter [0] = ADDcc (counter [0], increment [0], cy, t0, t1);
for (int i = 1; i < (n - 1); i++) {
counter [i] = ADDCcc (counter [i], increment [i], cy, t0, t1);
}
counter [n-1] = ADDC (counter [n-1], increment [n-1], cy, t0, t1);
}
#define INCREMENT (10)
#define UINT32_ARRAY_LEN (2)
#define UINT64_ARRAY_LEN (4)
int main (void)
{
uint32_t count32 = 0, incr32 = INCREMENT;
T count_arr2 [UINT32_ARRAY_LEN] = {0};
T incr_arr2 [UINT32_ARRAY_LEN] = {INCREMENT};
do {
count32 = count32 + incr32;
inc_array (count_arr2, incr_arr2, UINT32_ARRAY_LEN);
} while (count32 < (0U - INCREMENT - 1));
printf ("count32 = %08x arr_count = %08x\n",
count32, (((uint32_t)count_arr2 [1] << 16) +
((uint32_t)count_arr2 [0] << 0)));
uint64_t count64 = 0, incr64 = INCREMENT;
T count_arr4 [UINT64_ARRAY_LEN] = {0};
T incr_arr4 [UINT64_ARRAY_LEN] = {INCREMENT};
do {
count64 = count64 + incr64;
inc_array (count_arr4, incr_arr4, UINT64_ARRAY_LEN);
} while (count64 < 0xa987654321ULL);
printf ("count64 = %016llx arr_count = %016llx\n",
count64, (((uint64_t)count_arr4 [3] << 48) +
((uint64_t)count_arr4 [2] << 32) +
((uint64_t)count_arr4 [1] << 16) +
((uint64_t)count_arr4 [0] << 0)));
return EXIT_SUCCESS;
}
Compiled with full optimization, the 32-bit example executes in about a second, while the 64-bit example runs for about a minute on a modern PC. The output of the program should look like so:
count32 = fffffffa arr_count = fffffffa
count64 = 000000a987654326 arr_count = 000000a987654326
Non-portable code that is based on inline assembly or proprietary extensions for wide integer types may execute about two to three times as fast as the portable solution presented here.

Is multiplication and division using shift operators in C actually faster?

Multiplication and division can be achieved using bit operators, for example
i*2 = i<<1
i*3 = (i<<1) + i;
i*10 = (i<<3) + (i<<1)
and so on.
Is it actually faster to use say (i<<3)+(i<<1) to multiply with 10 than using i*10 directly? Is there any sort of input that can't be multiplied or divided in this way?

Short answer: Not likely.
Long answer:
Your compiler has an optimizer in it that knows how to multiply as quickly as your target processor architecture is capable. Your best bet is to tell the compiler your intent clearly (i.e. i*2 rather than i << 1) and let it decide what the fastest assembly/machine code sequence is. It's even possible that the processor itself has implemented the multiply instruction as a sequence of shifts & adds in microcode.
Bottom line--don't spend a lot of time worrying about this. If you mean to shift, shift. If you mean to multiply, multiply. Do what is semantically clearest--your coworkers will thank you later. Or, more likely, curse you later if you do otherwise.

Just a concrete point of measure: many years back, I benchmarked two
versions of my hashing algorithm:
unsigned
hash( char const* s )
{
unsigned h = 0;
while ( *s != '\0' ) {
h = 127 * h + (unsigned char)*s;
++ s;
}
return h;
}
and
unsigned
hash( char const* s )
{
unsigned h = 0;
while ( *s != '\0' ) {
h = (h << 7) - h + (unsigned char)*s;
++ s;
}
return h;
}
On every machine I benchmarked it on, the first was at least as fast as
the second. Somewhat surprisingly, it was sometimes faster (e.g. on a
Sun Sparc). When the hardware didn't support fast multiplication (and
most didn't back then), the compiler would convert the multiplication
into the appropriate combinations of shifts and add/sub. And because it
knew the final goal, it could sometimes do so in less instructions than
when you explicitly wrote the shifts and the add/subs.
Note that this was something like 15 years ago. Hopefully, compilers
have only gotten better since then, so you can pretty much count on the
compiler doing the right thing, probably better than you could. (Also,
the reason the code looks so C'ish is because it was over 15 years ago.
I'd obviously use std::string and iterators today.)

In addition to all the other good answers here, let me point out another reason to not use shift when you mean divide or multiply. I have never once seen someone introduce a bug by forgetting the relative precedence of multiplication and addition. I have seen bugs introduced when maintenance programmers forgot that "multiplying" via a shift is logically a multiplication but not syntactically of the same precedence as multiplication. x * 2 + z and x << 1 + z are very different!
If you're working on numbers then use arithmetic operators like + - * / %. If you're working on arrays of bits, use bit twiddling operators like & ^ | >> . Don't mix them; an expression that has both bit twiddling and arithmetic is a bug waiting to happen.

This depends on the processor and the compiler. Some compilers already optimize code this way, others don't.
So you need to check each time your code needs to be optimized this way.
Unless you desperately need to optimize, I would not scramble my source code just to save an assembly instruction or processor cycle.

Is it actually faster to use say (i<<3)+(i<<1) to multiply with 10 than using i*10 directly?
It might or might not be on your machine - if you care, measure in your real-world usage.
A case study - from 486 to core i7
Benchmarking is very difficult to do meaningfully, but we can look at a few facts. From http://www.penguin.cz/~literakl/intel/s.html#SAL and http://www.penguin.cz/~literakl/intel/i.html#IMUL we get an idea of x86 clock cycles needed for arithmetic shift and multiplication. Say we stick to "486" (the newest one listed), 32 bit registers and immediates, IMUL takes 13-42 cycles and IDIV 44. Each SAL takes 2, and adding 1, so even with a few of those together shifting superficially looks like a winner.
These days, with the core i7:
(from http://software.intel.com/en-us/forums/showthread.php?t=61481)
The latency is 1 cycle for an integer addition and 3 cycles for an integer multiplication. You can find the latencies and thoughput in Appendix C of the "Intel® 64 and IA-32 Architectures Optimization Reference Manual", which is located on http://www.intel.com/products/processor/manuals/.
(from some Intel blurb)
Using SSE, the Core i7 can issue simultaneous add and multiply instructions, resulting in a peak rate of 8 floating-point operations (FLOP) per clock cycle
That gives you an idea of how far things have come. The optimisation trivia - like bit shifting versus * - that was been taken seriously even into the 90s is just obsolete now. Bit-shifting is still faster, but for non-power-of-two mul/div by the time you do all your shifts and add the results it's slower again. Then, more instructions means more cache faults, more potential issues in pipelining, more use of temporary registers may mean more saving and restoring of register content from the stack... it quickly gets too complicated to quantify all the impacts definitively but they're predominantly negative.
functionality in source code vs implementation
More generally, your question is tagged C and C++. As 3rd generation languages, they're specifically designed to hide the details of the underlying CPU instruction set. To satisfy their language Standards, they must support multiplication and shifting operations (and many others) even if the underlying hardware doesn't. In such cases, they must synthesize the required result using many other instructions. Similarly, they must provide software support for floating point operations if the CPU lacks it and there's no FPU. Modern CPUs all support * and <<, so this might seem absurdly theoretical and historical, but the significance thing is that the freedom to choose implementation goes both ways: even if the CPU has an instruction that implements the operation requested in the source code in the general case, the compiler's free to choose something else that it prefers because it's better for the specific case the compiler's faced with.
Examples (with a hypothetical assembly language)
source literal approach optimised approach
#define N 0
int x; .word x xor registerA, registerA
x *= N; move x -> registerA
move x -> registerB
A = B * immediate(0)
store registerA -> x
...............do something more with x...............
Instructions like exclusive or (xor) have no relationship to the source code, but xor-ing anything with itself clears all the bits, so it can be used to set something to 0. Source code that implies memory addresses may not entail any being used.
These kind of hacks have been used for as long as computers have been around. In the early days of 3GLs, to secure developer uptake the compiler output had to satisfy the existing hardcore hand-optimising assembly-language dev. community that the produced code wasn't slower, more verbose or otherwise worse. Compilers quickly adopted lots of great optimisations - they became a better centralised store of it than any individual assembly language programmer could possibly be, though there's always the chance that they miss a specific optimisation that happens to be crucial in a specific case - humans can sometimes nut it out and grope for something better while compilers just do as they've been told until someone feeds that experience back into them.
So, even if shifting and adding is still faster on some particular hardware, then the compiler writer's likely to have worked out exactly when it's both safe and beneficial.
Maintainability
If your hardware changes you can recompile and it'll look at the target CPU and make another best choice, whereas you're unlikely to ever want to revisit your "optimisations" or list which compilation environments should use multiplication and which should shift. Think of all the non-power-of-two bit-shifted "optimisations" written 10+ years ago that are now slowing down the code they're in as it runs on modern processors...!
Thankfully, good compilers like GCC can typically replace a series of bitshifts and arithmetic with a direct multiplication when any optimisation is enabled (i.e. ...main(...) { return (argc << 4) + (argc << 2) + argc; } -> imull $21, 8(%ebp), %eax) so a recompilation may help even without fixing the code, but that's not guaranteed.
Strange bitshifting code implementing multiplication or division is far less expressive of what you were conceptually trying to achieve, so other developers will be confused by that, and a confused programmer's more likely to introduce bugs or remove something essential in an effort to restore seeming sanity. If you only do non-obvious things when they're really tangibly beneficial, and then document them well (but don't document other stuff that's intuitive anyway), everyone will be happier.
General solutions versus partial solutions
If you have some extra knowledge, such as that your int will really only be storing values x, y and z, then you may be able to work out some instructions that work for those values and get you your result more quickly than when the compiler's doesn't have that insight and needs an implementation that works for all int values. For example, consider your question:
Multiplication and division can be achieved using bit operators...
You illustrate multiplication, but how about division?
int x;
x >> 1; // divide by 2?
According to the C++ Standard 5.8:
-3- The value of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 divided by the quantity 2 raised to the power E2. If E1 has a signed type and a negative value, the resulting value is implementation-defined.
So, your bit shift has an implementation defined result when x is negative: it may not work the same way on different machines. But, / works far more predictably. (It may not be perfectly consistent either, as different machines may have different representations of negative numbers, and hence different ranges even when there are the same number of bits making up the representation.)
You may say "I don't care... that int is storing the age of the employee, it can never be negative". If you have that kind of special insight, then yes - your >> safe optimisation might be passed over by the compiler unless you explicitly do it in your code. But, it's risky and rarely useful as much of the time you won't have this kind of insight, and other programmers working on the same code won't know that you've bet the house on some unusual expectations of the data you'll be handling... what seems a totally safe change to them might backfire because of your "optimisation".
Is there any sort of input that can't be multiplied or divided in this way?
Yes... as mentioned above, negative numbers have implementation defined behaviour when "divided" by bit-shifting.

Just tried on my machine compiling this :
int a = ...;
int b = a * 10;
When disassembling it produces output :
MOV EAX,DWORD PTR SS:[ESP+1C] ; Move a into EAX
LEA EAX,DWORD PTR DS:[EAX+EAX*4] ; Multiply by 5 without shift !
SHL EAX, 1 ; Multiply by 2 using shift
This version is faster than your hand-optimized code with pure shifting and addition.
You really never know what the compiler is going to come up with, so it's better to simply write a normal multiplication and let him optimize the way he wants to, except in very precise cases where you know the compiler cannot optimize.

Shifting is generally a lot faster than multiplying at an instruction level but you may well be wasting your time doing premature optimisations. The compiler may well perform these optimisations at compiletime. Doing it yourself will affect readability and possibly have no effect on performance. It's probably only worth it to do things like this if you have profiled and found this to be a bottleneck.
Actually the division trick, known as 'magic division' can actually yield huge payoffs. Again you should profile first to see if it's needed. But if you do use it there are useful programs around to help you figure out what instructions are needed for the same division semantics. Here is an example : http://www.masm32.com/board/index.php?topic=12421.0
An example which I have lifted from the OP's thread on MASM32:
include ConstDiv.inc
...
mov eax,9999999
; divide eax by 100000
cdiv 100000
; edx = quotient
Would generate:
mov eax,9999999
mov edx,0A7C5AC47h
add eax,1
.if !CARRY?
mul edx
.endif
shr edx,16

Shift and integer multiply instructions have similar performance on most modern CPUs - integer multiply instructions were relatively slow back in the 1980s but in general this is no longer true. Integer multiply instructions may have higher latency, so there may still be cases where a shift is preferable. Ditto for cases where you can keep more execution units busy (although this can cut both ways).
Integer division is still relatively slow though, so using a shift instead of division by a power of 2 is still a win, and most compilers will implement this as an optimisation. Note however that for this optimisation to be valid the dividend needs to be either unsigned or must be known to be positive. For a negative dividend the shift and divide are not equivalent!
#include <stdio.h>
int main(void)
{
int i;
for (i = 5; i >= -5; --i)
{
printf("%d / 2 = %d, %d >> 1 = %d\n", i, i / 2, i, i >> 1);
}
return 0;
}
Output:
5 / 2 = 2, 5 >> 1 = 2
4 / 2 = 2, 4 >> 1 = 2
3 / 2 = 1, 3 >> 1 = 1
2 / 2 = 1, 2 >> 1 = 1
1 / 2 = 0, 1 >> 1 = 0
0 / 2 = 0, 0 >> 1 = 0
-1 / 2 = 0, -1 >> 1 = -1
-2 / 2 = -1, -2 >> 1 = -1
-3 / 2 = -1, -3 >> 1 = -2
-4 / 2 = -2, -4 >> 1 = -2
-5 / 2 = -2, -5 >> 1 = -3
So if you want to help the compiler then make sure the variable or expression in the dividend is explicitly unsigned.

It completely depends on target device, language, purpose, etc.
Pixel crunching in a video card driver? Very likely, yes!
.NET business application for your department? Absolutely no reason to even look into it.
For a high performance game for a mobile device it might be worth looking into, but only after easier optimizations have been performed.

Don't do unless you absolutely need to and your code intent requires shifting rather than multiplication/division.
In typical day - you could potentialy save few machine cycles (or loose, since compiler knows better what to optimize), but the cost doesn't worth it - you spend time on minor details rather than actual job, maintaining the code becomes harder and your co-workers will curse you.
You might need to do it for high-load computations, where each saved cycle means minutes of runtime. But, you should optimize one place at a time and do performance tests each time to see if you really made it faster or broke compilers logic.

As far as I know in some machines multiplication can need upto 16 to 32 machine cycle. So Yes, depending on the machine type, bitshift operators are faster than multiplication / division.
However certain machine do have their math processor, which contains special instructions for multiplication/division.

I agree with the marked answer by Drew Hall. The answer could use some additional notes though.
For the vast majority of software developers the processor and compiler are no longer relevant to the question. Most of us are far beyond the 8088 and MS-DOS. It is perhaps only relevant for those who are still developing for embedded processors...
At my software company Math (add/sub/mul/div) should be used for all mathematics.
While Shift should be used when converting between data types eg. ushort to byte as n>>8 and not n/256.

In the case of signed integers and right shift vs division, it can make a difference. For negative numbers, the shift rounds rounds towards negative infinity whereas division rounds towards zero. Of course the compiler will change the division to something cheaper, but it will usually change it to something that has the same rounding behavior as division, because it is either unable to prove that the variable won't be negative or it simply doesn't care.
So if you can prove that a number won't be negative or if you don't care which way it will round, you can do that optimization in a way that is more likely to make a difference.

Python test performing same multiplication 100 million times against the same random numbers.
>>> from timeit import timeit
>>> setup_str = 'import scipy; from scipy import random; scipy.random.seed(0)'
>>> N = 10*1000*1000
>>> timeit('x=random.randint(65536);', setup=setup_str, number=N)
1.894096851348877 # Time from generating the random #s and no opperati
>>> timeit('x=random.randint(65536); x*2', setup=setup_str, number=N)
2.2799630165100098
>>> timeit('x=random.randint(65536); x << 1', setup=setup_str, number=N)
2.2616429328918457
>>> timeit('x=random.randint(65536); x*10', setup=setup_str, number=N)
2.2799630165100098
>>> timeit('x=random.randint(65536); (x << 3) + (x<<1)', setup=setup_str, number=N)
2.9485139846801758
>>> timeit('x=random.randint(65536); x // 2', setup=setup_str, number=N)
2.490908145904541
>>> timeit('x=random.randint(65536); x / 2', setup=setup_str, number=N)
2.4757170677185059
>>> timeit('x=random.randint(65536); x >> 1', setup=setup_str, number=N)
2.2316000461578369
So in doing a shift rather than multiplication/division by a power of two in python, there's a slight improvement (~10% for division; ~1% for multiplication). If its a non-power of two, there's likely a considerable slowdown.
Again these #s will change depending on your processor, your compiler (or interpreter -- did in python for simplicity).
As with everyone else, don't prematurely optimize. Write very readable code, profile if its not fast enough, and then try to optimize the slow parts. Remember, your compiler is much better at optimization than you are.

There are optimizations the compiler can't do because they only work for a reduced set of inputs.
Below there is c++ sample code that can do a faster division doing a 64bits "Multiplication by the reciprocal". Both numerator and denominator must be below certain threshold. Note that it must be compiled to use 64 bits instructions to be actually faster than normal division.
#include <stdio.h>
#include <chrono>
static const unsigned s_bc = 32;
static const unsigned long long s_p = 1ULL << s_bc;
static const unsigned long long s_hp = s_p / 2;
static unsigned long long s_f;
static unsigned long long s_fr;
static void fastDivInitialize(const unsigned d)
{
s_f = s_p / d;
s_fr = s_f * (s_p - (s_f * d));
}
static unsigned fastDiv(const unsigned n)
{
return (s_f * n + ((s_fr * n + s_hp) >> s_bc)) >> s_bc;
}
static bool fastDivCheck(const unsigned n, const unsigned d)
{
// 32 to 64 cycles latency on modern cpus
const unsigned expected = n / d;
// At least 10 cycles latency on modern cpus
const unsigned result = fastDiv(n);
if (result != expected)
{
printf("Failed for: %u/%u != %u\n", n, d, expected);
return false;
}
return true;
}
int main()
{
unsigned result = 0;
// Make sure to verify it works for your expected set of inputs
const unsigned MAX_N = 65535;
const unsigned MAX_D = 40000;
const double ONE_SECOND_COUNT = 1000000000.0;
auto t0 = std::chrono::steady_clock::now();
unsigned count = 0;
printf("Verifying...\n");
for (unsigned d = 1; d <= MAX_D; ++d)
{
fastDivInitialize(d);
for (unsigned n = 0; n <= MAX_N; ++n)
{
count += !fastDivCheck(n, d);
}
}
auto t1 = std::chrono::steady_clock::now();
printf("Errors: %u / %u (%.4fs)\n", count, MAX_D * (MAX_N + 1), (t1 - t0).count() / ONE_SECOND_COUNT);
t0 = t1;
for (unsigned d = 1; d <= MAX_D; ++d)
{
fastDivInitialize(d);
for (unsigned n = 0; n <= MAX_N; ++n)
{
result += fastDiv(n);
}
}
t1 = std::chrono::steady_clock::now();
printf("Fast division time: %.4fs\n", (t1 - t0).count() / ONE_SECOND_COUNT);
t0 = t1;
count = 0;
for (unsigned d = 1; d <= MAX_D; ++d)
{
for (unsigned n = 0; n <= MAX_N; ++n)
{
result += n / d;
}
}
t1 = std::chrono::steady_clock::now();
printf("Normal division time: %.4fs\n", (t1 - t0).count() / ONE_SECOND_COUNT);
getchar();
return result;
}

I think in the one case that you want to multiply or divide by a power of two, you can't go wrong with using bitshift operators, even if the compiler converts them to a MUL/DIV, because some processors microcode (really, a macro) them anyway, so for those cases you will achieve an improvement, especially if the shift is more than 1. Or more explicitly, if the CPU has no bitshift operators, it will be a MUL/DIV anyway, but if the CPU has bitshift operators, you avoid a microcode branch and this is a few instructions less.
I am writing some code right now that requires a lot of doubling/halving operations because it is working on a dense binary tree, and there is one more operation that I suspect might be more optimal than an addition - a left (power of two multiply) shift with an addition. This can be replaced with a left shift and an xor if the shift is wider than the number of bits you want to add, easy example is (i<<1)^1, which adds one to a doubled value. This does not of course apply to a right shift (power of two divide) because only a left (little endian) shift fills the gap with zeros.
In my code, these multiply/divide by two and powers of two operations are very intensively used and because the formulae are quite short already, each instruction that can be eliminated can be a substantial gain. If the processor does not support these bitshift operators, no gain will happen but neither will there be a loss.
Also, in the algorithms I am writing, they visually represent the movements that occur so in that sense they are in fact more clear. The left hand side of a binary tree is bigger, and the right is smaller. As well as that, in my code, odd and even numbers have a special significance, and all left-hand children in the tree are odd and all right hand children, and the root, are even. In some cases, which I haven't encountered yet, but may, oh, actually, I didn't even think of this, x&1 may be a more optimal operation compared to x%2. x&1 on an even number will produce zero, but will produce 1 for an odd number.
Going a bit further than just odd/even identification, if I get zero for x&3 I know that 4 is a factor of our number, and same for x%7 for 8, and so on. I know that these cases have probably got limited utility but it's nice to know that you can avoid a modulus operation and use a bitwise logic operation instead, because bitwise operations are almost always the fastest, and least likely to be ambiguous to the compiler.
I am pretty much inventing the field of dense binary trees so I expect that people may not grasp the value of this comment, as very rarely do people want to only perform factorisations on only powers of two, or only multiply/divide powers of two.

Whether it is actually faster depends on the hardware and compiler actually used.

If you compare output for x+x , x*2 and x<<1 syntax on a gcc compiler, then you would get the same result in x86 assembly : https://godbolt.org/z/JLpp0j
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
add eax, eax
pop rbp
ret
So you can consider gcc as smart enought to determine his own best solution independently from what you typed.

I too wanted to see if I could Beat the House. this is a more general bitwise for any-number by any number multiplication. the macros I made are about 25% more to twice as slower than normal * multiplication. as said by others if it's close to a multiple of 2 or made up of few multiples of 2 you might win. like X*23 made up of (X<<4)+(X<<2)+(X<<1)+X is going to be slower then X*65 made up of (X<<6)+X.
#include <stdio.h>
#include <time.h>
#define MULTIPLYINTBYMINUS(X,Y) (-((X >> 30) & 1)&(Y<<30))+(-((X >> 29) & 1)&(Y<<29))+(-((X >> 28) & 1)&(Y<<28))+(-((X >> 27) & 1)&(Y<<27))+(-((X >> 26) & 1)&(Y<<26))+(-((X >> 25) & 1)&(Y<<25))+(-((X >> 24) & 1)&(Y<<24))+(-((X >> 23) & 1)&(Y<<23))+(-((X >> 22) & 1)&(Y<<22))+(-((X >> 21) & 1)&(Y<<21))+(-((X >> 20) & 1)&(Y<<20))+(-((X >> 19) & 1)&(Y<<19))+(-((X >> 18) & 1)&(Y<<18))+(-((X >> 17) & 1)&(Y<<17))+(-((X >> 16) & 1)&(Y<<16))+(-((X >> 15) & 1)&(Y<<15))+(-((X >> 14) & 1)&(Y<<14))+(-((X >> 13) & 1)&(Y<<13))+(-((X >> 12) & 1)&(Y<<12))+(-((X >> 11) & 1)&(Y<<11))+(-((X >> 10) & 1)&(Y<<10))+(-((X >> 9) & 1)&(Y<<9))+(-((X >> 8) & 1)&(Y<<8))+(-((X >> 7) & 1)&(Y<<7))+(-((X >> 6) & 1)&(Y<<6))+(-((X >> 5) & 1)&(Y<<5))+(-((X >> 4) & 1)&(Y<<4))+(-((X >> 3) & 1)&(Y<<3))+(-((X >> 2) & 1)&(Y<<2))+(-((X >> 1) & 1)&(Y<<1))+(-((X >> 0) & 1)&(Y<<0))
#define MULTIPLYINTBYSHIFT(X,Y) (((((X >> 30) & 1)<<31)>>31)&(Y<<30))+(((((X >> 29) & 1)<<31)>>31)&(Y<<29))+(((((X >> 28) & 1)<<31)>>31)&(Y<<28))+(((((X >> 27) & 1)<<31)>>31)&(Y<<27))+(((((X >> 26) & 1)<<31)>>31)&(Y<<26))+(((((X >> 25) & 1)<<31)>>31)&(Y<<25))+(((((X >> 24) & 1)<<31)>>31)&(Y<<24))+(((((X >> 23) & 1)<<31)>>31)&(Y<<23))+(((((X >> 22) & 1)<<31)>>31)&(Y<<22))+(((((X >> 21) & 1)<<31)>>31)&(Y<<21))+(((((X >> 20) & 1)<<31)>>31)&(Y<<20))+(((((X >> 19) & 1)<<31)>>31)&(Y<<19))+(((((X >> 18) & 1)<<31)>>31)&(Y<<18))+(((((X >> 17) & 1)<<31)>>31)&(Y<<17))+(((((X >> 16) & 1)<<31)>>31)&(Y<<16))+(((((X >> 15) & 1)<<31)>>31)&(Y<<15))+(((((X >> 14) & 1)<<31)>>31)&(Y<<14))+(((((X >> 13) & 1)<<31)>>31)&(Y<<13))+(((((X >> 12) & 1)<<31)>>31)&(Y<<12))+(((((X >> 11) & 1)<<31)>>31)&(Y<<11))+(((((X >> 10) & 1)<<31)>>31)&(Y<<10))+(((((X >> 9) & 1)<<31)>>31)&(Y<<9))+(((((X >> 8) & 1)<<31)>>31)&(Y<<8))+(((((X >> 7) & 1)<<31)>>31)&(Y<<7))+(((((X >> 6) & 1)<<31)>>31)&(Y<<6))+(((((X >> 5) & 1)<<31)>>31)&(Y<<5))+(((((X >> 4) & 1)<<31)>>31)&(Y<<4))+(((((X >> 3) & 1)<<31)>>31)&(Y<<3))+(((((X >> 2) & 1)<<31)>>31)&(Y<<2))+(((((X >> 1) & 1)<<31)>>31)&(Y<<1))+(((((X >> 0) & 1)<<31)>>31)&(Y<<0))
int main()
{
int randomnumber=23;
int randomnumber2=23;
int checknum=23;
clock_t start, diff;
srand(time(0));
start = clock();
for(int i=0;i<1000000;i++)
{
randomnumber = rand() % 10000;
randomnumber2 = rand() % 10000;
checknum=MULTIPLYINTBYMINUS(randomnumber,randomnumber2);
if (checknum!=randomnumber*randomnumber2)
{
printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
}
}
diff = clock() - start;
int msec = diff * 1000 / CLOCKS_PER_SEC;
printf("MULTIPLYINTBYMINUS Time %d milliseconds", msec);
start = clock();
for(int i=0;i<1000000;i++)
{
randomnumber = rand() % 10000;
randomnumber2 = rand() % 10000;
checknum=MULTIPLYINTBYSHIFT(randomnumber,randomnumber2);
if (checknum!=randomnumber*randomnumber2)
{
printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
}
}
diff = clock() - start;
msec = diff * 1000 / CLOCKS_PER_SEC;
printf("MULTIPLYINTBYSHIFT Time %d milliseconds", msec);
start = clock();
for(int i=0;i<1000000;i++)
{
randomnumber = rand() % 10000;
randomnumber2 = rand() % 10000;
checknum= randomnumber*randomnumber2;
if (checknum!=randomnumber*randomnumber2)
{
printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
}
}
diff = clock() - start;
msec = diff * 1000 / CLOCKS_PER_SEC;
printf("normal * Time %d milliseconds", msec);
return 0;
}

Is there any alternative to using % (modulus) in C/C++?

I read somewhere once that the modulus operator is inefficient on small embedded devices like 8 bit micro-controllers that do not have integer division instruction. Perhaps someone can confirm this but I thought the difference is 5-10 time slower than with an integer division operation.
Is there another way to do this other than keeping a counter variable and manually overflowing to 0 at the mod point?
const int FIZZ = 6;
for(int x = 0; x < MAXCOUNT; x++)
{
if(!(x % FIZZ)) print("Fizz\n"); // slow on some systems
}
vs:
The way I am currently doing it:
const int FIZZ = 6;
int fizzcount = 1;
for(int x = 1; x < MAXCOUNT; x++)
{
if(fizzcount >= FIZZ)
{
print("Fizz\n");
fizzcount = 0;
}
}

Ah, the joys of bitwise arithmetic. A side effect of many division routines is the modulus - so in few cases should division actually be faster than modulus. I'm interested to see the source you got this information from. Processors with multipliers have interesting division routines using the multiplier, but you can get from division result to modulus with just another two steps (multiply and subtract) so it's still comparable. If the processor has a built in division routine you'll likely see it also provides the remainder.
Still, there is a small branch of number theory devoted to Modular Arithmetic which requires study if you really want to understand how to optimize a modulus operation. Modular arithmatic, for instance, is very handy for generating magic squares.
So, in that vein, here's a very low level look at the math of modulus for an example of x, which should show you how simple it can be compared to division:
Maybe a better way to think about the problem is in terms of number
bases and modulo arithmetic. For example, your goal is to compute DOW
mod 7 where DOW is the 16-bit representation of the day of the
week. You can write this as:
DOW = DOW_HI*256 + DOW_LO
DOW%7 = (DOW_HI*256 + DOW_LO) % 7
= ((DOW_HI*256)%7 + (DOW_LO % 7)) %7
= ((DOW_HI%7 * 256%7) + (DOW_LO%7)) %7
= ((DOW_HI%7 * 4) + (DOW_LO%7)) %7
Expressed in this manner, you can separately compute the modulo 7
result for the high and low bytes. Multiply the result for the high by
4 and add it to the low and then finally compute result modulo 7.
Computing the mod 7 result of an 8-bit number can be performed in a
similar fashion. You can write an 8-bit number in octal like so:
X = a*64 + b*8 + c
Where a, b, and c are 3-bit numbers.
X%7 = ((a%7)*(64%7) + (b%7)*(8%7) + c%7) % 7
= (a%7 + b%7 + c%7) % 7
= (a + b + c) % 7
since 64%7 = 8%7 = 1
Of course, a, b, and c are
c = X & 7
b = (X>>3) & 7
a = (X>>6) & 7 // (actually, a is only 2-bits).
The largest possible value for a+b+c is 7+7+3 = 17. So, you'll need
one more octal step. The complete (untested) C version could be
written like:
unsigned char Mod7Byte(unsigned char X)
{
X = (X&7) + ((X>>3)&7) + (X>>6);
X = (X&7) + (X>>3);
return X==7 ? 0 : X;
}
I spent a few moments writing a PIC version. The actual implementation
is slightly different than described above
Mod7Byte:
movwf temp1 ;
andlw 7 ;W=c
movwf temp2 ;temp2=c
rlncf temp1,F ;
swapf temp1,W ;W= a*8+b
andlw 0x1F
addwf temp2,W ;W= a*8+b+c
movwf temp2 ;temp2 is now a 6-bit number
andlw 0x38 ;get the high 3 bits == a'
xorwf temp2,F ;temp2 now has the 3 low bits == b'
rlncf WREG,F ;shift the high bits right 4
swapf WREG,F ;
addwf temp2,W ;W = a' + b'
; at this point, W is between 0 and 10
addlw -7
bc Mod7Byte_L2
Mod7Byte_L1:
addlw 7
Mod7Byte_L2:
return
Here's a liitle routine to test the algorithm
clrf x
clrf count
TestLoop:
movf x,W
RCALL Mod7Byte
cpfseq count
bra fail
incf count,W
xorlw 7
skpz
xorlw 7
movwf count
incfsz x,F
bra TestLoop
passed:
Finally, for the 16-bit result (which I have not tested), you could
write:
uint16 Mod7Word(uint16 X)
{
return Mod7Byte(Mod7Byte(X & 0xff) + Mod7Byte(X>>8)*4);
}
Scott

If you are calculating a number mod some power of two, you can use the bit-wise and operator. Just subtract one from the second number. For example:
x % 8 == x & 7
x % 256 == x & 255
A few caveats:
This only works if the second number is a power of two.
It's only equivalent if the modulus is always positive. The C and C++ standards don't specify the sign of the modulus when the first number is negative (until C++11, which does guarantee it will be negative, which is what most compilers were already doing). A bit-wise and gets rid of the sign bit, so it will always be positive (i.e. it's a true modulus, not a remainder). It sounds like that's what you want anyway though.
Your compiler probably already does this when it can, so in most cases it's not worth doing it manually.

There is an overhead most of the time in using modulo that are not powers of 2.
This is regardless of the processor as (AFAIK) even processors with modulus operators are a few cycles slower for divide as opposed to mask operations.
For most cases this is not an optimisation that is worth considering, and certainly not worth calculating your own shortcut operation (especially if it still involves divide or multiply).
However, one rule of thumb is to select array sizes etc. to be powers of 2.
so if calculating day of week, may as well use %7 regardless
if setting up a circular buffer of around 100 entries... why not make it 128. You can then write % 128 and most (all) compilers will make this & 0x7F

Unless you really need high performance on multiple embedded platforms, don't change how you code for performance reasons until you profile!
Code that's written awkwardly to optimize for performance is hard to debug and hard to maintain. Write a test case, and profile it on your target. Once you know the actual cost of modulus, then decide if the alternate solution is worth coding.

#Matthew is right. Try this:
int main() {
int i;
for(i = 0; i<=1024; i++) {
if (!(i & 0xFF)) printf("& i = %d\n", i);
if (!(i % 0x100)) printf("mod i = %d\n", i);
}
}

x%y == (x-(x/y)*y)
Hope this helps.

Do you have access to any programmable hardware on the embedded device? Like counters and such? If so, you might be able to write a hardware based mod unit, instead of using the simulated %. (I did that once in VHDL. Not sure if I still have the code though.)
Mind you, you did say that division was 5-10 times faster. Have you considered doing a division, multiplication, and subtraction to simulated the mod? (Edit: Misunderstood the original post. I did think it was odd that division was faster than mod, they are the same operation.)
In your specific case, though, you are checking for a mod of 6. 6 = 2*3. So you could MAYBE get some small gains if you first checked if the least significant bit was a 0. Something like:
if((!(x & 1)) && (x % 3))
{
print("Fizz\n");
}
If you do that, though, I'd recommend confirming that you get any gains, yay for profilers. And doing some commenting. I'd feel bad for the next guy who has to look at the code otherwise.

You should really check the embedded device you need. All the assembly language I have seen (x86, 68000) implement the modulus using a division.
Actually, the division assembly operation returns the result of the division and the remaining in two different registers.

In the embedded world, the "modulus" operations you need to do are often the ones that break down nicely into bit operations that you can do with &, | and sometimes >>.

#Jeff V: I see a problem with it! (Beyond that your original code was looking for a mod 6 and now you are essentially looking for a mod 8). You keep doing an extra +1! Hopefully your compiler optimizes that away, but why not just test start at 2 and go to MAXCOUNT inclusive? Finally, you are returning true every time that (x+1) is NOT divisible by 8. Is that what you want? (I assume it is, but just want to confirm.)

For modulo 6 you can change the Python code to C/C++:
def mod6(number):
while number > 7:
number = (number >> 3 << 1) + (number & 0x7)
if number > 5:
number -= 6
return number

Not that this is necessarily better, but you could have an inner loop which always goes up to FIZZ, and an outer loop which repeats it all some certain number of times. You've then perhaps got to special case the final few steps if MAXCOUNT is not evenly divisible by FIZZ.
That said, I'd suggest doing some research and performance profiling on your intended platforms to get a clear idea of the performance constraints you're under. There may be much more productive places to spend your optimisation effort.

The print statement will take orders of magnitude longer than even the slowest implementation of the modulus operator. So basically the comment "slow on some systems" should be "slow on all systems".
Also, the two code snippets provided don't do the same thing. In the second one, the line
if(fizzcount >= FIZZ)
is always false so "FIZZ\n" is never printed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js