knuth multiplicative hash

knuth multiplicative hash - c++

Is this a correct implementation of the Knuth multiplicative hash.
int hash(int v)
{
v *= 2654435761;
return v >> 32;
}
Does overflow in the multiplication affects the algorithm?
How to improve the performance of this method?

Knuth multiplicative hash is used to compute an hash value in {0, 1, 2, ..., 2^p - 1} from an integer k.
Suppose that p is in between 0 and 32, the algorithm goes like this:
Compute alpha as the closest integer to 2^32 (-1 + sqrt(5)) / 2. We get alpha = 2 654 435 769.
Compute k * alpha and reduce the result modulo 2^32:
k * alpha = n0 * 2^32 + n1 with 0 <= n1 < 2^32
Keep the highest p bits of n1:
n1 = m1 * 2^(32-p) + m2 with 0 <= m2 < 2^(32 - p)
So, a correct implementation of Knuth multiplicative algorithm in C++ is:
std::uint32_t knuth(int x, int p) {
assert(p >= 0 && p <= 32);
const std::uint32_t knuth = 2654435769;
const std::uint32_t y = x;
return (y * knuth) >> (32 - p);
}
Forgetting to shift the result by (32 - p) is a major mistake. As you would lost all the good properties of the hash. It would transform an even sequence into an even sequence which would be very bad as all the odd slots would stay unoccupied. That's like taking a good wine and mixing it with Coke. By the way, the web is full of people misquoting Knuth and using a multiplication by 2 654 435 761 without taking the higher bits. I just opened the Knuth and he never said such a thing. It looks like some guy who decided he was "smart" decided to take a prime number close to 2 654 435 769.
Bare in mind that most hash tables implementations don't allow this kind of signature in their interface, as they only allow
uint32_t hash(int x);
and reduce hash(x) modulo 2^p to compute the hash value for x. Those hash tables cannot accept the Knuth multiplicative hash. This might be a reason why so many people completely ruined the algorithm by forgetting to take the higher p bits.
So you can't use the Knuth multiplicative hash with std::unordered_map or std::unordered_set. But I think that those hash tables use a prime number as a size, so the Knuth multiplicative hash is not useful in this case. Using hash(x) = x would be a good fit for those tables.
Source: "Introduction to Algorithms, third edition", Cormen et al., 13.3.2 p:263
Source: "The Art of Computer Programming, Volume 3, Sorting and Searching", D.E. Knuth, 6.4 p:516

Ok, I looked it up in TAOCP volume 3 (2nd edition), section 6.4, page 516.
This implementation is not correct, though as I mentioned in the comments it may give the correct result anyway.
A correct way (I think - feel free to read the relevant chapter of TAOCP and verify this) is something like this: (important: yes, you must shift the result right to reduce it, not use bitwise AND. However, that is not the responsibility of this function - range reduction is not properly part of hashing itself)
uint32_t hash(uint32_t v)
{
return v * UINT32_C(2654435761);
// do not comment about the lack of right shift. I'm not ignoring it. read on.
}
Note the uint32_t's (as opposed to int's) - they make sure the multiplication overflows modulo 2^32, as it is supposed to do if you choose 32 as the word size. There is also no right shift by k here, because there is no reason to give responsibility for range-reduction to the basic hashing function and it is actually more useful to get the full result. The constant 2654435761 is from the question, the actual suggested constant is 2654435769, but that's a small difference that as far as I know does not affect the quality of the hash.
Other valid implementations shift the result right by some amount (not the full word size though, that doesn't make sense and C++ doesn't like it), depending on how many bits of hash you need. Or they may use an other constant (subject to certain conditions) or an other word size. Reducing the hash modulo something is not a valid implementation, but a common mistake, likely it is a de-facto standard way to do range-reduction on a hash. The bottom bits of a multiplicative hash are the worst-quality bits (they depend on less of the input), you only want to use them if you really need more bits, while reducing the hash modulo a power of two would return only the worst bits. Indeed that is equivalent to throwing away most of the input bits too. Reducing modulo a non-power-of-two is not so bad since it does mix in the higher bits, but it's not how the multiplicative hash was defined.
So to be clear, yes there is a right shift, but that is range reduction not hashing and can only be the responsibility of the hash table, since it depends on its internal size.
The type should be unsigned, otherwise the overflow is unspecified (thus possibly wrong, not just on non-2's-complement architectures but also on overly clever compilers) and the optional right shift would be a signed shift (wrong).
On the page I mention at the top, there is this formula:
Here we have A = 2654435761 (or 2654435769), w = 232 and M = 232. Calculating AK/w gives a fixed-point result with the format Q32.32, the mod 1 step takes only the 32 fraction bits. But that's just the same thing as doing a modular multiplication and then saying that the result is the fraction bits. Of course when multiplied by M, all the fraction bits become integer bits because of how M was chosen, and so it simplifies to just a plain old modular multiplication. When M is a lower power of two, that just right-shifts the result, as mentioned.

Might be late, but heres a Java Implementation of Knuth's Method :
For a hashtable of Size N :
public long hash(int key) {
long l = 2654435769L;
return (key * l >> 32) % N ;
}

If the input argument is a pointer then I use this
#include <inttypes.h>
uint32_t knuth_mul_hash(void* k) {
ptrdiff_t v = (ptrdiff_t)k * UINT32_C(2654435761);
v >>= ((sizeof(ptrdiff_t) - sizeof(uint32_t)) * 8); // Right-shift v by the size difference between a pointer and a 32-bit integer (0 for x86, 32 for x64)
return (uint32_t)(v & UINT32_MAX);
}
I usually use this as the default fallback hashing function in hashmap implementations, dictionaries, sets, etc...

Related

Operating on two shorts at once by combining them into an integer

I'm using the following code to map two signed 16-bit integers to the upper and lower 16 bits of an unsigned 32 bit integer.
inline uint32_t to_score(int16_t mg, int16_t eg) {
return ((1u * mg) << 16 | (eg & 0xFFFF));
}
inline int16_t extract_mg(uint32_t score) {
return int16_t(score >> 16);
}
inline int16_t extract_eg(uint32_t score) {
return int16_t(score & 0xFFFF);
}
I need to perform various calculations on both the mg and eg parts simultaneously, before interpolating the two parts at the end of a function.
As I understand it, as long as there is no overflow, it should be safe to add two uint32_ts created to_score, and then extract the int16_ts to find the results of the individual calculations: i.e. the results if I added the the values for mg and eg separately.
I'm not sure whether this assumption holds if either mg or eg are negative, or whether this method can be used for subtraction, multiplication and/or division.
Which operations can I expect to function correctly? Are there alternative ways of representing two integers which can be added/subtracted/multiplied quickly?

There will be a problem with a carry going from the low half into the high half, but it can be avoided with extra operations, as detailed on for example chessprogramming.org/SIMD_and_SWAR_Techniques
z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H)
Where in this case H = 0x80008000.
As an other alternative, it could be done with two additions, but with optimized extraction/recombination:
// low half addition, leaving upper half corrupted but it will be ignored
l = x + y
// high half addition, adding 0 to the bottom so no carry
h = x + (y & 0xFFFF0000)
// recombine
z = (l & 0xFFFF) | (h & 0xFFFF0000)
Subtraction is a minor variation on addition.
Multiplication unfortunately cares about absolute bit-positions, so values have to be moved (shifted) to their notional position for it to work. Actual SIMD can still be used though, such as _mm_mullo_epi16 with SSE2.

C++ signed integers are two's complement, it is on the way to be standardized in C++20, in practice you may already assume that.
Some cases of addition and subtraction would work, those cases that don't cause either of following: eg to overflow, mg to overflow, mg to change sign.
The optimization does not make much sense.
If there's larger array, you can try to get your operations vectorized with proper SIMD instruction, if they are available for your platform by enabling compiler optimization or by using intrinsics ( _mm_adds_pi16 might be the one you need ).
If you have just two integers, just compute them one by one.

Obtain values from multiple distributions with a single generator roll

I am trying to implement the Alias method, also described here. This is an algorithm which allows to sample from a weighted N-sided dice in O(1).
The algorithm calls for the generation of two values:
An uniformly distributed integer i in [0, N]
An uniformly distributed real y in [0, 1)
The paper specifies that these two numbers can be obtained by a single real number x between [0, N). From x one can then derive two values as:
i = floor(x)
y = x - i
Now, the other implementations that I have seen call for the random number generator two times, one to generate i and one to generate y. Given that I am using a fairly expensive generator (std::mt19937) and that I need to sample many times, I was wondering if there was a better approach in terms of performance, while preserving the quality of the result.
I'm not sure whether using an uniform_real_distribution to generate x makes sense as if N is large then y's distribution is going to get sparser as doubles are not uniformly distributed. Is there maybe a way to call the engine, get the random bits out, and then generate i and y from them directly?

You are correct, with their method the distribution of y will become less and less uniform with increasing N.
In fact, for N above 2^52 y will be exactly 0, as all numbers above that value are integers for double precision. 2^52 is 4,503,599,627,370,496 (4.5 quadrillion).
It will not matter at all for reasonable values of N though. You should be fine if your N is less than 2^26 (67 million), intuitively. Your die does not have an astronomical number of sides, does it?

I had similar problem, and would tell you how I solved it in my case. It might be applicable to you or not, but here is the story
I didn't use any kind of 32bit RNG. Basically, no 32 bit platform and software to care about. So I used std::mt19937_64 as baseline generator. One 64bit unsigned int per call. Later I tried to use one of the PCG 64bit RNG, overall faster good outcome.
Top N bits to be used directly for selection from table (dice in your case). You could suffer from modulo bias, so I managed to extend table to be exact power of 2 (210 in my case, 10 bits for index sampling)
Remainder 54 bits were used to get uniform double random number following S. Vigna suggestion.
If you need more than 11 bits for index, you could either live with reduced randomness in mantissa, or replace double y with carefully crafted integer comparison.
Along the lines, some pseudocode (not tested!)
uint64_t mask = (1ULL << 53ULL) - 1ULL;
auto seed{ 98765432101ULL };
auto rng = std::mt19937_64{seed};
for (int k = 0; k != 1000; ++k) {
auto rv = rng();
auto idx = rv >> uint64_t(64 - 10); // needed only 10 bits for index
double y = (rv & mask) * (1. / (1ULL << 53ULL)); // 53 bits used for mantissa
std::cout << idx << "," << y << '\n';
}
Reference to S.Vigna integer2double conversion for RNG: http://xoshiro.di.unimi.it/, at the very end of the page

Fast way to generate pseudo-random bits with a given probability of 0 or 1 for each bit

Normally, a random number generator returns a stream of bits for which the probability to observe a 0 or a 1 in each position is equal (i.e. 50%). Let's call this an unbiased PRNG.
I need to generate a string of pseudo-random bits with the following property: the probability to see a 1 in each position is p (i.e. the probability to see a 0 is 1-p). The parameter p is a real number between 0 and 1; in my problem it happens that it has a resolution of 0.5%, i.e. it can take the values 0%, 0.5%, 1%, 1.5%, ..., 99.5%, 100%.
Note that p is a probability and not an exact fraction. The actual number of bits set to 1 in a stream of n bits must follow the binomial distribution B(n, p).
There is a naive method that can use an unbiased PRNG to generate the value of each bit (pseudocode):
generate_biased_stream(n, p):
result = []
for i in 1 to n:
if random_uniform(0, 1) < p:
result.append(1)
else:
result.append(0)
return result
Such an implementation is much slower than one generating an unbiased stream, since it calls the random number generator function once per each bit; while an unbiased stream generator calls it once per word size (e.g. it can generate 32 or 64 random bits with a single call).
I want a faster implementation, even it it sacrifices randomness slightly. An idea that comes to mind is to precompute a lookup table: for each of the 200 possible values of p, compute C 8-bit values using the slower algorithm and save them in a table. Then the fast algorithm would just pick one of these at random to generate 8 skewed bits.
A back of the envelope calculation to see how much memory is needed:
C should be at least 256 (the number of possible 8-bit values), probably more to avoid sampling effects; let's say 1024. Maybe the number should vary depending on p, but let's keep it simple and say the average is 1024.
Since there are 200 values of p => total memory usage is 200 KB. This is not bad, and might fit in the L2 cache (256 KB). I still need to evaluate it to see if there are sampling effects that introduce biases, in which case C will have to be increased.
A deficiency of this solution is that it can generate only 8 bits at once, even that with a lot of work, while an unbiased PRNG can generate 64 at once with just a few arithmetic instructions.
I would like to know if there is a faster method, based on bit operations instead of lookup tables. For example modifying the random number generation code directly to introduce a bias for each bit. This would achieve the same performance as an unbiased PRNG.
Edit March 5
Thank you all for your suggestions, I got a lot of interesting ideas and suggestions. Here are the top ones:
Change the problem requirements so that p has a resolution of 1/256 instead of 1/200. This allows using bits more efficiently, and also gives more opportunities for optimization. I think I can make this change.
Use arithmetic coding to efficiently consume bits from an unbiased generator. With the above change of resolution this becomes much easier.
A few people suggested that PRNGs are very fast, thus using arithmetic coding might actually make the code slower due to the introduced overhead. Instead I should always consume the worst-case number of bits and optimize that code. See the benchmarks below.
#rici suggested using SIMD. This is a nice idea, which works only if we always consume a fixed number of bits.
Benchmarks (without arithmetic decoding)
Note: as many of you have suggested, I changed the resolution from 1/200 to 1/256.
I wrote several implementations of the naive method that simply takes 8 random unbiased bits and generates 1 biased bit:
Without SIMD
With SIMD using the Agner Fog's vectorclass library, as suggested by #rici
With SIMD using intrinsics
I use two unbiased pseudo-random number generators:
xorshift128plus
Ranvec1 (Mersenne Twister-like) from Agner Fog's library.
I also measure the speed of the unbiased PRNG for comparison. Here are the results:
RNG: Ranvec1(Mersenne Twister for Graphics Processors + Multiply with Carry)
Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 16.081 16.125 16.093 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency
Gbps/s: 0.778 0.783 0.812 [Gb/s]
Number of ones: 104,867,269 104,867,269 104,867,269
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 2.176 2.184 2.145 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 2.129 2.151 2.183 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical : 104,857,600
SIMD increases performance by a factor of 3 compared to the scalar method. It is 8 times slower than the unbiased generator, as expected.
The fastest biased generator achieves 2.1 Gb/s.
RNG: xorshift128plus
Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 18.300 21.486 21.483 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical : 104,857,600
Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 22.660 22.661 24.662 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency
Gbps/s: 1.065 1.102 1.078 [Gb/s]
Number of ones: 104,868,930 104,868,930 104,868,930
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 4.972 4.971 4.970 [Gb/s]
Number of ones: 104,869,407 104,869,407 104,869,407
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 4.955 4.971 4.971 [Gb/s]
Number of ones: 104,869,407 104,869,407 104,869,407
Theoretical : 104,857,600
For xorshift, SIMD increases performance by a factor of 5 compared to the scalar method. It is 4 times slower than the unbiased generator. Note that this is a scalar implementation of xorshift.
The fastest biased generator achieves 4.9 Gb/s.
RNG: xorshift128plus_avx2
Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 18.754 21.494 21.878 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical : 104,857,600
Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 54.126 54.071 54.145 [Gb/s]
Number of ones: 536,874,540 536,880,718 536,891,316
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency
Gbps/s: 1.093 1.103 1.063 [Gb/s]
Number of ones: 104,868,930 104,868,930 104,868,930
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 19.567 19.578 19.555 [Gb/s]
Number of ones: 104,836,115 104,846,215 104,835,129
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 19.551 19.589 19.557 [Gb/s]
Number of ones: 104,831,396 104,837,429 104,851,100
Theoretical : 104,857,600
This implementation uses AVX2 to run 4 unbiased xorshift generators in parallel.
The fastest biased generator achieves 19.5 Gb/s.
Benchmarks for arithmetic decoding
Simple tests show that the arithmetic decoding code is the bottleneck, not the PRNG. So I am only benchmarking the most expensive PRNG.
RNG: Ranvec1(Mersenne Twister for Graphics Processors + Multiply with Carry)
Method: Arithmetic decoding (floating point)
Gbps/s: 0.068 0.068 0.069 [Gb/s]
Number of ones: 10,235,580 10,235,580 10,235,580
Theoretical : 10,240,000
Method: Arithmetic decoding (fixed point)
Gbps/s: 0.263 0.263 0.263 [Gb/s]
Number of ones: 10,239,367 10,239,367 10,239,367
Theoretical : 10,240,000
Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 12.687 12.686 12.684 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical : 104,857,600
Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 14.536 14.536 14.536 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency
Gbps/s: 0.754 0.754 0.754 [Gb/s]
Number of ones: 104,867,269 104,867,269 104,867,269
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 2.094 2.095 2.094 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 2.094 2.094 2.095 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical : 104,857,600
The simple fixed point method achieves 0.25 Gb/s, while the naive scalar method is 3x faster, and the naive SIMD method is 8x faster. There might be ways to optimize and/or parallelize the arithmetic decoding method further, but due to its complexity I have decided to stop here and choose the naive SIMD implementation.
Thank you all for the help.

One thing you can do is to sample from the underlying unbiased generator multiple times, getting several 32-bit or 64-bit words, and then performing bitwise boolean arithmetic. As an example, for 4 words b1,b2,b3,b4, you can get the following distributions:
expression | p(bit is 1)
-----------------------+-------------
b1 & b2 & b3 & b4 | 6.25%
b1 & b2 & b3 | 12.50%
b1 & b2 & (b3 | b4) | 18.75%
b1 & b2 | 25.00%
b1 & (b2 | (b3 & b4)) | 31.25%
b1 & (b2 | b3) | 37.50%
b1 & (b2 | b3 | b4)) | 43.75%
b1 | 50.00%
Similar constructions can be made for finer resolutions. It gets a bit tedious and still requires more generator calls, but at least not one per bit. This is similar to a3f's answer, but is probably easier to implement and, I suspect, faster than scanning words for 0xF nybbles.
Note that for your desired 0.5% resolution, you would need 8 unbiased words for one biased word, which would give you a resolution of (0.5^8) = 0.390625%.

If you're prepared to approximate p based on 256 possible values, and you have a PRNG which can generate uniform values in which the individual bits are independent of each other, then you can use vectorized comparison to produce multiple biased bits from a single random number.
That's only worth doing if (1) you worry about random number quality and (2) you are likely to need a large number of bits with the same bias. The second requirement seems to be implied by the original question, which criticizes a proposed solution, as follows: "A deficiency of this solution is that it can generate only 8 bits at once, even that with a lot of work, while an unbiased PRNG can generate 64 at once with just a few arithmetic instructions." Here, the implication seems to be that it is useful to generate a large block of biased bits in a single call.
Random-number quality is a difficult subject. It's hard if not impossible to measure, and therefore different people will propose different metrics which emphasize and/or devalue different aspects of "randomness". It is generally possible to trade off speed of random-number generation for lower "quality"; whether this is worth doing depends on your precise application.
The simplest possible tests of random number quality involve the distribution of individual values and the cycle length of the generator. Standard implementations of the C library rand and Posix random functions will typically pass the distribution test, but the cycle lengths are not adequate for long-running applications.
These generators are typically extremely fast, though: the glibc implementation of random requires only a few cycles, while the classic linear congruential generator (LCG) requires a multiply and an addition. (Or, in the case of the glibc implementation, three of the above to generate 31 bits.) If that's sufficient for your quality requirements, then there is little point trying to optimize, particularly if the bias probability changes frequently.
Bear in mind that the cycle length should be a lot longer than the number of samples expected; ideally, it should be greater than the square of that number, so a linear-congruential generator (LCG) with a cycle length of 231 is not appropriate if you expect to generate gigabytes of random data. Even the Gnu trinomial nonlinear additive-feedback generator, whose cycle length is claimed to be approximately 235, shouldn't be used in applications which will require millions of samples.
Another quality issue, which is much harder to test, relates to the independence on consecutive samples. Short cycle lengths completely fail on this metric, because once the repeat starts, the generated random numbers are precisely correlated with historical values. The Gnu trinomial algorithm, although its cycle is longer, has a clear correlation as a result of the fact that the ith random number generated, ri, is always one of the two values ri−3&plus;ri−31 or ri−3&plus;ri−31&plus;1. This can have surprising or at least puzzling consequences, particularly with Bernoulli experiments.
Here's an implementation using Agner Fog's useful vector class library, which abstracts away a lot of the annoying details in SSE intrinsics, and also helpfully comes with a fast vectorized random number generator (found in special.zip inside the vectorclass.zip archive), which lets us generate 256 bits from eight calls to the 256-bit PRNG. You can read Dr. Fog's explanation of why he finds even the Mersenne twister to have quality issues, and his proposed solution; I'm not qualified to comment, really, but it does at least appear to give expected results in the Bernoulli experiments I have tried with it.
#include "vectorclass/vectorclass.h"
#include "vectorclass/ranvec1.h"
class BiasedBits {
public:
// Default constructor, seeded with fixed values
BiasedBits() : BiasedBits(1) {}
// Seed with a single seed; other possibilities exist.
BiasedBits(int seed) : rng(3) { rng.init(seed); }
// Generate 256 random bits, each with probability `p/256` of being 1.
Vec8ui random256(unsigned p) {
if (p >= 256) return Vec8ui{ 0xFFFFFFFF };
Vec32c output{ 0 };
Vec32c threshold{ 127 - p };
for (int i = 0; i < 8; ++i) {
output += output;
output -= Vec32c(Vec32c(rng.uniform256()) > threshold);
}
return Vec8ui(output);
}
private:
Ranvec1 rng;
};
In my test, that produced and counted 268435456 bits in 260 ms, or one bit per nanosecond. The test machine is an i5, so it doesn't have AVX2; YMMV.
In the actual use case, with 201 possible values for p, the computation of 8-bit threshold values will be annoyingly imprecise. If that imprecision is undesired, you could adapt the above to use 16-bit thresholds, at the cost of generating twice as many random numbers.
Alternatively, you could hand-roll a vectorization based on 10-bit thresholds, which would give you a very good approximation to 0.5% increments, using the standard bit-manipulation hack of doing the vectorized threshold comparison by checking for borrow on every 10th bit of the subtraction of the vector of values and the repeated threshold. Combined with, say, std::mt19937_64, that would give you an average of six bits each 64-bit random number.

From an information-theoretic point of view, a biased stream of bits (with p != 0.5) has less information in it than an unbiased stream, so in theory it should take (on average) less than 1 bit of the unbiased input to produce a single bit of the biased output stream. For example, the entropy of a Bernoulli random variable with p = 0.1 is -0.1 * log2(0.1) - 0.9 * log2(0.9) bits, which is around 0.469 bits. That suggests that for the case p = 0.1 we should be able to produce a little over two bits of the output stream per unbiased input bit.
Below, I give two methods for producing the biased bits. Both achieve close to optimal efficiency, in the sense of requiring as few input unbiased bits as possible.
Method 1: arithmetic (de)coding
A practical method is to decode your unbiased input stream using arithmetic (de)coding, as already described in the answer from alexis. For this simple a case, it's not hard to code something up. Here's some unoptimised pseudocode (cough, Python) that does this:
import random
def random_bits():
"""
Infinite generator generating a stream of random bits,
with 0 and 1 having equal probability.
"""
global bit_count # keep track of how many bits were produced
while True:
bit_count += 1
yield random.choice([0, 1])
def bernoulli(p):
"""
Infinite generator generating 1-bits with probability p
and 0-bits with probability 1 - p.
"""
bits = random_bits()
low, high = 0.0, 1.0
while True:
if high <= p:
# Generate 1, rescale to map [0, p) to [0, 1)
yield 1
low, high = low / p, high / p
elif low >= p:
# Generate 0, rescale to map [p, 1) to [0, 1)
yield 0
low, high = (low - p) / (1 - p), (high - p) / (1 - p)
else:
# Use the next random bit to halve the current interval.
mid = 0.5 * (low + high)
if next(bits):
low = mid
else:
high = mid
Here's an example usage:
import itertools
bit_count = 0
# Generate a million deviates.
results = list(itertools.islice(bernoulli(0.1), 10**6))
print("First 50:", ''.join(map(str, results[:50])))
print("Biased bits generated:", len(results))
print("Unbiased bits used:", bit_count)
print("mean:", sum(results) / len(results))
The above gives the following sample output:
First 50: 00000000000001000000000110010000001000000100010000
Biased bits generated: 1000000
Unbiased bits used: 469036
mean: 0.100012
As promised, we've generated 1 million bits of our output biased stream using fewer than five hundred thousand from the source unbiased stream.
For optimisation purposes, when translating this into C / C++ it may make sense to code this up using integer-based fixed-point arithmetic rather than floating-point.
Method 2: integer-based algorithm
Rather than trying to convert the arithmetic decoding method to use integers directly, here's a simpler approach. It's not quite arithmetic decoding any more, but it's not totally unrelated, and it achieves close to the same output-biased-bit / input-unbiased-bit ratio as the floating-point version above. It's organised so that all quantities fit into an unsigned 32-bit integer, so should be easy to translate to C / C++. The code is specialised to the case where p is an exact multiple of 1/200, but this approach would work for any p that can be expressed as a rational number with reasonably small denominator.
def bernoulli_int(p):
"""
Infinite generator generating 1-bits with probability p
and 0-bits with probability 1 - p.
p should be an integer multiple of 1/200.
"""
bits = random_bits()
# Assuming that p has a resolution of 0.05, find p / 0.05.
p_int = int(round(200*p))
value, high = 0, 1
while True:
if high < 2**31:
high = 2 * high
value = 2 * value + next(bits)
else:
# Throw out everything beyond the last multiple of 200, to
# avoid introducing a bias.
discard = high - high % 200
split = high // 200 * p_int
if value >= discard: # rarer than 1 time in 10 million
value -= discard
high -= discard
elif value >= split:
yield 0
value -= split
high = discard - split
else:
yield 1
high = split
The key observation is that every time we reach the beginning of the while loop, value is uniformly distributed amongst all integers in [0, high), and is independent of all previously-output bits. If you care about speed more than perfect correctness, you can get rid of discard and the value >= discard branch: that's just there to ensure that we output 0 and 1 with exactly the right probabilities. Leave that complication out, and you'll just get almost the right probabilities instead. Also, if you make the resolution for p equal to 1/256 rather than 1/200, then the potentially time-consuming division and modulo operations can be replaced with bit operations.
With the same test code as before, but using bernoulli_int in place of bernoulli, I get the following results for p=0.1:
First 50: 00000010000000000100000000000000000000000110000100
Biased bits generated: 1000000
Unbiased bits used: 467997
mean: 0.099675

Let's say the probability of a 1 appearing is 6,25% (1/16). There are 16 possible bit patterns for a 4 bit-number:
0000,0001, ..., 1110,1111.
Now, just generate a random number like you used to and replace every 1111 at a nibble-boundary with a 1 and turn everything else to a 0.
Adjust accordingly for other probabilities.

You'll get theoretically optimal behavior, i.e. make truly minimal use of the random number generator and be able to model any probability p exactly, if you approach this using arithmetic coding.
Arithmetic coding is a form of data compression that represents the message as a sub-interval of a number range. It provides theoretically optimal encoding, and can use a fractional number of bits for each input symbol.
The idea is this: Imagine that you have a sequence of random bits, which are 1 with probability p. For convenience, I will instead use q for the probability of the bit being zero. (q = 1-p). Arithmetic coding assigns to each bit part of the number range. For the first bit, assign the interval [0, q) if the input is 0, and the interval [q, 1) if the input is 1. Subsequent bits assign proportional sub-intervals of the current range. For example, suppose that q = 1/3 The input 1 0 0 will be encoded like this:
Initially [0, 1), range = 1
After 1 [0.333, 1), range = 0.6666
After 0 [0.333, 0.5555), range = 0.2222
After 0 [0.333, 0.407407), range = 0.074074
The first digit, 1, selects the top two-thirds (1-q) of the range; the second digit, 0, selects the bottom third of that, and so on.
After the first and second step, the interval stradles the midpoint; but after the third step it is entirely below the midpoint, so the first compressed digit can be output: 0. The process continues, and a special EOF symbol is added as a terminator.
What does this have to do with your problem? The compressed output will have random zeros and ones with equal probability. So, to obtain bits with probability p, just pretend that the output of your RNG is the result of arithmetic coding as above, and apply the decoder process to it. That is, read bits as if they subdivide the line interval into smaller and smaller pieces. For example, after we read 01 from the RNG, we will be in the range [0.25, 0.5). Keep reading bits until enough output is "decoded". Since you're mimicking decompressing, you'll get more random bits out than you put in. Because arithmetic coding is theoretically optimal, there's no possible way to turn the RNG output into more biased bits without sacrificing randomness: you're getting the true maximum.
The catch is that you can't do this in a couple of lines of code, and I don't know of a library I can point you to (though there must be some you could use). Still, it's pretty simple. The above article provides code for a general-purpose encoder and decoder, in C. It's pretty straightforward, and it supports multiple input symbols with arbitrary probabilities; in your case a far simpler implementation is possible (as Mark Dickinson's answer now shows), since the probability model is trivial. For extended use, a bit more work would be needed to produce a robust implementation that does not do a lot of floating-point computation for each bit.
Wikipedia also has an interesting discussion of arithmetic encoding considered as change of radix, which is another way to view your task.

Uh, pseudo-random number generators are generally quite fast. I'm not sure what language this is (Python, perhaps), but "result.append" (which almost certainly contains memory allocation) is likely slower than "random_uniform" (which just does a little math).
If you want to optimize the performance of this code:
Verify that it is a problem. Optimizations are a bit of work and make the code harder to maintain. Don't do them unless necessary.
Profile it. Run some tests to determine which parts of the code are actually the slowest. Those are the parts you need to speed up.
Make your changes, and verify that they actually are faster. Compilers are pretty smart; often clear code will compile into better code that something complex than might appear faster.
If you are working in a compiled language (even JIT compiled), you take a performance hit for every transfer of control (if, while, function call, etc). Eliminate what you can. Memory allocation is also (usually) quite expensive.
If you are working in an interpreted language, all bets are off. The simplest code is very likely the best. The overhead of the interpreter will dwarf whatever you are doing, so reduce its work as much as possible.
I can only guess where your performance problems are:
Memory allocation. Pre-allocate the array at its full size and fill in the entries later. This ensures that the memory won't need to be reallocated while you're adding the entries.
Branches. You might be able to avoid the "if" by casting the result or something similar. This will depend a lot on the compiler. Check the assembly (or profile) to verify that it does what you want.
Numeric types. Find out the type your random number generator uses natively, and do your arithmetic in that type. For example, if the generator naturally returns 32-bit unsigned integers, scale "p" to that range first, then use it for the comparison.
By the way, if you really want to use the least bits of randomness possible, use "arithmetic coding" to decode your random stream. It won't be fast.

One way that would give a precise result is to first randomly generate for a k-bit block the number of 1 bits following the binomial distribution, and then generate a k-bit word with exactly that many bits using one of the methods here. For example the method by mic006 needs only about log k k-bit random numbers, and mine needs only one.

If p is close to 0, you can calculate the probability that the n-th bit is the first bit that is 1; then you calculate a random number between 0 and 1 and pick n accordingly. For example if p = 0.005 (0.5%), and the random number is 0.638128, you might calculate (I'm guessing here) n = 321, so you fill with 321 0 bits and one bit set.
If p is close to 1, use 1-p instead of p, and set 1 bits plus one 0 bit.
If p isn't close to 1 or 0, make a table of all 256 sequences of 8 bits, calculate their cumulative probabilities, then get a random number, do a binary search in the array of cumulative probabilities, and you can set 8 bits.

Assuming that you have access to a generator of random bits, you can generate a value to compare with p bit by bit, and abort as soon as you can prove that the generated value is less-than or greater-or-equal-to p.
Proceed as follows to create one item in a stream with given probability p:
Start with 0. in binary
Append a random bit; assuming that a 1 has been drawn, you'll get 0.1
If the result (in binary notation) is provably smaller than p, output a 1
If the result is provably larger or equal to p, output a 0
Otherwise (if neither can be ruled out), proceed with step 2.
Let's assume that p in binary notation is 0.1001101...; if this process generates any of 0.0, 0.1000, 0.10010, ..., the value cannot become larger or equal than p anymore; if any of 0.11, 0.101, 0.100111, ... is generated, the value cannot become smaller than p.
To me, it looks like this method uses about two random bits in expectation. Arithmetic coding (as shown in the answer by Mark Dickinson) consumes at most one random bit per biased bit (on average) for fixed p; the cost of modifying p is unclear.

What it does
This implementation makes single call to random device kernel module via interface of "/dev/urandom" special character file to get number of random data needed to represent all values in given resolution. Maximum possible resolution is 1/256^2 so that 0.005 can be represented by:
328/256^2,
i.e:
resolution: 256*256
x: 328
with error 0.000004883.
How it does that
The implementation calculates the number of bits bits_per_byte which is number of uniformly distributed bits needed to handle given resolution, i.e. represent all #resolution values. It makes then a single call to randomization device ("/dev/urandom" if URANDOM_DEVICE is defined, otherwise it will use additional noise from device drivers via call to "/dev/random" which may block if there is not enough entropy in bits) to get required number of uniformly distributed bytes and fills in array rnd_bytes of random bytes. Finally it reads number of needed bits per each Bernoulli sample from each bytes_per_byte bytes of rnd_bytes array and compares the integer value of these bits to probability of success in single Bernoulli outcome given by x/resolution. If value hits, i.e. it falls in segment of x/resolution length which we arbitrarily choose to be [0, x/resolution) segment then we note success and insert 1 into resulting array.
Read from random device:
/* if defined use /dev/urandom (will not block),
* if not defined use /dev/random (may block)*/
#define URANDOM_DEVICE 1
/*
* #brief Read #outlen bytes from random device
* to array #out.
*/
int
get_random_samples(char *out, size_t outlen)
{
ssize_t res;
#ifdef URANDOM_DEVICE
int fd = open("/dev/urandom", O_RDONLY);
if (fd == -1) return -1;
res = read(fd, out, outlen);
if (res < 0) {
close(fd);
return -2;
}
#else
size_t read_n;
int fd = open("/dev/random", O_RDONLY);
if (fd == -1) return -1;
read_n = 0;
while (read_n < outlen) {
res = read(fd, out + read_n, outlen - read_n);
if (res < 0) {
close(fd);
return -3;
}
read_n += res;
}
#endif /* URANDOM_DEVICE */
close(fd);
return 0;
}
Fill in vector of Bernoulli samples:
/*
* #brief Draw vector of Bernoulli samples.
* #details #x and #resolution determines probability
* of success in Bernoulli distribution
* and accuracy of results: p = x/resolution.
* #param resolution: number of segments per sample of output array
* as power of 2: max resolution supported is 2^24=16777216
* #param x: determines used probability, x = [0, resolution - 1]
* #param n: number of samples in result vector
*/
int
get_bernoulli_samples(char *out, uint32_t n, uint32_t resolution, uint32_t x)
{
int res;
size_t i, j;
uint32_t bytes_per_byte, word;
unsigned char *rnd_bytes;
uint32_t uniform_byte;
uint8_t bits_per_byte;
if (out == NULL || n == 0 || resolution == 0 || x > (resolution - 1))
return -1;
bits_per_byte = log_int(resolution);
bytes_per_byte = bits_per_byte / BITS_PER_BYTE +
(bits_per_byte % BITS_PER_BYTE ? 1 : 0);
rnd_bytes = malloc(n * bytes_per_byte);
if (rnd_bytes == NULL)
return -2;
res = get_random_samples(rnd_bytes, n * bytes_per_byte);
if (res < 0)
{
free(rnd_bytes);
return -3;
}
i = 0;
while (i < n)
{
/* get Bernoulli sample */
/* read byte */
j = 0;
word = 0;
while (j < bytes_per_byte)
{
word |= (rnd_bytes[i * bytes_per_byte + j] << (BITS_PER_BYTE * j));
++j;
}
uniform_byte = word & ((1u << bits_per_byte) - 1);
/* decision */
if (uniform_byte < x)
out[i] = 1;
else
out[i] = 0;
++i;
}
free(rnd_bytes);
return 0;
}
Usage:
int
main(void)
{
int res;
char c[256];
res = get_bernoulli_samples(c, sizeof(c), 256*256, 328); /* 328/(256^2) = 0.0050 */
if (res < 0) return -1;
return 0;
}
Complete code, results.

Although this question is 5 years old, I believe I have something of value to add. While SIMD and arithmetic decoding are undoubtedly great techniques, its hard to ignore that the bitwise boolean arithmetic suggested by #mindriot is very simple and easy to grasp.
However, it's not immediately apparent how you would go about efficiently and quickly implementing this solution. For 256 bits (0.00390625) of resolution, you could write a switch statement with 256 cases and then manually determine the required boolean expression by hand for each case. It would take a while to program this but it will compile down to a very fast jump table in C/C++.
But, what if you want 2^16 bits of resolution, or even 2^64? The latter is a resolution of 5.4210109E-20, more precise than most of us would ever need. The task is absolutely impossible by hand, but we can actually construct a small virtual machine to do this quickly in just 30 lines of C code.
Let's construct the machine for 256 bits of resolution. I'll define probability = resolution/256. e.g., when resolution = 64, then probability = 0.25. As it turns out, the numerator (resolution) actually implicitly encodes the required boolean operations in its binary representation.
For example, what expression generates probability = 0.69140625 = 177/256? The resolution is 177, which in binary is 10110001. Let AND = 0 and OR = 1. We start after the first nonzero least significant bit and read toward the most significant bit. Map the 0/1 to AND/OR. Thus, starting from b1 and reading right to left, we generate the boolean expression (((((((b1 and b2) and b3) and b4) or b5) or b6) and b7) or b8). A computer-generated truth table will confirm 177 cases yield True. To give another example, probability = 0.4375 = 112/256 gives the resolution in binary as 01110000. Reading the 3 bits in order after the first non zero LSB (011) gives ((b1 | b2) | b3) & b4).
Since all we need are the two AND and OR operations, and since the resolution encodes the exact boolean expression we need, a virtual machine can be programmed which interprets the resolution as bitcode. AND and OR are just opcodes that act immediately on the output of an unbiased random number generator. Here is my sample C code:
uint64_t rng_bias (uint64_t *state, const uint8_t resolution)
{
if (state == NULL) return 0;
//registers
uint64_t R0 = 0;
uint8_t PC = __builtin_ctz(resolution|0x80);
//opcodes
enum
{
OP_ANDI = 0,
OP_ORI = 1,
};
//execute instructions in sequence from LSB -> MSB
while (PC != (uint8_t) 0x8)
{
switch((resolution >> PC++) & (uint8_t) 0x1)
{
case OP_ANDI:
R0 &= rng_generator(state);
break;
case OP_ORI:
R0 |= rng_generator(state);
break;
}
}
return R0;
}
The virtual machine is nothing more than 2 registers and 2 opcodes. I am using GCC's builtin function ctz which counts the trailing zero bits so that I can easily find the first nonzero LSB. I bitwise-or the ctz argument with 0x80 because passing zero is undefined. Any other decent compiler should have a similar function. Notice, that unlike the examples I showed by hand, the VM interprets the bitcode starting on the first nonzero LSB, not after. This is because I need to make at least one call to the PRNG to generate the base p=0.5 and p=0.0 cases.
The state pointer and rng_generator() calls are used to interface with your random number generator. For example, for demonstration purposes I can use Marsaglia's Xorshift64:
uint64_t rng_generator(uint64_t *state)
{
uint64_t x = *state;
x ^= x << 13;
x ^= x >> 7;
x ^= x << 17;
return *state = x;
}
All the user/you need to do is manage a separate uint64_t state variable, which must be appropriately seeded prior to using either function.
It is extremely easy to scale to 2^64 bits or whatever other arbitrary resolution desired. use ctzll instead for unsigned long long arguments, change the uint8_t types to uint64_t, and change the while loop check to 64 instead of 8. That's it! Now with at most 64 calls to the PRNG, which is fairly fast, we have access to 5.4210109E-20 resolution.
The key here is that we get the bitcode practically for free. No lexing, parsing, or any other typical VM interpreter tasks. The user provides it via the resolution, without ever realizing. As far as they're concerned, its just the numerator of the probability. As far as we, the implementers are concerned, its nothing more than a string of bitcode for our VM to interpret.
Explaining why the bitcode works requires a whole different and much longer essay. In probability theory, the problem is to determine the generating event (the set of all sample points) of a given probability. Not unlike the usual inverse CDF problem for generating random numbers from a density function. In a computer science viewpoint, in the 256 bit resolution case, we are traversing a depth-8 binary tree where each node represents a probability. The parent node is p=0.5. Left traversal indicates AND operations, right traversal indicates OR. The traversal and node depth maps directly to the LSB->MSB bit encoding that we discussed several paragraphs before.

Analysis of the usage of prime numbers in hash functions

I was studying hash-based sort and I found that using prime numbers in a hash function is considered a good idea, because multiplying each character of the key by a prime number and adding the results up would produce a unique value (because primes are unique) and a prime number like 31 would produce better distribution of keys.
key(s)=s[0]*31(len–1)+s[1]*31(len–2)+ ... +s[len–1]
Sample code:
public int hashCode( )
{
int h = hash;
if (h == 0)
{
for (int i = 0; i < chars.length; i++)
{
h = MULT*h + chars[i];
}
hash = h;
}
return h;
}
I would like to understand why the use of even numbers for multiplying each character is a bad idea in the context of this explanation below (found on another forum; it sounds like a good explanation, but I'm failing to grasp it). If the reasoning below is not valid, I would appreciate a simpler explanation.
Suppose MULT were 26, and consider
hashing a hundred-character string.
How much influence does the string's
first character have on the final
value of 'h'? The first character's value
will have been multiplied by MULT 99
times, so if the arithmetic were done
in infinite precision the value would
consist of some jumble of bits
followed by 99 low-order zero bits --
each time you multiply by MULT you
introduce another low-order zero,
right? The computer's finite
arithmetic just chops away all the
excess high-order bits, so the first
character's actual contribution to 'h'
is ... precisely zero! The 'h' value
depends only on the rightmost 32
string characters (assuming a 32-bit
int), and even then things are not
wonderful: the first of those final 32
bytes influences only the leftmost bit
of `h' and has no effect on the
remaining 31. Clearly, an even-valued
MULT is a poor idea.

I think it's easier to see if you use 2 instead of 26. They both have the same effect on the lowest-order bit of h. Consider a 33 character string of some character c followed by 32 zero bytes (for illustrative purposes). Since the string isn't wholly null you'd hope the hash would be nonzero.
For the first character, your computed hash h is equal to c[0]. For the second character, you take h * 2 + c[1]. So now h is 2*c[0]. For the third character h is now h*2 + c[2] which works out to 4*c[0]. Repeat this 30 more times, and you can see that the multiplier uses more bits than are available in your destination, meaning effectively c[0] had no impact on the final hash at all.
The end math works out exactly the same with a different multiplier like 26, except that the intermediate hashes will modulo 2^32 every so often during the process. Since 26 is even it still adds one 0 bit to the low end each iteration.

This hash can be described like this (here ^ is exponentiation, not xor).
hash(string) = sum_over_i(s[i] * MULT^(strlen(s) - i - 1)) % (2^32).
Look at the contribution of the first character. It's
(s[0] * MULT^(strlen(s) - 1)) % (2^32).
If the string is long enough (strlen(s) > 32) then this is zero.

Other people have posted the answer -- if you use an even multiple, then only the last characters in the string matter for computing the hash, as the early character's influence will have shifted out of the register.
Now lets consider what happens when you use a multiplier like 31. Well, 31 is 32-1 or 2^5 - 1. So when you use that, your final hash value will be:
\sum{c_i 2^{5(len-i)} - \sum{c_i}
unfortunately stackoverflow doesn't understad TeX math notation, so the above is hard to understand, but its two summations over the characters in the string, where the first one shifts each character by 5 bits for each subsequent character in the string. So using a 32-bit machine, that will shift off the top for all except the last seven characters of the string.
The upshot of this is that using a multiplier of 31 means that while characters other than the last seven have an effect on the string, its completely independent of their order. If you take two strings that have the same last 7 characters, for which the other characters also the same but in a different order, you'll get the same hash for both. You'll also get the same hash for things like "az" and "by" other than in the last 7 chars.
So using a prime multiplier, while much better than an even multiplier, is still not very good. Better is to use a rotate instruction, which shifts the bits back into the bottom when they shift out the top. Something like:
public unisgned hashCode(string chars)
{
unsigned h = 0;
for (int i = 0; i < chars.length; i++) {
h = (h<<5) + (h>>27); // ROL by 5, assuming 32 bits here
h += chars[i];
}
return h;
}
Of course, this depends on your compiler being smart enough to recognize the idiom for a rotate instruction and turn it into a single instruction for maximum efficiency.
This also still has the problem that swapping 32-character blocks in the string will give the same hash value, so its far from strong, but probably adequate for most non-cryptographic purposes

would produce a unique value
Stop right there. Hashes are not unique. A good hash algorithm will minimize collisions, but the pigeonhole principle assures us that perfectly avoiding collisions is not possible (for any datatype with non-trivial information content).

Is there any alternative to using % (modulus) in C/C++?

I read somewhere once that the modulus operator is inefficient on small embedded devices like 8 bit micro-controllers that do not have integer division instruction. Perhaps someone can confirm this but I thought the difference is 5-10 time slower than with an integer division operation.
Is there another way to do this other than keeping a counter variable and manually overflowing to 0 at the mod point?
const int FIZZ = 6;
for(int x = 0; x < MAXCOUNT; x++)
{
if(!(x % FIZZ)) print("Fizz\n"); // slow on some systems
}
vs:
The way I am currently doing it:
const int FIZZ = 6;
int fizzcount = 1;
for(int x = 1; x < MAXCOUNT; x++)
{
if(fizzcount >= FIZZ)
{
print("Fizz\n");
fizzcount = 0;
}
}

Ah, the joys of bitwise arithmetic. A side effect of many division routines is the modulus - so in few cases should division actually be faster than modulus. I'm interested to see the source you got this information from. Processors with multipliers have interesting division routines using the multiplier, but you can get from division result to modulus with just another two steps (multiply and subtract) so it's still comparable. If the processor has a built in division routine you'll likely see it also provides the remainder.
Still, there is a small branch of number theory devoted to Modular Arithmetic which requires study if you really want to understand how to optimize a modulus operation. Modular arithmatic, for instance, is very handy for generating magic squares.
So, in that vein, here's a very low level look at the math of modulus for an example of x, which should show you how simple it can be compared to division:
Maybe a better way to think about the problem is in terms of number
bases and modulo arithmetic. For example, your goal is to compute DOW
mod 7 where DOW is the 16-bit representation of the day of the
week. You can write this as:
DOW = DOW_HI*256 + DOW_LO
DOW%7 = (DOW_HI*256 + DOW_LO) % 7
= ((DOW_HI*256)%7 + (DOW_LO % 7)) %7
= ((DOW_HI%7 * 256%7) + (DOW_LO%7)) %7
= ((DOW_HI%7 * 4) + (DOW_LO%7)) %7
Expressed in this manner, you can separately compute the modulo 7
result for the high and low bytes. Multiply the result for the high by
4 and add it to the low and then finally compute result modulo 7.
Computing the mod 7 result of an 8-bit number can be performed in a
similar fashion. You can write an 8-bit number in octal like so:
X = a*64 + b*8 + c
Where a, b, and c are 3-bit numbers.
X%7 = ((a%7)*(64%7) + (b%7)*(8%7) + c%7) % 7
= (a%7 + b%7 + c%7) % 7
= (a + b + c) % 7
since 64%7 = 8%7 = 1
Of course, a, b, and c are
c = X & 7
b = (X>>3) & 7
a = (X>>6) & 7 // (actually, a is only 2-bits).
The largest possible value for a+b+c is 7+7+3 = 17. So, you'll need
one more octal step. The complete (untested) C version could be
written like:
unsigned char Mod7Byte(unsigned char X)
{
X = (X&7) + ((X>>3)&7) + (X>>6);
X = (X&7) + (X>>3);
return X==7 ? 0 : X;
}
I spent a few moments writing a PIC version. The actual implementation
is slightly different than described above
Mod7Byte:
movwf temp1 ;
andlw 7 ;W=c
movwf temp2 ;temp2=c
rlncf temp1,F ;
swapf temp1,W ;W= a*8+b
andlw 0x1F
addwf temp2,W ;W= a*8+b+c
movwf temp2 ;temp2 is now a 6-bit number
andlw 0x38 ;get the high 3 bits == a'
xorwf temp2,F ;temp2 now has the 3 low bits == b'
rlncf WREG,F ;shift the high bits right 4
swapf WREG,F ;
addwf temp2,W ;W = a' + b'
; at this point, W is between 0 and 10
addlw -7
bc Mod7Byte_L2
Mod7Byte_L1:
addlw 7
Mod7Byte_L2:
return
Here's a liitle routine to test the algorithm
clrf x
clrf count
TestLoop:
movf x,W
RCALL Mod7Byte
cpfseq count
bra fail
incf count,W
xorlw 7
skpz
xorlw 7
movwf count
incfsz x,F
bra TestLoop
passed:
Finally, for the 16-bit result (which I have not tested), you could
write:
uint16 Mod7Word(uint16 X)
{
return Mod7Byte(Mod7Byte(X & 0xff) + Mod7Byte(X>>8)*4);
}
Scott

If you are calculating a number mod some power of two, you can use the bit-wise and operator. Just subtract one from the second number. For example:
x % 8 == x & 7
x % 256 == x & 255
A few caveats:
This only works if the second number is a power of two.
It's only equivalent if the modulus is always positive. The C and C++ standards don't specify the sign of the modulus when the first number is negative (until C++11, which does guarantee it will be negative, which is what most compilers were already doing). A bit-wise and gets rid of the sign bit, so it will always be positive (i.e. it's a true modulus, not a remainder). It sounds like that's what you want anyway though.
Your compiler probably already does this when it can, so in most cases it's not worth doing it manually.

There is an overhead most of the time in using modulo that are not powers of 2.
This is regardless of the processor as (AFAIK) even processors with modulus operators are a few cycles slower for divide as opposed to mask operations.
For most cases this is not an optimisation that is worth considering, and certainly not worth calculating your own shortcut operation (especially if it still involves divide or multiply).
However, one rule of thumb is to select array sizes etc. to be powers of 2.
so if calculating day of week, may as well use %7 regardless
if setting up a circular buffer of around 100 entries... why not make it 128. You can then write % 128 and most (all) compilers will make this & 0x7F

Unless you really need high performance on multiple embedded platforms, don't change how you code for performance reasons until you profile!
Code that's written awkwardly to optimize for performance is hard to debug and hard to maintain. Write a test case, and profile it on your target. Once you know the actual cost of modulus, then decide if the alternate solution is worth coding.

#Matthew is right. Try this:
int main() {
int i;
for(i = 0; i<=1024; i++) {
if (!(i & 0xFF)) printf("& i = %d\n", i);
if (!(i % 0x100)) printf("mod i = %d\n", i);
}
}

x%y == (x-(x/y)*y)
Hope this helps.

Do you have access to any programmable hardware on the embedded device? Like counters and such? If so, you might be able to write a hardware based mod unit, instead of using the simulated %. (I did that once in VHDL. Not sure if I still have the code though.)
Mind you, you did say that division was 5-10 times faster. Have you considered doing a division, multiplication, and subtraction to simulated the mod? (Edit: Misunderstood the original post. I did think it was odd that division was faster than mod, they are the same operation.)
In your specific case, though, you are checking for a mod of 6. 6 = 2*3. So you could MAYBE get some small gains if you first checked if the least significant bit was a 0. Something like:
if((!(x & 1)) && (x % 3))
{
print("Fizz\n");
}
If you do that, though, I'd recommend confirming that you get any gains, yay for profilers. And doing some commenting. I'd feel bad for the next guy who has to look at the code otherwise.

You should really check the embedded device you need. All the assembly language I have seen (x86, 68000) implement the modulus using a division.
Actually, the division assembly operation returns the result of the division and the remaining in two different registers.

In the embedded world, the "modulus" operations you need to do are often the ones that break down nicely into bit operations that you can do with &, | and sometimes >>.

#Jeff V: I see a problem with it! (Beyond that your original code was looking for a mod 6 and now you are essentially looking for a mod 8). You keep doing an extra +1! Hopefully your compiler optimizes that away, but why not just test start at 2 and go to MAXCOUNT inclusive? Finally, you are returning true every time that (x+1) is NOT divisible by 8. Is that what you want? (I assume it is, but just want to confirm.)

For modulo 6 you can change the Python code to C/C++:
def mod6(number):
while number > 7:
number = (number >> 3 << 1) + (number & 0x7)
if number > 5:
number -= 6
return number

Not that this is necessarily better, but you could have an inner loop which always goes up to FIZZ, and an outer loop which repeats it all some certain number of times. You've then perhaps got to special case the final few steps if MAXCOUNT is not evenly divisible by FIZZ.
That said, I'd suggest doing some research and performance profiling on your intended platforms to get a clear idea of the performance constraints you're under. There may be much more productive places to spend your optimisation effort.

The print statement will take orders of magnitude longer than even the slowest implementation of the modulus operator. So basically the comment "slow on some systems" should be "slow on all systems".
Also, the two code snippets provided don't do the same thing. In the second one, the line
if(fizzcount >= FIZZ)
is always false so "FIZZ\n" is never printed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js