Related
I am attempting to vectorize this fairly expensive function (Scaler Now working!):
template<typename N, typename POW>
inline constexpr bool isPower(const N n, const POW p) noexcept
{
double x = std::log(static_cast<double>(n)) / std::log(static_cast<double>(p));
return (x - std::trunc(x)) < 0.000001;
}//End of isPower
Here's what I have so far (for 32-bit int only):
template<typename RETURN_T>
inline RETURN_T count_powers_of(const std::vector<int32_t>& arr, const int32_t power)
{
RETURN_T cnt = 0;
const __m256 _MAGIC = _mm256_set1_ps(0.000001f);
const __m256 _POWER_D = _mm256_set1_ps(static_cast<float>(para));
const __m256 LOG_OF_POWER = _mm256_log_ps(_POWER_D);
__m256i _count = _mm256_setzero_si256();
__m256i _N_INT = _mm256_setzero_si256();
__m256 _N_DBL = _mm256_setzero_ps();
__m256 LOG_OF_N = _mm256_setzero_ps();
__m256 DIVIDE_LOG = _mm256_setzero_ps();
__m256 TRUNCATED = _mm256_setzero_ps();
__m256 CMP_MASK = _mm256_setzero_ps();
for (size_t i = 0uz; (i + 8uz) < end; i += 8uz)
{
//Set Values
_N_INT = _mm256_load_si256((__m256i*) &arr[i]);
_N_DBL = _mm256_cvtepi32_ps(_N_INT);
LOG_OF_N = _mm256_log_ps(_N_DBL);
DIVIDE_LOG = _mm256_div_ps(LOG_OF_N, LOG_OF_POWER);
TRUNCATED = _mm256_sub_ps(DIVIDE_LOG, _mm256_trunc_ps(DIVIDE_LOG));
CMP_MASK = _mm256_cmp_ps(TRUNCATED, _MAGIC, _CMP_LT_OQ);
_count = _mm256_sub_epi32(_count, _mm256_castps_si256(CMP_MASK));
}//End for
cnt = static_cast<RETURN_T>(util::_mm256_sum_epi32(_count));
}//End of count_powers_of
The scaler version runs in about 14.1 seconds.
The scaler version called from std::count_if with par_unseq runs in 4.5 seconds.
The vectorized version runs in just 155 milliseconds but produces the wrong result. Albeit vastly closer now.
Testing:
int64_t count = 0;
for (size_t i = 0; i < vec.size(); ++i)
{
if (isPower(vec[i], 4))
{
++count;
}//End if
}//End for
std::cout << "Counted " << count << " powers of 4.\n";//produces 4,996,215 powers of 4 in a vector of 1 billion 32-bit ints consisting of a uniform distribution of 0 to 1000
std::cout << "Counted " << count_powers_of<int32_t>(vec, 4) << " powers of 4.\n";//produces 4,996,865 powers of 4 on the same array
This new vastly simplified code often produces results that are either slightly off the correct number of powers found (usually higher). I think the problem is my reinterpret cast from __m256 to _m256i but when I try use a conversation (with floor) instead I get a number that's way off (in the billions again).
It could also be this sum function (based off of code by #PeterCordes ):
inline uint32_t _mm_sum_epi32(__m128i& x)
{
__m128i hi64 = _mm_unpackhi_epi64(x, x);
__m128i sum64 = _mm_add_epi32(hi64, x);
__m128i hi32 = _mm_shuffle_epi32(sum64, _MM_SHUFFLE(2, 3, 0, 1));
__m128i sum32 = _mm_add_epi32(sum64, hi32);
return _mm_cvtsi128_si32(sum32);
}
inline uint32_t _mm256_sum_epi32(__m256i& v)
{
__m128i sum128 = _mm_add_epi32(
_mm256_castsi256_si128(v),
_mm256_extracti128_si256(v, 1));
return _mm_sum_epi32(sum128);
}
I know this has got to be a floating-point precision/comparison issue; Is there a better way to approach this?
Thanks for all your insights and suggestions thus far.
A more sensible unit-test would be to non-random: Check all powers in a loop to make sure they're all true, like x *= base;, and count how many powers there are <= n. Then check all numbers from 0..n in a loop, once each to verify the right total. If both those checks succeed, that means it returned false in all the cases it should have, otherwise the count would be wrong.
Re: the original version:
This seems to depend on there being no floating-point rounding error. You do d == (N)d which (if N is an integral type) checks that the ratio of two logs is an exact integer; even 1 bit in the mantissa will make it unequal. Hardly surprising that a different log implementation would give different results, if one has different rounding error.
Except your scalar code at least is even more broken because it takes d = floor(log ratio) so it's already always an exact integer.
I just tried your scalar version for a testcase like return isPower(5, 4) to ask if 5 is a power of 4. It returns true: https://godbolt.org/z/aMT94ro6o . So yeah, your code is super broken, and is in fact only checking that n>0 or something. That would explain why 999 of 1000 of your "random" inputs from 0..999 were counted as powers of 4, which is obviously super broken.
I think it's impossible to achieve correctness with your FP log ratio idea: FP rounding error means you can't expect exact equality, but allowing a range would probably let in non-exact powers.
You might want to special-case integral N, power-of-2 pow. That can go vastly vaster by checking that n has a single bit set (n & (n-1) == 0) and that it's at a valid position. (e.g. for pow=4, n & 0b...10101010 != 0). You can construct the constant by multiplying and adding until overflow or something. Or 32/pow times? Anyway, one psubd/pand/pcmpeqd, pand/pcmpeqd, and pand/psubd per 8 elements, with maybe some room to optimize that further.
Otherwise, in the general case, you can brute-force check 32-bit integers one at a time against the 32 or fewer possible powers that fit in an int32_t. e.g. broadcast-load, 4x vpcmpeqd / vpsubd into multiple accumulators. (The smallest possible base, 2, can have exponents up to 2^31` and still fit in an unsigned int). log_3(2^31) is 19, so you'd only need three AVX2 vectors of powers. Or log_4(2^31) is 15.5 so you'd only need 2 vectors to hold every non-overflowing power.
That only handles 1 input element per vector instead of 4 doubles, but it's probably faster than your current FP attempt, as well as fixing the correctness problems. I could see that running more than 4x the throughput per iteration of what you're doing now, or even 8x, so it should be good for speed. And of course has the advantage that correctness is possible!!
Speed gets even better for bases of 4 or greater, only 2x compare/sub per input element, or 1x for bases of 16 or greater. (<= 8 elements to compare against can fit in one vector).
Implementation mistakes in the attempt to vectorize this probably-unfixable algorithm:
_mm256_rem_epi32 is slow library function, but you're using it with a constant divisor of 2! Integer mod 2 is just n & 1 for non-negative. Or if you need to handle negative remainders, you can use the tricks compilers use to implement int % 2: https://godbolt.org/z/b89eWqEzK where it shifts down the sign bit as a correction to do signed division.
Updated version using (x - std::trunc(x)) < 0.000001;
This might work, especially if you limit it to small n. I'd worry that with large n, the difference between an exact power and off-by-1 would be a small ratio. (I haven't really looked at the details, though.)
Your vectorization with __m256 vectors of single-precision float is doomed for large n, but could be ok for small n: float32 can't represent every int32_t, so large odd integers (above 2^24) get rounded to multiples of 2, or multiples of 4 above 2^25, etc.
float has less relative precision in general, so it might not have enough to spare for this algorithm. Or maybe there's something that could be fixed, IDK, I haven't looked closely since the update.
I'd still recommend trying a simple compare-for-equality against all possible powers in the range, broadcast-loading each element. That will definitely work exactly, and if it's as fast then there's no need to try to fix this version using FP logs.
__m256 _N_DBL = _mm256_setzero_ps(); is a confusing name; it's a vector of float, not double. (And it's not part of a standard library header so it shouldn't be using a leading underscore.)
Also, there's zero point initializing it with zero there, since it gets written unconditionally inside the loop. In fact it's only ever used inside the loop, so it could just be declared at that scope, when you're ready to give it a value. Only declare variables in outer scopes if you need them after a loop.
I am currently running a multithreading simulation application with 8+ pipes (threads). These pipes run a very complex code that depends on a random sequence generated by a seed. The sequence is then boiled down to a single 0/1.
I want this "random processing" to be 100% deterministic after passing a seed to the processing pipe from the main thread. So, I can replicate the results in a second run.
So, for example: (I have this coded and it works)
Pipe 1 -> Seed: 123 -> Result: 0
Pipe 2 -> Seed: 123 -> Result: 0
Pipe 3 -> Seed: 589 -> Result: 1
The problem arises when I need to run 100M or more of these processes and then average the results. It may be the case only 1 of the 100M is a 1, and the rest are 0.
As it is obvious, I cannot sample 100M random values with 32bit seeds feeding to srand().
Is it possible to seed with a 64bit seed in VS2010 to srand(), or use a equivalent approach?
Does rand() repeat itself after 2^32 or does not (has some inner hidden state)?
Thanks
You can use C++11's random facilities to generate random numbers of a given size and seed size, though the process is a bit too complicated to summarize here.
For example, you can construct an std::mersenne_twister<uint64_t, ...> and seed it with a 64-bit integer, then acquire random numbers within a specified distribution, which seems to be what you're looking for.
A simple 64-bit LCG should meet your needs. Bit n (counting from the least significant as bit 1) of an LCG has period at most (and, if parameters are chosen correctly, then exactly) 2^n, so avoid using the lower bits if you don't need them, and/or use a tempering function on the output. A sample implementation can be found in my answer to another question here:
https://stackoverflow.com/a/19083740/379897
And reposted:
static uint32_t temper(uint32_t x)
{
x ^= x>>11;
x ^= x<<7 & 0x9D2C5680;
x ^= x<<15 & 0xEFC60000;
x ^= x>>18;
return x;
}
uint32_t lcg64_temper(uint64_t *seed)
{
*seed = 6364136223846793005ULL * *seed + 1;
return temper(*seed >> 32);
}
you could use an XOR SHIFT psuedorandom number generator
It is fast and works a treat - this is the actual generation part from my implementation class of it. I found the information on this algorithm in a wikipedia search on psuedorandom number generators...
uint64_t XRS_64::generate(void)
{
seed ^= seed >> 12; // a
seed ^= seed << 25; // b
seed ^= seed >> 27; // c
return seed * UINT64_C(2685821657736338717);
}
it is fast and for initialisation you do that inside the constructor
XRS_64::XRS_64()
{
seed = 6394358446697381921;
}
seed is an unsigned int 64 bit variable and it is declared inside the class.
class XRS_64
{
public:
XRS_64();
~XRS_64();
void init(uint64_t newseed);
uint64_t generate();
private :
uint64_t seed; /* The state must be seeded with a nonzero value. */
};
I can't answer your questions, but if you find out you can't do what you want, you can implement your own pseudo-random algorithm generator which takes a uint64_t as a seed.
There are better algorithms for this purpose if you want some more serious generator (for cryptography purposes, for instance), but LCG is the easiest I've seen to be implemented.
EDIT
Actually you cannot use a 64-bit seed for the rand() function. You will have to go for your own. In this Wikipedia table there some parameters used by MMIX Donald Knuth to implement it. Be aware that depending on the parameters you use, your random number generator period will have a much lesser value than 2^64 and because of the multiplications, you may need a Big Number library to handle the math operations.
My recommendation is that you take direct control over the process and set up your own high-quality random number generator. None of the answers here have been properly tested or validated - and that is an important criterion that needs to be taken into account.
High-quality random number generators can be made for large periods even on 16-bit and 32-bit machines by just running several of them in parallel - subject to certain preconditions. This is described, in further depth, here
P.L'Ecuyer, ‟Efficient and portable combined random number generators”, CACM 31(6), June 1988, 742-751.
with testing & validation results also provided. Accessible versions of the article can be found on the net.
For a 32-bit implementation the recommendation issued there was to take M₀ = 1 + 2×3×7×631×81031 (= 2³¹ - 85) and M₁ = 1 + 2×19×31×1019×1789 (= 2³¹ - 249) to produce a random number generator of period (M₀ - 1)(M₁ - 1)/2 ≡ 2×3×7×19×31×631×1019×1789×81031 ≡ 2⁶¹ - 360777242114. They also posted a recommendation for 16-bit CPU's.
The seeds are updated as (S₀, S₁) ← ((A₀×S₀) mod M₀, (A₁×S₁) mod M₁), and a 32-bit value may be produced from this as S₀ - S₁ with the result adjusted upward by M₀ - 1 if S₀ ≤ S₁. If (S₀, S₁) is initialized to integer values in the interval [0,M₀)×[0,M₁), then it remains in that interval with each update. You'll have to modify the output value to suit your needs, since their version is specifically geared toward producing strictly positive results; and no 0's.
The preconditions are that (M₀ - 1)/2 and (M₁ - 1)/2 be relatively prime and that A₀² < M₀, A₁² < M₁; and the values (A₀, A₁) = (40014, 40692) were recommended, based on their analysis. Also listed were optimized routines that allow all the computations to be done with 16-bit or 32-bit arithmetic.
For 32-bits the updates were done as (S₀, S₁) ← (A₀×(S₀ - K₀×Q₀) - K₀×R₀, A₁×(S₁ - K₁×Q₁) - K₁×R₁) with any S₀ < 0 or S₁ < 0 results adjusted upward, respectively, to S₀ + M₀ or S₁ + M₁; where (K₀, K₁) = (S₀ div Q₀, S₁ div Q₁), (Q₀, Q₁) = (M₀ div A₀, M₁ div A₁) and (R₀, R₁) = (M₀ mod A₀, M₁ mod A₁).
In this StackOverflow question:
Generating random integer from a range
the accepted answer suggests the following formula for generating a random integer in between given min and max, with min and max being included into the range:
output = min + (rand() % (int)(max - min + 1))
But it also says that
This is still slightly biased towards lower numbers ... It's also
possible to extend it so that it removes the bias.
But it doesn't explain why it's biased towards lower numbers or how to remove the bias. So, the question is: is this the most optimal approach to generation of a random integer within a (signed) range while not relying on anything fancy, just rand() function, and in case if it is optimal, how to remove the bias?
EDIT:
I've just tested the while-loop algorithm suggested by #Joey against floating-point extrapolation:
static const double s_invRandMax = 1.0/((double)RAND_MAX + 1.0);
return min + (int)(((double)(max + 1 - min))*rand()*s_invRandMax);
to see how much uniformly "balls" are "falling" into and are being distributed among a number of "buckets", one test for the floating-point extrapolation and another for the while-loop algorithm. But results turned out to be varying depending on the number of "balls" (and "buckets") so I couldn't easily pick a winner. The working code can be found at this Ideone page. For example, with 10 buckets and 100 balls the maximum deviation from the ideal probability among buckets is less for the floating-point extrapolation than for the while-loop algorithm (0.04 and 0.05 respectively) but with 1000 balls, the maximum deviation of the while-loop algorithm is lesser (0.024 and 0.011), and with 10000 balls, the floating-point extrapolation is again doing better (0.0034 and 0.0053), and so on without much of consistency. Thinking of the possibility that none of the algorithms consistently produces uniform distribution better than that of the other algorithm, makes me lean towards the floating-point extrapolation since it appears to perform faster than the while-loop algorithm. So is it fine to choose the floating-point extrapolation algorithm or my testings/conclusions are not completely correct?
The problem is that you're doing a modulo operation. This would be no problem if RAND_MAX would be evenly divisible by your modulus, but usually that is not the case. As a very contrived example, assume RAND_MAX to be 11 and your modulus to be 3. You'll get the following possible random numbers and the following resulting remainders:
0 1 2 3 4 5 6 7 8 9 10
0 1 2 0 1 2 0 1 2 0 1
As you can see, 0 and 1 are slightly more probable than 2.
One option to solve this is rejection sampling: By disallowing the numbers 9 and 10 above you can cause the resulting distribution to be uniform again. The tricky part is figuring out how to do so efficiently. A very nice example (one that took me two days to understand why it works) can be found in Java's java.util.Random.nextInt(int) method.
The reason why Java's algorithm is a little tricky is that they avoid slow operations like multiplication and division for the check. If you don't care too much you can also do it the naïve way:
int n = (int)(max - min + 1);
int remainder = RAND_MAX % n;
int x, output;
do {
x = rand();
output = x % n;
} while (x >= RAND_MAX - remainder);
return min + output;
EDIT: Corrected a fencepost error in above code, now it works as it should. I also created a little sample program (C#; taking a uniform PRNG for numbers between 0 and 15 and constructing a PRNG for numbers between 0 and 6 from it via various ways):
using System;
class Rand {
static Random r = new Random();
static int Rand16() {
return r.Next(16);
}
static int Rand7Naive() {
return Rand16() % 7;
}
static int Rand7Float() {
return (int)(Rand16() / 16.0 * 7);
}
// corrected
static int Rand7RejectionNaive() {
int n = 7, remainder = 16 % n, x, output;
do {
x = Rand16();
output = x % n;
} while (x >= 16 - remainder);
return output;
}
// adapted to fit the constraints of this example
static int Rand7RejectionJava() {
int n = 7, x, output;
do {
x = Rand16();
output = x % n;
} while (x - output + 6 > 15);
return output;
}
static void Test(Func<int> rand, string name) {
var buckets = new int[7];
for (int i = 0; i < 10000000; i++) buckets[rand()]++;
Console.WriteLine(name);
for (int i = 0; i < 7; i++) Console.WriteLine("{0}\t{1}", i, buckets[i]);
}
static void Main() {
Test(Rand7Naive, "Rand7Naive");
Test(Rand7Float, "Rand7Float");
Test(Rand7RejectionNaive, "Rand7RejectionNaive");
}
}
The result is as follows (pasted into Excel and added conditional coloring of cells so that differences are more apparent):
Now that I fixed my mistake in above rejection sampling it works as it should (before it would bias 0). As you can see, the float method isn't perfect at all, it just distributes the biased numbers differently.
The problem occurs when the number of outputs from the random number generator (RAND_MAX+1) is not evenly divisible by the desired range (max-min+1). Since there will be a consistent mapping from a random number to an output, some outputs will be mapped to more random numbers than others. This is regardless of how the mapping is done - you can use modulo, division, conversion to floating point, whatever voodoo you can come up with, the basic problem remains.
The magnitude of the problem is very small, and undemanding applications can generally get away with ignoring it. The smaller the range and the larger RAND_MAX is, the less pronounced the effect will be.
I took your example program and tweaked it a bit. First I created a special version of rand that only has a range of 0-255, to better demonstrate the effect. I made a few tweaks to rangeRandomAlg2. Finally I changed the number of "balls" to 1000000 to improve the consistency. You can see the results here: http://ideone.com/4P4HY
Notice that the floating-point version produces two tightly grouped probabilities, near either 0.101 or 0.097, nothing in between. This is the bias in action.
I think calling this "Java's algorithm" is a bit misleading - I'm sure it's much older than Java.
int rangeRandomAlg2 (int min, int max)
{
int n = max - min + 1;
int remainder = RAND_MAX % n;
int x;
do
{
x = rand();
} while (x >= RAND_MAX - remainder);
return min + x % n;
}
It's easy to see why this algorithm produces a biased sample. Suppose your rand() function returns uniform integers from the set {0, 1, 2, 3, 4}. If I want to use this to generate a random bit 0 or 1, I would say rand() % 2. The set {0, 2, 4} gives me 0, and the set {1, 3} gives me 1 -- so clearly I sample 0 with 60% and 1 with 40% likelihood, not uniform at all!
To fix this you have to either make sure that your desired range divides the range of the random number generator, or otherwise discard the result whenever the random number generator returns a number that's larger than the largest possible multiple of the target range.
In the above example, the target range is 2, the largest multiple that fits into the random generation range is 4, so we discard any sample that is not in the set {0, 1, 2, 3} and roll again.
By far the easiest solution is std::uniform_int_distribution<int>(min, max).
You have touched on two points involving a random integer algorithm: Is it optimal, and is it unbiased?
Optimal
There are many ways to define an "optimal" algorithm. Here we look at "optimal" algorithms in terms of the number of random bits it uses on average. In this sense, rand is a poor method to use for randomly generated numbers because, among other problems with rand(), it need not necessarily produce random bits (because RAND_MAX is not exactly specified). Instead, we will assume we have a "true" random generator that can produce unbiased and independent random bits.
In 1976, D. E. Knuth and A. C. Yao showed that any algorithm that produces random integers with a given probability, using only random bits, can be represented as a binary tree, where random bits indicate which way to traverse the tree and each leaf (endpoint) corresponds to an outcome. (Knuth and Yao, "The complexity of nonuniform random number generation", in Algorithms and Complexity, 1976.) They also gave bounds on the number of bits a given algorithm will need on average for this task. In this case, an optimal algorithm to generate integers in [0, n) uniformly, will need at least log2(n) and at most log2(n) + 2 bits on average.
There are many examples of optimal algorithms in this sense. See the following answer of mine:
How to generate a random integer in the range [0,n] from a stream of random bits without wasting bits?
Unbiased
However, any optimal integer generator that is also unbiased will, in general, run forever in the worst case, as also shown by Knuth and Yao. Going back to the binary tree, each one of the n outcomes labels leaves in the binary tree so that each integer in [0, n) can occur with probability 1/n. But if 1/n has a non-terminating binary expansion (which will be the case if n is not a power of 2), this binary tree will necessarily either—
Have an "infinite" depth, or
include "rejection" leaves at the end of the tree,
And in either case, the algorithm won't run in constant time and will run forever in the worst case. (On the other hand, when n is a power of 2, the optimal binary tree will have a finite depth and no rejection nodes.)
And for general n, there is no way to "fix" this worst case time complexity without introducing bias. For instance, modulo reductions (including the min + (rand() % (int)(max - min + 1)) in your question) are equivalent to a binary tree in which rejection leaves are replaced with labeled outcomes — but since there are more possible outcomes than rejection leaves, only some of the outcomes can take the place of the rejection leaves, introducing bias. The same kind of binary tree — and the same kind of bias — results if you stop rejecting after a set number of iterations. (However, this bias may be negligible depending on the application. There are also security aspects to random integer generation, which are too complicated to discuss in this answer.)
Without loss of generality, the problem of generating random integers on [a, b] can be reduced to the problem of generating random integers on [0, s). The state of the art for generating random integers on a bounded range from a uniform PRNG is represented by the following recent publication:
Daniel Lemire,"Fast Random Integer Generation in an Interval." ACM Trans. Model. Comput. Simul. 29, 1, Article 3 (January 2019) (ArXiv draft)
Lemire shows that his algorithm provides unbiased results, and motivated by the growing popularity of very fast high-quality PRNGs such as Melissa O'Neill's PCG generators, shows how to the results can be computed fast, avoiding slow division operations almost all of the time.
An exemplary ISO-C implementation of his algorithm is shown in randint() below. Here I demonstrate it in conjunction with George Marsaglia's older KISS64 PRNG. For performance reasons, the required 64×64→128 bit unsigned multiplication is typically best implemented via machine-specific intrinsics or inline assembly that map directly to appropriate hardware instructions.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
/* PRNG state */
typedef struct Prng_T *Prng_T;
/* Returns uniformly distributed integers in [0, 2**64-1] */
uint64_t random64 (Prng_T);
/* Multiplies two 64-bit factors into a 128-bit product */
void umul64wide (uint64_t, uint64_t, uint64_t *, uint64_t *);
/* Generate in bias-free manner a random integer in [0, s) with Lemire's fast
algorithm that uses integer division only rarely. s must be in [0, 2**64-1].
Daniel Lemire, "Fast Random Integer Generation in an Interval," ACM Trans.
Model. Comput. Simul. 29, 1, Article 3 (January 2019)
*/
uint64_t randint (Prng_T prng, uint64_t s)
{
uint64_t x, h, l, t;
x = random64 (prng);
umul64wide (x, s, &h, &l);
if (l < s) {
t = (0 - s) % s;
while (l < t) {
x = random64 (prng);
umul64wide (x, s, &h, &l);
}
}
return h;
}
#define X86_INLINE_ASM (0)
/* Multiply two 64-bit unsigned integers into a 128 bit unsined product. Return
the least significant 64 bist of the product to the location pointed to by
lo, and the most signfiicant 64 bits of the product to the location pointed
to by hi.
*/
void umul64wide (uint64_t a, uint64_t b, uint64_t *hi, uint64_t *lo)
{
#if X86_INLINE_ASM
uint64_t l, h;
__asm__ (
"movq %2, %%rax;\n\t" // rax = a
"mulq %3;\n\t" // rdx:rax = a * b
"movq %%rax, %0;\n\t" // l = (a * b)<31:0>
"movq %%rdx, %1;\n\t" // h = (a * b)<63:32>
: "=r"(l), "=r"(h)
: "r"(a), "r"(b)
: "%rax", "%rdx");
*lo = l;
*hi = h;
#else // X86_INLINE_ASM
uint64_t a_lo = (uint64_t)(uint32_t)a;
uint64_t a_hi = a >> 32;
uint64_t b_lo = (uint64_t)(uint32_t)b;
uint64_t b_hi = b >> 32;
uint64_t p0 = a_lo * b_lo;
uint64_t p1 = a_lo * b_hi;
uint64_t p2 = a_hi * b_lo;
uint64_t p3 = a_hi * b_hi;
uint32_t cy = (uint32_t)(((p0 >> 32) + (uint32_t)p1 + (uint32_t)p2) >> 32);
*lo = p0 + (p1 << 32) + (p2 << 32);
*hi = p3 + (p1 >> 32) + (p2 >> 32) + cy;
#endif // X86_INLINE_ASM
}
/* George Marsaglia's KISS64 generator, posted to comp.lang.c on 28 Feb 2009
https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
*/
struct Prng_T {
uint64_t x, c, y, z, t;
};
struct Prng_T kiss64 = {1234567890987654321ULL, 123456123456123456ULL,
362436362436362436ULL, 1066149217761810ULL, 0ULL};
/* KISS64 state equations */
#define MWC64 (kiss64->t = (kiss64->x << 58) + kiss64->c, \
kiss64->c = (kiss64->x >> 6), kiss64->x += kiss64->t, \
kiss64->c += (kiss64->x < kiss64->t), kiss64->x)
#define XSH64 (kiss64->y ^= (kiss64->y << 13), kiss64->y ^= (kiss64->y >> 17), \
kiss64->y ^= (kiss64->y << 43))
#define CNG64 (kiss64->z = 6906969069ULL * kiss64->z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
uint64_t random64 (Prng_T kiss64)
{
return KISS64;
}
int main (void)
{
int i;
Prng_T state = &kiss64;
for (i = 0; i < 1000; i++) {
printf ("%llu\n", randint (state, 10));
}
return EXIT_SUCCESS;
}
If you really want to get a perfect generator assuming rand() function that you have is perfect, you need to apply the method explained bellow.
We will create a random number, r, from 0 to max-min=b-1, which is then easy to move to the range that you want, just take r+min
We will create a random number where b < RAND_MAX, but the procedure can be easily adopted to have a random number for any base
PROCEDURE:
Take a random number r in its original RAND_MAX size without any truncation
Display this number in base b
Take first m=floor(log_b(RAND_MAX)) digits of this number for m random numbers from 0 to b-1
Shift each by min (i.e. r+min) to get them into the range (min,max) as you wanted
Since log_b(RAND_MAX) is not necessarily an integer, the last digit in the representation is wasted.
The original approach of just using mod (%) is mistaken exactly by
(log_b(RAND_MAX) - floor(log_b(RAND_MAX)))/ceil(log_b(RAND_MAX))
which you might agree is not that much, but if you insist on being precise, that is the procedure.
Is this a correct implementation of the Knuth multiplicative hash.
int hash(int v)
{
v *= 2654435761;
return v >> 32;
}
Does overflow in the multiplication affects the algorithm?
How to improve the performance of this method?
Knuth multiplicative hash is used to compute an hash value in {0, 1, 2, ..., 2^p - 1} from an integer k.
Suppose that p is in between 0 and 32, the algorithm goes like this:
Compute alpha as the closest integer to 2^32 (-1 + sqrt(5)) / 2. We get alpha = 2 654 435 769.
Compute k * alpha and reduce the result modulo 2^32:
k * alpha = n0 * 2^32 + n1 with 0 <= n1 < 2^32
Keep the highest p bits of n1:
n1 = m1 * 2^(32-p) + m2 with 0 <= m2 < 2^(32 - p)
So, a correct implementation of Knuth multiplicative algorithm in C++ is:
std::uint32_t knuth(int x, int p) {
assert(p >= 0 && p <= 32);
const std::uint32_t knuth = 2654435769;
const std::uint32_t y = x;
return (y * knuth) >> (32 - p);
}
Forgetting to shift the result by (32 - p) is a major mistake. As you would lost all the good properties of the hash. It would transform an even sequence into an even sequence which would be very bad as all the odd slots would stay unoccupied. That's like taking a good wine and mixing it with Coke. By the way, the web is full of people misquoting Knuth and using a multiplication by 2 654 435 761 without taking the higher bits. I just opened the Knuth and he never said such a thing. It looks like some guy who decided he was "smart" decided to take a prime number close to 2 654 435 769.
Bare in mind that most hash tables implementations don't allow this kind of signature in their interface, as they only allow
uint32_t hash(int x);
and reduce hash(x) modulo 2^p to compute the hash value for x. Those hash tables cannot accept the Knuth multiplicative hash. This might be a reason why so many people completely ruined the algorithm by forgetting to take the higher p bits.
So you can't use the Knuth multiplicative hash with std::unordered_map or std::unordered_set. But I think that those hash tables use a prime number as a size, so the Knuth multiplicative hash is not useful in this case. Using hash(x) = x would be a good fit for those tables.
Source: "Introduction to Algorithms, third edition", Cormen et al., 13.3.2 p:263
Source: "The Art of Computer Programming, Volume 3, Sorting and Searching", D.E. Knuth, 6.4 p:516
Ok, I looked it up in TAOCP volume 3 (2nd edition), section 6.4, page 516.
This implementation is not correct, though as I mentioned in the comments it may give the correct result anyway.
A correct way (I think - feel free to read the relevant chapter of TAOCP and verify this) is something like this: (important: yes, you must shift the result right to reduce it, not use bitwise AND. However, that is not the responsibility of this function - range reduction is not properly part of hashing itself)
uint32_t hash(uint32_t v)
{
return v * UINT32_C(2654435761);
// do not comment about the lack of right shift. I'm not ignoring it. read on.
}
Note the uint32_t's (as opposed to int's) - they make sure the multiplication overflows modulo 2^32, as it is supposed to do if you choose 32 as the word size. There is also no right shift by k here, because there is no reason to give responsibility for range-reduction to the basic hashing function and it is actually more useful to get the full result. The constant 2654435761 is from the question, the actual suggested constant is 2654435769, but that's a small difference that as far as I know does not affect the quality of the hash.
Other valid implementations shift the result right by some amount (not the full word size though, that doesn't make sense and C++ doesn't like it), depending on how many bits of hash you need. Or they may use an other constant (subject to certain conditions) or an other word size. Reducing the hash modulo something is not a valid implementation, but a common mistake, likely it is a de-facto standard way to do range-reduction on a hash. The bottom bits of a multiplicative hash are the worst-quality bits (they depend on less of the input), you only want to use them if you really need more bits, while reducing the hash modulo a power of two would return only the worst bits. Indeed that is equivalent to throwing away most of the input bits too. Reducing modulo a non-power-of-two is not so bad since it does mix in the higher bits, but it's not how the multiplicative hash was defined.
So to be clear, yes there is a right shift, but that is range reduction not hashing and can only be the responsibility of the hash table, since it depends on its internal size.
The type should be unsigned, otherwise the overflow is unspecified (thus possibly wrong, not just on non-2's-complement architectures but also on overly clever compilers) and the optional right shift would be a signed shift (wrong).
On the page I mention at the top, there is this formula:
Here we have A = 2654435761 (or 2654435769), w = 232 and M = 232. Calculating AK/w gives a fixed-point result with the format Q32.32, the mod 1 step takes only the 32 fraction bits. But that's just the same thing as doing a modular multiplication and then saying that the result is the fraction bits. Of course when multiplied by M, all the fraction bits become integer bits because of how M was chosen, and so it simplifies to just a plain old modular multiplication. When M is a lower power of two, that just right-shifts the result, as mentioned.
Might be late, but heres a Java Implementation of Knuth's Method :
For a hashtable of Size N :
public long hash(int key) {
long l = 2654435769L;
return (key * l >> 32) % N ;
}
If the input argument is a pointer then I use this
#include <inttypes.h>
uint32_t knuth_mul_hash(void* k) {
ptrdiff_t v = (ptrdiff_t)k * UINT32_C(2654435761);
v >>= ((sizeof(ptrdiff_t) - sizeof(uint32_t)) * 8); // Right-shift v by the size difference between a pointer and a 32-bit integer (0 for x86, 32 for x64)
return (uint32_t)(v & UINT32_MAX);
}
I usually use this as the default fallback hashing function in hashmap implementations, dictionaries, sets, etc...
What would be the fastest way to generate a large number of (pseudo-)random bits. Each bit must be independent and be zero or one with equal probability. I could obviously do some variation on
randbit=rand()%2;
but I feel like there should be a faster way, generating several random bits from each call to the random number generator. Ideally I'd like to get an int or a char where each bit is random and independent, but other solutions are also possible.
The application is not cryptographic in nature so strong randomness isn't a major factor, whereas speed and getting the correct distribution is important.
convert a random number into binary
Why not get just one number (of appropriate size to get enough bits you need) and then convert it to binary. You'll actually get bits from a random number which means they are random as well.
Zeros and ones also have the probability of 50%, since taking all numbers between 0 and some 2^n limit and counting the number of zeros and ones are equal > meaning that probability of zeros and ones is the same.
regarding speed
this would probably be very fast, since getting just one random number compared to number of bits in it is faster. it purely depends on your binary conversion now.
Take a look at Boost.Random, especially boost::uniform_int<>.
As you say just generate random integers.
Then you have 32 random bits with ones and zeroes all equally probable.
Get the bits in a loop:
for (int i = 0; i < 32; i++)
{
randomBit = (randomNum >> i) & 1
...
// Do your thing
}
Repeat this for as many times you need to to get the correct amount of bits.
Here's a very fast one I coded in Java based on George Marsaglia's XORShift algorithm: gets you 64 bits at a time!
/**
* State for random number generation
*/
private static volatile long state=xorShift64(System.nanoTime()|0xCAFEBABE);
/**
* Gets a long random value
* #return Random long value based on static state
*/
public static final long nextLong() {
long a=state;
state = xorShift64(a);
return a;
}
/**
* XORShift algorithm - credit to George Marsaglia!
* #param a Initial state
* #return new state
*/
public static final long xorShift64(long a) {
a ^= (a << 21);
a ^= (a >>> 35);
a ^= (a << 4);
return a;
}
SMP Safe (i.e. Fastest way possiable these days) and good bits
Note the use of the [ThreadStatic] attribute, this object automatically handle's new thread's, no locking. That's the only way your going to ensure high-performance random, SMP lockfree.
http://blogs.msdn.com/pfxteam/archive/2009/02/19/9434171.aspx
If I rememeber correctly, the least significant bits are normally having a "less random"
distribution for most pseuodo random number generators, so using modulo and/or each bit in the generated number would be bad if you are worried about the distribution.
(Maybe you should at least google what Knuth says...)
If that holds ( and its hard to tell without knowing exactly what algorithm you are using) just use the highest bit in each generated number.
http://en.wikipedia.org/wiki/Pseudo-random
#include <iostream>
#include <bitset>
#include <random>
int main()
{
const size_t nrOfBits = 256;
std::bitset<nrOfBits> randomBits;
std::default_random_engine generator;
std::uniform_real_distribution<float> distribution(0.0,1.0);
float randNum;
for(size_t i = 0; i < nrOfBits; i++)
{
randNum = distribution(generator);
if(randNum < 0.5) {
randNum = 0;
} else {
randNum = 1;
}
randomBits.set(i, randNum);
}
std::cout << "randomBits =\n" << randomBits << std::endl;
return 0;
}
This took 4.5886e-05s in my test with 256 bits.
You can generate a random number and keep on right shifitng and testing the least significant bit to get the random bits instead of doing a mod operation.
How large do you need the number of generated bits to be? If it is not larger than a few million, and keeping in mind that you are not using the generator for cryptography, then I think the fastest possible way would be to precompute a large set of integers with the correct distribution, convert it to a text file like this:
unsigned int numbers[] =
{
0xABCDEF34, ...
};
and then compile the array into your program and go through it one number at a time.
That way you get 32 bits with every call (on a 32-bit processor), the generation time is the shortest possible because all the numbers are generated in advance, and the distribution is controlled by you. The downside is, of course, that these numbers are not random at all, which depending on what you are using the PRNG for may or may not matter.
if you only need a bit at a time try
bool coinToss()
{
return rand()&1;
} It would technically be a faster way to generate bits because of replacing the %2 with a &1 which are equivalent.
Just read some memory - take a n bit section of raw memory. It will be pretty random.
Alternatively, generate a large random int x and just use the bit values.
for(int i = (bitcount-1); i >= 0 ; i--) bin += x & (1 << i);