Efficiently randomly shuffling the bits of a sequence of words - c++

Consider the following algorithm from the C++ standard library: std::shuffle that has the following signature:
template <class RandomIt, class URBG>
void shuffle(RandomIt first, RandomIt last, URBG&& g);
It reorders the elements in the given range [first, last) such that each possible permutation of those elements has equal probability of appearance.
I am trying to implement the same algorithms, but which works at the bit level, randomly shuffling the bits of the words of the input sequence. Considering a sequence of 64-bits words, I am trying to implement:
template <class URBG>
void bit_shuffle(std::uint64_t* first, std::uint64_t* last, URBG&& g)
Question: How to do that as efficiently as possible (using compiler intrinsics if necessary)? I am not necessarily looking for an entire implementation, but more for suggestions/directions of research, because it's really not clear to me if it's even feasible to implement that efficiently.

It's obvious that asymptotically, the speed is O(N), where N is number of bits. Our goal is to improve the constants involved in it.
Disclaimer: the description proposed algorithm is a rough sketch. There are a lot of stuffs needs to be added and, especially, a lot of details that needs to be cared of in order to make it work correctly. The approximated execution time will not be different from what is claimed here though.
Baseline Algorithm
The most obvious one is the textbook approach, which takes N operations, each of which involves calling the random_generator which takes R milliseconds, and accessing the bit's value of two different bits, and set new value to them in total of 4 * A milliseconds (A is time to read/write one bit). Suppose that the array lookup operations takes C milliseconds. So the total time of this algorithm is N * (R + 4 * A + 2 * C) milliseconds (approximately). It is also reasonable to assume that the random number generation takes more time, i.e. R >> A == C.
Proposed Algorithm
Suppose the bits are stored in a byte storage, i.e. we will work with blocks of bytes.
unsigned char bit_field[field_size = N / 8];
First, let's count the number of 1 bits in our bitset. For that, we can use a lookup-table and iterate through the bitset as byte array:
# Generate lookup-table, you may modify it with `constexpr`
# to make it run in compile time.
int bitcount_lookup[256];
for (int = 0; i < 256; ++i) {
bitcount_lookup[i] = 0;
for (int b = 0; b < 8; ++b)
bitcount_lookup[i] += (i >> b) & 1;
}
We can treat this as preprocessing overhead (as it may as well be calculated at compile-time) and say that it takes 0 milliseconds. Now, counting number of 1 bits is easy (the following will take (N / 8) * C milliseconds):
int bitcount = 0;
for (auto *it = bit_field; it != bit_field + field_size; ++it)
bitcount += bitcount_lookup[*it];
Now, we randomly generate N / 8 numbers (let's call the resulting array gencnt[N / 8]), each in the range [0..8], such that they sums up to bitcount. This is a bit tricky and kind of hard to do it uniformly (the "correct" algorithm to generate uniform distribution is quite slow comparing to the baseline algo). A quite uniform-ish but quick solution is roughly:
Fill the gencnt[N / 8] array with values v = bitcount / (N / 8).
Randomly choose N / 16 "black" cells. The rests are "white". The algorithm is similar to random permutation, but only of half of the array.
Generate N / 16 random numbers in the range [0..v]. Let's call them tmp[N / 16].
Increase "black" cells by tmp[i] values, and decrease "white" cells by tmp[i]. This will ensure that the overall sum is bitcount.
After that, we will have a uniform-ish random-ish array gencnt[N / 8], the value of which are the number of 1 bytes in a particular "cell". It was all generated in:
(N / 8) * C + (N / 16) * (4 * C) + (N / 16) * (R + 2 * C)
^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^
filling step random coloring filling
milliseconds (this estimation is done with a concrete implementation in my mind). Lastly, we can have a lookup table of the bytes with specified number of bits set to 1 (can be compiled overhead, or even in compile-time as constexpr, so let's assume that this takes 0 milliseconds):
std::vector<std::vector<unsigned char>> random_lookup(8);
for (int c = 0; c < 8; c++)
random_lookup[c] = { /* numbers with `c` bits set to `1` */ };
Then, we can fill our bit_field as follows (which takes roughly (N / 8) * (R + 3 * C) milliseconds):
for (int i = 0; i < field_size; i++) {
bit_field[i] = random_lookup[gencnt[i]][rand() % gencnt[i].size()];
Summing everything up, we have the total execution time:
T = (N / 8) * C +
(N / 8) * C + (N / 16) * (4 * C) + (N / 16) * (R + 2 * C) +
(N / 8) * (R + 3 * C)
= N * (C + (3/16) * R) < N * (R + 4 * A + 2 * C)
^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^
proposed algorithm naive baseline algo
Although it's not truly uniformly random, but it does spread the bits out quite evenly and randomly, and it's quite fast and hopefully gets the job done in your use-case.

Observing that actual shuffling bits, which involves swapping via Fisher-Yates, is not required for producing the exact equivalent, a random distribution of the bits.
#include <iostream>
#include <vector>
#include <random>
// shuffle a vector of bools. This requires only counting the number of trues in the vector
// followed by clearing the vector and inserting bool trues to produce an equivalent to
// a bit shuffle. This is cache line friendly and doesn't require swapping.
std::vector<bool> DistributeBitsRandomly(std::vector<bool> bvector)
{
std::random_device rd;
static std::mt19937 gen(rd()); //mersenne_twister_engine seeded with rd()
// count the number of set bits and clear bvector
int set_bits_count = 0;
for (int i=0; i < bvector.size(); i++)
if (bvector[i])
{
set_bits_count++;
bvector[i] = 0;
}
// set a bit if a random value in range bvector.size()-bit_loc-1 is
// less than the number of bits remaining to be placed. This produces exactly the same
// distribution as a random shuffle but only does an insertion of a 1 bit rather than
// a swap. It requires counting the number of 1 bits. There are efficient ways
// of doing this. See https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer
for (int bit_loc = 0; set_bits_count; bit_loc++)
{
std::uniform_int_distribution<int> dis(0, bvector.size()-bit_loc-1);
auto x = dis(gen);
if (x < set_bits_count)
{
bvector[bit_loc] = true;
set_bits_count--;
}
}
return bvector;
}
This performs the equivalent of shuffling the bools in a vector<bool> It is cache line friendly and involves no swapping. It's presented in executable, but simple algorithmic form as requested by the OP. Much can be done to optimize this such as improving the speed of bit counting and clearing the array.
This sets 4 bits out of 10, calls the "shuffle" routine 100,000 times, and prints the number of time a 1 bit occurs in each of the 10 locations. It should be around 40,000 in each position.
int main()
{
std::vector<bool> initial{ 1,1,1,1,0,0,0,0,0,0 };
std::vector<int> totals(initial.size());
for (int i = 0; i < 100000; i++)
{
auto a_distribution = DistributeBitsRandomly(initial);
for (int ii = 0; ii < totals.size(); ii++)
if (a_distribution[ii])
totals[ii]++;
}
for (auto cnt : totals)
std::cout << cnt << "\n";
}
Possible Output:
40116
39854
40045
39917
40105
40074
40214
39963
39946
39766

Related

Return non-duplicate random values from a very large range

I would like a function that will produce k pseudo-random values from a set of n integers, zero to n-1, without repeating any previous result. k is less than or equal to n. O(n) memory is unacceptable because of the large size of n and the frequency with which I'll need to re-shuffle.
These are the methods I've considered so far:
Array:
Normally if I wanted duplicate-free random values I'd shuffle an array, but that's O(n) memory. n is likely to be too large for that to work.
long nextvalue(void) {
static long array[4000000000];
static int s = 0;
if (s == 0) {
for (int i = 0; i < 4000000000; i++) array[i] = i;
shuffle(array, 4000000000);
}
return array[s++];
}
n-state PRNG:
There are a variety of random number generators that can be designed so as to have a period of n and to visit n unique states over that period. The simplest example would be:
long nextvalue(void) {
static long s = 0;
static const long i = 1009; // assumed co-prime to n
s = (s + i) % n;
return s;
}
The problem with this is that it's not necessarily easy to design a good PRNG on the fly for a given n, and it's unlikely that that PRNG will approximate a fair shuffle if it doesn't have a lot of variable parameters (even harder to design). But maybe there's a good one I don't know about.
m-bit hash:
If the size of the set is a power of two, then it's possible to devise a perfect hash function f() which performs a 1:1 mapping from any value in the range to some other value in the range, where every input produces a unique output. Using this function I could simply maintain a static counter s, and implement a generator as:
long nextvalue(void) {
static long s = 0;
return f(s++);
}
This isn't ideal because the order of the results is determined by f(), rather than random values, so it's subject to all the same problems as above.
NPOT hash:
In principle I can use the same design principles as above to define a version of f() which works in an arbitrary base, or even a composite, that is compatible with the range needed; but that's potentially difficult, and I'm likely to get it wrong. Instead a function can be defined for the next power of two greater than or equal to n, and used in this construction:
long nextvalue(void) {
static long s = 0;
long x = s++;
do { x = f(x); } while (x >= n);
}
But this still have the same problem (unlikely to give a good approximation of a fair shuffle).
Is there a better way to handle this situation? Or perhaps I just need a good function for f() that is highly parameterisable and easy to design to visit exactly n discrete states.
One thing I'm thinking of is a hash-like operation where I contrive to have the first j results perfectly random through carefully designed mapping, and then any results between j and k would simply extrapolate on that pattern (albeit in a predictable way). The value j could then be chosen to find a compromise between a fair shuffle and a tolerable memory footprint.
First of all, it seems unreasonable to discount anything that uses O(n) memory and then discuss a solution that refers to an underlying array. You have an array. Shuffle it. If that doesn't work or isn't fast enough, come back to us with a question about it.
You only need to perform a complete shuffle once. After that, draw from index n, swap that element with an element located randomly before it and increase n, modulo element count. For example, with such a large dataset I'd use something like this.
Prime numbers are an option for hashes, but probably not the same way you think. Using two Mersenne primes (low and high, perhaps 0xefff and 0xefffffff) you should be able to come up with a much more general-purpose hashing algorithm.
size_t hash(unsigned char *value, size_t value_size, size_t low, size_t high) {
size_t x = 0;
while (value_size--) {
x += *value++;
x *= low;
}
return x % high;
}
#define hash(value, value_size, low, high) (hash((void *) value, value_size, low, high))
This should produce something fairly well distributed for all inputs larger than about two octets for example, with the minor troublesome exception for zero byte prefixes. You might want to treat those differently.
So... what I've ended up doing is digging deeper into pre-existing methods to
try to confirm their ability to approximate a fair shuffle.
I take a simple counter, which itself is guaranteed to visit
every in-range value exactly once, and then 'encrypt' it with an n-bit block
cypher. Rather, I round the range up to a power of two, and apply some 1:1
function; then if the result is out of range I repeat the permutation until the
result is in range.
This can be guaranteed to complete eventually because there are only a finite
number of out-of-range values within the power-of-two range, and they cannot
enter into a always-out-of-range cycle because that would imply that something
in the cycle was mapped from two different previous states (one from the
in-range set, and another from the out-of-range set), which would make the
function not bijective.
So all I need to do is devise a parameterisable function which I can tune to an
arbitrary number of bits. Like this one:
uint64_t mix(uint64_t x, uint64_t k) {
const int s0 = BITS * 4 / 5;
const int s1 = BITS / 5 + (k & 1);
const int s2 = BITS * 2 / 5;
k |= 1;
x *= k;
x ^= (x & BITMASK) >> s0;
x ^= (x << s1) & BITMASK;
x ^= (x & BITMASK) >> s2;
x += 0x9e3779b97f4a7c15;
return x & BITMASK;
}
I know it's bijective because I happen to have its inverse function handy:
uint64_t unmix(uint64_t x, uint64_t k) {
const int s0 = BITS * 4 / 5;
const int s1 = BITS / 5 + (k & 1);
const int s2 = BITS * 2 / 5;
k |= 1;
uint64_t kp = k * k;
while ((kp & BITMASK) > 1) {
k *= kp;
kp *= kp;
}
x -= 0x9e3779b97f4a7c15;
x ^= ((x & BITMASK) >> s2) ^ ((x & BITMASK) >> s2 * 2);
x ^= (x << s1) ^ (x << s1 * 2) ^ (x << s1 * 3) ^ (x << s1 * 4) ^ (x << s1 * 5);
x ^= (x & BITMASK) >> s0;
x *= k;
return x & BITMASK;
}
This allows me to define a simple parameterisable PRNG like this:
uint64_t key[ROUNDS];
uint64_t seed = 0;
uint64_t rand_no_rep(void) {
uint64_t x = seed++;
do {
for (int i = 0; i < ROUNDS; i++) x = mix(x, key[i]);
} while (x >= RANGE);
return x;
}
Initialise seed and key to random values and you're good to go.
Using the inverse function to lets me determine what seed must be to force
rand_no_rep() to produce a given output; making it much easier to test.
So far I've checked the cases where constant a, it is followed by constant
b. For ROUNDS==1 pairs collide on exactly 50% of the keys (and each
pair of collisions is with a different pair of a and b; they don't all converge on 0, 1 or whatever). That is, for
various k, a specific a-followed-by-b cases occurs for more than one k
(this must happen at least one). Subsequent values values do not collide in
that case, so different keys aren't falling into the same cycle at different
positions. Every k gives a unique cycle.
50% collisions comes from 25% being not unique when they're added to the list (count itself, and count the guy it ran into). That might sound bad but it's actually lower than birthday paradox logic would suggest. Selecting randomly, the percentage of new entries that fail to be unique looks to converge between 36% and 37%. Being "better than random" is obviously worse than random, as far as randomness goes, but that's why they're called pseudo-random numbers.
Extending that to ROUNDS==2, we want to make sure that a second round doesn't
cancel out or simply repeat the effects of the first.
This is important because it would mean that multiple rounds are a waste of
time and memory, and that the function cannot be paramaterised to any
substantial degree. It could happen trivially if mix() contained all linear
operations (say, multiply and add, mod RANGE). In that case all of the
parameters could be multiplied/added together to produce a single parameter for
a single round that would have the same effect. That would be disappointing,
as it would reduce the number of attainable permutations to the size of just
that one parameter, and if the set is as small as that then more work would be
needed to ensure that it's a good, representative set.
So what we want to see from two rounds is a large set of outcomes that could
never be achieved by one round. One way to demonstrate this is to look for the
original b-follows-a cases with an additional parameter c, where we want
to see every possible c following a and b.
We know from the one-round testing that in 50% of cases there is only one c
that can follow a and b because there is only one k that places b
immediately after a. We also know that 25% of the pairs of a and b were
unreachable (being the gap left behind by half the pairs that went into
collisions rather than new unique values), and the last 25% appear for two
different k.
The result that I get is that given a free choice of both keys, it's possible
to find about five eights of the values of c following a given a and b.
About a quarter of the a/b pairs are unreachable (it's a less predictable,
now, because of the potential intermediate mappings into or out of the
duplicate or unreachable cases) and a quarter have a, b, and c appear
together in two sequences (which diverge afterwards).
I think there's a lot to be inferred from the difference between one round and
two, but I could be wrong about that and I need to double-check. Further
testing gets harder; or at least slower unless I think more carefully about how
I'm going to do it.
I haven't yet demonstrated that amongst the set of permutations it can produce, that they're all equally likely; but this is normally not guaranteed for any other PRNG either.
It's fairly slow for a PRNG, but it would fit SIMD trivially.

How to generate a list of ascending random integers

I have an external collection containing n elements that I want to select some number (k) of them at random, outputting the indices of those elements to some serialized data file. I want the indices to be output in strict ascending order, and for there to be no duplicates. Both n and k may be quite large, and it is generally not feasible to simply store entire arrays in memory of that size.
The first algorithm I came up with was to pick a random number r[0] from 1 to n-k... and then pick a successive random numbers r[i] from r[i-1]+1 to n-k+i, only needing to store two entries for 'r' at any one time. However, a fairly simple analysis reveals the the probability for selecting small numbers is inconsistent with what could have been if the entire set was equally distributed. For example, if n was a billion and k was half a billion, the probability of selecting the first entry with the approach I've just described is very tiny (1 in half a billion), where in actuality since half of the entries are being selected, the first should be selected 50% of the time. Even if I use external sorting to sort k random numbers, I would have to discard any duplicates, and try again. As k approaches n, the number of retries would continue to grow, with no guarantee of termination.
I would like to find a O(k) or O(k log k) algorithm to do this, if it is at all possible. The implementation language I will be using is C++11, but descriptions in pseudocode may still be helpful.
If in practice k has the same order of magnitude as n, perhaps very straightforward O(n) algorithm will suffice:
assert(k <= n);
std::uniform_real_distribution rnd;
for (int i = 0; i < n; i++) {
if (rnd(engine) * (n - i) < k) {
std::cout << i << std::endl;
k--;
}
}
It produces all ascending sequences with equal probability.
You can solve this recursively in O(k log k) if you partition in the middle of your range, and randomly sample from the hypergeometric probability distribution to choose how many values lie above and below the middle point (i.e. the values of k for each subsequence), then recurse for each:
int sample_hypergeometric(int n, int K, int N) // samples hypergeometric distribution and
// returns number of "successes" where there are n draws without replacement from
// a population of N with K possible successes.
// Something similar to scipy.stats.hypergeom.rvs in Python.
// In this case, "success" means the selected value lying below the midpoint.
{
std::default_random_engine generator;
std::uniform_real_distribution<double> distribution(0.0,1.0);
int successes = 0;
for(int trial = 0; trial < n; trial++)
{
if((int)(distribution(generator) * N) < K)
{
successes++;
K--;
}
N--;
}
return successes;
}
select_k_from_n(int start, int k, int n)
{
if(k == 0)
return;
if(k == 1)
{
output start + random(1 to n);
return;
}
// find the number of results below the mid-point:
int k1 = sample_hypergeometric(k, n >> 1, n);
select_k_from_n(start, k1, n >> 1);
select_k_from_n(start + (n >> 1), k - k1, n - (n >> 1));
}
Sampling from the binomial distribution could also be used to approximate the hypergeometric distribution with p = (n >> 1) / n, rejecting samples where k1 > (n >> 1).
As mentioned in my comment, use a std::set<int> to store the randomly generated integers such that the resulting container is inherently sorted and contains no duplicates. Example code snippet:
#include <random>
#include <set>
int main(void) {
std::set<int> random_set;
std::random_device rd;
std::mt19937 mt_eng(rd());
// min and max of random set range
const int m = 0; // min
const int n = 100; // max
std::uniform_int_distribution<> dist(m,n);
// number to generate
const int k = 50;
for (int i = 0; i < k; ++i) {
// only non-previously occurring values will be inserted
if (!random_set.insert(dist(mt_eng)).second)
--i;
}
}
Assuming that you can't store k random numbers in memory, you'll have to generate the numbers in strict random order. One way to do it would be to generate a number between 0 and n/k. Call that number x. The next number you have to generate is between x+1 and (n-x)/(k-1). Continue in that fashion until you've selected k numbers.
Basically, you're dividing the remaining range by the number of values left to generate, and then generating a number in the first section of that range.
An example. You want to generate 3 numbers between 0 and 99, inclusive. So you first generate a number between 0 and 33. Say you pick 10.
So now you need a number between 11 and 99. The remaining range consists of 89 values, and you have two values left to pick. So, 89/2 = 44. You need a number between 11 and 54. Say you pick 36.
Your remaining range is from 37 to 99, and you have one number left to choose. So pick a number at random between 37 and 99.
This won't give you a normal distribution, as once you choose a number it's impossible to get a number less than that in a subsequent choice. But it might be good enough for your purposes.
This pseudocode shows the basic idea.
pick_k_from_n(n, k)
{
num_left = k
last_k = 0;
while num_left > 0
{
// divide the remaining range into num_left partitions
range_size = (n - last_k) / num_left
// pick a number in the first partition
r = random(range_size) + last_k + 1
output(r)
last_k = r
num_left = num_left - 1
}
}
Note that this takes O(k) time and requires O(1) extra space.
You can do it in O(k) time with Floyd's algorithm (not Floyd-Warshall, that's a shortest path thing). The only data structure you need is a 1-bit table that will tell you whether or not a number has already been selected. Searching a hash table can be O(1), so this will not be a burden, and can be kept in memory even for very large n (if n is truly huge, you'll have to use a b-tree or bloom filter or something).
To select k items from among n:
for j = n-k+1 to n:
select random x from 1 to j
if x is already in hash:
insert j into hash
else
insert x into hash
That's it. At the end, your hash table will contain a uniformly selected sample of k items from among n. Read them out in order (you may have to pick a type of hash table that allows that).
Could you adjust each ascending index selection in a way that compensates for the probability distortion you are describing?
IANAS, but my guess would be that if you pick a random number r between 0 and 1 (that you'll scale to the full remaining index range after the adjustment), you might be able to adjust it by calculating r^(x) (keeping the range in 0..1, but increasing the probability of smaller numbers), with x selected by solving the equation for the probability of the first entry?
Here's an O(k log k + √n)-time algorithm that uses O(√n) words of space. This can be generalized to an O(k + n^(1/c))-time, O(n^(1/c))-space algorithm for any integer constant c.
For intuition, imagine a simple algorithm that uses (e.g.) Floyd's sampling algorithm to generate k of n elements and then radix sorts them in base √n. Instead of remembering what the actual samples are, we'll do a first pass where we run a variant of Floyd's where we remember only the number of samples in each bucket. The second pass is, for each bucket in order, to randomly resample the appropriate number of elements from the bucket range. There's a short proof involving conditional probability that this gives a uniform distribution.
# untested Python code for illustration
# b is the number of buckets (e.g., b ~ sqrt(n))
import random
def first_pass(n, k, b):
counts = [0] * b # list of b zeros
for j in range(n - k, n):
t = random.randrange(j + 1)
if t // b >= counts[t % b]: # intuitively, "t is not in the set"
counts[t % b] += 1
else:
counts[j % b] += 1
return counts

C++ Optimizing this Algorithm

After watching some Terence Tao videos, I wanted to try implementing algorithms into c++ code to find all the prime numbers up to a number n. In my first version, where I simply had every integer from 2 to n tested to see if they were divisible by anything from 2 to sqrt(n), I got the program to find the primes between 1-10,000,000 in ~52 seconds.
Attempting to optimize the program, and implementing what I now know to be the Sieve of Eratosthenes, I assumed the task would be done much faster than 51 seconds, but sadly, that wasn't the case. Even going up to 1,000,000 took a considerable amount of time (didn't time it, though)
#include <iostream>
#include <vector>
using namespace std;
void main()
{
vector<int> tosieve = {};
for (int i = 2; i < 1000001; i++)
{
tosieve.push_back(i);
}
for (int j = 0; j < tosieve.size(); j++)
{
for (int k = j + 1; k < tosieve.size(); k++)
{
if (tosieve[k] % tosieve[j] == 0)
{
tosieve.erase(tosieve.begin() + k);
}
}
}
//for (int f = 0; f < tosieve.size(); f++)
//{
// cout << (tosieve[f]) << endl;
//}
cout << (tosieve.size()) << endl;
system("pause");
}
Is it the repeated referencing of the vectors or something? Why is this so slow? Even if I'm completely overlooking something (could be, complete beginner at this :I) I would think that finding the primes between 2 and 1,000,000 with this horrible inefficient method would be faster than my original way of finding them from 2 to 10,000,000.
Hope someone has a clear answer to this - hopefully I can use whatever knowledge is gleaned in the future when optimizing programs using a lot of recursion.
The problem is that 'erase' moves every element in the vector down one, meaning it is an O(n) operation.
There are three alternative choices:
1) Just mark deleted elements as 'empty' (make them 0, for example). This will mean future passes have to pass over those empty positions, but that isn't that expensive.
2) Make a new vector, and push_back new values into there.
3) Use std::remove_if: This will move the elements down, but do it in a single pass so will be more efficient. If you use std::remove_if, then you will have to remember it doesn't resize the vector itself.
Most of vector operations, including erase() have a O(n) linear time complexity.
Since you have two loops of size 10^6, and a vector of size 10^6, your algorithm executes up to 10^18 operations.
Qubic algorithms for such a big N will take a huge amount of time.
N = 10^6 is even big enough for quadratic algorithms.
Please, read carefully about Sieve of Eratosthenes. The fact that both full search and Sieve of Eratosthenes algorithms took the same time, means that you have done the second one wrong.
I see two performanse issues here:
First of all, push_back() will have to reallocate the dynamic memory block once in a while. Use reserve():
vector<int> tosieve = {};
tosieve.resreve(1000001);
for (int i = 2; i < 1000001; i++)
{
tosieve.push_back(i);
}
Second erase() has to move all Elements behind the one you try to remove. You set the elements to 0 instead and do a run over the vector in the end (untested code):
for (auto& x : tosieve) {
for (auto y = tosieve.begin(); *y < x; ++y) // this check works only in
// the case of an ordered vector
if (y != 0 && x % y == 0) x = 0;
}
{ // this block will make sure, that sieved will be released afterwards
auto sieved = vector<int>{};
for(auto x : tosieve)
sieved.push_back(x);
swap(tosieve, sieved);
} // the large memory block is released now, just keep the sieved elements.
consider to use standard algorithms instead of hand written loops. They help you to state your intent. In this case I see std::transform() for the outer loop of the sieve, std::any_of() for the inner loop, std::generate_n() for filling tosieve at the beginning and std::copy_if() for filling sieved (untested code):
vector<int> tosieve = {};
tosieve.resreve(1000001);
generate_n(back_inserter(tosieve), 1000001, []() -> int {
static int i = 2; return i++;
});
transform(begin(tosieve), end(tosieve), begin(tosieve), [](int i) -> int {
return any_of(begin(tosieve), begin(tosieve) + i - 2,
[&i](int j) -> bool {
return j != 0 && i % j == 0;
}) ? 0 : i;
});
swap(tosieve, [&tosieve]() -> vector<int> {
auto sieved = vector<int>{};
copy_if(begin(tosieve), end(tosieve), back_inserter(sieved),
[](int i) -> bool { return i != 0; });
return sieved;
});
EDIT:
Yet another way to get that done:
vector<int> tosieve = {};
tosieve.resreve(1000001);
generate_n(back_inserter(tosieve), 1000001, []() -> int {
static int i = 2; return i++;
});
swap(tosieve, [&tosieve]() -> vector<int> {
auto sieved = vector<int>{};
copy_if(begin(tosieve), end(tosieve), back_inserter(sieved),
[](int i) -> bool {
return !any_of(begin(tosieve), begin(tosieve) + i - 2,
[&i](int j) -> bool {
return i % j == 0;
});
});
return sieved;
});
Now instead of marking elements, we don't want to copy afterwards, but just directly copy only the elements, we want to copy. This is not only faster than the above suggestion, but also better states the intent.
Very interesting task you have. Thanks!
With pleasure I implemented from scratch my own versions of solving it.
I created 3 separate (independent) functions, all based on Sieve of Eratosthenes. These 3 versions are different in their complexity and speed.
Just a quick note, my simplest (slowest) version finds all primes below your desired limit of 10'000'000 within just 0.025 sec (i.e. 25 milli-seconds).
I also tested all 3 versions to find primes below 2^32 (4'294'967'296), which is solved by "simple" version within 47 seconds, by "intermediate" version within 30 seconds, by "advanced" within 12 seconds. So within just 12 seconds it finds all primes below 4 Billion (there are 203'280'221 such primes below 2^32, see OEIS sequence)!!!
For simplicity I will describe in details only Simple version out of 3. Here's code:
template <typename T>
std::vector<T> GenPrimes_SieveOfEratosthenes(size_t end) {
// https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes
if (end <= 2)
return {};
size_t const cnt = end >> 1;
std::vector<u8> composites((cnt + 7) / 8);
auto Get = [&](size_t i){ return bool((composites[i / 8] >> (i % 8)) & 1); };
auto Set = [&](size_t i){ composites[i / 8] |= u8(1) << (i % 8); };
std::vector<T> primes = {2};
size_t i = 0;
for (i = 1; i < cnt; ++i) {
if (Get(i))
continue;
size_t const p = 2 * i + 1, start = (p * p) >> 1;
primes.push_back(p);
if (start >= cnt)
break;
for (size_t j = start; j < cnt; j += p)
Set(j);
}
for (i = i + 1; i < cnt; ++i)
if (!Get(i))
primes.push_back(2 * i + 1);
return primes;
}
This code implements simplest but fast algorithm of finding primes, called Sieve of Eratosthenes. As a small optimization of speed and memory, I search only over odd numbers. This odd numbers optimization gives me ability to store 2x times less memory and do 2x times less steps, hence improves both speed and memory consumption exactly 2 times.
Algorithm is simple, we allocate array of bits, this array at position K has bit 1 if K is composite, or has 0 if K is probably prime. At the end all 0 bits in array signify Definite primes (that are for sure primes). Also due to odd numbers optimization this bit-array stores only odd numbers, so K-th bit is actually a number 2 * K + 1.
Then left to right we go over this array of bits and if we meet 0 bit at position K then it means we found a prime number P = 2 * K + 1 and now starting from position (P * P) / 2 we mark every P-th bit with 1. It means we mark all numbers bigger than P*P that are composite, because they are divisible by P.
We do this procedure only until P * P becomes greater or equal to our limit End (we're finding all primes < End). This limit guarantees that after reaching it ALL zero bits inside array signify prime numbers.
Second version of code does only one optimization to this Simple version, it makes all multi-core (multi-threaded). But this only optimization makes code much bigger and more complex. Basically it slices whole range of bits into all cores, so that they write bits to memory in parallel.
I'll explain only my third Advanced version, it is most complex of 3 versions. It does not only multi-threaded optimization, but also so-called Primorial optimization.
What is Primorial, it is a product of first smallest primes, for example I take primorial 2 * 3 * 5 * 7 = 210.
We can see that any primorial splits infinite range of integers into wheels by modulus of this primorial. For example primorial 210 splits into ranges [0; 210), [210; 2210), [2210; 3*210), etc.
Now it is easy to mathematically prove that inside All ranges of primorial we can mark same positions of numbers as complex, exactly we can mark all numbers that are multiple of 2 or 3 or 5 or 7 as composite.
We can see that out of 210 remainders there are 162 remainders that are for sure composite, and only 48 remainders are probably prime.
Hence it is enough for us to check primality of only 48/210=22.8% of whole search space. This reduction of search space makes task more than 4x times faster, and 4x times less memory consuming.
One can see that my first Simple version in fact due to odd-only optimization was actually using Primorial equal to 2 optimization. Yes, if we take primorial 2 instead of primorial 210, then we gain exactly first version (Simple) algorithm.
All of my 3 versions are tested for correctness and speed. Although still some tiny bugs can remain. Note. Yet it is recommended not to use my code straight away in production, unless it is tested thoroughly.
All 3 versions are tested for correctness by re-using each other answers. I thoroughly test correctness by feeding all limits (end value) from 0 to 2^18. It takes some time to do this.
See main() function to figure out how to use my functions.
Try it online!
SOURCE CODE GOES HERE. Due to StackOverflow limit of 30K symbols per post, I can't inline source code here, as it is almost 30K in size and together with English post above it takes more than 30K. So I'm providing source code on separate Github Gist server, link below. Note that Try it online! link above also contains full source code, but I reduced search limit of 2^32 to smaller one due to GodBolt limit of running time to 3 seconds.
Github Gist code
Output:
10M time 'Simple' 0.024 sec
Time 2^32 'Simple' 46.924 sec, number of primes 203280221
Time 2^32 'Intermediate' 30.999 sec
Time 2^32 'Advanced' 11.359 sec
All checked till 0
All checked till 5000
All checked till 10000
All checked till 15000
All checked till 20000
All checked till 25000

Maximise the AND

Given an array of n non-negative integers: A1, A2, …, AN. How to find a pair of integers Au, Av (1 ≤ u < v ≤ N) such that (Au and Av) is as large as possible.
Example : Let N=4 and array be [2 4 8 10] .Here answer is 8
Explanation
2 and 4 = 0
2 and 8 = 0
2 and 10 = 2
4 and 8 = 0
4 and 10 = 0
8 and 10 = 8
How to do it if N can go upto 10^5.
I have O(N^2) solution.But its not efficient
Code :
for(int i=0;i<n;i++){
for(int j=i+1;j<n;j++){
if(arr[i] & arr[j] > ans)
{
ans=arr[i] & arr[j];
}
}
}
One way you could speed it up is to take advantage of the fact that if any of the high bits are set in any two numbers, then the AND of those two number will ALWAYS be larger than any combination using lower bits.
Therefore, if you order your numbers by the bits set you may decrease the number of operations drastically.
In order to find the most significant bit efficiently, GCC has a builtin intrinsic: __builtin_clz(unsigned int x) that returns the index of the most significant set bit. (Other compilers have similar intrinsics, translating to a single instruction on at least x86).
const unsigned int BITS = sizeof(unsigned int)*8; // Assuming 8 bit bytes.
// Your implementation over.
unsigned int max_and_trivial( const std::vector<unsigned int> & input);
// Partition the set.
unsigned int max_and( const std::vector<unsigned int> & input ) {
// For small input, just use the trivial algorithm.
if ( input.size() < 100 ) {
return max_and_trivial(input);
}
std::vector<unsigned int> by_bit[BITS];
for ( auto elem : input ) {
unsigned int mask = elem;
while (mask) { // Ignore elements that are 0.
unsigned int most_sig = __builtin_clz(mask);
by_bits[ most_sig ].push_back(elem);
mask ^= (0x1 << BITS-1) >> most_sig;
}
}
// Now, if any of the vectors in by_bits have more
// than one element, the one with the highest index
// will include the largest AND-value.
for ( unsigned int i = BITS-1; i >= 0; i--) {
if ( by_bits[i].size() > 1 ) {
return max_and_trivial( by_bits[i]);
}
}
// If you get here, the largest value is 0.
return 0;
}
This algorithm still has worst case runtime O(N*N), but on average it should perform much better. You can also further increase the performance by repeating the partition step when you search through the smaller vector (just remember to ignore the most significant bit in the partition step, doing this should increase the performance to a worst case of O(N)).
Guaranteeing that there are no duplicates in the input-data will further increase the performance.
Sort the array in descending order.
Take the first two numbers. If they are both between two consecutive powers of 2 (say 2^k and 2^(k+1), then you can remove all elements that are less than 2^k.
From the remaining elements, subtract 2^k.
Repeat steps 2 and 3 until the number of elements in the array is 2.
Note: If you find that only the largest element is between 2^k and 2^(k+1) and the second largest element is less than 2^k, then you will not remove any element, but just subtract 2^k from the largest element.
Also, determining where an element lies in the series {1, 2, 4, 8, 16, ...} can be done in O(log(log(MAX))) time where MAX is the largest number in the array.
I didn't test this, and I'm not going to. O(N) memory and O(N) complexity.
#include <vector>
#include <utility>
#include <algorithm>
using namespace std;
/*
* The idea is as follows:
* 1.) Create a mathematical set A that holds integers.
* 2.) Initialize importantBit = highest bit in any integer in v
* 3.) Put into A all integers that have importantBit set to 1.
* 4.) If |A| = 2, that is our answer. If |A| < 2, --importantBit and try again. If |A| > 2, basically
* redo the problem but only on the integers in set A.
*
* Keep "set A" at the beginning of v.
*/
pair<unsigned, unsigned> find_and_sum_pair(vector<unsigned> v)
{
// Find highest bit in v.
int importantBit = 0;
for(auto num : v)
importantBit = max(importantBit, highest_bit_index(num));
// Move all elements with imortantBit to front of vector until doing so gives us at least 2 in the set.
int setEnd;
while((setEnd = partial_sort_for_bit(v, importantBit, v.size())) < 2 && importantBit > 0)
--importantBit;
// If the set is never sufficient, no answer exists
if(importantBit == 0)
return pair<unsigned, unsigned>();
// Repeat the problem only on the subset defined by A until |A| = 2 and impBit > 0 or impBit = 0
while(importantBit > 1)
{
unsigned secondSetEnd = partial_sort_for_bit(v, --importantBit, setEnd);
if(secondSetEnd >= 2)
setEnd = secondSetEnd;
}
return pair<unsigned, unsigned>(v[0], v[1]);
}
// Returns end index (1 past last) of set A
int partial_sort_for_bit(vector<unsigned> &v, unsigned importantBit, unsigned vSize)
{
unsigned setEnd = 0;
unsigned mask = 1<<(importantBit-1);
for(decltype(v.size()) index = 0; index < vSize; ++index)
if(v[index]&mask > 0)
swap(v[index], v[setEnd++]);
return setEnd;
}
unsigned highest_bit_index(unsigned i)
{
unsigned ret = i != 0;
while(i >>= 1)
++ret;
return ret;
}
I came upon this problem again and solved it a different way (much more understandable to me):
unsigned findMaxAnd(vector<unsigned> &input) {
vector<unsigned> candidates;
for(unsigned mask = 1<<31; mask; mask >>= 1) {
for(unsigned i : input)
if(i&mask)
candidates.push_back(i);
if (candidates.size() >= 2)
input = move(candidates);
candidates = vector<unsigned>();
}
if(input.size() < 2) {
return 0;
return input[0]&input[1];
}
Here is an O(N * log MAX_A) solution:
1)Let's construct the answer greedily, iterating from the highest bit to the lowest one.
2)To do it, one can mantain a set S of numbers that currently fit. Initially, it consists of all numbers in the array. Let's also assume that initially ANS = 0.
3)Now lets iterate over all the bits from the highest to the lowest. Let's say that current bit is B.
4)If the number of elements in S with value 1 of the B-th bit is greater than 1, it is possible to have 1 in this position without changing the values of higher bits in ANS so we should add 2^B to the ANS and remove all elements from S which have 0 value of this bit(they do not fit anymore).
5)Otherwise, it is not possible to obtain 1 in this position, so we do not change S and ANS and proceed to the next bit.

How to generate random permutations with CUDA

What parallel algorithms could I use to generate random permutations from a given set?
Especially proposals or links to papers suitable for CUDA would be helpful.
A sequential version of this would be the Fisher-Yates shuffle.
Example:
Let S={1, 2, ..., 7} be the set of source indices.
The goal is to generate n random permutations in parallel.
Each of the n permutations contains each of the source indices exactly once,
e.g. {7, 6, ..., 1}.
Fisher-Yates shuffle could be parallelized. For example, 4 concurrent workers need only 3 iterations to shuffle vector of 8 elements. On first iteration they swap 0<->1, 2<->3, 4<->5, 6<->7; on second iteration 0<->2, 1<->3, 4<->5, 6<->7; and on last iteration 0<->4, 1<->5, 2<->6, 3<->7.
This could be easily implemented as CUDA __device__ code (inspired by standard min/max reduction):
const int id = threadIdx.x;
__shared__ int perm_shared[2 * BLOCK_SIZE];
perm_shared[2 * id] = 2 * id;
perm_shared[2 * id + 1] = 2 * id + 1;
__syncthreads();
unsigned int shift = 1;
unsigned int pos = id * 2;
while(shift <= BLOCK_SIZE)
{
if (curand(&curand_state) & 1) swap(perm_shared, pos, pos + shift);
shift = shift << 1;
pos = (pos & ~shift) | ((pos & shift) >> 1);
__syncthreads();
}
Here the curand initialization code is omitted, and method swap(int *p, int i, int j) exchanges values p[i] and p[j].
Note that the code above has the following assumptions:
The length of permutation is 2 * BLOCK_SIZE, where BLOCK_SIZE is a power of 2.
2 * BLOCK_SIZE integers fit into __shared__ memory of CUDA device
BLOCK_SIZE is a valid size of CUDA block (usually something between 32 and 512)
To generate more than one permutation I would suggest to utilize different CUDA blocks. If the goal is to make permutation of 7 elements (as it was mentioned in the original question) then I believe it will be faster to do it in single thread.
If the length of s = s_L, a very crude way of doing this could be implemented in thrust:
http://thrust.github.com.
First, create a vector val of length s_L x n that repeats s n times.
Create a vector val_keys associate n unique keys repeated s_L times with each element of val, e.g.,
val = {1,2,...,7,1,2,...,7,....,1,2,...7}
val_keys = {0,0,0,0,0,0,0,1,1,1,1,1,1,2,2,2,...., n,n,n}
Now the fun part. create a vector of length s_L x n uniformly distributed random variables
U = {0.24, 0.1, .... , 0.83}
then you can do zip iterator over val,val_keys and sort them according to U:
http://codeyarns.com/2011/04/04/thrust-zip_iterator/
both val, val_keys will be all over the place, so you have to put them back together again using thrust::stable_sort_by_key() to make sure that if val[i] and val[j] both belong to key[k] and val[i] precedes val[j] following the random sort, then in the final version val[i] should still precede val[j]. If all goes according to plan, val_keys should look just as before, but val should reflect the shuffling.
For large sets, using a sort primitive on a vector of randomized keys might be efficient enough for your needs. First, setup some vectors:
const int N = 65535;
thrust:device_vector<uint16_t> d_cards(N);
thrust:device_vector<uint16_t> d_keys(N);
thrust::sequence(d_cards.begin(), d_cards.end());
Then, each time you want to shuffle the d_cards call the pair of:
thrust::tabulate(d_keys.begin(), d_keys.end(), PRNFunc(rand()*rand());
thrust::sort_by_key(d_keys.begin(), d_keys.end(), d_cards.begin());
// d_cards now freshly shuffled
The random keys are generated from a functor that uses a seed (evaluated in host-code and copied to the kernel at launch-time) and a key number (which tabulate passes in at thread-creation time):
struct PRNFunc
{
uint32_t seed;
PRNFunc(uint32_t s) { seed = s; }
__device__ __host__ uint32_t operator()(uint32_t kn) const
{
thrust::minstd_rand randEng(seed);
randEng.discard(kn);
return randEnd();
}
};
I have found that performance could be improved (by probably 30%) if I could figure out how to cache the allocations that thrust::sort_by_key does internally.
Any corrections or suggestions welcome.