Pick a unique random subset from a set of unique values - c++

C++. Visual Studio 2010.
I have a std::vector V of N unique elements (heavy structs). How can efficiently pick M random, unique, elements from it?
E.g. V contains 10 elements: { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } and I pick three...
4, 0, 9
0, 7, 8
But NOT this: 0, 5, 5 <--- not unique!
STL is preferred. So, something like this?
std::minstd_rand gen; // linear congruential engine??
std::uniform_int<int> unif(0, v.size() - 1);
gen.seed((unsigned int)time(NULL));
// ...?
// Or is there a good solution using std::random_shuffle for heavy objects?

Create a random permutation of the range 0, 1, ..., N - 1 and pick the first M of them; use those as indices into your original vector.
A random permutation is easily made with the standard library by using std::iota together with std::random_shuffle:
std::vector<Heavy> v; // given
std::vector<unsigned int> indices(V.size());
std::iota(indices.begin(), indices.end(), 0);
std::random_shuffle(indices.begin(), indices.end());
// use V[indices[0]], V[indices[1]], ..., V[indices[M-1]]
You can supply random_shuffle with a random number generator of your choice; check the docuĀ­menĀ­tation for details.

Most of the time, the method provided by Kerrek is sufficient. But if N is very large, and M is orders of magnitude smaller, the following method may be preferred.
Create a set of unsigned integers, and add random numbers to it in the range [0,N-1] until the size of the set is M. Then use the elements at those indexes.
std::set<unsigned int> indices;
while (indices.size() < M)
indices.insert(RandInt(0,N-1));

Since you wanted it to be efficient, I think you can get an amortised O(M), assuming you have to perform that operation a lot of times. However, this approach is not reentrant.
First of all create a local (i.e. static) vector of std::vector<...>::size_type (i.e. unsigned will do) values.
If you enter your function, resize the vector to match N and fill it with values from the old size to N-1:
static std::vector<unsigned> indices;
if (indices.size() < N) {
indices.reserve(N);
for (unsigned i = indices.size(); i < N; i++) {
indices.push_back(i);
}
}
Then, randomly pick M unique numbers from that vector:
std::vector<unsigned> result;
result.reserver(M);
for (unsigned i = 0; i < M; i++) {
unsigned const r = getRandomNumber(0,N-i); // random number < N-i
result.push_back(indices[r]);
indices[r] = indices[N-i-1];
indices[N-i-1] = r;
}
Now, your result is sitting in the result vector.
However, you still have to repair your changes to indices for the next run, so that indices is monotonic again:
for (unsigned i = N-M; i < N; i++) {
// restore previously changed values
indices[indices[i]] = indices[i];
indices[i] = i;
}
But this approach is only useful, if you have to run that algorithm a lot and N doesn't grow so big that you cannot live with indices eating up RAM all the the time.

Related

Create Array as Vector with 10 Million elements and assign random numbers without duplicates

I try to code myself a Table with random generated Numbers. While that is simple as it is, causing that Vector not having any duplicates isn't as easy as I thought. So far my Code looks like that:
QStringList generatedTable;
srand (QTime::currentTime().msec());
std::vector<int> array(10000000);
for(std::size_t i = 0; i < array.size(); i++){
array[i] = (rand() % 10000000000)+1;
}
It generates numbers just fine, but because I'm generating a large amount of array elements (10 Million), even though I'm using 10 Billion possible numbers, it will create duplicates. I already browsed a bit and found something that seems handy to use, but doesn't work properly in my Program. The code is from another stackoverflow User:
#include<iostream>
#include<algorithm>
#include<functional>
#include<set>
int main()
{
int arr[] = {0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1};
std::set<int> duplicates;
auto it = std::remove_if(std::begin(arr), std::end(arr), [&duplicates](int i) {
return !duplicates.insert(i).second;
});
size_t n = std::distance(std::begin(arr), it);
for (size_t i = 0; i < n; i++)
std::cout << arr[i] << " ";
return 0;
}
It basically moves all the duplicates to the end of the Array, but for some reason does it not work anymore when the array gets bigger. The code will always place the iterator n at 32.768 as long the Array stays above a Million. Under a Million it drops slightly to ~31.000. So while the code is nice it doesn't really help me alot. Does someone have a better option I could use? Since I'm still a Qt and C++ beginner do I not know how to solve that problem properly.
If you want to sample N integers without replacement from the range [low, high) you can write this:
std::vector<int> array(N); // or reserve space for N elements up front
auto gen = std::mt19937{std::random_device{}()};
std::ranges::sample(std::views::iota(low, high),
array.begin(),
N,
gen);
std::ranges::shuffle(array, gen); // only if you want the samples in random order
Here's a demo.
Note that this requires C++20, otherwise the range to be sampled from can't be generated lazily, which would require it to be stored in memory. If you want to write something similar before C++20, you can use the range-v3 library.
The simplest but at the same time most efficient thing is to implement a binary search tree. Generate the random number in your range and check if it's not already there. Note that the operations are performed in a time O(n)

Fastest sorting method for k digits and N elements, k <<< N

Question: There are n balls, they are labeled 0, 1, 2 and the order is chaotic, I want to sort them from small to large. Balls:
1, 2, 0, 1, 1, 2, 2, 0, 1, 2, ...
We must use the fastest way to solve and cannot use sort() function, I thought many ways like the bubble sort, inset sort, etc. But it is not fast. Is there an algorithm that makes the time complexity is O(logn) or O(n)?
given balls list A[] and length n
void sortBalls(int A[], int n)
{
//code here
}
Given the very limited number of item types (0, 1, and 2), you just count the number of occurrences of each. Then to print the "sorted" array, you repeatedly print each label the number of times it occurred. Running time is O(N)
int balls[N] = {...}; // array of balls: initialized to whatever
int sorted_balls[N]; // sorted array of balls (to be set below)
int counts[3] = {}; // count of each label, zero initialized array.
// enumerate over the input array and count each label's occurance
for (int i = 0; i < N; i++)
{
counts[balls[i]]++;
}
// sort the items by just printing each label the number of times it was counted above
int k = 0;
for (int j = 0; j < 3; j++)
{
for (int x = 0; x < counts[j]; x++)
{
cout << j << ", "; // print
sorted_balls[k] = j; // store into the final sorted array
k++;
}
}
If you have a small number of possible values known in advance, and the value is everything you need to know about the ball (they carry no other attributes), "sorting" becomes equivalent to "counting how many of each value there are". So you generate a histogram - an array from 0 to 2, in your case - go through your values and increase the corresponding count. Then you generate an array of n_0 balls with number 0, n_1 balls with number 1 and n_2 with number 2, and voila, they're sorted.
It's trivially obvious that you cannot go below O(n) - at the very least, you have to look at each value once to count it, and for n values, that's n operations right away.

How to uniform spread every k values over a collection of n values with k <= n?

I've a collection of k elements. I need to spread them uniformly random into a collection of n elements, where k <= n.
So for example, with this k-collection (with k = 3):
{ 3, 5, 6 }
and give n = 7, a valid permutation result (with n = 7 elements) could be:
{ 6, 5, 6, 3, 3, 6, 5}
Notice that every item within the k-collection must be used into the permutation.
So this is not a valid result:
{ 6, 3, 6, 3, 3, 6, 6} // it lacks "5"
What's the fast way to accomplish this?
The simplest way I can think of.
Add one of each item to the array. So with your example, your initial array is [3,5,6]. This guarantees that every element is represented at least once.
Then, successively pick an element at random, and add it to the array. Do this n-3 times. (i.e. fill the array with randomly selected items from the list of elements)
Shuffle the array.
This takes O(n) to fill the array, and O(n) to shuffle it.
Let's assume you have a
std::vector<int> input;
that contains the k elements you need to spread and
std::vector<int> output;
that will be filled with n elements.
I used the following approach for a similiar problem. (Edit: Thinking about it, here is a simpler and probably faster version than the original)
First we satisfy the condition that every item from input must occurr at least once in output. Therefore we put every element from input once into output.
output.resize(n); // fill with n 0's
std::copy(input.begin(), input.end(), output.begin()); // fill k first items
Now we can fill up the remaining n - k slots with random elements from input:
std::random_device rd;
std::mt19937 rand(rd()); // get seed from random device
std::uniform_int_distribution<> dist(0, k - 1); // for random numbers in [0, k-1]
for(size_t i = k; i < n; i++) {
output[i] = input[dist(rand)];
}
At the end shuffle the whole thing, to randomize the position of the first k elements:
std::random_shuffle(output.begin(), output.end(), rand);
I hope this is what you wanted.
You can try just randomly put values to ur n-collection, then verify if it contains all k-collection values if not try again. However it's not always fast xd u can also put missing values in a random place of n-collection, but remember to verify again.
Simply make an array of the k elements, say {3,5,6} in the given example. Make a variable counter, which is zero initially. If you want to spread it over n elements, simply iterate over n elements of array with the counter incrementing as
counter=(counter+1)%k;

Performance optimization nested loops

I am implementing a rather complicated code and in one of the critical sections I need to basically consider all the possible strings of numbers following a certain rule. The naive implementation to explain what I do would be such a nested loop implementation:
std::array<int,3> max = { 3, 4, 6};
for(int i = 0; i <= max.at(0); ++i){
for(int j = 0; j <= max.at(1); ++j){
for(int k = 0; k <= max.at(2); ++k){
DoSomething(i, j, k);
}
}
}
Obviously I actually need more nested for and the "max" rule is more complicated but the idea is clear I think.
I implemented this idea using a recursive function approach:
std::array<int,3> max = { 3, 4, 6};
std::array<int,3> index = {0, 0, 0};
int total_depth = 3;
recursive_nested_for(0, index, max, total_depth);
where
void recursive_nested_for(int depth, std::array<int,3>& index,
std::array<int,3>& max, int total_depth)
{
if(depth != total_depth){
for(int i = 0; i <= max.at(depth); ++i){
index.at(depth) = i;
recursive_nested_for(depth+1, index, max, total_depth);
}
}
else
DoSomething(index);
}
In order to save as much as possible I declare all the variable I use global in the actual code.
Since this part of the code takes really long is it possible to do anything to speed it up?
I would also be open to write 24 nested for if necessary to avoid the overhead at least!
I thought that maybe an approach like expressions templates to actually generate at compile time these nested for could be more elegant. But is it possible?
Any suggestion would be greatly appreciated.
Thanks to all.
The recursive_nested_for() is a nice idea. It's a bit inflexible as it is currently written. However, you could use std::vector<int> for the array dimensions and indices, or make it a template to handle any size std::array<>. The compiler might be able to inline all recursive calls if it knows how deep the recursion is, and then it will probably be just as efficient as the three nested for-loops.
Another option is to use a single for loop for incrementing the indices that need incrementing:
void nested_for(std::array<int,3>& index, std::array<int,3>& max)
{
while (index.at(2) < max.at(2)) {
DoSomething(index);
// Increment indices
for (int i = 0; i < 3; ++i) {
if (++index.at(i) >= max.at(i))
index.at(i) = 0;
else
break;
}
}
}
However, you can also consider creating a linear sequence that visits all possible combinations of the iterators i, j, k and so on. For example, with array dimensions {3, 4, 6}, there are 3 * 4 * 6 = 72 possible combinations. So you can have a single counter going from 0 to 72, and then "split" that counter into the three iterator values you need, like so:
for (int c = 0; c < 72; c++) {
int k = c % 6;
int j = (c / 6) % 4;
int i = c / 6 / 4;
DoSomething(i, j, k);
}
You can generalize this to as many dimensions as you want. Of course, the more dimensions you have, the higher the cost of splitting the linear iterator. But if your array dimensions are powers of two, it might be very cheap to do so. Also, it might be that you don't need to split it at all; for example if you are calculating the sum of all elements of a multidimensional array, you don't care about the actual indices i, j, k and so on, you just want to visit all elements once. If the array is layed out linearly in memory, then you just need a linear iterator.
Of course, if you have 24 nested for loops, you'll notice that the product of all the dimension's sizes will become a very large number. If it doesn't fit in a 32 bit integer, your code is going to be very slow. If it doesn't fit into a 64 bit integer anymore, it will never finish.

How to get intersection of two Arrays

I have two integer arrays
int A[] = {2, 4, 3, 5, 6, 7};
int B[] = {9, 2, 7, 6};
And i have to get intersection of these array.
i.e. output will be - 2,6,7
I am thinking to sove it by saving array A in a data strcture and then i want to compare all the element till size A or B and then i will get intersection.
Now i have a problem i need to first store the element of Array A in a container.
shall i follow like -
int size = sizeof(A)/sizeof(int);
To get the size but by doing this i will get size after that i want to access all the elemts too and store in a container.
Here i the code which i am using to find Intersection ->
#include"iostream"
using namespace std;
int A[] = {2, 4, 3, 5, 6, 7};
int B[] = {9, 2, 7, 6};
int main()
{
int sizeA = sizeof(A)/sizeof(int);
int sizeB = sizeof(B)/sizeof(int);
int big = (sizeA > sizeB) ? sizeA : sizeB;
int small = (sizeA > sizeB) ? sizeB : sizeA;
for (int i = 0; i <big ;++i)
{
for (int j = 0; j <small ; ++j)
{
if(A[i] == B[j])
{
cout<<"Element is -->"<<A[i]<<endl;
}
}
}
return 0;
}
Just use a hash table:
#include <unordered_set> // needs C++11 or TR1
// ...
unordered_set<int> setOfA(A, A + sizeA);
Then you can just check for every element in B, whether it's also in A:
for (int i = 0; i < sizeB; ++i) {
if (setOfA.find(B[i]) != setOfA.end()) {
cout << B[i] << endl;
}
}
Runtime is expected O(sizeA + sizeB).
You can sort the two arrays
sort(A, A+sizeA);
sort(B, B+sizeB);
and use a merge-like algorithm to find their intersection:
#include <vector>
...
std::vector<int> intersection;
int idA=0, idB=0;
while(idA < sizeA && idB < sizeB) {
if (A[idA] < B[idB]) idA ++;
else if (B[idB] < A[idA]) idB ++;
else { // => A[idA] = B[idB], we have a common element
intersection.push_back(A[idA]);
idA ++;
idB ++;
}
}
The time complexity of this part of the code is linear. However, due to the sorting of the arrays, the overall complexity becomes O(n * log n), where n = max(sizeA, sizeB).
The additional memory required for this algorithm is optimal (equal to the size of the intersection).
saving array A in a data strcture
Arrays are data structures; there's no need to save A into one.
i want to compare all the element till size A or B and then i will get intersection
This is extremely vague but isn't likely to yield the intersection; notice that you must examine every element in both A and B but "till size A or B" will ignore elements.
What approach i should follow to get size of an unkown size array and store it in a container??
It isn't possible to deal with arrays of unknown size in C unless they have some end-of-array sentinel that allows counting the number of elements (as is the case with NUL-terminated character arrays, commonly referred to in C as "strings"). However, the sizes of your arrays are known because their compile-time sizes are known. You can calculate the number of elements in such arrays with a macro:
#define ARRAY_ELEMENT_COUNT(a) (sizeof(a)/sizeof *(a))
...
int *ptr = new sizeof(A);
[Your question was originally tagged [C], and my comments below refer to that]
This isn't valid C -- new is a C++ keyword.
If you wanted to make copies of your arrays, you could simply do it with, e.g.,
int Acopy[ARRAY_ELEMENT_COUNT(A)];
memcpy(Acopy, A, sizeof A);
or, if for some reason you want to put the copy on the heap,
int* pa = malloc(sizeof A);
if (!pa) /* handle out-of-memory */
memcpy(pa, A, sizeof A);
/* After you're done using pa: */
free(pa);
[In C++ you would used new and delete]
However, there's no need to make copies of your arrays in order to find the intersection, unless you need to sort them (see below) but also need to preserve the original order.
There are a few ways to find the intersection of two arrays. If the values fall within the range of 0-63, you can use two unsigned longs and set the bits corresponding to the values in each array, then use & (bitwise "and") to find the intersection. If the values aren't in that range but the difference between the largest and smallest is < 64, you can use the same method but subtract the smallest value from each value to get the bit number. If the range is not that small but the number of distinct values is <= 64, you can maintain a lookup table (array, binary tree, hash table, etc.) that maps the values to bit numbers and a 64-element array that maps bit numbers back to values.
If your arrays may contain more than 64 distinct values, there are two effective approaches:
1) Sort each array and then compare them element by element to find the common values -- this algorithm resembles a merge sort.
2) Insert the elements of one array into a fast lookup table (hash table, balanced binary tree, etc.), and then look up each element of the other array in the lookup table.
Sort both arrays (e.g., qsort()) and then walk through both arrays one element at a time.
Where there is a match, add it to a third array, which is sized to match the larger of the two input arrays (your result array can be no larger than the largest of the two arrays). Use a negative or other "dummy" value as your terminator.
When walking through input arrays, where one value in the first array is larger than the other, move the index of the second array, and vice versa.
When you're done walking through both arrays, your third array has your answer, up to the terminator value.