Removing duplicates from an array using std::map - c++

I'm directly posting my code which I've written on collabedit under 5 minutes (including figuring out the algorithm) thus even though with the risk of completely made of fun in terms of efficiency I wanted to ask my fellow experienced stack overflow algorithm enthusiasts about the problem;
Basically removing duplicate elements from an array. My Approach: Basically using the std::map as my hash table and for each element in duplicated array if the value has not been assigned add it to our new array. If assigned just skip. At the end return the unique array. Here is my code and the only thing I'm asking in terms of an interview question can my solution be more efficient?
#include <iostream>
#include <vector>
#include <map>
using namespace std;
vector<int>uniqueArr(int arr[],int size){
std::map<int,int>storedValues;
vector<int>uniqueArr;
for(int i=0;i<size;i++){
if(storedValues[arr[i]]==0){
uniqueArr.push_back(arr[i]);
storedValues[arr[i]]=1;
}
}
return uniqueArr;
}
int main()
{
const int size=10;
int arr[size]={1,2,2,4,2,5,6,5,7,1};
vector<int>uniArr=uniqueArr(arr,size);
cout<<"Result: ";
for(int i=0;i<uniArr.size();i++) cout<<uniArr[i]<<" ";
cout<<endl;
return 0;
}

First of all, there is no need for a map, a set is conceptually more correct, since you don't want to store any values, but only the keys.
Performance-wise, it might be a better idea to use a std::unordered_set instead of a std::set, as the former is hashed and can give you O(1) insert and lookup in best case, whereas the latter is a binary search tree, giving you only O(log n) access.
vector<int> uniqueArr(int arr[], int size)
{
std::unordered_set<int> storedValues;
vector<int> uniqueArr;
for(int i=0; i<size; ++i){
if(storedValues.insert(arr[i]).second)
uniqueArr.push_back(arr[i]);
return uniqueArr;
}
But if you are allowed to use the C++ standard library more extensively, you may also consider the other answers using std::sort and std::unique, although they are O(n log n) (instead of the above ~O(n) solution) and destroy the order of the elements.
If you want to use a more flexible and std-driven approach but with ~O(n) complexity and without destroying the order of the elements, you can transform the above routine into the following std-like algorithm, even if being a bit too far-fetched for a simple interview question:
template<typename ForwardIterator>
ForwardIterator unordered_unique(ForwardIterator first, ForwardIterator last)
{
typedef typename std::iterator_traits<ForwardIterator>::value_type value_type;
std::unordered_set<value_type> unique;
return std::remove_if(first, last,
[&unique](const value_type &arg) mutable -> bool
{ return !unique.insert(arg).second; });
}
Which you can then apply like std::unique in the usual erase-remove way:
std::vector<int> values(...);
values.erase(unordered_unique(values.begin(), values.end()), values.end());
To remove the unique values without copying the vector and without needing to sort it beforehand.

Since you are asking in terms of an interview question, I will say that you don't get the job.
const int size=10;
int arr[size]={1,2,2,4,2,5,6,5,7,1};
std::sort( &arr[0], &arr[size] );
int* new_end = std::unique( &arr[0], &arr[size] );
std::copy(
&arr[0], new_end,
, std::ostream_iterator< int >( std::cout, " " )
);
No temporary maps, no temporary vectors, no dynamic memory allocations, a lot less code written so its easier both to write and to mantain.

#include <algorithm>
#include <vector>
int main()
{
std::vector<int> vec({1,2,3,2,4,4,5,7,6,6});
std::sort(vec.begin(), vec.end());
vec.erase(std::unique(vec.begin(), vec.end()), vec.end());
// vec = {1,2,3,4,5,6,7}
return 0;
}
//works with C++11
// O(n log n)

In-place removal's nice for speed - something like this (returning the new size):
template <typename T, size_t N>
size_t keep_unique(T (&array)[N])
{
std::unordered_set<T> found;
for (size_t i = 0, j = 0; i < N; ++i)
if (found.insert(array[i]).second))
if (j != i) // (optional) avoid copy to self, as may be slower or unsupported by T
array[j++] = array[i];
else
++j;
return j;
}
(For larger objects, or those that can't be safely copied, may be necessary and/or faster and more space efficient to store T*s in the unordered_set - must also provide a dereferencing comparison operator and hash function.)
To visualise how this works, consider processing the following input:
1 3 6 3 5 6 0 2 1
<--+<----+ |
<-----+
The arrows above represent the minimal in-place compaction necessary to produce the answer:
1 3 6 5 0 2
That's precisely what the algorithm above does, looking at all the elements at [i], and keeping track of where they need to be copied to (and how many non-duplicates there are) in [j].

Related

How to efficiently delete elements from a vector given an another vector

What is the best way to delete elements from a vector given an another vector?
I have come up with the following code:
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;
void remove_elements(vector<int>& vDestination, const vector<int>& vSource)
{
if(!vDestination.empty() && !vSource.empty())
{
for(auto i: vSource) {
vDestination.erase(std::remove(vDestination.begin(), vDestination.end(), i), vDestination.end());
}
}
}
int main()
{
vector<int> v1={1,2,3};
vector<int> v2={4,5,6};
vector<int> v3={1,2,3,4,5,6,7,8,9};
remove_elements(v3,v1);
remove_elements(v3,v2);
for(auto i:v3)
cout << i << endl;
return 0;
}
Here the output will be:
7
8
9
My version is the following, I only apply erase after all elements from the vector vSource have been moved to the end by std::remove and keep track of the pointer to the end of the vector vDestination to not iterate over it for nothing.
void remove_elements(vector<int>& vDestination, const vector<int>& vSource)
{
auto last = std::end(vDestination);
std::for_each(std::begin(vSource), std::end(vSource), [&](const int & val) {
last = std::remove(std::begin(vDestination), last, val);
});
vDestination.erase(last, std::end(vDestination));
}
See on coliru : http://coliru.stacked-crooked.com/a/6e86893babb6759c
Update
Here is a template version, so you don't care about the container type :
template <class ContainerA, class ContainerB>
void remove_elements(ContainerA & vDestination, const ContainerB & vSource)
{
auto last = std::end(vDestination);
std::for_each(std::begin(vSource), std::end(vSource), [&](typename ContainerB::const_reference val) {
last = std::remove(std::begin(vDestination), last, val);
});
vDestination.erase(last, std::end(vDestination));
}
Note
This version works for vectors without any constraints, if your vectors are sorted you can take some shortcuts and avoid iterating over and over the vector to delete each element.
I assume that by best you mean fastest that works. Since it's a question about efficiency, I performed a simple benchmark to compare efficiency of several algorithms. Note that they differ a little, since the problem is a bit underspecified - the questions that arise (and assumptions taken for benchmark) are:
is it guaranteed that vDestination contains all elements from vSource ? (assumption: no)
are duplicates allowed in either vDestination or vSource ? (assumption: yes, in both)
does the order of the elements in the result vector matter? (algorithms for both cases tested)
should every element from vDestination be removed if it is equal to any element from vSource, or only one-for-one? (assumption: yes, in both)
are sizes of vDestination and vSource somehow bounded? Is one of them always bigger or much bigger? (several cases tested)
in the comments it's already explained that vectors don't need to be sorted, but I've included this point, as it's not immediately visible from the question (no sorting assumed in either of vectors)
as you see, there are a few points in which algorithms will differ and consequently, as you can guess, best algorithm will depend on your use case. Compared algorithms include:
original one (proposed in question) - baseline
proposed in #dkg answer
proposed in #Revolver_Ocelot answer + additional sorting (required by the algorithm) and pre-reservation of space for result
vector
proposed in #Jarod42 answer
set-based algorithm (presented below - mostly optimization of #Jarod42 algorithm)
counting algorithm (presended below)
set-based algorithm:
std::unordered_set<int> elems(vSource.begin(), vSource.end());
auto i = destination.begin();
auto target = destination.end();
while(i <= target) {
if(elems.count(*i) > 0)
std::swap(*i, *(--target));
else
i++;
}
destination.erase(target, destination.end());
counting algorithm:
std::unordered_map<int, int> counts;
counts.max_load_factor(0.3);
counts.reserve(destination.size());
for(auto v: destination) {
counts[v]++;
}
for(auto v: source) {
counts[v]--;
}
auto i = destination.begin();
for(auto k: counts) {
if(k.second < 1) continue;
i = std::fill_n(i, k.second, k.first);
}
destination.resize(std::distance(destination.begin(), i));
Benchmarking procedure was executed using Celero library and was the following:
Generate n pseudo-random ints (n in set {10,100,1000,10000, 20000, 200000}) and put them to a vector
Copy a fraction (m) of these ints to second vector (fractions from set {0.01, 0.1, 0.2, 0.4, 0.6, 0.8}, min. 1 element)
Start timer
Execute removal procedure
Stop timer
Only algorithms 3, 5 and 6 were executed on datasets larger than 10 000 elements as the rest of them took to long for me to comfortably measure (feel free to do it yourself).
Long story short: if your vectors contain less than 1000 elements, pick whichever you prefer. If they are longer - rely on size of vSource. If it's less than 50% of vDestination - choose set-based algorithm, if it's more - sort them and pick #Revolver_Ocelot's solution (they tie around 60%, with set-based being over 2x faster for vSource being 1% size of vDestination). Please don't rely on order or provide a vector that is sorted from the beginning - requirement that ordering has to remain same slows the process down dramatically. Benchmark on your use case, your compiler, your flags and your hardware. I've attached link to my benchmarks, in case you wanted to reproduce them.
Complete results (file vector-benchmarks.csv) are available on GitHub together with benchmarking code (file tests/benchmarks/vectorRemoval.cpp) here.
Please keep in mind that these are results that I've obtained on my computer, my compiler etc. - in your case they will differ (especially when it comes to point in which one algorithm is better than another).
I've used GCC 6.1.1 with -O3 on Fedora 24, on top of VirtualBox.
If your vectors are always sorted, you can use set_difference:
#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
void remove_elements(std::vector<int>& vDestination, const std::vector<int>& vSource)
{
std::vector<int> result;
std::set_difference(vDestination.begin(), vDestination.end(), vSource.begin(), vSource.end(), std::back_inserter(result));
vDestination.swap(result);
}
int main()
{
std::vector<int> v1={1,2,3};
std::vector<int> v2={4,5,6};
std::vector<int> v3={1,2,3,4,5,6,7,8,9};
remove_elements(v3,v1);
remove_elements(v3,v2);
for(auto i:v3)
std::cout << i << '\n';
}
If not for reqirement, that output range should not ovelap with any input range, we could even avoid additional vector. Potentially you can roll your own version of set_difference which is allowed to output in range starting with vDestination.begin(), but it is outside of scope of this answer.
Can be written with STL as:
void remove_elements(vector<int>& vDestination, const vector<int>& vSource)
{
const auto isInSource = [&](int e) {
return std::find(vSource.begin(), vSource.end(), e) != vSource.end();
};
vDestination.erase(
std::remove_if(vDestination.begin(), vDestination.end(), isInSource),
vDestination.end());
}
if vSource is sorted, you may replace std::find by std::binary_search.

Simultaneous in-place std::sort on a vector of keys and a vector of values

I have a vector<uint64_t> keys and a vector<char> vals, both of size N. I would like to sort keys and vals based on entries in keys.
An obvious solution is copying into a vector<pair<uint64_t, char>>, sorting that, and copying the sorted data back out, but I would like to avoid copying, and I would like to avoid the alignment padding: sizeof(pair<uint64_t, char>) is 2*sizeof(uint64_t), or 16 bytes, due to alignment; much more than the 9 bytes needed.
In other words, although the following C++11 implementation is correct, it is not efficient enough:
#include <algorithm>
#include <tuple>
using namespace std;
void aux_sort(vector<uint64_t> & k, vector<char> & v) {
vector<pair<uint64_t, char> > kv(k.size());
for (size_t i = 0; i < k.size(); ++i) kv[i] = make_pair(k[i], v[i]);
sort(kv.begin(), kv.end());
for (size_t i = 0; i < k.size(); ++i) tie(k[i], v[i]) = kv[i];
}
Although the following C++11 implementation is correct, I want to use std::sort instead of hand-coding my own sorting algorithm:
#include <algorithm>
using namespace std;
void aux_sort(vector<uint64_t> & k, vector<char> & v) {
for (size_t i = 0; i < k.size(); ++i)
for (size_t j = i; j--;)
if (k[j] > k[j + 1]) {
iter_swap(&k[j], &k[j + 1]);
iter_swap(&v[j], &v[j + 1]);
}
}
(Edit to add, in response to #kfsone) Although the following implementation is correct, it is not in-place, since permutation according to indices needs a copy (or alternatively, a prohibitively complex linear time in-place permutation algorithm that I am not going to implement):
#include <algorithm>
#include <tuple>
using namespace std;
void aux_sort(vector<uint64_t> & k, vector<char> & v) {
vector<size_t> indices(k.size());
iota(indices.begin(), indices.end(), 0);
sort(indices.begin(), indices.end(),
[&](size_t a, size_t b) { return k[a] < k[b]; });
vector<uint64_t> k2 = k;
vector<char> v2 = v;
for (size_t i = 0; i < k.size(); ++i)
tie(k[i], v[i]) = make_pair(k2[indices[i]], v2[indices[i]]);
}
What is the easiest way to apply STL algorithms such as std::sort to a sequence of key/value-pairs in-place, with keys and values stored in separate vectors?
Background: My application is reading large (40 000 by 40 000) rasters that represent terrains, one row at a time. One raster assigns each cell a label between 0 and 10 000 000 such that labels are contiguous, and another raster assigns each cell a value between 0 and 255. I want to sum the values for each label in an efficient manner, and I think the fastest way is to sort the label row, and for each swap during the sort, apply the same swap in the value row. I want to avoid coding std::sort, std::set_intersection and others by hand.
Range adapters. The most direct route would be a zip range, that takes two equal length ranges over T and U respectively, and produces a range over pair<T&,U&>. (containers are a kind of range -- a range that owns its contents)
You then sort this by .first (or use default sort, where .second determines ties).
The range is never a container, the wrapping into pairs happens on the fly with each dereference of the zip iterator.
boost has a zip iterators and zip ranges, but you can write them yourself. The boost iterators/ranges may be read only, but the link also contains an implementation of zipping that is not, and maybe boost has upgraded.
You can use the thrust library and use the sort by key function. Not STL, but has the (dubious) advantage of being easily ported to a nVIdia GPU.
In fact it is easy to permute the input vectors according to indices in-place (contrary to the claim in the question):
#include <algorithm>
#include <tuple>
using namespace std;
void aux_sort(vector<uint64_t> & k, vector<char> & v) {
vector<size_t> indices(k.size());
iota(indices.begin(), indices.end(), 0);
sort(indices.begin(), indices.end(),
[&](size_t a, size_t b) { return k[a] < k[b]; });
for (size_t i = 0; i < k.size(); ++i)
while (indices[i] != i) {
swap(k[i], k[indices[i]]);
swap(v[i], v[indices[i]]);
swap(indices[i], indices[indices[i]]);
}
}
However, this solution is perhaps undesirable since it incurs many more cache-faults than the sorting itself as the input is traversed in the order of indices, which can possibly incur one cache fault per element. On the other hand, quicksort incurs much fewer cache faults (O(n/B log n/M) when pivots are random, where B is the size of a cache line and M is the size of the cache).
I do not believe that it is possible to satisfy all the constraints that you have set up for the solution. It is almost certainly possible to hack the STL to sort the arrays. However, the solution is likely to be both clumsy and slower than just copying the data, sorting it, and copying it back.
If you have the option, you might want to consider just storing the data in a single vector to begin with.

Efficient way to get the indizes of the k highest values in vector<float>

How can I create a std::map<int, float> from a vector<float>, so that the map contains the k highest values from the vector with the keys beeing the index of the value in the vector.
A naive approach would be to traverse the vector (O(n)), extract and erase (O(n)) the highest element k times (O(k)), leading to a complexity of O(k*n^2), which is suboptimal, I guess.
Even better would be to just copy (O(n)) and remove the smallest until size is k. Which would lead to O(n^2). Still polynomial...
Any ideas?
Following should do the job:
#include <cstdint>
#include <algorithm>
#include <iostream>
#include <map>
#include <tuple>
#include <vector>
// Compare: greater T2 first.
struct greater_by_second
{
template <typename T1, typename T2>
bool operator () (const std::pair<T1, T2>& lhs, const std::pair<T1, T2>& rhs)
{
return std::tie(lhs.second, lhs.first) > std::tie(rhs.second, rhs.first);
}
};
std::map<std::size_t, float> get_index_pairs(const std::vector<float>& v, int k)
{
std::vector<std::pair<std::size_t, float>> indexed_floats;
indexed_floats.reserve(v.size());
for (std::size_t i = 0, size = v.size(); i != size; ++i) {
indexed_floats.emplace_back(i, v[i]);
}
std::nth_element(indexed_floats.begin(),
indexed_floats.begin() + k,
indexed_floats.end(), greater_by_second());
return std::map<std::size_t, float>(indexed_floats.begin(), indexed_floats.begin() + k);
}
Let's test it:
int main(int argc, char *argv[])
{
const std::vector<float> fs {45.67f, 12.34f, 67.8f, 4.2f, 123.4f};
for (const auto& elem : get_index_pairs(fs, 2)) {
std::cout << elem.first << " " << elem.second << std::endl;
}
return 0;
}
Output:
2 67.8
4 123.4
You can keep a list of the k-highest values so far, and update it for each of the values in your vector, which takes you down to O(n*log k) (assuming log k for each update of the list of highest values) or, for a naive list, O(kn).
You can probably get closer to O(n), but assuming k is probably pretty small, may not be worth the effort.
Your optimal solution will have a complexity of O(n+k*log(k)), since sorting the k elements can be reduced to this, and you will have to look at each of the elements at least once.
Two possible solutions come to mind:
Iterate through the vector while adding all elements to a bounded (size k) priority-queue/heap, also keeping their indices.
Create a copy of your vector with including the original indices, i.e. std::vector<std::pair<float, std::size_t>> and use std::nth_element to move the k highest values to the front using a comparator that compares only the first element. Then insert those elements into your target map. Ironically, that last step adds you the k*log(k) in the overall complexity, while nth_element is linear (but will permute your indices).
Maybe I did not get it, but in case the incremental approach is not an option, why not use std::sort std::partial_sort?
That should be an o(n log k), and since k is very likely to be small, that makes practically an o(n).
Edit: thanks to Mike Seymour for the update.
Edit (bis):
The idea is to use an intermediate vector for sorting, and then put it into the map. Trying to reduce the order of the computation would only be justified for significant amount of data, so I guess the copy time (in o(n) ) could be lost in background noise.
Edit (bis):
That's actually what the selected answer does, without the theorietical explanations :).

How to randomly shuffle values in a map?

I have a std::map with both key and value as integers. Now I want to randomly shuffle the map, so keys point to a different value at random. I tried random_shuffle but it doesn't compile. Note that I am not trying to shuffle the keys, which makes no sense for a map. I'm trying to randomise the values.
I could push the values into a vector, shuffle that and then copy back. Is there a better way?
You can push all the keys in a vector, shuffle the vector and use it to swap the values in the map.
Here is an example:
#include <iostream>
#include <string>
#include <vector>
#include <map>
#include <algorithm>
#include <random>
#include <ctime>
using namespace std;
int myrandom (int i) { return std::rand()%i;}
int main ()
{
srand(time(0));
map<int,string> m;
vector<int> v;
for(int i=0; i<10; i++)
m.insert(pair<int,string>(i,("v"+to_string(i))));
for(auto i: m)
{
cout << i.first << ":" << i.second << endl;
v.push_back(i.first);
}
random_shuffle(v.begin(), v.end(),myrandom);
vector<int>::iterator it=v.begin();
cout << endl;
for(auto& i:m)
{
string ts=i.second;
i.second=m[*it];
m[*it]=ts;
it++;
}
for(auto i: m)
{
cout << i.first << ":" << i.second << endl;
}
return 0;
}
The complexity of your proposal is O(N), (both the copies and the shuffle have linear complexity) which seems optimal (looking at less elements would introduce non-randomness into your shuffle).
If you want to repeatedly shuffle your data, you could maintain a map of type <Key, size_t> (i.e. the proverbial level of indirection) that indexes into a std::vector<Value> and then just shuffle that vector repeatedly. That saves you all the copying in exchange for O(N) space overhead. If the Value type itself is expensive, you have an extra vector<size_t> of indices into the real data on which you do the shuffling.
For convenience sake, you could encapsulate the map and vector inside one class that exposes a shuffle() member function. Such a wrapper would also need to expose the basic lookup / insertion / erase functionality of the underyling map.
EDIT: As pointed out by #tmyklebu in the comments, maintaining (raw or smart) pointers to secondary data can be subject to iterator invalidation (e.g. when inserting new elements at the end that causes the vector's capacity to be resized). Using indices instead of pointers solves the "insertion at the end" problem. But when writing the wrapper class you need to make sure that insertions of new key-value pairs never cause "insertions in the middle" for your secondary data because that would also invalidate the indices. A more robust library solution would be to use Boost.MultiIndex, which is specifically designed to allow multiple types of view over a data structure.
Well, with only using the map i think of that:
make a flag array for each cell of the map, randomly generate two integers s.t. 0<=i, j < size of map; swap them and mark these cells as swapped. iterate for all.
EDIT: the array is allocate by the size of the map, and is a local array.
I doubt it...
But... Why not write a quick class that has 2 vectors in. A sorted std::vector of keys and a std::random_shuffled std::vector of values? Lookup the key using std::lower_bound and use std::distance and std::advance to get the value. Easy!
Without thinking too deeply, this should have similar complexity to std::map and possibly better locality of reference.
Some untested and unfinished code to get you started.
template <class Key, class T>
class random_map
{
public:
T& at(Key const& key);
void shuffle();
private:
std::vector<Key> d_keys; // Hold the keys of the *map*; MUST be sorted.
std::vector<T> d_values;
}
template <class Key, class T>
T& random_map<Key, T>::at(Key const& key)
{
auto lb = std::lower_bound(d_keys.begin(), d_keys.end(), key);
if(key < *lb) {
throw std::out_of_range();
}
auto delta = std::difference(d_keys.begin(), lb);
auto it = std::advance(d_values.begin(), lb);
return *it;
}
template <class Key, class T>
void random_map<Key, T>::shuffle()
{
random_shuffle(d_keys.begin(), d_keys.end());
}
If you want to shuffle the map in place, you can implement your own version of random_shuffle for your map. The solution still requires placing the keys into a vector, which is done below using transform:
typedef std::map<int, std::string> map_type;
map_type m;
m[10] = "hello";
m[20] = "world";
m[30] = "!";
std::vector<map_type::key_type> v(m.size());
std::transform(m.begin(), m.end(), v.begin(),
[](const map_type::value_type &x){
return x.first;
});
srand48(time(0));
auto n = m.size();
for (auto i = n-1; i > 0; --i) {
map_type::size_type r = drand48() * (i+1);
std::swap(m[v[i]], m[v[r]]);
}
I used drand48()/srand48() for a uniform pseudo random number generator, but you can use whatever is best for you.
Alternatively, you can shuffle v, and then rebuild the map, such as:
std::random_shuffle(v.begin(), v.end());
map_type m2 = m;
int i = 0;
for (auto &x : m) {
x.second = m2[v[i++]];
}
But, I wanted to illustrate that implementing shuffle on the map in place isn't overly burdensome.
Here is my solution using std::reference_wrapper of C++11.
First, let's make a version of std::random_shuffle that shuffles references. It is a small modification of version 1 from here: using the get method to get to the referenced values.
template< class RandomIt >
void shuffleRefs( RandomIt first, RandomIt last ) {
typename std::iterator_traits<RandomIt>::difference_type i, n;
n = last - first;
for (i = n-1; i > 0; --i) {
using std::swap;
swap(first[i].get(), first[std::rand() % (i+1)].get());
}
}
Now it's easy:
template <class MapType>
void shuffleMap(MapType &map) {
std::vector<std::reference_wrapper<typename MapType::mapped_type>> v;
for (auto &el : map) v.push_back(std::ref(el.second));
shuffleRefs(v.begin(), v.end());
}

How to efficiently select a random element from a std::set

How can I efficiently select a random element from a std::set?
A std::set::iterator is not a random access iterator. So I can't directly index a randomly chosen element like I could for a std::deque or std::vector
I could take the iterator returned from std::set::begin() and increment it a random number of times in the range [0,std::set::size()), but that seems to be doing a lot of unnecessary work. For an "index" close to the set's size, I would end up traversing the entire first half of the internal tree structure, even though it's already known the element won't be found there.
Is there a better approach?
In the name of efficiency, I am willing to define "random" as less random than whatever approach I might have used to choose a random index in a vector. Call it "reasonably random".
Edit...
Many insightful answers below.
The short version is that even though you can find a specific element in log(n) time, you can't find an arbitrary element in that time through the std::set interface.
Use boost::container::flat_set instead:
boost::container::flat_set<int> set;
// ...
auto it = set.begin() + rand() % set.size();
Insertions and deletions become O(N) though, I don't know if that's a problem. You still have O(log N) lookups, and the fact that the container is contiguous gives an overall improvement that often outweighs the loss of O(log N) insertions and deletions.
What about a predicate for find (or lower_bound) which causes a random tree traversal? You'd have to tell it the size of the set so it could estimate the height of the tree and sometimes terminate before leaf nodes.
Edit: I realized the problem with this is that std::lower_bound takes a predicate but does not have any tree-like behavior (internally it uses std::advance which is discussed in the comments of another answer). std::set<>::lower_bound uses the predicate of the set, which cannot be random and still have set-like behavior.
Aha, you can't use a different predicate, but you can use a mutable predicate. Since std::set passes the predicate object around by value you must use a predicate & as the predicate so you can reach in and modify it (setting it to "randomize" mode).
Here's a quasi-working example. Unfortunately I can't wrap my brain around the right random predicate so my randomness is not excellent, but I'm sure someone can figure that out:
#include <iostream>
#include <set>
#include <stdlib.h>
#include <time.h>
using namespace std;
template <typename T>
struct RandomPredicate {
RandomPredicate() : size(0), randomize(false) { }
bool operator () (const T& a, const T& b) {
if (!randomize)
return a < b;
int r = rand();
if (size == 0)
return false;
else if (r % size == 0) {
size = 0;
return false;
} else {
size /= 2;
return r & 1;
}
}
size_t size;
bool randomize;
};
int main()
{
srand(time(0));
RandomPredicate<int> pred;
set<int, RandomPredicate<int> & > s(pred);
for (int i = 0; i < 100; ++i)
s.insert(i);
pred.randomize = true;
for (int i = 0; i < 100; ++i) {
pred.size = s.size();
set<int, RandomPredicate<int> >::iterator it = s.lower_bound(0);
cout << *it << endl;
}
}
My half-baked randomness test is ./demo | sort -u | wc -l to see how many unique integers I get out. With a larger sample set try ./demo | sort | uniq -c | sort -n to look for unwanted patterns.
If you could access the underlying red-black tree (assuming that one exists) then you could access a random node in O(log n) choosing L/R as the successive bits of a ceil(log2(n))-bit random integer. However, you can't, as the underlying data structure is not exposed by the standard.
Xeo's solution of placing iterators in a vector is O(n) time and space to set up, but amortized constant overall. This compares favourably to std::next, which is O(n) time.
You can use the std::advance method:
set <int> myset;
//insert some elements into myset
int rnd = rand() % myset.size();
set <int> :: const_iterator it(myset.begin());
advance(it, rnd);
//now 'it' points to your random element
Another way to do this, probably less random:
int mini = *myset().begin(), maxi = *myset().rbegin();
int rnd = rand() % (maxi - mini + 1) + mini;
int rndresult = *myset.lower_bound(rnd);
If either the set doesn't update frequently or you don't need to run this algorithm frequently, keep a mirrored copy of the data in a vector (or just copy the set to a vector on need) and randomly select from that.
Another approach, as seen in a comment, is to keep a vector of iterators into the set (they're only invalidated on element deletion for sets) and randomly select an iterator.
Finally if you don't need a tree-based set, you could use vector or deque as your underlying container and sort/unique-ify when needed.
You can do this by maintaining a normal array of values; when you insert to the set, you append the element to the end of the array (O(1)), then when you want to generate a random number you can grab it from the array in O(1) as well.
The issue comes when you want to remove elements from the array. The most naive method would take O(n), which might be efficient enough for your needs. However, this can be improved to O(log n) using the following method;
Keep, for each index i in the array, prfx[i], which represents the number of non-deleted elements in the range 0...i in the array. Keep a segment tree, where you keep the maximum prfx[i] contained in each range.
Updating the segment tree can be done in O(log n) per deletion. Now, when you want to access the random number, you query the segment tree to find the "real" index of the number (by finding the earliest range in which the maximum prfx is equal to the random index). This makes the random-number generation of complexity O(log n).
Average O(1)/O(log N) (hashable/unhashable) insert/delete/sample with off-the-shelf containers
The idea is simple: use rejection sampling while upper bounding the rejection rate, which is achievable with a amortized O(1) compaction operation.
However, unlike solutions based on augmented trees, this approach cannot be extended to support weighted sampling.
template <typename T>
class UniformSamplingSet {
size_t max_id = 0;
std::unordered_set<size_t> unused_ids;
std::unordered_map<size_t, T> id2value;
std::map<T, size_t> value2id;
void compact() {
size_t id = 0;
std::map<T, size_t> new_value2id;
std::unordered_map<size_t, T> new_id2value;
for (auto [_, value] : id2value) {
new_value2id.emplace(value, id);
new_id2value.emplace(id, value);
++id;
}
max_id = id;
unused_ids.clear();
std::swap(id2value, new_id2value);
std::swap(value2id, new_value2id);
}
public:
size_t size() {
return id2value.size();
}
void insert(const T& value) {
size_t id;
if (!unused_ids.empty()) {
id = *unused_ids.begin();
unused_ids.erase(unused_ids.begin());
} else {
id = max_id++;
}
if (!value2id.emplace(value, id).second) {
unused_ids.insert(id);
} else {
id2value.emplace(id, value);
}
}
void erase(const T& value) {
auto it = value2id.find(value);
if (it == value2id.end()) return;
unused_ids.insert(it->second);
id2value.erase(it->second);
value2id.erase(it);
if (unused_ids.size() * 2 > max_id) {
compact();
};
}
// uniform(n): uniform random in [0, n)
template <typename F>
T sample(F&& uniform) {
size_t i;
do { i = uniform(max_id); } while (unused_ids.find(i) != unused_ids.end());
return id2value.at(i);
}