C++ shuffle algorithms [duplicate]

C++ shuffle algorithms [duplicate] - c++

This question already has answers here:
What shuffling algorithms exist besides Fisher-Yates and finding the "next permutation?"
(5 answers)
Closed yesterday.
I have an assignment where I must create a random shuffle algorithm using rand() or another random number generator. I cannot use the built in std::shuffle algorithm, nor can I use Fisher-Yates. Are there any lesser-known shuffle algorithms that I can implement?
Complexity and speed are not too important to me with this specific project, so if there is a shuffle algorithm that works well but has a sucky complexity that is completely fine with me.
Thank you!

I would do it like this:
template<typename DType>
struct Shufflable
{
DType data;
int random;
};
std::vector<Shufflable<float>> vec;
...init vec from input float data array...
// give random values to them
for(auto & v:vec)
v.random = your_mersenne_twister_random_generator(); // std::mt19937
// shuffle
std::sort(vec.begin(),vec.end(),[](auto & e1, auto & e2){ return e1.random<e2.random;});
... copy float datas to original array from vec ...
assign random numbers to each element, then sort them on those random values.
If the duplicated random numbers' bias is a problem, you can improve the sorting by using the data too, or even applying a second-pass for randomization for all of the duplicates.

Related

How to randomly pick element from an array with different probabilities in C++ [duplicate]

This question already has answers here:
Random numbers based on a probability
(3 answers)
Closed last month.
Suppose I have a vector<Point> p of some objects.
I can pick a uniformly random by simply p[rand() % p.size()].
Now suppose I have another same-sized vector of doubles vector <double> chances.
I want to randomly sample from p with each element having a probability analogous to its value in chances (which may not be summing to 1.0). How can I achieve that in C++?

You are looking for std::discrete_distribution. Forget about rand().
#include <random>
#include <vector>
struct Point {};
int main() {
std::mt19937 gen(std::random_device{}());
std::vector<double> chances{1.0, 2.0, 3.0};
// Initialize to same length.
std::vector<Point> points(chances.size());
// size_t is suitable for indexing.
std::discrete_distribution<std::size_t> d{chances.begin(), chances.end()};
auto sampled_value = points[d(gen)];
}
Conveniently for you, the weights do not have to sum to 1.

Randomly generated sorted arrays: search performances comparison [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm making a program that tests and compares stats of Multi-Key Sequential search and Interpolation binary search. I'm asking for an advice:
What is the best way to sort a random-generated array of integers, or even generate it like a sorted one (if that makes any sense) in given context?
I was looking into some sorting techniques, but, if you keep in mind that the accent is on searching (not sorting) performance, all of the advanced sorts seem rather complicated to be used in just one utility method. Considering that the array has to be larger than 106 (for testing purposes), Modified/Bubble, Selection or Insertion sorts are not an option.
Additional constraint is that all of the array members must be unique.
Now, my initial idea was to split the interval [INT_MIN,INT_MAX] into n intervals (n being the array length) and then add a random integer from, 0 to 232/n (rounded down), to every interval beginning.
The problem is this:
I presume that, as n rises closer to 232, like mine does, Interpolation search begins to give better and better results, as it's interpolation gets more accurate.
However:
If I rely solely on pseudo-random number generators (like rand();), their dispersion characteristics dictate the same tendency for a generated-then-sorted array, that is - Interpolation gets better at pinpointing the most likely location as the size gets closer to int limit. Uniformity/dispersion characteristics get lost as n rises to INT_MAX, so, due to stated limitations, Interpolation seems to always win.
Feel free do discuss, criticize and clarify this question as you see fit, but I'm rather desperate for an answer, because the test seems to be rigged in Interpolation's favor either way and I want to analyze them fairly. In short: I want to be convinced that my initial idea doesn't tilt the scales in Interpolation's favor even further, and I want to use it because it's O(n).

Here is a method to generate an ordered random sequence. This uses Knuth's algorithm S and taken from the book Programming Pearls.
This requires a function that returns a random double in the range [0,1). I included my_rand() as an example. I've also modified it to take an output iterator for the destination.
namespace
{
std::random_device rd;
std::mt19937 eng{ rd() };
std::uniform_real_distribution<> dist; // [0,1)
double my_rand() { return dist(eng); }
}
// Programming Pearls column 11.2
// Knuth's algorithm S (3.4.2)
// output M integers (in order) in range 1..N
template <typename OutIt>
void knuth_s(int M, int N, OutIt dest)
{
double select = M, remaining = N;
for (int i = 1; i <= N; ++i) {
if (my_rand() < select / remaining) {
*dest++ = i;
--select;
}
--remaining;
}
}
int main()
{
std::vector<int> data;
knuth_s(20, 200, back_inserter(data)); // 20 values in [1,200]
}
Demo in ideone.com

So you want to generate an "array" that has N unique random numbers and they must be in a sorted order? This sounds like a perfect use for a std::set. When inserting elements into a set they are sorted for us automatically and a set can only contain unique elements so it takes care of checking if the random number has already been generated.
std::set random_numbers;
std::random_device rd;
std::mt19937 mt(rd());
while (random_numbers.size() < number_of_random_numbers_needed)
{
random_numbers.insert(mt());
}
Then you can convert the set to something else like a std::vector or std::array if you don't want to keep it as a set.

What about generating a sorted array from statistical properties ?
This probably needs some digging but you should be able to generate the integers in order by adding a random difference whose mean is the standard deviation of your overall sample.
That raises some problem at range boundaries, but given the size of your sample you can probably ignore it.

OK, this I've decided to transfer the responsibility to built-in PRNG and do the follwing:
Add n rand() results to binary tree and fill the array by traversing it in order (from leftmost leaf).

Efficient method for randomly selecting all elements of a std::vector exactly once WITHOUT reshuffling

I am looking for an efficient method for selecting access to each element of a std::vector<T> in a random order, without reshuffling or copying them i.e no use of std::random_shuffle and ensure that each element is selected only once.
I don't want to copy or reshuffle as a) each instance of T is likely to be a very large object and b) for other operations I will be doing on the elements of the vector, it is easier for them to remain in the same order.
Furthermore, I don't really want to go down the street of continuously picking and rejecting duplicates. It is likely I will have lots of these large objects stored in the vector and efficiency is key as I will be looking to call this random selection method many times a second.

Create a vector the same size as the existing one that uses pointers to the elements in the vector. Randomly shuffle the pointer vector instead and read from there - it's low cost.

You did not tell us whether you want to iterate over the whole array randomly, or if you only need some elements at random.
I assume the first case. You'll need extra storage for bookkeeping, and you'll need linear time for the shuffling anyway. So create a permutation, and keep its memory alive so that you can reshuffle it as you wish. With C++11:
#include <algorithm>
#include <random>
#include <numeric>
struct permutation
{
permutation(size_t n)
: perm(n), g(std::random_device())
{
std::iota(perm.begin(), perm.end(), size_t(0));
}
void shuffle() { std::shuffle(perm.begin(), perm.end(), g); }
size_t operator[](size_t n) const { return perm[n]; }
private:
std::vector<size_t> perm;
std::mt19937 g;
};
Usage:
std::vector<huge_t> v;
...
permutation sigma(v.size());
sigma.shuffle();
const huge_t& x = v[sigma[0]];
...
sigma.shuffle(); // No extra allocation
const huge_t& y = v[sigma[0]];
You can adapt the code to use C++03 std::random_shuffle, but please note that there are very few guarantees on the random number generator.

I think the easiest (and one of the more efficient ones) solution would be to either create a std::vector<size_t> holding indices into your vector<T>, or a std::vector<T*> holding the pointers into your vector. Then you can shuffle that one using std::random_shuffle, iterate over it and pick the corresponding elements from your original vector. That way you don't change the order of your original vector and shuffleing pointers or size_t is pretty cheap

count the number of distinct absolute values among the elements of the array

I was asked an interview question to find the number of distinct absolute values among the elements of the array. I came up with the following solution (in C++) but the interviewer was not happy with the code's run time efficiency.
I will appreciate pointers as to how I can improve the run time efficiency of this code?
Also how do I calculate the efficiency of the code below? The for loop executes A.size() times. However I am not sure about the efficiency of STL std::find (In the worse case it could be O(n) so that makes this code O(n²) ?
Code is:
int countAbsoluteDistinct ( const std::vector<int> &A ) {
using namespace std;
list<int> x;
vector<int>::const_iterator it;
for(it = A.begin();it < A.end();it++)
if(find(x.begin(),x.end(),abs(*it)) == x.end())
x.push_back(abs(*it));
return x.size();
}

To propose alternative code to the set code.
Note that we don't want to alter the caller's vector, we take by value. It's better to let the compiler copy for us than make our own. If it's ok to destroy their value we can take by non-const reference.
#include <vector>
#include <algorithm>
#include <iterator>
#include <cstdlib>
using namespace std;
int count_distinct_abs(vector<int> v)
{
transform(v.begin(), v.end(), v.begin(), abs); // O(n) where n = distance(v.end(), v.begin())
sort(v.begin(), v.end()); // Average case O(n log n), worst case O(n^2) (usually implemented as quicksort.
// To guarantee worst case O(n log n) replace with make_heap, then sort_heap.
// Unique will take a sorted range, and move things around to get duplicated
// items to the back and returns an iterator to the end of the unique section of the range
auto unique_end = unique(v.begin(), v.end()); // Again n comparisons
return distance(v.begin(), unique_end); // Constant time for random access iterators (like vector's)
}
The advantage here is that we only allocate/copy once if we decide to take by value, and the rest is all done in-place while still giving you an average complexity of O(n log n) on the size of v.

std::find() is linear (O(n)). I'd use a sorted associative container to handle this, specifically std::set.
#include <vector>
#include <set>
using namespace std;
int distict_abs(const vector<int>& v)
{
std::set<int> distinct_container;
for(auto curr_int = v.begin(), end = v.end(); // no need to call v.end() multiple times
curr_int != end;
++curr_int)
{
// std::set only allows single entries
// since that is what we want, we don't care that this fails
// if the second (or more) of the same value is attempted to
// be inserted.
distinct_container.insert(abs(*curr_int));
}
return distinct_container.size();
}
There is still some runtime penalty with this approach. Using a separate container incurs the cost of dynamic allocations as the container size increases. You could do this in place and not occur this penalty, however with code at this level its sometimes better to be clear and explicit and let the optimizer (in the compiler) do its work.

Yes, this will be O(N2) -- you'll end up with a linear search for each element.
A couple of reasonably obvious alternatives would be to use an std::set or std::unordered_set. If you don't have C++0x, you can replace std::unordered_set with tr1::unordered_set or boost::unordered_set.
Each insertion in an std::set is O(log N), so your overall complexity is O(N log N).
With unordered_set, each insertion has constant (expected) complexity, giving linear complexity overall.

Basically, replace your std::list with a std::set. This gives you O(log(set.size())) searches + O(1) insertions, if you do things properly. Also, for efficiency, it makes sense to cache the result of abs(*it), although this will have only a minimal (negligible) effect. The efficiency of this method is about as good as you can get it, without using a really nice hash (std::set uses bin-trees) or more information about the values in the vector.

Since I was not happy with the previous answer here is mine today. Your intial question does not mention how big your vector is. Suppose your std::vector<> is extremely large and have very few duplicates (why not?). This means that using another container (eg. std::set<>) will basically duplicate your memory consumption. Why would you do that since your goal is simply to count non duplicate.
I like #Flame answer, but I was not really happy with the call to std::unique. You've spent lots of time carefully sorting your vector and then simply discard the sorted array while you could be re-using it afterward.
I could not find anything really elegant in the STD library, so here is my proposal (a mixture of std::transform + std::abs + std::sort, but without touching the sorted array afterward).
// count the number of distinct absolute values among the elements of the sorted container
template<class ForwardIt>
typename std::iterator_traits<ForwardIt>::difference_type
count_unique(ForwardIt first, ForwardIt last)
{
if (first == last)
return 0;
typename std::iterator_traits<ForwardIt>::difference_type
count = 1;
ForwardIt previous = first;
while (++first != last) {
if (!(*previous == *first) ) ++count;
++previous;
}
return count;
}
Bonus point is works with forward iterator:
#include <iostream>
#include <list>
int main()
{
std::list<int> nums {1, 3, 3, 3, 5, 5, 7,8};
std::cout << count_unique( std::begin(nums), std::end(nums) ) << std::endl;
const int array[] = { 0,0,0,1,2,3,3,3,4,4,4,4};
const int n = sizeof array / sizeof * array;
std::cout << count_unique( array, array + n ) << std::endl;
return 0;
}

Two points.
std::list is very bad for search. Each search is O(n).
Use std::set. Insert is logarithmic, it removes duplicate and is sorted. Insert every value O(n log n) then use set::size to find how many values.
EDIT:
To answer part 2 of your question, the C++ standard mandates the worst case for operations on containers and algorithms.
Find: Since you are using the free function version of find which takes iterators, it cannot assume anything about the passed in sequence, it cannot assume that the range is sorted, so it must traverse every item until it finds a match, which is O(n).
If you are using set::find on the other hand, this member find can utilize the structure of the set, and it's performance is required to be O(log N) where N is the size of the set.

To answer your second question first, yes the code is O(n^2) because the complexity of find is O(n).
You have options to improve it. If the range of numbers is low you can just set up a large enough array and increment counts while iterating over the source data. If the range is larger but sparse, you can use a hash table of some sort to do the counting. Both of these options are linear complexity.
Otherwise, I would do one iteration to take the abs value of each item, then sort them, and then you can do the aggregation in a single additional pass. The complexity here is n log(n) for the sort. The other passes don't matter for complexity.

I think a std::map could also be interesting:
int absoluteDistinct(const vector<int> &A)
{
map<int, char> my_map;
for (vector<int>::const_iterator it = A.begin(); it != A.end(); it++)
{
my_map[abs(*it)] = 0;
}
return my_map.size();
}

As #Jerry said, to improve a little on the theme of most of the other answers, instead of using a std::map or std::set you could use a std::unordered_map or std::unordered_set (or the boost equivalent).
This would reduce the runtimes down from O(n lg n) or O(n).
Another possibility, depending on the range of the data given, you might be able to do a variant of a radix sort, though there's nothing in the question that immediately suggests this.

Sort the list with a Radix style sort for O(n)ish efficiency. Compare adjacent values.

The best way is to customize the quicksort algorithm such that when we are partitioning whenever we get two equal element then overwrite the second duplicate with last element in the range and then reduce the range. This will ensure you will not process duplicate elements twice. Also after quick sort is done the range of the element is answer
Complexity is still O(n*Lg-n) BUT this should save atleast two passes over the array.
Also savings are proportional to % of duplicates. Imagine if they twist original questoin with, 'say 90% of the elements are duplicate' ...

One more approach :
Space efficient : Use hash map .
O(logN)*O(n) for insert and just keep the count of number of elements successfully inserted.
Time efficient : Use hash table O(n) for insert and just keep the count of number of elements successfully inserted.

You have nested loops in your code. If you will scan each element over the whole array it will give you O(n^2) time complexity which is not acceptable in most of the scenarios. That was the reason the Merge Sort and Quick sort algorithms came up to save processing cycles and machine efforts. I will suggest you to go through the suggested links and redesign your program.

How to get a sorted subvector out of a sorted vector, fast

I have a data structure like this:
struct X {
float value;
int id;
};
a vector of those (size N (think 100000), sorted by value (stays constant during the execution of the program):
std::vector<X> values;
Now, I want to write a function
void subvector(std::vector<X> const& values,
std::vector<int> const& ids,
std::vector<X>& out /*,
helper data here */);
that fills the out parameter with a sorted subset of values, given by the passed ids (size M < N (about 0.8 times N)), fast (memory is not an issue, and this will be done repeatedly, so building lookuptables (the helper data from the function parameters) or something else that is done only once is entirely ok).
My solution so far:
Build lookuptable lut containing id -> offset in values (preparation, so constant runtime)
create std::vector<X> tmp, size N, filled with invalid ids (linear in N)
for each id, copy values[lut[id]] to tmp[lut[id]] (linear in M)
loop over tmp, copying items to out (linear in N)
this is linear in N (as it's bigger than M), but the temporary variable and repeated copying bugs me. Is there a way to do it quicker than this? Note that M will be close to N, so things that are O(M log N) are unfavourable.
Edit: http://ideone.com/xR8Vp is a sample implementation of mentioned algorithm, to make the desired output clear and prove that it's doable in linear time - the question is about the possibility of avoiding the temporary variable or speeding it up in some other way, something that is not linear is not faster :).

An alternative approach you could try is to use a hash table instead of a vector to look up ids in:
void subvector(std::vector<X> const& values,
std::unordered_set<int> const& ids,
std::vector<X>& out) {
out.clear();
out.reserve(ids.size());
for(std::vector<X>::const_iterator i = values.begin(); i != values.end(); ++i) {
if(ids.find(i->id) != ids.end()) {
out.push_back(*i);
}
}
}
This runs in linear time since unordered_set::find is constant expected time (assuming that we have no problems hashing ints). However I suspect it might not be as fast in practice as the approach you described initially using vectors.

Since your vector is sorted, and you want a subset of it sorted the same way, I assume we can just slice out the chunk you want without rearranging it.
Why not just use find_if() twice. Once to find the start of the range you want and once to find the end of the range. This will give you the start and end iterators of the sub vector. Construct a new vector using those iterators. One of the vector constructor overloads takes two iterators.
That or the partition algorithm should work.

If I understood your problem correctly, you actually try to create a linear time sorting algorithm (subject to the input size of numbers M).
That is NOT possible.
Your current approach is to have a sorted list of possible values.
This takes linear time to the number of possible values N (theoretically, given that the map search takes O(1) time).
The best you could do, is to sort the values (you found from the map) with a quick sorting method (O(MlogM) f.e. quicksort, mergesort etc) for small values of M and maybe do that linear search for bigger values of M.
For example, if N is 100000 and M is 100 it is much faster to just use a sorting algorithm.
I hope you can understand what I say. If you still have questions I will try to answer them :)
edit: (comment)
I will further explain what I mean.
Say you know that your numbers will range from 1 to 100.
You have them sorted somewhere (actually they are "naturally" sorted) and you want to get a subset of them in sorted form.
If it would be possible to do it faster than O(N) or O(MlogM), sorting algorithms would just use this method to sort.
F.e. by having the set of numbers {5,10,3,8,9,1,7}, knowing that they are a subset of the sorted set of numbers {1,2,3,4,5,6,7,8,9,10} you still can't sort them faster than O(N) (N = 10) or O(MlogM) (M = 7).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js