Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm making a program that tests and compares stats of Multi-Key Sequential search and Interpolation binary search. I'm asking for an advice:
What is the best way to sort a random-generated array of integers, or even generate it like a sorted one (if that makes any sense) in given context?
I was looking into some sorting techniques, but, if you keep in mind that the accent is on searching (not sorting) performance, all of the advanced sorts seem rather complicated to be used in just one utility method. Considering that the array has to be larger than 106 (for testing purposes), Modified/Bubble, Selection or Insertion sorts are not an option.
Additional constraint is that all of the array members must be unique.
Now, my initial idea was to split the interval [INT_MIN,INT_MAX] into n intervals (n being the array length) and then add a random integer from, 0 to 232/n (rounded down), to every interval beginning.
The problem is this:
I presume that, as n rises closer to 232, like mine does, Interpolation search begins to give better and better results, as it's interpolation gets more accurate.
However:
If I rely solely on pseudo-random number generators (like rand();), their dispersion characteristics dictate the same tendency for a generated-then-sorted array, that is - Interpolation gets better at pinpointing the most likely location as the size gets closer to int limit. Uniformity/dispersion characteristics get lost as n rises to INT_MAX, so, due to stated limitations, Interpolation seems to always win.
Feel free do discuss, criticize and clarify this question as you see fit, but I'm rather desperate for an answer, because the test seems to be rigged in Interpolation's favor either way and I want to analyze them fairly. In short: I want to be convinced that my initial idea doesn't tilt the scales in Interpolation's favor even further, and I want to use it because it's O(n).
Here is a method to generate an ordered random sequence. This uses Knuth's algorithm S and taken from the book Programming Pearls.
This requires a function that returns a random double in the range [0,1). I included my_rand() as an example. I've also modified it to take an output iterator for the destination.
namespace
{
std::random_device rd;
std::mt19937 eng{ rd() };
std::uniform_real_distribution<> dist; // [0,1)
double my_rand() { return dist(eng); }
}
// Programming Pearls column 11.2
// Knuth's algorithm S (3.4.2)
// output M integers (in order) in range 1..N
template <typename OutIt>
void knuth_s(int M, int N, OutIt dest)
{
double select = M, remaining = N;
for (int i = 1; i <= N; ++i) {
if (my_rand() < select / remaining) {
*dest++ = i;
--select;
}
--remaining;
}
}
int main()
{
std::vector<int> data;
knuth_s(20, 200, back_inserter(data)); // 20 values in [1,200]
}
Demo in ideone.com
So you want to generate an "array" that has N unique random numbers and they must be in a sorted order? This sounds like a perfect use for a std::set. When inserting elements into a set they are sorted for us automatically and a set can only contain unique elements so it takes care of checking if the random number has already been generated.
std::set random_numbers;
std::random_device rd;
std::mt19937 mt(rd());
while (random_numbers.size() < number_of_random_numbers_needed)
{
random_numbers.insert(mt());
}
Then you can convert the set to something else like a std::vector or std::array if you don't want to keep it as a set.
What about generating a sorted array from statistical properties ?
This probably needs some digging but you should be able to generate the integers in order by adding a random difference whose mean is the standard deviation of your overall sample.
That raises some problem at range boundaries, but given the size of your sample you can probably ignore it.
OK, this I've decided to transfer the responsibility to built-in PRNG and do the follwing:
Add n rand() results to binary tree and fill the array by traversing it in order (from leftmost leaf).
Related
I was set a homework challenge as part of an application process (I was rejected, by the way; I wouldn't be writing this otherwise) in which I was to implement the following functions:
// Store a collection of integers
class IntegerCollection {
public:
// Insert one entry with value x
void Insert(int x);
// Erase one entry with value x, if one exists
void Erase(int x);
// Erase all entries, x, from <= x < to
void Erase(int from, int to);
// Return the count of all entries, x, from <= x < to
size_t Count(int from, int to) const;
The functions were then put through a bunch of tests, most of which were trivial. The final test was the real challenge as it performed 500,000 single insertions, 500,000 calls to count and 500,000 single deletions.
The member variables of IntegerCollection were not specified and so I had to choose how to store the integers. Naturally, an STL container seemed like a good idea and keeping it sorted seemed an easy way to keep things efficient.
Here is my code for the four functions using a vector:
// Previous bit of code shown goes here
private:
std::vector<int> integerCollection;
};
void IntegerCollection::Insert(int x) {
/* using lower_bound to find the right place for x to be inserted
keeps the vector sorted and makes life much easier */
auto it = std::lower_bound(integerCollection.begin(), integerCollection.end(), x);
integerCollection.insert(it, x);
}
void IntegerCollection::Erase(int x) {
// find the location of the first element containing x and delete if it exists
auto it = std::find(integerCollection.begin(), integerCollection.end(), x);
if (it != integerCollection.end()) {
integerCollection.erase(it);
}
}
void IntegerCollection::Erase(int from, int to) {
if (integerCollection.empty()) return;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
/* std::vector::erase deletes entries between the two pointers
fromBound (included) and toBound (not indcluded) */
integerCollection.erase(fromBound, toBound);
}
size_t IntegerCollection::Count(int from, int to) const {
if (integerCollection.empty()) return 0;
int count = 0;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
// increment pointer until fromBound == toBound (we don't count elements of value = to)
while (fromBound != toBound) {
++count; ++fromBound;
}
return count;
}
The company got back to me saying that they wouldn't be moving forward because my choice of container meant the runtime complexity was too high. I also tried using list and deque and compared the runtime. As I expected, I found that list was dreadful and that vector took the edge over deque. So as far as I was concerned I had made the best of a bad situation, but apparently not!
I would like to know what the correct container to use in this situation is? deque only makes sense if I can guarantee insertion or deletion to the ends of the container and list hogs memory. Is there something else that I'm completely overlooking?
We cannot know what would make the company happy. If they reject std::vector without concise reasoning I wouldn't want to work for them anyway. Moreover, we dont really know the precise requirements. Were you asked to provide one reasonably well performing implementation? Did they expect you to squeeze out the last percent of the provided benchmark by profiling a bunch of different implementations?
The latter is probably too much for a homework challenge as part of an application process. If it is the first you can either
roll your own. It is unlikely that the interface you were given can be implemented more efficiently than one of the std containers does... unless your requirements are so specific that you can write something that performs well under that specific benchmark.
std::vector for data locality. See eg here for Bjarne himself advocating std::vector rather than linked lists.
std::set for ease of implementation. It seems like you want the container sorted and the interface you have to implement fits that of std::set quite well.
Let's compare only isertion and erasure assuming the container needs to stay sorted:
operation std::set std::vector
insert log(N) N
erase log(N) N
Note that the log(N) for the binary_search to find the position to insert/erase in the vector can be neglected compared to the N.
Now you have to consider that the asymptotic complexity listed above completely neglects the non-linearity of memory access. In reality data can be far away in memory (std::set) leading to many cache misses or it can be local as with std::vector. The log(N) only wins for huge N. To get an idea of the difference 500000/log(500000) is roughly 26410 while 1000/log(1000) is only ~100.
I would expect std::vector to outperform std::set for considerably small container sizes, but at some point the log(N) wins over cache. The exact location of this turning point depends on many factors and can only reliably determined by profiling and measuring.
Nobody knows which container is MOST efficient for multiple insertions / deletions. That is like asking what is the most fuel-efficient design for a car engine possible. People are always innovating on the car engines. They make more efficient ones all the time. However, I would recommend a splay tree. The time required for a insertion or deletion is a splay tree is not constant. Some insertions take a long time and some take only a very a short time. However, the average time per insertion/deletion is always guaranteed to be be O(log n), where n is the number of items being stored in the splay tree. logarithmic time is extremely efficient. It should be good enough for your purposes.
The first thing that comes to mind is to hash the integer value so single look ups can be done in constant time.
The integer value can be hashed to compute an index in to an array of bools or bits, used to tell if the integer value is in the container or not.
Counting and and deleting large ranges could be sped up from there, by using multiple hash tables for specific integer ranges.
If you had 0x10000 hash tables, that each stored ints from 0 to 0xFFFF and were using 32 bit integers you could then mask and shift the upper half of the int value and use that as an index to find the correct hash table to insert / delete values from.
IntHashTable containers[0x10000];
u_int32 hashIndex = (u_int32)value / 0x10000;
u_int32int valueInTable = (u_int32)value - (hashIndex * 0x10000);
containers[hashIndex].insert(valueInTable);
Count for example could be implemented as so, if each hash table kept count of the number of elements it contained:
indexStart = startRange / 0x10000;
indexEnd = endRange / 0x10000;
int countTotal = 0;
for (int i = indexStart; i<=indexEnd; ++i) {
countTotal += containers[i].count();
}
Not sure if using sorting really is a requirement for removing the range. It might be based on position. Anyway, here is a link with some hints which STL container to use.
In which scenario do I use a particular STL container?
Just FYI.
Vector maybe a good choice, but it does a lot of re allocation, as you know. I prefer deque instead, as it doesn't require big chunk of memory to allocate all items. For such requirement as you had, list probably fit better.
Basic solution for this problem might be std::map<int, int>
where key is the integer you are storing and value is the number of occurences.
Problem with this is that you can not quickly remove/count ranges. In other words complexity is linear.
For quick count you would need to implement your own complete binary tree where you can know the number of nodes between 2 nodes(upper and lower bound node) because you know the size of tree, and you know how many left and right turns you took to upper and lower bound nodes. Note that we are talking about complete binary tree, in general binary tree you can not make this calculation fast.
For quick range remove I do not know how to make it faster than linear.
If I used (with appropriate #includes)
int main()
{
srand(time(0));
int arr[1000];
for(int i = 0; i < 1000; i++)
{
arr[i] = rand() % 100000;
}
return 0;
}
To generate random 5-digit ID numbers (disregard iomanip stuff here), would those ID numbers be guranteed by rand() to be unique? I've been running another loop to check all the values of the array vs the recently generated ID number but it takes forever to run, considering the nested 1000 iteration loops. By the way is there a simple way to do that check?
Since the question was tagged c++11,
you should consider using <random> in place of rand().
Using a standard distribution engine, you can't guarantee that you will get back unique values. If you use a std::set, you can keep retrying until you have the right amount. Depending on your distribution range, and the amount of unique values you are requesting, that may be adequate.
For example, here is a customized function to get n unique values from range [x,y].
#include <unordered_set>
#include <iostream>
#include <random>
template <typename T>
std::unordered_set<T> GetUniqueNumbers(int amount, T low, T high){
static std::random_device random_device;
static std::mt19937 engine{random_device()};
std::uniform_int_distribution<T> dist(low, high);
std::unordered_set<T> uniques;
while (uniques.size() < amount){
uniques.insert(dist(engine));
}
return uniques;
}
int main(){
//get 10 unique numbers between [0,100]
auto numbers = GetUniqueNumbers(10,0,100);
for (auto number: numbers){
std::cout << number << " ";
}
}
No, because any guarantee about the output of a random source makes it less random.
There are specific mathematical formulas that have the behavior known as a random permutation. This site seems to have quite a good write-up about it: http://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers/
No, there is definitely no guarantee rand will not produce duplicate numbers, designing it in such a way would not only be expensive in terms of remembering all the numbers it has returned so far but will also reduce its randomness greatly (after it had returned many numbers you could guess what it is likely to return from what it had already returned so far).
If uniqueness is your only goal, just use an incrementing ID number for each thing. If the numbers must also be arbitrary and hard to guess you will have to use some kind of random generator or hash, but should make the numbers much longer to make the chance of a collision much closer to 0.
However if you absolutely must do it the current way I would suggest storing all the numbers you have generated so far into a std::unordered_map and generating another random number if it is already in it.
There is a common uniqueness guarantee in most PRNGs, but it won't help you here. A generator will typically iterate over a finite number of states and not visit the same state twice until every other state has been visited once.
However, a state is not the same thing as the number you get to see. Many states can map to the same number and in the worst possible case two consecutive states could map to the same number.
That said, there are specific configurations of PRNG that can visit every value in a range you specify exactly once before revisiting an old state. Notably, an LCG designed with a modulo that is a multiple of your range can be reduced to exactly your range with another modulo operation. Since most LCG implementations have a power-of-two period, this means that the low-order bits repeat with shorter periods. However, 10000 is not a power of two, so that won't help you.
A simple method is to use an LCG, bitmask it down to a power of two larger than your desired range, and just throw away results that it produces that are out of range.
Is there a way using the C++ standard library built in random generator to get a specific random number in a sequence, without saving them all?
Like
srand(cTime);
getRand(1); // 10
getRand(2); // 8995
getRand(3); // 65464456
getRand(1); // 10
getRand(2); // 8995
getRand(1); // 10
getRand(3); // 65464456
C++11 random number engines are required to implement a member function discard(unsigned long long z) (ยง26.5.1.4) that advances the random number sequence by z steps. The complexity guarantee is quite weak: "no worse than
the complexity of z consecutive calls e()". This member obviously exists solely to make it possible to expose more performant implementations when possible as note 274 states:
This operation is common in user code, and can often be implemented
in an engine-specific manner so as to provide significant performance
improvements over an equivalent naive loop that makes z consecutive
calls e().
Given discard you can easily implement your requirement to retrieve the nth number in sequence by reseeding a generator, discarding n-1 values and using the next generated value.
I'm unaware of which - if any - of the standard RNG engines are amenable to efficient implementations of discard. It may be worth your time to do a bit of investigation and profiling.
You have to save the numbers. There may be other variants, but it still requires saving a list of numbers (e.g. using different seeds based on the argument to getRand() - but that wouldn't really be beneficial over saving them).
Something like this would work reasonably well, I'd say:
int getRand(int n)
{
static std::map<int, int> mrand;
// Check if it's there.
if ((std::map::iterator it = mrand.find(n)) != mrand.end())
{
return it->second;
}
int r = rand();
mrand[n] = r;
return r;
}
(I haven't compiled this code, just written it up as a "this sort of thing might work")
Implement getRand() to always seed and then return the given number. This will interfere with all other random numbers in a system, though, and will be slow, especially for large indexes. Assuming a 1-based index:
int getRand(int index)
{
srand(999); // fix the seed
for (int loop=1; loop<index; ++loop)
rand();
return rand();
}
Similar to cdmh's post,
Following from C++11 could also be used :
#include<random>
long getrand(int index)
{
std::default_random_engine e;
for(auto i=1;i<index;i++)
e();
return e();
}
Check out:
Random123
From the documentation:
Random123 is a library of "counter-based" random number generators (CBRNGs), in which the Nth random number can be obtained by applying a stateless mixing function to N..
I have a data structure like this:
struct X {
float value;
int id;
};
a vector of those (size N (think 100000), sorted by value (stays constant during the execution of the program):
std::vector<X> values;
Now, I want to write a function
void subvector(std::vector<X> const& values,
std::vector<int> const& ids,
std::vector<X>& out /*,
helper data here */);
that fills the out parameter with a sorted subset of values, given by the passed ids (size M < N (about 0.8 times N)), fast (memory is not an issue, and this will be done repeatedly, so building lookuptables (the helper data from the function parameters) or something else that is done only once is entirely ok).
My solution so far:
Build lookuptable lut containing id -> offset in values (preparation, so constant runtime)
create std::vector<X> tmp, size N, filled with invalid ids (linear in N)
for each id, copy values[lut[id]] to tmp[lut[id]] (linear in M)
loop over tmp, copying items to out (linear in N)
this is linear in N (as it's bigger than M), but the temporary variable and repeated copying bugs me. Is there a way to do it quicker than this? Note that M will be close to N, so things that are O(M log N) are unfavourable.
Edit: http://ideone.com/xR8Vp is a sample implementation of mentioned algorithm, to make the desired output clear and prove that it's doable in linear time - the question is about the possibility of avoiding the temporary variable or speeding it up in some other way, something that is not linear is not faster :).
An alternative approach you could try is to use a hash table instead of a vector to look up ids in:
void subvector(std::vector<X> const& values,
std::unordered_set<int> const& ids,
std::vector<X>& out) {
out.clear();
out.reserve(ids.size());
for(std::vector<X>::const_iterator i = values.begin(); i != values.end(); ++i) {
if(ids.find(i->id) != ids.end()) {
out.push_back(*i);
}
}
}
This runs in linear time since unordered_set::find is constant expected time (assuming that we have no problems hashing ints). However I suspect it might not be as fast in practice as the approach you described initially using vectors.
Since your vector is sorted, and you want a subset of it sorted the same way, I assume we can just slice out the chunk you want without rearranging it.
Why not just use find_if() twice. Once to find the start of the range you want and once to find the end of the range. This will give you the start and end iterators of the sub vector. Construct a new vector using those iterators. One of the vector constructor overloads takes two iterators.
That or the partition algorithm should work.
If I understood your problem correctly, you actually try to create a linear time sorting algorithm (subject to the input size of numbers M).
That is NOT possible.
Your current approach is to have a sorted list of possible values.
This takes linear time to the number of possible values N (theoretically, given that the map search takes O(1) time).
The best you could do, is to sort the values (you found from the map) with a quick sorting method (O(MlogM) f.e. quicksort, mergesort etc) for small values of M and maybe do that linear search for bigger values of M.
For example, if N is 100000 and M is 100 it is much faster to just use a sorting algorithm.
I hope you can understand what I say. If you still have questions I will try to answer them :)
edit: (comment)
I will further explain what I mean.
Say you know that your numbers will range from 1 to 100.
You have them sorted somewhere (actually they are "naturally" sorted) and you want to get a subset of them in sorted form.
If it would be possible to do it faster than O(N) or O(MlogM), sorting algorithms would just use this method to sort.
F.e. by having the set of numbers {5,10,3,8,9,1,7}, knowing that they are a subset of the sorted set of numbers {1,2,3,4,5,6,7,8,9,10} you still can't sort them faster than O(N) (N = 10) or O(MlogM) (M = 7).
I'm intersecting some sets of numbers, and doing this by storing a count of each time I see a number in a map.
I'm finding the performance be very slow.
Details:
- One of the sets has 150,000 numbers in it
- The intersection of that set and another set takes about 300ms the first time, and about 5000ms the second time
- I haven't done any profiling yet, but every time I break the debugger while doing the intersection its in malloc.c!
So, how can I improve this performance? Switch to a different data structure? Some how improve the memory allocation performance of map?
Update:
Is there any way to ask std::map or
boost::unordered_map to pre-allocate
some space?
Or, are there any tips for using these efficiently?
Update2:
See Fast C++ container like the C# HashSet<T> and Dictionary<K,V>?
Update3:
I benchmarked set_intersection and got horrible results:
(set_intersection) Found 313 values in the intersection, in 11345ms
(set_intersection) Found 309 values in the intersection, in 12332ms
Code:
int runIntersectionTestAlgo()
{
set<int> set1;
set<int> set2;
set<int> intersection;
// Create 100,000 values for set1
for ( int i = 0; i < 100000; i++ )
{
int value = 1000000000 + i;
set1.insert(value);
}
// Create 1,000 values for set2
for ( int i = 0; i < 1000; i++ )
{
int random = rand() % 200000 + 1;
random *= 10;
int value = 1000000000 + random;
set2.insert(value);
}
set_intersection(set1.begin(),set1.end(), set2.begin(), set2.end(), inserter(intersection, intersection.end()));
return intersection.size();
}
You should definitely be using preallocated vectors which are way faster. The problem with doing set intersection with stl sets is that each time you move to the next element you're chasing a dynamically allocated pointer, which could easily not be in your CPU caches. With a vector the next element will often be in your cache because it's physically close to the previous element.
The trick with vectors, is that if you don't preallocate the memory for a task like this, it'll perform EVEN WORSE because it'll go on reallocating memory as it resizes itself during your initialization step.
Try something like this instaed - it'll be WAY faster.
int runIntersectionTestAlgo() {
vector<char> vector1; vector1.reserve(100000);
vector<char> vector2; vector2.reserve(1000);
// Create 100,000 values for set1
for ( int i = 0; i < 100000; i++ ) {
int value = 1000000000 + i;
set1.push_back(value);
}
sort(vector1.begin(), vector1.end());
// Create 1,000 values for set2
for ( int i = 0; i < 1000; i++ ) {
int random = rand() % 200000 + 1;
random *= 10;
int value = 1000000000 + random;
set2.push_back(value);
}
sort(vector2.begin(), vector2.end());
// Reserve at most 1,000 spots for the intersection
vector<char> intersection; intersection.reserve(min(vector1.size(),vector2.size()));
set_intersection(vector1.begin(), vector1.end(),vector2.begin(), vector2.end(),back_inserter(intersection));
return intersection.size();
}
Without knowing any more about your problem, "check with a good profiler" is the best general advise I can give. Beyond that...
If memory allocation is your problem, switch to some sort of pooled allocator that reduces calls to malloc. Boost has a number of custom allocators that should be compatible with std::allocator<T>. In fact, you may even try this before profiling, if you've already noticed debug-break samples always ending up in malloc.
If your number-space is known to be dense, you can switch to using a vector- or bitset-based implementation, using your numbers as indexes in the vector.
If your number-space is mostly sparse but has some natural clustering (this is a big if), you may switch to a map-of-vectors. Use higher-order bits for map indexing, and lower-order bits for vector indexing. This is functionally very similar to simply using a pooled allocator, but it is likely to give you better caching behavior. This makes sense, since you are providing more information to the machine (clustering is explicit and cache-friendly, rather than a random distribution you'd expect from pool allocation).
I would second the suggestion to sort them. There are already STL set algorithms that operate on sorted ranges (like set_intersection, set_union, etc):
set_intersection
I don't understand why you have to use a map to do intersection. Like people have said, you could put the sets in std::set's, and then use std::set_intersection().
Or you can put them into hash_set's. But then you would have to implement intersection manually: technically you only need to put one of the sets into a hash_set, and then loop through the other one, and test if each element is contained in the hash_set.
Intersection with maps are slow, try a hash_map. (however, this is not provided in all STL implementation.
Alternatively, sort both map and do it in a merge-sort-like way.
What is your intersection algorithm? Maybe there are some improvements to be made?
Here is an alternate method
I do not know it to be faster or slower, but it could be something to try. Before doing so, I also recommend using a profiler to ensure you really are working on the hotspot. Change the sets of numbers you are intersecting to use std::set<int> instead. Then iterate through the smallest one looking at each value you find. For each value in the smallest set, use the find method to see if the number is present in each of the other sets (for performance, search from smallest to largest).
This is optimised in the case that the number is not found in all of the sets, so if the intersection is relatively small, it may be fast.
Then, store the intersection in std::vector<int> instead - insertion using push_back is also very fast.
Here is another alternate method
Change the sets of numbers to std::vector<int> and use std::sort to sort from smallest to largest. Then use std::binary_search to find the values, using roughly the same method as above. This may be faster than searching a std::set since the array is more tightly packed in memory. Actually, never mind that, you can then just iterate through the values in lock-step, looking at the ones with the same value. Increment only the iterators which are less than the minimum value you saw at the previous step (if the values were different).
Might be your algorithm. As I understand it, you are spinning over each set (which I'm hoping is a standard set), and throwing them into yet another map. This is doing a lot of work you don't need to do, since the keys of a standard set are in sorted order already. Instead, take a "merge-sort" like approach. Spin over each iter, dereferencing to find the min. Count the number that have that min, and increment those. If the count was N, add it to the intersection. Repeat until the first map hits it's end (If you compare the sizes before starting, you won't have to check every map's end each time).
Responding to update: There do exist faculties to speed up memory allocation by pre-reserving space, like boost::pool_alloc. Something like:
std::map<int, int, std::less<int>, boost::pool_allocator< std::pair<int const, int> > > m;
But honestly, malloc is pretty good at what it does; I'd profile before doing anything too extreme.
Look at your algorithms, then choose the proper data type. If you're going to have set-like behaviour, and want to do intersections and the like, std::set is the container to use.
Since it's elements are stored in a sorted way, insertion may cost you O(log N), but intersection with another (sorted!) std::set can be done in linear time.
I figured something out: if I attach the debugger to either RELEASE or DEBUG builds (e.g. hit F5 in the IDE), then I get horrible times.