Choosing N random numbers from a set - c++

I have a sorted set (std::set to be precise) that contains elements with an assigned weight. I want to randomly choose N elements from this set, while the elements with higher weight should have a bigger probability of being chosen. Any element can be chosen multiple times.
I want to do this as efficiently as possible - I want to avoid any copying of the set (it might get very large) and run at O(N) time if it is possible. I'm using C++ and would like to stick to a STL + Boost only solution.
Does anybody know if there is a function in STL/Boost that performs this task? If not, how to implement one?

You need to calculate (and possibly cache, if you think of performance) the sum of all weights in your set. Then, generate N random numbers ranging up to this value. Finally, iterate your set, counting the sum of the weights you encountered so far. Inspect all the (remaining) random numbers. If the number falls between the previous and the next value of the sum, insert the value from the set and remove your random number. Stop when your list of random numbers is empty or you've reached the end of the set.

I don't know about any libraries, but it sounds like you have a weighted roulette wheel. Here's a reference with some pseudo-code, although the context is related to genetic algorithms: http://www.cse.unr.edu/~banerjee/selection.htm
As for "as efficiently as possible," that would depend on some characteristics of the data. In the application of the weighted roulette wheel, when searching for the index you could consider a binary search instead. However, it is not the case that each slot of the roulette wheel is equally likely, so it may make sense to examine them in order of their weights.

A lot depends on the amount of extra storage you're willing to expend to make the selection faster.
If you're not willing to use any extra storage, #Alex Emelianov's answer is pretty much what I was thinking of posting. If you're willing use some extra storage (and possibly a different data structure than std::set) you could create a tree (like a set uses) but at each node of the tree, you'd also store the (weighted) number of items to the left of that node. This will let you map from a generated number to the correct associated value with logarithmic (rather than linear) complexity.

Related

LRU sorted by score in C++, is there such container?

I need to implement a very efficient cache LRU with the following properties: entries are indices in a vector of cache entries, each cache hit updates an empirical score, computed from some values that can be kept in the container value, like number of hits, size of matched object etc.
I need to be able to quickly pick a victim for cache eviction from the bottom of such LRU, and be able to quickly iterate over some number of the best-performing entries from the top, so such container needs to be sorted.
So far, I was only be able to come up with a vector of structures that hold values for score calculation that are updated, and bi-directional links, which I use to put the updated element in place after score recalculation by linear search from its current position and score comparison. This search may obviously happen upwards (when the score is updated, always getting bigger) and downwards (when an element is evicted, and its score resets to 0). Linear search may not be so bad, because this is running for a long time, and scores of elements that survive grow large, and each increment is small, so the element does not have to move very far to get to its place, and in case of reset I can start search from bottom.
I am aware of STL sorted containers, folly's cache LRU implementation, and Boost.Bimap (this last one seems to be an overkill for what I need).
Can I do better than a linear search here? Does anyone know of an implementation?
Thanks in advance!
Update: implemented a solution that involves a vector of iterators into a set that has index into the vector (for uniqueness) + necessary data to compute the score, with comparator sorting by the score.
Seems to work well, maybe there is a more elegant solution out there?

Efficient data structure to map integer-to-integer with find & insert, no allocations and fixed upper bound

I am looking for input on an associative data structure that might take advantage of the specific criteria of my use case.
Currently I am using a red/black tree to implement a dictionary that maps keys to values (in my case integers to addresses).
In my use case, the maximum number of elements is known up front (1024), and I will only ever be inserting and searching. Searching happens twenty times more often than inserting. At the end of the process I clear the structure and repeat again. There can be no allocations during use - only the initial up front one. Unfortunately, the STL and recent versions of C++ are not available.
Any insight?
I ended up implementing a simple linear-probe HashTable from an example here. I used the MurmurHash3 hash function since my data is randomized.
I modified the hash table in the following ways:
The size is a template parameter. Internally, the size is doubled. The implementation requires power of 2 sizes, and traditionally resizes at 75% occupation. Since I know I am going to be filling up the hash table, I pre-emptively double it's capacity to keep it sparse enough. This might be less efficient when adding small number of objects, but it is more efficient once the capacity starts to fill up. Since I cannot resize it I chose to start it doubled in size.
I do not allow keys with a value of zero to be stored. This is okay for my application and it keeps the code simple.
All resizing and deleting is removed, replaced by a single clear operation which performs a memset.
I chose to inline the insert and lookup functions since they are quite small.
It is faster than my red/black tree implementation before. The only change I might make is to revisit the hashing scheme to see if there is something in the source keys that would help make a cheaper hash.
Billy ONeal suggested, given a small number of elements (1024) that a simple linear search in a fixed array would be faster. I followed his advice and implemented one for side by side comparison. On my target hardware (roughly first generation iPhone) the hash table outperformed a linear search by a factor of two to one. At smaller sizes (256 elements) the hash table was still superior. Of course these values are hardware dependant. Cache line sizes and memory access speed are terrible in my environment. However, others looking for a solution to this problem would be smart to follow his advice and try and profile it first.

c++ discrete distribution sampling with frequently changing probabilities

Problem: I need to sample from a discrete distribution constructed of certain weights e.g. {w1,w2,w3,..}, and thus probability distribution {p1,p2,p3,...}, where pi=wi/(w1+w2+...).
some of wi's change very frequently, but only a very low proportion of all wi's. But the distribution itself thus has to be renormalised every time it happens, and therefore I believe Alias method does not work efficiently because one would need to build the whole distribution from scratch every time.
The method I am currently thinking is a binary tree (heap method), where all wi's are saved in the lowest level, and then the sum of each two in higher level and so on. The sum of all of them will be in the highest level, which is also a normalisation constant. Thus in order to update the tree after change in wi, one needs to do log(n) changes, as well as the same amount to get the sample from the distribution.
Question:
Q1. Do you have a better idea on how to achieve it faster?
Q2. The most important part: I am looking for a library which has already done this.
explanation: I have done this myself several years ago, by building heap structure in a vector, but since then I have learned many things including discovering libraries ( :) ), and containers such as map... Now I need to rewrite that code with higher functionality, and I want to make it right this time:
so Q2.1 is there a nice way to make a c++ map ordered and searched not by index, but by a cumulative sum of it's elements (this is how we sample, right?..). (that is my current theory how I would like to do it, but it doesnt have to be this way...)
Q2.2 Maybe there is some even nicer way to do the same? I would believe this problem is so frequent that I am very surprised I could not find some sort of library which would do it for me...
Thank you very much, and I am very sorry if this has been asked in some other form, please direct me towards it, but I have spent a good while looking...
-z
Edit: There is a possibility that I might need to remove or add the elements as well, but I think I could avoid it, if that makes a huge difference, thus leaving only changing the value of the weights.
Edit2: weights are reals in general, I would have to think if I could make them integers...
I would actually use a hash set of strings (don't remember the C++ container for it, you might need to implement your own though). Put wi elements for each i, with the values "w1_1", "w1_2",... all through "w1_[w1]" (that is, w1 elements starting with "w1_").
When you need to sample, pick an element at random using a uniform distribution. If you picked w5_*, say you picked element 5. Because of the number of elements in the hash, this will give you the distribution you were looking for.
Now, when wi changes from A to B, just add B-A elements to the hash (if B>A), or remove the last A-B elements of wi (if A>B).
Adding new elements and removing old elements is trivial in this case.
Obviously the problem is 'pick an element at random'. If your hash is a closed hash, you pick an array cell at random, if it's empty - just pick one at random again. If you keep your hash 3 or 4 times larger than the total sum of weights, your complexity will be pretty good: O(1) for retrieving a random sample, O(|A-B|) for modifying the weights.
Another option, since only a small part of your weights change, is to split the weights into two - the fixed part and the changed part. Then you only need to worry about changes in the changed part, and the difference between the total weight of changed parts and the total weight of unchanged parts. Then for the fixed part your hash becomes a simple array of numbers: 1 appears w1 times, 2 appears w2 times, etc..., and picking a random fixed element is just picking a random number.
Updating your normalisation factor when you change a value is trivial. This might suggest an algorithm.
w_sum = w_sum_old - w_i_old + w_i_new;
If you leave p_i as a computed property p_i = w_i / w_sum you would avoid recalculating the entire p_i array at the cost of calculating p_i every time they are needed. You would, however, be able to update many statistical properties without recalculating the entire sum
expected_something = (something_1 * w_1 + something_2 * w_2 + ...) / w_sum;
With a bit of algebra you can update expected_something by subtracting the contribution with the old weight and add the contribution with the new weight, multiplying and dividing with the normalization factors as required.
If you during the sampling keep track of which outcomes that are part of the sample, it would be possible to propagate how the probabilities were updated to the generated sample. Would this make it possible for you to update rather than recalculate values related to the sample? I think a bitmap could provide an efficient way to store an index of which outcomes that were used to build the sample.
One way of storing the probabilities together with the sums is to start with all probabilities. In the next N/2 positions you store the sums of the pairs. After that N/4 sums of the pairs etc. Where the sums are located can, obviously, be calculate in O(1) time. This data-structure is sort of a heap, but upside down.

Implementation of a locally ordered set or priority queue?

I have a rather large set of objects that represent numbers and I want to select such numbers according to a custom ordering. This ordering includes several criteria such as the type of their representation (some numbers are represented by an interval), their integrality and ultimatively their value. These numbers are shared throughout the programs (shared pointers) and there is nothing I can do about this.
However, the elements properties can change at any time such that the order changes while I can't notify the container about this. For example, some operations require a refinement of a number that is represented by an interval and during this refinement, the exact value can be found. Thereby, the number changes from the interval representation to a rational number, possibly even an integer. This change, due to the shared instance, immediately propagates to the number in the container and breaks the ordering (and I don't even know which number changed). This totally breaks std::set.
So what I'd like to have is a container that tries to be sorted, but does not rely on this. Whenever an operation detects an incorrect ordering, this ordering should be corrected locally. For example insert would insert the element (using binary search) and always check if the ordering of the current element (w.r.t. the neighbors) is correct.
I'd be willing to accept that "give me the smallest element" would then be only "give me a small element" and that find or remove would have linear complexity: I only need front, insert and remove_front to be particularly efficient.
Is there any implementation that does something like this?
How would you implement this?
If you are looking for an algorithm in the standard library, you should take a look at:
std::make_heap
std::pop_heap
std::push_heap
In <algorithm>. They might fit your need, and even if they don't I'm quite sure you will find what you are looking for in some kind of heap structure. Which one will probably depend on how your code is structured, and how often you expect a value to change etc.
In short:
A heap is a data structure in which it is fast to find and extract the smallest (or largest) element. It is also for most heaps possible to create restructure the heap in linear time or better. You could start out from this page on Wikipedia if you want to learn more about heaps.

Perfect hash function for a set of integers with no updates

In one of the applications I work on, it is necessary to have a function like this:
bool IsInList(int iTest)
{
//Return if iTest appears in a set of numbers.
}
The number list is known at app load up (But are not always the same between two instances of the same application) and will not change (or added to) throughout the whole of the program. The integers themselves maybe large and have a large range so it is not efficient to have a vector<bool>. Performance is a issue as the function sits in a hot spot. I have heard about Perfect hashing but could not find out any good advice. Any pointers would be helpful. Thanks.
p.s. I'd ideally like if the solution isn't a third party library because I can't use them here. Something simple enough to be understood and manually implemented would be great if it were possible.
I would suggest using Bloom Filters in conjunction with a simple std::map.
Unfortunately the bloom filter is not part of the standard library, so you'll have to implement it yourself. However it turns out to be quite a simple structure!
A Bloom Filter is a data structure that is specialized in the question: Is this element part of the set, but does so with an incredibly tight memory requirement, and quite fast too.
The slight catch is that the answer is... special: Is this element part of the set ?
No
Maybe (with a given probability depending on the properties of the Bloom Filter)
This looks strange until you look at the implementation, and it may require some tuning (there are several properties) to lower the probability but...
What is really interesting for you, is that for all the cases it answers No, you have the guarantee that it isn't part of the set.
As such a Bloom Filter is ideal as a doorman for a Binary Tree or a Hash Map. Carefully tuned it will only let very few false positive pass. For example, gcc uses one.
What comes to my mind is gperf. However, it is based in strings and not in numbers. However, part of the calculation can be tweaked to use numbers as input for the hash generator.
integers, strings, doesn't matter
http://videolectures.net/mit6046jf05_leiserson_lec08/
After the intro, at 49:38, you'll learn how to do this. The Dot Product hash function is demonstrated since it has an elegant proof. Most hash functions are like voodoo black magic. Don't waste time here, find something that is FAST for your datatype and that offers some adjustable SEED for hashing. A good combo there is better than the alternative of growing the hash table.
#54:30 The Prof. draws picture of a standard way of doing perfect hash. The perfect minimal hash is beyond this lecture. (good luck!)
It really all depends on what you mod by.
Keep in mind, the analysis he shows can be further optimized by knowing the hardware you are running on.
The std::map you get very good performance in 99.9% scenarios. If your hot spot has the same iTest(s) multiple times, combine the map result with a temporary hash cache.
Int is one of the datatypes where it is possible to just do:
bool hash[UINT_MAX]; // stackoverflow ;)
And fill it up. If you don't care about negative numbers, then it's twice as easy.
A perfect hash function maps a set of inputs onto the integers with no collisions. Given that your input is a set of integers, the values themselves are a perfect hash function. That really has nothing to do with the problem at hand.
The most obvious and easy to implement solution for testing existence would be a sorted list or balanced binary tree. Then you could decide existence in log(N) time. I doubt it'll get much better than that.
For this problem I would use a binary search, assuming it's possible to keep the list of numbers sorted.
Wikipedia has example implementations that should be simple enough to translate to C++.
It's not necessary or practical to aim for mapping N distinct randomly dispersed integers to N contiguous buckets - i.e. a perfect minimal hash - the important thing is to identify an acceptable ratio. To do this at run-time, you can start by configuring a worst-acceptible ratio (say 1 to 20) and a no-point-being-better-than-this-ratio (say 1 to 4), then randomly vary (e.g. changing prime numbers used) a fast-to-calculate hash algorithm to see how easily you can meet increasingly difficult ratios. For worst-acceptible you don't time out, or you fall back on something slower but reliable (container or displacement lists to resolve collisions). Then, allow a second or ten (configurable) for each X% better until you can't succeed at that ratio or reach the no-pint-being-better ratio....
Just so everyone's clear, this works for inputs only known at run time with no useful patterns known beforehand, which is why different hash functions have to be trialed or actively derived at run time. It is not acceptible to simple say "integer inputs form a hash", because there are collisions when %-ed into any sane array size. But, you don't need to aim for a perfectly packed array either. Remember too that you can have a sparse array of pointers to a packed array, so there's little memory wasted for large objects.
Original Question
After working with it for a while, I came up with a number of hash functions that seemed to work reasonably well on strings, resulting in a unique - perfect hashing.
Let's say the values ranged from L to H in the array. This yields a Range R = H - L + 1.
Generally it was pretty big.
I then applied the modulus operator from H down to L + 1, looking for a mapping that keeps them unique, but has a smaller range.
In you case you are using integers. Technically, they are already hashed, but the range is large.
It may be that you can get what you want, simply by applying the modulus operator.
It may be that you need to put a hash function in front of it first.
It also may be that you can't find a perfect hash for it, in which case your container class should have a fall back position.... binary search, or map or something like that, so that
you can guarantee that the container will work in all cases.
A trie or perhaps a van Emde Boas tree might be a better bet for creating a space efficient set of integers with lookup time bring constant against the number of objects in the data structure, assuming that even std::bitset would be too large.