My overall goal is to create a round-based game. So, I need some seemingly randomly generated numbers, example in fights - but not only, there are some different occasions. Anyway, I want those numbers to be reliably the same if certain parameters are the same.
Those parameters will be an integer seed - it will take the value of a generalRandomSeed, changing every round; but I need more parameters, like IDs of attacker and defender. I would be very convenient to call this function with the parameters (maybe all combined in a vector) like getRandom(generalRandomSeed,id1,id2).
So, in the end I am hoping for a function that takes one or more ints as parameters (ideally, a vector), returning one single integer: int getRandom(std::vector<int> parameters);
I canot quite figure out how I could solve that problem; if it was only about one parameter, I might just create a new mt19937 every time with my seed generalRandomSeed.
To explain Maarten's issues with std::seed_seq (and to make sure I understand things correctly!) I'll attempt an answer.
The main issue I'd have with potentially suggesting using it (and it sounds like Maarten has the same one) is that the algorithm used by std::seed_seq isn't defined by the standard. Hence it's possible for other standard libraries (e.g. if you use a different compiler or even a different version of the same compiler) to change the implementation and you'd get different values back from the same inputs. Depending on your use case this lack of stability may, or may not, matter. That's a domain specific issue you'd have to decide on as you haven't specified it.
The cryptographic approach suggested would would be to use something like a HKDF, where you use a cryptographic hash (like SHA256) to extract the entropy (i.e. "randomness") from your input values in a deterministic manner (e.g. taking care of endianness) and then use another cryptographic primitive to "stretch" this entropy out to produce your random output. If you're not familiar with the cryptographic world these things can be awkward as there's a lot of terminology.
As a minor point, I'd suggest against using the MT19937 PRNG as it's relatively expensive to seed and it sounds like you'd be doing this a lot. There are other algorithms that are much cheaper, I personally like Sebastiano's xoshiro256 family or you could use Melissa's PCG family. Melissa O'Neill's site is a very useful resource if you're new to PRNGs.
That said, if you're going the HKDF route you may as well just use the "expansion" step as a PRNG as it will directly produce uniform values. These can be transformed into bounded/uniform values easily: Melissa has a good review for ints here, or Vigna describes a conventional transform for binary IEEE-754 floats at the bottom of here (so is valid for float and double on most common CPUs).
update: The MT19937 would seem to be difficult to predict given its enormous state space, but in fact it's been shown almost trivial. For example, searching for "predicting mt19937" leads to https://github.com/kmyk/mersenne-twister-predictor which will give you the state of the RNG from 624 consecutive 32bit integer draws. Using a CSPRNG would protect you from this, and is what using the output of a HKDF would give you. PCG makes this more difficult than the Mersenne Twister, but given that it's optimised for speed can't expend too much work doing this.
Apparently, it was indeed a seed_seq that I was looking for.
Related
I'm trying to create a general memoizator for multiple and arbitrary functions.
For each function std::function<ReturnType(Args...)> that we want to memoize, we unordered_map<Args ..., ReturnType> (I'm keeping things simple on purpose).
The big problem comes when our memoized function has some really big argument Args ...: for example let suppose that our function sort a vector of 10 millions numbers and then returns the sorted vector, so something like std::function<vector<double>(vector<double>)>.
As you can imagine, after having inserted less than 100 vectors, we have already filled 8 GBS of memory. Notice that maybe this is given from the combination of huge vectors and the memory required by the sorting algorithm (I didn't investigate on the causes).
So what about if instead of the structure described above, we define unordered_map<UUID(Args ...), ReturnType> (where UUID= Universally Unique Identifier)? We should relax the deterministic feature (so maybe we return a wrong error), but with a very low probability.
The problem is that since I never used UUIDs, I don't know if there are suitable implementations for this application.
So my question is:
There exists a better solution than UUIDs for this problem?
Which UUID implementation is better suitable for this problem?
boost uuid is a possible candidate?
Unfortunately, the problem could be solved for Args ... but not for ReturnType, so there is a solution for memoized result?
Notice that the UUIDs generated for the object x should be the same even in different runs and machines.
Notice that if we have the same UUID for two different objects (and so we return the wrong value) with a really low probability, then it could be acceptable...let's say that this could be a "probabilistic memoizator".
I know that this application doesn't make sense in a memoization context (what are the odds that an user asks two times to sort the same 10 millions elements vector?), but it's time and memory expensive (so good for benchmarking and to introduce the memory problem that I stated above), so please don't whip and crucify me because this is an absurd memoization application.
Identifying any object is easy. The address is "object identity" in C++. This is also the reason that even empty classes cannot have zero size.
Now, what you want is value equivalence. That's strictly not in the language domain. It's solidly in the application/library logic domain.
You should consider using something like boost::flyweights. It has precisely this facility, and makes it "easy" to customize the equivalence semantics for your types.
I am new to the hashing in general and also to the STL world and saw the new std::unrdered_set and the SGI :hash_set,both of which uses the hasher hash. I understand to get a good load factor , you might need to write your own hashfunction, and I have been able to write one.
However, I am trying to go deep into , how the original default has_functions are written.
My question is :
1) How is the original default HashFcn written ; more concretely how is the hash generated?
Is it based on some pseudo random number. Can anyone point me to some header file (I am a bit lost with the documentation), where I can look up ; how the hasher hash is implemented.
2)How does it guarantee that each time , you will be able to get the same key?
Please, let me know if I can make my questions clearer any way?
In the version of gcc that I happen to have installed here, the required hash functions are in /usr/lib/gcc/i686-pc-cygwin/4.7.3/include/c++/bits/functional_hash.h
The hashers for integer types are defined using the macro _Cxx_hashtable_define_trivial_hash. As you might expect from the name, this just casts the input value to size_t.
This is how gcc does it. If you're using gcc then you should have a similarly-named file somewhere. If you're using a different compiler then the source will be somewhere else. It is not required that every implementation uses a trivial hash for integer types, but I suspect that it is very common.
It's not based on a random number generator, and hopefully it's now pretty obvious to you how this function guarantees to return the same key for the same input every time! The reason for using a trivial hash is that it's as fast as it gets. If it gives a bad distribution for your data (because your values tend to collide modulo the number of buckets) then you can either use a different, slower hash function or a different number of buckets (std::unordered_set doesn't let you specify the exact number of buckets, but it does let you set a minimum). Since library implementers don't know anything about your data, I think they will tend not to introduce slower hash functions as the default.
A hash function must be deterministic -- i.e., the same input must always produce the same result.
Generally speaking, you want the hash function to produce all outputs with about equal probability for arbitrary inputs (but while desirable, this is no mandatory -- and for any given hash function, there will always be an arbitrary number of inputs that produce identical outputs).
Generally speaking, you want the hashing function to be fast, and to depend (to at least some degree) on the entirety of the input.
A fairly frequently seen pattern is: start with some semi-random input. Combine one byte of input with the current value. Do something that will move the bits around (multiplication, rotation, etc.) Repeat for all bytes of the input.
I am using c++11 new <random> header in my application and in one class in different methods I need different random number with different distributions. I just put a random engine std::default_random_engine as class member seed it in the class constructor with std::random_device and use it for different distributions in my methods. Is that OK to use the random engine in this way or I should declare different engines for every distribution I use.
It's ok.
Reasons to not share the generator:
threading (standard RNG implementations are not thread safe)
determinism of random sequences:
If you wish to be able (for testing/bug hunting) to control the exact sequences generated, you will by likely have fewer troubles by isolating the RNGs used, especially when not all RNGs consumption is deterministic.
You should be careful when using one pseudo random number generator for different random variables, because in doing so they become correlated.
Here is an example: If you want to simulate Brownian motion in two dimensions (e.g. x and y) you need randomness in both dimensions. If you take the random numbers from one generator (noise()) and assign them successively
while(simulating)
x = x + noise()
y = y + noise()
then the variables x and y become correlated, because the algorithms of the pseudo number generators only make statements about how good they are, if you take every single number generated and not only every second one like in this example. Here, the Brownian particles could maybe move into the positive x and y directions with a higher probability than in the negative directions and thus introduce an artificial drift.
For two further reasons to use different generators look at sehe's answer.
MosteM's answer isn't correct. It's correct to do this so long as you want the draws from the distributions to be independent. If for some reason you need exactly the same random input into draws of different distributions, then you may want different RNGs. If you want correlation between two random variables, it's better to build them starting from a common random variable using mathematical principal: e.g., if A, B are independent normal(0,1), then A and aA +sqrt(1-a**2)B are normal(0,1) with correlation a.
EDIT: I found a great resource on the C++11 random library which may be useful to you.
There is no reason not to do it like this. Depending on which random generator you use, the period is quite huge (2^19937 in case of Mersenne-Twister), so in most cases, you won't even reach the end of one period during the execution of your program. And even if it is not said that, it's worse to reach the period with all distributions using the same generator than having 3 generators each doing 1/3 of their period.
In my programs, I use one generator for each thread, and it works fine. I think that's the main reason they split up the generator and distributions in C++11, since if you weren't allowed to do this, there would be no benefit from having the generator and the distribution separate, if one needs one generator for each distribution anyway.
I'm thinking about generating random strings, without making any duplication.
First thought was to use a binary tree create and locate for duplicate in tree, if any.
But this may not be very effective.
Second thought was using MD5 like hash method which create messages based only on time, but this may introduce another problem, different machines has different accuracy of time.
And in a modern processor, more than one string could be created in a single timestamp.
Is there any better way to do this?
Generate N sequential strings, then do a random shuffle to pull them out in random order. If they need to be unique across separate generators, mix a unique generator ID into the string.
Beware of MD5, there's no guarantee that two different Strings won't generate the same hash.
As for your problem, it depends on a number of constraints: are the strings short or long? Do they have to be meaningful? Etc... Two solutions from the top of my head:
1 Generate UUIDs then turn them into String with a binary representation or base 64 algorithm.
2 Simply generate random Strings and put them in a searchable structure (HashMap) so that you can find very quickly (O(1)-O(log n)) if a generated String already has a duplicate, in which case it is discarded.
A tree probably won't be the most efficient, especially for insertions - as it will have to constantly re-balance itself (somewhat of an "expensive" operation).
I'd recommend using a HashSet type data structure. The hashing algorithm should already be quite efficient (much more so than something like MD5), and all operations are constant-time. Insert all your Strings into the Set. If you create a new String, check to see if it already exists in the Set.
It sounds like you want to generate a uuid? See http://docs.python.org/library/uuid.html
>>> import uuid
>>> uuid.uuid4()
UUID('dafd3cb8-3163-4734-906b-a33671ce52fe')
You should specify in what programming language you're coding. For instance, in Java this will work nicely: UUID.randomUUID().toString() . UUID identifiers are unique in practice, as is stated in wikipedia:
The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. In this context the word unique should be taken to mean "practically unique" rather than "guaranteed unique". Since the identifiers have a finite size it is possible for two differing items to share the same identifier. The identifier size and generation process need to be selected so as to make this sufficiently improbable in practice.
A binary tree is probably better than usual here - no rebalancing necessary, because your strings are random, and it's on random data that binary trees work their best. However, it's still O(log(n)) for lookup and addition.
But maybe more efficient, if you know in advance how many random strings you'll need and don't mind a little probability in the mix, is to use a bloom filter.
Bloom filters give an efficient, probabilistic set membership test with memory requirements as low as one bit per element saved in the set. Basically, a bloom filter can say with 100% certainty that a member does not belong to a set, but with a high but not quite 100% certainty that a member is in a set. In your case, throwing out an extra candidate or two shouldn't hurt at all, so the probabilistic nature shouldn't hurt a bit.
Bloom filters are also relatively unique in that they can test for set membership in constant time.
For a while, I listed treaps here, but that's silly - they do a lot of operations in O(log(n)) again, and would only be relevant if your data isn't truly random.
If you don't need your strings to be saved in order for some reason (and it sounds like you probably don't), a traditional hash table is a good way to go. They like to know how big your final dataset will be in advance (to avoid slow hash table resizes), but they too are constant time for insertion and lookup.
http://stromberg.dnsalias.org/svn/bloom-filter/trunk/
In one of the applications I work on, it is necessary to have a function like this:
bool IsInList(int iTest)
{
//Return if iTest appears in a set of numbers.
}
The number list is known at app load up (But are not always the same between two instances of the same application) and will not change (or added to) throughout the whole of the program. The integers themselves maybe large and have a large range so it is not efficient to have a vector<bool>. Performance is a issue as the function sits in a hot spot. I have heard about Perfect hashing but could not find out any good advice. Any pointers would be helpful. Thanks.
p.s. I'd ideally like if the solution isn't a third party library because I can't use them here. Something simple enough to be understood and manually implemented would be great if it were possible.
I would suggest using Bloom Filters in conjunction with a simple std::map.
Unfortunately the bloom filter is not part of the standard library, so you'll have to implement it yourself. However it turns out to be quite a simple structure!
A Bloom Filter is a data structure that is specialized in the question: Is this element part of the set, but does so with an incredibly tight memory requirement, and quite fast too.
The slight catch is that the answer is... special: Is this element part of the set ?
No
Maybe (with a given probability depending on the properties of the Bloom Filter)
This looks strange until you look at the implementation, and it may require some tuning (there are several properties) to lower the probability but...
What is really interesting for you, is that for all the cases it answers No, you have the guarantee that it isn't part of the set.
As such a Bloom Filter is ideal as a doorman for a Binary Tree or a Hash Map. Carefully tuned it will only let very few false positive pass. For example, gcc uses one.
What comes to my mind is gperf. However, it is based in strings and not in numbers. However, part of the calculation can be tweaked to use numbers as input for the hash generator.
integers, strings, doesn't matter
http://videolectures.net/mit6046jf05_leiserson_lec08/
After the intro, at 49:38, you'll learn how to do this. The Dot Product hash function is demonstrated since it has an elegant proof. Most hash functions are like voodoo black magic. Don't waste time here, find something that is FAST for your datatype and that offers some adjustable SEED for hashing. A good combo there is better than the alternative of growing the hash table.
#54:30 The Prof. draws picture of a standard way of doing perfect hash. The perfect minimal hash is beyond this lecture. (good luck!)
It really all depends on what you mod by.
Keep in mind, the analysis he shows can be further optimized by knowing the hardware you are running on.
The std::map you get very good performance in 99.9% scenarios. If your hot spot has the same iTest(s) multiple times, combine the map result with a temporary hash cache.
Int is one of the datatypes where it is possible to just do:
bool hash[UINT_MAX]; // stackoverflow ;)
And fill it up. If you don't care about negative numbers, then it's twice as easy.
A perfect hash function maps a set of inputs onto the integers with no collisions. Given that your input is a set of integers, the values themselves are a perfect hash function. That really has nothing to do with the problem at hand.
The most obvious and easy to implement solution for testing existence would be a sorted list or balanced binary tree. Then you could decide existence in log(N) time. I doubt it'll get much better than that.
For this problem I would use a binary search, assuming it's possible to keep the list of numbers sorted.
Wikipedia has example implementations that should be simple enough to translate to C++.
It's not necessary or practical to aim for mapping N distinct randomly dispersed integers to N contiguous buckets - i.e. a perfect minimal hash - the important thing is to identify an acceptable ratio. To do this at run-time, you can start by configuring a worst-acceptible ratio (say 1 to 20) and a no-point-being-better-than-this-ratio (say 1 to 4), then randomly vary (e.g. changing prime numbers used) a fast-to-calculate hash algorithm to see how easily you can meet increasingly difficult ratios. For worst-acceptible you don't time out, or you fall back on something slower but reliable (container or displacement lists to resolve collisions). Then, allow a second or ten (configurable) for each X% better until you can't succeed at that ratio or reach the no-pint-being-better ratio....
Just so everyone's clear, this works for inputs only known at run time with no useful patterns known beforehand, which is why different hash functions have to be trialed or actively derived at run time. It is not acceptible to simple say "integer inputs form a hash", because there are collisions when %-ed into any sane array size. But, you don't need to aim for a perfectly packed array either. Remember too that you can have a sparse array of pointers to a packed array, so there's little memory wasted for large objects.
Original Question
After working with it for a while, I came up with a number of hash functions that seemed to work reasonably well on strings, resulting in a unique - perfect hashing.
Let's say the values ranged from L to H in the array. This yields a Range R = H - L + 1.
Generally it was pretty big.
I then applied the modulus operator from H down to L + 1, looking for a mapping that keeps them unique, but has a smaller range.
In you case you are using integers. Technically, they are already hashed, but the range is large.
It may be that you can get what you want, simply by applying the modulus operator.
It may be that you need to put a hash function in front of it first.
It also may be that you can't find a perfect hash for it, in which case your container class should have a fall back position.... binary search, or map or something like that, so that
you can guarantee that the container will work in all cases.
A trie or perhaps a van Emde Boas tree might be a better bet for creating a space efficient set of integers with lookup time bring constant against the number of objects in the data structure, assuming that even std::bitset would be too large.