Comparing different implementations with random seeds - c++

I have two implementations of a program, one using lists and one using vectors, in order to compare their runtimes. The class functions in each implementation are different, since the list implementation allows more flexibility in code. They both also use random number generators.
I set both to have random seed 0 and ran them, but the results I get are not the same.
One question I have is, if both implementations call a function using a random seed, e.g.
boost::variate_generator<boost::mt19937&, boost::exponential_distribution<>> random_n(seed, boost::exponential_distribution<>()) ;
and one calls it more times than the other implementation, will that cause desynchronization with respect to random seeds?
To be more specific, the vector implementation simulates a Poisson Process on a continuous real segment, e.g. [0,1], whereas the list implementation simulates the PP on separate partitions: {[0,0.1], [0.1,0.2], [0.2,0.3], ..., [0.9, 1]} and then combines the results. Simulating a PP on the big partition could mean as few as 1 boost::exponential_distribution calls, but simulating on the 10 partitions requires at least 10 boost::exponential_distribution calls, even if none of them may be used (e.g. if they overshoot the partition).
Even though probabilistically, these methods should generate the same kind of results, would the seeds between the programs be de-synchronized? And if so, is there any way to resynchronize them without changing the implementation?

Related

How to detect that the entire period of C++ random engine has been consumed

I want to write a small (fast) C++ program which essentially "detects" that the entire period of a std::minstd_rand0 engine has been consumed. In other words I want to detect when the sequence is repeating (and also measure the time required for the sequence to repeat). I don't care about the target distribution (it can be uniform).
Any ideas on how I should proceed? One option I was considering is to create two std::array variables. In the first std::array I would store say the 10000 first pseudo-random variates returned by std:minstd_rand0. I would then proceed by filling the other std::array with successive blocks of 10000 variates and compare the contents of the 2 arrays after each pass of 10000 variates. I would consider that the entire period has been consummed once the 2 arrays are equals.
Would this approach be sensible?
Standard random number engines can be compared to each other--they compare equal if and only if they have the same state.
Therefore, you can pretty easily measure the period with code that:
Creates two default constructed generators
Executes a loop generating a number from one of the generators, and incrementing a count.
Until the two generators compare equal again.
At least in my quick test of std::minstd_rand0, I get a result of
2147483646
Needless to say, this is a lot more practical with std::minstd_rand0 than with std::mt19937 (for one obvious example).

C++ hash function, how is the original haser i.e. hash<int xkey> implemented

I am new to the hashing in general and also to the STL world and saw the new std::unrdered_set and the SGI :hash_set,both of which uses the hasher hash. I understand to get a good load factor , you might need to write your own hashfunction, and I have been able to write one.
However, I am trying to go deep into , how the original default has_functions are written.
My question is :
1) How is the original default HashFcn written ; more concretely how is the hash generated?
Is it based on some pseudo random number. Can anyone point me to some header file (I am a bit lost with the documentation), where I can look up ; how the hasher hash is implemented.
2)How does it guarantee that each time , you will be able to get the same key?
Please, let me know if I can make my questions clearer any way?
In the version of gcc that I happen to have installed here, the required hash functions are in /usr/lib/gcc/i686-pc-cygwin/4.7.3/include/c++/bits/functional_hash.h
The hashers for integer types are defined using the macro _Cxx_hashtable_define_trivial_hash. As you might expect from the name, this just casts the input value to size_t.
This is how gcc does it. If you're using gcc then you should have a similarly-named file somewhere. If you're using a different compiler then the source will be somewhere else. It is not required that every implementation uses a trivial hash for integer types, but I suspect that it is very common.
It's not based on a random number generator, and hopefully it's now pretty obvious to you how this function guarantees to return the same key for the same input every time! The reason for using a trivial hash is that it's as fast as it gets. If it gives a bad distribution for your data (because your values tend to collide modulo the number of buckets) then you can either use a different, slower hash function or a different number of buckets (std::unordered_set doesn't let you specify the exact number of buckets, but it does let you set a minimum). Since library implementers don't know anything about your data, I think they will tend not to introduce slower hash functions as the default.
A hash function must be deterministic -- i.e., the same input must always produce the same result.
Generally speaking, you want the hash function to produce all outputs with about equal probability for arbitrary inputs (but while desirable, this is no mandatory -- and for any given hash function, there will always be an arbitrary number of inputs that produce identical outputs).
Generally speaking, you want the hashing function to be fast, and to depend (to at least some degree) on the entirety of the input.
A fairly frequently seen pattern is: start with some semi-random input. Combine one byte of input with the current value. Do something that will move the bits around (multiplication, rotation, etc.) Repeat for all bytes of the input.

How to use shuffle in KFold in scikit_learn

I am running 10-fold CV using the KFold function provided by scikit-learn in order to select some kernel parameters. I am implementing this (grid_search)procedure:
1-pick up a selection of parameters
2-generate a svm
3-generate a KFold
4-get the data that correspons to training/cv_test
5-train the model (clf.fit)
6-classify with the cv_testdata
7-calculate the cv-error
8-repeat 1-7
9-When ready pick the parameters that provide the lowest average(cv-error)
If I do not use shuffle in the KFold generation, I get very much the same results for the average( cv_errors) if I repeat the same runs and the "best results" are repeatable.
If I use the shuffle, I am getting different values for the average (cv-errors) if I repeat the same run several times and the "best values" are not repeatable.
I can understand that I should get different cv_errors for each KFold pass but the final average should be the same.
How does the KFold with shuffle really work?
Each time the KFold is called, it shuffles my indexes and it generates training/test data. How does it pick the different folds for "training/testing"? Does it have a random way to pick the different folds for training/testing?
Any situations where its avantageous with "shuffle" and situations that are not??
If shuffle is True, the whole data is first shuffled and then split into the K-Folds. For repeatable behavior, you can set the random_state, for example to an integer seed (random_state=0).
If your parameters depend on the shuffling, this means your parameter selection is very unstable. Probably you have very little training data or you use to little folds (like 2 or 3).
The "shuffle" is mainly useful if your data is somehow sorted by classes, because then each fold might contain only samples from one class (in particular for stochastic gradient decent classifiers sorted classes are dangerous).
For other classifiers, it should make no differences. If shuffling is very unstable, your parameter selection is likely to be uninformative (aka garbage).

using one random engine for multi distributions in c++11

I am using c++11 new <random> header in my application and in one class in different methods I need different random number with different distributions. I just put a random engine std::default_random_engine as class member seed it in the class constructor with std::random_device and use it for different distributions in my methods. Is that OK to use the random engine in this way or I should declare different engines for every distribution I use.
It's ok.
Reasons to not share the generator:
threading (standard RNG implementations are not thread safe)
determinism of random sequences:
If you wish to be able (for testing/bug hunting) to control the exact sequences generated, you will by likely have fewer troubles by isolating the RNGs used, especially when not all RNGs consumption is deterministic.
You should be careful when using one pseudo random number generator for different random variables, because in doing so they become correlated.
Here is an example: If you want to simulate Brownian motion in two dimensions (e.g. x and y) you need randomness in both dimensions. If you take the random numbers from one generator (noise()) and assign them successively
while(simulating)
x = x + noise()
y = y + noise()
then the variables x and y become correlated, because the algorithms of the pseudo number generators only make statements about how good they are, if you take every single number generated and not only every second one like in this example. Here, the Brownian particles could maybe move into the positive x and y directions with a higher probability than in the negative directions and thus introduce an artificial drift.
For two further reasons to use different generators look at sehe's answer.
MosteM's answer isn't correct. It's correct to do this so long as you want the draws from the distributions to be independent. If for some reason you need exactly the same random input into draws of different distributions, then you may want different RNGs. If you want correlation between two random variables, it's better to build them starting from a common random variable using mathematical principal: e.g., if A, B are independent normal(0,1), then A and aA +sqrt(1-a**2)B are normal(0,1) with correlation a.
EDIT: I found a great resource on the C++11 random library which may be useful to you.
There is no reason not to do it like this. Depending on which random generator you use, the period is quite huge (2^19937 in case of Mersenne-Twister), so in most cases, you won't even reach the end of one period during the execution of your program. And even if it is not said that, it's worse to reach the period with all distributions using the same generator than having 3 generators each doing 1/3 of their period.
In my programs, I use one generator for each thread, and it works fine. I think that's the main reason they split up the generator and distributions in C++11, since if you weren't allowed to do this, there would be no benefit from having the generator and the distribution separate, if one needs one generator for each distribution anyway.

random_shuffle algorithm - are identical results produced without random generator function?

If a random generator function is not supplied to the random_shuffle algorithm in the standard library, will successive runs of the program produce the same random sequence if supplied with the same data?
For example, if
std::random_shuffle(filenames.begin(), filenames.end());
is performed on the same list of filenames from a directory in successive runs of the program, is the random sequence produced the same as that in the prior run?
If you use the same random generator, with the same seed, and the same starting
sequence, the results will be the same. A computer is, after all,
deterministic in its behavior (modulo threading issues and a few other
odds and ends).
If you do not specify a generator, the default generator is
implementation defined. Most implementations, I think, use
std::rand() (which can cause problems, particularly when the number of
elements in the sequence is larger than RAND_MAX). I would recommend
getting a generator with known quality, and using it.
If you don't correctly seed the generator which is being used (another
reason to not use the default, since how you seed it will depend on the
implementation), then you'll get what you get. In the case of
std::rand(), the default always uses the same seed. How you seed
depends on the generator used. What you use to seed it should be vary
from one run to the other; for many applications, time(NULL) is
sufficient; on a Unix platform, I'd recommend reading however many bytes
it takes from /dev/random. Otherwise, hashing other information (IP
address of the machine, process id, etc.) can also improve things---it
means that two users starting the program at exactly the same second
will still get different sequences. (But this is really only relevant
if you're working in a networked environment.)
25.2.11 just says that the elements are shuffled with uniform distribution. It makes no guarantees as to which RNG is used behind the scenes (unless you pass one in) so you can't rely on any such behavior.
In order to guarantee the same shuffle outcome you'll need to provide your own RNG that provides those guarantees, but I suspect even then if you update your standard library the random_shuffle algorithm itself could change effects.
You may produce an identical result every run of the program. You can add a custom random number generator (which can be seeded from an external source) as an additional argument to std::random_shuffle if this is a problem. The function would be the third argument. Some people recommend call srand(unsigned(time(NULL))); before random_shuffle, but the results are often times implementation defined (and unreliable).