Best ways to evaluate randomness of a shuffled list [closed]

Best ways to evaluate randomness of a shuffled list [closed] - c++

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 months ago.
Improve this question
I'm trying to qualify the randomness of some shuffled list. Given a list of distinct integers, I want to shuffle it with different random generators or methods and evaluate the quality of the resulting shuffling.
For now, I tried to do some kind of dice experiment. Given a list of input_size, I select some bucket in the list to be the "observed" one and then I shuffle the initial list num_runs * input_size (always starting with a fresh copy). I then look the frequencies of the elements that fell in the observed bucket. I then report the result on some plot. You can observe the results bellow for three different methods (line plots of the frequencies, I tried histogramms but it would look bad).
The dice experiment over three methods
Reporting plots only is not formal enough, I would like to report some numbers. What are the best ways to do it (or used in academic publications).
Thanks in advance.

Quantifying randomness isn't trivial.
You generate a huge amount of random numbers and then test if they have the properties exhibited from true random numbers. There are tons of properties you can test for, e.g., a bits should occure with a 50% probability.
There are a randomness test suites that combine a bunch of these tests and try to find statistical flaws in the pseudo random numbers.
PractRand is to my knowledge currently the most sophisticated randomness test suite.
I'd suggest you write a program that uses your method to repeatedly shuffle an array of e.g. [0..255] and write the raw bytes to stdout. (so the output is a pseudo random bit stream) Then you can pipe that into PractRand and it will quit once it finds statistical flaws. ./a.out | ./PractRand stdin.
TestU01's "Big Crush" is also a pretty good test suite, but it takes a very long time to run, and from my experiance PractRand finds more statistical flaws.
I suggest not to use the Diehard or the newer Dieharder test suites, because they aren't as powerful and have false positives, even when using CSPRNGs or true random number generators.

Related

How can I benchmark the performance of C++ code? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am starting to study algorithms and data structures seriously, and interested in learning how to compare the performance of the different ways I can implement A&DTs.
For simple tests, I can get the time before/after something runs, run that thing 10^5 times, and average the running times. I can parametrize input by size, or sample random input, and get a list of running times vs. input size. I can output that as a csv file, and feed it into pandas.
I am not sure there are no caveats. I am also not sure what to do about measuring space complexity.
I am learning to program in C++. Are there humane tools to achieve what I am trying to do?

Benchmarking code is not easy. What I found most useful was Google benchmark library. Even if you are not planning to use it, it might be good to read some of examples. It has a lot of possibilities to parametrize test, output results to file and even returning you Big O notation complexity of your algorithm (to name just few of them). If you are any familiar with Google test framework I would recommend you to use it. It also keeps compiler optimization possible to manage so you can be sure that your code wasn't optimized away.
There is also great talk about benchmarking code on CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!". There are many insights in possible mistake that you can make (it also uses google benchmark)

It is operating system and compiler specific (so implementation specific). You could use profiling tools, you could use timing tools, etc.
On Linux, see time(1), time(7), perf(1), gprof(1), pmap(1), mallinfo(3) and proc(5) and about Invoking GCC.
See also this. In practice, be sure that your runs are lasting long enough (e.g. at least one second of time in a process).
Be aware that optimizing compilers can transform drastically your program. See CppCon 2017: Matt Godbolt talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”

Talking from an architecture point of view, you can also benchmark your C++ code using different architectural tools such as Intel Pin, perf tool. You can use these tools to study the architecture dependency of your code. For example, you can compile your code for different level of optimizations and check the IPC/CPI, cache accesses and load-store accesses. You can even check if your code is suffering a performance hit due to library functions. The tools are powerful and can give you potentially huge insights into your code.
You can also try disassembling your code and study where your code spends most of the time and try and optimize that. You can look at different techniques to ensure that the frequently accessed data remains in the cache and thus ensure a high hit rate.
Say, you realize that your code is heavily dominated by loops, you can run your code for different loop bounds and check for the metrics in 2 cases. For example, set the loop bound for 100,000 and find the desired performance metric 'X' and then set the loop bound for 200,000 and find the performance metric 'Y'. Now,calculate Y-X. This will give you a much better insight into the behavior of the loops because by subtracting the two metrics, you have effectively removed the static effects of the code.
Say, you run your code for 10 times and with different user input size. You can maybe find the runtime per user input size and then sort this new metric in ascending order, remove the first and the last value(to remove the outliers) and then take the average. Finally, find the Coefficient of variance to understand how the run times behave.
On a side note, more often than not, we end up using the term 'average' or 'arithmetic mean' rashly. Look at the metric you plan to average and look at harmonic means, arithmetic means and geometric means in each of the cases. For example,finding the arithmetic mean for rates will give you incorrect answers. Simply finding arithmetic means of two events which do not occur equally in time can give incorrect results. Instead, use weighted arithmetic means.

Algorithm for generating unique number across system [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have problem with me where I have to generate unique number throughout the system.Say an application 'X' generates value 'A' by using some inputs and this value 'A' will be used by some other application as input to generate some other value 'B'.
'A' and 'B' both values will be saved later in KDB. Purpose of doing this is to identify which value of 'A' triggered generation value of 'B'. 'A' are getting generated at very high speed so I am looking for algorithm which is fast and doesn't hamper the performance of application 'X'.

What you want is a UUID. See https://en.m.wikipedia.org/wiki/Universally_unique_identifier. They are generally based on things like MAC addresses, timestamps, hashes, and randomness. Their theoretical intent is to be globally unique. Depending on the platform there are often built in functions for generating them. I can expand on this more when I'm not on my phone if necessary, but start there.
You have likely run into them from time to time, https://www.uuidgenerator.net can give you some examples.
That said if you're inserting them in a database another strategy to investigate is using the databases auto assigned primary key ID numbers. Not always possible since you have to store them first to get an ID assigned, but philosophically sounds correct for your application.
You could also roll your own although there are many caveats. E.g. The time stamp of application startup concatenated with some internal counter. Just be aware of collision risks, even unlikely, e.g. two applications starting at the same time, or an incorrect system clock. I wouldn't consider this approach for serious usage given the presence of other more reliable strategies.
No matter what you use, I do recommend ultimately also using it as the primary key in your database. It will simplify things for you overall, and also having two unique ids in a database (e.g. UUID plus auto-generated primary key) denormalizes your database a bit (https://en.m.wikipedia.org/wiki/Database_normalization).

How do I choose the right checksum for a simple C++ program, and how do I implement it? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm very new to checksums, and fairly new to programming. I have a fairly simple C++ program (measuring psi) that I'm transferring to an Arduino board. Would crc16 be OK or should I go with crc32 or would that be overkill?

A possible way to check that an executable has been correctly transmited to your Arduino board could be to use a simple checksum like md5, or even something even simpler, like some crude hash function computing a 16 bit hash. See e.g. this answer for inspiration.

Checksums come up in the context of unreliable communication channels. A communication channel is an abstraction; bits go in one end and come out on the other end. An unreliable channel simply means that the bits which come out aren't the same as those which went in.
Now obviously the most extreme unreliable channel has random bits come out. That's unusable, so we focus on the channels where the input and output are correlated.
Still, we have many different corruption models. One common model is each bit is P% likely to fail. Another common model considers that bit errors typically come in bursts. Each bit then has a P% chance to start a bust of errors of length N, in which each bit is 50% likely to be wrong. Quite a few more models exist, depending on your problem - more advanced models also consider the chance of bits going completely missing.
The correct checksum has a very, very high likelihood of detecting the type of error predicted by the model, but might not work well for other types of errors.
For instance, I think with the Internet IP layer, the most common error is an entire IP packet going missing. That's why TCP uses sequence numbers to detect this particular error.

How should I generate random numbers for a genetic algorithm? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm writing a genetic algorithm to solve the Master Mind game. I've done lots of research on best approaches and it's incredibly important to have a diverse population. I'm trying to determine how to get really good random numbers in C++. I've done srand(time(NULL)) at the start of my program to set the seed then I've simply used rand(). What I would like to know is how random is that really? Is it pretty good? Are there other better libraries for random numbers?
I know number theory and randomness is a very complicated subject; do you have any pointers in writing your own version of rand()?

For crypto, you need very strong properties on your random numbers. Much of the literature out there focusses on these sorts of requirements. A typical solution would be seeding iterated applications of SHA-256 using environmental noise (hard drive delays, network packets, mouse movements, RDRAND, HAVEGE, ...).
For Monte-Carlo simulations or AI applications, the randomness requirements are much lower indeed. In fact, very simple generators will be good enough. The most basic is the infamous Linear Congruential generator, which is considered a bit old-fashioned now, because the output patterns sometimes produce noticeable and unwanted sampling effects (in particular, some experimental studies done in the 70s and 80s are quite probably flawed because of this). These days, the Mersenne Twister is more popular, and is more than adequate for a computer game. It's available in the C++ standard library: see std::mt19937.

rand()'s randomness is really bad. It is a bog standard LCG and generally has bad randomness and bad quality of projection. If you are serious about quality of randomness for your application, you need to go with something better. Then it depends on whether you want to keep to the standard library, or not.
If you want to use standard library, go with <random> header and Mersenne Twister.
But, I would recommend to you to use the PCG random family instead. Its fast, it has good quality and fixes most of mistakes of <random>.

Efficient approach in the grid [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
Problem: we have to fill a 2D grid of size m*n with characters from the set S such that number of distinct sub-matrices in the resulting grid are close to a given number k.
This question is derived from http://www.codechef.com/JULY14/problems/GERALD09
Limits:
1<=n,m<16
1<=k <=m*n*m*n
|S|=4
time limit=0.1 sec
Assumption: Two sub-matrices are distinct if they are not having same dimensions or at least a pair of characters at their corresponding locations doesn't match.
My approach: We can start with a random grid and loop while acceptable solution is found and in each iteration, we can increase/decrease randomness depending on our current state(but we can stuck in local optimum states).
But the problem is that I don't know efficient way to calculate number of different sub-matrices in a sub-grid.I tried hashing for counting which is pretty fast ( O(n2m2)*cost of generating/searching a hash value for a sub-grid).
But this approach doesn't give exact answers due to collisions of hash values and even after correcting it using the comment of #Vaughn Cato I can carry 15-25 iterations for optimum state finding and that is not enough .
Recently, I learned that Simulated annealing can be used to solve these kinds of problems.
http://www.theprojectspot.com/tutorial-post/simulated-annealing-algorithm-for-beginners/6
I am searching for any efficient approach for solving this optimization problem.
Thanks in advance.

I think they will post an editorial at some point, but here is a possible idea for this particular problem:
I generated locally all possible numbers of sub matrices possible for particular n and m.
For n=m=3 I got only 11 out of 81 possibilities.
For n=3,m=4 I got only 19 out of possible 144 values.
What's more, when I generated the values, I obtained all 19 possible options at the very beginning - after 263000 matrices out of possible 16M I already had them. (I generated in the lexicographical order)
So, I assume, one possible solution might be to precompute as many as possible different values of K that can be achieved for given n and m, save either the seed for random generator or in some other way such that you need O(1) characters per n-m-k triplet, and for a particular test case just check two neighboring values - first k larger and smaller than given.
What's more, since number of possible K values is not large, it may be possible to generate them in other way: given all possible values of K for nxm table, along with the appropriate tables, we can only backtrack through the values in the next row, and try to obtain all possible matrices with all different values of K for nx(m+1).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js