Generating random numbers: CPU vs GPU, which currently wins? - c++

I've been working on a physics simulations requiring the generation of a large amount of random numbers (at least 10^13 if you want an idea). I've been using the C++11 implementation of the Mersenne twister. I've also read that GPU implementation of this same algorithm are now a part of Cuda libraries and that GPU can be extremely efficient at this task; but I couldn't find explicit numbers or a benchmark comparison. For example compared to an 8 cores i7, are Nvidia cards of the last generations more performant in generating random numbers? If yes, how much and in which price range?
I'm thinking that my simulation could gain from having a GPU generating a huge pile of random numbers and the CPU doing the rest.

Some comparisons can be found here:
https://developer.nvidia.com/cuRAND

If you have a new enough Intel CPU (IvyBridge or newer), you can use the RDRAND instruction.
This can be used via the _rdrand16_step(), _rdrand32_step() and _rdrand64_step() intrinsic functions.
Available via VS2012/13, Intel compiler and gcc.
The generated random number is originally seeded on a real random number. Designed for NIST SP 800-90A compliance, its randomness is very high.
Some numbers for reference:
On an IvyBridge dual core laptop with HT (2.3GHz), 2^32 (4 Gigs) random 32bit numbers took 5.7 seconds for single thread and 1.7 seconds with OpenMP.

Related

What weights for a multi-part benchmark?

I'm writing a benchmark for a school project. It's very simple but I am wondering, in real life, what are the typical weights used for the various types of benchmarks? For instance, if I am combining an integer test, a cache test, a floating point test, should they be equally weighted in the final "score"? My hunch is that for many things, the cache test matters more than raw arithmetic, and that for many things, the RAM speed is a big factor. Is there a consensus?
There is no universal set of weights.
Different real-world workloads have different bottlenecks, or different weightings.
There is no single number that can tell you how fast a computer is. It's possible (and happens in real life) that program X runs faster on computer A then B, but program Y runs faster on computer B.
Choosing a set of weights for microbenchmarks totally comes down to what you want your number to mean, and what kind of workload you want it to be a rough indicator for.
e.g. a dense matmul can usually saturate FMA execution unit throughput because it does O(N^3) work over N^2 data. With careful cache-blocking you can get mostly L1d cache hits, and avoid doing more than 1 SIMD vector load per FMA. DRAM / cache bandwidth has to be high enough to keep up, but most of the stores/reloads hit in L1d cache (which of course also has to be able to keep up).
But other workloads might bottleneck on memory bandwidth or latency and not care about FPU throughput at all. e.g. AMD Ryzen 1 can do 1x 128-bit FMA per clock while Intel Haswell and later can do 2x 256-bit FMA per clock. But Ryzen is faster or nearly equal clock-for-clock for some other workloads.
And on multi-core systems some programs are single-threaded and care only about single-core throughput, while others scale well and get a big speedup on a machine with lots of slower cores. Or they might care about inter-core latency vs. aggregate memory bandwidth.

Can I generate cryptographically secure random data from a combination of random_device and mt19937 with reseeding?

I need to generate cryptographically secure random data in c++11 and I'm worried that using random_device for all the data would severely limit the performance (See slide 23 of Stephan T. Lavavej's "rand() Considered Harmful" where he says that when he tested it (on his system), random_device was 1.93 MB/s and mt19937 was 499 MB/s) as this code will be running on mobile devices (Android via JNI and iOS) which are probably slower than the numbers above.
In addition I'm aware that mt19937 is not cryptographically secure, from wikipedia: "observing a sufficient number of iterations (624 in the case of MT19937, since this is the size of the state vector from which future iterations are produced) allows one to predict all future iterations".
Taking all of the above information into account, can I generate cryptographically secure random data by generating a new random seed from random_device every 624 iterations of mt19937? Or (possibly) better yet, every X iterations where X is a random number (from random_device or mt19937 seeded by random_device) between 1 and 624?
Don't do this. Seriously, just don't. This isn't just asking for trouble--it's more like asking and begging for trouble by going into the most crime-ridden part of the most dangerous city you can find, carrying lots of valuables.
Instead of trying to re-seed MT 19937 often enough to cover up how insecure it is, I'd advise generating your random numbers by running AES in counter mode. This requires that you get one (but only one) good random number of the right size to use as your initial "seed" for your generator.
You use that as the key for AES, and simply use it to encrypt sequential numbers to get a stream of output that looks random, but is easy to reproduce.
This has many advantages. First, it uses an algorithm that's actually been studied heavily, and is generally believed to be quite secure. Second, it means you only need to distribute one (fairly small) random number as the "key" for the system as a whole. Third, it probably gives better throughput. Both Intel's and (seemingly) independent tests show a range of throughput that starts out competitive with what you're quoting for MT 19937 at the low end, and up to around 4-5 times faster at the top end. Given the way you're using it, I'd expect to see you get results close to (and possibly even exceeding1) the top end of the range they show.
Bottom line: AES in counter mode is obviously a better solution to the problem at hand. The very best you can hope for is that MT 19937 ends up close to as fast and close to as secure. In reality, it'll probably disappoint both those hopes, and end up both slower and substantially less secure.
1. How would it exceed those results? Those are undoubtedly based on encrypting bulk data--i.e., reading a block of data from RAM, encrypting it, and writing the result back to RAM. In your case, you don't need to read the result from RAM--you just have to generate consecutive numbers in the CPU, encrypt them, and write out the results.
The short answer is, no you cannot. The requirements for a cryptographically secure RNG are very stringent, and if you have to ask this question here, then you are not sufficiently aware of those requirements. As Jerry says, AES-CTR is one option, if you need repeatability. Another option, which does not allow repeatability, would be to look for an implementation of Yarrow or Fortuna for your system. In general it is much better to find a CSRNG in a library than to roll your own. Library writers are sufficiently aware of the requirements for a good CSRNG.

Which Pseudo-Random Number Generator to use in C++11?

C++11 comes with a set of PRNG's.
In what situation should one choose one over another? What are their advantages, disadvantages etc.
I think the Mersenne twister std::mt19937 engine is just fine as the "default" PRNG.
You can just use std::random_device to get a non-deterministic seed for mt19937.
There is a very interesting talk from GoingNative 2013 by Stephan T. Lavavej:
rand() Considered Harmful
You can download the slides as well from that web site. In particular, slide #23 clearly compares mt19937 vs. random_device:
mt19937 is:
Fast (499 MB/s = 6.5 cycles/byte for me)
Extremely high quality, but not cryptographically secure
Seedable (with more than 32 bits if you want)
Reproducible (Standard-mandated algorithm)
random_device is:
Possibly slow (1.93 MB/s = 1683 cycles/byte for me)
Strongly platform-dependent (GCC 4.8 can use IVB RDRAND)
Possibly crypto-secure (check documentation, true for VC)
Non-seedable, non-reproducible
The trade-off is speed, memory foot-print and period of PRNG.
Linear Congruential Generators: fast, low memory, small period
Lagged Fibonacci(Subtract with Carry): fast, large memory, large period
Mersenne Twister: slow, very large memory, very large period

Optimizing the use of the C++11 random generator

I'm coding a physics simulation heavily using random numbers, I just profiled my code for the first time so I may be in wrong in reading the output but I see this line coming first:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
90.09 21.88 21.88 265536 0.08 0.08 std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>::operator()()
It seems to mean that generating number generator takes 90% of the time.
I had already written a previous post asking if not constructing the random probability distributions at each loop could save me time but after trying and timing it didn't help (Is defining a probability distribution costly? ). Are there common options for optimizing random number generation?
Thank you in advance, my simulations (in its current state) runs for days so reducing on 90% of this computation time would be a significant progress.
There is always a trade-off between the efficiency, i.e. speed and size (number of bytes of the state), on the one hand and "randomness" of any RNG on the other. The Mersenne twister has quite good randomness (provided you use a high-entropy seed, such as provided by std::random_device), but slow and has large state. std::minstd_rand or std::knuth_b (linear congruential) are faster and ranlux48 (Fibbonacci) yet faster, but are less random (pass fewer test for randomness, i.e. have some non-random spectral properties). Just experiment and test if you're happy with the randomness provided (i.e. have no unsuspected correlations in the random data).
edit: 1 All these RNG are not truly random, of course, and are also not random enough for cryptography. If you need that, use std::random_device, but don't complain about speed. 2 In parallel (which you should consider), use thread_local RNGs, each initialised with another seed.
If your code spends most of its time generating random numbers, you may want to take some time to choose the best algorithm for your application and implement it yourself. The Mersenne Twister is a pretty fast algorithm, and has good randomness, but you can always trade off some quality of the random numbers generated for more speed. It will depend on what your simulation requires and on the type of numbers you are generating (ints or floats). If you absolutely need good randomness, Mersenne Twister is probably already one of your best options. Otherwise, you may want to implement a simple linear congruential generator in your code.
Another thing to watch out for is if your code is parallel, you should be using a reentrant version of the random number generator and make sure that different threads use their own internal state variables for their generators. Otherwise, mutexes to avoid overwriting internal state variables of the generator will slow down your code a lot. Many library generators are not reentrant, mind you. If your code is not parallel, you should probably parallelize it and use a separate thread to populate a list of random numbers for your simulation to consume. Another option is to use the GPU to generate random numbers in parallel.
Here are some links comparing the performance of diferent generators:
http://www.boost.org/doc/libs/1_38_0/libs/random/random-performance.html
https://www.gnu.org/software/gsl/manual/html_node/Random-Number-Generator-Performance.html
Use a dedicated random number library.
I would suggest WELL512 (link contains the paper and source code).
Marsaglia's KISS RNG is fast and is fine for simulation work. I am assuming that you don't need cryptographic quality.
If the randomness requirements allow it, you can use the RDTSC instruction to get random numbers, e.g. int from0to9 = rdtsc() % 10.

How to calculate ANSI C code performance?

I have written a simple code in ANSI C and now would want to perform some measurements.
I have measured execution time (using clock() function under the Windows OS and clock_gettime() under the Linux OS).
Now I would want to calculate, how many IPSes (Instructions Per Second) my CPU executes, while running this code of mine. (Yes, I know that MIPS is a pathetic parameter, but even this, I want to calculate it)
It would be also nice to see, how many CPIs (Cycles Per Instruction) it takes to perform e.g. addition of 3 elements and others operations I perform.
Google says how to calculate number of MIPS using calculator, some knowledge about my CPU (its clock speed), simple math and a bunch of other parameters (like CPI), but doesn't say HOW to obtain those!
I haven't found also any C/C++ function which would return the number of clock cycles needed to perform e.g. access to a local variable.
There is also a problem to find a Reference Manual by Intel/AMD for a modern CPU, which would have information about opcodes and others.
I have manually calculated, that my ANSI C code takes 37 operations, but those are ANSI C operations, not CPU instructions.
The easiest way of getting high accuracy timing on windows is PerformanceCounter, see How to use QueryPerformanceCounter?
Then you simply need some functions that perform the operations you are interested in timing. You have to be a little careful of caching etc. so run the calculation several times and look at the distribution of times