Is std::mt19937_64 faster than std::mt19937?

Is std::mt19937_64 faster than std::mt19937? - c++

Does mt19937_64 have a higher throughput (bit/s) than the 32 bit version, mt19937, assuming a 64 bit architecture?
What about after vectorization?

As #byjoe points out, this obviously depends on the compiler.
In this case, it seems to be considerably more dependent on the compiler than is typical though. For example, the Boost test linked in the comments uses the compiler from VC++ 2010, and shows only a fairly slight increase in random bits per second from using mt19937_64.
To get more up-to-date information, I whipped up a simple test:
#include <random>
#include <chrono>
#include <iostream>
#include <iomanip>
template <class T, class U>
U test(char const *label, U count) {
using namespace std::chrono;
T gen(100);
U result = 0;
auto start = high_resolution_clock::now();
for (U i = 0; i < count; i++)
result ^= gen();
auto stop = high_resolution_clock::now();
std::cout << "Time for " << std::left << std::setw(12) << label
<< duration_cast<milliseconds>(stop - start).count() << "\n";
return result;
}
int main(int argc, char **argv) {
unsigned long long limit = 1000000000;
auto result1 = test<std::mt19937>("mt19937: ", limit);
auto result2 = test<std::mt19937_64>("mt19937_64: ", limit);
std::cout << "Ignore: " << result1 << ", " << result2 << "\n";
}
With VC++ 2015 udpate 3 (with /o2b2 /GL, though it probably doesn't matter), I got results like these:
Time for mt19937: 4339
Time for mt19937_64: 4215
Ignore: 2598366015, 13977046647333287932
This shows mt19937_64 as being slightly faster per call, so over twice as fast per bit as mt19937. With MinGW (using -O3), the results were much more like those linked from the Boost site:
Time for mt19937: 2211
Time for mt19937_64: 4183
Ignore: 2598366015, 13977046647333287932
In this case, mt19937_64 takes just a little less than twice the time per call, so it's only slightly faster per bit. The highest overall speed seems to be from g++ with mt19937_64, but the difference between g++ and VC++ (on these runs) is less than 1%, so I'm not sure it's reproducible.
For what it's worth, the difference in speed (per call) between mt19937 and mt19937_64 with VC++ is also pretty small, but does seem to be reproducible--it happened quite consistently in my testing. I did wonder about whether that might be (at least partially) a matter of clock management--that when the code first started, the CPU was idle, and the clock had been slowed, so the first part of the first run was at a lower clock speed. To check, I reversed the order to test mt19937_64 first. I think my hypothesis was at least partially correct--when I reversed the order, mt19937_64 slowed down compared to mt19937, so they were nearly identical on a per-call basis with VC++.

It clearly depends on your compiler and their implementation. I just tested and the 64bit version takes about 60% longer call-for-call, so that makes the 64bit version about 25% fast bit-for-bit. I tested with an i7 cpu.
If you need max speed, you may want to consider using something else. Especially if the numbers don't need to be very high quality.

Related

Why it is appropriate to use `std::uniform_real_distribution`?

I'm trying to write Metropolis Monte Carlo simulation code.
Since the simulation will be very long, I'd like to think seriously about the performance for generating random numbers in [0, 1].
So I decided to check the performance of two methods by the following code:
#include <cfloat>
#include <chrono>
#include <iostream>
#include <random>
int main()
{
constexpr auto Ntry = 5000000;
std::mt19937 mt(123);
std::uniform_real_distribution<double> dist(0.0, std::nextafter(1.0, DBL_MAX));
double test1, test2;
// method 1
auto start1 = std::chrono::system_clock::now();
for (int i=0; i<Ntry; i++) {
test1 = dist(mt);
}
auto end1 = std::chrono::system_clock::now();
auto elapsed1 = std::chrono::duration_cast<std::chrono::microseconds>(end1-start1).count();
std::cout << elapsed1 << std::endl;
// method 2
auto start2 = std::chrono::system_clock::now();
for (int i=0; i<Ntry; i++) {
test2 = 1.0*mt() / mt.max();
}
auto end2 = std::chrono::system_clock::now();
auto elapsed2 = std::chrono::duration_cast<std::chrono::microseconds>(end2-start2).count();
std::cout << elapsed2 << std::endl;
}
Then the result is
295489 micro sec for method 1
79884 micro sec for method 2
I understand that there are many posts that recommend to use std::uniform_real_distribution.
But performance-wise, it is tempting to use the latter as this result shows.
Would you tell me what is the point of using std::uniform_real_distribution?
What is the disadvantage of using 1.0*mt() / mt.max()?
And in the current purpose, is it acceptable to use 1.0*mt() / mt.max() instead?
Edit:
I compiled this code with g++-11 test.cpp.
When I compile with -O3 flag, the result is qualitatively same (the method 1 is approx. 1.8 times slower).
I would like to discuss what is the advantage of the widely-used method.
I do concern the trend of performances, but specific performance comparison is out of my scope.

You use the standard random library because it is extremely difficult to do numerical calculations correctly and you don't want the burden of proving and maintaining your own random library.
Case in point, your random distribution is wrong. std::mt19937 produces 32-bit integers, yet you're expecting a double, which has a 53-bit significand (usually). There are values in the range [0, 1] that you will never obtain from 1.0*mt() / mt::max().

Your testing methodology is flawed. You don't use the result that you produce, so a smart optimiser may simply skip producing a result.
Would you tell me what is the point of using std::uniform_real_distribution?
The clue is in the name. It produces a uniform distribution.
Furthermore, it allows you to specify the minimum and maximum between which you want the distribution to lie.
What is the disadvantage of using 1.0*mt() / mt.max()?
You cannot specify a minimum and a maximum.
It produces a less uniform distribution.
It produces less randomness.
is it acceptable to use 1.0*mt() / mt.max() instead?
In some use cases, it could be acceptable. In some other cases, it isn't acceptable. In the rest, it won't matter.

May changing unsigned int to size_t impact performances?

After I ported some legacy code from win32 to win64, after I discussed what was the best strategy to remove the warning "possible loss of data" (What's the best strategy to get rid of "warning C4267 possible loss of data"?). I'm about to replace many unsigned int by size_t in my code.
However, my code is critical in term of performance (I can't even run it in Debug...too slow).
I did a quick benchmarking:
#include "stdafx.h"
#include <iostream>
#include <chrono>
#include <string>
template<typename T> void testSpeed()
{
auto start = std::chrono::steady_clock::now();
T big = 0;
for ( T i = 0; i != 100000000; ++i )
big *= std::rand();
std::cout << "Elapsed " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start).count() << "ms" << std::endl;
}
int main()
{
testSpeed<size_t>();
testSpeed<unsigned int>();
std::string str;
std::getline( std::cin, str ); // pause
return 0;
}
Compiled for x64, it outputs:
Elapsed 2185ms
Elapsed 2157ms
Compiled for x86, it outputs:
Elapsed 2756ms
Elapsed 2748ms
So apparently using size_t instead of unsigned int has unsignificant performance impact. But is that really always the case (it's hard to benchmark performances this way).
Does/may changing unsigned int into size_t impact CPU performance (now a 64bits object will be manipulated instead of a 32bits)?

Definitely not. On modern (and even older) CPUs, 64 bits integer operations perfom as fast as 32 bits operation.
Example on my i7 4600u for arithmetic operation a * b / c :
(int32_t) * (int32_t) / (int32_t) : 1.3 nsec
(int64_t) * (int64_t) / (int64_t) : 1.3 nsec
Both tests compiled for x64 target (same target as yours).
Howether, if your code manages big objects full of integers (big arrays of integers, fox example), using size_t instead of unsigned int may have an impact on performance if cache misses count increase (bigger data may exceed cache capacity). The most reliable way to check impact on performance is to test your app in both cases. Use your own type typedef'ed to either size_t or unsigned int then benchmark your application.

How much performance difference when using string vs char array?

I have the following code:
char fname[255] = {0}
snprintf(fname, 255, "%s_test_no.%d.txt", baseLocation, i);
vs
std::string fname = baseLocation + "_test_no." + std::to_string(i) + ".txt";
Which one performs better? Does the second one involve temporary creation? Is there any better way to do this?

Let's run the numbers:
2022 edit:
Using Quick-Bench with GCC 10.3 and compiling with C++20 (with some minor changes for constness) demonstrates that std::string is now faster, almost 3x as much:
Original answer (2014)
The code (I used PAPI Timers)
main.cpp
#include <iostream>
#include <string>
#include <stdio.h>
#include "papi.h"
#include <vector>
#include <cmath>
#define TRIALS 10000000
class Clock
{
public:
typedef long_long time;
time start;
Clock() : start(now()){}
void restart(){ start = now(); }
time usec() const{ return now() - start; }
time now() const{ return PAPI_get_real_usec(); }
};
int main()
{
int eventSet = PAPI_NULL;
PAPI_library_init(PAPI_VER_CURRENT);
if(PAPI_create_eventset(&eventSet)!=PAPI_OK)
{
std::cerr << "Failed to initialize PAPI event" << std::endl;
return 1;
}
Clock clock;
std::vector<long_long> usecs;
const char* baseLocation = "baseLocation";
//std::string baseLocation = "baseLocation";
char fname[255] = {};
for (int i=0;i<TRIALS;++i)
{
clock.restart();
snprintf(fname, 255, "%s_test_no.%d.txt", baseLocation, i);
//std::string fname = baseLocation + "_test_no." + std::to_string(i) + ".txt";
usecs.push_back(clock.usec());
}
long_long sum = 0;
for(auto vecIter = usecs.begin(); vecIter != usecs.end(); ++vecIter)
{
sum+= *vecIter;
}
double average = static_cast<double>(sum)/static_cast<double>(TRIALS);
std::cout << "Average: " << average << " microseconds" << std::endl;
//compute variance
double variance = 0;
for(auto vecIter = usecs.begin(); vecIter != usecs.end(); ++vecIter)
{
variance += (*vecIter - average) * (*vecIter - average);
}
variance /= static_cast<double>(TRIALS);
std::cout << "Variance: " << variance << " microseconds" << std::endl;
std::cout << "Std. deviation: " << sqrt(variance) << " microseconds" << std::endl;
double CI = 1.96 * sqrt(variance)/sqrt(static_cast<double>(TRIALS));
std::cout << "95% CI: " << average-CI << " usecs to " << average+CI << " usecs" << std::endl;
}
Play with the comments to get one way or the other.
10 million iterations of both methods on my machine with the compile line:
g++ main.cpp -lpapi -DUSE_PAPI -std=c++0x -O3
Using char array:
Average: 0.240861 microseconds
Variance: 0.196387microseconds
Std. deviation: 0.443156 microseconds
95% CI: 0.240586 usecs to 0.241136 usecs
Using string approach:
Average: 0.365933 microseconds
Variance: 0.323581 microseconds
Std. deviation: 0.568842 microseconds
95% CI: 0.365581 usecs to 0.366286 usecs
So at least on MY machine with MY code and MY compiler settings, I saw about a 50% slowdown when moving to strings. that character arrays incur a 34% speedup over strings using the following formula:
((time for string) - (time for char array) ) / (time for string)
Which gives the difference in time between the approaches as a percentage on time for string alone. My original percentage was correct; I used the character array approach as a reference point instead, which shows a 52% slowdown when moving to string, but I found it misleading.
I'll take any and all comments for how I did this wrong :)
2015 Edit
Compiled with GCC 4.8.4:
string
Average: 0.338876 microseconds
Variance: 0.853823 microseconds
Std. deviation: 0.924026 microseconds
95% CI: 0.338303 usecs to 0.339449 usecs
character array
Average: 0.239083 microseconds
Variance: 0.193538 microseconds
Std. deviation: 0.439929 microseconds
95% CI: 0.238811 usecs to 0.239356 usecs
So the character array approach remains significantly faster although less so. In these tests, it was about 29% faster.

The snprintf() version will almost certainly be quite a bit faster. Why? Simply because no memory allocation takes place. The new operator is surprisingly expensive, roughly 250ns on my system - snprintf() will have finished quite a bit of work in the meantime.
That is not to say that you should use the snprintf() approach: The price you pay is safety. It is just so easy to get things wrong with the fixed buffer size you are supplying to snprintf(), and you absolutely need to supply code for the case that the buffer is not large enough. So, only think about using snprintf() when you have identified this part of code to be really performance critical.
If you have a POSIX-2008 compliant system, you may also think about trying asprintf() instead of snprintf(), it will malloc() the memory for you, giving you pretty much the same comfort as C++ strings. At least on my system, malloc() is quite a bit faster than the builtin new-operator (don't ask me why, though).
Edit:
Just saw, that you used filenames in your example. If filenames are your concern, forget about the performance of string operation! Your code will spend virtually no time in them. Unless you have on the order of 100000 such string operations per second, they are irrelevant to your performance.

If it's REALLY important, measure the two solutions. If not, whichever you think makes most sense from what data you have, company/private coding style standards, etc. Make sure you use an optimised build [with the same optimisation you are going to use in the actual production build, not -O3 because that is the highest, if your production build is using -O1]
I expect that either will be pretty close if you only do a few. If you have several millions, there may be a difference. Which is faster? I'd guess the second [1], but it depends on who wrote the implementation of snprintf and who wrote the std::string implementation. Both certainly have the potential to take a lot longer than you would expect from a naive approach to how the function works (and possibly also run faster than you'd expect)
[1] Because I have worked with printf, and it's not a simple function, it spends a lot of time messing about with various groking of the format string. It's not very efficient (and I have looked at the ones in glibc and such too, and they are not noticeably better).
On the other hand std::string functions are often inlined since they are template implementations, which improves the efficiency. The joker in the pack is whether the memory allocation for std::string that is likely to happen. Of course, if somehow baselocation turns to be rather large, you probably don't want to store it as a fixed size local array anyway, so that evens out in that case.

I would recommend using strcat in that case. It is by far the fastest method:

How to zero a vector<bool>?

I have a vector<bool> and I'd like to zero it out. I need the size to stay the same.
The normal approach is to iterate over all the elements and reset them. However, vector<bool> is a specially optimized container that, depending on implementation, may store only one bit per element. Is there a way to take advantage of this to clear the whole thing efficiently?
bitset, the fixed-length variant, has the set function. Does vector<bool> have something similar?

There seem to be a lot of guesses but very few facts in the answers that have been posted so far, so perhaps it would be worthwhile to do a little testing.
#include <vector>
#include <iostream>
#include <time.h>
int seed(std::vector<bool> &b) {
srand(1);
for (int i = 0; i < b.size(); i++)
b[i] = ((rand() & 1) != 0);
int count = 0;
for (int i = 0; i < b.size(); i++)
if (b[i])
++count;
return count;
}
int main() {
std::vector<bool> bools(1024 * 1024 * 32);
int count1= seed(bools);
clock_t start = clock();
bools.assign(bools.size(), false);
double using_assign = double(clock() - start) / CLOCKS_PER_SEC;
int count2 = seed(bools);
start = clock();
for (int i = 0; i < bools.size(); i++)
bools[i] = false;
double using_loop = double(clock() - start) / CLOCKS_PER_SEC;
int count3 = seed(bools);
start = clock();
size_t size = bools.size();
bools.clear();
bools.resize(size);
double using_clear = double(clock() - start) / CLOCKS_PER_SEC;
int count4 = seed(bools);
start = clock();
std::fill(bools.begin(), bools.end(), false);
double using_fill = double(clock() - start) / CLOCKS_PER_SEC;
std::cout << "Time using assign: " << using_assign << "\n";
std::cout << "Time using loop: " << using_loop << "\n";
std::cout << "Time using clear: " << using_clear << "\n";
std::cout << "Time using fill: " << using_fill << "\n";
std::cout << "Ignore: " << count1 << "\t" << count2 << "\t" << count3 << "\t" << count4 << "\n";
}
So this creates a vector, sets some randomly selected bits in it, counts them, and clears them (and repeats). The setting/counting/printing is done to ensure that even with aggressive optimization, the compiler can't/won't optimize out our code to clear the vector.
I found the results interesting, to say the least. First the result with VC++:
Time using assign: 0.141
Time using loop: 0.068
Time using clear: 0.141
Time using fill: 0.087
Ignore: 16777216 16777216 16777216 16777216
So, with VC++, the fastest method is what you'd probably initially think of as the most naive -- a loop that assigns to each individual item. With g++, the results are just a tad different though:
Time using assign: 0.002
Time using loop: 0.08
Time using clear: 0.002
Time using fill: 0.001
Ignore: 16777216 16777216 16777216 16777216
Here, the loop is (by far) the slowest method (and the others are basically tied -- the 1 ms difference in speed isn't really repeatable).
For what it's worth, in spite of this part of the test showing up as much faster with g++, the overall times were within 1% of each other (4.944 seconds for VC++, 4.915 seconds for g++).

Try
v.assign(v.size(), false);
Have a look at this link:
http://www.cplusplus.com/reference/vector/vector/assign/
Or the following
std::fill(v.begin(), v.end(), 0)

You are out of luck. std::vector<bool> is a specialization that apparently does not even guarantee contiguous memory or random access iterators (or even forward?!), at least based on my reading of cppreference -- decoding the standard would be the next step.
So write implementation specific code, pray and use some standard zeroing technique, or do not use the type. I vote 3.
The recieved wisdom is that it was a mistake, and may become deprecated. Use a different container if possible. And definitely do not mess around with the internal guts, or rely on its packing. Check if you have dynamic bitset in your std library mayhap, or roll your own wrapper around std::vector<unsigned char>.

I ran into this as a performance issue recently. I hadn't tried looking for answers on the web but did find that using assignment with the constructor was 10x faster using g++ O3 (Debian 4.7.2-5) 4.7.2. I found this question because I was looking to avoid the additional malloc. Looks like the assign is optimized as well as the constructor and about twice as good in my benchmark.
unsigned sz = v.size(); for (unsigned ii = 0; ii != sz; ++ii) v[ii] = false;
v = std::vector(sz, false); // 10x faster
v.assign(sz, false); > // 20x faster
So, I wouldn't say to shy away from using the specialization of vector<bool>; just be very cognizant of the bit vector representation.

Use the std::vector<bool>::assign method, which is provided for this purpose.
If an implementation is specific for bool, then assign, most likely, also implemented appropriately.

If you're able to switch from vector<bool> to a custom bit vector representation, then you can use a representation designed specifically for fast clear operations, and get some potentially quite significant speedups (although not without tradeoffs).
The trick is to use integers per bit vector entry and a single 'rolling threshold' value that determines which entries actually then evaluate to true.
You can then clear the bit vector by just increasing the single threshold value, without touching the rest of the data (until the threshold overflows).
A more complete write up about this, and some example code, can be found here.

It seems that one nice option hasn't been mentioned yet:
auto size = v.size();
v.resize(0);
v.resize(size);
The STL implementer will supposedly have picked the most efficient means of zeroising, so we don't even need to know which particular method that might be. And this works with real vectors as well (think templates), not just the std::vector<bool> monstrosity.
There can be a minuscule added advantage for reused buffers in loops (e.g. sieves, whatever), where you simply resize to whatever will be needed for the current round, instead of to the original size.

As an alternative to std::vector<bool>, check out boost::dynamic_bitset (https://www.boost.org/doc/libs/1_72_0/libs/dynamic_bitset/dynamic_bitset.html). You can zero one (ie, set each element to false) out by calling the reset() member function.
Like clearing, say, std::vector<int>, reset on a boost::dynamic_bitset can also compile down to a memset, whereas you probably won't get that with std::vector<bool>. For example, see https://godbolt.org/z/aqSGCi

Random Engine Differences

The C++11 standard specifies a number of different engines for random number generation: linear_congruential_engine, mersenne_twister_engine, subtract_with_carry_engine and so on. Obviously, this is a large change from the old usage of std::rand.
Obviously, one of the major benefits of (at least some) of these engines is the massively increased period length (it's built into the name for std::mt19937).
However, the differences between the engines is less clear. What are the strengths and weaknesses of the different engines? When should one be used over the other? Is there a sensible default that should generally be preferred?

From the explanations below, a linear engine seems to be faster but less random while the Mersenne Twister has a higher complexity and randomness. Subtract-with-carry random number engine is an improvement to the linear engine and it is definitely more random. In the last reference, it is stated that Mersenne Twister has higher complexity than the Subtract-with-carry random number engine.
Linear congruential random number engine
A pseudo-random number generator engine that produces unsigned integer numbers.
This is the simplest generator engine in the standard library. Its state is a single integer value, with the following transition algorithm:
x = (ax+c) mod m
Where x is the current state value, a and c are their respective template parameters, and m is its respective template parameter if this is greater than 0, or numerics_limits<UIntType>::max() + 1, otherwise.
Its generation algorithm is a direct copy of the state value.
This makes it an extremely efficient generator in terms of processing and memory consumption, but producing numbers with varying degrees of serial correlation, depending on the specific parameters used.
The random numbers generated by linear_congruential_engine have a period of m.
Mersenne twister random number engine
A pseudo-random number generator engine that produces unsigned integer numbers in the closed interval [0,2^w-1].
The algorithm used by this engine is optimized to compute large series of numbers (such as in Monte Carlo experiments) with an almost uniform distribution in the range.
The engine has an internal state sequence of n integer elements, which is filled with a pseudo-random series generated on construction or by calling member function seed.
The internal state sequence becomes the source for n elements: When the state is advanced (for example, in order to produce a new random number), the engine alters the state sequence by twisting the current value using xor mask a on a mix of bits determined by parameter r that come from that value and from a value m elements away (see operator() for details).
The random numbers produced are tempered versions of these twisted values. The tempering is a sequence of shift and xor operations defined by parameters u, d, s, b, t, c and l applied on the selected state value (see operator()).
The random numbers generated by mersenne_twister_engine have a period equivalent to the mersenne number 2^((n-1)*w)-1.
Subtract-with-carry random number engine
A pseudo-random number generator engine that produces unsigned integer numbers.
The algorithm used by this engine is a lagged fibonacci generator, with a state sequence of r integer elements, plus one carry value.
Lagged Fibonacci generators have a maximum period of (2k - 1)*^(2M-1) if addition or subtraction is used. The initialization of LFGs is a very complex problem. The output of LFGs is very sensitive to initial conditions, and statistical defects may appear initially but also periodically in the output sequence unless extreme care is taken. Another potential problem with LFGs is that the mathematical theory behind them is incomplete, making it necessary to rely on statistical tests rather than theoretical performance.
And finally from the documentation of random:
The choice of which engine to use involves a number of tradeoffs: the linear congruential engine is moderately fast and has a very small storage requirement for state. The lagged Fibonacci generators are very fast even on processors without advanced arithmetic instruction sets, at the expense of greater state storage and sometimes less desirable spectral characteristics. The Mersenne Twister is slower and has greater state storage requirements but with the right parameters has the longest non-repeating sequence with the most desirable spectral characteristics (for a given definition of desirable).

I think that the point is that random generators have different properties, which can make them more suitable or not for a given problem.
The period length is one of the properties.
The quality of the random numbers can also be important.
The performance of the generator can also be an issue.
Depending on your need, you might take one generator or another one. E.g., if you need fast random numbers but do not really care for the quality, an LCG might be a good option. If you want better quality random numbers, the Mersenne Twister is probably a better option.
To help you making your choice, there are some standard tests and results (I definitely like the table p.29 of this paper).
EDIT: From the paper,
The LCG (LCG(***) in the paper) family are the fastest generators, but with the poorest quality.
The Mersenne Twister (MT19937) is a little bit slower, but yields better random numbers.
The substract with carry ( SWB(***), I think) are way slower, but can yield better random properties when well tuned.

As the other answers forget about ranlux, here is a small note by an AMD developer that recently ported it to OpenCL:
https://community.amd.com/thread/139236
RANLUX is also one of very few (the only one I know of actually) PRNGs that has a underlying theory explaining why it generates "random" numbers, and why they are good. Indeed, if the theory is correct (and I don't know of anyone who has disputed it), RANLUX at the highest luxury level produces completely decorrelated numbers down to the last bit, with no long-range correlations as long as we stay well below the period (10^171). Most other generators can say very little about their quality (like Mersenne Twister, KISS etc.) They must rely on passing statistical tests.
Physicists at CERN are fan of this PRNG. 'nuff said.

Some of the information in these other answers conflicts with my findings. I've run tests on Windows 8.1 using Visual Studio 2013, and consistently I've found mersenne_twister_engine to be but higher quality and significantly faster than either linear_congruential_engine or subtract_with_carry_engine. This leads me to believe, when the information in the other answers are taken into account, that the specific implementation of an engine has a significant impact on performance.
This is of great surprise to nobody, I'm sure, but it's not mentioned in the other answers where mersenne_twister_engine is said to be slower. I have no test results for other platforms and compilers, but with my configuration, mersenne_twister_engine is clearly the superior choice when considering period, quality, and speed performance. I have not profiled memory usage, so I cannot speak to the space requirement property.
Here's the code I'm using to test with (to make portable, you should only have to replace the windows.h QueryPerformanceXxx() API calls with an appropriate timing mechanism):
// compile with: cl.exe /EHsc
#include <random>
#include <iostream>
#include <windows.h>
using namespace std;
void test_lc(const int a, const int b, const int s) {
/*
typedef linear_congruential_engine<unsigned int, 48271, 0, 2147483647> minstd_rand;
*/
minstd_rand gen(1729);
uniform_int_distribution<> distr(a, b);
for (int i = 0; i < s; ++i) {
distr(gen);
}
}
void test_mt(const int a, const int b, const int s) {
/*
typedef mersenne_twister_engine<unsigned int, 32, 624, 397,
31, 0x9908b0df,
11, 0xffffffff,
7, 0x9d2c5680,
15, 0xefc60000,
18, 1812433253> mt19937;
*/
mt19937 gen(1729);
uniform_int_distribution<> distr(a, b);
for (int i = 0; i < s; ++i) {
distr(gen);
}
}
void test_swc(const int a, const int b, const int s) {
/*
typedef subtract_with_carry_engine<unsigned int, 24, 10, 24> ranlux24_base;
*/
ranlux24_base gen(1729);
uniform_int_distribution<> distr(a, b);
for (int i = 0; i < s; ++i) {
distr(gen);
}
}
int main()
{
int a_dist = 0;
int b_dist = 1000;
int samples = 100000000;
cout << "Testing with " << samples << " samples." << endl;
LARGE_INTEGER ElapsedTime;
double ElapsedSeconds = 0;
LARGE_INTEGER Frequency;
QueryPerformanceFrequency(&Frequency);
double TickInterval = 1.0 / ((double) Frequency.QuadPart);
LARGE_INTEGER StartingTime;
LARGE_INTEGER EndingTime;
QueryPerformanceCounter(&StartingTime);
test_lc(a_dist, b_dist, samples);
QueryPerformanceCounter(&EndingTime);
ElapsedTime.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedSeconds = ElapsedTime.QuadPart * TickInterval;
cout << "linear_congruential_engine time: " << ElapsedSeconds << endl;
QueryPerformanceCounter(&StartingTime);
test_mt(a_dist, b_dist, samples);
QueryPerformanceCounter(&EndingTime);
ElapsedTime.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedSeconds = ElapsedTime.QuadPart * TickInterval;
cout << " mersenne_twister_engine time: " << ElapsedSeconds << endl;
QueryPerformanceCounter(&StartingTime);
test_swc(a_dist, b_dist, samples);
QueryPerformanceCounter(&EndingTime);
ElapsedTime.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedSeconds = ElapsedTime.QuadPart * TickInterval;
cout << "subtract_with_carry_engine time: " << ElapsedSeconds << endl;
}
Output:
Testing with 100000000 samples.
linear_congruential_engine time: 10.0821
mersenne_twister_engine time: 6.11615
subtract_with_carry_engine time: 9.26676

I just saw this answer from Marnos and decided to test it myself. I used std::chono::high_resolution_clock to time 100000 samples 100 times to produce an average. I measured everything in std::chrono::nanoseconds and ended up with different results:
std::minstd_rand had an average of 28991658 nanoseconds
std::mt19937 had an average of 29871710 nanoseconds
ranlux48_base had an average of 29281677 nanoseconds
This is on a Windows 7 machine. Compiler is Mingw-Builds 4.8.1 64bit. This is obviously using the C++11 flag and no optimisation flags.
When I turn on -O3 optimisations, the std::minstd_rand and ranlux48_base actually run faster than what the implementation of high_precision_clock can measure; however std::mt19937 still takes 730045 nanoseconds, or 3/4 of a second.
So, as he said, it's implementation specific, but at least in GCC the average time seems to stick to what the descriptions in the accepted answer say. Mersenne Twister seems to benefit the least from optimizations, whereas the other two really just throw out the random numbers unbelieveably fast once you factor in compiler optimizations.
As an aside, I'd been using Mersenne Twister engine in my noise generation library (it doesn't precompute gradients), so I think I'll switch to one of the others to really see some speed improvements. In my case, the "true" randomness doesn't matter.
Code:
#include <iostream>
#include <chrono>
#include <random>
using namespace std;
using namespace std::chrono;
int main()
{
minstd_rand linearCongruentialEngine;
mt19937 mersenneTwister;
ranlux48_base subtractWithCarry;
uniform_real_distribution<float> distro;
int numSamples = 100000;
int repeats = 100;
long long int avgL = 0;
long long int avgM = 0;
long long int avgS = 0;
cout << "results:" << endl;
for(int j = 0; j < repeats; ++j)
{
cout << "start of sequence: " << j << endl;
auto start = high_resolution_clock::now();
for(int i = 0; i < numSamples; ++i)
distro(linearCongruentialEngine);
auto stop = high_resolution_clock::now();
auto L = duration_cast<nanoseconds>(stop-start).count();
avgL += L;
cout << "Linear Congruential:\t" << L << endl;
start = high_resolution_clock::now();
for(int i = 0; i < numSamples; ++i)
distro(mersenneTwister);
stop = high_resolution_clock::now();
auto M = duration_cast<nanoseconds>(stop-start).count();
avgM += M;
cout << "Mersenne Twister:\t" << M << endl;
start = high_resolution_clock::now();
for(int i = 0; i < numSamples; ++i)
distro(subtractWithCarry);
stop = high_resolution_clock::now();
auto S = duration_cast<nanoseconds>(stop-start).count();
avgS += S;
cout << "Subtract With Carry:\t" << S << endl;
}
cout << setprecision(10) << "\naverage:\nLinear Congruential: " << (long double)(avgL/repeats)
<< "\nMersenne Twister: " << (long double)(avgM/repeats)
<< "\nSubtract with Carry: " << (long double)(avgS/repeats) << endl;
}

Its a trade-off really. A PRNG like Mersenne Twister is better because it has extremely large period and other good statistical properties.
But a large period PRNG takes up more memory (for maintaining the internal state) and also takes more time for generating a random number (due to complex transitions and post processing).
Choose a PNRG depending on the needs of your application. When in doubt use Mersenne Twister, its the default in many tools.

In general, mersenne twister is the best (and fastest) RNG, but it requires some space (about 2.5 kilobytes). Which one suits your need depends on how many times you need to instantiate the generator object. (If you need to instantiate it only once, or a few times, then MT is the one to use. If you need to instantiate it millions of times, then perhaps something smaller.)
Some people report that MT is slower than some of the others. According to my experiments, this depends a lot on your compiler optimization settings. Most importantly the -march=native setting may make a huge difference, depending on your host architecture.
I ran a small program to test the speed of different generators, and their sizes, and got this:
std::mt19937 (2504 bytes): 1.4714 s
std::mt19937_64 (2504 bytes): 1.50923 s
std::ranlux24 (120 bytes): 16.4865 s
std::ranlux48 (120 bytes): 57.7741 s
std::minstd_rand (4 bytes): 1.04819 s
std::minstd_rand0 (4 bytes): 1.33398 s
std::knuth_b (1032 bytes): 1.42746 s

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js