Using C++11's random module, I encountered an odd performance drop when using std::mt19937 (32 and 64bit versions) in combination with a uniform_real_distribution (float or double, doesn't matter). Compared to a g++ compile, it's more than an order of magnitude slower!
The culprit isn't just the mt generator, as it's fast with a uniform_int_distribution. And it isn't a general flaw in the uniform_real_distribution since that's fast with other generators like default_random_engine. Just that specific combination is oddly slow.
I'm not very familiar with the intrinsics, but the Mersenne Twister algorithm is more or less strictly defined, so a difference in implementation couldn't account for this difference I guess? measure Program is following, but here are my results for clang 3.4 and gcc 4.8.1 on a 64bit linux machine:
gcc 4.8.1
runtime_int_default: 185.6
runtime_int_mt: 179.198
runtime_int_mt_64: 175.195
runtime_float_default: 45.375
runtime_float_mt: 58.144
runtime_float_mt_64: 94.188
clang 3.4
runtime_int_default: 215.096
runtime_int_mt: 201.064
runtime_int_mt_64: 199.836
runtime_float_default: 55.143
runtime_float_mt: 744.072 <--- this and
runtime_float_mt_64: 783.293 <- this is slow
Program to generate this and try out yourself:
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
template< typename T_rng, typename T_dist>
double time_rngs(T_rng& rng, T_dist& dist, int n){
std::vector< typename T_dist::result_type > vec(n, 0);
auto t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < n; ++i)
vec[i] = dist(rng);
auto t2 = std::chrono::high_resolution_clock::now();
auto runtime = std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()/1000.0;
auto sum = vec[0]; //access to avoid compiler skipping
return runtime;
}
int main(){
const int n = 10000000;
unsigned seed = std::chrono::system_clock::now().time_since_epoch().count();
std::default_random_engine rng_default(seed);
std::mt19937 rng_mt (seed);
std::mt19937_64 rng_mt_64 (seed);
std::uniform_int_distribution<int> dist_int(0,1000);
std::uniform_real_distribution<float> dist_float(0.0, 1.0);
// print max values
std::cout << "rng_default_random.max(): " << rng_default.max() << std::endl;
std::cout << "rng_mt.max(): " << rng_mt.max() << std::endl;
std::cout << "rng_mt_64.max(): " << rng_mt_64.max() << std::endl << std::endl;
std::cout << "runtime_int_default: " << time_rngs(rng_default, dist_int, n) << std::endl;
std::cout << "runtime_int_mt: " << time_rngs(rng_mt_64, dist_int, n) << std::endl;
std::cout << "runtime_int_mt_64: " << time_rngs(rng_mt_64, dist_int, n) << std::endl;
std::cout << "runtime_float_default: " << time_rngs(rng_default, dist_float, n) << std::endl;
std::cout << "runtime_float_mt: " << time_rngs(rng_mt, dist_float, n) << std::endl;
std::cout << "runtime_float_mt_64: " << time_rngs(rng_mt_64, dist_float, n) << std::endl;
}
compile via clang++ -O3 -std=c++11 random.cpp or g++ respectively. Any ideas?
edit: Finally, Matthieu M. had a great idea: The culprit is inlining, or rather a lack thereof. Increasing the clang inlining limit eliminated the performance penalty. That actually solved a number of performance oddities I encountered. Thanks, I learned something new.
As already stated in the comments, the problem is caused by the fact that gcc inlines more aggressive than clang. If we make clang inline very aggressively, the effect disappears:
Compiling your code with g++ -O3 yields
runtime_int_default: 3000.32
runtime_int_mt: 3112.11
runtime_int_mt_64: 3069.48
runtime_float_default: 859.14
runtime_float_mt: 1027.05
runtime_float_mt_64: 1777.48
while clang++ -O3 -mllvm -inline-threshold=10000 yields
runtime_int_default: 3623.89
runtime_int_mt: 751.484
runtime_int_mt_64: 751.132
runtime_float_default: 1072.53
runtime_float_mt: 968.967
runtime_float_mt_64: 1781.34
Apparently, clang now out-inlines gcc in the int_mt cases, but all of the other runtimes are now in the same order of magnitude. I used gcc 4.8.3 and clang 3.4 on Fedora 20 64 bit.
Related
I'm trying to parallelize some old code using the Execution Policy from the C++ 17. My sample code is below:
#include <cstdlib>
#include <chrono>
#include <iostream>
#include <algorithm>
#include <execution>
#include <vector>
using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::duration<double>;
constexpr auto NUM = 100'000'000U;
double func()
{
return rand();
}
int main()
{
std::vector<double> v(NUM);
// ------ feature testing
std::cout << "__cpp_lib_execution : " << __cpp_lib_execution << std::endl;
std::cout << "__cpp_lib_parallel_algorithm: " << __cpp_lib_parallel_algorithm << std::endl;
// ------ fill the vector with random numbers sequentially
auto const startTime1 = Clock::now();
std::generate(std::execution::seq, v.begin(), v.end(), func);
Duration const elapsed1 = Clock::now() - startTime1;
std::cout << "std::execution::seq: " << elapsed1.count() << " sec." << std::endl;
// ------ fill the vector with random numbers in parallel
auto const startTime2 = Clock::now();
std::generate(std::execution::par, v.begin(), v.end(), func);
Duration const elapsed2 = Clock::now() - startTime2;
std::cout << "std::execution::par: " << elapsed2.count() << " sec." << std::endl;
}
The program output on my Linux desktop:
__cpp_lib_execution : 201902
__cpp_lib_parallel_algorithm: 201603
std::execution::seq: 0.971162 sec.
std::execution::par: 25.0349 sec.
Why does the parallel version performs 25 times worse than the sequential one?
Compiler: g++ (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0
The thread-safety of rand is implementation-defined. Which means either:
Your code is wrong in the parallel case, or
It's effectively serial, with a highly contended lock, which would dramatically increase the overhead in the parallel case and get incredibly poor performance.
Based on your results, I'm guessing #2 applies, but it could be both.
Either way, the answer is: rand is a terrible test case for parallelism.
This program, built with -std=c++20 flag:
#include <iostream>
using namespace std;
int main() {
auto [n, m] = minmax(3, 4);
cout << n << " " << m << endl;
}
produces expected result 3 4 when no optimization flags -Ox are used. With optimization flags it outputs 0 0. I tried it with multiple gcc versions with -O1, -O2 and -O3 flags.
Clang 13 works fine, but clang 10 and 11 outputs 0 4198864 with optimization level -O2 and higher. Icc works fine. What is happening here?
The code is here: https://godbolt.org/z/Wd4ex8bej
The overload of std::minmax taking two arguments returns a pair of references to the arguments. The lifetime of the arguments however end at the end of the full expression since they are temporaries.
Therefore the output line is reading dangling references, causing your program to have undefined behavior.
Instead you can use std::tie to receive by-value:
#include <iostream>
#include <tuple>
#include <algorithm>
int main() {
int n, m;
std::tie(n,m) = std::minmax(3, 4);
std::cout << n << " " << m << std::endl;
}
Or you can use the std::initializer_list overload of std::minmax, which returns a pair of values:
#include <iostream>
#include <algorithm>
int main() {
auto [n, m] = std::minmax({3, 4});
std::cout << n << " " << m << std::endl;
}
This question already has an answer here:
How to use RDRAND intrinsics?
(1 answer)
Closed 2 years ago.
I would like to be able to use the hardware random number generator, when available and regardless of intel or amd running the code, with C++ random library:
void randomDeviceBernouilli(double bernoulliParameter, uint64_t trialCount) {
std::random_device randomDevice;
std::cout << "operator():" << randomDevice() << std::endl;
std::cout << "default random_device characteristics:" << std::endl;
std::cout << "minimum: " << randomDevice.min() << std::endl;
std::cout << "maximum: " << randomDevice.max() << std::endl;
std::cout << "entropy: " << randomDevice.entropy() << std::endl;
std::bernoulli_distribution bernoulliDist(bernoulliParameter);
uint64_t trueCount{ 0ull };
for (uint64_t i = 0; i < trialCount; i++)
if (bernoulliDist(randomDevice))
trueCount++;
std::cout << "Success " << trueCount << " out of " << trialCount << std::endl;
const double successRate = (double)trueCount / (double)trialCount;
std::cout << "Empirical: " << std::fixed << std::setprecision(8) << std::setw(10) << successRate << " Theoretical: " << bernoulliParameter << std::endl;
}
According to this article, entropy() should have returned 0 on a cpu without RDRAND, such as the i7-2670qm ivy bridge (on which I've tested-RDRAND first appeared in its successor, sandy bridge), but it is always 32 in Visual Studio as documented here. It has been suggested that the absence of a random device may cause operator() to throw an exception, but that does not happen either.
One can use the intrinsic int _rdrand32_step (unsigned int* val) for example but that only draws from a uniform distribution, I need to be able to make use of the distributions available in the C++'s random library.
Also, the code should make use of the hardware generator on an AMD cpu.
What is the proper way to use the hardware random number generator(RDRAND) with the C++'s random library in Visual Studio(2017, 2019)?
In my opinion you should create your own random number engine that can be used by all the distributions available in <random>.
Something like
#include <exception>
#include <immintrin.h>
#include <iostream>
#include <limits>
#include <ostream>
#include <random>
#include <stdexcept>
#include <string>
using namespace std::string_literals;
class rdrand
{
public:
using result_type = unsigned int;
static constexpr result_type min() { return std::numeric_limits<result_type>::min(); }
static constexpr result_type max() { return std::numeric_limits<result_type>::max(); }
result_type operator()()
{
result_type x;
auto success = _rdrand32_step(&x);
if (!success)
{
throw std::runtime_error("rdrand32_step failed to generate a valid random number");
}
return x;
}
};
int main()
try
{
auto engine = rdrand();
auto distribution = std::bernoulli_distribution(0.75);
// Flip a bias coin with probability of heads = 0.75
for (auto i = 0; i != 20; ++i)
{
auto outcome = distribution(engine);
std::cout << (outcome ? "heads"s : "tails"s) << std::endl;
}
return 0;
}
catch (std::exception const& exception)
{
std::cerr << "Standard exception thrown with message: \""s << exception.what() << "\""s << std::endl;
}
catch (...)
{
std::cerr << "Unknown exception thrown"s << std::endl;
}
Whether the rdrand instruction is available should be external to the class. You could add code in main to test for it.
Can somebody tell me how to compile example code for std::reduce, which is on cppreference?
#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>
#include <execution_policy>
int main()
{
std::vector<double> v(10'000'007, 0.5);
{
auto t1 = std::chrono::high_resolution_clock::now();
double result = std::accumulate(v.begin(), v.end(), 0.0);
auto t2 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> ms = t2 - t1;
std::cout << std::fixed << "std::accumulate result " << result
<< " took " << ms.count() << " ms\n";
}
{
auto t1 = std::chrono::high_resolution_clock::now();
double result = std::reduce(std::par, v.begin(), v.end());
auto t2 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> ms = t2 - t1;
std::cout << "std::reduce result "
<< result << " took " << ms.count() << " ms\n";
}
}
I am working under Mint18 with fresh g++6.2 installed. During compilation I use only -std=c++17 flag. What should I do if I want to use execution_policy (#include <execution_policy>)? Currently my compiler tells me that reduce is not a member of namespace std and that there is no such file as execution_policy.
Neither libstdc++ (used by default with g++) nor libc++ (use by default with clang) have implemented execution_policy support yet.
Progress towards implementing new features in libstdc++ is tracked on this page. libc++ progress is tracked here. As you can see, P0024R2 "The Parallelism TS Should be Standardized" has not yet been implemented by either library as of time of writing.
I am writing a few functionalities for distributions and used the normal-distribution to run the tests between my implementation and C++ Boost.
Given the Probability Density Function (pdf: http://www.mathworks.com/help/stats/normpdf.html)
Which I wrote like that:
double NormalDistribution1D::prob(double x) {
return (1 / (sigma * (std::sqrt(boost::math::constants::pi<double>()*2))))*std::exp((-1 / 2)*(((x - mu) / sigma)*((x - mu) / sigma)));
}
To compare my results the way how it is done with C++ Boost:
boost::math::normal_distribution <> d(mu, sigma);
return boost::math::pdf(d, x);
I'm not very positively surprised - my version took 44278 nano Seconds, boost
only 326.
So I played a bit around and wrote the method probboost in my NormalDistribution1D-Class and compared all three:
void MATTest::runNormalDistribution1DTest1() {
double mu = 0;
double sigma = 1;
double x = 0;
std::chrono::high_resolution_clock::time_point tn_start = std::chrono::high_resolution_clock::now();
NormalDistribution1D *n = new NormalDistribution1D(mu, sigma);
double nres = n->prob(x);
std::chrono::high_resolution_clock::time_point tn_end = std::chrono::high_resolution_clock::now();
std::chrono::high_resolution_clock::time_point tdn_start = std::chrono::high_resolution_clock::now();
NormalDistribution1D *n1 = new NormalDistribution1D(mu, sigma);
double nres1 = n1->probboost(x);
std::chrono::high_resolution_clock::time_point tdn_end = std::chrono::high_resolution_clock::now();
std::chrono::high_resolution_clock::time_point td_start = std::chrono::high_resolution_clock::now();
boost::math::normal_distribution <> d(mu, sigma);
double dres = boost::math::pdf(d, x);
std::chrono::high_resolution_clock::time_point td_end = std::chrono::high_resolution_clock::now();
std::cout << "Mu : " << mu << "; Sigma: " << sigma << "; x" << x << std::endl;
if (nres == dres) {
std::cout << "Result" << nres << std::endl;
} else {
std::cout << "\033[1;31mRes incorrect: " << nres << "; Correct: " << dres << "\033[0m" << std::endl;
}
auto duration_n = std::chrono::duration_cast<std::chrono::nanoseconds>(tn_end - tn_start).count();
auto duration_d = std::chrono::duration_cast<std::chrono::nanoseconds>(td_end - td_start).count();
auto duration_dn = std::chrono::duration_cast<std::chrono::nanoseconds>(tdn_end - tdn_start).count();
std::cout << "own boost: " << duration_dn << std::endl;
if (duration_n < duration_d) {
std::cout << "Boost: " << (duration_d) << "; own implementation: " << duration_n << std::endl;
} else {
std::cout << "\033[1;31mBoost faster: " << (duration_d) << "; than own implementation: " << duration_n << "\033[0m" << std::endl;
}
}
The results are (Was compiling and running the checking-Method 3 times)
own boost: 1082 Boost faster: 326; than own implementation: 44278
own boost: 774 Boost faster: 216; than own implementation: 34291
own boost: 769 Boost faster: 230; than own implementation: 33456
Now this puzzles me:
How is it possible the method from the class takes 3 times longer than the statements which were called directly?
My compiling options:
g++ -O2 -c -g -std=c++11 -MMD -MP -MF "build/Debug/GNU-Linux-x86/main.o.d" -o build/Debug/GNU-Linux-x86/main.o main.cpp
g++ -O2 -o ***Classes***
Firs of all, you are allocating your object dynamically, with new:
NormalDistribution1D *n = new NormalDistribution1D(mu, sigma);
double nres = n->prob(x);
If you did just like you have done with boost, that alone would be sufficient to have the same (or comparable) speed:
NormalDistribution1D n(mu, sigma);
double nres = n.prob(x);
Now, I don't know if the way you spelled out your expression in NormalDistribution1D::prob() is significant, but I am skeptical that it would make any difference to write it in a more "optimized" way, because arithmetic expressions like that is the kind of thing compilers can optimize very well. Maybe it will get faster if you user the --ffast-math switch, that will give more optimization freedom to the compiler.
Also, if the definition of double NormalDistribution1D::prob(double x) is in another compilation unit (another .cpp file), the compiler will be unable to inline it, and that will also make a noticeable overhead (maybe twice slower or less). In boost, almost everithing is implemented inside headers, so inlines will always happen when the compiler seems fit. You can overcome this problem if you compile and link with gcc's -flto switch.
You didn't compile with the -ffast-math option. That means the compiler cannot (in fact, must not!) simplify (-1 / 2)*(((x - mu) / sigma)*((x - mu) / sigma)) to a form similar to that used in boost::math::pdf,
expo = (x - mu) / sigma
expo *= -x
expo /= 2
result = std::exp(expo)
result /= sigma * std::sqrt(2 * boost::math::constants::pi<double>())
The above forces the compiler to do the fast (but potentially unsafe / inaccurate) calculation without resorting to using -ffast_math.
Secondly, the time difference between the above and your code is minimal compared to the time needed to allocate from the heap (new) vs. the stack (local variable). You are timing the cost of allocating dynamic memory.