Why is my C++ code so much slower than R? - c++

I have written the following codes in R and C++ which perform the same algorithm:
a) To simulate the random variable X 500 times. (X has value 0.9 with prob 0.5 and 1.1 with prob 0.5)
b) Multiply these 500 simulated values together to get a value. Save that value in a container
c) Repeat 10000000 times such that the container has 10000000 values
R:
ptm <- proc.time()
steps <- 500
MCsize <- 10000000
a <- rbinom(MCsize,steps,0.5)
b <- rep(500,times=MCsize) - a
result <- rep(1.1,times=MCsize)^a*rep(0.9,times=MCsize)^b
proc.time()-ptm
C++
#include <numeric>
#include <vector>
#include <iostream>
#include <random>
#include <thread>
#include <mutex>
#include <cmath>
#include <algorithm>
#include <chrono>
const size_t MCsize = 10000000;
std::mutex mutex1;
std::mutex mutex2;
unsigned seed_;
std::vector<double> cache;
void generatereturns(size_t steps, int RUNS){
mutex2.lock();
// setting seed
try{
std::mt19937 tmpgenerator(seed_);
seed_ = tmpgenerator();
std::cout << "SEED : " << seed_ << std::endl;
}catch(int exception){
mutex2.unlock();
}
mutex2.unlock();
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
std::lock_guard<std::mutex> guard(mutex1);
cache.push_back(returns);
}
}
int main(){
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
size_t steps = 500;
seed_ = 777;
unsigned concurentThreadsSupported = std::max(std::thread::hardware_concurrency(),(unsigned)1);
int remainder = MCsize % concurentThreadsSupported;
std::vector<std::thread> threads;
// starting sub-thread simulations
if(concurentThreadsSupported != 1){
for(int i = 0 ; i != concurentThreadsSupported - 1; ++i){
if(remainder != 0){
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported + 1));
remainder--;
}else{
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported));
}
}
}
//starting main thread simulation
if(remainder != 0){
generatereturns(steps, MCsize / concurentThreadsSupported + 1);
remainder--;
}else{
generatereturns(steps, MCsize / concurentThreadsSupported);
}
for (auto& th : threads) th.join();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now() ;
typedef std::chrono::duration<int,std::milli> millisecs_t ;
millisecs_t duration( std::chrono::duration_cast<millisecs_t>(end-start) ) ;
std::cout << "Time elapsed : " << duration.count() << " milliseconds.\n" ;
return 0;
}
I can't understand why my R code is so much faster than my C++ code (3.29s vs 12s) even though I have used four threads in the C++ code? Can anyone enlighten me please? How should I improve my C++ code to make it run faster?
EDIT:
Thanks for all the advice! I reserved capacity for my vectors and reduced the amount of locking in my code. The crucial update in the generatereturns() function is :
std::vector<double> cache(MCsize);
std::vector<double>::iterator currit = cache.begin();
//.....
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
std::vector<double> tmpvec(RUNS);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
tmpvec[i] = returns;
}
std::lock_guard<std::mutex> guard(mutex1);
std::move(tmpvec.begin(),tmpvec.end(),currit);
currit += RUNS;
Instead of locking every time, I created a temporary vector and then used std::move to shift the elements in that tempvec into cache. Now the elapsed time has reduced to 1.9seconds.

First of all, are you running it in release mode?
Switching from debug to release reduced the running time from ~15s to ~4.5s on my laptop (windows 7, i5 3210M).
Also, reducing the number of threads to 2 instead of 4 in my case (I just have 2 cores but with hyperthreading) further reduced the running time to ~2.4s.
Changing the variable power to int (as jimifiki also suggested) also offered a slight boost, reducing the time to ~2.3s.

I really enjoyed your question and I tried the code at home. I tried to change the random number generator, my implementation of std::binomial_distribution requires on average about 9.6 calls of generator().
I know the question is more about comparing R with C++ performances, but since you ask "How should I improve my C++ code to make it run faster?" I insist with pow optimization. You can easily avoid one half of the call by precomputing either 0.9^steps or 1.1^steps before the for loop. This makes your code run a bit faster:
double power1 = pow(0.9,steps);
double ratio = 1.1/0.9;
for(int i = 0; i!= RUNS; ++i){
...
returns = myF1 * pow(myF2, (double)power);
Analogously you can improve the R code:
...
ratio <-1.1/0.9
pow1 = 0.9^steps
result <- rep(ratio,times=MCsize)^rep(pow1,times=MCsize)
...

Probably doesn't help you that much, but
start by using pow(double,int) when your exponent is an int.
int power;
returns = pow(0.9,power) * pow(1.1,(int)steps - power);
Can you see any improvement?

Related

C++ Header file not creating random number

My first attempt at creating a header file. The solution is nonsense and nothing more than practice. It receives two numbers from the main file and is supposed to return a random entry from the vector. When I call it from a loop in the main file, it increments by 3 instead of randomly. (Diagnosed by returning the value of getEntry.) The Randomizer code works correctly if I pull it out of the header file and run it directly as a program.
int RandomNumber::Randomizer(int a, int b){
std::vector < int > vecArray{};
int range = (b - a) + 1;
time_t nTime;
srand((unsigned)time(&nTime));
for (int i = a-1; i < b+1; i++) {
vecArray.push_back(i);
}
int getEntry = rand() % range + 1;
int returnValue = vecArray[getEntry];
vecArray.clear();
return returnValue;
}
From what I read, header files should generally not contain function and variable definitions. I suspect Rand, being a function, is the source of the problem.
How, if possible, can I get my header file to create random numbers?
void random(){
double rangeMin = 1;
double rangeMax = 10;
size_t numSamples = 10;
thread_local std::mt19937 mt(std::random_device{}());
std::uniform_real_distribution<double> dist(rangeMin, rangeMax);
for (size_t i = 1; i <= numSamples; ++i) {
std::cout << dist(mt) << std::endl;
}
}
This method will give you the opportunity to generate random numbers, between two numbers this method you have to include random
There are many cases where you will optate to engender a desultory number. There are genuinely two functions you will require to ken about arbitrary number generation. The first is rand(), this function will only return a pseudo desultory number. The way to fine-tune this is to first call the srand() function.
Here is an example:
#include <iostream>
#include <ctime>
#include <cstdlib>
using namespace std;
int main () {
int i,j;
srand( (unsigned)time( NULL ) );
for( i = 0; i < 10; i++ ) {
j = rand();
cout <<" Random Number : " << j << endl;
}
return 0;
}
Using srand( (unsigned)time( NULL ) ); Instead of using your own value use NULL for the default setting.
You can also go here for more info.
I hope I answered your question! Have a nice day!
Ted Lyngmo gave me the idea that fixed the problem. Using random appears to work correctly in a header file.
I removed/changed the following:
time_t nTime;
srand((unsigned)time(&nTime));
int getEntry = rand() % range + 1;
and replaced them with:
std::random_device rd;
std::mt19937 gen(rd());
int getEntry = gen() % range + 1;
Issue resolved. Thank you everybody for your suggestions and comments!
As an experiment, I remove the vector and focus on the randomizer `srand(T)`, where `T` is the system time `volatile time_t T = time(NULL)`. We then found that system is NOT changed during the program running (execution simply too fast).
The function `rand()` generates a pesudo-random integer using confluent rnadom generator, basically multiply the seed by another larger unsigned integer and truncated to the finite bits of `seed`. The randomizer `srand(T)` is used to initialize the seed using system time, or any number `srand(12345);` . A seed gives a fixed sequence of random number. Without calling `srand(T)`, the seed is determined by the system initial memory gabage. The seed is then changed in every generating `rand()`.
In your code, you issue randomizer `srand(T)` reset the seed to the system time in every run. But the system time didn't changed, Thus, you are reseting the `seed` to a same number.
Run this test.
#include <cstdlib>
#include <iostream>
#include <ctime>
int Randomizer(int a, int b){
volatile time_t T = time(NULL);
std::cout << "time = " << T << std::endl;
srand(T);
std::cout << "rand() = " << rand() << std::endl;
return rand();
}
int main()
{
int n1 = 1, n2 = 8;
for(int i=0; i<5; ++i)
{
std::cout << Randomizer(n1, n2) << std::endl;
}
}
The seed is reset to the system time, which is not change during execution. Thus It renders the same random number.
$ ./a.exe
time = 1608049336
rand() = 9468
15874
time = 1608049336
rand() = 9468
15874
time = 1608049336
rand() = 9468
15874
time = 1608049336
rand() = 9468
15874
time = 1608049336
rand() = 9468
15874
In order to see the change of system time, we add a pause in the main():
int main()
{
int n1 = 1, n2 = 8;
for(int i=0; i<5; ++i)
{
std::cout << Randomizer(n1, n2) << std::endl;
system("pause");
}
}
We can observe the system time moving on...
$ ./a.exe
time = 1608050805
rand() = 14265
11107
Press any key to continue . . .
time = 1608050809
rand() = 14279
21332
Press any key to continue . . .
time = 1608050815
rand() = 14298
20287
Press any key to continue . . .
Because system time is not much different, the first generation of confluent sequence rand() is also rather closed, but the continue sequence of numbers will be "seemingly" random. The principle for confluent random generator is that once after set the seed don't change it. Until you are working for another series of random set. Therefore, put the srand(T) funtcion just once in the main() or somewhere that executed only once.
int main()
{
srand(time(NULL)); // >>>> just for this once <<<<
int n1 = 1, n2 = 8;
for(int i=0; i<5; ++i)
{
std::cout << Randomizer(n1, n2) << std::endl;
}
}

Why does threading floating point computations on the CPU make them take significantly longer?

I am currently working on a scientific simulation (Gravitational nbody). I first wrote it with a naive single-threaded algorithm, and this performed acceptably for a small number of particles. I then multi-threaded this algorithm (it is embarrassingly parallel), and the program took about 3x as long. What follows is a minimum, complete, verifiable example of a trivial algorithm with similar properties and output to a file in /tmp (it is designed to run on Linux, but the C++ is also standard). Be warned that if you decide to run this code, it will produce a 152.62MB file. The data is outputted to prevent the compiler from optimizing the computation out of the program.
#include <iostream>
#include <functional>
#include <thread>
#include <vector>
#include <atomic>
#include <random>
#include <fstream>
#include <chrono>
constexpr unsigned ITERATION_COUNT = 2000;
constexpr unsigned NUMBER_COUNT = 10000;
void runThreaded(unsigned count, unsigned batchSize, std::function<void(unsigned)> callback){
unsigned threadCount = std::thread::hardware_concurrency();
std::vector<std::thread> threads;
threads.reserve(threadCount);
std::atomic<unsigned> currentIndex(0);
for(unsigned i=0;i<threadCount;++i){
threads.emplace_back([&currentIndex, batchSize, count, callback]{
unsigned startAt = currentIndex.fetch_add(batchSize);
if(startAt >= count){
return;
}else{
for(unsigned i=0;i<count;++i){
unsigned index = startAt+i;
if(index >= count){
return;
}
callback(index);
}
}
});
}
for(std::thread &thread : threads){
thread.join();
}
}
void threadedTest(){
std::mt19937_64 rnd(0);
std::vector<double> numbers;
numbers.reserve(NUMBER_COUNT);
for(unsigned i=0;i<NUMBER_COUNT;++i){
numbers.push_back(rnd());
}
std::vector<double> newNumbers = numbers;
std::ofstream fout("/tmp/test-data.bin");
for(unsigned i=0;i<ITERATION_COUNT;++i) {
std::cout << "Iteration: " << i << "/" << ITERATION_COUNT << std::endl;
runThreaded(NUMBER_COUNT, 100, [&numbers, &newNumbers](unsigned x){
double total = 0;
for(unsigned y=0;y<NUMBER_COUNT;++y){
total += numbers[y]*(y-x)*(y-x);
}
newNumbers[x] = total;
});
fout.write(reinterpret_cast<char*>(newNumbers.data()), newNumbers.size()*sizeof(double));
std::swap(numbers, newNumbers);
}
}
void unThreadedTest(){
std::mt19937_64 rnd(0);
std::vector<double> numbers;
numbers.reserve(NUMBER_COUNT);
for(unsigned i=0;i<NUMBER_COUNT;++i){
numbers.push_back(rnd());
}
std::vector<double> newNumbers = numbers;
std::ofstream fout("/tmp/test-data.bin");
for(unsigned i=0;i<ITERATION_COUNT;++i){
std::cout << "Iteration: " << i << "/" << ITERATION_COUNT << std::endl;
for(unsigned x=0;x<NUMBER_COUNT;++x){
double total = 0;
for(unsigned y=0;y<NUMBER_COUNT;++y){
total += numbers[y]*(y-x)*(y-x);
}
newNumbers[x] = total;
}
fout.write(reinterpret_cast<char*>(newNumbers.data()), newNumbers.size()*sizeof(double));
std::swap(numbers, newNumbers);
}
}
int main(int argc, char *argv[]) {
if(argv[1][0] == 't'){
threadedTest();
}else{
unThreadedTest();
}
return 0;
}
When I run this (compiled with clang 7.0.1 on Linux), I get the following times from the Linux time command. The difference between these is similar to what I see in my real program. The entry labelled "real" is what is relevant to this question, as this is the clock time that the program takes to run.
Single-threaded:
real 6m27.261s
user 6m27.081s
sys 0m0.051s
Multi-threaded:
real 14m32.856s
user 216m58.063s
sys 0m4.492s
As such, I ask what is causing this massive slowdown when I expect it to speed up significantly (roughly by a factor of 8, as I have an 8 core 16 thread CPU). I am not implementing this on the GPU as the next step is to make some changes to the algorithm to take it from O(n²) to O(nlogn), but that are also not amicable to a GPU. The changed algorithm will have less difference with my currently implemented O(n²) algorithm than the included example. Lastly, I want to observe that the subjective time to run each iteration (judged by the time between the iteration lines appearing) changes significantly in both the threaded and unthreaded runs.
It's kind of hard to follow this code, but I think you're duplicating work on a massive scale because each thread does nearly all the work, just skipping a small portion of it at the start.
I'm presuming the inner loop of runThreaded should be:
unsigned startAt = currentIndex.fetch_add(batchSize);
while (startAt < count) {
if (startAt >= count) {
return;
} else {
for(unsigned i=0;i<batchSize;++i){
unsigned index = startAt+i;
if(index >= count){
return;
}
callback(index);
}
}
startAt = currentIndex.fetch_add(batchSize);
}
Where i < batchSize is the key here. You should only do as much work as the batch dictates, not count times, which is the whole list minus the initial offset.
With this update the code runs significantly faster. I'm not sure if it does all the required work because it's hard to tell if that's actually happening, the output is very minimal.
For easy parallelization over multiple CPUs I recommend using tbb::parallel_for. It uses the correct number of CPUs and splits the range for you, completely eliminating the risk of implementing it wrong. Alternatively, there is a parallel for_each in C++17. In other words, this problem has a number of good solutions.
Vectorizing code is a difficult problem and neither clang++-6 not g++-8 auto-vectorize the baseline code. Hence, SIMD version below I used excellent Vc: portable, zero-overhead C++ types for explicitly data-parallel programming library.
Below is a working benchmark that compares:
The baseline version.
SIMD version.
SIMD + multi-threading version.
#include <Vc/Vc>
#include <tbb/parallel_for.h>
#include <algorithm>
#include <chrono>
#include <iomanip>
#include <iostream>
#include <random>
#include <vector>
constexpr int ITERATION_COUNT = 20;
constexpr int NUMBER_COUNT = 20000;
double baseline() {
double result = 0;
std::vector<double> newNumbers(NUMBER_COUNT);
std::vector<double> numbers(NUMBER_COUNT);
std::mt19937 rnd(0);
for(auto& n : numbers)
n = rnd();
for(int i = 0; i < ITERATION_COUNT; ++i) {
for(int x = 0; x < NUMBER_COUNT; ++x) {
double total = 0;
for(int y = 0; y < NUMBER_COUNT; ++y) {
auto d = (y - x);
total += numbers[y] * (d * d);
}
newNumbers[x] = total;
}
result += std::accumulate(newNumbers.begin(), newNumbers.end(), 0.);
swap(numbers, newNumbers);
}
return result;
}
double simd() {
double result = 0;
constexpr int SIMD_NUMBER_COUNT = NUMBER_COUNT / Vc::double_v::Size;
using vector_double_v = std::vector<Vc::double_v, Vc::Allocator<Vc::double_v>>;
vector_double_v newNumbers(SIMD_NUMBER_COUNT);
vector_double_v numbers(SIMD_NUMBER_COUNT);
std::mt19937 rnd(0);
for(auto& n : numbers) {
alignas(Vc::VectorAlignment) double t[Vc::double_v::Size];
for(double& v : t)
v = rnd();
n.load(t, Vc::Aligned);
}
Vc::double_v const incv(Vc::double_v::Size);
for(int i = 0; i < ITERATION_COUNT; ++i) {
Vc::double_v x(Vc::IndexesFromZero);
for(auto& new_n : newNumbers) {
Vc::double_v totals;
int y = 0;
for(auto const& n : numbers) {
for(unsigned j = 0; j < Vc::double_v::Size; ++j) {
auto d = y - x;
totals += n[j] * (d * d);
++y;
}
}
new_n = totals;
x += incv;
}
result += std::accumulate(newNumbers.begin(), newNumbers.end(), Vc::double_v{}).sum();
swap(numbers, newNumbers);
}
return result;
}
double simd_mt() {
double result = 0;
constexpr int SIMD_NUMBER_COUNT = NUMBER_COUNT / Vc::double_v::Size;
using vector_double_v = std::vector<Vc::double_v, Vc::Allocator<Vc::double_v>>;
vector_double_v newNumbers(SIMD_NUMBER_COUNT);
vector_double_v numbers(SIMD_NUMBER_COUNT);
std::mt19937 rnd(0);
for(auto& n : numbers) {
alignas(Vc::VectorAlignment) double t[Vc::double_v::Size];
for(double& v : t)
v = rnd();
n.load(t, Vc::Aligned);
}
Vc::double_v const v0123(Vc::IndexesFromZero);
for(int i = 0; i < ITERATION_COUNT; ++i) {
constexpr int SIMD_STEP = 4;
tbb::parallel_for(0, SIMD_NUMBER_COUNT, SIMD_STEP, [&](int ix) {
Vc::double_v xs[SIMD_STEP];
for(int is = 0; is < SIMD_STEP; ++is)
xs[is] = v0123 + (ix + is) * Vc::double_v::Size;
Vc::double_v totals[SIMD_STEP];
int y = 0;
for(auto const& n : numbers) {
for(unsigned j = 0; j < Vc::double_v::Size; ++j) {
for(int is = 0; is < SIMD_STEP; ++is) {
auto d = y - xs[is];
totals[is] += n[j] * (d * d);
}
++y;
}
}
std::copy_n(totals, SIMD_STEP, &newNumbers[ix]);
});
result += std::accumulate(newNumbers.begin(), newNumbers.end(), Vc::double_v{}).sum();
swap(numbers, newNumbers);
}
return result;
}
struct Stopwatch {
using Clock = std::chrono::high_resolution_clock;
using Seconds = std::chrono::duration<double>;
Clock::time_point start_ = Clock::now();
Seconds elapsed() const {
return std::chrono::duration_cast<Seconds>(Clock::now() - start_);
}
};
std::ostream& operator<<(std::ostream& s, Stopwatch::Seconds const& a) {
auto precision = s.precision(9);
s << std::fixed << a.count() << std::resetiosflags(std::ios_base::floatfield) << 's';
s.precision(precision);
return s;
}
void benchmark() {
Stopwatch::Seconds baseline_time;
{
Stopwatch s;
double result = baseline();
baseline_time = s.elapsed();
std::cout << "baseline: " << result << ", " << baseline_time << '\n';
}
{
Stopwatch s;
double result = simd();
auto time = s.elapsed();
std::cout << " simd: " << result << ", " << time << ", " << (baseline_time / time) << "x speedup\n";
}
{
Stopwatch s;
double result = simd_mt();
auto time = s.elapsed();
std::cout << " simd_mt: " << result << ", " << time << ", " << (baseline_time / time) << "x speedup\n";
}
}
int main() {
benchmark();
benchmark();
benchmark();
}
Timings:
baseline: 2.76582e+257, 6.399848397s
simd: 2.76582e+257, 1.600373449s, 3.99897x speedup
simd_mt: 2.76582e+257, 0.168638435s, 37.9501x speedup
Notes:
My machine supports AVX but not AVX-512, so it is roughly 4x speedup when using SIMD.
simd_mt version uses 8 threads on my machine and larger SIMD steps. The theoretical speedup is 128x, on practice - 38x.
clang++-6 cannot auto-vectorize the baseline code, neither can g++-8.
g++-8 generates considerably faster code for SIMD versions than clang++-6 .
Your heart is certainly in the right place minus a bug or two.
par_for is a complex issue depending on the payload of your loop. There is
no one-size-fits-all solution to this. The payload can be anything from
a couple of adds to almost infinite mutex blocks - for example by doing memory
allocation.
The atomic variable as a work item pattern has always worked well for me but
remember that atomic variables have a high cost on X86 (~400 cycles) and even
incur a high cost if they are in an unexecuted branch as I found to my peril.
Some permutation of the following is usually good. Choosing the right chunks_per_thread (as in your batchSize) is critical. If you don't trust your
users, you can test execute a few iterations of the loop to guess the
best chunking level.
#include <atomic>
#include <future>
#include <thread>
#include <vector>
#include <stdio.h>
template<typename Func>
void par_for(int start, int end, int step, int chunks_per_thread, Func func) {
using namespace std;
using namespace chrono;
atomic<int> work_item{start};
vector<future<void>> futures(std::thread::hardware_concurrency());
for (auto &fut : futures) {
fut = async(std::launch::async, [&work_item, end, step, chunks_per_thread, &func]() {
for(;;) {
int wi = work_item.fetch_add(step * chunks_per_thread);
if (wi > end) break;
int wi_max = std::min(end, wi+step * chunks_per_thread);
while (wi < wi_max) {
func(wi);
wi += step;
}
}
});
}
for (auto &fut : futures) {
fut.wait();
}
}
int main() {
using namespace std;
using namespace chrono;
for (int k = 0; k != 2; ++k) {
auto t0 = high_resolution_clock::now();
constexpr int loops = 100000000;
if (k == 0) {
for (int i = 0; i != loops; ++i ) {
if (i % 10000000 == 0) printf("%d\n", i);
}
} else {
par_for(0, loops, 1, 100000, [](int i) {
if (i % 10000000 == 0) printf("%d\n", i);
});
}
auto t1 = high_resolution_clock::now();
duration<double, milli> ns = t1 - t0;
printf("k=%d %fms total\n", k, ns.count());
}
}
results
...
k=0 174.925903ms total
...
k=1 27.924738ms total
About a 6x speedup.
I avoid the term "embarassingly parallel" as it is almost never the case. You pay exponentially higher costs the more resources you use on your journey from level 1 cache (ns latency) to globe spanning cluster (ms latency). But I hope this code snippet is useful as an answer.

Performance issue with boost transform_iterator and counting_iterator

I am currently trying to benchmark various implementations of large loop performing arbitrary jobs, and I found myself with a very slow version when using boost transform iterators and boost counting_iterators.
I designed a small code that benchmark two loops that sums the product of all integers between 0 and SIZE-1 with an arbitrary integer (that I choose to be 1 in my example in order to avoid overflow).
Her's my code:
//STL
#include <iostream>
#include <algorithm>
#include <functional>
#include <chrono>
//Boost
#include <boost/iterator/transform_iterator.hpp>
#include <boost/iterator/counting_iterator.hpp>
//Compile using
// g++ ./main.cpp -o test -std=c++11
//Launch using
// ./test 1
#define NRUN 10
#define SIZE 128*1024*1024
struct MultiplyByN
{
MultiplyByN( size_t N ): m_N(N){};
size_t operator()(int i) const { return i*m_N; }
const size_t m_N;
};
int main(int argc, char* argv[] )
{
int N = std::stoi( argv[1] );
size_t sum = 0;
//Initialize chrono helpers
auto start = std::chrono::steady_clock::now();
auto stop = std::chrono::steady_clock::now();
auto diff = stop - start;
double msec=std::numeric_limits<double>::max(); //Set min runtime to ridiculously high value
MultiplyByN op(N);
//Perform multiple run in order to get minimal runtime
for(int k = 0; k< NRUN; k++)
{
sum = 0;
start = std::chrono::steady_clock::now();
for(int i=0;i<SIZE;i++)
{
sum += op(i);
}
stop = std::chrono::steady_clock::now();
diff = stop - start;
//Compute minimum runtime
msec = std::min( msec, std::chrono::duration<double, std::milli>(diff).count() );
}
std::cout << "First version : Sum of values is "<< sum << std::endl;
std::cout << "First version : Minimal Runtime was "<< msec << " msec "<< std::endl;
msec=std::numeric_limits<double>::max(); //Reset min runtime to ridiculously high value
//Perform multiple run in order to get minimal runtime
for(int k = 0; k< NRUN; k++)
{
start = std::chrono::steady_clock::now();
//Functional way to express the summation
sum = std::accumulate( boost::make_transform_iterator(boost::make_counting_iterator(0), op ),
boost::make_transform_iterator(boost::make_counting_iterator(SIZE), op ),
(size_t)0, std::plus<size_t>() );
stop = std::chrono::steady_clock::now();
diff = stop - start;
//Compute minimum runtime
msec = std::min( msec, std::chrono::duration<double, std::milli>(diff).count() );
}
std::cout << "Second version : Sum of values is "<< sum << std::endl;
std::cout << "Second version version : Minimal Runtime was "<< msec << " msec "<< std::endl;
return EXIT_SUCCESS;
}
And the output I get:
./test 1
First version : Sum of values is 9007199187632128
First version : Minimal Runtime was 433.142 msec
Second version : Sum of values is 9007199187632128
Second version version : Minimal Runtime was 10910.7 msec
The "functional" version of my loop that uses std::accumulate is 25 times slower than the simple loop version, why so ?
Thank you in advance for your help
Based on your comment in the code, you've compiled this with
g++ ./main.cpp -o test -std=c++11
Since you didn't specify the optimization level, g++ used the default setting, which is -O0 i.e. no optimization.
That means that the compiler didn't inline anything. Template libraries like the standard library or boost depend on inlining for performance. Additionally, the compiler will produce a lot of extra code, that's far from optimal -- it doesn't make any sense to make performance comparisons on such binaries.
Recompile with optimization enabled, and try your test again to get meaningful results.

Armadillo+OpenBLAS slower than MATLAB?

New to SO. I am test-driving Armadillo+OpenBLAS, and a simple Monte-Carlo geometric Brownian motion logic shows much longer runtime than MATLAB. I believe something must be wrong.
Environment:
Intel i-5 4 core,
8GB ram,
VS 2012 Express,
Armadillo 4.2,
OpenBLAS (official x64 binary) v0.2.9.rc2,
MATLAB takes 2 seconds for the same logic, but Armadillo+OB takes 12 seconds. I also noticed that the program is running on single thread, but I turned to OpenBLAS because I heard of its multi-core capability.
Thanks for any advice.
#include <iostream>
#include <armadillo>
#include <ctime>
using namespace std;
using namespace arma;
int main()
{
clock_t start;
start = clock();
unsigned int R=100000;
vec Spre = 100*ones<vec> (R);
vec S = zeros<vec> (R);
double r = 0.03;
double Vol = 0.2;
double TTM = 5;
unsigned int T=260*TTM;
double dt = TTM/T;
for (unsigned int iT=0; iT<T; ++iT)
{
S = Spre%exp((r-0.5*Vol*Vol)*dt + Vol*sqrt(dt)*randn(R));
Spre = S;
}
cout << mean(S) << endl;
cout << (clock()-start) / (double) CLOCKS_PER_SEC << endl;
system("pause");
return 0;
}
First, the bottleneck is not exp(), though std::exp is slow. The problem is randn().
on my machine, randn() takes most of the time. And when I use MKL_VSL 's implementation of randn, the time cost dropped from 12s to 4s, comparable to matlab's 3s or so.
#include <iostream>
#include <armadillo>
#include <ctime>
#include "mkl_vml.h"
#include "mkl_vsl.h"
using namespace std;
using namespace arma;
#define SEED 0
#define BRNG VSL_BRNG_MCG31
#define METHOD 0
int main()
{
clock_t start;
VSLStreamStatePtr stream;
start = clock();
vslNewStream(&stream, BRNG, SEED);
unsigned int R=100000;
vec Spre = 100*ones<vec> (R);
vec S = zeros<vec> (R);
double r = 0.03;
double Vol = 0.2;
double TTM = 5;
unsigned int T=260*TTM;
double dt = TTM/T;
double tmp = sqrt(dt);
vec tmp2=100*zeros<vec>(R);
vec tmp3=100*zeros<vec>(R);
for (unsigned int iT=0; iT<T; ++iT)
{
vdRngGaussian(METHOD,stream, R, tmp3.memptr(), 0, 1);
tmp2 =(r - 0.5 * Vol * Vol) * dt + Vol * tmp * tmp3;
vdExp(R, tmp2.memptr(), tmp3.memptr());
S = Spre%tmp3;
Spre = S;
}
cout << mean(S) << endl;
cout << (clock()-start) / (double) CLOCKS_PER_SEC << endl;
vslDeleteStream(&stream);
//system("pause");
return 0;
}
Key observation is that Armadillo exp() function is way slower than MATLAB.
Similar overhead is observed in log(), pow() and sqrt().
Just a guess, but it looks like you need to set the number of threads to use in OpenBLAS via the OPENBLAS_NUM_THREADS environment variable.
Try something like:
set OPENBLAS_NUM_THREADS=4
...on the command line before you run your program. Substitute the number of cores in your system where I put "4" (some would say set it to twice the number of cores in your system--YMMV).
Make sure you have Streaming SIMD Extensions enabled when you compile your code. In Visual Studio, check your project C/C++ compiler code generation options.

Random Number Generator - Histogram Construction (Poisson Distribution and Counting Variables)

This Problem Has Now Been Resolved - Revised Code is Shown Below
I have a problem here which I'm sure will only require a small amount of tweaking the code but I do not seem to have been able to correct the program.
So, basically what I want to do is write a C++ program to construct a histogram with nbin = 20 (number of bins), for the number of counts of a Geiger counter in 10000 intervals of a time interval dt (delta t) = 1s; assuming an average count rate of 5 s^(-1). In order to determine the number of counts in some time interval deltat I use a while statement of the form shown below:
while((t-=tau*log(zscale*double(iran=IM*iran+IC)))<deltat)count++;
As a bit of background to this problem I should mention that the total number of counts is given by n*mu, which is proportional to the total counting time T = n*deltat. Obviously, in this problem n has been chosen to be 10000 and deltat is 1s; giving T = 10000s.
The issue I am having is that the output of my code (which will be shown below) simply gives 10000 "hits" for the element 0 (corresponding to 0 counts in the time deltat) and then, of course, 0 "hits" for every other element of the hist[] array subsequently. Whereas, the output which I am expecting is a Poisson Distribution with the peak "hits" at 5 counts (per second).
Thank you in advance for any help you can offer, and I apologise for my poor explanation of the problem at hand! My code is shown below:
#include <iostream> // Pre-processor directives to include
#include <ctime> //... input/output, time,
#include <fstream> //... file streaming and
#include <cmath> //... mathematical function headers
using namespace std;
int main(void) {
const unsigned IM = 1664525; // Integer constants for
const unsigned IC = 1013904223; //... the RNG algorithm
const double zscale = 1.0/0xFFFFFFFF; // Scaling factor for random double between 0 and 1
const double lambda = 5; // Count rate = 5s^-1
const double tau = 1/lambda; // Average time tau is inverse of count rate
const int deltat = 1; // Time intervals of 1s
const int nbin = 20; // Number of bins in histogram
const int nsteps = 1E4;
clock_t start, end;
int count(0);
double t = 0; // Time variable declaration
unsigned iran = time(0); // Seeds the random-number generator from the system time
int hist[nbin]; // Declare array of size nbin for histogram
// Create output stream and open output file
ofstream rout;
rout.open("geigercounterdata.txt");
// Initialise the hist[] array, each element is given the value of zero
for ( int i = 0 ; i < nbin ; i++ )
hist[i] = 0;
start = clock();
// Construction of histogram using RNG process
for ( int i = 1 ; i <= nsteps ; i++ ) {
t = 0;
count = 0;
while((t -= tau*log(zscale*double(iran=IM*iran+IC))) < deltat)
count++; // Increase count variable by 1
hist[count]++; // Increase element "count" of hist array by 1
}
// Print histogram to console window and save to output file
for ( int i = 0 ; i < nbin ; i++ ) {
cout << i << "\t" << hist[i] << endl;
rout << i << "\t" << hist[i] << endl;
}
end = clock();
cout << "\nTime taken for process completion = "
<< (end - start)/double(CLOCKS_PER_SEC)
<< " seconds.\n";
rout.close();
return 1;
} // End of main() routine
I do not entirely follow you on the mathematics of your while loop; however the problem is indeed in the condition of the while loop. I broke your while loop down as follows:
count--;
do
{
iran=IM * iran + IC; //Time generated pseudo-random
double mulTmp = zscale*iran; //Pseudo-random double 0 to 1
double logTmp = log(mulTmp); //Always negative (see graph of ln(x))
t -= tau*logTmp; //Always more than 10^4 as we substract negative
count++;
} while(t < deltat);
From the code it is apparent that you will always end up with count = 0 when t > 1 and run-time error when t < 1 as you will be corrupting your heap.
Unfortunately, I do not entirely follow you on mathematics behind your calculation and I don't understand why Poisson distribution shall to be expected. With the issue mentioned above, you should either go ahead and solve your problem (and then share your answer for the community) or provide me with more mathematical background and references and I will edit my answer with corrected code. If you decide for the earlier, keep in mind that Poisson distribution's domain is [0, infinity[ so you will need to check whether the vale of count is smaller than 20 (or your nbin for that matter).