Armadillo+OpenBLAS slower than MATLAB? - c++

New to SO. I am test-driving Armadillo+OpenBLAS, and a simple Monte-Carlo geometric Brownian motion logic shows much longer runtime than MATLAB. I believe something must be wrong.
Environment:
Intel i-5 4 core,
8GB ram,
VS 2012 Express,
Armadillo 4.2,
OpenBLAS (official x64 binary) v0.2.9.rc2,
MATLAB takes 2 seconds for the same logic, but Armadillo+OB takes 12 seconds. I also noticed that the program is running on single thread, but I turned to OpenBLAS because I heard of its multi-core capability.
Thanks for any advice.
#include <iostream>
#include <armadillo>
#include <ctime>
using namespace std;
using namespace arma;
int main()
{
clock_t start;
start = clock();
unsigned int R=100000;
vec Spre = 100*ones<vec> (R);
vec S = zeros<vec> (R);
double r = 0.03;
double Vol = 0.2;
double TTM = 5;
unsigned int T=260*TTM;
double dt = TTM/T;
for (unsigned int iT=0; iT<T; ++iT)
{
S = Spre%exp((r-0.5*Vol*Vol)*dt + Vol*sqrt(dt)*randn(R));
Spre = S;
}
cout << mean(S) << endl;
cout << (clock()-start) / (double) CLOCKS_PER_SEC << endl;
system("pause");
return 0;
}

First, the bottleneck is not exp(), though std::exp is slow. The problem is randn().
on my machine, randn() takes most of the time. And when I use MKL_VSL 's implementation of randn, the time cost dropped from 12s to 4s, comparable to matlab's 3s or so.
#include <iostream>
#include <armadillo>
#include <ctime>
#include "mkl_vml.h"
#include "mkl_vsl.h"
using namespace std;
using namespace arma;
#define SEED 0
#define BRNG VSL_BRNG_MCG31
#define METHOD 0
int main()
{
clock_t start;
VSLStreamStatePtr stream;
start = clock();
vslNewStream(&stream, BRNG, SEED);
unsigned int R=100000;
vec Spre = 100*ones<vec> (R);
vec S = zeros<vec> (R);
double r = 0.03;
double Vol = 0.2;
double TTM = 5;
unsigned int T=260*TTM;
double dt = TTM/T;
double tmp = sqrt(dt);
vec tmp2=100*zeros<vec>(R);
vec tmp3=100*zeros<vec>(R);
for (unsigned int iT=0; iT<T; ++iT)
{
vdRngGaussian(METHOD,stream, R, tmp3.memptr(), 0, 1);
tmp2 =(r - 0.5 * Vol * Vol) * dt + Vol * tmp * tmp3;
vdExp(R, tmp2.memptr(), tmp3.memptr());
S = Spre%tmp3;
Spre = S;
}
cout << mean(S) << endl;
cout << (clock()-start) / (double) CLOCKS_PER_SEC << endl;
vslDeleteStream(&stream);
//system("pause");
return 0;
}

Key observation is that Armadillo exp() function is way slower than MATLAB.
Similar overhead is observed in log(), pow() and sqrt().

Just a guess, but it looks like you need to set the number of threads to use in OpenBLAS via the OPENBLAS_NUM_THREADS environment variable.
Try something like:
set OPENBLAS_NUM_THREADS=4
...on the command line before you run your program. Substitute the number of cores in your system where I put "4" (some would say set it to twice the number of cores in your system--YMMV).

Make sure you have Streaming SIMD Extensions enabled when you compile your code. In Visual Studio, check your project C/C++ compiler code generation options.

Related

Down and In Call Option using Monte Carlo in C++

I am trying to write a C++ program that runs a Monte Carlo simulation to approximate the theoretical price of a down-and-in call option with a barrier between the moment of pricing and the option expiry. I implemented a BarrOption constructed but I don't know if I implemented this correctly. Please if anyone has an idea about what should be corrected leave a comment. Code below:
using namespace std;
#include <cmath>
#include <cstdlib>
#include <iostream>
#include <vector>
#include <algorithm>
#include "random.h"
#include "function.h"
//definition of constructor
BarrOption::BarrOption(
int nInt_,
double strike_,
double spot_,
double vol_,
double r_,
double expiry_,
double barr_){
nInt = nInt_;
strike = strike_;
spot = spot_;
vol = vol_;
r = r_;
expiry = expiry_;
barr = barr_;
generatePath();
}
void BarrOption::generatePath (){
double thisDrift = (r * expiry - 0.5 * vol * vol * expiry) / double(nInt);
double cumShocks = 0;
thisPath.clear();
for(int i = 0; i < nInt; i++){
cumShocks += (thisDrift + vol * sqrt(expiry / double(nInt)) * GetOneGaussianByBoxMuller());
thisPath.push_back(spot * exp(cumShocks));
}
}
// definition of getBarrOptionPrice(int nReps) method:
double BarrOption::getBarrOptionPrice(int nReps){
double rollingSum = 0.0;
for(int i = 0; i < nReps; i++){
generatePath();
std::vector<double>::iterator minPrice;
minPrice = std::min_element(thisPath.begin(), thisPath.end());
if (thisPath[std::distance(thisPath.end(), minPrice)] <= barr & thisPath.back() > strike) {
rollingSum += (thisPath.back() - strike);
}
}
return exp(-r*expiry)*rollingSum/double(nReps);
}
// definition of printPath() method:
void BarrOption::printPath(){
for(int i = 0; i < nInt; i++){
std::cout << thisPath[i] << "\n";
}
}
Hi some comments from a first read, I am pretty sure you did some other mistakes, but below some:
You are using push_back instead of reserve + assignment.
Your diffusion is better is in log-space.
You are recalculating (thisDrift + vol * sqrt(expiry / double(nInt)) each time.
Just use auto instead of std::vector::iterator minPrice.
std::distance(thisPath.end(), minPrice) this is negative no? Should be std::distance(minPrice, thisPath.end())??
You are checking the min first, which takes expensive time, first start with if the call is triggered (S>strike), and then after check the barrier, otherwise you are just wasting time.
You are finding the min, while all you need is a number less than the barrier, so better if you do a for loop with a break than finding the min.
Really this is bad code :) however, it's a good start.

How to benchmark CUDA programs?

I was trying to benchmark my first CUDA application that adds two arrays first using the CPU and then using the GPU.
Here is the program.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include<iostream>
#include<chrono>
using namespace std;
using namespace std::chrono;
// add two arrays
void add(int n, float *x, float *y) {
for (int i = 0; i < n; i++) {
y[i] += x[i];
}
}
__global__ void addParallel(int n, float *x, float *y) {
int i = threadIdx.x;
if (i < n)
y[i] += x[i];
}
void printElapseTime(std::chrono::microseconds elapsed_time) {
cout << "completed in " << elapsed_time.count() << " microseconds" << endl;
}
int main() {
// generate two arrays of million float values each
cout << "Generating two lists of a million float values ... ";
int n = 1 << 28;
float *x, *y;
cudaMallocManaged(&x, sizeof(float)*n);
cudaMallocManaged(&y, sizeof(float)*n);
// begin benchmark array generation
auto begin = high_resolution_clock::now();
for (int i = 0; i < n; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// end benchmark array generation
auto end = high_resolution_clock::now();
auto elapsed_time = duration_cast<microseconds>(end - begin);
printElapseTime(elapsed_time);
// begin benchmark addition cpu
begin = high_resolution_clock::now();
cout << "Adding both arrays using CPU ... ";
add(n, x, y);
// end benchmark addition cpu
end = high_resolution_clock::now();
elapsed_time = duration_cast<microseconds>(end - begin);
printElapseTime(elapsed_time);
// begin benchmark addition gpu
begin = high_resolution_clock::now();
cout << "Adding both arrays using GPU ... ";
addParallel << <1, 1024 >> > (n, x, y);
cudaDeviceSynchronize();
// end benchmark addition gpu
end = high_resolution_clock::now();
elapsed_time = duration_cast<microseconds>(end - begin);
printElapseTime(elapsed_time);
cudaFree(x);
cudaFree(y);
return 0;
}
Surprisingly though, the program is generating the following output.
Generating two lists of a million float values ... completed in 13343211 microseconds
Adding both arrays using CPU ... completed in 543994 microseconds
Adding both arrays using GPU ... completed in 3030147 microseconds
I wonder where exactly I am going wrong. Why is the GPU computation taking 6 times longer than the one that is running on the CPU.
For your reference, I'm running Windows 10 on Intel i7 8750H and Nvidia GTX 1060.
Note that your unified memory array contains 268 million floats, meaning you're transferring about 1 GB of data to the device when you invoke your kernel. Use a GPU profiler (nvprof, nvvp, or nsight) and you should see a HtoD transfer taking the bulk of your computation time.

Unable to find the machine epsilon for float in c++ in codeblocks

I wanted to find out the machine epsilon for float and double types through C++, but I am getting the same answer again and again for each data type of variable x I am using, which is that of long double and of the order of O(1e-20). I am running it on my Windows 10 machine using Codeblocks.
I tried using the same code in Ubuntu and also in DevC++ in Windows itself, I am getting the correct answer. What is it that I am doing wrong in codeblocks. Is there any default setting?
#include <iostream>
#include <string>
#include <typeinfo>
using namespace std;
int main()
{
//double x = 5;
//double one = 1;
//double fac = 0.5;
float x=1;
float one = 1.0;
float fac = 0.5;
// cout <<"What is the input of number you are giving"<< endl;
// cin >> x;
cout <<"The no. you have given is: "<< x << endl;
int iter = 1;
while(one+x != one)
{
x = x * fac;
iter = iter + 1;
}
cout<<"The value of machine epsilon for the given data type is "<<x<<endl;
cout<<"The no.of iterations taken place are: "<<iter<<endl;
}
while(one+x != one)
The computation of one+x might well be an extended precision double. The compiler is quite free to do so. In such an implementation, you will indeed see the same value for iter regardless of the type of one and x.
The following works quite nicely on my computer.
#include <iostream>
#include <limits>
template <typename T> void machine_epsilon()
{
T one = 1.0;
T eps = 1.0;
T fac = 0.5;
int iter = 0;
T one_plus_eps = one + eps;
while (one_plus_eps != one)
{
++iter;
eps *= fac;
one_plus_eps = one + eps;
}
--iter;
eps /= fac;
std::cout << iter << ' '
<< eps << ' '
<< std::numeric_limits<T>::epsilon() << '\n';
}
int main ()
{
machine_epsilon<float>();
machine_epsilon<double>();
machine_epsilon<long double>();
}
You could try this code to obtain the machine epsilon for float values:
#include<iostream>
#include<limits>
int main(){
std::cout << "machine epsilon (float): "
<< std::numeric_limits<float>::epsilon() << std::endl;
}

Why is my C++ code so much slower than R?

I have written the following codes in R and C++ which perform the same algorithm:
a) To simulate the random variable X 500 times. (X has value 0.9 with prob 0.5 and 1.1 with prob 0.5)
b) Multiply these 500 simulated values together to get a value. Save that value in a container
c) Repeat 10000000 times such that the container has 10000000 values
R:
ptm <- proc.time()
steps <- 500
MCsize <- 10000000
a <- rbinom(MCsize,steps,0.5)
b <- rep(500,times=MCsize) - a
result <- rep(1.1,times=MCsize)^a*rep(0.9,times=MCsize)^b
proc.time()-ptm
C++
#include <numeric>
#include <vector>
#include <iostream>
#include <random>
#include <thread>
#include <mutex>
#include <cmath>
#include <algorithm>
#include <chrono>
const size_t MCsize = 10000000;
std::mutex mutex1;
std::mutex mutex2;
unsigned seed_;
std::vector<double> cache;
void generatereturns(size_t steps, int RUNS){
mutex2.lock();
// setting seed
try{
std::mt19937 tmpgenerator(seed_);
seed_ = tmpgenerator();
std::cout << "SEED : " << seed_ << std::endl;
}catch(int exception){
mutex2.unlock();
}
mutex2.unlock();
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
std::lock_guard<std::mutex> guard(mutex1);
cache.push_back(returns);
}
}
int main(){
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
size_t steps = 500;
seed_ = 777;
unsigned concurentThreadsSupported = std::max(std::thread::hardware_concurrency(),(unsigned)1);
int remainder = MCsize % concurentThreadsSupported;
std::vector<std::thread> threads;
// starting sub-thread simulations
if(concurentThreadsSupported != 1){
for(int i = 0 ; i != concurentThreadsSupported - 1; ++i){
if(remainder != 0){
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported + 1));
remainder--;
}else{
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported));
}
}
}
//starting main thread simulation
if(remainder != 0){
generatereturns(steps, MCsize / concurentThreadsSupported + 1);
remainder--;
}else{
generatereturns(steps, MCsize / concurentThreadsSupported);
}
for (auto& th : threads) th.join();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now() ;
typedef std::chrono::duration<int,std::milli> millisecs_t ;
millisecs_t duration( std::chrono::duration_cast<millisecs_t>(end-start) ) ;
std::cout << "Time elapsed : " << duration.count() << " milliseconds.\n" ;
return 0;
}
I can't understand why my R code is so much faster than my C++ code (3.29s vs 12s) even though I have used four threads in the C++ code? Can anyone enlighten me please? How should I improve my C++ code to make it run faster?
EDIT:
Thanks for all the advice! I reserved capacity for my vectors and reduced the amount of locking in my code. The crucial update in the generatereturns() function is :
std::vector<double> cache(MCsize);
std::vector<double>::iterator currit = cache.begin();
//.....
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
std::vector<double> tmpvec(RUNS);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
tmpvec[i] = returns;
}
std::lock_guard<std::mutex> guard(mutex1);
std::move(tmpvec.begin(),tmpvec.end(),currit);
currit += RUNS;
Instead of locking every time, I created a temporary vector and then used std::move to shift the elements in that tempvec into cache. Now the elapsed time has reduced to 1.9seconds.
First of all, are you running it in release mode?
Switching from debug to release reduced the running time from ~15s to ~4.5s on my laptop (windows 7, i5 3210M).
Also, reducing the number of threads to 2 instead of 4 in my case (I just have 2 cores but with hyperthreading) further reduced the running time to ~2.4s.
Changing the variable power to int (as jimifiki also suggested) also offered a slight boost, reducing the time to ~2.3s.
I really enjoyed your question and I tried the code at home. I tried to change the random number generator, my implementation of std::binomial_distribution requires on average about 9.6 calls of generator().
I know the question is more about comparing R with C++ performances, but since you ask "How should I improve my C++ code to make it run faster?" I insist with pow optimization. You can easily avoid one half of the call by precomputing either 0.9^steps or 1.1^steps before the for loop. This makes your code run a bit faster:
double power1 = pow(0.9,steps);
double ratio = 1.1/0.9;
for(int i = 0; i!= RUNS; ++i){
...
returns = myF1 * pow(myF2, (double)power);
Analogously you can improve the R code:
...
ratio <-1.1/0.9
pow1 = 0.9^steps
result <- rep(ratio,times=MCsize)^rep(pow1,times=MCsize)
...
Probably doesn't help you that much, but
start by using pow(double,int) when your exponent is an int.
int power;
returns = pow(0.9,power) * pow(1.1,(int)steps - power);
Can you see any improvement?

Performance swapping integers vs double

For some reason my code is able to perform swaps on doubles faster than on the integers. I have no idea why this would be happening.
On my machine the double swap loop completes 11 times faster than the integer swap loop. What property of doubles/integers make them perform this way?
Test setup
Visual Studio 2012 x64
cpu core i7 950
Build as Release and run exe directly, VS Debug hooks skew things
Output:
Process time for ints 1.438 secs
Process time for doubles 0.125 secs
#include <iostream>
#include <ctime>
using namespace std;
#define N 2000000000
void swap_i(int *x, int *y) {
int tmp = *x;
*x = *y;
*y = tmp;
}
void swap_d(double *x, double *y) {
double tmp = *x;
*x = *y;
*y = tmp;
}
int main () {
int a = 1, b = 2;
double d = 1.0, e = 2.0, iTime, dTime;
clock_t c0, c1;
// Time int swaps
c0 = clock();
for (int i = 0; i < N; i++) {
swap_i(&a, &b);
}
c1 = clock();
iTime = (double)(c1-c0)/CLOCKS_PER_SEC;
// Time double swaps
c0 = clock();
for (int i = 0; i < N; i++) {
swap_d(&d, &e);
}
c1 = clock();
dTime = (double)(c1-c0)/CLOCKS_PER_SEC;
cout << "Process time for ints " << iTime << " secs" << endl;
cout << "Process time for doubles " << dTime << " secs" << endl;
}
It seems that VS only optimized one of the loops as Blastfurnace explained.
When I disable all compiler optimizations and have my swap code inline inside the loops, I got the following results (I also switched my timer to std::chrono::high_resolution_clock):
Process time for ints 1449 ms
Process time for doubles 1248 ms
You can find the answer by looking at the generated assembly.
Using Visual C++ 2012 (32-bit Release build) the body of swap_i is three mov instructions but the body of swap_d is completely optimized away to an empty loop. The compiler is smart enough to see that an even number of swaps has no visible effect. I don't know why it doesn't do the same with the int loop.
Just changing #define N 2000000000 to #define N 2000000001 and rebuilding causes the swap_d body to perform actual work. The final times are close on my machine with swap_d being about 3% slower.