For some reason my code is able to perform swaps on doubles faster than on the integers. I have no idea why this would be happening.
On my machine the double swap loop completes 11 times faster than the integer swap loop. What property of doubles/integers make them perform this way?
Test setup
Visual Studio 2012 x64
cpu core i7 950
Build as Release and run exe directly, VS Debug hooks skew things
Output:
Process time for ints 1.438 secs
Process time for doubles 0.125 secs
#include <iostream>
#include <ctime>
using namespace std;
#define N 2000000000
void swap_i(int *x, int *y) {
int tmp = *x;
*x = *y;
*y = tmp;
}
void swap_d(double *x, double *y) {
double tmp = *x;
*x = *y;
*y = tmp;
}
int main () {
int a = 1, b = 2;
double d = 1.0, e = 2.0, iTime, dTime;
clock_t c0, c1;
// Time int swaps
c0 = clock();
for (int i = 0; i < N; i++) {
swap_i(&a, &b);
}
c1 = clock();
iTime = (double)(c1-c0)/CLOCKS_PER_SEC;
// Time double swaps
c0 = clock();
for (int i = 0; i < N; i++) {
swap_d(&d, &e);
}
c1 = clock();
dTime = (double)(c1-c0)/CLOCKS_PER_SEC;
cout << "Process time for ints " << iTime << " secs" << endl;
cout << "Process time for doubles " << dTime << " secs" << endl;
}
It seems that VS only optimized one of the loops as Blastfurnace explained.
When I disable all compiler optimizations and have my swap code inline inside the loops, I got the following results (I also switched my timer to std::chrono::high_resolution_clock):
Process time for ints 1449 ms
Process time for doubles 1248 ms
You can find the answer by looking at the generated assembly.
Using Visual C++ 2012 (32-bit Release build) the body of swap_i is three mov instructions but the body of swap_d is completely optimized away to an empty loop. The compiler is smart enough to see that an even number of swaps has no visible effect. I don't know why it doesn't do the same with the int loop.
Just changing #define N 2000000000 to #define N 2000000001 and rebuilding causes the swap_d body to perform actual work. The final times are close on my machine with swap_d being about 3% slower.
Related
The code below was taken from an example compiled with g++. The multi-threaded was 2x faster than the single-threaded.
I'm executing it in Visual Studio 2019 and the results are the opposite: the single-threaded is 2x faster than the multi-threaded.
#include<thread>
#include<iostream>
#include<chrono>
using namespace std;
using ll = long long;
ll odd, even;
void par(const ll ini, const ll fim)
{
for (auto i = ini; i <= fim; i++)
if (!(i & 1))
even += i;
}
void impar(const ll ini, const ll fim)
{
for (auto i = ini; i <= fim; i++)
if (i & 1)
odd += i;
}
int main()
{
const ll start = 0;
const ll end = 190000000;
/* SINGLE THREADED */
auto start_single = chrono::high_resolution_clock::now();
par(start, end);
impar(start, end);
auto end_single = chrono::high_resolution_clock::now();
auto single_duration = chrono::duration_cast<chrono::microseconds>(end_single - start_single).count();
cout << "SINGLE THREADED\nEven sum: " << even << "\nOdd sum: " << odd << "\nTime: " << single_duration << "ms\n\n\n";
/* END OF SINGLE*/
/* MULTI THREADED */
even = odd = 0;
auto start_multi= chrono::high_resolution_clock::now();
thread t(par, start, end);
thread t2(impar, start, end);
t.join();
t2.join();
auto end_multi = chrono::high_resolution_clock::now();
auto multi_duration = chrono::duration_cast<chrono::microseconds>(end_multi - start_multi).count();
cout << "MULTI THREADED\nEven sum: " << even << "\nOdd sum: " << odd << "\nTime: " << multi_duration << "ms\n";
/*END OF MULTI*/
cout << "\n\nIs multi faster than single? => " << boolalpha << (multi_duration < single_duration) << '\n';
}
However, If I do a small modification on my functions as shown below:
void par(const ll ini, const ll fim)
{
ll temp = 0;
for (auto i = ini; i <= fim; i++)
if (!(i & 1))
temp += i;
even = temp;
}
void impar(const ll ini, const ll fim)
{
ll temp = 0;
for (auto i = ini; i <= fim; i++)
if (i & 1)
temp += i;
odd = temp;
}
The multi-threaded performs better. I would like to know what leads to this behavior (what are the possible differences in implementation that explains it).
Also, I have compiled with gcc from www.onlinegdb.com and the results are similar to Visual Studio's in my machine.
You are a victim of false sharing.
odd and even reside next to each other, and accessing them from two threads leads to L3 cache line contention (a.k.a false sharing).
You can fix it by spreading them by 64 bytes to make sure they reside in different cache lines, for example, like this:
alignas(64) ll odd, even;
With that change I get good speedup with 2 threads:
SINGLE THREADED
Even sum: 9025000095000000
Odd sum: 9025000000000000
Time: 825954ms
MULTI THREADED
Even sum: 9025000095000000
Odd sum: 9025000000000000
Time: 532420ms
As for G++ performance - it might be performing the optimization you made manually for you. MSVC is more careful when it comes to optimizing global variables.
I was trying to benchmark my first CUDA application that adds two arrays first using the CPU and then using the GPU.
Here is the program.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include<iostream>
#include<chrono>
using namespace std;
using namespace std::chrono;
// add two arrays
void add(int n, float *x, float *y) {
for (int i = 0; i < n; i++) {
y[i] += x[i];
}
}
__global__ void addParallel(int n, float *x, float *y) {
int i = threadIdx.x;
if (i < n)
y[i] += x[i];
}
void printElapseTime(std::chrono::microseconds elapsed_time) {
cout << "completed in " << elapsed_time.count() << " microseconds" << endl;
}
int main() {
// generate two arrays of million float values each
cout << "Generating two lists of a million float values ... ";
int n = 1 << 28;
float *x, *y;
cudaMallocManaged(&x, sizeof(float)*n);
cudaMallocManaged(&y, sizeof(float)*n);
// begin benchmark array generation
auto begin = high_resolution_clock::now();
for (int i = 0; i < n; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// end benchmark array generation
auto end = high_resolution_clock::now();
auto elapsed_time = duration_cast<microseconds>(end - begin);
printElapseTime(elapsed_time);
// begin benchmark addition cpu
begin = high_resolution_clock::now();
cout << "Adding both arrays using CPU ... ";
add(n, x, y);
// end benchmark addition cpu
end = high_resolution_clock::now();
elapsed_time = duration_cast<microseconds>(end - begin);
printElapseTime(elapsed_time);
// begin benchmark addition gpu
begin = high_resolution_clock::now();
cout << "Adding both arrays using GPU ... ";
addParallel << <1, 1024 >> > (n, x, y);
cudaDeviceSynchronize();
// end benchmark addition gpu
end = high_resolution_clock::now();
elapsed_time = duration_cast<microseconds>(end - begin);
printElapseTime(elapsed_time);
cudaFree(x);
cudaFree(y);
return 0;
}
Surprisingly though, the program is generating the following output.
Generating two lists of a million float values ... completed in 13343211 microseconds
Adding both arrays using CPU ... completed in 543994 microseconds
Adding both arrays using GPU ... completed in 3030147 microseconds
I wonder where exactly I am going wrong. Why is the GPU computation taking 6 times longer than the one that is running on the CPU.
For your reference, I'm running Windows 10 on Intel i7 8750H and Nvidia GTX 1060.
Note that your unified memory array contains 268 million floats, meaning you're transferring about 1 GB of data to the device when you invoke your kernel. Use a GPU profiler (nvprof, nvvp, or nsight) and you should see a HtoD transfer taking the bulk of your computation time.
I have run into a problem where i am trying to optimize my query which is created to calculate Nmin values for the increasing values of N and error approximation.
I am not from programming background and have just started to take it up.
This is the calculation which is inefficient as it calculates Nmin even after finding Nmin.
Now to reduce the time i did below changes reduce function call with no improvement:
#include<iostream>
#include<cmath>
#include<time.h>
#include<iomanip>
using namespace std;
double f(int);
int main(void)
{
double err;
double pi = 4.0*atan(1.0);
cout<<fixed<<setprecision(7);
clock_t start = clock();
for (int n=1;;n++)
{
if((f(n)-pi)>= 1e-6)
{
cout<<"n_min is "<< n <<"\t"<<f(n)-pi<<endl;
}
else
{
break;
}
}
clock_t stop = clock();
//double elapsed = (double)(stop - start) * 1000.0 / CLOCKS_PER_SEC; //this one in ms
cout << "time: " << (stop-start)/double(CLOCKS_PER_SEC)*1000 << endl; //this one in s
return 0;
}
double f(int n)
{
double sum=0;
for (int i=1;i<=n;i++)
{
sum += 1/(1+pow((i-0.5)/n,2));
}
return (4.0/n)*sum;
}
Is there any way to reduce the time and make the second query efficient?
Any help would be greatly appreciated.
I do not see any immediate way of optimizing the algorithm itself. You could however reduce the time significantly by not writing to the standard output for every iteration. Also, do not calculate f(n) more than once per iteration.
for (int n=1;;n++)
{
double val = f(n);
double diff = val-pi;
if(diff < 1e-6)
{
cout<<"n_min is "<< n <<"\t"<<diff<<endl;
break;
}
}
Note however that this will yield a higher n_min (increased by 1 compared to the result of your version) since we changed the condition to diff < 1e-6.
I have written the following codes in R and C++ which perform the same algorithm:
a) To simulate the random variable X 500 times. (X has value 0.9 with prob 0.5 and 1.1 with prob 0.5)
b) Multiply these 500 simulated values together to get a value. Save that value in a container
c) Repeat 10000000 times such that the container has 10000000 values
R:
ptm <- proc.time()
steps <- 500
MCsize <- 10000000
a <- rbinom(MCsize,steps,0.5)
b <- rep(500,times=MCsize) - a
result <- rep(1.1,times=MCsize)^a*rep(0.9,times=MCsize)^b
proc.time()-ptm
C++
#include <numeric>
#include <vector>
#include <iostream>
#include <random>
#include <thread>
#include <mutex>
#include <cmath>
#include <algorithm>
#include <chrono>
const size_t MCsize = 10000000;
std::mutex mutex1;
std::mutex mutex2;
unsigned seed_;
std::vector<double> cache;
void generatereturns(size_t steps, int RUNS){
mutex2.lock();
// setting seed
try{
std::mt19937 tmpgenerator(seed_);
seed_ = tmpgenerator();
std::cout << "SEED : " << seed_ << std::endl;
}catch(int exception){
mutex2.unlock();
}
mutex2.unlock();
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
std::lock_guard<std::mutex> guard(mutex1);
cache.push_back(returns);
}
}
int main(){
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
size_t steps = 500;
seed_ = 777;
unsigned concurentThreadsSupported = std::max(std::thread::hardware_concurrency(),(unsigned)1);
int remainder = MCsize % concurentThreadsSupported;
std::vector<std::thread> threads;
// starting sub-thread simulations
if(concurentThreadsSupported != 1){
for(int i = 0 ; i != concurentThreadsSupported - 1; ++i){
if(remainder != 0){
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported + 1));
remainder--;
}else{
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported));
}
}
}
//starting main thread simulation
if(remainder != 0){
generatereturns(steps, MCsize / concurentThreadsSupported + 1);
remainder--;
}else{
generatereturns(steps, MCsize / concurentThreadsSupported);
}
for (auto& th : threads) th.join();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now() ;
typedef std::chrono::duration<int,std::milli> millisecs_t ;
millisecs_t duration( std::chrono::duration_cast<millisecs_t>(end-start) ) ;
std::cout << "Time elapsed : " << duration.count() << " milliseconds.\n" ;
return 0;
}
I can't understand why my R code is so much faster than my C++ code (3.29s vs 12s) even though I have used four threads in the C++ code? Can anyone enlighten me please? How should I improve my C++ code to make it run faster?
EDIT:
Thanks for all the advice! I reserved capacity for my vectors and reduced the amount of locking in my code. The crucial update in the generatereturns() function is :
std::vector<double> cache(MCsize);
std::vector<double>::iterator currit = cache.begin();
//.....
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
std::vector<double> tmpvec(RUNS);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
tmpvec[i] = returns;
}
std::lock_guard<std::mutex> guard(mutex1);
std::move(tmpvec.begin(),tmpvec.end(),currit);
currit += RUNS;
Instead of locking every time, I created a temporary vector and then used std::move to shift the elements in that tempvec into cache. Now the elapsed time has reduced to 1.9seconds.
First of all, are you running it in release mode?
Switching from debug to release reduced the running time from ~15s to ~4.5s on my laptop (windows 7, i5 3210M).
Also, reducing the number of threads to 2 instead of 4 in my case (I just have 2 cores but with hyperthreading) further reduced the running time to ~2.4s.
Changing the variable power to int (as jimifiki also suggested) also offered a slight boost, reducing the time to ~2.3s.
I really enjoyed your question and I tried the code at home. I tried to change the random number generator, my implementation of std::binomial_distribution requires on average about 9.6 calls of generator().
I know the question is more about comparing R with C++ performances, but since you ask "How should I improve my C++ code to make it run faster?" I insist with pow optimization. You can easily avoid one half of the call by precomputing either 0.9^steps or 1.1^steps before the for loop. This makes your code run a bit faster:
double power1 = pow(0.9,steps);
double ratio = 1.1/0.9;
for(int i = 0; i!= RUNS; ++i){
...
returns = myF1 * pow(myF2, (double)power);
Analogously you can improve the R code:
...
ratio <-1.1/0.9
pow1 = 0.9^steps
result <- rep(ratio,times=MCsize)^rep(pow1,times=MCsize)
...
Probably doesn't help you that much, but
start by using pow(double,int) when your exponent is an int.
int power;
returns = pow(0.9,power) * pow(1.1,(int)steps - power);
Can you see any improvement?
New to SO. I am test-driving Armadillo+OpenBLAS, and a simple Monte-Carlo geometric Brownian motion logic shows much longer runtime than MATLAB. I believe something must be wrong.
Environment:
Intel i-5 4 core,
8GB ram,
VS 2012 Express,
Armadillo 4.2,
OpenBLAS (official x64 binary) v0.2.9.rc2,
MATLAB takes 2 seconds for the same logic, but Armadillo+OB takes 12 seconds. I also noticed that the program is running on single thread, but I turned to OpenBLAS because I heard of its multi-core capability.
Thanks for any advice.
#include <iostream>
#include <armadillo>
#include <ctime>
using namespace std;
using namespace arma;
int main()
{
clock_t start;
start = clock();
unsigned int R=100000;
vec Spre = 100*ones<vec> (R);
vec S = zeros<vec> (R);
double r = 0.03;
double Vol = 0.2;
double TTM = 5;
unsigned int T=260*TTM;
double dt = TTM/T;
for (unsigned int iT=0; iT<T; ++iT)
{
S = Spre%exp((r-0.5*Vol*Vol)*dt + Vol*sqrt(dt)*randn(R));
Spre = S;
}
cout << mean(S) << endl;
cout << (clock()-start) / (double) CLOCKS_PER_SEC << endl;
system("pause");
return 0;
}
First, the bottleneck is not exp(), though std::exp is slow. The problem is randn().
on my machine, randn() takes most of the time. And when I use MKL_VSL 's implementation of randn, the time cost dropped from 12s to 4s, comparable to matlab's 3s or so.
#include <iostream>
#include <armadillo>
#include <ctime>
#include "mkl_vml.h"
#include "mkl_vsl.h"
using namespace std;
using namespace arma;
#define SEED 0
#define BRNG VSL_BRNG_MCG31
#define METHOD 0
int main()
{
clock_t start;
VSLStreamStatePtr stream;
start = clock();
vslNewStream(&stream, BRNG, SEED);
unsigned int R=100000;
vec Spre = 100*ones<vec> (R);
vec S = zeros<vec> (R);
double r = 0.03;
double Vol = 0.2;
double TTM = 5;
unsigned int T=260*TTM;
double dt = TTM/T;
double tmp = sqrt(dt);
vec tmp2=100*zeros<vec>(R);
vec tmp3=100*zeros<vec>(R);
for (unsigned int iT=0; iT<T; ++iT)
{
vdRngGaussian(METHOD,stream, R, tmp3.memptr(), 0, 1);
tmp2 =(r - 0.5 * Vol * Vol) * dt + Vol * tmp * tmp3;
vdExp(R, tmp2.memptr(), tmp3.memptr());
S = Spre%tmp3;
Spre = S;
}
cout << mean(S) << endl;
cout << (clock()-start) / (double) CLOCKS_PER_SEC << endl;
vslDeleteStream(&stream);
//system("pause");
return 0;
}
Key observation is that Armadillo exp() function is way slower than MATLAB.
Similar overhead is observed in log(), pow() and sqrt().
Just a guess, but it looks like you need to set the number of threads to use in OpenBLAS via the OPENBLAS_NUM_THREADS environment variable.
Try something like:
set OPENBLAS_NUM_THREADS=4
...on the command line before you run your program. Substitute the number of cores in your system where I put "4" (some would say set it to twice the number of cores in your system--YMMV).
Make sure you have Streaming SIMD Extensions enabled when you compile your code. In Visual Studio, check your project C/C++ compiler code generation options.