This question already has answers here:
C: using clock() to measure time in multi-threaded programs
(2 answers)
Closed 2 years ago.
I am implementing pattern matching algorithm, by moving template gradient info over entire target's gradient image , that too at each rotation (-60 to 60). I have already saved the template info for each rotation ,i.e. 121 templates are already preprocessed and saved.
But the issue is, this is consuming lot of time (approx 110ms), so decided to split the matching at set of rotations (-60 to -30 , -30 to 0, 0 to 30 and 30 to 60) into 4 threads, but threading is taking more time that single process (approx 115ms to 120ms).
Snippet of code is...
MatchResultA totalResultsTemp[MAXTARGETNUM];
void CShapeMatch::match(ShapeInfo *ShapeInfoVec, search_region SearchRegion, float MinScore, float Greediness, int width,int height, int16_t *pBufGradX ,int16_t *pBufGradY,float *pBufMag, bool corr)
MatchResultA resultsPerDeg[MAXTARGETNUM];
int startX = SearchRegion.StartX;
int startY = SearchRegion.StartY;
int endX = SearchRegion.EndX;
int endY = SearchRegion.EndY;
float AngleStep = SearchRegion.AngleStep;
float AngleStart = SearchRegion.AngleStart;
float AngleStop = SearchRegion.AngleStop;
int startIndex = (int)(ShapeInfoVec[0].AngleNum/2) + ShapeInfoVec[0].AngleNum%2+(int)AngleStart/AngleStep;
int stopIndex = (int)(ShapeInfoVec[0].AngleNum/2) + ShapeInfoVec[0].AngleNum%2+(int)AngleStop/AngleStep;
for (int k = startIndex; k < stopIndex ; k++){
for(int j = startY; j < endY; j++){
for(int i = startX; i < endX; i++){
for(int m = 0; m < ShapeInfoVec[k].NoOfCordinates; m++)
curX = i + (ShapeInfoVec[k].Coordinates + m)->x; // template X coordinate
curY = j + (ShapeInfoVec[k].Coordinates + m)->y ; // template Y coordinate
iTx = *(ShapeInfoVec[k].EdgeDerivativeX + m); // template X derivative
iTy = *(ShapeInfoVec[k].EdgeDerivativeY + m); // template Y derivative
iTm = *(ShapeInfoVec[k].EdgeMagnitude + m); // template gradients magnitude
if(curX < 0 ||curY < 0||curX > width-1 ||curY > height-1)
offSet = curY*width + curX;
iSx = *(pBufGradX + offSet); // get corresponding X derivative from source image
iSy = *(pBufGradY + offSet); // get corresponding Y derivative from source image
iSm = *(pBufMag + offSet);
if (PartialScore > MinScore)
float Angle = ShapeInfoVec[k].Angel;
bool hasFlag = false;
for(int n = 0; n < resultsNumPerDegree; n++)
if(abs(resultsPerDeg[n].CenterLocX - i) < 5 && abs(resultsPerDeg[n].CenterLocY - j) < 5)
hasFlag = true;
if(resultsPerDeg[n].ResultScore < PartialScore)
resultsPerDeg[n].Angel = Angle;
resultsPerDeg[n].CenterLocX = i;
resultsPerDeg[n].CenterLocY = j;
resultsPerDeg[n].ResultScore = PartialScore;
resultsPerDeg[resultsNumPerDegree].Angel = Angle;
resultsPerDeg[resultsNumPerDegree].CenterLocX = i;
resultsPerDeg[resultsNumPerDegree].CenterLocY = j;
resultsPerDeg[resultsNumPerDegree].ResultScore = PartialScore;
resultsNumPerDegree ++;
minScoreTemp = minScoreTemp < PartialScore ? PartialScore : minScoreTemp;
for(int i = 0; i < resultsNumPerDegree; i++)
totalResultsTemp[totalResultsNum] = resultsPerDeg[i];
void CallerFunction(){
int16_t *pBufGradX = (int16_t *) malloc(bufferSize * sizeof(int16_t));
int16_t *pBufGradY = (int16_t *) malloc(bufferSize * sizeof(int16_t));
float *pBufMag = (float *) malloc(bufferSize * sizeof(float));
clock_t start = clock();
float temp_stop = SearchRegion->AngleStop;
SearchRegion->AngleStop = -30;
thread t1(&CShapeMatch::match, this, ShapeInfoVec, *SearchRegion, MinScore, Greediness, width, height, pBufGradX ,pBufGradY,pBufMag, corr);
SearchRegion->AngleStart = -30;
thread t2(&CShapeMatch::match, this, ShapeInfoVec, *SearchRegion, MinScore, Greediness, width, height, pBufGradX ,pBufGradY,pBufMag, corr);
SearchRegion->AngleStart = 0;
thread t3(&CShapeMatch::match, this, ShapeInfoVec, *SearchRegion, MinScore, Greediness,width, height, pBufGradX ,pBufGradY,pBufMag, corr);
SearchRegion->AngleStart = 30;
thread t4(&CShapeMatch::match, this, ShapeInfoVec, *SearchRegion, MinScore, Greediness,width, height, pBufGradX ,pBufGradY,pBufMag, corr);
clock_t end = clock();
cout << 1000*(double)(end-start)/CLOCKS_PER_SEC << endl;
As we can see there are plenty of heap access but they just are read-only. Only totalResultTemp and totalResultNum are shared global resource on which write are performed.
My PC configuration is,
i5-7200U CPU # 2.50GHz 4 cores
4 Gig RAM
Ubuntu 18
for(int i = 0; i < resultsNumPerDegree; i++)
totalResultsTemp[totalResultsNum] = resultsPerDeg[i];
You writing into static array, and mutexes are really time consuming. Instead of creating locks try to use std::atomic_int, or in my opinion even better, just pass to function exact place where to store result, so problem with sync is not your problem anymore
POSIX Threads in c/c++ are not concurrent since the time assigned by the operative system to each parent process must be split into the number of threads it has. Thus, your algorithm is executing only core. To leverage multicore technology, you must use OpenMP. This interface library let you split your algorithm in different physic cores. This is a good OpenMP tutorial
I'm wondering if there is a simple way to run a function multiple times in parrallel. I've tried multithreading but either there is something I don't understand or it doesn't actually speed up the calculations (actually quite the opposite). I have here the function that I want to run in parrallel:
void heun_update_pos(vector<planet>& planets, vector<double> x_i, vector<double> y_i, vector<double> mass, size_t n_planets, double h, int i)
if (planets[i].mass != 0) {
double sum_gravity_x = 0;
double sum_gravity_y = 0;
//loop for collision check and gravitational contribution
for (int j = 0; j < n_planets; j++) {
if (planets[j].mass != 0) {
double delta_x = planets[i].x_position - x_i[j];
double delta_y = planets[i].y_position - y_i[j];
//computing the distances between two planets in x and y
if (delta_x != 0 && delta_y != 0) {
//collision test
if (collision_test(planets[i], planets[j], delta_x, delta_y) == true) {
planets[i].mass += planets[j].mass;
planets[j].mass = 0;
//sum of the gravity contributions from other planets
sum_gravity_x += gravity_x(delta_x, delta_y, mass[j]);
sum_gravity_y += gravity_y(delta_x, delta_y, mass[j]);
double sx_ip1 = planets[i].x_speed + (h / 2) * sum_gravity_x;
double sy_ip1 = planets[i].y_speed + (h / 2) * sum_gravity_y;
double x_ip1 = planets[i].x_position + (h / 2) * (planets[i].x_speed + sx_ip1);
double y_ip1 = planets[i].y_position + (h / 2) * (planets[i].y_speed + sy_ip1);
planets[i].update_position(x_ip1, y_ip1, sx_ip1, sy_ip1);
An here is my how I tried to use multithreading with it:
const int cores = 6;
vector<thread> threads(cores);
int active_threads = 0;
int closing_threads = 1;
for (int i = 0; i < n_planets; i++) {
threads[active_threads] = thread(&Heun_update_pos, ref(planets), x_i, y_i, mass, n_planets, h, i);
if (i > cores - 2) threads[closing_threads].join();
//There should only be as many threads as there are cores
if (closing_threads > cores - 1) closing_threads = 0;
active_threads++; // counting the number of active threads
if (active_threads >= cores) active_threads = 0;
for (int k = 0; k < cores; k++) {
if (threads[k].joinable()) threads[k].join();
I just started learning C++ today (used Python before), this is my first code, so I am not very familiar with all the C++ functionalities.
Creating new threads take a lot of time, typically 50-100 microseconds. Depending on how long your serial version takes, it would really not be very helpful. If you run this code several times, it would be worth trying to use a thread pool since waking up a thread takes max 5 microseconds.
Check out a similar answer here:
Is there a performance benefit in using a pool of threads over simply creating threads?
There is a framework for multithreading calculation in C++ called OpenMP. You might think about using it.
I`m trying to find the fastest way to compute a MxN matrix with dot products in c++ using Eigen.
3rd Update: This is part of a Maya 2018 plugin ( For which I'm using visual studio 2018 with the toolset v141 which is what they recommend apparently to work properly with their devkit )
Lets say we have 2 Matrices:
A(3,5812) and B(3,23686)
What I'm looking for in Numpy is achieved using:
result =, B) which creates a result matrix of 5812 x 23686
Np.Dot takes 394 ms
So my progress so far:
Loop in loop method:
for (int i = 0; i < (int)A.cols(); i++) {
for (int j = 0; j < (int)B.cols(); j++) {
result(i, j) = A.col(i).cwiseProduct(B.col(j)).sum();
Process completed in: 891 ms
and using Eigen broadcasting:
for (int i = 0; i < (int)A.cols(); i++) {
result.row(i) = A.col(i).replicate(1, B.cols()).cwiseProduct(B).colwise().sum();
Process completed in: 844 ms. #
result = A.transpose() * B
Process completed in: 690 ms. # ( Update 3 .. 170ms)
I used OpenMP to multithreaded the loops, but I removed that in the code above for readability.
There has to be a faster way to do this properly maybe without using loops at all? I`ve been searching for a couple of days now... So, in the end, I decided to post here...
Second Update:
Managed to reduce the time to 170ms using A.transpose() * B... with openblas, and using MatrixXf instead of Xd..since for this particular implementation that precision is fine... still working on it trying various optimizations...
Third Update:
So for the sake of making things more clear I will post the function I'm currently working on...
Python version:
def getCorrespondences(X, Y, Cx, Cy, Rx):
timer = time.time()
X_ =, X - Cx)
Y_ = Y - Cy
ab =, Y_) # each cell is X_i dot Y_j
xx = np.sum(X_*X_, 0)
yy = np.sum(Y_*Y_, 0)
D = (xx[:, np.newaxis] + yy[np.newaxis, :]) - 2*ab
idx = np.argmin(D, 1)
elapsed_time = time.time() - timer
print (elapsed_time)
return idx
Python computes this in 1.377s
And my current c++ implementation
DWORD tcStart1 = GetTickCount();
int vtxCountX = (int)X.cols();
int vtxCountY = (int)Y.cols();
MatrixXf X_ = Rx * (X - Cx.col(0).replicate(1, vtxCountX));
MatrixXf Y_ = Y - Cy.col(0).replicate(1, vtxCountY);
MatrixXf ab = X_.transpose() * Y_;
MatrixXf xx = X_.cwiseProduct(X_).colwise().sum();
MatrixXf yy = Y_.cwiseProduct(Y_).colwise().sum();
MatrixXf D(vtxCountX, vtxCountY);
MatrixXf::Index index;
VectorXi idx(D.rows());
#ifdef _OPENMP
int a, b;
if (vtxCountX > vtxCountY) { b = threads / 4; a = threads - b; }
else { a = threads / 4; b = threads - a; }
#pragma omp parallel for num_threads(a)
for (int i = 0; i < vtxCountX; i++) {
#ifdef _OPENMP
#pragma omp parallel for num_threads(b)
for (int j = 0; j < vtxCountY; j++) {
D(i, j) = xx(i) + yy(j) - 2 * ab(i,j);
idx(i) = (int)index;
DWORD tcTotal1 = GetTickCount() - tcStart1;
//cout << idx.transpose() << endl;
cout << "Correspondence computed in: " << tcTotal1 << endl;
return idx;
This seems to the job in 906 ms with MatrixXf, and 1297ms with MatrixXd
I still need to figure out how I can squeeze more speed out of this since this function will end up getting fired hundreds of times.
I will try other things proposed in the comments... or if you guys have any other suggestions Im all ears :)... I ended up casting to float since the result is a bunch of indexes... so I dont need precision... the function basically computes distances and figures out some point correspondences, for Procrustes alignment.
I ended up using loops at a certain point instead of eigen replicate and cwise broadcasting since that seemed to be slower:/. Anyway I hope this gives more context. Cheers
Update 4...
int vtxCountX = (int)X.rows(); int vtxCountY = (int)Y.rows();
MatrixXf X_ = (X - Cx.row(0).replicate(vtxCountX, 1)) * Rx.transpose();
MatrixXf Y_ = Y - Cy.row(0).replicate(vtxCountY, 1);
MatrixXf ab = Y_* X_.transpose();
MatrixXf xx = X_.cwiseProduct(X_).rowwise().sum();
MatrixXf yy = Y_.cwiseProduct(Y_).rowwise().sum();
MatrixXf D(vtxCountX, vtxCountY);
MatrixXf::Index index;
VectorXi idx(D.rows());
#ifdef _OPENMP
int a, b;
if (vtxCountX > vtxCountY) { b = threads / 4; a = threads - b; }
else { a = threads / 4; b = threads - a; }
#pragma omp parallel for num_threads(a)
for (int i = 0; i < vtxCountX; i++) {
#ifdef _OPENMP
#pragma omp parallel for num_threads(4)
for (int j = 0; j < vtxCountY; j++) {
D(i, j) = xx(i) + yy(j) - 2 * ab(j,i);
idx(i) = (int)index;
This does the job in 469 ms... 17 faster than my original implementation... and ~3 times faster than numpy...
The error message I'm getting:
Unhandled exception at 0x7712A9F2 in eye_tracking.exe: Microsoft C++ exception: std::future_error at memory location 0x010FEA50.
Code snippet of where I fork and join:
std::vector<costGrad*> threadGrads;
std::vector<std::thread> threads;
std::vector<std::future<costGrad*>> ftr(maxThreads);
for (int i = 0; i < maxThreads; i++) //Creating threads
int start = floor(xValsB.rows() / (double)maxThreads * i);
int end = floor(xValsB.rows() / (double)maxThreads * (i+1));
int length = end-start;
std::promise<costGrad*> prms;
ftr[i] = prms.get_future();
costThread(std::move(prms), params, xValsB.block(start, 0, length, xValsB.cols()), yVals.block(start, 0, length, yVals.cols()), lambda, m);
for (int i = 0; i < maxThreads; i++) //Collecting future
threadGrads.push_back(ftr[i].get()); <-------I THINK THIS IS WHERE I'M MESSING UP
for (int i = 0; i < maxThreads; i++) //Joining threads
Following is the costThread function:
void costThread(std::promise<costGrad*> && pmrs,
const std::vector<Eigen::MatrixXd>& params,
const Eigen::MatrixXd& xValsB,
const Eigen::MatrixXd& yVals,
const double lambda,
const int m)
costGrad* temp = new costGrad; //"Cost / Gradient" struct to be returned at end
temp->forw = 0;
temp->back = 0;
std::vector<Eigen::MatrixXd> matA; //Contains the activation values including bias, first entry will be xVals
std::vector<Eigen::MatrixXd> matAb; //Contains the activation values excluding bias, first entry will be xVals
std::vector<Eigen::MatrixXd> matZ; //Contains the activation values prior to sigmoid
std::vector<Eigen::MatrixXd> paramTrunc; //Contains the parameters exluding bias terms
clock_t t1, t2, t3;
t1 = clock();
Eigen::MatrixXd xVals = Eigen::MatrixXd::Constant(xValsB.rows(), xValsB.cols() + 1, 1); //Add bias units onto xVal
xVals.block(0, 1, xValsB.rows(), xValsB.cols()) = xValsB;
for (int i = 0; i < params.size(); i++)
Eigen::MatrixXd paramTemp = params[i].block(0, 1, params[i].rows(), params[i].cols() - 1); //Setting up paramTrunc
matZ.push_back(matA.back() * params[i].transpose());
Eigen::MatrixXd tempA = Eigen::MatrixXd::Constant(matAb.back().rows(), matAb.back().cols() + 1, 1); //Add bias units
tempA.block(0, 1, matAb.back().rows(), matAb.back().cols()) = matAb.back();
t2 = clock();
temp->J = (yVals.array()*(0 - log(matAb.back().array())) - (1 - yVals.array())*log(1 - matAb.back().array())).sum() / m;
std::vector<Eigen::MatrixXd> del;
std::vector<Eigen::MatrixXd> grad;
del.push_back(matAb.back() - yVals);
for (int i = 0; i < params.size() - 1; i++)
del.push_back((del.back() * paramTrunc[paramTrunc.size() - 1 - i]).array() * sigmoidGrad(matZ[matZ.size() - 2 - i]).array());
for (int i = 0; i < params.size(); i++)
grad.push_back(del.back().transpose() * matA[i] / m);
for (int i = 0; i < params.size(); i++)
int rws = grad[i].rows();
int cls = grad[i].cols() - 1;
Eigen::MatrixXd tmp = grad[i].block(0, 1, rws, cls);
grad[i].block(0, 1, rws, cls) = tmp.array() + lambda / m*paramTrunc[i].array();
temp->grad = grad;
t3 = clock();
temp->forw = ((float)t2 - (float)t1) / 1000;
temp->back = ((float)t3 - (float)t2) / 1000;
catch (...)
//return temp;
Figured out the exception is a broken promise. I'm still having problems understanding what I'm getting wrong here. At the end of costThread() I use
And I expect the following to get temp:
for (int i = 0; i < maxThreads; i++) //Collecting future
But somehow I'm getting it all wrong.
You have a race condition: you are passing a local variable to a thread by reference and moving it inside the thread; it will work only if a new thread manages to execute the move statement before local variable gets destructed due to going out of the scope. Normally, given the code, the destructor would be faster.
If you can use C++14, you can move the promise in the lambda initializer:
std::thread([prms=std::move(prms)]() {
costThread(prms, /* etc */);
If you're limited to C++11, wrap the promise into a std::shared_ptr and pass it by value.
I would additionally handle exceptions in the worker threads and pass them to the processing thread through std::promise::set_exception(), though it's a matter of preference.
I wrote a (probably-inefficient, but anyway..) Rcpp code using inline to simulate a stochastic SEIR model.
The serial version compiles and works perfectly, but since I need to simulate from it a large number of times and since it seems to me like an embarrassingly parallel problem (just need to simulate again for other parameter values and return a matrix with the results) I tried to add #pragma omp parallel for and to compile with -fopenmp -lgomp but ... boom!
I get a segfault even for very small examples!
I tried to add setenv("OMP_STACKSIZE","24M",1); and values well over 24M but still the segfault happens.
I'll explain briefly the code since it's a bit long (I tried to shorten it but the result change and I can't reproduce it..):
I have two nested loops, the inner one execute the model for a given parameter set and the outer one changes the parameters.
The only reason a race condition might happen is if the code were trying to execute set of instructions inside inner the loop in parallel (which cannot be done because of the model structure, on iteration t it depends on iteration t-1) and not to parallelize the outer, but if I'm not mistaken that is what the parallel for constructor does for default if put just outside the outer...
This is basically the form of the code I'm trying to run:
mat result(n_param,T_MAX);
#pragma omp parallel for
for(int i=0,i<n_param_set;i++){
rowvec jnk(T_MAX);
while(t < T_MAX){
jnk(t) = something(jnk(t-1));
return wrap(result);
And my question is: How I tell the compiler that I just want to compute in parallel the outer loop (even distributing them statically like n_loops/n_threads for each thread) and not the inner one (which is actually non-parallelizable)?
The real code is a bit more involved and I'll present it here for the sake of reproducibility if you're really willing, but I'm only asking about the behavior of OpenMP. Please notice that the only OpenMP instruction appears at line 122.
#include <math.h>
#include <omp.h>
using namespace arma;
template <typename T> int sgn(T val) {
return (T(0) < val) - (val < T(0));
uvec rmultinomial(int n,vec prob)
int K = prob.n_elem;
uvec rN = zeros<uvec>(K);
double p_tot = sum(prob);
double pp;
for(int k = 0; k < K-1; k++) {
if(prob(k)>0) {
pp = prob[k] / p_tot;
rN(k) = ((pp < 1.) ? (rbinom(1,(double) n, pp))(0) : n);
n -= rN[k];
} else
rN[k] = 0;
if(n <= 0) /* we have all*/
return rN;
p_tot -= prob[k]; /* i.e. = sum(prob[(k+1):K]) */
rN[K-1] = n;
return rN;
mat SEIR_sim_plus_summaries()
vec alpha;
alpha << 0.002 << 0.0045;
vec beta;
beta << 0.01 << 0.01;
vec gamma;
gamma << 1.0/14.0 << 1.0/14.0;
vec sigma;
sigma << 1.0/(3.5) << 1.0/(3.5);
vec phi;
phi << 0.8 << 0.8;
int S_0 = 800;
int E_0 = 100;
int I_0 = 100;
int R_0 = 0;
int pop = 1000;
double tau = 0.01;
double t_0 = 0;
vec obs_time;
obs_time << 1 << 2 << 3 << 4 << 5 << 6 << 7 << 8 << 9 << 10 << 11 << 12 << 13 << 14 << 15 << 16 << 17 << 18 << 19 << 20 << 21 << 22 << 23 << 24;
const int n_obs = obs_time.n_elem;
const int n_part = alpha.n_elem;
mat stat(n_part,6);
//#pragma omp parallel for
for(int k=0;k<n_part;k++) {
ivec INC_i(n_obs);
ivec INC_o(n_obs);
// Event variables
double alpha_t;
int nX; //current number of people moving
vec rates(8);
uvec trans(4); // current transitions, e.g. from S to E,I,R,Universe
vec r(4); // rates e.g. from S to E, I, R, Univ.
/*********************** Initialize **********************/
int S_curr = S_0;
int S_prev = S_0;
int E_curr = E_0;
int E_prev = E_0;
int I_curr = I_0;
int I_prev = I_0;
int R_curr = R_0;
int R_prev = R_0;
int IncI_curr = 0;
int IncI_prev = 0;
int IncO_curr = 0;
int IncO_prev = 0;
double t_curr = t_0;
int t_idx =0;
while( t_idx < n_obs ) {
// next time preparation
t_curr += tau;
S_prev = S_curr;
E_prev = E_curr;
I_prev = I_curr;
R_prev = R_curr;
IncI_prev = IncI_curr;
IncO_prev = IncO_curr;
/*********************** description (rates) of the events **********************/
alpha_t = alpha(k)*(1+phi(k)*sin(2*M_PI*(t_curr+0)/52)); //real contact rate, time expressed in weeks
rates(0) = (alpha_t * ((double)I_curr / (double)pop ) * ((double)S_curr)); //e+1, s-1, r,i one s get infected (goes in E, not yey infectous)
rates(1) = (sigma(k) * E_curr); //e-1, i+1, r,s one exposed become infectous (goes in I) INCIDENCE!!
rates(2) = (gamma(k) * I_curr); //i-1, s,e, r+1 one i recover
rates(3) = (beta(k) * I_curr); //i-1, s, r,e one i dies
rates(4) = (beta(k) * R_curr); //i,e, s, r-1 one r dies
rates(5) = (beta(k) * E_curr); //e-1, s, r,i one e dies
rates(6) = (beta(k) * S_curr); //s-1 e, i ,r one s dies
rates(7) = (beta(k) * pop); //s+1 one susc is born
// Let the events occour
/*********************** S compartement **********************/
nX = rbinom(1,S_prev,1-exp(-(rates(0)+rates(6))*tau))(0);
r(0) = rates(0)/(rates(0)+rates(6)); r(1) = 0.0; r(2) = 0; r(3) = rates(6)/(rates(0)+rates(6));
trans = rmultinomial(nX, r);
S_curr -= nX;
E_curr += trans(0);
I_curr += trans(1);
R_curr += trans(2);
//trans(3) contains dead individual, who disappear...we could avoid this using sequential conditional binomial
/*********************** E compartement **********************/
nX = rbinom(1,E_prev,1-exp(-(rates(1)+rates(5))*tau))(0);
r(0) = 0.0; r(1) = rates(1)/(rates(1)+rates(5)); r(2) = 0.0; r(3) = rates(5)/(rates(1)+rates(5));
trans = rmultinomial(nX, r);
S_curr += trans(0);
E_curr -= nX;
I_curr += trans(1);
R_curr += trans(2);
IncI_curr += trans(1);
/*********************** I compartement **********************/
nX = rbinom(1,I_prev,1-exp(-(rates(2)+rates(3))*tau))(0);
r(0) = 0.0; r(1) = 0.0; r(2) = rates(2)/(rates(2)+rates(3)); r(3) = rates(3)/(rates(2)+rates(3));
trans = rmultinomial(nX, r);
S_curr += trans(0);
E_curr += trans(1);
I_curr -= nX;
R_curr += trans(2);
IncO_curr += trans(2);
/*********************** R compartement **********************/
nX = rbinom(1,R_prev,1-exp(-rates(4)*tau))(0);
r(0) = 0.0; r(1) = 0.0; r(2) = 0.0; r(3) = rates(4)/rates(4);
trans = rmultinomial(nX, r);
S_curr += trans(0);
E_curr += trans(1);
I_curr += trans(2);
R_curr -= nX;
/*********************** Universe **********************/
S_curr += pop - (S_curr+E_curr+I_curr+R_curr); //it should be poisson, but since the pop is fixed...
/*********************** Save & Continue **********************/
// Check if the time is interesting for us
if(t_curr > obs_time[t_idx]){
INC_i(t_idx) = IncI_curr;
INC_o(t_idx) = IncO_curr;
IncI_curr = IncI_prev = 0;
IncO_curr = IncO_prev = 0;
//else just go on...
/*********************** Finished - Starting w/ stats **********************/
// INC_i is the useful variable, how can I change its reference withour copying it?
ivec incidence = INC_i; //just so if I want to use INC_o i have to change just this...
//Scan the epidemics to recover the summary stats (naively divide the data each 52 weeks)
double n_years = ceil((double)obs_time(n_obs-1)/52.0);
vec mu_attack(n_years);
vec ratio_attack(n_years-1);
vec peak(n_years);
vec atk(52);
vec tmpExplo(52); //explosiveness
vec explo(n_years);
int year=0;
int week;
for(week=0 ; week<n_obs ; week++){
if(week - 52*year > 51){
mu_attack(year) = sum( atk )/(double)pop;
ratio_attack(year-1) = mu_attack(year)/mu_attack(year-1);
for(int i=0;i<52;i++){
tmpExplo(i) = 1.0;
} else {
tmpExplo(i) = 0.0;
explo(year) = sum(tmpExplo);
atk(week-52*year) = incidence(week);
if( peak(year) < incidence(week) )
if(week - 52*year > 51){
mu_attack(year) = sum( atk )/(double)pop;
} else {
ivec idx(52);
for(int i=0;i<52;i++)
{ idx(i) = i; } //take just the updated ones...
vec tmp = atk.elem(find(idx<(week - 52*year)));
mu_attack(year) = sum( tmp )/((double)pop * (tmp.n_elem/52.0));
ratio_attack(year-1) = mu_attack(year)/mu_attack(year-1);
for(int i=0;i<tmp.n_elem;i++){
tmpExplo(i) = 1.0;
} else {
tmpExplo(i) = 0.0;
for(int i=tmp.n_elem;i<52;i++)
tmpExplo(i) = 0.0; //to reset the others
explo(year) = sum(tmpExplo);
double correlation2;
double correlation4;
vec autocorr = acf(peak);
/***** ACF *****/
} else {
correlation2 = autocorr(1);
correlation4 = 0.0;
} else {
correlation2 = autocorr(1);
correlation4 = autocorr(3);
rowvec jnk(6);
jnk << sum(mu_attack)/(year+1.0)
<< (sum( log(ratio_attack)%log(ratio_attack) )/(n_years-1)) - (pow(sum( log(ratio_attack) )/(n_years-1),2))
<< correlation2 << correlation4 << max(peak) << sum(explo)/n_years;
stat.row(k) = jnk;
return stat;
std::cout << "max_num_threads " << omp_get_max_threads() << std::endl;
RNGScope scope;
mat summaries = SEIR_sim_plus_summaries();
return wrap(summaries);
plug = getPlugin("RcppArmadillo")
## modify the plugin for Rcpp to support OpenMP
plug$env$PKG_CXXFLAGS <- paste('-fopenmp', plug$env$PKG_CXXFLAGS)
plug$env$PKG_LIBS <- paste('-fopenmp -lgomp', plug$env$PKG_LIBS)
SEIR_sim_summary = cxxfunction(sig=signature(),main,settings=plug,inc = paste(misc,model_and_summary),verbose=TRUE)
Thanks for the help!
NB: before you ask, I slightly modified the Rcpp multinomial sampling function just because I liked that way more than the one using pointer...not any other particular reason! :)
The core pseudo-random number generators (PRNGs) in R are not designed to be used in multithreaded environments. That is, their state is stored in a static array (dummy from src/main/PRNG.c) and therefore is shared among all threads. Moreover several other static structures are used to store states for the higher-level interfaces to the core PRNGs.
A possible solution could be that you put each call to rnorm() or other sampling functions inside named critical sections with all having the same name, e.g.:
#pragma omp critical(random)
rN(k) = ((pp < 1.) ? (rbinom(1,(double) n, pp))(0) : n);
#pragma omp critical(random)
nX = rbinom(1,S_prev,1-exp(-(rates(0)+rates(6))*tau))(0);
Note that the critical construct operates on the structured block following it and therefore locks the entire statement. If a random number is being drawn inline inside a call to a time-consuming function, e.g.
#pragma omp critical(random)
x = slow_computation(rbinom(...));
this is better transformed to:
#pragma omp critical(random)
rb = rbinom(...);
x = slow_computation(rb);
That way only the rb = rbinom(...); statement will be protected.
I'm writing a sparse matrix solver using the Gauss-Seidel method. By profiling, I've determined that about half of my program's time is spent inside the solver. The performance-critical part is as follows:
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
All arrays involved are of float type. Actually, they are not arrays but objects with an overloaded [] operator, which (I think) should be optimized away, but is defined as follows:
inline float &operator[](size_t i) { return d_cells[i]; }
inline float const &operator[](size_t i) const { return d_cells[i]; }
For d_nx = d_ny = 128, this can be run about 3500 times per second on an Intel i7 920. This means that the inner loop body runs 3500 * 128 * 128 = 57 million times per second. Since only some simple arithmetic is involved, that strikes me as a low number for a 2.66 GHz processor.
Maybe it's not limited by CPU power, but by memory bandwidth? Well, one 128 * 128 float array eats 65 kB, so all 6 arrays should easily fit into the CPU's L3 cache (which is 8 MB). Assuming that nothing is cached in registers, I count 15 memory accesses in the inner loop body. On a 64-bits system this is 120 bytes per iteration, so 57 million * 120 bytes = 6.8 GB/s. The L3 cache runs at 2.66 GHz, so it's the same order of magnitude. My guess is that memory is indeed the bottleneck.
To speed this up, I've attempted the following:
Compile with g++ -O3. (Well, I'd been doing this from the beginning.)
Parallelizing over 4 cores using OpenMP pragmas. I have to change to the Jacobi algorithm to avoid reads from and writes to the same array. This requires that I do twice as many iterations, leading to a net result of about the same speed.
Fiddling with implementation details of the loop body, such as using pointers instead of indices. No effect.
What's the best approach to speed this guy up? Would it help to rewrite the inner body in assembly (I'd have to learn that first)? Should I run this on the GPU instead (which I know how to do, but it's such a hassle)? Any other bright ideas?
(N.B. I do take "no" for an answer, as in: "it can't be done significantly faster, because...")
Update: as requested, here's a full program:
#include <iostream>
#include <cstdlib>
#include <cstring>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
void solve(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
cout << d_x[0] << endl; // prevent the thing from being optimized away
I compile and run it as follows:
$ g++ -o gstest -O3 gstest.cpp
$ time ./gstest 8000
real 0m1.052s
user 0m1.050s
sys 0m0.010s
(It does 8000 instead of 3500 iterations per second because my "real" program does a lot of other stuff too. But it's representative.)
Update 2: I've been told that unititialized values may not be representative because NaN and Inf values may slow things down. Now clearing the memory in the example code. It makes no difference for me in execution speed, though.
Couple of ideas:
Use SIMD. You could load 4 floats at a time from each array into a SIMD register (e.g. SSE on Intel, VMX on PowerPC). The disadvantage of this is that some of the d_x values will be "stale" so your convergence rate will suffer (but not as bad as a jacobi iteration); it's hard to say whether the speedup offsets it.
Use SOR. It's simple, doesn't add much computation, and can improve your convergence rate quite well, even for a relatively conservative relaxation value (say 1.5).
Use conjugate gradient. If this is for the projection step of a fluid simulation (i.e. enforcing non-compressability), you should be able to apply CG and get a much better convergence rate. A good preconditioner helps even more.
Use a specialized solver. If the linear system arises from the Poisson equation, you can do even better than conjugate gradient using an FFT-based methods.
If you can explain more about what the system you're trying to solve looks like, I can probably give some more advice on #3 and #4.
I think I've managed to optimize it, here's a code, create a new project in VC++, add this code and simply compile under "Release".
#include <iostream>
#include <cstdlib>
#include <cstring>
#define _WIN32_WINNT 0x0400
#include <windows.h>
#include <conio.h>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step_original() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
void step_new() {
//size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
d_b_ic = d_b;
d_w_ic = d_w;
d_e_ic = d_e;
d_x_ic = d_x;
d_x_iw = d_x;
d_x_ie = d_x;
d_x_is = d_x;
d_x_in = d_x;
d_n_ic = d_n;
d_s_ic = d_s;
for (size_t y = 1; y < d_ny - 1; ++y)
for (size_t x = 1; x < d_nx - 1; ++x)
/*d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];*/
*d_x_ic = *d_b_ic
- *d_w_ic * *d_x_iw - *d_e_ic * *d_x_ie
- *d_s_ic * *d_x_is - *d_n_ic * *d_x_in;
//++ic; ++iw; ++ie; ++is; ++in;
//ic += 2; iw += 2; ie += 2; is += 2; in += 2;
d_b_ic += 2;
d_w_ic += 2;
d_e_ic += 2;
d_x_ic += 2;
d_x_iw += 2;
d_x_ie += 2;
d_x_is += 2;
d_x_in += 2;
d_n_ic += 2;
d_s_ic += 2;
void solve_original(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
void solve_new(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
if(argc < 3)
printf("app.exe (x)iters (o/n)algo\n");
bool bOriginalStep = (argv[2][0] == 'o');
size_t iters = atoi(argv[1]);
/*printf("Press any key to start!");
printf(" Running speed test..\n");*/
__int64 freq, start, end, diff;
throw "Not supported!";
freq /= 1000000; // microseconds!
diff = (end - start) / freq;
printf("Speed (%s)\t\t: %u\n", (bOriginalStep ? "original" : "new"), diff);
//cout << d_x[0] << endl; // prevent the thing from being optimized away
Run it like this:
app.exe 10000 o
app.exe 10000 n
"o" means old code, yours.
"n" is mine, the new one.
My results:
Speed (original):
Speed (new):
Improvement of about 30%.
The logic behind:
You've been using index counters to access/manipulate.
I use pointers.
While running, breakpoint at a certain calculation code line in VC++'s debugger, and press F8. You'll get the disassembler window.
The you'll see the produced opcodes (assembly code).
Anyway, look:
int *x = ...;
x[3] = 123;
This tells the PC to put the pointer x at a register (say EAX).
The add it (3 * sizeof(int)).
Only then, set the value to 123.
The pointers approach is much better as you can understand, because we cut the adding process, actually we handle it ourselves, thus able to optimize as needed.
I hope this helps.
Sidenote to's staff:
Great website, I hope I've heard of it long ago!
For one thing, there seems to be a pipelining issue here. The loop reads from the value in d_x that has just been written to, but apparently it has to wait for that write to complete. Just rearranging the order of the computation, doing something useful while it's waiting, makes it almost twice as fast:
d_x[ic] = d_b[ic]
- d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in]
- d_w[ic] * d_x[iw] /* d_x[iw] has just been written to, process this last */;
It was Eamon Nerbonne who figured this out. Many upvotes to him! I would never have guessed.
Poni's answer looks like the right one to me.
I just want to point out that in this type of problem, you often gain benefits from memory locality. Right now, the b,w,e,s,n arrays are all at separate locations in memory. If you could not fit the problem in L3 cache (mostly in L2), then this would be bad, and a solution of this sort would be helpful:
size_t d_nx = 128, d_ny = 128;
float *d_x;
struct D { float b,w,e,s,n; };
D *d;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d[ic].b
- d[ic].w * d_x[iw] - d[ic].e * d_x[ie]
- d[ic].s * d_x[is] - d[ic].n * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
void solve(size_t iters) { for (size_t i = 0; i < iters; ++i) step(); }
void clear(float *a) { memset(a, 0, d_nx * d_ny * sizeof(float)); }
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d = new D[n]; memset(d,0,n * sizeof(D));
cout << d_x[0] << endl; // prevent the thing from being optimized away
For example, this solution at 1280x1280 is a little less than 2x faster than Poni's solution (13s vs 23s in my test--your original implementation is then 22s), while at 128x128 it's 30% slower (7s vs. 10s--your original is 10s).
(Iterations were scaled up to 80000 for the base case, and 800 for the 100x larger case of 1280x1280.)
I think you're right about memory being a bottleneck. It's a pretty simple loop with just some simple arithmetic per iteration. the ic, iw, ie, is, and in indices seem to be on opposite sides of the matrix so i'm guessing that there's a bunch of cache misses there.
I'm no expert on the subject, but I've seen that there are several academic papers on improving the cache usage of the Gauss-Seidel method.
Another possible optimization is the use of the red-black variant, where points are updated in two sweeps in a chessboard-like pattern. In this way, all updates in a sweep are independent and can be parallelized.
I suggest putting in some prefetch statements and also researching "data oriented design":
void step_original() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
float dw_ic, dx_ic, db_ic, de_ic, dn_ic, ds_ic;
float dx_iw, dx_is, dx_ie, dx_in, de_ic, db_ic;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
// Perform the prefetch
// Sorting these statements by array may increase speed;
// although sorting by index name may increase speed too.
db_ic = d_b[ic];
dw_ic = d_w[ic];
dx_iw = d_x[iw];
de_ic = d_e[ic];
dx_ie = d_x[ie];
ds_ic = d_s[ic];
dx_is = d_x[is];
dn_ic = d_n[ic];
dx_in = d_x[in];
// Calculate
d_x[ic] = db_ic
- dw_ic * dx_iw - de_ic * dx_ie
- ds_ic * dx_is - dn_ic * dx_in;
++ic; ++iw; ++ie; ++is; ++in;
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
This differs from your second method since the values are copied to local temporary variables before the calculation is performed.