Broken promise, having trouble figuring it out (C++) - c++

The error message I'm getting:
Unhandled exception at 0x7712A9F2 in eye_tracking.exe: Microsoft C++ exception: std::future_error at memory location 0x010FEA50.
Code snippet of where I fork and join:
//CONCURRENCE
std::vector<costGrad*> threadGrads;
std::vector<std::thread> threads;
std::vector<std::future<costGrad*>> ftr(maxThreads);
for (int i = 0; i < maxThreads; i++) //Creating threads
{
int start = floor(xValsB.rows() / (double)maxThreads * i);
int end = floor(xValsB.rows() / (double)maxThreads * (i+1));
int length = end-start;
std::promise<costGrad*> prms;
ftr[i] = prms.get_future();
threads.push_back(std::thread([&]()
{
costThread(std::move(prms), params, xValsB.block(start, 0, length, xValsB.cols()), yVals.block(start, 0, length, yVals.cols()), lambda, m);
}));
}
for (int i = 0; i < maxThreads; i++) //Collecting future
threadGrads.push_back(ftr[i].get()); <-------I THINK THIS IS WHERE I'M MESSING UP
for (int i = 0; i < maxThreads; i++) //Joining threads
threads[i].join();
Following is the costThread function:
void costThread(std::promise<costGrad*> && pmrs,
const std::vector<Eigen::MatrixXd>& params,
const Eigen::MatrixXd& xValsB,
const Eigen::MatrixXd& yVals,
const double lambda,
const int m)
{
try
{
costGrad* temp = new costGrad; //"Cost / Gradient" struct to be returned at end
temp->forw = 0;
temp->back = 0;
std::vector<Eigen::MatrixXd> matA; //Contains the activation values including bias, first entry will be xVals
std::vector<Eigen::MatrixXd> matAb; //Contains the activation values excluding bias, first entry will be xVals
std::vector<Eigen::MatrixXd> matZ; //Contains the activation values prior to sigmoid
std::vector<Eigen::MatrixXd> paramTrunc; //Contains the parameters exluding bias terms
clock_t t1, t2, t3;
t1 = clock();
//FORWARD PROPAGATION PREP
Eigen::MatrixXd xVals = Eigen::MatrixXd::Constant(xValsB.rows(), xValsB.cols() + 1, 1); //Add bias units onto xVal
xVals.block(0, 1, xValsB.rows(), xValsB.cols()) = xValsB;
matA.push_back(xVals);
matAb.push_back(xValsB);
//FORWARD PROPAGATION
for (int i = 0; i < params.size(); i++)
{
Eigen::MatrixXd paramTemp = params[i].block(0, 1, params[i].rows(), params[i].cols() - 1); //Setting up paramTrunc
paramTrunc.push_back(paramTemp);
matZ.push_back(matA.back() * params[i].transpose());
matAb.push_back(sigmoid(matZ.back()));
Eigen::MatrixXd tempA = Eigen::MatrixXd::Constant(matAb.back().rows(), matAb.back().cols() + 1, 1); //Add bias units
tempA.block(0, 1, matAb.back().rows(), matAb.back().cols()) = matAb.back();
matA.push_back(tempA);
}
t2 = clock();
//COST CALCULATION
temp->J = (yVals.array()*(0 - log(matAb.back().array())) - (1 - yVals.array())*log(1 - matAb.back().array())).sum() / m;
//BACK PROPAGATION
std::vector<Eigen::MatrixXd> del;
std::vector<Eigen::MatrixXd> grad;
del.push_back(matAb.back() - yVals);
for (int i = 0; i < params.size() - 1; i++)
{
del.push_back((del.back() * paramTrunc[paramTrunc.size() - 1 - i]).array() * sigmoidGrad(matZ[matZ.size() - 2 - i]).array());
}
for (int i = 0; i < params.size(); i++)
{
grad.push_back(del.back().transpose() * matA[i] / m);
del.pop_back();
}
for (int i = 0; i < params.size(); i++)
{
int rws = grad[i].rows();
int cls = grad[i].cols() - 1;
Eigen::MatrixXd tmp = grad[i].block(0, 1, rws, cls);
grad[i].block(0, 1, rws, cls) = tmp.array() + lambda / m*paramTrunc[i].array();
}
temp->grad = grad;
t3 = clock();
temp->forw = ((float)t2 - (float)t1) / 1000;
temp->back = ((float)t3 - (float)t2) / 1000;
pmrs.set_value(temp);
}
catch (...)
{
pmrs.set_exception(std::current_exception());
}
//return temp;
}
EDIT:
Figured out the exception is a broken promise. I'm still having problems understanding what I'm getting wrong here. At the end of costThread() I use
pmrs.set_value(temp);
And I expect the following to get temp:
for (int i = 0; i < maxThreads; i++) //Collecting future
threadGrads.push_back(ftr[i].get());
But somehow I'm getting it all wrong.

You have a race condition: you are passing a local variable to a thread by reference and moving it inside the thread; it will work only if a new thread manages to execute the move statement before local variable gets destructed due to going out of the scope. Normally, given the code, the destructor would be faster.
If you can use C++14, you can move the promise in the lambda initializer:
threads.push_back(
std::thread([prms=std::move(prms)]() {
costThread(prms, /* etc */);
})
);
If you're limited to C++11, wrap the promise into a std::shared_ptr and pass it by value.
I would additionally handle exceptions in the worker threads and pass them to the processing thread through std::promise::set_exception(), though it's a matter of preference.

Related

Loops optimization

I have a loop and inside a have a inner loop. How can I optimise it please in order to optimise execution time like avoiding accessing to memory many times to the same thing and avoid the maximum possible the addition and multiplication.
int n,m,x1,y1,x2,y2,cnst;
int N = 9600;
int M = 1800;
int temp11,temp12,temp13,temp14;
int temp21,temp22,temp23,temp24;
int *arr1 = new int [32000]; // suppose it's already filled
int *arr2 = new int [32000];// suppose it's already filled
int sumFirst = 0;
int maxFirst = 0;
int indexFirst = 0;
int sumSecond = 0;
int maxSecond = 0;
int indexSecond = 0;
int jump = 2400;
for( n = 0; n < N; n++)
{
temp14 = 0;
temp24 = 0;
for( m = 0; m < M; m++)
{
x1 = m + cnst;
y1 = m + n + cnst;
temp11 = arr1[x1];
temp12 = arr2[y1];
temp13 = temp11 * temp12;
temp14+= temp13;
x2 = m + cnst + jump;
y2 = m + n + cnst + jump;
temp21 = arr1[x2];
temp22 = arr2[y2];
temp23 = temp21 * temp22;
temp24+= temp23;
}
sumFirst += temp14;
if (temp14 > maxFirst)
{
maxFirst = temp14;
indexFirst = m;
}
sumSecond += temp24;
if (temp24 > maxSecond)
{
maxSecond = temp24;
indexSecond = n;
}
}
// At the end we use sum , index and max for first and second;
You are multiplying array elements and accumulating the result.
This can be optimized by:
SIMD (doing multiple operations at a single CPU step)
Parallel execution (using multiple physical/logical CPUs at once)
Look for CPU-specific SIMD way of doing this. Like _mm_mul_epi32 from SSE4.1 can possibly be used on x86-64. Before trying to write your own SIMD version with compiler intrinsics, make sure the compiler doesn't do it already for you.
As for parallel execution, look into omp, or using C++17 parallel accumulate.

Multiple threads taking more time than single process [duplicate]

This question already has answers here:
C: using clock() to measure time in multi-threaded programs
(2 answers)
Closed 2 years ago.
I am implementing pattern matching algorithm, by moving template gradient info over entire target's gradient image , that too at each rotation (-60 to 60). I have already saved the template info for each rotation ,i.e. 121 templates are already preprocessed and saved.
But the issue is, this is consuming lot of time (approx 110ms), so decided to split the matching at set of rotations (-60 to -30 , -30 to 0, 0 to 30 and 30 to 60) into 4 threads, but threading is taking more time that single process (approx 115ms to 120ms).
Snippet of code is...
#define MAXTARGETNUM 64
MatchResultA totalResultsTemp[MAXTARGETNUM];
void CShapeMatch::match(ShapeInfo *ShapeInfoVec, search_region SearchRegion, float MinScore, float Greediness, int width,int height, int16_t *pBufGradX ,int16_t *pBufGradY,float *pBufMag, bool corr)
{
MatchResultA resultsPerDeg[MAXTARGETNUM];
....
....
int startX = SearchRegion.StartX;
int startY = SearchRegion.StartY;
int endX = SearchRegion.EndX;
int endY = SearchRegion.EndY;
float AngleStep = SearchRegion.AngleStep;
float AngleStart = SearchRegion.AngleStart;
float AngleStop = SearchRegion.AngleStop;
int startIndex = (int)(ShapeInfoVec[0].AngleNum/2) + ShapeInfoVec[0].AngleNum%2+(int)AngleStart/AngleStep;
int stopIndex = (int)(ShapeInfoVec[0].AngleNum/2) + ShapeInfoVec[0].AngleNum%2+(int)AngleStop/AngleStep;
for (int k = startIndex; k < stopIndex ; k++){
....
for(int j = startY; j < endY; j++){
for(int i = startX; i < endX; i++){
for(int m = 0; m < ShapeInfoVec[k].NoOfCordinates; m++)
{
curX = i + (ShapeInfoVec[k].Coordinates + m)->x; // template X coordinate
curY = j + (ShapeInfoVec[k].Coordinates + m)->y ; // template Y coordinate
iTx = *(ShapeInfoVec[k].EdgeDerivativeX + m); // template X derivative
iTy = *(ShapeInfoVec[k].EdgeDerivativeY + m); // template Y derivative
iTm = *(ShapeInfoVec[k].EdgeMagnitude + m); // template gradients magnitude
if(curX < 0 ||curY < 0||curX > width-1 ||curY > height-1)
continue;
offSet = curY*width + curX;
iSx = *(pBufGradX + offSet); // get corresponding X derivative from source image
iSy = *(pBufGradY + offSet); // get corresponding Y derivative from source image
iSm = *(pBufMag + offSet);
if (PartialScore > MinScore)
{
float Angle = ShapeInfoVec[k].Angel;
bool hasFlag = false;
for(int n = 0; n < resultsNumPerDegree; n++)
{
if(abs(resultsPerDeg[n].CenterLocX - i) < 5 && abs(resultsPerDeg[n].CenterLocY - j) < 5)
{
hasFlag = true;
if(resultsPerDeg[n].ResultScore < PartialScore)
{
resultsPerDeg[n].Angel = Angle;
resultsPerDeg[n].CenterLocX = i;
resultsPerDeg[n].CenterLocY = j;
resultsPerDeg[n].ResultScore = PartialScore;
break;
}
}
}
if(!hasFlag)
{
resultsPerDeg[resultsNumPerDegree].Angel = Angle;
resultsPerDeg[resultsNumPerDegree].CenterLocX = i;
resultsPerDeg[resultsNumPerDegree].CenterLocY = j;
resultsPerDeg[resultsNumPerDegree].ResultScore = PartialScore;
resultsNumPerDegree ++;
}
minScoreTemp = minScoreTemp < PartialScore ? PartialScore : minScoreTemp;
}
}
}
for(int i = 0; i < resultsNumPerDegree; i++)
{
mtx.lock();
totalResultsTemp[totalResultsNum] = resultsPerDeg[i];
totalResultsNum++;
mtx.unlock();
}
n++;
}
void CallerFunction(){
int16_t *pBufGradX = (int16_t *) malloc(bufferSize * sizeof(int16_t));
int16_t *pBufGradY = (int16_t *) malloc(bufferSize * sizeof(int16_t));
float *pBufMag = (float *) malloc(bufferSize * sizeof(float));
clock_t start = clock();
float temp_stop = SearchRegion->AngleStop;
SearchRegion->AngleStop = -30;
thread t1(&CShapeMatch::match, this, ShapeInfoVec, *SearchRegion, MinScore, Greediness, width, height, pBufGradX ,pBufGradY,pBufMag, corr);
SearchRegion->AngleStart = -30;
SearchRegion->AngleStop=0;
thread t2(&CShapeMatch::match, this, ShapeInfoVec, *SearchRegion, MinScore, Greediness, width, height, pBufGradX ,pBufGradY,pBufMag, corr);
SearchRegion->AngleStart = 0;
SearchRegion->AngleStop=30;
thread t3(&CShapeMatch::match, this, ShapeInfoVec, *SearchRegion, MinScore, Greediness,width, height, pBufGradX ,pBufGradY,pBufMag, corr);
SearchRegion->AngleStart = 30;
SearchRegion->AngleStop=temp_stop;
thread t4(&CShapeMatch::match, this, ShapeInfoVec, *SearchRegion, MinScore, Greediness,width, height, pBufGradX ,pBufGradY,pBufMag, corr);
t1.join();
t2.join();
t3.join();
t4.join();
clock_t end = clock();
cout << 1000*(double)(end-start)/CLOCKS_PER_SEC << endl;
}
As we can see there are plenty of heap access but they just are read-only. Only totalResultTemp and totalResultNum are shared global resource on which write are performed.
My PC configuration is,
i5-7200U CPU # 2.50GHz 4 cores
4 Gig RAM
Ubuntu 18
for(int i = 0; i < resultsNumPerDegree; i++)
{
mtx.lock();
totalResultsTemp[totalResultsNum] = resultsPerDeg[i];
totalResultsNum++;
mtx.unlock();
}
You writing into static array, and mutexes are really time consuming. Instead of creating locks try to use std::atomic_int, or in my opinion even better, just pass to function exact place where to store result, so problem with sync is not your problem anymore
POSIX Threads in c/c++ are not concurrent since the time assigned by the operative system to each parent process must be split into the number of threads it has. Thus, your algorithm is executing only core. To leverage multicore technology, you must use OpenMP. This interface library let you split your algorithm in different physic cores. This is a good OpenMP tutorial

How to parallel a nested loop

This is a part of c++ code for solving a problem in computational mathematics of large dimension, say more than 100000 variables. I'd like to parallelise it using OpenMP. What is the best way of paralleling the following nested loop by OpenMP?
e = 0;
// m and n are are big numbers 200000 - 10000000
int i,k,r,s,t;
// hpk,hqk,pk_x0,n2pk_x0,dk,sk are double and declared before.
for (k=0; k<m; k++)
{
hpk = 0;
hqk = 0;
n2pk_x0 = 0;
dk = 0;
sk = 0;
for (int i=0; i<n; i++)
{
if (lamb[i] <= lam[k])
{
if (h[i]<0)
{
pk[i] = xu[i];
}
else if (h[i]>0)
{
pk[i] = xl[i];
}
qk[i] = 0;
}
else
{
pk[i] = x0[i];
qk[i] = -h[i];
}
hpk += h[i]*pk[i];
hqk += h[i]*qk[i];
pk_x0 = pk[i]-x0[i];
n2pk_x0 += pk_x0*pk_x0;
dk += pk_x0*qk[i];
sk += qk[i]*qk[i];
}
//}//p
/* ------- Compute ak, bk, ck, dk and sk to construct e(lam) -------- */
ak = - (gamma + hpk);
bk = - hqk;
ck = q0 + 0.5 * n2pk_x0;
sk = 0.5 * sk;
// some calculation based on index k
} // end of first for
I did some of the advises to private the local variables in the nested loop.The CPU time decreased by factor 1/2, but the output is not correct! Is there any way to improve the code in such a way that get correct result with less CPU time? (In the nested loop, if we set m=1, the output will be correct, but for m>1 the output is incorrect.)
This is the whole code:
static void subboconcpp(
double u[],
double *Egh,
double h[],
double gamma,
double x0[],
double q0,
double xl[],
double xu[],
int dim
)
{
int n,m,infinity = INT_MAX,i,k,r,s,t;;
double e;
double hpk, hqk, dk1, sk1, n2pk_x0;
double ak, bk, ck, dk, sk;
double lam_hat, phik, ek1, ek2;
double *pk = new double[dim];
double *qk = new double[dim];
double *lamb = new double[dim];
double *lamb1 = new double[dim];
double *lam = new double[dim];
/* ------------------ Computing lambl(i) and lambu(i) ------------------ */
/* n is the length of x0 */
n = dim;
#pragma omp parallel for shared(n,h,x0,xl,xu)//num_threads(8)
for (int i=0; i<n; i++)
{
double lamb_flag;
if (h[i] > 0)
{
lamb_flag = (x0[i] - xl[i])/h[i];
lamb[i] = lamb_flag;
lamb1[i] = lamb_flag;
}
else if (h[i] < 0)
{
lamb_flag = (x0[i] - xu[i])/h[i];
lamb[i] = lamb_flag;
lamb1[i] = lamb_flag;
}
//cout << "lamb:" << lamb[i];
}
/* --------------------------------------------------------------------- */
/* ----------------- Sorting lamb and constructing lam ----------------- */
/* lamb = sort(lamb,1); */
sort(lamb1, lamb1+n);
int q = 0;
double lam_flag = 0;
#pragma omp parallel for shared(n) firstprivate(q) lastprivate(m)
for (int j=0; j<n; j++)
{
if (lamb1[j] > lam_flag)
{
lam_flag = lamb1[j];
q = q + 1;
lam[q] = lam_flag;
//cout << "lam: \n" << lam[q];
}
if (j == n-1)
{
if (lam_flag < infinity)
{
m = q+1;
lam[m] = + infinity;
}
else
{
m = q;
}
}
//cout << "q: \n" << q;
}
/* --------------------------------------------------------------------- */
/* -- Finding the global maximizer of e(lam) for lam in[-inf, + inf] -- */
e = 0;
#pragma omp parallel shared(m,n,h,x0,xl,xu,lamb,lam) \
private(i,r,s,t,hpk, hqk, dk1, sk1, n2pk_x0,ak, bk, ck, dk, sk,lam_hat, phik, ek1, ek2)
{
#pragma omp for nowait
for (k=0; k<1; k++)
{
/*double hpk=0, hqk=0, dk1=0, sk1=0, n2pk_x0=0;
double ak, bk, ck, dk, sk;
double lam_hat, phik, ek1, ek2;
double *pk = new double[dim];
double *qk = new double[dim];*/
hpk = 0;
hqk = 0;
n2pk_x0 = 0;
dk1 = 0;
sk1 = 0;
for (int i=0; i<n; i++)
{
double pk_x0;
if (lamb[i] <= lam[k])
{
if (h[i]<0)
{
pk[i] = xu[i];
}
else if (h[i]>0)
{
pk[i] = xl[i];
}
qk[i] = 0;
}
else
{
pk[i] = x0[i];
qk[i] = -h[i];
}
hpk += h[i]*pk[i];
hqk += h[i]*qk[i];
pk_x0 = pk[i]-x0[i];
n2pk_x0 += pk_x0*pk_x0;
dk1 += pk_x0*qk[i];
sk1 += qk[i]*qk[i];
}
/* ------- Compute ak, bk, ck, dk and sk to construct e(lam) -------- */
ak = - (gamma + hpk);
bk = - hqk;
ck = q0 + 0.5 * n2pk_x0;
dk = dk1;
sk = 0.5 * sk1;
/* ----------------------------------------------------------------- */
/* - Finding the global maximizer of e(lam) for [lam(k), lam(k+1)] - */
/* --------------------- using Proposition 4 ----------------------- */
if (bk != 0)
{
double w = ak*ak - bk*(ak*dk - bk*ck)/sk;
if (w == 0)
{
lam_hat = -ak / bk;
phik = 0;
}
else
{
double w = ak*ak - bk*(ak*dk - bk*ck)/sk;
lam_hat = (-ak + sqrt(w))/bk;
phik = bk / (2*sk*lam_hat + dk);
}
}
else
{
if (ak > 0)
{
lam_hat = -dk / (2 * sk);
phik = 4*ak*sk / (4*ck*sk + (sk - 2)*(dk*dk));
}
else
{
lam_hat = + infinity;
phik = 0;
}
}
/* ----------------------------------------------------------------- */
/* --- Checking the feasibility of the solution of Proposition 4 --- */
if (lam[k] <= lam_hat && lam_hat <= lam[k + 1])
{
if (phik > e)
{
for (r=0; r<n; r++)
{
u[r] = pk[r] + lam_hat * qk[r];
}
e = phik;
}
}
else
{
ek1 = (ak+bk*lam[k])/(ck+(dk+sk*lam[k])*lam[k]);
ek2 = (ak+bk*lam[k+1])/(ck+(dk+sk*lam[k+1])*lam[k+1]);
if (ek1 >= ek2)
{
lam_hat = lam[k];
if (ek1 > e)
{
for (s=0; s<n;s++)
{
u[s] = pk[s] + lam_hat * qk[s];
}
e = ek1;
}
}
else
{
lam_hat = lam[k + 1];
if (ek2 > e)
{
for (t=0; t<n;t++)
{
u[t] = pk[t] + lam_hat * qk[t];
}
e = ek2;
}
}
}
/* ------------------------------------------------------------------ */
}/* ------------------------- End of for (k) --------------------------- */
}//p
/* --------- The global maximizer by searching all m intervals --------- */
*Egh = e;
delete[] pk;
delete[] qk;
delete[] lamb1;
delete[] lamb;
delete[] lam;
return;
/* --------------------------------------------------------------------- */
}
Please note that the first two parallel code working well, but just the output of the nested loop is in correct.
Any suggestion or comment is appreciated.
The outermost loop: I do not know all code but it look like that variables hpk, hqk, n2pk_x0, dk, sk should be private. If you do not specify them to be private it will break correctness.
OpenMP is not always very good for nested parallelism. It depends on OpenMP settings but nested loop can create p*p threads, where p is a default concurrency of your machine. So big oversubscription may lead significant performance degradation. In most cases it is Ok to parallelise the outermost loop and leave the nested loops to be serial.
The one of the reason of parallelising nested loops is achieving better work balancing. But your case seems to have balanced work and you should not face the work balancing problem if you parallelise only the outermost loop.
But if you still want to parallelise both loops may I suggest using Intel TBB instead of OpenMP? You can use tbb::parallel_for for outermost loop and tbb::parallel_reduce for the nested one. Intel TBB uses one thread pool for all its algorithms so it will not lead your application to have oversubscription.
[updated] Some parallelization advises:
Until you achieve correctness the execution time does not mean anything. Since a correctness fix can change it significantly (even for better in some cases);
Do not try to parallelise "all and at once": try to parallelise loop-by-loop. It will be easier to understand when correctness is broken;
Do not modify shared variables concurrently. If you really need it you should rethink you algorithm and use special constructions such as reductions, atomic operations, locks/mutexes/semaphores and so on.
Be accurate when write in shared arrays with private-modified indices since different threads may have the same indices.
I think your idea of nested parallelistation does not fit the OpenMP mindset very well. Allthough nested parallelism can be achieved in OpenMP, it brings more complications than necessary. Typically in OpenMP you only parallelise a single loop at once.
Parallelisation should be done on the level with the least interleaving dependencies. Often this comes out to be the top level. In your particular case this is true as well, as the steps in the outer loop are not strongly coupled.
I don't know what the rest of your code does, especially what happens to the values of hpk,hqk,n2pk_x0, dk and sk. All you have to do is add #pragma omp parallel for to your code.

OpenMP generate segfault in Rcpp code for the SEIR model

I wrote a (probably-inefficient, but anyway..) Rcpp code using inline to simulate a stochastic SEIR model.
The serial version compiles and works perfectly, but since I need to simulate from it a large number of times and since it seems to me like an embarrassingly parallel problem (just need to simulate again for other parameter values and return a matrix with the results) I tried to add #pragma omp parallel for and to compile with -fopenmp -lgomp but ... boom!
I get a segfault even for very small examples!
I tried to add setenv("OMP_STACKSIZE","24M",1); and values well over 24M but still the segfault happens.
I'll explain briefly the code since it's a bit long (I tried to shorten it but the result change and I can't reproduce it..):
I have two nested loops, the inner one execute the model for a given parameter set and the outer one changes the parameters.
The only reason a race condition might happen is if the code were trying to execute set of instructions inside inner the loop in parallel (which cannot be done because of the model structure, on iteration t it depends on iteration t-1) and not to parallelize the outer, but if I'm not mistaken that is what the parallel for constructor does for default if put just outside the outer...
This is basically the form of the code I'm trying to run:
mat result(n_param,T_MAX);
#pragma omp parallel for
for(int i=0,i<n_param_set;i++){
t=0;
rowvec jnk(T_MAX);
while(t < T_MAX){
...
jnk(t) = something(jnk(t-1));
...
t++;
}
result.row(i)=jnk;
}
return wrap(result);
And my question is: How I tell the compiler that I just want to compute in parallel the outer loop (even distributing them statically like n_loops/n_threads for each thread) and not the inner one (which is actually non-parallelizable)?
The real code is a bit more involved and I'll present it here for the sake of reproducibility if you're really willing, but I'm only asking about the behavior of OpenMP. Please notice that the only OpenMP instruction appears at line 122.
library(Rcpp);library(RcppArmadillo);library(inline)
misc='
#include <math.h>
#define _USE_MATH_DEFINES
#include <omp.h>
using namespace arma;
template <typename T> int sgn(T val) {
return (T(0) < val) - (val < T(0));
}
uvec rmultinomial(int n,vec prob)
{
int K = prob.n_elem;
uvec rN = zeros<uvec>(K);
double p_tot = sum(prob);
double pp;
for(int k = 0; k < K-1; k++) {
if(prob(k)>0) {
pp = prob[k] / p_tot;
rN(k) = ((pp < 1.) ? (rbinom(1,(double) n, pp))(0) : n);
n -= rN[k];
} else
rN[k] = 0;
if(n <= 0) /* we have all*/
return rN;
p_tot -= prob[k]; /* i.e. = sum(prob[(k+1):K]) */
}
rN[K-1] = n;
return rN;
}
'
model_and_summary='
mat SEIR_sim_plus_summaries()
{
vec alpha;
alpha << 0.002 << 0.0045;
vec beta;
beta << 0.01 << 0.01;
vec gamma;
gamma << 1.0/14.0 << 1.0/14.0;
vec sigma;
sigma << 1.0/(3.5) << 1.0/(3.5);
vec phi;
phi << 0.8 << 0.8;
int S_0 = 800;
int E_0 = 100;
int I_0 = 100;
int R_0 = 0;
int pop = 1000;
double tau = 0.01;
double t_0 = 0;
vec obs_time;
obs_time << 1 << 2 << 3 << 4 << 5 << 6 << 7 << 8 << 9 << 10 << 11 << 12 << 13 << 14 << 15 << 16 << 17 << 18 << 19 << 20 << 21 << 22 << 23 << 24;
const int n_obs = obs_time.n_elem;
const int n_part = alpha.n_elem;
mat stat(n_part,6);
//#pragma omp parallel for
for(int k=0;k<n_part;k++) {
ivec INC_i(n_obs);
ivec INC_o(n_obs);
// Event variables
double alpha_t;
int nX; //current number of people moving
vec rates(8);
uvec trans(4); // current transitions, e.g. from S to E,I,R,Universe
vec r(4); // rates e.g. from S to E, I, R, Univ.
/*********************** Initialize **********************/
int S_curr = S_0;
int S_prev = S_0;
int E_curr = E_0;
int E_prev = E_0;
int I_curr = I_0;
int I_prev = I_0;
int R_curr = R_0;
int R_prev = R_0;
int IncI_curr = 0;
int IncI_prev = 0;
int IncO_curr = 0;
int IncO_prev = 0;
double t_curr = t_0;
int t_idx =0;
while( t_idx < n_obs ) {
// next time preparation
t_curr += tau;
S_prev = S_curr;
E_prev = E_curr;
I_prev = I_curr;
R_prev = R_curr;
IncI_prev = IncI_curr;
IncO_prev = IncO_curr;
/*********************** description (rates) of the events **********************/
alpha_t = alpha(k)*(1+phi(k)*sin(2*M_PI*(t_curr+0)/52)); //real contact rate, time expressed in weeks
rates(0) = (alpha_t * ((double)I_curr / (double)pop ) * ((double)S_curr)); //e+1, s-1, r,i one s get infected (goes in E, not yey infectous)
rates(1) = (sigma(k) * E_curr); //e-1, i+1, r,s one exposed become infectous (goes in I) INCIDENCE!!
rates(2) = (gamma(k) * I_curr); //i-1, s,e, r+1 one i recover
rates(3) = (beta(k) * I_curr); //i-1, s, r,e one i dies
rates(4) = (beta(k) * R_curr); //i,e, s, r-1 one r dies
rates(5) = (beta(k) * E_curr); //e-1, s, r,i one e dies
rates(6) = (beta(k) * S_curr); //s-1 e, i ,r one s dies
rates(7) = (beta(k) * pop); //s+1 one susc is born
// Let the events occour
/*********************** S compartement **********************/
if((rates(0)+rates(6))>0){
nX = rbinom(1,S_prev,1-exp(-(rates(0)+rates(6))*tau))(0);
r(0) = rates(0)/(rates(0)+rates(6)); r(1) = 0.0; r(2) = 0; r(3) = rates(6)/(rates(0)+rates(6));
trans = rmultinomial(nX, r);
S_curr -= nX;
E_curr += trans(0);
I_curr += trans(1);
R_curr += trans(2);
//trans(3) contains dead individual, who disappear...we could avoid this using sequential conditional binomial
}
/*********************** E compartement **********************/
if((rates(1)+rates(5))>0){
nX = rbinom(1,E_prev,1-exp(-(rates(1)+rates(5))*tau))(0);
r(0) = 0.0; r(1) = rates(1)/(rates(1)+rates(5)); r(2) = 0.0; r(3) = rates(5)/(rates(1)+rates(5));
trans = rmultinomial(nX, r);
S_curr += trans(0);
E_curr -= nX;
I_curr += trans(1);
R_curr += trans(2);
IncI_curr += trans(1);
}
/*********************** I compartement **********************/
if((rates(2)+rates(3))>0){
nX = rbinom(1,I_prev,1-exp(-(rates(2)+rates(3))*tau))(0);
r(0) = 0.0; r(1) = 0.0; r(2) = rates(2)/(rates(2)+rates(3)); r(3) = rates(3)/(rates(2)+rates(3));
trans = rmultinomial(nX, r);
S_curr += trans(0);
E_curr += trans(1);
I_curr -= nX;
R_curr += trans(2);
IncO_curr += trans(2);
}
/*********************** R compartement **********************/
if(rates(4)>0){
nX = rbinom(1,R_prev,1-exp(-rates(4)*tau))(0);
r(0) = 0.0; r(1) = 0.0; r(2) = 0.0; r(3) = rates(4)/rates(4);
trans = rmultinomial(nX, r);
S_curr += trans(0);
E_curr += trans(1);
I_curr += trans(2);
R_curr -= nX;
}
/*********************** Universe **********************/
S_curr += pop - (S_curr+E_curr+I_curr+R_curr); //it should be poisson, but since the pop is fixed...
/*********************** Save & Continue **********************/
// Check if the time is interesting for us
if(t_curr > obs_time[t_idx]){
INC_i(t_idx) = IncI_curr;
INC_o(t_idx) = IncO_curr;
IncI_curr = IncI_prev = 0;
IncO_curr = IncO_prev = 0;
t_idx++;
}
//else just go on...
}
/*********************** Finished - Starting w/ stats **********************/
// INC_i is the useful variable, how can I change its reference withour copying it?
ivec incidence = INC_i; //just so if I want to use INC_o i have to change just this...
//Scan the epidemics to recover the summary stats (naively divide the data each 52 weeks)
double n_years = ceil((double)obs_time(n_obs-1)/52.0);
vec mu_attack(n_years);
vec ratio_attack(n_years-1);
vec peak(n_years);
vec atk(52);
peak(0)=0.0;
vec tmpExplo(52); //explosiveness
vec explo(n_years);
int year=0;
int week;
for(week=0 ; week<n_obs ; week++){
if(week - 52*year > 51){
mu_attack(year) = sum( atk )/(double)pop;
if(year>0)
ratio_attack(year-1) = mu_attack(year)/mu_attack(year-1);
for(int i=0;i<52;i++){
if(atk(i)>(peak(year)/2.0)){
tmpExplo(i) = 1.0;
} else {
tmpExplo(i) = 0.0;
}
}
explo(year) = sum(tmpExplo);
year++;
peak(year)=0.0;
}
atk(week-52*year) = incidence(week);
if( peak(year) < incidence(week) )
peak(year)=incidence(week);
}
if(week - 52*year > 51){
mu_attack(year) = sum( atk )/(double)pop;
} else {
ivec idx(52);
for(int i=0;i<52;i++)
{ idx(i) = i; } //take just the updated ones...
vec tmp = atk.elem(find(idx<(week - 52*year)));
mu_attack(year) = sum( tmp )/((double)pop * (tmp.n_elem/52.0));
ratio_attack(year-1) = mu_attack(year)/mu_attack(year-1);
for(int i=0;i<tmp.n_elem;i++){
if(tmp(i)>(peak(year)/2.0)){
tmpExplo(i) = 1.0;
} else {
tmpExplo(i) = 0.0;
}
}
for(int i=tmp.n_elem;i<52;i++)
tmpExplo(i) = 0.0; //to reset the others
explo(year) = sum(tmpExplo);
}
double correlation2;
double correlation4;
vec autocorr = acf(peak);
/***** ACF *****/
if(n_years<3){
correlation2=0.0;
correlation4=0.0;
} else {
if(n_years<5){
correlation2 = autocorr(1);
correlation4 = 0.0;
} else {
correlation2 = autocorr(1);
correlation4 = autocorr(3);
}
}
rowvec jnk(6);
jnk << sum(mu_attack)/(year+1.0)
<< (sum( log(ratio_attack)%log(ratio_attack) )/(n_years-1)) - (pow(sum( log(ratio_attack) )/(n_years-1),2))
<< correlation2 << correlation4 << max(peak) << sum(explo)/n_years;
stat.row(k) = jnk;
}
return stat;
}
'
main='
std::cout << "max_num_threads " << omp_get_max_threads() << std::endl;
RNGScope scope;
mat summaries = SEIR_sim_plus_summaries();
return wrap(summaries);
'
plug = getPlugin("RcppArmadillo")
## modify the plugin for Rcpp to support OpenMP
plug$env$PKG_CXXFLAGS <- paste('-fopenmp', plug$env$PKG_CXXFLAGS)
plug$env$PKG_LIBS <- paste('-fopenmp -lgomp', plug$env$PKG_LIBS)
SEIR_sim_summary = cxxfunction(sig=signature(),main,settings=plug,inc = paste(misc,model_and_summary),verbose=TRUE)
SEIR_sim_summary()
Thanks for the help!
NB: before you ask, I slightly modified the Rcpp multinomial sampling function just because I liked that way more than the one using pointer...not any other particular reason! :)
The core pseudo-random number generators (PRNGs) in R are not designed to be used in multithreaded environments. That is, their state is stored in a static array (dummy from src/main/PRNG.c) and therefore is shared among all threads. Moreover several other static structures are used to store states for the higher-level interfaces to the core PRNGs.
A possible solution could be that you put each call to rnorm() or other sampling functions inside named critical sections with all having the same name, e.g.:
...
#pragma omp critical(random)
rN(k) = ((pp < 1.) ? (rbinom(1,(double) n, pp))(0) : n);
...
if((rates(0)+rates(6))>0){
#pragma omp critical(random)
nX = rbinom(1,S_prev,1-exp(-(rates(0)+rates(6))*tau))(0);
...
Note that the critical construct operates on the structured block following it and therefore locks the entire statement. If a random number is being drawn inline inside a call to a time-consuming function, e.g.
#pragma omp critical(random)
x = slow_computation(rbinom(...));
this is better transformed to:
#pragma omp critical(random)
rb = rbinom(...);
x = slow_computation(rb);
That way only the rb = rbinom(...); statement will be protected.

structs within structs, dynamic memory allocation

I am making a 3D application where a boat has to drive through buoy tracks. I also need to store the tracks in groups or "layouts". The buoys class is basically a list of "buoy layouts" inside of which is a list of "buoy tracks", inside of which is a list of buoys.
I checked the local variable watcher and all memory allocations in the constructor appear to work. Later when the calculateCoordinates function is called it enters a for loop. On the first iteration of the for loop the functions pointer is used and works fine, but then on this line
ctMain[j+1][1] = 0;
the function pointers are set to NULL. I am guessing it has something to with the structs not being allocated or addressed correctly. I am not sure what to do from here. Maybe I am not understanding how malloc is working.
Update
I replaced the M3DVector3d main_track with double ** main_track, thinking maybe malloc is not handling the typedefs correctly. But I am getting the same error when trying to access the main_track variable later in calculateCoordinates.
Update
It ended up being memory corruption caused by accessing a pointer wrong in the line
rotatePointD(&(cTrack->main_track[j]), rotation);
It only led to an error later when I tried to access it.
// Buoys.h
////////////////////////////////////////////
struct buoy_layout_t;
struct buoy_track_t;
typedef double M3DVector3d[3];
class Buoys {
public:
Buoys();
struct buoy_layout_t ** buoyLayouts;
int nTrackLayouts;
int currentLayoutID;
void calculateCoordinates();
};
struct buoy_track_t {
int nMain, nYellow, nDistract;
M3DVector3d * main_track,
yellow_buoys,
distraction_buoys;
double (*f)(double x);
double (*fp)(double x);
double thickness;
M3DVector3d start, end;
};
struct buoy_layout_t {
int nTracks;
buoy_track_t ** tracks;
};
// Buoys.cpp
/////////////////////////////
// polynomial and its derivative, for shape of track
double buoyfun1(double x) {return (1.0/292.0)*x*(x-12.0)*(x-24.0);}
double buoyfun1d(double x) {return (1.0/292.0)*((3.0*pow(x,2))-(72.0*x)+288.0);}
// ... rest of buoy shape functions go here ...
Buoys::Buoys() {
struct buoy_layout_t * cLayout;
struct buoy_track_t * cTrack;
nTrackLayouts = 1;
buoyLayouts = (buoy_layout_t **) malloc(nTrackLayouts*sizeof(*buoyLayouts));
for (int i = 0; i < nTrackLayouts; i++) {
buoyLayouts[i] = (buoy_layout_t *) malloc(sizeof(*(buoyLayouts[0])));
}
currentLayoutID = 0;
// ** Layout 1 **
cLayout = buoyLayouts[0];
cLayout->nTracks = 1;
cLayout->tracks = (buoy_track_t **) malloc(sizeof(*(cLayout->tracks)));
for (int i = 0; i < 1; i++) {
cLayout->tracks[i] = (buoy_track_t *) malloc (sizeof(*(cLayout->tracks)));
}
cTrack = cLayout->tracks[0];
cTrack->main_track = (M3DVector3d *) malloc(30*sizeof(*(cTrack->main_track)));
cTrack->nMain = 30;
cTrack->f = buoyfun1;
cTrack->fp = buoyfun1d;
cTrack->thickness = 5.5;
cTrack->start[0] = 0; cTrack->start[1] = 0; cTrack->start[2] = 0;
cTrack->end[0] = 30; cTrack->end[1] = 0; cTrack->end[2] = -19;
// ... initialize rest of layouts here ...
// ** Layout 2 **
// ** Layout 3 **
// ...
// ** Layout N **
calculateCoordinates();
}
void Buoys::calculateCoordinates()
{
int i, j;
buoy_layout_t * cLayout = buoyLayouts[0];
for (i = 0; i < (cLayout->nTracks); i++) {
buoy_track_t * cTrack = cLayout->tracks[i];
M3DVector3d * ctMain = cTrack->main_track;
double thickness = cTrack->thickness;
double rotation = getAngleD(cTrack->start[0], cTrack->start[2],
cTrack->end[0], cTrack->end[2]);
double full_disp = sqrt(pow((cTrack->end[0] - cTrack->start[0]), 2)
+ pow((cTrack->end[2] - cTrack->start[2]), 2));
// nBuoys is nBuoys per side. So one side has nBuoys/2 buoys.
for (j=0; j < cTrack->nMain; j+=2) {
double id = j*((full_disp)/(cTrack->nMain));
double y = (*(cTrack->f))(id);
double yp = (*(cTrack->fp))(id);
double normal, normal_a;
if (yp!=0) {
normal = -1.0/yp;
}
else {
normal = 999999999;
}
if (normal > 0) {
normal_a = atan(normal);
}
else {
normal_a = atan(normal) + PI;
}
ctMain[j][0] = id + ((thickness/2.0)*cos(normal_a));
ctMain[j][1] = 0;
ctMain[j][2] = y + ((thickness/2.0)*sin(normal_a));
ctMain[j+1][0] = id + ((thickness/2.0)*cos(normal_a+PI));
ctMain[j+1][1] = 0; // function pointers get set to null here
ctMain[j+1][2] = y + ((thickness/2.0)*sin(normal_a+PI));
}
for (j=0; j < cTrack->nMain; j++) {
rotatePointD(&(cTrack->main_track[j]), rotation);
}
}
}
Unless there are requirements for learning pointers or you cannot use STL, given you are using C++ I'd strongly recommend you use more STL, it is your friend. But anyways...
First, the type of ctMain is *M3DVector3D. So you can safely access ctMain[0], but you cannot access ctMain[1], maybe you meant for the type of ctMain to be **M3DVector3D, in which case the line for initialization you had written which is:
cTrack->main_track = (M3DVector3d *) malloc(30*sizeof(*(cTrack->main_track)));
would make sense.
More Notes
Why are you allocating 30 of these here?
cTrack->main_track = (M3DVector3d *) malloc(30*sizeof(*(cTrack->main_track)));
Given the type of main_track, you only need:
cTrack->main_track = (M3DVector3d *) malloc(sizeof(M3DVector3d));
In addition, for organizational purposes, when doing sizeof you may want to give the actual type to check the sizeof, as opposed to the variable (there should be no difference, just organizational), these two changes:
buoyLayouts = (buoy_layout_t **) malloc(nTrackLayouts*sizeof(buoy_layout_t*));
for (int i = 0; i < nTrackLayouts; i++) {
buoyLayouts[i] = (buoy_layout_t *) malloc(sizeof(buoy_layout_t));
}
cLayout->tracks = (buoy_track_t **) malloc(clayout->nTracks * sizeof(buoy_track_t*));
for (int i = 0; i < 1; i++) {
cLayout->tracks[i] = (buoy_track_t *) malloc(sizeof(buoy_track_t));
}