new to openMP, any suggestions to parallel the following code with openMP?
I want to speed up the code with openMP, and tried to add #pragma omp for in the following 2 sections of "sum -= a[i][k]*a[k][j]" since the hotspot analysis shows these two loop takes great portions of time. but seems like some race conditions resulted in wrong results. any suggestions?
void ludcmp(float **a, int n, int *indx, float *d)
{
int i,imax,j,k;
float big,dum,sum,temp;
float *vv;
vv=vector(1,n);
*d=1.0;
for (j=1;j<=n;j++) {
for (i=1;i<j;i++) {
sum=a[i][j];
for (k=1;k<i;k++) sum -= a[i][k]*a[k][j]; //here
a[i][j]=sum;
}
big=0.0;
for (i=j;i<=n;i++) {
sum=a[i][j];
for (k=1;k<j;k++)
sum -= a[i][k]*a[k][j]; //here
a[i][j]=sum;
if ( (dum=vv[i]*fabs(sum)) >= big) {
big=dum;
imax=i;
}
}
}
Your variables are all declared at the top of the function, so every thread will share them resulting in little or no benefit from the threading.
You should declare variables as close as possible to where you use them. In particular, sum and k are used in the innermost loops, and should be declared right there (so that every thread will have its own copy of those variables). This can be extended to i and dum as well. Also, that last if (looking for the largest value) can/should be placed in a separate loop and either run single threaded, or with proper OpenMP directives for handling big and imax.
Related
I am trying to parallelise a code that I am using since a while without parellelisation and without issues in a relatively large C++ project (compilation made on C++11). The code heavily rely on the Eigen package (version 3.3.9 used here).
Here is a minimalistic version of the code that has to be parallelised and for which I get crashes (I hope not having introduced errors when squashing things...):
The main function with the two for loop that need to be parallelised:
Data_eigensols solve(const VectorXd& nu_l0_in, const int el, const long double delta0l,
const long double DPl, const long double alpha, const long double q, const long double sigma_p,
const long double resol){
const int Nmmax=5000;
int np, ng, s, s0m;
double nu_p, nu_g;
VectorXi test;
VectorXd nu_p_all, nu_g_all, nu_m_all(Nmmax);
Data_coresolver sols_iter;
Data_eigensols nu_sols;
//----
// Some code that initialize several variables (that I do not show here for clarity).
// This includes initialising freq_max, freq_min, tol, nu_p_all, nu_g_all, deriv_p and deriv_g structures
//----
// Problematic loops that crash with omp parallel but works fine when not using omp
s0m=0;
#pragma omp parallel for collapse(2) num_threads(4) default(shared) private(np, ng)
for (np=0; np<nu_p_all.size(); np++)
{
for (ng=0; ng<nu_g_all.size();ng++)
{
nu_p=nu_p_all[np];
nu_g=nu_g_all[ng];
sols_iter=solver_mm(nu_p, nu_g); // Depends on several other function but for clarity, I do not show them here (all those 'const long double' or 'const int')
//printf("np = %d, ng= %d, threadId = %d \n", np, ng, omp_get_thread_num());
for (s=0;s<sols_iter.nu_m.size();s++)
{
// Cleaning doubles: Assuming exact matches or within a tolerance range
if ((sols_iter.nu_m[s] >= freq_min) && (sols_iter.nu_m[s] <= freq_max))
{
test=where_dbl(nu_m_all, sols_iter.nu_m[s], tol, 0, s0m); // This function returns -1 if there is no match for the condition (here means that sols_iter.nu_m[s] is not found in nu_m_all)
if (test[0] == -1) // Keep the solution only if it was not pre-existing in nu_m_all
{
nu_m_all[s0m]=sols_iter.nu_m[s];
s0m=s0m+1;
}
}
}
}
}
nu_m_all.conservativeResize(s0m); // Reduced the size of nu_m_all to its final size
nu_sols.nu_m=nu_m_all;
nu_sols.nu_p=nu_p_all;
nu_sols.nu_g=nu_g_all;
nu_sols.dnup=deriv_p.deriv;
nu_sols.dPg=deriv_g.deriv;
return, nu_sols;
}
The types Data_coresolver and Data_eigensols are defined as:
struct Data_coresolver{
VectorXd nu_m, ysol, nu,pnu, gnu;
};
struct Data_eigensols{
VectorXd nu_p, nu_g, nu_m, dnup, dPg;
};
The where_dbl() is as follows:
VectorXi where_dbl(const VectorXd& vec, double value, const double tolerance){
/*
* Gives the indexes of values of an array that match the value.
* A tolerance parameter allows you to control how close the match
* is considered as acceptable. The tolerance is in the same unit
* as the value
*
*/
int cpt;
VectorXi index_out;
index_out.resize(vec.size());
cpt=0;
for(int i=0; i<vec.size(); i++){
if(vec[i] > value - tolerance && vec[i] < value + tolerance){
index_out[cpt]=i;
cpt=cpt+1;
}
}
if(cpt >=1){
index_out.conservativeResize(cpt);
} else{
index_out.resize(1);
index_out[0]=-1;
}
return index_out;
}
Regarding the solver_mm():
I don't details this function as it calls few subroutines and may be too long to show here and I don't think it is relevant here. It is basically a function that search to solve an implicit equation.
What the main function is supposed to do:
The main function solve() calls iteratively solver_mm() in order to solve an implicit equation under different conditions, where the only variables are nu_p and nu_g. Sometimes solutions of a pair (nu_p(i),nu_g(j)) leads to duplicate solution than another pair (nu_p(k), nu_g(l)). This is why there is a section calling where_dbl() to detect those duplicated solution and throw them, keeping only unique solutions.
What is the problem:
Without the #pragma call, the code works fine. But it fails at random point of the execution with it.
After few tests, it seems that the culprit is somewhat related to the part that remove duplicate solutions. My guess is that there is a concurrent writing on the nu_m_all VectorXd. I tried to use the #pragma omp barrier without success. But I am quite new to omp and I may have misunderstood how the barrier works.
Can someone let me know why I have a crash here and how to solve it? The solution might be obvious for some person with good experience in omp.
nu_p, nu_g, sols_iter and test should be private. Since these variables are declared as shared, multiple threads might write in the same memory region in a non thread safe manner. This might be your problem.
I am currently working on parallelizing a nested for loop using C++ and OpenMP. Without going into the actual details of the program, I have constructed a basic example on the concepts I am using below:
float var = 0.f;
float distance = some float array;
float temp[] = some float array;
for(int i=0; i < distance.size; i++){
\\some work
for(int j=0; j < temp.size; j++){
var += temp[i]/distance[j]
}
}
I attempted to parallelize the above code in the following way:
float var = 0.f;
float distance = some float array;
float temp[] = some float array;
#pragma omp parallel for default(shared)
for(int i=0; i < distance.size; i++){
\\some work
#pragma omp parallel for reduction(+:var)
for(int j=0; j < temp.size; j++){
var += temp[i]/distance[j]
}
}
I then compared the serial program output with the parallel program output and I got incorrect result. I know that this is mainly due to the fact that floating point arithmetic is not associative. But are there any workarounds to this that give exact results?
Although the lack of associativity of floating point arithmetic might be an issue in some cases, the code you show here exposes a much more essential problem which you need to address first: the status of the var variable in the outer loop.
Indeed, since var is modified inside the i loop, even if only in the j part of the i loop, it needs to be "privatized" somehow. Now the exact status it needs to get depends on the value you expect it to store upon exit of the enclosing parallel region:
If you don't care about its value at all, just declare it private (or better, declare it inside the parallel region.
If you need its final value at the end of the i loop, and considering it accumulates a sum of values, most likely you'll need to declare it reduction(+:), although lastprivate might also be what you want (impossible to say without further details)
If private or lastprivate was all you needed, but you also need its initial value upon entrance of the parallel region, then you'll have to consider adding firstprivate too (no need of that if you went for reduction as it is already been taken care of)
That should be enough for fixing your issue.
Now, in your snippet, you also parallelized the inner loop. That is usually a bad idea to go for nested parallelism. So unless you have a very compelling reason for doing so, you will likely get much better performance by only parallelizing the outer loop, and leaving the inner loop alone. That won't mean the inner loop won't benefit from the parallelization, but rather that several instances of the inner loop will be computed in parallel (each one being sequential admittedly, but the whole process is parallel).
A nice side effect of removing the inner loop's parallelization (in addition to making the code faster) is that now all accumulations inside the privates var variables are done in the same order as when not in parallel. Therefore, your (hypothetical) floating point arithmetic issues inside the outer loop will now have disappeared, and only if you needed the final reduction upon exit of the parallel region might you still face them there.
I have implemented the Jacobi algorithm based on the routine described in the book Numerical Recipes but since I plan to work with very large matrices I am trying to parallelize it using openmp.
void ROTATE(MatrixXd &a, int i, int j, int k, int l, double s, double tau)
{
double g,h;
g=a(i,j);
h=a(k,l);
a(i,j)=g-s*(h+g*tau);
a(k,l)=h+s*(g-h*tau);
}
void jacobi(int n, MatrixXd &a, MatrixXd &v, VectorXd &d )
{
int j,iq,ip,i;
double tresh,theta,tau,t,sm,s,h,g,c;
VectorXd b(n);
VectorXd z(n);
v.setIdentity();
z.setZero();
#pragma omp parallel for
for (ip=0;ip<n;ip++)
{
d(ip)=a(ip,ip);
b(ip)=d(ip);
}
for (i=0;i<50;i++)
{
sm=0.0;
for (ip=0;ip<n-1;ip++)
{
#pragma omp parallel for reduction (+:sm)
for (iq=ip+1;iq<n;iq++)
sm += fabs(a(ip,iq));
}
if (sm == 0.0) {
break;
}
if (i < 3)
tresh=0.2*sm/(n*n);
else
tresh=0.0;
#pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
for (ip=0;ip<n-1;ip++)
{
//#pragma omp parallel for private (g,h,t,theta,c,s,tau)
for (iq=ip+1;iq<n;iq++)
{
g=100.0*fabs(a(ip,iq));
if (i > 3 && (fabs(d(ip))+g) == fabs(d[ip]) && (fabs(d[iq])+g) == fabs(d[iq]))
a(ip,iq)=0.0;
else if (fabs(a(ip,iq)) > tresh)
{
h=d(iq)-d(ip);
if ((fabs(h)+g) == fabs(h))
{
t=(a(ip,iq))/h;
}
else
{
theta=0.5*h/(a(ip,iq));
t=1.0/(fabs(theta)+sqrt(1.0+theta*theta));
if (theta < 0.0)
{
t = -t;
}
c=1.0/sqrt(1+t*t);
s=t*c;
tau=s/(1.0+c);
h=t*a(ip,iq);
#pragma omp critical
{
z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;
a(ip,iq)=0.0;
for (j=0;j<ip;j++)
ROTATE(a,j,ip,j,iq,s,tau);
for (j=ip+1;j<iq;j++)
ROTATE(a,ip,j,j,iq,s,tau);
for (j=iq+1;j<n;j++)
ROTATE(a,ip,j,iq,j,s,tau);
for (j=0;j<n;j++)
ROTATE(v,j,ip,j,iq,s,tau);
}
}
}
}
}
}
}
I wanted to parallelize the loop that does most of the calculations and both comments inserted in the code:
//#pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
//#pragma omp parallel for private (g,h,t,theta,c,s,tau)
are my attempts at it. Unfortunately both of them end up producing incorrect results. I suspect the problem may be in this block:
z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;
because usually this sort of accumulation would need a reduction, but since each thread accesses a different part of the array, I am not certain of this.
I am not really sure if I am doing the parallelization in a correct manner because I have only recently started working with openmp, so any suggestion or recommendation would also be welcomed.
Sidenote: I know there are faster algorithms for eigenvalue and eigenvector determination including the SelfAdjointEigenSolver in Eigen, but those are not giving me the precision I need in the eigenvectors and this algorithm is.
My thanks in advance.
Edit: I considered to correct answer to be the one provided by The Quantum Physicist because what I did does not reduce the computation time for system of size up to 4096x4096. In any case I corrected the code in order to make it work and maybe for big enough systems it could be of some use. I would advise the use of timers to test if the
#pragma omp for
actually decrease the computation time.
I'll try to help, but I'm not sure this is the answer to your question.
There are tons of problems with your code. My friendly advice for you is: Don't do parallel things if you don't understand the implications of what you're doing.
For some reason, it looks like that you think putting everything in parallel #pragma for will make it faster. This is VERY wrong. Because spawning threads is an expensive thing to do and costs (relatively) lots of memory and time. So if you redo that #pragma for for every loop, you'll respawn threads for every loop, which will significantly reduce the speed of your program... UNLESS: Your matrices are REALLY huge and the computation time is >> than the cost of spawning them.
I fell into a similar issue when I wanted to multiply huge matrices, element-wise (and then I needed the sum for some expectation value in quantum mechanics). To use OpenMP for that, I had to flatten the matrices to linear arrays, and then distribute the array chunks each to a thread, and then run a for loop, where every loop iteration uses elements that are independent of others for sure, and I made them all evolve independently. This was quite fast. Why? Because I never had to respawn threads twice.
Why you're getting wrong results? I believe the reason is because you're not respecting shared memory rules. You have some variable(s) that is being modified by multiple threads simultaneously. It's hiding somewhere, and you have to find it! For example, what does the function z do? Does it take stuff by reference? What I see here:
z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;
Looks VERY multi-threading not-safe, and I don't understand what you're doing. Are you returning a reference that you have to modify? This is a recipe for thread non-safety. Why don't you create clean arrays and deal with them instead of this?
How to debug: Start with a small example (2x2 matrix, maybe), and use only 2 threads, and try to understand what's going on. Use a debugger and define break points, and check what information is shared between threads.
Also consider using a mutex to check what data gets ruined when it becomes shared. Here is how to do it.
My recommendation: Don't use OpenMP unless you plan to spawn the threads ONLY ONCE. I actually believe that OpenMP is going to die very soon because of C++11. OpenMP was beautiful back then when C++ didn't have any native multi-threading implementation. So learn how to use std::thread, and use it, and if you need to run many things in threads, then learn how to create a Thread Pool with std::thread. This is a good book for learning multithreading.
I am running my code on Intel® Xeon(R) CPU X5680 # 3.33GHz × 12. Here is a fairly simple OpenMP pseudo code (the OpenMP parts are exact, just normal code in between is changed for compactness and clarity):
vector<int> myarray(arraylength,something);
omp_set_num_threads(3);
#pragma omp parallel
{
#pragma omp for schedule(dynamic)
for(int j=0;j<pr.max_iteration_limit;j++)
{
vector<int> temp_array(updated_array(a,b,myarray));
for(int i=0;i<arraylength;i++)
{
#pragma omp atomic
myarray[i]+=temp_array[i];
}
}
}
all parameters taken by temp_array function are copied so that there would be no clashes. Basic structure of temp_array function:
vector<int> updated_array(myClass1 a, vector<myClass2> b, vector<int> myarray)
{
//lots of preparations, but obviously there are only local variables, since
//function only takes copies
//the core code taking most of the time, which I will be measuring:
double time_s=time(NULL);
while(waiting_time<t_wait) //as long as needed
{
//a fairly short computaiton
//generates variable: vector<int> another_array
waiting_time++;
}
double time_f=time(NULL);
cout<<"Thread "<<omp_get_thread_num()<<" / "<<omp_get_num_threads()
<< " runtime "<<time_f-time_s<<endl;
//few more changes to the another_array
return another_array;
}
Questions and my attempts to resolve it:
adding more threads (with omp_set_num_threads(3);) does create more threads, but each thread does the job slower. E.g. 1: 6s, 2: 10s, 3: 15s ... 12: 60s.
(where to "job" I refer to the exact part of the code I pointed out as core, (NOT the whole omp loop or so) since it takes most of the time, and makes sure I am not missing anything additional)
There are no rand() things happening inside the core code.
Dynamic or static schedule doesnt make a difference here of course (and I tried..)
There seem to be no sharing possible in any way or form, thus I am running out of ideas completely... What can it be? I would be extremely grateful if you could help me with this (even with just ideas)!
p.s. The point of the code is to take myarray, do a bit of montecarlo on it with a single thread, and then collect tiny changes and add/substract to the original array.
OpenMP may implement the atomic access using a mutex, when your code will suffer from heavy contention on that mutex. This will result in a significant performance hit.
If the work in updated_array() dominates the cost of the parallel loop, you'de better put the whole of the second loop inside a critical section:
{ // body of parallel loop
vector<int> temp_array = updated_array(a,b,myarray);
#pragma omp critical(UpDateMyArray)
for(int i=0;i<arraylength;i++)
myarray[i]+=temp_array[i];
}
However, your code looks broken (essentially not threadsafe), see my comment.
I'm using C++ to create a finite element analysis routine, and I'm trying to use Open MP to parallelise some 'for' loops in my code.
I have an array of structures called Elements, with each section of the array containing a structure with all of the information for that particular element. Some of the information that is required is a stiffness matrix for each element (called kt in the code below). This is then assembled into a global stiffness matrix for the whole system.
The calculation of the element stiffness matrix is pretty involved and lengthy so I reckon I could get some good speed gains by parallelising its calculation.
The below code works fine when everything related to Open MP is commented out but fails when it isn't despite the fact that I am not writing to Elements at the same time and kt and Elementsi (the ith element) are private to the thread they are being used in.
I'm using Armadillo for the matrix algebra so that is what the 'mat' means.
I'm pretty new to C++ so any help will be much appreciated.
mat KtCalc(struct Element Elements[],mat Nodes,double ngamma,double nbeta,double hhtalpha, int nel, double dt)
//Stiffness matrix calculation routine
{
int nn=Nodes.n_rows;
mat Kt(nn*6, nn*6, fill::zeros);
int i;
struct Element Elementi;
mat kt;
#pragma omp parallel private(Elementi,kt) shared(nel,i,hhtalpha,ngamma,nbeta,dt,Elements)
{
#pragma omp for
for(i=0;i<nel;i++)
{
#pragma omp critical(dataupdate)
{
Elementi=Elements[i];
}
kt=KtEl(Elementi, ngamma, nbeta, hhtalpha, dt);
#pragma omp critical(dataupdate)
{
Elements[i].kt=kt;
}
}
}
for(int k=0;k<nel;k++){
//Use the stuff calculated above in a non parallel way to calculate Kt
}
return Kt;
}
Your problem was the shared deceleration of the i of the forloop. In cpp you're allowed to declare variables anywhere. The following code is equivalent and should work:
mat KtCalc(struct Element Elements[],mat Nodes,double ngamma,double nbeta,double hhtalpha, int nel, double dt)
//Stiffness matrix calculation routine
{
int nn=Nodes.n_rows;
mat Kt(nn*6, nn*6, fill::zeros);
#pragma omp parallel for
for(int i=0;i<nel;i++)
{
Elements[i].kt=KtEl(Elements[i], ngamma, nbeta, hhtalpha, dt);
}
for(int k=0;k<nel;k++){
//Use the stuff calculated above in a non parallel way to calculate Kt
}
return Kt;
}
Also, you're allowed to modify the elements of an array simultaneously, as long as you're sure you're never modifying the same element (which is the case as you're only touching the ith element). The critical sections where thus unnecessary.
On a side note, you'd generally want your variables to be declared as late as possible. Declaring them on top is old c style. Declaring them as late as possible means:
It makes RAII (Resource Acquisition Is Initialization) easier.
It keeps the scope of the variable tight. This lets the optimizer work better.