Using omp parallel in a nested loop crash due to concurrent writing? - c++

I am trying to parallelise a code that I am using since a while without parellelisation and without issues in a relatively large C++ project (compilation made on C++11). The code heavily rely on the Eigen package (version 3.3.9 used here).
Here is a minimalistic version of the code that has to be parallelised and for which I get crashes (I hope not having introduced errors when squashing things...):
The main function with the two for loop that need to be parallelised:
Data_eigensols solve(const VectorXd& nu_l0_in, const int el, const long double delta0l,
const long double DPl, const long double alpha, const long double q, const long double sigma_p,
const long double resol){
const int Nmmax=5000;
int np, ng, s, s0m;
double nu_p, nu_g;
VectorXi test;
VectorXd nu_p_all, nu_g_all, nu_m_all(Nmmax);
Data_coresolver sols_iter;
Data_eigensols nu_sols;
//----
// Some code that initialize several variables (that I do not show here for clarity).
// This includes initialising freq_max, freq_min, tol, nu_p_all, nu_g_all, deriv_p and deriv_g structures
//----
// Problematic loops that crash with omp parallel but works fine when not using omp
s0m=0;
#pragma omp parallel for collapse(2) num_threads(4) default(shared) private(np, ng)
for (np=0; np<nu_p_all.size(); np++)
{
for (ng=0; ng<nu_g_all.size();ng++)
{
nu_p=nu_p_all[np];
nu_g=nu_g_all[ng];
sols_iter=solver_mm(nu_p, nu_g); // Depends on several other function but for clarity, I do not show them here (all those 'const long double' or 'const int')
//printf("np = %d, ng= %d, threadId = %d \n", np, ng, omp_get_thread_num());
for (s=0;s<sols_iter.nu_m.size();s++)
{
// Cleaning doubles: Assuming exact matches or within a tolerance range
if ((sols_iter.nu_m[s] >= freq_min) && (sols_iter.nu_m[s] <= freq_max))
{
test=where_dbl(nu_m_all, sols_iter.nu_m[s], tol, 0, s0m); // This function returns -1 if there is no match for the condition (here means that sols_iter.nu_m[s] is not found in nu_m_all)
if (test[0] == -1) // Keep the solution only if it was not pre-existing in nu_m_all
{
nu_m_all[s0m]=sols_iter.nu_m[s];
s0m=s0m+1;
}
}
}
}
}
nu_m_all.conservativeResize(s0m); // Reduced the size of nu_m_all to its final size
nu_sols.nu_m=nu_m_all;
nu_sols.nu_p=nu_p_all;
nu_sols.nu_g=nu_g_all;
nu_sols.dnup=deriv_p.deriv;
nu_sols.dPg=deriv_g.deriv;
return, nu_sols;
}
The types Data_coresolver and Data_eigensols are defined as:
struct Data_coresolver{
VectorXd nu_m, ysol, nu,pnu, gnu;
};
struct Data_eigensols{
VectorXd nu_p, nu_g, nu_m, dnup, dPg;
};
The where_dbl() is as follows:
VectorXi where_dbl(const VectorXd& vec, double value, const double tolerance){
/*
* Gives the indexes of values of an array that match the value.
* A tolerance parameter allows you to control how close the match
* is considered as acceptable. The tolerance is in the same unit
* as the value
*
*/
int cpt;
VectorXi index_out;
index_out.resize(vec.size());
cpt=0;
for(int i=0; i<vec.size(); i++){
if(vec[i] > value - tolerance && vec[i] < value + tolerance){
index_out[cpt]=i;
cpt=cpt+1;
}
}
if(cpt >=1){
index_out.conservativeResize(cpt);
} else{
index_out.resize(1);
index_out[0]=-1;
}
return index_out;
}
Regarding the solver_mm():
I don't details this function as it calls few subroutines and may be too long to show here and I don't think it is relevant here. It is basically a function that search to solve an implicit equation.
What the main function is supposed to do:
The main function solve() calls iteratively solver_mm() in order to solve an implicit equation under different conditions, where the only variables are nu_p and nu_g. Sometimes solutions of a pair (nu_p(i),nu_g(j)) leads to duplicate solution than another pair (nu_p(k), nu_g(l)). This is why there is a section calling where_dbl() to detect those duplicated solution and throw them, keeping only unique solutions.
What is the problem:
Without the #pragma call, the code works fine. But it fails at random point of the execution with it.
After few tests, it seems that the culprit is somewhat related to the part that remove duplicate solutions. My guess is that there is a concurrent writing on the nu_m_all VectorXd. I tried to use the #pragma omp barrier without success. But I am quite new to omp and I may have misunderstood how the barrier works.
Can someone let me know why I have a crash here and how to solve it? The solution might be obvious for some person with good experience in omp.

nu_p, nu_g, sols_iter and test should be private. Since these variables are declared as shared, multiple threads might write in the same memory region in a non thread safe manner. This might be your problem.

Related

OpenMP reduction on container elements

I have a nested loop, with few outer, and many inner iterations. In the inner loop, I need to calculate a sum, so I want to use an OpenMP reduction. The outer loop is on a container, so the reduction is supposed to happen on an element of that container.
Here's a minimal contrived example:
#include <omp.h>
#include <vector>
#include <iostream>
int main(){
constexpr int n { 128 };
std::vector<int> vec (4, 0);
for (unsigned int i {0}; i<vec.size(); ++i){
/* this does not work */
//#pragma omp parallel for reduction (+:vec[i])
//for (int j=0; j<n; ++j)
// vec[i] +=j;
/* this works */
int* val { &vec[0] };
#pragma omp parallel for reduction (+:val[i])
for (int j=0; j<n; ++j)
val[i] +=j;
/* this is allowed, but looks very wrong. Produces wrong results
* for std::vector, but on an Eigen type, it worked. */
#pragma omp parallel for reduction (+:val[i])
for (int j=0; j<n; ++j)
vec[i] +=j;
}
for (unsigned int i=0; i<vec.size(); ++i) std::cout << vec[i] << " ";
std::cout << "\n";
return 0;
}
The problem is, that if I write the reduction clause as (+:vec[i]), I get the error ‘vec’ does not have pointer or array type, which is descriptive enough to find a workaround. However, that means I have to introduce a new variable and somewhat change the code logic, and I find it less obvious to see what the code is supposed to do.
My main question is, whether there is a better/cleaner/more standard way to write a reduction for container elements.
I'd also like to know why and how the third way shown in the code above somewhat works. I'm actually working with the Eigen library, on whose containers that variant seems to work just fine (haven't extensively tested it though), but on std::vector, it produces results somewhere between zero and the actual result (8128). I thought it should work, because vec[i] and val[i] should both evaluate to dereferencing the same address. But alas, apparently not.
I'm using OpenMP 4.5 and gcc 9.3.0.
I'll answer your question in three parts:
1. What is the best way to perform to OpenMP reductions in your example above with a std::vec ?
i) Use your approach, i.e. create a pointer int* val { &vec[0] };
ii) Declare a new shared variable like #1201ProgramAlarm answered.
iii) declare a user defined reduction (which is not really applicable in your simple case, but see 3. below for a more efficient pattern).
2. Why doesn't the third loop work and why does it work with Eigen ?
Like the previous answer states you are telling OpenMP to perform a reduction sum on a memory address X, but you are performing additions on memory address Y, which means that the reduction declaration is ignored and your addition is subjected to the usual thread race conditions.
You don't really provide much detail into your Eigen venture, but here are some possible explanations:
i) You're not really using multiple threads (check n = Eigen::nbThreads( ))
ii) You didn't disable Eigen's own parallelism which can disrupt your own usage of OpenMP, e.g. EIGEN_DONT_PARALLELIZE compiler directive.
iii) The race condition is there, but you're not seeing it because Eigen operations take longer, you're using a low number of threads and only writing a low number of values => lower occurrence of threads interfering with each other to produce the wrong result.
3. How should I parallelize this scenario using OpenMP (technically not a question you asked explicitly) ?
Instead of parallelizing only the inner loop, you should parallelize both at the same time. The less serial code you have, the better. In this scenario each thread has its own private copy of the vec vector, which gets reduced after all the elements have been summed by their respective thread. This solution is optimal for your presented example, but might run into RAM problems if you're using a very large vector and very many threads (or have very limited RAM).
#pragma omp parallel for collapse(2) reduction(vsum : vec)
for (unsigned int i {0}; i<vec.size(); ++i){
for (int j = 0; j < n; ++j) {
vec[i] += j;
}
}
where vsum is a user defined reduction, i.e.
#pragma omp declare reduction(vsum : std::vector<int> : std::transform(omp_out.begin(), omp_out.end(), omp_in.begin(), omp_out.begin(), std::plus<int>())) initializer(omp_priv = decltype(omp_orig)(omp_orig.size()))
Declare the reduction before the function where you use it, and you'll be good to go
For the second example, rather than storing a pointer then always accessing the same element, just use a local variable:
int val = vec[i];
#pragma omp parallel for reduction (+:val)
for (int j=0; j<n; ++j)
val +=j;
vec[i] = val;
With the 3rd loop, I suspect that the problem is because the reduction clause names a variable, but you never update that variable by that name in the loop so there is nothing that the compiler sees to reduce. Using Eigen may make the code a bit more complicate to analyze, resulting in the loop working.

How to parallelize nearest neighbour search using OpenMP

Basically, I have a collection std::vector<std::pair<std::vector<float>, unsigned int>> which contains pairs of templates std::vector<float> of size 512 (2048 bytes) and their corresponding identifier unsigned int.
I am writing a function in which I am provided with a template and I need to return the identifier of the most similar template in the collection. I am using dot product to compute the similarity.
My naive implementation looks as follows:
// Should return false if no match is found (ie. similarity is 0 for all templates in collection)
bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
bool found = false;
similarity = 0.f;
for (size_t i = 0; i < collection.size(); ++i) {
const float* candidateTemplate = collection[i].first.data();
float consinSimilarity = getSimilarity(data, candidateTemplate, length); // computes cosin sim between two vectors, implementation depends on architecture.
if (consinSimilarity > similarity) {
found = true;
similarity = consinSimilarity;
label = collection[i].second;
}
}
return found;
}
How can I speed this up using parallelization. My collection can contain potentially millions of templates. I have read that you can add #pragma omp parallel for reduction but I am not entirely sure how to use it (and if this is even the best option).
Also note:
For my dot product implementation, if the base architecture supports AVX & FMA, I am using this implementation.
Will this affect performance when we parallelize since there are only a limited number of SIMD registers?
Since we don't have access to an example that actually compiles (which would have been nice), I didn't actually try to compile the example below. Nevertheless, some minor typos (maybe) aside, the general idea should be clear.
The task is to find the highest value of similarity and the corresponding label, for this we can indeed use reduction, but since we need to find the maximum of one value and then store the corresponding label, we make use of a pair to store both values at once, in order to implement this as a reduction in OpenMP.
I have slightly rewritten your code, possibly made things a bit harder to read with the original naming (temp) of the variable. Basically, we perform the search in parallel, so each thread finds an optimal value, we then ask OpenMP to find the optimal solution between the threads (reduction) and we are done.
//Reduce by finding the maximum and also storing the corresponding label, this is why we use a std::pair.
void reduce_custom (std::pair<float, unsigned int>& output, std::pair<float, unsigned int>& input) {
if (input.first > output.first) output = input;
}
//Declare an OpenMP reduction with our pair and our custom reduction function.
#pragma omp declare reduction(custom_reduction : \
std::pair<float, unsigned int>: \
reduce_custom(omp_out, omp_in)) \
initializer(omp_priv(omp_orig))
bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
std::pair<float, unsigned int> temp(0.0, label); //Stores thread local similarity and corresponding best label.
#pragma omp parallel for reduction(custom_reduction:temp)
for (size_t i = 0; i < collection.size(); ++i) {
const float* candidateTemplate = collection[i].first.data();
float consinSimilarity = getSimilarity(data, candidateTemplate, length);
if (consinSimilarity > temp.first) {
temp.first = consinSimilarity;
temp.second = collection[i].second;
}
}
if (temp.first > 0.f) {
similarity = temp.first;
label = temp.second;
return true;
}
return false;
}
Regarding your concern on the limited number of SIMD registers, their number depends on the specific CPU you are using. To the best of my understanding each core has a set number of vector registers available, so as long as you were not using more than there were available before it should be fine now as well, besides, AVX512 for instance provides 32 vector registers and 2 arithemtic units for vector operations per core, so running out of compute resources is not trivial, you are more likely to suffer due to poor memory locality (particularly in your case with vectors being saved all over the place). I might of course be wrong, if so, please feel free to correct me in the comments.

new to openMP, any suggestions to parallel the following code with openMP?

new to openMP, any suggestions to parallel the following code with openMP?
I want to speed up the code with openMP, and tried to add #pragma omp for in the following 2 sections of "sum -= a[i][k]*a[k][j]" since the hotspot analysis shows these two loop takes great portions of time. but seems like some race conditions resulted in wrong results. any suggestions?
void ludcmp(float **a, int n, int *indx, float *d)
{
int i,imax,j,k;
float big,dum,sum,temp;
float *vv;
vv=vector(1,n);
*d=1.0;
for (j=1;j<=n;j++) {
for (i=1;i<j;i++) {
sum=a[i][j];
for (k=1;k<i;k++) sum -= a[i][k]*a[k][j]; //here
a[i][j]=sum;
}
big=0.0;
for (i=j;i<=n;i++) {
sum=a[i][j];
for (k=1;k<j;k++)
sum -= a[i][k]*a[k][j]; //here
a[i][j]=sum;
if ( (dum=vv[i]*fabs(sum)) >= big) {
big=dum;
imax=i;
}
}
}
Your variables are all declared at the top of the function, so every thread will share them resulting in little or no benefit from the threading.
You should declare variables as close as possible to where you use them. In particular, sum and k are used in the innermost loops, and should be declared right there (so that every thread will have its own copy of those variables). This can be extended to i and dum as well. Also, that last if (looking for the largest value) can/should be placed in a separate loop and either run single threaded, or with proper OpenMP directives for handling big and imax.

Parallelise 'for' loop reading and writing to data structure with Open MP

I'm using C++ to create a finite element analysis routine, and I'm trying to use Open MP to parallelise some 'for' loops in my code.
I have an array of structures called Elements, with each section of the array containing a structure with all of the information for that particular element. Some of the information that is required is a stiffness matrix for each element (called kt in the code below). This is then assembled into a global stiffness matrix for the whole system.
The calculation of the element stiffness matrix is pretty involved and lengthy so I reckon I could get some good speed gains by parallelising its calculation.
The below code works fine when everything related to Open MP is commented out but fails when it isn't despite the fact that I am not writing to Elements at the same time and kt and Elementsi (the ith element) are private to the thread they are being used in.
I'm using Armadillo for the matrix algebra so that is what the 'mat' means.
I'm pretty new to C++ so any help will be much appreciated.
mat KtCalc(struct Element Elements[],mat Nodes,double ngamma,double nbeta,double hhtalpha, int nel, double dt)
//Stiffness matrix calculation routine
{
int nn=Nodes.n_rows;
mat Kt(nn*6, nn*6, fill::zeros);
int i;
struct Element Elementi;
mat kt;
#pragma omp parallel private(Elementi,kt) shared(nel,i,hhtalpha,ngamma,nbeta,dt,Elements)
{
#pragma omp for
for(i=0;i<nel;i++)
{
#pragma omp critical(dataupdate)
{
Elementi=Elements[i];
}
kt=KtEl(Elementi, ngamma, nbeta, hhtalpha, dt);
#pragma omp critical(dataupdate)
{
Elements[i].kt=kt;
}
}
}
for(int k=0;k<nel;k++){
//Use the stuff calculated above in a non parallel way to calculate Kt
}
return Kt;
}
Your problem was the shared deceleration of the i of the forloop. In cpp you're allowed to declare variables anywhere. The following code is equivalent and should work:
mat KtCalc(struct Element Elements[],mat Nodes,double ngamma,double nbeta,double hhtalpha, int nel, double dt)
//Stiffness matrix calculation routine
{
int nn=Nodes.n_rows;
mat Kt(nn*6, nn*6, fill::zeros);
#pragma omp parallel for
for(int i=0;i<nel;i++)
{
Elements[i].kt=KtEl(Elements[i], ngamma, nbeta, hhtalpha, dt);
}
for(int k=0;k<nel;k++){
//Use the stuff calculated above in a non parallel way to calculate Kt
}
return Kt;
}
Also, you're allowed to modify the elements of an array simultaneously, as long as you're sure you're never modifying the same element (which is the case as you're only touching the ith element). The critical sections where thus unnecessary.
On a side note, you'd generally want your variables to be declared as late as possible. Declaring them on top is old c style. Declaring them as late as possible means:
It makes RAII (Resource Acquisition Is Initialization) easier.
It keeps the scope of the variable tight. This lets the optimizer work better.

openmp parallel for with non-PODs

I'm trying to speed up a program, in the heart of which is a trivially-looking loop:
double sum=0.;
#pragma omp parallel for reduction(+:sum) // fails
for( size_t i=0; i<_S.size(); ++i ){
sum += _S[i].first* R(atoms,_S[i].second) ;
}
While the looping itself is trivial, the objects inside it are not PODs: Here _S is in fact an
std::vector< std::pair<double, std::vector<size_t> > >, and R(...) is an overloaded operator(...) const of some object. Both of its arguments qualified as const, so that the call does not have any side effects.
Since some 90% of the runtime is spent in this call, it seemed a simple thing to throw in an OpenMP pragma as shown above, and enjoy a speedup by a factor of two or three;
but of course --- the code works OK with a single thread, but gives plain wrong results for more than one thread :-).
There is no data dependency, both _S and R(...) seem to be safe to share between threads, but still it produces nonsense.
I'd really appreciate any pointers into how to find what goes wrong.
UPD2:
Figured it. As all bugs, it's trivial. The R(...) was calling the operator() of something of this sort:
class objR{
public:
objR(const size_t N){
_buffer.reserve(N);
};
double operator(...) const{
// do something, using the _buffer to store intermediaries
}
private:
std::vector<double> _buffer;
};
Clearly, different threads use the _buffer at the same time and mess it up. My solution so far is to allocate more space (memory is not a problem, the code is CPU-bound):
class objR{
public:
objR(const size_t N){
int nth=1;
#ifdef _OPENMP
nth=omp_get_max_threads();
#endif
_buffer.resize(N);
}
double operator(...) const{
int thread_id=0;
#ifdef _OPENMP
thread_id = omp_get_thread_num();
#endif
// do something, using the _buffer[thread_id] to store intermediaries
}
private:
std::vector< std::vector<double> > _buffer;
};
This seems to work correctly. Still, since it's my first foray into things mutithreaded, I'd appreciate if somebody knowledgeable could comment on if there is a better approach.
The access to _S[i].first and _S[i].second is perfectly safe (can't guarantee anything about atoms). This means that your function call to R must be what's causing the problem. You need to find out what R is and post what it's doing.
In another point, names which begin with an underscore and begin with an uppercase character are reserved for the implementation and you invoke undefined behaviour by using them.