Parallelizing nested loops with OpenMP - c++

Perhaps the solution to my problem is obvious for some on with exprience with openmp, but I don't have it. I want to accelerate the following subroutine using openmp:
void Build_ERIS(vector<double> &eris, vector<Atomic_Orbital> &Basis)
{
int basis_size = Basis.size();
int m = basis_size*(basis_size+1)/2;
eris.resize(m*(m+1)/2);
bool compute;
std::fill(eris.begin(), eris.end(), 0);
int i_orbital,j_orbital, k_orbital,l_orbital, i_primitive, j_primitive, k_primitive,l_primitive,ij,kl, ijkl,ijij,klkl;
#pragma omp parallel
{
#pragma omp for ordered
for(i_orbital=0; i_orbital<basis_size; i_orbital++){
for(j_orbital=0; j_orbital<i_orbital+1; j_orbital++){
ij = i_orbital*(i_orbital+1)/2 + j_orbital;
for(k_orbital=0; k_orbital<basis_size; k_orbital++){
for(l_orbital=0; l_orbital<k_orbital+1; l_orbital++){
kl = k_orbital*(k_orbital+1)/2 + l_orbital;
if (ij >= kl) {
ijkl = composite_index(i_orbital,j_orbital,k_orbital,l_orbital);
ijij = composite_index(i_orbital,j_orbital,i_orbital,j_orbital);
klkl = composite_index(k_orbital,l_orbital,k_orbital,l_orbital);
for(i_primitive=0; i_primitive<Basis[i_orbital].contraction.size; i_primitive++)
for(j_primitive=0; j_primitive<Basis[j_orbital].contraction.size; j_primitive++)
for(k_primitive=0; k_primitive<Basis[k_orbital].contraction.size; k_primitive++)
for(l_primitive=0; l_primitive<Basis[l_orbital].contraction.size; l_primitive++)
eris[ijkl] +=
normconst(Basis[i_orbital].contraction.exponent[i_primitive],Basis[i_orbital].angular.l, Basis[i_orbital].angular.m, Basis[i_orbital].angular.n)*
normconst(Basis[j_orbital].contraction.exponent[j_primitive],Basis[j_orbital].angular.l, Basis[j_orbital].angular.m, Basis[j_orbital].angular.n)*
normconst(Basis[k_orbital].contraction.exponent[k_primitive],Basis[k_orbital].angular.l, Basis[k_orbital].angular.m, Basis[k_orbital].angular.n)*
normconst(Basis[l_orbital].contraction.exponent[l_primitive],Basis[l_orbital].angular.l, Basis[l_orbital].angular.m, Basis[l_orbital].angular.n)*
Basis[i_orbital].contraction.coef[i_primitive]*
Basis[j_orbital].contraction.coef[j_primitive]*
Basis[k_orbital].contraction.coef[k_primitive]*
Basis[l_orbital].contraction.coef[l_primitive]*
ERI_int(Basis[i_orbital].contraction.center.x, Basis[i_orbital].contraction.center.y, Basis[i_orbital].contraction.center.z, Basis[i_orbital].contraction.exponent[i_primitive],Basis[i_orbital].angular.l, Basis[i_orbital].angular.m, Basis[i_orbital].angular.n,
Basis[j_orbital].contraction.center.x, Basis[j_orbital].contraction.center.y, Basis[j_orbital].contraction.center.z, Basis[j_orbital].contraction.exponent[j_primitive],Basis[j_orbital].angular.l, Basis[j_orbital].angular.m, Basis[j_orbital].angular.n,
Basis[k_orbital].contraction.center.x, Basis[k_orbital].contraction.center.y, Basis[k_orbital].contraction.center.z, Basis[k_orbital].contraction.exponent[k_primitive],Basis[k_orbital].angular.l, Basis[k_orbital].angular.m, Basis[k_orbital].angular.n,
Basis[l_orbital].contraction.center.x, Basis[l_orbital].contraction.center.y, Basis[l_orbital].contraction.center.z, Basis[l_orbital].contraction.exponent[l_primitive],Basis[l_orbital].angular.l, Basis[l_orbital].angular.m, Basis[l_orbital].angular.n);
/**/
}
}
}
}
}
}
}
My concern is regarding the best way of be sure that after the openmp parallelization, the computation of the reductions in eris[ijkl], still giving the same values that the serial version of the routine? How can I do a loops fusion in a way that is numerically safe?

Several things I see.
1) #pragma for ordered means: execute every single one of the iterations of this loop in order. This essentially means that while you're executing "in parallel," all of your work will be done in serial. Remove it.
2) You have not declared any of your variables shared or private. Note that all variables by default will be shared, so in your case ij and kl for instance will be accessible by any thread working on any iteration. You can no doubt see how this would cause a race condition if, say, iteration 100 changed variable ij while iteration 1 thought it was using it.
3) Your variable eris[ijkl] as you rightly noted must be reduced properly. If ijkl can never be the same value for two different iterations in your i_orbital loop, then you're fine as-is; no two threads will ever be changing the same variable eris[ijkl] potentially at the same time. If it can be the same value, then you have to carefully handle reduction on the array.
4) Here's what you should work with for starters. This is assuming that ijkl will never be the same value for two different iterations, and your functions do not take in any non-constant references (potentially changing what I'm assuming input variables to output variables).
#pragma omp parallel for private(i_orbital, j_orbital, ij, k_orbital, l_orbital, kl, ijkl, ijij, klkl, i_primitive, j_primitive, k_primitive, l_primitive)

Related

reduction with string type in OpenMP

I am use OpenMP to parallize a for loop like so
std::stringType = "somevalue";
#pragma omp parallel for reduction(+ : stringType)
//a for loop here which every loop appends a string to stringType
The only way I can think to do this is to convert to an int representation in some way first and then convert back at the end but this has obvious overhead. Is there any better ways to perform this style of operation?
As mentioned in comments, reduction assumes that the operation is associative and commutative. The values may be computed in any order and be "accumulated" through any kind of partial results and the final result will be the same.
There is no guarantee that an OpenMP for loop will distribute contiguous iterations to each thread unless the loop schedule explicitly requests that. There is no guarantee either that continuous blocks will be distributed by increasing thread number (i.e. thread #0 might go through iterations 1000-1999 while thread #1 goes through 0-999). If you need that behavior, then you should define you own schedule.
Something like:
int N=1000;
std::string globalString("initial value");
#pragma omp parallel shared(N,stringType)
{
std::string localString; //Empty string
// Set schedule
int iterTo, iterFrom;
iterFrom = omp_get_thread_num() * (N / omp_get_num_threads());
if (omp_get_num_threads() == omp_get_thread_num()+1)
iterTo = N;
else
iterTo = (1+omp_get_thread_num()) * (N / omp_get_num_threads());
// Loop - concatenate a number of neighboring values in the right order
// No #pragma omp for: each thread goes through the loop, but loop
// boundaries change according to the thread ID
for (int ii=iterTo; ii<iterTo ; ii++){
localString += get_some_string(ii);
}
// Dirty trick to concatenate strings from all threads in the good order
for (int ii=0;ii<omp_get_num_threads();ii++){
#pragma omp barrier
if (ii==omp_get_thread_num())
globalString += localString;
}
}
A better way would be to have a shared array of std::string, each thread using one as a local accumulator. At the end, a single thread can run the concatenation part (and avoid the dirty trick and all its overhead-heavy barrier calls).

Parallelize Algorithm with OpenMP in C++

my problem is this:
I want to solve TSP with the Ant Colony Optimization Algorithm in C++.
Right now Ive implemented a algorithm that solve this problem iterative.
For example: I generate 500 ants - and they find their route one after the other.
Each ant starts not until the previous ant finished.
Now I want to parallelize the whole thing - and I thought about using OpenMP.
So my first question is: Can I generate a large number of threads that work
simultaneously (for the number of ants > 500)?
I already tried something out. So this is my code from my main.cpp:
#pragma omp parallel for
for (auto ant = antarmy.begin(); ant != antarmy.end(); ++ant) {
#pragma omp ordered
if (ant->getIterations() < ITERATIONSMAX) {
ant->setNumber(currentAntNumber);
currentAntNumber++;
ant->antRoute();
}
}
And this is the code in my Ant class that is "critical" because each Ant reads and writes into the same Matrix (pheromone-Matrix):
void Ant::antRoute()
{
this->route.setCity(0, this->getStartIndex());
int nextCity = this->getNextCity(this->getStartIndex());
this->routedistance += this->data->distanceMatrix[this->getStartIndex()][nextCity];
int tempCity;
int i = 2;
this->setProbability(nextCity);
this->setVisited(nextCity);
this->route.setCity(1, nextCity);
updatePheromone(this->getStartIndex(), nextCity, routedistance, 0);
while (this->getVisitedCount() < datacitycount) {
tempCity = nextCity;
nextCity = this->getNextCity(nextCity);
this->setProbability(nextCity);
this->setVisited(nextCity);
this->route.setCity(i, nextCity);
this->routedistance += this->data->distanceMatrix[tempCity][nextCity];
updatePheromone(tempCity, nextCity, routedistance, 0);
i++;
}
this->routedistance += this->data->distanceMatrix[nextCity][this->getStartIndex()];
// updatePheromone(-1, -1, -1, 1);
ShortestDistance(this->routedistance);
this->iterationsshortestpath++;
}
void Ant::updatePheromone(int i, int j, double distance, bool reduce)
{
#pragma omp critical(pheromone)
if (reduce == 1) {
for (int x = 0; x < datacitycount; x++) {
for (int y = 0; y < datacitycount; y++) {
if (REDUCE * this->data->pheromoneMatrix[x][y] < 0)
this->data->pheromoneMatrix[x][y] = 0.0;
else
this->data->pheromoneMatrix[x][y] -= REDUCE * this->data->pheromoneMatrix[x][y];
}
}
}
else {
double currentpheromone = this->data->pheromoneMatrix[i][j];
double updatedpheromone = (1 - PHEROMONEREDUCTION)*currentpheromone + (PHEROMONEDEPOSIT / distance);
if (updatedpheromone < 0.0) {
this->data->pheromoneMatrix[i][j] = 0;
this->data->pheromoneMatrix[j][i] = 0;
}
else {
this->data->pheromoneMatrix[i][j] = updatedpheromone;
this->data->pheromoneMatrix[j][i] = updatedpheromone;
}
}
}
So for some reasons the omp parallel for loop wont work on these range-based loops. So this is my second question - if you guys have any suggestions on the code how the get the range-based loops done im happy.
Thanks for your help
So my first question is: Can I generate a large number of threads that work simultaneously (for the number of ants > 500)?
In OpenMP you typically shouldn't care how many threads are active, instead you make sure to expose enough parallel work through work-sharing constructs such as omp for or omp task. So while you may have a loop with 500 iterations, your program could be run with anything between one thread and 500 (or more, but they would just idle). This is a difference to other parallelization approaches such as pthreads where you have to manage all the threads and what they do.
Now your example uses ordered incorrectly. Ordered is only useful if you have a small part of your loop body that needs to be executed in-order. Even then it can be very problematic for performance. Also you need to declare a loop to be ordered if you want to use ordered inside. See also this excellent answer.
You should not use ordered. Instead make sure that the ants know there number beforehand, write the code such that they don't need a number, or at the very least that the order of numbers doesn't matter for ants. In the latter case you can use omp atomic capture.
As to the access to shared data. Try to avoid it as much as possible. Adding omp critical is a first step to get a correct parallel program, but often leads to performance problems. Measure your parallel efficiency, use parallel performance analysis tools to find out if this is the case for you. Then you can use atomic data access or reduction (each threads has their own data they work on and only after the main work is finished, data from all threads is merged).

OpenMP double for loop array with stored results

I've spent time going over other posts but I still can't get this simple program to go.
#include<iostream>
#include<cmath>
#include<omp.h>
using namespace std;
int main()
{
int threadnum =4;//want manual control
int steps=100000,cumulative=0, counter;
int a,b,c;
float dum1, dum2, dum3;
float pos[10000][3] = {0};
float non=0;
//RNG declared
#pragma omp parallel private(dum1,dum2,dum3,counter,a,b,c) reduction (+: non, cumulative) num_threads(threadnum)
{
for(int dummy=0;dummy<(10000/threadnum);dummy++)
{
dum1=0,dum2=0,dum3=0;
a=0,b=0,c=0;
for (counter=0;counter<steps;counter++)
{
dum1 = somefunct1()+rand();
dum2=somefunct2()+rand();
dum3 = somefunct3(dum1, dum2, ...);
a += somefunct4(dum1,dum2,dum3, ...);
b += somefunct5(dum1,dum2,dum3, ...);
c += somefunct6(dum1,dum2,dum3, ...);
cumulative++; //count number of loops executed
}
pos[dummy][0] = a;//saves results of second loop to array
pos[dummy][1] = b;
pos[dummy][2] = c;
non+= pos[dummy][0];//holds the summed a values
}
}
}
I've cut down the program to get it to fit here. A lot of times if I make changes, and I've tried a lot, a lot of time the inner loop simply does not execute the correct number of times and I get cumulative equal to something like 32,532,849 instead of 1 billion. Scaling is about 2x for the code above but should be much higher.
I want the code to simply break the first 10000 iteration for loop so that each thread runs a certain number of iterations in parallel (if this could be dynamic that would be nice) and saves the results of each iteration of the second for loop to the results array. The second for loop is composed of dependents and cannot be broken. Currently the order of the 'dummy' iterations do not matter (can switch pos[345] with pos[3456] as long as all three indices are switches) but I will have to modify it later so it does matter.
The numerous variables and initializations in the inner loop are confusing me terribly. There are a lot of random calls and functions/math functions in the inner loop - is there overhead here that is causing a problem? I'm using GNU 4.9.2 on windows.
Any help would be greatly appreciated.
Edit: finally fixed. Moved the RNG declaration inside the first for loop. Now I get 3.75x scaling going to 4 threads and 5.72x scaling on 8 threads (hyperthreads). Not perfect but I will take it. I still think there is an issue with thread locking and syncing.
......
float non=0;
#pragma omp parallel private(dum1,dum2,dum3,counter,a,b,c) reduction (+: non, cumulative) num_threads(threadnum)
{
//RNG declared
#pragma omp for
for(int dummy=0;dummy<(10000/threadnum);dummy++)
{
....

OpenMP parallel code has not the same output as the serial code

I had to change and extend my algorithm for some signal analysis (using the polyfilterbank technique) and couldn't use my old OpenMP code, but in the new code the results are not as expected (the results in the beginning positions in the array are somehow incorrect in comparison with a serial run [serial code shows the expected result]).
So in the first loop tFFTin I have some FFT data, which I'm multiplicating with a window function.
The goal is that a thread runs the inner loops for each polyphase factor. To avoid locks I use the reduction pragma (no complex reduction is defined by standard, so I use my one where each thread's omp_priv variable gets initialized with the omp_orig [so with tFFTin]). The reason I'm using the ordered pragma is that the results should be added to the output vector in an ordered way.
typedef std::complex<float> TComplexType;
typedef std::vector<TComplexType> TFFTContainer;
#pragma omp declare reduction(complexMul:TFFTContainer:\
transform(omp_in.begin(), omp_in.end(),\
omp_out.begin(), omp_out.begin(),\
std::multiplies<TComplexType>()))\
initializer (omp_priv(omp_orig))
void ConcreteResynthesis::ApplyPolyphase(TFFTContainer& tFFTin, TFFTContainer& tFFTout, TWindowContainer& tWindow, *someparams*) {;
#pragma omp parallel for shared(tWindow) firstprivate(sFFTParams) reduction(complexMul: tFFTin) ordered if(iFFTRawDataLen>cMinParallelSize)
for (int p = 0; p < uPolyphase; ++p) {
int iPolyphaseOffset = p * uFFTLength;
for (int i = 0; i < uFFTLength; ++i) {
tFFTin[i] *= tWindow[iPolyphaseOffset + i]; ///< get FFT input data from raw data
}
#pragma omp ordered
{
//using the overlap and add method
for (int i = 0; i < sFFTParams.uFFTLength; ++i) {
pDataPool->GetFullSignalData(workSignal)[mSignalPos + iPolyphaseOffset + i] += tFFTin[i];
}
}
}
mSignalPos = mSignalPos + mStep;
}
Is there a race condition or something, which makes wrong outputs at the beginning? Or do I have some logic error?
Another issue is, I don't really like my solution with using the ordered pragma, is there a better approach( i tried to use for this also the reduction-model, but the compiler doesn't allow me to use a pointer type for that)?
I think your problem is that you have implemented a very cool custom reduction for tFFTin. But this reduction is applied at the end of the parallel region.
Which is after you use the data in tFFTin. Another thing is what H. Iliev mentions that the second iteration of the outer loop relies on data which is computed in the previous iteration - a classic dependency.
I think you should try parallelizing the inner loops.

OpenMP parallelization

I'm writing a C++ program with scientific purposes. The program works well and it returns good results, so I decided to improve its perfomance using OpenMP. The loop I want to optimize is the following one:
//== #pragma omp parallel for private(i,j)
for (k=0; k < number; k++)
{
for (i=0; i < L; i++)
{
for (j=0; j < L; j++)
{
red[i][j] = UNDEFINED;
}
}
Point inicial = {L/2, L/2, OCCUPIED};
red[L/2][L/2] = OCCUPIED;
addToList(inicial, red, list, L,f);
oc.push_back(inicial);
while (list.size() > 0 && L > 0)
{
punto = selectPoint(red, list, generator, prob, p);
if (punto.state == OCCUPIED)
{
addToList(punto, red, list, L,f);
oc.push_back(punto);
}
else
{
out.push_back(punto);
}
}
L = auxL;
oc.clear();
out.clear();
list.clear();
}
f = f*1.0/(number*1.0);
if (f > 0.5)
{
inta = inta;
intb = p;
p = (inta + intb) / 2.0;
}
else if (f < 0.5)
{
intb = intb;
inta = p;
p = (inta + intb) / 2.0;
}
cout << p << endl;
}
My try with OpenMP is commented above. As you can see I've declared i and j as private because they're declared before the parallel section. I've also tried to make L private, with no results. Only segmentation faults and bad pointers everywhere.
I think the problem is that while loop nested inside. My questions are: Is the omp parallel for correct in this case? or should I try to optimize only that while loop? Are the std::vector interfering with OpenMP?
NOTE: list, oc and out are std::vector<Point>, and Point is a simple struct with three int properties. addToList is a function with no loops inside.
You might want to go over an OpenMP tutorial. When you look at OpenMP code, you need to imagine what can happen in parallel. Take
oc.push_back(inicial);
Can two threads try to do this at the same time? Yes. Does std::vector support parallelism? No.
The code above is full of these things.
If you want to use data-structures within your OpenMP ode, you need to use locks. From my personal experience, when this happens, it is far better to refactor the algorithm than actually use them. While OpenMP + locks is possible, it is usually an indication that there's a problem with the idea (= a possibly subjective view).
The current answer points out the concurrency in the code, but please note that not all data-structures have to be implemented with locks to attain thread-safety. There are also lock-free data structures. For this particular case, we could the Harris lock free linked list: https://timharris.uk/papers/2001-disc.pdf
While I know that pointing out concurrency issues to the OP is of great assistance at this point, I want to make sure we don't convey a wrong message by saying that locks are absolutely necessary to attain thread safety.
The directive #pragma omp parallel defines a piece of code that can be executed simultaneously by various threads. In your case, as you have not specified any further directive, your parallel region will be executed once by every thread. In order to achieve a parallel behavior you could try to break the loop into smaller tasks(the taskloop directive will do the job). Those tasks will remain in a task pool until a thread starts executing them. This way your loop will be fragmented and executed by your threads instead of making each thread execute the whole loop.
https://www.openmp.org/spec-html/5.0/openmpsu47.html here's the official openMP documentation for the taskloop directive.