Parallelize Algorithm with OpenMP in C++

Parallelize Algorithm with OpenMP in C++ - c++

my problem is this:
I want to solve TSP with the Ant Colony Optimization Algorithm in C++.
Right now Ive implemented a algorithm that solve this problem iterative.
For example: I generate 500 ants - and they find their route one after the other.
Each ant starts not until the previous ant finished.
Now I want to parallelize the whole thing - and I thought about using OpenMP.
So my first question is: Can I generate a large number of threads that work
simultaneously (for the number of ants > 500)?
I already tried something out. So this is my code from my main.cpp:
#pragma omp parallel for
for (auto ant = antarmy.begin(); ant != antarmy.end(); ++ant) {
#pragma omp ordered
if (ant->getIterations() < ITERATIONSMAX) {
ant->setNumber(currentAntNumber);
currentAntNumber++;
ant->antRoute();
}
}
And this is the code in my Ant class that is "critical" because each Ant reads and writes into the same Matrix (pheromone-Matrix):
void Ant::antRoute()
{
this->route.setCity(0, this->getStartIndex());
int nextCity = this->getNextCity(this->getStartIndex());
this->routedistance += this->data->distanceMatrix[this->getStartIndex()][nextCity];
int tempCity;
int i = 2;
this->setProbability(nextCity);
this->setVisited(nextCity);
this->route.setCity(1, nextCity);
updatePheromone(this->getStartIndex(), nextCity, routedistance, 0);
while (this->getVisitedCount() < datacitycount) {
tempCity = nextCity;
nextCity = this->getNextCity(nextCity);
this->setProbability(nextCity);
this->setVisited(nextCity);
this->route.setCity(i, nextCity);
this->routedistance += this->data->distanceMatrix[tempCity][nextCity];
updatePheromone(tempCity, nextCity, routedistance, 0);
i++;
}
this->routedistance += this->data->distanceMatrix[nextCity][this->getStartIndex()];
// updatePheromone(-1, -1, -1, 1);
ShortestDistance(this->routedistance);
this->iterationsshortestpath++;
}
void Ant::updatePheromone(int i, int j, double distance, bool reduce)
{
#pragma omp critical(pheromone)
if (reduce == 1) {
for (int x = 0; x < datacitycount; x++) {
for (int y = 0; y < datacitycount; y++) {
if (REDUCE * this->data->pheromoneMatrix[x][y] < 0)
this->data->pheromoneMatrix[x][y] = 0.0;
else
this->data->pheromoneMatrix[x][y] -= REDUCE * this->data->pheromoneMatrix[x][y];
}
}
}
else {
double currentpheromone = this->data->pheromoneMatrix[i][j];
double updatedpheromone = (1 - PHEROMONEREDUCTION)*currentpheromone + (PHEROMONEDEPOSIT / distance);
if (updatedpheromone < 0.0) {
this->data->pheromoneMatrix[i][j] = 0;
this->data->pheromoneMatrix[j][i] = 0;
}
else {
this->data->pheromoneMatrix[i][j] = updatedpheromone;
this->data->pheromoneMatrix[j][i] = updatedpheromone;
}
}
}
So for some reasons the omp parallel for loop wont work on these range-based loops. So this is my second question - if you guys have any suggestions on the code how the get the range-based loops done im happy.
Thanks for your help

So my first question is: Can I generate a large number of threads that work simultaneously (for the number of ants > 500)?
In OpenMP you typically shouldn't care how many threads are active, instead you make sure to expose enough parallel work through work-sharing constructs such as omp for or omp task. So while you may have a loop with 500 iterations, your program could be run with anything between one thread and 500 (or more, but they would just idle). This is a difference to other parallelization approaches such as pthreads where you have to manage all the threads and what they do.
Now your example uses ordered incorrectly. Ordered is only useful if you have a small part of your loop body that needs to be executed in-order. Even then it can be very problematic for performance. Also you need to declare a loop to be ordered if you want to use ordered inside. See also this excellent answer.
You should not use ordered. Instead make sure that the ants know there number beforehand, write the code such that they don't need a number, or at the very least that the order of numbers doesn't matter for ants. In the latter case you can use omp atomic capture.
As to the access to shared data. Try to avoid it as much as possible. Adding omp critical is a first step to get a correct parallel program, but often leads to performance problems. Measure your parallel efficiency, use parallel performance analysis tools to find out if this is the case for you. Then you can use atomic data access or reduction (each threads has their own data they work on and only after the main work is finished, data from all threads is merged).

Related

How to parallelise a 2D time evolution

I have a C++ code that performs a time evolution of four variables that live on a 2D spatial grid. To save some time, I tried to parallelise my code with OpenMP but I just cannot get it to work: No matter how many cores I use, the runtime stays basically the same or increases. (My code does use 24 cores or however many I specify, so the compilation is not a problem.)
I have the feeling that the runtime for one individual time-step is too short and the overhead of producing threads kills the potential speed-up.
The layout of my code is:
for (int t = 0; t < max_time_steps; t++) {
// do some book-keeping
...
// perform time step
// (1) calculate righthand-side of ODE:
for (int i = 0; i < nr; i++) {
for (int j = 0; j < ntheta; j++) {
rhs[0][i][j] = A0[i][j] + B0[i][j] + ...;
rhs[1][i][j] = A1[i][j] + B1[i][j] + ...;
rhs[2][i][j] = A2[i][j] + B2[i][j] + ...;
rhs[3][i][j] = A3[i][j] + B3[i][j] + ...;
}
}
// (2) perform Euler step (or Runge-Kutta, ...)
for (int d = 0; d < 4; d++) {
for (int i = 0; i < nr; i++) {
for (int j = 0; j < ntheta; j++) {
next[d][i][j] = current[d][i][j] + time_step * rhs[d][i][j];
}
}
}
}
I thought this code should be fairly easy to parallelise... I put "#pragma omp parellel for" in front of the (1) and (2) loops, and I also specified the number of cores (e.g. 4 cores for loop (2) since there are four variables) but there is simply no speed-up whatsoever.
I have found that OpenMP is fairly smart about when to create/destroy the threads. I.e. it realises that threads are required soon again and then they're only put asleep to save overhead time.
I think one "problem" is that my time step is coded in a subroutine (I'm using RK4 instead of Euler) and the computation of the righthand-side is again in another subroutine that is called by the time_step() function. So, I believe that due to this, OpenMP cannot see that the threads should be kept open for longer and hence the threads are created and destroyed at every time step.
Would it be helpful to put a "#pragma omp parallel" in front of the time-loop so that the threads are created at the very beginning? And then do the actual parallelisation for the righthand-side (1) and the Euler step (2)? But how do I do that?
I have found numerous examples for how to parallelise nested for loops, but none of them were concerned with the setup where the inner loops have been sourced out to separate modules. Would this an obstacle for parallelising?
I have now removed the d loops (by making the indices explicit) and collapsed the i and j loops (by running over the entire 2D array with one variable only).
The code looks like:
for (int t = 0; t < max_time_steps; t++) {
// do some book-keeping
...
// perform time step
// (1) calculate righthand-side of ODE:
#pragma omp parallel for
for (int i = 0; i < nr*ntheta; i++) {
rhs[0][0][i] = A0[0][i] + B0[0][i] + ...;
rhs[1][0][i] = A1[0][i] + B1[0][i] + ...;
rhs[2][0][i] = A2[0][i] + B2[0][i] + ...;
rhs[3][0][i] = A3[0][i] + B3[0][i] + ...;
}
// (2) perform Euler step (or Runge-Kutta, ...)
#pragma omp parallel for
for (int i = 0; i < nr*ntheta; i++) {
next[0][0][i] = current[0][0][i] + time_step * rhs[0][0][i];
next[1][0][i] = current[1][0][i] + time_step * rhs[1][0][i];
next[2][0][i] = current[2][0][i] + time_step * rhs[2][0][i];
next[3][0][i] = current[3][0][i] + time_step * rhs[3][0][i];
}
}
The size of nr*ntheta is 400*40=1600 and I a make max_time_steps=1000 time steps. Still, the parallelisation does not result in a speed-up:
Runtime without OpenMP (result of time on the command line):
real 0m23.597s
user 0m23.496s
sys 0m0.076s
Runtime with OpenMP (24 cores)
real 0m23.162s
user 7m47.026s
sys 0m0.905s
I do not understand what's happening here.
One peculiarity that I don't show in my code snippet above is that my variables are not actually doubles but a self-defined struct of two doubles which resemble real and imaginary part. But I think this should not make a difference.
Just wanted to report some success after I left the parallelisation alone for a while. The code evolved for a year and now I went back to parallelisation. This time, I can say that OpenMP does it's job and reduces the required walltime.
While the code evolved overall, this particular loop that I've shown above did not really change; merely two things: a) The resolution is higher so that it covers about 10 times as many points and b) the number of calculations per loop also is about 10-fold (maybe even more).
My only explanation why it works now and didn't work a little over a year ago, is that, when I tried to parallelise the code last time, it wasn't computationally expensive enough and the speed-up was killed by the OpenMP overhead. One single loop now requires about 200-300ms whereas that time required must have been in the single digit ms last time.
I can see such effect when comparing gcc and the Intel compiler (which are doing a very different job when vectorizing):
a) Using gcc, one loop needs about 300ms without OpenMP, and on two cores only 52% of the time is required --> near perfect optimization.
b) Using icpc, one loop needs about 160ms without OpenMP, and on two cores it needs 60% of the time --> good optimization but about 20% less effective.
When going for more than two cores, the speed-up is not large enough to make it worthwhile.

Best way to parallelize this recursion using OpenMP

I have following recursive function (NOTE: It is stripped of all unimportant details)
int recursion(...) {
int minimum = INFINITY;
for(int i=0; i<C; i++) {
int foo = recursion(...);
if (foo < minimum) {
minimum = foo;
}
}
return minimum;
}
Note 2: It is finite, but not in this simplified example, so please ignore it. Point of this question is how to aproach this problem correctly.
I was thinking about using tasks, but I am not sure, how to use it correctly - how to paralelize the inner cycle.
EDIT 1: The recursion tree isn't well balanced. It is being used with dynamic programing approach, so as time goes on, a lot of values are re-used from previous passes. This worries me a lot and I think it will be a big bottleneck.
C is somewhere around 20.
Metric for the best is fastest :)
It will run on 2x Xeon, so there is plenty of HW power availible.

Yes, you can use OpenMP tasks exploit parallelism on multiple recursion levels and ensure that imbalances don't cause wasted cycles.
I would collect the results in a vector and compute the minimum outside. You could also perform a guarded (critical / lock) minimum computation within the task.
Avoid spawning tasks / allocating memory for the minimum if you are too deep in the recursion, where the overhead / work ratio becomes too bad. The strongest solution it to create two separate (parallel/serial) recursive functions. That way you have zero runtime overhead once you switch to the serial function - as opposed to checking the recursion depth against a threshold every time in a unified function.
int recursion(...) {
#pragma omp parallel
#pragma omp single
return recursion_par(..., 0);
}
int recursion_ser(...) {
int minimum = INFINITY;
for(int i=0; i<C; i++) {
int foo = recursion_ser(...);
if (foo < minimum) {
minimum = foo;
}
}
return minimum;
}
int recursion_par(..., int depth) {
std::vector<int> foos(C);
for(int i=0; i<C; i++) {
#pragma omp task
{
if (depth < threshhold) {
foos[i] = recursion_par(..., depth + 1);
} else {
foos[i] = recursion_ser(...);
}
}
}
#pragma omp taskwait
return *std::min_element(std::begin(foos), std::end(foos));
}
Obviously you must not do any nasty things with global / shared state within the unimportant details.

OpenMP - std::next_permutation

I am trying to parallelize my own C++ implementation of Travelling Salesman Problem using OpenMP.
I have a function to calculate cost of road cost() and vector [0,1,2,...,N], where N is a number of nodes of the road.
In main(), I am trying to find the best road:
do
{
cost();
} while (std::next_permutation(permutation_base, permutation_base + operations_number));
I was trying to use #pragma omp parallel to parallelize that code, but it only made it more time consuming.
Is there any way to parallelize that code?

#pragma omp parallel doesn't automatically divide the computation on separate threads. If you want to divide the computation you need do additionally use #pragma omp for, otherwise the hole computation is done multiple times, one time for each thread. For instance the following code prints "Hello World!" four times on my laptop, since it has 4 cores.
int main(int argc, char* argv[]){
#pragma omp parallel
cout << "Hello World!\n";
}
The same thing happens to your code, if you simple write #pragma omp parallel. Your code gets executed multiple times, once for each thread. And therefore your program won't be faster. If you want to divide the work onto the threads (each thread does different things), you have to use something like #pragma omp parallel for.
Now we can look at your code. It isn't suited for parallelization. Lets see why. You start with your array permutation_base and calculate the costs. Then you manipulate permutation_base with next_permutation. You actually have to wait for the finished cost computations, before you are allowed to manipulate the the array, because otherwise the cost computation would be wrong. So the whole thing wouldn't work on separate threads.
One possible solution would be, to keep multiple copies of your array permutation_base, and each possible permutation base only runs through a part of all permutations. For instance:
vector<int> permutation_base{1, 2, 3, 4};
int n = permutation_base.size();
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
// Make a copy of permutation_base
auto perm = permutation_base;
// rotate the i'th element to the front
// keep the other elements sorted
std::rotate(perm.begin(), perm.begin() + i, perm.begin() + i + 1);
// Now go through all permutations of the last `n-1` elements.
// Keep the first element fixed.
do {
cost()
}
while (std::next_permutation(perm.begin() + 1, perm.end()));
}

Most definitely.
The big problem with parallelizing these permutation problems is that in order to parallelize well, you need to "index" into an arbitrary permutation. In short, you need to find the kth permutation. You can take advantage of some cool math properties and you'll find this:
std::vector<int> kth_perm(long long k, std::vector<int> V) {
long long int index;
long long int next;
std::vector<int> new_v;
while(V.size()) {
index = k / fact(V.size() - 1);
new_v.push_back(V.at(index));
next = k % fact(V.size() - 1);
V.erase(V.begin() + index);
k = next;
}
return new_v;
}
So then your logic might look something like this:
long long int start = (numperms*threadnum)/ numthreads;
long long int end = threadnum == numthreads-1 ? numperms : (numperms*(threadnum+1))/numthreads;
perm = kth_perm(start, perm); // perm is your list of permutations
for (int j = start; j < end; ++j){
if (is_valid_tour(adj_list, perm, startingVertex, endingVertex)) {
isValidTour=true;
return perm;
}
std::next_permutation(perm.begin(),perm.end());
}
isValidTour = false;
return perm;
Obviously there's a lot of code, but the idea of parallelizing it can be captured by the little code I've posted. You can visualize "indexing" like this:
|--------------------------------|
^ ^ ^
t1 t2 ... tn
Find the ith permutation and let a thread call std::next_permutation until it finds the starting point of the next thread.
Note that you'll want to wrap the function that contains the bottom code in #pragma omp parallel

OpenMP parallelization

I'm writing a C++ program with scientific purposes. The program works well and it returns good results, so I decided to improve its perfomance using OpenMP. The loop I want to optimize is the following one:
//== #pragma omp parallel for private(i,j)
for (k=0; k < number; k++)
{
for (i=0; i < L; i++)
{
for (j=0; j < L; j++)
{
red[i][j] = UNDEFINED;
}
}
Point inicial = {L/2, L/2, OCCUPIED};
red[L/2][L/2] = OCCUPIED;
addToList(inicial, red, list, L,f);
oc.push_back(inicial);
while (list.size() > 0 && L > 0)
{
punto = selectPoint(red, list, generator, prob, p);
if (punto.state == OCCUPIED)
{
addToList(punto, red, list, L,f);
oc.push_back(punto);
}
else
{
out.push_back(punto);
}
}
L = auxL;
oc.clear();
out.clear();
list.clear();
}
f = f*1.0/(number*1.0);
if (f > 0.5)
{
inta = inta;
intb = p;
p = (inta + intb) / 2.0;
}
else if (f < 0.5)
{
intb = intb;
inta = p;
p = (inta + intb) / 2.0;
}
cout << p << endl;
}
My try with OpenMP is commented above. As you can see I've declared i and j as private because they're declared before the parallel section. I've also tried to make L private, with no results. Only segmentation faults and bad pointers everywhere.
I think the problem is that while loop nested inside. My questions are: Is the omp parallel for correct in this case? or should I try to optimize only that while loop? Are the std::vector interfering with OpenMP?
NOTE: list, oc and out are std::vector<Point>, and Point is a simple struct with three int properties. addToList is a function with no loops inside.

You might want to go over an OpenMP tutorial. When you look at OpenMP code, you need to imagine what can happen in parallel. Take
oc.push_back(inicial);
Can two threads try to do this at the same time? Yes. Does std::vector support parallelism? No.
The code above is full of these things.
If you want to use data-structures within your OpenMP ode, you need to use locks. From my personal experience, when this happens, it is far better to refactor the algorithm than actually use them. While OpenMP + locks is possible, it is usually an indication that there's a problem with the idea (= a possibly subjective view).

The current answer points out the concurrency in the code, but please note that not all data-structures have to be implemented with locks to attain thread-safety. There are also lock-free data structures. For this particular case, we could the Harris lock free linked list: https://timharris.uk/papers/2001-disc.pdf
While I know that pointing out concurrency issues to the OP is of great assistance at this point, I want to make sure we don't convey a wrong message by saying that locks are absolutely necessary to attain thread safety.

The directive #pragma omp parallel defines a piece of code that can be executed simultaneously by various threads. In your case, as you have not specified any further directive, your parallel region will be executed once by every thread. In order to achieve a parallel behavior you could try to break the loop into smaller tasks(the taskloop directive will do the job). Those tasks will remain in a task pool until a thread starts executing them. This way your loop will be fragmented and executed by your threads instead of making each thread execute the whole loop.
https://www.openmp.org/spec-html/5.0/openmpsu47.html here's the official openMP documentation for the taskloop directive.

openmp parallel performance

I'm trying to implement the distance matrix in parallel using openmp in which I calculate the distance between each point and all the other points, so the best algorithm I thought of till now cost O(n^2) and the performance of my algorithm using openmp using 10 thread on 8processor machine isn't better than the serial approach in terms of running time, so I wonder if there is any mistake in my implementation on the openmp approach as this is my first time to use openmp, so please if there is any mistake in my apporach or any better "faster" approach please let me know. The following is my code where "dat" is a vector that contains the data points.
map <int, map< int, double> > dist; //construct the distance matrix
int c=count(dat.at(0).begin(),dat.at(0).end(),delm)+1;
#pragma omp parallel for shared (c,dist)
for(int p=0;p<dat.size();p++)
{
for(int j=p+1;j<dat.size();j++)
{
double ecl=0;
string line1=dat.at(p);
string line2=dat.at(j);
for (int i=0;i<c;i++)
{
double num1=atof(line1.substr(0,line1.find_first_of(delm)).c_str());
line1=line1.substr(line1.find_first_of(delm)+1).c_str();
double num2=atof(line2.substr(0,line2.find_first_of(delm)).c_str());
line2=line2.substr(line2.find_first_of(delm)+1).c_str();
ecl += (num1-num2)*(num1-num2);
}
ecl=sqrt(ecl);
#pragma omp critical
{
dist[p][j]=ecl;
dist[j][p]=ecl;
}
}
}

#pragma omp critical has the effect of serializing your loop so getting rid of that should be your first goal. This should be a step in the right direction:
ptrdiff_t const c = count(dat[0].begin(), dat[0].end(), delm) + 1;
vector<vector<double> > dist(dat.size(), vector<double>(dat.size()));
#pragma omp parallel for
for (size_t p = 0; p != dat.size(); ++p)
{
for (size_t j = p + 1; j != dat.size(); ++j)
{
double ecl = 0.0;
string line1 = dat[p];
string line2 = dat[j];
for (ptrdiff_t i = 0; i != c; ++i)
{
double const num1 = atof(line1.substr(0, line1.find_first_of(delm)).c_str());
double const num2 = atof(line2.substr(0, line2.find_first_of(delm)).c_str());
line1 = line1.substr(line1.find_first_of(delm) + 1);
line2 = line2.substr(line2.find_first_of(delm) + 1);
ecl += (num1 - num2) * (num1 - num2);
}
ecl = sqrt(ecl);
dist[p][j] = ecl;
dist[j][p] = ecl;
}
}
There are a few other obvious things that could be done to make this faster overall, but fixing your parallelization is the most important thing.

As already pointed out, using critical sections will slow things down as only 1 thread is allowed in that section at a time. There is absolutely no need for using critical sections because each thread writes to mutually exclusive sections of data, reading non-modified data obviously doesn't need protection.
My suspicion as to the slowness of the code comes down to uneven work distribution over the threads. By default I think openmp divides the iterations equally among threads. As an example, consider when you have 8 threads and 8 points:
-thread 0 will get 7 distance calculations
-thread 1 will get 6 distance calculations
...
-thread 7 will get 0 distance calculations
Even with more iterations, a similar inequality still exists. If you need to convince yourself, make a thread private counter to track how many distance calculations are actually done by each thread.
With work-sharing constructs like parallel for, you can specify various work distribution strategies. In your case, probably best to go with
#pragma omp for schedule(guided)
When each thread requests some iterations of the for loop, it will get the number of remaining loops (not already given to a thread) divided by the number of threads. So initially you get big blocks, later you get smaller blocks. It's a form of automatic load balancing, mind you there's some (probably small) overhead in dynamically allocating iterations to the threads.
To avoid the first thread getting an unfair large amount of work, your looping structure should be changed so that lower iterations have fewer calculations, e.g. change the inner for loop to
for (j=0; j<p-1; j++)
Another thing to consider is when working with a lot of cores, memory can become the bottleneck. You have 8 processors fighting for probably 2 or maybe 3 channels of DRAM (separate memory sticks on the same channel still compete for bandwidth). On-chip CPU cache is at best shared between all the processors, so you still have no more cache than the serial version of this program.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js