I'm writing a C++ program with scientific purposes. The program works well and it returns good results, so I decided to improve its perfomance using OpenMP. The loop I want to optimize is the following one:
//== #pragma omp parallel for private(i,j)
for (k=0; k < number; k++)
{
for (i=0; i < L; i++)
{
for (j=0; j < L; j++)
{
red[i][j] = UNDEFINED;
}
}
Point inicial = {L/2, L/2, OCCUPIED};
red[L/2][L/2] = OCCUPIED;
addToList(inicial, red, list, L,f);
oc.push_back(inicial);
while (list.size() > 0 && L > 0)
{
punto = selectPoint(red, list, generator, prob, p);
if (punto.state == OCCUPIED)
{
addToList(punto, red, list, L,f);
oc.push_back(punto);
}
else
{
out.push_back(punto);
}
}
L = auxL;
oc.clear();
out.clear();
list.clear();
}
f = f*1.0/(number*1.0);
if (f > 0.5)
{
inta = inta;
intb = p;
p = (inta + intb) / 2.0;
}
else if (f < 0.5)
{
intb = intb;
inta = p;
p = (inta + intb) / 2.0;
}
cout << p << endl;
}
My try with OpenMP is commented above. As you can see I've declared i and j as private because they're declared before the parallel section. I've also tried to make L private, with no results. Only segmentation faults and bad pointers everywhere.
I think the problem is that while loop nested inside. My questions are: Is the omp parallel for correct in this case? or should I try to optimize only that while loop? Are the std::vector interfering with OpenMP?
NOTE: list, oc and out are std::vector<Point>, and Point is a simple struct with three int properties. addToList is a function with no loops inside.
You might want to go over an OpenMP tutorial. When you look at OpenMP code, you need to imagine what can happen in parallel. Take
oc.push_back(inicial);
Can two threads try to do this at the same time? Yes. Does std::vector support parallelism? No.
The code above is full of these things.
If you want to use data-structures within your OpenMP ode, you need to use locks. From my personal experience, when this happens, it is far better to refactor the algorithm than actually use them. While OpenMP + locks is possible, it is usually an indication that there's a problem with the idea (= a possibly subjective view).
The current answer points out the concurrency in the code, but please note that not all data-structures have to be implemented with locks to attain thread-safety. There are also lock-free data structures. For this particular case, we could the Harris lock free linked list: https://timharris.uk/papers/2001-disc.pdf
While I know that pointing out concurrency issues to the OP is of great assistance at this point, I want to make sure we don't convey a wrong message by saying that locks are absolutely necessary to attain thread safety.
The directive #pragma omp parallel defines a piece of code that can be executed simultaneously by various threads. In your case, as you have not specified any further directive, your parallel region will be executed once by every thread. In order to achieve a parallel behavior you could try to break the loop into smaller tasks(the taskloop directive will do the job). Those tasks will remain in a task pool until a thread starts executing them. This way your loop will be fragmented and executed by your threads instead of making each thread execute the whole loop.
https://www.openmp.org/spec-html/5.0/openmpsu47.html here's the official openMP documentation for the taskloop directive.
Related
my problem is this:
I want to solve TSP with the Ant Colony Optimization Algorithm in C++.
Right now Ive implemented a algorithm that solve this problem iterative.
For example: I generate 500 ants - and they find their route one after the other.
Each ant starts not until the previous ant finished.
Now I want to parallelize the whole thing - and I thought about using OpenMP.
So my first question is: Can I generate a large number of threads that work
simultaneously (for the number of ants > 500)?
I already tried something out. So this is my code from my main.cpp:
#pragma omp parallel for
for (auto ant = antarmy.begin(); ant != antarmy.end(); ++ant) {
#pragma omp ordered
if (ant->getIterations() < ITERATIONSMAX) {
ant->setNumber(currentAntNumber);
currentAntNumber++;
ant->antRoute();
}
}
And this is the code in my Ant class that is "critical" because each Ant reads and writes into the same Matrix (pheromone-Matrix):
void Ant::antRoute()
{
this->route.setCity(0, this->getStartIndex());
int nextCity = this->getNextCity(this->getStartIndex());
this->routedistance += this->data->distanceMatrix[this->getStartIndex()][nextCity];
int tempCity;
int i = 2;
this->setProbability(nextCity);
this->setVisited(nextCity);
this->route.setCity(1, nextCity);
updatePheromone(this->getStartIndex(), nextCity, routedistance, 0);
while (this->getVisitedCount() < datacitycount) {
tempCity = nextCity;
nextCity = this->getNextCity(nextCity);
this->setProbability(nextCity);
this->setVisited(nextCity);
this->route.setCity(i, nextCity);
this->routedistance += this->data->distanceMatrix[tempCity][nextCity];
updatePheromone(tempCity, nextCity, routedistance, 0);
i++;
}
this->routedistance += this->data->distanceMatrix[nextCity][this->getStartIndex()];
// updatePheromone(-1, -1, -1, 1);
ShortestDistance(this->routedistance);
this->iterationsshortestpath++;
}
void Ant::updatePheromone(int i, int j, double distance, bool reduce)
{
#pragma omp critical(pheromone)
if (reduce == 1) {
for (int x = 0; x < datacitycount; x++) {
for (int y = 0; y < datacitycount; y++) {
if (REDUCE * this->data->pheromoneMatrix[x][y] < 0)
this->data->pheromoneMatrix[x][y] = 0.0;
else
this->data->pheromoneMatrix[x][y] -= REDUCE * this->data->pheromoneMatrix[x][y];
}
}
}
else {
double currentpheromone = this->data->pheromoneMatrix[i][j];
double updatedpheromone = (1 - PHEROMONEREDUCTION)*currentpheromone + (PHEROMONEDEPOSIT / distance);
if (updatedpheromone < 0.0) {
this->data->pheromoneMatrix[i][j] = 0;
this->data->pheromoneMatrix[j][i] = 0;
}
else {
this->data->pheromoneMatrix[i][j] = updatedpheromone;
this->data->pheromoneMatrix[j][i] = updatedpheromone;
}
}
}
So for some reasons the omp parallel for loop wont work on these range-based loops. So this is my second question - if you guys have any suggestions on the code how the get the range-based loops done im happy.
Thanks for your help
So my first question is: Can I generate a large number of threads that work simultaneously (for the number of ants > 500)?
In OpenMP you typically shouldn't care how many threads are active, instead you make sure to expose enough parallel work through work-sharing constructs such as omp for or omp task. So while you may have a loop with 500 iterations, your program could be run with anything between one thread and 500 (or more, but they would just idle). This is a difference to other parallelization approaches such as pthreads where you have to manage all the threads and what they do.
Now your example uses ordered incorrectly. Ordered is only useful if you have a small part of your loop body that needs to be executed in-order. Even then it can be very problematic for performance. Also you need to declare a loop to be ordered if you want to use ordered inside. See also this excellent answer.
You should not use ordered. Instead make sure that the ants know there number beforehand, write the code such that they don't need a number, or at the very least that the order of numbers doesn't matter for ants. In the latter case you can use omp atomic capture.
As to the access to shared data. Try to avoid it as much as possible. Adding omp critical is a first step to get a correct parallel program, but often leads to performance problems. Measure your parallel efficiency, use parallel performance analysis tools to find out if this is the case for you. Then you can use atomic data access or reduction (each threads has their own data they work on and only after the main work is finished, data from all threads is merged).
I'm working on a small Collatz conjecture calculator using C++ and GMP, and I'm trying to implement parallelism on it using OpenMP, but I'm coming across issues regarding thread safety. As it stands, attempting to run the code will yield this:
*** Error in `./collatz': double free or corruption (fasttop): 0x0000000001140c40 ***
*** Error in `./collatz': double free or corruption (fasttop): 0x00007f4d200008c0 ***
[1] 28163 abort (core dumped) ./collatz
This is the code to reproduce the behaviour.
#include <iostream>
#include <gmpxx.h>
mpz_class collatz(mpz_class n) {
if (mpz_odd_p(n.get_mpz_t())) {
n *= 3;
n += 1;
} else {
n /= 2;
}
return n;
}
int main() {
mpz_class x = 1;
#pragma omp parallel
while (true) {
//std::cout << x.get_str(10);
while (true) {
if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
x = collatz(x);
}
x++;
//std::cout << " OK" << std::endl;
}
}
Given that I did not get this error when I uncomment the outputs to screen, which are slow, I assume the issue at hand has to do with thread safety, and in particular with concurrent threads trying to increment x at the same time.
Am I correct in my assumptions? How can I fix this and make it safe to run?
I assume what you want to do is to check if the collatz conjecture holds for all numbers. The program you posted is wrong on many levels both serially and in parallel.
if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
Means that it will break when x != 1. If you replace it with the correct 0 == mpz_cmp_ui, the code will just continue to test 2 over and over again. You have to have two variables anyway, one for the outer loop that represents what you want to check, and one for the inner loop performing the check. It's easier to get this right if you make a function for that:
void check_collatz(mpz_class n) {
while (n != 1) {
n = collatz(n);
}
}
int main() {
mpz_class x = 1;
while (true) {
std::cout << x.get_str(10);
check_collatz(x);
x++;
}
}
The while (true) loop is bad to reason about and parallelize, so let's just make an equivalent for loop:
for (mpz_class x = 1;; x++) {
check_collatz(x);
}
Now, we can talk about parallelizing the code. The basis for OpenMP parallelizing is a worksharing construct. You cannot just slap #pragma omp parallel on a while loop. Fortunately you can easily mark certain canonical for loops with #pragma omp parallel for. For that, however, you cannot use mpz_class as a loop variable, and you must specify an end for the loop:
#pragma omp parallel for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
check_collatz(check);
}
Note that check is implicitly private, there is a copy for each thread working on it. Also OpenMP will take care of distributing the work [1 ... 2^63] among threads. When a thread calls check_collatz a new, private, mpz_class object will be created for it.
Now, you might notice, that repeatedly creating a new mpz_class object in each loop iteration is costly (memory allocation). You can reuse that (by breaking check_collatz again) and creating a thread-private mpz_class working object. For this, you split the compound parallel for into separate parallel and for pragmas:
#include <gmpxx.h>
#include <iostream>
#include <limits>
// Avoid copying objects by taking and modifying a reference
void collatz(mpz_class& n)
{
if (mpz_odd_p(n.get_mpz_t()))
{
n *= 3;
n += 1;
}
else
{
n /= 2;
}
}
int main()
{
#pragma omp parallel
{
mpz_class x;
#pragma omp for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
// Note: The structure of this fits perfectly in a for loop.
for (x = check; x != 1; collatz(x));
}
}
}
Note that declaring x in the parallel region will make sure it is implicitly private and properly initialized. You should prefer that to declaring it outside and marking it private. This will often lead to confusion because explicitly private variables from outside scope are unitialized.
You might complain that this only checks the first 2^63 numbers. Just let it run. This gives you enough time to master OpenMP to expert level and write your own custom worksharing for GMP objects.
You were concerned about having extra objects for each thread. This is essential for good performance. You cannot solve this efficiently with locks/critical sections/atomics. You would have to protect each and every read and write to your only relevant variable. There would be no parallelism left.
Note: The huge for loop will likely have a load imbalance. So some threads will probably finish a few centuries earlier than the others. You could fix that with dynamic scheduling, or smaller static chunks.
Edit: For academic sake, here is one idea how to implement the worksharing directly on GMP objects:
#pragma omp parallel
{
// Note this is not a "parallel" loop
// these are just separate loops on distinct strided
int nthreads = omp_num_threads();
mpz_class check = 1;
// we already checked those in the other program
check += std::numeric_limits<long>::max();
check += omp_get_thread_num();
mpz_class x;
for (; ; check += nthreads)
{
// Note: The structure of this fits perfectly in a for loop.
for (x = check; x != 1; collatz(x));
}
}
You could well be right about collisions with x. You can mark x as private by:
#pragma omp parallel private(x)
This way each thread gets their own "version" of the variable x, which should make this thread-safe. By default, variables declared before a #pragma omp parallel are public, so there is one shared instance between all of the threads.
You might want to touch x only with atomic instructions.
#pragma omp atomic
x++;
This ensures that all threads see the same value of x without requires mutexes or other synchronization techniques.
I had to change and extend my algorithm for some signal analysis (using the polyfilterbank technique) and couldn't use my old OpenMP code, but in the new code the results are not as expected (the results in the beginning positions in the array are somehow incorrect in comparison with a serial run [serial code shows the expected result]).
So in the first loop tFFTin I have some FFT data, which I'm multiplicating with a window function.
The goal is that a thread runs the inner loops for each polyphase factor. To avoid locks I use the reduction pragma (no complex reduction is defined by standard, so I use my one where each thread's omp_priv variable gets initialized with the omp_orig [so with tFFTin]). The reason I'm using the ordered pragma is that the results should be added to the output vector in an ordered way.
typedef std::complex<float> TComplexType;
typedef std::vector<TComplexType> TFFTContainer;
#pragma omp declare reduction(complexMul:TFFTContainer:\
transform(omp_in.begin(), omp_in.end(),\
omp_out.begin(), omp_out.begin(),\
std::multiplies<TComplexType>()))\
initializer (omp_priv(omp_orig))
void ConcreteResynthesis::ApplyPolyphase(TFFTContainer& tFFTin, TFFTContainer& tFFTout, TWindowContainer& tWindow, *someparams*) {;
#pragma omp parallel for shared(tWindow) firstprivate(sFFTParams) reduction(complexMul: tFFTin) ordered if(iFFTRawDataLen>cMinParallelSize)
for (int p = 0; p < uPolyphase; ++p) {
int iPolyphaseOffset = p * uFFTLength;
for (int i = 0; i < uFFTLength; ++i) {
tFFTin[i] *= tWindow[iPolyphaseOffset + i]; ///< get FFT input data from raw data
}
#pragma omp ordered
{
//using the overlap and add method
for (int i = 0; i < sFFTParams.uFFTLength; ++i) {
pDataPool->GetFullSignalData(workSignal)[mSignalPos + iPolyphaseOffset + i] += tFFTin[i];
}
}
}
mSignalPos = mSignalPos + mStep;
}
Is there a race condition or something, which makes wrong outputs at the beginning? Or do I have some logic error?
Another issue is, I don't really like my solution with using the ordered pragma, is there a better approach( i tried to use for this also the reduction-model, but the compiler doesn't allow me to use a pointer type for that)?
I think your problem is that you have implemented a very cool custom reduction for tFFTin. But this reduction is applied at the end of the parallel region.
Which is after you use the data in tFFTin. Another thing is what H. Iliev mentions that the second iteration of the outer loop relies on data which is computed in the previous iteration - a classic dependency.
I think you should try parallelizing the inner loops.
I'm trying to parallelize qsort() function in c++ with openMP, but I have some problems. For of all, for a high number of element to sort (ex: V=1000) qsort doesn't work at all, I never exit this function (infinite loop). Then, I notice that my parallel version of qsort is much slower then the serial one.
Anyone could help me?
Here the code: https://dpaste.de/EaH0.
here the average of time elapsed where kruscalP is the parallel version:
thanks
You have multiple race conditions in your code when you write to A, exch0, and exch1 which are all shared. Not only can this hurt performance it likely will give the wrong result. Unfortunately, you have to use a critical section to fix this which will also likely hurt performance but at least it will give the right result.
#pragma omp for
for (i = 0; i < N-1; i += 2) {
#pragma critical
{
if (A[i].weight > A[i+1].weight) {
temp = A[i];
A[i] = A[i+1];
A[i+1] = temp;
exch0 = 1;
}
}
}
The same applies to the next for loop. If you want to do this efficiently you have to change your algorithm. Maybe consider merge sort.
I'm trying to implement the distance matrix in parallel using openmp in which I calculate the distance between each point and all the other points, so the best algorithm I thought of till now cost O(n^2) and the performance of my algorithm using openmp using 10 thread on 8processor machine isn't better than the serial approach in terms of running time, so I wonder if there is any mistake in my implementation on the openmp approach as this is my first time to use openmp, so please if there is any mistake in my apporach or any better "faster" approach please let me know. The following is my code where "dat" is a vector that contains the data points.
map <int, map< int, double> > dist; //construct the distance matrix
int c=count(dat.at(0).begin(),dat.at(0).end(),delm)+1;
#pragma omp parallel for shared (c,dist)
for(int p=0;p<dat.size();p++)
{
for(int j=p+1;j<dat.size();j++)
{
double ecl=0;
string line1=dat.at(p);
string line2=dat.at(j);
for (int i=0;i<c;i++)
{
double num1=atof(line1.substr(0,line1.find_first_of(delm)).c_str());
line1=line1.substr(line1.find_first_of(delm)+1).c_str();
double num2=atof(line2.substr(0,line2.find_first_of(delm)).c_str());
line2=line2.substr(line2.find_first_of(delm)+1).c_str();
ecl += (num1-num2)*(num1-num2);
}
ecl=sqrt(ecl);
#pragma omp critical
{
dist[p][j]=ecl;
dist[j][p]=ecl;
}
}
}
#pragma omp critical has the effect of serializing your loop so getting rid of that should be your first goal. This should be a step in the right direction:
ptrdiff_t const c = count(dat[0].begin(), dat[0].end(), delm) + 1;
vector<vector<double> > dist(dat.size(), vector<double>(dat.size()));
#pragma omp parallel for
for (size_t p = 0; p != dat.size(); ++p)
{
for (size_t j = p + 1; j != dat.size(); ++j)
{
double ecl = 0.0;
string line1 = dat[p];
string line2 = dat[j];
for (ptrdiff_t i = 0; i != c; ++i)
{
double const num1 = atof(line1.substr(0, line1.find_first_of(delm)).c_str());
double const num2 = atof(line2.substr(0, line2.find_first_of(delm)).c_str());
line1 = line1.substr(line1.find_first_of(delm) + 1);
line2 = line2.substr(line2.find_first_of(delm) + 1);
ecl += (num1 - num2) * (num1 - num2);
}
ecl = sqrt(ecl);
dist[p][j] = ecl;
dist[j][p] = ecl;
}
}
There are a few other obvious things that could be done to make this faster overall, but fixing your parallelization is the most important thing.
As already pointed out, using critical sections will slow things down as only 1 thread is allowed in that section at a time. There is absolutely no need for using critical sections because each thread writes to mutually exclusive sections of data, reading non-modified data obviously doesn't need protection.
My suspicion as to the slowness of the code comes down to uneven work distribution over the threads. By default I think openmp divides the iterations equally among threads. As an example, consider when you have 8 threads and 8 points:
-thread 0 will get 7 distance calculations
-thread 1 will get 6 distance calculations
...
-thread 7 will get 0 distance calculations
Even with more iterations, a similar inequality still exists. If you need to convince yourself, make a thread private counter to track how many distance calculations are actually done by each thread.
With work-sharing constructs like parallel for, you can specify various work distribution strategies. In your case, probably best to go with
#pragma omp for schedule(guided)
When each thread requests some iterations of the for loop, it will get the number of remaining loops (not already given to a thread) divided by the number of threads. So initially you get big blocks, later you get smaller blocks. It's a form of automatic load balancing, mind you there's some (probably small) overhead in dynamically allocating iterations to the threads.
To avoid the first thread getting an unfair large amount of work, your looping structure should be changed so that lower iterations have fewer calculations, e.g. change the inner for loop to
for (j=0; j<p-1; j++)
Another thing to consider is when working with a lot of cores, memory can become the bottleneck. You have 8 processors fighting for probably 2 or maybe 3 channels of DRAM (separate memory sticks on the same channel still compete for bandwidth). On-chip CPU cache is at best shared between all the processors, so you still have no more cache than the serial version of this program.