Thread safety while looping with OpenMP - c++

I'm working on a small Collatz conjecture calculator using C++ and GMP, and I'm trying to implement parallelism on it using OpenMP, but I'm coming across issues regarding thread safety. As it stands, attempting to run the code will yield this:
*** Error in `./collatz': double free or corruption (fasttop): 0x0000000001140c40 ***
*** Error in `./collatz': double free or corruption (fasttop): 0x00007f4d200008c0 ***
[1] 28163 abort (core dumped) ./collatz
This is the code to reproduce the behaviour.
#include <iostream>
#include <gmpxx.h>
mpz_class collatz(mpz_class n) {
if (mpz_odd_p(n.get_mpz_t())) {
n *= 3;
n += 1;
} else {
n /= 2;
}
return n;
}
int main() {
mpz_class x = 1;
#pragma omp parallel
while (true) {
//std::cout << x.get_str(10);
while (true) {
if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
x = collatz(x);
}
x++;
//std::cout << " OK" << std::endl;
}
}
Given that I did not get this error when I uncomment the outputs to screen, which are slow, I assume the issue at hand has to do with thread safety, and in particular with concurrent threads trying to increment x at the same time.
Am I correct in my assumptions? How can I fix this and make it safe to run?

I assume what you want to do is to check if the collatz conjecture holds for all numbers. The program you posted is wrong on many levels both serially and in parallel.
if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
Means that it will break when x != 1. If you replace it with the correct 0 == mpz_cmp_ui, the code will just continue to test 2 over and over again. You have to have two variables anyway, one for the outer loop that represents what you want to check, and one for the inner loop performing the check. It's easier to get this right if you make a function for that:
void check_collatz(mpz_class n) {
while (n != 1) {
n = collatz(n);
}
}
int main() {
mpz_class x = 1;
while (true) {
std::cout << x.get_str(10);
check_collatz(x);
x++;
}
}
The while (true) loop is bad to reason about and parallelize, so let's just make an equivalent for loop:
for (mpz_class x = 1;; x++) {
check_collatz(x);
}
Now, we can talk about parallelizing the code. The basis for OpenMP parallelizing is a worksharing construct. You cannot just slap #pragma omp parallel on a while loop. Fortunately you can easily mark certain canonical for loops with #pragma omp parallel for. For that, however, you cannot use mpz_class as a loop variable, and you must specify an end for the loop:
#pragma omp parallel for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
check_collatz(check);
}
Note that check is implicitly private, there is a copy for each thread working on it. Also OpenMP will take care of distributing the work [1 ... 2^63] among threads. When a thread calls check_collatz a new, private, mpz_class object will be created for it.
Now, you might notice, that repeatedly creating a new mpz_class object in each loop iteration is costly (memory allocation). You can reuse that (by breaking check_collatz again) and creating a thread-private mpz_class working object. For this, you split the compound parallel for into separate parallel and for pragmas:
#include <gmpxx.h>
#include <iostream>
#include <limits>
// Avoid copying objects by taking and modifying a reference
void collatz(mpz_class& n)
{
if (mpz_odd_p(n.get_mpz_t()))
{
n *= 3;
n += 1;
}
else
{
n /= 2;
}
}
int main()
{
#pragma omp parallel
{
mpz_class x;
#pragma omp for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
// Note: The structure of this fits perfectly in a for loop.
for (x = check; x != 1; collatz(x));
}
}
}
Note that declaring x in the parallel region will make sure it is implicitly private and properly initialized. You should prefer that to declaring it outside and marking it private. This will often lead to confusion because explicitly private variables from outside scope are unitialized.
You might complain that this only checks the first 2^63 numbers. Just let it run. This gives you enough time to master OpenMP to expert level and write your own custom worksharing for GMP objects.
You were concerned about having extra objects for each thread. This is essential for good performance. You cannot solve this efficiently with locks/critical sections/atomics. You would have to protect each and every read and write to your only relevant variable. There would be no parallelism left.
Note: The huge for loop will likely have a load imbalance. So some threads will probably finish a few centuries earlier than the others. You could fix that with dynamic scheduling, or smaller static chunks.
Edit: For academic sake, here is one idea how to implement the worksharing directly on GMP objects:
#pragma omp parallel
{
// Note this is not a "parallel" loop
// these are just separate loops on distinct strided
int nthreads = omp_num_threads();
mpz_class check = 1;
// we already checked those in the other program
check += std::numeric_limits<long>::max();
check += omp_get_thread_num();
mpz_class x;
for (; ; check += nthreads)
{
// Note: The structure of this fits perfectly in a for loop.
for (x = check; x != 1; collatz(x));
}
}

You could well be right about collisions with x. You can mark x as private by:
#pragma omp parallel private(x)
This way each thread gets their own "version" of the variable x, which should make this thread-safe. By default, variables declared before a #pragma omp parallel are public, so there is one shared instance between all of the threads.

You might want to touch x only with atomic instructions.
#pragma omp atomic
x++;
This ensures that all threads see the same value of x without requires mutexes or other synchronization techniques.

Related

How to stop parallel region of OpenMP 2.0

I look for a better way to cancel my threads.
In my approach, I use a shared variable and if this variable is set, I just throw a continue. This finishes my threads fast, but threads keep theoretically spawning and ending, which seems not elegant.
So, is there a better way to solve the issue (break is not supported by my OpenMP)?
I have to work with Visual, so my OpenMP Lib is outdated and there is no way around that. Consequently, I think #omp cancel will not work
int progress_state = RunExport;
#pragma omp parallel
{
#pragma omp for
for (int k = 0; k < foo.z; k++)
for (int j = 0; j < foo.y; j++)
for (int i = 0; i < foo.x; i++) {
if (progress_state == StopExport) {
continue;
}
// do some fancy shit
// yeah here is a condition for speed due to the critical
#pragma omp critical
if (condition) {
progress_state = StopExport;
}
}
}
You should do it the simple way of "just continue in all remaining iterations if cancellation is requested". That can just be the first check in the outermost loop (and given that you have several nested loops, that will probably not have any measurable overhead).
std::atomic<int> progress_state = RunExport;
// You could just write #pragma omp parallel for instead of these two nested blocks.
#pragma omp parallel
{
#pragma omp for
for (int k = 0; k < foo.z; k++)
{
if (progress_state == StopExport)
continue;
for (int j = 0; j < foo.y; j++)
{
// You can add break statements in these inner loops.
// OMP only parallelizes the outermost loop (at least given the way you wrote this)
// so it won't care here.
for (int i = 0; i < foo.x; i++)
{
// ...
if (condition) {
progress_state = StopExport;
}
}
}
}
}
Generally speaking, OMP will not suddenly spawn new threads or end existing ones, especially not within one parallel region. This means there is little overhead associated with running a few more tiny iterations. This is even more true given that the default scheduling in your case is most likely static, meaning that each thread knows its start and end index right away. Other scheduling modes would have to call into the OMP runtime every iteration (or every few iterations) to request more work, but that won't happen here. The compiler will basically see this code for the threaded work:
// Not real omp functions.
int myStart = __omp_static_for_my_start();
int myEnd = __omp_static_for_my_end();
for (int k = myStart; k < myEnd; ++k)
{
if (progress_state == StopExport)
continue;
// etc.
}
You might try a non-atomic thread-local "should I cancel?" flag that starts as false and can only be changed to true (which the compiler may understand and fold into the loop condition). But I doubt you will see significant overhead either way, at least on x86 where int is atomic anyway.
which seems not elegant
OMP 2.0 does not exactly shine with respect to elegance. I mean, iterating over a std::vector requires at least one static_cast to silence signed -> unsigned conversion warnings. So unless you have specific evidence of this pattern causing a performance problem, there is little reason not to use it.

Parallelize Algorithm with OpenMP in C++

my problem is this:
I want to solve TSP with the Ant Colony Optimization Algorithm in C++.
Right now Ive implemented a algorithm that solve this problem iterative.
For example: I generate 500 ants - and they find their route one after the other.
Each ant starts not until the previous ant finished.
Now I want to parallelize the whole thing - and I thought about using OpenMP.
So my first question is: Can I generate a large number of threads that work
simultaneously (for the number of ants > 500)?
I already tried something out. So this is my code from my main.cpp:
#pragma omp parallel for
for (auto ant = antarmy.begin(); ant != antarmy.end(); ++ant) {
#pragma omp ordered
if (ant->getIterations() < ITERATIONSMAX) {
ant->setNumber(currentAntNumber);
currentAntNumber++;
ant->antRoute();
}
}
And this is the code in my Ant class that is "critical" because each Ant reads and writes into the same Matrix (pheromone-Matrix):
void Ant::antRoute()
{
this->route.setCity(0, this->getStartIndex());
int nextCity = this->getNextCity(this->getStartIndex());
this->routedistance += this->data->distanceMatrix[this->getStartIndex()][nextCity];
int tempCity;
int i = 2;
this->setProbability(nextCity);
this->setVisited(nextCity);
this->route.setCity(1, nextCity);
updatePheromone(this->getStartIndex(), nextCity, routedistance, 0);
while (this->getVisitedCount() < datacitycount) {
tempCity = nextCity;
nextCity = this->getNextCity(nextCity);
this->setProbability(nextCity);
this->setVisited(nextCity);
this->route.setCity(i, nextCity);
this->routedistance += this->data->distanceMatrix[tempCity][nextCity];
updatePheromone(tempCity, nextCity, routedistance, 0);
i++;
}
this->routedistance += this->data->distanceMatrix[nextCity][this->getStartIndex()];
// updatePheromone(-1, -1, -1, 1);
ShortestDistance(this->routedistance);
this->iterationsshortestpath++;
}
void Ant::updatePheromone(int i, int j, double distance, bool reduce)
{
#pragma omp critical(pheromone)
if (reduce == 1) {
for (int x = 0; x < datacitycount; x++) {
for (int y = 0; y < datacitycount; y++) {
if (REDUCE * this->data->pheromoneMatrix[x][y] < 0)
this->data->pheromoneMatrix[x][y] = 0.0;
else
this->data->pheromoneMatrix[x][y] -= REDUCE * this->data->pheromoneMatrix[x][y];
}
}
}
else {
double currentpheromone = this->data->pheromoneMatrix[i][j];
double updatedpheromone = (1 - PHEROMONEREDUCTION)*currentpheromone + (PHEROMONEDEPOSIT / distance);
if (updatedpheromone < 0.0) {
this->data->pheromoneMatrix[i][j] = 0;
this->data->pheromoneMatrix[j][i] = 0;
}
else {
this->data->pheromoneMatrix[i][j] = updatedpheromone;
this->data->pheromoneMatrix[j][i] = updatedpheromone;
}
}
}
So for some reasons the omp parallel for loop wont work on these range-based loops. So this is my second question - if you guys have any suggestions on the code how the get the range-based loops done im happy.
Thanks for your help
So my first question is: Can I generate a large number of threads that work simultaneously (for the number of ants > 500)?
In OpenMP you typically shouldn't care how many threads are active, instead you make sure to expose enough parallel work through work-sharing constructs such as omp for or omp task. So while you may have a loop with 500 iterations, your program could be run with anything between one thread and 500 (or more, but they would just idle). This is a difference to other parallelization approaches such as pthreads where you have to manage all the threads and what they do.
Now your example uses ordered incorrectly. Ordered is only useful if you have a small part of your loop body that needs to be executed in-order. Even then it can be very problematic for performance. Also you need to declare a loop to be ordered if you want to use ordered inside. See also this excellent answer.
You should not use ordered. Instead make sure that the ants know there number beforehand, write the code such that they don't need a number, or at the very least that the order of numbers doesn't matter for ants. In the latter case you can use omp atomic capture.
As to the access to shared data. Try to avoid it as much as possible. Adding omp critical is a first step to get a correct parallel program, but often leads to performance problems. Measure your parallel efficiency, use parallel performance analysis tools to find out if this is the case for you. Then you can use atomic data access or reduction (each threads has their own data they work on and only after the main work is finished, data from all threads is merged).

Best way to parallelize this recursion using OpenMP

I have following recursive function (NOTE: It is stripped of all unimportant details)
int recursion(...) {
int minimum = INFINITY;
for(int i=0; i<C; i++) {
int foo = recursion(...);
if (foo < minimum) {
minimum = foo;
}
}
return minimum;
}
Note 2: It is finite, but not in this simplified example, so please ignore it. Point of this question is how to aproach this problem correctly.
I was thinking about using tasks, but I am not sure, how to use it correctly - how to paralelize the inner cycle.
EDIT 1: The recursion tree isn't well balanced. It is being used with dynamic programing approach, so as time goes on, a lot of values are re-used from previous passes. This worries me a lot and I think it will be a big bottleneck.
C is somewhere around 20.
Metric for the best is fastest :)
It will run on 2x Xeon, so there is plenty of HW power availible.
Yes, you can use OpenMP tasks exploit parallelism on multiple recursion levels and ensure that imbalances don't cause wasted cycles.
I would collect the results in a vector and compute the minimum outside. You could also perform a guarded (critical / lock) minimum computation within the task.
Avoid spawning tasks / allocating memory for the minimum if you are too deep in the recursion, where the overhead / work ratio becomes too bad. The strongest solution it to create two separate (parallel/serial) recursive functions. That way you have zero runtime overhead once you switch to the serial function - as opposed to checking the recursion depth against a threshold every time in a unified function.
int recursion(...) {
#pragma omp parallel
#pragma omp single
return recursion_par(..., 0);
}
int recursion_ser(...) {
int minimum = INFINITY;
for(int i=0; i<C; i++) {
int foo = recursion_ser(...);
if (foo < minimum) {
minimum = foo;
}
}
return minimum;
}
int recursion_par(..., int depth) {
std::vector<int> foos(C);
for(int i=0; i<C; i++) {
#pragma omp task
{
if (depth < threshhold) {
foos[i] = recursion_par(..., depth + 1);
} else {
foos[i] = recursion_ser(...);
}
}
}
#pragma omp taskwait
return *std::min_element(std::begin(foos), std::end(foos));
}
Obviously you must not do any nasty things with global / shared state within the unimportant details.

OpenMP parallelization

I'm writing a C++ program with scientific purposes. The program works well and it returns good results, so I decided to improve its perfomance using OpenMP. The loop I want to optimize is the following one:
//== #pragma omp parallel for private(i,j)
for (k=0; k < number; k++)
{
for (i=0; i < L; i++)
{
for (j=0; j < L; j++)
{
red[i][j] = UNDEFINED;
}
}
Point inicial = {L/2, L/2, OCCUPIED};
red[L/2][L/2] = OCCUPIED;
addToList(inicial, red, list, L,f);
oc.push_back(inicial);
while (list.size() > 0 && L > 0)
{
punto = selectPoint(red, list, generator, prob, p);
if (punto.state == OCCUPIED)
{
addToList(punto, red, list, L,f);
oc.push_back(punto);
}
else
{
out.push_back(punto);
}
}
L = auxL;
oc.clear();
out.clear();
list.clear();
}
f = f*1.0/(number*1.0);
if (f > 0.5)
{
inta = inta;
intb = p;
p = (inta + intb) / 2.0;
}
else if (f < 0.5)
{
intb = intb;
inta = p;
p = (inta + intb) / 2.0;
}
cout << p << endl;
}
My try with OpenMP is commented above. As you can see I've declared i and j as private because they're declared before the parallel section. I've also tried to make L private, with no results. Only segmentation faults and bad pointers everywhere.
I think the problem is that while loop nested inside. My questions are: Is the omp parallel for correct in this case? or should I try to optimize only that while loop? Are the std::vector interfering with OpenMP?
NOTE: list, oc and out are std::vector<Point>, and Point is a simple struct with three int properties. addToList is a function with no loops inside.
You might want to go over an OpenMP tutorial. When you look at OpenMP code, you need to imagine what can happen in parallel. Take
oc.push_back(inicial);
Can two threads try to do this at the same time? Yes. Does std::vector support parallelism? No.
The code above is full of these things.
If you want to use data-structures within your OpenMP ode, you need to use locks. From my personal experience, when this happens, it is far better to refactor the algorithm than actually use them. While OpenMP + locks is possible, it is usually an indication that there's a problem with the idea (= a possibly subjective view).
The current answer points out the concurrency in the code, but please note that not all data-structures have to be implemented with locks to attain thread-safety. There are also lock-free data structures. For this particular case, we could the Harris lock free linked list: https://timharris.uk/papers/2001-disc.pdf
While I know that pointing out concurrency issues to the OP is of great assistance at this point, I want to make sure we don't convey a wrong message by saying that locks are absolutely necessary to attain thread safety.
The directive #pragma omp parallel defines a piece of code that can be executed simultaneously by various threads. In your case, as you have not specified any further directive, your parallel region will be executed once by every thread. In order to achieve a parallel behavior you could try to break the loop into smaller tasks(the taskloop directive will do the job). Those tasks will remain in a task pool until a thread starts executing them. This way your loop will be fragmented and executed by your threads instead of making each thread execute the whole loop.
https://www.openmp.org/spec-html/5.0/openmpsu47.html here's the official openMP documentation for the taskloop directive.

C OpenMP parallel quickSort

Once again I'm stuck when using openMP in C++. This time I'm trying to implement a parallel quicksort.
Code:
#include <iostream>
#include <vector>
#include <stack>
#include <utility>
#include <omp.h>
#include <stdio.h>
#define SWITCH_LIMIT 1000
using namespace std;
template <typename T>
void insertionSort(std::vector<T> &v, int q, int r)
{
int key, i;
for(int j = q + 1; j <= r; ++j)
{
key = v[j];
i = j - 1;
while( i >= q && v[i] > key )
{
v[i+1] = v[i];
--i;
}
v[i+1] = key;
}
}
stack<pair<int,int> > s;
template <typename T>
void qs(vector<T> &v, int q, int r)
{
T pivot;
int i = q - 1, j = r;
//switch to insertion sort for small data
if(r - q < SWITCH_LIMIT)
{
insertionSort(v, q, r);
return;
}
pivot = v[r];
while(true)
{
while(v[++i] < pivot);
while(v[--j] > pivot);
if(i >= j) break;
std::swap(v[i], v[j]);
}
std::swap(v[i], v[r]);
#pragma omp critical
{
s.push(make_pair(q, i - 1));
s.push(make_pair(i + 1, r));
}
}
int main()
{
int n, x;
int numThreads = 4, numBusyThreads = 0;
bool *idle = new bool[numThreads];
for(int i = 0; i < numThreads; ++i)
idle[i] = true;
pair<int, int> p;
vector<int> v;
cin >> n;
for(int i = 0; i < n; ++i)
{
cin >> x;
v.push_back(x);
}
cout << v.size() << endl;
s.push(make_pair(0, v.size()));
#pragma omp parallel shared(s, v, idle, numThreads, numBusyThreads, p)
{
bool done = false;
while(!done)
{
int id = omp_get_thread_num();
#pragma omp critical
{
if(s.empty() == false && numBusyThreads < numThreads)
{
++numBusyThreads;
//the current thread is not idle anymore
//it will get the interval [q, r] from stack
//and run qs on it
idle[id] = false;
p = s.top();
s.pop();
}
if(numBusyThreads == 0)
{
done = true;
}
}
if(idle[id] == false)
{
qs(v, p.first, p.second);
idle[id] = true;
#pragma omp critical
--numBusyThreads;
}
}
}
return 0;
}
Algorithm:
To use openMP for a recursive function I used a stack to keep track of the next intervals on which the qs function should run. I manually add the 1st interval [0, size] and then let the threads get to work when a new interval is added in the stack.
The problem:
The program ends too early, not sorting the array after creating the 1st set of intervals ([q, i - 1], [i+1, r] if you look on the code. My guess is that the threads which get the work, considers the local variables of the quicksort function(qs in the code) shared by default, so they mess them up and add no interval in the stack.
How I compile:
g++ -o qs qs.cc -Wall -fopenmp
How I run:
./qs < in_100000 > out_100000
where in_100000 is a file containing 100000 on the 1st line followed by 100k intergers on the next line separated by spaces.
I am using gcc 4.5.2 on linux
Thank you for your help,
Dan
I didn't actually run your code, but I see an immediate mistake on p, which should be private not shared. The parallel invocation of qs: qs(v, p.first, p.second); will have races on p, resulting in unpredictable behavior. The local variables at qs should be okay because all threads have their own stack. However, the overall approach is good. You're on the right track.
Here are my general comments for the implementation of parallel quicksort. Quicksort itself is embarrassingly parallel, which means no synchronization is needed. The recursive calls of qs on a partitioned array is embarrassingly parallel.
However, the parallelism is exposed in a recursive form. If you simply use the nested parallelism in OpenMP, you will end up having thousand threads in a second. No speedup will be gained. So, mostly you need to turn the recursive algorithm into an interative one. Then, you need to implement a sort of work-queue. This is your approach. And, it's not easy.
For your approach, there is a good benchmark: OmpSCR. You can download at http://sourceforge.net/projects/ompscr/
In the benchmark, there are several versions of OpenMP-based quicksort. Most of them are similar to yours. However, to increase parallelism, one must minimize the contention on a global queue (in your code, it's s). So, there could be a couple of optimizations such as having local queues. Although the algorithm itself is purely parallel, the implementation may require synchronization artifacts. And, most of all, it's very hard to gain speedups.
However, you still directly use recursive parallelism in OpenMP in two ways: (1) Throttling the total number of the threads, and (2) Using OpenMP 3.0's task.
Here is pseudo code for the first approach (This is only based on OmpSCR's benchmark):
void qsort_omp_recursive(int* begin, int* end)
{
if (begin != end) {
// Partition ...
// Throttling
if (...) {
qsort_omp_recursive(begin, middle);
qsort_omp_recursive(++middle, ++end);
} else {
#pragma omp parallel sections nowait
{
#pragma omp section
qsort_omp_recursive(begin, middle);
#pragma omp section
qsort_omp_recursive(++middle, ++end);
}
}
}
}
In order to run this code, you need to call omp_set_nested(1) and omp_set_num_threads(2). The code is really simple. We simply spawn two threads on the division of the work. However, we insert a simple throttling logic to prevent excessive threads. Note that my experimentation showed decent speedups for this approach.
Finally, you may use OpenMP 3.0's task, where a task is a logically concurrent work. In the above all OpenMP's approaches, each parallel construct spawns two physical threads. You may say there is a hard 1-to-1 mapping between a task to a work thread. However, task separates logical tasks and workers.
Because OpenMP 3.0 is not popular yet, I will use Cilk Plus, which is great to express this kind of nested and recursive parallelisms. In Cilk Plus, the parallelization is extremely easy:
void qsort(int* begin, int* end)
{
if (begin != end) {
--end;
int* middle = std::partition(begin, end,
std::bind2nd(std::less<int>(), *end));
std::swap(*end, *middle);
cilk_spawn qsort(begin, middle);
qsort(++middle, ++end);
// cilk_sync; Only necessay at the final stage.
}
}
I copied this code from Cilk Plus' example code. You will see a single keyword cilk_spawn is everything to parallelize quicksort. I'm skipping the explanations of Cilk Plus and spawn keyword. However, it's easy to understand: the two recursive calls are declared as logically concurrent tasks. Whenever the recursion takes place, the logical tasks are created. But, the Cilk Plus runtime (which implements an efficient work-stealing scheduler) will handle all kinds of dirty job. It optimally queues the parallel tasks and maps to the work threads.
Note that OpenMP 3.0's task is essentially similar to the Cilk Plus's approach. My experimentation shows that pretty nice speedups were feasible. I got a 3~4x speedup on a 8-core machine. And, the speedup was scale. Cilk Plus' absolute speedups are greater than those of OpenMP 3.0's.
The approach of Cilk Plus (and OpenMP 3.0) and your approach are essentially the same: the separation of parallel task and workload assignment. However, it's very difficult to implement efficiently. For example, you must reduce the contention and use lock-free data structures.