I have a parallel code that does some computation and then adds a double to an outside-the-loop double variable. I tried using std::atomic but it does not have suport for arithmetic operations on std::atomic < double > variables.
double dResCross = 0.0;
std::atomic<double> dResCrossAT = 0.0;
Concurrency::parallel_for(0, iExperimentalVectorLength, [&](size_t m)
{
double value;
//some computation of the double value
atomic_fetch_add(&dResCrossAT, value);
});
dResCross += dResCrossAT;
Simply writing
dResCross += value;
does obviously otput nonsense. My question is, how can I solve this problem, without making the code serial?
A typical way to atomically perform arithmetic operations on a floating-point type is with a compare-and-swap (CAS) loop.
double value;
//some computation of the double value
double expected = atomic_load(&dResCrossAT);
while (!atomic_compare_exchange_weak(&dResCrossAT, &expected, expected + value));
A detailed explanation can be found in Jeff Preshing's article about this class of operation.
I believe excluding partial memory write in a non-atomic variable requires mutexing, I am not certain of that being the only way to ensure there is no write conflict but it is accomplished like this
#include <mutex>
#include <thread>
std::mutex mtx;
void threadFunction(double* d){
while (*d < 100) {
mtx.lock();
*d += 1.0;
mtx.unlock();
}
}
int main() {
double* d = new double(0);
std::thread thread(threadFunction, d);
while (true) {
if (*d == 100) {
break;
}
}
thread.join();
}
Which will add 1.0 to d 100 times in a thread-safe way. The mutex locking and unlocking ensures that only one thread is accessing d at a given time. However, this is significantly slower than an atomic equivalent because locking and unlocking is so expensive - I've heard varying things based on operating system and specific processor and what is being locked or unlocked but it's in the neighborhood of 50 clock cycles for this example, but it can require a system call which is more like 2000 clock cycles.
Moral: use with caution.
If your vector has many elements per thread, you should consider implementing a reduction rather than using an atomic operation for every element. Atomic operations are much more expensive than normal stores.
double global_value{0.0};
std::vector<double> private_values(num_threads,0.0);
parallel_for(size_t k=0; k<n; ++k) {
private_values[my_thread] += ...;
}
if (my_thread==0) {
for (int t=0; t<num_threads; ++t) {
global_value += private_values[t];
}
}
This algorithm requires no atomic operations and will be faster in many cases. You can replace the second phase with a tree or atomics if the thread count is very high (e.g. on a GPU).
Concurrency libraries like TBB and Kokkos both provide parallel reduce templates that do the right thing internally.
Related
Suppose I have some tasks (Monte Carlo simulations) that I want to run in parallel. I want to complete a given number of tasks, but tasks take different amount of time so not easy to divide the work evenly over the threads. Also: I need the results of all simulations in a single vector (or array) in the end.
So I come up with below approach:
int Max{1000000};
//SimResult is some struct with well-defined default value.
std::vector<SimResult> vec(/*length*/Max);//Initialize with default values of SimResult
int LastAdded{0};
void fill(int RandSeed)
{
Simulator sim{RandSeed};
while(LastAdded < Max)
{
// Do some work to bring foo to the desired state
//The duration of this work is subject to randomness
vec[LastAdded++]
= sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1);
auto fut2 = std::async(fill,2);
//maybe some more tasks.
fut1.get();
fut2.get();
//do something with the results in vec.
}
The above code will give race conditions I guess. I am looking for a performant approach to avoid that. Requirements: avoid race conditions (fill the entire array, no skips) ; final result is immediately in array ; performant.
Reading on various approaches, it seems atomic is a good candidate, but I am not sure what settings will be most performant in my case? And not even sure whether atomic will cut it; maybe a mutex guarding LastAdded is needed?
One thing I would say is that you need to be very careful with the standard library random number functions. If your 'Simulator' class creates an instance of a generator, you should not run Monte Carlo simulations in parallel using the same object, because you'll get likely get repeated patterns of random numbers between the runs, which will give you inaccurate results.
The best practice in this area would be to create N Simulator objects with the same properties, and give each one a different random seed. Then you could pool these objects out over multiple threads using OpenMP, which is a common parallel programming model for scientific software development.
std::vector<SimResult> generateResults(size_t N_runs, double seed)
{
std::vector<SimResult> results(N_runs);
#pragma omp parallel for
for(auto i = 0; i < N_runs; i++)
{
auto sim = Simulator(seed + i);
results[i] = sim.GetResult();
}
}
Edit: With OpenMP, you can choose different scheduling models, which allow you to for e.g. dynamically split work between threads. You can do this with:
#pragma omp parallel for schedule(dynamic, 16)
which would give each thread chunks of 16 items to work on at a time.
Since you already know how many elements your are going to work with and never change the size of the vector, the easiest solution is to let each thread work on it's own part of the vector. For example
Update
to accomodate for vastly varying calculation times, you should keep your current code, but avoid race conditions via a std::lock_guard. You will need a std::mutex that is the same for all threads, for example a global variable, or pass a reference of the mutex to each thread.
void fill(int RandSeed, std::mutex &nextItemMutex)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
// enter critical area
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
// Acquire next item
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded++;
}
else
{
break;
}
// lock is released when nextItemLock goes out of scope
}
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[workingIndex] = sim.GetResult();//Produces SimResult.
}
}
Problem with this is, that snychronisation is quite expensive. But it's probably not that expensive in comparison to the simulation you run, so it shouldn't be too bad.
Version 2:
To reduce the amount of synchronisation that is required, you could acquire blocks to work on, instead of single items:
void fill(int RandSeed, std::mutex &nextItemMutex, size_t blockSize)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded += blockSize;
}
else
{
break;
}
}
for(size_t i = workingIndex; i < workingIndex + blockSize && i < MAX; i++)
vec[i] = sim.GetResult();//Produces SimResult.
}
}
Simple Version
void fill(int RandSeed, size_t partitionStart, size_t partitionEnd)
{
Simulator sim{RandSeed};
for(size_t i = partitionStart; i < partitionEnd; i++)
{
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[i] = sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1, 0, Max / 2);
auto fut2 = std::async(fill,2, Max / 2, Max);
// ...
}
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
If I have an array that I want to be updated by multiple threads simultaneously, what's the best/fastest way to go about doing that? For example, say I have the following code:
std::vector<float> vec;
vec.push_back(0.f);
for(int i = 0; i < 10000; i++) {
std::thread([&]{
// SAFETY CONSTRUCTS GO HERE
vec[0] += 1; // OR MAYBE HERE
// AND HERE?
});
}
// wait a little while, i.e. I was too lazy to write out joins
std::cout << vec[0];
If I want this to be safe and finally print the value 10000, what would be the best/fastest way to do this?
In the example you've given, the best/safest way would be to not launch threads, and simply update v[0] in the loop. The overhead of launching and synchronising threads will probably exceed any benefit you get by doing some operations in parallel.
v is a non-atomic object (std::vector<float>) and v[0] is actually a function call. Such objects, and their non-static member functions, cannot protect themselves from concurrent access by multiple threads. To use them from multiple threads, every direct usage of v (and v[0]) must be synchronised.
Generally, safety involving concurrently executing threads is achieved by synchronising access to any variables (or, more generally, memory) that are updated and accessed by multiple threads.
If using a mutex, that normally means all threads which access shared data must first grab the mutex, do the operation on shared variables (e.g. update v[0]), and then release the mutex. If a thread has not grabbed (or has grabbed and then released) the mutex, then all operations it does must NOT touch the shared variables.
If you want performance through threading, you will need to have a significant amount of the work done in each thread without ANY access to shared variables. That work, since parts can be executed concurrently, can potentially be executed in less total elapsed time. For that to represent a performance benefit, the gains (e.g. by doing a lot of operations concurrently) need to exceed the costs (of launching threads, of synchronising access to any data that is accessed by multiple threads).
Which is highly unlikely in anything similar to the code you have shown.
The point is that there is always a trade-off between speed and safety, when threads share any data. Safety requires updating of shared variables to be synchronised - without exception. A performance gain is generally derived from the things that do not need to be synchronised (i.e. that don't access variables shared between threads) and can be executed in parallel.
There's no single magic technique to have highly performant parallel access to shared data, but there are a few general techniques you'll see fairly often.
I'll use the example of summing an array in parallel for my answer, but these techniques apply pretty generally to many parallel algorithms.
1) Avoid sharing data in the first place
This is likely to be the safest and fastest method. Instead of having your worker threads directly update the shared state, have each of them work with their own local state, and then have your main thread combine the results. For the array sum example, this could look something like this:
int main() {
std::vector<int> toSum = getSomeVector();
std::vector<int> sums(NUM_THREADS);
std::vector<std::thread> threads;
int chunkSize = std::ceil(toSum.size() / (float)NUM_THREADS);
for (int i = 0; i < NUM_THREADS; ++i) {
auto chunkBegin = toSum.begin() + (i * chunkSize);
auto chunkEnd = chunkBegin + chunkSize;
threads.emplace_back([chunkBegin, chunkEnd](int& result) mutable {
for (; chunkBegin != chunkEnd; ++chunkBegin) {
result += *chunkBegin;
}
}, std::ref(sums[i]));
}
for (std::thread& thd : threads) {
thd.join();
}
int finalSum = 0;
for (int partialSum : sums) {
finalSum += partialSum;
}
std::cout << finalSum << '\n';
}
Since each thread only ever operates on its own partial sum, they cannot interfere with each other, and no extra synchronization is needed. You have to to a little bit of extra work at the end to add all the partial sums up, but the number of partial results is small, so this overhead should be pretty minimal.
2) Mutual exclusion
Instead of having each thread operate on its own state, you can protect shared state with a locking mechanism. Fairly often, this is a mutex, but there are lots of different locking primitives that have slightly different roles. The point here is to make sure only one thread is ever working with the shared state at a time. Be very careful when using this technique to avoid accessing the shared state within a tight loop. Since only one thread can hold the lock at a time, it's very easy to accidentally transform you fancy parallel code back into single-threaded code by making it so that only one thread can ever be working at a time.
For example, consider the following:
int main() {
std::vector<int> toSum = getSomeVector();
int sum = 0;
std::vector<std::thread> threads;
int chunkSize = std::ceil(toSum.size() / (float)NUM_THREADS);
std::mutex mtx;
for (int i = 0; i < NUM_THREADS; ++i) {
auto chunkBegin = toSum.begin() + (i * chunkSize);
auto chunkEnd = chunkBegin + chunkSize;
threads.emplace_back([chunkBegin, chunkEnd, &mtx, &sum]() mutable {
for (; chunkBegin != chunkEnd; ++chunkBegin) {
std::lock_guard guard(mtx);
sum += *chunkBegin;
}
});
}
for (std::thread& thd : threads) {
thd.join();
}
std::cout << sum << '\n';
}
Since each thread locks mtx within its loop, only one thread can ever be doing any work at a time. There is no parallelization here, and this code is likely to be slower than the equivalent single-threaded code due to the extra overhead of allocating threads and locking and unlocking the mutex.
Instead try to do as much as possible independantly, and access your shared state as infrequently as possible. For this example, you can do something similar to the example in (1) and build up partial sums within each thread, only adding them to the shared sum once at the end:
int main() {
std::vector<int> toSum = getSomeVector();
int sum = 0;
std::vector<std::thread> threads;
int chunkSize = std::ceil(toSum.size() / (float)NUM_THREADS);
std::mutex mtx;
for (int i = 0; i < NUM_THREADS; ++i) {
auto chunkBegin = toSum.begin() + (i * chunkSize);
auto chunkEnd = chunkBegin + chunkSize;
threads.emplace_back([chunkBegin, chunkEnd, &mtx, &sum]() mutable {
int partialSum = 0;
for (; chunkBegin != chunkEnd; ++chunkBegin) {
partialSum += *chunkBegin;
}
{
std::lock_guard guard(mtx);
sum += partialSum;
}
});
}
for (std::thread& thd : threads) {
thd.join();
}
std::cout << sum << '\n';
}
3) Atomic variables
Atomic variables are variables that can be "safely" shared between threads. They are very powerful, but also very easy to get wrong. You have to worry about things like memory-ordering constraints, and when you get them wrong it can be very difficult to debug and figure out what you did wrong.
At their core, atomic variables could be implemented as a simple variable whose operations are guarded by a mutex or similar. The magic all lies in the implementation, which often uses special CPU instructions to coordinate access to the variables at the CPU level to avoid a lot of the overhead of locking and unlocking.
Atomics aren't a magic bullet though. There is still overhead involved, and you can still shoot yourself in the foot by accessing your atomics too frequently. Your CPU does a lot of caching, and having multiple threads writing to an atomic variable likely means spilling the contents back out to memory, or at least to a higher level of cache. Once again, if you can avoid accessing your shared state withing tight loops in your thread, you should do so:
int main() {
std::vector<int> toSum = getSomeVector();
std::atomic<int> sum(0);
std::vector<std::thread> threads;
int chunkSize = std::ceil(toSum.size() / (float)NUM_THREADS);
for (int i = 0; i < NUM_THREADS; ++i) {
auto chunkBegin = toSum.begin() + (i * chunkSize);
auto chunkEnd = chunkBegin + chunkSize;
threads.emplace_back([chunkBegin, chunkEnd, &sum]() mutable {
int partialSum = 0;
for (; chunkBegin != chunkEnd; ++chunkBegin) {
partialSum += *chunkBegin;
}
// Since we don't care about the order that the threads update the sum,
// we can use memory_order_relaxed. This is a rabbit-hole I won't get
// too deep into here though.
sum.fetch_add(partialSum, std::memory_order_relaxed);
});
}
for (std::thread& thd : threads) {
thd.join();
}
std::cout << sum << '\n';
}
I'm working on a small Collatz conjecture calculator using C++ and GMP, and I'm trying to implement parallelism on it using OpenMP, but I'm coming across issues regarding thread safety. As it stands, attempting to run the code will yield this:
*** Error in `./collatz': double free or corruption (fasttop): 0x0000000001140c40 ***
*** Error in `./collatz': double free or corruption (fasttop): 0x00007f4d200008c0 ***
[1] 28163 abort (core dumped) ./collatz
This is the code to reproduce the behaviour.
#include <iostream>
#include <gmpxx.h>
mpz_class collatz(mpz_class n) {
if (mpz_odd_p(n.get_mpz_t())) {
n *= 3;
n += 1;
} else {
n /= 2;
}
return n;
}
int main() {
mpz_class x = 1;
#pragma omp parallel
while (true) {
//std::cout << x.get_str(10);
while (true) {
if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
x = collatz(x);
}
x++;
//std::cout << " OK" << std::endl;
}
}
Given that I did not get this error when I uncomment the outputs to screen, which are slow, I assume the issue at hand has to do with thread safety, and in particular with concurrent threads trying to increment x at the same time.
Am I correct in my assumptions? How can I fix this and make it safe to run?
I assume what you want to do is to check if the collatz conjecture holds for all numbers. The program you posted is wrong on many levels both serially and in parallel.
if (mpz_cmp_ui(x.get_mpz_t(), 1)) break;
Means that it will break when x != 1. If you replace it with the correct 0 == mpz_cmp_ui, the code will just continue to test 2 over and over again. You have to have two variables anyway, one for the outer loop that represents what you want to check, and one for the inner loop performing the check. It's easier to get this right if you make a function for that:
void check_collatz(mpz_class n) {
while (n != 1) {
n = collatz(n);
}
}
int main() {
mpz_class x = 1;
while (true) {
std::cout << x.get_str(10);
check_collatz(x);
x++;
}
}
The while (true) loop is bad to reason about and parallelize, so let's just make an equivalent for loop:
for (mpz_class x = 1;; x++) {
check_collatz(x);
}
Now, we can talk about parallelizing the code. The basis for OpenMP parallelizing is a worksharing construct. You cannot just slap #pragma omp parallel on a while loop. Fortunately you can easily mark certain canonical for loops with #pragma omp parallel for. For that, however, you cannot use mpz_class as a loop variable, and you must specify an end for the loop:
#pragma omp parallel for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
check_collatz(check);
}
Note that check is implicitly private, there is a copy for each thread working on it. Also OpenMP will take care of distributing the work [1 ... 2^63] among threads. When a thread calls check_collatz a new, private, mpz_class object will be created for it.
Now, you might notice, that repeatedly creating a new mpz_class object in each loop iteration is costly (memory allocation). You can reuse that (by breaking check_collatz again) and creating a thread-private mpz_class working object. For this, you split the compound parallel for into separate parallel and for pragmas:
#include <gmpxx.h>
#include <iostream>
#include <limits>
// Avoid copying objects by taking and modifying a reference
void collatz(mpz_class& n)
{
if (mpz_odd_p(n.get_mpz_t()))
{
n *= 3;
n += 1;
}
else
{
n /= 2;
}
}
int main()
{
#pragma omp parallel
{
mpz_class x;
#pragma omp for
for (long check = 1; check <= std::numeric_limits<long>::max(); check++)
{
// Note: The structure of this fits perfectly in a for loop.
for (x = check; x != 1; collatz(x));
}
}
}
Note that declaring x in the parallel region will make sure it is implicitly private and properly initialized. You should prefer that to declaring it outside and marking it private. This will often lead to confusion because explicitly private variables from outside scope are unitialized.
You might complain that this only checks the first 2^63 numbers. Just let it run. This gives you enough time to master OpenMP to expert level and write your own custom worksharing for GMP objects.
You were concerned about having extra objects for each thread. This is essential for good performance. You cannot solve this efficiently with locks/critical sections/atomics. You would have to protect each and every read and write to your only relevant variable. There would be no parallelism left.
Note: The huge for loop will likely have a load imbalance. So some threads will probably finish a few centuries earlier than the others. You could fix that with dynamic scheduling, or smaller static chunks.
Edit: For academic sake, here is one idea how to implement the worksharing directly on GMP objects:
#pragma omp parallel
{
// Note this is not a "parallel" loop
// these are just separate loops on distinct strided
int nthreads = omp_num_threads();
mpz_class check = 1;
// we already checked those in the other program
check += std::numeric_limits<long>::max();
check += omp_get_thread_num();
mpz_class x;
for (; ; check += nthreads)
{
// Note: The structure of this fits perfectly in a for loop.
for (x = check; x != 1; collatz(x));
}
}
You could well be right about collisions with x. You can mark x as private by:
#pragma omp parallel private(x)
This way each thread gets their own "version" of the variable x, which should make this thread-safe. By default, variables declared before a #pragma omp parallel are public, so there is one shared instance between all of the threads.
You might want to touch x only with atomic instructions.
#pragma omp atomic
x++;
This ensures that all threads see the same value of x without requires mutexes or other synchronization techniques.
I am trying to multithread a piece of code using the boost library. The problem is that each thread has to access and modify a couple of global variables. I am using mutex to lock the shared resources, but the program ends up taking more time then when it was not multithreaded. Any advice on how to optimize the shared access?
Thanks a lot!
In the example below, the *choose_ecount* variable has to be locked, and I cannot take it out of the loop and lock it for only an update at the end of the loop because it is needed with the newest values by the inside function.
for(int sidx = startStep; sidx <= endStep && sidx < d.sents[lang].size(); sidx ++){
sentence s = d.sents[lang][sidx];
int senlen = s.words.size();
int end_symb = s.words[senlen-1].pos;
inside(s, lbeta);
outside(s,lbeta, lalpha);
long double sen_prob = lbeta[senlen-1][F][NO][0][senlen-1];
if (lambda[0] == 0){
mtx_.lock();
d.sents[lang][sidx].prob = sen_prob;
mtx_.unlock();
}
for(int size = 1; size <= senlen; size++)
for(int i = 0; i <= senlen - size ; i++)
{
int j = i + size - 1;
for(int k = i; k < j; k++)
{
int hidx = i; int head = s.words[hidx].pos;
for(int r = k+1; r <=j; r++)
{
int aidx = r; int arg = s.words[aidx].pos;
mtx_.lock();
for(int kids = ONE; kids <= MAX; kids++)
{
long double num = lalpha[hidx][R][kids][i][j] * get_choose_prob(s, hidx, aidx) *
lbeta[hidx][R][kids - 1][i][k] * lbeta[aidx][F][NO][k+1][j];
long double gen_right_prob = (num / sen_prob);
choose_ecount[lang][head][arg] += gen_right_prob; //LOCK
order_ecount[lang][head][arg][RIGHT] += gen_right_prob; //LOCK
}
mtx_.unlock();
}
}
From the code you have posted I can see only writes to choose_ecount and order_ecount. So why not use local per thread buffers to compute the sum and then add them up after the outermost loop and only sync this operation?
Edit:
If you need to access the intermediate values of choose_ecount how do you assure the correct intermediate value is present? One thread might have finished 2 iterations of its loop in the meantime producing different results in another thread.
It kind of sounds like you need to use a barrier for your computation instead.
It's unlikely you're going to get acceptable performance using a mutex in an inner loop. Concurrent programming is difficult, not just for the programmer but also for the computer. A large portion of the performance of modern CPUs comes from being able to treat blocks of code as sequences independent of external data. Algorithms that are efficient for single-threaded execution are often unsuitable for multi-threaded execution.
You might want to have a look at boost::atomic, which can provide lock-free synchronization, but the memory barriers required for atomic operations are still not free, so you may still run into problems, and you will probably have to re-think your algorithm.
I guess that you divide your complete problem into chunks ranging from startStep to endStep to get processed by each thread.
Since you have that locked mutex there, you're effectively serializing all threads:
You divide your problem into some chunks which are processed in serial, yet unspecified order.
That is the only thing you get is the overhead for doing multithreading.
Since you're operating on doubles, using atomic operations is not a choice for you: they're typically implemented for integral types only.
The only possible solution is to follow Kratz' suggestion to have a copy of choose_ecount and order_ecount for each thread and reduce them to a single one after your threads have finished.
Once again I'm stuck when using openMP in C++. This time I'm trying to implement a parallel quicksort.
Code:
#include <iostream>
#include <vector>
#include <stack>
#include <utility>
#include <omp.h>
#include <stdio.h>
#define SWITCH_LIMIT 1000
using namespace std;
template <typename T>
void insertionSort(std::vector<T> &v, int q, int r)
{
int key, i;
for(int j = q + 1; j <= r; ++j)
{
key = v[j];
i = j - 1;
while( i >= q && v[i] > key )
{
v[i+1] = v[i];
--i;
}
v[i+1] = key;
}
}
stack<pair<int,int> > s;
template <typename T>
void qs(vector<T> &v, int q, int r)
{
T pivot;
int i = q - 1, j = r;
//switch to insertion sort for small data
if(r - q < SWITCH_LIMIT)
{
insertionSort(v, q, r);
return;
}
pivot = v[r];
while(true)
{
while(v[++i] < pivot);
while(v[--j] > pivot);
if(i >= j) break;
std::swap(v[i], v[j]);
}
std::swap(v[i], v[r]);
#pragma omp critical
{
s.push(make_pair(q, i - 1));
s.push(make_pair(i + 1, r));
}
}
int main()
{
int n, x;
int numThreads = 4, numBusyThreads = 0;
bool *idle = new bool[numThreads];
for(int i = 0; i < numThreads; ++i)
idle[i] = true;
pair<int, int> p;
vector<int> v;
cin >> n;
for(int i = 0; i < n; ++i)
{
cin >> x;
v.push_back(x);
}
cout << v.size() << endl;
s.push(make_pair(0, v.size()));
#pragma omp parallel shared(s, v, idle, numThreads, numBusyThreads, p)
{
bool done = false;
while(!done)
{
int id = omp_get_thread_num();
#pragma omp critical
{
if(s.empty() == false && numBusyThreads < numThreads)
{
++numBusyThreads;
//the current thread is not idle anymore
//it will get the interval [q, r] from stack
//and run qs on it
idle[id] = false;
p = s.top();
s.pop();
}
if(numBusyThreads == 0)
{
done = true;
}
}
if(idle[id] == false)
{
qs(v, p.first, p.second);
idle[id] = true;
#pragma omp critical
--numBusyThreads;
}
}
}
return 0;
}
Algorithm:
To use openMP for a recursive function I used a stack to keep track of the next intervals on which the qs function should run. I manually add the 1st interval [0, size] and then let the threads get to work when a new interval is added in the stack.
The problem:
The program ends too early, not sorting the array after creating the 1st set of intervals ([q, i - 1], [i+1, r] if you look on the code. My guess is that the threads which get the work, considers the local variables of the quicksort function(qs in the code) shared by default, so they mess them up and add no interval in the stack.
How I compile:
g++ -o qs qs.cc -Wall -fopenmp
How I run:
./qs < in_100000 > out_100000
where in_100000 is a file containing 100000 on the 1st line followed by 100k intergers on the next line separated by spaces.
I am using gcc 4.5.2 on linux
Thank you for your help,
Dan
I didn't actually run your code, but I see an immediate mistake on p, which should be private not shared. The parallel invocation of qs: qs(v, p.first, p.second); will have races on p, resulting in unpredictable behavior. The local variables at qs should be okay because all threads have their own stack. However, the overall approach is good. You're on the right track.
Here are my general comments for the implementation of parallel quicksort. Quicksort itself is embarrassingly parallel, which means no synchronization is needed. The recursive calls of qs on a partitioned array is embarrassingly parallel.
However, the parallelism is exposed in a recursive form. If you simply use the nested parallelism in OpenMP, you will end up having thousand threads in a second. No speedup will be gained. So, mostly you need to turn the recursive algorithm into an interative one. Then, you need to implement a sort of work-queue. This is your approach. And, it's not easy.
For your approach, there is a good benchmark: OmpSCR. You can download at http://sourceforge.net/projects/ompscr/
In the benchmark, there are several versions of OpenMP-based quicksort. Most of them are similar to yours. However, to increase parallelism, one must minimize the contention on a global queue (in your code, it's s). So, there could be a couple of optimizations such as having local queues. Although the algorithm itself is purely parallel, the implementation may require synchronization artifacts. And, most of all, it's very hard to gain speedups.
However, you still directly use recursive parallelism in OpenMP in two ways: (1) Throttling the total number of the threads, and (2) Using OpenMP 3.0's task.
Here is pseudo code for the first approach (This is only based on OmpSCR's benchmark):
void qsort_omp_recursive(int* begin, int* end)
{
if (begin != end) {
// Partition ...
// Throttling
if (...) {
qsort_omp_recursive(begin, middle);
qsort_omp_recursive(++middle, ++end);
} else {
#pragma omp parallel sections nowait
{
#pragma omp section
qsort_omp_recursive(begin, middle);
#pragma omp section
qsort_omp_recursive(++middle, ++end);
}
}
}
}
In order to run this code, you need to call omp_set_nested(1) and omp_set_num_threads(2). The code is really simple. We simply spawn two threads on the division of the work. However, we insert a simple throttling logic to prevent excessive threads. Note that my experimentation showed decent speedups for this approach.
Finally, you may use OpenMP 3.0's task, where a task is a logically concurrent work. In the above all OpenMP's approaches, each parallel construct spawns two physical threads. You may say there is a hard 1-to-1 mapping between a task to a work thread. However, task separates logical tasks and workers.
Because OpenMP 3.0 is not popular yet, I will use Cilk Plus, which is great to express this kind of nested and recursive parallelisms. In Cilk Plus, the parallelization is extremely easy:
void qsort(int* begin, int* end)
{
if (begin != end) {
--end;
int* middle = std::partition(begin, end,
std::bind2nd(std::less<int>(), *end));
std::swap(*end, *middle);
cilk_spawn qsort(begin, middle);
qsort(++middle, ++end);
// cilk_sync; Only necessay at the final stage.
}
}
I copied this code from Cilk Plus' example code. You will see a single keyword cilk_spawn is everything to parallelize quicksort. I'm skipping the explanations of Cilk Plus and spawn keyword. However, it's easy to understand: the two recursive calls are declared as logically concurrent tasks. Whenever the recursion takes place, the logical tasks are created. But, the Cilk Plus runtime (which implements an efficient work-stealing scheduler) will handle all kinds of dirty job. It optimally queues the parallel tasks and maps to the work threads.
Note that OpenMP 3.0's task is essentially similar to the Cilk Plus's approach. My experimentation shows that pretty nice speedups were feasible. I got a 3~4x speedup on a 8-core machine. And, the speedup was scale. Cilk Plus' absolute speedups are greater than those of OpenMP 3.0's.
The approach of Cilk Plus (and OpenMP 3.0) and your approach are essentially the same: the separation of parallel task and workload assignment. However, it's very difficult to implement efficiently. For example, you must reduce the contention and use lock-free data structures.