OpenMP - std::next_permutation - c++

I am trying to parallelize my own C++ implementation of Travelling Salesman Problem using OpenMP.
I have a function to calculate cost of road cost() and vector [0,1,2,...,N], where N is a number of nodes of the road.
In main(), I am trying to find the best road:
do
{
cost();
} while (std::next_permutation(permutation_base, permutation_base + operations_number));
I was trying to use #pragma omp parallel to parallelize that code, but it only made it more time consuming.
Is there any way to parallelize that code?

#pragma omp parallel doesn't automatically divide the computation on separate threads. If you want to divide the computation you need do additionally use #pragma omp for, otherwise the hole computation is done multiple times, one time for each thread. For instance the following code prints "Hello World!" four times on my laptop, since it has 4 cores.
int main(int argc, char* argv[]){
#pragma omp parallel
cout << "Hello World!\n";
}
The same thing happens to your code, if you simple write #pragma omp parallel. Your code gets executed multiple times, once for each thread. And therefore your program won't be faster. If you want to divide the work onto the threads (each thread does different things), you have to use something like #pragma omp parallel for.
Now we can look at your code. It isn't suited for parallelization. Lets see why. You start with your array permutation_base and calculate the costs. Then you manipulate permutation_base with next_permutation. You actually have to wait for the finished cost computations, before you are allowed to manipulate the the array, because otherwise the cost computation would be wrong. So the whole thing wouldn't work on separate threads.
One possible solution would be, to keep multiple copies of your array permutation_base, and each possible permutation base only runs through a part of all permutations. For instance:
vector<int> permutation_base{1, 2, 3, 4};
int n = permutation_base.size();
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
// Make a copy of permutation_base
auto perm = permutation_base;
// rotate the i'th element to the front
// keep the other elements sorted
std::rotate(perm.begin(), perm.begin() + i, perm.begin() + i + 1);
// Now go through all permutations of the last `n-1` elements.
// Keep the first element fixed.
do {
cost()
}
while (std::next_permutation(perm.begin() + 1, perm.end()));
}

Most definitely.
The big problem with parallelizing these permutation problems is that in order to parallelize well, you need to "index" into an arbitrary permutation. In short, you need to find the kth permutation. You can take advantage of some cool math properties and you'll find this:
std::vector<int> kth_perm(long long k, std::vector<int> V) {
long long int index;
long long int next;
std::vector<int> new_v;
while(V.size()) {
index = k / fact(V.size() - 1);
new_v.push_back(V.at(index));
next = k % fact(V.size() - 1);
V.erase(V.begin() + index);
k = next;
}
return new_v;
}
So then your logic might look something like this:
long long int start = (numperms*threadnum)/ numthreads;
long long int end = threadnum == numthreads-1 ? numperms : (numperms*(threadnum+1))/numthreads;
perm = kth_perm(start, perm); // perm is your list of permutations
for (int j = start; j < end; ++j){
if (is_valid_tour(adj_list, perm, startingVertex, endingVertex)) {
isValidTour=true;
return perm;
}
std::next_permutation(perm.begin(),perm.end());
}
isValidTour = false;
return perm;
Obviously there's a lot of code, but the idea of parallelizing it can be captured by the little code I've posted. You can visualize "indexing" like this:
|--------------------------------|
^ ^ ^
t1 t2 ... tn
Find the ith permutation and let a thread call std::next_permutation until it finds the starting point of the next thread.
Note that you'll want to wrap the function that contains the bottom code in #pragma omp parallel

Related

How to optimize omp parallelization when batching

I am generating class Objects and putting them into std::vector. Before adding, I need to check if they intersect with the already generated objects. As I plan to have millions of them, I need to parallelize this function as it takes a lot of time (The function must check each new object against all previously generated).
Unfortunately, the speed increase is not significant. The profiler also shows very low efficiency (all overhead). Any advise would be appreciated.
bool
Generator::_check_cube (std::vector<Cube> &cubes, const cube &cube)
{
auto ptr_cube = &cube;
auto npol = cubes.size();
auto ptr_cubes = cubes.data();
const auto nthreads = omp_get_max_threads();
bool check = false;
#pragma omp parallel shared (ptr_cube, ptr_cubes, npol, check)
{
#pragma omp single nowait
{
const auto batch_size = npol / nthreads;
for (int32_t i = 0; i < nthreads; i++)
{
const auto bstart = batch_size * i;
const auto bend = ((bstart + batch_size) > npol) ? npol : bstart + batch_size;
#pragma omp task firstprivate(i, bstart, bend) shared (check)
{
struct bd bd1{}, bd2{};
bd1 = allocate_bd();
bd2 = allocate_bd();
for (auto j = bstart; j < bend; j++)
{
bool loc_check;
#pragma omp atomic read
loc_check = check;
if (loc_check) break;
if (ptr_cube->cube_intersecting(ptr_cubes[j], &bd1, &bd2))
{
#pragma omp atomic write
check = true;
break;
}
}
free_bd(&bd1);
free_bd(&bd2);
}
}
}
}
return check;
}
UPDATE: The Cube is actually made of smaller objects Cuboids, each of them have size (L, W, H), position coordinates and rotation. The intersect function:
bool
Cube::cube_intersecting(Cube &other, struct bd *bd1, struct bd *bd2) const
{
const auto nom = number_of_cuboids();
const auto onom = other.number_of_cuboids();
for (int32_t i = 0; i < nom; i++)
{
get_mcoord(i, bd1);
for (int32_t j = 0; j < onom; j++)
{
other.get_mcoord(j, bd2);
if (check_gjk_intersection(bd1, bd2))
{
return true;
}
}
}
return false;
}
//get_mcoord calculates vertices of the cuboids
void
Cube::get_mcoord(int32_t index, struct bd *bd) const
{
for (int32_t i = 0; i < 8; i++)
{
for (int32_t j = 0; j < 3; j++)
{
bd->coord[i][j] = _cuboids[index].get_coord(i)[j];
}
}
}
inline struct bd
allocate_bd()
{
struct bd bd{};
bd.numpoints = 8;
bd.coord = (double **) malloc(8 * sizeof(double *));
for (int32_t i = 0; i < 8; i++)
{
bd.coord[i] = (double *) malloc(3 * sizeof(double));
}
return bd;
}
Typical values: npol > 1 million, threads 32, and each npol Cube consists of 1 - 3 smaller cuboids which are directly checked against other if intersect.
The problem with your search is that OpenMP really likes static loops, where the number of iterations is predetermined. Thus, maybe one task will break early, but all the other will go through their full search.
With recent versions of OpenMP (5, I think) there is a solution for that.
(Not sure about this one: Make your tasks much more fine-grained, for instance one for each intersection test);
Spawn your tasks in a taskloop;
Once you find your intersection (or any condition that causes you to break), do cancel taskloop.
Small problem: cancelling is disabled by default. Set the environment variable OMP_CANCELLATION to true.
Do you have more intersections being true or more being false ? If most are true, you're flooding your hardware with requests to write to a shared resource, and what you are doing is essentially sequential. One way to address this is to avoid using a shared resource so there is no mutex and you let all threads run and at the end you take a decision given the results; this will likely run faster but the benefit depends also on arbitrary choices such as few metrics (eg., nthreads, ncuboids).
It is possible that on another architecture (eg., gpu), your algorithm works well as it is. I may be worth it to benchmark it on a gpu, and see if you will benefit from that migration, given the production sizes (millions of cuboids, 24 dimensions).
You also have a complexity problem, which is, for every new cuboid you compare up to the whole set of existing cuboids. One way to address this is to gather all the cuboids size (range) by dimension and order them, and add the new cuboids ranges ordered. If there is intersection in one dimension, you test the next one etc. You also can runs them in parallel. Before running through the ranges, you test if you are hitting inside the global range, if not it's useless to test locally the intersection.
Here and in general you want to parallelize with minimum of dependency (shared resources, mutex). So you want to try to find a point of view where this will happen. Parallelising over dimensions over ordered ranges (segments) might be better that parallelizing over cuboids.
Algorithms and benefits of parallelism also depend on the values of your objects. This does not mean that complexity predictions are not relevant, but that one may find a smarter approach given those values.
I think your code is memory bound, so its bottleneck is memory read/write not calculations. This can be the main reason of poor speed increase. As already mentioned by #Soleil a different hardware (GPU) can be beneficial here.
You mentioned in the comments that Generator::_check_cub called many times. To reduce OpenMP overheads my suggestion is moving the parallel region out of this function, you can even use it in your main function:
main(){
#pragma omp parallel
#pragma omp single nowait
{
//your code
}
}
In this case you have to use #pragma omp taskwait to wait for the tasks to complete.
for (int32_t i = 0; i < nthreads; i++)
{
#pragma omp task default(none) firstprivate(...) shared (..)
{
//your code comes here
}
}
#pragma omp taskwait
I also suggest using default(none) clause in #pragma omp task directive so you have to explicitly tell the sharing attribute of all your variables.
Do you really need function get_mcoord? It seems a redunant memory copy to me. I think it may be better to write a check_gjk_intersection function which takes _cuboids or its indices as parameters. In this case you get rid of many memory allocations/deallocations of bd1 and bd2, which also can be time consuming as #Victor pointed out.

How to run all threads in sequence as static with out using opemMP for?

I'm new to openMP and multi-threading.
I have been given a task to run a method as static, dynamic, and guided without using OpenMPfor loop which means I cant use scheduled clauses.!
I could create parallel threads with parallel and could assign loop iterations to threads equally
but how to make it static and dynamic(1000 block) and guided?
void static_scheduling_function(const int start_count,
const int upper_bound,
int *results)
{
int i, tid, numt;
#pragma omp parallel private(i,tid)
{
int from, to;
tid = omp_get_thread_num();
numt = omp_get_num_threads();
from = (upper_bound / numt) * tid;
to = (upper_bound / numt) * (tid + 1) - 1;
if (tid == numt - 1)
to = upper_bound - 1;
for (i = from; i < to; i++)
{
//compute one iteration (i)
int start = i;
int end = i + 1;
compute_iterations(start, end, results);
}
}
}
======================================
For dynamic i have tried something like this
void chunk_scheduling_function(const int start_count, const int upper_bound, int* results) {
int numt, shared_lower_iteration_counter=start_count;
for (int shared_lower_iteration_counter=start_count; shared_lower_iteration_counter<upper_bound;){
#pragma omp parallel shared(shared_lower_iteration_counter)
{
int tid = omp_get_thread_num();
int from,to;
int chunk = 1000;
#pragma omp critical
{
from= shared_lower_iteration_counter; // 10, 1010
to = ( shared_lower_iteration_counter + chunk ); // 1010,
shared_lower_iteration_counter = shared_lower_iteration_counter + chunk; // 1100 // critical is important while incrementing shared variable which decides next iteration
}
for(int i = from ; (i < to && i < upper_bound ); i++) { // 10 to 1009 , i< upperbound prevents other threads from executing call
int start = i;
int end = i + 1;
compute_iterations(start, end, results);
}
}
}
}
This looks like a university assignment (and a very good one IMO), I will not provide the complete solution, instead I will provide what you should be looking for.
The static scheduler looks okey; Notwithstanding, it can be improved by taking into account the chunk size as well.
For the dynamic and guided schedulers, they can be implemented by using a variable (let us name it shared_iteration_counter) that will be marking the current loop iteration that should pick up next by the threads. Therefore, when a thread needs to request a new task to work with (i.e., a new loop iteration) it queries that variable for that. In pseudo code would look like the following:
int thread_current_iteration = shared_iteration_counter++;
while(thread_current_iteration < MAX_SIZE)
{
// do work
thread_current_iteration = shared_iteration_counter++;
}
The pseudo code is assuming chunk size of 1 (i.e., shared_iteration_counter++) you will have to adapt to your use-case. Now, because that variable will be shared among threads, and every thread will be updating it, you need to ensure mutual exclusion during the updates of that variable. Fortunately, OpenMP offers means to achieve that, for instance, using #pragma omp critical, explicitly locks, and atomic operations. The latter is the better option for your use-case:
#pragma omp atomic
shared_iteration_counter = shared_iteration_counter + 1;
For the guided scheduler:
Similar to dynamic scheduling, but the chunk size starts off large and
decreases to better handle load imbalance between iterations. The
optional chunk parameter specifies them minimum size chunk to use. By
default the chunk size is approximately loop_count/number_of_threads.
In this case, not only you have to guarantee mutual exclusion of the variable that will be used to count the current loop iteration to be pick up by threads, but also guarantee mutual exclusion of the chunk size variable, since it also changes.
Without given it way too much bear in mind that you may need to considered how to deal with edge-cases such as your current thread_current_iteration= 1000 and your chunks_size=1000 with a MAX_SIZE=1500. Hence, thread_current_iteration + chunks_size > MAX_SIZE, but there is still 500 iterations to be computed.

Parallelize Algorithm with OpenMP in C++

my problem is this:
I want to solve TSP with the Ant Colony Optimization Algorithm in C++.
Right now Ive implemented a algorithm that solve this problem iterative.
For example: I generate 500 ants - and they find their route one after the other.
Each ant starts not until the previous ant finished.
Now I want to parallelize the whole thing - and I thought about using OpenMP.
So my first question is: Can I generate a large number of threads that work
simultaneously (for the number of ants > 500)?
I already tried something out. So this is my code from my main.cpp:
#pragma omp parallel for
for (auto ant = antarmy.begin(); ant != antarmy.end(); ++ant) {
#pragma omp ordered
if (ant->getIterations() < ITERATIONSMAX) {
ant->setNumber(currentAntNumber);
currentAntNumber++;
ant->antRoute();
}
}
And this is the code in my Ant class that is "critical" because each Ant reads and writes into the same Matrix (pheromone-Matrix):
void Ant::antRoute()
{
this->route.setCity(0, this->getStartIndex());
int nextCity = this->getNextCity(this->getStartIndex());
this->routedistance += this->data->distanceMatrix[this->getStartIndex()][nextCity];
int tempCity;
int i = 2;
this->setProbability(nextCity);
this->setVisited(nextCity);
this->route.setCity(1, nextCity);
updatePheromone(this->getStartIndex(), nextCity, routedistance, 0);
while (this->getVisitedCount() < datacitycount) {
tempCity = nextCity;
nextCity = this->getNextCity(nextCity);
this->setProbability(nextCity);
this->setVisited(nextCity);
this->route.setCity(i, nextCity);
this->routedistance += this->data->distanceMatrix[tempCity][nextCity];
updatePheromone(tempCity, nextCity, routedistance, 0);
i++;
}
this->routedistance += this->data->distanceMatrix[nextCity][this->getStartIndex()];
// updatePheromone(-1, -1, -1, 1);
ShortestDistance(this->routedistance);
this->iterationsshortestpath++;
}
void Ant::updatePheromone(int i, int j, double distance, bool reduce)
{
#pragma omp critical(pheromone)
if (reduce == 1) {
for (int x = 0; x < datacitycount; x++) {
for (int y = 0; y < datacitycount; y++) {
if (REDUCE * this->data->pheromoneMatrix[x][y] < 0)
this->data->pheromoneMatrix[x][y] = 0.0;
else
this->data->pheromoneMatrix[x][y] -= REDUCE * this->data->pheromoneMatrix[x][y];
}
}
}
else {
double currentpheromone = this->data->pheromoneMatrix[i][j];
double updatedpheromone = (1 - PHEROMONEREDUCTION)*currentpheromone + (PHEROMONEDEPOSIT / distance);
if (updatedpheromone < 0.0) {
this->data->pheromoneMatrix[i][j] = 0;
this->data->pheromoneMatrix[j][i] = 0;
}
else {
this->data->pheromoneMatrix[i][j] = updatedpheromone;
this->data->pheromoneMatrix[j][i] = updatedpheromone;
}
}
}
So for some reasons the omp parallel for loop wont work on these range-based loops. So this is my second question - if you guys have any suggestions on the code how the get the range-based loops done im happy.
Thanks for your help
So my first question is: Can I generate a large number of threads that work simultaneously (for the number of ants > 500)?
In OpenMP you typically shouldn't care how many threads are active, instead you make sure to expose enough parallel work through work-sharing constructs such as omp for or omp task. So while you may have a loop with 500 iterations, your program could be run with anything between one thread and 500 (or more, but they would just idle). This is a difference to other parallelization approaches such as pthreads where you have to manage all the threads and what they do.
Now your example uses ordered incorrectly. Ordered is only useful if you have a small part of your loop body that needs to be executed in-order. Even then it can be very problematic for performance. Also you need to declare a loop to be ordered if you want to use ordered inside. See also this excellent answer.
You should not use ordered. Instead make sure that the ants know there number beforehand, write the code such that they don't need a number, or at the very least that the order of numbers doesn't matter for ants. In the latter case you can use omp atomic capture.
As to the access to shared data. Try to avoid it as much as possible. Adding omp critical is a first step to get a correct parallel program, but often leads to performance problems. Measure your parallel efficiency, use parallel performance analysis tools to find out if this is the case for you. Then you can use atomic data access or reduction (each threads has their own data they work on and only after the main work is finished, data from all threads is merged).

OpenMP parallel code has not the same output as the serial code

I had to change and extend my algorithm for some signal analysis (using the polyfilterbank technique) and couldn't use my old OpenMP code, but in the new code the results are not as expected (the results in the beginning positions in the array are somehow incorrect in comparison with a serial run [serial code shows the expected result]).
So in the first loop tFFTin I have some FFT data, which I'm multiplicating with a window function.
The goal is that a thread runs the inner loops for each polyphase factor. To avoid locks I use the reduction pragma (no complex reduction is defined by standard, so I use my one where each thread's omp_priv variable gets initialized with the omp_orig [so with tFFTin]). The reason I'm using the ordered pragma is that the results should be added to the output vector in an ordered way.
typedef std::complex<float> TComplexType;
typedef std::vector<TComplexType> TFFTContainer;
#pragma omp declare reduction(complexMul:TFFTContainer:\
transform(omp_in.begin(), omp_in.end(),\
omp_out.begin(), omp_out.begin(),\
std::multiplies<TComplexType>()))\
initializer (omp_priv(omp_orig))
void ConcreteResynthesis::ApplyPolyphase(TFFTContainer& tFFTin, TFFTContainer& tFFTout, TWindowContainer& tWindow, *someparams*) {;
#pragma omp parallel for shared(tWindow) firstprivate(sFFTParams) reduction(complexMul: tFFTin) ordered if(iFFTRawDataLen>cMinParallelSize)
for (int p = 0; p < uPolyphase; ++p) {
int iPolyphaseOffset = p * uFFTLength;
for (int i = 0; i < uFFTLength; ++i) {
tFFTin[i] *= tWindow[iPolyphaseOffset + i]; ///< get FFT input data from raw data
}
#pragma omp ordered
{
//using the overlap and add method
for (int i = 0; i < sFFTParams.uFFTLength; ++i) {
pDataPool->GetFullSignalData(workSignal)[mSignalPos + iPolyphaseOffset + i] += tFFTin[i];
}
}
}
mSignalPos = mSignalPos + mStep;
}
Is there a race condition or something, which makes wrong outputs at the beginning? Or do I have some logic error?
Another issue is, I don't really like my solution with using the ordered pragma, is there a better approach( i tried to use for this also the reduction-model, but the compiler doesn't allow me to use a pointer type for that)?
I think your problem is that you have implemented a very cool custom reduction for tFFTin. But this reduction is applied at the end of the parallel region.
Which is after you use the data in tFFTin. Another thing is what H. Iliev mentions that the second iteration of the outer loop relies on data which is computed in the previous iteration - a classic dependency.
I think you should try parallelizing the inner loops.

C OpenMP parallel quickSort

Once again I'm stuck when using openMP in C++. This time I'm trying to implement a parallel quicksort.
Code:
#include <iostream>
#include <vector>
#include <stack>
#include <utility>
#include <omp.h>
#include <stdio.h>
#define SWITCH_LIMIT 1000
using namespace std;
template <typename T>
void insertionSort(std::vector<T> &v, int q, int r)
{
int key, i;
for(int j = q + 1; j <= r; ++j)
{
key = v[j];
i = j - 1;
while( i >= q && v[i] > key )
{
v[i+1] = v[i];
--i;
}
v[i+1] = key;
}
}
stack<pair<int,int> > s;
template <typename T>
void qs(vector<T> &v, int q, int r)
{
T pivot;
int i = q - 1, j = r;
//switch to insertion sort for small data
if(r - q < SWITCH_LIMIT)
{
insertionSort(v, q, r);
return;
}
pivot = v[r];
while(true)
{
while(v[++i] < pivot);
while(v[--j] > pivot);
if(i >= j) break;
std::swap(v[i], v[j]);
}
std::swap(v[i], v[r]);
#pragma omp critical
{
s.push(make_pair(q, i - 1));
s.push(make_pair(i + 1, r));
}
}
int main()
{
int n, x;
int numThreads = 4, numBusyThreads = 0;
bool *idle = new bool[numThreads];
for(int i = 0; i < numThreads; ++i)
idle[i] = true;
pair<int, int> p;
vector<int> v;
cin >> n;
for(int i = 0; i < n; ++i)
{
cin >> x;
v.push_back(x);
}
cout << v.size() << endl;
s.push(make_pair(0, v.size()));
#pragma omp parallel shared(s, v, idle, numThreads, numBusyThreads, p)
{
bool done = false;
while(!done)
{
int id = omp_get_thread_num();
#pragma omp critical
{
if(s.empty() == false && numBusyThreads < numThreads)
{
++numBusyThreads;
//the current thread is not idle anymore
//it will get the interval [q, r] from stack
//and run qs on it
idle[id] = false;
p = s.top();
s.pop();
}
if(numBusyThreads == 0)
{
done = true;
}
}
if(idle[id] == false)
{
qs(v, p.first, p.second);
idle[id] = true;
#pragma omp critical
--numBusyThreads;
}
}
}
return 0;
}
Algorithm:
To use openMP for a recursive function I used a stack to keep track of the next intervals on which the qs function should run. I manually add the 1st interval [0, size] and then let the threads get to work when a new interval is added in the stack.
The problem:
The program ends too early, not sorting the array after creating the 1st set of intervals ([q, i - 1], [i+1, r] if you look on the code. My guess is that the threads which get the work, considers the local variables of the quicksort function(qs in the code) shared by default, so they mess them up and add no interval in the stack.
How I compile:
g++ -o qs qs.cc -Wall -fopenmp
How I run:
./qs < in_100000 > out_100000
where in_100000 is a file containing 100000 on the 1st line followed by 100k intergers on the next line separated by spaces.
I am using gcc 4.5.2 on linux
Thank you for your help,
Dan
I didn't actually run your code, but I see an immediate mistake on p, which should be private not shared. The parallel invocation of qs: qs(v, p.first, p.second); will have races on p, resulting in unpredictable behavior. The local variables at qs should be okay because all threads have their own stack. However, the overall approach is good. You're on the right track.
Here are my general comments for the implementation of parallel quicksort. Quicksort itself is embarrassingly parallel, which means no synchronization is needed. The recursive calls of qs on a partitioned array is embarrassingly parallel.
However, the parallelism is exposed in a recursive form. If you simply use the nested parallelism in OpenMP, you will end up having thousand threads in a second. No speedup will be gained. So, mostly you need to turn the recursive algorithm into an interative one. Then, you need to implement a sort of work-queue. This is your approach. And, it's not easy.
For your approach, there is a good benchmark: OmpSCR. You can download at http://sourceforge.net/projects/ompscr/
In the benchmark, there are several versions of OpenMP-based quicksort. Most of them are similar to yours. However, to increase parallelism, one must minimize the contention on a global queue (in your code, it's s). So, there could be a couple of optimizations such as having local queues. Although the algorithm itself is purely parallel, the implementation may require synchronization artifacts. And, most of all, it's very hard to gain speedups.
However, you still directly use recursive parallelism in OpenMP in two ways: (1) Throttling the total number of the threads, and (2) Using OpenMP 3.0's task.
Here is pseudo code for the first approach (This is only based on OmpSCR's benchmark):
void qsort_omp_recursive(int* begin, int* end)
{
if (begin != end) {
// Partition ...
// Throttling
if (...) {
qsort_omp_recursive(begin, middle);
qsort_omp_recursive(++middle, ++end);
} else {
#pragma omp parallel sections nowait
{
#pragma omp section
qsort_omp_recursive(begin, middle);
#pragma omp section
qsort_omp_recursive(++middle, ++end);
}
}
}
}
In order to run this code, you need to call omp_set_nested(1) and omp_set_num_threads(2). The code is really simple. We simply spawn two threads on the division of the work. However, we insert a simple throttling logic to prevent excessive threads. Note that my experimentation showed decent speedups for this approach.
Finally, you may use OpenMP 3.0's task, where a task is a logically concurrent work. In the above all OpenMP's approaches, each parallel construct spawns two physical threads. You may say there is a hard 1-to-1 mapping between a task to a work thread. However, task separates logical tasks and workers.
Because OpenMP 3.0 is not popular yet, I will use Cilk Plus, which is great to express this kind of nested and recursive parallelisms. In Cilk Plus, the parallelization is extremely easy:
void qsort(int* begin, int* end)
{
if (begin != end) {
--end;
int* middle = std::partition(begin, end,
std::bind2nd(std::less<int>(), *end));
std::swap(*end, *middle);
cilk_spawn qsort(begin, middle);
qsort(++middle, ++end);
// cilk_sync; Only necessay at the final stage.
}
}
I copied this code from Cilk Plus' example code. You will see a single keyword cilk_spawn is everything to parallelize quicksort. I'm skipping the explanations of Cilk Plus and spawn keyword. However, it's easy to understand: the two recursive calls are declared as logically concurrent tasks. Whenever the recursion takes place, the logical tasks are created. But, the Cilk Plus runtime (which implements an efficient work-stealing scheduler) will handle all kinds of dirty job. It optimally queues the parallel tasks and maps to the work threads.
Note that OpenMP 3.0's task is essentially similar to the Cilk Plus's approach. My experimentation shows that pretty nice speedups were feasible. I got a 3~4x speedup on a 8-core machine. And, the speedup was scale. Cilk Plus' absolute speedups are greater than those of OpenMP 3.0's.
The approach of Cilk Plus (and OpenMP 3.0) and your approach are essentially the same: the separation of parallel task and workload assignment. However, it's very difficult to implement efficiently. For example, you must reduce the contention and use lock-free data structures.