I am studying OpenMP and have written an implementation of shaker sorting. There are 2 consecutive cycles here, and in order for them to be called sequentially, I added blockers in the form of omp_init_lock, omp_destroy_lock, but still the result is incorrect. Please tell me how you can parallelize two consecutive cycles. My code is below:
int Left, Right;
Left = 1;
Right = ARR_SIZE;
while (Left <= Right)
{
omp_init_lock(&lock);
#pragma omp parallel reduction(+:Left) num_threads(4)
{
#pragma omp for
for (int i = Right; i >= Left; i--) {
if (Arr[i - 1] > Arr[i]) {
int temp;
temp = Arr[i];
Arr[i] = Arr[i - 1];
Arr[i - 1] = temp;
}
}
Left++;
}
omp_destroy_lock(&lock);
omp_init_lock(&lock);
#pragma omp parallel reduction(+:Right) num_threads(4)
{
#pragma omp for
for (int i = Left; i <= Right; i++) {
if (Arr[i - 1] > Arr[i]) {
int temp;
temp = Arr[i];
Arr[i] = Arr[i - 1];
Arr[i - 1] = temp;
}
}
Right--;
}
omp_destroy_lock(&lock);
}
You can only make something an omp for if the iterations are independent. Yours clearly aren't.
Your locks serve no purpose. Two parallel regions are always done in sequence. So you can remove the locks.
You seem to have several misconceptions about how OpenMP works.
Two parallel sections don't execute in parallel. This is fork-join parallelism. The parallel section itself is executed by multiple threads which then join back up at the end of the parallel section.
Your code looks like you expected them to work like pragma omp sections. Side note: Unless you have absolutely no other choice and/or you know exactly what you are doing, don't use sections. They don't scale well.
Your use of the lock API is wrong. omp_init_lock initializes a lock object. It doesn't acquire it. Likewise the destroy function deallocates it, it doesn't release the lock. If you ever want to acquire a lock, use omp_set_lock and omp_unset_lock on locks that you initialize once before you enter a parallel section.
Generally speaking, if you need a lock for an extended section of your code, it will not parallelize. Read up on Amdahl's law. Locks are only useful if used rarely or if the chance of two threads competing for the same lock at the same time is low.
Your code contains race conditions. Since you used pragma omp for, two different threads may execute the i'th and (i-1)'th iteration at the same time. That means they will touch the same integers. That's undefined behavior and will lead to them stepping on each other's toes, so to speak.
I have no idea what you wanted to do with those reductions.
How to solve this
Well, traditional shaker sort cannot work in parallel because within one iteration of the outer loop, an element may travel the whole distance to the end of the range. That requires an amount of inter-thread coordination that is infeasible.
What you can do is a variation of bubble sort where each thread looks at two values and swaps them. Move this window back and forth and values will slowly travel towards their correct position.
This should work:
#include <utility>
// using std::swap
void shake_sort(int* arr, int n) noexcept
{
using std::swap;
const int even_to_odd = n / 2;
const int odd_to_even = (n - 1) / 2;
bool any_swap;
do {
any_swap = false;
# pragma omp parallel for reduction(|:any_swap)
for(int i = 0; i < even_to_odd; ++i) {
int left = i * 2;
int right = left + 1;
if(arr[left] > arr[right]) {
swap(arr[left], arr[right]);
any_swap = true;
}
}
# pragma omp parallel for reduction(|:any_swap)
for(int i = 0; i < odd_to_even; ++i) {
int left = i * 2 + 1;
int right = left + 1;
if(arr[left] > arr[right]) {
swap(arr[left], arr[right]);
any_swap = true;
}
}
} while(any_swap);
}
Note how you can't exclude the left and right border because one outer iteration cannot guarantee that the value there is correct.
Other remarks:
Others have already commented on how std::swap makes the code more readable
You don't need to specify num_threads. OpenMP can figure this out itself
Related
I'm new to openMP and multi-threading.
I have been given a task to run a method as static, dynamic, and guided without using OpenMPfor loop which means I cant use scheduled clauses.!
I could create parallel threads with parallel and could assign loop iterations to threads equally
but how to make it static and dynamic(1000 block) and guided?
void static_scheduling_function(const int start_count,
const int upper_bound,
int *results)
{
int i, tid, numt;
#pragma omp parallel private(i,tid)
{
int from, to;
tid = omp_get_thread_num();
numt = omp_get_num_threads();
from = (upper_bound / numt) * tid;
to = (upper_bound / numt) * (tid + 1) - 1;
if (tid == numt - 1)
to = upper_bound - 1;
for (i = from; i < to; i++)
{
//compute one iteration (i)
int start = i;
int end = i + 1;
compute_iterations(start, end, results);
}
}
}
======================================
For dynamic i have tried something like this
void chunk_scheduling_function(const int start_count, const int upper_bound, int* results) {
int numt, shared_lower_iteration_counter=start_count;
for (int shared_lower_iteration_counter=start_count; shared_lower_iteration_counter<upper_bound;){
#pragma omp parallel shared(shared_lower_iteration_counter)
{
int tid = omp_get_thread_num();
int from,to;
int chunk = 1000;
#pragma omp critical
{
from= shared_lower_iteration_counter; // 10, 1010
to = ( shared_lower_iteration_counter + chunk ); // 1010,
shared_lower_iteration_counter = shared_lower_iteration_counter + chunk; // 1100 // critical is important while incrementing shared variable which decides next iteration
}
for(int i = from ; (i < to && i < upper_bound ); i++) { // 10 to 1009 , i< upperbound prevents other threads from executing call
int start = i;
int end = i + 1;
compute_iterations(start, end, results);
}
}
}
}
This looks like a university assignment (and a very good one IMO), I will not provide the complete solution, instead I will provide what you should be looking for.
The static scheduler looks okey; Notwithstanding, it can be improved by taking into account the chunk size as well.
For the dynamic and guided schedulers, they can be implemented by using a variable (let us name it shared_iteration_counter) that will be marking the current loop iteration that should pick up next by the threads. Therefore, when a thread needs to request a new task to work with (i.e., a new loop iteration) it queries that variable for that. In pseudo code would look like the following:
int thread_current_iteration = shared_iteration_counter++;
while(thread_current_iteration < MAX_SIZE)
{
// do work
thread_current_iteration = shared_iteration_counter++;
}
The pseudo code is assuming chunk size of 1 (i.e., shared_iteration_counter++) you will have to adapt to your use-case. Now, because that variable will be shared among threads, and every thread will be updating it, you need to ensure mutual exclusion during the updates of that variable. Fortunately, OpenMP offers means to achieve that, for instance, using #pragma omp critical, explicitly locks, and atomic operations. The latter is the better option for your use-case:
#pragma omp atomic
shared_iteration_counter = shared_iteration_counter + 1;
For the guided scheduler:
Similar to dynamic scheduling, but the chunk size starts off large and
decreases to better handle load imbalance between iterations. The
optional chunk parameter specifies them minimum size chunk to use. By
default the chunk size is approximately loop_count/number_of_threads.
In this case, not only you have to guarantee mutual exclusion of the variable that will be used to count the current loop iteration to be pick up by threads, but also guarantee mutual exclusion of the chunk size variable, since it also changes.
Without given it way too much bear in mind that you may need to considered how to deal with edge-cases such as your current thread_current_iteration= 1000 and your chunks_size=1000 with a MAX_SIZE=1500. Hence, thread_current_iteration + chunks_size > MAX_SIZE, but there is still 500 iterations to be computed.
I look for a better way to cancel my threads.
In my approach, I use a shared variable and if this variable is set, I just throw a continue. This finishes my threads fast, but threads keep theoretically spawning and ending, which seems not elegant.
So, is there a better way to solve the issue (break is not supported by my OpenMP)?
I have to work with Visual, so my OpenMP Lib is outdated and there is no way around that. Consequently, I think #omp cancel will not work
int progress_state = RunExport;
#pragma omp parallel
{
#pragma omp for
for (int k = 0; k < foo.z; k++)
for (int j = 0; j < foo.y; j++)
for (int i = 0; i < foo.x; i++) {
if (progress_state == StopExport) {
continue;
}
// do some fancy shit
// yeah here is a condition for speed due to the critical
#pragma omp critical
if (condition) {
progress_state = StopExport;
}
}
}
You should do it the simple way of "just continue in all remaining iterations if cancellation is requested". That can just be the first check in the outermost loop (and given that you have several nested loops, that will probably not have any measurable overhead).
std::atomic<int> progress_state = RunExport;
// You could just write #pragma omp parallel for instead of these two nested blocks.
#pragma omp parallel
{
#pragma omp for
for (int k = 0; k < foo.z; k++)
{
if (progress_state == StopExport)
continue;
for (int j = 0; j < foo.y; j++)
{
// You can add break statements in these inner loops.
// OMP only parallelizes the outermost loop (at least given the way you wrote this)
// so it won't care here.
for (int i = 0; i < foo.x; i++)
{
// ...
if (condition) {
progress_state = StopExport;
}
}
}
}
}
Generally speaking, OMP will not suddenly spawn new threads or end existing ones, especially not within one parallel region. This means there is little overhead associated with running a few more tiny iterations. This is even more true given that the default scheduling in your case is most likely static, meaning that each thread knows its start and end index right away. Other scheduling modes would have to call into the OMP runtime every iteration (or every few iterations) to request more work, but that won't happen here. The compiler will basically see this code for the threaded work:
// Not real omp functions.
int myStart = __omp_static_for_my_start();
int myEnd = __omp_static_for_my_end();
for (int k = myStart; k < myEnd; ++k)
{
if (progress_state == StopExport)
continue;
// etc.
}
You might try a non-atomic thread-local "should I cancel?" flag that starts as false and can only be changed to true (which the compiler may understand and fold into the loop condition). But I doubt you will see significant overhead either way, at least on x86 where int is atomic anyway.
which seems not elegant
OMP 2.0 does not exactly shine with respect to elegance. I mean, iterating over a std::vector requires at least one static_cast to silence signed -> unsigned conversion warnings. So unless you have specific evidence of this pattern causing a performance problem, there is little reason not to use it.
I have following recursive function (NOTE: It is stripped of all unimportant details)
int recursion(...) {
int minimum = INFINITY;
for(int i=0; i<C; i++) {
int foo = recursion(...);
if (foo < minimum) {
minimum = foo;
}
}
return minimum;
}
Note 2: It is finite, but not in this simplified example, so please ignore it. Point of this question is how to aproach this problem correctly.
I was thinking about using tasks, but I am not sure, how to use it correctly - how to paralelize the inner cycle.
EDIT 1: The recursion tree isn't well balanced. It is being used with dynamic programing approach, so as time goes on, a lot of values are re-used from previous passes. This worries me a lot and I think it will be a big bottleneck.
C is somewhere around 20.
Metric for the best is fastest :)
It will run on 2x Xeon, so there is plenty of HW power availible.
Yes, you can use OpenMP tasks exploit parallelism on multiple recursion levels and ensure that imbalances don't cause wasted cycles.
I would collect the results in a vector and compute the minimum outside. You could also perform a guarded (critical / lock) minimum computation within the task.
Avoid spawning tasks / allocating memory for the minimum if you are too deep in the recursion, where the overhead / work ratio becomes too bad. The strongest solution it to create two separate (parallel/serial) recursive functions. That way you have zero runtime overhead once you switch to the serial function - as opposed to checking the recursion depth against a threshold every time in a unified function.
int recursion(...) {
#pragma omp parallel
#pragma omp single
return recursion_par(..., 0);
}
int recursion_ser(...) {
int minimum = INFINITY;
for(int i=0; i<C; i++) {
int foo = recursion_ser(...);
if (foo < minimum) {
minimum = foo;
}
}
return minimum;
}
int recursion_par(..., int depth) {
std::vector<int> foos(C);
for(int i=0; i<C; i++) {
#pragma omp task
{
if (depth < threshhold) {
foos[i] = recursion_par(..., depth + 1);
} else {
foos[i] = recursion_ser(...);
}
}
}
#pragma omp taskwait
return *std::min_element(std::begin(foos), std::end(foos));
}
Obviously you must not do any nasty things with global / shared state within the unimportant details.
I want to parallelize the following function with OpenMP:
void calculateAll() {
int k;
int nodeId1, minCost1, lowerLimit1, upperLimit8;
for (k = mostUpperLevel; k > 0; k--) {
int myStart = borderNodesArrayStartGlobal[k - 1];
int size = myStart + borderNodesArraySizeGlobal[k - 1];
/* this loop may be parallel */
for (nodeId1 = myStart; nodeId1 < size; nodeId1++) {
if (getNodeScanned(nodeId1)) {
setNodeScannedFalse(nodeId1);
} else {
minCost1 = myMax;
lowerLimit1 = getNode3LevelsDownAll(nodeId1);
upperLimit8 = getUpperLimit3LevelsDownAll(nodeId1);
changeNodeValue(nodeId1, lowerLimit1, upperLimit8, minCost1, minCost1);
}
}
}
int myStart = restNodesArrayStartGlobal;
int size = myStart + restNodesArraySizeGlobal;
/* this loop may also be parallel */
for (nodeId1 = myStart; nodeId1 < size; nodeId1++) {
if (getNodeScanned(nodeId1)) {
setNodeScannedFalse(nodeId1);
} else {
minCost1 = myMax;
lowerLimit1 = getNode3LevelsDownAll(nodeId1);
upperLimit8 = getUpperLimit3LevelsDownAll(nodeId1);
changeNodeValue(nodeId1, lowerLimit1, upperLimit8, minCost1, minCost1);
}
}
}
Although I can use "omp pragma parallel for" on the 2 inside loops, code is too slow due to the constant overhead of creating new threads. Is there a way to separate "omp pragma parallel" so that at the beginning of function I take the necessary threads and then with "omp pragma for" to get the best possible results? I am using gcc 4.6.
Thanks in advance
The creation of the threads is normally not the bottleneck in openmp programs. It is the distribution of the tasks to the threads. The threads are actually generated at the first #pragma omp for (You can verify that with a profiler like VTune. At each loop the work is assigned to the threads. This assignment is often the problem as this is a costly operation.
However you should try to play around with the schedulers. As this might have a big impact on the performance. E.g play with schedule(dynamic,chunksize) vs schedule(static,chunksize) and also try different chunksizes.
Once again I'm stuck when using openMP in C++. This time I'm trying to implement a parallel quicksort.
Code:
#include <iostream>
#include <vector>
#include <stack>
#include <utility>
#include <omp.h>
#include <stdio.h>
#define SWITCH_LIMIT 1000
using namespace std;
template <typename T>
void insertionSort(std::vector<T> &v, int q, int r)
{
int key, i;
for(int j = q + 1; j <= r; ++j)
{
key = v[j];
i = j - 1;
while( i >= q && v[i] > key )
{
v[i+1] = v[i];
--i;
}
v[i+1] = key;
}
}
stack<pair<int,int> > s;
template <typename T>
void qs(vector<T> &v, int q, int r)
{
T pivot;
int i = q - 1, j = r;
//switch to insertion sort for small data
if(r - q < SWITCH_LIMIT)
{
insertionSort(v, q, r);
return;
}
pivot = v[r];
while(true)
{
while(v[++i] < pivot);
while(v[--j] > pivot);
if(i >= j) break;
std::swap(v[i], v[j]);
}
std::swap(v[i], v[r]);
#pragma omp critical
{
s.push(make_pair(q, i - 1));
s.push(make_pair(i + 1, r));
}
}
int main()
{
int n, x;
int numThreads = 4, numBusyThreads = 0;
bool *idle = new bool[numThreads];
for(int i = 0; i < numThreads; ++i)
idle[i] = true;
pair<int, int> p;
vector<int> v;
cin >> n;
for(int i = 0; i < n; ++i)
{
cin >> x;
v.push_back(x);
}
cout << v.size() << endl;
s.push(make_pair(0, v.size()));
#pragma omp parallel shared(s, v, idle, numThreads, numBusyThreads, p)
{
bool done = false;
while(!done)
{
int id = omp_get_thread_num();
#pragma omp critical
{
if(s.empty() == false && numBusyThreads < numThreads)
{
++numBusyThreads;
//the current thread is not idle anymore
//it will get the interval [q, r] from stack
//and run qs on it
idle[id] = false;
p = s.top();
s.pop();
}
if(numBusyThreads == 0)
{
done = true;
}
}
if(idle[id] == false)
{
qs(v, p.first, p.second);
idle[id] = true;
#pragma omp critical
--numBusyThreads;
}
}
}
return 0;
}
Algorithm:
To use openMP for a recursive function I used a stack to keep track of the next intervals on which the qs function should run. I manually add the 1st interval [0, size] and then let the threads get to work when a new interval is added in the stack.
The problem:
The program ends too early, not sorting the array after creating the 1st set of intervals ([q, i - 1], [i+1, r] if you look on the code. My guess is that the threads which get the work, considers the local variables of the quicksort function(qs in the code) shared by default, so they mess them up and add no interval in the stack.
How I compile:
g++ -o qs qs.cc -Wall -fopenmp
How I run:
./qs < in_100000 > out_100000
where in_100000 is a file containing 100000 on the 1st line followed by 100k intergers on the next line separated by spaces.
I am using gcc 4.5.2 on linux
Thank you for your help,
Dan
I didn't actually run your code, but I see an immediate mistake on p, which should be private not shared. The parallel invocation of qs: qs(v, p.first, p.second); will have races on p, resulting in unpredictable behavior. The local variables at qs should be okay because all threads have their own stack. However, the overall approach is good. You're on the right track.
Here are my general comments for the implementation of parallel quicksort. Quicksort itself is embarrassingly parallel, which means no synchronization is needed. The recursive calls of qs on a partitioned array is embarrassingly parallel.
However, the parallelism is exposed in a recursive form. If you simply use the nested parallelism in OpenMP, you will end up having thousand threads in a second. No speedup will be gained. So, mostly you need to turn the recursive algorithm into an interative one. Then, you need to implement a sort of work-queue. This is your approach. And, it's not easy.
For your approach, there is a good benchmark: OmpSCR. You can download at http://sourceforge.net/projects/ompscr/
In the benchmark, there are several versions of OpenMP-based quicksort. Most of them are similar to yours. However, to increase parallelism, one must minimize the contention on a global queue (in your code, it's s). So, there could be a couple of optimizations such as having local queues. Although the algorithm itself is purely parallel, the implementation may require synchronization artifacts. And, most of all, it's very hard to gain speedups.
However, you still directly use recursive parallelism in OpenMP in two ways: (1) Throttling the total number of the threads, and (2) Using OpenMP 3.0's task.
Here is pseudo code for the first approach (This is only based on OmpSCR's benchmark):
void qsort_omp_recursive(int* begin, int* end)
{
if (begin != end) {
// Partition ...
// Throttling
if (...) {
qsort_omp_recursive(begin, middle);
qsort_omp_recursive(++middle, ++end);
} else {
#pragma omp parallel sections nowait
{
#pragma omp section
qsort_omp_recursive(begin, middle);
#pragma omp section
qsort_omp_recursive(++middle, ++end);
}
}
}
}
In order to run this code, you need to call omp_set_nested(1) and omp_set_num_threads(2). The code is really simple. We simply spawn two threads on the division of the work. However, we insert a simple throttling logic to prevent excessive threads. Note that my experimentation showed decent speedups for this approach.
Finally, you may use OpenMP 3.0's task, where a task is a logically concurrent work. In the above all OpenMP's approaches, each parallel construct spawns two physical threads. You may say there is a hard 1-to-1 mapping between a task to a work thread. However, task separates logical tasks and workers.
Because OpenMP 3.0 is not popular yet, I will use Cilk Plus, which is great to express this kind of nested and recursive parallelisms. In Cilk Plus, the parallelization is extremely easy:
void qsort(int* begin, int* end)
{
if (begin != end) {
--end;
int* middle = std::partition(begin, end,
std::bind2nd(std::less<int>(), *end));
std::swap(*end, *middle);
cilk_spawn qsort(begin, middle);
qsort(++middle, ++end);
// cilk_sync; Only necessay at the final stage.
}
}
I copied this code from Cilk Plus' example code. You will see a single keyword cilk_spawn is everything to parallelize quicksort. I'm skipping the explanations of Cilk Plus and spawn keyword. However, it's easy to understand: the two recursive calls are declared as logically concurrent tasks. Whenever the recursion takes place, the logical tasks are created. But, the Cilk Plus runtime (which implements an efficient work-stealing scheduler) will handle all kinds of dirty job. It optimally queues the parallel tasks and maps to the work threads.
Note that OpenMP 3.0's task is essentially similar to the Cilk Plus's approach. My experimentation shows that pretty nice speedups were feasible. I got a 3~4x speedup on a 8-core machine. And, the speedup was scale. Cilk Plus' absolute speedups are greater than those of OpenMP 3.0's.
The approach of Cilk Plus (and OpenMP 3.0) and your approach are essentially the same: the separation of parallel task and workload assignment. However, it's very difficult to implement efficiently. For example, you must reduce the contention and use lock-free data structures.