OpenMP: Include atomic section into parrallel region declaration - c++

I have a parallel region where I monitor the progress. This means I use the the variable iteration to calculate the current state of the loop (percentage: 0 - 100 until loop is finished).
For this I increment with an atomic operation. Is there a way to make the code shorter, maybe by including iteration++ into the #pragma omp parallel for clause?
int iteration = 0;
#pragma omp parallel for
for (int64_t ip = 0; ip < num_voxels; ip++)
{
// calc stuff
#pragma omp atomic
iteration++;
// output stuff
// if thread == 0:
// Progress(iteration / num_voxels * 100);
}

I don't think it's possible to increment iteration elsewhere than inside the loop body. For instance, this is not allowed:
std::atomic<int> iteration{0};
#pragma omp parallel for
for (int64_t ip = 0; ip < num_voxels; ip++, iteration++) { ...
since OpenMP requires so-called Canonical Loop Form where the increment expression may not update multiple variables (see Section 2.6 of OpenMP 4.5 Spcification).
Also I would strongly advise against incrementing iteration within each loop, since it would be very inefficient (atomic memory operations = memory fences and cache contention).
I would prefer, e.g.:
int64_t iteration = 0;
int64_t local_iteration = 0;
#pragma omp parallel for firstprivate(local_iteration)
for (int64_t ip = 0; ip < num_voxels; ip++) {
{
... // calc stuff
if (++local_iteration % 1024 == 0) { // modulo using bitwise AND
#pragma omp atomic
iteration += 1024;
}
// output stuff
// if thread == 0:
// Progress(iteration / num_voxels * 100);
}
And, output only if progress in percents changes. This might be also tricky, since you need to read iteration atomically and you likely don't want to do that in each iteration. A possible solution, which also saves a lot of cycles regarding "expensive" integer division:
int64_t iteration = 0;
int64_t local_iteration = 0;
int64_t last_progress = 0;
#pragma omp parallel for firstprivate(local_iteration)
for (int64_t ip = 0; ip < num_voxels; ip++) {
{
... // calc stuff
if (++local_iteration % 1024 == 0) { // modulo using bitwise AND
#pragma omp atomic
iteration += 1024;
// output stuff:
if (omp_get_thread_num() == 0) {
int64_t progress;
#pragma omp atomic read
progress = iteration;
progress = progress / num_voxels * 100;
if (progress != last_prgoress) {
Progress(progress);
last_progress = progress;
}
}
}
}

Related

how to use parallelize two serial for loops such that the work of the two for loops are distributed over the thread

I have written the below code to parallelize two 'for' loops.
#include <iostream>
#include <omp.h>
#define SIZE 100
int main()
{
int arr[SIZE];
int sum = 0;
int i, tid, numt, prod;
double t1, t2;
for (i = 0; i < SIZE; i++)
arr[i] = 0;
t1 = omp_get_wtime();
#pragma omp parallel private(tid, prod)
{
tid = omp_get_thread_num();
numt = omp_get_num_threads();
std::cout << "Tid: " << tid << " Thread: " << numt << std::endl;
#pragma omp for reduction(+: sum)
for (i = 0; i < 50; i++) {
prod = arr[i]+1;
sum += prod;
}
#pragma omp for reduction(+: sum)
for (i = 50; i < SIZE; i++) {
prod = arr[i]+1;
sum += prod;
}
}
t2 = omp_get_wtime();
std::cout << "Time taken: " << (t2 - t1) << ", Parallel sum: " << sum << std::endl;
return 0;
}
In this case the execution of 1st 'for' loop is done in parallel by all the threads and the result is accumulated in sum variable. After the execution of the 1st 'for' loop is done, threads start executing the 2nd 'for' loop in parallel and the result is accumulated in sum variable. In this case clearly the execution of the 2nd 'for' loop waits for the execution of the 1st 'for' loop to get over.
I want to do the processing of the two 'for' loop simultaneously over threads. How can I do that? Is there any other way I can write this code more efficiently. Ignore the dummy work that I am doing inside the 'for' loop.
You can declare the loops nowait and move the reduction to the end of the parallel section. Something like this:
# pragma omp parallel private(tid, prod) reduction(+: sum)
{
# pragma omp for nowait
for (i = 0; i < 50; i++) {
prod = arr[i]+1;
sum += prod;
}
# pragma omp for nowait
for (i = 50; i < SIZE; i++) {
prod = arr[i]+1;
sum += prod;
}
}
If you use #pragma omp for nowait all threads are assigned to the first loop, the second loop will only start if at least one thread finished in the first loop. Unfortunately, there is no way to tell the omp for construct to use e.g. only half of the threads.
Fortunately, there is a solution to do so (i.e. to run the 2 loops parallel) by using tasks. The following code will use half of the threads to run the first loop, the other half to run the second one using the taskloop construct and num_threads clause to control the threads assigned for a loop. This will do exactly what you intended, but you have to test which solution is faster in your case.
#pragma omp parallel
#pragma omp single
{
int n=omp_get_num_threads();
#pragma omp taskloop num_tasks(n/2)
for (int i = 0; i < 50; i++) {
//do something
}
#pragma omp taskloop num_tasks(n/2)
for (int i = 50; i < SIZE; i++) {
//do something
}
}
UPDATE: The first paragraph is not entirely correct, by changing the chunk_size you have some control how many threads will be used in the first loop. It can be done by using e.g. schedule(linear, chunk_size) clause. So, I thought setting the chunk_size will do the trick:
#pragma omp parallel
{
int n=omp_get_num_threads();
#pragma omp single
printf("num_threads=%d\n",n);
#pragma omp for schedule(static,2) nowait
for (int i = 0; i < 4; i++) {
printf("thread %d running 1st loop\n", omp_get_thread_num());
}
#pragma omp for schedule(static,2)
for (int i = 4; i < SIZE; i++) {
printf("thread %d running 2nd loop\n", omp_get_thread_num());
}
}
BUT at first the result seems surprising:
num_threads=4
thread 0 running 1st loop
thread 0 running 1st loop
thread 0 running 2nd loop
thread 0 running 2nd loop
thread 1 running 1st loop
thread 1 running 1st loop
thread 1 running 2nd loop
thread 1 running 2nd loop
What is going on? Why threads 2 and 3 not used? OpenMP run-time guarantees that if you have two separate loops with the same number of iterations and execute them with the same number of threads using static scheduling, then each thread will receive exactly the same iteration ranges in both parallel regions.
On the other hand result of using schedule(dynamic,2) clause was quite surprising - only one thread is used, CodeExplorer link is here.

openMP: call parallel function from parallel region

I'm trying to make my serial programm parallel with openMP. Here is the code where I have a big parallel region with a number of internal "#pragma omp for" sections. In serial version I have a function fftw_shift() which has "for" loops inside it too.
The question is how to rewrite the fftw_shift() function properly in order to already existed threads in the external parallel region could split "for" loops inside with no nested threads.
I'm not sure that my realisation works correctly. There is the way to inline the whole function in parallel region but I'm trying to realise how to deal with it in the described situation.
int fftw_shift(fftw_complex *pulse, fftw_complex *shift_buf, int
array_size)
{
int j = 0; //counter
if ((pulse != nullptr) || (shift_buf != nullptr)){
if (omp_in_parallel()) {
//shift the array
#pragma omp for private(j) //shedule(dynamic)
for (j = 0; j < array_size / 2; j++) {
//left to right
shift_buf[(array_size / 2) + j][REAL] = pulse[j][REAL]; //real
shift_buf[(array_size / 2) + j][IMAG] = pulse[j][IMAG]; //imaginary
//right to left
shift_buf[j][REAL] = pulse[(array_size / 2) + j][REAL]; //real
shift_buf[j][IMAG] = pulse[(array_size / 2) + j][IMAG]; //imaginary
}
//rewrite the array
#pragma omp for private(j) //shedule(dynamic)
for (j = 0; j < array_size; j++) {
pulse[j][REAL] = shift_buf[j][REAL]; //real
pulse[j][IMAG] = shift_buf[j][IMAG]; //imaginary
}
return 0;
}
}
....
#pragma omp parallel firstprivate(x, phase) if(array_size >=
OMP_THREASHOLD)
{
// First half-step
#pragma omp for schedule(dynamic)
for (x = 0; x < array_size; x++) {
..
}
// Forward FTW
fftw_shift(pulse_x, shift_buf, array_size);
#pragma omp master
{
fftw_execute(dft);
}
#pragma omp barrier
fftw_shift(pulse_kx, shift_buf, array_size);
...
}
If you call fftw_shift from a parallel region - but not a work-sharing construct (i.e. not in a parallel for), then you can just use omp for just as if you were inside a parallel region. This is called an orphaned directive.
However, your loops just copy data, so don't expect a perfect speedup depending on your system.

Simple task-based OpenMP application hangs

The following small program (online version) attempts to calculate the area of a 64 by 64 square by recursively dividing into four squares until the smallest square has unit length (hardly optimal). But for some reason the program hangs. What am doing wrong?
#include <iostream>
unsigned compute( unsigned length )
{
if( length == 1 ) return length * length;
unsigned a[4] , area = 0 , len = length/2;
for( unsigned i = 0; i < 4; ++i )
{
#pragma omp task
{
a[i] = compute( len );
}
#pragma omp single
{
area += a[i];
}
}
return area;
}
int main()
{
unsigned area , length = 64;
#pragma omp parallel
{
area = compute( length );
}
std::cout << area << std::endl;
}
The single construct acts as an implicit barrier for all threads in the team. However, not all threads in the team do encounter this single block, because different threads are working at different recursion depths. This is why your application hangs.
In any case your code is not correct. After your task block, a[i] is not yet assigned, so you cannot immediately use it! You must wait for the task to be completed. Of course you shouldn't do that inside the loop, otherwise the tasking wouldn't exploit any parallelism. The solution is to do this at the end of the loop. Also you must specify a as shared for the output to become visible:
for( unsigned i = 0; i < 4; ++i )
{
#pragma omp task shared(a)
{
a[i] = compute( len );
}
}
#pragma omp taskwait
for( unsigned i = 0; i < 4; ++i )
{
area += a[i];
}
Note that the reduction is not wrapped a single construct! Compute is executed by a task, so only one thread should ever have it's own local area. However, you need one single construct before you first spawn any tasks:
#pragma omp parallel
#pragma omp single
{
area = compute( length );
}
Simply speaking this opens a parallel region with a team of threads, and only one thread begins the initial computation. The other threads will pick up the tasks that are later spawned by this initial thread with the task construct. This is what tasking is all about.
Motivated by the discussion about taskwait and how it can be avoided, I show below a slightly modified version of the original code. Please note that the implied barrier at the end of the single construct is really necessary in this case.
unsigned tp_area = 0;
#pragma omp threadprivate(tp_area)
void compute (unsigned length)
{
if (length == 1)
{
tp_area += 1;
return;
}
unsigned len = length / 2;
for (unsigned i = 0; i < 4; ++i)
{
#pragma omp task
{
compute (len);
}
}
}
int main ()
{
unsigned area, length = 64;
#pragma omp parallel
{
#pragma omp single
{
compute (length);
}
#pragma omp atomic
area += tp_area;
}
std::cout << area << std::endl;
}

OpenMP/C++: Parallel for loop with reduction afterwards - best practice?

Given the following code...
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity[j] += f(j);
}
...which I would like to run on multiple CPUs/cores. The function f does not use velocity.
A simple #pragma omp parallel for before the first for loop will produce unpredictable/wrong results, because the std::vector<T> velocity is modified in the inner loop. Multiple threads may access and (try to) modify the same element of velocity at the same time.
I think the first solution would be to write #pragma omp atomic before the velocity[j] += f(j);operation. This gives me a compile error (might have something to do with the elements being of type Eigen::Vector3d or velocity being a class member). Also, I read atomic operations are very slow compared to having a private variable for each thread and doing a reduction in the end. So that's what I would like to do, I think.
I have come up with this:
#pragma omp parallel
{
// these variables are local to each thread
std::vector<Eigen::Vector3d> velocity_local(velocity.size());
std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0));
#pragma omp for
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity_local[j] += f(j); // save results from the previous calculations
}
// now each thread can save its results to the global variable
#pragma omp critical
{
for (size_t i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
}
}
Is this a good solution? Is it the best solution? (Is it even correct?)
Further thoughts: Using the reduce clause (instead of the critical section) throws a compiler error. I think this is because velocity is a class member.
I have tried to find a question with a similar problem, and this question looks like it's almost the same. But I think my case might differ because the last step includes a for loop. Also the question whether this is the best approach still holds.
Edit: As request per comment: The reduction clause...
#pragma omp parallel reduction(+:velocity)
for (omp_int i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
...throws the following error:
error C3028: 'ShapeMatching::velocity' : only a variable or static data member can be used in a data-sharing clause
(similar error with g++)
You're doing an array reduction. I have described this several times (e.g. reducing an array in openmp and fill histograms array reduction in parallel with openmp without using a critical section). You can do this with and without a critical section.
You have already done this correctly with a critical section (in your recent edit) so let me describe how to do this without a critical section.
std::vector<Eigen::Vector3d> velocitya;
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int vsize = velocity.size();
#pragma omp single
velocitya.resize(vsize*nthreads);
std::fill(velocitya.begin()+vsize*ithread, velocitya.begin()+vsize*(ithread+1),
Eigen::Vector3d(0,0,0));
#pragma omp for schedule(static)
for (size_t i = 0; i < clusters.size(); i++) {
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster) velocitya[ithread*vsize+j] += f(j);
}
#pragma omp for schedule(static)
for(int i=0; i<vsize; i++) {
for(int t=0; t<nthreads; t++) {
velocity[i] += velocitya[vsize*t + i];
}
}
}
This method requires extra care/tuning due to false sharing which I have not done.
As to which method is better you will have to test.

Decreasing number of iterations in OpenMP parallel for

I have a parallel for in a C++ program that has to loop up to some number of iterations. Each iteration computes a possible solution for an algorithm, and I want to exit the loop once I find a valid one (it is ok if a few extra iterations are done). I know the number of iterations should be fixed from the beginning in the parallel for, but since I'm not increasing the number of iterations in the following code, is there any guarantee of that threads check the condition before proceeding with their current iteration?
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
...
if(some condition)
max_its = t; // valid to make threads exit the for?
}
}
Modifying the loop counter works for most implementations of OpenMP worksharing constructs, but the program will no longer be conforming to OpenMP and there is no guarantee that the program works with other compilers.
Since the OP is OK with some extra iterations, OpenMP cancellation will be the way to go. OpenMP 4.0 introduced the "cancel" construct exactly for this purpose. It will request termination of the worksharing construct and teleport the threads to the end of it.
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
...
if(some condition) {
#pragma omp cancel for
}
#pragma omp cancellation point for
}
}
Be aware that might there might be a price to pay in terms of performance, but you might want to accept this if the overall performance is better when aborting the loop.
In pre-4.0 implementations of OpenMP, the only OpenMP-compliant solution would be to have an if statement to approach the regular end of the loop as quickly as possible without execution the actual loop body:
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
if(!some condition) {
... loop body ...
}
}
}
Hope that helps!
Cheers,
-michael
You can't modify max_its as the standard says it must be a loop invariant expression.
What you can do, though, is using a boolean shared variable as a flag:
void fun()
{
int max_its = 100;
bool found = false;
#pragma omp parallel for schedule(dynamic, 1) shared(found)
for(int t = 0; t < max_its; ++t)
{
if( ! found ) {
...
}
if(some condition) {
#pragma omp atomic
found = true; // valid to make threads exit the for?
}
}
}
A logic of this kind may be also implemented with tasks instead of a work-sharing construct. A sketch of the code would be something like the following:
void algorithm(int t, bool& found) {
#pragma omp task shared(found)
{
if( !found ) {
// Do work
if ( /* conditionc*/ ) {
#pragma omp atomic
found = true
}
}
} // task
} // function
void fun()
{
int max_its = 100;
bool found = false;
#pragma omp parallel
{
#pragma omp single
{
for(int t = 0; t < max_its; ++t)
{
algorithm(t,found);
}
} // single
} // parallel
}
The idea is that a single thread creates max_its tasks. Each task will be assigned to a waiting thread. If some of the tasks find a valid solution, then all the others will be informed by the shared variable found.
If some_condition is a logical expression that is "always valid", then you could do:
for(int t = 0; t < max_its && !some_condition; ++t)
That way, it's very clear that !some_condition is required to continue the loop, and there is no need to read the rest of the code to find out that "if some_condition, loop ends"
Otherwise (for example if some_condition is the result of some calculation inside the loop and it's complicated to "move" the some_condition to the for-loop condition, then using break is clearly the right thing to do.