Simple task-based OpenMP application hangs

Simple task-based OpenMP application hangs - c++

The following small program (online version) attempts to calculate the area of a 64 by 64 square by recursively dividing into four squares until the smallest square has unit length (hardly optimal). But for some reason the program hangs. What am doing wrong?
#include <iostream>
unsigned compute( unsigned length )
{
if( length == 1 ) return length * length;
unsigned a[4] , area = 0 , len = length/2;
for( unsigned i = 0; i < 4; ++i )
{
#pragma omp task
{
a[i] = compute( len );
}
#pragma omp single
{
area += a[i];
}
}
return area;
}
int main()
{
unsigned area , length = 64;
#pragma omp parallel
{
area = compute( length );
}
std::cout << area << std::endl;
}

The single construct acts as an implicit barrier for all threads in the team. However, not all threads in the team do encounter this single block, because different threads are working at different recursion depths. This is why your application hangs.
In any case your code is not correct. After your task block, a[i] is not yet assigned, so you cannot immediately use it! You must wait for the task to be completed. Of course you shouldn't do that inside the loop, otherwise the tasking wouldn't exploit any parallelism. The solution is to do this at the end of the loop. Also you must specify a as shared for the output to become visible:
for( unsigned i = 0; i < 4; ++i )
{
#pragma omp task shared(a)
{
a[i] = compute( len );
}
}
#pragma omp taskwait
for( unsigned i = 0; i < 4; ++i )
{
area += a[i];
}
Note that the reduction is not wrapped a single construct! Compute is executed by a task, so only one thread should ever have it's own local area. However, you need one single construct before you first spawn any tasks:
#pragma omp parallel
#pragma omp single
{
area = compute( length );
}
Simply speaking this opens a parallel region with a team of threads, and only one thread begins the initial computation. The other threads will pick up the tasks that are later spawned by this initial thread with the task construct. This is what tasking is all about.

Motivated by the discussion about taskwait and how it can be avoided, I show below a slightly modified version of the original code. Please note that the implied barrier at the end of the single construct is really necessary in this case.
unsigned tp_area = 0;
#pragma omp threadprivate(tp_area)
void compute (unsigned length)
{
if (length == 1)
{
tp_area += 1;
return;
}
unsigned len = length / 2;
for (unsigned i = 0; i < 4; ++i)
{
#pragma omp task
{
compute (len);
}
}
}
int main ()
{
unsigned area, length = 64;
#pragma omp parallel
{
#pragma omp single
{
compute (length);
}
#pragma omp atomic
area += tp_area;
}
std::cout << area << std::endl;
}

Related

Are all tasks that are created in worksharing loop constructs inside a parallel region sibling tasks in OpenMP?

I have this simple self-contained example of a very rudimentary stencil application to work with OpenMP tasks and the dependence clause. At 2 steps one location of an array is added 3 values from another array, one from the corresponding location and its left and right neighbours. To avoid data races I have set up dependencies so that for every section on the second update its task can only be scheduled if the relevant tasks for the sections from the first update step are executed. I get the expected results but I am not sure if my assumptions are correct, because these tasks might be immediately executed by the encountering threads and not spawned. So my question is whether the tasks that are created in worksharing loops all sibling tasks and thus are the dependencies retained just like when the tasks are generated inside a single construct.
#include <iostream>
#include <omp.h>
#include <math.h>
typedef double value_type;
int main(int argc, char * argv[]){
std::size_t size = 100000;
std::size_t part_size = 25;
std::size_t parts = ceil(float(size)/part_size);
std::size_t num_threads = 4;
value_type * A = (value_type *) malloc(sizeof(value_type)*size);
value_type * B = (value_type *) malloc(sizeof(value_type)*size);
value_type * C = (value_type *) malloc(sizeof(value_type)*size);
for (int i = 0; i < size; ++i) {
A[i] = 1;
B[i] = 1;
C[i] = 0;
}
#pragma omp parallel num_threads(num_threads)
{
#pragma omp for schedule(static)
for(int part=0; part<parts; part++){
std::size_t current_part = part * part_size;
std::size_t left_part = part != 0 ? (part-1)*part_size : current_part;
std::size_t right_part = part != parts-1 ? (part+1)*part_size : current_part;
std::size_t start = current_part;
std::size_t end = part == parts-1 ? size-1 : start+part_size;
if(part==0) start = 1;
#pragma omp task depend(in: A[current_part], A[left_part], A[right_part]) depend(out: B[current_part])
{
for(int i=start; i<end; i++){
B[i] += A[i] + A[i-1] + A[i+1];
}
}
}
#pragma omp for schedule(static)
for(int part=0; part<parts; part++){
int current_part = part * part_size;
std::size_t left_part = part != 0 ? (part-1)*part_size : current_part;
std::size_t right_part = part != parts-1 ? (part+1)*part_size : current_part;
std::size_t start = current_part;
std::size_t end = part == parts-1 ? size-1 : start+part_size;
if(part==0) start = 1;
#pragma omp task depend(in: B[current_part], B[left_part], B[right_part]) depend(out: C[current_part])
{
for(int i=start; i<end; i++){
C[i] += B[i] + B[i-1] + B[i+1];
}
}
}
}
value_type sum = 0;
value_type max = -1000000000000;
value_type min = 1000000000000;
for(int i = 0; i < size; i++){
sum+=C[i];
if(C[i]<min) min = C[i];
if(C[i]>max) max = C[i];
}
std::cout << "sum: " << sum << std::endl;
std::cout << "min: " << min << std::endl;
std::cout << "max: " << max << std::endl;
std::cout << "avg: " << sum/(size) << std::endl;
return 0;
}

In OpenMP specification you can find the corresponding definitions:
sibling tasks - Tasks that are child tasks of the same task region.
child task - A task is a child task of its generating task region. A child task region is not part of its generating task region.
task region - A region consisting of all code encountered during the
execution of a task. COMMENT: A parallel region consists of one or
more implicit task regions
In the description of parallel construct you can read that:
A set of implicit tasks, equal in number to the number of threads in
the team, is generated by the encountering thread. The structured
block of the parallel construct determines the code that will be
executed in each implicit task.
This practically means that in the parallel region many task regions are generated and using #pragma omp for different task region will generate explicit tasks (i.e #pragma omp task...). However, only tasks generated by the same task region are sibling tasks (not all of them!). If you want to be sure that all generated tasks are sibling tasks, you have to use a single task region (e.g. using single construct) to generate all the explicit tasks.
Note that your code gives the correct result, because there is an implicit barrier at the end of worksharing-loop construct (#pragma omp for). To remove this barrier you have to use the nowait clause and you will see that the result will be incorrect in such a case.
Another comment is that in your case the workload is smaller than parallel overheads, so my guess is that your parallel code will be slower than the serial one.

Parallelizing many nested for loops in openMP c++

Hi i am new to c++ and i made a code which runs but it is slow because of many nested for loops i want to speed it up by openmp anyone who can guide me. i tried to use '#pragma omp parallel' before ip loop and inside this loop i used '#pragma omp parallel for' before it loop but it does not works
#pragma omp parallel
for(int ip=0; ip !=nparticle; ip++){
inf14>>r>>xp>>yp>>zp;
zp/=sqrt(gamma2);
counter++;
double para[7]={0,0,Vz,x0-xp,y0-yp,z0-zp,0};
if(ip>=0 && ip<=43){
#pragma omp parallel for
for(int it=0;it<NT;it++){
para[6]=PosT[it];
for(int ix=0;ix<NumX;ix++){
para[3]=PosX[ix]-xp;
for(int iy=0;iy<NumY;iy++){
para[4]=PosY[iy]-yp;
for(int iz=0;iz<NumZ;iz++){
para[5]=PosZ[iz]-zp;
int position=it*NumX*NumY*NumZ+ix*NumY*NumZ+iy*NumZ+iz;
rotation(para,&Field[3*position]);
MagX[position] +=chg*Field[3*position];
MagY[position] +=chg*Field[3*position+1];
MagZ[position] +=chg*Field[3*position+2];
}
}
}
}
}
}enter code here
and my rotation function also has infinite integration for loop as given below
for(int i=1;;i++){
gsl_integration_qag(&F, 10*i, 10*i+10, 1.0e-8, 1.0e-8, 100, 2, w, &temp, &error);
result+=temp;
if(abs(temp/result)<ACCURACY){
break;
}
}
i am using gsl libraries as well. so how to speed up this process or how to make openmp?

If you don't have inter-loop dependences, you can use the collapse keyword to parallelize multiple loops altoghether. Example:
void scale( int N, int M, float A[N][M], float B[N][M], float alpha ) {
#pragma omp for collapse(2)
for( int i = 0; i < N; i++ ) {
for( int j = 0; j < M; j++ ) {
A[i][j] = alpha * B[i][j];
}
}
}
I suggest you to check out the OpenMP C/C++ cheat sheet (PDF), which contain all the specifications for loop parallelization.

Do not set parallel pragmas inside another parallel pragma. You might overhead the machine creating more threads than it can handle. I would establish the parallelization in the outter loop (if it is big enough):
#pragma omp parallel for
for(int ip=0; ip !=nparticle; ip++)
Also make sure you do not have any race condition between threads (e.g. RAW).
Advice: if you do not get a great speed-up, a good practice is iterating by chunks and not only by one increment. For instance:
int num_threads = 1;
#pragma omp parallel
{
#pragma omp single
{
num_threads = omp_get_num_threads();
}
}
int chunkSize = 20; //Define your own chunk here
for (int position = 0; position < total; position+=(chunkSize*num_threads)) {
int endOfChunk = position + (chunkSize*num_threads);
#pragma omp parallel for
for(int ip = position; ip < endOfChunk ; ip += chunkSize) {
//Code
}
}

openMP increment add int among threads?

Hye
I'm trying to multithread the function below. I fail to get the counter to be properly shared among OpenMP threads, I tried atomic and int, atomic seem to not be working, neither do INT. Not sure, I'm lost, how can I solve this?
std::vector<myStruct> _myData(100);
int counter;
counter =0
int index;
#pragma omp parallel for private(index)
for (index = 0; index < 500; ++index) {
if (data[index].type == "xx") {
myStruct s;
s.innerData = data[index].rawData
processDataA(s); // processDataA(myStruct &data)
processDataB(s);
_myData[counter++] = s; // each thread should have unique int not going over 100 of initially allocated items in _myData
}
}
Edit. Update bad syntax/missing parts

If you cannot use OpenMP atomic capture, I would try:
std::vector<myStruct> _myData(100);
int counter = 0;
#pragma omp parallel for schedule(dynamic)
for (int index = 0; index < 500; ++index) {
if (data[index].type == "xx") {
myStruct s;
s.innerData = data[index].rawData
processDataA(s);
processDataB(s);
int temp;
#pragma omp critical
temp = counter++;
assert(temp < _myData.size());
_myData[temp] = s;
}
}
Or:
#pragma omp parallel for schedule(dynamic,c)
and experiment with chunk size c.
However, atomics would be likely more efficient than critical sections. There should be some form of atomics supported by your compiler.
Note that your solution is kind of fragile, since it works only if the condition inside the loop is evaluated to true less than 101x. That's why I added assertion into the code. Maybe a better solution:
std::vector<myStruct> _myData;
size_t size = 0;
#pragma omp parallel for reduction(+,size)
for (int index = 0; index < data.size(); ++index)
if (data[index].type == "xx") size++;
v.resize(size);
...
Then, you don't need to care about the vector size and also don't waste memory space.

OpenMP: Include atomic section into parrallel region declaration

I have a parallel region where I monitor the progress. This means I use the the variable iteration to calculate the current state of the loop (percentage: 0 - 100 until loop is finished).
For this I increment with an atomic operation. Is there a way to make the code shorter, maybe by including iteration++ into the #pragma omp parallel for clause?
int iteration = 0;
#pragma omp parallel for
for (int64_t ip = 0; ip < num_voxels; ip++)
{
// calc stuff
#pragma omp atomic
iteration++;
// output stuff
// if thread == 0:
// Progress(iteration / num_voxels * 100);
}

I don't think it's possible to increment iteration elsewhere than inside the loop body. For instance, this is not allowed:
std::atomic<int> iteration{0};
#pragma omp parallel for
for (int64_t ip = 0; ip < num_voxels; ip++, iteration++) { ...
since OpenMP requires so-called Canonical Loop Form where the increment expression may not update multiple variables (see Section 2.6 of OpenMP 4.5 Spcification).
Also I would strongly advise against incrementing iteration within each loop, since it would be very inefficient (atomic memory operations = memory fences and cache contention).
I would prefer, e.g.:
int64_t iteration = 0;
int64_t local_iteration = 0;
#pragma omp parallel for firstprivate(local_iteration)
for (int64_t ip = 0; ip < num_voxels; ip++) {
{
... // calc stuff
if (++local_iteration % 1024 == 0) { // modulo using bitwise AND
#pragma omp atomic
iteration += 1024;
}
// output stuff
// if thread == 0:
// Progress(iteration / num_voxels * 100);
}
And, output only if progress in percents changes. This might be also tricky, since you need to read iteration atomically and you likely don't want to do that in each iteration. A possible solution, which also saves a lot of cycles regarding "expensive" integer division:
int64_t iteration = 0;
int64_t local_iteration = 0;
int64_t last_progress = 0;
#pragma omp parallel for firstprivate(local_iteration)
for (int64_t ip = 0; ip < num_voxels; ip++) {
{
... // calc stuff
if (++local_iteration % 1024 == 0) { // modulo using bitwise AND
#pragma omp atomic
iteration += 1024;
// output stuff:
if (omp_get_thread_num() == 0) {
int64_t progress;
#pragma omp atomic read
progress = iteration;
progress = progress / num_voxels * 100;
if (progress != last_prgoress) {
Progress(progress);
last_progress = progress;
}
}
}
}

How to implement argmax with OpenMP?

I am trying to implement a argmax with OpenMP. If short, I have a function that computes a floating point value:
double toOptimize(int val);
I can get the integer maximizing the value with:
double best = 0;
#pragma omp parallel for reduction(max: best)
for(int i = 2 ; i < MAX ; ++i)
{
double v = toOptimize(i);
if(v > best) best = v;
}
Now, how can I get the value i corresponding to the maximum?
Edit:
I am trying this, but would like to make sure it is valid:
double best_value = 0;
int best_arg = 0;
#pragma omp parallel
{
double local_best = 0;
int ba = 0;
#pragma omp for reduction(max: best_value)
for(size_t n = 2 ; n <= MAX ; ++n)
{
double v = toOptimize(n);
if(v > best_value)
{
best_value = v;
local_best = v;
bn = n;
}
}
#pragma omp barrier
#pragma omp critical
{
if(local_best == best_value)
best_arg = bn;
}
}
And in the end, I should have best_arg the argmax of toOptimize.

Your solution is completely standard conformant. Anyhow, if you are willing to add a bit of syntactic sugar, you may try something like the following:
#include<iostream>
using namespace std;
double toOptimize(int arg) {
return arg * (arg%100);
}
class MaximumEntryPair {
public:
MaximumEntryPair(size_t index = 0, double value = 0.0) : index_(index), value_(value){}
void update(size_t arg) {
double v = toOptimize(arg);
if( v > value_ ) {
value_ = v;
index_ = arg;
}
}
bool operator<(const MaximumEntryPair& other) const {
if( value_ < other.value_ ) return true;
return false;
}
size_t index_;
double value_;
};
int main() {
MaximumEntryPair best;
#pragma omp parallel
{
MaximumEntryPair thread_local;
#pragma omp for
for(size_t ii = 0 ; ii < 1050 ; ++ii) {
thread_local.update(ii);
} // implicit barrier
#pragma omp critical
{
if ( best < thread_local ) best = thread_local;
}
} // implicit barries
cout << "The maximum is " << best.value_ << " obtained at index " << best.index_ << std::endl;
cout << "\t toOptimize(" << best.index_ << ") = " << toOptimize(best.index_) << std::endl;
return 0;
}

I would just create a separate buffer for each thread to store a val and idx and then select the max out of the buffer afterwards.
std::vector<double> thread_maxes(omp_get_max_threads());
std::vector<int> thread_max_ids(omp_get_max_threads());
#pragma omp for reduction(max: best_value)
for(size_t n = 2 ; n <= MAX ; ++n)
{
int thread_num = omp_get_num_threads();
double v = toOptimize(n);
if(v > thread_maxes[thread_num])
{
thread_maxes[thread_num] = v;
thread_max_ids[thread_num] = i;
}
}
std::vector<double>::iterator max =
std::max_element(thread_maxes.begin(), thread_maxes.end());
best.val = *max;
best.idx = thread_max_ids[max - thread_maxes.begin()];

Your solution is fine. It has O(nthreads) convergence with the critical section. However, it's possible to do this with O(Log(nthreads)) convergence.
For example imagine there were 32 threads.
You would first find the local max for the 32 threads. Then you could combine pairs with 16 threads, then 8, then 4, then 2, then 1. In five steps you could merge the local max values without a critical section and free threads in the process. But your method would merge the local max values in 32 steps in a critical section and uses all threads.
The same logic goes for a reduction. That's why it's best to let OpenMP do the reduction rather than do it manually with an atomic section. But at least in the C/C++ implementation of OpenMP there is no easy way to get the max/min in O(Log(nthreads)). It might be possible using tasks but I have not tried that.
In practice this might not make a difference since the time to merge the local values even with a critical section is probably negligible compared the time to do the parallel loop. It probably makes more of a difference on the GPU though where the number of "threads" is much larger.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js