OpenMP tasks & data environment - c++

I'm trying to use a task construct for my C++/OpenMP program:
#pragma omp parallel
{
typename ClusterNormal<DIM>::VectorMean ResultMeanThread;
ResultMeanThread.setZero();
#pragma omp single
for(list<unsigned int>::const_iterator it=IDLeft.cbegin(); it!=IDLeft.cend(); it++)
{
#pragma omp task
{
ResultMeanThread += Data[*it];
}
}
}
This code is computing the sum of some VectorMean (it doesn't mind what they are, but they have operator + defined) for the elements of data indicated in IDLeft.
Every thread initialize VectorMean with all zeros. My problem is that after the for loop, ResultMeanThread is still composted of all zeros.
When a task is executed, the sum is computed correctly, but after the task execution,ResultMeanThread is always re-initialised to zeros.
How could I fix it? I'm using tasks because of the lists, but my code isn't working.

I've found that the problem was the declaration of ResultMeanThread like private variable.
I tried this code, declaring a vector of ResultMeanThread like shared variable (length of vector is nember of threads) so every thread access only one element of the vector (no race conditions).
In the previous code, every ResultMeanThread was zero because of task construct. Every time that a task is executed, private variables are set to their initial value. I have to use task construct because of list
Here's the code:
vector<typename ClusterNormal<DIM>::VectorMean> ResultMeanThread;
typename ClusterNormal<DIM>::VectorMean ResultMeanFinal;
//here i set initial values to zero, and number of vector elements equal to total number of threads
#pragma omp parallel
{
#pragma omp single
for(list<unsigned int>::const_iterator it=IDLeft.cbegin(); it!=IDLeft.cend(); it++)
{
#pragma omp task
{
ResultMeanThread[omp_get_thread_num()] += Data[*it];
}
}
#pragma omp taskwait
// Final sum
#pragma omp critical
{
ResultMeanFinal+=ResultMeanThread[omp_get_thread_num()];
}
}

Related

OpenMP: how to realize thread local object in task?

What I am trying to do is to iterate over all elements of a container in parallel comparable to #pragma omp for; however the container in question does not offer a random access iterator. I therefore use a workaround via tasks described in this stackoverflow answer:
for (std::map<A,B>::iterator it = my_map.begin();
it != my_map.end();
++it)
{
#pragma omp task
{ /* do work with it */ }
}
My problem is that a 'scratch space' object is needed in each iteration; this object is expensive to construct or copy into the data environment of the task. It would only be necessary to have a single thread local object for each thread; in the sense that each task uses the object of the thread it is executed on. private requires a copy and shared results in a race condition. Is there a way to realize this with OpenMP?
I researched #pragma omp threadprivate, however the objects are not static, as the structure of the program looks something like this:
method(int argument_for_scratch_object){
#pragma omp parallel
{
Object scratch(argument_for_scratch_object);
//some computations are done here...
#pragma omp single nowait
{
//here goes the for loop creating the tasks above
//each task uses the scratch space object
}
}
}
If scratch was declared static (and then made threadprivate) before the parallel region, it would be initialized with argument_for_scratch_object of the first method call; which might not be correct for the subsequent method calls.
According to your update, I would suggest to use a global/static thread-private pointer and then initialize it by each thread within your parallel section.
static Object* scratch_ptr;
#pragma omp threadprivate(scratch_ptr);
void method(int argument_for_scratch_object)
{
#pragma omp parallel
{
scratch_ptr = new Object(argument_for_scratch_object);
...
delete scratch_ptr;
}
}

Optimize loop with openmp

I've got the following loop:
while (a != b) {
#pragma omp parallel
{
#pragma omp for
// first for
#pragma omp for
// second for
}
}
In this way the team is created at each loop. Is it possible to rearrange the code in order to have a single team? "a" variable is accessed with omp atomic inside the loop and "b" is a constant.
The only thing that comes to my mind is something like this:
#pragma omp parallel
{
while (a != b) {
#pragma omp barrier
// This barrier ensures that threads
// wait each other after evaluating the condition
// in the while loop
#pragma omp for
// first for (implicit barrier)
#pragma omp for
// second for (implicit barrier)
// The second implicit barrier ensures that every
// thread will have the same view of a
} // while
} // omp parallel
In this way each thread will evaluate the condition, but every evaluation will be consistent with the others. If you really want a single thread to evaluate the condition, then you should think of transforming your worksharing constructs into task constructs.

Openmp: nested loops and allocation

I'd like to parallelize a for loop within another for loop. I can simply use the instruction "#pragma omp parallel for" directly in the inner loop, but I fear that creating a new set of threads each time is not the oprimal thing to do. In the outer loop (before the inner one) there is the allocation and some other instructions to be done by a single thread (I allocate a matrix to be worked sharely in the inner loop, so every thread should have access to it). I tried to do something like this:
#pragma omp parallel
{
for (auto t=1;t<=time_step;++t){
#pragma omp single {
Matrix<unsigned int> newField(rows,cols);
//some instructions
}
unsigned int j;
#pragma omp for
for (unsigned int i = 2;i<=rows-1;++i){
for ( j = 1;j<=cols;++j){
//Work on NewField (i,j)
}
}
#pragma omp single {
//Instruction
}
}
}
This code doesn't work. Is this way (if I make it work) more efficient than creating the threads every time? What I am doing wrong?
Thank you!
Many implementations of OpenMP are keeping pool of threads instead of creating them before every parallel region.
So you can just go with
for (auto t=1;t<=time_step;++t){
Matrix<unsigned int> newField(rows,cols);
//some instructions
unsigned int j;
#pragma omp parallel for
for (unsigned int i = 2;i<=rows-1;++i){
for ( j = 1;j<=cols;++j){
//Work on NewField (i,j)
}
}
//Instruction
}
and it even could be faster because of absent single directives.
The way you have your code written now is going to cause syntax errors. When you use OpenMP directives such as single or critical the braces must be on a new line.
So instead of this
#pragma omp single {
}
You need to do this
#pragma omp single
{
}

OpenMP tasks passing "shared" pointers

I would like to use the task pragmas of openMP for the next code:
std::vector<Class*> myVectorClass;
#pragma omp parallel
{
#pragma omp single nowait
{
for (std::list<Class*>::iterator it = myClass.begin(); it != myClass.end();) {
#pragma omp task firstprivate(it)
(*it)->function(t, myVectorClass))
++it;
}
}
#pragma omp taskwait
}
The problem, or one of them, is that the myVectorClass is a pointer to an object. So it is not possible to set this vector as shared. myVectorClass is modified by the function. The previous code crash. So, could you tell me how to modify the previous code (without using the for-loop pragmas)?
Thanks
myVectorClass is a vector of pointers. In your current code, you set it as shared. Since your code crashes, I suppose you changes the length of myVectorClass in function(). However std::vector is not thread-safe, so modifying the length in multiple threads will crash its data structure.
Depending on what exactly function() does, you could have simple solutions. The basic idea is to use one thread-local vector per thread to collect the result of function() first, then concatenate/merge these vectors into a single one.
The code shown here gives a good example.
C++ OpenMP Parallel For Loop - Alternatives to std::vector
std::vector<int> vec;
#pragma omp parallel
{
std::vector<int> vec_private;
#pragma omp for nowait //fill vec_private in parallel
for(int i=0; i<100; i++) {
vec_private.push_back(i);
}
#pragma omp critical
vec.insert(vec.end(), vec_private.begin(), vec_private.end());
}

Elegantly initializing openmp threads in parallel for loop

I have a for loop that uses a (somewhat complicated) counter object sp_ct to initialize an array. The serial code looks like
sp_ct.depos(0);
for(int p=0;p<size; p++, sp_ct.increment() ) {
in[p]=sp_ct.parable_at_basis();
}
My counter supports parallelization because it can be initialized to the state after p increments, leading to the following working code-fragment:
int firstloop=-1;
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p=0;p<size;p++) {
if( firstloop == -1 ) {
sp_ct.depos(p); firstloop=0;
} else {
sp_ct.increment();
}
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for
I dislike this because of the clutter that obscures what is really going on, and because it has an unnecessary branch inside the loop (Yes, I know that this is likely to not have a measurable influence on running time because it is so predictable...).
I would prefer to write something like
#pragma omp parallel for default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p=0;p<size;p++) {
#prgma omp initialize // or something
{ sp_ct.depos(p); }
in[p]=sp_ct.parable_at_basis();
sp_ct.increment();
}
} // end omp paralell for
Is this possible?
If I generalize you problem, the question is "How to execute some intialization code for each thread of a parallel section ?", is that right ? You may use a property of the firstprivate clause : "the initialization or construction of the given variable happens as if it were done once per thread, prior to the thread's execution of the construct".
struct thread_initializer
{
explicit thread_initializer(
int size /*initialization params*/) : size_(size) {}
//Copy constructor that does the init
thread_initializer(thread_initializer& _it) : size_(_it.size)
{
//Here goes once per thread initialization
for(int p=0;p<size;p++)
sp_ct.depos(p);
}
int size_;
scp_type sp_ct;
};
Then the loop may be written :
thread_initializer init(size);
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(init)
for(int p=0;p<size;p++) {
init.sp_ct.increment();
}
in[p]=init.sp_ct.parable_at_basis();
The bad things are that you have to write this extra initializer and some code is moved away from its actual execution point. The good thing is that you can reuse it as well as the cleaner loop syntaxe.
From what I can tell you can do this by manually defining the chunks. This looks somewhat like something I was trying to do with induction in OpenMP Induction with OpenMP: getting range values for a parallized for loop in OpenMP
So you probably want something like this:
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int start = ithread*size/nthreads;
const int finish = (ithread+1)*size/nthreads;
Counter_class_name sp_ct;
sp_ct.depos(start);
for(int p=start; p<finish; p++, sp_ct.increment()) {
in[p]=sp_ct.parable_at_basis();
}
}
Notice that except for some declarations and changing the range values this code is almost identical to the serial code.
Also you don't have to declare anything shared or private. Everything declared inside the parallel block is private and everything declared outside is shared. You don't need firstprivate either. This makes the code cleaner and more clear (IMHO).
I see what you're trying to do, and I don't think it is possible. I'm just going to write some code that I believe would achieve the same thing, and is somewhat clean, and if you like it, sweet!
sp_ct.depos(0);
in[0]=sp_ct.parable_at_basis();
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p = 1; p < size; p++) {
sp_ct.increment();
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for
Riko, implement sp_ct.depos(), so it will invoke .increment() only as often as necessary to bring the counter to the passed parameter. Then you can use this code:
sp_ct.depos(0);
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct)
for(int p=0;p<size;p++) {
sp_ct.depos(p);
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for
This solution has one additional benefit: Your implementation only works if each thread receives only one chunk out of 0 - size. Which is the case when specifying schedule(static) omitting the chunk size (OpenMP 4.0 Specification, chapter 2.7.1, page 57). But since you did not specify a schedule the used schedule will be implementation dependent (OpenMP 4.0 Specification, chapter 2.3.2). If the implementation chooses to use dynamic or guided, threads will receive multiple chunks with gaps between them. So one thread could receive chunk 0-20 and then chunk 70-90 which will make p and sp_ct out of sync on the second chunk. The solution above is compatible to all schedules.