Openmp: nested loops and allocation - c++

I'd like to parallelize a for loop within another for loop. I can simply use the instruction "#pragma omp parallel for" directly in the inner loop, but I fear that creating a new set of threads each time is not the oprimal thing to do. In the outer loop (before the inner one) there is the allocation and some other instructions to be done by a single thread (I allocate a matrix to be worked sharely in the inner loop, so every thread should have access to it). I tried to do something like this:
#pragma omp parallel
{
for (auto t=1;t<=time_step;++t){
#pragma omp single {
Matrix<unsigned int> newField(rows,cols);
//some instructions
}
unsigned int j;
#pragma omp for
for (unsigned int i = 2;i<=rows-1;++i){
for ( j = 1;j<=cols;++j){
//Work on NewField (i,j)
}
}
#pragma omp single {
//Instruction
}
}
}
This code doesn't work. Is this way (if I make it work) more efficient than creating the threads every time? What I am doing wrong?
Thank you!

Many implementations of OpenMP are keeping pool of threads instead of creating them before every parallel region.
So you can just go with
for (auto t=1;t<=time_step;++t){
Matrix<unsigned int> newField(rows,cols);
//some instructions
unsigned int j;
#pragma omp parallel for
for (unsigned int i = 2;i<=rows-1;++i){
for ( j = 1;j<=cols;++j){
//Work on NewField (i,j)
}
}
//Instruction
}
and it even could be faster because of absent single directives.

The way you have your code written now is going to cause syntax errors. When you use OpenMP directives such as single or critical the braces must be on a new line.
So instead of this
#pragma omp single {
}
You need to do this
#pragma omp single
{
}

Related

OpenMP thread initialization and de-initialization before doing work

I am using an API that needs to be started and stopped for every thread in which it is used. So if I want to do something with the API in a specific thread I have to call api_start() (and api_stop() afterwards).
Now I have a very trivial problem I can solve in parallel which I want to try with OpenMP. Consider the problem is looking like this:
#pragma omp parallel for num_threads(NUM_THREADS), default(none)
for (auto i = 0; i < count; i++)
{
api_process(i);
}
This will not work because the worker threads of OpenMP did not call api_start() or api_stop() so a working solution would be:
#pragma omp parallel for num_threads(NUM_THREADS), default(none)
for (auto i = 0; i < count; i++)
{
api_start();
api_process(i);
api_stop();
}
But this solution will bring up overhead because now a thread calls api_start() and api_stop() multiple times (if NUM_THREADS < count).
So my question is: Is there a way in OpenMP to define a function to call for every created thread once on startup and once on deletion?
Thanks in advance!
You can call the functions manually at the beginning/end of the first/last iteration, respectively, or use something as std::call_once. However, this would add some overhead into each iteration (branching).
EDIT: Actually, this wouldn't work since only a single thread would call those functions. You would need to define some thread-local flags and check them in iterations. Same downside.
A much better alternative would be simply to split parallel and for OpenMP code blocks:
#pragma omp parallel
{
api_start();
#pragma omp for
for (auto i = 0; i < count; i++)
{
api_process(i);
}
api_stop();
}

OpenMP tasks & data environment

I'm trying to use a task construct for my C++/OpenMP program:
#pragma omp parallel
{
typename ClusterNormal<DIM>::VectorMean ResultMeanThread;
ResultMeanThread.setZero();
#pragma omp single
for(list<unsigned int>::const_iterator it=IDLeft.cbegin(); it!=IDLeft.cend(); it++)
{
#pragma omp task
{
ResultMeanThread += Data[*it];
}
}
}
This code is computing the sum of some VectorMean (it doesn't mind what they are, but they have operator + defined) for the elements of data indicated in IDLeft.
Every thread initialize VectorMean with all zeros. My problem is that after the for loop, ResultMeanThread is still composted of all zeros.
When a task is executed, the sum is computed correctly, but after the task execution,ResultMeanThread is always re-initialised to zeros.
How could I fix it? I'm using tasks because of the lists, but my code isn't working.
I've found that the problem was the declaration of ResultMeanThread like private variable.
I tried this code, declaring a vector of ResultMeanThread like shared variable (length of vector is nember of threads) so every thread access only one element of the vector (no race conditions).
In the previous code, every ResultMeanThread was zero because of task construct. Every time that a task is executed, private variables are set to their initial value. I have to use task construct because of list
Here's the code:
vector<typename ClusterNormal<DIM>::VectorMean> ResultMeanThread;
typename ClusterNormal<DIM>::VectorMean ResultMeanFinal;
//here i set initial values to zero, and number of vector elements equal to total number of threads
#pragma omp parallel
{
#pragma omp single
for(list<unsigned int>::const_iterator it=IDLeft.cbegin(); it!=IDLeft.cend(); it++)
{
#pragma omp task
{
ResultMeanThread[omp_get_thread_num()] += Data[*it];
}
}
#pragma omp taskwait
// Final sum
#pragma omp critical
{
ResultMeanFinal+=ResultMeanThread[omp_get_thread_num()];
}
}

OpenMP tasks passing "shared" pointers

I would like to use the task pragmas of openMP for the next code:
std::vector<Class*> myVectorClass;
#pragma omp parallel
{
#pragma omp single nowait
{
for (std::list<Class*>::iterator it = myClass.begin(); it != myClass.end();) {
#pragma omp task firstprivate(it)
(*it)->function(t, myVectorClass))
++it;
}
}
#pragma omp taskwait
}
The problem, or one of them, is that the myVectorClass is a pointer to an object. So it is not possible to set this vector as shared. myVectorClass is modified by the function. The previous code crash. So, could you tell me how to modify the previous code (without using the for-loop pragmas)?
Thanks
myVectorClass is a vector of pointers. In your current code, you set it as shared. Since your code crashes, I suppose you changes the length of myVectorClass in function(). However std::vector is not thread-safe, so modifying the length in multiple threads will crash its data structure.
Depending on what exactly function() does, you could have simple solutions. The basic idea is to use one thread-local vector per thread to collect the result of function() first, then concatenate/merge these vectors into a single one.
The code shown here gives a good example.
C++ OpenMP Parallel For Loop - Alternatives to std::vector
std::vector<int> vec;
#pragma omp parallel
{
std::vector<int> vec_private;
#pragma omp for nowait //fill vec_private in parallel
for(int i=0; i<100; i++) {
vec_private.push_back(i);
}
#pragma omp critical
vec.insert(vec.end(), vec_private.begin(), vec_private.end());
}

Creating threads within a multithreaded for loop using openMP

I am new to OpenMP and I am not able to create threads within each threaded loop iteration. My question may sound naive, please bear with me.
#pragma omp parallel private(a,b) shared(f)
{
#pragma omp for
for(...)
{
//some operations
// I want to parallelize the code in italics along within in the multi threaded for loop
*int x=func1(a,b);*
*int val1=validate(x);*
*int y=func2(a,b);*
*int val2=validate(y);*
}
}
Within the for loop all threads are busy with loop iterations, so there are no resources left to execute stuff in side an iteration in parallel. And in case the work is well balanced you won't gain any better performance.
If it is hard/impossible to well-balance the work with a parallel for. You can try generating tasks within the loop, and do the work afterwords. But be aware of the overhead of task generation.
#pragma omp parallel private(a,b) shared(f)
{
#pragma omp for nowait
for(...)
{
//some operations
#pragma omp task
{
int x=func1(a,b);
int val1=validate(x);
}
#pragma omp task
{
int y=func2(a,b);
int val2=validate(y);
}
}
// wait for all tasks to be finished (implicit at the end of the parallel region (here))
#pragma omp taskwait
}

Elegantly initializing openmp threads in parallel for loop

I have a for loop that uses a (somewhat complicated) counter object sp_ct to initialize an array. The serial code looks like
sp_ct.depos(0);
for(int p=0;p<size; p++, sp_ct.increment() ) {
in[p]=sp_ct.parable_at_basis();
}
My counter supports parallelization because it can be initialized to the state after p increments, leading to the following working code-fragment:
int firstloop=-1;
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p=0;p<size;p++) {
if( firstloop == -1 ) {
sp_ct.depos(p); firstloop=0;
} else {
sp_ct.increment();
}
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for
I dislike this because of the clutter that obscures what is really going on, and because it has an unnecessary branch inside the loop (Yes, I know that this is likely to not have a measurable influence on running time because it is so predictable...).
I would prefer to write something like
#pragma omp parallel for default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p=0;p<size;p++) {
#prgma omp initialize // or something
{ sp_ct.depos(p); }
in[p]=sp_ct.parable_at_basis();
sp_ct.increment();
}
} // end omp paralell for
Is this possible?
If I generalize you problem, the question is "How to execute some intialization code for each thread of a parallel section ?", is that right ? You may use a property of the firstprivate clause : "the initialization or construction of the given variable happens as if it were done once per thread, prior to the thread's execution of the construct".
struct thread_initializer
{
explicit thread_initializer(
int size /*initialization params*/) : size_(size) {}
//Copy constructor that does the init
thread_initializer(thread_initializer& _it) : size_(_it.size)
{
//Here goes once per thread initialization
for(int p=0;p<size;p++)
sp_ct.depos(p);
}
int size_;
scp_type sp_ct;
};
Then the loop may be written :
thread_initializer init(size);
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(init)
for(int p=0;p<size;p++) {
init.sp_ct.increment();
}
in[p]=init.sp_ct.parable_at_basis();
The bad things are that you have to write this extra initializer and some code is moved away from its actual execution point. The good thing is that you can reuse it as well as the cleaner loop syntaxe.
From what I can tell you can do this by manually defining the chunks. This looks somewhat like something I was trying to do with induction in OpenMP Induction with OpenMP: getting range values for a parallized for loop in OpenMP
So you probably want something like this:
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int start = ithread*size/nthreads;
const int finish = (ithread+1)*size/nthreads;
Counter_class_name sp_ct;
sp_ct.depos(start);
for(int p=start; p<finish; p++, sp_ct.increment()) {
in[p]=sp_ct.parable_at_basis();
}
}
Notice that except for some declarations and changing the range values this code is almost identical to the serial code.
Also you don't have to declare anything shared or private. Everything declared inside the parallel block is private and everything declared outside is shared. You don't need firstprivate either. This makes the code cleaner and more clear (IMHO).
I see what you're trying to do, and I don't think it is possible. I'm just going to write some code that I believe would achieve the same thing, and is somewhat clean, and if you like it, sweet!
sp_ct.depos(0);
in[0]=sp_ct.parable_at_basis();
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p = 1; p < size; p++) {
sp_ct.increment();
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for
Riko, implement sp_ct.depos(), so it will invoke .increment() only as often as necessary to bring the counter to the passed parameter. Then you can use this code:
sp_ct.depos(0);
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct)
for(int p=0;p<size;p++) {
sp_ct.depos(p);
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for
This solution has one additional benefit: Your implementation only works if each thread receives only one chunk out of 0 - size. Which is the case when specifying schedule(static) omitting the chunk size (OpenMP 4.0 Specification, chapter 2.7.1, page 57). But since you did not specify a schedule the used schedule will be implementation dependent (OpenMP 4.0 Specification, chapter 2.3.2). If the implementation chooses to use dynamic or guided, threads will receive multiple chunks with gaps between them. So one thread could receive chunk 0-20 and then chunk 70-90 which will make p and sp_ct out of sync on the second chunk. The solution above is compatible to all schedules.