Elegantly initializing openmp threads in parallel for loop - c++

I have a for loop that uses a (somewhat complicated) counter object sp_ct to initialize an array. The serial code looks like
sp_ct.depos(0);
for(int p=0;p<size; p++, sp_ct.increment() ) {
in[p]=sp_ct.parable_at_basis();
}
My counter supports parallelization because it can be initialized to the state after p increments, leading to the following working code-fragment:
int firstloop=-1;
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p=0;p<size;p++) {
if( firstloop == -1 ) {
sp_ct.depos(p); firstloop=0;
} else {
sp_ct.increment();
}
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for
I dislike this because of the clutter that obscures what is really going on, and because it has an unnecessary branch inside the loop (Yes, I know that this is likely to not have a measurable influence on running time because it is so predictable...).
I would prefer to write something like
#pragma omp parallel for default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p=0;p<size;p++) {
#prgma omp initialize // or something
{ sp_ct.depos(p); }
in[p]=sp_ct.parable_at_basis();
sp_ct.increment();
}
} // end omp paralell for
Is this possible?

If I generalize you problem, the question is "How to execute some intialization code for each thread of a parallel section ?", is that right ? You may use a property of the firstprivate clause : "the initialization or construction of the given variable happens as if it were done once per thread, prior to the thread's execution of the construct".
struct thread_initializer
{
explicit thread_initializer(
int size /*initialization params*/) : size_(size) {}
//Copy constructor that does the init
thread_initializer(thread_initializer& _it) : size_(_it.size)
{
//Here goes once per thread initialization
for(int p=0;p<size;p++)
sp_ct.depos(p);
}
int size_;
scp_type sp_ct;
};
Then the loop may be written :
thread_initializer init(size);
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(init)
for(int p=0;p<size;p++) {
init.sp_ct.increment();
}
in[p]=init.sp_ct.parable_at_basis();
The bad things are that you have to write this extra initializer and some code is moved away from its actual execution point. The good thing is that you can reuse it as well as the cleaner loop syntaxe.

From what I can tell you can do this by manually defining the chunks. This looks somewhat like something I was trying to do with induction in OpenMP Induction with OpenMP: getting range values for a parallized for loop in OpenMP
So you probably want something like this:
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int start = ithread*size/nthreads;
const int finish = (ithread+1)*size/nthreads;
Counter_class_name sp_ct;
sp_ct.depos(start);
for(int p=start; p<finish; p++, sp_ct.increment()) {
in[p]=sp_ct.parable_at_basis();
}
}
Notice that except for some declarations and changing the range values this code is almost identical to the serial code.
Also you don't have to declare anything shared or private. Everything declared inside the parallel block is private and everything declared outside is shared. You don't need firstprivate either. This makes the code cleaner and more clear (IMHO).

I see what you're trying to do, and I don't think it is possible. I'm just going to write some code that I believe would achieve the same thing, and is somewhat clean, and if you like it, sweet!
sp_ct.depos(0);
in[0]=sp_ct.parable_at_basis();
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct,firstloop)
for(int p = 1; p < size; p++) {
sp_ct.increment();
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for

Riko, implement sp_ct.depos(), so it will invoke .increment() only as often as necessary to bring the counter to the passed parameter. Then you can use this code:
sp_ct.depos(0);
#pragma omp parallel for \
default(none) shared(size,in) firstprivate(sp_ct)
for(int p=0;p<size;p++) {
sp_ct.depos(p);
in[p]=sp_ct.parable_at_basis();
} // end omp paralell for
This solution has one additional benefit: Your implementation only works if each thread receives only one chunk out of 0 - size. Which is the case when specifying schedule(static) omitting the chunk size (OpenMP 4.0 Specification, chapter 2.7.1, page 57). But since you did not specify a schedule the used schedule will be implementation dependent (OpenMP 4.0 Specification, chapter 2.3.2). If the implementation chooses to use dynamic or guided, threads will receive multiple chunks with gaps between them. So one thread could receive chunk 0-20 and then chunk 70-90 which will make p and sp_ct out of sync on the second chunk. The solution above is compatible to all schedules.

Related

OpenMP thread initialization and de-initialization before doing work

I am using an API that needs to be started and stopped for every thread in which it is used. So if I want to do something with the API in a specific thread I have to call api_start() (and api_stop() afterwards).
Now I have a very trivial problem I can solve in parallel which I want to try with OpenMP. Consider the problem is looking like this:
#pragma omp parallel for num_threads(NUM_THREADS), default(none)
for (auto i = 0; i < count; i++)
{
api_process(i);
}
This will not work because the worker threads of OpenMP did not call api_start() or api_stop() so a working solution would be:
#pragma omp parallel for num_threads(NUM_THREADS), default(none)
for (auto i = 0; i < count; i++)
{
api_start();
api_process(i);
api_stop();
}
But this solution will bring up overhead because now a thread calls api_start() and api_stop() multiple times (if NUM_THREADS < count).
So my question is: Is there a way in OpenMP to define a function to call for every created thread once on startup and once on deletion?
Thanks in advance!
You can call the functions manually at the beginning/end of the first/last iteration, respectively, or use something as std::call_once. However, this would add some overhead into each iteration (branching).
EDIT: Actually, this wouldn't work since only a single thread would call those functions. You would need to define some thread-local flags and check them in iterations. Same downside.
A much better alternative would be simply to split parallel and for OpenMP code blocks:
#pragma omp parallel
{
api_start();
#pragma omp for
for (auto i = 0; i < count; i++)
{
api_process(i);
}
api_stop();
}

OpenMP: how to realize thread local object in task?

What I am trying to do is to iterate over all elements of a container in parallel comparable to #pragma omp for; however the container in question does not offer a random access iterator. I therefore use a workaround via tasks described in this stackoverflow answer:
for (std::map<A,B>::iterator it = my_map.begin();
it != my_map.end();
++it)
{
#pragma omp task
{ /* do work with it */ }
}
My problem is that a 'scratch space' object is needed in each iteration; this object is expensive to construct or copy into the data environment of the task. It would only be necessary to have a single thread local object for each thread; in the sense that each task uses the object of the thread it is executed on. private requires a copy and shared results in a race condition. Is there a way to realize this with OpenMP?
I researched #pragma omp threadprivate, however the objects are not static, as the structure of the program looks something like this:
method(int argument_for_scratch_object){
#pragma omp parallel
{
Object scratch(argument_for_scratch_object);
//some computations are done here...
#pragma omp single nowait
{
//here goes the for loop creating the tasks above
//each task uses the scratch space object
}
}
}
If scratch was declared static (and then made threadprivate) before the parallel region, it would be initialized with argument_for_scratch_object of the first method call; which might not be correct for the subsequent method calls.
According to your update, I would suggest to use a global/static thread-private pointer and then initialize it by each thread within your parallel section.
static Object* scratch_ptr;
#pragma omp threadprivate(scratch_ptr);
void method(int argument_for_scratch_object)
{
#pragma omp parallel
{
scratch_ptr = new Object(argument_for_scratch_object);
...
delete scratch_ptr;
}
}

Openmp: nested loops and allocation

I'd like to parallelize a for loop within another for loop. I can simply use the instruction "#pragma omp parallel for" directly in the inner loop, but I fear that creating a new set of threads each time is not the oprimal thing to do. In the outer loop (before the inner one) there is the allocation and some other instructions to be done by a single thread (I allocate a matrix to be worked sharely in the inner loop, so every thread should have access to it). I tried to do something like this:
#pragma omp parallel
{
for (auto t=1;t<=time_step;++t){
#pragma omp single {
Matrix<unsigned int> newField(rows,cols);
//some instructions
}
unsigned int j;
#pragma omp for
for (unsigned int i = 2;i<=rows-1;++i){
for ( j = 1;j<=cols;++j){
//Work on NewField (i,j)
}
}
#pragma omp single {
//Instruction
}
}
}
This code doesn't work. Is this way (if I make it work) more efficient than creating the threads every time? What I am doing wrong?
Thank you!
Many implementations of OpenMP are keeping pool of threads instead of creating them before every parallel region.
So you can just go with
for (auto t=1;t<=time_step;++t){
Matrix<unsigned int> newField(rows,cols);
//some instructions
unsigned int j;
#pragma omp parallel for
for (unsigned int i = 2;i<=rows-1;++i){
for ( j = 1;j<=cols;++j){
//Work on NewField (i,j)
}
}
//Instruction
}
and it even could be faster because of absent single directives.
The way you have your code written now is going to cause syntax errors. When you use OpenMP directives such as single or critical the braces must be on a new line.
So instead of this
#pragma omp single {
}
You need to do this
#pragma omp single
{
}

Forking once and then use the threads in a different procedure?

If I fork inside my main program and then call a subroutine inside a single directive, what is the behavior if I enter an OMP parallel directive in this subroutine?
My guess/hope is that existing threads are used, as they all should have nothing to do at the moment.
Pseudo-Example:
double A[];
int main() {
#pragma omp parallel num_threads(2)
{
#pragma omp single
{
for (int t=0; t<1000; t++) {
evolve();
}
}
}
}
void evolve() {
#pragma omp parallel for num_threads(2)
for (int i=0; i<100; i++) {
do_stuff(i);
}
}
void do_stuff(int i) {
// expensive calculation on array element A[i]
}
As evolve() is called very often, forking here would cause way to much overhead, so I would like to do it only once, then call evolve() from a single thread and split the work of the calls to do_stuff() over the existing threads.
For Fortran this seems to work. I get a roughly 80-90% speed increase on a simple example using 2 threads. But for C++ I get a different behavior, only the thread which executes the single directive is used for the loop in evolve()
I fixed the problem using the task directive in the main program and passing the limits to evolve(), but this looks like a clumsy solution...
Why is the behavior in Fortran and C++ different and what would be the solution in C++?
I believe orphaned directives are the cleanest solution in your case:
double A[];
int main() {
#pragma omp parallel num_threads(2)
{
// Each thread calls evolve() a thousand times
for (int t=0; t<1000; t++) {
evolve();
}
}
}
void evolve() {
// The orphaned construct inside evolve()
// will bind to the innermost parallel region
#pragma omp for
for (int i=0; i<100; i++) {
do_stuff(i);
} // Implicit thread synchronization
}
void do_stuff(int i) {
// expensive calculation on array element A[i]
}
This will work because (section 2.6.1 of the standard):
A loop region binds to the innermost enclosing parallel region
That said, in your code you are using nested parallel constructs. To be sure to enable them you must set the environment variable OMP_NESTED to true, otherwise (quoting Appendix E of the latest standard):
OMP_NESTED environment variable: if the value is neither true nor false the behavior is implementation defined
Unfortunately, your code will likely not work as expected in all cases. If you have a code structure like this:
void foo() {
#pragma omp parallel
#pragma omp single
bar();
}
void bar() {
#pragma omp parallel
printf("...)";
}
OpenMP is requested to create a new team of threads when entering the parallel region in bar. OpenMP calls that "nested parallelism". However, what exactly happens depends on the your actual implementation of OpenMP used and the setting of OMP_NESTED.
OpenMP implementations are not required to support nested parallelism. It would be perfectly legal, if an implementation ignored the parallel region in bar and just execute it with one thread. OMP_NESTED can be used to turn on and off nesting, if the implementation supports it.
In your case, things by chance went well, since you sent all threads to sleep except one. This thread then created a new team of threads of full size (potentially NEW threads, not reusing the old ones). If you omitted the single construct, you would easily get thousands of threads.
Unfortunately, OpenMP does not support your pattern to create a parallel team, have one thread executing the call stacks, and then distribute work across the other team members through a worksharing construct like for. If you need this code pattern, the only solution will be OpenMP tasks.
Cheers,
-michael
Your example doesn't actually call fork(), so I suspect you don't mean fork in the system-call sense (i.e. duplicating your process). However, if that really is what you meant, I suspect that most OpenMP implementations will not work correctly in a forked process. Typically, threads are not preserved across fork() calls. If the OpenMP implementation you use registers pthread_atfork() handlers, it may work correctly following a fork() call, but it will not use the same threads as the parent process.

OpenMP tasks passing "shared" pointers

I would like to use the task pragmas of openMP for the next code:
std::vector<Class*> myVectorClass;
#pragma omp parallel
{
#pragma omp single nowait
{
for (std::list<Class*>::iterator it = myClass.begin(); it != myClass.end();) {
#pragma omp task firstprivate(it)
(*it)->function(t, myVectorClass))
++it;
}
}
#pragma omp taskwait
}
The problem, or one of them, is that the myVectorClass is a pointer to an object. So it is not possible to set this vector as shared. myVectorClass is modified by the function. The previous code crash. So, could you tell me how to modify the previous code (without using the for-loop pragmas)?
Thanks
myVectorClass is a vector of pointers. In your current code, you set it as shared. Since your code crashes, I suppose you changes the length of myVectorClass in function(). However std::vector is not thread-safe, so modifying the length in multiple threads will crash its data structure.
Depending on what exactly function() does, you could have simple solutions. The basic idea is to use one thread-local vector per thread to collect the result of function() first, then concatenate/merge these vectors into a single one.
The code shown here gives a good example.
C++ OpenMP Parallel For Loop - Alternatives to std::vector
std::vector<int> vec;
#pragma omp parallel
{
std::vector<int> vec_private;
#pragma omp for nowait //fill vec_private in parallel
for(int i=0; i<100; i++) {
vec_private.push_back(i);
}
#pragma omp critical
vec.insert(vec.end(), vec_private.begin(), vec_private.end());
}