I am trying to parallelize a for-loop with OpenMP. Usually this should be fairly straightforward. However I need to perform thread specific initializations prior to executing the for-loop.
Specifically I have the following problem: I have a random number generator which is not thread-safe so I need create an instance of the RNG for every thread. But I want to make sure that not every thread will produce the same random numbers.
So I tried the following:
#pragma omp parallel
{
int rndseed = 42;
#ifdef _OPENMP
rndseed += omp_get_thread_num();
#endif
// initialize randon number generator
#pragma omp for
for (int sampleid = 0; sampleid < numsamples; ++sampleid)
{
// do stuff
}
}
If I use this construct I get the following error message at runtime:
Fatal User Error 1002: '#pragma omp for' improperly nested in a work-sharing construct
So is there a way to do thread-specific initializations?
Thanks
The error you have:
Fatal User Error 1002: '#pragma omp for' improperly nested in a work-sharing construct
refers to an illegal nesting of worksharing constructs. In fact, the OpenMP 3.1 standard gives the following restrictions in section 2.5:
Each worksharing region must be encountered by all threads in a team or by none at all.
The sequence of worksharing regions and barrier regions encountered must be the same for every thread in a team.
From the lines quoted above it follows that nesting different worksharing constructs inside the same parallel region is not conforming.
Even though the illegal nesting is not visible in your snippet, I assume it was hidden by an oversimplification of the post with respect to the actual code. Just to give you an hint the most common cases are:
loop worksharing constructs nested inside a single construct (similar the example here)
loop worksharing constructs nested inside another loop construct
In case you are interested, in this answer the latter case is discussed more in details.
I think there's a design error.
A parallel for loop is not just N threads with N the number of cores for example, but potentially N*X threads, with 1 <= N*X < numsamples.
If you want an "iteration private" variable, then declare it just inside the loop-body (but you know that already); but declaring a thread-private variable for use inside a parallel for loop is probably not justified enough.
Related
I am trying to parallelize my C++ code using OpenMP.
So this is my first time with OpenMP and I have a couple of questions about how to use private / shared properly
Below is just a sample code I wrote to understand what is going on. Correct me if I am wrong.
#pragma omp parallel for
for (int x=0;x<100;x++)
{
for (int y=0;y<100;y++)
{
for (int z=0;z<100;z++)
{
a[x][y][z]=U[x]+U[y]+U[z];
}
}
}
So by using #pragma omp parallel for I can use multiple threads to do this loop i.e with 5 threads, #1 thread use 0<=x<20, #2 thread use 20<=x<40 ... 80 <=x<100.
And each thread runs at the same time. So by using this, I can make this code faster.
Since x, y, and z are declared inside the loop, they are private (each thread will have a copy version of these variables), a and U are shared.
So each thread reads a shared variable U and writes to a shared variable a.
I have a couple of questions.
What would be the difference between #pragma omp parallel for and #pragma omp parallel for private(y,z)? I think since x, y, and z are already private, they should be the same.
If I use #pragma omp parallel for private(a, U), does this mean each thread will have a copy of a and U?
For example, with 2 threads that have a copy of a and U, thread #1 use 0<=x<50 so that it writes from a[0][0][0] to a[49][99][99] and thread #2 writes from a[50][0][0] to a[99][99][99]. And after that they merge these two results so that they have complete version of a[x][y][z]?
Any variable declared within a parallel block will be private. Variables mentioned in the private clause of a parallel directive follow the normal rules for variables: the variable must already be declared at the point it is used.
The effect of private is to create a copy of the variable for each thread. Then the threads can update the value without worrying about changes that could be made by other threads. At the end of the parallel block, the values are generally lost unless there are other clauses included in the parallel directive. The reduction directive is the most common, as it can combine the results from each thread into a final result for the loop.
What does it mean in OpenMP that
Nested parallel regions are serialized by default
Does it mean threads do it continuously? I also can not underestend this part:
A throw executed inside a parallel region must cause execution to resume within
the dynamic extent of the same structured block, and it must be caught by the
same thread that threw the exception.
As explained here (scroll down to "17.1 Nested parallelism", by default a nested parallel region will not be parallelized, thus run sequentially. Nested thread creation is possible using either OMP_NESTED=true (as environment variable) or omp_set_nested(1) (in your code).
EDIT: also see this answer to a similar question.
I am using OpenMP successful to parallelize for loops in my c++ code. I tried to
step further and use OpenMP tasks. Unfortunately my code behaves
really strange, so i wrote a minimal example and found a problem.
I would like to define a couple of tasks. Each task should be executed once
by an idle thread.
Unfortunately i can only make all threads execute every task or
only one thread performing all tasks sequentially.
Here is my code which basically runs sequentially:
int main() {
#pragma omp parallel
{
int id, nths;
id = omp_get_thread_num();
#pragma omp single nowait
{
#pragma omp task
cout<<"My id is "<<id<<endl;
#pragma omp task
cout<<"My id is "<<id<<endl;
#pragma omp task
cout<<"My id is "<<id<<endl;
#pragma omp task
cout<<"My id is "<<id<<endl;
}
}
return 0;
}
Only worker 0 shows up and gives his id four times.
I expected to see "My id is 0; My id is 1; my id is 2; my id is 3;
If i delete #pragma omp single i get 16 messages, all threads execute
every single cout.
Is this a problem with my OpenMP setup or did I not get something about
tasks? I am using gcc 6.3.0 on Ubuntu and use -fopenmp flag properly.
Your basic usage of OpenMP tasks (parallel -> single -> task) is correct, you misunderstand the intricacies of data-sharing attributes for variables.
First, you can easily confirm that your tasks are run by different threads by moving omp_get_thread_num() inside the task instead of accessing id.
What happens in your example, id becomes implicitly private within the parallel construct. However, inside the task, it becomes implicitly firstprivate. This means, the task copies the value from the thread that executes the single construct. A more elaborate discussion of a similar issue can be found here.
Note that if you used private within a nested task construct, it would not be the same private variable as the one of the outside parallel construct. Simply said, private does not refer to the thread, but the construct. That's the difference to threadprivate. However, threadprivate is not an attribute to a construct, but it's own directive and only applies to variables with file-scope, namespace-scope or static variables with block-scope.
Considering the following scenario:
Function A creates a layer of OMP parallel region, and each OMP thread make a call to a function B, which itself contain another layer of OMP parallel region.
Then if within the parallel region of function B, there is a OMP critcal region, then, does that region is critical "globally" with respect to all threads created by function A and B, or it is merely locally to function B?
And what if B is a pre-bulit function (e.g. static or dynamic linked libraries)?
Critical regions in OpenMP have global binding and their scope extends to all occurrences of the critical construct that have the same name (in that respect all unnamed constructs share the same special internal name), no matter where they occur in the code. You can read about the binding of each construct in the corresponding Binding section of the OpenMP specification. For the critical construct you have:
The binding thread set for a critical region is all threads. Region execution is restricted to a single thread at a time among all the threads in the program, without regard to the team(s) to which the threads belong.
(HI: emphasis mine)
That's why it is strongly recommended that named critical regions should be used, especially if the sets of protected resources are disjoint, e.g.:
// This one located inside a parallel region in fun1
#pragma omp critical(fun1)
{
// Modify shared variables a and b
}
...
// This one located inside a parallel region in fun2
#pragma omp critical(fun2)
{
// Modify shared variables c and d
}
Naming the regions eliminates the chance that two unrelated critical construct could block each other.
As to the second part of your question, to support the dynamic scoping requirements of the OpenMP specification, critical regions are usually implemented with named mutexes that are resolved at run-time. Therefore it is possible to have homonymous critical regions in a prebuilt library function and in your code and it will work as expected as long as both codes are using the same OpenMP runtime, e.g. both were built using the same compiler suite. Cross-suite OpenMP compatibility is usually not guaranteed. Also if in B() there is an unnamed critical region, it will interfere with all unnamed critical regions in the rest of the code, no matter if they are part the same library code of belong to the user code.
I'm trying to parallelize a pretty massive for-loop in OpenMP. About 20% of the time it runs through fine, but the rest of the time it crashes with various segfaults such as;
*** glibc detected *** ./execute: double free or corruption (!prev): <address> ***
*** glibc detected *** ./execute: free(): invalid next size (fast): <address> ***
[2] <PID> segmentation fault ./execute
My general code structure is as follows;
<declare and initialize shared variables here>
#pragma omp parallel private(list of private variables which are initialized in for loop) shared(much shorter list of shared variables)
{
#pragma omp for
for (index = 0 ; index < end ; index++) {
// Lots of functionality (science!)
// Calls to other deep functions which manipulate private variables
// Finally generated some calculated_values
shared_array1[index] = calculated_value1;
shared_array2[index] = calculated_value2;
shared_array3[index] = calculated_value3;
} // end for
}
// final tidy up
}
In terms of what's going on, each loop iteration is totally independent of each other loop iteration, other than the fact they pull data from shared matrices (but different columns on each loop iteration). Where I call other functions, they're only changing private variables (although occasionally reading shared variables) so I'd assume they'd be thread safe as they're only messing with stuff local to a specific thread? The only writing to any shared variables happens right at the end, where we write various calculated values to some shared arrays, where array elements are indexed by the for-loop index. This code is in C++, although the code it calls is both C and C++ code.
I've been trying to identify the source of the problem, but no luck so far. If I set num_theads(1) it runs fine, as it does if I enclose the contents of the for-loop in a single
#pragma omp for
for(index = 0 ; index < end ; index++) {
#pragma omp critical(whole_loop)
{
// loop body
}
}
which presumably gives the same effect (i.e. only one thread can pass through the loop at any one time).
If, on the other hand, I enclose the for-loop's contents in two critical directives e.g.
#pragma omp for
for(index = 0 ; index < end ; index++) {
#pragma omp critical(whole_loop)
{
// first half of loop body
}
#pragma omp critical(whole_loop2)
{
// second half of loop body
}
}
I get the unpredictable segfaulting. Similarly, if I enclose EVERY function call in a critical directive it still doesn't work.
The reason I think the problem may be linked to a function call is because when I profile with Valgrind (using valgrind --tool=drd --check-stack-var=yes --read-var-info=yes ./execute) as well as SIGSEGing I get an insane number of load and store errors, such as;
Conflicting load by thread 2 at <address> size <number>
at <address> : function which is ultimately called from within my for loop
Which according to the valgrind manual is exactly what you'd expect with race conditions. Certainly this kind of weirdly appearing/disappearing issue seems consistent with the kinds of non-deterministic errors race conditions would give, but I don't understand how, if every call which gives apparent race conditions is in a critical section.
Things which could be wrong but I don't think are include;
All private() variables are initialized inside the for-loops (because they're thread local).
I've checked that shared variables have the same memory address while private variables have different memory addresses.
I'm not sure synchronization would help, but given there are implicit barrier directives on entry and exit to critical directives and I've tried versions of my code where every function call is enclosed in a (uniquely named) critical section I think we can rule that out.
Any thoughts on how to best proceed would be hugely appreciated. I've been banging my head against this all day. Obviously I'm not looking for a, "Oh - here's the problem" type answer, but more how best to proceed in terms of debugging/deconstructing.
Things which could be an issue, or might be helpful;
There are some std::Vectors in the code which utilize the vector.pushback() function to add elements. I remember reading that resizing vectors isn't threadsafe, but the vectors are only private variables, so not shared between threads. I figured this would be OK?
If I enclose the entire for-loop body in an critical directive and slowly shrink back the end of the codeblock (so an ever growing region at the end of the for-loop is outside the critical section) it runs fine until I expose one of a the function calls, at which point segfaulting resumes. Analyzing this binary with Valgrind shows race conditions in many other function calls, not just the one I exposed.
One of the function calls is to a GSL function, which doesn't trigger any race conditions according to Valgrind.
Do I need to go and explicitly define private and shared variables in the functions being called? If so, this seems somewhat limiting for OpenMP - would this not mean you need to have OpenMP compatibility for any legacy code you call?
Is parallelizing a big for-loop just not something that works?
If you've read this far, thank you and Godspeed.
So there is no way anyone could have answered this, but having figured it out I hope this helps someone, given my system's behaviors was so bizarre.
One of the (C) functions I was ultimately calling to (my_function->intermediate_function->lower_function->BAD_FUNCTION) declared a number of it's variables as static, which meant that they were retaining the same memory address and so essentially acting a shared variables. Interesting that the static overrides OpenMP.
I discovered all this by;
Using Valgrid to identify where errors were happening, and looking at the specific variables involved.
Defining the entire for-loop as a critical section and then exposing more code at the top and bottom.
Talking to my boss. More sets of eyes always help, not least because you're forced to verbalize the problem (which ended up with me opening the culprit function and point at the declarations)