correct use of interal function and openmp - c++

i have a for loop which calls an internal function :
some variables
for(int i=0; i< 10000000; i++)
func(variables)
Basically, func gets a reference to some array A, and inserts values in A[i] - so i'm assured
that each call to func actually tries to insert a value to a different place in A, and all other input variables stay the the same as they were before the for loop. So func is thread-safe.
Can i safely change the code to
some variables
#pragma omp parallel for
for(int i=0; i< 10000000; i++)
func(variables)
From what i understand from the openmp tutorials, this isn't good enough - since the openmp libraries wouldn't know that the variables given to func are really thread-safe, and so this would yield attempts to perform synchronization which would slow things up, and i would need to declare variables private, etc. But actually, when trying the code above, it seems to work indeed faster and parallel - is this as expected? I just wanted to make sure i'm not missing something.
The declaration of func :
func(int i, int client_num, const vector<int>& vec)

First of all, OpenMP cannot magically determine the dependency on your code. It is your responsibility that the code is correct for parallelization.
In order to safely parallelize the for loop, func must not have loop-carried flow dependences, or inter-iteration dependency, especially for read-after-write pattern. Also, you must check there are no static variables. (Actually, it's much more complex to write down the conditions of the safe parallelization in this short answer.)
Your description of func says that func will write a variable into a different place. If so, you can safely parallelize by putting pragma omp parallel for, unless the other computations do not dependences that prohibit parallelization.
Your prototype of func: func(int i, int client_num, const vector<int>& vec)
There is a vector, but it's a constant, so vec should not have any dependency. Simultaneous reading from different threads are safe.
However, you say that the output is different. That means somethings were wrong. It's impossible to say what the problems are. Showing prototype of the function never helps; we need to know what kind of computations are done func.
Nonetheless, some steps for the diagnosis are:
Check the dependency in your code. You must not have dependences shown in the below. Note that the array A has loop-carried dependence that will prevent parallelization:
for (int k = 1; k <N; ++k) A[k] = A[k-1]+1;
Check func is re-entrant or thread-safe. Mostly, static and global variables may kill your code. If so, you can solve this problem by privatization. In OpenMP, you may declare these variables in private clause. Also, there is threadprivate pragma in OpenMP.

You do not change anywhere your loop variable i, so it is no problem for the compiler to parallelize it. As i is only copied into your function, it cannot be changed outside.
The only thing you need to make sure is that you write inside your functions only to positions A[i] and read only from positions A[i]. Otherwise you might get race conditions.

Related

Can a compiler read twice from a global variable, instead of storing a local one?

I've been trying to get re-familiarized multi-threading recently and found this paper. One of the examples says to be careful when using code like this:
int my_counter = counter; // Read global
int (* my_func) (int);
if (my_counter > my_old_counter) {
... // Consume data
my_func = ...;
... // Do some more consumer work
}
... // Do some other work
if (my_counter > my_old_counter) {
... my_func(...) ...
}
Stating that:
If the compiler decides that it needs to spill the register
containing my counter between the two tests, it may well decide to
avoid storing the value (it’s just a copy of counter, after all), and
to instead simply re-read the value of counter for the second
comparison involving my counter[...]
Doing this would turn the code into:
int my_counter = counter; // Read global
int (* my_func) (int);
if (my_counter > my_old_counter) {
... // Consume data
my_func = ...;
... // Do some more consumer work
}
... // Do some other work
my_counter = counter; // Reread global!
if (my_counter > my_old_counter) {
... my_func(...) ...
}
I, however, am skeptical about this. I don't understand why the compiler is allowed to do this, since to my understanding a data race only occurs when trying to access the same memory area with any number of reads and at least a write at the same time. The author goes on to motivate that:
the core problem arises from the compiler taking advantage of the
assumption that variable values cannot asynchronously change without
an explicit assignment
It seems to me that the condition is respected in this case, as the local variable my_counter is never accessed twice and cannot be accessed by other threads. How would the compiler know that the global variable can not be set elsewhere, in another translation unit by another thread? It cannot, and in fact, I assume that the second if case would just be actually optimized away.
Is the author wrong, or am I missing something?
Unless counter is explicitly volatile, the compiler may assume that it never changes if nothing in the current scope of execution could change it. That means if there can be either no alias on a variable, or there are no function calls in between for which the compiler can't know the effects, any external modification is undefined behavior. With volatile you would be declaring external changes as possible, even if the compiler can't know how.
So that optimization is perfectly valid. In fact, even if it did actually perform the copy it still wouldn't be threadsafe as the value may change partially mid read, or might even be completely stale as cache coherency is not guaranteed without synchronisation primitives or atomics.
Well, actually on x86 you won't get an intermediate value for an integer, at least as long as it is aligned. That's one of the guarantees the architecture makes. Stale caches still apply, the value may already have been modified by another thread.
Use either a mutex or an atomic if you need this behavior.
Compilers [are allowed to] optimize presuming that anything which is "undefined behavior" simply cannot happen: that the programmer will prevent the code from being executed in such a way that would invoke the undefined behavior.
This can lead to rather silly executions where, for example, the following loop never terminates!
int vals[10];
for(int i = 0; i < 11; i++) {
vals[i] = i;
}
This is because the compiler knows that vals[10] would be undefined behavior, therefore it assumes it cannot happen, and since it cannot happen, i will never exceed or be equal to 11, therefore this loop never terminates. Not all compilers will aggressively optimize a loop like this in this way, though I do know that GCC does.
In the specific case you're working with, reading a global variable in this way can be undefined behavior iff [sic] it is possible for another thread to modify it in the interim. As a result, the compiler is assuming that cross-thread modifications never happen (because it's undefined behavior, and compilers can optimize presuming UB doesn't happen), and thus it's perfectly safe to reread the value (which it knows doesn't get modified in its own code).
The solution is to make counter atomic (std::atomic<int>), which forces the compiler to acknowledge that there might be some kind of cross-thread manipulation of the variable.

c++ pragma omp segmentation fault (data race?) with arrays

I used #pragma directives to parallelize my program. Without it, everything works fine.
Unfortunately, I use complex arrays which I have to declare globally because they are used in several functions within the parallelization. As far as I understand, technically this doesn't make a difference since they are stored globally anyway.
However, the problematic array(s) are used privately. From what I understood from other discussions, I have to assign memory to the arrays before the parallelization starts to ensure that the program reserves the memory correctly for every thread. Within the threads I then assign memory again. The size (matrixsize) does not change.
But even if I set num_threads(1) for testing the data (the array "degree") in the thread becomes corrupted eventually.
In an earlier version I declared the arrays within the threads and didn't use any functions. Everything worked fine too, but this is getting too messy now.
I tried to reduce the code. Hope it's understandable. I use gcc to compile it.
Sorry, I can't figure out the problem. I would be thankful for some advise.
best,
Mathias
#include <omp.h>
int matrixsize=200;
vector<int> degree;
vector<list<int> >adjacency;
vector<vector<bool> >admatrix;
vector<vector<float> > geopos;
\\[...]
void netgen();
void runanalyses();
\\[...]
int main(int argc, char *argv[])
{
\\[...]
adjacency.assign(matrixsize,list<int>());
admatrix.assign(matrixsize, vector<bool>(matrixsize, 0));
degree.assign(matrixsize,0);
geopos.assign(matrixsize,vector<float> (dim,0));
#pragma omp parallel for shared(degdist,ADC,ADCnorm,ACC,ACCnorm,its,matrixsize) private(adjacency,admatrix,degree,geopos) num_threads(1)
for (int a=0;a<its;a++)
{
adjacency.assign(matrixsize,list<int>());
admatrix.assign(matrixsize, vector<bool>(matrixsize, 0));
degree.assign(matrixsize,0);
geopos.assign(matrixsize,vector<float> (dim,0));
netgen();
runanalyses();
} // for parallelization
\\[...]
}
Unfortunately, I use complex arrays which I have to declare globally because they are used in several functions within the parallelization. As far as I understand, technically this doesn't make a difference since they are stored globally anyway.
You really should not do that! Modifying global data structures within parallel regions makes it very difficult to reason about data races. Instead, define proper inerfaces, e.g. passing vectors by (const) reference. For example you can safely operate on a const std::vector& in a parallel region.
Once you got rid of global state, and still encounter issues, feel free to ask a proper followup question, but make sure to include a Minimal, Complete, and Verifiable example (read that page very carefully) as well as a description of the specific error you are getting and your attempts to debug it.

Should i specify volatile keyword for every object that shares its memory between different threads

I just read Do not use volatile as a synchronization primitive article on CERT site and noticed that a compiler can theoretically optimize the following code in the way that it'll store a flag variable in the registers instead of modifying actual memory shared between different threads:
bool flag = false;//Not declaring as {{volatile}} is wrong. But even by declaring {{volatile}} this code is still erroneous
void test() {
while (!flag) {
Sleep(1000); // sleeps for 1000 milliseconds
}
}
void Wakeup() {
flag = true;
}
void debit(int amount){
test();
account_balance -= amount;//We think it is safe to go inside the critical section
}
Am I right?
Is it true that I need to use volatile keyword for every object in my program that shares its memory between different threads? Not because it does some kind of synchronization for me (I need to use mutexes or any other synchronization primitives to accomplish such task anyway) but just because of the fact that a compiler can possibly optimize my code and store all shared variables in the registers so other threads will never get updated values?
It's not just about storing them in the registers, there are all sorts of levels of caching between the shared main memory and the CPU. Much of that caching is per CPU-core so any change made there will not be seen by other cores for a long time (or potentially if other cores are modifying the same memory then those changes may be lost completely).
There are no guarantees about how that caching will behave and even if something is true for current processors it may well not be true for older processors or for the next generation of processors. In order to write safe multi threading code you need to do it properly. The easiest way is to use the libraries and tools provided in order to do so. Trying to do it yourself using low level primitives like volatile is a very hard thing involving a lot of in-depth knowledge.
It is actually very simple, but confusing at the same time. On a high level, there are two optimization entities at play when you write C++ code - compiler and CPU. And within compiler, there are two major optimization techniue in regards to variable access - omitting variable access even if written in the code and moving other instructions around this particular variable access.
In particular, following example demonstrates those two techniques:
int k;
bool flag;
void foo() {
flag = true;
int i = k;
k++;
k = i;
flag = false;
}
In the code provided, compiler is free to skip first modification of flag - leaving only final assignment to false; and completely remove any modifications to k. If you make k volatile, you will require compiler to preserve all access to k = it will be incremented, and than original value put back. If you make flag volatile as well, both assignments first to true, than two false will remain in the code. However, reordering would still be possible, and the effective code might look like
void foo() {
flag = true;
flag = false;
int i = k;
k++;
k = i;
}
This will have unpleasant effect if another thread would be expecting flag to indicate if k is being modified now.
One of the way to achive the desired effect would be to define both variables as atomic. This would prevent compiler from both optimizations, ensuring code executed will be the same as code written. Note that atomic is, in effect, a volatile+ - it does all the volatile does + more.
Another thing to notice is that compiler optimizations are, indeed, a very powerful and desired tool. One should not impede them just for the fun of it, so atomicity should be used only when it is required.
On your particular
bool flag = false;
example, declaring it as volatile will universally work and is 100% correct.
But it will not buy you that all the time.
Volatile IMPOSES on the compiler that each and every evaluation of an object (or mere C variable) is either done directly on the memory/register or preceded by retrieval from external-memory medium into internal memory/registers. In some cases code and memory-footprint size can be quite larger, but the real issue is that it's not enough.
When some time-based context-switching is going on (e.g. threads), and your volatile object/variable is aligned and fits in a CPU register, you get what you intended. Under these strict conditions, a change or evaluation is atomically done, so in a context switching scenario the other thread will be immediately "aware" of any changes.
However, if your object/ big variable does not fit in a CPU register (from size or no alignment) a thread context-switch on a volatile may still be a NO-NO... an evaluation at the concurrent thread may catch a mid-changing procedure... e.g. while changing a 5-member struct copy, the concurrent thread is invoked amid 3rd member changing. cabum!
The conclusion is (back to "Operating-Systems 101"), you need to identify your shared objects, elect preemptive+blocking or non-preemptive or other concurrent-resource access strategy, and make your evaluaters/changers atomic. The access methods (change/eval) usually incorporate the make-atomic strategy, or (if it's aligned and small) simply declare it as volatile.

OpenMP causes heisenbug segfault

I'm trying to parallelize a pretty massive for-loop in OpenMP. About 20% of the time it runs through fine, but the rest of the time it crashes with various segfaults such as;
*** glibc detected *** ./execute: double free or corruption (!prev): <address> ***
*** glibc detected *** ./execute: free(): invalid next size (fast): <address> ***
[2] <PID> segmentation fault ./execute
My general code structure is as follows;
<declare and initialize shared variables here>
#pragma omp parallel private(list of private variables which are initialized in for loop) shared(much shorter list of shared variables)
{
#pragma omp for
for (index = 0 ; index < end ; index++) {
// Lots of functionality (science!)
// Calls to other deep functions which manipulate private variables
// Finally generated some calculated_values
shared_array1[index] = calculated_value1;
shared_array2[index] = calculated_value2;
shared_array3[index] = calculated_value3;
} // end for
}
// final tidy up
}
In terms of what's going on, each loop iteration is totally independent of each other loop iteration, other than the fact they pull data from shared matrices (but different columns on each loop iteration). Where I call other functions, they're only changing private variables (although occasionally reading shared variables) so I'd assume they'd be thread safe as they're only messing with stuff local to a specific thread? The only writing to any shared variables happens right at the end, where we write various calculated values to some shared arrays, where array elements are indexed by the for-loop index. This code is in C++, although the code it calls is both C and C++ code.
I've been trying to identify the source of the problem, but no luck so far. If I set num_theads(1) it runs fine, as it does if I enclose the contents of the for-loop in a single
#pragma omp for
for(index = 0 ; index < end ; index++) {
#pragma omp critical(whole_loop)
{
// loop body
}
}
which presumably gives the same effect (i.e. only one thread can pass through the loop at any one time).
If, on the other hand, I enclose the for-loop's contents in two critical directives e.g.
#pragma omp for
for(index = 0 ; index < end ; index++) {
#pragma omp critical(whole_loop)
{
// first half of loop body
}
#pragma omp critical(whole_loop2)
{
// second half of loop body
}
}
I get the unpredictable segfaulting. Similarly, if I enclose EVERY function call in a critical directive it still doesn't work.
The reason I think the problem may be linked to a function call is because when I profile with Valgrind (using valgrind --tool=drd --check-stack-var=yes --read-var-info=yes ./execute) as well as SIGSEGing I get an insane number of load and store errors, such as;
Conflicting load by thread 2 at <address> size <number>
at <address> : function which is ultimately called from within my for loop
Which according to the valgrind manual is exactly what you'd expect with race conditions. Certainly this kind of weirdly appearing/disappearing issue seems consistent with the kinds of non-deterministic errors race conditions would give, but I don't understand how, if every call which gives apparent race conditions is in a critical section.
Things which could be wrong but I don't think are include;
All private() variables are initialized inside the for-loops (because they're thread local).
I've checked that shared variables have the same memory address while private variables have different memory addresses.
I'm not sure synchronization would help, but given there are implicit barrier directives on entry and exit to critical directives and I've tried versions of my code where every function call is enclosed in a (uniquely named) critical section I think we can rule that out.
Any thoughts on how to best proceed would be hugely appreciated. I've been banging my head against this all day. Obviously I'm not looking for a, "Oh - here's the problem" type answer, but more how best to proceed in terms of debugging/deconstructing.
Things which could be an issue, or might be helpful;
There are some std::Vectors in the code which utilize the vector.pushback() function to add elements. I remember reading that resizing vectors isn't threadsafe, but the vectors are only private variables, so not shared between threads. I figured this would be OK?
If I enclose the entire for-loop body in an critical directive and slowly shrink back the end of the codeblock (so an ever growing region at the end of the for-loop is outside the critical section) it runs fine until I expose one of a the function calls, at which point segfaulting resumes. Analyzing this binary with Valgrind shows race conditions in many other function calls, not just the one I exposed.
One of the function calls is to a GSL function, which doesn't trigger any race conditions according to Valgrind.
Do I need to go and explicitly define private and shared variables in the functions being called? If so, this seems somewhat limiting for OpenMP - would this not mean you need to have OpenMP compatibility for any legacy code you call?
Is parallelizing a big for-loop just not something that works?
If you've read this far, thank you and Godspeed.
So there is no way anyone could have answered this, but having figured it out I hope this helps someone, given my system's behaviors was so bizarre.
One of the (C) functions I was ultimately calling to (my_function->intermediate_function->lower_function->BAD_FUNCTION) declared a number of it's variables as static, which meant that they were retaining the same memory address and so essentially acting a shared variables. Interesting that the static overrides OpenMP.
I discovered all this by;
Using Valgrid to identify where errors were happening, and looking at the specific variables involved.
Defining the entire for-loop as a critical section and then exposing more code at the top and bottom.
Talking to my boss. More sets of eyes always help, not least because you're forced to verbalize the problem (which ended up with me opening the culprit function and point at the declarations)

is this proper openMP usage? (or: can I trust the default settings?)

I am presently using openMP for the first time, and have hit my head against the "data members cannot be private"-rule.
I would like to know whether the below is valid, or if it will eventually break:
class network
{
double tau;
void SomeFunction();
};
void network::SomeFunction()
{
#pragma omp parallel for // <-the openMP call
for (uint iNeu=0;iNeu<nNeurons;++iNeu)
{
neurons[iNeu].timeSinceSpike+=tau; //tau is defined in some other place
neurons[iNeu].E+=tau*tau;
}
}
So, I am using the minimal syntax, and letting openMP figure out everything on its own. This version compiles, and the output is correct (so far).
What I tried before that was
void network::SomeFunction()
{
#pragma omp parallel for default(none) shared(neurons) firstprivate(tau) // <-the openMP call
for (uint iNeu=0;iNeu<nNeurons;++iNeu)
{
neurons[iNeu].timeSinceSpike+=tau; //tau is defined in some other place
neurons[iNeu].E+=tau*tau;
}
}
However, as hinted, that won't compile, presumably because tau and neurons are data members of network.
The question then is, if I have really just been lucky in my runs of the first version, and whether I have to do something like
void network::SomeFunction()
{
double tempTau=tau;
vector <neuron> tempNeurons=neurons; //in realtity this copy-process would be quite involved
#pragma omp parallel for shared(tempNeurons) firstprivate(tempTau)// <-the openMP call
for (uint iNeu=0;iNeu<nNeurons;++iNeu)
{
tempNeurons[iNeu].timeSinceSpike+=tempTau;
tempNeurons[iNeu].E+=tempTau*tempTau;
}
}
Naturally, I would much prefer to stick with the present version, as it is so short and easy to read, but I would also like to trust my output :)
I am using gcc 4.6.1
Hope someone can educate me on the proper way to do it.
In this example, what you are initially doing should be fine:
The reason is that you aren't modifying the tau member at all. So there's no reason to make it private in the first place. It's safe to asynchronously share the same value if it isn't modified.
As for neurons, you are modifying the elements independently. So there's no problem here either.
When you declare a variable as firstprivate, it gets copy constructed into all the threads. So shared(tempNeurons) is definitely not what you want to do.
http://www.openmp.org/mp-documents/OpenMP3.1.pdf
Section 2.9.1, Data-sharing Attribute Rules
Variables appearing in threadprivate directives are threadprivate.
Variables with automatic storage duration that are declared in a scope inside the construct are private.
Objects with dynamic storage duration are shared.
Static data members are shared.
The loop iteration variable(s) in the associated for-loop(s) of a for or parallel for construct is (are) private.
Variables with const-qualified type having no mutable member are shared.
Variables with static storage duration that are declared in a scope inside the construct are shared.
...
The loop iteration variable(s) in the associated for-loop(s) of a for or parallel for construct may be listed in a private or lastprivate clause.
Variables with const-qualified type having no mutable member may be listed in a firstprivate clause.
However, I yet miss the default sharing attribute for automatic variables outside a construct.