I'm trying to parallelize a pretty massive for-loop in OpenMP. About 20% of the time it runs through fine, but the rest of the time it crashes with various segfaults such as;
*** glibc detected *** ./execute: double free or corruption (!prev): <address> ***
*** glibc detected *** ./execute: free(): invalid next size (fast): <address> ***
[2] <PID> segmentation fault ./execute
My general code structure is as follows;
<declare and initialize shared variables here>
#pragma omp parallel private(list of private variables which are initialized in for loop) shared(much shorter list of shared variables)
{
#pragma omp for
for (index = 0 ; index < end ; index++) {
// Lots of functionality (science!)
// Calls to other deep functions which manipulate private variables
// Finally generated some calculated_values
shared_array1[index] = calculated_value1;
shared_array2[index] = calculated_value2;
shared_array3[index] = calculated_value3;
} // end for
}
// final tidy up
}
In terms of what's going on, each loop iteration is totally independent of each other loop iteration, other than the fact they pull data from shared matrices (but different columns on each loop iteration). Where I call other functions, they're only changing private variables (although occasionally reading shared variables) so I'd assume they'd be thread safe as they're only messing with stuff local to a specific thread? The only writing to any shared variables happens right at the end, where we write various calculated values to some shared arrays, where array elements are indexed by the for-loop index. This code is in C++, although the code it calls is both C and C++ code.
I've been trying to identify the source of the problem, but no luck so far. If I set num_theads(1) it runs fine, as it does if I enclose the contents of the for-loop in a single
#pragma omp for
for(index = 0 ; index < end ; index++) {
#pragma omp critical(whole_loop)
{
// loop body
}
}
which presumably gives the same effect (i.e. only one thread can pass through the loop at any one time).
If, on the other hand, I enclose the for-loop's contents in two critical directives e.g.
#pragma omp for
for(index = 0 ; index < end ; index++) {
#pragma omp critical(whole_loop)
{
// first half of loop body
}
#pragma omp critical(whole_loop2)
{
// second half of loop body
}
}
I get the unpredictable segfaulting. Similarly, if I enclose EVERY function call in a critical directive it still doesn't work.
The reason I think the problem may be linked to a function call is because when I profile with Valgrind (using valgrind --tool=drd --check-stack-var=yes --read-var-info=yes ./execute) as well as SIGSEGing I get an insane number of load and store errors, such as;
Conflicting load by thread 2 at <address> size <number>
at <address> : function which is ultimately called from within my for loop
Which according to the valgrind manual is exactly what you'd expect with race conditions. Certainly this kind of weirdly appearing/disappearing issue seems consistent with the kinds of non-deterministic errors race conditions would give, but I don't understand how, if every call which gives apparent race conditions is in a critical section.
Things which could be wrong but I don't think are include;
All private() variables are initialized inside the for-loops (because they're thread local).
I've checked that shared variables have the same memory address while private variables have different memory addresses.
I'm not sure synchronization would help, but given there are implicit barrier directives on entry and exit to critical directives and I've tried versions of my code where every function call is enclosed in a (uniquely named) critical section I think we can rule that out.
Any thoughts on how to best proceed would be hugely appreciated. I've been banging my head against this all day. Obviously I'm not looking for a, "Oh - here's the problem" type answer, but more how best to proceed in terms of debugging/deconstructing.
Things which could be an issue, or might be helpful;
There are some std::Vectors in the code which utilize the vector.pushback() function to add elements. I remember reading that resizing vectors isn't threadsafe, but the vectors are only private variables, so not shared between threads. I figured this would be OK?
If I enclose the entire for-loop body in an critical directive and slowly shrink back the end of the codeblock (so an ever growing region at the end of the for-loop is outside the critical section) it runs fine until I expose one of a the function calls, at which point segfaulting resumes. Analyzing this binary with Valgrind shows race conditions in many other function calls, not just the one I exposed.
One of the function calls is to a GSL function, which doesn't trigger any race conditions according to Valgrind.
Do I need to go and explicitly define private and shared variables in the functions being called? If so, this seems somewhat limiting for OpenMP - would this not mean you need to have OpenMP compatibility for any legacy code you call?
Is parallelizing a big for-loop just not something that works?
If you've read this far, thank you and Godspeed.
So there is no way anyone could have answered this, but having figured it out I hope this helps someone, given my system's behaviors was so bizarre.
One of the (C) functions I was ultimately calling to (my_function->intermediate_function->lower_function->BAD_FUNCTION) declared a number of it's variables as static, which meant that they were retaining the same memory address and so essentially acting a shared variables. Interesting that the static overrides OpenMP.
I discovered all this by;
Using Valgrid to identify where errors were happening, and looking at the specific variables involved.
Defining the entire for-loop as a critical section and then exposing more code at the top and bottom.
Talking to my boss. More sets of eyes always help, not least because you're forced to verbalize the problem (which ended up with me opening the culprit function and point at the declarations)
Related
I used #pragma directives to parallelize my program. Without it, everything works fine.
Unfortunately, I use complex arrays which I have to declare globally because they are used in several functions within the parallelization. As far as I understand, technically this doesn't make a difference since they are stored globally anyway.
However, the problematic array(s) are used privately. From what I understood from other discussions, I have to assign memory to the arrays before the parallelization starts to ensure that the program reserves the memory correctly for every thread. Within the threads I then assign memory again. The size (matrixsize) does not change.
But even if I set num_threads(1) for testing the data (the array "degree") in the thread becomes corrupted eventually.
In an earlier version I declared the arrays within the threads and didn't use any functions. Everything worked fine too, but this is getting too messy now.
I tried to reduce the code. Hope it's understandable. I use gcc to compile it.
Sorry, I can't figure out the problem. I would be thankful for some advise.
best,
Mathias
#include <omp.h>
int matrixsize=200;
vector<int> degree;
vector<list<int> >adjacency;
vector<vector<bool> >admatrix;
vector<vector<float> > geopos;
\\[...]
void netgen();
void runanalyses();
\\[...]
int main(int argc, char *argv[])
{
\\[...]
adjacency.assign(matrixsize,list<int>());
admatrix.assign(matrixsize, vector<bool>(matrixsize, 0));
degree.assign(matrixsize,0);
geopos.assign(matrixsize,vector<float> (dim,0));
#pragma omp parallel for shared(degdist,ADC,ADCnorm,ACC,ACCnorm,its,matrixsize) private(adjacency,admatrix,degree,geopos) num_threads(1)
for (int a=0;a<its;a++)
{
adjacency.assign(matrixsize,list<int>());
admatrix.assign(matrixsize, vector<bool>(matrixsize, 0));
degree.assign(matrixsize,0);
geopos.assign(matrixsize,vector<float> (dim,0));
netgen();
runanalyses();
} // for parallelization
\\[...]
}
Unfortunately, I use complex arrays which I have to declare globally because they are used in several functions within the parallelization. As far as I understand, technically this doesn't make a difference since they are stored globally anyway.
You really should not do that! Modifying global data structures within parallel regions makes it very difficult to reason about data races. Instead, define proper inerfaces, e.g. passing vectors by (const) reference. For example you can safely operate on a const std::vector& in a parallel region.
Once you got rid of global state, and still encounter issues, feel free to ask a proper followup question, but make sure to include a Minimal, Complete, and Verifiable example (read that page very carefully) as well as a description of the specific error you are getting and your attempts to debug it.
When I compile with the configuration set to release (for both x86 and x64), my program fails to complete. To clarify, there are no build errors or execution errors.
After looking for a cause and solution for the issue, I found Program only crashes as release build -- how to debug? which proposes that it is an array issue. Though this solve my problem, it gave me some insight on the matter (which I leave here for the next person).
To further muddle matters, it's only when a subroutine on the main thread has an execution time greater than about 0ms.
Here are the relevant sections of code:
// Startup Progress Bar Thread
nPC_Current = 0; // global int
nPC_Max = nPC; // global int (max value nPC_Current will reach)
DWORD myThreadID;
HANDLE progressBarHandle = CreateThread(0, 0, printProgress, &nPC_Current, 0, &myThreadID);
/* Do stuff and time how long it takes (this is what increments nPC_Current) */
// Wait for Progress Bar Thread to Terminate
WaitForSingleObject(progressBarHandle, INFINITE);
Where the offending line that my program gets stuck on is that last statement, where the program waits for the created thread to terminate:
WaitForSingleObject(progressBarHandle, INFINITE);
And here is the code for the progress bar function:
DWORD WINAPI printProgress(LPVOID lpParameter)
{
int lastProgressPercent = -1; // Only reprint bar when there is a change to display.
// Core Progress Bar Loop
while (nPC_Current <= nPC_Max)
{
// Do stuff to print a text progress bar
}
return 0;
}
Where the 'Core' while loop generally won't get a single iteration if the execution time of the measured subroutine is about 0ms. To clarify this, if the execution time of the timed subroutine is about 0ms, the nPC_Current will be greater than nPC_Max before the printProgressBar executes once. This means that thread will terminate before the main thread begins to wait for it.
If anyone would help with this, or provide some further insight on the matter, that would be fantastic as I'm having quite some trouble figuring this out.
Thanks!
edits:
wording
deleted distracting contents and added clarifications
My guess would be that you forgot to declare your shared global variables volatile (nPC_Current specifically). Since the thread function itself never modifies nPC_Current, in the release version of the code the compiler optimized you progress bar loop into an infinite loop with never changing value of nPC_Current.
This is why your progress bar never updates from 0% value in release version of the code and this is why your progress bar thread never terminates.
P.S. Also, it appears that you originally intended to pass your nPC_Current counter to the thread function as a thread parameter (judging by your CreateThread call). However, in the thread function you ignore the parameter and access nPC_Current directly as a global variable. It might be a better idea to stick to the original idea of passing and accessing it as a thread parameter.
The number one rule in writing software is:
Leave nothing to chance; check for every single possible error, everywhere.
Note: this is the number one rule not when troubleshooting software; when there is trouble, it is already too late; this is the number one rule when writing software, that is, before there is even a need to troubleshoot.
There is a number of problems with your code; I cannot tell for sure that any one of those is what is causing you the problem that you are experiencing, but I would be willing to bet that if you fixed those, and if you developed the mentality of fixing problems like those, then you would not have the problem you are experiencing.
The documentation for WaitForSingleObject says: "If this handle is closed while the wait is still pending, the function's behavior is undefined." However, you do not appear to be asserting that CreateThread() returned a valid handle. You are not even showing us where and how you are closing that handle. (And when you do close the handle, do you assert that CloseHandle() did not fail?)
Not only you are using global variables, (which are something that I would strongly advice against,) but also, you happily make a multitude of assumptions about their values, without ever asserting any one of those assumptions.
What guarantees do you have that nPC_Current is in fact less than nPC_Max at the beginning of your function?
What guarantees do you have that nPC_Current keeps incrementing over time?
What guarantees do you have that the calculation of lastProgressPercent does not in fact keep yielding -1 during your loop?
What guarantees do you have that nPC_Max is not zero? (Division by zero on a separate thread is kind of hard to catch.)
What guarantees do you have that nPC_Max does not get also modified while your thread is running?
What guarantees do you have that nPC_Current gets incremented atomically? (I hope you understand that if it does not get incremented atomically, then at the moment that you read it from another thread, you may read garbage.)
You have tagged this question with [C++], and I do see a few C++ features being used, but I do not really see any object-oriented programming. The thread function accepts an LPVOID parameter precisely so that you can pass an object to it and thus continue being object-oriented in your second thread, with all the benefits that this entails, like for example encapsulation. I would suggest that you use it.
You can use (with some limitations) breakpoints in release...
Does this part of the code:
/* Do stuff and time how long it takes (this is what increments nPC_Current) */
depend on what printProgress thread does? (If so, you have to assure time dependence, and order conveniently) Are you sure this is always incrementing nPC_Current? Is it a time dependent algorithm?
Have you tested the effect that a Sleep() has here?
I have OpenCV 3.0.0 installed. My code is multithreaded using OpenMP.
Each thread accesses the same opencv function ("convertTo").
This causes a segmentation fault.
The error does not occurr
if I print a simple statement using std::cout at the beginning of each thread or
if I use only a single thread.
Can anyone help, what the reason might be?
Many functions and data openCV use the same memory addresses for different variables, for example if you have a matrix Mat A and you do Mat B = A, data matrix B are stored in the same pociciones memory A, now when you use OpenMP must make sure that when you write to a memory location, just do it from a single thread, otherwise you will get an error at runtime.
Now when you use a single thread there is no problem, since it is only one thread which writes or reads a pocicion memory.
On the other hand when you use functions to print screen as printf () or std :: cout, there is the possibility that the threads are delayed, that is, that while a thread prints, another thread writes to the memory locations, by thus the possibility of an error at runtime decline, but that does not mean that in the future do not exist.
The solution when you use OpenMP in a loop to protect write in the same memory locations from different threads is:`
#pragma omp critical
{
//code only be written from a thread
}
I have curious situation (at least for me :D ) in C++
My code is:
static void startThread(Object* r){
while(true)
{
while(!r->commands->empty())
{
doSomthing();
}
}
}
I start this function as thread using boost where commands in r is queue... this queue I fill up in another thread....
The problem is that if I fill the queue first and then start this tread everything works fine... But if I run the startThread first and after that I fill up queue commands, it is not working... doSomething() will not run...
Howewer if I modify startThread:
static void startThread(Object* r){
while(true)
{
std::cout << "c" << std::endl;
while(!r->commands->empty())
{
doSomthing();
}
}
}
I just added cout... and it is working... Can anybody explain why it is working with cout and not without? Or anybody has idea what can be wrong?
Maybe compiler is doing some kind of optimalization? I do not think so... :(
Thanks
But if I run the startThread first and after that I fill up queue commands, it is not working... doSomething() will not run
Of course not! What did you expect? Your queue is empty, so !r->commands->empty() will be false.
I just added cout... and it is working
You got lucky. cout is comparatively slow, so your main thread had a chance to fill the queue before the inner while test was executed for the first time.
So why does the thread not see an updated version of r->commands after it has been filled by the main thread? Because nothing in your code indicates that your variable is going to change from the outside, so the compiler assumes that it doesn’t.
In fact, the compiler sees that your r’s pointee cannot change, so it can just remove the redundant checks from the inner loop. When working with multithreaded code, you explicitly need to tell C++ that variables can be changed from a different context, using atomic memory access.
When u first run the thread and then fill up the queue, not entering the inner loop is logical, since the test !r->commands->empty() is true. After u add the cout statement, it is working because it takes some time to print the output, and meanwhile the other thread fills up the queue. so the condition becomes again true. But this is not good programming to rely on this facts in a multi-threading environment.
There are two inter-related issues:
You are not forcing a reload of r->commands or r->commands-Yempty(), thus your compiler, diligent as it is in search of the pinnacle of performance, cached the result. Adding some more code might make the compiler remove this optimisation if it cannot prove the caching is still valid.
You have a data-race, so your program has undefined behavior. (I am assuming doSomething() removes an element and some other thread adds elements.
1.10 Multi-threaded executions and data races § 21
The execution of a program contains a data race if it contains two conflicting actions in different threads,
at least one of which is not atomic, and neither happens before the other. Any such data race results in
undefined behavior. [ Note: It can be shown that programs that correctly use mutexes and memory_order_-
seq_cst operations to prevent all data races and use no other synchronization operations behave as if the
operations executed by their constituent threads were simply interleaved, with each value computation of an
object being taken from the last side effect on that object in that interleaving. This is normally referred to as
“sequential consistency”. However, this applies only to data-race-free programs, and data-race-free programs
cannot observe most program transformations that do not change single-threaded program semantics. In
fact, most single-threaded program transformations continue to be allowed, since any program that behaves
differently as a result must perform an undefined operation. —end note ]
22
i have a for loop which calls an internal function :
some variables
for(int i=0; i< 10000000; i++)
func(variables)
Basically, func gets a reference to some array A, and inserts values in A[i] - so i'm assured
that each call to func actually tries to insert a value to a different place in A, and all other input variables stay the the same as they were before the for loop. So func is thread-safe.
Can i safely change the code to
some variables
#pragma omp parallel for
for(int i=0; i< 10000000; i++)
func(variables)
From what i understand from the openmp tutorials, this isn't good enough - since the openmp libraries wouldn't know that the variables given to func are really thread-safe, and so this would yield attempts to perform synchronization which would slow things up, and i would need to declare variables private, etc. But actually, when trying the code above, it seems to work indeed faster and parallel - is this as expected? I just wanted to make sure i'm not missing something.
The declaration of func :
func(int i, int client_num, const vector<int>& vec)
First of all, OpenMP cannot magically determine the dependency on your code. It is your responsibility that the code is correct for parallelization.
In order to safely parallelize the for loop, func must not have loop-carried flow dependences, or inter-iteration dependency, especially for read-after-write pattern. Also, you must check there are no static variables. (Actually, it's much more complex to write down the conditions of the safe parallelization in this short answer.)
Your description of func says that func will write a variable into a different place. If so, you can safely parallelize by putting pragma omp parallel for, unless the other computations do not dependences that prohibit parallelization.
Your prototype of func: func(int i, int client_num, const vector<int>& vec)
There is a vector, but it's a constant, so vec should not have any dependency. Simultaneous reading from different threads are safe.
However, you say that the output is different. That means somethings were wrong. It's impossible to say what the problems are. Showing prototype of the function never helps; we need to know what kind of computations are done func.
Nonetheless, some steps for the diagnosis are:
Check the dependency in your code. You must not have dependences shown in the below. Note that the array A has loop-carried dependence that will prevent parallelization:
for (int k = 1; k <N; ++k) A[k] = A[k-1]+1;
Check func is re-entrant or thread-safe. Mostly, static and global variables may kill your code. If so, you can solve this problem by privatization. In OpenMP, you may declare these variables in private clause. Also, there is threadprivate pragma in OpenMP.
You do not change anywhere your loop variable i, so it is no problem for the compiler to parallelize it. As i is only copied into your function, it cannot be changed outside.
The only thing you need to make sure is that you write inside your functions only to positions A[i] and read only from positions A[i]. Otherwise you might get race conditions.