c++ pragma omp segmentation fault (data race?) with arrays

c++ pragma omp segmentation fault (data race?) with arrays - c++

I used #pragma directives to parallelize my program. Without it, everything works fine.
Unfortunately, I use complex arrays which I have to declare globally because they are used in several functions within the parallelization. As far as I understand, technically this doesn't make a difference since they are stored globally anyway.
However, the problematic array(s) are used privately. From what I understood from other discussions, I have to assign memory to the arrays before the parallelization starts to ensure that the program reserves the memory correctly for every thread. Within the threads I then assign memory again. The size (matrixsize) does not change.
But even if I set num_threads(1) for testing the data (the array "degree") in the thread becomes corrupted eventually.
In an earlier version I declared the arrays within the threads and didn't use any functions. Everything worked fine too, but this is getting too messy now.
I tried to reduce the code. Hope it's understandable. I use gcc to compile it.
Sorry, I can't figure out the problem. I would be thankful for some advise.
best,
Mathias
#include <omp.h>
int matrixsize=200;
vector<int> degree;
vector<list<int> >adjacency;
vector<vector<bool> >admatrix;
vector<vector<float> > geopos;
\\[...]
void netgen();
void runanalyses();
\\[...]
int main(int argc, char *argv[])
{
\\[...]
adjacency.assign(matrixsize,list<int>());
admatrix.assign(matrixsize, vector<bool>(matrixsize, 0));
degree.assign(matrixsize,0);
geopos.assign(matrixsize,vector<float> (dim,0));
#pragma omp parallel for shared(degdist,ADC,ADCnorm,ACC,ACCnorm,its,matrixsize) private(adjacency,admatrix,degree,geopos) num_threads(1)
for (int a=0;a<its;a++)
{
adjacency.assign(matrixsize,list<int>());
admatrix.assign(matrixsize, vector<bool>(matrixsize, 0));
degree.assign(matrixsize,0);
geopos.assign(matrixsize,vector<float> (dim,0));
netgen();
runanalyses();
} // for parallelization
\\[...]
}

Unfortunately, I use complex arrays which I have to declare globally because they are used in several functions within the parallelization. As far as I understand, technically this doesn't make a difference since they are stored globally anyway.
You really should not do that! Modifying global data structures within parallel regions makes it very difficult to reason about data races. Instead, define proper inerfaces, e.g. passing vectors by (const) reference. For example you can safely operate on a const std::vector& in a parallel region.
Once you got rid of global state, and still encounter issues, feel free to ask a proper followup question, but make sure to include a Minimal, Complete, and Verifiable example (read that page very carefully) as well as a description of the specific error you are getting and your attempts to debug it.

Related

Can I edit a global vector using multiple threads in C++?

I currently have a code which works well, but I am learning C++, and hence would like to rid myself of any newbie mistakes. Basically the code is
vector<vector<float>> gAbs;
void functionThatAddsEntryTogAbs(){
...
gAbs.pushback(value);
}
int main(){
thread thread1 = thread(functionThatAddsEntryTogAbs,args);
thread thread2 = thread(functionThatAddsEntryTogAbs,args);
thread1.join(); thread2.join();
std::sort(gAbs.begin(),gAbs.end());
writeDataToFile(gAbs,"filename.dat");
}
For instance I remember learning that there are only a few instances where global variables are the right choice. My initial thought was just to have the threads write to the file, but then I cannot guarantee that the data is sorted (which I need), which is my I use std::sort. Are there any suggestions of how to improve this, and what are some alternatives that the more experiences programmers would use instead?
The code needs to be as fast as possible.
Thanks in advance

You can access and modify global resources, including containers from different threads, but you have to protect them from doing that at the same time. Some exceptions are: no modifications are possible, the container itself is not changed and the threads are working on separate entries.
In your code, entries are added to the container, so you need mutexes, but by doing that your parallel code probably doesn't gain you much in speed. A better way could be to know how many entries need to be added, add empty entries (just initialize) and then assign ranges to the threads, so they can fill in the entries.

c++ -successfully printing out of range string indexes

I am new to C++ and I was playing around with string class. I realized when I run the following code in CodeBlocks with GNU compiler:
#include <iostream>
using namespace std;
int main()
{
string test = "hi";
cout<<"char is : "<<test[100];
return 0;
}
I actually get a value. I play with indexes (I tried from 100 to 10000) and I may get other characters or I may get null. Does that mean this way you are able to read parts of memory that you are not supposed to? can you use it for exploitation? or is it just my mind being illusional?

The answer is simple- Undefined Behavior. No, you can't trust this info and it is highly not recommended. Don't do it..

This is undefined behavior, so anything can happen. The compiler can optimize away the line, insert abort(), or do anything else.
If the compiler does not make big changes to the code, and std::string implements short-string-optimization, then test[100] will access the stack frame of one of the functions that call main().
These functions are responsible for loading shared libraries, arranging the environment variables, constructing global objects such as std::cout, and creating and passing argc, argv, to main(). This code peeks into the stack of these functions. On a system with memory protection, such as Linux or Windows, with far enough out-of-bounds access, the application will crash.
Don't rely on it, since the compiler can do something totally unexpected with it.
And yes, it can lead to exploitation. If out-of-bounds depends on user input, then that user may be able to read or write data they were not supposed to. This is one of the way a worm or a virus might spread: they read passwords, or write code that will execute upon a return from a function.

Implementing Thread Local Storage in Software

We are porting an embedded application from Windows CE to a different system. The current processor is an STM32F4. Our current codebase heavily uses TLS. The new prototype is running KEIL CMSIS RTOS which has very reduced functionality.
On http://www.keil.com/support/man/docs/armcc/armcc_chr1359124216560.htm it says that thread local storage is supported since 5.04. Right now we are using 5.04. The problem is that when linking our program with a variable definition of __thread int a; the linker cannot find __aeabi_read_tp which makes sense to me.
My question is: Is it possible to implement __aeabi_read_tp and it will work or is there more to it?
If it simply is not possible for us: Is there a way to implement TLS only in software? Let's not talk about performance there for now.
EDIT
I tried implementing __aeabi_read_tp by looking at old source of freeBSD and other sources. While the function is mostly implemented in assembly I found a version in C which boils down to this:
extern "C"
{
extern osThreadId svcThreadGetId(void);
void *__aeabi_read_tp()
{
return (void*)svcThreadGetId();
}
}
What this basically does is give me the ID (void*) of my currently executing thread. If I understand correctly that is what we want. Can this possibly work?

Not considering the performance and not going into CMIS RTOS specifics (which are unknown to me), you can allocate space needed for your variables - either on heap or as static or global variable - I would suggest to have an array of structures. Then, when you create thread, pass the pointer to the next not used structure to your thread function.
In case of static or global variable, it would be good if you know how many threads are working in parallel for limiting the size of preallocated memory.
EDIT: Added sample of TLS implementation based on pthreads:
#include <pthread.h>
#define MAX_PARALLEL_THREADS 10
static pthread_t threads[MAX_PARALLEL_THREADS];
static struct tls_data tls_data[MAX_PARALLEL_THREADS];
static int tls_data_free_index = 0;
static void *worker_thread(void *arg) {
static struct tls_data *data = (struct tls_data *) arg;
/* Code omitted. */
}
static int spawn_thread() {
if (tls_data_free_index >= MAX_PARALLEL_THREADS) {
// Consider increasing MAX_PARALLEL_THREADS
return -1;
}
/* Prepare thread data - code omitted. */
pthread_create(& threads[tls_data_free_index], NULL, worker_thread, & tls_data[tls_data_free_index]);
}

The not-so-impressive solution is a std::map<threadID, T>. Needs to be wrapped with a mutex to allow new threads.
For something more convoluted, see this idea

I believe this is possible, but probably tricky.
Here's a paper describing how __thread or thread_local behaves in ELF images (though it doesn't talk about ARM architecture for AEABI):
https://www.akkadia.org/drepper/tls.pdf
The executive summary is:
The linker creates .tbss and/or .tdata sections in the resulting executable to provide a prototype image of the thread local data needed for each thread.
At runtime, each thread control block (TCB) has a pointer to a dynamic thread-local vector table (dtv in the paper) that contains the thread-local storage for that thread. It is lazily allocated and initialized the first time a thread attempts to access a thread-local variable. (presumably by __aeabi_read_tp())
Initialization copies the prototype .tdata image and memsets the .tbss image into the allocated storage.
When source code access thread-local variables, the compiler generates code to read the thread pointer from __aeabi_read_tp(), and do all the appropriate indirection to get at the storage for that thread-local variable.
The compiler and linker is doing all the work you'd expect it to, but you need to initialize and return a "thread pointer" that is properly structured and filled out the way the compiler expects it to be, because it's generating instructions directly to follow the hops.
There are a few ways that TLS variables are accessed, as mentioned in this paper, which, again, may or may not totally apply to your compiler and architecture:
http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt
But, the problems are roughly the same. When you have runtime-loaded libraries that may bring their own .tbss and .tdata sections, it gets more complicated. You have to expand the thread-local storage for any thread that suddenly tries to access a variable introduced by a library loaded after the storage for that thread was initialized. The compiler has to generate different access code depending on where the TLS variable is declared. You'd need to handle and test all the cases you would want to support.
It's years later, so you probably already solved or didn't solve your problem. In this case, it is (was) probably easiest to use your OS's TLS API directly.

OpenMP causes heisenbug segfault

I'm trying to parallelize a pretty massive for-loop in OpenMP. About 20% of the time it runs through fine, but the rest of the time it crashes with various segfaults such as;
*** glibc detected *** ./execute: double free or corruption (!prev): <address> ***
*** glibc detected *** ./execute: free(): invalid next size (fast): <address> ***
[2] <PID> segmentation fault ./execute
My general code structure is as follows;
<declare and initialize shared variables here>
#pragma omp parallel private(list of private variables which are initialized in for loop) shared(much shorter list of shared variables)
{
#pragma omp for
for (index = 0 ; index < end ; index++) {
// Lots of functionality (science!)
// Calls to other deep functions which manipulate private variables
// Finally generated some calculated_values
shared_array1[index] = calculated_value1;
shared_array2[index] = calculated_value2;
shared_array3[index] = calculated_value3;
} // end for
}
// final tidy up
}
In terms of what's going on, each loop iteration is totally independent of each other loop iteration, other than the fact they pull data from shared matrices (but different columns on each loop iteration). Where I call other functions, they're only changing private variables (although occasionally reading shared variables) so I'd assume they'd be thread safe as they're only messing with stuff local to a specific thread? The only writing to any shared variables happens right at the end, where we write various calculated values to some shared arrays, where array elements are indexed by the for-loop index. This code is in C++, although the code it calls is both C and C++ code.
I've been trying to identify the source of the problem, but no luck so far. If I set num_theads(1) it runs fine, as it does if I enclose the contents of the for-loop in a single
#pragma omp for
for(index = 0 ; index < end ; index++) {
#pragma omp critical(whole_loop)
{
// loop body
}
}
which presumably gives the same effect (i.e. only one thread can pass through the loop at any one time).
If, on the other hand, I enclose the for-loop's contents in two critical directives e.g.
#pragma omp for
for(index = 0 ; index < end ; index++) {
#pragma omp critical(whole_loop)
{
// first half of loop body
}
#pragma omp critical(whole_loop2)
{
// second half of loop body
}
}
I get the unpredictable segfaulting. Similarly, if I enclose EVERY function call in a critical directive it still doesn't work.
The reason I think the problem may be linked to a function call is because when I profile with Valgrind (using valgrind --tool=drd --check-stack-var=yes --read-var-info=yes ./execute) as well as SIGSEGing I get an insane number of load and store errors, such as;
Conflicting load by thread 2 at <address> size <number>
at <address> : function which is ultimately called from within my for loop
Which according to the valgrind manual is exactly what you'd expect with race conditions. Certainly this kind of weirdly appearing/disappearing issue seems consistent with the kinds of non-deterministic errors race conditions would give, but I don't understand how, if every call which gives apparent race conditions is in a critical section.
Things which could be wrong but I don't think are include;
All private() variables are initialized inside the for-loops (because they're thread local).
I've checked that shared variables have the same memory address while private variables have different memory addresses.
I'm not sure synchronization would help, but given there are implicit barrier directives on entry and exit to critical directives and I've tried versions of my code where every function call is enclosed in a (uniquely named) critical section I think we can rule that out.
Any thoughts on how to best proceed would be hugely appreciated. I've been banging my head against this all day. Obviously I'm not looking for a, "Oh - here's the problem" type answer, but more how best to proceed in terms of debugging/deconstructing.
Things which could be an issue, or might be helpful;
There are some std::Vectors in the code which utilize the vector.pushback() function to add elements. I remember reading that resizing vectors isn't threadsafe, but the vectors are only private variables, so not shared between threads. I figured this would be OK?
If I enclose the entire for-loop body in an critical directive and slowly shrink back the end of the codeblock (so an ever growing region at the end of the for-loop is outside the critical section) it runs fine until I expose one of a the function calls, at which point segfaulting resumes. Analyzing this binary with Valgrind shows race conditions in many other function calls, not just the one I exposed.
One of the function calls is to a GSL function, which doesn't trigger any race conditions according to Valgrind.
Do I need to go and explicitly define private and shared variables in the functions being called? If so, this seems somewhat limiting for OpenMP - would this not mean you need to have OpenMP compatibility for any legacy code you call?
Is parallelizing a big for-loop just not something that works?
If you've read this far, thank you and Godspeed.

So there is no way anyone could have answered this, but having figured it out I hope this helps someone, given my system's behaviors was so bizarre.
One of the (C) functions I was ultimately calling to (my_function->intermediate_function->lower_function->BAD_FUNCTION) declared a number of it's variables as static, which meant that they were retaining the same memory address and so essentially acting a shared variables. Interesting that the static overrides OpenMP.
I discovered all this by;
Using Valgrid to identify where errors were happening, and looking at the specific variables involved.
Defining the entire for-loop as a critical section and then exposing more code at the top and bottom.
Talking to my boss. More sets of eyes always help, not least because you're forced to verbalize the problem (which ended up with me opening the culprit function and point at the declarations)

correct use of interal function and openmp

i have a for loop which calls an internal function :
some variables
for(int i=0; i< 10000000; i++)
func(variables)
Basically, func gets a reference to some array A, and inserts values in A[i] - so i'm assured
that each call to func actually tries to insert a value to a different place in A, and all other input variables stay the the same as they were before the for loop. So func is thread-safe.
Can i safely change the code to
some variables
#pragma omp parallel for
for(int i=0; i< 10000000; i++)
func(variables)
From what i understand from the openmp tutorials, this isn't good enough - since the openmp libraries wouldn't know that the variables given to func are really thread-safe, and so this would yield attempts to perform synchronization which would slow things up, and i would need to declare variables private, etc. But actually, when trying the code above, it seems to work indeed faster and parallel - is this as expected? I just wanted to make sure i'm not missing something.
The declaration of func :
func(int i, int client_num, const vector<int>& vec)

First of all, OpenMP cannot magically determine the dependency on your code. It is your responsibility that the code is correct for parallelization.
In order to safely parallelize the for loop, func must not have loop-carried flow dependences, or inter-iteration dependency, especially for read-after-write pattern. Also, you must check there are no static variables. (Actually, it's much more complex to write down the conditions of the safe parallelization in this short answer.)
Your description of func says that func will write a variable into a different place. If so, you can safely parallelize by putting pragma omp parallel for, unless the other computations do not dependences that prohibit parallelization.
Your prototype of func: func(int i, int client_num, const vector<int>& vec)
There is a vector, but it's a constant, so vec should not have any dependency. Simultaneous reading from different threads are safe.
However, you say that the output is different. That means somethings were wrong. It's impossible to say what the problems are. Showing prototype of the function never helps; we need to know what kind of computations are done func.
Nonetheless, some steps for the diagnosis are:
Check the dependency in your code. You must not have dependences shown in the below. Note that the array A has loop-carried dependence that will prevent parallelization:
for (int k = 1; k <N; ++k) A[k] = A[k-1]+1;
Check func is re-entrant or thread-safe. Mostly, static and global variables may kill your code. If so, you can solve this problem by privatization. In OpenMP, you may declare these variables in private clause. Also, there is threadprivate pragma in OpenMP.

You do not change anywhere your loop variable i, so it is no problem for the compiler to parallelize it. As i is only copied into your function, it cannot be changed outside.
The only thing you need to make sure is that you write inside your functions only to positions A[i] and read only from positions A[i]. Otherwise you might get race conditions.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js