OpenCV in multithreaded environment (OpenMP) causes segmentation fault - c++

I have OpenCV 3.0.0 installed. My code is multithreaded using OpenMP.
Each thread accesses the same opencv function ("convertTo").
This causes a segmentation fault.
The error does not occurr
if I print a simple statement using std::cout at the beginning of each thread or
if I use only a single thread.
Can anyone help, what the reason might be?

Many functions and data openCV use the same memory addresses for different variables, for example if you have a matrix Mat A and you do Mat B = A, data matrix B are stored in the same pociciones memory A, now when you use OpenMP must make sure that when you write to a memory location, just do it from a single thread, otherwise you will get an error at runtime.
Now when you use a single thread there is no problem, since it is only one thread which writes or reads a pocicion memory.
On the other hand when you use functions to print screen as printf () or std :: cout, there is the possibility that the threads are delayed, that is, that while a thread prints, another thread writes to the memory locations, by thus the possibility of an error at runtime decline, but that does not mean that in the future do not exist.
The solution when you use OpenMP in a loop to protect write in the same memory locations from different threads is:`
#pragma omp critical
{
   //code only be written from a thread
}

Related

CUDA Segmentation fault in threads with no CUDA code

I have this code:
__global__ void testCuda() {}
void wrapperLock()
{
std::lock_guard<std::mutex> lock(mutexCudaExecution);
// changing this value to 20000 does NOT trigger "Segmentation fault"
usleep(5000);
runCuda();
}
void runCuda()
{
testCuda<<<1, 1>>>();
cudaDeviceSynchronize();
}
When these functions are executed from around 20 threads then I get Segmentation fault. As written in the comment, changing the value in usleep() to 20000 works fine.
Is there an issue with CUDA and threads?
It looks to me like CUDA needs a bit of time to recover when an execution finished even when there was nothing to do.
Using a single CUDA context, multiple host threads should either delegate their CUDA work to a context-owner thread (similar to a worker thread) or bind the context with cuCtxSetCurrent (driver API) or cudaSetDevice in order to not overwrite the context resources.
UPDATE:
According to http://docs.nvidia.com/cuda/cuda-c-programming-guide/#um-gpu-exclusive the problem was a concurrent access to the Unified Memory I am using. I had to wrap the CUDA kernel calls and access to the Unified Memory with a std::lock_guard and now the program runs for 4 days under heavy thread load without any problems.
I have to call in each thread - as suggested by Marco & Robert - cudaSetDevice otherwise it crashes again.

Using pointers to local variables in a multi threaded program in C++

Lets say there are 4 consumer threads that run in a loop continuously
function consumerLoop(threadIndex)
{
int myArray[100];
main loop {
..process data..
myArray[myIndex] += newValue
}
}
I have another monitor thread which does other background tasks.
I need to access the myArray for each of these threads from the monitor thread.
Assume that the loops will run for ever(so the local variables would exist) and the only operation required from the monitor thread is to read the array contents of all the threads.
One alternative is to change myArray to a global array of arrays. But i am guessing that would slow down the consumer loops.
What are the ill effects of declaring a global pointer array
int *p[4]; and assigning each element to the address of the local variable by adding a line in consumerLoop like so p[threadIndex] = myArray and accessing p from monitor thread?
Note: I am running It in a linux system and the language is C++. I am not concerned about synchronization/validity of the array contents when i am accessing it from the monitor thread.Lets stay away from a discussion of locking
If you are really interested in the performance difference, you have to measure. I would guess, that there are nearly no differenced.
Both approaches are correct, as long as the monitor thread doesn't access stack local variables that are invalid because the function returned.
You can not access myArray from different thread because it is local variable.
You can do 1) Use glibal variable or 2) Malloca and pass the address to all threads.
Please protect the critical section when all threads rush to use the common memory.

OpenMP causes heisenbug segfault

I'm trying to parallelize a pretty massive for-loop in OpenMP. About 20% of the time it runs through fine, but the rest of the time it crashes with various segfaults such as;
*** glibc detected *** ./execute: double free or corruption (!prev): <address> ***
*** glibc detected *** ./execute: free(): invalid next size (fast): <address> ***
[2] <PID> segmentation fault ./execute
My general code structure is as follows;
<declare and initialize shared variables here>
#pragma omp parallel private(list of private variables which are initialized in for loop) shared(much shorter list of shared variables)
{
#pragma omp for
for (index = 0 ; index < end ; index++) {
// Lots of functionality (science!)
// Calls to other deep functions which manipulate private variables
// Finally generated some calculated_values
shared_array1[index] = calculated_value1;
shared_array2[index] = calculated_value2;
shared_array3[index] = calculated_value3;
} // end for
}
// final tidy up
}
In terms of what's going on, each loop iteration is totally independent of each other loop iteration, other than the fact they pull data from shared matrices (but different columns on each loop iteration). Where I call other functions, they're only changing private variables (although occasionally reading shared variables) so I'd assume they'd be thread safe as they're only messing with stuff local to a specific thread? The only writing to any shared variables happens right at the end, where we write various calculated values to some shared arrays, where array elements are indexed by the for-loop index. This code is in C++, although the code it calls is both C and C++ code.
I've been trying to identify the source of the problem, but no luck so far. If I set num_theads(1) it runs fine, as it does if I enclose the contents of the for-loop in a single
#pragma omp for
for(index = 0 ; index < end ; index++) {
#pragma omp critical(whole_loop)
{
// loop body
}
}
which presumably gives the same effect (i.e. only one thread can pass through the loop at any one time).
If, on the other hand, I enclose the for-loop's contents in two critical directives e.g.
#pragma omp for
for(index = 0 ; index < end ; index++) {
#pragma omp critical(whole_loop)
{
// first half of loop body
}
#pragma omp critical(whole_loop2)
{
// second half of loop body
}
}
I get the unpredictable segfaulting. Similarly, if I enclose EVERY function call in a critical directive it still doesn't work.
The reason I think the problem may be linked to a function call is because when I profile with Valgrind (using valgrind --tool=drd --check-stack-var=yes --read-var-info=yes ./execute) as well as SIGSEGing I get an insane number of load and store errors, such as;
Conflicting load by thread 2 at <address> size <number>
at <address> : function which is ultimately called from within my for loop
Which according to the valgrind manual is exactly what you'd expect with race conditions. Certainly this kind of weirdly appearing/disappearing issue seems consistent with the kinds of non-deterministic errors race conditions would give, but I don't understand how, if every call which gives apparent race conditions is in a critical section.
Things which could be wrong but I don't think are include;
All private() variables are initialized inside the for-loops (because they're thread local).
I've checked that shared variables have the same memory address while private variables have different memory addresses.
I'm not sure synchronization would help, but given there are implicit barrier directives on entry and exit to critical directives and I've tried versions of my code where every function call is enclosed in a (uniquely named) critical section I think we can rule that out.
Any thoughts on how to best proceed would be hugely appreciated. I've been banging my head against this all day. Obviously I'm not looking for a, "Oh - here's the problem" type answer, but more how best to proceed in terms of debugging/deconstructing.
Things which could be an issue, or might be helpful;
There are some std::Vectors in the code which utilize the vector.pushback() function to add elements. I remember reading that resizing vectors isn't threadsafe, but the vectors are only private variables, so not shared between threads. I figured this would be OK?
If I enclose the entire for-loop body in an critical directive and slowly shrink back the end of the codeblock (so an ever growing region at the end of the for-loop is outside the critical section) it runs fine until I expose one of a the function calls, at which point segfaulting resumes. Analyzing this binary with Valgrind shows race conditions in many other function calls, not just the one I exposed.
One of the function calls is to a GSL function, which doesn't trigger any race conditions according to Valgrind.
Do I need to go and explicitly define private and shared variables in the functions being called? If so, this seems somewhat limiting for OpenMP - would this not mean you need to have OpenMP compatibility for any legacy code you call?
Is parallelizing a big for-loop just not something that works?
If you've read this far, thank you and Godspeed.
So there is no way anyone could have answered this, but having figured it out I hope this helps someone, given my system's behaviors was so bizarre.
One of the (C) functions I was ultimately calling to (my_function->intermediate_function->lower_function->BAD_FUNCTION) declared a number of it's variables as static, which meant that they were retaining the same memory address and so essentially acting a shared variables. Interesting that the static overrides OpenMP.
I discovered all this by;
Using Valgrid to identify where errors were happening, and looking at the specific variables involved.
Defining the entire for-loop as a critical section and then exposing more code at the top and bottom.
Talking to my boss. More sets of eyes always help, not least because you're forced to verbalize the problem (which ended up with me opening the culprit function and point at the declarations)

C++ Creating a SIGSEGV for debug purposes

I am working on a lock-free shared variable class, and I want to be able to generate a SIGSEGV fault to see if my implementation works as I planned. I've tried creating a function that modifies a pointer and read it 100 times. I then call this function in both threads and have the threads run infinitely within my program. This doesn't generate the error I want. How should I go about doing this?
edit
I don't handle segfaults at all, but they are generated in my program if I remove locks. I want to use a lock-less design, therefore i created a shared variable class that uses CAS to remain lockless. Is there are way that I can have a piece of code that will generate segfaults, so that i can use my class to test that it fixes the problem?
#include <signal.h>
raise(SIGSEGV);
Will cause an appropriate signal to be raised.
malloc + mprotect + dereference pointer
This mprotect man page has an example.
Derefencing pointer to unallocated memory (at least on my system):
int *a;
*a = 0;

Having a function that returns int how can I run it in a separate thread using boost?

I know it looks not necessary but I hope that it would help me find memory leak.
So having a function inside a class that returns int, how can I call it from another function of that class (call it so that function that returns int would run in another thread)?
You are trying to find a memory leak in a function by having it called from another thread? That is like trying to find a needle in a haystack by adding more hay to the stack.
Thread programming 101:
Spawn a new thread ("thread2") that invokes a new function ("foo").
Have the original thread join against thread2 immediately after the spawn.
Read a global variable that foo() has written its final value to.
Notice that foo() cannot return its value to the original thread; it must write the value to some shared memory (ie, global variable). Also note that this will not solve your memory leak problem, or even make it obvious where your memory leak is coming from.
Look for memory leaks with Valgrind. And read a book or tutorial about multithreading.
The operating system will not reclaim memory leaks in worker threads. That's not how it works.
Fix your bugs. The world doesn't need any more crappy software.