EDIT: I can run the same program twice, simultaneously without any problem - how can I duplicate this with OpenMP or with some other method?
This is the basic framework of the problem.
//Defined elsewhere
class SomeClass
{
public:
void Function()
{
// Allocate some memory
float *Data;
Data = new float[1024];
// Declare a struct which will be used by functions defined in the DLL
SomeStruct Obj;
Obj = MemAllocFunctionInDLL(Obj);
// Call it
FunctionDefinedInDLL(Data,Obj);
// Clean up
MemDeallocFunctionInDLL(Obj);
delete [] Data;
}
}
void Bar()
{
#pragma omp parallel for
for(int j = 0;j<10;++j)
{
SomeClass X;
X.Function();
}
}
I've verified that when some memory is attempted to be deallocated through MemDeallocFunctionInDLL(), the _CrtIsValidHeapPointer() assertion fails.
Is this because both threads are writing to the same memory?
So to fix this, I thought I'd make SomeClass private (this is totally alien to me, so any help is appreciated).
void Bar()
{
SomeClass X;
#pragma omp parallel for default(shared) private(X)
for(int j = 0;j<10;++j)
{
X.Function();
}
}
And now it fails when it tries to allocate memory in the beginning for Data.
Note: I can make changes to the DLL if required
Note: It runs perfectly without #pragma omp parallel for
EDIT: Now Bar looks like this:
void Bar()
{
int j
#pragma omp parallel for default(none) private(j)
for(j = 0;j<10;++j)
{
SomeClass X;
X.Function();
}
}
Still no luck.
Check out MemAllocFunctionInDLL, FunctionDefinedInDLL, MemDeallocFunctionInDLL are thread-safe, or re-entrant. In other words, do these functions static variables or shared variables? In such case, you need to make it sure these variables are not corrupted by other threads.
The fact without omp-for is fine could mean you didn't correctly write some functions to be thread-safe.
I'd like to see what kind of memory allocation/free functions has been used in Mem(Alloc|Dealloc)FunctionInDLL.
Added: I'm pretty sure your functions in DLL is not thread-safe. You can run this program concurrently without problem. Yes, it should be okay unless your program uses system-wide shared resources (such as global memory or shared memory among processes), which is very rare. In this case, no shared variables in threads, so your program works fine.
But, invoking these functions in mutithreads (that means in a single process) crashes your program. It means there are some shared variables among threads, and it could have been corrupted.
It's not a problem of OpenMP, but just a multithreading bug. It could be simple to solve this problem. Please take a look the DLL functions whether they are safe to be called in concurrent by many threads.
How to privatize static variables
Say that we have such global variables:
static int g_data;
static int* g_vector = new int[100];
Privatization is nothing but a creating private copy for each thread.
int g_data[num_threads];
int* g_vector[num_threads];
for (int i = 0; i < num_threads; ++i)
g_vector[i] = new int[100];
And, then any references on such variables are
// Thread: tid
g_data[tid] = ...
.. = g_vector[tid][..]
Yes, it's pretty simple. However, this sort of code may have a false sharing problem. But, false sharing is a matter of performance, not correctness.
First, just try to privatize any static and global variables. Then, check it correctness. Next, see the speedup you would get. If the speedup is scalable (say 3.7x faster on quad core), then it's okay. But, in case of low speedup (such as 2x speedup on quad core), then you probably look at the false sharing problem. To solve false sharing problem, all you need to do is just putting some padding in data structures.
Instead of
delete Data
you must write
delete [] Data;
Wherever you do new [], make sure to use delete [].
It looks like your problem is not specific to openmp. Did you try to run your application without including #pragma parallel?
default(shared) means all variables are shared between threads, which is not what you want. Change that to default(none).
Private(X) will make a copy of X for each thread, however, none of them will be initialised so any construction will not necessarily be performed.
I think you'd be better with your initial approach, put a breakpoint in the Dealloc call, and see what the memory pointer is and what it contains. You can see the guard bytes to tell if the memory has been overwritten at the end of a single call, or after a thread.
Incidentally, I am assuming this works if you run it once, without the omp loop?
Related
I'm calling a class function in parallel using OpenMP. It's working okay in serial or putting the my_func part in the critical section (slow apparently). Running in parallel, however, keeps giving the following error message,
malloc: *** error for object 0x7f961dcef750: pointer being freed was
not allocated
I think the problem is with the new operators in the my_func, i.e., The pointer myDyna seems to be shared among threads. My questions are,
1)Isn't anything inside the parallel region private, including all the pointers in the my_func? Meaning that, each thread should have its own copy of myDyna, but why the error occurred?
2)Without changing the my_func too much, what can be done to make the parallelization work at the main level? For example, would it work by adding a copy constructor in the myDyna?
void my_func(){
Variables *theVars = new Variables();
//trDynaPS Constructor for the class.
trDynaPS *myDyna = new trDynaPS(theVars, starttime, stepsize, stoptime);
ResultRate = myDyna->trDynaPS_Drive(1, 1);
if (theVars != nullptr) {
myDyna->theVars = nullptr;
delete theVars;
}
delete myDyna;
};
int main()
{
#pragma omp parallel for
{
for (int i = 0;i<10;i++){
//I have multiple copies of myclass as a vector
myclass[i]->run(); //my_func is in the run operation
}
}
return 0;
}
I have a function f that I can use parallel processing. For this purpose, I used openmp.
However, this function is called many times, and it seems that thread creation is done every call.
How can we reuse the thread?
void f(X &src, Y &dest) {
... // do processing based on "src"
#pragma omp parallel for
for (...) {
}
...// put output into "dest"
}
int main() {
...
for(...) { // It is impossible to make this loop processing parallel one.
f(...);
}
...
return 0;
}
OpenMP implements thread pool internally, it tries to reuse threads unless you change some of its settings in between or use different application threads to call parallel regions while others are still active.
One can verify that the threads are indeed the same by using thread locals. I'd recommend you to verify your claim about recreating the threads. OpenMP runtime does lots of smart optimizations beyond obvious thread pool idea, you just need to know how to tune and control it properly.
While it is unlikely that threads are recreated, it is easy to see how threads can go to sleep by the time when you call parallel region again and it takes noticeable amount of time to wake them up. You can prevent threads from going to sleep by using OMP_WAIT_POLICY=active and/or implementation-specific environment variables like KMP_BLOCKTIME=infinite (for Intel/LLVM run-times).
This is just in addition to Anton's correct answer. If you are really concerned about the issue, for most programs you can easily move the parallel region on the outside and keep serial work serial like follows:
void f(X &src, Y &dest) {
// You can also do simple computations
// without side effects outside of the single section
#pragma omp single
{
... // do processing based on "src"
}
#pragma omp for // note parallel is missing
for (...) {
}
#pragma omp critical
...// each thread puts its own part of the output into "dest"
}
int main() {
...
// make sure to declare loop variable locally or explicitly private
#pragma omp parallel
for(type variable;...;...) {
f(...);
}
...
return 0;
}
Use this only if you have measured evidence that you are suffering from the overhead of reopening parallel regions. You may have to juggle with shared variables, or manually inline f, because all variables declared within f will be private - so how it looks in detail depends on your specific application.
If I fork inside my main program and then call a subroutine inside a single directive, what is the behavior if I enter an OMP parallel directive in this subroutine?
My guess/hope is that existing threads are used, as they all should have nothing to do at the moment.
Pseudo-Example:
double A[];
int main() {
#pragma omp parallel num_threads(2)
{
#pragma omp single
{
for (int t=0; t<1000; t++) {
evolve();
}
}
}
}
void evolve() {
#pragma omp parallel for num_threads(2)
for (int i=0; i<100; i++) {
do_stuff(i);
}
}
void do_stuff(int i) {
// expensive calculation on array element A[i]
}
As evolve() is called very often, forking here would cause way to much overhead, so I would like to do it only once, then call evolve() from a single thread and split the work of the calls to do_stuff() over the existing threads.
For Fortran this seems to work. I get a roughly 80-90% speed increase on a simple example using 2 threads. But for C++ I get a different behavior, only the thread which executes the single directive is used for the loop in evolve()
I fixed the problem using the task directive in the main program and passing the limits to evolve(), but this looks like a clumsy solution...
Why is the behavior in Fortran and C++ different and what would be the solution in C++?
I believe orphaned directives are the cleanest solution in your case:
double A[];
int main() {
#pragma omp parallel num_threads(2)
{
// Each thread calls evolve() a thousand times
for (int t=0; t<1000; t++) {
evolve();
}
}
}
void evolve() {
// The orphaned construct inside evolve()
// will bind to the innermost parallel region
#pragma omp for
for (int i=0; i<100; i++) {
do_stuff(i);
} // Implicit thread synchronization
}
void do_stuff(int i) {
// expensive calculation on array element A[i]
}
This will work because (section 2.6.1 of the standard):
A loop region binds to the innermost enclosing parallel region
That said, in your code you are using nested parallel constructs. To be sure to enable them you must set the environment variable OMP_NESTED to true, otherwise (quoting Appendix E of the latest standard):
OMP_NESTED environment variable: if the value is neither true nor false the behavior is implementation defined
Unfortunately, your code will likely not work as expected in all cases. If you have a code structure like this:
void foo() {
#pragma omp parallel
#pragma omp single
bar();
}
void bar() {
#pragma omp parallel
printf("...)";
}
OpenMP is requested to create a new team of threads when entering the parallel region in bar. OpenMP calls that "nested parallelism". However, what exactly happens depends on the your actual implementation of OpenMP used and the setting of OMP_NESTED.
OpenMP implementations are not required to support nested parallelism. It would be perfectly legal, if an implementation ignored the parallel region in bar and just execute it with one thread. OMP_NESTED can be used to turn on and off nesting, if the implementation supports it.
In your case, things by chance went well, since you sent all threads to sleep except one. This thread then created a new team of threads of full size (potentially NEW threads, not reusing the old ones). If you omitted the single construct, you would easily get thousands of threads.
Unfortunately, OpenMP does not support your pattern to create a parallel team, have one thread executing the call stacks, and then distribute work across the other team members through a worksharing construct like for. If you need this code pattern, the only solution will be OpenMP tasks.
Cheers,
-michael
Your example doesn't actually call fork(), so I suspect you don't mean fork in the system-call sense (i.e. duplicating your process). However, if that really is what you meant, I suspect that most OpenMP implementations will not work correctly in a forked process. Typically, threads are not preserved across fork() calls. If the OpenMP implementation you use registers pthread_atfork() handlers, it may work correctly following a fork() call, but it will not use the same threads as the parent process.
Consider the following serial function. When I parallelize my code, every thread will call this function from within the parallel region (not shown). I am trying to make this threadsafe and efficient (fast).
float get_stored_value__or__calculate_if_does_not_yet_exist( int A )
{
static std::map<int, float> my_map;
std::map::iterator it_find = my_map.find(A); //many threads do this often.
bool found_A = it_find != my_map.end();
if (found_A)
{
return it_find->second;
}
else
{
float result_for_A = calculate_value(A); //should only be done once, really.
my_map[A] = result_for_A;
return result_for_A;
}
}
Almost every single time this function is called, the threads will successfully "find" the stored value for their "A" (whatever it is). Every once in a while, when a "new A" is called, a value will have to be calculated and stored.
So where should I put the #pragma omp critical ?
Though easy, it is very inefficient to put a #pragma omp critical around all of this, since each thread will be doing this constantly and it will often be the read-only case.
Is there any way to implement a "one-way" critical, or a "one-way" lock routine? That is, the above operations involving the iterator should only be "locked" when writing to my_map in the else statement. But multiple threads should be able to execute the .find call simultaneously.
I hope I make sense.
Thank you.
According to this link on Stack Overflow inserting into an std::map doesn't invalidate iterators. The same goes for the end() iterator. Here's a supporting link.
Unfortunately, insertion can happen multiple times if you don't use a critical section. Also, since your calculate_value routine might be computationally expensive, you will have to lock to avoid this else clause being operated on twice with the same value of A and then inserted twice.
Here's a sample function where you can replicate this incorrect multiple insertion:
void testFunc(std::map<int,float> &theMap, int i)
{
std::map<int,float>::iterator ite = theMap.find(i);
if(ite == theMap.end())
{
theMap[i] = 3.14 * i * i;
}
}
Then called like this:
std::map<int,float> myMap;
int i;
#pragma omp parallel for
for(i=1;i<=100000;++i)
{
testFunc(myMap,i % 100);
}
if(myMap.size() != 100)
{
std::cout << "Problem!" << std::endl;
}
Edit: edited to correct error in earler version.
OpenMP is a compiler "tool" for automatic loop parallelization, not a thread communication or synchronization library; so it doesn't have sophisticated mutexes, like a read/write mutex: acquire lock on writing, but not on reading.
Here's an implementation example.
Anyway Chris A.'s answer is better than mine though :)
While #ChrisA's answer may solve your problem, I'll leave my answer here in case any future searchers find it useful.
If you'd like, you can give #pragma omp critical sections a name. Then, any section with that name is considered the same critical section. If this is what you would like to do, you can easily cause onyl small portions of your method to be critical.
#pragma omp critical map_protect
{
std::map::iterator it_find = my_map.find(A); //many threads do this often.
bool found_A = it_find != my_map.end();
}
...
#pragma omp critical map_protect
{
float result_for_A = calculate_value(A); //should only be done once, really.
my_map[A] = result_for_A;
}
The #pragma omp atomic and #pragma omp flush directives may also be useful.
atomic causes a write to a memory location (lvalue in the expression preceded by the directive) to always be atomic.
flush ensures that any memory expected to be available to all threads is actually written to all threads, not stored in a processor cache and unavailable where it should be.
According to the OpenMP Memory Model, the following is incorrect:
int *p0 = NULL, *p1 = NULL;
#pragma omp parallel shared(p0,p1)
{
int x;
// THREAD 0 // THREAD 1
p0 = &x; p1 = &x;
*p1 ... *p0 ...
}
My example looks like the following though:
int *p0 = NULL, *p1 = NULL;
#pragma omp parallel shared(p0,p1)
{
int x;
// THREAD 0 // THREAD 1
p0 = &x; p1 = &x;
#pragma omp flush
#pragma omp barrier
*p1 ... *p0 ...
#pragma omp barrier
}
Would this be incorrect? I cannot find something in the memory model that would disallow this.
I assume that my toy example is correct, as in the memory model in 3.1 they allow a task to have access to a private variable as long as the programmer ensures that it is still alive. Given the fact that tasks can be untied, they can in theory execute within a different worker thread, therefore allowing an OpenMP thread access to the private memory of another.
This should work. Flush syncs all shared variables and barrier guarantees that the mp environment with all threads is still active. As long as you don't use p0 in p1s assignment or vice versa it should be fine. Although I cannot imagine why one would do something like that. Maybe you can tell more about the reasoning behind that construct.
Since p0 and p1 are still alive after the parallel region, you could do all assignments there as well without barriers etc..
As a side thought, this is analogous to trying to read a local variable inside some function you've called by assigning that local variable to a global variable, then reading the global variable.
The analogy here is that a global variable acts like a shared variable in multithreading, essentially giving access to what was supposed to be thread private (like the local variable that should only be visible within the function).
So to answer the question as asked, doing dereferencing into thread private memory is completely valid. It is allowed because pointer aliasing is allowed (this is where 2 or more variables provide access to the same location in memory, in your case one is a thread private integer and the other is a shared pointer).
Although completely valid, beware this can cause some difficult to detect race conditions, as typically one would not use a lock to protect access to thread private variables.