Accessing a thread's private memory in OpenMP - c++

According to the OpenMP Memory Model, the following is incorrect:
int *p0 = NULL, *p1 = NULL;
#pragma omp parallel shared(p0,p1)
{
int x;
// THREAD 0 // THREAD 1
p0 = &x; p1 = &x;
*p1 ... *p0 ...
}
My example looks like the following though:
int *p0 = NULL, *p1 = NULL;
#pragma omp parallel shared(p0,p1)
{
int x;
// THREAD 0 // THREAD 1
p0 = &x; p1 = &x;
#pragma omp flush
#pragma omp barrier
*p1 ... *p0 ...
#pragma omp barrier
}
Would this be incorrect? I cannot find something in the memory model that would disallow this.
I assume that my toy example is correct, as in the memory model in 3.1 they allow a task to have access to a private variable as long as the programmer ensures that it is still alive. Given the fact that tasks can be untied, they can in theory execute within a different worker thread, therefore allowing an OpenMP thread access to the private memory of another.

This should work. Flush syncs all shared variables and barrier guarantees that the mp environment with all threads is still active. As long as you don't use p0 in p1s assignment or vice versa it should be fine. Although I cannot imagine why one would do something like that. Maybe you can tell more about the reasoning behind that construct.
Since p0 and p1 are still alive after the parallel region, you could do all assignments there as well without barriers etc..

As a side thought, this is analogous to trying to read a local variable inside some function you've called by assigning that local variable to a global variable, then reading the global variable.
The analogy here is that a global variable acts like a shared variable in multithreading, essentially giving access to what was supposed to be thread private (like the local variable that should only be visible within the function).
So to answer the question as asked, doing dereferencing into thread private memory is completely valid. It is allowed because pointer aliasing is allowed (this is where 2 or more variables provide access to the same location in memory, in your case one is a thread private integer and the other is a shared pointer).
Although completely valid, beware this can cause some difficult to detect race conditions, as typically one would not use a lock to protect access to thread private variables.

Related

Is using volatile on shared memory safe?

Lets suppose following:
I have two processes on Linux / Mac OS.
I have mmap on shared memory (or in a file).
Then in both processes I have following:
struct Data{
volatile int reload = 0; // using int because is more standard
// more things in the future...
};
void *mmap_memory = mmap(...);
Data *data = static_cast<Data *>(mmap_memory); // suppose size is sufficient and all OK
Then in one of the processes I do:
//...
data->reload = 1;
//...
And in the other I do:
while(...){
do_some_work();
//...
if (data->reload == 1)
do_reload();
}
Will this be thread / inter process safe?
Idea is from here:
https://embeddedartistry.com/blog/2019/03/11/improve-volatile-usage-with-volatile_load-and-volatile_store/
Note:
This can not be safe with std::atomic<>, since it does not "promise" anything about shared memory. Also constructing/destructing from two different processes is not clear at all.
Will this be thread / inter process safe?
No.
From your own link:
One problematic and common assumption is that volatile is equivalent to “atomic”. This is not the case. All the volatile keyword denotes is that the variable may be modified externally, and thus reads/writes cannot be optimized.
Your code needs atomic access to the value. if (data->reload == 1) won't work if it reads some partial/intermediate value from data->reload.
And nevermind what happens if multiple threads do read 1 from data->reload - your posted code doesn't handle that at all.
Also see Why is volatile not considered useful in multithreaded C or C++ programming?

C++: __sync_synchronize() still needed with std::atomic?

I've been running into an infrequent but re-occurring race condition.
The program has two threads and uses std::atomic. I'll simplify the critical parts of the code to look like:
std::atomic<uint64_t> b; // flag, initialized to 0
uint64_t data[100]; // shared data, initialized to 0
thread 1 (publishing):
// set various shared variables here, for example
data[5] = 10;
uint64_t a = b.exchange(1); // signal to thread 2 that data is ready
thread 2 (receiving):
if (b.load() != 0) { // signal that data is ready
// read various shared variables here, for example:
uint64_t x = data[5];
// race condition sometimes (x sometimes not consistent)
}
The odd thing is that when I add __sync_synchronize() to each thread, then the race condition goes away. I've seen this happen on two different servers.
i.e. when I change the code to look like the following, then the problem goes away:
thread 1 (publishing):
// set various shared variables here, for example
data[5] = 10;
__sync_synchronize();
uint64_t a = b.exchange(1); // signal to thread 2 that data is ready
thread 2 (receiving):
if (b.load() != 0) { // signal that data is ready
__sync_synchronize();
// read various shared variables here, for example:
uint64_t x = data[5];
}
Why is __sync_synchronize() necessary? It seems redundant as I thought both exchange and load ensured the correct sequential ordering of logic.
Architecture is x86_64 processors, linux, g++ 4.6.2
Whilst it is impossible to say from your simplified code what actually goes on in your actual application, the fact that __sync_synchronize helps, and the fact that this function is a memory barrier tells me that you are writing things in the one thread that the other thread is reading, in a way that isn't atomic.
An example:
thread_1:
object *p = new object;
p->x = 1;
b.exchange(p); /* give pointer p to other thread */
thread_2:
object *p = b.load();
if (p->x == 1) do_stuff();
else error("Huh?");
This may very well trigger the error-path in thread2, because the write to p->x has not actually been completed when thread 2 reads the new pointer value p.
Adding memory barrier, in this case, in the thread_1 code should fix this. Note that for THIS case, a memory barrier in thread_2 will not do anything - it may alter the timing and appear to fix the problem, but it won't be the right thing. You may need memory barriers on both sides still, if you are reading/writing memory that is shared between two threads.
I understand that this may not be precisely what your code is doing, but the concept is the same - __sync_synchronize ensures that memory reads and memory writes have completed for ALL of the instructions before that function call [which isn't a real function call, it will inline a single instruction that waits for any pending memory operations to comlete].
Noteworthy is that operations on std::atomic ONLY affect the actual data stored in the atomic object. Not reads/writes of other data.
Sometimes you also need a "compiler barrier" to avoid the compiler moving stuff from one side of an operation to another:
std::atomic<bool> flag(false);
value = 42;
flag.store(true);
....
another thread:
while(!flag.load());
print(value);
Now, there is a chance that the compiler generates the first form as:
flag.store(true);
value = 42;
Now, that wouldn't be good, would it? std::atomic is guaranteed to be a "compiler barrier", but in other cases, the compiler may well shuffle stuff around in a similar way.

Proper compiler intrinsics for double-checked locking?

When implementing double-checked locking, what is the proper way to do the memory and/or compiler barriers when implementing double-checked locking for initialization?
Something like std::call_once isn't what I want; it's way too slow. It's typically just implemented on top of pthread_mutex_lock and EnterCriticalSection respective to OS.
In my programs, I often run into initialization cases where the initialization is safe to repeat, as long as exactly one thread gets to set the final pointer. If another thread beats it to setting the final pointer to the singleton object, it deletes what it created and makes use of the other thread's. I also often use this in cases where it doesn't matter which thread "wins" because they all come up with the same result.
Here's an unsafe, overly-contrived example, using Visual C++ intrinsics:
MyClass *GetGlobalMyClass()
{
static MyClass *const UNSET_POINTER = reinterpret_cast<MyClass *>(
static_cast<intptr_t>(-1));
static MyClass *volatile s_object = UNSET_POINTER;
if (s_object == UNSET_POINTER)
{
MyClass *newObject = MyClass::Create();
if (_InterlockedCompareExchangePointer(&s_object, newObject,
UNSET_POINTER) != UNSET_POINTER)
{
// Another thread beat us. If Create didn't return null, destroy.
if (newObject)
{
newObject->Destroy(); // calls "delete this;", presumably
}
}
}
return s_object;
}
On a weakly-ordered memory architecture, my understanding is that it's possible that the new value of s_object is visible to other threads before other variables written inside MyClass::Create or MyClass::MyClass are visible. Also, the compiler itself could arrange the code this way in the absence of a compiler barrier (in Visual C++, _WriteBarrier, but _InterlockedCompareExchange acts as a barrier).
Do I need like a store fence intrinsic function in there or something in order to ensure that MyClass's variables are visible to all threads before s_object becomes somethings besides -1?
Fortunately, the rules in C++ are very simple:
If there is a data race, the behaviour is undefined.
In you code the data race is caused by the following read, which conflicts with the write operation in __InterlockedCompareExchangePointer.
if (s_object.m_void == UNSET_POINTER)
A thread-safe solution without blocking might look as follows. Note that on x86 a load operation with sequential consistency has basically no overhead compared to a regular load operation. If you care about other architectures, you can also use acquire release instead of sequential consistency.
static std::atomic<MyClass*> s_object{nullptr};
MyClass* o = s_object.load(std::memory_order_seq_cst);
if (o == nullptr) {
o = new MyClass{...};
MyClass* expected = nullptr;
if (!s_object.compare_exchange_strong(expected, o, std::memory_order_seq_cst)) {
delete o;
o = expected;
}
}
return o;
For a proper C++11 implementation any function-local static variable will be constructed in a thread-safe fashion by the first thread passing through this variable.

Do I need to use volatile keyword if I declare a variable between mutexes and return it?

Let's say I have the following function.
std::mutex mutex;
int getNumber()
{
mutex.lock();
int size = someVector.size();
mutex.unlock();
return size;
}
Is this a place to use volatile keyword while declaring size? Will return value optimization or something else break this code if I don't use volatile? The size of someVector can be changed from any of the numerous threads the program have and it is assumed that only one thread (other than modifiers) calls getNumber().
No. But beware that the size may not reflect the actual size AFTER the mutex is released.
Edit:If you need to do some work that relies on size being correct, you will need to wrap that whole task with a mutex.
You haven't mentioned what the type of the mutex variable is, but assuming it is an std::mutex (or something similar meant to guarantee mutual exclusion), the compiler is prevented from performing a lot of optimizations. So you don't need to worry about return value optimization or some other optimization allowing the size() query from being performed outside of the mutex block.
However, as soon as the mutex lock is released, another waiting thread is free to access the vector and possibly mutate it, thus changing the size. Now, the number returned by your function is outdated. As Mats Petersson mentions in his answer, if this is an issue, then the mutex lock needs to be acquired by the caller of getNumber(), and held until the caller is done using the result. This will ensure that the vector's size does not change during the operation.
Explicitly calling mutex::lock followed by mutex::unlock quickly becomes unfeasible for more complicated functions involving exceptions, multiple return statements etc. A much easier alternative is to use std::lock_guard to acquire the mutex lock.
int getNumber()
{
std::lock_guard<std::mutex> l(mutex); // lock is acquired
int size = someVector.size();
return size;
} // lock is released automatically when l goes out of scope
Volatile is a keyword that you use to tell the compiler to literally actually write or read the variable and not to apply any optimizations. Here is an example
int example_function() {
int a;
volatile int b;
a = 1; // this is ignored because nothing reads it before it is assigned again
a = 2; // same here
a = 3; // this is the last one, so a write takes place
b = 1; // b gets written here, because b is volatile
b = 2; // and again
b = 3; // and again
return a + b;
}
What is the real use of this? I've seen it in delay functions (keep the CPU busy for a bit by making it count up to a number) and in systems where several threads might look at the same variable. It can sometimes help a bit with multi-threaded things, but it isn't really a threading thing and is certainly not a silver bullet

OpenMP: Causes for heap corruption, anyone?

EDIT: I can run the same program twice, simultaneously without any problem - how can I duplicate this with OpenMP or with some other method?
This is the basic framework of the problem.
//Defined elsewhere
class SomeClass
{
public:
void Function()
{
// Allocate some memory
float *Data;
Data = new float[1024];
// Declare a struct which will be used by functions defined in the DLL
SomeStruct Obj;
Obj = MemAllocFunctionInDLL(Obj);
// Call it
FunctionDefinedInDLL(Data,Obj);
// Clean up
MemDeallocFunctionInDLL(Obj);
delete [] Data;
}
}
void Bar()
{
#pragma omp parallel for
for(int j = 0;j<10;++j)
{
SomeClass X;
X.Function();
}
}
I've verified that when some memory is attempted to be deallocated through MemDeallocFunctionInDLL(), the _CrtIsValidHeapPointer() assertion fails.
Is this because both threads are writing to the same memory?
So to fix this, I thought I'd make SomeClass private (this is totally alien to me, so any help is appreciated).
void Bar()
{
SomeClass X;
#pragma omp parallel for default(shared) private(X)
for(int j = 0;j<10;++j)
{
X.Function();
}
}
And now it fails when it tries to allocate memory in the beginning for Data.
Note: I can make changes to the DLL if required
Note: It runs perfectly without #pragma omp parallel for
EDIT: Now Bar looks like this:
void Bar()
{
int j
#pragma omp parallel for default(none) private(j)
for(j = 0;j<10;++j)
{
SomeClass X;
X.Function();
}
}
Still no luck.
Check out MemAllocFunctionInDLL, FunctionDefinedInDLL, MemDeallocFunctionInDLL are thread-safe, or re-entrant. In other words, do these functions static variables or shared variables? In such case, you need to make it sure these variables are not corrupted by other threads.
The fact without omp-for is fine could mean you didn't correctly write some functions to be thread-safe.
I'd like to see what kind of memory allocation/free functions has been used in Mem(Alloc|Dealloc)FunctionInDLL.
Added: I'm pretty sure your functions in DLL is not thread-safe. You can run this program concurrently without problem. Yes, it should be okay unless your program uses system-wide shared resources (such as global memory or shared memory among processes), which is very rare. In this case, no shared variables in threads, so your program works fine.
But, invoking these functions in mutithreads (that means in a single process) crashes your program. It means there are some shared variables among threads, and it could have been corrupted.
It's not a problem of OpenMP, but just a multithreading bug. It could be simple to solve this problem. Please take a look the DLL functions whether they are safe to be called in concurrent by many threads.
How to privatize static variables
Say that we have such global variables:
static int g_data;
static int* g_vector = new int[100];
Privatization is nothing but a creating private copy for each thread.
int g_data[num_threads];
int* g_vector[num_threads];
for (int i = 0; i < num_threads; ++i)
g_vector[i] = new int[100];
And, then any references on such variables are
// Thread: tid
g_data[tid] = ...
.. = g_vector[tid][..]
Yes, it's pretty simple. However, this sort of code may have a false sharing problem. But, false sharing is a matter of performance, not correctness.
First, just try to privatize any static and global variables. Then, check it correctness. Next, see the speedup you would get. If the speedup is scalable (say 3.7x faster on quad core), then it's okay. But, in case of low speedup (such as 2x speedup on quad core), then you probably look at the false sharing problem. To solve false sharing problem, all you need to do is just putting some padding in data structures.
Instead of
delete Data
you must write
delete [] Data;
Wherever you do new [], make sure to use delete [].
It looks like your problem is not specific to openmp. Did you try to run your application without including #pragma parallel?
default(shared) means all variables are shared between threads, which is not what you want. Change that to default(none).
Private(X) will make a copy of X for each thread, however, none of them will be initialised so any construction will not necessarily be performed.
I think you'd be better with your initial approach, put a breakpoint in the Dealloc call, and see what the memory pointer is and what it contains. You can see the guard bytes to tell if the memory has been overwritten at the end of a single call, or after a thread.
Incidentally, I am assuming this works if you run it once, without the omp loop?