C++: __sync_synchronize() still needed with std::atomic? - c++

I've been running into an infrequent but re-occurring race condition.
The program has two threads and uses std::atomic. I'll simplify the critical parts of the code to look like:
std::atomic<uint64_t> b; // flag, initialized to 0
uint64_t data[100]; // shared data, initialized to 0
thread 1 (publishing):
// set various shared variables here, for example
data[5] = 10;
uint64_t a = b.exchange(1); // signal to thread 2 that data is ready
thread 2 (receiving):
if (b.load() != 0) { // signal that data is ready
// read various shared variables here, for example:
uint64_t x = data[5];
// race condition sometimes (x sometimes not consistent)
}
The odd thing is that when I add __sync_synchronize() to each thread, then the race condition goes away. I've seen this happen on two different servers.
i.e. when I change the code to look like the following, then the problem goes away:
thread 1 (publishing):
// set various shared variables here, for example
data[5] = 10;
__sync_synchronize();
uint64_t a = b.exchange(1); // signal to thread 2 that data is ready
thread 2 (receiving):
if (b.load() != 0) { // signal that data is ready
__sync_synchronize();
// read various shared variables here, for example:
uint64_t x = data[5];
}
Why is __sync_synchronize() necessary? It seems redundant as I thought both exchange and load ensured the correct sequential ordering of logic.
Architecture is x86_64 processors, linux, g++ 4.6.2

Whilst it is impossible to say from your simplified code what actually goes on in your actual application, the fact that __sync_synchronize helps, and the fact that this function is a memory barrier tells me that you are writing things in the one thread that the other thread is reading, in a way that isn't atomic.
An example:
thread_1:
object *p = new object;
p->x = 1;
b.exchange(p); /* give pointer p to other thread */
thread_2:
object *p = b.load();
if (p->x == 1) do_stuff();
else error("Huh?");
This may very well trigger the error-path in thread2, because the write to p->x has not actually been completed when thread 2 reads the new pointer value p.
Adding memory barrier, in this case, in the thread_1 code should fix this. Note that for THIS case, a memory barrier in thread_2 will not do anything - it may alter the timing and appear to fix the problem, but it won't be the right thing. You may need memory barriers on both sides still, if you are reading/writing memory that is shared between two threads.
I understand that this may not be precisely what your code is doing, but the concept is the same - __sync_synchronize ensures that memory reads and memory writes have completed for ALL of the instructions before that function call [which isn't a real function call, it will inline a single instruction that waits for any pending memory operations to comlete].
Noteworthy is that operations on std::atomic ONLY affect the actual data stored in the atomic object. Not reads/writes of other data.
Sometimes you also need a "compiler barrier" to avoid the compiler moving stuff from one side of an operation to another:
std::atomic<bool> flag(false);
value = 42;
flag.store(true);
....
another thread:
while(!flag.load());
print(value);
Now, there is a chance that the compiler generates the first form as:
flag.store(true);
value = 42;
Now, that wouldn't be good, would it? std::atomic is guaranteed to be a "compiler barrier", but in other cases, the compiler may well shuffle stuff around in a similar way.

Related

Is using volatile on shared memory safe?

Lets suppose following:
I have two processes on Linux / Mac OS.
I have mmap on shared memory (or in a file).
Then in both processes I have following:
struct Data{
volatile int reload = 0; // using int because is more standard
// more things in the future...
};
void *mmap_memory = mmap(...);
Data *data = static_cast<Data *>(mmap_memory); // suppose size is sufficient and all OK
Then in one of the processes I do:
//...
data->reload = 1;
//...
And in the other I do:
while(...){
do_some_work();
//...
if (data->reload == 1)
do_reload();
}
Will this be thread / inter process safe?
Idea is from here:
https://embeddedartistry.com/blog/2019/03/11/improve-volatile-usage-with-volatile_load-and-volatile_store/
Note:
This can not be safe with std::atomic<>, since it does not "promise" anything about shared memory. Also constructing/destructing from two different processes is not clear at all.
Will this be thread / inter process safe?
No.
From your own link:
One problematic and common assumption is that volatile is equivalent to “atomic”. This is not the case. All the volatile keyword denotes is that the variable may be modified externally, and thus reads/writes cannot be optimized.
Your code needs atomic access to the value. if (data->reload == 1) won't work if it reads some partial/intermediate value from data->reload.
And nevermind what happens if multiple threads do read 1 from data->reload - your posted code doesn't handle that at all.
Also see Why is volatile not considered useful in multithreaded C or C++ programming?

How does atomic seq_cst memory order actually work?

For example, there are shared variables.
int val;
Cls obj;
An atomic bool variable acts as a data indicator.
std::atomic_bool flag = false;
Thread 1 only set these variables.
while (flag == true) { /* Sleep */ }
val = ...;
obj = ...;
flag = true; /* Set flag to true after setting shared variables. */
Thread 2 only get these variables.
while (flag != true) { /* Sleep */ }
int local_val = val;
Cls local_obj = obj;
flag = false; /* Set flag to false after using shared variables. */
My questions are:
For std::memory_order_seq_cst, which is default for std::atomic_bool, is it safe to set or get shared variables after while (...) {}?
Using bool instead of std::atomic_bool is correct or not?
Yes, the code is fine, and as ALX23z says, it would still be fine if all the loads of flag were std::memory_order_acquire and all the stores were std::memory_order_release. The extra semantics that std::memory_order_seq_cst provides are only relevant to observing the ordering between loads and stores to two or more different atomic variables. When your whole program only has one atomic variable, it has no useful effect.
Roughly, the idea is that acquire/release suffice to ensure that the accesses to your non-atomic variables really do happen in between the load/store of the atomic flag, and do not "leak out". More formally, you could use the C++11 memory model axioms to prove that given any read of the objects in thread 1, and any write of the objects in thread 2, one of them happens before the other. The details are a little tedious, but the key idea is that an acquire load and a release store is exactly what you need to get synchronization, which is how you get a happens-before relation between operations in two different threads.
No, if you replace the atomic_bool with a plain bool then your code has undefined behavior. You would then be reading and writing all your variables in different threads, with no mutexes or atomic variables that could possibly create synchronization, and that is the definition of a data race.

c++ atomic: would function call act as memory barrier?

I'm reading this article Memory Ordering at Compile Time from which said:
In fact, the majority of function calls act as compiler barriers,
whether they contain their own compiler barrier or not.This excludes inline functions, functions declared with the pure attribute, and cases where link-time code generation is used. Other than those cases, a call to an external function is even stronger than a compiler barrier, since the compiler has no idea what the function’s side effects will be.
Is this a true statement? Think about this sample -
std::atomic_bool flag = false;
int value = 0;
void th1 () { // running in thread 1
value = 1;
// use atomic & release to prevent above sentence being reordered below
flag.store(true, std::memory_order_release);
}
void th2 () { // running in thread 2
// use atomic & acquire to prevent asset(..) being reordered above
while (!flag.load(std::memory_order_acquire)) {}
assert (value == 1); // should never fail!
}
Then we can remove atomic but replace with function call -
bool flag = false;
int value = 0;
void writeflag () {
flag = true;
}
void readflag () {
while (!flag) {}
}
void th1 () {
value = 1;
writeflag(); // would function call prevent reordering?
}
void th2 () {
readflag(); // would function call prevent reordering?
assert (value == 1); // would this fail???
}
Any idea?
A compiler barrier is not the same thing as a memory barrier. A compiler barrier prevents the compiler from moving code across the barrier. A memory barrier (loosely speaking) prevents the hardware from moving reads and writes across the barrier. For atomics you need both, and you also need to ensure that values don't get torn when read or written.
Formally, no, if only because Link-Time Code Generation is a valid implementation choice and need not be optional.
There's also a second oversight, and that's escape analysis. The claim is that "the compiler has no idea what the function’s side effects will be.", but if no pointers to my local variables escape from my function, then the compiler does know for sure that no other function changes them.
In the second example, even if we assume that no reordering of any kind, the behavior is undefined.
The writes and reads from variable flag are not atomic, and there is a race condition1. Having no reordering doesn't guarantee that both threads don't access the variable flat at the same time. This happens when one thread hits the while loop in the function readflag and reads flag, and the other thread writes to flag in writeflag.
1 (Quoted from: ISO/IEC 14882:2011(E) 1.10 Multi-threaded executions and data races 21)
The execution of a program contains a data race if it contains two conflicting actions in different threads,
at least one of which is not atomic, and neither happens before the other. Any such data race results in
undefined behavior
You are confusing a memory barrier used for inter thread memory visibility and a compiler barrier, which isn't a thread device, just a device (or trick) to prevent reordering of side effects by the compiler.
You need a memory barrier for your threading example.
You can use a compiler barrier to ensure that memory side effet are performed in a given order (on the local CPU) for other purposes, like benchmarking, getting around a type aliasing violation, integrating assembly code, or signal handling (for a signal only handled in that same thread).

Do I need to use volatile keyword if I declare a variable between mutexes and return it?

Let's say I have the following function.
std::mutex mutex;
int getNumber()
{
mutex.lock();
int size = someVector.size();
mutex.unlock();
return size;
}
Is this a place to use volatile keyword while declaring size? Will return value optimization or something else break this code if I don't use volatile? The size of someVector can be changed from any of the numerous threads the program have and it is assumed that only one thread (other than modifiers) calls getNumber().
No. But beware that the size may not reflect the actual size AFTER the mutex is released.
Edit:If you need to do some work that relies on size being correct, you will need to wrap that whole task with a mutex.
You haven't mentioned what the type of the mutex variable is, but assuming it is an std::mutex (or something similar meant to guarantee mutual exclusion), the compiler is prevented from performing a lot of optimizations. So you don't need to worry about return value optimization or some other optimization allowing the size() query from being performed outside of the mutex block.
However, as soon as the mutex lock is released, another waiting thread is free to access the vector and possibly mutate it, thus changing the size. Now, the number returned by your function is outdated. As Mats Petersson mentions in his answer, if this is an issue, then the mutex lock needs to be acquired by the caller of getNumber(), and held until the caller is done using the result. This will ensure that the vector's size does not change during the operation.
Explicitly calling mutex::lock followed by mutex::unlock quickly becomes unfeasible for more complicated functions involving exceptions, multiple return statements etc. A much easier alternative is to use std::lock_guard to acquire the mutex lock.
int getNumber()
{
std::lock_guard<std::mutex> l(mutex); // lock is acquired
int size = someVector.size();
return size;
} // lock is released automatically when l goes out of scope
Volatile is a keyword that you use to tell the compiler to literally actually write or read the variable and not to apply any optimizations. Here is an example
int example_function() {
int a;
volatile int b;
a = 1; // this is ignored because nothing reads it before it is assigned again
a = 2; // same here
a = 3; // this is the last one, so a write takes place
b = 1; // b gets written here, because b is volatile
b = 2; // and again
b = 3; // and again
return a + b;
}
What is the real use of this? I've seen it in delay functions (keep the CPU busy for a bit by making it count up to a number) and in systems where several threads might look at the same variable. It can sometimes help a bit with multi-threaded things, but it isn't really a threading thing and is certainly not a silver bullet

Can I switch the test and modification part in wait/signal semaphore?

The classic none-busy-waiting version of wait() and signal() semaphore are implemented as below. In this verson, value can be negative.
//primitive
wait(semaphore* S)
{
S->value--;
if (S->value < 0)
{
add this process to S->list;
block();
}
}
//primitive
signal(semaphore* S)
{
S->value++;
if (S->value <= 0)
{
remove a process P from S->list;
wakeup(P);
}
}
Question: Is the following version also correct? Here I test first and modify the value. It's great if you can show me a scenario where it doesn't work.
//primitive wait().
//If (S->value > 0), the whole function is atomic
//otherise, only if(){} section is atomic
wait(semaphore* S)
{
if (S->value <= 0)
{
add this process to S->list;
block();
}
// here I decrement the value after the previous test and possible blocking
S->value--;
}
//similar to wait()
signal(semaphore* S)
{
if (S->list is not empty)
{
remove a process P from S->list;
wakeup(P);
}
// here I increment the value after the previous test and possible waking up
S->value++;
}
Edit:
My motivation is to figure out whether I can use this latter version to achieve mutual exclusion, and no deadlock, no starvation.
Your modified version introduces a race condition:
Thread A: if(S->Value < 0) // Value = 1
Thread B: if(S->Value < 0) // Value = 1
Thread A: S->Value--; // Value = 0
Thread B: S->Value--; // Value = -1
Both threads have acquired a count=1 semaphore. Oops. Note that there's another problem even if they're non-preemptible (see below), but for completeness, here's a discussion on atomicity and how real locking protocols work.
When working with protocols like this, it's very important to nail down exactly what atomic primitives you are using. Atomic primitives are such that they seem to execute instantaneously, without being interleaved with any other operations. You cannot just take a big function and call it atomic; you have to make it atomic somehow, using other atomic primitives.
Most CPUs offer a primitive called 'atomic compare and exchange'. I'll abbreviate it cmpxchg from here on. The semantics are like so:
bool cmpxchg(long *ptr, long old, long new) {
if (*ptr == old) {
*ptr = new;
return true;
} else {
return false;
}
}
cmpxchg is not implemented with this code. It is in the CPU hardware, but behaves a bit like this, only atomically.
Now, let's add to this some additional helpful functions (built out of other primitives):
add_waitqueue(waitqueue) - Sets our process state to sleeping and adds us to a wait queue, but continues executing (ATOMIC)
schedule() - Switch threads. If we're in a sleeping state, we don't run again until awakened (BLOCKING)
remove_waitqueue(waitqueue) - removes our process from a wait queue, then sets our state to awakened if it isn't already (ATOMIC)
memory_barrier() - ensures that any reads/writes logically before this point actually are performed before this point, avoiding nasty memory ordering issues (we'll assume all other atomic primitives come with a free memory barrier, although this isn't always true) (CPU/COMPILER PRIMITIVE)
Here's how a typical semaphore acquisition routine will look. It's a bit more complex than your example, because I've explicitly nailed down what atomic operations I'm using:
void sem_down(sem *pSem)
{
while (1) {
long spec_count = pSem->count;
read_memory_barrier(); // make sure spec_count doesn't start changing on us! pSem->count may keep changing though
if (spec_count > 0)
{
if (cmpxchg(&pSem->count, spec_count, spec_count - 1)) // ATOMIC
return; // got the semaphore without blocking
else
continue; // count is stale, try again
} else { // semaphore count is zero
add_waitqueue(pSem->wqueue); // ATOMIC
// recheck the semaphore count, now that we're in the waitqueue - it may have changed
if (pSem->count == 0) schedule(); // NOT ATOMIC
remove_waitqueue(pSem->wqueue); // ATOMIC
// loop around again to try to acquire the semaphore
}
}
}
You'll note that the actual test for a non-zero pSem->count, in a real-world semaphore_down function, is accomplished by cmpxchg. You can't trust any other read; the value can change an instant after you read the value. We simply can't separate the value check and the value modification.
The spec_count here is speculative. This is important. I'm essentially making a guess at what the count will be. It's a pretty good guess, but it's a guess. cmpxchg will fail if my guess is wrong, at which point the routine has to loop and try again. If I guess 0, then I will either be woken up (as it ceases to be zero while I'm on the waitqueue), or I will notice it's not zero anymore in the schedule test.
You should also note that there is no possible way to make a function that contains a blocking operation atomic. It's nonsensical. Atomic functions, by definition, appear to execute instantaneously, not interleaved with anything else whatsoever. But a blocking function, by definition, waits for something else to happen. This is inconsistent. Likewise, no atomic operation can be 'split up' across a blocking operation, which it is in your example.
Now, you could do away with a lot of this complexity by declaring the function non-preemptable. By using locks or other methods, you simply ensure only one thread is ever running (not including blocking of course) in the semaphore code at a time. But a problem still remains then. Start with a value of 0, where C has taken the semaphore down twice, then:
Thread A: if (S->Value < 0) // Value = 0
Thread A: Block....
Thread B: if (S->Value < 0) // Value = 0
Thread B: Block....
Thread C: S->Value++ // value = 1
Thread C: Wakeup(A)
(Thread C calls signal() again)
Thread C: S->Value++ // value = 2
Thread C: Wakeup(B)
(Thread C calls wait())
Thread C: if (S->Value <= 0) // Value = 2
Thread C: S->Value-- // Value = 1
// A and B have been woken
Thread A: S->Value-- // Value = 0
Thread B: S->Value-- // Value = -1
You could probably fix this with a loop to recheck S->value - again, assuming you are on a single processor machine and your semaphore code is preemptable. Unfortunately, these assumptions are false on all desktop OSes :)
For more discussion on how real locking protocols work, you might be interested in the paper "Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux"