How does atomic seq_cst memory order actually work? - c++

For example, there are shared variables.
int val;
Cls obj;
An atomic bool variable acts as a data indicator.
std::atomic_bool flag = false;
Thread 1 only set these variables.
while (flag == true) { /* Sleep */ }
val = ...;
obj = ...;
flag = true; /* Set flag to true after setting shared variables. */
Thread 2 only get these variables.
while (flag != true) { /* Sleep */ }
int local_val = val;
Cls local_obj = obj;
flag = false; /* Set flag to false after using shared variables. */
My questions are:
For std::memory_order_seq_cst, which is default for std::atomic_bool, is it safe to set or get shared variables after while (...) {}?
Using bool instead of std::atomic_bool is correct or not?

Yes, the code is fine, and as ALX23z says, it would still be fine if all the loads of flag were std::memory_order_acquire and all the stores were std::memory_order_release. The extra semantics that std::memory_order_seq_cst provides are only relevant to observing the ordering between loads and stores to two or more different atomic variables. When your whole program only has one atomic variable, it has no useful effect.
Roughly, the idea is that acquire/release suffice to ensure that the accesses to your non-atomic variables really do happen in between the load/store of the atomic flag, and do not "leak out". More formally, you could use the C++11 memory model axioms to prove that given any read of the objects in thread 1, and any write of the objects in thread 2, one of them happens before the other. The details are a little tedious, but the key idea is that an acquire load and a release store is exactly what you need to get synchronization, which is how you get a happens-before relation between operations in two different threads.
No, if you replace the atomic_bool with a plain bool then your code has undefined behavior. You would then be reading and writing all your variables in different threads, with no mutexes or atomic variables that could possibly create synchronization, and that is the definition of a data race.

Related

do sequentially-consistent atomic loads (load-load pair) form an inter-thread synchronisation point?

I am trying to understand what does sequentially-consistent ordering mean for loads. Consider this artificial example:
#include <atomic>
#include <thread>
#include <cassert>
static std::atomic<bool> preStop {false};
static std::atomic<bool> stop {false};
static std::atomic<int> counter{0};
void waiter() {
preStop.store(true, std::memory_order_relaxed);
while (counter.load() > 0);
stop.store(true, std::memory_order_relaxed);
}
void performer() {
while (true) {
counter.fetch_add(1);
const bool visiblePreStop = preStop.load();
if (stop.load()) {
assert(visiblePreStop);
return;
}
counter.fetch_sub(1);
std::this_thread::yield();
}
}
int main() {
std::thread performerThread(performer);
std::thread waiterThread(waiter);
waiterThread.join();
performerThread.join();
}
Can assert fail? Or
does counter.fetch_add() synchronise with counter.load()?
It is my understanding that had operations on counter have std::memory_order_relaxed or std::memory_order_acq_rel, the load-load pair would not create a synchronisation point. Does std::memory_order_seq_cst make any difference for load-load pairs?
The assert can fail. The sequential consistency of counter.load() implies that it is acquire, but that does not prevent the relaxed preStop.store(true) from being reordered after it. Then preStop.store(true) may also be reordered after stop.store(true). If we have a weakly ordered machine with a store buffer, then nothing in waiter() ever drains the store buffer, so preStop.store() may languish there for an arbitrarily long time.
If so, then it is entirely possible that the code does something like
waiter
======
preStop.store(true, relaxed); // suppose not globally visible for a while
while (counter.load() > 0); // terminates immediately
stop.store(true, relaxed); // suppose globally visible right away
performer
=========
counter.fetch_add(1);
preStop.load(); // == false
stop.load(); // == true
I don't quite understand the rest of the question, though. Synchronization is established by a release store and an acquire load that reads the stored value (or another value later in the release sequence); when this occurs, it proves that the store happened before the load. Two loads cannot synchronize with each other, not even if they are sequentially consistent.
But counter.fetch_add(1) is a read-modify-write; it consists of a load and a store. Since it is seq_cst, the load is acquire and the store is release. And counter.load() is likewise acquire, so if it returns 1, it does synchronize with counter.fetch_add(1), proving that counter.fetch_add(1) happens before counter.load().
But that doesn't really help. The problem is that waiter doesn't do any release stores at all, so nothing in performer can synchronize with it. Therefore neither of the relaxed stores in waiter can be proved to happen before the corresponding loads in performer, and so we cannot be assured that either load will return true. In particular it is quite possible that preStop.load() returns false and stop.load() returns true.
You have a problem where you're reading two different atomic variables but expect their state to be inter-dependent. This is similar, but not the same, as Time-of-check to time-of-use bugs.
Here's a valid interleaving where the assert fires:
PerformerThread (PT) created
WaiterThread (WT) created
PT executes the following:
while (true) {
counter.fetch_add(1);
const bool visiblePreStop = preStop.load();
PT sees that visiblePreStop is false
PT is suspended
WT executes the following:
preStop.store(true, std::memory_order_relaxed);
while (counter.load() > 0);
stop.store(true, std::memory_order_relaxed);
WT is suspended
PT executes the following:
if (stop.load()) {
PT sees that stop is true
PT hits the assert because visiblePreStop is false and stop is `true

Using compare and read/write operations for std::atomic<bool> in c++?

Assume there are 2 threads as threadA and threadB and we're gonna using of a std::atomic<bool> data-type in these threads. So now we have some critical sections as bellow:
My global variable (threads access to it concurrently) :
std::atomic<bool> flag;
threadA :
void *threadA(void *arg)
{
bool ttt = true;
if(flag == true) // comparison operator ==
// do something
// something to do
flag = false; // assign a avalue
ttt = flag; // read operation
return 0;
}
threadB :
void *threadB(void *arg)
{
bool m = true;
if(flag == true) // comparison operator ==
flag = false;
// something to do
flag = true; // assign a value
m = !flag; // read operation
return 0;
}
Any way, i know std::atomic<> unlike ordinary data-types are free-races, but i wanna be sure about these :
will be any trouble when using of ==, assignment, read/write instead of (for example) std::atomic_load or exchange statements?
is it possible to occur any trouble, something like memory problems while read or write of flag?
is it absolutely safe in any platform with any CPU architecture? I mean(a portable code). because atomic<bool> not need in some X86 architectures?
i just wanna use of atomic feature instead of mutex.
will be any trouble when using of ==, assignment, read/write instead of (for example) std::atomic_load or exchange statements?
When operator== is used with std::atomic<T> and T, it first invokes atomic<T>::operator T() to load the atomic value using the strongest memory ordering std::memory_order_seq_cst. Next, operator==(T, T) is used. This sequence is not atomic. Which means that when the comparison actually happens that std::atomic<T> may have already changed.
just wanna use of atomic feature instead of mutex.
You can implement a spin-lock with an atomic using std::atomic::compare_exchange_weak (there is an example), but it won't be able to put the thread to sleep like std::mutex does.

C++: __sync_synchronize() still needed with std::atomic?

I've been running into an infrequent but re-occurring race condition.
The program has two threads and uses std::atomic. I'll simplify the critical parts of the code to look like:
std::atomic<uint64_t> b; // flag, initialized to 0
uint64_t data[100]; // shared data, initialized to 0
thread 1 (publishing):
// set various shared variables here, for example
data[5] = 10;
uint64_t a = b.exchange(1); // signal to thread 2 that data is ready
thread 2 (receiving):
if (b.load() != 0) { // signal that data is ready
// read various shared variables here, for example:
uint64_t x = data[5];
// race condition sometimes (x sometimes not consistent)
}
The odd thing is that when I add __sync_synchronize() to each thread, then the race condition goes away. I've seen this happen on two different servers.
i.e. when I change the code to look like the following, then the problem goes away:
thread 1 (publishing):
// set various shared variables here, for example
data[5] = 10;
__sync_synchronize();
uint64_t a = b.exchange(1); // signal to thread 2 that data is ready
thread 2 (receiving):
if (b.load() != 0) { // signal that data is ready
__sync_synchronize();
// read various shared variables here, for example:
uint64_t x = data[5];
}
Why is __sync_synchronize() necessary? It seems redundant as I thought both exchange and load ensured the correct sequential ordering of logic.
Architecture is x86_64 processors, linux, g++ 4.6.2
Whilst it is impossible to say from your simplified code what actually goes on in your actual application, the fact that __sync_synchronize helps, and the fact that this function is a memory barrier tells me that you are writing things in the one thread that the other thread is reading, in a way that isn't atomic.
An example:
thread_1:
object *p = new object;
p->x = 1;
b.exchange(p); /* give pointer p to other thread */
thread_2:
object *p = b.load();
if (p->x == 1) do_stuff();
else error("Huh?");
This may very well trigger the error-path in thread2, because the write to p->x has not actually been completed when thread 2 reads the new pointer value p.
Adding memory barrier, in this case, in the thread_1 code should fix this. Note that for THIS case, a memory barrier in thread_2 will not do anything - it may alter the timing and appear to fix the problem, but it won't be the right thing. You may need memory barriers on both sides still, if you are reading/writing memory that is shared between two threads.
I understand that this may not be precisely what your code is doing, but the concept is the same - __sync_synchronize ensures that memory reads and memory writes have completed for ALL of the instructions before that function call [which isn't a real function call, it will inline a single instruction that waits for any pending memory operations to comlete].
Noteworthy is that operations on std::atomic ONLY affect the actual data stored in the atomic object. Not reads/writes of other data.
Sometimes you also need a "compiler barrier" to avoid the compiler moving stuff from one side of an operation to another:
std::atomic<bool> flag(false);
value = 42;
flag.store(true);
....
another thread:
while(!flag.load());
print(value);
Now, there is a chance that the compiler generates the first form as:
flag.store(true);
value = 42;
Now, that wouldn't be good, would it? std::atomic is guaranteed to be a "compiler barrier", but in other cases, the compiler may well shuffle stuff around in a similar way.

Shared vector Synchronization with an atomic boolean

I am having a shared vector which gets accessed by two threads.
A function from thread A pushs into the vector and a function from thread B swaps the vector completely for processing.
MovetoVec(PInfo* pInfo)
{
while(1)
{
if(GetSwitch())
{
swapBucket->push_back(pInfo);
toggles = true;
break;
}
else if(pInfo->tryMove == 5)
{
delete pInfo;
break;
}
pInfo->tryMove++;
Sleep(25);
}
}
The thread A tries to get atomic boolean toggles to true and pushes into vector.(the above MoveToVec function will be called by many number of threads). The function GetSwitch is defined as
GetSwitch()
{
if(toggles)
{
toggles = false;
return TRUE;
}
else
return FALSE;
}
toggles here is atomic_bool.And the another function from thread B that swaps the vector is
GetClassObj(vector<ProfiledInfo*>* toSwaps)
{
if(GetSwitch())
{
toSwaps->swap(*swapBucket);
toggles = true;
}
}
If GetSwitch returns false then the threadB does nothing. Here i dint use any locking. It works in most of the cases. But some time one of the pInfo objects in swapBucket is NULL. I got to know it is because of poor synchronization.
I followed this type of GetSwitch() logic just to neglect the overhead caused by locking. Should i drop this out and go back to mutex or critical section stuffs?
Your GetSwitch implementation is wrong. It is possible for multiple threads to acquire the switch simultaneously.
An example of such a scenario with just two threads:
Thread 1 | Thread 2
--------------------------|--------------------------
if (toggles) |
| if (toggles)
toggles = false; |
| toggles = false;
The if-test and assignment are not an atomic operation and therefore cannot be used to synchronize threads on their own.
If you want to use an atomic boolean as a means of synchronization, you need to compare and exchange the value in one atomic operation. Luckily, C++ provides such an operation called std::compare_exchange, which is available in a weak and strong flavor (the weak one may spuriously fail but is cheaper when called in a loop).
Using this operation, your GetSwitch method would become:
bool GetSwitch()
{
bool expected = true; // The value we expect 'toggles' to have
bool desired = false; // The value we want 'toggles' to get
// Check if 'toggles' is as expected, and if it is, update it to the desired value
bool result = toggles.compare_exchange_strong(&expected, desired);
// The result of the compare_exchange is true if the value was updated and false if it was not
return result;
}
This will ensure that comparing and updating the value happens atomically.
Note that the C++ standard does not guarantee an atomic boolean to be lock-free. In your case, you could also use std::atomic_flag which is guaranteed to be lock-free by the standard! Carefully read the example though, it works a tad bit different than atomic variables.
Writing lock-free code, as you are attempting to do, is quite complex and error-prone.
My advice would be to write the code with locks first and ensure it is 100% correct. Mutexes are actually surprisingly fast, so performance should be okay in most cases. A good read on lock performance: http://preshing.com/20111118/locks-arent-slow-lock-contention-is
Only once you have profiled your code, and convinced yourself that the locks are impacting performance, you should attempt to write the code lock-free. Then profile again because lock-free code is not necessarily faster.

Effective placement of lock_guard - from item 16 from Effective Modern C++

In item 16: "Make const member functions thread safe" there is a code as follows:
class Widget {
public:
int magicValue() const
{
std::lock_guard<std::mutex> guard(m); // lock m
if (cacheValid) return cachedValue;
else {
auto val1 = expensiveComputation1();
auto val2 = expensiveComputation2();
cachedValue = val1 + val2;
cacheValid = true;
return cachedValue;
}
} // unlock m
private:
mutable std::mutex m;
mutable int cachedValue; // no longer atomic
mutable bool cacheValid{ false }; // no longer atomic
};
I wonder why std::lock_guard should be executed always on each magicValue() call, wouldnt following work as expected?:
class Widget {
public:
int magicValue() const
{
if (cacheValid) return cachedValue;
else {
std::lock_guard<std::mutex> guard(m); // lock m
if (cacheValid) return cachedValue;
auto val1 = expensiveComputation1();
auto val2 = expensiveComputation2();
cachedValue = val1 + val2;
cacheValid = true;
return cachedValue;
}
} // unlock m
private:
mutable std::atomic<bool> cacheValid{false};
mutable std::mutex m;
mutable int cachedValue; // no longer atomic
};
This way fewer mutex locks would be required, making code more efficient. I assume here that atomica are always faster than mutexes.
[edit]
For completness I measured efficiency of both apraches, and the second looks like is only 6% faster.: http://coliru.stacked-crooked.com/a/e8ce9c3cfd3a4019
Your second code snippet shows a perfectly valid implementation of the Double Checked Locking Pattern (DCLP) and is (probably) more efficient that Meyers' solution since it avoids locking the mutex unnecessarily after cachedValue is set.
It is guaranteed that the expensive computations are not performed more than once.
Also, it is important that the cacheValid flag is atomic because it creates a happens-before relationship between writing-to and reading-from cachedValue.
In other words, it synchronizes cachedValue (which is accessed outside of the mutex) with other threads calling magicValue().
Had cacheValid been a regular 'bool', you would have had a data race on both cacheValid and cachedValue (causing undefined behavior per the C++11 standard).
Using default sequential consistent memory ordering on the cacheValid memory operations is fine, since it implies acquire/release semantics.
In theory, you could optimize by using weaker memory orderings on the atomic loads and store:
int Widget::magicValue() const
{
if (cacheValid.load(std::memory_order_acquire)) return cachedValue;
else {
std::lock_guard<std::mutex> guard(m); // lock m
if (cacheValid.load(std::memory_order_relaxed)) return cachedValue;
auto val1 = expensiveComputation1();
auto val2 = expensiveComputation2();
cachedValue = val1 + val2;
cacheValid.store(true, std::memory_order_release);
return cachedValue;
}
}
Note that this is only a minor optimization since reading an atomic is a regular load on many platforms (making it as efficient as reading from a non-atomic).
As pointed out by Nir Friedman, this only works one way; you cannot invalidate cacheValid and restart calculations. But that was not part of Meyers' example.
I actually think that your snippet is correct in isolation, but it relies on an assumption that is usually not true in a real world example: it assumes that the cacheValid goes from false to true, but can never make the reverse progression, that is become invalidated.
In the old code, the mutex protects all reads and writes on cachedValue. In your new code, there's actually a read access of cachedValue outside the mutex. That means that it is possible for one thread to read this value, while another thread is writing it. The catch is that the reading outside the mutex will only occur if cacheValid is true. But if cacheValid is true, no writing will occur; cacheValid can only become true after all writing is complete (note that this is enforced, because the assignment operator on cacheValid will use the strictest memory ordering guarantee, so it cannot be reordered with the earlier instructions in the block).
But suppose some other piece of code is written, that can invalidate the cache: Widget::invalidateCache(). This piece of code does nothing but set cacheValid to false again. In the old code, if you called invalidateCache and magicValue repeatedly from different threads, the latter function might recalculate the value or not at any given point. But even if your complex calculations are returning different values each time they are called (because they use global state, say), you will always get either the old or new value, and nothing else. But now consider the following execution order in your code:
Thread 1 calls magicValue, and checks the value of cacheValid. It's true. It gets interrupted before it can continue.
Thread 2 calls invalidateCache, and then immediately calls magicValue. magicValue sees that the cache is invalid, acquires the mutex, and starts computing, and begins writing to cacheValid.
Thread 1 interrupts, reading a partially written cacheValid.
I actually don't think this example works on most modern computers, because int will typically be 32 bits, and typically 32 bit writes and reads will be atomic. So it's not really possible to intersperse or "tear" the value of cachedValue. But on different architectures, or if you use a type other than integer (anything over 64 bits, for example), writing or reading is not guaranteed to be atomic. So you can get, as a return for magicValue, something which is neither the old value nor the new value but some weird bitwise hybrid that is not even a valid object.
So, good on you for finding this. I guess that in trying to boil down the example for simplicity, the author forgot that it was no longer necessary to be strict about putting the mutex on the outside.