What guarantees that a weak relaxed uncontended CAS loop terminates?

What guarantees that a weak relaxed uncontended CAS loop terminates? - c++

cppreference's page on compare_exchange gives the following example code (paraphrased snippet):
while(!head.compare_exchange_weak(new_node->next,
new_node,
std::memory_order_release,
std::memory_order_relaxed))
; // empty body
Suppose you run this once on thread A, and once on thread B. Nothing else is touching head or its associated data. Thread B's invocation happens to start after thread A's has happened (in real time), but A's changes have not yet propagated to the cache of the CPU running Thread B.
What forces A's changes to get to B? That is to say, why is the execution where B's weak compare exchange simply fails indefinitely and the CPU cache remains stale not allowed? Or is it allowed?
It seems like the CPU running B is not being forced to go out and sync in the changes made by A, because the failure memory ordering is relaxed. So why does the hardware ever do so? Is this an implicit guarantee of the C++ spec, or of the hardware, or is bounded-staleness-of-memory a standard documented guarantee?

Good question! This is actually specifically addressed by clause 25 in §1.10 of the C++11 standard:
An implementation should ensure that the last value (in modification order) assigned by an atomic or
synchronization operation will become visible to all other threads in a finite period of time.
So the answer is yes, the value is guaranteed to eventually propagate to the other thread, even with relaxed memory ordering.

Related

Does std::mutex enforce cache cohesion?

I have a non-atomic variable my_var and an std::mutex my_mut. I assume up to this point in the code, the programmer has followed this rule:
Each time the programmer modifies or writes to my_var, he locks
and unlocks my_mut.
Assuming this, Thread1 performs the following:
my_mut.lock();
my_var.modify();
my_mut.unlock();
Here is the sequence of events I imagine in my mind:
Prior to my_mut.lock();, there were possibly multiple copies of my_var in main memory and some local caches. These values do not necessarily agree, even if the programmer followed the rule.
By the instruction my_mut.lock();, all writes from the previously executed my_mut critical section are visible in memory to this thread.
my_var.modify(); executes.
After my_mut.unlock();, there are possibly multiple copies of my_var in main memory and some local caches. These values do not necessarily agree, even if the programmer followed the rule. The value of my_var at the end of this thread will be visible to the next thread that locks my_mut, by the time it locks my_mut.
I have been having trouble finding a source that verifies that this is exactly how std::mutex should work. I consulted the C++ standard. From ISO 2013, I found this section:
[ Note: For example, a call that acquires a mutex will perform an
acquire operation on the locations comprising the mutex.
Correspondingly, a call that releases the same mutex will perform a
release operation on those same locations. Informally, performing a
release operation on A forces prior side effects on other memory
locations to become visible to other threads that later perform a
consume or an acquire operation on A.
Is my understanding of std::mutex correct?

C++ operates on the relations between operations not some particular hardware terms (like cache cohesion). So C++ Standard has a happens-before relationship which roughly means that whatever happened before completed all its side-effects and therefore is visible at the moment that happened after.
And given you have an exclusive critical session to which you have entered means that whatever happens within it, happens before the next time this critical section is entered. So any consequential entering to it will see everything happened before. That's what the Standard mandates. Everything else (including the cache cohesion) is the implementation's duty: it has to make sure that the described behavior is coherent with what actually happens.

C++20: How is the returning from atomic::wait() guaranteed by the standard?

This is a language-lawyer question.
First of all, does the a.wait() in the following code always get to return?
std::atomic_int a{ 0 };
void f()
{
a.store(1, std::memory_order_relaxed);
a.notify_one();
}
int main()
{
std::thread thread(f);
a.wait(0, std::memory_order_relaxed);//always return?
thread.join();
}
I believe the standard's intention is that a.wait() always get to return. (Otherwise atomic::wait/notify would be useless, isn't it?) But I think the current standard text cannot guarantee this.
The relevant part of the standard is in §31.6 [atomics.wait] paragraph 4:
A call to an atomic waiting operation on an atomic object M is eligible to be unblocked by a call to an atomic notifying operation on M if there exist side effects X and Y on M such that:
(4.1) — the atomic waiting operation has blocked after observing the result of X,
(4.2) — X precedes Y in the modification order of M, and
(4.3) — Y happens before the call to the atomic notifying operation.
and §31.8.2 [atomics.types.operations] paragraph 29~33:
void wait(T old, memory_order order = memory_order::seq_cst) const volatile noexcept;
void wait(T old, memory_order order = memory_order::seq_cst) const noexcept;
Effects: Repeatedly performs the following steps, in order:
(30.1) — Evaluates load(order) and compares its value representation for equality against that of old.
(30.2) — If they compare unequal, returns.
(30.3) — Blocks until it is unblocked by an atomic notifying operation or is unblocked spuriously.
void notify_one() volatile noexcept;
void notify_one() noexcept;
Effects: Unblocks the execution of at least one atomic waiting operation that is eligible to be unblocked (31.6) by this call, if any such atomic waiting operations exist.
With the above wording, I see two problems:
If the wait() thread saw the value in step (30.1), compared it equal to old in step (30.2), and got scheduled out; then in another thread notify_one() stepped in and saw no blocking thread, doing nothing; the subsequent blocking in step (30.3) would never be unblocked. Here isn't it necessary for the standard to say "wait() function atomically performs the evaluation-compare-block operation", similar to what is said about condition_variable::wait()?
There's no synchronization between notify_*() and unblocking of wait(). If in step (30.3), the thread was unblocked by an atomic notifying operation, it would repeat step (30.1) to evaluate load(order). Here there is nothing preventing it from getting the old value. (Or is there?) Then it would block again. Now no one would wake it.
Is the above concern just nit-picking, or defect of the standard?

#1 is pretty much addressed by C++20 thread possibly waiting on std::atomic forever. The wait() operation is clearly eligible to be unblocked by the notify(), and is the only such operation, so the notify() must unblock it. The eligible "wait operation" is the entire call, not only step 30.3.
If an implementation performs steps 30.1-3 in a non-atomic fashion, such that the notify can happen "between" steps 1 and 3, then it has to somehow ensure that step 3 unblocks anyway.
#2 is stickier. At this point I think you are right: the standard doesn't guarantee that the second load gets the value 1; and if it doesn't, then it will presumably block again and never be woken up.
The use of the relaxed memory ordering makes it pretty clear in this example. If we wanted to prove that the second load must see 1, the only way I can see is to invoke write-read coherence (intro.races p18) which requires that we prove the store happens before the load, in the sense of intro.races p10. This in turn requires that somewhere along the way, we have some operation in one thread that synchronizes with some operation in the other (you can't get inter-thread happens before without a synchronizes with, unless there are consume operations which is not the case here). Usually you get synchronizes with from a pairing of an acquire load with a release store (atomics.order p2), and here we have no such thing; nor, as far as I can tell, anything else that would synchronize. So we don't have a proof.
In fact, I think the problem persists even if we upgrade to seq_cst operations. We could then have both loads coherence-ordered before the store, and the total order S of atomics.order p4 would go "first load, second load, store". I don't see that contradicting anything. We would still have to show a synchronizes with to rule this out, and again we can't. There might appear to be a better chance than in the relaxed case, since seq_cst loads and stores are acquire and release respectively. But the only way to use this would be if one of the loads were to take its value from the store, i.e. if one of the loads were to return 1, and we are assuming that is not the case. So again this undesired behavior seems consistent with all the rules.
It does make you wonder if the Standard authors meant to require the notification to synchronize with the unblocking. That would fix the problem, and I would guess real-life implementations already include the necessary barriers.
But indeed, I am not seeing this specified anywhere.
The only possible way out that I can see is that "eligible to be unblocked" applies to the entire wait operation, not just to a single iteration of it. But it seems clear that the intent was that if you are unblocked by a notify and the value has not changed, then you block again until a second notify occurs (or spurious wakeup).
It's starting to look to me like a defect.

1. wait atomicity: As "M is eligible to be unblocked by a call to an atomic notifying operation" in the situation you described (after the 30.2 logical step but before 30.3), the implementation has to comply.
If there is "a waiting operation that is eligible to be unblocked" then notify_one has to unblock at least one wait call - not it's internal block - , regardless of whether it is before step (30.3) or not.
So the implementation must ensure that the notification is being delivered and the (30.1)...(30.3) steps are repeated in the described case.
2. store/notify order: First clarify the terms the standard uses:
"the value of an atomic object M" is used to refer the underlying T object.
"the value pointed to by this" is used to refer the underlying T object too. It is misleading I think, as the internal representation of the std::atomic<T> object is not required to be identical with a T object, so it refers something like - just an example -: this->m_underlying_data (because this can have more members, as sizeof(T) != sizeof(std::atomic<T>) can be true, cannot refer simply *this)
"an atomic object M" term is used to refer the whole std::atomic<T> typed object, the real *this. As explained in this thread.
It is guaranteed by the standard that even with the relaxed order the sequence of the modifications on the same atomic object has to be consistent among different threads:
C++20 standard 6.9.2.1 (19):
[Note: The four preceding coherence requirements effectively disallow
compiler reordering of atomic operations to a single object, even if
both operations are relaxed loads. This effectively makes the cache
coherence guarantee provided by most hardware available to C++ atomic
operations. — end note]
As the standard doesn't say how atomic should be implemented, notify could modify the atomic object. This is why it is not a const function and I think the following applies:
C++20 standard 6.9.2.1 (15):
If an operation A that modifies an atomic object M happens before an operation B that modifies M, then A shall be earlier than B in the modification order of M.
These are why this statement is wrong:
Here there is nothing preventing it from getting the old value.
As the store and notify_one operate on the same atomic object (and modify it), their order has to be preserved. So it is granted that there will be a notify_one after the value is 1.

This question was asked on std-discussions, so I'll post here my answer in that discussion.
If the wait() thread saw the value in step (30.1), compared it equal to old in step (30.2), and got scheduled out; then in another thread notify_one() stepped in and saw no blocking thread, doing nothing; the subsequent blocking in step (30.3) would never be unblocked. Here isn't it necessary for the standard to say "wait() function atomically performs the evaluation-compare-block operation", similar to what is said about condition_variable::wait()?
The operation is called an "atomic waiting operation". I agree, this can be read as a "waiting operation on an atomic", but practically speaking, it is clear that the steps described in the operation description need to be performed atomically (or have the effect of being atomic). You could argue that the standard wording could be better, but I don't see that as a logical error in the standard.
There's no synchronization between notify_*() and unblocking of wait(). If in step (30.3), the thread was unblocked by an atomic notifying operation, it would repeat step (30.1) to evaluate load(order). Here there is nothing preventing it from getting the old value. (Or is there?) Then it would block again. Now no one would wake it.
There is no synchronization between notify_* and wait because such synchronization would be redundant.
The synchronizes-with relation is used to determine the order of events happening in different threads, including events on different objects. It is used in definition of inter-thread happens before and, by induction, of happens before. In short, "a release operation on an atomic in one thread synchronizes with an acquire operation on the atomic in another thread" means that effects on objects (including other than the atomic) that were made prior to the release operation will be observable by operations that are performed following the acquire operation.
This release-acquire memory ordering semantics is redundant and irrelevant for the notify_* and wait operations because you already can achieve the synchronizes-with relation by performing a store(release) in the notifying thread and wait(acquire) in the waiting thread. There is no reasonable use case for notify_* without a prior store or read-modify-write operation. In fact, as you quoted yourself, an atomic waiting operation is eligible to be unblocked by an atomic notifying operation only if there was a side effect on the atomic that was not first observed by the waiting operation.
What gives the guarantee that the load in the wait observes the effect of the store in the notifying thread is sequential execution guarantee in the notifying thread. The store call is sequenced before the notify_one call because those are two full expressions. Consequently, in the waiting thread, store happens before notify_one (following from the definitions of happens before and inter-thread happens before). This means that the waiting thread will observe the effects of store and notify_one in that order, never in reverse.
So, to recap, there are two scenarios possible:
The notifying thread stores true to the atomic first, and the waiting thread observes that store on entry into wait. The waiting thread does not block, notify_one call is ignored.
The waiting thread does not observe the store and blocks. The notifying thread issues store and notify_one, in that order. At some point, the waiting thread observes the notify_one (due to the eligible to be unblocked relation between notify_one and wait). At this point, it must also observe the effect of store (due to the happens before relation between store and notify_one) and return.
If the waiting thread observed notify_one but not store, it would violate the happens before relation between those operations, and therefore it is not possible.
Update 2022-07-28:
As was pointed out in the comments by Broothy, the happens before relation between store and notify_one may be guaranteed only in the notifying thread but not in the waiting thread. Indeed, the standard is unclear in this regard, as it required the happens before relation, but it doesn't specify in which thread this relation has to be observed. My answer above assumed that the relation has to be maintained in every thread, but this may not be the correct interpretation of the standard. So, in the end I agree the standard needs to be clarified in this regard. Specifically, it needs to require that store inter-thread happens before notify_one.

How effective a barrier is a atomic write followed by an atomic read of the same variable?

Consider the following:
#include <atomic>
std::atomic<unsigned> var;
unsigned foo;
unsigned bar;
unsigned is_this_a_full_fence() {
var.store(1, std::memory_order_release);
var.load(std::memory_order_acquire);
bar = 5;
return foo;
}
My thought is the dummy load of var should prevent the subsequent variable accesses of foo and bar from being reordered before the store.
It seems the code creates a barrier against reordering - and at least on x86, release and acquire require no special fencing instructions.
Is this a valid way to code a full fence (LoadStore/StoreStore/StoreLoad/LoadLoad)? What am I missing?
I think the release creates a LoadStore and StoreStore barrier. The acquire creates a LoadStore and LoadLoad barrier. And the dependency between the two variable accesses creates a StoreLoad barrier?
EDIT: change barrier to full fence. Make snippet C++.

One major issue with this code is that the store and subsequent load to the same memory location are clearly not synchronizing with any other thread. In the C++ memory model races are undefined behavior, and the compiler can therefore assume your code didn't have a race. The only way that your load could observe a value different from what was stored is if you had a race. The compiler can therefore, under the C++ memory model, assume that the load observes the stored value.
This exact atomic code sequence appears in my C++ standards committee paper no sane compiler would optimize atomics under "Redundant load eliminated". There's a longer CppCon version of this paper on YouTube.
Now imagine C++ weren't such a pedant, and the load / store were guaranteed to stay there despite the inherent racy nature. Real-world ISAs offer such guarantees which C++ doesn't. You provide some happens-before relationship with other threads with acquire / release, but you don't provide a unique total order which all threads agree on. So yes this would act as a fence, but it wouldn't be the same as obtaining sequential consistency, or even total store order. Some architectures could have threads which observe events in a well-defined but different order. That's perfectly fine for some applications! You'll want to look into IRIW (independent reads of independent writes) to learn more about this topic. The x86-TSO paper discusses it specifically in the context of the ad-hoc x86 memory model, as implemented in various processors.

Your pseudo-code (which is not valid C++) is not atomic as a whole.
For example, a context switch could happen between the store and the load and some other thread would become scheduled (or is already running on some other core) and would then change the variable in between. Context switches and interrupts can happen at every machine instruction.
Is this a valid way to code a barrier
No, it is not. See also pthread_barrier_init(3p), pthread_barrier_wait(3p) and related functions.
You should read some pthread tutorial (in practice, C++11 threads are a tiny abstraction above them) and consider using mutexes.
Notice that std::memory_order affects mostly the current thread (and what it is observing), and do not forbid it from being interrupted/context-switched ...
See also this answer.

Assuming that you run this code in multiple threads, using ordering like this is not correct because the atomic operations do not synchronize (see link below) and hence foo and bar are not protected.
But it still may have some value to look at guarantees that apply to individual operations.
As an acquire operation, var.load is not reordered (inter-thread) with the operations on foo and bar (hence #LoadStore and #LoadLoad, you got that right).
However, var.store, is not protected against any reordering (in this context).
#StoreLoad reordering can be prevented by tagging both atomic operations seq_cst. In that case, all threads will observe the order as defined (still incorrect though because the non-atomics are unprotected).
EDIT
var.store is not protected against reordering because it acts as a one-way barrier for operations that are sequenced before it (i.e earlier in program order) and in your code there are no operations
before that store.
var.load acts as a one-way barrier for operations that are sequenced after it (i.e. foo and bar).
Here is a basic example of how a variable (foo) is protected by an atomic store/load pair:
// thread 1
foo = 42;
var.store(1, std::memory_order_release);
// thread 2
while (var.load(std::memory_order_acquire) != 1);
assert(foo == 42);
Thread 2 only continues after it observes the value set by thread 1.. The store is then said to have synchronized with the load and the assert cannot fire.
For a complete overview, check Jeff Preshing's blog articles.

std::promise set_value and thread safety

I'm a bit confused about the requirements in terms of thread-safety placed on std::promise::set_value().
The standard says:
Effects: Atomically stores the value r in the shared state and makes
that state ready
However, it also says that promise::set_value() can only be used to set a value once. If it is called multiple times, a std::future_error is thrown. So you can only set the value of a promise once.
And indeed, just about every tutorial, online code sample, or actual use case for std::promise involves a communication channel between 2 threads, where one thread calls std::future::get(), and the other thread calls std::promise::set_value().
I've never seen a use case where multiple threads might call std::promise::set_value(), and even if they did, all but one would cause a std::future_error exception to be thrown.
So why does the standard mandate that calls to std::promise::set_value() are atomic? What is the use case for calling std::promise::set_value() from multiple threads concurrently?
EDIT:
Since the top-voted answer here is not really answering my question, I assume what I'm asking is unclear. So, to clarify: I'm aware of what futures and promises are for and how they work. My question is why, specifically, does the standard insist that std::promise::set_value() must be atomic? This is a more subtle question than "why must there not be a race between calls to promise::set_value() and calls to future::get()"?
In fact, many of the answers here (incorrectly) respond that the reason is because if std::promise::set_value() wasn't atomic, then std::future::get() could potentially cause a race condition. But this is not true.
The only requirement to avoid a race condition is that std::promise::set_value() must have a happens-before relationship with std::future::get() - in other words, it must be guaranteed that when std::future::wait() returns, std::promise::set_value() has completed.
This is completely orthogonal to std::promise::set_value() itself being atomic or not. In a typical implementation using condition variables, std::future::get()/wait() would wait on a condition variable. Then, std::promise::set_value() could non-atomically perform any arbitrarily complex computation to set the actual value. Then it would notify the shared condition variable, (implying a memory fence with release semantics), and std::future::get() would wake up and safely read the result.
So, std::promise::set_value() itself does not need to be atomic to avoid a race condition here - it simply needs to satisfy a happens-before relationship with std::future::get().
So again, my question is: why does the C++ standard insist that std::promise::set_value() must actually be an atomic operation, as if a call to std::promise::set_value() was performed entirely under a mutex lock? I see no reason why this requirement should exist, unless there is some reason or use case for multiple threads calling std::promise::set_value() concurrently. And I can't think of such a use-case, hence this question.

If it was not an atomic store, then two threads could simultaneously call promise::set_value, which does the following:
check that the future is not ready (i.e., has a stored value or exception)
store the value
mark the state ready
release anything blocking on the shared state becoming ready
By making this sequence atomic, the first thread to execute (1) gets all the way through to (3), and any other thread calling promise::set_value at the same time will fail at (1) and raise a future_error with promise_already_satisfied.
Without the atomicity, two threads could potentially store their value, and then one would successfully mark the state ready, and the other would raise an exception, i.e. the same result, except that it might be the value from the thread that saw an exception that got through.
In many cases that might not matter which thread 'wins', but when it does matter, without the atomicity guarantee you would need to wrap another mutex around the promise::set_value call. Other approaches such as compare-and-exchange wouldn't work because you can't check the future (unless it's a shared_future) to see if your value won or not.
When it doesn't matter which thread 'wins', you could give each thread its own future, and use std::experimental::when_any to collect the first result that happened to become available.
Edit after some historical research:
Although the above (two threads using the same promise object) doesn't seem like a good use-case, it was certainly envisaged by one of the contemporary papers of the introduction of future to C++: N2744. This paper proposed a couple of use-cases which had such conflicting threads calling set_value, and I'll quote them here:
Second, consider use cases where two or more asynchronous operations are performed in parallel and "compete" to satisfy the promise. Some examples include:
A sequence of network operations (e.g. request a web page) is performed in conjunction with a wait on a timer.
A value may be retrieved from multiple servers. For redundancy, all servers are tried but only the first value obtained is needed.
In both examples, the first asynchronous operation to complete is the one that satisfies the promise. Since either operation may complete second, the code for both must be written to expect that calls to set_value() may fail.

I've never seen a use case where multiple threads might call
std::promise::set_value(), and even if they did, all but one would
cause a std::future_error exception to be thrown.
You missed the whole idea of promises and futures.
Usually, we have a pair of promise and a future. the promise is the object you push the asynchronous result or the exception, and the future is the object you pull the asynchronous result or the exception.
Under most cases, the future and the promise pair do not reside on the same thread, (otherwise we would use a simple pointer). so, you might pass the promise to some thread, threadpool, or some third library asynchronous function, and set the result from there, and pull the result in the caller thread.
setting the result with std::promise::set_value must be atomic, not because many promises set the result, but because an object (the future) which resides on another thread must read the result, and doing it un-atomically is undefined behavior, so setting the value and pulling it (either by calling std::future::get or std::future::then) must happen atomically
Remember, every future and promise has a shared state, setting the result from one thread updates the shared state, and getting the result reads from the shared state. like every shared state/memory in C++, when it's done from multiple threads, the update/reading must happen under a lock. otherwise it's undefined behavior.

These are all good answers, but there's one additional point that's essential. Without atomicity of setting a value, reading the value may be subject to observability side-effects.
E.g., in a naive implementation:
void thread1()
{
// do something. Maybe read from disk, or perform computation to populate value
v = value;
flag = true;
}
void thread2()
{
if(flag)
{
v2 = v;//Here we have a read problem.
}
}
Atomicity in the std::promise<> allows you to avoid the very basic race condition between writing a value in one thread and reading in another. Of course, if flag were std::atomic<> and the proper fence flags are used, you no longer have any side effects, and std::promise guarantees that.

C++ standard: can relaxed atomic stores be lifted above a mutex lock?

Is there any wording in the standard that guarantees that relaxed stores to atomics won't be lifted above the locking of a mutex? If not, is there any wording that explicitly says that it's kosher for the compiler or CPU to do so?
For example, take the following program (which could potentially use acq/rel for foo_has_been_set and avoid the lock, and/or make foo itself atomic. It's written this way to illustrate this question.)
std::mutex mu;
int foo = 0; // Guarded by mu
std::atomic<bool> foo_has_been_set{false};
void SetFoo() {
mu.lock();
foo = 1;
foo_has_been_set.store(true, std::memory_order_relaxed);
mu.unlock();
}
void CheckFoo() {
if (foo_has_been_set.load(std::memory_order_relaxed)) {
mu.lock();
assert(foo == 1);
mu.unlock();
}
}
Is it possible for CheckFoo to crash in the above program if another thread is calling SetFoo concurrently, or is there some guarantee that the store to foo_has_been_set can't be lifted above the call to mu.lock by the compiler and CPU?
This is related to an older question, but it's not 100% clear to me that the answer there applies to this. In particular, the counter-example in that question's answer may apply to two concurrent calls to SetFoo, but I'm interested in the case where the compiler knows that there is one call to SetFoo and one call to CheckFoo. Is that guaranteed to be safe?
I'm looking for specific citations in the standard.

I think I've figured out the particular partial order edges that guarantee the
program can't crash. In the answer below I'm referencing version
N4659 of the draft standard.
The code involved for the writer thread A and reader thread B is:
A1: mu.lock()
A2: foo = 1
A3: foo_has_been_set.store(relaxed)
A4: mu.unlock()
B1: foo_has_been_set.load(relaxed) <-- (stop if false)
B2: mu.lock()
B3: assert(foo == 1)
B4: mu.unlock()
We seek a proof that if B3 executes, then A2 happens before B3, as defined in [intro.races]/10. By [intro.races]/10.2, it's sufficient to prove that A2 inter-thread happens
before B3.
Because lock and unlock operations on a given mutex happen in a single total
order ([thread.mutex.requirements.mutex]/5), we must have either A1 or B2
coming first. The two cases:
Assume that A1 happens before B2. Then by [thread.mutex.class]/1 and
[thread.mutex.requirements.mutex]/25, we know that A4 will synchronize with B2.
Therefore by [intro.races]/9.1, A4 inter-thread happens before B2. Since B2 is
sequenced before B3, by [intro.races]/9.3.1 we know that A4 inter-thread
happens before B3. Since A2 is sequenced before A4, by [intro.races]/9.3.2, A2
inter-thread happens before B3.
Assume that B2 happens before A1. Then by the same logic as above, we know
that B4 synchronizes with A1. So since A1 is sequenced before A3, by
[intro.races]/9.3.1, B4 inter-thread happens before A3. Therefore since B1 is
sequenced before B4, by [intro.races]/9.3.2, B1 inter-thread happens before A3.
Therefore by [intro.races]/10.2, B1 happens before A3. But then according to [intro.races]/16, B1 must take its value from the pre-A3 state. Therefore the load will return false, and B2 will never run in the first place. In other words, this case can't happen.
So if B3 executes at all (case 1), A2 happens before B3 and the assert will pass. ∎

No memory operation inside a mutex protected region can 'escape' from that area. That applies to all memory operations, atomic and non-atomic.
In section 1.10.1:
a call that acquires a mutex will perform an acquire operation on the locations comprising the mutex
Correspondingly, a call that releases the same mutex will perform a release operation on those same locations
Furthermore, in section 1.10.1.6:
All operations on a given mutex occur in a single total order. Each mutex acquisition “reads the value written” by the last mutex release.
And in 30.4.3.1
A mutex object facilitates protection against data races and allows safe synchronization of data between execution agents
This means, acquiring (locking) a mutex sets a one-way barrier that prevents operations that are sequenced after the acquire (inside the protected area) from moving up across the mutex lock.
Releasing (unlocking) a mutex sets a one-way barrier that prevents operations that are sequenced before the release (inside the protected area) from moving down across the mutex unlock.
In addition, memory operations that are released by a mutex are synchronized (visible) with another thread that acquires the same mutex.
In your example, foo_has_been_set is checked in CheckFoo.. If it reads true you know that the value 1 has been assigned to foo by SetFoo, but it is not synchronized yet.
The mutex lock that follows will acquire foo, synchronization is complete and the assert cannot fire.

The standard does not directly guarantee that, but you can read it between the lines of [thread.mutex.requirements.mutex].:
For purposes of determining the existence of a data race, these behave as atomic operations ([intro.multithread]).
The lock and unlock operations on a single mutex shall appear to occur in a single total order.
Now the second sentence looks like a hard guarantee, but it really isn't. Single total order is very nice, but it only means that there is a well-defined single total order of of acquiring and releasing one particular mutex. Alone by itself, that doesn't mean that the effects of any atomic operations, or related non-atomic operations should or must be globally visible at some particular point related to the mutex. Or, whatever. The only thing that is guaranteed is about the order of code execution (specifically, the execution of a single pair of functions, lock and unlock), nothing is being said about what may or may not happen with data, or otherwise.
One can, however, read between the lines that this is nevertheless the very intention from the "behave as atomic operations" part.
From other places, it is also pretty clear that this is the exact idea and that an implementation is expected to work that way, without explicitly saying that it must. For example, [intro.races] reads:
[ Note: For example, a call that acquires a mutex will perform an acquire operation on the locations comprising the mutex. Correspondingly, a call that releases the same mutex will perform a release operation on those same locations.
Note the unlucky little, harmless word "Note:". Notes are not normative. So, while it's clear that this is how it's intended to be understood (mutex lock = acquire; unlock = release), this is not actually a guarantee.
I think the best, although non-straightforward guarantee comes from this sentence in [thread.mutex.requirements.general]:
A mutex object facilitates protection against data races and allows safe synchronization of data between execution agents.
So that's what a mutex does (without saying how exactly). It protects against data races. Fullstop.
Thus, no matter what subtleties one comes up with and no matter what else is written or isn't explicitly said, using a mutex protects against data races (... of any kind, since no specific type is given). That's what is written. So, in conclusion, as long as you use a mutex, you are good to go even with relaxed ordering or no atomic ops at all. Loads and stores (of any kind) cannot be moved around because then you couldn't be sure no data races occur. Which, however, is exactly what a mutex protects against.
Thus, without saying so, this says that a mutex must be a full barrier.

The answer seem to lie in http://eel.is/c++draft/intro.multithread#intro.races-3
The two pertinent parts are
[...] In addition, there are relaxed atomic operations, which are not synchronization operations [...]
and
[...] performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform a consume or an acquire operation on A. [...]
While relaxed orders atomics are not considered synchronization operations, that's all the standard has to say about them in this context. Since they are still memory locations, the general rule of them being governed by other synchronization operations still applies.
So in conclusion, the standard does not seem to have anything specifically in there to prevent the reordering you described, but the wording as it stands would prevent it naturally.
Edit: Woops, I linked to the draft. The C++11 paragraph covering this is 1.10-5, using the same language.

Reordering within the critical section is of course possible:
void SetFoo() {
mu.lock();
// REORDERED:
foo_has_been_set.store(true, std::memory_order_relaxed);
PAUSE(); //imagine scheduler pause here
foo = 1;
mu.unlock();
}
Now, the question is CheckFoo - can the read of foo_has_been_set fall into the lock? Normally a read like that can (things can fall into locks, just not out), but the lock should never be taken if the if is false, so it would be a strange ordering. Does anything say "speculative locks" are not allowed? Or can the CPU speculate that the if is true before reading foo_has_been_set?
void CheckFoo() {
// REORDER???
mu.lock();
if (foo_has_been_set.load(std::memory_order_relaxed)) {
assert(foo == 1);
}
mu.unlock();
}
That ordering is probably not OK, but only because of "logic order" not memory order. If the mu.lock() was inlined (and became some atomic ops) what stops them from being reordered?
I'm not too worried about your current code, but I worry about any real code that uses something like this. It is too close to wrong.
ie if the OP code was the real code, you would just change foo to atomic, and get rid of the rest. So the real code must be different. More complicated? ...

CheckFoo() cannot cause the program to crash (i.e. trigger the assert()) but there is also no guarantee the assert() will ever be executed.
If the condition at the start of CheckFoo() triggers (see below) the visible value of foo will be 1 because of the memory barriers and synchronization between mu.unlock() in SetFoo() and mu.lock() in CheckFoo().
I believe that is covered by the description of mutex cited in other answers.
However there is no guarantee that the if condition (foo_has_been_set.load(std::memory_order_relaxed))) will ever be true.
Relaxed memory order makes no guarantees and only the atomicity of the operation is assured. Consequently in the absence of some other barrier there's no guarantee when the relaxed store in SetFoo() will be visible in CheckFoo() but if it is visible it will only be because the store was executed and then following the mu.lock() must be ordered after mu.unlock() and the writes before it visible.
Please note this argument relies on the fact that foo_has_been_set is only ever set from false to true. If there were another function called UnsetFoo() that set it back to false:
void UnsetFoo() {
mu.lock();
foo = 0;
foo_has_been_set.store(false, std::memory_order_relaxed);
mu.unlock();
}
That was called from the other (or yet a third) thread then there's no guarantee that checking foo_has_been_set without synchronization will guarantee that foo is set.
To be clear (and assuming foo_has_been_set is never unset):
void CheckFoo() {
if (foo_has_been_set.load(std::memory_order_relaxed)) {
assert(foo == 1); //<- All bets are off. data-race UB
mu.lock();
assert(foo == 1); //Guaranteed to succeed.
mu.unlock();
}
}
In practice on any real platform on any long running application it is probably inevitable that the relax store will eventually become visible to the other thread. But there is no formal guarantee regarding if or when that will happen unless other barriers exist to assure it.
Formal References:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf
Refer to the notes at the end of p.13 and start of p.14 particularly notes 17 - 20. They are essentially assuring coherence of 'relaxed' operations. Their visibility is relaxed but the visibility that occurs will be coherent and use of the phrase 'happens before' is within the overall principle of program ordering and particularly acquire and release barriers of mutexes.
Note 19 is particularly relevant:
The four preceding coherence requirements effectively disallow
compiler reordering of atomic operations to a single object, even if
both operations are relaxed loads. This effectively makes the cache
coherence guarantee provided by most hardware available to C++ atomic
operations.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js