Does std::mutex enforce cache cohesion? - c++

I have a non-atomic variable my_var and an std::mutex my_mut. I assume up to this point in the code, the programmer has followed this rule:
Each time the programmer modifies or writes to my_var, he locks
and unlocks my_mut.
Assuming this, Thread1 performs the following:
my_mut.lock();
my_var.modify();
my_mut.unlock();
Here is the sequence of events I imagine in my mind:
Prior to my_mut.lock();, there were possibly multiple copies of my_var in main memory and some local caches. These values do not necessarily agree, even if the programmer followed the rule.
By the instruction my_mut.lock();, all writes from the previously executed my_mut critical section are visible in memory to this thread.
my_var.modify(); executes.
After my_mut.unlock();, there are possibly multiple copies of my_var in main memory and some local caches. These values do not necessarily agree, even if the programmer followed the rule. The value of my_var at the end of this thread will be visible to the next thread that locks my_mut, by the time it locks my_mut.
I have been having trouble finding a source that verifies that this is exactly how std::mutex should work. I consulted the C++ standard. From ISO 2013, I found this section:
[ Note: For example, a call that acquires a mutex will perform an
acquire operation on the locations comprising the mutex.
Correspondingly, a call that releases the same mutex will perform a
release operation on those same locations. Informally, performing a
release operation on A forces prior side effects on other memory
locations to become visible to other threads that later perform a
consume or an acquire operation on A.
Is my understanding of std::mutex correct?

C++ operates on the relations between operations not some particular hardware terms (like cache cohesion). So C++ Standard has a happens-before relationship which roughly means that whatever happened before completed all its side-effects and therefore is visible at the moment that happened after.
And given you have an exclusive critical session to which you have entered means that whatever happens within it, happens before the next time this critical section is entered. So any consequential entering to it will see everything happened before. That's what the Standard mandates. Everything else (including the cache cohesion) is the implementation's duty: it has to make sure that the described behavior is coherent with what actually happens.

Related

Why does this acquire and release memory fence not give a consistent value?

im just exploring the use of acquire and release memory fences and dont understand why i get the value output to zero sometimes and not the value of 2 all the time
I ran the program a number of times , and assumed the atomic store before the release barrier and the atomic load after the acquire barrier would ensure the values always would synchronise
#include <iostream>
#include <thread>
#include <atomic>
std::atomic<int>x;
void write()
{
x.store(2,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
}
void read()
{
std::atomic_thread_fence(std::memory_order_acquire);
// THIS DOES NOT GIVE THE EXPECTED VALUE OF 2 SOMETIMES
std::cout<<x.load(std::memory_order_relaxed)<<std::endl;
}
int main()
{
std::thread t1(write);
std::thread t2(read);
t1.join();
t2.join();
return 0;
}
the atomic varible x gives a value of 0 sometimes
I think you are misunderstanding the purpose of fences. Fences only enforce a certain ordering of memory operations for the compiler and processor in a single thread of execution. Your acquire fence will not magically make the thread wait until the óther thread performs the release.
Some literature will describe that a release operation in one thread "synchronizes with" a subsequent acquire operation in another thread. The key to this is that the acquire action is a subsequent action (i.e. the acquire is ordered "after" the release). If the release action is ordered after your acquire action, then there is no synchronizes-with relation between the write and read operations.
The reason why your code doesn't consistently return what you're expecting is because the thread interleavings sometimes order the write before the read, sometimes the read before the write.
If you want to guarantee that thread t2 reads the value 2 that thread t1 publishes, you're going to have to force t2 to wait for the publish to happen. The textbook example almost invariably uses a guard variable that notifies t2 that the data is ready to be consumed.
I recommend you read a very well-written blog post about release and acquire semantics and the synchronizes-with relation at Preshing on Programming's The Synchronizes-With Relation.
Looks like you misuse the fence. You are trying to use it as a mutex, right? If you expect the code to output 2 always, you just think that the load operation would never be executed before the save one. But that is not what memory fence does, that is what the synchronization primitives do.
The fences are much trickier and they just don't allow compiler/processor to reorder certain types of commands within one thread. At the end of the day the order of execution of two separate threads is undefined.
The reasons is simple: your fences accomplish exactly nothing and cannot have any use here anyway because there is no write that the fence would make visible (on the release side) to the acquiring side.
The simple answer is that the reading thread can run first and obviously will not see any write if it does.
The longer answer is that when your code has a race, as any code that uses either mutexes or atomics in a non trivial way, it must be prepared for any race outcome! So you has to make sure that not reading the value written by a write will not break your code.
ADDITIONAL EXPLANATION
A way to explain rel/ack semantics is:
release means "I have accomplished something" and I set that atomic object to some value to publish that claim;
acquire means "have you accomplished something?" and I read that atomic object to see if it contains the claim.
So release before you have accomplished anything is meaningless and an acquire that throws away the information containing the claim, as in (void)x.load(memory_order_acquire) is generally meaningless as there is no knowledge (in general) of what was acquired, which is to say of what was accomplished. (The exception to that rule is when a thread had relaxed loads or RMW operations.)

How effective a barrier is a atomic write followed by an atomic read of the same variable?

Consider the following:
#include <atomic>
std::atomic<unsigned> var;
unsigned foo;
unsigned bar;
unsigned is_this_a_full_fence() {
var.store(1, std::memory_order_release);
var.load(std::memory_order_acquire);
bar = 5;
return foo;
}
My thought is the dummy load of var should prevent the subsequent variable accesses of foo and bar from being reordered before the store.
It seems the code creates a barrier against reordering - and at least on x86, release and acquire require no special fencing instructions.
Is this a valid way to code a full fence (LoadStore/StoreStore/StoreLoad/LoadLoad)? What am I missing?
I think the release creates a LoadStore and StoreStore barrier. The acquire creates a LoadStore and LoadLoad barrier. And the dependency between the two variable accesses creates a StoreLoad barrier?
EDIT: change barrier to full fence. Make snippet C++.
One major issue with this code is that the store and subsequent load to the same memory location are clearly not synchronizing with any other thread. In the C++ memory model races are undefined behavior, and the compiler can therefore assume your code didn't have a race. The only way that your load could observe a value different from what was stored is if you had a race. The compiler can therefore, under the C++ memory model, assume that the load observes the stored value.
This exact atomic code sequence appears in my C++ standards committee paper no sane compiler would optimize atomics under "Redundant load eliminated". There's a longer CppCon version of this paper on YouTube.
Now imagine C++ weren't such a pedant, and the load / store were guaranteed to stay there despite the inherent racy nature. Real-world ISAs offer such guarantees which C++ doesn't. You provide some happens-before relationship with other threads with acquire / release, but you don't provide a unique total order which all threads agree on. So yes this would act as a fence, but it wouldn't be the same as obtaining sequential consistency, or even total store order. Some architectures could have threads which observe events in a well-defined but different order. That's perfectly fine for some applications! You'll want to look into IRIW (independent reads of independent writes) to learn more about this topic. The x86-TSO paper discusses it specifically in the context of the ad-hoc x86 memory model, as implemented in various processors.
Your pseudo-code (which is not valid C++) is not atomic as a whole.
For example, a context switch could happen between the store and the load and some other thread would become scheduled (or is already running on some other core) and would then change the variable in between. Context switches and interrupts can happen at every machine instruction.
Is this a valid way to code a barrier
No, it is not. See also pthread_barrier_init(3p), pthread_barrier_wait(3p) and related functions.
You should read some pthread tutorial (in practice, C++11 threads are a tiny abstraction above them) and consider using mutexes.
Notice that std::memory_order affects mostly the current thread (and what it is observing), and do not forbid it from being interrupted/context-switched ...
See also this answer.
Assuming that you run this code in multiple threads, using ordering like this is not correct because the atomic operations do not synchronize (see link below) and hence foo and bar are not protected.
But it still may have some value to look at guarantees that apply to individual operations.
As an acquire operation, var.load is not reordered (inter-thread) with the operations on foo and bar (hence #LoadStore and #LoadLoad, you got that right).
However, var.store, is not protected against any reordering (in this context).
#StoreLoad reordering can be prevented by tagging both atomic operations seq_cst. In that case, all threads will observe the order as defined (still incorrect though because the non-atomics are unprotected).
EDIT
var.store is not protected against reordering because it acts as a one-way barrier for operations that are sequenced before it (i.e earlier in program order) and in your code there are no operations
before that store.
var.load acts as a one-way barrier for operations that are sequenced after it (i.e. foo and bar).
Here is a basic example of how a variable (foo) is protected by an atomic store/load pair:
// thread 1
foo = 42;
var.store(1, std::memory_order_release);
// thread 2
while (var.load(std::memory_order_acquire) != 1);
assert(foo == 42);
Thread 2 only continues after it observes the value set by thread 1.. The store is then said to have synchronized with the load and the assert cannot fire.
For a complete overview, check Jeff Preshing's blog articles.

C++ if one thread writes toggles a bool once done, is it safe to read that bool in a loop in a single other thread?

I am building a very simple program as an exercise.
The idea is to compute the total size of a directory by recursively iterating over all its contents, and summing the sizes of all files contained in the directory (and its subdirectories).
To show to a user that the program is still working, this computation is performed on another thread, while the main thread prints a dot . once every second.
Now the main thread of course needs to know when it should stop printing dots and can look up a result.
It is possible to use e.g. a std::atomic<bool> done(false); and pass this to the thread that will perform the computation, which will set it to true once it is finished. But I am wondering if in this simple case (one thread writes once completed, one thread reads periodically until nonzero) it is necessary to use atomic data types for this. Obviously if multiple threads might write to it, it needs to be protected. But in this case, there's only one writing thread and one reading thread.
Is it necessary to use an atomic data type here, or is it overkill and could a normal data type be used instead?
Yes, it's necessary.
The issue is that the different cores of the processor can have different views of the "same" data, notably data that's been cached within the CPU. The atomic part ensures that these caches are properly flushed so that you can safely do what you are trying to do.
Otherwise, it's quite possible that the other thread will never actually see the flag change from the first thread.
Yes it is necessary. The rule is that if two threads could potentially be accessing the same memory at the same time, and at least one of the threads is a writer, then you have a data race. Any execution of a program with a data race has undefined behavior.
Relevant quotes from the C++14 standard:
1.10/23
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.
1.10/6
Two expression evaluations conflict if one of them modifies a memory location (1.7) and the other one accesses or modifies the same memory location.
Yes, it is necessary. Otherwise it is not guaranteed that changes to the bool in one thread will be observable in the other thread. In fact, if the compiler sees that the bool variable is, apparently, not ever used again in the execution thread that sets it, it might completely optimize away the code that sets the value of the bool.

What guarantees that a weak relaxed uncontended CAS loop terminates?

cppreference's page on compare_exchange gives the following example code (paraphrased snippet):
while(!head.compare_exchange_weak(new_node->next,
new_node,
std::memory_order_release,
std::memory_order_relaxed))
; // empty body
Suppose you run this once on thread A, and once on thread B. Nothing else is touching head or its associated data. Thread B's invocation happens to start after thread A's has happened (in real time), but A's changes have not yet propagated to the cache of the CPU running Thread B.
What forces A's changes to get to B? That is to say, why is the execution where B's weak compare exchange simply fails indefinitely and the CPU cache remains stale not allowed? Or is it allowed?
It seems like the CPU running B is not being forced to go out and sync in the changes made by A, because the failure memory ordering is relaxed. So why does the hardware ever do so? Is this an implicit guarantee of the C++ spec, or of the hardware, or is bounded-staleness-of-memory a standard documented guarantee?
Good question! This is actually specifically addressed by clause 25 in §1.10 of the C++11 standard:
An implementation should ensure that the last value (in modification order) assigned by an atomic or
synchronization operation will become visible to all other threads in a finite period of time.
So the answer is yes, the value is guaranteed to eventually propagate to the other thread, even with relaxed memory ordering.

Does std::mutex create a fence?

If I lock a std::mutex will I always get a memory fence? I am unsure if it implies or enforces you to get the fence.
Update:
Found this reference following up on RMF's comments.
Multithreaded programming and memory visibility
As I understand this is covered in:
1.10 Multi-threaded executions and data races
Para 5:
The library defines a number of atomic operations (Clause 29) and operations on mutexes (Clause 30)
that are specially identified as synchronization operations. These operations play a special role in making assignments in one thread visible to another. A synchronization operation on one or more memory locations is either a consume operation, an acquire operation, a release operation, or both an acquire and release operation. A synchronization operation without an associated memory location is a fence and can be either an acquire fence, a release fence, or both an acquire and release fence. In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations, which have special characteristics. [Note: For example, a call that acquires a mutex will perform an acquire operation on the locations comprising the mutex. Correspondingly, a call that releases the same mutex will perform a release operation on those same locations. Informally, performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform a consume or an acquire operation on A. “Relaxed” atomic operations are not synchronization operations even though, like synchronization operations, they cannot contribute to data races. —end note]
Unlocking a mutex synchronizes with locking the mutex. I don't know what options the compiler has for the implementation, but you get the same effect of a fence.
A mutex operation (lock or unlock) on a particular mutex M is only useful for any purpose related to synchronization or memory visibility if M is shared by different threads and they perform these operations. A mutex defined locally and only used by one thread does not provide any meaningful synchronization.
[Note: The optimizations I describe here are probably not done by many compilers, which might view these mutex and atomic synchronization operations as "black boxes" that cannot be optimized (or even that should not optimized in order to preserve the predictability of code generation, and some particular patterns, which is a bogus argument). I would not be surprised if zero compiler did the optimization in even the simpler case but there is no doubt that they are legal.]
A compiler can easily determine that some variables are never used by multiple threads (or any asynchronous execution), notably for an automatic variable whose address is not taken (nor a reference to it). Such objects are called here "thread private". (All automatic variables candidate for register allocation are thread private.)
For a thread private mutex, no meaningful code needs to be generated for lock/unlock operations: no atomic compare and exchange, no fencing, and often no state needs to be kept at all, except for the case of "safe mutex" where the behavior of recursive locking is well defined and should fail (to make the sequence m.lock(); bool locked = m.try_lock(); work you need to keep at least a boolean state).
This is also true for any thread private atomic objects: only the naked non atomic type is needed and normal operations can be performed (so a fetch-add 1 becomes as regular post increment).
The reason why these transformations are legal:
the obvious observation that if an object is accessed by only of thread or parallel execution (they aren't even accessed by an asynchronous signal handler) so the use of non atomic operation in assembly is not detectable by any mean
the less obvious remark that no ordering/memory visibility is implied by any use of thread private synchronization object.
All synchronization objects are specified as tools for inter thread communication: they can guarantee that the side effects in one thread are visible in another thread; they cause a well defined order of operations to exist not just in one thread (the sequential order of execution of operations of one thread) but in multiple threads.
A common example is the publication of an information with an atomic pointer type:
The shared data is:
atomic<T*> shared; // null pointer by default
The publishing thread does:
T *p = new T;
*p = load_info();
shared.store(p, memory_order_release);
The consuming thread can check whether the data is available by loading the atomic object value, as a consumer:
T *p = shared.load(memory_order_acquire);
if (p) use *p;
(There is no defined way to wait for availability here, it's a simple example to illustrate publication and consumption of the published value.)
The publishing thread only needs to set the atomic variable after it has finished the initialization of all fields; the memory order is a release to communicate the fact the memory manipulations are finished.
The other threads only need an acquire memory order to "connect" with the release operation if there was one. If the value is still zero, the thread has learn nothing about the world and the acquire is meaningless; it can't act on it. (By the time the thread check the pointer and see a null, the shared variable might have been changed already. It doesn't matter as the designer considered that not having a value in that thread is manageable, or it would have performed the operation in a sequence.)
All atomic operations are intended to be possibly lock less, that is to finish in a short finite time whatever other threads are doing, and even if they are stuck. It means that you can't depend on another thread having finished a job.
At the other end of the spectrum of thread communication primitives, mutexes don't hold a value that can be used to carry information between threads (*) but they ensure that one thread can enter a lock-ops-unlock sequence only after another thread has finished his own lock-ops-unlock sequence.
(*) not even a boolean value, as using a mutex to as a general boolean signal (= binary semaphore) between threads is specifically prohibited
A mutex is always used in connection with a set of shared variables: the protected variables or objects V; these V are used to carry information between threads and the mutex make access to that information well ordered (or serialized) between threads. In technical terms, all but the first mutex lock (on M) operation pair with previous unlock operation on M:
a lock of M is an acquire operation on M
an unlock of M is a release operation on M
The semantic of locking/unlocking is defined on a single M so let's stop repeating "on M"; we have threads A and B. The lock by B is an acquire that pairs with the unlock by A. Both operations together form an inter thread synchronization.
[What about a thread that happens to lock M often and will often re-lock M without any other thread acting on M in the meantime? Nothing interesting, the acquire is still paired with a release but A = B so nothing is accomplished. The unlock was sequenced in the same thread of execution so it's pointless in that particular case, but in general a thread can't tell that it's pointless. It isn't even special cased by the language semantic.]
The synchronization that occurs is between the set of threads T acting on a mutex: no other thread is guaranteed to be able to view any memory operation performed by these T. Note that in practice on most real computers, once a memory modification hits the cache, all CPU will view it if they check that same address, by the power of the cache consistency. But C/C++ threads(#) are not specified in term of a globally consistent cache, and not in term of ordering visible on a CPU, as the compiler itself can assume that non atomic objects are not mutated in arbitrary ways by the program without synchronization (the CPU cannot assume any such thing as it doesn't a concept of atomic vs. non atomic memory locations). That means that a guarantee available at the CPU/memory system you are targeting is not in general available at the C/C++ high level model. You absolutely cannot use normal C/C++ code as a high level assembly; only by dousing your code with volatile (almost everywhere) can you even vaguely approach high level assembly (but not quite).
(#) "C/C++ thread/semantics" not "C/C++ programming language thread semantics": C and C++ are based on the same specification for synchronization primitives, that doesn't mean that there is a C/C++ language)
Since the effect of mutex operations on M is only to serialize access to some data by threads that use M, it's clear that other threads don't see any effect. In technical terms, the synchronize with relation is between threads using the same synchronization objects (mutexes in that context, same atomic objects in the context of atomic use).
Even when the compiler emits memory fences in assembly language, it doesn't have to assume that an unlock operation makes changes performed before the unlock to threads outside the set T.
That allows the decomposition of sets of threads for program analysis: if a programs runs in parallel two sets of threads U and V, and U and V are created such that U and V can't access any common synchronization object (but they can access common non atomic objects), then you can analyse the interactions of U and of V separately from the point of view of thread semantics, as U and V cannot exchange information in well defined inter threads ways (they can still exchange information via the system, for example via disk files, sockets, for system specific shared memory).
(That observation might allow the compiler to optimize some threads without doing a full program analysis, even if some common mutable objects are "pulled it" via a third party class that has static members.)
Another way to explain that is to say that the semantics of these primitives is not leaking: only those threads that participate get a defined result.
Note that this is only true at the specification level for acquire and release operations, not sequentially consistent operations (which is the default order for operations on atomic object with you don't specify a memory order): all sequentially consistent actions (operations on an atomic object or fences) occur in a well defined global order. That however doesn't mean anything for independent threads having no atomic objects in common.
An order of operations is unlike an order of elements in a container where you can really navigate the container, or saying that files are presented as ordered by names. Only objects are observable, the orders of operations are not. Saying that there is a well defined order only means that values don't appear to have provably changed backward (vs. some abstract order).
If you have two unrelated sets that are ordered, say the integers with usual order and the words with lexicographical order), you can define the sum of these sets as having an order compatible with both orders. You might put the numbers before, after, or alternated with the words. You are free to do what you want because the elements in the sum of two sets doesn't have any relation with each other when they don't come from the same set.
You could say that there is a global order of all mutex operations, it's just not useful, like defining the order of the sum of unrelated sets.