Does std::mutex create a fence?

Does std::mutex create a fence? - c++

If I lock a std::mutex will I always get a memory fence? I am unsure if it implies or enforces you to get the fence.
Update:
Found this reference following up on RMF's comments.
Multithreaded programming and memory visibility

As I understand this is covered in:
1.10 Multi-threaded executions and data races
Para 5:
The library deﬁnes a number of atomic operations (Clause 29) and operations on mutexes (Clause 30)
that are specially identiﬁed as synchronization operations. These operations play a special role in making assignments in one thread visible to another. A synchronization operation on one or more memory locations is either a consume operation, an acquire operation, a release operation, or both an acquire and release operation. A synchronization operation without an associated memory location is a fence and can be either an acquire fence, a release fence, or both an acquire and release fence. In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations, which have special characteristics. [Note: For example, a call that acquires a mutex will perform an acquire operation on the locations comprising the mutex. Correspondingly, a call that releases the same mutex will perform a release operation on those same locations. Informally, performing a release operation on A forces prior side eﬀects on other memory locations to become visible to other threads that later perform a consume or an acquire operation on A. “Relaxed” atomic operations are not synchronization operations even though, like synchronization operations, they cannot contribute to data races. —end note]

Unlocking a mutex synchronizes with locking the mutex. I don't know what options the compiler has for the implementation, but you get the same effect of a fence.

A mutex operation (lock or unlock) on a particular mutex M is only useful for any purpose related to synchronization or memory visibility if M is shared by different threads and they perform these operations. A mutex defined locally and only used by one thread does not provide any meaningful synchronization.
[Note: The optimizations I describe here are probably not done by many compilers, which might view these mutex and atomic synchronization operations as "black boxes" that cannot be optimized (or even that should not optimized in order to preserve the predictability of code generation, and some particular patterns, which is a bogus argument). I would not be surprised if zero compiler did the optimization in even the simpler case but there is no doubt that they are legal.]
A compiler can easily determine that some variables are never used by multiple threads (or any asynchronous execution), notably for an automatic variable whose address is not taken (nor a reference to it). Such objects are called here "thread private". (All automatic variables candidate for register allocation are thread private.)
For a thread private mutex, no meaningful code needs to be generated for lock/unlock operations: no atomic compare and exchange, no fencing, and often no state needs to be kept at all, except for the case of "safe mutex" where the behavior of recursive locking is well defined and should fail (to make the sequence m.lock(); bool locked = m.try_lock(); work you need to keep at least a boolean state).
This is also true for any thread private atomic objects: only the naked non atomic type is needed and normal operations can be performed (so a fetch-add 1 becomes as regular post increment).
The reason why these transformations are legal:
the obvious observation that if an object is accessed by only of thread or parallel execution (they aren't even accessed by an asynchronous signal handler) so the use of non atomic operation in assembly is not detectable by any mean
the less obvious remark that no ordering/memory visibility is implied by any use of thread private synchronization object.
All synchronization objects are specified as tools for inter thread communication: they can guarantee that the side effects in one thread are visible in another thread; they cause a well defined order of operations to exist not just in one thread (the sequential order of execution of operations of one thread) but in multiple threads.
A common example is the publication of an information with an atomic pointer type:
The shared data is:
atomic<T*> shared; // null pointer by default
The publishing thread does:
T *p = new T;
*p = load_info();
shared.store(p, memory_order_release);
The consuming thread can check whether the data is available by loading the atomic object value, as a consumer:
T *p = shared.load(memory_order_acquire);
if (p) use *p;
(There is no defined way to wait for availability here, it's a simple example to illustrate publication and consumption of the published value.)
The publishing thread only needs to set the atomic variable after it has finished the initialization of all fields; the memory order is a release to communicate the fact the memory manipulations are finished.
The other threads only need an acquire memory order to "connect" with the release operation if there was one. If the value is still zero, the thread has learn nothing about the world and the acquire is meaningless; it can't act on it. (By the time the thread check the pointer and see a null, the shared variable might have been changed already. It doesn't matter as the designer considered that not having a value in that thread is manageable, or it would have performed the operation in a sequence.)
All atomic operations are intended to be possibly lock less, that is to finish in a short finite time whatever other threads are doing, and even if they are stuck. It means that you can't depend on another thread having finished a job.
At the other end of the spectrum of thread communication primitives, mutexes don't hold a value that can be used to carry information between threads (*) but they ensure that one thread can enter a lock-ops-unlock sequence only after another thread has finished his own lock-ops-unlock sequence.
(*) not even a boolean value, as using a mutex to as a general boolean signal (= binary semaphore) between threads is specifically prohibited
A mutex is always used in connection with a set of shared variables: the protected variables or objects V; these V are used to carry information between threads and the mutex make access to that information well ordered (or serialized) between threads. In technical terms, all but the first mutex lock (on M) operation pair with previous unlock operation on M:
a lock of M is an acquire operation on M
an unlock of M is a release operation on M
The semantic of locking/unlocking is defined on a single M so let's stop repeating "on M"; we have threads A and B. The lock by B is an acquire that pairs with the unlock by A. Both operations together form an inter thread synchronization.
[What about a thread that happens to lock M often and will often re-lock M without any other thread acting on M in the meantime? Nothing interesting, the acquire is still paired with a release but A = B so nothing is accomplished. The unlock was sequenced in the same thread of execution so it's pointless in that particular case, but in general a thread can't tell that it's pointless. It isn't even special cased by the language semantic.]
The synchronization that occurs is between the set of threads T acting on a mutex: no other thread is guaranteed to be able to view any memory operation performed by these T. Note that in practice on most real computers, once a memory modification hits the cache, all CPU will view it if they check that same address, by the power of the cache consistency. But C/C++ threads(#) are not specified in term of a globally consistent cache, and not in term of ordering visible on a CPU, as the compiler itself can assume that non atomic objects are not mutated in arbitrary ways by the program without synchronization (the CPU cannot assume any such thing as it doesn't a concept of atomic vs. non atomic memory locations). That means that a guarantee available at the CPU/memory system you are targeting is not in general available at the C/C++ high level model. You absolutely cannot use normal C/C++ code as a high level assembly; only by dousing your code with volatile (almost everywhere) can you even vaguely approach high level assembly (but not quite).
(#) "C/C++ thread/semantics" not "C/C++ programming language thread semantics": C and C++ are based on the same specification for synchronization primitives, that doesn't mean that there is a C/C++ language)
Since the effect of mutex operations on M is only to serialize access to some data by threads that use M, it's clear that other threads don't see any effect. In technical terms, the synchronize with relation is between threads using the same synchronization objects (mutexes in that context, same atomic objects in the context of atomic use).
Even when the compiler emits memory fences in assembly language, it doesn't have to assume that an unlock operation makes changes performed before the unlock to threads outside the set T.
That allows the decomposition of sets of threads for program analysis: if a programs runs in parallel two sets of threads U and V, and U and V are created such that U and V can't access any common synchronization object (but they can access common non atomic objects), then you can analyse the interactions of U and of V separately from the point of view of thread semantics, as U and V cannot exchange information in well defined inter threads ways (they can still exchange information via the system, for example via disk files, sockets, for system specific shared memory).
(That observation might allow the compiler to optimize some threads without doing a full program analysis, even if some common mutable objects are "pulled it" via a third party class that has static members.)
Another way to explain that is to say that the semantics of these primitives is not leaking: only those threads that participate get a defined result.
Note that this is only true at the specification level for acquire and release operations, not sequentially consistent operations (which is the default order for operations on atomic object with you don't specify a memory order): all sequentially consistent actions (operations on an atomic object or fences) occur in a well defined global order. That however doesn't mean anything for independent threads having no atomic objects in common.
An order of operations is unlike an order of elements in a container where you can really navigate the container, or saying that files are presented as ordered by names. Only objects are observable, the orders of operations are not. Saying that there is a well defined order only means that values don't appear to have provably changed backward (vs. some abstract order).
If you have two unrelated sets that are ordered, say the integers with usual order and the words with lexicographical order), you can define the sum of these sets as having an order compatible with both orders. You might put the numbers before, after, or alternated with the words. You are free to do what you want because the elements in the sum of two sets doesn't have any relation with each other when they don't come from the same set.
You could say that there is a global order of all mutex operations, it's just not useful, like defining the order of the sum of unrelated sets.

Related

Does std::mutex enforce cache cohesion?

I have a non-atomic variable my_var and an std::mutex my_mut. I assume up to this point in the code, the programmer has followed this rule:
Each time the programmer modifies or writes to my_var, he locks
and unlocks my_mut.
Assuming this, Thread1 performs the following:
my_mut.lock();
my_var.modify();
my_mut.unlock();
Here is the sequence of events I imagine in my mind:
Prior to my_mut.lock();, there were possibly multiple copies of my_var in main memory and some local caches. These values do not necessarily agree, even if the programmer followed the rule.
By the instruction my_mut.lock();, all writes from the previously executed my_mut critical section are visible in memory to this thread.
my_var.modify(); executes.
After my_mut.unlock();, there are possibly multiple copies of my_var in main memory and some local caches. These values do not necessarily agree, even if the programmer followed the rule. The value of my_var at the end of this thread will be visible to the next thread that locks my_mut, by the time it locks my_mut.
I have been having trouble finding a source that verifies that this is exactly how std::mutex should work. I consulted the C++ standard. From ISO 2013, I found this section:
[ Note: For example, a call that acquires a mutex will perform an
acquire operation on the locations comprising the mutex.
Correspondingly, a call that releases the same mutex will perform a
release operation on those same locations. Informally, performing a
release operation on A forces prior side effects on other memory
locations to become visible to other threads that later perform a
consume or an acquire operation on A.
Is my understanding of std::mutex correct?

C++ operates on the relations between operations not some particular hardware terms (like cache cohesion). So C++ Standard has a happens-before relationship which roughly means that whatever happened before completed all its side-effects and therefore is visible at the moment that happened after.
And given you have an exclusive critical session to which you have entered means that whatever happens within it, happens before the next time this critical section is entered. So any consequential entering to it will see everything happened before. That's what the Standard mandates. Everything else (including the cache cohesion) is the implementation's duty: it has to make sure that the described behavior is coherent with what actually happens.

Why does this acquire and release memory fence not give a consistent value?

im just exploring the use of acquire and release memory fences and dont understand why i get the value output to zero sometimes and not the value of 2 all the time
I ran the program a number of times , and assumed the atomic store before the release barrier and the atomic load after the acquire barrier would ensure the values always would synchronise
#include <iostream>
#include <thread>
#include <atomic>
std::atomic<int>x;
void write()
{
x.store(2,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
}
void read()
{
std::atomic_thread_fence(std::memory_order_acquire);
// THIS DOES NOT GIVE THE EXPECTED VALUE OF 2 SOMETIMES
std::cout<<x.load(std::memory_order_relaxed)<<std::endl;
}
int main()
{
std::thread t1(write);
std::thread t2(read);
t1.join();
t2.join();
return 0;
}
the atomic varible x gives a value of 0 sometimes

I think you are misunderstanding the purpose of fences. Fences only enforce a certain ordering of memory operations for the compiler and processor in a single thread of execution. Your acquire fence will not magically make the thread wait until the óther thread performs the release.
Some literature will describe that a release operation in one thread "synchronizes with" a subsequent acquire operation in another thread. The key to this is that the acquire action is a subsequent action (i.e. the acquire is ordered "after" the release). If the release action is ordered after your acquire action, then there is no synchronizes-with relation between the write and read operations.
The reason why your code doesn't consistently return what you're expecting is because the thread interleavings sometimes order the write before the read, sometimes the read before the write.
If you want to guarantee that thread t2 reads the value 2 that thread t1 publishes, you're going to have to force t2 to wait for the publish to happen. The textbook example almost invariably uses a guard variable that notifies t2 that the data is ready to be consumed.
I recommend you read a very well-written blog post about release and acquire semantics and the synchronizes-with relation at Preshing on Programming's The Synchronizes-With Relation.

Looks like you misuse the fence. You are trying to use it as a mutex, right? If you expect the code to output 2 always, you just think that the load operation would never be executed before the save one. But that is not what memory fence does, that is what the synchronization primitives do.
The fences are much trickier and they just don't allow compiler/processor to reorder certain types of commands within one thread. At the end of the day the order of execution of two separate threads is undefined.

The reasons is simple: your fences accomplish exactly nothing and cannot have any use here anyway because there is no write that the fence would make visible (on the release side) to the acquiring side.
The simple answer is that the reading thread can run first and obviously will not see any write if it does.
The longer answer is that when your code has a race, as any code that uses either mutexes or atomics in a non trivial way, it must be prepared for any race outcome! So you has to make sure that not reading the value written by a write will not break your code.
ADDITIONAL EXPLANATION
A way to explain rel/ack semantics is:
release means "I have accomplished something" and I set that atomic object to some value to publish that claim;
acquire means "have you accomplished something?" and I read that atomic object to see if it contains the claim.
So release before you have accomplished anything is meaningless and an acquire that throws away the information containing the claim, as in (void)x.load(memory_order_acquire) is generally meaningless as there is no knowledge (in general) of what was acquired, which is to say of what was accomplished. (The exception to that rule is when a thread had relaxed loads or RMW operations.)

Multiple vector writers without locking

I have a few threads writing in a vector. It's possible that different threads try to write the same byte. There is no reads. Can I use only an atomic_fecth_or(), like in the example, so the vector will become thread safe? It compiled with GCC without errors or warnings.
std::vector<std::atomic<uint8_t>> MapVis(1024*1024);
void threador()
{
...
std::atomic_fetch_or(&MapVis[i], testor1);
}

It compiled with GCC without errors or warnings
That doesn't mean anything because compilers don't perform that sort of concurrency analysis. There are dedicated static analysis tools that may do this with varying levels of success.
Can I use only an atomic_fetch_or ...
you certainly can, and it will be safe at the level of each individual std::atomic<uint8_t>.
... the vector will become thread safe?
it's not sufficient that each element is accessed safely. You specifically need to avoid any operation that invalidates iterators (swap, resize, insert, push_back etc.).
I'd hesitate to say vector is thread-safe in this context - but you're limiting yourself to a thread-safe subset of its interface, so it will work correctly.
Note that as VTT suggests, keeping a separate partial vector per thread is better if possible. Partly because it's easier to prove correct, and partly because it avoids false sharing between cores.

Yes this is guaranteed to be thread safe due to atomic opperations being guaranteed of:
Isolation from interrupts, signals, concurrent processes and threads
Thus when you access an element of MapVis atomically you're guaranteed that any other process writing to it has already completed. And that your process will not be interrupted till it finishes writing.
The concern if you were using non-atomic variables would be that:
Thread A fetches the value of MapVis[i]
Thread B fetches the value of MapVis[i]
Thread A writes the ored value to MapVis[i]
Thread B writes the ored value to MapVis[i]
As you can see Thread B needed to wait until Thread A had finished writing otherwise it's just going to stomp Thread A's changes to MapVis[i]. With atomic variables the fetch and write cannot be interrupted by concurrent threads. Meaning that Thread B couldn't interrupt Thread A's read-write operations.

How effective a barrier is a atomic write followed by an atomic read of the same variable?

Consider the following:
#include <atomic>
std::atomic<unsigned> var;
unsigned foo;
unsigned bar;
unsigned is_this_a_full_fence() {
var.store(1, std::memory_order_release);
var.load(std::memory_order_acquire);
bar = 5;
return foo;
}
My thought is the dummy load of var should prevent the subsequent variable accesses of foo and bar from being reordered before the store.
It seems the code creates a barrier against reordering - and at least on x86, release and acquire require no special fencing instructions.
Is this a valid way to code a full fence (LoadStore/StoreStore/StoreLoad/LoadLoad)? What am I missing?
I think the release creates a LoadStore and StoreStore barrier. The acquire creates a LoadStore and LoadLoad barrier. And the dependency between the two variable accesses creates a StoreLoad barrier?
EDIT: change barrier to full fence. Make snippet C++.

One major issue with this code is that the store and subsequent load to the same memory location are clearly not synchronizing with any other thread. In the C++ memory model races are undefined behavior, and the compiler can therefore assume your code didn't have a race. The only way that your load could observe a value different from what was stored is if you had a race. The compiler can therefore, under the C++ memory model, assume that the load observes the stored value.
This exact atomic code sequence appears in my C++ standards committee paper no sane compiler would optimize atomics under "Redundant load eliminated". There's a longer CppCon version of this paper on YouTube.
Now imagine C++ weren't such a pedant, and the load / store were guaranteed to stay there despite the inherent racy nature. Real-world ISAs offer such guarantees which C++ doesn't. You provide some happens-before relationship with other threads with acquire / release, but you don't provide a unique total order which all threads agree on. So yes this would act as a fence, but it wouldn't be the same as obtaining sequential consistency, or even total store order. Some architectures could have threads which observe events in a well-defined but different order. That's perfectly fine for some applications! You'll want to look into IRIW (independent reads of independent writes) to learn more about this topic. The x86-TSO paper discusses it specifically in the context of the ad-hoc x86 memory model, as implemented in various processors.

Your pseudo-code (which is not valid C++) is not atomic as a whole.
For example, a context switch could happen between the store and the load and some other thread would become scheduled (or is already running on some other core) and would then change the variable in between. Context switches and interrupts can happen at every machine instruction.
Is this a valid way to code a barrier
No, it is not. See also pthread_barrier_init(3p), pthread_barrier_wait(3p) and related functions.
You should read some pthread tutorial (in practice, C++11 threads are a tiny abstraction above them) and consider using mutexes.
Notice that std::memory_order affects mostly the current thread (and what it is observing), and do not forbid it from being interrupted/context-switched ...
See also this answer.

Assuming that you run this code in multiple threads, using ordering like this is not correct because the atomic operations do not synchronize (see link below) and hence foo and bar are not protected.
But it still may have some value to look at guarantees that apply to individual operations.
As an acquire operation, var.load is not reordered (inter-thread) with the operations on foo and bar (hence #LoadStore and #LoadLoad, you got that right).
However, var.store, is not protected against any reordering (in this context).
#StoreLoad reordering can be prevented by tagging both atomic operations seq_cst. In that case, all threads will observe the order as defined (still incorrect though because the non-atomics are unprotected).
EDIT
var.store is not protected against reordering because it acts as a one-way barrier for operations that are sequenced before it (i.e earlier in program order) and in your code there are no operations
before that store.
var.load acts as a one-way barrier for operations that are sequenced after it (i.e. foo and bar).
Here is a basic example of how a variable (foo) is protected by an atomic store/load pair:
// thread 1
foo = 42;
var.store(1, std::memory_order_release);
// thread 2
while (var.load(std::memory_order_acquire) != 1);
assert(foo == 42);
Thread 2 only continues after it observes the value set by thread 1.. The store is then said to have synchronized with the load and the assert cannot fire.
For a complete overview, check Jeff Preshing's blog articles.

Effect of std::memory_order_acq_rel on non-atomic variable read in other thread

I think I mostly understand the semantics of the various memory_order flags in the C++ atomic library.
However, I'm confused about the following situation:
Suppose we have two threads - Thread A, which is a "main execution" thread, and Thread B, which is some arbitrary thread that is part of a thread pool where tasks can be scheduled and run.
If I perform a "read-write-update" atomic operation using std::memory_order_acq_rel, then perform a non-atomic write on a boolean variable, is the non-atomic write immediately visible to other threads? I would think the answer is no, unless the other threads also access the atomic variable that did the "read-write-update" operation.
So, for example, given a global std::atomic_flag variable X, a global bool value B, and a thread pool object THREADPOOL that has a member function dispatch, which will execute arbitrary function handlers in another thread:
if (!X.test_and_set(std::memory_order_acq_rel)
{
if (SOME_CONDITION) B = true;
THREADPOOL.dispatch([]() {
// This executes in Thread B
if (B) { /* do something */ } // are we guaranteed to see changes to B?
});
}
So in this example, the code inside the lambda function will be executed in a different thread. Is that thread necessarily going to see the (non-atomic) update to B made in the first thread? Note that the second thread does not access the atomic_flag, so my understanding is that changes to B will not necessarily be seen in the second thread.
Is my understanding correct here? And if so, would using std::memory_order_seq_cst change that?

Correct implementation of dispatch method in THREADPOOL should provide happens-before relation between all operations executed by the caller before this method call and all operations executed by the function(lambda in your case), passed to the method.
So, auxiliary thread, executed your lambda function, will definitely see value of B, assigned by the main thread.
Without happens-before order, the only way to garantee immediate visibility of variable modification is to use std::memory_order_seq_cst for both modification and reading. See, e.g., this question.

No memory order specification makes future memory accesses visible. At most, they prevent them from becoming visible before the atomic access is visible.
If you want to ensure a particular access does become visible, you must either enforce a particular memory ordering on that access or you must have a future access that uses memory ordering to ensure it is sequenced after the access you want to make visible.
All atomic operations are atomic. Memory ordering only allows you to do three things:
Establish ordering of this atomic operation with respect to prior operations, atomic or not -- this operation is guaranteed to come after them.
Establish ordering of this operation with respect to future operations, atomic or not -- this operation is guaranteed to come before them.
Establish ordering with other atomic operations.
None of these ensure future non-atomic operations occur "soon" or become visible at any particular time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js