Take a seq_cst fence for example. The main explanations I've seen are:
1. It gives you the synchronizes-with relationships of acquire and release (if you include the appropriate loads and stores) and also all the fences happen in the same order for all threads (which doesn't seem very useful).
2. also this. It prevents all memory reads or writes on the current thread from being reordered with ones on the other side of the fence (which seems very useful).
3. Incomprehensible standardese, except for the line "in many cases, memory_order_seq_cst atomic operations are reorderable with respect to other atomic operations performed by the same thread" which seems to contradict number 2.
How do these definitions mean the same thing? I find synchronizes-with a useful way to think about acquire and release, is there a similarly elegant mental model for seq_cst?
Your (1) is the easy to understand explanation.
I don't see how your links relate to your statement in (2). I don't see the statement you wrote there anywhere in the rust article you linked. You also linked an explanation of a #LoadStore fence, but it doesn't say anything about how this relates to sequentially consistent operations.
The c++11 standard seems to suggest your statement, though:
There shall be a single total order S on all memory_order_seq_cst operations, consistent with the “happens before” order.
In C++20, your definitely (2) holds. memory_order_seq_cst operations and fences can't be reordered in respect to each other in any way.
Regarding (3) I don't know if I can help you with comprehending the standardese. Reading more of Preshing's blog posts might help – for example The Synchronizes-With Relation. Regarding this statement:
in many cases, memory_order_seq_cst atomic operations are reorderable with respect to other atomic operations performed by the same thread
The "other" atomic operations are non-seq_cst operations. They can still be reordered in respect to the seq_cst operations. For example, these two statements are allowed to be reordered:
std::atomic<int> a, b;
b.load(std::memory_order_seq_cst)
a.store(std::memory_order_relaxed)
Related
I have basically two questions that are closely related and they are both based on this SO question:
Thread synchronization problem with c++ std::atomic variables
As cppreference.com explains:
For memory_order_acquire:
A load operation with this memory order performs the acquire operation
on the affected memory location: no reads or writes in the current
thread can be reordered before this load. All writes in other
threads that release the same atomic variable are visible in the
current thread
For memory_order_release: A store operation with this memory order
performs the release operation: no reads or writes in the current
thread can be reordered after this store. All writes in the current
thread are visible in other threads that acquire the same atomic
variable
Why people say that memory_order_seq_cst MUST be used in order for that example to work properly? What's the purpose of memory_order_acquire if it doesn't work as the official documentation says so?
The documentation clearly says: All writes in other threads that release the same atomic variable are visible in the current thread.
Why that example from SO question should never print "bad\n"? It just doesn't make any sense to me.
I did my homework by reading all available documentation, SO queastions/anwers, googling, etc... But, I'm still not able to understand some things.
Your linked question has two atomic variables, your "cppreference" quote specifically mentions "same atomic variable". That's why the reference text doesn't cover the linked question.
Quoting further from cppreference: memory_order_seq_cst : "...a single total order exists in which all threads observe all modifications in the same order".
So that does cover modifications to two atomic variables.
Essentially, the design problem with memory_order_release is that it's a data equivalent of GOTO, which we know is a problem since Dijkstra. And memory_order_acquire is the equivalent of a COMEFROM, which is usually reserved for April Fools. I'm not yet convinced that they're good additions to C++.
For example, is calling std::mutex::lock() required by the Standard to provide a sequentially consistent fence, an acquire fence, or neither?
cppreference.com doesn't seem to address this topic. Is it addressed in any reference documentation that's more easy to use than the Standard or working papers?
I'm not sure about an easier source, but here's a quote from a note in the standard:
[...] a call that acquires a mutex will perform an acquire operation on
the locations comprising the mutex. Correspondingly, a call that releases the same mutex will perform a
release operation on those same locations. Informally, performing a release operation on A forces prior side
effects on other memory locations to become visible to other threads that later perform a consume or an
acquire operation on A.
I think that answers the question about memory fences reasonably well (and although it's "only" a note, not a normative part of the standard, I'd say it's as reliable a description of the standard as any other site could hope to provide).
std::atomic and std::mutex operations never require full 2-way fences. That does happen in practice on some ISAs as an implementation detail, notably x86, but not AArch64.
Even std::atomic<T> atomic RMWs with the default memory_order_seq_cst aren't as strong as full 2-way fences, I think. On real ISAs where SC RMWs can be done without being much stronger than required (specifically AArch64), I'm not sure they stop relaxed operations on opposite sides from reordering with each other. (Happening between the load and store parts of the atomic RMW).
As Jerry Coffin says, taking a std::mutex is only an acquire operation in the ISO C++ standard, not an acquire fence. It's not like std::atomic_thread_fence(std::memory_order_acquire), it's only required to be as strong as foo.exchange(std::memory_order_acquire).
The lack of mention of requiring a 2-way fence makes it clear that one isn't required or guaranteed by the standard. An acquire operation like taking a mutex allows 1-way reordering with itself, so relaxed operations before/after it can potentially reorder with each other. (That's why fences and operations are different things.)
Being any stronger than that is an implementation detail. For example on x86 where any atomic RMW operation is a full barrier, waiting for the store buffer to drain itself, and for all earlier loads to complete, before RMWing the cache line. So it's like a std::atomic_thread_fence(seq_cst) tied to the foo.exchange(); in fact a dummy lock add byte [rsp], 0 is how most compilers implement that C++ fence, because unfortunately mfence is slower on most CPUs.
Taking a mutex always require an atomic RMW, but some machines can do that in ways that allow limited reordering with surrounding operations. e.g. AArch64 can use ldaxr (sequential-acquire load-linked) / stxr (plain store-conditional, not stlxr with release semantics) to implement .exchange(acquire) or .compare_exchange_weak(acquire). See an example compiling to asm for AArch64 on Godbolt, and also atomic exchange with memory_order_acquire and memory_order_release and For purposes of ordering, is atomic read-modify-write one operation or two?
If T is a C++ fundamental type, and if std::atomic<T>::is_lock_free() returns true, then is there anything in std::atomic<T> that is wait-free (not just lock-free)? Like, load, store, fetch_add, fetch_sub, compare_exchange_weak, and compare_exchange_strong.
Can you also answer based on what is specified in the C++ Standard, and what is implemented in Clang and/or GCC (your version of choice).
My favorite definitions for lock-free and wait-free are taken from C++ Concurrency in Action (available for free). An algorithm is lock-free if it satisfies the first condition bellow, and it is wait-free if it satisfies both conditions below:
If one of the threads accessing the data structure is suspended by the scheduler midway through its operation, the other threads must still be able to complete their operations without waiting for the suspended thread.
Every thread accessing the data structure can complete its operation within a bounded number of steps, regardless of the behavior of other threads.
There exist universally accepted definitions of lock-freedom and wait-freedom, and the definitions provided in your question are consistent with those. And I would strongly assume that the C++ standard committee certainly sticks to definitions that are universally accepted in the scientific community.
In general, publications on lock-free/wait-free algorithms assume that CPU instructions are wait-free. Instead, the arguments about progress guarantees focus on the software algorithm.
Based on this assumption I would argue that any std::atomic method that can be translated to a single atomic instruction for some architecture is wait-free on that specific architecture. Whether such a translation is possible sometimes depends on how the method is used though. Take for example fetch_or. On x86 this can be translated to lock or, but only if you do not use its return value, because this instruction does not provide the original value. If you use the return value, then the compiler will create a CAS-loop, which is lock-free, but not wait-free. (And the same goes for fetch_and/fetch_xor.)
So which methods are actually wait-free depends not only on the compiler, but mostly on the target architecture.
Whether it is technically correct to assume that a single instruction is actually wait-free or not is a rather philosophical one IMHO. True, it might not be guaranteed that an instruction finishes execution within a bounded number of "steps" (whatever such a step might be), but the machine instruction is still the smallest unit on the lowest level that we can see and control. Actually, if you cannot assume that a single instruction is wait-free, then strictly speaking it is not possible to run any real-time code on that architecture, because real-time also requires strict bounds on time/the number of steps.
This is what the C++17 standard states in [intro.progress]:
Executions of atomic functions that are either defined to be lock-free (32.8) or indicated as lock-free (32.5)
are lock-free executions.
If there is only one thread that is not blocked (3.6) in a standard library function, a lock-free execution in that thread shall complete. [ Note: Concurrently executing threads may prevent progress of a lock-free
execution. For example, this situation can occur with load-locked store-conditional implementations. This property is sometimes termed obstruction-free. — end note ]
When one or more lock-free executions run concurrently, at least one should complete. [ Note: It is difficult for some implementations to provide absolute guarantees to this effect, since repeated and particularly inopportune interference from other threads may prevent forward progress, e.g., by repeatedly stealing a cache line for unrelated purposes between load-locked and store-conditional instructions. Implementations should ensure that such effects cannot indefinitely delay progress under expected operating conditions, and that such anomalies can therefore safely be ignored by programmers. Outside this document, this property is sometimes termed lock-free. — end note ]
The other answer correctly pointed out that my original answer was a bit imprecise, since there exist two stronger subtypes of wait-freedom.
wait-free - A method is wait-free if it guarantees that every call finishes its execution in a finite number of steps, i.e., it is not possible to determine an upper bound, but it must still be guaranteed that the number of steps is finite.
wait-free bounded - A method is wait-free bounded if it guarantees that every call finishes its execution in a bounded number of steps, where this bound may depend on the number of threads.
wait-free population oblivious - A method is wait-free population oblivious if it guarantees that every call finishes its execution in a bounded number of steps, and this bound does not depend on the number of threads.
So strictly speaking, the definition in the question is consistent with the definition of wait-free bounded.
In practice, most wait-free algorithms are actually wait-free bounded or even wait-free population oblivious, i.e., it is possible to determine an upper bound on the number of steps.
Since there are many definitions of wait-freedom1 and people choose different ones, I think that a precise definition is paramount, and a distinction between its specializations is necessary and useful.
These are the universally accepted definitions of wait-freedom and its specializations:
wait-free:
All threads will make progress in a finite number of steps.
wait-free bounded:
All threads will make progress in a bounded number of steps, which may depend on the number of threads.
wait-free population-oblivious3:
All threads will make progress in a fixed number of steps, that does not depend on the number of threads.
Overall, the C++ standard makes no distinction between lock-free and wait-free (see this other answer). It always gives guarantees no stronger than lock-free.
When std::atomic<T>::is_lock_free() returns true, instead of mutexes the implementation utilizes atomic instructions possibly with CAS loops or LL/SC loops.
Atomic instructions are wait-free. CAS and LL/SC loops are lock-free.
How a method is implemented depends on many factors, including its usage, the compiler, the target architecture and its version. For example:
As someone says, on x86 gcc, fetch_add() for std::atomic<double> uses a CAS loop (lock cmpxchg), while for std::atomic<int> uses lock add or lock xadd.
As someone else says, on architectures featuring LL/SC instructions, fetch_add() uses a LL/SC loop if no better instructions are available. For example, this is not the case on ARM versions 8.1 and above, where ldaddal is used for non-relaxed std::atomic<int> and ldadd is used if relaxed.
As stated in this other answer, on x86 gcc fetch_or() uses lock or if the return value is not used, otherwise it uses a CAS loop (lock cmpxchg).
As explained in this answer of mine:
The store() method and lock add, lock xadd, lock or instructions are wait-free population-oblivious, while their "algorithm", that is the work performed by the hardware to lock the cache line, is wait-free bounded.
The load() method is always wait-free population-oblivious.
1 For example:
all threads will make progress in a finite number of steps (source)
all threads will make progress in a bounded number of steps2
per "step" that they all execute, all threads will make forward progress without any starvation (source)
2 It is not clear whether the bound is constant, or it may depend on the number of threads.
3 A strange name and not good for an acronym, so maybe another one should be chosen.
My understanding of std::memory_order_acquire and std::memory_order_release is as follows:
Acquire means that no memory accesses which appear after the acquire fence can be reordered to before the fence.
Release means that no memory accesses which appear before the release fence can be reordered to after the fence.
What I don't understand is why with the C++11 atomics library in particular, the acquire fence is associated with load operations, while the release fence is associated with store operations.
To clarify, the C++11 <atomic> library enables you to specify memory fences in two ways: either you can specify a fence as an extra argument to an atomic operation, like:
x.load(std::memory_order_acquire);
Or you can use std::memory_order_relaxed and specify the fence separately, like:
x.load(std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_acquire);
What I don't understand is, given the above definitions of acquire and release, why does C++11 specifically associate acquire with load, and release with store? Yes, I've seen many of the examples that show how you can use an acquire/load with a release/store to synchronize between threads, but in general it seems that the idea of acquire fences (prevent memory reordering after statement) and release fences (prevent memory reordering before statement) is orthogonal to the idea of loads and stores.
So, why, for example, won't the compiler let me say:
x.store(10, std::memory_order_acquire);
I realize I can accomplish the above by using memory_order_relaxed, and then a separate atomic_thread_fence(memory_order_acquire) statement, but again, why can't I use store directly with memory_order_acquire?
A possible use case for this might be if I want to ensure that some store, say x = 10, happens before some other statement executes that might affect other threads.
Say I write some data, and then I write an indication that the data is now ready. It's imperative that no other thread who sees the indication that the data is ready not see the write of the data itself. So prior writes cannot move past that write.
Say I read that some data is ready. It's imperative that any reads I issue after seeing that take place after the read that saw that the data was ready. So subsequent reads cannot move behind that read.
So when you do a synchronized write, you typically need to make sure that all writes you did before that are visible to anyone who sees the synchronized write. And when you do a synchronized read, it's typically imperative that any reads you do after that take place after the synchronized read.
Or, to put it another way, an acquire is typically reading that you can take or access the resource, and subsequent reads and writes must not be moved before it. A release is typically writing that you are done with the resource, and preceding writes must not be moved to after it.
(Partial answer correcting a mistake in the early part of the question. David Schwartz's answer already nicely covers the main question you're asking. Jeff Preshing's article on acquire / release is also good reading for another take on it.)
The definitions you gave for acquire / release are wrong for fences; they only apply to acquire operations and release operations, like x.store(mo_release), not std::atomic_thread_fence(mo_release).
Acquire means that no memory accesses which appear after the acquire fence can be reordered to before the fence. [wrong, would be correct for acquire operation]
Release means that no memory accesses which appear before the release fence can be reordered to after the fence. [wrong, would be correct for release operation]
They're insufficient for fences, which is why ISO C++ has stronger ordering rules for acquire fences (blocking LoadStore / LoadLoad reordering) and release fences (LoadStore / StoreStore).
Of course ISO C++ doesn't define "reordering", that would imply there is some global coherent state that you're accessing. ISO C++ instead
Jeff Preshing's articles are relevant here:
Acquire and Release Semantics (acquire / release operations such as loads, stores, and RMWs)
Acquire and Release Fences Don't Work the Way You'd Expect explains why those one-way barrier definitions are incorrect and insufficient for fences, unlike for operations. (Because it would let the fence reorder all the way to one end of your program and leave all the operations unordered wrt. each other, because it's not tied to an operation itself.)
A possible use case for this might be if I want to ensure that some store, say x = 10, happens before some other statement executes that might affect other threads.
If that "other statement" is a load from an atomic shared variable, you actually need std::memory_order_seq_cst to avoid StoreLoad reordering. acquire / release / acq_rel won't block that.
If you mean make sure the atomic store is visible before some other atomic store, the normal way is to make the 2nd atomic store use mo_release.
If the 2nd store isn't atomic, it's unlikely any reader could safely sync with anything in a way that it could observe the value without data-race UB.
(Although you do run into a use case for a release fence when hacking up a SeqLock that uses plain non-atomic objects for the payload, to allow a compiler to optimize. But that's an implementation-specific behaviour that depends on knowing how std::atomic stuff compiles for real CPUs. See Implementing 64 bit atomic counter with 32 bit atomics for example.)
std::memory_order_acquire fence only ensures all load operation after the fence is not reordered before any load operation before the fence, thus memory_order_acquire cannot ensure the store is visible for other threads when after loads are executed. This is why memory_order_acquire is not supported for store operation, you may need memory_order_seq_cst to achieve the acquire of store.
As an alternative, you may say
x.store(10, std::memory_order_releaxed);
x.load(std::memory_order_acquire); // this introduce a data dependency
to ensure all loads not reordered before the store. Again, the fence not work here.
Besides, memory order in atomic operation could be cheaper than a memory fence, because it only ensures the order relative to the atomic instruction, not all instruction before and after the fence.
See also formal description and explanation for detail.
I'd like to write a C++ lock-free object where there are many logger threads logging to a large global (non-atomic) ring buffer, with an occasional reader thread which wants to read as much data in the buffer as possible. I ended up having a global atomic counter where loggers get locations to write to, and each logger increments the counter atomically before writing. The reader tries to read the buffer and per-logger local (atomic) variable to know whether particular buffer entries are busy being written by some logger, so as to avoid using them.
So I have to do synchronization between a pure reader thread and many writer threads. I sense that the problem can be solved without using locks, and I can rely on "happens after" relation to determine whether my program is correct.
I've tried relaxed atomic operation, but it won't work: atomic variable stores are releases and loads are acquires, and the guarantee is that some acquire (and its subsequent work) always "happen after" some release (and its preceding work). That means there is no way for the reader thread (doing no store at all) to guarantee that something "happens after" the time it reads the buffer, which means I don't know whether some logger has overwritten part of the buffer when the thread is reading it.
So I turned to sequential consistency. For me, "atomic" means Boost.Atomic, which notion of sequential consistency has a "pattern" documented:
The third pattern for coordinating threads via Boost.Atomic uses
seq_cst for coordination: If ...
thread1 performs an operation A,
thread1 subsequently performs any operation with seq_cst,
thread1 subsequently performs an operation B,
thread2 performs an operation C,
thread2 subsequently performs any operation with seq_cst,
thread2 subsequently performs an operation D,
then either "A happens-before D" or "C happens-before B" holds.
Note that the second and fifth lines say "any operation", without saying whether it modify anything, or what it operates on. This provides the guarantee that I wanted.
All is happy until I watch the talk of Herb Sutter titled "atomic<> Weapnos". What he implies is that seq_cst is just a acq_rel, with the additional guarantee of consistent atomic stores ordering. I turned to the cppreference.com, which have similar description.
So my questions:
Does C++11 and Boost Atomic implement the same memory model?
If (1) is "yes", does it mean the "pattern" described by Boost is somehow implied by the C++11 memory model? How? Or does it mean the documentation of either Boost or C++11 in cppreference is wrong?
If (1) is "no", or (2) is "yes, but Boost documentation is incorrect", is there any way to achieve the effect I want in C++11, namely to have guarantee that (the work subsequent to) some atomic store happens after (the work preceding) some atomic load?
I saw no answer here, so I asked again in the Boost user mailing list.
I saw no answer there either (apart from a suggestion to look into
Boost lockfree), so I planed to ask Herb Sutter (expecting no answer
anyway). But before doing that, I Googled "C++ memory model" a little
more deeply. After reading a page of Hans Boehm
(http://www.hboehm.info/c++mm/), I could answer most of my own
question. I Googled a bit more, this time for "C++ Data Race", and
landed at a page by Bartosz Milewski
(http://bartoszmilewski.com/2014/10/25/dealing-with-benign-data-races-the-c-way/).
Then I can answer even more of my own question. Unluckily, I still
don't know how to do what I want to do given that knowledge. Perhaps
what I want to do is actually unachieveable in standard C++.
My first part of the question: "Does C++11 and Boost.Atomic implement
the same memory model?" The answer is, mostly, "yes". My second part
of the question: "If (1) is 'yes', does it mean the "pattern"
described by Boost is somehow implied by the C++11 memory model?" The
answer is again, yes. "How?" is answered by a proof found here
(http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2392.html).
Essentially, for data race free programs, the little bit added to
acq_rel is sufficient to guarantee the behavior required by seq_cst.
So both documentation, although perhaps confusing, are correct.
Now the real problem: although both (1) and (2) get "yes" answers, my
original program is wrong! I neglected (actually, I'm unaware of) an
important rule of C++: a program with data race has undefined behavior
(rather than an "unspecified" or "implementation defined" one). That
is, the compiler guarantees behavior of my program only if my program
has absolutely no data race. Without a lock, my program contains a
data race: the pure reader thread can read any time, even at a time
when the logger thread is busy writing. This is "undefined behavior",
and the rule says that the computer can do anything (the "catch fire"
rule). To fix it, one has to use ideas found in the page of Bartosz
Milewski I mentioned earlier, i.e., change the ring buffer to contain
only atomic content, so that the compiler knows that its ordering is
important and must not be reordered with the operations marked to
require sequential consistency. If overhead minimization is desired,
one can write to it using relaxed atomic operations.
Unluckily, this applies to the reader thread too. I can no longer
just "memcpy" the whole memory buffer. Instead I must also use
relaxed atomic operations to read the buffer, one word after another.
This kills performance, but I have no choice actually. Luckily for
me, the dumper's performance is not important to me at all: it rarely
gets run anyway. But if I do want the performance of "memcpy", I
would get an answer of "no solution": C++ provides no semantics of "I
know there is data race, you can return anything to me here but don't
screw up my program". Either you ensure that there is no data race
and pay the cost to get everything well defined, or you have a data
race and the compiler is allowed to put you to jail.