I have basically two questions that are closely related and they are both based on this SO question:
Thread synchronization problem with c++ std::atomic variables
As cppreference.com explains:
For memory_order_acquire:
A load operation with this memory order performs the acquire operation
on the affected memory location: no reads or writes in the current
thread can be reordered before this load. All writes in other
threads that release the same atomic variable are visible in the
current thread
For memory_order_release: A store operation with this memory order
performs the release operation: no reads or writes in the current
thread can be reordered after this store. All writes in the current
thread are visible in other threads that acquire the same atomic
variable
Why people say that memory_order_seq_cst MUST be used in order for that example to work properly? What's the purpose of memory_order_acquire if it doesn't work as the official documentation says so?
The documentation clearly says: All writes in other threads that release the same atomic variable are visible in the current thread.
Why that example from SO question should never print "bad\n"? It just doesn't make any sense to me.
I did my homework by reading all available documentation, SO queastions/anwers, googling, etc... But, I'm still not able to understand some things.
Your linked question has two atomic variables, your "cppreference" quote specifically mentions "same atomic variable". That's why the reference text doesn't cover the linked question.
Quoting further from cppreference: memory_order_seq_cst : "...a single total order exists in which all threads observe all modifications in the same order".
So that does cover modifications to two atomic variables.
Essentially, the design problem with memory_order_release is that it's a data equivalent of GOTO, which we know is a problem since Dijkstra. And memory_order_acquire is the equivalent of a COMEFROM, which is usually reserved for April Fools. I'm not yet convinced that they're good additions to C++.
Related
Does atomic operation synchronize between threads? I know that no one thread can see such operation undone, but does it synchronize? For example if I write to some var in one thread, and read after(in time domain) from another, is there possibility that I still can see old value?
Atomics by default provide sequential consistency (SC). SC doesn't need to respect the real time order. So it could be that after a write has executed (and even retired), when a different CPU does a load, it will not see that write. So in the real time order the load has occurred after the write, but in the memory order it has 'happened before' the write.
See the following answer for more info:
Does 'volatile' guarantee that any thread reads the most recently written value?
In general, load-acquire/store-release synchronization is one of the most common forms of memory-ordering based synchronization in the C++11 memory model. It's basically how a mutex provides memory ordering. The "critical section" between a load-acquire and a store-release is always synchronized among different observer threads, in the sense that all observer threads will agree on what happens after the acquire and before the release.
Generally, this is achieved with a read-modify-write instruction, like compare-exchange, along with an acquire barrier, when entering the critical section, and another read-modify-write instruction with a release barrier when exiting the critical section.
But there are some situations where you might have a similar critical section[1] between a load-acquire and a release-store, except only one thread actually modifies the synchronization variable. Other threads may read the synchronization variable, but only one thread actually modifies it. In this case, when entering the critical section, you don't need a read-modify-write instruction. You would just need a simple store, since you are not racing with other threads that are attempting to modify the synchronization flag. (This may seem odd, but note that many lock-free memory reclamation deferral patterns, like user-space RCU or epoch based reclamation, use thread-local synchronization variables that are written to only by one thread, but read by many threads, so this isn't too weird of a situation.)
So, when entering the critical section, you could just do something like:
sync_var.store(true, ...);
.... critical section ....
sync_var.store(false, std::memory_order_release);
There is no race, because, again, there is no need for a read-modify-write when only one thread needs to set/unset the critical section variable. Other threads can simply read the critical section variable with a load-acquire.
The problem is, when you're entering the critical section, you need an acquire operation or fence. But you don't need to do a LOAD, only a STORE. So what is a good way to produce acquire ordering when you only really need a STORE? I see only two real options that fall within the C++ memory model. Either:
Use an exchange instead of a store, so you can do sync_var.exchange(true, std::memory_order_acquire). The downside here is that exchange is a more heavy-weight read-modify-write operation, when all you really need is a simple store.
Insert a "dummy" load-acquire, like:
(void)sync_var.load(std::memory_order_acquire);
sync_var.store(true, std::memory_order_relaxed);
The "dummy" load-acquire seems better. Presumably, the compiler can't optimize away the unused load, because it's an atomic instruction that has the side-effect of producing a "synchronizes-with" relationship with a release operation on sync_var. But it also seems very hacky, and the intention is unclear without comments explaining what's going on.
So what is the best way to produce acquire semantics when all we need to do is a simple store?
[1] I use the term "critical section" loosely. I don't necessarily mean a section that is always accessed via mutual exclusion. Rather, I just mean any section where memory ordering is synchronized via acquire-release semantics. This could refer to a mutex, or it could just mean something like RCU, where the critical section can be accessed concurrently by multiple readers.
The flaw in your logic is that an atomic RMW is not required because data in the critical section is modified by a single thread while all other threads only have read-access.
This is not true; there still needs to be a well-defined order between reading and writing.
You don't want data to be modified while another thread is still reading it. Therefore, each thread needs to inform other threads when it has finished accessing the data.
By only using an atomic store to enter the critical section, the 'synchronizes-with' relationship cannot be established.
Acquire/release synchronization is based on a runtime relationship where the acquirer knows that synchronization is complete only after observing a particular value returned by the atomic load.
This can never be achieved by a single atomic store since the one modifying thread can change the atomic variable sync_var at any time
and as such it has no way knowing whether another thread is still reading the data.
The option with a 'dummy' load/acquire is also invalid because it fails to inform other threads that it wants exclusive access. You attempt to solve that by using a single (relaxed) store,
but the load and the store are separate operations that can be interrupted by other threads (i.e. multiple threads simultaneously accessing the critical area).
An atomic RMW must be used by each thread to load a particular value and at the same time update the variable to inform all other threads it has now exclusive access
(regardless whether that is for reading or writing).
void lock()
{
while (sync_var.exchange(true, std::memory_order_acquire));
}
void unlock()
{
sync_var.store(false, std::memory_order_release);
}
Optimizations are possible where multiple threads have read-access at the same time (eg. std::shared_mutex).
My understanding of std::memory_order_acquire and std::memory_order_release is as follows:
Acquire means that no memory accesses which appear after the acquire fence can be reordered to before the fence.
Release means that no memory accesses which appear before the release fence can be reordered to after the fence.
What I don't understand is why with the C++11 atomics library in particular, the acquire fence is associated with load operations, while the release fence is associated with store operations.
To clarify, the C++11 <atomic> library enables you to specify memory fences in two ways: either you can specify a fence as an extra argument to an atomic operation, like:
x.load(std::memory_order_acquire);
Or you can use std::memory_order_relaxed and specify the fence separately, like:
x.load(std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_acquire);
What I don't understand is, given the above definitions of acquire and release, why does C++11 specifically associate acquire with load, and release with store? Yes, I've seen many of the examples that show how you can use an acquire/load with a release/store to synchronize between threads, but in general it seems that the idea of acquire fences (prevent memory reordering after statement) and release fences (prevent memory reordering before statement) is orthogonal to the idea of loads and stores.
So, why, for example, won't the compiler let me say:
x.store(10, std::memory_order_acquire);
I realize I can accomplish the above by using memory_order_relaxed, and then a separate atomic_thread_fence(memory_order_acquire) statement, but again, why can't I use store directly with memory_order_acquire?
A possible use case for this might be if I want to ensure that some store, say x = 10, happens before some other statement executes that might affect other threads.
Say I write some data, and then I write an indication that the data is now ready. It's imperative that no other thread who sees the indication that the data is ready not see the write of the data itself. So prior writes cannot move past that write.
Say I read that some data is ready. It's imperative that any reads I issue after seeing that take place after the read that saw that the data was ready. So subsequent reads cannot move behind that read.
So when you do a synchronized write, you typically need to make sure that all writes you did before that are visible to anyone who sees the synchronized write. And when you do a synchronized read, it's typically imperative that any reads you do after that take place after the synchronized read.
Or, to put it another way, an acquire is typically reading that you can take or access the resource, and subsequent reads and writes must not be moved before it. A release is typically writing that you are done with the resource, and preceding writes must not be moved to after it.
(Partial answer correcting a mistake in the early part of the question. David Schwartz's answer already nicely covers the main question you're asking. Jeff Preshing's article on acquire / release is also good reading for another take on it.)
The definitions you gave for acquire / release are wrong for fences; they only apply to acquire operations and release operations, like x.store(mo_release), not std::atomic_thread_fence(mo_release).
Acquire means that no memory accesses which appear after the acquire fence can be reordered to before the fence. [wrong, would be correct for acquire operation]
Release means that no memory accesses which appear before the release fence can be reordered to after the fence. [wrong, would be correct for release operation]
They're insufficient for fences, which is why ISO C++ has stronger ordering rules for acquire fences (blocking LoadStore / LoadLoad reordering) and release fences (LoadStore / StoreStore).
Of course ISO C++ doesn't define "reordering", that would imply there is some global coherent state that you're accessing. ISO C++ instead
Jeff Preshing's articles are relevant here:
Acquire and Release Semantics (acquire / release operations such as loads, stores, and RMWs)
Acquire and Release Fences Don't Work the Way You'd Expect explains why those one-way barrier definitions are incorrect and insufficient for fences, unlike for operations. (Because it would let the fence reorder all the way to one end of your program and leave all the operations unordered wrt. each other, because it's not tied to an operation itself.)
A possible use case for this might be if I want to ensure that some store, say x = 10, happens before some other statement executes that might affect other threads.
If that "other statement" is a load from an atomic shared variable, you actually need std::memory_order_seq_cst to avoid StoreLoad reordering. acquire / release / acq_rel won't block that.
If you mean make sure the atomic store is visible before some other atomic store, the normal way is to make the 2nd atomic store use mo_release.
If the 2nd store isn't atomic, it's unlikely any reader could safely sync with anything in a way that it could observe the value without data-race UB.
(Although you do run into a use case for a release fence when hacking up a SeqLock that uses plain non-atomic objects for the payload, to allow a compiler to optimize. But that's an implementation-specific behaviour that depends on knowing how std::atomic stuff compiles for real CPUs. See Implementing 64 bit atomic counter with 32 bit atomics for example.)
std::memory_order_acquire fence only ensures all load operation after the fence is not reordered before any load operation before the fence, thus memory_order_acquire cannot ensure the store is visible for other threads when after loads are executed. This is why memory_order_acquire is not supported for store operation, you may need memory_order_seq_cst to achieve the acquire of store.
As an alternative, you may say
x.store(10, std::memory_order_releaxed);
x.load(std::memory_order_acquire); // this introduce a data dependency
to ensure all loads not reordered before the store. Again, the fence not work here.
Besides, memory order in atomic operation could be cheaper than a memory fence, because it only ensures the order relative to the atomic instruction, not all instruction before and after the fence.
See also formal description and explanation for detail.
I am new to using threads and have read a lot about how data is shared and protected. But I have also not really got a good grasp of using mutexes and locks to protect data.
Below is a description of the problem I will be working on. The important thing to note is that it will be time-critical, so I need to reduce overheads as much as possible.
I have two fixed-size double arrays.
The first array will provide data for subsequent calculations.
Threads will read values from it, but it will never be modified. An element may be read at some time by any of the threads.
The second array will be used to store the results of the calculations performed by the threads. An element of this array will only ever be updated by one thread, and probably only once when the result value
is written to it.
My questions then:
Do I really need to use a mutex in a thread each time I access the data from the read-only array? If so, could you explain why?
Do I need to use a mutex in a thread when it writes to the result array even though this will be the only thread that ever writes to this element?
Should I use atomic data types, and will there be any significant time overhead if I do?
Many answers to this type of question seem to be - no, you don't need the mutex if your variables are aligned. Would my array elements in this example be aligned, or is there some way to ensure they are?
The code will be implemented on 64bit Linux. I am planning on using Boost libraries for multithreading.
I have been mulling this over and looking all over the web for days, and once posted, the answer and clear explanations came back in literally seconds. There is an "accepted answer," but all the answers and comments were equally helpful.
Do I really need to use a mutex in a thread each time I access the data from the read-only array? If so could you explain why?
No. Because the data is never modified, there cannot be synchronization problem.
Do I need to use a mutex in a thread when it writes to the result array even though this will be the only thread that ever writes to this element?
Depends.
If any other thread is going to read the element, you need synchronization.
If any thread may modify the size of the vector, you need synchronization.
In any case, take care of not writing into adjacent memory locations by different threads a lot. That could destroy the performance. See "false sharing". Considering, you probably don't have a lot of cores and therefore not a lot of threads and you say write is done only once, this is probably not going to be a significant problem though.
Should I use atomic data types and will there be any significant time over head if I do?
If you use locks (mutex), atomic variables are not necessary (and they do have overhead). If you need no synchronization, atomic variables are not necessary. If you need synchronization, then atomic variables can be used to avoid locks in some cases. In which cases can you use atomics instead of locks... is more complicated and beyond the scope of this question I think.
Given the description of your situation in the comments, it seems that no synchronization is required at all and therefore no atomics nor locks.
...Would my array elements in this example be aligned, or is there some way to ensure they are?
As pointed out by Arvid, you can request specific alignment using the alginas keyword which was introduced in c++11. Pre c++11, you may resort to compiler specific extensions: https://gcc.gnu.org/onlinedocs/gcc-5.1.0/gcc/Variable-Attributes.html
Under the two conditions given, there's no need for mutexes. Remember every use of a mutex (or any synchronization construct) is a performance overhead. So you want to avoid them as much as possible (without compromising correct code, of course).
No. Mutexes are not needed since threads are only reading the array.
No. Since each thread only writes to a distinct memory location, no race condition is possible.
No. There's no need for atomic access to objects here. In fact, using atomic objects could affect the performance negatively as it prevents the optimization possibilities such as re-ordering operations.
The only time you need to use Locks is when data is modified on a shared resource. Eg if some threads where used to write data and some used to read data (in both cases from the same resource) then you only need a lock for when writing is done. This is to prevent whats known as "race".
There is good information of race on google for when you make programs that manipulate data on a shared resource.
You are on the right track.
1) For the first array (read only) , you do not need to utilize a mutex lock for it. Since the threads are just reading not altering the data there is no way a thread can corrupt the data for another thread
2) I'm a little confused by this question. If you know that thread 1 will only write an element to array slot 1 and thread 2 will only write to array slot 2 then you do not need a mutex lock. However I'm not sure how your achieving this property. If my above statement is not correct for your situation you would definitely need a mutex lock.
3) Given the definition of atomic:
Atomic types are types that encapsulate a value whose access is guaranteed to not cause data races and can be used to synchronize memory accesses among different threads.
Key note, a mutex lock is atomic meaning that there is only 1 assembly instruction needed to grab/release a lock. If it required 2 assembly instructions to grab/release a lock, a lock would not be thread safe. For example, if thread 1 attempted to grab a lock and was switched to thread 2, thread 2 would grab the lock.
Use of atomic data types would decrease your overhead but not significantly.
4) I'm not sure how you can assure your variables are lined. Since threads can switch at any moment in your program (Your OS determines when a thread switches)
Hope this helps
I'd like to write a C++ lock-free object where there are many logger threads logging to a large global (non-atomic) ring buffer, with an occasional reader thread which wants to read as much data in the buffer as possible. I ended up having a global atomic counter where loggers get locations to write to, and each logger increments the counter atomically before writing. The reader tries to read the buffer and per-logger local (atomic) variable to know whether particular buffer entries are busy being written by some logger, so as to avoid using them.
So I have to do synchronization between a pure reader thread and many writer threads. I sense that the problem can be solved without using locks, and I can rely on "happens after" relation to determine whether my program is correct.
I've tried relaxed atomic operation, but it won't work: atomic variable stores are releases and loads are acquires, and the guarantee is that some acquire (and its subsequent work) always "happen after" some release (and its preceding work). That means there is no way for the reader thread (doing no store at all) to guarantee that something "happens after" the time it reads the buffer, which means I don't know whether some logger has overwritten part of the buffer when the thread is reading it.
So I turned to sequential consistency. For me, "atomic" means Boost.Atomic, which notion of sequential consistency has a "pattern" documented:
The third pattern for coordinating threads via Boost.Atomic uses
seq_cst for coordination: If ...
thread1 performs an operation A,
thread1 subsequently performs any operation with seq_cst,
thread1 subsequently performs an operation B,
thread2 performs an operation C,
thread2 subsequently performs any operation with seq_cst,
thread2 subsequently performs an operation D,
then either "A happens-before D" or "C happens-before B" holds.
Note that the second and fifth lines say "any operation", without saying whether it modify anything, or what it operates on. This provides the guarantee that I wanted.
All is happy until I watch the talk of Herb Sutter titled "atomic<> Weapnos". What he implies is that seq_cst is just a acq_rel, with the additional guarantee of consistent atomic stores ordering. I turned to the cppreference.com, which have similar description.
So my questions:
Does C++11 and Boost Atomic implement the same memory model?
If (1) is "yes", does it mean the "pattern" described by Boost is somehow implied by the C++11 memory model? How? Or does it mean the documentation of either Boost or C++11 in cppreference is wrong?
If (1) is "no", or (2) is "yes, but Boost documentation is incorrect", is there any way to achieve the effect I want in C++11, namely to have guarantee that (the work subsequent to) some atomic store happens after (the work preceding) some atomic load?
I saw no answer here, so I asked again in the Boost user mailing list.
I saw no answer there either (apart from a suggestion to look into
Boost lockfree), so I planed to ask Herb Sutter (expecting no answer
anyway). But before doing that, I Googled "C++ memory model" a little
more deeply. After reading a page of Hans Boehm
(http://www.hboehm.info/c++mm/), I could answer most of my own
question. I Googled a bit more, this time for "C++ Data Race", and
landed at a page by Bartosz Milewski
(http://bartoszmilewski.com/2014/10/25/dealing-with-benign-data-races-the-c-way/).
Then I can answer even more of my own question. Unluckily, I still
don't know how to do what I want to do given that knowledge. Perhaps
what I want to do is actually unachieveable in standard C++.
My first part of the question: "Does C++11 and Boost.Atomic implement
the same memory model?" The answer is, mostly, "yes". My second part
of the question: "If (1) is 'yes', does it mean the "pattern"
described by Boost is somehow implied by the C++11 memory model?" The
answer is again, yes. "How?" is answered by a proof found here
(http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2392.html).
Essentially, for data race free programs, the little bit added to
acq_rel is sufficient to guarantee the behavior required by seq_cst.
So both documentation, although perhaps confusing, are correct.
Now the real problem: although both (1) and (2) get "yes" answers, my
original program is wrong! I neglected (actually, I'm unaware of) an
important rule of C++: a program with data race has undefined behavior
(rather than an "unspecified" or "implementation defined" one). That
is, the compiler guarantees behavior of my program only if my program
has absolutely no data race. Without a lock, my program contains a
data race: the pure reader thread can read any time, even at a time
when the logger thread is busy writing. This is "undefined behavior",
and the rule says that the computer can do anything (the "catch fire"
rule). To fix it, one has to use ideas found in the page of Bartosz
Milewski I mentioned earlier, i.e., change the ring buffer to contain
only atomic content, so that the compiler knows that its ordering is
important and must not be reordered with the operations marked to
require sequential consistency. If overhead minimization is desired,
one can write to it using relaxed atomic operations.
Unluckily, this applies to the reader thread too. I can no longer
just "memcpy" the whole memory buffer. Instead I must also use
relaxed atomic operations to read the buffer, one word after another.
This kills performance, but I have no choice actually. Luckily for
me, the dumper's performance is not important to me at all: it rarely
gets run anyway. But if I do want the performance of "memcpy", I
would get an answer of "no solution": C++ provides no semantics of "I
know there is data race, you can return anything to me here but don't
screw up my program". Either you ensure that there is no data race
and pay the cost to get everything well defined, or you have a data
race and the compiler is allowed to put you to jail.