How do changes (read/writes) to std::atomic variables propagate across threads - c++

I have asked this question recently do-i-need-to-use-memory-barriers-to-protect-a-shared-resource
To that question I got a very interesting answer that uses this hypothesis:
Changes to std::atomic variables are guaranteed to propagate across threads.
Why is this so? How is it done? How does this behavior fit within the MESI protocol ?

They don't actually have to propagate, the cache coherency model (MESI or something more advanced) provides you a guarantee that the memory behaves coherently, almost as if it's flat and no cached copies exist. Sequential consistency adds to that a guarantee of the same observation order by all agents in the system (notice - most CPUs don't provide sequential consistency through HW alone).
If a thread does a memory write (not even atomic), the core it runs on will fetch the line and obtain ownership over it. Once the write is done, any thread that attempts to observe the line is guaranteed to see the updated value, even if the line still resides in the modifying core - usually this is achieved through snooping the core and getting the line from it as a response. The cache coherency protocols will guarantee that if such a modification exists locally in some core - any other core looking for that line is bound to see it eventually. To do that, the CPU my use snoop filters, directory management (often for cross socket coherency), or other methods.
Now, you're asking why is atomic important? For 2 reasons. First - all the above applies only if the variable resides in memory, and not a register. This is a compiler decision, so the correct type tells it to do so. Other paradigms (like open-MP or POSIX threads) have other ways to tell the compiler that a variable needs to be shared through memory.
Second - modern cores execute operations out-of-order, and we don't want any other operation to pass that write and expose stale data. std::atomic tells the compiler to enforce the strongest memory ordering (through the use of explicit fencing or locking - check out the generated assembly code), which means that all your memory operations from all threads will be have the same global ordering. If you didn't do that, strange things can happen like core A and core B disagreeing on the order of 2 writes to the same location (meaning that they may see different final values in it).
Last, of course, is the actual atomicity - if your data type is not one that has atomicity guaranteed, or it's not properly aligned - this will also solve that problem for you (otherwise the coherency problem intensifies - think of some thread trying to change a value split between 2 cache lines, and different cores seeing partial values)

Related

MESI Protocol & std::atomic - Does it ensure all writes are immediately visible to other threads?

In regards to std::atomic, the C++11 standard states that stores to an atomic variable will become visible to loads of that variable in a "reasonable amount of time".
From 29.3p13:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
However I was curious to know what actually happens when dealing with specific CPU architectures which are based on the MESI cache coherency protocol (x86, x86-64, ARM, etc..).
If my understanding of the MESI protocol is correct, a core will always read the value previously written/being written by another core immediately, possibly by snooping it. (because writing a value means issuing a RFO request which in turn invalidates other cache lines)
Does it mean that when a thread A stores a value into an std::atomic, another thread B which does a load on that atomic successively will in fact always observe the new value written by A on MESI architectures? (Assuming no other threads are doing operations on that atomic)
By “successively” I mean after thread A has issued the atomic store. (Modification order has been updated)
I'll answer for what happens on real implementations on real CPUs, because an answer based only on the standard can barely say anything useful about time or "immediacy".
MESI is just an implementation detail that ISO C++ doesn't have anything to say about. The guarantees provided by ISO C++ only involve order, not actual time. ISO C++ is intentionally non-specific to avoid assuming that it will execute on a "normal" CPU. An implementation on a non-coherent machine that required explicit flushes for store visibility might be theoretically possible (although probably horrible for performance of release / acquire and seq-cst operations)
C++ is non-specific enough about timing to even allow an implementation on a single-core cooperative multi-tasking system (no pre-emption), with the compiler inserting voluntary yields occasionally. (Infinite loops without any volatile accesses or I/O are UB). C++ on a system where only one thread can actually be executing at once is totally fine and possible, assuming you consider a scheduler timeslice to still be a "reasonable" amount of time. (Or less if you yield or otherwise block.)
Even the model of formalism ISO C++ uses to give the guarantees it does about ordering is very different from the way hardware ISAs define their memory models. C++ formal guarantees are purely in terms of happens-before and synchronizes-with, not "re"-ordering litmus tests or any kind of stuff like that. e.g. How to achieve a StoreLoad barrier in C++11? is impossible to answer for pure ISO C++ formalism. The "option C" in that Q&A serves to show just how weak the C++ guarantees are; that case with store then load of two different SC variables is not sufficient to imply happens-before based on it, according to the C++ formalism, even though there has to be a total order of all SC operations. But it is sufficient in real life on systems with coherent cache and only local (within each CPU core) memory reordering, even AArch64 where the SC load right after the SC store does still essentially give us a StoreLoad barrier.
when a thread A stores a value into an std::atomic
It depends what you mean by "doing" a store.
If you mean committing from the store buffer into L1d cache, then yes, that's the moment when a store becomes globally visible, on a normal machine that uses MESI to give all CPU cores a coherent view of memory.
Although note that on some ISAs, some other threads are allowed to see stores before they become globally visible via cache. (i.e. the hardware memory model may not be "multi-copy atomic", and allow IRIW reordering. POWER is the only example I know of that does this in real life. See Will two atomic writes to different locations in different threads always be seen in the same order by other threads? for details on the HW mechanism: Store forwarding for retired aka graduated stores between SMT threads.)
If you mean executing locally so later loads in this thread can see it, then no. std::atomic can use a memory_order weaker than seq_cst.
All mainstream ISAs have memory-ordering rules weak enough to allow for a store buffer to decouple instruction execution from commit to cache. This also allows speculative out-of-order execution by giving stores somewhere private to live after execution, before we're sure that they were on the correct path of execution. (Stores can't commit to L1d until after the store instruction retires from the out-of-order part of the back end, and thus is known to be non-speculative.)
If you want to wait for your store to be visible to other threads before doing any later loads, use atomic_thread_fence(memory_order_seq_cst);. (Which on "normal" ISAs with standard choice of C++ -> asm mappings will compile to a full barrier).
On most ISAs, a seq_cst store (the default) will also stall all later loads (and stores) in this thread until the store is globally visible. But on AArch64, STLR is a sequential-release store and execution of later loads/stores doesn't have to stall unless / until a LDAR (acquire load) is about to execute while the STLR is still in the store buffer. This implements SC semantics as weakly as possible, assuming AArch64 hardware actually works that way instead of just treating it as a store + full barrier.
But note that only blocking later loads/stores is necessary; out-of-order exec of ALU instructions on registers can still continue. But if you were expecting some kind of timing effect due to dependency chains of FP operations, for example, that's not something you can depend on in C++.
Even if you do use seq_cst so nothing happens in this thread before the store is visible to others, that's still not instant. Inter-core latency on real hardware can be on the order of maybe 40ns on mainstream modern Intel x86, for example. (This thread doesn't have to stall that long on a memory barrier instruction; some of that time is the cache miss on the other thread trying to read the line that was invalidated by this core's RFO to get exclusive ownership.) Or of course much cheaper for logical cores that share the L1d cache of a physical core: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?
From 29.3p13:
Implementations should make atomic stores visible to atomic loads
within a reasonable amount of time.
The C and C++ standards are all over the place on threads, hence not usable as formal specifications. They use the concept of time, and somewhat imply that everything runs step by step, sequentially (if not, you wouldn't have a sound program semantic) and then say that some constructs can see effects out of order, without ever telling which is which.
When effects are seen out of order, thread time is ill defined as you don't have a chronometer that would also be out of order: you wouldn't do sport with out of order execution of actions!
Even "out of order" suggests that some things are purely sequential and some other operations can be "out of order" with respect to the firsts. That is not how std::atomic is defined.
What the standards try to say is that there is a notion of progress for each thread, with a CPU time or cost index, and as it increases as more stuff is done, and stuff can only be slightly reordered by the implementation: now reordering is well defined, not in term of other sequential instructions, but in term of cost/cycles/CPU time.
So if two instructions are close to each other in the sequential intra-thread execution, they will be close in CPU time too. A reasonable compiler shouldn't move a volatile operation, a file output, or an atomic operation past a very costly "pure" computation (one that has no externally visible side effect).
A basic idea that many committee members sadly couldn't even spell out!

Will other threads see a write to a `volatile` word sized variable in reasonable time?

When asking about a more specific problem I discovered this is the core issue where people are not exactly sure.
The following assumptions can be made:
CPU does use a cache coherency protocol like MESI(F) (examples: x86/x86_64 and ARMv7mp)
variable is assumed to be of a size which is atomically written/read by the processor (aligned and native word size)
The variable is declared volatile
The questions are:
If I write to the variable in one thread, will other threads see the change?
What is the order of magnitude of the timeframe in which the other threads will see the change?
Do you know of architectures where cache coherency is not enough to ensure cross-CPU / cross-core visibility?
The question is NOT:
Is it safe to use such a variable?
about reordering issues
about C++11 atomics
This might be considered a duplicate of In C/C++, are volatile variables guaranteed to have eventually consistent semantics betwen threads? and other similar questions, but I think none of these have those clear requirements regarding the target architecture which leads to a lot of confusion about differing assumptions.
Do you know of architectures where cache coherency is not enough to insure cross-cpu / cross-core visibility?
I"m not aware of any single processor with multiple cores that has cache coherency issues. It might be possible for someone to use the wrong type of processor in a multi-processor board, for example an Intel processor that has what Intel calls external QPI disabled, but this would cause all sorts of issues.
Wiki article about Intel's QPI and which processors have it enabled or disabled:
http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect
If I write to the variable in one thread, will other threads see the change?
There is no guarantee. If you think there is, show me where you found it.
What is the order of magnitude of the timeframe in which the other threads will see the change?
It can be never. There is no guarantee.
Do you know of architectures where cache coherency is not enough to insure cross-cpu / cross-core visibility?
This is an incoherent question because you are talking about operations in C++ code that has to be compiled into assembly code. Even if you have hardware guarantees that apply to assembly code, there's no guarantee those guarantees "pass through" to C++ code.
But to the extent the question can be answered, the answer is yes. Posted writes, read prefetching, and other kinds of caching (such as what compilers do with registers) exist in real platforms.
I'd say no, there is no guarantee. There are implementations using multiple, independent computers where shared data has to be transmitted over a (usually very fast) connection between computers. In that situation, you'd try to transmit data only when it is needed. This might be triggered by mutexes, for example, and by the standard atomic functions, but hopefully not by stores into arbitrary local memory, and maybe not by stores into volatile memory.
I may be wrong, but you'd have to prove me wrong.
Assuming nowadays x86/64:
If I write to the variable in one thread, will other threads see the change?
Yes. Assuming you use a modern and not very old / buggy compiler.
What is the order of magnitude of the timeframe in which the other threads will see the change?
It really depends how you measure.
Basically, this would be the memory latency time = 200 cycles on same NUMA node. About double on another node, on a 2-node box. Might differ on bigger boxes.
If your write gets reordered relatively to the point of time measurement, you can get +/-50 cycles.
I measured this a few years back and got 60-70ns on 3GHz boxes and double that on the other node.
Do you know of architectures where cache coherency is not enough to insure cross-cpu / cross-core visibility?
I think the meaning of cache coherency is visibility. Having said that, I'm not sure Sun risk machines have the same cache coherency, and relaxed memory model, as x86, so I'd test very carefully on them. Specifically, you might need to add memory release barriers to force flushing of memory writes.
Given the assumptions you have described, there is no guarantee that a write of a volatile variable in one thread will be "seen" in another.
Given that, your second question (about the timeframe) is not applicable.
With (multi-processor) PowerPC architectures, cache coherency is not sufficient to ensure cross-core visibility of a volatile variable. There are explicit instructions that need to be executed to ensure state is flushed (and to make it visible across multiple processors and their caches).
In practice, on architectures that require such instructions to be executed, the implementation of data synchronisation primitives (mutexes, semaphores, critical sections, etc) does - among other things - use those instructions.
More broadly, the volatile keyword in C++ has nothing to do with multithreading at all, let alone anything to do with cross-cache coherency. volatile, within a given thread of execution, translates to a need for things like fetches and writes of the variable not being eliminated or reordered by the compiler (which affects optimisation). It does not translate into any requirement about ordering or synchronisation of the completion of fetches or writes between threads of execution - and such requirements are necessary for cache coherency.
Notionally, a compiler might be implemented to provide such guarantees. I've yet to see any information about one that does so - which is not surprising, as providing such a guarantee would seriously affect performance of multithreaded code by forcing synchronisation between threads - even if the programmer has not used synchronisation (mutexes, etc) in their code.
Similarly, the host platform could also notionally provide such guarantees with volatile variables - even if the instructions being executed don't specifically require them. Again, that would tend to reduce performance of multithreaded programs - including modern operating systems - on those platforms. It would also affect (or negate) the benefits of various features that contribute to performance of modern processors, such as pipelining, by forcing processors to wait on each other.
If, as a C++ developer (as distinct from someone writing code that exploits specific features offered by your particular compiler or host platform) you want a variable written in one thread able to be coherently read by another thread, then don't bother with volatile. Perform synchronisation between threads - when they need to access the same variable concurrently - using provided techniques - such as mutexes. And follow the usual guidelines on using those techniques (e.g. use mutexes sparingly and minimise the time which they are held, do as much as possible in your threads without accessing variables that are shared between threads at all).

Is memory ordering in C++11 about main memory flush ordering?

I'm not sure i fully understand (and i may have all wrong) the concepts of atomicity and memory ordering in C++11.
Let's take this simple example single threaded :
int main()
{
std::atomic<int> a(0);
std::atomic<int> b(0);
a.store(16);
b.store(10);
return 0;
}
In this single threaded code, if a and b were not atomic types, the compiler could have reordered the instruction in a way that in the assembly code, i have for instance a move instruction to assigned 10 to 'b' before a move instruction to assigned 16 to 'a'.
So for me, being atomic variables, it guarantees me that i'd have the "a move instruction" before the "b move instruction" as i stated in my source code.
After that, there is the processor with his execution unit, prefetching instructions, and with his out-of-order box. And this processor can process the "b instruction" before the "a instruction", whatever is the instruction ordering in the assembly code.
So i can have 10 stored in a register or in the store buffer of a processor or in cache memory before i have 16 stored in a register / store buffer or in cache.
And with my understanding, it's where memory ordering model come out. From that moment, if i let the default model sequentially consistent. One guarantees me that flush out these values (10 and 16) in main memory will respect the order i did the store in my source code. So that the processor will start flushing out the register or cache where 16 is stored into main memory for update 'a' and after that it will flush 10 in the main memory for 'b'.
So that behavior does allow me to understand that if i use a relaxed memory model. Only the last part is not guarantee so that the flush in main memory can be in total disorder.
Sorry if you get trouble to read me, my english is still poor. But thank you guys for your time.
The C++ memory model is about the abstract machine and value visibility, not about concrete things like "main memory", "write queues" or "flushing".
In your example, the memory model states that since the write to a happens-before the write to b, any thread that reads the 10 from b must, on subsequent reads from a, see 16 (unless this has since been overwritten, of course).
The important thing here is establishing happens-before relationships and value visibility. How this maps to caches and memory is up to the compiler. In my opinion, it's better to stay on that abstract level instead of trying to map the model to your understanding of the hardware, because
Your understanding of the hardware might be wrong. Hardware is even more complicated than the C++ memory model.
Even if your understanding is correct now, a later version of the hardware might have a different model, at least in subsystems.
By mapping to a hardware model, you might then make wrong assumptions about the implications for a different hardware model. E.g. if you understand how the memory model maps to x86 hardware, you will not understand the subtle difference between consume and acquire on PowerPC.
The C++ model is very well suited for reasoning about correctness.
You didn't specify which architecture you work with, but basically each has its own memory ordering model (some times more than one that you can choose from), and that serves as a "contract". The compiler should be aware of that and use lightweight or heavyweight instructions accordingly to guarantee what it needs in order to provide the memory model of the language.
The HW implementation under the hood can be quite complicated, but in a nutshell - you don't need to flush in order to get global visibility. Modern cache systems provide snooping capabilities, so that a value can be globally visible and globally ordered while still residing in some private core cache (and having stale copies in lower cache levels), the MESI protocols control how this is handled correctly.
The life cycle of a write begins in the out of order engine, where it is still speculative (i.e. - can be cleared due to an older branch misprediction or fault). Naturally, during that time the write can not be seen from the outside, so out-of-order execution here is not relevant. Once it commits, if the system guarantees store ordering (like x86), it still has to wait in line for its turn to become visible, so it is buffered. Other cores can't see it since its observation time hasn't reached yet (although local loads in that core might see it in some implementations of x86 - that's one of the differences between TSO and real sequential consistency).
Once the older stores are done, the store may become globally visible - it doesn't have to go anywhere outside of the core for that, it can remain cached internally. In fact, some CPUs may even make it observable while still in the store buffer, or write it to the cache speculatively - the actual decision point is when to make it respond to external snoops, the rest is implementation details. Architectures with more relaxed ordering may change the order unless explicitly blocked by a fence/barrier.
Based on that, your code snippet can not reorder stores on x86 since stores don't reorder with each other there, but it may be able to do so on arm for example. If the language requires strong ordering in that case, the compiler will have to decide if it can rely on the HW, or add a fence. Either way, anyone reading this value from another thread (or socket) will have to snoop for it, and can only see the writes that respond.

Using Mutex for shared memory of 1 word

I have an application where multiple threads access and write to a shared memory of 1 word (16bit).
Can I expect that the processor reads and writes a word from/to memory in an atomic operation? So I don't need mutex protection of the shared memory/variable?
The target is an embedded device running VxWorks.
EDIT: There's only one CPU and it is an old one (>7years) - I am not exactly sure about the architecture and model, but I am also more interested in the general way that "most" CPU's will work. If it is a 16bit CPU, would it then, in most cases, be fair to expect that it will read/write a 16bit variable in one operation? Or should I always in any case use mutex protection? And let's say that I don't care about portability, and we talk about C++98.
All processors will read and write aligned machine-words atomicially in the sense that you won't get half the bits of the old value and half the bits of the new value if read by another processor.
To achieve good speed, modern processors will NOT synchronize read-modif-write operations to a particular location unless you actually ask for it - since nearly all reads and writes go to "non-shared" locations.
So, if the value is, say, a counter of how many times we've encountered a particular condition, or some other "if we read/write an old value, it'll go wrong" situation, then you need to ensure that two processors don't simultaneously update the value. This can typically be done with atomic instructions (or some other form of atomic updates) - this will ensure that one, and only one, processor touches the value at any given time, and that all the other processors DO NOT hold a copy of the value that they think is accurate and up to date when another has just made an update. See the C++11 std::atomic set of functions.
Note the distinction between atomically reading or writing the machine word's value and atomically performing the whole update.
The problem is not the atomicity of the acess (which you can usually assume unless you are using a 8bit MC), but the missing synchronization which leads to undefined behavior.
If you want to write portable code, use atomics instead. If you want to achieve maximal performance for your specific platform, read the documentation of your OS and compiler very carefully and see what additional mechanisms or guarantees they provide for multithreaded programs (But I really doubt that you will find anything more efficient than std::atomic that gives you sufficient guarantees).
Can I expect that the processor reads and writes a word from/to memory in an atomic operation?
Yes.
So I don't need mutex protection of the shared memory/variable?
No. Consider:
++i;
Even if the read and write are atomic, two threads doing this at the same time can each read, each increment, and then each write, resulting in only one increment where two are needed.
Can I expect that the processor reads and writes a word from/to memory in an atomic operation?
Yes, if the data's properly aligned and no bigger than a machine word, most CPU instructions will operate on it atomically in the sense you describe.
So I don't need mutex protection of the shared memory/variable?
You do need some synchronisation - whether a mutex or using atomic operations ala std::atomic.
The reasons for this include:
if your variable is not volatile, the compiler might not even emit read and write instructions for the memory address nominally holding that variable at the places you might expect, instead reusing values read or set earlier that are saved in CPU registers or known at compile time
if you use a mutex or std::atomic type you do not need to use volatile as well
further, even if the data is written towards memory, it may not leave the CPU caches and be written to actual RAM where other cores and CPUs can see it unless you explicitly use a memory barrier (std::mutex and std::atomic types do that for you)
finally, delays between reading and writing values can cause unexpected results, so operations like ++x can fail as explained by David Schwartz.

How does a mutex ensure a variable's value is consistent across cores?

If I have a single int which I want to write to from one thread and read from on another, I need to use std::atomic, to ensure that its value is consistent across cores, regardless of whether or not the instructions that read from and write to it are conceptually atomic. If I don't, it may be that the reading core has an old value in its cache, and will not see the new value. This makes sense to me.
If I have some complex data type that cannot be read/written to atomically, I need to guard access to it using some synchronisation primitive, such as std::mutex. This will prevent the object getting into (or being read from) an inconsistent state. This makes sense to me.
What doesn't make sense to me is how mutexes help with the caching problem that atomics solve. They seem to exist solely to prevent concurrent access to some resource, but not to propagate any values contained within that resource to other cores' caches. Is there some part of their semantics I've missed which deals with this?
The right answer to this is magic pixies - e.g. It Just Works. The implementation of std::atomic for each platform must do the right thing.
The right thing is a combination of 3 parts.
Firstly, the compiler needs to know that it can't move instructions across boundaries [in fact it can in some cases, but assume that it doesn't].
Secondly, the cache/memory subsystem needs to know - this is generally done using memory barriers, although x86/x64 generally have such strong memory guarantees that this isn't necessary in the vast majority of cases (which is a big shame as its nice for wrong code to actually go wrong).
Finally the CPU needs to know it cannot reorder instructions. Modern CPUs are massively aggressive at reordering operations and making sure in the single threaded case that this is unnoticeable. They may need more hints that this cannot happen in certain places.
For most CPUs part 2 and 3 come down to the same thing - a memory barrier implies both. Part 1 is totally inside the compiler, and is down to the compiler writers to get right.
See Herb Sutters talk 'Atomic Weapons' for a lot more interesting info.
The consistency across cores is ensured by memory barriers (which also prevents instructions reordering). When you use std::atomic, not only do you access the data atomically, but the compiler (and library) also insert the relevant memory barriers.
Mutexes work the same way: the mutex implementations (eg. pthreads or WinAPI or what not) internally also insert memory barriers.
Most modern multicore processors (including x86 and x64) are cache coherent. If two cores hold the same memory location in cache and one of them updates the value, the change is automatically propagated to other cores' caches. It's inefficient (writing to the same cache line at the same time from two cores is really slow) but without cache coherence it would be very difficult to write multithreaded software.
And like syam said, memory barriers are also required. They prevent the compiler or processor from reordering memory accesses, and also force the write into memory (or at least into cache), when for example a variable is held in a register because of compiler optizations.