Can memory store be reordered really, in an OoOE processor?

Can memory store be reordered really, in an OoOE processor? - c++

We know that two instructions can be reordered by an OoOE processor. For example, there are two global variables shared among different threads.
int data;
bool ready;
A writer thread produce data and turn on a flag ready to allow readers to consume that data.
data = 6;
ready = true;
Now, on an OoOE processor, these two instructions can be reordered (instruction fetch, execution). But what about the final commit/write-back of the results? i.e., will the store be in-order?
From what I learned, this totally depends on a processor's memory model. E.g., x86/64 has a strong memory model, and reorder of stores is disallowed. On the contrary, ARM typically has a weak model where store reordering can happen (along with several other reorderings).
Also, the gut feeling tells me that I am right because otherwise we won't need a store barrier between those two instructions as used in typical multi-threaded programs.
But, here is what our wikipedia says:
.. In the outline above, the OoOE processor avoids the stall that
occurs in step (2) of the in-order processor when the instruction is
not completely ready to be processed due to missing data.
OoOE processors fill these "slots" in time with other instructions
that are ready, then re-order the results at the end to make it appear
that the instructions were processed as normal.
I'm confused. Is it saying that the results have to be written back in-order? Really, in an OoOE processor, can store to data and ready be reordered?

The simple answer is YES on some processor types.
Before the CPU, your code faces an earlier problem, compiler reordering.
data = 6;
ready = true;
The compiler is free to rearrange these statements since, as far as it knows, they do not affect each other (it is not thread-aware).
Now down to the processor level:
1) An out-of-order processor can process these instructions in different order, including reversing the order of the stores.
2) Even if the CPU performs them in order, they memory controller may not perform them in order because it may need to flush or bring in new cache lines or do an address translation before it can write them.
3) Even if this doesn't happen, another CPU in the system may not see them in the same order. In order to observe them, it may need to bring in the modified cache lines from the core that wrote them. It may not be able to bring one cache line in earlier than another if it is held be another core or if there is contention for that line by multiple cores, and its own out of order execution will read one before the other.
4) Finally, speculative execution on other cores may read the value of data before ready was set by the writing core, and by the time it gets around to reading ready, it was already set but data was also modified.
These problems are all solved by memory barriers. Platforms with weakly-ordered memory must make use of memory barriers to ensure memory coherence for thread synchronization.

The consistency model (or memory model) for the architecture determines what memory operations can be reordered. The idea is always to achieve the best performance from the code, while preserving the semantics expected by the programmer. That is the point from wikipedia, the memory operations appear in order to the programmer, even though they may have been reordered. Reordering is generally safe when the code is single-threaded, as the processor can easily detect potential violations.
On x86, the common model is that writes are not reordered with other writes. Yet, the processor is using out of order execution (OoOE), so instructions are being reordered constantly. Generally, the processor has several additional hardware structures to support OoOE, like a reorder buffer and load-store queue. The reorder buffer ensures that all instructions appear to execute in order, such that interrupts and exceptions break a specific point in the program. The load-store queue functions similarly, in that it can restore the order of memory operations according to the memory model. The load-store queue also disambiguates addresses, so that the processor can identify when the operations are made to the same or different addresses.
Back to OoOE, the processor is executing 10s to 100s of instructions in every cycle. Loads and stores are computing their addresses, etc. The processor may prefetch the cache lines for the accesses (which may include cache coherence), but it cannot actually access the line either to read or write until it is safe (according to the memory model) to do so.
Inserting store barriers, memory fences, etc tell both the compiler and processor about further restrictions to reordering the memory operations. The compiler is part of implementing the memory model, as some languages like java have specific memory model, while others like C obey the "memory accesses should appear as if they were executed in order".
In conclusion, yes, data and ready can be reordered in an OoOE. But it depends on the memory model as to whether they actually are. So if you need a specific order, provide the appropriate indication using barriers, etc such that the compiler, processor, etc will not choose a different order for higher performance.

On modern processor, the storing action itself is async (think of it like submit a change to the L1 cache and continue execution, the cache system further propagate in async manner). So the changes on two object lies on different cache block may be realised OoO from other CPU's perspective.
Furthermore, even the instruction to store those data, can be executed OoO. For example when two object is stored "at the same time", but the bus line of one object is retained/locked by other CPU or bus mastering, thus other other object may be committed earlier.
Therefore, to properly share data across threads, you need some kind of memory barrier or make use of transactional memory feature found in latest CPU like TSX.

I think you're misinterpreting "appear that the instructions were processed as normal." What that means is that if I have:
add r1 + 7 -> r2
move r3 -> r1
and the order of those is effectively reversed by out-of-order execution, the value that participates in the add operation will still be the value of r1 that was present prior to the move. Etc. The CPU will cache register values and/or delay register stores to assure that the "meaning" of a sequential instruction stream is not changed.
This says nothing about the order of stores as visible from another processor.

Related

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

TL;DR: In a producer-consumer queue does it ever make sense to put an unnecessary (from C++ memory model viewpoint) memory fence, or unnecessarily strong memory order to have better latency at the expense of possibly worse throughput?
C++ memory model is executed on the hardware by having some sort of memory fences for stronger memory orders and not having them on weaker memory orders.
In particular, if producer does store(memory_order_release), and consumer observes the stored value with load(memory_order_acquire), there are no fences between load and store. On x86 there are no fences at all, on ARM fences are put operation before store and after load.
The value stored without a fence will eventually be observed by load without a fence (possibly after few unsuccessful attempts)
I'm wondering if putting a fence on either of sides of the queue can make the value to be observed faster?
What is the latency with and without fence, if so?
I expect that just having a loop with load(memory_order_acquire) and pause / yield limited to thousands of iterations is the best option, as it is used everywhere, but want to understand why.
Since this question is about hardware behavior, I expect there's no generic answer. If so, I'm wondering mostly about x86 (x64 flavor), and secondarily about ARM.
Example:
T queue[MAX_SIZE]
std::atomic<std::size_t> shared_producer_index;
void producer()
{
std::size_t private_producer_index = 0;
for(;;)
{
private_producer_index++; // Handling rollover and queue full omitted
/* fill data */;
shared_producer_index.store(
private_producer_index, std::memory_order_release);
// Maybe barrier here or stronger order above?
}
}
void consumer()
{
std::size_t private_consumer_index = 0;
for(;;)
{
std::size_t observed_producer_index = shared_producer_index.load(
std::memory_order_acquire);
while (private_consumer_index == observed_producer_index)
{
// Maybe barrier here or stronger order below?
_mm_pause();
observed_producer_index= shared_producer_index.load(
std::memory_order_acquire);
// Switching from busy wait to kernel wait after some iterations omitted
}
/* consume as much data as index difference specifies */;
private_consumer_index = observed_producer_index;
}
}

Basically no significant effect on inter-core latency, and definitely never worth using "blindly" without careful profiling, if you suspect there might be any contention from later loads missing in cache.
It's a common misconception that asm barriers are needed to make the store buffer commit to cache. In fact barriers just make this core wait for something that was already going to happen on its own, before doing later loads and/or stores. For a full barrier, blocking later loads and stores until the store buffer is drained.
Size of store buffers on Intel hardware? What exactly is a store buffer?
In the bad old days before std::atomic, compiler barriers were one way to stop the compiler from keeping values in registers (private to a CPU core / thread, not coherent), but that's a compilation issue not asm. CPUs with non-coherent caches are possible in theory (where std::atomic would need to do explicit flushing to make a store visible), but in practice no implementation runs std::thread across cores with non-coherent caches.
If I don't use fences, how long could it take a core to see another core's writes? is highly related, I've written basically this answer at least a few times before. (But this looks like a good place for an answer specifically about this, without getting into the weeds of which barriers do what.)
There might be some very minor secondary effects of blocking later loads that could maybe compete with RFOs (for this core to get exclusive access to a cache line to commit a store). The CPU always tries to drain the store buffer as fast as possible (by committing to L1d cache). As soon as a store commits to L1d cache, it becomes globally visible to all other cores. (Because they're coherent; they'd still have to make a share request...)
Getting the current core to write-back some store data to L3 cache (especially in shared state) could reduce the miss penalty if the load on another core happens somewhat after this store commits. But there are no good ways to do that. Creating a conflict miss in L1d and L2 maybe, if producer performance is unimportant other than creating low latency for the next read.
On x86, Intel Tremont (low power Silvermont series) will introduce cldemote (_mm_cldemote) that writes back a line as far as an outer cache, but not all the way to DRAM. (clwb could possibly help, but does force the store to go all the way to DRAM. Also, the Skylake implementation is just a placeholder and works like clflushopt.)
Is there any way to write for Intel CPU direct core-to-core communication code?
How to force cpu core to flush store buffer in c?
x86 MESI invalidate cache line latency issue
Force a migration of a cache line to another core (not possible)
Fun fact: non-seq_cst stores/loads on PowerPC can store-forward between logical cores on the same physical core, making stores visible to some other cores before they become globally visible to all other cores. This is AFAIK the only real hardware mechanism for threads to not agree on a global order of stores to all objects. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. On other ISAs, including ARMv8 and x86, it's guaranteed that stores become visible to all other cores at the same time (via commit to L1d cache).
For loads, CPUs already prioritize demand loads over any other memory accesses (because of course execution has to wait for them.) A barrier before a load could only delay it.
That might happen to be optimal by coincidence of timing, if that makes it see the store it was waiting for instead of going "too soon" and seeing the old cached boring value. But there's generally no reason to assume or ever predict that a pause or barrier could be a good idea ahead of a load.
A barrier after a load shouldn't help either. Later loads or stores might be able to start, but out-of-order CPUs generally do stuff in oldest-first priority so later loads probably can't fill up all the outstanding load buffers before this load gets a chance to get its load request sent off-core (assuming a cache miss because another core stored recently.)
I guess I could imagine a benefit to a later barrier if this load address wasn't ready for a while (pointer-chasing situation) and the max number of off-core requests were already in-flight when the address did become known.
Any possible benefit is almost certainly not worth it; if there was that much useful work independent of this load that it could fill up all the off-core request buffers (LFBs on Intel) then it might well not be on the critical path and it's probably a good thing to have those loads in flight.

MESI Protocol & std::atomic - Does it ensure all writes are immediately visible to other threads?

In regards to std::atomic, the C++11 standard states that stores to an atomic variable will become visible to loads of that variable in a "reasonable amount of time".
From 29.3p13:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
However I was curious to know what actually happens when dealing with specific CPU architectures which are based on the MESI cache coherency protocol (x86, x86-64, ARM, etc..).
If my understanding of the MESI protocol is correct, a core will always read the value previously written/being written by another core immediately, possibly by snooping it. (because writing a value means issuing a RFO request which in turn invalidates other cache lines)
Does it mean that when a thread A stores a value into an std::atomic, another thread B which does a load on that atomic successively will in fact always observe the new value written by A on MESI architectures? (Assuming no other threads are doing operations on that atomic)
By “successively” I mean after thread A has issued the atomic store. (Modification order has been updated)

I'll answer for what happens on real implementations on real CPUs, because an answer based only on the standard can barely say anything useful about time or "immediacy".
MESI is just an implementation detail that ISO C++ doesn't have anything to say about. The guarantees provided by ISO C++ only involve order, not actual time. ISO C++ is intentionally non-specific to avoid assuming that it will execute on a "normal" CPU. An implementation on a non-coherent machine that required explicit flushes for store visibility might be theoretically possible (although probably horrible for performance of release / acquire and seq-cst operations)
C++ is non-specific enough about timing to even allow an implementation on a single-core cooperative multi-tasking system (no pre-emption), with the compiler inserting voluntary yields occasionally. (Infinite loops without any volatile accesses or I/O are UB). C++ on a system where only one thread can actually be executing at once is totally fine and possible, assuming you consider a scheduler timeslice to still be a "reasonable" amount of time. (Or less if you yield or otherwise block.)
Even the model of formalism ISO C++ uses to give the guarantees it does about ordering is very different from the way hardware ISAs define their memory models. C++ formal guarantees are purely in terms of happens-before and synchronizes-with, not "re"-ordering litmus tests or any kind of stuff like that. e.g. How to achieve a StoreLoad barrier in C++11? is impossible to answer for pure ISO C++ formalism. The "option C" in that Q&A serves to show just how weak the C++ guarantees are; that case with store then load of two different SC variables is not sufficient to imply happens-before based on it, according to the C++ formalism, even though there has to be a total order of all SC operations. But it is sufficient in real life on systems with coherent cache and only local (within each CPU core) memory reordering, even AArch64 where the SC load right after the SC store does still essentially give us a StoreLoad barrier.
when a thread A stores a value into an std::atomic
It depends what you mean by "doing" a store.
If you mean committing from the store buffer into L1d cache, then yes, that's the moment when a store becomes globally visible, on a normal machine that uses MESI to give all CPU cores a coherent view of memory.
Although note that on some ISAs, some other threads are allowed to see stores before they become globally visible via cache. (i.e. the hardware memory model may not be "multi-copy atomic", and allow IRIW reordering. POWER is the only example I know of that does this in real life. See Will two atomic writes to different locations in different threads always be seen in the same order by other threads? for details on the HW mechanism: Store forwarding for retired aka graduated stores between SMT threads.)
If you mean executing locally so later loads in this thread can see it, then no. std::atomic can use a memory_order weaker than seq_cst.
All mainstream ISAs have memory-ordering rules weak enough to allow for a store buffer to decouple instruction execution from commit to cache. This also allows speculative out-of-order execution by giving stores somewhere private to live after execution, before we're sure that they were on the correct path of execution. (Stores can't commit to L1d until after the store instruction retires from the out-of-order part of the back end, and thus is known to be non-speculative.)
If you want to wait for your store to be visible to other threads before doing any later loads, use atomic_thread_fence(memory_order_seq_cst);. (Which on "normal" ISAs with standard choice of C++ -> asm mappings will compile to a full barrier).
On most ISAs, a seq_cst store (the default) will also stall all later loads (and stores) in this thread until the store is globally visible. But on AArch64, STLR is a sequential-release store and execution of later loads/stores doesn't have to stall unless / until a LDAR (acquire load) is about to execute while the STLR is still in the store buffer. This implements SC semantics as weakly as possible, assuming AArch64 hardware actually works that way instead of just treating it as a store + full barrier.
But note that only blocking later loads/stores is necessary; out-of-order exec of ALU instructions on registers can still continue. But if you were expecting some kind of timing effect due to dependency chains of FP operations, for example, that's not something you can depend on in C++.
Even if you do use seq_cst so nothing happens in this thread before the store is visible to others, that's still not instant. Inter-core latency on real hardware can be on the order of maybe 40ns on mainstream modern Intel x86, for example. (This thread doesn't have to stall that long on a memory barrier instruction; some of that time is the cache miss on the other thread trying to read the line that was invalidated by this core's RFO to get exclusive ownership.) Or of course much cheaper for logical cores that share the L1d cache of a physical core: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?

From 29.3p13:
Implementations should make atomic stores visible to atomic loads
within a reasonable amount of time.
The C and C++ standards are all over the place on threads, hence not usable as formal specifications. They use the concept of time, and somewhat imply that everything runs step by step, sequentially (if not, you wouldn't have a sound program semantic) and then say that some constructs can see effects out of order, without ever telling which is which.
When effects are seen out of order, thread time is ill defined as you don't have a chronometer that would also be out of order: you wouldn't do sport with out of order execution of actions!
Even "out of order" suggests that some things are purely sequential and some other operations can be "out of order" with respect to the firsts. That is not how std::atomic is defined.
What the standards try to say is that there is a notion of progress for each thread, with a CPU time or cost index, and as it increases as more stuff is done, and stuff can only be slightly reordered by the implementation: now reordering is well defined, not in term of other sequential instructions, but in term of cost/cycles/CPU time.
So if two instructions are close to each other in the sequential intra-thread execution, they will be close in CPU time too. A reasonable compiler shouldn't move a volatile operation, a file output, or an atomic operation past a very costly "pure" computation (one that has no externally visible side effect).
A basic idea that many committee members sadly couldn't even spell out!

Is memory ordering in C++11 about main memory flush ordering?

I'm not sure i fully understand (and i may have all wrong) the concepts of atomicity and memory ordering in C++11.
Let's take this simple example single threaded :
int main()
{
std::atomic<int> a(0);
std::atomic<int> b(0);
a.store(16);
b.store(10);
return 0;
}
In this single threaded code, if a and b were not atomic types, the compiler could have reordered the instruction in a way that in the assembly code, i have for instance a move instruction to assigned 10 to 'b' before a move instruction to assigned 16 to 'a'.
So for me, being atomic variables, it guarantees me that i'd have the "a move instruction" before the "b move instruction" as i stated in my source code.
After that, there is the processor with his execution unit, prefetching instructions, and with his out-of-order box. And this processor can process the "b instruction" before the "a instruction", whatever is the instruction ordering in the assembly code.
So i can have 10 stored in a register or in the store buffer of a processor or in cache memory before i have 16 stored in a register / store buffer or in cache.
And with my understanding, it's where memory ordering model come out. From that moment, if i let the default model sequentially consistent. One guarantees me that flush out these values (10 and 16) in main memory will respect the order i did the store in my source code. So that the processor will start flushing out the register or cache where 16 is stored into main memory for update 'a' and after that it will flush 10 in the main memory for 'b'.
So that behavior does allow me to understand that if i use a relaxed memory model. Only the last part is not guarantee so that the flush in main memory can be in total disorder.
Sorry if you get trouble to read me, my english is still poor. But thank you guys for your time.

The C++ memory model is about the abstract machine and value visibility, not about concrete things like "main memory", "write queues" or "flushing".
In your example, the memory model states that since the write to a happens-before the write to b, any thread that reads the 10 from b must, on subsequent reads from a, see 16 (unless this has since been overwritten, of course).
The important thing here is establishing happens-before relationships and value visibility. How this maps to caches and memory is up to the compiler. In my opinion, it's better to stay on that abstract level instead of trying to map the model to your understanding of the hardware, because
Your understanding of the hardware might be wrong. Hardware is even more complicated than the C++ memory model.
Even if your understanding is correct now, a later version of the hardware might have a different model, at least in subsystems.
By mapping to a hardware model, you might then make wrong assumptions about the implications for a different hardware model. E.g. if you understand how the memory model maps to x86 hardware, you will not understand the subtle difference between consume and acquire on PowerPC.
The C++ model is very well suited for reasoning about correctness.

You didn't specify which architecture you work with, but basically each has its own memory ordering model (some times more than one that you can choose from), and that serves as a "contract". The compiler should be aware of that and use lightweight or heavyweight instructions accordingly to guarantee what it needs in order to provide the memory model of the language.
The HW implementation under the hood can be quite complicated, but in a nutshell - you don't need to flush in order to get global visibility. Modern cache systems provide snooping capabilities, so that a value can be globally visible and globally ordered while still residing in some private core cache (and having stale copies in lower cache levels), the MESI protocols control how this is handled correctly.
The life cycle of a write begins in the out of order engine, where it is still speculative (i.e. - can be cleared due to an older branch misprediction or fault). Naturally, during that time the write can not be seen from the outside, so out-of-order execution here is not relevant. Once it commits, if the system guarantees store ordering (like x86), it still has to wait in line for its turn to become visible, so it is buffered. Other cores can't see it since its observation time hasn't reached yet (although local loads in that core might see it in some implementations of x86 - that's one of the differences between TSO and real sequential consistency).
Once the older stores are done, the store may become globally visible - it doesn't have to go anywhere outside of the core for that, it can remain cached internally. In fact, some CPUs may even make it observable while still in the store buffer, or write it to the cache speculatively - the actual decision point is when to make it respond to external snoops, the rest is implementation details. Architectures with more relaxed ordering may change the order unless explicitly blocked by a fence/barrier.
Based on that, your code snippet can not reorder stores on x86 since stores don't reorder with each other there, but it may be able to do so on arm for example. If the language requires strong ordering in that case, the compiler will have to decide if it can rely on the HW, or add a fence. Either way, anyone reading this value from another thread (or socket) will have to snoop for it, and can only see the writes that respond.

Is memory reordering visible to other threads on a uniprocessor?

It's common that modern CPU architectures employ performance optimizations that can result in out-of-order execution. In single threaded applications memory reordering may also occur, but it's invisible to programmers as if memory was accessed in program order. And for SMP, memory barriers come to the rescue which are used to enforce some sort of memory ordering.
What I'm not sure, is about multi-threading in a uniprocessor. Consider the following example: When thread 1 runs, the store to f could take place before the store to x. Let's say context switch happens after f is written, and right before x is written. Now thread 2 starts to run, and it ends the loop and print 0, which is undesirable of course.
// Both x, f are initialized w/ 0.
// Thread 1
x = 42;
f = 1;
// Thread 2
while (f == 0)
;
print x;
Is the scenario described above possible? Or is there a guarantee that physical memory is committed during thread context switch?
According to this wiki,
When a program runs on a single-CPU machine, the hardware performs
the necessary bookkeeping to ensure that the program execute as if all
memory operations were performed in the order specified by the
programmer (program order), so memory barriers are not necessary.
Although it didn't explicitly mention uniprocessor multi-threaded applications, it includes this case.
I'm not sure it's correct/complete or not. Note that this may highly depend on the hardware(weak/strong memory model). So you may want to include the hardware you know in the answers. Thanks.
PS. device I/O, etc are not my concern here. And it's a single-core uniprocessor.
Edit: Thanks Nitsan for the reminder, we assume no compiler reordering here(just hardware reordering), and loop in thread 2 is not optimized away..Again, devil is in the details.

As a C++ question the answer must be that the program contains a data race, so the behavior is undefined. In reality that means that it could print something other than 42.
That is independent of the underlying hardware. As has been pointed out, the loop can be optimized away and the compiler can reorder the assignments in thread 1, so that result can occur even on uniprocessor machines.
[I'll assume that with "uniprocessor" machine, you mean processors with a single core and hardware thread.]
You now say, that you want to assume compiler reordering or loop elimination does not happen. With this, we have left the realm of C++ and are really asking about corresponding machine instructions. If you want to eliminate compiler reordering, we can probably also rule out any form of SIMD instructions and consider only instructions operating on a single memory location at a time.
So essentially thread1 has two store instructions in the order store-to-x store-to-f, while thread2 has test-f-and-loop-if-not-zero (this may be multiple instructions, but involves a load-from-f) and then a load-from-x.
On any hardware architecture I am aware of or can reasonably imagine, thread 2 will print 42.
One reason is that, if instructions processed by a single processors are not sequentially consistent among themselves, you could hardly assert anything about the effects of a program.
The only event that could interfere here, is an interrupt (as is used to trigger a preemptive context switch). A hypothetical machine that stores the entire state of its current execution pipeline state upon an interrupt and restores it upon return from the interrupt, could produce a different result, but such a machine is impractical and afaik does not exist. These operations would create quite a bit of additional complexity and/or require additional redundant buffers or registers, all for no good reason - except to break your program. Real processors either flush or roll back the current pipeline upon interrupt, which is enough to guarantee sequential consistency for all instructions on a single hardware thread.
And there is no memory model issue to worry about. The weaker memory models originate from the separate buffers and caches that separate the separate hardware processors from the main memory or nth level cache they actually share. A single processor has no similarly partitioned resources and no good reason to have them for multiple (purely software) threads. Again there is no reason to complicate the architecture and waste resources to make the processor and/or memory subsystem aware of something like separate thread contexts, if there aren't separate processing resources (processors/hardware threads) to keep these resources busy.

A strong memory ordering execute memory access instructions with the exact same order as defined in the program, it is often referred as "program ordering".
Weaker memory ordering may be employed to allow the processor reorder memory access for better performance, it is often referred as "processor ordering".
AFAIK, the scenario described above is NOT possible in the Intel ia32 architecture, whose processor ordering outlaws such cases. The relevant rules are (intel ia-32 software development manual Vol3A 8.2 Memory Ordering) :
writes are not reordered with other writes, with the exception of streaming stores, CLFLUSH and string operations.
To illustrate the rule, it gives an example similar to this:
memory location x, y, initialized to 0;
thread 1:
mov [x] 1
mov [y] 1
thread 2:
mov r1 [y]
mov r2 [x]
r1 == 1 and r2 == 0 is not allowed
In your example, thread 1 cannot store f before storing x.
#Eric in respond to your comments.
fast string store instruction "stosd", may store string out of order inside its operation. In a multiprocessor environment, when a processor store a string "str", another processor may observe str[1] being written before str[0], while the logic order presumed to be writing str[0] before str[1];
But these instructions are not reorder with any other stores. and must have precise exception handling. When exception occurs in the middle of stosd, the implementation may choose to delay it so that all out-of-order sub-stores (doesn't necessarily mean the whole stosd instruction) must commit before the context switch.
Edited to address the claims made on as if this is a C++ question:
Even this is considered in the context of C++, As I understand, a standard confirming compiler should NOT reorder the assignment of x and f in thread 1.
$1.9.14
Every value computation and side effect associated with a full-expression is sequenced before every value
computation and side effect associated with the next full-expression to be evaluated.

This isn't really a C or C++ question, since you've explicitly assumed no load/store re-ordering, which compilers for both languages are perfectly allowed to do.
Allowing that assumption for the sake of argument, note that loop may anyway never exit, unless you either:
give the compiler some reason to believe f may change (eg, by passing its address to some non-inlineable function which could modify it)
mark it volatile, or
make it an explicitly atomic type and request acquire semantics
On the hardware side, your worry about physical memory being "committed" during a context switch isn't an issue. Both software threads share the same memory hardware and cache, so there's no risk of inconsistency there whatever consistency/coherence protocol pertains between cores.
Say both stores were issued, and the memory hardware decides to re-order them. What does this really mean? Perhaps f's address is already in cache, so it can be written immediately, but x's store is deferred until that cache line is fetched. Well, a read from x is dependent on the same address, so either:
the load can't happen until the fetch happens, in which case a sane implementation must issue the queued store before the queued load
or the load can peek into the queue and fetch x's value without waiting for the write
Consider anyway that the kernel pre-emption required to switch threads will itself issue whatever load/store barriers are required for consistency of the kernel scheduler state, and it should be obvious that hardware re-ordering can't be a problem in this situation.
The real issue (which you're trying to avoid) is your assumption that there is no compiler re-ordering: this is simply wrong.

You would only need a compiler fence. From the Linux kernel docs on Memory Barriers (link):
SMP memory barriers are reduced to compiler barriers on uniprocessor
compiled systems because it is assumed that a CPU will appear to be
self-consistent, and will order overlapping accesses correctly with
respect to itself.
To expand on that, the reason why synchronization is not required at the hardware level is that:
All threads on a uniprocessor system share the same memory and thus there are no cache-coherency issues (such as propagation latency) that can occur on SMP systems, and
Any out-of-order load/store instructions in the CPU's execution pipeline would either be committed or rolled back in full if the pipeline is flushed due to a preemptive context switch.

This code may well never finish(in thread 2) as the compiler can decide to hoist the whole expression out of the loop(this is similar to using an isRunning flag which is not volatile).
That said you need to worry about 2 types of re-orderings here: compiler and CPU, both are free to move the stores around. See here: http://preshing.com/20120515/memory-reordering-caught-in-the-act for an example. At this point the code you describe above is at the mercy of compiler, compiler flags, and particular architecture. The wiki quoted is misleading as it may suggest internal re-ordering is not at the mercy of the cpu/compiler which is not the case.

As far as the x86 is concerned, the out-of-order-stores are made consistent from the viewpoint of the executing code with regards to program flow. In this case, "program flow" is just the flow of instructions that a processor executes, not something constrained to a "program running in a thread". All the instructions necessary for context switching, etc. are considered part of this flow so the consistency is maintained across threads.

A context switch has to store the complete machine state so that it can be restored before the suspended thread resumes execution. The machine states includes the processor registers but not the processor pipeline.
If you assume no compiler reordering, this means that all hardware instructions that are "on-the-fly" have to be completed before a context switch (i.e. an interrupt) takes place, otherwise they get lost and are not stored by the context switch mechanism. This is independend of hardware reordering.
In your example, even if the processor swaps the two hardware instructions "x=42" and "f=1", the instruction pointer is already after the second one, and therefore both instructions must be completed before the context switch begins. if it were not so, since the content of the pipeline and of the cache are not part of the "context", they would be lost.
In other words, if the interrupt that causes the ctx switch happens when the IP register points at the instruction following "f=1", then all instructions before that point must have completed all their effects.

From my point of view, processor fetch instructions one by one.
In your case, if "f = 1" was speculatively executed before "x = 42", that means both these two instructions are already in the processor's pipeline. The only possible way to schedule current thread out is interrupt. But the processor(at least on X86) will flush the pipeline's instructions before serving the interrupt.
So no need to worry about the reordering in a uniprocessor.

Compare and swap C++0x

From the C++0x proposal on C++ Atomic Types and Operations:
29.1 Order and Consistency [atomics.order]
Add a new sub-clause with the following paragraphs.
The enumeration memory_order specifies the detailed regular (non-atomic) memory synchronization order as defined in [the new section added by N2334 or its adopted successor] and may provide for operation ordering. Its enumerated values and their meanings are as follows.
memory_order_relaxed
The operation does not order memory.
memory_order_release
Performs a release operation on the affected memory locations, thus making regular memory writes visible to other threads through the atomic variable to which it is applied.
memory_order_acquire
Performs an acquire operation on the affected memory locations, thus making regular memory writes in other threads released through the atomic variable to which it is applied, visible to the current thread.
memory_order_acq_rel
The operation has both acquire and release semantics.
memory_order_seq_cst
The operation has both acquire and release semantics, and in addition, has sequentially-consistent operation ordering.
Lower in the proposal:
bool A::compare_swap( C& expected, C desired,
memory_order success, memory_order failure ) volatile
where one can specify memory order for the CAS.
My understanding is that “memory_order_acq_rel” will only necessarily synchronize those memory locations which are needed for the operation, while other memory locations may remain unsynchronized (it will not behave as a memory fence).
Now, my question is - if I choose “memory_order_acq_rel” and apply compare_swap to integral types, for instance, integers, how is this typically translated into machine code on modern consumer processors such as a multicore Intel i7? What about the other commonly used architectures (x64, SPARC, ppc, arm)?
In particular (assuming a concrete compiler, say gcc):
How to compare-and-swap an integer location with the above operation?
What instruction sequence will such a code produce?
Is the operation lock-free on i7?
Will such an operation run a full cache coherence protocol, synchronizing caches of different processor cores as if it were a memory fence on i7? Or will it just synchronize the memory locations needed by this operation?
Related to previous question - is there any performance advantage to using acq_rel semantics on i7? What about the other architectures?
Thanks for all the answers.

The answer here is not trivial. Exactly what happens and what is meant is dependent on many things. For basic understanding of cache coherence/memory perhaps my recent blog entries might be helpful:
CPU Reordering – What is actually being reordered?
CPU Memory – Why do I need a mutex?
But that aside, let me try to answer a few questions. First off the below function is being very hopeful as to what is supported: very fine-grained control over exactly how strong a memory-order guarantee you get. That's reasonable for compile-time reordering but often not for runtime barriers.
compare_swap( C& expected, C desired,
memory_order success, memory_order failure )
Architectures won't all be able to implement this exactly as you requested; many will have to strengthen it to something strong enough that they can implement. When you specify memory_order you are specifying how reordering may work. To use Intel's terms you will be specifying what type of fence you want, there are three of them, the full fence, load fence, and store fence. (But on x86, load fence and store fence are only useful with weakly-ordered instructions like NT stores; atomics don't use them. Regular load/store give you everything except that stores can appear after later loads.) Just because you want a particular fence on that operation won't mean it is supported, in which I'd hope it always falls back to a full fence. (See Preshing's article on memory barriers)
An x86 (including x64) compiler will likely use the LOCK CMPXCHG instruction to implement the CAS, regardless of memory ordering. This implies a full barrier; x86 doesn't have a way to make a read-modify-write operation atomic without a lock prefix, which is also a full barrier. Pure-store and pure-load can be atomic "on their own", with many ISAs needing barriers for anything above mo_relaxed, but x86 does acq_rel "for free" in asm.
This instruction is lock-free, although all cores trying to CAS the same location will contend for access to it so you could argue it's not really wait-free. (Algorithms that use it might not be lock-free, but the operation itself is wait-free, see wikipedia's non-blocking algorithm article). On non-x86 with LL/SC instead of locked instructions, C++11 compare_exchange_weak is normally wait-free but compare_exchange_strong requires a retry loop in case of spurious failure.
Now that C++11 has existed for years, you can look at the asm output for various architectures on the Godbolt compiler explorer.
In terms of memory sync you need to understand how cache-coherence works (my blog may help a bit). New CPUs use a ccNUMA architecture (previously SMP). Essentially the "view" on the memory never gets out-of-sync. The fences used in the code don't actually force any flushing of cache to happen per-se, only of the store buffer committing in flight stores to cache before later loads.
If two cores both have the same memory location cached in a cache-line, a store by one core will get exclusive ownership of the cache line (invalidating all other copies) and marking its own as dirty. A very simple explanation for a very complex process
To answer your last question you should always use the memory semantics that you logically need to be correct. Most architectures won't support all the combinations you use in your program. However, in many cases you'll get great optimizations, especially in cases where the order you requested is guaranteed without a fence (which is quite common).
-- Answers to some comments:
You have to distinguish between what it means to execute a write instruction and write to a memory location. This is what I attempt to explain in my blog post. By the time the "0" is committed to 0x100, all cores see that zero. Writing integers is also atomic, that is even without a lock, when you write to a location all cores will immediately have that value if they wish to use it.
The trouble is that to use the value you have likely loaded it into a register first, any changes to the location after that obviously won't touch the register. This is why one needs mutexes or atomic<T> despite a cache coherent memory: the compiler is allowed to keep plain variable values in private registers. (In C++11, that's because a data-race on non-atomic variables is Undefined Behaviour.)
As to contradictory claims, generally you'll see all sorts of claims. Whether they are contradictory comes right down to exactly what "see" "load" "execute" mean in the context. If you write "1" to 0x100, does that mean you executed the write instruction or did the CPU actually commit that value. The difference created by the store buffer is one major cause of reordering (the only one x86 allows). The CPU can delay writing the "1", but you can be sure that the moment it does finally commit that "1" all cores see it. The fences control this ordering by making the thread wait until a store commits before doing later operations.

Your whole worldview seems off base: your question insinuates that cache consistency is controlled by memory orders at the C++ level and fences or atomic operations at the CPU level.
But cache consistency is one of the most important invariants for the physical architecture, and it's provided at all time by the memory system that consists of the interconnection of all CPUs and the RAM. You can never beat it from code running on a CPU, or even see its detail of operation. Of course, by observing RAM directly and running code elsewhere you might see stale data at some level of memory: by definition the RAM doesn't have the newest value of all memory locations.
But code running on a CPU can't access DRAM directly, only through the memory hierarchy which includes caches that communicate with each other to maintain coherency of this shared view of memory. (Typically with MESI). Even on a single core, a write-back cache lets DRAM values be stale, which can be an issue for non-cache-coherent DMA but not for reading/writing memory from a CPU.
So the issue exists only for external devices, and only ones that do non-coherent DMA. (DMA is cache-coherent on modern x86 CPUs; the memory controller being built-in to the CPU makes this possible).
Will such an operation run a full cache coherence protocol,
synchronizing caches of different processor cores as if it were a
memory fence on i7?
They are already synchronized. See Does a memory barrier ensure that the cache coherence has been completed? - memory barriers only do local things inside the core running the barrier, like flush the store buffer.
Or will it just synchronize the memory locations
needed by this operation?
An atomic operation applies to exactly one memory location. What others locations do you have in mind?
On a weakly-ordered CPU, a memory_order_relaxed atomic increment could avoid making earlier loads/stores visible before that increment. But x86's strongly-ordered memory model doesn't allow that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js