Is memory reordering visible to other threads on a uniprocessor?

Is memory reordering visible to other threads on a uniprocessor? - c++

It's common that modern CPU architectures employ performance optimizations that can result in out-of-order execution. In single threaded applications memory reordering may also occur, but it's invisible to programmers as if memory was accessed in program order. And for SMP, memory barriers come to the rescue which are used to enforce some sort of memory ordering.
What I'm not sure, is about multi-threading in a uniprocessor. Consider the following example: When thread 1 runs, the store to f could take place before the store to x. Let's say context switch happens after f is written, and right before x is written. Now thread 2 starts to run, and it ends the loop and print 0, which is undesirable of course.
// Both x, f are initialized w/ 0.
// Thread 1
x = 42;
f = 1;
// Thread 2
while (f == 0)
;
print x;
Is the scenario described above possible? Or is there a guarantee that physical memory is committed during thread context switch?
According to this wiki,
When a program runs on a single-CPU machine, the hardware performs
the necessary bookkeeping to ensure that the program execute as if all
memory operations were performed in the order specified by the
programmer (program order), so memory barriers are not necessary.
Although it didn't explicitly mention uniprocessor multi-threaded applications, it includes this case.
I'm not sure it's correct/complete or not. Note that this may highly depend on the hardware(weak/strong memory model). So you may want to include the hardware you know in the answers. Thanks.
PS. device I/O, etc are not my concern here. And it's a single-core uniprocessor.
Edit: Thanks Nitsan for the reminder, we assume no compiler reordering here(just hardware reordering), and loop in thread 2 is not optimized away..Again, devil is in the details.

As a C++ question the answer must be that the program contains a data race, so the behavior is undefined. In reality that means that it could print something other than 42.
That is independent of the underlying hardware. As has been pointed out, the loop can be optimized away and the compiler can reorder the assignments in thread 1, so that result can occur even on uniprocessor machines.
[I'll assume that with "uniprocessor" machine, you mean processors with a single core and hardware thread.]
You now say, that you want to assume compiler reordering or loop elimination does not happen. With this, we have left the realm of C++ and are really asking about corresponding machine instructions. If you want to eliminate compiler reordering, we can probably also rule out any form of SIMD instructions and consider only instructions operating on a single memory location at a time.
So essentially thread1 has two store instructions in the order store-to-x store-to-f, while thread2 has test-f-and-loop-if-not-zero (this may be multiple instructions, but involves a load-from-f) and then a load-from-x.
On any hardware architecture I am aware of or can reasonably imagine, thread 2 will print 42.
One reason is that, if instructions processed by a single processors are not sequentially consistent among themselves, you could hardly assert anything about the effects of a program.
The only event that could interfere here, is an interrupt (as is used to trigger a preemptive context switch). A hypothetical machine that stores the entire state of its current execution pipeline state upon an interrupt and restores it upon return from the interrupt, could produce a different result, but such a machine is impractical and afaik does not exist. These operations would create quite a bit of additional complexity and/or require additional redundant buffers or registers, all for no good reason - except to break your program. Real processors either flush or roll back the current pipeline upon interrupt, which is enough to guarantee sequential consistency for all instructions on a single hardware thread.
And there is no memory model issue to worry about. The weaker memory models originate from the separate buffers and caches that separate the separate hardware processors from the main memory or nth level cache they actually share. A single processor has no similarly partitioned resources and no good reason to have them for multiple (purely software) threads. Again there is no reason to complicate the architecture and waste resources to make the processor and/or memory subsystem aware of something like separate thread contexts, if there aren't separate processing resources (processors/hardware threads) to keep these resources busy.

A strong memory ordering execute memory access instructions with the exact same order as defined in the program, it is often referred as "program ordering".
Weaker memory ordering may be employed to allow the processor reorder memory access for better performance, it is often referred as "processor ordering".
AFAIK, the scenario described above is NOT possible in the Intel ia32 architecture, whose processor ordering outlaws such cases. The relevant rules are (intel ia-32 software development manual Vol3A 8.2 Memory Ordering) :
writes are not reordered with other writes, with the exception of streaming stores, CLFLUSH and string operations.
To illustrate the rule, it gives an example similar to this:
memory location x, y, initialized to 0;
thread 1:
mov [x] 1
mov [y] 1
thread 2:
mov r1 [y]
mov r2 [x]
r1 == 1 and r2 == 0 is not allowed
In your example, thread 1 cannot store f before storing x.
#Eric in respond to your comments.
fast string store instruction "stosd", may store string out of order inside its operation. In a multiprocessor environment, when a processor store a string "str", another processor may observe str[1] being written before str[0], while the logic order presumed to be writing str[0] before str[1];
But these instructions are not reorder with any other stores. and must have precise exception handling. When exception occurs in the middle of stosd, the implementation may choose to delay it so that all out-of-order sub-stores (doesn't necessarily mean the whole stosd instruction) must commit before the context switch.
Edited to address the claims made on as if this is a C++ question:
Even this is considered in the context of C++, As I understand, a standard confirming compiler should NOT reorder the assignment of x and f in thread 1.
$1.9.14
Every value computation and side effect associated with a full-expression is sequenced before every value
computation and side effect associated with the next full-expression to be evaluated.

This isn't really a C or C++ question, since you've explicitly assumed no load/store re-ordering, which compilers for both languages are perfectly allowed to do.
Allowing that assumption for the sake of argument, note that loop may anyway never exit, unless you either:
give the compiler some reason to believe f may change (eg, by passing its address to some non-inlineable function which could modify it)
mark it volatile, or
make it an explicitly atomic type and request acquire semantics
On the hardware side, your worry about physical memory being "committed" during a context switch isn't an issue. Both software threads share the same memory hardware and cache, so there's no risk of inconsistency there whatever consistency/coherence protocol pertains between cores.
Say both stores were issued, and the memory hardware decides to re-order them. What does this really mean? Perhaps f's address is already in cache, so it can be written immediately, but x's store is deferred until that cache line is fetched. Well, a read from x is dependent on the same address, so either:
the load can't happen until the fetch happens, in which case a sane implementation must issue the queued store before the queued load
or the load can peek into the queue and fetch x's value without waiting for the write
Consider anyway that the kernel pre-emption required to switch threads will itself issue whatever load/store barriers are required for consistency of the kernel scheduler state, and it should be obvious that hardware re-ordering can't be a problem in this situation.
The real issue (which you're trying to avoid) is your assumption that there is no compiler re-ordering: this is simply wrong.

You would only need a compiler fence. From the Linux kernel docs on Memory Barriers (link):
SMP memory barriers are reduced to compiler barriers on uniprocessor
compiled systems because it is assumed that a CPU will appear to be
self-consistent, and will order overlapping accesses correctly with
respect to itself.
To expand on that, the reason why synchronization is not required at the hardware level is that:
All threads on a uniprocessor system share the same memory and thus there are no cache-coherency issues (such as propagation latency) that can occur on SMP systems, and
Any out-of-order load/store instructions in the CPU's execution pipeline would either be committed or rolled back in full if the pipeline is flushed due to a preemptive context switch.

This code may well never finish(in thread 2) as the compiler can decide to hoist the whole expression out of the loop(this is similar to using an isRunning flag which is not volatile).
That said you need to worry about 2 types of re-orderings here: compiler and CPU, both are free to move the stores around. See here: http://preshing.com/20120515/memory-reordering-caught-in-the-act for an example. At this point the code you describe above is at the mercy of compiler, compiler flags, and particular architecture. The wiki quoted is misleading as it may suggest internal re-ordering is not at the mercy of the cpu/compiler which is not the case.

As far as the x86 is concerned, the out-of-order-stores are made consistent from the viewpoint of the executing code with regards to program flow. In this case, "program flow" is just the flow of instructions that a processor executes, not something constrained to a "program running in a thread". All the instructions necessary for context switching, etc. are considered part of this flow so the consistency is maintained across threads.

A context switch has to store the complete machine state so that it can be restored before the suspended thread resumes execution. The machine states includes the processor registers but not the processor pipeline.
If you assume no compiler reordering, this means that all hardware instructions that are "on-the-fly" have to be completed before a context switch (i.e. an interrupt) takes place, otherwise they get lost and are not stored by the context switch mechanism. This is independend of hardware reordering.
In your example, even if the processor swaps the two hardware instructions "x=42" and "f=1", the instruction pointer is already after the second one, and therefore both instructions must be completed before the context switch begins. if it were not so, since the content of the pipeline and of the cache are not part of the "context", they would be lost.
In other words, if the interrupt that causes the ctx switch happens when the IP register points at the instruction following "f=1", then all instructions before that point must have completed all their effects.

From my point of view, processor fetch instructions one by one.
In your case, if "f = 1" was speculatively executed before "x = 42", that means both these two instructions are already in the processor's pipeline. The only possible way to schedule current thread out is interrupt. But the processor(at least on X86) will flush the pipeline's instructions before serving the interrupt.
So no need to worry about the reordering in a uniprocessor.

Related

MESI Protocol & std::atomic - Does it ensure all writes are immediately visible to other threads?

In regards to std::atomic, the C++11 standard states that stores to an atomic variable will become visible to loads of that variable in a "reasonable amount of time".
From 29.3p13:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
However I was curious to know what actually happens when dealing with specific CPU architectures which are based on the MESI cache coherency protocol (x86, x86-64, ARM, etc..).
If my understanding of the MESI protocol is correct, a core will always read the value previously written/being written by another core immediately, possibly by snooping it. (because writing a value means issuing a RFO request which in turn invalidates other cache lines)
Does it mean that when a thread A stores a value into an std::atomic, another thread B which does a load on that atomic successively will in fact always observe the new value written by A on MESI architectures? (Assuming no other threads are doing operations on that atomic)
By “successively” I mean after thread A has issued the atomic store. (Modification order has been updated)

I'll answer for what happens on real implementations on real CPUs, because an answer based only on the standard can barely say anything useful about time or "immediacy".
MESI is just an implementation detail that ISO C++ doesn't have anything to say about. The guarantees provided by ISO C++ only involve order, not actual time. ISO C++ is intentionally non-specific to avoid assuming that it will execute on a "normal" CPU. An implementation on a non-coherent machine that required explicit flushes for store visibility might be theoretically possible (although probably horrible for performance of release / acquire and seq-cst operations)
C++ is non-specific enough about timing to even allow an implementation on a single-core cooperative multi-tasking system (no pre-emption), with the compiler inserting voluntary yields occasionally. (Infinite loops without any volatile accesses or I/O are UB). C++ on a system where only one thread can actually be executing at once is totally fine and possible, assuming you consider a scheduler timeslice to still be a "reasonable" amount of time. (Or less if you yield or otherwise block.)
Even the model of formalism ISO C++ uses to give the guarantees it does about ordering is very different from the way hardware ISAs define their memory models. C++ formal guarantees are purely in terms of happens-before and synchronizes-with, not "re"-ordering litmus tests or any kind of stuff like that. e.g. How to achieve a StoreLoad barrier in C++11? is impossible to answer for pure ISO C++ formalism. The "option C" in that Q&A serves to show just how weak the C++ guarantees are; that case with store then load of two different SC variables is not sufficient to imply happens-before based on it, according to the C++ formalism, even though there has to be a total order of all SC operations. But it is sufficient in real life on systems with coherent cache and only local (within each CPU core) memory reordering, even AArch64 where the SC load right after the SC store does still essentially give us a StoreLoad barrier.
when a thread A stores a value into an std::atomic
It depends what you mean by "doing" a store.
If you mean committing from the store buffer into L1d cache, then yes, that's the moment when a store becomes globally visible, on a normal machine that uses MESI to give all CPU cores a coherent view of memory.
Although note that on some ISAs, some other threads are allowed to see stores before they become globally visible via cache. (i.e. the hardware memory model may not be "multi-copy atomic", and allow IRIW reordering. POWER is the only example I know of that does this in real life. See Will two atomic writes to different locations in different threads always be seen in the same order by other threads? for details on the HW mechanism: Store forwarding for retired aka graduated stores between SMT threads.)
If you mean executing locally so later loads in this thread can see it, then no. std::atomic can use a memory_order weaker than seq_cst.
All mainstream ISAs have memory-ordering rules weak enough to allow for a store buffer to decouple instruction execution from commit to cache. This also allows speculative out-of-order execution by giving stores somewhere private to live after execution, before we're sure that they were on the correct path of execution. (Stores can't commit to L1d until after the store instruction retires from the out-of-order part of the back end, and thus is known to be non-speculative.)
If you want to wait for your store to be visible to other threads before doing any later loads, use atomic_thread_fence(memory_order_seq_cst);. (Which on "normal" ISAs with standard choice of C++ -> asm mappings will compile to a full barrier).
On most ISAs, a seq_cst store (the default) will also stall all later loads (and stores) in this thread until the store is globally visible. But on AArch64, STLR is a sequential-release store and execution of later loads/stores doesn't have to stall unless / until a LDAR (acquire load) is about to execute while the STLR is still in the store buffer. This implements SC semantics as weakly as possible, assuming AArch64 hardware actually works that way instead of just treating it as a store + full barrier.
But note that only blocking later loads/stores is necessary; out-of-order exec of ALU instructions on registers can still continue. But if you were expecting some kind of timing effect due to dependency chains of FP operations, for example, that's not something you can depend on in C++.
Even if you do use seq_cst so nothing happens in this thread before the store is visible to others, that's still not instant. Inter-core latency on real hardware can be on the order of maybe 40ns on mainstream modern Intel x86, for example. (This thread doesn't have to stall that long on a memory barrier instruction; some of that time is the cache miss on the other thread trying to read the line that was invalidated by this core's RFO to get exclusive ownership.) Or of course much cheaper for logical cores that share the L1d cache of a physical core: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?

From 29.3p13:
Implementations should make atomic stores visible to atomic loads
within a reasonable amount of time.
The C and C++ standards are all over the place on threads, hence not usable as formal specifications. They use the concept of time, and somewhat imply that everything runs step by step, sequentially (if not, you wouldn't have a sound program semantic) and then say that some constructs can see effects out of order, without ever telling which is which.
When effects are seen out of order, thread time is ill defined as you don't have a chronometer that would also be out of order: you wouldn't do sport with out of order execution of actions!
Even "out of order" suggests that some things are purely sequential and some other operations can be "out of order" with respect to the firsts. That is not how std::atomic is defined.
What the standards try to say is that there is a notion of progress for each thread, with a CPU time or cost index, and as it increases as more stuff is done, and stuff can only be slightly reordered by the implementation: now reordering is well defined, not in term of other sequential instructions, but in term of cost/cycles/CPU time.
So if two instructions are close to each other in the sequential intra-thread execution, they will be close in CPU time too. A reasonable compiler shouldn't move a volatile operation, a file output, or an atomic operation past a very costly "pure" computation (one that has no externally visible side effect).
A basic idea that many committee members sadly couldn't even spell out!

Is memory ordering in C++11 about main memory flush ordering?

I'm not sure i fully understand (and i may have all wrong) the concepts of atomicity and memory ordering in C++11.
Let's take this simple example single threaded :
int main()
{
std::atomic<int> a(0);
std::atomic<int> b(0);
a.store(16);
b.store(10);
return 0;
}
In this single threaded code, if a and b were not atomic types, the compiler could have reordered the instruction in a way that in the assembly code, i have for instance a move instruction to assigned 10 to 'b' before a move instruction to assigned 16 to 'a'.
So for me, being atomic variables, it guarantees me that i'd have the "a move instruction" before the "b move instruction" as i stated in my source code.
After that, there is the processor with his execution unit, prefetching instructions, and with his out-of-order box. And this processor can process the "b instruction" before the "a instruction", whatever is the instruction ordering in the assembly code.
So i can have 10 stored in a register or in the store buffer of a processor or in cache memory before i have 16 stored in a register / store buffer or in cache.
And with my understanding, it's where memory ordering model come out. From that moment, if i let the default model sequentially consistent. One guarantees me that flush out these values (10 and 16) in main memory will respect the order i did the store in my source code. So that the processor will start flushing out the register or cache where 16 is stored into main memory for update 'a' and after that it will flush 10 in the main memory for 'b'.
So that behavior does allow me to understand that if i use a relaxed memory model. Only the last part is not guarantee so that the flush in main memory can be in total disorder.
Sorry if you get trouble to read me, my english is still poor. But thank you guys for your time.

The C++ memory model is about the abstract machine and value visibility, not about concrete things like "main memory", "write queues" or "flushing".
In your example, the memory model states that since the write to a happens-before the write to b, any thread that reads the 10 from b must, on subsequent reads from a, see 16 (unless this has since been overwritten, of course).
The important thing here is establishing happens-before relationships and value visibility. How this maps to caches and memory is up to the compiler. In my opinion, it's better to stay on that abstract level instead of trying to map the model to your understanding of the hardware, because
Your understanding of the hardware might be wrong. Hardware is even more complicated than the C++ memory model.
Even if your understanding is correct now, a later version of the hardware might have a different model, at least in subsystems.
By mapping to a hardware model, you might then make wrong assumptions about the implications for a different hardware model. E.g. if you understand how the memory model maps to x86 hardware, you will not understand the subtle difference between consume and acquire on PowerPC.
The C++ model is very well suited for reasoning about correctness.

You didn't specify which architecture you work with, but basically each has its own memory ordering model (some times more than one that you can choose from), and that serves as a "contract". The compiler should be aware of that and use lightweight or heavyweight instructions accordingly to guarantee what it needs in order to provide the memory model of the language.
The HW implementation under the hood can be quite complicated, but in a nutshell - you don't need to flush in order to get global visibility. Modern cache systems provide snooping capabilities, so that a value can be globally visible and globally ordered while still residing in some private core cache (and having stale copies in lower cache levels), the MESI protocols control how this is handled correctly.
The life cycle of a write begins in the out of order engine, where it is still speculative (i.e. - can be cleared due to an older branch misprediction or fault). Naturally, during that time the write can not be seen from the outside, so out-of-order execution here is not relevant. Once it commits, if the system guarantees store ordering (like x86), it still has to wait in line for its turn to become visible, so it is buffered. Other cores can't see it since its observation time hasn't reached yet (although local loads in that core might see it in some implementations of x86 - that's one of the differences between TSO and real sequential consistency).
Once the older stores are done, the store may become globally visible - it doesn't have to go anywhere outside of the core for that, it can remain cached internally. In fact, some CPUs may even make it observable while still in the store buffer, or write it to the cache speculatively - the actual decision point is when to make it respond to external snoops, the rest is implementation details. Architectures with more relaxed ordering may change the order unless explicitly blocked by a fence/barrier.
Based on that, your code snippet can not reorder stores on x86 since stores don't reorder with each other there, but it may be able to do so on arm for example. If the language requires strong ordering in that case, the compiler will have to decide if it can rely on the HW, or add a fence. Either way, anyone reading this value from another thread (or socket) will have to snoop for it, and can only see the writes that respond.

Can memory store be reordered really, in an OoOE processor?

We know that two instructions can be reordered by an OoOE processor. For example, there are two global variables shared among different threads.
int data;
bool ready;
A writer thread produce data and turn on a flag ready to allow readers to consume that data.
data = 6;
ready = true;
Now, on an OoOE processor, these two instructions can be reordered (instruction fetch, execution). But what about the final commit/write-back of the results? i.e., will the store be in-order?
From what I learned, this totally depends on a processor's memory model. E.g., x86/64 has a strong memory model, and reorder of stores is disallowed. On the contrary, ARM typically has a weak model where store reordering can happen (along with several other reorderings).
Also, the gut feeling tells me that I am right because otherwise we won't need a store barrier between those two instructions as used in typical multi-threaded programs.
But, here is what our wikipedia says:
.. In the outline above, the OoOE processor avoids the stall that
occurs in step (2) of the in-order processor when the instruction is
not completely ready to be processed due to missing data.
OoOE processors fill these "slots" in time with other instructions
that are ready, then re-order the results at the end to make it appear
that the instructions were processed as normal.
I'm confused. Is it saying that the results have to be written back in-order? Really, in an OoOE processor, can store to data and ready be reordered?

The simple answer is YES on some processor types.
Before the CPU, your code faces an earlier problem, compiler reordering.
data = 6;
ready = true;
The compiler is free to rearrange these statements since, as far as it knows, they do not affect each other (it is not thread-aware).
Now down to the processor level:
1) An out-of-order processor can process these instructions in different order, including reversing the order of the stores.
2) Even if the CPU performs them in order, they memory controller may not perform them in order because it may need to flush or bring in new cache lines or do an address translation before it can write them.
3) Even if this doesn't happen, another CPU in the system may not see them in the same order. In order to observe them, it may need to bring in the modified cache lines from the core that wrote them. It may not be able to bring one cache line in earlier than another if it is held be another core or if there is contention for that line by multiple cores, and its own out of order execution will read one before the other.
4) Finally, speculative execution on other cores may read the value of data before ready was set by the writing core, and by the time it gets around to reading ready, it was already set but data was also modified.
These problems are all solved by memory barriers. Platforms with weakly-ordered memory must make use of memory barriers to ensure memory coherence for thread synchronization.

The consistency model (or memory model) for the architecture determines what memory operations can be reordered. The idea is always to achieve the best performance from the code, while preserving the semantics expected by the programmer. That is the point from wikipedia, the memory operations appear in order to the programmer, even though they may have been reordered. Reordering is generally safe when the code is single-threaded, as the processor can easily detect potential violations.
On x86, the common model is that writes are not reordered with other writes. Yet, the processor is using out of order execution (OoOE), so instructions are being reordered constantly. Generally, the processor has several additional hardware structures to support OoOE, like a reorder buffer and load-store queue. The reorder buffer ensures that all instructions appear to execute in order, such that interrupts and exceptions break a specific point in the program. The load-store queue functions similarly, in that it can restore the order of memory operations according to the memory model. The load-store queue also disambiguates addresses, so that the processor can identify when the operations are made to the same or different addresses.
Back to OoOE, the processor is executing 10s to 100s of instructions in every cycle. Loads and stores are computing their addresses, etc. The processor may prefetch the cache lines for the accesses (which may include cache coherence), but it cannot actually access the line either to read or write until it is safe (according to the memory model) to do so.
Inserting store barriers, memory fences, etc tell both the compiler and processor about further restrictions to reordering the memory operations. The compiler is part of implementing the memory model, as some languages like java have specific memory model, while others like C obey the "memory accesses should appear as if they were executed in order".
In conclusion, yes, data and ready can be reordered in an OoOE. But it depends on the memory model as to whether they actually are. So if you need a specific order, provide the appropriate indication using barriers, etc such that the compiler, processor, etc will not choose a different order for higher performance.

On modern processor, the storing action itself is async (think of it like submit a change to the L1 cache and continue execution, the cache system further propagate in async manner). So the changes on two object lies on different cache block may be realised OoO from other CPU's perspective.
Furthermore, even the instruction to store those data, can be executed OoO. For example when two object is stored "at the same time", but the bus line of one object is retained/locked by other CPU or bus mastering, thus other other object may be committed earlier.
Therefore, to properly share data across threads, you need some kind of memory barrier or make use of transactional memory feature found in latest CPU like TSX.

I think you're misinterpreting "appear that the instructions were processed as normal." What that means is that if I have:
add r1 + 7 -> r2
move r3 -> r1
and the order of those is effectively reversed by out-of-order execution, the value that participates in the add operation will still be the value of r1 that was present prior to the move. Etc. The CPU will cache register values and/or delay register stores to assure that the "meaning" of a sequential instruction stream is not changed.
This says nothing about the order of stores as visible from another processor.

Thread Synchronisation 101

Previously I've written some very simple multithreaded code, and I've always been aware that at any time there could be a context switch right in the middle of what I'm doing, so I've always guarded access the shared variables through a CCriticalSection class that enters the critical section on construction and leaves it on destruction. I know this is fairly aggressive and I enter and leave critical sections quite frequently and sometimes egregiously (e.g. at the start of a function when I could put the CCriticalSection inside a tighter code block) but my code doesn't crash and it runs fast enough.
At work my multithreaded code needs to be a tighter, only locking/synchronising at the lowest level needed.
At work I was trying to debug some multithreaded code, and I came across this:
EnterCriticalSection(&m_Crit4);
m_bSomeVariable = true;
LeaveCriticalSection(&m_Crit4);
Now, m_bSomeVariable is a Win32 BOOL (not volatile), which as far as I know is defined to be an int, and on x86 reading and writing these values is a single instruction, and since context switches occur on an instruction boundary then there's no need for synchronising this operation with a critical section.
I did some more research online to see whether this operation did not need synchronisation, and I came up with two scenarios it did:
The CPU implements out of order execution or the second thread is running on a different core and the updated value is not written into RAM for the other core to see; and
The int is not 4-byte aligned.
I believe number 1 can be solved using the "volatile" keyword. In VS2005 and later the C++ compiler surrounds access to this variable using memory barriers, ensuring that the variable is always completely written/read to the main system memory before using it.
Number 2 I cannot verify, I don't know why the byte alignment would make a difference. I don't know the x86 instruction set, but does mov need to be given a 4-byte aligned address? If not do you need to use a combination of instructions? That would introduce the problem.
So...
QUESTION 1: Does using the "volatile" keyword (implicity using memory barriers and hinting to the compiler not to optimise this code) absolve a programmer from the need to synchronise a 4-byte/8-byte on x86/x64 variable between read/write operations?
QUESTION 2: Is there the explicit requirement that the variable be 4-byte/8-byte aligned?
I did some more digging into our code and the variables defined in the class:
class CExample
{
private:
CRITICAL_SECTION m_Crit1; // Protects variable a
CRITICAL_SECTION m_Crit2; // Protects variable b
CRITICAL_SECTION m_Crit3; // Protects variable c
CRITICAL_SECTION m_Crit4; // Protects variable d
// ...
};
Now, to me this seems excessive. I thought critical sections synchronised threads between a process, so if you've got one you can enter it and no other thread in that process can execute. There is no need for a critical section for each variable you want to protect, if you're in a critical section then nothing else can interrupt you.
I think the only thing that can change the variables from outside a critical section is if the process shares a memory page with another process (can you do that?) and the other process starts to change the values. Mutexes would also help here, named mutexes are shared across processes, or only processes of the same name?
QUESTION 3: Is my analysis of critical sections correct, and should this code be rewritten to use mutexes? I have had a look at other synchronisation objects (semaphores and spinlocks), are they better suited here?
QUESTION 4: Where are critical sections/mutexes/semaphores/spinlocks best suited? That is, which synchronisation problem should they be applied to. Is there a vast performance penalty for choosing one over the other?
And while we're on it, I read that spinlocks should not be used in a single-core multithreaded environment, only a multi-core multithreaded environment. So, QUESTION 5: Is this wrong, or if not, why is it right?
Thanks in advance for any responses :)

1) No volatile just says re-load the value from memory each time it is STILL possible for it to be half updated.
Edit:
2) Windows provides some atomic functions. Look up the "Interlocked" functions.
The comments led me do a bit more reading up. If you read through the Intel System Programming Guide You can see that there aligned read and writes ARE atomic.
8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following
basic memory operations will always be carried out atomically:
• Reading or writing a byte
• Reading or writing a word aligned on a 16-bit boundary
• Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following
additional memory operations will always be carried out atomically:
• Reading or writing a quadword aligned on a 64-bit boundary
• 16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following
additional memory operation will always be carried out atomically:
• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache
line
Accesses to cacheable memory that are split across bus widths, cache lines, and
page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel
Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and
Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M,
Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that
permit external memory subsystems to make split accesses atomic; however,
nonaligned data accesses will seriously impact the performance of the processor and
should be avoided.
An x87 instruction or an SSE instructions that accesses data larger than a quadword
may be implemented using multiple memory accesses. If such an instruction stores
to memory, some of the accesses may complete (writing to memory) while another
causes the operation to fault for architectural reasons (e.g. due an page-table entry
that is marked “not present”). In this case, the effects of the completed accesses
may be visible to software even though the overall instruction caused a fault. If TLB
invalidation has been delayed (see Section 4.10.3.4), such page faults may occur
even if all accesses are to the same page.
So basically yes if you do an 8-bit read/write from any address a 16-bit read/write from a 16-bit aligned address etc etc you ARE getting atomic operations. Its also interesting to note that you can do unaligned memory read/writes within a cacheline on a modern machine. The rules seem quite complex though so I wouldn't rely on them if i were you. Cheers to the commenters thats a good learning experience for me that one :)
3) A critical section will attempt to spin lock for its lock a few times and then locks a mutex. Spin Locking can suck CPU power doing nothing and a mutex can take a while to do its stuff. CriticalSections are a good choice if you can't use the interlocked functions.
4) There are performance penalties for choosing one over another. Its a pretty big ask to go through the benefits of everything here. The MSDN help has lots of good information on each of these. I sugegst reading them.
Semaphores
Critical Sections & Spin locks
Events
Mutexes
5) You can use a spin lock in a single threaded environment its not usually necessary though as thread management means that you can't have 2 processors accessing the same data simultaneously. It just isn't possible.

1: Volatile in itself is practically useless for multithreading. It guarantees that the read/write will be executed, rather than storing the value in a register, and it guarantees that the read/write won't be reordered with respect to other volatile reads/writes. But it may still be reordered with respect to non-volatile ones, which is basically 99.9% of your code. Microsoft have redefined volatile to also wrap all accesses in memory barriers, but that is not guaranteed to be the case in general. It will just silently break on any compiler which defines volatile as the standard does. (The code will compile and run, it just won't be thread-safe any longer)
Apart from that, reads/writes to integer-sized objects are atomic on x86 as long as the object is well aligned. (You have no guarantee of when the write will occur though. The compiler and CPU may reorder it, so it's atomic, but not thread-safe)
2: Yes, the object has to be aligned for the read/write to be atomic.
3: Not really. Only one thread can execute code inside a given critical section at a time. Other threads can still execute other code. So you can have four variables each protected by a different critical section. If they all shared the same critical section, I'd be unable to manipulate object 1 while you're manipulating object 2, which is inefficient and constrains parallelism more than necessary. If they are protected by different critical sections, we just can't both manipulate the same object simultaneously.
4: Spinlocks are rarely a good idea. They are useful if you expect a thread to have to wait only a very short time before being able to acquire the lock, and you absolutely neeed minimal latency. It avoids the OS context switch which is a relatively slow operation. Instead, the thread just sits in a loop constantly polling a variable. So higher CPU usage (the core isn't freed up to run another thread while waiting for the spinlock), but the thread will be able to continue as soon as the lock is released.
As for the others, the performance characteristics are pretty much the same: just use whichever has the semantics best suited for your needs. Typically critical sections are most convenient for protecting shared variables, and mutexes can be easily used to set a "flag" allowing other threads to proceed.
As for not using spinlocks in a single-core environment, remember that the spinlock doesn't actually yield. Thread A waiting on a spinlock isn't actually put on hold allowing the OS to schedule thread B to run. But since A is waiting on this spinlock, some other thread is going to have to release that lock. If you only have a single core, then that other thread will only be able to run when A is switched out. With a sane OS, that's going to happen sooner or later anyway as part of the regular context switching. But since we know that A won't be able to get the lock until B has had a time to executed and release the lock, we'd be better off if A just yielded immediately, was put in a wait queue by the OS, and restarted when B has released the lock. And that's what all other lock types do.
A spinlock will still work in a single core environment (assuming an OS with preemptive multitasking), it'll just be very very inefficient.

Q1: Using the "volatile" keyword
In VS2005 and later the C++ compiler surrounds access to this variable using memory barriers, ensuring that the variable is always completely written/read to the main system memory before using it.
Exactly. If you are not creating portable code, Visual Studio implements it exactly this way. If you want to be portable, your options are currently "limited". Until C++0x there is no portable way how to specify atomic operations with guaranteed read/write ordering and you need to implement per-platform solutions. That said, boost already did the dirty job for you, and you can use its atomic primitives.
Q2: Variable needs to be 4-byte/8-byte aligned?
If you do keep them aligned, you are safe. If you do not, rules are complicated (cache lines, ...), therefore the safest way is to keep them aligned, as this is easy to achieve.
Q3: Should this code be rewritten to use mutexes?
Critical section is a lightweight mutex. Unless you need to synchronize between processes, use critical sections.
Q4: Where are critical sections/mutexes/semaphores/spinlocks best suited?
Critical sections can even do spin waits for you.
Q5: Spinlocks should not be used in a single-core
Spin lock uses the fact that while the waiting CPU is spinning, another CPU may release the lock. This cannot happen with one CPU only, therefore it is only a waste of time there. On multi-CPU spin locks can be good idea, but it depends on how often the spin wait will be successful. The idea is waiting for a short while is a lot faster then doing context switch there and back again, therefore if the wait it likely to be short, it is better to wait.

Don't use volatile. It has virtually nothing to do with thread-safety. See here for the low-down.
The assignment to BOOL doesn't need any synchronisation primitives. It'll work fine without any special effort on your part.
If you want to set the variable and then make sure that another thread sees the new value, you need to establish some kind of communication between the two threads. Just locking immediately before assigning achieves nothing because the other thread might have come and gone before you acquired the lock.
One final word of caution: threading is extremely hard to get right. The most experienced programmers tend to be the least comfortable with using threads, which should set alarm bells ringing for anyone who is inexperienced with their use. I strongly suggest you use some higher-level primitives to implement concurrency in your app. Passing immutable data structures via synchronised queues is one approach that substantially reduces the danger.

Volatile does not imply memory barriers.
It only means that it will be part of the perceived state of the memory model. The implication of this is that the compiler cannot optimize the variable away, nor can it perform operations on the variable only in CPU registers (it will actually load and store to memory).
As there are no memory barriers implied, the compiler can reorder instructions at will. The only guarantee is that the order in which different volatile variables are read/write will be the same as in the code:
void test()
{
volatile int a;
volatile int b;
int c;
c = 1;
a = 5;
b = 3;
}
With the code above (assuming that c is not optimized away) the update to c can happen before or after the updates to a and b, providing 3 possible outcomes. The a and b updates are guaranteed to be performed in order. c can be optimized away easily by any compiler. With enough information, the compiler can even optimize away a and b (if it can be proven that no other threads read the variables and that they are not bound to a hardware array (so in this case, they can in fact be removed). Note that the standard does not require an specific behavior, but rather a perceivable state with the as-if rule.

Questions 3: CRITICAL_SECTIONs and Mutexes work, pretty much, the same way. A Win32 mutex is a kernel object, so it can be shared between processes, and waited on with WaitForMultipleObjects, which you can't do with a CRITICAL_SECTION. On the other hand, a CRITICAL_SECTION is lighter-weight and therefore faster. But the logic of the code should be unaffected by which you use.
You also commented that "there is no need for a critical section for each variable you want to protect, if you're in a critical section then nothing else can interrupt you." This is true, but the tradeoff is that accesses to any of the variables would need you to hold that lock. If the variables can meaningfully be updated independently, you are losing an opportunity for parallelising those operations. (Since these are members of the same object, though, I would think hard before concluding that they can really be accessed independently of each other.)

Overhead of a Memory Barrier / Fence

I'm currently writing C++ code and use a lot of memory barriers / fences in my code. I know, that a MB tolds the compiler and the hardware to not reorder write/reads around it. But i don't know how complex this operation is for the processor at runtime.
My Question is: What is the runtime-overhead of such a barrier? I didn't found any useful answer with google...
Is the overhead negligible? Or leads heavy usage of MBs to serious performance problems?
Best regards.

Compared to arithmetic and "normal" instructions I understand these to be very costly, but do not have numbers to back up that statement. I like jalf's answer by describing effects of the instructions, and would like to add a bit.
There are in general a few different types of barriers, so understanding the differences could be helpful. A barrier like the one that jalf mentioned is required for example in a mutex implementation before clearing the lock word (lwsync on ppc, or st4.rel on ia64 for example). All reads and writes must be complete, and only instructions later in the pipeline that have no memory access and no dependencies on in progress memory operations can be executed.
Another type of barrier is the sort that you'd use in a mutex implementation when acquiring a lock (examples, isync on ppc, or instr.acq on ia64). This has an effect on future instructions, so if a non-dependent load has been prefetched it must be discarded. Example:
if ( pSharedMem->atomic.bit_is_set() ) // use a bit to flag that somethingElse is "ready"
{
foo( pSharedMem->somethingElse ) ;
}
Without an acquire barrier (borrowing ia64 lingo), your program may have unexpected results if somethingElse made it into a register before the check of the flagging bit check is complete.
There is a third type of barrier, generally less used, and is required to enforce store load ordering. Examples of instructions for such an ordering enforcing instruction are, sync on ppc (heavyweight sync), MF on ia64, membar #storeload on sparc (required even for TSO).
Using ia64 like pseudocode to illustrate, suppose one had
st4.rel
ld4.acq
without an mf in between one has no guarentee that the load follows the store. You know that loads and stores preceding the st4.rel are done before that store or the "subsequent" load, but that load or other future loads (and perhaps stores if non-dependent?) could sneak in, completing earlier since nothing prevents that otherwise.
Because mutex implementations very likely only use acquire and release barriers in thier implementations, I'd expect that an observable effect of this is that memory access following lock release may actually sometimes occur while "still in the critical section".

Try thinking about what the instruction does. It doesn't make the CPU do anything complicated in terms of logic, but it forces it to wait until all reads and writes have been committed to main memory. So the cost really depends on the cost of accessing main memory (and the number of outstanding reads/writes).
Accessing main memory is generally pretty expensive (10-200 clock cycles), but in a sense, that work would have to be done without the barrier as well, it could just be hidden by executing some other instructions simultaneously so you didn't feel the cost so much.
It also limits the CPU's (and compilers) ability to reschedule instructions, so there may be an indirect cost as well in that nearby instructions can't be interleaved which might otherwise yield a more efficient execution schedule.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js