Memory ordering behavior of std::atomic::load

Memory ordering behavior of std::atomic::load - c++

Am I wrong to assume that the atomic::load should also act as a memory barrier ensuring that all previous non-atomic writes will become visible by other threads?
To illustrate:
volatile bool arm1 = false;
std::atomic_bool arm2 = false;
bool triggered = false;
Thread1:
arm1 = true;
//std::std::atomic_thread_fence(std::memory_order_seq_cst); // this would do the trick
if (arm2.load())
triggered = true;
Thread2:
arm2.store(true);
if (arm1)
triggered = true;
I expected that after executing both 'triggered' would be true. Please don't suggest to make arm1 atomic, the point is to explore the behavior of atomic::load.
While I have to admit I don't fully understand the formal definitions of the different relaxed semantics of memory order I thought that the sequentially consistent ordering was pretty straightforward in that it guarantees that "a single total order exists in which all threads observe all modifications in the same order." To me this implies that the std::atomic::load with the default memory order of std::memory_order_seq_cst will also act as a memory fence. This is further corroborated by the following statement under "Sequentially-consistent ordering":
Total sequential ordering requires a full memory fence CPU instruction on all multi-core systems.
Yet, my simple example below demonstrates this is not the case with MSVC 2013, gcc 4.9 (x86) and clang 3.5.1 (x86), where the atomic load simply translates to a load instruction.
#include <atomic>
std::atomic_long al;
#ifdef _WIN32
__declspec(noinline)
#else
__attribute__((noinline))
#endif
long load() {
return al.load(std::memory_order_seq_cst);
}
int main(int argc, char* argv[]) {
long r = load();
}
With gcc this looks like:
load():
mov rax, QWORD PTR al[rip] ; <--- plain load here, no fence or xchg
ret
main:
call load()
xor eax, eax
ret
I'll omit the msvc and clang which are essentially identical. Now on gcc for ARM we get what I expected:
load():
dmb sy ; <---- data memory barrier here
movw r3, #:lower16:.LANCHOR0
movt r3, #:upper16:.LANCHOR0
ldr r0, [r3]
dmb sy ; <----- and here
bx lr
main:
push {r3, lr}
bl load()
movs r0, #0
pop {r3, pc}
This is not an academic question, it results in a subtle race condition in our code which called into question my understanding of the behavior of std::atomic.

Sigh, this was too long for a comment:
Isn't the meaning of atomic "to appear to occur instantaneously to the rest of the system"?
I'd say yes and no to that one, depending on how you think of it. For writes with SEQ_CST, yes. But as far as how atomic loads are handled, check out 29.3 of the C++11 standard. Specifically, 29.3.3 is really good reading, and 29.3.4 might be specifically what you're looking for:
For an atomic operation B that reads the value of an atomic object M, if there is a memory_order_seq_-
cst fence X sequenced before B, then B observes either the last memory_order_seq_cst modification of M
preceding X in the total order S or a later modification of M in its modification order.
Basically, SEQ_CST forces a global order just like the standard says, but reads can return and old value without violating the 'atomic' constraint.
To accomplish 'getting the absolute latest value' you'll need to perform an operation that forces the hardware coherency protocol to lock(the lock instruction on x86_64). This is what the atomic compare-and-exchange operations do, if you look at the assembly output.

Am I wrong to assume that the atomic::load should also act as a memory barrier ensuring that all previous non-atomic writes will become visible by other threads?
Yes. atomic::load(SEQ_CST) just enforces that the read cannot load an 'invalid' value, and neither writes nor loads may be reordered by the compiler or the cpu around that statement. It does not mean you'll always get the most up to date value.
I would expect your code to have a data race because again, barriers do not ensure the most up to date value is seen at a given time, they just prevent reordering.
Its perfectly valid for Thread1 to not see the write by Thread2 and therefore not set triggered, and for Thread2 to not see the write by Thread1 (again, not setting triggered), because you only write 'atomically' from one thread.
With two threads writing and reading shared values, you'll need a barrier in each thread to maintain consistency. It looks like you knew this already based in your code comments, so I'll just leave it at "the C++ standard is somewhat misleading when it comes to accurately describing meaning of atomic / multithreaded operations".
Even though you're writing C++, its still best, in my opinion, to think about what you're doing on the underlying architecture.
Not sure I explained that well, but I'd be happy to go into more detail if you'd like.

Related

Can num++ be atomic for 'int num'?

In general, for int num, num++ (or ++num), as a read-modify-write operation, is not atomic. But I often see compilers, for example GCC, generate the following code for it (try here):
void f()
{
int num = 0;
num++;
}
f():
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
add DWORD PTR [rbp-4], 1
nop
pop rbp
ret
Since line 5, which corresponds to num++ is one instruction, can we conclude that num++ is atomic in this case?
And if so, does it mean that so-generated num++ can be used in concurrent (multi-threaded) scenarios without any danger of data races (i.e. we don't need to make it, for example, std::atomic<int> and impose the associated costs, since it's atomic anyway)?
UPDATE
Notice that this question is not whether increment is atomic (it's not and that was and is the opening line of the question). It's whether it can be in particular scenarios, i.e. whether one-instruction nature can in certain cases be exploited to avoid the overhead of the lock prefix. And, as the accepted answer mentions in the section about uniprocessor machines, as well as this answer, the conversation in its comments and others explain, it can (although not with C or C++).

This is absolutely what C++ defines as a Data Race that causes Undefined Behaviour, even if one compiler happened to produce code that did what you hoped on some target machine. You need to use std::atomic for reliable results, but you can use it with memory_order_relaxed if you don't care about reordering. See below for some example code and asm output using fetch_add.
But first, the assembly language part of the question:
Since num++ is one instruction (add dword [num], 1), can we conclude that num++ is atomic in this case?
Memory-destination instructions (other than pure stores) are read-modify-write operations that happen in multiple internal steps. No architectural register is modified, but the CPU has to hold the data internally while it sends it through its ALU. The actual register file is only a small part of the data storage inside even the simplest CPU, with latches holding outputs of one stage as inputs for another stage, etc., etc.
Memory operations from other CPUs can become globally visible between the load and store. I.e. two threads running add dword [num], 1 in a loop would step on each other's stores. (See #Margaret's answer for a nice diagram). After 40k increments from each of two threads, the counter might have only gone up by ~60k (not 80k) on real multi-core x86 hardware.
"Atomic", from the Greek word meaning indivisible, means that no observer can see the operation as separate steps. Happening physically / electrically instantaneously for all bits simultaneously is just one way to achieve this for a load or store, but that's not even possible for an ALU operation. I went into a lot more detail about pure loads and pure stores in my answer to Atomicity on x86, while this answer focuses on read-modify-write.
The lock prefix can be applied to many read-modify-write (memory destination) instructions to make the entire operation atomic with respect to all possible observers in the system (other cores and DMA devices, not an oscilloscope hooked up to the CPU pins). That is why it exists. (See also this Q&A).
So lock add dword [num], 1 is atomic. A CPU core running that instruction would keep the cache line pinned in Modified state in its private L1 cache from when the load reads data from cache until the store commits its result back into cache. This prevents any other cache in the system from having a copy of the cache line at any point from load to store, according to the rules of the MESI cache coherency protocol (or the MOESI/MESIF versions of it used by multi-core AMD/Intel CPUs, respectively). Thus, operations by other cores appear to happen either before or after, not during.
Without the lock prefix, another core could take ownership of the cache line and modify it after our load but before our store, so that other store would become globally visible in between our load and store. Several other answers get this wrong, and claim that without lock you'd get conflicting copies of the same cache line. This can never happen in a system with coherent caches.
(If a locked instruction operates on memory that spans two cache lines, it takes a lot more work to make sure the changes to both parts of the object stay atomic as they propagate to all observers, so no observer can see tearing. The CPU might have to lock the whole memory bus until the data hits memory. Don't misalign your atomic variables!)
Note that the lock prefix also turns an instruction into a full memory barrier (like MFENCE), stopping all run-time reordering and thus giving sequential consistency. (See Jeff Preshing's excellent blog post. His other posts are all excellent, too, and clearly explain a lot of good stuff about lock-free programming, from x86 and other hardware details to C++ rules.)
On a uniprocessor machine, or in a single-threaded process, a single RMW instruction actually is atomic without a lock prefix. The only way for other code to access the shared variable is for the CPU to do a context switch, which can't happen in the middle of an instruction. So a plain dec dword [num] can synchronize between a single-threaded program and its signal handlers, or in a multi-threaded program running on a single-core machine. See the second half of my answer on another question, and the comments under it, where I explain this in more detail.
Back to C++:
It's totally bogus to use num++ without telling the compiler that you need it to compile to a single read-modify-write implementation:
;; Valid compiler output for num++
mov eax, [num]
inc eax
mov [num], eax
This is very likely if you use the value of num later: the compiler will keep it live in a register after the increment. So even if you check how num++ compiles on its own, changing the surrounding code can affect it.
(If the value isn't needed later, inc dword [num] is preferred; modern x86 CPUs will run a memory-destination RMW instruction at least as efficiently as using three separate instructions. Fun fact: gcc -O3 -m32 -mtune=i586 will actually emit this, because (Pentium) P5's superscalar pipeline didn't decode complex instructions to multiple simple micro-operations the way P6 and later microarchitectures do. See the Agner Fog's instruction tables / microarchitecture guide for more info, and the x86 tag wiki for many useful links (including Intel's x86 ISA manuals, which are freely available as PDF)).
Don't confuse the target memory model (x86) with the C++ memory model
Compile-time reordering is allowed. The other part of what you get with std::atomic is control over compile-time reordering, to make sure your num++ becomes globally visible only after some other operation.
Classic example: Storing some data into a buffer for another thread to look at, then setting a flag. Even though x86 does acquire loads/release stores for free, you still have to tell the compiler not to reorder by using flag.store(1, std::memory_order_release);.
You might be expecting that this code will synchronize with other threads:
// int flag; is just a plain global, not std::atomic<int>.
flag--; // Pretend this is supposed to be some kind of locking attempt
modify_a_data_structure(&foo); // doesn't look at flag, and the compiler knows this. (Assume it can see the function def). Otherwise the usual don't-break-single-threaded-code rules come into play!
flag++;
But it won't. The compiler is free to move the flag++ across the function call (if it inlines the function or knows that it doesn't look at flag). Then it can optimize away the modification entirely, because flag isn't even volatile.
(And no, C++ volatile is not a useful substitute for std::atomic. std::atomic does make the compiler assume that values in memory can be modified asynchronously similar to volatile, but there's much more to it than that. (In practice there are similarities between volatile int to std::atomic with mo_relaxed for pure-load and pure-store operations, but not for RMWs). Also, volatile std::atomic<int> foo is not necessarily the same as std::atomic<int> foo, although current compilers don't optimize atomics (e.g. 2 back-to-back stores of the same value) so volatile atomic wouldn't change the code-gen.)
Defining data races on non-atomic variables as Undefined Behaviour is what lets the compiler still hoist loads and sink stores out of loops, and many other optimizations for memory that multiple threads might have a reference to. (See this LLVM blog for more about how UB enables compiler optimizations.)
As I mentioned, the x86 lock prefix is a full memory barrier, so using num.fetch_add(1, std::memory_order_relaxed); generates the same code on x86 as num++ (the default is sequential consistency), but it can be much more efficient on other architectures (like ARM). Even on x86, relaxed allows more compile-time reordering.
This is what GCC actually does on x86, for a few functions that operate on a std::atomic global variable.
See the source + assembly language code formatted nicely on the Godbolt compiler explorer. You can select other target architectures, including ARM, MIPS, and PowerPC, to see what kind of assembly language code you get from atomics for those targets.
#include <atomic>
std::atomic<int> num;
void inc_relaxed() {
num.fetch_add(1, std::memory_order_relaxed);
}
int load_num() { return num; } // Even seq_cst loads are free on x86
void store_num(int val){ num = val; }
void store_num_release(int val){
num.store(val, std::memory_order_release);
}
// Can the compiler collapse multiple atomic operations into one? No, it can't.
# g++ 6.2 -O3, targeting x86-64 System V calling convention. (First argument in edi/rdi)
inc_relaxed():
lock add DWORD PTR num[rip], 1 #### Even relaxed RMWs need a lock. There's no way to request just a single-instruction RMW with no lock, for synchronizing between a program and signal handler for example. :/ There is atomic_signal_fence for ordering, but nothing for RMW.
ret
inc_seq_cst():
lock add DWORD PTR num[rip], 1
ret
load_num():
mov eax, DWORD PTR num[rip]
ret
store_num(int):
mov DWORD PTR num[rip], edi
mfence ##### seq_cst stores need an mfence
ret
store_num_release(int):
mov DWORD PTR num[rip], edi
ret ##### Release and weaker doesn't.
store_num_relaxed(int):
mov DWORD PTR num[rip], edi
ret
Notice how MFENCE (a full barrier) is needed after a sequential-consistency stores. x86 is strongly ordered in general, but StoreLoad reordering is allowed. Having a store buffer is essential for good performance on a pipelined out-of-order CPU. Jeff Preshing's Memory Reordering Caught in the Act shows the consequences of not using MFENCE, with real code to show reordering happening on real hardware.
Re: discussion in comments on #Richard Hodges' answer about compilers merging std::atomic num++; num-=2; operations into one num--; instruction:
A separate Q&A on this same subject: Why don't compilers merge redundant std::atomic writes?, where my answer restates a lot of what I wrote below.
Current compilers don't actually do this (yet), but not because they aren't allowed to. C++ WG21/P0062R1: When should compilers optimize atomics? discusses the expectation that many programmers have that compilers won't make "surprising" optimizations, and what the standard can do to give programmers control. N4455 discusses many examples of things that can be optimized, including this one. It points out that inlining and constant-propagation can introduce things like fetch_or(0) which may be able to turn into just a load() (but still has acquire and release semantics), even when the original source didn't have any obviously redundant atomic ops.
The real reasons compilers don't do it (yet) are: (1) nobody's written the complicated code that would allow the compiler to do that safely (without ever getting it wrong), and (2) it potentially violates the principle of least surprise. Lock-free code is hard enough to write correctly in the first place. So don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much. It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).
Getting back to num++; num-=2; compiling as if it were num--:
Compilers are allowed to do this, unless num is volatile std::atomic<int>. If a reordering is possible, the as-if rule allows the compiler to decide at compile time that it always happens that way. Nothing guarantees that an observer could see the intermediate values (the num++ result).
I.e. if the ordering where nothing becomes globally visible between these operations is compatible with the ordering requirements of the source
(according to the C++ rules for the abstract machine, not the target architecture), the compiler can emit a single lock dec dword [num] instead of lock inc dword [num] / lock sub dword [num], 2.
num++; num-- can't disappear, because it still has a Synchronizes With relationship with other threads that look at num, and it's both an acquire-load and a release-store which disallows reordering of other operations in this thread. For x86, this might be able to compile to an MFENCE, instead of a lock add dword [num], 0 (i.e. num += 0).
As discussed in PR0062, more aggressive merging of non-adjacent atomic ops at compile time can be bad (e.g. a progress counter only gets updated once at the end instead of every iteration), but it can also help performance without downsides (e.g. skipping the atomic inc / dec of ref counts when a copy of a shared_ptr is created and destroyed, if the compiler can prove that another shared_ptr object exists for entire lifespan of the temporary.)
Even num++; num-- merging could hurt fairness of a lock implementation when one thread unlocks and re-locks right away. If it's never actually released in the asm, even hardware arbitration mechanisms won't give another thread a chance to grab the lock at that point.
With current gcc6.2 and clang3.9, you still get separate locked operations even with memory_order_relaxed in the most obviously optimizable case. (Godbolt compiler explorer so you can see if the latest versions are different.)
void multiple_ops_relaxed(std::atomic<unsigned int>& num) {
num.fetch_add( 1, std::memory_order_relaxed);
num.fetch_add(-1, std::memory_order_relaxed);
num.fetch_add( 6, std::memory_order_relaxed);
num.fetch_add(-5, std::memory_order_relaxed);
//num.fetch_add(-1, std::memory_order_relaxed);
}
multiple_ops_relaxed(std::atomic<unsigned int>&):
lock add DWORD PTR [rdi], 1
lock sub DWORD PTR [rdi], 1
lock add DWORD PTR [rdi], 6
lock sub DWORD PTR [rdi], 5
ret

Without many complications an instruction like add DWORD PTR [rbp-4], 1 is very CISC-style.
It perform three operations: load the operand from memory, increment it, store the operand back to memory.
During these operations the CPU acquire and release the bus twice, in between any other agent can acquire it too and this violates the atomicity.
AGENT 1 AGENT 2
load X
inc C
load X
inc C
store X
store X
X is incremented only once.

...and now let's enable optimisations:
f():
rep ret
OK, let's give it a chance:
void f(int& num)
{
num = 0;
num++;
--num;
num += 6;
num -=5;
--num;
}
result:
f(int&):
mov DWORD PTR [rdi], 0
ret
another observing thread (even ignoring cache synchronisation delays) has no opportunity to observe the individual changes.
compare to:
#include <atomic>
void f(std::atomic<int>& num)
{
num = 0;
num++;
--num;
num += 6;
num -=5;
--num;
}
where the result is:
f(std::atomic<int>&):
mov DWORD PTR [rdi], 0
mfence
lock add DWORD PTR [rdi], 1
lock sub DWORD PTR [rdi], 1
lock add DWORD PTR [rdi], 6
lock sub DWORD PTR [rdi], 5
lock sub DWORD PTR [rdi], 1
ret
Now, each modification is:-
observable in another thread, and
respectful of similar modifications happening in other threads.
atomicity is not just at the instruction level, it involves the whole pipeline from processor, through the caches, to memory and back.
Further info
Regarding the effect of optimisations of updates of std::atomics.
The c++ standard has the 'as if' rule, by which it is permissible for the compiler to reorder code, and even rewrite code provided that the outcome has the exact same observable effects (including side-effects) as if it had simply executed your code.
The as-if rule is conservative, particularly involving atomics.
consider:
void incdec(int& num) {
++num;
--num;
}
Because there are no mutex locks, atomics or any other constructs that influence inter-thread sequencing, I would argue that the compiler is free to rewrite this function as a NOP, eg:
void incdec(int&) {
// nada
}
This is because in the c++ memory model, there is no possibility of another thread observing the result of the increment. It would of course be different if num was volatile (might influence hardware behaviour). But in this case, this function will be the only function modifying this memory (otherwise the program is ill-formed).
However, this is a different ball game:
void incdec(std::atomic<int>& num) {
++num;
--num;
}
num is an atomic. Changes to it must be observable to other threads that are watching. Changes those threads themselves make (such as setting the value to 100 in between the increment and decrement) will have very far-reaching effects on the eventual value of num.
Here is a demo:
#include <thread>
#include <atomic>
int main()
{
for (int iter = 0 ; iter < 20 ; ++iter)
{
std::atomic<int> num = { 0 };
std::thread t1([&] {
for (int i = 0 ; i < 10000000 ; ++i)
{
++num;
--num;
}
});
std::thread t2([&] {
for (int i = 0 ; i < 10000000 ; ++i)
{
num = 100;
}
});
t2.join();
t1.join();
std::cout << num << std::endl;
}
}
sample output:
99
99
99
99
99
100
99
99
100
100
100
100
99
99
100
99
99
100
100
99

The add instruction is not atomic. It references memory, and two processor cores may have different local cache of that memory.
IIRC the atomic variant of the add instruction is called lock xadd

Since line 5, which corresponds to num++ is one instruction, can we conclude that num++ is atomic in this case?
It is dangerous to draw conclusions based on "reverse engineering" generated assembly. For example, you seem to have compiled your code with optimization disabled, otherwise the compiler would have thrown away that variable or loaded 1 directly to it without invoking operator++. Because the generated assembly may change significantly, based on optimization flags, target CPU, etc., your conclusion is based on sand.
Also, your idea that one assembly instruction means an operation is atomic is wrong as well. This add will not be atomic on multi-CPU systems, even on the x86 architecture.

Even if your compiler always emitted this as an atomic operation, accessing num from any other thread concurrently would constitute a data race according to the C++11 and C++14 standards and the program would have undefined behavior.
But it is worse than that. First, as has been mentioned, the instruction generated by the compiler when incrementing a variable may depend on the optimization level. Secondly, the compiler may reorder other memory accesses around ++num if num is not atomic, e.g.
int main()
{
std::unique_ptr<std::vector<int>> vec;
int ready = 0;
std::thread t{[&]
{
while (!ready);
// use "vec" here
});
vec.reset(new std::vector<int>());
++ready;
t.join();
}
Even if we assume optimistically that ++ready is "atomic", and that the compiler generates the checking loop as needed (as I said, it's UB and therefore the compiler is free to remove it, replace it with an infinite loop, etc.), the compiler might still move the pointer assignment, or even worse the initialization of the vector to a point after the increment operation, causing chaos in the new thread. In practice, I would not be surprised at all if an optimizing compiler removed the ready variable and the checking loop completely, as this does not affect observable behavior under language rules (as opposed to your private hopes).
In fact, at last year's Meeting C++ conference, I've heard from two compiler developers that they very gladly implement optimizations that make naively written multi-threaded programs misbehave, as long as language rules allow it, if even a minor performance improvement is seen in correctly written programs.
Lastly, even if you didn't care about portability, and your compiler was magically nice, the CPU you are using is very likely of a superscalar CISC type and will break down instructions into micro-ops, reorder and/or speculatively execute them, to an extent only limited by synchronizing primitives such as (on Intel) the LOCK prefix or memory fences, in order to maximize operations per second.
To make a long story short, the natural responsibilities of thread-safe programming are:
Your duty is to write code that has well-defined behavior under language rules (and in particular the language standard memory model).
Your compiler's duty is to generate machine code which has the same well-defined (observable) behavior under the target architecture's memory model.
Your CPU's duty is to execute this code so that the observed behavior is compatible with its own architecture's memory model.
If you want to do it your own way, it might just work in some cases, but understand that the warranty is void, and you will be solely responsible for any unwanted outcomes. :-)
PS: Correctly written example:
int main()
{
std::unique_ptr<std::vector<int>> vec;
std::atomic<int> ready{0}; // NOTE the use of the std::atomic template
std::thread t{[&]
{
while (!ready);
// use "vec" here
});
vec.reset(new std::vector<int>());
++ready;
t.join();
}
This is safe because:
The checks of ready cannot be optimized away according to language rules.
The ++ready happens-before the check that sees ready as not zero, and other operations cannot be reordered around these operations. This is because ++ready and the check are sequentially consistent, which is another term described in the C++ memory model and that forbids this specific reordering. Therefore the compiler must not reorder the instructions, and must also tell the CPU that it must not e.g. postpone the write to vec to after the increment of ready. Sequentially consistent is the strongest guarantee regarding atomics in the language standard. Lesser (and theoretically cheaper) guarantees are available e.g. via other methods of std::atomic<T>, but these are definitely for experts only, and may not be optimized much by the compiler developers, because they are rarely used.

On a single-core x86 machine, an add instruction will generally be atomic with respect to other code on the CPU1. An interrupt can't split a single instruction down the middle.
Out-of-order execution is required to preserve the illusion of instructions executing one at a time in order within a single core, so any instruction running on the same CPU will either happen completely before or completely after the add.
Modern x86 systems are multi-core, so the uniprocessor special case doesn't apply.
If one is targeting a small embedded PC and has no plans to move the code to anything else, the atomic nature of the "add" instruction could be exploited. On the other hand, platforms where operations are inherently atomic are becoming more and more scarce.
(This doesn't help you if you're writing in C++, though. Compilers don't have an option to require num++ to compile to a memory-destination add or xadd without a lock prefix. They could choose to load num into a register and store the increment result with a separate instruction, and will likely do that if you use the result.)
Footnote 1: The lock prefix existed even on original 8086 because I/O devices operate concurrently with the CPU; drivers on a single-core system need lock add to atomically increment a value in device memory if the device can also modify it, or with respect to DMA access.

Back in the day when x86 computers had one CPU, the use of a single instruction ensured that interrupts would not split the read/modify/write and if the memory would not be used as a DMA buffer too, it was atomic in fact (and C++ did not mention threads in the standard, so this wasn’t addressed).
When it was rare to have a dual processor (e.g. dual-socket Pentium Pro) on a customer desktop, I effectively used this to avoid the LOCK prefix on a single-core machine and improve performance.
Today, it would only help against multiple threads that were all set to the same CPU affinity, so the threads you are worried about would only come into play via time slice expiring and running the other thread on the same CPU (core). That is not realistic.
With modern x86/x64 processors, the single instruction is broken up into several micro ops and furthermore the memory reading and writing is buffered. So different threads running on different CPUs will not only see this as non-atomic but may see inconsistent results concerning what it reads from memory and what it assumes other threads have read to that point in time: you need to add memory fences to restore sane behavior.

No.
https://www.youtube.com/watch?v=31g0YE61PLQ
(That's just a link to the "No" scene from "The Office")
Do you agree that this would be a possible output for the program:
sample output:
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
If so, then the compiler is free to make that the only possible output for the program, in whichever way the compiler wants. ie a main() that just puts out 100s.
This is the "as-if" rule.
And regardless of output, you can think of thread synchronization the same way - if thread A does num++; num--; and thread B reads num repeatedly, then a possible valid interleaving is that thread B never reads between num++ and num--. Since that interleaving is valid, the compiler is free to make that the only possible interleaving. And just remove the incr/decr entirely.
There are some interesting implications here:
while (working())
progress++; // atomic, global
(ie imagine some other thread updates a progress bar UI based on progress)
Can the compiler turn this into:
int local = 0;
while (working())
local++;
progress += local;
probably that is valid. But probably not what the programmer was hoping for :-(
The committee is still working on this stuff. Currently it "works" because compilers don't optimize atomics much. But that is changing.
And even if progress was also volatile, this would still be valid:
int local = 0;
while (working())
local++;
while (local--)
progress++;
:-/

That a single compiler's output, on a specific CPU architecture, with optimizations disabled (since gcc doesn't even compile ++ to add when optimizing in a quick&dirty example), seems to imply incrementing this way is atomic doesn't mean this is standard-compliant (you would cause undefined behavior when trying to access num in a thread), and is wrong anyways, because add is not atomic in x86.
Note that atomics (using the lock instruction prefix) are relatively heavy on x86 (see this relevant answer), but still remarkably less than a mutex, which isn't very appropriate in this use-case.
Following results are taken from clang++ 3.8 when compiling with -Os.
Incrementing an int by reference, the "regular" way :
void inc(int& x)
{
++x;
}
This compiles into :
inc(int&):
incl (%rdi)
retq
Incrementing an int passed by reference, the atomic way :
#include <atomic>
void inc(std::atomic<int>& x)
{
++x;
}
This example, which is not much more complex than the regular way, just gets the lock prefix added to the incl instruction - but caution, as previously stated this is not cheap. Just because assembly looks short doesn't mean it's fast.
inc(std::atomic<int>&):
lock incl (%rdi)
retq

Yes, but...
Atomic is not what you meant to say. You're probably asking the wrong thing.
The increment is certainly atomic. Unless the storage is misaligned (and since you left alignment to the compiler, it is not), it is necessarily aligned within a single cache line. Short of special non-caching streaming instructions, each and every write goes through the cache. Complete cache lines are being atomically read and written, never anything different.
Smaller-than-cacheline data is, of course, also written atomically (since the surrounding cache line is).
Is it thread-safe?
This is a different question, and there are at least two good reasons to answer with a definite "No!".
First, there is the possibility that another core might have a copy of that cache line in L1 (L2 and upwards is usually shared, but L1 is normally per-core!), and concurrently modifies that value. Of course that happens atomically, too, but now you have two "correct" (correctly, atomically, modified) values -- which one is the truly correct one now?
The CPU will sort it out somehow, of course. But the result may not be what you expect.
Second, there is memory ordering, or worded differently happens-before guarantees. The most important thing about atomic instructions is not so much that they are atomic. It's ordering.
You have the possibility of enforcing a guarantee that everything that happens memory-wise is realized in some guaranteed, well-defined order where you have a "happened before" guarantee. This ordering may be as "relaxed" (read as: none at all) or as strict as you need.
For example, you can set a pointer to some block of data (say, the results of some calculation) and then atomically release the "data is ready" flag. Now, whoever acquires this flag will be led into thinking that the pointer is valid. And indeed, it will always be a valid pointer, never anything different. That's because the write to the pointer happened-before the atomic operation.

When your compiler uses only a single instruction for the increment and your machine is single-threaded, your code is safe. ^^

Try compiling the same code on a non-x86 machine, and you'll quickly see very different assembly results.
The reason num++ appears to be atomic is because on x86 machines, incrementing a 32-bit integer is, in fact, atomic (assuming no memory retrieval takes place). But this is neither guaranteed by the c++ standard, nor is it likely to be the case on a machine that doesn't use the x86 instruction set. So this code is not cross-platform safe from race conditions.
You also don't have a strong guarantee that this code is safe from Race Conditions even on an x86 architecture, because x86 doesn't set up loads and stores to memory unless specifically instructed to do so. So if multiple threads tried to update this variable simultaneously, they may end up incrementing cached (outdated) values
The reason, then, that we have std::atomic<int> and so on is so that when you're working with an architecture where the atomicity of basic computations is not guaranteed, you have a mechanism that will force the compiler to generate atomic code.

GCC reordering up across load with `memory_order_seq_cst`. Is this allowed?

Using a simplified version of a basic seqlock , gcc reorders a nonatomic load up across an atomic load(memory_order_seq_cst) when compiling the code with -O3. This reordering isn't observed when compiling with other optimization levels or when compiling with clang ( even on O3 ). This reordering seems to violate a synchronizes-with relationship that should be established and I'm curious to know why gcc reorders this particular load and if this is even allowed by the standard.
Consider this following load function:
auto load()
{
std::size_t copy;
std::size_t seq0 = 0, seq1 = 0;
do
{
seq0 = seq_.load();
copy = value;
seq1 = seq_.load();
} while( seq0 & 1 || seq0 != seq1);
std::cout << "Observed: " << seq0 << '\n';
return copy;
}
Following seqlock procedure, this reader spins until it is able to load two instances of seq_, which is defined to be a std::atomic<std::size_t>, that are even ( to indicate that a writer is not currently writing ) and equal ( to indicate that a writer has not written to value in between the two loads of seq_ ). Furthermore, because these loads are tagged with memory_order_seq_cst ( as a default argument ), I would imagine that the instruction copy = value; would be executed on each iteration as it cannot be reordered up across the initial load, nor can it reordered down below the latter.
However, the generated assembly issues the load from value before the first load from seq_ and is even performed outside of the loop. This could lead to improper synchronization or torn reads of value that do not get resolved by the seqlock algorithm. Additionally, I've noticed that this only occurs when sizeof(value) is below 123 bytes. Modifying value to be of some type >= 123 bytes yields the correct assembly and is loaded upon each loop iteration in between the two loads of seq_. Is there any reason why this seemingly arbitrary threshold dictates which assembly is generated?
This test harness exposes the behavior on my Xeon E3-1505M, in which "Observed: 2" will be printed from the reader and the value 65535 will be returned. This combination of observed values of seq_ and the returned load from value seem to violate the synchronizes-with relationship that should be established by the writer thread publishing seq.store(2) with memory_order_release and the reader thread reading seq_ with memory_order_seq_cst.
Is it valid for gcc to reorder the load, and if so, why does it only do so when sizeof(value) is < 123? clang, no matter the optimization level or the sizeof(value) will not reorder the load. Clang's codegen, I believe, is the appropriate and correct approach.

Congratulations, I think you've hit a bug in gcc!
Now I think you can make a reasonable argument, as the other answer does, that the original code you showed could perhaps have been correctly optimized that way by gcc by relying on a fairly obscure argument about the unconditional access to value: essentially you can't have been relying on a synchronizes-with relationship between the load seq0 = seq_.load(); and the subsequent read of value, so reading it "somewhere else" shouldn't change the semantics of a race-free program. I'm not actually sure of this argument, but here's a "simpler" case I got from reducing your code:
#include <atomic>
#include <iostream>
std::atomic<std::size_t> seq_;
std::size_t value;
auto load()
{
std::size_t copy;
std::size_t seq0;
do
{
seq0 = seq_.load();
if (!seq0) continue;
copy = value;
seq0 = seq_.load();
} while (!seq0);
return copy;
}
This isn't a seqlock or anything - it just waits for seq0 to change from zero to non-zero, and then reads value. The second read of seq_ is superfluous as is the while condition, but without them the bug goes away.
This is now the read-side of the well known idiom which does work and is race-free: one thread writes to value, then sets seq0 non-zero with a release store. The threads calling load see the non-zero store, and synchronize with it, and so can safely read value. Of course, you can't keep writing to value, it's a "one time" initialization, but this a common pattern.
With the above code, gcc is still hoisting the read of value:
load():
mov rax, QWORD PTR value[rip]
.L2:
mov rdx, QWORD PTR seq_[rip]
test rdx, rdx
je .L2
mov rdx, QWORD PTR seq_[rip]
test rdx, rdx
je .L2
rep ret
Oops!
This behavior occurs up to gcc 7.3, but not in 8.1. Your code also compiles as you wanted in 8.1:
mov rbx, QWORD PTR seq_[rip]
mov rbp, QWORD PTR value[rip]
mov rax, QWORD PTR seq_[rip]

Reordering such operations is not allowed in general, but it is allowed in this case because any concurrently executing code which would yield a different result must invoke undefined behavior by creating a race condition in the read by interleaving a non-atomic read and (atomic or non-atomic) write in different threads.
The C++11 standard says:
Two expression evaluations conflict if one of them modifies a memory location (1.7) and the other one
accesses or modifies the same memory location.
And also that:
The execution of a program contains a data race if it contains two conflicting actions in different threads,
at least one of which is not atomic, and neither happens before the other. Any such data race results in
undefined behavior.
This applies even to things that occur before the undefined behavior:
A conforming implementation executing a well-formed program shall produce the same observable behavior
as one of the possible executions of the corresponding instance of the abstract machine with the same program
and the same input. However, if any such execution contains an undefined operation, this International
Standard places no requirement on the implementation executing that program with that input (not even
with regard to operations preceding the first undefined operation).
Because non-atomic reading from the write there creates undefined behavior (even if you overwrite and ignore the value), GCC is allowed to assume it does not occur and thus optimize out the seqlock. It can do so because any initial (acquired) state which would cause the loop to execute multiple times does not guard against subsequent race conditions from the non-atomic read as any subsequent atomic or non-atomic write to the variable beyond the initially acquired state does not establish a guaranteed synchronize-with relationship with the load operation before the non-atomic read. That is to say, the write could occur to the non-atomic read variable inbetween the execution of the seq cst load and the subsequent read, which is a race condition. The fact this "could" occur is a pointer to the lack of synchronizes with relationship and hence undefined behavior, so the compiler may assume it doesn't happen, which allows it to assume that no concurrent write whatsoever will happen to that variable during the loop.

The cost of atomic counters and spinlocks on x86(_64)

Preface
I recently came across some synchronization problems, which led me to spinlocks and atomic counters. Then I was searching a bit more, how these work and found std::memory_order and memory barriers (mfence, lfence and sfence).
So now, it seems that I should use acquire/release for the spinlocks and relaxed for the counters.
Some reference
x86 MFENCE - Memory Fence
x86 LOCK - Assert LOCK# Signal
Question
What is the machine code (edit: see below) for those three operations (lock = test_and_set, unlock = clear, increment = operator++ = fetch_add) with default (seq_cst) memory order and with acquire/release/relaxed (in that order for those three operations). What is the difference (which memory barriers where) and the cost (how many CPU cycles)?
Purpose
I was just wondering how bad my old code (not specifying memory order = seq_cst used) really is and if I should create some class atomic_counter derived from std::atomic but using relaxed memory ordering (as well as good spinlock with acquire/release instead of mutexes on some places ...or to use something from boost library - I have avoided boost so far).
My Knowledge
So far I do understand that spinlocks protect more than itself (but some shared resource/memory as well), so, there must be something that makes some memory view coherent for multiple threads/cores (that would be those acquire/release and memory fences). Atomic counter just lives for itself and only need that atomic increment (no other memory involved and I do not really care about the value when I read it, it is informative and can be few cycles old, no problem). There is some LOCK prefix and some instructions like xchg implicitly have it. Here my knowledge ends, I don't know how the cache and buses really work and what is behind (but I know that modern CPUs can reorder instructions, execute them in parallel and use memory cache and some synchronization). Thank you for explanation.
P.S.: I have old 32bit PC now, can only see lock addl and simple xchg, nothing else - all versions look the same (except unlock), memory_order makes no difference on my old PC (except unlock, release uses move instead of xchg). Will that be true for 64bit PC? (edit: see below) Do I have to care about memory order? (answer: no, not much, release on unlock saves few cycles, that's all.)
The Code:
#include <atomic>
using namespace std;
atomic_flag spinlock;
atomic<int> counter;
void inc1() {
counter++;
}
void inc2() {
counter.fetch_add(1, memory_order_relaxed);
}
void lock1() {
while(spinlock.test_and_set()) ;
}
void lock2() {
while(spinlock.test_and_set(memory_order_acquire)) ;
}
void unlock1() {
spinlock.clear();
}
void unlock2() {
spinlock.clear(memory_order_release);
}
int main() {
inc1();
inc2();
lock1();
unlock1();
lock2();
unlock2();
}
g++ -std=c++11 -O1 -S (32bit Cygwin, shortened output)
__Z4inc1v:
__Z4inc2v:
lock addl $1, _counter ; both seq_cst and relaxed
ret
__Z5lock1v:
__Z5lock2v:
movl $1, %edx
L5:
movl %edx, %eax
xchgb _spinlock, %al ; both seq_cst and acquire
testb %al, %al
jne L5
rep ret
__Z7unlock1v:
movl $0, %eax
xchgb _spinlock, %al ; seq_cst
ret
__Z7unlock2v:
movb $0, _spinlock ; release
ret
UPDATE for x86_64bit: (see mfence in unlock1)
_Z4inc1v:
_Z4inc2v:
lock addl $1, counter(%rip) ; both seq_cst and relaxed
ret
_Z5lock1v:
_Z5lock2v:
movl $1, %edx
.L5:
movl %edx, %eax
xchgb spinlock(%rip), %al ; both seq_cst and acquire
testb %al, %al
jne .L5
ret
_Z7unlock1v:
movb $0, spinlock(%rip)
mfence ; seq_cst
ret
_Z7unlock2v:
movb $0, spinlock(%rip) ; release
ret

x86 has mostly strong memory model, all the usual stores/loads have release/acquire semantics implicitly. The exception is only SSE non-temporal store operations which require sfence to be ordered as usual. All the read-modify-write (RMW) instructions with the LOCK prefix imply full memory barrier, i.e. seq_cst.
Thus on x86, we have
test_and_set can be coded with lock bts (for bit-wise operations), lock cmpxchg, or lock xchg (or just xchg which implies the lock). Other spin-lock implementations can use instructions like lock inc (or dec) if they need e.g. fairness. It is not possible to implement try_lock with release/acquire fence (at least you'd need standalone memory barrier mfence anyway).
clear is coded with lock and (for bit-wise) or lock xchg, though, more efficient implementations would use plain write (mov) instead of locked instruction.
fetch_add is coded with lock add.
Removing the lock prefix will not guarantee atomicity for RMW operations thus such operations cannot be interpreted strictly as having memory_order_relaxed in C++ view. However in practice, you might want to access atomic variable via faster non-atomic operation when it is safe (in constructor, under lock).
In our experience, it does not really matter which exactly RMW atomic operation is performed they take almost the same number of cycles to execute (and mfence is about x0.5 of a lock operation). You can estimate performance of synchronization algorithms by counting the number of atomic operations (and mfences), and the number of memory indirections (cache misses).

I recommend: x86-TSO: A Rigorous and Usable Programmer's Model for x86 Multiprocessors.
Your x86 and x86_64 are indeed pretty "well behaved". In particular, they do not re-order write operations (and any speculative writes are discarded while they are in the cpu/core's write-queue), and they do not re-order read operations. However, they will start read operations as early as they can, which means that reads and writes can be re-ordered. (A read of something sitting in the write-queue reads the queued value, so reads/writes of the same location are not re-ordered.) So:
read-modify-write operations require LOCKs which makes them, implicitly, memory_order_seq_cst.
So for these operations you gain nothing by weakening the memory ordering (on the x86/x86_64). The general advice is to "keep it simple" and stick with memory_order_seq_cst, which happily is not costing anything extra for the x86 and x86_64.
For anything newer than a Pentium, if the cpu/core already has "exclusive" access to the affected memory, the LOCK does not affect other cpus/cores, and may be a relatively simple operation.
memory_order_acquire/_release do not require an mfence or any other overhead.
So, for atomic load/store, if acquire/release is sufficient, then for the x86/x86_64 those operations are "tax free".
memory_order_seq_cst does require mfence...
...which is worth understanding.
(NB: we are here talking about what the processor does with the instructions generated by the compiler. The compiler's re-ordering of operations is a very similar issue, but not addressed here.)
An mfence stalls the cpu/core until all pending writes are cleared out of the write-queue. In particular, any read operations which follow the mfence will not start until the write-queue is empty. Consider two threads:
initial state: wa = wb = 0
thread 'A' thread 'B'
wa = 1 ; (mov [wa] ← 1) wb = 1 ; (mov [wb] ← 1)
a = wb ; (mov ebx ← [wb]) b = wa ; (mov ebx ← [wa])
Left to their own devices, the x86/x86_64 can produce any of (a = 1, b = 1), (a = 0, b = 1), (a = 1, b = 0) and (a = 0, b = 0). The last is invalid if you expect memory_order_seq_cst -- since you cannot get that by any interleaving of the operations. The reason this can happen is that the writes of wa and wb are queued in the respective cpu's/core's queue, and the reads of wa and wb can both be scheduled and can both complete before either write. To achieve memory_order_seq_cst you need an mfence:
thread 'A' thread 'B'
wa = 1 ; (mov [wa] ← 1) wb = 1 ; (mov [wb] ← 1)
mfence ; mfence
a = wb ; (mov ebx ← [wb]) b = wa ; (mov ebx ← [wa])
Since there is no synchronization between the threads, the result may be anything except (a = 0, b = 0). Interestingly, the mfence is for the benefit of the thread itself, because it prevents the read operation starting before the write completes. The only thing that other threads care about is the order in which writes occur, and the x86/x86_64 does not re-order those in any case.
So, to implement memory_order_seq_cst atomic_load() and atomic_store(), it is necessary to insert an mfence after one or more stores and before a load. Where these operations are implemented as library functions, the common convention is to add the mfence to all stores, leaving the load "naked". (The logic being that loads are more common than stores, and it seems better to add the overhead to the store.)
For spin-locks, at least, your question seems to boil down to whether a spin-unlock operation requires an mfence, or not, and what difference it makes.
The C11 atomic_flag_clear() is, implicitly, memory_order_seq_cst, for which an mfence is required. The C11 atomic_flag_test_and_set() is not only a read-modify-write operation but is also implictly memory_order_seq_cst -- and LOCK does that.
C11 does not offer a spin-lock in the threads.h library. But you can use an atomic_flag -- though for your x86/x86_64 you have PAUSE instruction problem to deal with. The question is, do you need memory_order_seq_cst for this, in particular for the unlock ? I think the answer is no, and that the trick is to do: atomic_flag_test_and_set_explicit(xxx, memory_order_acquire) and atomic_flag_clear(xxx, memory_order_release).
FWIW, the glibc pthread_spin_unlock() does not have an mfence. Nor does the gcc __sync_lock_release() (which is explicitly a "release" operation). But the gcc _atomic_clear() is aligned with the C11 atomic_flag_clear(), and takes a memory order parameter.
What difference does the mfence make to the unlock ? Clearly it's very disruptive to the pipe-line, and since it's not necessary, there's not much to be gained working out the exact scale of its impact, which will depend on the circumstances.

spinlock do not use mfence, mfence only enforce serialise/flush of data of current core. The fence itself do not in any way relate to atomic operation.
For spinlock you need some kind of atomic action to exchange data to a memory place. There are many different implementation, targeted for different requirement: for example, do it work on kernel or user-space? is it fair-lock?
A very simple and dumb spinlock for x86 looks like this (my kernel use this):
typedef volatile uint32_t _SPINLOCK __attribute__ ((aligned(16)));
static inline void _SPIN_LOCK(_SPINLOCK* lock) {
__asm (
"cli\n"
"lock bts %0, 0\n"
"jnc 1f\n"
"0:\n"
"pause\n"
"test %0, 1\n"
"je 0b\n"
"lock bts %0, 0\n"
"jc 0b\n"
"1:\n"
:
: "m"(lock)
:
);
}
The logic is simple
test and exchange a bit, if zero it mean the lock not taken, and we got it.
if bit is not zero, it mean the lock is taken by other, pause is a hint recommended by cpu manufacture so that it doesn't burn the cpu with a tight look.
loop until you got the lock
Note 1. you may also implement spinlock with intrinsics and extensions, it should be fairly similar.
Note 2. Spinlock is not judge by cycles, a sane implementation should be quite fast, for instant, the above implementation you should grab the lock on first try in well designed usage, if not, fix the algorithm or split the lock to prevent/reduce lock contention.
Note 3. You should also consider other things like fairness.

Re
and the cost (how many CPU cycles)?
On x86 at least, instructions that perform memory synchronization (atomic ops, fences) have a very variable CPU cycle latency. They wait for the processor store buffers to be flushed to memory, and this varies dramatically depending on the store buffer content.
E.g., if an atomic op is straight after a memcpy() that pushes multiple cache lines out to main memory, the delay may be in the 100's of nanoseconds. The same atomic op, but after a series of register-only arithmetic instructions, may take only a few clock cycles.

C++ memory_order with fences and aquire/release

I have the following C++ 2011 code:
std::atomic<bool> x, y;
std::atomic<int> z;
void f() {
x.store(true, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
y.store(true, std::memory_order_relaxed);
}
void g() {
while (!y.load(std::memory_order_relaxed)) {}
std::atomic_thread_fence(std::memory_order_acquire);
if (x.load(std::memory_order_relaxed)) ++z;
}
int main() {
x = false;
y = false;
z = 0;
std::thread t1(f);
std::thread t2(g);
t1.join();
t2.join();
assert(z.load() !=0);
return 0;
}
At my computer architecture class, we've been told that the assert in this code always comes true. But after reviewing it thouroughly now, I can't really understand why it's so.
For what I know:
A fence with 'memory_order_release' will not allow previous stores to be executed after it
A fence with 'memory_order_acquire' will not allow that any load that comes after it to be executed before it.
If my understanding is correct, why can't the following sequence of actions occur?
Inside t1, y.store(true, std::memory_order_relaxed); is called
t2 runs entirely, and will see a 'false' when loading 'x', therefore not increasing z in a unit
t1 finishes execution
In the main thread, the assert fails because z.load() returns 0
I think this complies with 'acquire'-'release' rules, but, for example in the best answer in this question: Understanding c++11 memory fences which is very similar to my case, it hints that something like step 1 in my sequence of actions cannot happen before the 'memory_order_release', but doesn't get into details for the reason behind it.
I'm terribly puzzled about this, and will be very glad if anyone could shed some light on it :)

Exactly what happens in each of these cases depends on what processor you are actually using. For example, x86 would probably not assert on this, since it is a cache-coherent architecture (you can have race-conditions, but once a value is written out to cache/memory from the processor, all other processors will read that value - of course, doesn't stop another processor from writing a different value immediately after, etc).
So assuming this is running on an ARM or similar processor that isn't guaranteed to be cache-coherent by itself:
Because the write to x is done before the memory_order_release, the t2 loop will not exit the while(y...) until x is also true. This means that when x is being read later on, it is guaranteed to be one, so z is updated. My only slight query is as to if you don't need a release for z as well... If main is running on a different processor than t1 and t2, then z may stil have a stale value in main.
Of course, that's not GUARANTEED to happen if you have a multitasking OS (or just interrupts that do enough stuff, etc) - since if the processor that ran t1 gets its cache flushed, then t2 may well read the new value of x.
And like I said, this won't have that effect on x86 processors (AMD or Intel ones).
So, to explain barrier instructions in general (also applicable to Intel and AMD process0rs):
First, we need to understand that although instructions can start and finish out of order, the processor does have a general "understanding" of order. Let's say we have this "pseudo-machine-code":
...
mov $5, x
cmp a, b
jnz L1
mov $4, x
L1:
...
THe processor could speculatively execute mov $4, x before it completes the "jnz L1" - so, to solve this fact, the processor would have to roll-back the mov $4, x in the case where the jnz L1 was taken.
Likewise, if we have:
mov $1, x
wmb // "write memory barrier"
mov $1, y
the processor has rules to say "do not execute any store instruction issued AFTER wmb until all stores before it has been completed". It is a "special" instruction - it's there for the precise purpose of guaranteeing memory ordering. If it's not doing that, you have a broken processor, and someone in the design department has "his ass on the line".
Equally, the "read memory barrier" is an instruction which guarantees, by the designers of the processor, that the processor will not complete another read until we have completed the pending reads before the barrier instruction.
As long as we're not working on "experimental" processors or some skanky chip that doesn't work correctly, it WILL work that way. It's part of the definition of that instruction. Without such guarantees, it would be impossible (or at least extremely complicated and "expensive") to implement (safe) spinlocks, semaphores, mutexes, etc.
There are often also "implicit memory barriers" - that is, instructions that cause memory barriers even if they are not. Software interrupts ("INT X" instruction or similar) tend to do this.

I don't like arguing about C++ concurrency questions in terms of "this processor does this, that processor does that". C++11 has a memory model, and we should be using this memory model to determine what is valid and what isn't. CPU architectures and memory models are usually even harder to understand. Plus there's more than one of them.
With this in mind, consider this: thread t2 is blocked in the while loop until t1 executes the y.store and the change has propagated to t2. (Which, by the way, could in theory be never. But that's not realistic.) Therefore we have a happens-before relationship between the y.store in t1 and the y.load in t2 that allows it to leave the loop.
Furthermore, we have simple intra-thread happens-before relations between the x.store and the release barrier and the barrier and the y.store.
In t2, we have a happens-before between the true-returning load and the acquire barrier and the x.load.
Because happens-before is transitive, the release barrier happens-before the acquire barrier, and the x.store happens-before the x.load. Because of the barriers, the x.store synchronizes-with the x.load, which means the load has to see the value stored.
Finally, the z.add_and_fetch (post-increment) happens-before the thread termination, which happens-before the main thread wakes from t2.join, which happens-before the z.load in the main thread, so the modification to z must be visible in the main thread.

Reading interlocked variables

Assume:
A. C++ under WIN32.
B. A properly aligned volatile integer incremented and decremented using InterlockedIncrement() and InterlockedDecrement().
__declspec (align(8)) volatile LONG _ServerState = 0;
If I want to simply read _ServerState, do I need to read the variable via an InterlockedXXX function?
For instance, I have seen code such as:
LONG x = InterlockedExchange(&_ServerState, _ServerState);
and
LONG x = InterlockedCompareExchange(&_ServerState, _ServerState, _ServerState);
The goal is to simply read the current value of _ServerState.
Can't I simply say:
if (_ServerState == some value)
{
// blah blah blah
}
There seems to be some confusion WRT this subject. I understand register-sized reads are atomic in Windows, so I would assume the InterlockedXXX function is unnecessary.
Matt J.
Okay, thanks for the responses. BTW, this is Visual C++ 2005 and 2008.
If it's true I should use an InterlockedXXX function to read the value of _ServerState, even if just for the sake of clarity, what's the best way to go about that?
LONG x = InterlockedExchange(&_ServerState, _ServerState);
This has the side effect of modifying the value, when all I really want to do is read it. Not only that, but there is a possibility that I could reset the flag to the wrong value if there is a context switch as the value of _ServerState is pushed on the stack in preparation of calling InterlockedExchange().
LONG x = InterlockedCompareExchange(&_ServerState, _ServerState, _ServerState);
I took this from an example I saw on MSDN.
See http://msdn.microsoft.com/en-us/library/ms686355(VS.85).aspx
All I need is something along the lines:
lock mov eax, [_ServerState]
In any case, the point, which I thought was clear, is to provide thread-safe access to a flag without incurring the overhead of a critical section. I have seen LONGs used this way via the InterlockedXXX() family of functions, hence my question.
Okay, we are thinking a good solution to this problem of reading the current value is:
LONG Cur = InterlockedCompareExchange(&_ServerState, 0, 0);

It depends on what you mean by "goal is to simply read the current value of _ServerState" and it depends on what set of tools and the platform you use (you specify Win32 and C++, but not which C++ compiler, and that may matter).
If you simply want to read the value such that the value is uncorrupted (ie., if some other processor is changing the value from 0x12345678 to 0x87654321 your read will get one of those 2 values and not 0x12344321) then simply reading will be OK as long as the variable is :
marked volatile,
properly aligned, and
read using a single instruction with a word size that the processor handles atomically
None of this is promised by the C/C++ standard, but Windows and MSVC do make these guarantees, and I think that most compilers that target Win32 do as well.
However, if you want your read to be synchronized with behavior of the other thread, there's some additional complexity. Say that you have a simple 'mailbox' protocol:
struct mailbox_struct {
uint32_t flag;
uint32_t data;
};
typedef struct mailbox_struct volatile mailbox;
// the global - initialized before wither thread starts
mailbox mbox = { 0, 0 };
//***************************
// Thread A
while (mbox.flag == 0) {
/* spin... */
}
uint32_t data = mbox.data;
//***************************
//***************************
// Thread B
mbox.data = some_very_important_value;
mbox.flag = 1;
//***************************
The thinking is Thread A will spin waiting for mbox.flag to indicate mbox.data has a valid piece of information. Thread B will write some data into mailbox.data then will set mbox.flag to 1 as a signal that mbox.data is valid.
In this case a simple read in Thread A of mbox.flag might get the value 1 even though a subsequent read of mbox.data in Thread A does not get the value written by Thread B.
This is because even though the compiler will not reorder the Thread B writes to mbox.data and mbox.flag, the processor and/or cache might. C/C++ guarantees that the compiler will generate code such that Thread B will write to mbox.data before it writes to mbox.flag, but the processor and cache might have a different idea - special handling called 'memory barriers' or 'acquire and release semantics' must be used to ensure ordering below the level of the thread's stream of instructions.
I'm not sure if compilers other than MSVC make any claims about ordering below the instruction level. However MS does guarantee that for MSVC volatile is enough - MS specifies that volatile writes have release semantics and volatile reads have acquire semantics - though I'm not sure at which version of MSVC this applies - see http://msdn.microsoft.com/en-us/library/12a04hfd.aspx?ppud=4.
I have also seen code like you describe that uses Interlocked APIs to perform simple reads and writes to shared locations. My take on the matter is to use the Interlocked APIs. Lock free inter-thread communication is full of very difficult to understand and subtle pitfalls, and trying to take a shortcut on a critical bit of code that may end up with a very difficult to diagnose bug doesn't seem like a good idea to me. Also, using an Interlocked API screams to anyone maintaining the code, "this is data access that needs to be shared or synchronized with something else - tread carefully!".
Also when using the Interlocked API you're taking the specifics of the hardware and the compiler out of the picture - the platform makes sure all of that stuff is dealt with properly - no more wondering...
Read Herb Sutter's Effective Concurrency articles on DDJ (which happen to be down at the moment, for me at least) for good information on this topic.

Your way is good:
LONG Cur = InterlockedCompareExchange(&_ServerState, 0, 0);
I'm using similar solution:
LONG Cur = InterlockedExchangeAdd(&_ServerState, 0);

Interlocked instructions provide atomicity and inter-processor synchronization. Both writes and reads must be synchronized, so yes, you should be using interlocked instructions to read a value that is shared between threads and not protected by a lock. Lock-free programming (and that's what you're doing) is a very tricky area, so you might consider using locks instead. Unless this is really one of your program's bottlenecks that must be optimized?

32-bit read operations are already atomic on some 32-bit systems (Intel spec says these operations are atomic, but there's no guarantee that this will be true on other x86-compatible platforms). So you shouldn't use this for threads synchronization.
If you need a flag some sort you should consider using Event object and WaitForSingleObject function for that purpose.

To anyone who has to revisit this thread I want to add to what was well explained by Bartosz that _InterlockedCompareExchange() is a good alternative to standard atomic_load() if standard atomics are not available. Here is the code for atomically reading my_uint32_t_var in C on i86 Win64. atomic_load() is included as a benchmark:
long debug_x64_i = std::atomic_load((const std::_Atomic_long *)&my_uint32_t_var);
00000001401A6955 mov eax,dword ptr [rbp+30h]
00000001401A6958 xor edi,edi
00000001401A695A mov dword ptr [rbp-0Ch],eax
debug_x64_i = _InterlockedCompareExchange((long*)&my_uint32_t_var, 0, 0);
00000001401A695D xor eax,eax
00000001401A695F lock cmpxchg dword ptr [rbp+30h],edi
00000001401A6964 mov dword ptr [rbp-0Ch],eax
debug_x64_i = _InterlockedOr((long*)&my_uint32_t_var, 0);
00000001401A6967 prefetchw [rbp+30h]
00000001401A696B mov eax,dword ptr [rbp+30h]
00000001401A696E xchg ax,ax
00000001401A6970 mov ecx,eax
00000001401A6972 lock cmpxchg dword ptr [rbp+30h],ecx
00000001401A6977 jne foo+30h (01401A6970h)
00000001401A6979 mov dword ptr [rbp-0Ch],eax
long release_x64_i = std::atomic_load((const std::_Atomic_long *)&my_uint32_t_var);
00000001401A6955 mov eax,dword ptr [rbp+30h]
release_x64_i = _InterlockedCompareExchange((long*)&my_uint32_t_var, 0, 0);
00000001401A6958 mov dword ptr [rbp-0Ch],eax
00000001401A695B xor edi,edi
00000001401A695D mov eax,dword ptr [rbp-0Ch]
00000001401A6960 xor eax,eax
00000001401A6962 lock cmpxchg dword ptr [rbp+30h],edi
00000001401A6967 mov dword ptr [rbp-0Ch],eax
release_x64_i = _InterlockedOr((long*)&my_uint32_t_var, 0);
00000001401A696A prefetchw [rbp+30h]
00000001401A696E mov eax,dword ptr [rbp+30h]
00000001401A6971 mov ecx,eax
00000001401A6973 lock cmpxchg dword ptr [rbp+30h],ecx
00000001401A6978 jne foo+31h (01401A6971h)
00000001401A697A mov dword ptr [rbp-0Ch],eax

you should be okay. It's volatile, so the optimizer shouldn't savage you, and it's a 32-bit value so it should be at least approximately atomic. The one possible surprise is if the instruction pipeline can get around that.
On the other hand, what's the additional cost of using the guarded routines?

Current value reading may not need any lock.

The Interlocked* functions prevent two different processors from accessing the same piece of memory. In a single processor system you are going to be ok. If you have a dual-core system where you have threads on different cores both accessing this value, you might have problems doing what you think is atomic without the Interlocked*.

Your initial understanding is basically correct. According to the memory model which Windows requires on all MP platforms it supports (or ever will support), reads from a naturally-aligned variable marked volatile are atomic as long as they are smaller than the size of a machine word. Same with writes. You don't need a 'lock' prefix.
If you do the reads without using an interlock, you are subject to processor reordering. This can even occur on x86, in a limited circumstance: reads from a variable may be moved above writes of a different variable. On pretty much every non-x86 architecture that Windows supports, you are subject to even more complicated reordering if you don't use explicit interlocks.
There's also a requirement that if you're using a compare exchange loop, you must mark the variable you're compare exchanging on as volatile. Here's a code example to demonstrate why:
long g_var = 0; // not marked 'volatile' -- this is an error
bool foo () {
long oldValue;
long newValue;
long retValue;
// (1) Capture the original global value
oldValue = g_var;
// (2) Compute a new value based on the old value
newValue = SomeTransformation(oldValue);
// (3) Store the new value if the global value is equal to old?
retValue = InterlockedCompareExchange(&g_var,
newValue,
oldValue);
if (retValue == oldValue) {
return true;
}
return false;
}
What can go wrong is that the compiler is well within its rights to re-fetch oldValue from g_var at any time if it's not volatile. This 'rematerialization' optimization is great in many cases because it can avoid spilling registers to the stack when register pressure is high.
Thus, step (3) of the function would become:
// (3) Incorrectly store new value regardless of whether the global
// is equal to old.
retValue = InterlockedCompareExchange(&g_var,
newValue,
g_var);

Read is fine. A 32-bit value is always read as a whole as long as it's not split on a cache line. Your align 8 guarantees that it's always within a cache line so you'll be fine.
Forget about instructions reordering and all that non-sense. Results are always retired in-order. It would be a processor recall otherwise!!!
Even for a dual CPU machine (i.e. shared via the slowest FSBs), you'll still be fine as the CPUs guarantee cache coherency via MESI Protocol. The only thing you're not guaranteed is the value you read may not be the absolute latest. BUT, what is the latest anyway? That's something you likely won't need to know in most situations if you're not writing back to the location based on the value of that read. Otherwise, you'd have used interlocked ops to handle it in the first place.
In short, you gain nothing by using Interlocked ops on a read (except perhaps reminding the next person maintaining your code to tread carefully - then again, that person may not be qualified to maintain your code to begin with).
EDIT: In response to a comment left by Adrian McCarthy.
You're overlooking the effect of compiler optimizations. If the
compiler thinks it has the value already in a register, then it's
going to re-use that value instead of re-reading it from memory. Also,
the compiler may do instruction re-ordering for optimization if it
believes there are no observable side effects.
I did not say reading from a non-volatile variable is fine. All the question was asking was if interlocked was required. In fact, the variable in question was clearly declared with volatile. Or were you overlooking the effect of the keyword volatile?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js