What does std::atomic::is_always_lock_free = true really mean?

What does std::atomic::is_always_lock_free = true really mean? - c++

I have the following code:
#include <atomic>
int main () {
std::atomic<uint32_t> value(0);
value.fetch_add(1, std::memory_order::relaxed);
static_assert(std::atomic<uint32_t>::is_always_lock_free);
return 0;
}
It compiles and so it means std::atomic<uint32_t>::is_always_lock_free is true.
Then, the assembly code looks like this with gcc 10 and -std=c++20 -O3 -mtune=skylake-avx512 -march=skylake-avx512:
0000000000401050 <main>:
401050: c7 44 24 fc 00 00 00 mov DWORD PTR [rsp-0x4],0x0
401057: 00
401058: f0 ff 44 24 fc lock inc DWORD PTR [rsp-0x4]
40105d: 31 c0 xor eax,eax
40105f: c3 ret
Many posts point out that read-modify-write operation (fetch_add() here) can't be an atomic operation without a lock.
My question is what std::atomic::is_always_lock_free being true really means.
This page states Equals true if this atomic type is always lock-free and false if it is never or sometimes lock-free.
What does it mean by "this atomic type is always lock-free" then?

"Lock" here is in the sense of "mutex", not specifically in reference to the x86 instruction prefix named lock.
A trivial and generic way to implement std::atomic<T> for arbitrary types T would be as a class containing a T member together with a std::mutex, which is locked and unlocked around every operation on the object (load, store, exchange, fetch_add, etc). Those operations can then be done in any old way, and need not use atomic machine instructions, because the lock protects them. This implementation would be not lock free.
A downside of such an implementation, besides being slow in general, is that if two threads try to operate on the object at the same time, one of them will have to wait for the lock, which may actually block and cause it to be scheduled out for a while. Or, if a thread gets scheduled out while holding the lock, every other thread that wants to operate on the object will have to wait for the first thread to get scheduled back in and complete its work first.
So it is desirable if the machine supports truly atomic operations on T: a single instruction or sequence that other threads cannot interfere with, and which will not block other threads if interrupted (or perhaps cannot be interrupted at all). If for some type T the library has been able to specialize std::atomic<T> with such an implementation, then that is what we mean by saying it is lock free. (It is just confusing on x86 because the atomic instructions used for such implementations are named lock. On other architectures they might be called something else, e.g. ARM64's ldxr/stxr exclusive load/store instructions.)
The C++ standard allows for types to be "sometimes lock free": maybe it is not known at compile time whether std::atomic<T> will be lock-free, because it depends on special machine features that will be detected at runtime. It's even possible that some objects of type std::atomic<T> are lock-free and others are not. That's why atomic_is_lock_free is a function and not a constant. It checks whether this particular object is lock-free on this particular day.
However, it might be the case for some implementations that certain types can be guaranteed, at compile time, to always be lock free. That's what is_always_lock_free is used to indicate, and note that it's a constexpr bool instead of a function.

Related

Is 11 a valid output under ISO c++ for x86_64,arm or other arch?

This question is based on Can't relaxed atomic fetch_add reorder with later loads on x86, like store can?
I agree with answer given.
On x86 00 will never occur because a.fetch_add has a lock prefix/full barrier and loads can't reorder above fetch_add but on other architectures like arm/mips it can print 00. I have a two followup question about store buffer on x86 and arm.
I never get 11 on my pc (core i3 x86_64) i.e is 11 a valid output on
x86 in iso c++ , so am i missing something ? #Daniel Langr demonstrated 11 is a valid output on x86.
Now x86_64 has an advantage fetch_add acting as a full barrier.
For arm64 , output can be 00 sometimes due to cpu instruction reordering.
For arm64 or some other arch, can the output be 00 if without reordering ?. My question
is based on this. The store buffer values for function foo
a.fetch_add(1) is not visible to bar's a.load() and b.fetch_add(1) is
not visible to foo's b.load(). Hence we get 00 without reordering. Can this happen under ISO C++ on different archs ?
// g++ -O2 -pthread axbx.cpp ; while [ true ]; do ./a.out | grep "00" ; done
#include<cstdio>
#include<thread>
#include<atomic>
using namespace std;
atomic<int> a,b;
int reta,retb;
void foo(){
a.fetch_add(1,memory_order_relaxed); //add to a is stored in store buffer of cpu0
//a.store(1,memory_order_relaxed);
retb=b.load(memory_order_relaxed);
}
void bar(){
b.fetch_add(1,memory_order_relaxed); //add to b is stored in store buffer of cpu1
//b.store(1,memory_order_relaxed);
reta=a.load(memory_order_relaxed);
}
int main(){
thread t[2]{ thread(foo),thread(bar) };
t[0].join(); t[1].join();
printf("%d%d\n",reta,retb);
return 0;
}

Yes, ISO C++ allows this; as Daniel pointed out, one easy way is to put some slow stuff after the RMWs, before the loads, so they don't execute until both threads have had a chance to increment. This should be obvious because it doesn't require any run-time reordering to happen, just a simple interleaving of program-order. (So ISO C++ allows 11 even with seq_cst.) i.e. you could have exercised this by single-stepping each thread separately.
Are you wondering about how to create a practical demonstration on x86 without delay loops?
Try putting your atomic vars in separate cache lines so two different cores can be writing them in parallel.
alignas(64) std::atomic<int> a, b; // the alignas applies to each separately
With them in the same cache line, like probably happens by default, the core that won ownership of the cache line they're both in is going to be able to execute the load of the other var as soon as the full-barrier part of the increment is done. Completing the RMW means that cache line is already hot in this core's L1d cache. (A core can only modify a cache line after gaining exclusive ownership of it via MESI, which will also make it valid for reads.)
So both operations by one thread are extremely likely to happen before either operation by the other thread. (In the x86 asm, each operation has the same asm as its seq_cst equivalent would have had, so we can usefully talk about a global order of operations without losing anything.)
Probably the only thing that would stop this from happening is an interrupt arriving at just the right moment, between the RMW and the load.
You also asked a separate question:
can the output be 00 if without reordering ?
Clearly no. No interleaving of program-order can put both loads before either increment, so either run-time or compile-time reordering is necessary to create the 00 effect.
a.fetch_add(1,memory_order_relaxed); // foo1
retb=b.load(memory_order_relaxed); // foo2
b.fetch_add(1,memory_order_relaxed); // bar1
reta=a.load(memory_order_relaxed); // bar2
Mix those however you want without putting foo2 before foo1 or bar2 before bar1.
i.e. if you single-stepped each thread separately, you could never see 00. Of course the whole point of mo_relaxed is that it can reorder. Specifying "without reordering" is the same as saying "with seq_cst".
The effects of a store buffer are a kind of reordering, specifically Store Load reordering. mo_seq_cst prevents even that, which is part of the point of seq_cst and what makes it so expensive, especially for pure-store operations.

Are C++ atomics preemption safe?

From what I understand of atomics, they are special assembly instructions which guarantee that two processors in an SMP system cannot write to the same memory region at the same time. For example in PowerPC an atomic increment would look something like:
retry:
lwarx r4, 0, r3 // Read integer from RAM location r3 into r4, placing reservation.
addi r4, r4, 1 // Add 1 to r4.
stwcx. r4, 0, r3 // Attempt to store incremented value back to RAM.
bne- retry // If the store failed (unlikely), retry.
However this does not protect the four instructions from begin preempted by an interrupt, and another task being scheduled in. To protect from preemption you need to do an interrupt-lock before entering the code.
From what I can see of C++ atomics, they seem to be enforcing locks if needed. So my first question is -
Does the C++ standard guarantee no preemption will happen during an atomic operation? If so, where in the standard can I find this?
I checked the atomic<int>::is_always_lock_free on my Intel PC, and it came true. With my assumption of the above assembly block, this confused me. After digging into Intel assembly (which I am unfamiliar with) I found lock xadd DWORD PTR [rdx], eax to be happening. So my question is -
Do some architectures provide atomic related instructions which guarantee no-preepmtion? Or is my understanding wrong?
Finally I was wondering about the compare_exchange_weak and compare_exchange_strong semantics -
Does the difference lie int the retry mechanism or is it something else?
EDIT: After reading the answers, I am curious about one more thing
The atomic member function operations fetch_add, operator++ etc. are they strong or weak?

This is similar to this question: Anything in std::atomic is wait-free?
Here are some definitions of lock-freedom and wait-freedom (both taken from Wikipedia):
An algorithm is lock-free if, when the program threads are run for a sufficiently long time, at least one of the threads makes progress.
An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes.
Your code with the retry loop is lock-free: a thread only has to perform a retry if the store fails, but that implies that the value must have been updated in the meantime, so some other thread must have made progress.
With regard to lock-freedom it doesn't matter whether a thread can be preempted in the middle of an atomic operation or not.
Some operations can be translated to a single atomic operation, in which case this operation is wait-free and therefore cannot be preempted midway. However, which operations are actually wait-free depends on the compiler and the target architecture (as described in my answer in the referenced SO question).
Regarding the difference between compare_exchange_weak and compare_exchange_strong - the weak version can fail spuriously, i.e., it may fail even though the comparison is actually true. This can happen on architectures with LL/SC. Suppose we use compare_exchange_weak to update some variable with the expected value A. LL loads the value A from the variable, and before the SC is executed, the variable is changed to B and then back to A. So even though the variable contains the same value as before, the intermediate change to B causes the SC (and therefore the compare_exchange_weak) to fail. compare_exchange_strong cannot fail spuriously, but to achieve that it has to use a retry-loop on architectures with LL/SC.
I am not entirely sure what you mean by fetch_add being "strong or weak". fetch_add cannot fail - it simply performs an atomic update of some variable by adding the provided value, and returns the old value of the variable. Whether this can be translated to a single instruction (like on Intel) or to a retry loop with LL/SC (Power) or CAS (Sparc) depends on the target architecture. Either way, the variable is guaranteed to be updated correctly.

Does the C++ standard guarantee no preemption will happen during an atomic operation? If so, where in the standard can I find this?
No, it doesn't. Since there's really no way code could tell whether or not this happened (it's indistinguishable from pre-emption either before or after the atomic operation depending on circumstances) there is no reason for it to.
Do some architectures provide atomic related instructions which guarantee no-preemption? Or is my understanding wrong?
There would be no point since the operation must appear to be atomic anyway, so pre-emption during would always be identical in observed behavior to pre-emption before or after. If you can write code that ever sees a case where pre-emption during the atomic operation causes observable effects different from either pre-emption before or pre-emption after, that platform is broken since the operation does not behave atomically.

Can num++ be atomic for 'int num'?

In general, for int num, num++ (or ++num), as a read-modify-write operation, is not atomic. But I often see compilers, for example GCC, generate the following code for it (try here):
void f()
{
int num = 0;
num++;
}
f():
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
add DWORD PTR [rbp-4], 1
nop
pop rbp
ret
Since line 5, which corresponds to num++ is one instruction, can we conclude that num++ is atomic in this case?
And if so, does it mean that so-generated num++ can be used in concurrent (multi-threaded) scenarios without any danger of data races (i.e. we don't need to make it, for example, std::atomic<int> and impose the associated costs, since it's atomic anyway)?
UPDATE
Notice that this question is not whether increment is atomic (it's not and that was and is the opening line of the question). It's whether it can be in particular scenarios, i.e. whether one-instruction nature can in certain cases be exploited to avoid the overhead of the lock prefix. And, as the accepted answer mentions in the section about uniprocessor machines, as well as this answer, the conversation in its comments and others explain, it can (although not with C or C++).

This is absolutely what C++ defines as a Data Race that causes Undefined Behaviour, even if one compiler happened to produce code that did what you hoped on some target machine. You need to use std::atomic for reliable results, but you can use it with memory_order_relaxed if you don't care about reordering. See below for some example code and asm output using fetch_add.
But first, the assembly language part of the question:
Since num++ is one instruction (add dword [num], 1), can we conclude that num++ is atomic in this case?
Memory-destination instructions (other than pure stores) are read-modify-write operations that happen in multiple internal steps. No architectural register is modified, but the CPU has to hold the data internally while it sends it through its ALU. The actual register file is only a small part of the data storage inside even the simplest CPU, with latches holding outputs of one stage as inputs for another stage, etc., etc.
Memory operations from other CPUs can become globally visible between the load and store. I.e. two threads running add dword [num], 1 in a loop would step on each other's stores. (See #Margaret's answer for a nice diagram). After 40k increments from each of two threads, the counter might have only gone up by ~60k (not 80k) on real multi-core x86 hardware.
"Atomic", from the Greek word meaning indivisible, means that no observer can see the operation as separate steps. Happening physically / electrically instantaneously for all bits simultaneously is just one way to achieve this for a load or store, but that's not even possible for an ALU operation. I went into a lot more detail about pure loads and pure stores in my answer to Atomicity on x86, while this answer focuses on read-modify-write.
The lock prefix can be applied to many read-modify-write (memory destination) instructions to make the entire operation atomic with respect to all possible observers in the system (other cores and DMA devices, not an oscilloscope hooked up to the CPU pins). That is why it exists. (See also this Q&A).
So lock add dword [num], 1 is atomic. A CPU core running that instruction would keep the cache line pinned in Modified state in its private L1 cache from when the load reads data from cache until the store commits its result back into cache. This prevents any other cache in the system from having a copy of the cache line at any point from load to store, according to the rules of the MESI cache coherency protocol (or the MOESI/MESIF versions of it used by multi-core AMD/Intel CPUs, respectively). Thus, operations by other cores appear to happen either before or after, not during.
Without the lock prefix, another core could take ownership of the cache line and modify it after our load but before our store, so that other store would become globally visible in between our load and store. Several other answers get this wrong, and claim that without lock you'd get conflicting copies of the same cache line. This can never happen in a system with coherent caches.
(If a locked instruction operates on memory that spans two cache lines, it takes a lot more work to make sure the changes to both parts of the object stay atomic as they propagate to all observers, so no observer can see tearing. The CPU might have to lock the whole memory bus until the data hits memory. Don't misalign your atomic variables!)
Note that the lock prefix also turns an instruction into a full memory barrier (like MFENCE), stopping all run-time reordering and thus giving sequential consistency. (See Jeff Preshing's excellent blog post. His other posts are all excellent, too, and clearly explain a lot of good stuff about lock-free programming, from x86 and other hardware details to C++ rules.)
On a uniprocessor machine, or in a single-threaded process, a single RMW instruction actually is atomic without a lock prefix. The only way for other code to access the shared variable is for the CPU to do a context switch, which can't happen in the middle of an instruction. So a plain dec dword [num] can synchronize between a single-threaded program and its signal handlers, or in a multi-threaded program running on a single-core machine. See the second half of my answer on another question, and the comments under it, where I explain this in more detail.
Back to C++:
It's totally bogus to use num++ without telling the compiler that you need it to compile to a single read-modify-write implementation:
;; Valid compiler output for num++
mov eax, [num]
inc eax
mov [num], eax
This is very likely if you use the value of num later: the compiler will keep it live in a register after the increment. So even if you check how num++ compiles on its own, changing the surrounding code can affect it.
(If the value isn't needed later, inc dword [num] is preferred; modern x86 CPUs will run a memory-destination RMW instruction at least as efficiently as using three separate instructions. Fun fact: gcc -O3 -m32 -mtune=i586 will actually emit this, because (Pentium) P5's superscalar pipeline didn't decode complex instructions to multiple simple micro-operations the way P6 and later microarchitectures do. See the Agner Fog's instruction tables / microarchitecture guide for more info, and the x86 tag wiki for many useful links (including Intel's x86 ISA manuals, which are freely available as PDF)).
Don't confuse the target memory model (x86) with the C++ memory model
Compile-time reordering is allowed. The other part of what you get with std::atomic is control over compile-time reordering, to make sure your num++ becomes globally visible only after some other operation.
Classic example: Storing some data into a buffer for another thread to look at, then setting a flag. Even though x86 does acquire loads/release stores for free, you still have to tell the compiler not to reorder by using flag.store(1, std::memory_order_release);.
You might be expecting that this code will synchronize with other threads:
// int flag; is just a plain global, not std::atomic<int>.
flag--; // Pretend this is supposed to be some kind of locking attempt
modify_a_data_structure(&foo); // doesn't look at flag, and the compiler knows this. (Assume it can see the function def). Otherwise the usual don't-break-single-threaded-code rules come into play!
flag++;
But it won't. The compiler is free to move the flag++ across the function call (if it inlines the function or knows that it doesn't look at flag). Then it can optimize away the modification entirely, because flag isn't even volatile.
(And no, C++ volatile is not a useful substitute for std::atomic. std::atomic does make the compiler assume that values in memory can be modified asynchronously similar to volatile, but there's much more to it than that. (In practice there are similarities between volatile int to std::atomic with mo_relaxed for pure-load and pure-store operations, but not for RMWs). Also, volatile std::atomic<int> foo is not necessarily the same as std::atomic<int> foo, although current compilers don't optimize atomics (e.g. 2 back-to-back stores of the same value) so volatile atomic wouldn't change the code-gen.)
Defining data races on non-atomic variables as Undefined Behaviour is what lets the compiler still hoist loads and sink stores out of loops, and many other optimizations for memory that multiple threads might have a reference to. (See this LLVM blog for more about how UB enables compiler optimizations.)
As I mentioned, the x86 lock prefix is a full memory barrier, so using num.fetch_add(1, std::memory_order_relaxed); generates the same code on x86 as num++ (the default is sequential consistency), but it can be much more efficient on other architectures (like ARM). Even on x86, relaxed allows more compile-time reordering.
This is what GCC actually does on x86, for a few functions that operate on a std::atomic global variable.
See the source + assembly language code formatted nicely on the Godbolt compiler explorer. You can select other target architectures, including ARM, MIPS, and PowerPC, to see what kind of assembly language code you get from atomics for those targets.
#include <atomic>
std::atomic<int> num;
void inc_relaxed() {
num.fetch_add(1, std::memory_order_relaxed);
}
int load_num() { return num; } // Even seq_cst loads are free on x86
void store_num(int val){ num = val; }
void store_num_release(int val){
num.store(val, std::memory_order_release);
}
// Can the compiler collapse multiple atomic operations into one? No, it can't.
# g++ 6.2 -O3, targeting x86-64 System V calling convention. (First argument in edi/rdi)
inc_relaxed():
lock add DWORD PTR num[rip], 1 #### Even relaxed RMWs need a lock. There's no way to request just a single-instruction RMW with no lock, for synchronizing between a program and signal handler for example. :/ There is atomic_signal_fence for ordering, but nothing for RMW.
ret
inc_seq_cst():
lock add DWORD PTR num[rip], 1
ret
load_num():
mov eax, DWORD PTR num[rip]
ret
store_num(int):
mov DWORD PTR num[rip], edi
mfence ##### seq_cst stores need an mfence
ret
store_num_release(int):
mov DWORD PTR num[rip], edi
ret ##### Release and weaker doesn't.
store_num_relaxed(int):
mov DWORD PTR num[rip], edi
ret
Notice how MFENCE (a full barrier) is needed after a sequential-consistency stores. x86 is strongly ordered in general, but StoreLoad reordering is allowed. Having a store buffer is essential for good performance on a pipelined out-of-order CPU. Jeff Preshing's Memory Reordering Caught in the Act shows the consequences of not using MFENCE, with real code to show reordering happening on real hardware.
Re: discussion in comments on #Richard Hodges' answer about compilers merging std::atomic num++; num-=2; operations into one num--; instruction:
A separate Q&A on this same subject: Why don't compilers merge redundant std::atomic writes?, where my answer restates a lot of what I wrote below.
Current compilers don't actually do this (yet), but not because they aren't allowed to. C++ WG21/P0062R1: When should compilers optimize atomics? discusses the expectation that many programmers have that compilers won't make "surprising" optimizations, and what the standard can do to give programmers control. N4455 discusses many examples of things that can be optimized, including this one. It points out that inlining and constant-propagation can introduce things like fetch_or(0) which may be able to turn into just a load() (but still has acquire and release semantics), even when the original source didn't have any obviously redundant atomic ops.
The real reasons compilers don't do it (yet) are: (1) nobody's written the complicated code that would allow the compiler to do that safely (without ever getting it wrong), and (2) it potentially violates the principle of least surprise. Lock-free code is hard enough to write correctly in the first place. So don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much. It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).
Getting back to num++; num-=2; compiling as if it were num--:
Compilers are allowed to do this, unless num is volatile std::atomic<int>. If a reordering is possible, the as-if rule allows the compiler to decide at compile time that it always happens that way. Nothing guarantees that an observer could see the intermediate values (the num++ result).
I.e. if the ordering where nothing becomes globally visible between these operations is compatible with the ordering requirements of the source
(according to the C++ rules for the abstract machine, not the target architecture), the compiler can emit a single lock dec dword [num] instead of lock inc dword [num] / lock sub dword [num], 2.
num++; num-- can't disappear, because it still has a Synchronizes With relationship with other threads that look at num, and it's both an acquire-load and a release-store which disallows reordering of other operations in this thread. For x86, this might be able to compile to an MFENCE, instead of a lock add dword [num], 0 (i.e. num += 0).
As discussed in PR0062, more aggressive merging of non-adjacent atomic ops at compile time can be bad (e.g. a progress counter only gets updated once at the end instead of every iteration), but it can also help performance without downsides (e.g. skipping the atomic inc / dec of ref counts when a copy of a shared_ptr is created and destroyed, if the compiler can prove that another shared_ptr object exists for entire lifespan of the temporary.)
Even num++; num-- merging could hurt fairness of a lock implementation when one thread unlocks and re-locks right away. If it's never actually released in the asm, even hardware arbitration mechanisms won't give another thread a chance to grab the lock at that point.
With current gcc6.2 and clang3.9, you still get separate locked operations even with memory_order_relaxed in the most obviously optimizable case. (Godbolt compiler explorer so you can see if the latest versions are different.)
void multiple_ops_relaxed(std::atomic<unsigned int>& num) {
num.fetch_add( 1, std::memory_order_relaxed);
num.fetch_add(-1, std::memory_order_relaxed);
num.fetch_add( 6, std::memory_order_relaxed);
num.fetch_add(-5, std::memory_order_relaxed);
//num.fetch_add(-1, std::memory_order_relaxed);
}
multiple_ops_relaxed(std::atomic<unsigned int>&):
lock add DWORD PTR [rdi], 1
lock sub DWORD PTR [rdi], 1
lock add DWORD PTR [rdi], 6
lock sub DWORD PTR [rdi], 5
ret

Without many complications an instruction like add DWORD PTR [rbp-4], 1 is very CISC-style.
It perform three operations: load the operand from memory, increment it, store the operand back to memory.
During these operations the CPU acquire and release the bus twice, in between any other agent can acquire it too and this violates the atomicity.
AGENT 1 AGENT 2
load X
inc C
load X
inc C
store X
store X
X is incremented only once.

...and now let's enable optimisations:
f():
rep ret
OK, let's give it a chance:
void f(int& num)
{
num = 0;
num++;
--num;
num += 6;
num -=5;
--num;
}
result:
f(int&):
mov DWORD PTR [rdi], 0
ret
another observing thread (even ignoring cache synchronisation delays) has no opportunity to observe the individual changes.
compare to:
#include <atomic>
void f(std::atomic<int>& num)
{
num = 0;
num++;
--num;
num += 6;
num -=5;
--num;
}
where the result is:
f(std::atomic<int>&):
mov DWORD PTR [rdi], 0
mfence
lock add DWORD PTR [rdi], 1
lock sub DWORD PTR [rdi], 1
lock add DWORD PTR [rdi], 6
lock sub DWORD PTR [rdi], 5
lock sub DWORD PTR [rdi], 1
ret
Now, each modification is:-
observable in another thread, and
respectful of similar modifications happening in other threads.
atomicity is not just at the instruction level, it involves the whole pipeline from processor, through the caches, to memory and back.
Further info
Regarding the effect of optimisations of updates of std::atomics.
The c++ standard has the 'as if' rule, by which it is permissible for the compiler to reorder code, and even rewrite code provided that the outcome has the exact same observable effects (including side-effects) as if it had simply executed your code.
The as-if rule is conservative, particularly involving atomics.
consider:
void incdec(int& num) {
++num;
--num;
}
Because there are no mutex locks, atomics or any other constructs that influence inter-thread sequencing, I would argue that the compiler is free to rewrite this function as a NOP, eg:
void incdec(int&) {
// nada
}
This is because in the c++ memory model, there is no possibility of another thread observing the result of the increment. It would of course be different if num was volatile (might influence hardware behaviour). But in this case, this function will be the only function modifying this memory (otherwise the program is ill-formed).
However, this is a different ball game:
void incdec(std::atomic<int>& num) {
++num;
--num;
}
num is an atomic. Changes to it must be observable to other threads that are watching. Changes those threads themselves make (such as setting the value to 100 in between the increment and decrement) will have very far-reaching effects on the eventual value of num.
Here is a demo:
#include <thread>
#include <atomic>
int main()
{
for (int iter = 0 ; iter < 20 ; ++iter)
{
std::atomic<int> num = { 0 };
std::thread t1([&] {
for (int i = 0 ; i < 10000000 ; ++i)
{
++num;
--num;
}
});
std::thread t2([&] {
for (int i = 0 ; i < 10000000 ; ++i)
{
num = 100;
}
});
t2.join();
t1.join();
std::cout << num << std::endl;
}
}
sample output:
99
99
99
99
99
100
99
99
100
100
100
100
99
99
100
99
99
100
100
99

The add instruction is not atomic. It references memory, and two processor cores may have different local cache of that memory.
IIRC the atomic variant of the add instruction is called lock xadd

Since line 5, which corresponds to num++ is one instruction, can we conclude that num++ is atomic in this case?
It is dangerous to draw conclusions based on "reverse engineering" generated assembly. For example, you seem to have compiled your code with optimization disabled, otherwise the compiler would have thrown away that variable or loaded 1 directly to it without invoking operator++. Because the generated assembly may change significantly, based on optimization flags, target CPU, etc., your conclusion is based on sand.
Also, your idea that one assembly instruction means an operation is atomic is wrong as well. This add will not be atomic on multi-CPU systems, even on the x86 architecture.

Even if your compiler always emitted this as an atomic operation, accessing num from any other thread concurrently would constitute a data race according to the C++11 and C++14 standards and the program would have undefined behavior.
But it is worse than that. First, as has been mentioned, the instruction generated by the compiler when incrementing a variable may depend on the optimization level. Secondly, the compiler may reorder other memory accesses around ++num if num is not atomic, e.g.
int main()
{
std::unique_ptr<std::vector<int>> vec;
int ready = 0;
std::thread t{[&]
{
while (!ready);
// use "vec" here
});
vec.reset(new std::vector<int>());
++ready;
t.join();
}
Even if we assume optimistically that ++ready is "atomic", and that the compiler generates the checking loop as needed (as I said, it's UB and therefore the compiler is free to remove it, replace it with an infinite loop, etc.), the compiler might still move the pointer assignment, or even worse the initialization of the vector to a point after the increment operation, causing chaos in the new thread. In practice, I would not be surprised at all if an optimizing compiler removed the ready variable and the checking loop completely, as this does not affect observable behavior under language rules (as opposed to your private hopes).
In fact, at last year's Meeting C++ conference, I've heard from two compiler developers that they very gladly implement optimizations that make naively written multi-threaded programs misbehave, as long as language rules allow it, if even a minor performance improvement is seen in correctly written programs.
Lastly, even if you didn't care about portability, and your compiler was magically nice, the CPU you are using is very likely of a superscalar CISC type and will break down instructions into micro-ops, reorder and/or speculatively execute them, to an extent only limited by synchronizing primitives such as (on Intel) the LOCK prefix or memory fences, in order to maximize operations per second.
To make a long story short, the natural responsibilities of thread-safe programming are:
Your duty is to write code that has well-defined behavior under language rules (and in particular the language standard memory model).
Your compiler's duty is to generate machine code which has the same well-defined (observable) behavior under the target architecture's memory model.
Your CPU's duty is to execute this code so that the observed behavior is compatible with its own architecture's memory model.
If you want to do it your own way, it might just work in some cases, but understand that the warranty is void, and you will be solely responsible for any unwanted outcomes. :-)
PS: Correctly written example:
int main()
{
std::unique_ptr<std::vector<int>> vec;
std::atomic<int> ready{0}; // NOTE the use of the std::atomic template
std::thread t{[&]
{
while (!ready);
// use "vec" here
});
vec.reset(new std::vector<int>());
++ready;
t.join();
}
This is safe because:
The checks of ready cannot be optimized away according to language rules.
The ++ready happens-before the check that sees ready as not zero, and other operations cannot be reordered around these operations. This is because ++ready and the check are sequentially consistent, which is another term described in the C++ memory model and that forbids this specific reordering. Therefore the compiler must not reorder the instructions, and must also tell the CPU that it must not e.g. postpone the write to vec to after the increment of ready. Sequentially consistent is the strongest guarantee regarding atomics in the language standard. Lesser (and theoretically cheaper) guarantees are available e.g. via other methods of std::atomic<T>, but these are definitely for experts only, and may not be optimized much by the compiler developers, because they are rarely used.

On a single-core x86 machine, an add instruction will generally be atomic with respect to other code on the CPU1. An interrupt can't split a single instruction down the middle.
Out-of-order execution is required to preserve the illusion of instructions executing one at a time in order within a single core, so any instruction running on the same CPU will either happen completely before or completely after the add.
Modern x86 systems are multi-core, so the uniprocessor special case doesn't apply.
If one is targeting a small embedded PC and has no plans to move the code to anything else, the atomic nature of the "add" instruction could be exploited. On the other hand, platforms where operations are inherently atomic are becoming more and more scarce.
(This doesn't help you if you're writing in C++, though. Compilers don't have an option to require num++ to compile to a memory-destination add or xadd without a lock prefix. They could choose to load num into a register and store the increment result with a separate instruction, and will likely do that if you use the result.)
Footnote 1: The lock prefix existed even on original 8086 because I/O devices operate concurrently with the CPU; drivers on a single-core system need lock add to atomically increment a value in device memory if the device can also modify it, or with respect to DMA access.

Back in the day when x86 computers had one CPU, the use of a single instruction ensured that interrupts would not split the read/modify/write and if the memory would not be used as a DMA buffer too, it was atomic in fact (and C++ did not mention threads in the standard, so this wasn’t addressed).
When it was rare to have a dual processor (e.g. dual-socket Pentium Pro) on a customer desktop, I effectively used this to avoid the LOCK prefix on a single-core machine and improve performance.
Today, it would only help against multiple threads that were all set to the same CPU affinity, so the threads you are worried about would only come into play via time slice expiring and running the other thread on the same CPU (core). That is not realistic.
With modern x86/x64 processors, the single instruction is broken up into several micro ops and furthermore the memory reading and writing is buffered. So different threads running on different CPUs will not only see this as non-atomic but may see inconsistent results concerning what it reads from memory and what it assumes other threads have read to that point in time: you need to add memory fences to restore sane behavior.

No.
https://www.youtube.com/watch?v=31g0YE61PLQ
(That's just a link to the "No" scene from "The Office")
Do you agree that this would be a possible output for the program:
sample output:
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
If so, then the compiler is free to make that the only possible output for the program, in whichever way the compiler wants. ie a main() that just puts out 100s.
This is the "as-if" rule.
And regardless of output, you can think of thread synchronization the same way - if thread A does num++; num--; and thread B reads num repeatedly, then a possible valid interleaving is that thread B never reads between num++ and num--. Since that interleaving is valid, the compiler is free to make that the only possible interleaving. And just remove the incr/decr entirely.
There are some interesting implications here:
while (working())
progress++; // atomic, global
(ie imagine some other thread updates a progress bar UI based on progress)
Can the compiler turn this into:
int local = 0;
while (working())
local++;
progress += local;
probably that is valid. But probably not what the programmer was hoping for :-(
The committee is still working on this stuff. Currently it "works" because compilers don't optimize atomics much. But that is changing.
And even if progress was also volatile, this would still be valid:
int local = 0;
while (working())
local++;
while (local--)
progress++;
:-/

That a single compiler's output, on a specific CPU architecture, with optimizations disabled (since gcc doesn't even compile ++ to add when optimizing in a quick&dirty example), seems to imply incrementing this way is atomic doesn't mean this is standard-compliant (you would cause undefined behavior when trying to access num in a thread), and is wrong anyways, because add is not atomic in x86.
Note that atomics (using the lock instruction prefix) are relatively heavy on x86 (see this relevant answer), but still remarkably less than a mutex, which isn't very appropriate in this use-case.
Following results are taken from clang++ 3.8 when compiling with -Os.
Incrementing an int by reference, the "regular" way :
void inc(int& x)
{
++x;
}
This compiles into :
inc(int&):
incl (%rdi)
retq
Incrementing an int passed by reference, the atomic way :
#include <atomic>
void inc(std::atomic<int>& x)
{
++x;
}
This example, which is not much more complex than the regular way, just gets the lock prefix added to the incl instruction - but caution, as previously stated this is not cheap. Just because assembly looks short doesn't mean it's fast.
inc(std::atomic<int>&):
lock incl (%rdi)
retq

Yes, but...
Atomic is not what you meant to say. You're probably asking the wrong thing.
The increment is certainly atomic. Unless the storage is misaligned (and since you left alignment to the compiler, it is not), it is necessarily aligned within a single cache line. Short of special non-caching streaming instructions, each and every write goes through the cache. Complete cache lines are being atomically read and written, never anything different.
Smaller-than-cacheline data is, of course, also written atomically (since the surrounding cache line is).
Is it thread-safe?
This is a different question, and there are at least two good reasons to answer with a definite "No!".
First, there is the possibility that another core might have a copy of that cache line in L1 (L2 and upwards is usually shared, but L1 is normally per-core!), and concurrently modifies that value. Of course that happens atomically, too, but now you have two "correct" (correctly, atomically, modified) values -- which one is the truly correct one now?
The CPU will sort it out somehow, of course. But the result may not be what you expect.
Second, there is memory ordering, or worded differently happens-before guarantees. The most important thing about atomic instructions is not so much that they are atomic. It's ordering.
You have the possibility of enforcing a guarantee that everything that happens memory-wise is realized in some guaranteed, well-defined order where you have a "happened before" guarantee. This ordering may be as "relaxed" (read as: none at all) or as strict as you need.
For example, you can set a pointer to some block of data (say, the results of some calculation) and then atomically release the "data is ready" flag. Now, whoever acquires this flag will be led into thinking that the pointer is valid. And indeed, it will always be a valid pointer, never anything different. That's because the write to the pointer happened-before the atomic operation.

When your compiler uses only a single instruction for the increment and your machine is single-threaded, your code is safe. ^^

Try compiling the same code on a non-x86 machine, and you'll quickly see very different assembly results.
The reason num++ appears to be atomic is because on x86 machines, incrementing a 32-bit integer is, in fact, atomic (assuming no memory retrieval takes place). But this is neither guaranteed by the c++ standard, nor is it likely to be the case on a machine that doesn't use the x86 instruction set. So this code is not cross-platform safe from race conditions.
You also don't have a strong guarantee that this code is safe from Race Conditions even on an x86 architecture, because x86 doesn't set up loads and stores to memory unless specifically instructed to do so. So if multiple threads tried to update this variable simultaneously, they may end up incrementing cached (outdated) values
The reason, then, that we have std::atomic<int> and so on is so that when you're working with an architecture where the atomicity of basic computations is not guaranteed, you have a mechanism that will force the compiler to generate atomic code.

"pseudo-atomic" operations in C++

So I'm aware that nothing is atomic in C++. But I'm trying to figure out if there are any "pseudo-atomic" assumptions I can make. The reason is that I want to avoid using mutexes in some simple situations where I only need very weak guarantees.
1) Suppose I have globally defined volatile bool b, which
initially I set true. Then I launch a thread which executes a loop
while(b) doSomething();
Meanwhile, in another thread, I execute b=true.
Can I assume that the first thread will continue to execute? In other words, if b starts out as true, and the first thread checks the value of b at the same time as the second thread assigns b=true, can I assume that the first thread will read the value of b as true? Or is it possible that at some intermediate point of the assignment b=true, the value of b might be read as false?
2) Now suppose that b is initially false. Then the first thread executes
bool b1=b;
bool b2=b;
if(b1 && !b2) bad();
while the second thread executes b=true. Can I assume that bad() never gets called?
3) What about an int or other builtin types: suppose I have volatile int i, which is initially (say) 7, and then I assign i=7. Can I assume that, at any time during this operation, from any thread, the value of i will be equal to 7?
4) I have volatile int i=7, and then I execute i++ from some thread, and all other threads only read the value of i. Can I assume that i never has any value, in any thread, except for either 7 or 8?
5) I have volatile int i, from one thread I execute i=7, and from another I execute i=8. Afterwards, is i guaranteed to be either 7 or 8 (or whatever two values I have chosen to assign)?

There are no threads in standard C++, and Threads cannot be implemented as a library.
Therefore, the standard has nothing to say about the behaviour of programs which use threads. You must look to whatever additional guarantees are provided by your threading implementation.
That said, in threading implementations I've used:
(1) yes, you can assume that irrelevant values aren't written to variables. Otherwise the whole memory model goes out the window. But be careful that when you say "another thread" never sets b to false, that means anywhere, ever. If it does, that write could perhaps be re-ordered to occur during your loop.
(2) no, the compiler can re-order the assignments to b1 and b2, so it is possible for b1 to end up true and b2 false. In such a simple case I don't know why it would re-order, but in more complex cases there might be very good reasons.
[Edit: oops, by the time I got to answering (2) I'd forgotten that b was volatile. Reads from a volatile variable won't be re-ordered, sorry, so yes on a typical threading implementation (if there is any such thing), you can assume that you won't end up with b1 true and b2 false.]
(3) same as 1. volatile in general has nothing to do with threading at all. However, it is quite exciting in some implementations (Windows), and might in effect imply memory barriers.
(4) on an architecture where int writes are atomic yes, although volatile has nothing to do with it. See also...
(5) check the docs carefully. Likely yes, and again volatile is irrelevant, because on almost all architectures int writes are atomic. But if int write is not atomic, then no (and no for the previous question), even if it's volatile you could in principle get a different value. Given those values 7 and 8, though, we're talking a pretty weird architecture for the byte containing the relevant bits to be written in two stages, but with different values you could more plausibly get a partial write.
For a more plausible example, suppose that for some bizarre reason you have a 16 bit int on a platform where only 8bit writes are atomic. Odd, but legal, and since int must be at least 16 bits you can see how it could come about. Suppose further that your initial value is 255. Then increment could legally be implemented as:
read the old value
increment in a register
write the most significant byte of the result
write the least significant byte of the result.
A read-only thread which interrupted the incrementing thread between the third and fourth steps of that, could see the value 511. If the writes are in the other order, it could see 0.
An inconsistent value could be left behind permanently if one thread is writing 255, another thread is concurrently writing 256, and the writes get interleaved. Impossible on many architectures, but to know that this won't happen you need to know at least something about the architecture. Nothing in the C++ standard forbids it, because the C++ standard talks about execution being interrupted by a signal, but otherwise has no concept of execution being interrupted by another part of the program, and no concept of concurrent execution. That's why threads aren't just another library - adding threads fundamentally changes the C++ execution model. It requires the implementation to do things differently, as you'll eventually discover if for example you use threads under gcc and forget to specify -pthreads.
The same could happen on a platform where aligned int writes are atomic, but unaligned int writes are permitted and not atomic. For example IIRC on x86, unaligned int writes are not guaranteed atomic if they cross a cache line boundary. x86 compilers will not mis-align a declared int variable, for this reason and others. But if you play games with structure packing you could probably provoke an example.
So: pretty much any implementation will give you the guarantees you need, but might do so in quite a complicated way.
In general, I've found that it is not worth trying to rely on platform-specific guarantees about memory access, that I don't fully understand, in order to avoid mutexes. Use a mutex, and if that's too slow use a high-quality lock-free structure (or implement a design for one) written by someone who really knows the architecture and compiler. It will probably be correct, and subject to correctness will probably outperform anything I invent myself.

Most of the answers correctly address the CPU memory ordering issues you're going to experience, but none have detailed how the compiler can thwart your intentions by re-ordering your code in ways that break your assumptions.
Consider an example taken from this post:
volatile int ready;
int message[100];
void foo(int i)
{
message[i/10] = 42;
ready = 1;
}
At -O2 and above, recent versions of GCC and Intel C/C++ (don't know about VC++) will do the store to ready first, so it can be overlapped with computation of i/10 (volatile does not save you!):
leaq _message(%rip), %rax
movl $1, _ready(%rip) ; <-- whoa Nelly!
movq %rsp, %rbp
sarl $2, %edx
subl %edi, %edx
movslq %edx,%rdx
movl $42, (%rax,%rdx,4)
This isn't a bug, it's the optimizer exploiting CPU pipelining. If another thread is waiting on ready before accessing the contents of message then you have a nasty and obscure race.
Employ compiler barriers to ensure your intent is honored. An example that also exploits the relatively strong ordering of x86 are the release/consume wrappers found in Dmitriy Vyukov's Single-Producer Single-Consumer queue posted here:
// load with 'consume' (data-dependent) memory ordering
// NOTE: x86 specific, other platforms may need additional memory barriers
template<typename T>
T load_consume(T const* addr)
{
T v = *const_cast<T const volatile*>(addr);
__asm__ __volatile__ ("" ::: "memory"); // compiler barrier
return v;
}
// store with 'release' memory ordering
// NOTE: x86 specific, other platforms may need additional memory barriers
template<typename T>
void store_release(T* addr, T v)
{
__asm__ __volatile__ ("" ::: "memory"); // compiler barrier
*const_cast<T volatile*>(addr) = v;
}
I suggest that if you are going to venture into the realm of concurrent memory access, use a library that will take care of these details for you. While we all wait for n2145 and std::atomic check out Thread Building Blocks' tbb::atomic or the upcoming boost::atomic.
Besides correctness, these libraries can simplify your code and clarify your intent:
// thread 1
std::atomic<int> foo; // or tbb::atomic, boost::atomic, etc
foo.store(1, std::memory_order_release);
// thread 2
int tmp = foo.load(std::memory_order_acquire);
Using explicit memory ordering, foo's inter-thread relationship is clear.

May be this thread is ancient, but the C++ 11 standard DOES have a thread library and also a vast atomic library for atomic operations. The purpose is specifically for concurrency support and avoid data races.
The relevant header is atomic

It's generally a really, really bad idea to depend on this, as you could end up with bad things happening and only one some architectures. The best solution would be to use a guaranteed atomic API, for example the Windows Interlocked api.

If your C++ implementation supplies the library of atomic operations specified by n2145 or some variant thereof, you can presumably rely on it. Otherwise, you cannot in general rely on "anything" about atomicity at the language level, since multitasking of any kind (and therefore atomicity, which deals with multitasking) is not specified by the existing C++ standard.

Volatile in C++ do not plays the same role than in Java. All the cases are undefined behavior as Steve saids. Some cases can be Ok for a compiler, oa given processor architecture and with a multi-threading system, but switching the optimization flags can make your program behave differently, as the C++03 compilers don't know about threads.
C++0x defines the rules that avoid race conditions and the operations that help you to master that, but to may knowledge there is no yet a compiler that implement yet all the part of the standard related to this subject.

My answer is going to be frustrating: No, No, No, No, and No.
1-4) The compiler is allowed to do ANYTHING it pleases with a variable it writes to. It may store temporary values in it, so long as ends up doing something that would do the same thing as that thread executing in a vacuum. ANYTHING is valid
5) Nope, no guarantee. If a variable is not atomic, and you write to it on one thread, and read or write to it on another, it is a race case. The spec declares such race cases to be undefined behavior, and absolutely anything goes. That being said, you will be hard pressed to find a compiler that does not give you 7 or 8, but it IS legal for a compiler to give you something else.
I always refer to this highly comical explanation of race cases.
http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong

Reading interlocked variables

Assume:
A. C++ under WIN32.
B. A properly aligned volatile integer incremented and decremented using InterlockedIncrement() and InterlockedDecrement().
__declspec (align(8)) volatile LONG _ServerState = 0;
If I want to simply read _ServerState, do I need to read the variable via an InterlockedXXX function?
For instance, I have seen code such as:
LONG x = InterlockedExchange(&_ServerState, _ServerState);
and
LONG x = InterlockedCompareExchange(&_ServerState, _ServerState, _ServerState);
The goal is to simply read the current value of _ServerState.
Can't I simply say:
if (_ServerState == some value)
{
// blah blah blah
}
There seems to be some confusion WRT this subject. I understand register-sized reads are atomic in Windows, so I would assume the InterlockedXXX function is unnecessary.
Matt J.
Okay, thanks for the responses. BTW, this is Visual C++ 2005 and 2008.
If it's true I should use an InterlockedXXX function to read the value of _ServerState, even if just for the sake of clarity, what's the best way to go about that?
LONG x = InterlockedExchange(&_ServerState, _ServerState);
This has the side effect of modifying the value, when all I really want to do is read it. Not only that, but there is a possibility that I could reset the flag to the wrong value if there is a context switch as the value of _ServerState is pushed on the stack in preparation of calling InterlockedExchange().
LONG x = InterlockedCompareExchange(&_ServerState, _ServerState, _ServerState);
I took this from an example I saw on MSDN.
See http://msdn.microsoft.com/en-us/library/ms686355(VS.85).aspx
All I need is something along the lines:
lock mov eax, [_ServerState]
In any case, the point, which I thought was clear, is to provide thread-safe access to a flag without incurring the overhead of a critical section. I have seen LONGs used this way via the InterlockedXXX() family of functions, hence my question.
Okay, we are thinking a good solution to this problem of reading the current value is:
LONG Cur = InterlockedCompareExchange(&_ServerState, 0, 0);

It depends on what you mean by "goal is to simply read the current value of _ServerState" and it depends on what set of tools and the platform you use (you specify Win32 and C++, but not which C++ compiler, and that may matter).
If you simply want to read the value such that the value is uncorrupted (ie., if some other processor is changing the value from 0x12345678 to 0x87654321 your read will get one of those 2 values and not 0x12344321) then simply reading will be OK as long as the variable is :
marked volatile,
properly aligned, and
read using a single instruction with a word size that the processor handles atomically
None of this is promised by the C/C++ standard, but Windows and MSVC do make these guarantees, and I think that most compilers that target Win32 do as well.
However, if you want your read to be synchronized with behavior of the other thread, there's some additional complexity. Say that you have a simple 'mailbox' protocol:
struct mailbox_struct {
uint32_t flag;
uint32_t data;
};
typedef struct mailbox_struct volatile mailbox;
// the global - initialized before wither thread starts
mailbox mbox = { 0, 0 };
//***************************
// Thread A
while (mbox.flag == 0) {
/* spin... */
}
uint32_t data = mbox.data;
//***************************
//***************************
// Thread B
mbox.data = some_very_important_value;
mbox.flag = 1;
//***************************
The thinking is Thread A will spin waiting for mbox.flag to indicate mbox.data has a valid piece of information. Thread B will write some data into mailbox.data then will set mbox.flag to 1 as a signal that mbox.data is valid.
In this case a simple read in Thread A of mbox.flag might get the value 1 even though a subsequent read of mbox.data in Thread A does not get the value written by Thread B.
This is because even though the compiler will not reorder the Thread B writes to mbox.data and mbox.flag, the processor and/or cache might. C/C++ guarantees that the compiler will generate code such that Thread B will write to mbox.data before it writes to mbox.flag, but the processor and cache might have a different idea - special handling called 'memory barriers' or 'acquire and release semantics' must be used to ensure ordering below the level of the thread's stream of instructions.
I'm not sure if compilers other than MSVC make any claims about ordering below the instruction level. However MS does guarantee that for MSVC volatile is enough - MS specifies that volatile writes have release semantics and volatile reads have acquire semantics - though I'm not sure at which version of MSVC this applies - see http://msdn.microsoft.com/en-us/library/12a04hfd.aspx?ppud=4.
I have also seen code like you describe that uses Interlocked APIs to perform simple reads and writes to shared locations. My take on the matter is to use the Interlocked APIs. Lock free inter-thread communication is full of very difficult to understand and subtle pitfalls, and trying to take a shortcut on a critical bit of code that may end up with a very difficult to diagnose bug doesn't seem like a good idea to me. Also, using an Interlocked API screams to anyone maintaining the code, "this is data access that needs to be shared or synchronized with something else - tread carefully!".
Also when using the Interlocked API you're taking the specifics of the hardware and the compiler out of the picture - the platform makes sure all of that stuff is dealt with properly - no more wondering...
Read Herb Sutter's Effective Concurrency articles on DDJ (which happen to be down at the moment, for me at least) for good information on this topic.

Your way is good:
LONG Cur = InterlockedCompareExchange(&_ServerState, 0, 0);
I'm using similar solution:
LONG Cur = InterlockedExchangeAdd(&_ServerState, 0);

Interlocked instructions provide atomicity and inter-processor synchronization. Both writes and reads must be synchronized, so yes, you should be using interlocked instructions to read a value that is shared between threads and not protected by a lock. Lock-free programming (and that's what you're doing) is a very tricky area, so you might consider using locks instead. Unless this is really one of your program's bottlenecks that must be optimized?

32-bit read operations are already atomic on some 32-bit systems (Intel spec says these operations are atomic, but there's no guarantee that this will be true on other x86-compatible platforms). So you shouldn't use this for threads synchronization.
If you need a flag some sort you should consider using Event object and WaitForSingleObject function for that purpose.

To anyone who has to revisit this thread I want to add to what was well explained by Bartosz that _InterlockedCompareExchange() is a good alternative to standard atomic_load() if standard atomics are not available. Here is the code for atomically reading my_uint32_t_var in C on i86 Win64. atomic_load() is included as a benchmark:
long debug_x64_i = std::atomic_load((const std::_Atomic_long *)&my_uint32_t_var);
00000001401A6955 mov eax,dword ptr [rbp+30h]
00000001401A6958 xor edi,edi
00000001401A695A mov dword ptr [rbp-0Ch],eax
debug_x64_i = _InterlockedCompareExchange((long*)&my_uint32_t_var, 0, 0);
00000001401A695D xor eax,eax
00000001401A695F lock cmpxchg dword ptr [rbp+30h],edi
00000001401A6964 mov dword ptr [rbp-0Ch],eax
debug_x64_i = _InterlockedOr((long*)&my_uint32_t_var, 0);
00000001401A6967 prefetchw [rbp+30h]
00000001401A696B mov eax,dword ptr [rbp+30h]
00000001401A696E xchg ax,ax
00000001401A6970 mov ecx,eax
00000001401A6972 lock cmpxchg dword ptr [rbp+30h],ecx
00000001401A6977 jne foo+30h (01401A6970h)
00000001401A6979 mov dword ptr [rbp-0Ch],eax
long release_x64_i = std::atomic_load((const std::_Atomic_long *)&my_uint32_t_var);
00000001401A6955 mov eax,dword ptr [rbp+30h]
release_x64_i = _InterlockedCompareExchange((long*)&my_uint32_t_var, 0, 0);
00000001401A6958 mov dword ptr [rbp-0Ch],eax
00000001401A695B xor edi,edi
00000001401A695D mov eax,dword ptr [rbp-0Ch]
00000001401A6960 xor eax,eax
00000001401A6962 lock cmpxchg dword ptr [rbp+30h],edi
00000001401A6967 mov dword ptr [rbp-0Ch],eax
release_x64_i = _InterlockedOr((long*)&my_uint32_t_var, 0);
00000001401A696A prefetchw [rbp+30h]
00000001401A696E mov eax,dword ptr [rbp+30h]
00000001401A6971 mov ecx,eax
00000001401A6973 lock cmpxchg dword ptr [rbp+30h],ecx
00000001401A6978 jne foo+31h (01401A6971h)
00000001401A697A mov dword ptr [rbp-0Ch],eax

you should be okay. It's volatile, so the optimizer shouldn't savage you, and it's a 32-bit value so it should be at least approximately atomic. The one possible surprise is if the instruction pipeline can get around that.
On the other hand, what's the additional cost of using the guarded routines?

Current value reading may not need any lock.

The Interlocked* functions prevent two different processors from accessing the same piece of memory. In a single processor system you are going to be ok. If you have a dual-core system where you have threads on different cores both accessing this value, you might have problems doing what you think is atomic without the Interlocked*.

Your initial understanding is basically correct. According to the memory model which Windows requires on all MP platforms it supports (or ever will support), reads from a naturally-aligned variable marked volatile are atomic as long as they are smaller than the size of a machine word. Same with writes. You don't need a 'lock' prefix.
If you do the reads without using an interlock, you are subject to processor reordering. This can even occur on x86, in a limited circumstance: reads from a variable may be moved above writes of a different variable. On pretty much every non-x86 architecture that Windows supports, you are subject to even more complicated reordering if you don't use explicit interlocks.
There's also a requirement that if you're using a compare exchange loop, you must mark the variable you're compare exchanging on as volatile. Here's a code example to demonstrate why:
long g_var = 0; // not marked 'volatile' -- this is an error
bool foo () {
long oldValue;
long newValue;
long retValue;
// (1) Capture the original global value
oldValue = g_var;
// (2) Compute a new value based on the old value
newValue = SomeTransformation(oldValue);
// (3) Store the new value if the global value is equal to old?
retValue = InterlockedCompareExchange(&g_var,
newValue,
oldValue);
if (retValue == oldValue) {
return true;
}
return false;
}
What can go wrong is that the compiler is well within its rights to re-fetch oldValue from g_var at any time if it's not volatile. This 'rematerialization' optimization is great in many cases because it can avoid spilling registers to the stack when register pressure is high.
Thus, step (3) of the function would become:
// (3) Incorrectly store new value regardless of whether the global
// is equal to old.
retValue = InterlockedCompareExchange(&g_var,
newValue,
g_var);

Read is fine. A 32-bit value is always read as a whole as long as it's not split on a cache line. Your align 8 guarantees that it's always within a cache line so you'll be fine.
Forget about instructions reordering and all that non-sense. Results are always retired in-order. It would be a processor recall otherwise!!!
Even for a dual CPU machine (i.e. shared via the slowest FSBs), you'll still be fine as the CPUs guarantee cache coherency via MESI Protocol. The only thing you're not guaranteed is the value you read may not be the absolute latest. BUT, what is the latest anyway? That's something you likely won't need to know in most situations if you're not writing back to the location based on the value of that read. Otherwise, you'd have used interlocked ops to handle it in the first place.
In short, you gain nothing by using Interlocked ops on a read (except perhaps reminding the next person maintaining your code to tread carefully - then again, that person may not be qualified to maintain your code to begin with).
EDIT: In response to a comment left by Adrian McCarthy.
You're overlooking the effect of compiler optimizations. If the
compiler thinks it has the value already in a register, then it's
going to re-use that value instead of re-reading it from memory. Also,
the compiler may do instruction re-ordering for optimization if it
believes there are no observable side effects.
I did not say reading from a non-volatile variable is fine. All the question was asking was if interlocked was required. In fact, the variable in question was clearly declared with volatile. Or were you overlooking the effect of the keyword volatile?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js