std::atomic in a union with another character - c++

I recently read some code that had an atomic and a character in the same union. Something like this
union U {
std::atomic<char> atomic;
char character;
};
I am not entirely sure of the rules here, but the code comments said that since a character can alias anything, we can safely operate on the atomic variable if we promise not to change the last few bits of the byte. And the character only uses those last few bytes.
Is this allowed? Can we overlay an atomic integer over a character and have them both be active? If so what happens when one thread is trying to load the value from the atomic integer and another goes and makes a write to the character (only the last few bytes), will the char write be an atomic write? What happens there? Will the cache have to be flushed for the thread that is trying to load the atomic integer?
(This code looks smelly to me too, and I am not advocating using this. Just want to understand what parts of the above scheme can be defined and under what circumstances)
As requested the code is doing something like this
// thread using the atomic
while (!atomic.compare_exchange_weak(old_value, old_value | mask, ...) { ... }
// thread using the character
character |= 0b1; // set the 1st bit or something

the code comments said that since a character can alias anything, we can safely operate on the atomic variable if we promise not to change the last few bits of the byte.
Those comments are wrong. char-can-alias-anything doesn't stop this from being a data race on a non-atomic variable, so it's not allowed in theory, and worse, it's actually broken when compiled by any normal compiler (like gcc, clang, or MSVC) for any normal CPU (like x86).
The unit of atomicity is the memory location, not bits within a memory location. The ISO C++11 standard defines "memory location" carefully, so adjacent elements in a char[] array or a struct are separate locations (and thus it's not a race if two threads write c[0] and c[1] without synchronization). But adjacent bit-fields in a struct are not separate memory locations, and using |= on a non-atomic char aliased to the same address as an atomic<char> is definitely the same memory location, regardless of which bits are set in the right-hand side of the |=.
For a program to be free of data-race UB, if a memory location is written by any thread, all other threads that access that memory location (potentially) simultaneously must do so with atomic operations. (And probably also through the exact same object, i.e. changing the middle byte of an atomic<int> by type-punning to atomic<char> isn't guaranteed to be safe either. On most implementations on hardware similar to "normal" modern CPUs, type-punning to a different atomic type might happen to still be atomic if atomic<int/char> are both lock-free, but memory-ordering semantics might actually be broken, especially if it isn't perfectly overlapping.
Also, union type-punning in general is not allowed in ISO C++. I think you actually need to pointer-cast to char*, not make unions with char. Union type-punning is allowed in ISO C99, and as a GNU extension in GNU C89 and GNU C++, and also in some other C++ implementations.
So that takes care of the theory, but does any of this work on current CPUs? No, it's totally unsafe in practice, too.
character |= 1 will (on normal computers) compile to asm that loads the whole char, modifies the temporary, then stores the value back. On x86 this can all happen within one memory-destination or instruction if the compiler chooses to do that (which it won't if it also wants the value later). But even so, it's still a non-atomic RMW which can step on modifications to other bits.
Atomicity is expensive and optional for read-modify-write operations, and the only way to set some bits in a byte without affecting others is a read-modify-write on current CPUs. Compilers only emit asm that does it atomically if you specifically request it. (Unlike pure stores or pure loads, which are often naturally atomic. But always use std::atomic to get the other semantics you want...)
Consider this sequence of events:
thread A | thread B
-------------------|--------------
read tmp=c=0000 |
|
| c|=0b1100 # atomically, leaving c = 1100
tmp |= 1 # tmp=1 |
store c = tmp
Leaving c = 1, not the 1101 you're hoping for. i.e. the non-atomic load/store of the high bits stepped on the modification by thread B.
We get asm that can do exactly that, from compiling the source snippets from the question (on the Godbolt compiler explorer):
void t1(U &v, unsigned mask) {
// thread using the atomic
char old_value = v.atomic.load(std::memory_order_relaxed);
// with memory_order_seq_cst as the default for CAS
while (!v.atomic.compare_exchange_weak(old_value, old_value | mask)) {}
// v.atomic |= mask; // would have been easier and more efficient than CAS
}
t1(U&, unsigned int):
movzx eax, BYTE PTR [rdi] # atomic load of the old value
.L2:
mov edx, eax
or edx, esi # esi = mask (register arg)
lock cmpxchg BYTE PTR [rdi], dl # atomic CAS, uses AL implicitly as the expected value, same semantics as C++11 comp_exg seq_cst
jne .L2
ret
void t2(U &v) {
// thread using the character
v.character |= 0b1; // set the 1st bit or something
}
t2(U&):
or BYTE PTR [rdi], 1 # NON-ATOMIC RMW of the whole byte.
ret
It would be straightforward to write a program that ran v.character |= 1 in one thread, and an atomic v.atomic ^= 0b1100000 (or the equivalent with a CAS loop) in another thread.
If this code was safe, you'd always find that an even number of XOR operations modifying only the high bits left them zero. But you wouldn't find that, because the non-atomic or in the other thread might have stepped on an odd number of XOR operations. Or to make the problem easier to see, use addition with 0x10 or something, so instead of a 50% chance of being right by accident, you only have a 1 in 16 chance of the upper 4 bits being right.
This is pretty much exactly the same problem as losing counts when one of the increment operations is non-atomic.
Will the cache have to be flushed for the thread that is trying to load the atomic integer?
No, that's not how atomicity works. The problem isn't cache, it's that unless the CPU does something special, nothing stops other CPUs from reading or writing the location between when you load the old value and when you store the update value. You'd have the same problem on a multi-core system with no cache.
Of course, all systems do use cache, but cache is coherent so there's a hardware protocol (MESI) that stops different cores from simultaneously having conflicting values. When a store commits to L1D cache, it becomes globally visible. See Can num++ be atomic for 'int num'? for the details.

Related

How can I show that volatile assignment is not atomic?

I understand that assignment may not be atomic in C++.
I am trying to trigger a race condition to show this.
However my code below does not seem to trigger any such.
How can I change it so that it will eventually trigger a race condition?
#include <iostream>
#include <thread>
volatile uint64_t sharedValue = 1;
const uint64_t value1 = 13;
const uint64_t value2 = 1414;
void write() {
bool even = true;
for (;;) {
uint64_t value;
if (even)
value = value1;
else
value = value2;
sharedValue = value;
even = !even;
}
}
void read() {
for (;;) {
uint64_t value = sharedValue;
if (value != value1 && value != value2) {
std::cout << "Race condition! Value: " << value << std::endl << std::flush;
}
}
}
int main()
{
std::thread t1(write);
std::thread t2(read);
t1.join();
}
I am using VS 2017 and compiling in Release x86.
Here is the disassembly of the assignment:
sharedValue = value;
00D54AF2 mov eax,dword ptr [ebp-18h]
00D54AF5 mov dword ptr [sharedValue (0D5F000h)],eax
00D54AFA mov ecx,dword ptr [ebp-14h]
00D54AFD mov dword ptr ds:[0D5F004h],ecx
I guess this means that the assignment is not atomic?
Seems 32 of the bits are copied into the general purpose 32 bit register eax and the other 32 bits are copied into another general purpose 32 bit register ecx before being copied into sharedValue which is in the data segment register?
I also tried with uint32_t and all data was copied in one go.
So I guess that on x86 there is no need to use std::atomic for 32 bit data types?
Some answers/comments suggested sleeping in the writer. This is not useful; hammering away on the cache line changing it as often as possible is what you want. (And what you get with volatile assignments and reads.) An assignment will be torn when a MESI share request for the cache line arrives at the writer core between committing two halves of a store from the store buffer to L1d cache.
If you sleep, you're waiting a long time without creating a window for that to happen. Sleeping between halves would make it even more easy to detect, but you can't do that unless you use separate memcpy to write halves of the 64-bit integer or something.
Tearing between reads in the reader is also possible even if writes are atomic. This may be less likely, but still happens plenty in practice. Modern x86 CPUs can execute two loads per clock cycle (Intel since Sandybridge, AMD since K8). I tested with atomic 64-bit stores but split 32-bit loads on Skylake and tearing is still frequent enough to spew lines of text in a terminal. So the CPU didn't manage to run everything in lock-step with corresponding pairs of reads always executing in the same clock cycle. So there is a window for the reader getting its cache line invalidated between a pair of loads. (However, all the pending cache-miss loads while the cache line is owned by the writer core probably complete all at once when the cache line arrives. And the total number of load buffers available is an even number in existing microarchitectures.)
As you discovered, your test values both had the same upper half of 0, so this made it impossible to observe any tearing; only the 32-bit aligned low half was ever changing, and was changing atomically because your compiler guarantees at least 4-byte alignment for uint64_t, and x86 guarantees that 4-byte aligned loads/stores are atomic.
0 and -1ULL are the obvious choices. I used the same thing in a test-case for this GCC C11 _Atomic bug for a 64-bit struct.
For your case, I'd do this. read() and write() are POSIX system call names so I picked something else.
#include <cstdint>
volatile uint64_t sharedValue = 0; // initializer = one of the 2 values!
void writer() {
for (;;) {
sharedValue = 0;
sharedValue = -1ULL; // unrolling is vastly simpler than an if
}
}
void reader() {
for (;;) {
uint64_t val = sharedValue;
uint32_t low = val, high = val>>32;
if (low != high) {
std::cout << "Tearing! Value: " << std::hex << val << '\n';
}
}
}
MSVC 19.24 -O2 compiles the writer to using a movlpd 64-bit store for the = 0, but two separate 32-bit stores of -1 for the = -1. (And the reader to two separate 32-bit loads). GCC uses a total of four mov dword ptr [mem], imm32 stores in the writer, like you'd expect. (Godbolt compiler explorer)
Terminology: it's always a race condition (even with atomicity you don't know which of the two values you're going to get). With std::atomic<> you'd only have that garden-variety race condition, no Undefined Behaviour.
The question is whether you actually see tearing from the data race Undefined Behaviour on the volatile object, on a specific C++ implementation / set of compile options, for a specific platform. Data race UB is a technical term with a more specific meaning than "race condition". I changed the error message to report the one symptom we check for. Note that data-race UB on a non-volatile object can have way weirder effects, like hosting the load or store out of loops, or even inventing extra reads leading to code that thinks one read was both true and false at the same time. (https://lwn.net/Articles/793253/)
I removed 2 redundant cout flushes: one from std::endl and one from std::flush. cout is line-buffered by default, or full buffered if writing to a file, which is fine. And '\n' is just as portable as std::endl as far as DOS line endings are concerned; text vs. binary stream mode handles that. endl is still just \n.
I simplified your check for tearing by checking that high_half == low_half. Then the compiler just has to emit one cmp/jcc instead of two extended-precision compares to see if the value is either 0 or -1 exactly. We know that there's no plausible way for false-negatives like high = low = 0xff00ff00 to happen on x86 (or any other mainstream ISA with any sane compiler).
So I guess that on x86 there is no need to use std::atomic for 32 bit data types?
Incorrect.
Hand-rolled atomics with volatile int can't give you atomic RMW operations (without inline asm or special functions like Windows InterlockedIncrement or GNU C builtin __atomic_fetch_add), and can't give you any ordering guarantees wrt. other code. (Release / acquire semantics)
When to use volatile with multi threading? - pretty much never.
Rolling your own atomics with volatile is still possible and de-facto supported by many mainstream compilers (e.g. the Linux kernel still does that, along with inline asm). Real-world compilers do effectively define the behaviour of data races on volatile objects. But it's generally a bad idea when there's a portable and guaranteed-safe way. Just use std::atomic<T> with std::memory_order_relaxed to get asm that's just as efficient as what you could get with volatile (for the cases where volatile works), but with guarantees of safety and correctness from the ISO C++ standard.
atomic<T> also lets you ask the implementation whether a given type can be cheaply atomic or not, with C++17 std::atomic<T>::is_always_lock_free or the older member function. (In practice C++11 implementations decided not to let some but not all instances of any given atomic be lock free based on alignment or something; instead they just give atomic the required alignas if there is one. So C++17 made a constant per-type constant instead of per-object member function way to check lock-freedom).
std::atomic can also give cheap lock-free atomicity for types wider than a normal register. e.g. on ARM, using ARMv6 strd / ldrd to store/load a pair of registers.
On 32-bit x86, a good compiler can implement std::atomic<uint64_t> by using SSE2 movq to do atomic 64-bit loads and stores, without falling back to the non-lock_free mechanism (a table of locks). In practice, GCC and clang9 do use movq for atomic<uint64_t> load/store. clang8.0 and earlier uses lock cmpxchg8b unfortunately. MSVC uses lock cmpxchg8b in an even more inefficient way. Change the definition of sharedVariable in the Godbolt link to see it. (Or if you use one each of default seq_cst and memory_order_relaxed stores in the loop, MSVC for some reason calls a ?store#?$_Atomic_storage#_K$07#std##QAEX_KW4memory_order#2##Z helper function for one of them. But when both stores are the same ordering, it inlines lock cmpxchg8b with much clunkier loops than clang8.0) Note that this inefficient MSVC code-gen is for a case where volatile wasn't atomic; in cases where it is, atomic<T> with mo_relaxed compiles nicely, too.
You generally can't get that wide-atomic code-gen from volatile. Although GCC does actually use movq for your if() bool write function (see the earlier Godbolt compiler explorer link) because it can't see through the alternating or something. It also depends on what values you use. With 0 and -1 it uses separate 32-bit stores, but with 0 and 0x0f0f0f0f0f0f0f0fULL you get movq for a usable pattern. (I used this to verify that you can still get tearing from just the read side, instead of hand-writing some asm.) My simple unrolled version compiles to just use plain mov dword [mem], imm32 stores with GCC. This is a good example of there being zero guarantee of how volatile really compiles in this level of detail.
atomic<uint64_t> will also guarantee 8-byte alignment for the atomic object, even if plain uint64_t might have only been 4-byte aligned.
In ISO C++, a data race on a volatile object is still Undefined Behaviour. (Except for volatile sig_atomic_t racing with a signal handler.)
A "data race" is any time two unsynchronized accesses happen and they're not both reads. ISO C++ allows for the possibility of running on machines with hardware race detection or something; in practice no mainstream systems do that so the result is just tearing if the volatile object isn't "naturally atomic".
ISO C++ also in theory allows for running on machines that don't have coherent shared memory and require manual flushes after atomic stores, but that's not really plausible in practice. No real-world implementations are like that, AFAIK. Systems with cores that have non-coherent shared memory (like some ARM SoCs with DSP cores + microcontroller cores) don't start std::thread across those cores.
See also Why is integer assignment on a naturally aligned variable atomic on x86?
It's still UB even if you don't observe tearing in practice, although as I said real compilers do de-facto define the behaviour of volatile.
Skylake experiments to try to detect store-buffer coalescing
I wondered if store coalescing in the store buffer could maybe create an atomic 64-bit commit to L1d cache out of two separate 32-bit stores. (No useful results so far, leaving this here in case anyone's interested or wants to build on it.)
I used a GNU C __atomic builtin for the reader, so if the stores also ended up being atomic we'd see no tearing.
void reader() {
for (;;) {
uint64_t val = __atomic_load_n(&sharedValue, __ATOMIC_ACQUIRE);
uint32_t low = val, high = val>>32;
if (low != high) {
std::cout << "Tearing! Value: " << std::hex << val << '\n';
}
}
}
This was one attempt get the microachitecture to group the stores.
void writer() {
volatile int separator; // in a different cache line, has to commit separately
for (;;) {
sharedValue = 0;
_mm_mfence();
separator = 1234;
_mm_mfence();
sharedValue = -1ULL; // unrolling is vastly simpler than an if
_mm_mfence();
separator = 1234;
_mm_mfence();
}
}
I still see tearing with this. (mfence on Skylake with updated microcode is like lfence, and blocks out-of-order exec as well as draining the store buffer. So later stores shouldn't even enter the store buffer before the later ones leave. That might actually be a problem, because we need time for merging, not just committing a 32-bit store as soon as it "graduates" when the store uops retire).
Probably I should try to measure the rate of tearing and see if it's less frequent with anything, because any tearing at all is enough to spam a terminal window with text on a 4GHz machine.
Grab the disassembly and then check the documentation for your architecture; on some machines you'll find even standard "non-atomic" operations (in terms of C++) are actually atomic when it hits the hardware (in terms of assembly).
With that said, your compiler will know what is and isn't safe and it is therefore a better idea to use the std::atomic template to make your code more portable across architectures. If you're on a platform that doesn't require anything special, it'll typically get optimised down to a primitive type anyway (putting memory ordering aside).
I don't remember the details of x86 operations off hand, but I would guess that you have a data race if the 64 bit integer is written in 32 bit "chunks" (or less); it is possible to get a torn read that case.
There are also tools called thread sanitizer to catch it in the act. I don't believe they're supported on Windows with MSVC, but if you can get GCC or clang working then you might have some luck there. If your code is portable (it looks it), then you can run it on a Linux system (or VM) using these tools.
I changed the code to:
volatile uint64_t sharedValue = 0;
const uint64_t value1 = 0;
const uint64_t value2 = ULLONG_MAX;
and now the code triggers the race condition in less than a second.
The problem was that both 13 and 1414 has the 32 MSB = 0.
13=0xd
1414=0x586
0=0x0
ULLONG_MAX=0xffffffffffffffff
First, your code has a data race condition when reading and writing to sharedValue variable without any synchronization which is Undefined Behavior in C++. This can be fixed by making sharedValue an atomic variable:
std::atomic<uint64_t> sharedValue{1};
You can trigger a logical race condition by using an artificial delay before writing to sharedValue ("Race condition! Value: ..." message will be printed in reader thread). You can use std::this_thread::sleep_for for this:
void write() {
bool even = true;
for (;;) {
uint64_t value;
if (even)
value = value1;
else
value = value2;
using namespace std::chrono_literals;
std::this_thread::sleep_for(1ms);
sharedValue = value;
even = !even;
}
}

Are Reads and Writes of an int in C++ Atomic on x86-64 multi-core machine

I've read this, my question is quite similar yet somewhat different.
Note, I know C++0x does not guarantee that but I'm asking particularly for a multi-core machine like x86-64.
Let's say we have 2 threads (pinned to 2 physical cores) running the following code:
// I know people may delcare volatile useless, but here I do NOT care memory reordering nor synchronization/
// I just want to suppress complier optimization of using register.
volatile int n;
void thread1() {
for (;;)
n = 0xABCD1234;
// NOTE, I know ++n is not atomic,
// but I do NOT care here.
// what I cares is whether n can be 0x00001234, i.e. in the middle of the update from core-1's cache lines to main memory,
// will core-2 see an incomplete value(like the first 2 bytes lost)?
++n;
}
}
void thread2() {
while (true) {
printf('%d', n);
}
}
Is it possible for thread 2 to see n to be something like 0x00001234, i.e. in the middle of the update from core-1's cache lines to main memory, will core-2 see an incomplete value?
I know a single 4-byte int definitely fits into a typically 128-byte-long cache line, and if that int does store inside one cache line then I believe no issues here... yet what if it acrosses the cache line boundary? i.e. will it be possbile that some char already sit inside that cache line which makes first part of the n in one cache line and the other part in the next line? If that is the case, then core-2 may have a chance seeing an incomplete value, right?
Also, I think unless making every char or short or other less-than-4-bytes types padded to be 4-byte-long, one can never guarantee a single int does not pass the cache line boundary, isn't it?
If so, would that suggest generally even setting a single int is not guaranteed to be atomic on a x86-64 multi-core machine?
I got this question because as I researched on this topic, various people in various posts seem agreed on that as long as the machine architecture is proper(e.g. x86-64) setting an int should be atomic. But as I argued above that does not hold, right?
UPDATE
I'd like to give some background of my question. I'm dealing with a real-time system, which is sampling some signal and putting the result into one global int, this is of course done in one thread. And in yet another thread I read this value and process it.
I do not care the ordering of set and get, all I need is just a complete (vs. a corrrupted integer value) value.
x86 guarantees this. C++ doesn't. If you write x86 assembly you will be fine. If you write C++ it is undefined behavior. Since you can't reason about undefined behavior (it is undefined after all) you have to go lower and look at the assembler instructions that were generated. If they do what you want then this is fine. Note, however, that compilers tend to change generated assembly when you change compilers, compiler versions, compiler flags or any code which might change the optimizer's behavior, so you will constantly have to check the assembler code to make sure it is still correct.
The easier way is to use std::atomic<int> which will guarantee that the correct assembler instructions are generated so you don't have to constantly check.
The other question talks about variables "properly aligned". If it crosses a cache-line, the variable is not properly aligned. An int will not do that unless you specifically ask the compiler to pack a struct, for example.
You also assume that using volatile int is better than atomic<int>. If volatile int is the perfect way to sync variables on your platform, surely the library implementer would also know that and store a volatile x inside atomic<x>.
There is no requirement that atomic<int> has to be extra slow just because it is standard. :-)
Why worry so much?
Rely on your implementation. std::atomic<int> will reduce to an int if int is atomic on your platform (and in x86-64 they are, if properly aligned).
I'd also be concerned about the possibility of int overflow with your code (which is undefined behaviour), if I were you.
In other words std::atomic<unsigned> is the appropriate type here.
If you're looking for atomicity guarantee, std::atomic<> is your friend. Don't rely on volatile qualifier.
The question is almost a duplicate of Why is integer assignment on a naturally aligned variable atomic on x86?. The answer there does answer everything you ask, but this question is more focused on the ABI / compiler question of whether an int (or other type?) will be sufficiently-aligned, rather than what happens when it is. There's other stuff in this question that's worth answering specifically, too.
Yes, they almost invariably will be on machines where an int fits in a single register (e.g. not AVR: an 8-bit RISC), because compilers typically choose not to use multiple store instructions when they could use 1.
Normal x86 ABIs will align an int to a 4B boundary, even inside structs (unless you use GNU C __attribute__((packed)) or the equivalent for other dialects). But beware that the i386 System V ABI only aligns double to 4 bytes; it's only outside structs that modern compilers can go beyond that and give it natural alignment, making load/store atomic.
But nothing you can legally do in C++ can ever depend on this fact (because by definition it will involve a data race on a non-atomic type so it's Undefined Behaviour). Fortunately, there are efficient ways to get the same result (i.e. about the same compiler-generated asm, without mfence instructions or other slow stuff) that don't cause undefined behaviour.
You should use atomic instead of volatile or hoping that the compiler doesn't optimize away stores or loads on a non-volatile int, because the assumption of async modification is one of the ways that volatile and atomic overlap.
I'm dealing with a real-time system, which is sampling some signal and putting the result into one global int, this is of course done in one thread. And in yet another thread I read this value and process it.
std::atomic with .store(val, std::memory_order_relaxed) and .load(std::memory_order_relaxed) will give you exactly what you want here. The HW-access thread runs free and does plain ordinary x86 store instructions into the shared variable, while the reader thread does plain ordinary x86 load instructions.
This is the C++11 way to express that this is what you want, and you should expect it to compile to the same asm as with volatile. (With maybe a couple instructions difference if you use clang, but nothing important.) If there was any case where volatile int wouldn't have sufficient alignment, or any other corner cases, atomic<int> will work (barring compiler bugs). Except maybe in a packed struct; IDK if compilers stop you from breaking atomicity by packing atomic types in structs.
In theory, you might want to use volatile std::atomic<int> to make sure the compiler doesn't optimize out multiple stores to the same variable. See Why don't compilers merge redundant std::atomic writes?. But for now, compilers don't do that kind of optimization. (volatile std::atomic<int> should still compile to the same light-weight asm.)
I know a single 4-byte int definitely fits into a typically 128-byte-long cache line, and if that int does store inside one cache line then I believe no issues here...
Cache lines are 64B on all mainstream x86 CPUs since PentiumIII; before that 32B lines were typical. (Well AMD Geode still uses 32B lines...) Pentium4 uses 64B lines, although it prefers to transfer them in pairs or something? Still, I think it's accurate to say that it really does use 64B lines, not 128B. This page lists it as 64B per line.
AFAIK, there are no x86 microarchitectures that used 128B lines in any level of cache.
Also, only Intel CPUs guarantee that cached unaligned stores / loads are atomic if they don't cross a cache-line boundary. The baseline atomicity guarantee for x86 in general (AMD/Intel/other) is don't cross an 8-byte boundary. See Why is integer assignment on a naturally aligned variable atomic on x86? for quotes from Intel/AMD manuals.
Natural alignment works on pretty much any ISA (not just x86) up to the maximum guaranteed-atomic width.
The code in your question wants a non-atomic read-modify write where the load and store are separately atomic, and impose no ordering on surrounding loads/stores.
As everyone has said, the right way to do this is with atomic<int>, but nobody has pointed out exactly how. If you just n++ on atomic_int n, you will get (for x86-64) lock add [n], 1, which will be much slower than what you get with volatile, because it makes the entire RMW operation atomic. (Perhaps this is why you were avoiding std::atomic<>?)
#include <atomic>
volatile int vcount;
std::atomic <int> acount;
static_assert(alignof(vcount) == sizeof(vcount), "under-aligned volatile counter");
void inc_volatile() {
while(1) vcount++;
}
void inc_separately_atomic() {
while(1) {
int t = acount.load(std::memory_order_relaxed);
t++;
acount.store(t, std::memory_order_relaxed);
}
}
asm output from the Godbolt compiler explorer with gcc7.2 and clang5.0
Unsurprisingly, they both compile to equivalent asm with gcc/clang for x86-32 and x86-64. gcc makes identical asm for both, except for the address to increment:
# x86-64 gcc -O3
inc_volatile:
.L2:
mov eax, DWORD PTR vcount[rip]
add eax, 1
mov DWORD PTR vcount[rip], eax
jmp .L2
inc_separately_atomic():
.L5:
mov eax, DWORD PTR acount[rip]
add eax, 1
mov DWORD PTR acount[rip], eax
jmp .L5
clang optimizes better, and uses
inc_separately_atomic():
.LBB1_1:
add dword ptr [rip + acount], 1
jmp .LBB1_1
Note the lack of a lock prefix, so inside the CPU this decodes to separate load, ALU add, and store uops. (See Can num++ be atomic for 'int num'?).
Besides smaller code-size, some of these uops can be micro-fused when they come from the same instruction, reducing front-end bottlenecks. (Totally irrelevant here; the loop bottlenecks on the 5 or 6 cycle latency of a store/reload. But if used as part of a larger loop, it would be relevant.) Unlike with a register operand, add [mem], 1 is better than inc [mem] on Intel CPUs because it micro-fuses even more: INC instruction vs ADD 1: Does it matter?.
It's interesting that clang uses the less efficient inc dword ptr [rip + vcount] for inc_volatile().
And how does an actual atomic RMW compile?
void inc_atomic_rmw() {
while(1) acount++;
}
# both gcc and clang do this:
.L7:
lock add DWORD PTR acount[rip], 1
jmp .L7
Alignment inside structs:
#include <stdint.h>
struct foo {
int a;
volatile double vdouble;
};
// will fail with -m32, in the SysV ABI.
static_assert(alignof(foo) == sizeof(double), "under-aligned volatile counter");
But atomic<double> or atomic<unsigned long long> will guarantee atomicity.
For 64-bit integer load/store on 32-bit machines, gcc uses SSE2 instructions. Some other compilers unfortunately use lock cmpxchg8b, which is far less efficient for separate stores or loads. volatile long long wouldn't give you that.
volatile double usually would normally be atomic to load/store when aligned correctly, because the normal way is already to use single 8B load/store instructions.

VC++, /volatile:ms on x86

The documentation on volatile says:
When the /volatile:ms compiler option is used—by default when architectures other than ARM are targeted—the compiler generates extra code to maintain ordering among references to volatile objects in addition to maintaining ordering to references to other global objects.
What exact code could compile differently with /volatile:ms and /volatile:iso?
A complete understanding of this requires a bit of a history lesson. (And who doesn't like history?…says the guy who majored in history.) The /volatile:ms semantics were first added to the compiler with Visual Studio 2005. Starting with that version, variables marked volatile automatically imposed acquire semantics on reads, and release semantics on writes, through that variable.
What does this mean? It has to do with the memory model, and specifically, how aggressively the compiler is permitted to reorder memory-access operations. An operation that has acquire semantics prevents subsequent memory operations from being hoisted above it; an operation that has release semantics prevents preceding memory operations from being delayed until after it. As the names suggest, acquire semantics are typically used when you are acquiring a resource, whereas release semantics are typically used when you are releasing a resource. MSDN has a more complete description of acquire and release semantics; it says:
An operation has acquire semantics if other processors will always see
its effect before any subsequent operation's effect. An operation has
release semantics if other processors will see every preceding
operation's effect before the effect of the operation itself. Consider
the following code example:
a++;
b++;
c++;
From another processor's point of view, the preceding operations can
appear to occur in any order. For example, the other processor might
see the increment of b before the increment of a.
For example, the InterlockedIncrementAcquire routine uses acquire semantics to increment a variable. If you rewrote the preceding code example as follows:
InterlockedIncrementAcquire(&a);
b++;
c++;
other processors would always see the increment of a before the increments of b and c.
Likewise, the InterlockedIncrementRelease routine uses release semantics to increment a variable. If you rewrote the code example once again, as follows:
a++;
b++;
InterlockedIncrementRelease(&c);
other processors would always see the increments of a and b before the increment of c.
Now, like MSDN says, atomic operations have both acquire and release semantics. And, in fact, on x86, there is no way to give an instruction only acquire or release semantics, so achieving even one of these requires that the instruction is made atomic (which the compiler will generally do by emitting a LOCK CMPXCHG instruction).
Before Visual Studio 2005's enhancement of the volatile semantics, developers who wanted to write correct code needed to use the Interlocked* family of functions, as described in the MSDN article. Unfortunately, many developers failed to do this and got code that worked mostly by accident (or didn't work at all). But there was a good chance that it did work by accident, given the x86's relatively strict memory model. You often get the semantics you want for free since, on x86, most loads and stores already have acquire/release semantics, so you don't even need to make anything atomic. (Non-temporal stores are the obvious exception, but in this case, those wouldn't matter anyway.) I suspect this ease of implementation on x86 is what, combined with the realization that programmers generally failed to understand and do the right thing, persuaded Microsoft to strengthen the semantics of volatile in VS 2005.
Another potential reason for the change was the growing importance of multi-threaded code. 2005 was around the time that Pentium 4 chips with HyperThreading were beginning to become popular, effectively bringing simultaneous multi-threading to every users' desktop. Probably not coincidentally, VS 2005 also removed the option to link to single-threaded version of the C run-time libraries. It is when you have multi-threaded code—with the possibility of executing on multiple processors—that you really have to start worrying about getting the memory-access semantics correct.
With VS 2005 and later, you could just mark a pointer parameter as volatile and get the desired acquire semantics. The volatility implied/imposed the acquire semantics, which made multi-threaded code running in multi-processing environments safe. Prior to 2011, this was extremely important, since the C and C++ language standards had absolutely nothing to say about threading and gave you no portable way of writing correct code.
And this brings us right to the answer to your question. If your code assumes these extended semantics for volatile, then you need to pass the /volatile:ms switch to ensure that the compiler continues to apply them. If you have written C++11-style code that uses modern primitives for atomic, thread-safe operations, then don't need volatile to have these extended semantics and are safe passing /volatile:iso. In other words, as manni66 quipped, if your code "misuses volatile as std::atomic", then you will see a difference in behavior and need /volatile:ms to guarantee that volatile does have the same effect as std::atomic.
As it turns out, it has proven very difficult for me to find an example of a case where /volatile:iso actually changes the generated code, as compared to /volatile:ms. Microsoft's optimizer is actually very conservative with respect to reordering instructions, which is the type of thing that the acquire/release semantics are supposed to protect against.
Here's a simple example (where we're using a volatile global variable to guard a critical section, as you might find in a simplistic "lock-free" implementation) that should demonstrate the difference:
volatile bool CriticalSection;
int Data[100];
void FillData(int i)
{
Data[i] = 42; // fill data item at index 'i'
CriticalSection = false; // release critical section
}
If you compile this with GCC at -O2, it will generate the following machine code:
FillData(int):
mov eax, DWORD PTR [esp+4] // retrieve parameter 'i' from stack
mov BYTE PTR [CriticalSection], 0 // store '0' in 'CriticalSection'
mov DWORD PTR [Data+eax*4], 42 // store '42' at index 'i' in 'Data'
ret
Even if you aren't fluent in assembly language, you should be able to see that the optimizer has re-ordered the stores, such that the critical section is released (CriticalSection = false) before the data is filled in (Data[i] = 42)—precisely the opposite of the order in which the statements appeared in the original C code. The volatile is having no effect on this re-ordering, because GCC follows the ISO semantics, just like /volatile:iso will (in theory).
By the way, notice how…um…volatile :-) this ordering is. If we compile at -O1 in GCC, we get instructions that do everything in the same order as our original C code:
FillData(int):
mov eax, DWORD PTR [esp+4] // retrieve parameter 'i' from stack
mov DWORD PTR [Data+eax*4], 42 // store '42' at index 'i' in 'Data'
mov BYTE PTR [CriticalSection], 0 // store '0' in 'CriticalSection'
ret
When you start throwing more instructions in there for the compiler to rearrange, and especially if this code were to get inlined, you can imagine how unlikely it is that the original order is preserved.
But, like I said, MSVC is actually very conservative with regard to re-ordering instructions. Regardless of whether I specify /volatile:ms or /volatile:iso, I get the exactly same machine code:
FillData, COMDAT PROC
mov eax, DWORD PTR [esp+4]
mov DWORD PTR [Data+eax*4], 42
mov BYTE PTR [CriticalSection], 0
ret
FillData ENDP
where the stores are done in the original order. I've played with all sorts of different permutations, introducing additional variables and operations, all without being able to find the magic sequence that causes MSVC to re-order the stores. So, it is very likely that, currently, in practice, you won't see a very big difference with the /volatile:iso switch set when targeting x86 architectures. But that's a very loose guarantee, to say the least.
Note that this empirical observation is consistent with Alexander Gutenev's speculation that a difference in semantics is observed only on ARM, and that the whole reason these switches were introduced was to avoid paying a performance penalty on this newly-supported platform. Meanwhile, on the x86 side, there have been no actual changes to the semantics in generated code, since there is essentially no cost. (Save for some extremely trivial optimization possibilities, but that would require that their optimizer have two completely separate schedulers, which probably isn't a good use of developer time.)
The point is that, with /volatile:iso, MSVC is permitted to act like GCC and re-order the stores. With /volatile:ms, you are guaranteed that it won't because volatile implies acquire/release semantics for that variable.
Bonus Reading: So, what is volatile supposed to be used for, in strictly ISO-compliant code (i.e., when the /volatile:iso switch is used)? Well, volatile is basically meant for memory-mapped I/O. That's what it was originally meant for when it was first introduced, and it remains its principal purpose. I've heard it jokingly said that volatile is for reading/writing a tape drive. Basically, you mark the pointer volatile in order to prevent the compiler from optimizing reads and writes away. For example:
volatile char* pDeviceIOAddr = ...;
void Wait()
{
while (*pDeviceIOAddr)
{ }
}
Qualifying the parameter's type with volatile prevents the compiler from assuming that subsequent reads return the same value, forcing it to do a new read each time through the loop. In other words:
mov eax, DWORD PTR [pDeviceIoAddr] // get pointer
Wait:
cmp BYTE PTR [eax], 0 // dereference pointer, read 1 byte,
jnz Wait // and compare to 0
If pDeviceIoAddr wasn't volatile, the entire loop could have been elided. Optimizers definitely do this in practice, including MSVC. Or, you could get the following pathological code:
mov eax, DWORD PTR [pDeviceIoAddr] // get pointer
mov al, BYTE PTR [eax] // dereference pointer, read 1 byte
Wait:
cmp al, 0 // compare it to 0
jnz Wait
where the pointer is dereferenced once, outside of the loop, caching the byte in a register. The instruction at the top of the loop just tests that enregistered value, creating either no loop or an infinite loop. Oops.
Notice, however, that this use of volatile in ISO-standard C++ does not obviate the need for critical sections, mutexes, or other types of locks. Even the correct version of the above code wouldn't work correctly if another thread could potentially modify pDeviceIOAddr, since the read of that address/pointer does not have acquire semantics. Acquire semantics would look like this:
Wait:
mov eax, DWORD PTR [pDeviceIoAddr] // get pointer (acquire semantics)
cmp BYTE PTR [eax], 0 // dereference pointer, read 1 byte,
jnz Wait // and compare to 0
and to get that, you would need C++11's std::atomic.
I suspect it might start matter from some point in time, if not yet already.
There's undocumented option /volatileMetadata- vs /volatileMetadata. When it is on (default), some metadata is generated for emulation of x86 on ARM.
The information is from my issue report, some links from there:
Visual Studio 2019 version 16.10 Preview 2 enabled volatile metadata by default when targeting x64 to improve emulation performance
ARM64EC Support in Visual Studio
So, at least hypothetically, /volatile:iso may matter for volatile metadata, and affect x86 code execution on ARM.
I have no confirmation that it happens. By compiling the same binary with and without /volatileMetadata- I at least confirm by binary size that the mentioned metadata does exist.

Can num++ be atomic for 'int num'?

In general, for int num, num++ (or ++num), as a read-modify-write operation, is not atomic. But I often see compilers, for example GCC, generate the following code for it (try here):
void f()
{
int num = 0;
num++;
}
f():
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
add DWORD PTR [rbp-4], 1
nop
pop rbp
ret
Since line 5, which corresponds to num++ is one instruction, can we conclude that num++ is atomic in this case?
And if so, does it mean that so-generated num++ can be used in concurrent (multi-threaded) scenarios without any danger of data races (i.e. we don't need to make it, for example, std::atomic<int> and impose the associated costs, since it's atomic anyway)?
UPDATE
Notice that this question is not whether increment is atomic (it's not and that was and is the opening line of the question). It's whether it can be in particular scenarios, i.e. whether one-instruction nature can in certain cases be exploited to avoid the overhead of the lock prefix. And, as the accepted answer mentions in the section about uniprocessor machines, as well as this answer, the conversation in its comments and others explain, it can (although not with C or C++).
This is absolutely what C++ defines as a Data Race that causes Undefined Behaviour, even if one compiler happened to produce code that did what you hoped on some target machine. You need to use std::atomic for reliable results, but you can use it with memory_order_relaxed if you don't care about reordering. See below for some example code and asm output using fetch_add.
But first, the assembly language part of the question:
Since num++ is one instruction (add dword [num], 1), can we conclude that num++ is atomic in this case?
Memory-destination instructions (other than pure stores) are read-modify-write operations that happen in multiple internal steps. No architectural register is modified, but the CPU has to hold the data internally while it sends it through its ALU. The actual register file is only a small part of the data storage inside even the simplest CPU, with latches holding outputs of one stage as inputs for another stage, etc., etc.
Memory operations from other CPUs can become globally visible between the load and store. I.e. two threads running add dword [num], 1 in a loop would step on each other's stores. (See #Margaret's answer for a nice diagram). After 40k increments from each of two threads, the counter might have only gone up by ~60k (not 80k) on real multi-core x86 hardware.
"Atomic", from the Greek word meaning indivisible, means that no observer can see the operation as separate steps. Happening physically / electrically instantaneously for all bits simultaneously is just one way to achieve this for a load or store, but that's not even possible for an ALU operation. I went into a lot more detail about pure loads and pure stores in my answer to Atomicity on x86, while this answer focuses on read-modify-write.
The lock prefix can be applied to many read-modify-write (memory destination) instructions to make the entire operation atomic with respect to all possible observers in the system (other cores and DMA devices, not an oscilloscope hooked up to the CPU pins). That is why it exists. (See also this Q&A).
So lock add dword [num], 1 is atomic. A CPU core running that instruction would keep the cache line pinned in Modified state in its private L1 cache from when the load reads data from cache until the store commits its result back into cache. This prevents any other cache in the system from having a copy of the cache line at any point from load to store, according to the rules of the MESI cache coherency protocol (or the MOESI/MESIF versions of it used by multi-core AMD/Intel CPUs, respectively). Thus, operations by other cores appear to happen either before or after, not during.
Without the lock prefix, another core could take ownership of the cache line and modify it after our load but before our store, so that other store would become globally visible in between our load and store. Several other answers get this wrong, and claim that without lock you'd get conflicting copies of the same cache line. This can never happen in a system with coherent caches.
(If a locked instruction operates on memory that spans two cache lines, it takes a lot more work to make sure the changes to both parts of the object stay atomic as they propagate to all observers, so no observer can see tearing. The CPU might have to lock the whole memory bus until the data hits memory. Don't misalign your atomic variables!)
Note that the lock prefix also turns an instruction into a full memory barrier (like MFENCE), stopping all run-time reordering and thus giving sequential consistency. (See Jeff Preshing's excellent blog post. His other posts are all excellent, too, and clearly explain a lot of good stuff about lock-free programming, from x86 and other hardware details to C++ rules.)
On a uniprocessor machine, or in a single-threaded process, a single RMW instruction actually is atomic without a lock prefix. The only way for other code to access the shared variable is for the CPU to do a context switch, which can't happen in the middle of an instruction. So a plain dec dword [num] can synchronize between a single-threaded program and its signal handlers, or in a multi-threaded program running on a single-core machine. See the second half of my answer on another question, and the comments under it, where I explain this in more detail.
Back to C++:
It's totally bogus to use num++ without telling the compiler that you need it to compile to a single read-modify-write implementation:
;; Valid compiler output for num++
mov eax, [num]
inc eax
mov [num], eax
This is very likely if you use the value of num later: the compiler will keep it live in a register after the increment. So even if you check how num++ compiles on its own, changing the surrounding code can affect it.
(If the value isn't needed later, inc dword [num] is preferred; modern x86 CPUs will run a memory-destination RMW instruction at least as efficiently as using three separate instructions. Fun fact: gcc -O3 -m32 -mtune=i586 will actually emit this, because (Pentium) P5's superscalar pipeline didn't decode complex instructions to multiple simple micro-operations the way P6 and later microarchitectures do. See the Agner Fog's instruction tables / microarchitecture guide for more info, and the x86 tag wiki for many useful links (including Intel's x86 ISA manuals, which are freely available as PDF)).
Don't confuse the target memory model (x86) with the C++ memory model
Compile-time reordering is allowed. The other part of what you get with std::atomic is control over compile-time reordering, to make sure your num++ becomes globally visible only after some other operation.
Classic example: Storing some data into a buffer for another thread to look at, then setting a flag. Even though x86 does acquire loads/release stores for free, you still have to tell the compiler not to reorder by using flag.store(1, std::memory_order_release);.
You might be expecting that this code will synchronize with other threads:
// int flag; is just a plain global, not std::atomic<int>.
flag--; // Pretend this is supposed to be some kind of locking attempt
modify_a_data_structure(&foo); // doesn't look at flag, and the compiler knows this. (Assume it can see the function def). Otherwise the usual don't-break-single-threaded-code rules come into play!
flag++;
But it won't. The compiler is free to move the flag++ across the function call (if it inlines the function or knows that it doesn't look at flag). Then it can optimize away the modification entirely, because flag isn't even volatile.
(And no, C++ volatile is not a useful substitute for std::atomic. std::atomic does make the compiler assume that values in memory can be modified asynchronously similar to volatile, but there's much more to it than that. (In practice there are similarities between volatile int to std::atomic with mo_relaxed for pure-load and pure-store operations, but not for RMWs). Also, volatile std::atomic<int> foo is not necessarily the same as std::atomic<int> foo, although current compilers don't optimize atomics (e.g. 2 back-to-back stores of the same value) so volatile atomic wouldn't change the code-gen.)
Defining data races on non-atomic variables as Undefined Behaviour is what lets the compiler still hoist loads and sink stores out of loops, and many other optimizations for memory that multiple threads might have a reference to. (See this LLVM blog for more about how UB enables compiler optimizations.)
As I mentioned, the x86 lock prefix is a full memory barrier, so using num.fetch_add(1, std::memory_order_relaxed); generates the same code on x86 as num++ (the default is sequential consistency), but it can be much more efficient on other architectures (like ARM). Even on x86, relaxed allows more compile-time reordering.
This is what GCC actually does on x86, for a few functions that operate on a std::atomic global variable.
See the source + assembly language code formatted nicely on the Godbolt compiler explorer. You can select other target architectures, including ARM, MIPS, and PowerPC, to see what kind of assembly language code you get from atomics for those targets.
#include <atomic>
std::atomic<int> num;
void inc_relaxed() {
num.fetch_add(1, std::memory_order_relaxed);
}
int load_num() { return num; } // Even seq_cst loads are free on x86
void store_num(int val){ num = val; }
void store_num_release(int val){
num.store(val, std::memory_order_release);
}
// Can the compiler collapse multiple atomic operations into one? No, it can't.
# g++ 6.2 -O3, targeting x86-64 System V calling convention. (First argument in edi/rdi)
inc_relaxed():
lock add DWORD PTR num[rip], 1 #### Even relaxed RMWs need a lock. There's no way to request just a single-instruction RMW with no lock, for synchronizing between a program and signal handler for example. :/ There is atomic_signal_fence for ordering, but nothing for RMW.
ret
inc_seq_cst():
lock add DWORD PTR num[rip], 1
ret
load_num():
mov eax, DWORD PTR num[rip]
ret
store_num(int):
mov DWORD PTR num[rip], edi
mfence ##### seq_cst stores need an mfence
ret
store_num_release(int):
mov DWORD PTR num[rip], edi
ret ##### Release and weaker doesn't.
store_num_relaxed(int):
mov DWORD PTR num[rip], edi
ret
Notice how MFENCE (a full barrier) is needed after a sequential-consistency stores. x86 is strongly ordered in general, but StoreLoad reordering is allowed. Having a store buffer is essential for good performance on a pipelined out-of-order CPU. Jeff Preshing's Memory Reordering Caught in the Act shows the consequences of not using MFENCE, with real code to show reordering happening on real hardware.
Re: discussion in comments on #Richard Hodges' answer about compilers merging std::atomic num++; num-=2; operations into one num--; instruction:
A separate Q&A on this same subject: Why don't compilers merge redundant std::atomic writes?, where my answer restates a lot of what I wrote below.
Current compilers don't actually do this (yet), but not because they aren't allowed to. C++ WG21/P0062R1: When should compilers optimize atomics? discusses the expectation that many programmers have that compilers won't make "surprising" optimizations, and what the standard can do to give programmers control. N4455 discusses many examples of things that can be optimized, including this one. It points out that inlining and constant-propagation can introduce things like fetch_or(0) which may be able to turn into just a load() (but still has acquire and release semantics), even when the original source didn't have any obviously redundant atomic ops.
The real reasons compilers don't do it (yet) are: (1) nobody's written the complicated code that would allow the compiler to do that safely (without ever getting it wrong), and (2) it potentially violates the principle of least surprise. Lock-free code is hard enough to write correctly in the first place. So don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much. It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).
Getting back to num++; num-=2; compiling as if it were num--:
Compilers are allowed to do this, unless num is volatile std::atomic<int>. If a reordering is possible, the as-if rule allows the compiler to decide at compile time that it always happens that way. Nothing guarantees that an observer could see the intermediate values (the num++ result).
I.e. if the ordering where nothing becomes globally visible between these operations is compatible with the ordering requirements of the source
(according to the C++ rules for the abstract machine, not the target architecture), the compiler can emit a single lock dec dword [num] instead of lock inc dword [num] / lock sub dword [num], 2.
num++; num-- can't disappear, because it still has a Synchronizes With relationship with other threads that look at num, and it's both an acquire-load and a release-store which disallows reordering of other operations in this thread. For x86, this might be able to compile to an MFENCE, instead of a lock add dword [num], 0 (i.e. num += 0).
As discussed in PR0062, more aggressive merging of non-adjacent atomic ops at compile time can be bad (e.g. a progress counter only gets updated once at the end instead of every iteration), but it can also help performance without downsides (e.g. skipping the atomic inc / dec of ref counts when a copy of a shared_ptr is created and destroyed, if the compiler can prove that another shared_ptr object exists for entire lifespan of the temporary.)
Even num++; num-- merging could hurt fairness of a lock implementation when one thread unlocks and re-locks right away. If it's never actually released in the asm, even hardware arbitration mechanisms won't give another thread a chance to grab the lock at that point.
With current gcc6.2 and clang3.9, you still get separate locked operations even with memory_order_relaxed in the most obviously optimizable case. (Godbolt compiler explorer so you can see if the latest versions are different.)
void multiple_ops_relaxed(std::atomic<unsigned int>& num) {
num.fetch_add( 1, std::memory_order_relaxed);
num.fetch_add(-1, std::memory_order_relaxed);
num.fetch_add( 6, std::memory_order_relaxed);
num.fetch_add(-5, std::memory_order_relaxed);
//num.fetch_add(-1, std::memory_order_relaxed);
}
multiple_ops_relaxed(std::atomic<unsigned int>&):
lock add DWORD PTR [rdi], 1
lock sub DWORD PTR [rdi], 1
lock add DWORD PTR [rdi], 6
lock sub DWORD PTR [rdi], 5
ret
Without many complications an instruction like add DWORD PTR [rbp-4], 1 is very CISC-style.
It perform three operations: load the operand from memory, increment it, store the operand back to memory.
During these operations the CPU acquire and release the bus twice, in between any other agent can acquire it too and this violates the atomicity.
AGENT 1 AGENT 2
load X
inc C
load X
inc C
store X
store X
X is incremented only once.
...and now let's enable optimisations:
f():
rep ret
OK, let's give it a chance:
void f(int& num)
{
num = 0;
num++;
--num;
num += 6;
num -=5;
--num;
}
result:
f(int&):
mov DWORD PTR [rdi], 0
ret
another observing thread (even ignoring cache synchronisation delays) has no opportunity to observe the individual changes.
compare to:
#include <atomic>
void f(std::atomic<int>& num)
{
num = 0;
num++;
--num;
num += 6;
num -=5;
--num;
}
where the result is:
f(std::atomic<int>&):
mov DWORD PTR [rdi], 0
mfence
lock add DWORD PTR [rdi], 1
lock sub DWORD PTR [rdi], 1
lock add DWORD PTR [rdi], 6
lock sub DWORD PTR [rdi], 5
lock sub DWORD PTR [rdi], 1
ret
Now, each modification is:-
observable in another thread, and
respectful of similar modifications happening in other threads.
atomicity is not just at the instruction level, it involves the whole pipeline from processor, through the caches, to memory and back.
Further info
Regarding the effect of optimisations of updates of std::atomics.
The c++ standard has the 'as if' rule, by which it is permissible for the compiler to reorder code, and even rewrite code provided that the outcome has the exact same observable effects (including side-effects) as if it had simply executed your code.
The as-if rule is conservative, particularly involving atomics.
consider:
void incdec(int& num) {
++num;
--num;
}
Because there are no mutex locks, atomics or any other constructs that influence inter-thread sequencing, I would argue that the compiler is free to rewrite this function as a NOP, eg:
void incdec(int&) {
// nada
}
This is because in the c++ memory model, there is no possibility of another thread observing the result of the increment. It would of course be different if num was volatile (might influence hardware behaviour). But in this case, this function will be the only function modifying this memory (otherwise the program is ill-formed).
However, this is a different ball game:
void incdec(std::atomic<int>& num) {
++num;
--num;
}
num is an atomic. Changes to it must be observable to other threads that are watching. Changes those threads themselves make (such as setting the value to 100 in between the increment and decrement) will have very far-reaching effects on the eventual value of num.
Here is a demo:
#include <thread>
#include <atomic>
int main()
{
for (int iter = 0 ; iter < 20 ; ++iter)
{
std::atomic<int> num = { 0 };
std::thread t1([&] {
for (int i = 0 ; i < 10000000 ; ++i)
{
++num;
--num;
}
});
std::thread t2([&] {
for (int i = 0 ; i < 10000000 ; ++i)
{
num = 100;
}
});
t2.join();
t1.join();
std::cout << num << std::endl;
}
}
sample output:
99
99
99
99
99
100
99
99
100
100
100
100
99
99
100
99
99
100
100
99
The add instruction is not atomic. It references memory, and two processor cores may have different local cache of that memory.
IIRC the atomic variant of the add instruction is called lock xadd
Since line 5, which corresponds to num++ is one instruction, can we conclude that num++ is atomic in this case?
It is dangerous to draw conclusions based on "reverse engineering" generated assembly. For example, you seem to have compiled your code with optimization disabled, otherwise the compiler would have thrown away that variable or loaded 1 directly to it without invoking operator++. Because the generated assembly may change significantly, based on optimization flags, target CPU, etc., your conclusion is based on sand.
Also, your idea that one assembly instruction means an operation is atomic is wrong as well. This add will not be atomic on multi-CPU systems, even on the x86 architecture.
Even if your compiler always emitted this as an atomic operation, accessing num from any other thread concurrently would constitute a data race according to the C++11 and C++14 standards and the program would have undefined behavior.
But it is worse than that. First, as has been mentioned, the instruction generated by the compiler when incrementing a variable may depend on the optimization level. Secondly, the compiler may reorder other memory accesses around ++num if num is not atomic, e.g.
int main()
{
std::unique_ptr<std::vector<int>> vec;
int ready = 0;
std::thread t{[&]
{
while (!ready);
// use "vec" here
});
vec.reset(new std::vector<int>());
++ready;
t.join();
}
Even if we assume optimistically that ++ready is "atomic", and that the compiler generates the checking loop as needed (as I said, it's UB and therefore the compiler is free to remove it, replace it with an infinite loop, etc.), the compiler might still move the pointer assignment, or even worse the initialization of the vector to a point after the increment operation, causing chaos in the new thread. In practice, I would not be surprised at all if an optimizing compiler removed the ready variable and the checking loop completely, as this does not affect observable behavior under language rules (as opposed to your private hopes).
In fact, at last year's Meeting C++ conference, I've heard from two compiler developers that they very gladly implement optimizations that make naively written multi-threaded programs misbehave, as long as language rules allow it, if even a minor performance improvement is seen in correctly written programs.
Lastly, even if you didn't care about portability, and your compiler was magically nice, the CPU you are using is very likely of a superscalar CISC type and will break down instructions into micro-ops, reorder and/or speculatively execute them, to an extent only limited by synchronizing primitives such as (on Intel) the LOCK prefix or memory fences, in order to maximize operations per second.
To make a long story short, the natural responsibilities of thread-safe programming are:
Your duty is to write code that has well-defined behavior under language rules (and in particular the language standard memory model).
Your compiler's duty is to generate machine code which has the same well-defined (observable) behavior under the target architecture's memory model.
Your CPU's duty is to execute this code so that the observed behavior is compatible with its own architecture's memory model.
If you want to do it your own way, it might just work in some cases, but understand that the warranty is void, and you will be solely responsible for any unwanted outcomes. :-)
PS: Correctly written example:
int main()
{
std::unique_ptr<std::vector<int>> vec;
std::atomic<int> ready{0}; // NOTE the use of the std::atomic template
std::thread t{[&]
{
while (!ready);
// use "vec" here
});
vec.reset(new std::vector<int>());
++ready;
t.join();
}
This is safe because:
The checks of ready cannot be optimized away according to language rules.
The ++ready happens-before the check that sees ready as not zero, and other operations cannot be reordered around these operations. This is because ++ready and the check are sequentially consistent, which is another term described in the C++ memory model and that forbids this specific reordering. Therefore the compiler must not reorder the instructions, and must also tell the CPU that it must not e.g. postpone the write to vec to after the increment of ready. Sequentially consistent is the strongest guarantee regarding atomics in the language standard. Lesser (and theoretically cheaper) guarantees are available e.g. via other methods of std::atomic<T>, but these are definitely for experts only, and may not be optimized much by the compiler developers, because they are rarely used.
On a single-core x86 machine, an add instruction will generally be atomic with respect to other code on the CPU1. An interrupt can't split a single instruction down the middle.
Out-of-order execution is required to preserve the illusion of instructions executing one at a time in order within a single core, so any instruction running on the same CPU will either happen completely before or completely after the add.
Modern x86 systems are multi-core, so the uniprocessor special case doesn't apply.
If one is targeting a small embedded PC and has no plans to move the code to anything else, the atomic nature of the "add" instruction could be exploited. On the other hand, platforms where operations are inherently atomic are becoming more and more scarce.
(This doesn't help you if you're writing in C++, though. Compilers don't have an option to require num++ to compile to a memory-destination add or xadd without a lock prefix. They could choose to load num into a register and store the increment result with a separate instruction, and will likely do that if you use the result.)
Footnote 1: The lock prefix existed even on original 8086 because I/O devices operate concurrently with the CPU; drivers on a single-core system need lock add to atomically increment a value in device memory if the device can also modify it, or with respect to DMA access.
Back in the day when x86 computers had one CPU, the use of a single instruction ensured that interrupts would not split the read/modify/write and if the memory would not be used as a DMA buffer too, it was atomic in fact (and C++ did not mention threads in the standard, so this wasn’t addressed).
When it was rare to have a dual processor (e.g. dual-socket Pentium Pro) on a customer desktop, I effectively used this to avoid the LOCK prefix on a single-core machine and improve performance.
Today, it would only help against multiple threads that were all set to the same CPU affinity, so the threads you are worried about would only come into play via time slice expiring and running the other thread on the same CPU (core). That is not realistic.
With modern x86/x64 processors, the single instruction is broken up into several micro ops and furthermore the memory reading and writing is buffered. So different threads running on different CPUs will not only see this as non-atomic but may see inconsistent results concerning what it reads from memory and what it assumes other threads have read to that point in time: you need to add memory fences to restore sane behavior.
No.
https://www.youtube.com/watch?v=31g0YE61PLQ
(That's just a link to the "No" scene from "The Office")
Do you agree that this would be a possible output for the program:
sample output:
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
If so, then the compiler is free to make that the only possible output for the program, in whichever way the compiler wants. ie a main() that just puts out 100s.
This is the "as-if" rule.
And regardless of output, you can think of thread synchronization the same way - if thread A does num++; num--; and thread B reads num repeatedly, then a possible valid interleaving is that thread B never reads between num++ and num--. Since that interleaving is valid, the compiler is free to make that the only possible interleaving. And just remove the incr/decr entirely.
There are some interesting implications here:
while (working())
progress++; // atomic, global
(ie imagine some other thread updates a progress bar UI based on progress)
Can the compiler turn this into:
int local = 0;
while (working())
local++;
progress += local;
probably that is valid. But probably not what the programmer was hoping for :-(
The committee is still working on this stuff. Currently it "works" because compilers don't optimize atomics much. But that is changing.
And even if progress was also volatile, this would still be valid:
int local = 0;
while (working())
local++;
while (local--)
progress++;
:-/
That a single compiler's output, on a specific CPU architecture, with optimizations disabled (since gcc doesn't even compile ++ to add when optimizing in a quick&dirty example), seems to imply incrementing this way is atomic doesn't mean this is standard-compliant (you would cause undefined behavior when trying to access num in a thread), and is wrong anyways, because add is not atomic in x86.
Note that atomics (using the lock instruction prefix) are relatively heavy on x86 (see this relevant answer), but still remarkably less than a mutex, which isn't very appropriate in this use-case.
Following results are taken from clang++ 3.8 when compiling with -Os.
Incrementing an int by reference, the "regular" way :
void inc(int& x)
{
++x;
}
This compiles into :
inc(int&):
incl (%rdi)
retq
Incrementing an int passed by reference, the atomic way :
#include <atomic>
void inc(std::atomic<int>& x)
{
++x;
}
This example, which is not much more complex than the regular way, just gets the lock prefix added to the incl instruction - but caution, as previously stated this is not cheap. Just because assembly looks short doesn't mean it's fast.
inc(std::atomic<int>&):
lock incl (%rdi)
retq
Yes, but...
Atomic is not what you meant to say. You're probably asking the wrong thing.
The increment is certainly atomic. Unless the storage is misaligned (and since you left alignment to the compiler, it is not), it is necessarily aligned within a single cache line. Short of special non-caching streaming instructions, each and every write goes through the cache. Complete cache lines are being atomically read and written, never anything different.
Smaller-than-cacheline data is, of course, also written atomically (since the surrounding cache line is).
Is it thread-safe?
This is a different question, and there are at least two good reasons to answer with a definite "No!".
First, there is the possibility that another core might have a copy of that cache line in L1 (L2 and upwards is usually shared, but L1 is normally per-core!), and concurrently modifies that value. Of course that happens atomically, too, but now you have two "correct" (correctly, atomically, modified) values -- which one is the truly correct one now?
The CPU will sort it out somehow, of course. But the result may not be what you expect.
Second, there is memory ordering, or worded differently happens-before guarantees. The most important thing about atomic instructions is not so much that they are atomic. It's ordering.
You have the possibility of enforcing a guarantee that everything that happens memory-wise is realized in some guaranteed, well-defined order where you have a "happened before" guarantee. This ordering may be as "relaxed" (read as: none at all) or as strict as you need.
For example, you can set a pointer to some block of data (say, the results of some calculation) and then atomically release the "data is ready" flag. Now, whoever acquires this flag will be led into thinking that the pointer is valid. And indeed, it will always be a valid pointer, never anything different. That's because the write to the pointer happened-before the atomic operation.
When your compiler uses only a single instruction for the increment and your machine is single-threaded, your code is safe. ^^
Try compiling the same code on a non-x86 machine, and you'll quickly see very different assembly results.
The reason num++ appears to be atomic is because on x86 machines, incrementing a 32-bit integer is, in fact, atomic (assuming no memory retrieval takes place). But this is neither guaranteed by the c++ standard, nor is it likely to be the case on a machine that doesn't use the x86 instruction set. So this code is not cross-platform safe from race conditions.
You also don't have a strong guarantee that this code is safe from Race Conditions even on an x86 architecture, because x86 doesn't set up loads and stores to memory unless specifically instructed to do so. So if multiple threads tried to update this variable simultaneously, they may end up incrementing cached (outdated) values
The reason, then, that we have std::atomic<int> and so on is so that when you're working with an architecture where the atomicity of basic computations is not guaranteed, you have a mechanism that will force the compiler to generate atomic code.

Is Updating double operation atomic

In Java, updating double and long variable may not be atomic, as double/long are being treated as two separate 32 bits variables.
http://java.sun.com/docs/books/jls/second_edition/html/memory.doc.html#28733
In C++, if I am using 32 bit Intel Processor + Microsoft Visual C++ compiler, is updating double (8 byte) operation atomic?
I cannot find much specification mention on this behavior.
When I say "atomic variable", here is what I mean :
Thread A trying to write 1 to variable x.
Thread B trying to write 2 to variable x.
We shall get value 1 or 2 out from variable x, but not an undefined value.
This is hardware specific and depends an the architecture. For x86 and x86_64 8 byte writes or reads are guaranteed to be atomic, if they are aligned. Quoting from the Intel Architecture Memory Ordering White Paper:
Intel 64 memory ordering guarantees
that for each of the following
memory-access instructions, the
constituent memory operation appears
to execute as a single memory access
regardless of memory type:
Instructions that read or write a single byte.
Instructions that read or write a word (2 bytes) whose address is
aligned on a 2 byte boundary.
Instructions that read or write a doubleword (4 bytes) whose address is
aligned on a 4 byte boundary.
Instructions that read or write a quadword (8 bytes) whose address is
aligned on an 8 byte boundary.
All locked instructions (the implicitly
locked xchg instruction and other
read-modify-write instructions with a
lock prefix) are an indivisible and
uninterruptible sequence of load(s)
followed by store(s) regardless of
memory type and alignment.
It's safe to assume that updating a double is never atomic, even if it's size is the same as an int with atomic guarantee. The reason is that if has different processing path since it's a non-critical and expensive data type. For example even data barriers usually mention that they don't apply to floating point data/operations in general.
Visual C++ will allign primitive types (see article) and while that should guarantee that it's bits won't get garbled while writing to memory (8 byte allignment is always in one 64 or 128 bit cache line) the rest depends on how CPU handles non-atomic data in it's cache and whether reading/flushing a cache line is interruptable. So if you dig through Intel docs for the kind of core you are using and it gives you that guarantee then you are safe to go.
The reason why Java spec is so conservative is that it's supposed to run the same way on an old 386 and on Corei7. Which is of course delusional but a promise is a promise, therefore it promisses less :-)
The reason I'm saying that you have to look up CPU doc is that your CPU might be an old 386, or alike :-)) Don't forget that on a 32-bit CPU your 8-byte block takes 2 "rounds" to access so you are down to the mercy of the mechanics of the cache access.
Cache line flushing giving much higher data consistency guarantee applies only to a reasonably recent CPU with Intel-ian guarantee (automatic cache consistency).
I wouldn't think in any architecture, thread/context switching would interrupt updating a register halfway so that you are left with for example 18bits updated of the 32bits it was going to update. Same for updating a memory location ( provided that it's a basic access unit, 8,16,32,64 bits etc).
So has this question been answered? I ran a simple test program changing a double:
#include <stdio.h>
int main(int argc, char** argv)
{
double i = 3.14159265358979323;
i += 84626.433;
}
I compiled it without optimizations (gcc -O0), and all assignment operations are performed with single assembler instructions such as fldl .LC0 and faddp %st, %st(1). (i += 84626.433 is of course done two operations, faddp and fstpl).
Can a thread really get interrupted inside a single instruction such as faddp?
On a multicore, besides being atomic, you have to worry about cache coherence, so that the thread reading sees the new value in its cache when the writer has updated.