benchmarking, code reordering, volatile - c++

I decide I want to benchmark a particular function, so I naïvely write code like this:
#include <ctime>
#include <iostream>
int SlowCalculation(int input) { ... }
int main() {
std::cout << "Benchmark running..." << std::endl;
std::clock_t start = std::clock();
int answer = SlowCalculation(42);
std::clock_t stop = std::clock();
double delta = (stop - start) * 1.0 / CLOCKS_PER_SEC;
std::cout << "Benchmark took " << delta << " seconds, and the answer was "
<< answer << '.' << std::endl;
return 0;
}
A colleague pointed out that I should declare the start and stop variables as volatile to avoid code reordering. He suggested that the optimizer could, for example, effectively reorder the code like this:
std::clock_t start = std::clock();
std::clock_t stop = std::clock();
int answer = SlowCalculation(42);
At first I was skeptical that such extreme reordering was allowed, but after some research and experimentation, I learned that it was.
But volatile didn't feel like the right solution; isn't volatile really just for memory mapped I/O?
Nevertheless, I added volatile and found that not only did the benchmark take significantly longer, it also was wildly inconsistent from run to run. Without volatile (and getting lucky to ensure the code wasn't reordered), the benchmark consistently took 600-700 ms. With volatile, it often took 1200 ms and sometimes more than 5000 ms. The disassembly listings for the two versions showed virtually no difference other than a different selection of registers. This makes me wonder if there is another way to avoid the code reordering that doesn't have such overwhelming side effects.
My question is:
What is the best way to prevent code reordering in benchmarking code like this?
My question is similar to this one (which was about using volatile to avoid elision rather than reordering), this one (which didn't answer how to prevent reordering), and this one (which debated whether the issue was code reordering or dead code elimination). While all three are on this exact topic, none actually answer my question.
Update: The answer appears to be that my colleague was mistaken and that reordering like this isn't consistent with the standard. I've upvoted everyone who said so and am awarding the bounty to the Maxim.
I've seen one case (based on the code in this question) where Visual Studio 2010 reordered the clock calls as I illustrated (only in 64-bit builds). I'm trying to make a minimal case to illustrate that so that I can file a bug on Microsoft Connect.
For those who said that volatile should be much slower because it forces reads and writes to memory, this isn't quite consistent with the code being emitted. In my answer on this question, I show the disassembly for the the code with and without volatile. Inside the loop, everything is kept in registers. The only significant differences appear to be register selection. I do not understand x86 assembly well enough to know why the performance of the non-volatile version is consistently fast while the volatile version is inconsistently (and sometimes dramatically) slower.

A colleague pointed out that I should declare the start and stop variables as volatile to avoid code reordering.
Sorry, but your colleague is wrong.
The compiler does not reorder calls to functions whose definitions are not available at compile time. Simply imagine the hilarity that would ensue if compiler reordered such calls as fork and exec or moved code around these.
In other words, any function with no definition is a compile time memory barrier, that is, the compiler does not move subsequent statements before the call or prior statements after the call.
In your code calls to std::clock end up calling a function whose definition is not available.
I can not recommend enough watching atomic Weapons: The C++ Memory Model and Modern Hardware because it discusses misconceptions about (compile time) memory barriers and volatile among many other useful things.
Nevertheless, I added volatile and found that not only did the benchmark take significantly longer, it also was wildly inconsistent from run to run. Without volatile (and getting lucky to ensure the code wasn't reordered), the benchmark consistently took 600-700 ms. With volatile, it often took 1200 ms and sometimes more than 5000 ms
Not sure if volatile is to blame here.
The reported run-time depends on how the benchmark is run. Make sure you disable CPU frequency scaling so that it does not turn on turbo mode or switches frequency in the middle of the run. Also, micro-benchmarks should be run as real-time priority processes to avoid scheduling noise. It could be that during another run some background file indexer starts competing with your benchmark for the CPU time. See this for more details.
A good practice is to measure times it takes to execute the function a number of times and report min/avg/median/max/stdev/total time numbers. High standard deviation may indicate that the above preparations are not performed. The first run often is the longest because the CPU cache may be cold and it may take many cache misses and page faults and also resolve dynamic symbols from shared libraries on the first call (lazy symbol resolution is the default run-time linking mode on Linux, for example), while subsequent calls are going to execute with much less overhead.

The usual way to prevent reordering is a compile barrier i.e asm volatile ("":::"memory"); (with gcc). This is an asm instruction which does nothing, but we tell the compiler it will clobber memory, so it is not permitted to reorder code across it. The cost of this is only the actual cost of removing the reorder, which obviously is not the case for changing the optimisation level etc as suggested elsewhere.
I believe _ReadWriteBarrier is equivalent for Microsoft stuff.
Per Maxim Yegorushkin's answer, reordering is unlikely to be the cause of your issues though.

Volatile ensures one thing, and one thing only: reads from a volatile variable will be read from memory every time -- the compiler won't assume that the value can be cached in a register. And likewise, writes will be written through to memory. The compiler won't keep it around in a register "for a while, before writing it out to memory".
In order to prevent compiler reordering you may use so called compiler fences.
MSVC includes 3 compiler fences:
_ReadWriteBarrier() - full fence
_ReadBarrier() - two-sided fence for loads
_WriteBarrier() - two-sided fence for stores
ICC includes __memory_barrier() full fence.
Full fences are usually the best choice because there is no need in finer-granularity on this level (compiler fences are basically costless in run-time).
Statment reordering (which most compiler do when optimization is enabled), thats the also main reason why certain program fail to operation operation when compiled with compiler optimzation.
Will suggest to read http://preshing.com/20120625/memory-ordering-at-compile-time to see potential issues we can face with compiler reoredering etc.

Related problem: how to stop the compiler from hoisting a tiny repeated calculation out of a loop
I couldn't find this anywhere - so adding my own answer 11 years after the question was asked ;).
Using volatile on variables is not what you want for that. That will cause the compiler to load and store those variable from and to RAM every single time (assuming there is a side effect of that that must be preserved: aka - good for I/O registers). When you are bench marking you are not interested in measuring how long it takes to get something from memory, or write it there. Often you just want your variable to be in CPU registers.
volatile is usable if you assign to it once outside a loop that doesn't get optimized away (like summing an array), as an alternative to printing the result. (Like the long-running function in the question). But not inside a tiny loop; that will introduce store/reload instructions and store-forwarding latency.
I think that the ONLY way to submit your compiler into not optimizing your benchmark code to hell is by using asm. This allows you to fool the compiler into thinking it doesn't know anything about your variables content, or usage, so it has to do everything every single time, as often as your loop asks it to.
For example, if I wanted to benchmark m & -m where m is some uint64_t, I could try:
uint64_t const m = 0x0000080e70100000UL;
for (int i = 0; i < loopsize; ++i)
{
uint64_t result = m & -m;
}
The compiler would obviously say: I'm not even going to calculate that;
since you're not using the result. Aka, it would actually do:
for (int i = 0; i < loopsize; ++i)
{
}
Then you can try:
uint64_t const m = 0x0000080e70100000UL;
static uint64_t volatile result;
for (int i = 0; i < loopsize; ++i)
{
result = m & -m;
}
and the compiler says, ok - so you want me to write to result every time
and do
uint64_t const m = 0x0000080e70100000UL;
uint64_t tmp = m & -m;
static uint64_t volatile result;
for (int i = 0; i < loopsize; ++i)
{
result = tmp;
}
Spending a lot of time writing to the memory address of result loopsize times, just as you asked.
Finally you could also make m volatile, but the result would look like this in assembly:
507b: ba e8 03 00 00 mov $0x3e8,%edx
# top of loop
5080: 48 8b 05 89 ef 20 00 mov 0x20ef89(%rip),%rax # 214010 <m_test>
5087: 48 8b 0d 82 ef 20 00 mov 0x20ef82(%rip),%rcx # 214010 <m_test>
508e: 48 f7 d8 neg %rax
5091: 48 21 c8 and %rcx,%rax
5094: 48 89 44 24 28 mov %rax,0x28(%rsp)
5099: 83 ea 01 sub $0x1,%edx
509c: 75 e2 jne 5080 <main+0x120>
Reading from memory twice and writing to it once, besides the requested calculation with registers.
The correct way to do this is therefore:
for (int i = 0; i < loopsize; ++i)
{
uint64_t result = m & -m;
asm volatile ("" : "+r" (m) : "r" (result));
}
which results in the assembly code (from gcc8.2 on the Godbolt compiler explorer):
# gcc8.2 -O3 -fverbose-asm
movabsq $8858102661120, %rax #, m
movl $1000, %ecx #, ivtmp_9 # induction variable tmp_9
.L2:
mov %rax, %rdx # m, tmp91
neg %rdx # tmp91
and %rax, %rdx # m, result
# asm statement here, m=%rax result=%rdx
subl $1, %ecx #, ivtmp_9
jne .L2
ret
Doing exactly the three requested assembly instructions inside the loop, plus a sub and jne for the loop overhead.
The trick here is that by using the asm volatile1 and tell the compiler
"r" input operand: it uses the value of result as input so the compiler has to materialize it in a register.
"+r" input/output operand: m stays in the same register but is (potentially) modified.
volatile: it has some mysterious side effect and/or is not a pure function of the inputs; the compiler must execute it as many times as the source does. This forces the compiler to leave your test snippet alone and inside the loop. See the gcc manual's Extended Asm#Volatile section.
footnote 1: The volatile is required here or the compiler will turn this into an empty loop. Non-volatile asm (with any output operands) is considered a pure function of its inputs that can be optimized away if the result is unused. Or CSEd to only run once if used multiple times with the same input.
Everything below is not mine-- and I do not necessarily agree with it. --Carlo Wood
If you had used asm volatile ("" : "=r" (m) : "r" (result)); (with an "=r" write-only output), the compiler might choose the same register for m and result, creating a loop-carried dependency chain that tests the latency, not throughput, of the calculation.
From that, you'd get this asm:
5077: ba e8 03 00 00 mov $0x3e8,%edx
507c: 0f 1f 40 00 nopl 0x0(%rax) # alignment padding
# top of loop
5080: 48 89 e8 mov %rbp,%rax # copy m
5083: 48 f7 d8 neg %rax # -m
5086: 48 21 c5 and %rax,%rbp # m &= -m instead of using the tmp as the destination.
5089: 83 ea 01 sub $0x1,%edx
508c: 75 f2 jne 5080 <main+0x120>
This will run at 1 iteration per 2 or 3 cycles (depending on whether your CPU has mov-elimination or not.) The version without a loop-carried dependency can run at 1 per clock cycle on Haswell and later, and Ryzen. Those CPUs have the ALU throughput to run at least 4 uops per clock cycle.
This asm corresponds to C++ that looks like this:
for (int i = 0; i < loopsize; ++i)
{
m = m & -m;
}
By misleading the compiler with a write-only output constraint, we've created asm that doesn't look like the source (which looked like it was computing a new result from a constant every iteration, not using result as an input to the next iteration..)
You might want to microbenchmark latency, so you can more easily detect the benefit of compiling with -mbmi or -march=haswell to let the compiler use blsi %rax, %rax and calculate m &= -m; in one instruction. But it's easier to keep track of what you're doing if the C++ source has the same dependency as the asm, instead of fooling the compiler into introducing a new dependency.

You could make two C files, SlowCalculation compiled with g++ -O3 (high level of optimization), and the benchmark one compiled with g++ -O1 (lower level, still optimized - that may be sufficient for that benchmarking part).
According to the man page, reordering of code happens during -O2 and -O3 optimizations levels.
Since optimization happens during compilation, not linkage, the benchmark side should not be affected by code reordering.
Assuming you are using g++ - but there should be something equivalent in another compiler.

The correct way to do this in C++ is to use a class, e.g. something like
class Timer
{
std::clock_t startTime;
std::clock_t* targetTime;
public:
Timer(std::clock_t* target) : targetTime(target) { startTime = std::clock(); }
~Timer() { *target = std::clock() - startTime; }
};
and use it like this:
std::clock_t slowTime;
{
Timer timer(&slowTime);
int answer = SlowCalculation(42);
}
Mind you, I don't actually believe your compiler will ever re-order like this.

There are a couple of ways that I can think of. The idea is to create compile time barriers so that compiler does not reorder a set of instructions.
One possible way to avoid reordering would be to enforce dependency among instructions that cannot be resolved by compiler (e.g. passing a pointer to the function and using that pointer in later instruction). I'm not sure how that would affect the performance of the actual code that you are interested in benchmarking.
Another possibility is to make the function SlowCalculation(42); an extern function (define this function in a separate .c/.cpp file and link the file to your main program) and declare start and stop as global variables. I do not know what are the optimizations offered by the link-time/inter-procedural optimizer of your compiler.
Also, if you compile at O1 or O0, most probably the compiler would not bother reordering instructions.
Your question is somewhat related to (Compile time barriers - compiler code reordering - gcc and pthreads)

Reordering described by your colleague just breaks 1.9/13
Sequenced before is an asymmetric, transitive, pair-wise relation between evaluations executed by a single
thread (1.10), which induces a partial order among those evaluations. Given any two evaluations A and B, if
A is sequenced before B, then the execution of A shall precede the execution of B. If A is not sequenced before
B and B is not sequenced before A, then A and B are unsequenced . [ Note: The execution of unsequenced
evaluations can overlap. —end note ] Evaluations A and B are indeterminately sequenced when either A
is sequenced before B or B is sequenced before A, but it is unspecified which. [ Note: Indeterminately
sequenced evaluations cannot overlap, but either could be executed first. —end note ]
So basically you should not think about reordering while you don't use threads.

Related

Inline assembly array sum benchmark near-zero time for large arrays with optimization enabled, even though result is used

I have written two functions that gets the sum of an array, the first one is written in C++ and the other is written with inline assembly (x86-64), I compared the performance of the two functions on my device.
If the -O flag is not enabled during compilation the function with inline assembly is almost 4-5x faster than the C++ version.
cpp time : 543070068 nanoseconds
cpp time : 547990578 nanoseconds
asm time : 185495494 nanoseconds
asm time : 188597476 nanoseconds
If the -O flag is set to -O1 they produce the same performance.
cpp time : 177510914 nanoseconds
cpp time : 178084988 nanoseconds
asm time : 179036546 nanoseconds
asm time : 181641378 nanoseconds
But if I try to set the -O flag to -O2 or -O3 I'm getting an unusual 2-3 digit nanoseconds performance for the function written with inline assembly which is sketchy fast (at least for me, please bear with me since I have no rock solid experience with assembly programming so I don't know how fast or how slow it can be compared to a program written in C++.
)
cpp time : 177522894 nanoseconds
cpp time : 183816275 nanoseconds
asm time : 125 nanoseconds
asm time : 75 nanoseconds
My Questions
Why is this array sum function written with inline assembly so fast after enabling -O2 or -O3?
Is this a normal reading or there is something wrong with the timing/measurement of the performance?
Or maybe there is something wrong with my inline assembly function?
And if the inline assembly function for the array sum is correct and the performance reading is correct, why does the C++ compiler failed to optimize a simple array sum function for the C++ version and make it as fast as the inline assembly version?
I have also speculated that maybe the memory alignment and cache misses are improved during compilation to increase the performance but my knowledge on this one is still very very limited.
Apart from answering my questions, if you have something to add please feel free to do so, I hope somebody can explain, thanks!
[EDIT]
So I have removed the use of macro and isolated running the two version and also tried to add volatile keyword, a "memory" clobber and "+&r" constraint for the output and the performance was now the same with the cpp_sum.
Though if I remove back the volatile keyword and "memory" clobber it I'm still getting those 2-3 digit nanoseconds performance.
code:
#include <iostream>
#include <random>
#include <chrono>
uint64_t sum_cpp(const uint64_t *numbers, size_t length) {
uint64_t sum = 0;
for(size_t i=0; i<length; ++i) {
sum += numbers[i];
}
return sum;
}
uint64_t sum_asm(const uint64_t *numbers, size_t length) {
uint64_t sum = 0;
asm volatile(
"xorq %%rax, %%rax\n\t"
"%=:\n\t"
"addq (%[numbers], %%rax, 8), %[sum]\n\t"
"incq %%rax\n\t"
"cmpq %%rax, %[length]\n\t"
"jne %=b"
: [sum]"+&r"(sum)
: [numbers]"r"(numbers), [length]"r"(length)
: "%rax", "memory", "cc"
);
return sum;
}
int main() {
std::mt19937_64 rand_engine(1);
std::uniform_int_distribution<uint64_t> random_number(0,5000);
size_t length = 99999999;
uint64_t *arr = new uint64_t[length];
for(size_t i=1; i<length; ++i) arr[i] = random_number(rand_engine);
uint64_t cpp_total = 0, asm_total = 0;
for(size_t i=0; i<5; ++i) {
auto start = std::chrono::high_resolution_clock::now();
#ifndef _INLINE_ASM
cpp_total += sum_cpp(arr, length);
#else
asm_total += sum_asm(arr,length);
#endif
auto end = std::chrono::high_resolution_clock::now();
auto dur = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start);
std::cout << "time : " << dur.count() << " nanoseconds\n";
}
#ifndef _INLINE_ASM
std::cout << "cpp sum = " << cpp_total << "\n";
#else
std::cout << "asm sum = " << asm_total << "\n";
#endif
delete [] arr;
return 0;
}
The compiler is hoisting the inline asm out of your repeat loop, and thus out of your timed region.
If your goal is performance, https://gcc.gnu.org/wiki/DontUseInlineAsm. The useful thing to spend your time learning first is SIMD intrinsics (and how they compile to asm) like _mm256_add_epi64 to add 4x uint64_t with a single AVX2 instruction. See https://stackoverflow.com/tags/sse/info (Compilers can auto-vectorize decently for a simple sum like this, which you could see the benefit from if you used a smaller array and put a repeat loop inside the timed region to get some cache hits.)
If you want to play around with asm to test what's actually fast on various CPUs, you can do that in a stand-alone static executable, or a function you call from C++. https://stackoverflow.com/tags/x86/info has some good performance links.
Re: benchmarking at -O0, yes the compiler makes slow asm with the default -O0 of consistent debugging and not trying at all to optimize. It's not much of a challenge to beat it when it has its hands tied behind its back.
Why your asm can get hoisted out of the timed regions
Without being asm volatile, your asm statement is a pure function of the inputs you've told the compiler about, which are a pointer, a length, and the initial value of sum=0. It does not include the pointed-to memory because you didn't use a dummy "m" input for that. (How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)
Without a "memory" clobber, your asm statement isn't ordered wrt. function calls, so GCC is hoisting the asm statement out of the loop. See How does Google's `DoNotOptimize()` function enforce statement ordering for more details about that effect of the "memory" clobber.
Have a look at the compiler output on https://godbolt.org/z/KeEMfoMvo and see how it inlined into main. -O2 and higher enables -finline-functions, while -O1 only enables -finline-functions-called-once and this isn't static or inline so it has to emit a stand-alone definition in case of calls from other compilation units.
75ns is just the timing overhead of std::chrono functions around a nearly-empty timed region. It is actually running, just not inside the timed regions. You can see this if you single-step the asm of your whole program, or for example set a breakpoint on the asm statement. When doing asm-level debugging of the executable, you could help yourself find it by putting a funky instruction like mov $0xdeadbeef, %eax before xor %eax,%eax, something you can search for in the debugger's disassembly output (like GDB's layout asm or layout reg; see asm debugging tips at the bottom of https://stackoverflow.com/tags/x86/info). And yes, you do often want to look at what the compiler did when debugging inline asm, how it filled in your constraints, because stepping on its toes is a very real possibility.
Note that a "memory" clobber without asm volatile would still let GCC do Common Subexpression Elimination (CSE) between two invocations of the asm statement, if there was no function call in between. Like if you put a repeat loop inside a timed region to test performance on an array small enough to fit in some level of cache.
Sanity-checking your benchmark
Is this a normal reading
It's wild that you even have to ask that. 99999999 8-byte integers in 75ns would be a memory bandwidth of 99999999 * 8 B / 75 ns = 10666666 GB/s, while fast dual-channel DDR4 might hit 32 GB/s. (Or cache bandwidth if it was that large, but it's not, so your code bottlenecks on memory).
Or a 4GHz CPU would have had to run at 99999999 / (75*4) = 333333.33 add instructions per clock cycle, but the pipeline is only 4 to 6 uops wide on modern CPUs, with taken-branch throughputs of at best 1 for a loop branch. (https://uops.info/ and https://agner.org/optimize/)
Even with AVX-512, that's 2/clock 8x uint64_t additions per core, but compilers don't rewrite your inline asm; that would defeat its purpose compared to using plain C++ or intrinsics.
This is pretty obviously just std::chrono timing overhead from a near-empty timed region.
Asm code-review: correctness
As mentioned above, How can I indicate that the memory *pointed* to by an inline ASM argument may be used?
You're also missing an & early clobber declaration in "+&r"(sum) which would in theory let it pick the same register for sum as for one of the inputs. But since sum is also an input, it could only do that if numbers or length were also 0.
It's kind of a toss-up whether it's better to xor-zero inside the asm for an "=&r" output, or better to use "+&r" and leave that zeroing to the compiler. For your loop counter, it makes sense because the compiler doesn't need to know about that at all. But by manually picking RAX for it (with a clobber), you're preventing the compiler from choosing to have your code produce sum in RAX, like it would want for a non-inline function. A dummy [idx] "=&r" (dummy) output operand will get the compiler to pick a register for you, of the appropriate width, e.g. intptr_t.
Asm code review: performance
As David Wohlferd said: xor %eax, %eax to zero RAX. Implicit zero-extension saves a REX prefix. (1 byte of code-size in the machine code. Smaller machine-code is generally better.)
It doesn't seem worth hand-writing asm if you're not going to do anything smarter than what GCC would on its own without -ftree-vectorize or with -mgeneral-regs-only or -mno-sse2 (even though it's baseline for x86-64, kernel code generally needs to avoid SIMD registers). But I guess it works as a learning exercise in how inline asm constraints work, and a starting point for measuring. And to get a benchmark working so you can then test better loops.
Typical x86-64 CPUs can do 2 loads per clock cycle (Intel since Sandybridge, AMD since K8) Or 3/clock on Alder Lake. On modern CPUs with AVX/AVX2, each load can be 32 bytes wide (or 64 bytes with AVX-512) best case on L1d hits. Or more like 1/clock with only L2 hits on recent Intel, which is a reasonable cache-blocking target.
But your loop can at best run 1x 8-byte load per clock cycle, because loop branches can run 1/clock, and add mem, %[sum] has a 1 cycle loop-carried dependency through sum.
That might max out DRAM bandwidth (with the help of HW prefetchers), e.g. 8 B / cycle * 4GHz = 32GB/s, which modern desktop/laptop Intel CPUs can manage for a single core (but not big Xeons). But with fast enough DRAM and/or a slower CPU relative to it, even DRAM can avoid being a bottleneck. But aiming for DRAM bandwidth is quite a low bar compared to L3 or L2 cache bandwidth.
So even if you want to keep using scalar code without movdqu / paddq (or better get to an alignment boundary for memory-source paddq, if you want to spend some code-size to optimize this loop), you could still unroll with two register accumulators for sum which you add at the end. This exposes some instruction-level parallelism, allowing two memory-source loads per clock cycle.
You can also avoid the cmp, which can reduce loop overhead. Fewer uops lets out-of-order exec see farther.
Get a pointer to the end of the array and index from -length up towards zero. Like (arr+len)[idx] with for(idx=-len ; idx != 0 ; idx++). Looping backwards through the array is on some CPUs a little worse for some of the HW prefetchers, so generally not recommended for loops that are often memory bound.
See also Micro fusion and addressing modes - an indexed addressing mode can only stay micro-fused in the back-end on Intel Haswell and later, and only for instructions like add that RMW their destination register.
So your best bet would be a loop with one pointer increment and 2 to 4 add instructions using it, and a cmp/jne at the bottom.

Is 11 a valid output under ISO c++ for x86_64,arm or other arch?

This question is based on Can't relaxed atomic fetch_add reorder with later loads on x86, like store can?
I agree with answer given.
On x86 00 will never occur because a.fetch_add has a lock prefix/full barrier and loads can't reorder above fetch_add but on other architectures like arm/mips it can print 00. I have a two followup question about store buffer on x86 and arm.
I never get 11 on my pc (core i3 x86_64) i.e is 11 a valid output on
x86 in iso c++ , so am i missing something ? #Daniel Langr demonstrated 11 is a valid output on x86.
Now x86_64 has an advantage fetch_add acting as a full barrier.
For arm64 , output can be 00 sometimes due to cpu instruction reordering.
For arm64 or some other arch, can the output be 00 if without reordering ?. My question
is based on this. The store buffer values for function foo
a.fetch_add(1) is not visible to bar's a.load() and b.fetch_add(1) is
not visible to foo's b.load(). Hence we get 00 without reordering. Can this happen under ISO C++ on different archs ?
// g++ -O2 -pthread axbx.cpp ; while [ true ]; do ./a.out | grep "00" ; done
#include<cstdio>
#include<thread>
#include<atomic>
using namespace std;
atomic<int> a,b;
int reta,retb;
void foo(){
a.fetch_add(1,memory_order_relaxed); //add to a is stored in store buffer of cpu0
//a.store(1,memory_order_relaxed);
retb=b.load(memory_order_relaxed);
}
void bar(){
b.fetch_add(1,memory_order_relaxed); //add to b is stored in store buffer of cpu1
//b.store(1,memory_order_relaxed);
reta=a.load(memory_order_relaxed);
}
int main(){
thread t[2]{ thread(foo),thread(bar) };
t[0].join(); t[1].join();
printf("%d%d\n",reta,retb);
return 0;
}
Yes, ISO C++ allows this; as Daniel pointed out, one easy way is to put some slow stuff after the RMWs, before the loads, so they don't execute until both threads have had a chance to increment. This should be obvious because it doesn't require any run-time reordering to happen, just a simple interleaving of program-order. (So ISO C++ allows 11 even with seq_cst.) i.e. you could have exercised this by single-stepping each thread separately.
Are you wondering about how to create a practical demonstration on x86 without delay loops?
Try putting your atomic vars in separate cache lines so two different cores can be writing them in parallel.
alignas(64) std::atomic<int> a, b; // the alignas applies to each separately
With them in the same cache line, like probably happens by default, the core that won ownership of the cache line they're both in is going to be able to execute the load of the other var as soon as the full-barrier part of the increment is done. Completing the RMW means that cache line is already hot in this core's L1d cache. (A core can only modify a cache line after gaining exclusive ownership of it via MESI, which will also make it valid for reads.)
So both operations by one thread are extremely likely to happen before either operation by the other thread. (In the x86 asm, each operation has the same asm as its seq_cst equivalent would have had, so we can usefully talk about a global order of operations without losing anything.)
Probably the only thing that would stop this from happening is an interrupt arriving at just the right moment, between the RMW and the load.
You also asked a separate question:
can the output be 00 if without reordering ?
Clearly no. No interleaving of program-order can put both loads before either increment, so either run-time or compile-time reordering is necessary to create the 00 effect.
a.fetch_add(1,memory_order_relaxed); // foo1
retb=b.load(memory_order_relaxed); // foo2
b.fetch_add(1,memory_order_relaxed); // bar1
reta=a.load(memory_order_relaxed); // bar2
Mix those however you want without putting foo2 before foo1 or bar2 before bar1.
i.e. if you single-stepped each thread separately, you could never see 00. Of course the whole point of mo_relaxed is that it can reorder. Specifying "without reordering" is the same as saying "with seq_cst".
The effects of a store buffer are a kind of reordering, specifically Store Load reordering. mo_seq_cst prevents even that, which is part of the point of seq_cst and what makes it so expensive, especially for pure-store operations.

Why is ONE basic arithmetic operation in for loop body executed SLOWER THAN TWO arithmetic operations?

While I experimented with measuring time of execution of arithmetic operations, I came across very strange behavior. A code block containing a for loop with one arithmetic operation in the loop body was always executed slower than an identical code block, but with two arithmetic operations in the for loop body. Here is the code I ended up testing:
#include <iostream>
#include <chrono>
#define NUM_ITERATIONS 100000000
int main()
{
// Block 1: one operation in loop body
{
int64_t x = 0, y = 0;
auto start = std::chrono::high_resolution_clock::now();
for (long i = 0; i < NUM_ITERATIONS; i++) {x+=31;}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end-start;
std::cout << diff.count() << " seconds. x,y = " << x << "," << y << std::endl;
}
// Block 2: two operations in loop body
{
int64_t x = 0, y = 0;
auto start = std::chrono::high_resolution_clock::now();
for (long i = 0; i < NUM_ITERATIONS; i++) {x+=17; y-=37;}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end-start;
std::cout << diff.count() << " seconds. x,y = " << x << "," << y << std::endl;
}
return 0;
}
I tested this with different levels of code optimization (-O0,-O1,-O2,-O3), with different online compilers (for example onlinegdb.com), on my work machine, on my hame PC and laptop, on RaspberryPi and on my colleague's computer. I rearranged these two code blocks, repeated them, changed constants, changed operations (+, -, <<, =, etc.), changed integer types. But I always got similar result: the block with one line in loop is SLOWER than block with two lines:
1.05681 seconds. x,y = 3100000000,0
0.90414 seconds. x,y = 1700000000,-3700000000
I checked the assembly output on https://godbolt.org/ but everything looked like I expected: second block just had one more operation in assembly output.
Three operations always behaved as expected: they are slower than one and faster than four. So why two operations produce such an anomaly?
Edit:
Let me repeat: I have such behaviour on all of my Windows and Unix machines with code not optimized. I looked at assembly I execute (Visual Studio, Windows) and I see the instructions I want to test there. Anyway if the loop is optimized away, there is nothing I ask about in the code which left. I added that optimizations notice in the question to avoid "do not measure not optimized code" answers because optimizations is not what I ask about. The question is actually why my computers execute two operations faster than one, first of all in code where these operations are not optimized away. The difference in time of execution is 5-25% on my tests (quite noticeable).
This effect only happens at -O0 (or with volatile), and is a result of the compiler keeping your variables in memory (not registers). You'd expect that to just introduce a fixed amount of extra latency into a loop-carried dependency chains through i, x, and y, but modern CPUs are not that simple.
On Intel Sandybridge-family CPUs, store-forwarding latency is lower when the load uop runs some time after the store whose data it's reloading, not right away. So an empty loop with the loop counter in memory is the worst case. I don't understand what CPU design choices could lead to that micro-architectural quirk, but it's a real thing.
This is basically a duplicate of Adding a redundant assignment speeds up code when compiled without optimization, at least for Intel Sandybridge-family CPUs.
This is is one of the major reasons why you shouldn't benchmark at -O0: the bottlenecks are different than in realistically optimized code. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about why compilers make such terrible asm on purpose.
Micro-benchmarking is hard; you can only measure something properly if you can get compilers to emit realistically optimized asm loops for the thing you're trying to measure. (And even then you're only measuring throughput or latency, not both; those are separate things for single operations on out-of-order pipelined CPUs: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?)
See #rcgldr's answer for measurement + explanation of what would happen with loops that keep variables in registers.
With clang, benchmark::DoNotOptimize(x1 += 31) also de-optimizes into keeping x in memory, but with GCC it does just stay in a register. Unfortunately #SashaKnorre's answer used clang on QuickBench, not gcc, to get results similar to your -O0 asm. It does show the cost of lots of short-NOPs being hidden by the bottleneck through memory, and a slight speedup when those NOPs delay the reload next iteration just long enough for store-forwarding to hit the lower latency good case. (QuickBench I think runs on Intel Xeon server CPUs, with the same microarchitecture inside each CPU core as desktop version of the same generation.)
Presumably all the x86 machines you tested on had Intel CPUs from the last 10 years, or else there's a similar effect on AMD. It's plausible there's a similar effect on whichever ARM CPU your RPi uses, if your measurements really were meaningful there. Otherwise, maybe another case of seeing what you expected (confirmation bias), especially if you tested with optimization enabled there.
I tested this with different levels of code optimization (-O0,-O1,-O2,-O3) [...] But I always got similar result
I added that optimizations notice in the question to avoid "do not measure not optimized code" answers because optimizations is not what I ask about.
(later from comments) About optimizations: yes, I reproduced that with different optimization levels, but as the loops were optimized away, the execution time was too fast to say for sure.
So actually you didn't reproduce this effect for -O1 or higher, you just saw what you wanted to see (confirmation bias) and mostly made up the claim that the effect was the same. If you'd accurately reported your data (measurable effect at -O0, empty timed region at -O1 and higher), I could have answered right away.
See Idiomatic way of performance evaluation? - if your times don't increase linearly with increasing repeat count, you aren't measuring what you think you're measuring. Also, startup effects (like cold caches, soft page faults, lazy dynamic linking, and dynamic CPU frequency) can easily lead to the first empty timed region being slower than the second.
I assume you only swapped the loops around when testing at -O0, otherwise you would have ruled out there being any effect at -O1 or higher with that test code.
The loop with optimization enabled:
As you can see on Godbolt, gcc fully removes the loop with optimization enabled. Sometimes GCC leaves empty loops alone, like maybe it thinks the delay was intentional, but here it doesn't even loop at all. Time doesn't scale with anything, and both timed regions look the same like this:
orig_main:
...
call std::chrono::_V2::system_clock::now() # demangled C++ symbol name
mov rbp, rax # save the return value = start
call std::chrono::_V2::system_clock::now()
# end in RAX
So the only instruction in the timed region is saving start to a call-preserved register. You're measuring literally nothing about your source code.
With Google Benchmark, we can get asm that doesn't optimize the work away, but which doesn't store/reload to introduce new bottlenecks:
#include <benchmark/benchmark.h>
static void TargetFunc(benchmark::State& state) {
uint64_t x2 = 0, y2 = 0;
// Code inside this loop is measured repeatedly
for (auto _ : state) {
benchmark::DoNotOptimize(x2 += 31);
benchmark::DoNotOptimize(y2 += 31);
}
}
// Register the function as a benchmark
BENCHMARK(TargetFunc);
# just the main loop, from gcc10.1 -O3
.L7: # do{
add rax, 31 # x2 += 31
add rdx, 31 # y2 += 31
sub rbx, 1
jne .L7 # }while(--count != 0)
I assume benchmark::DoNotOptimize is something like asm volatile("" : "+rm"(x) ) (GNU C inline asm) to make the compiler materialize x in a register or memory, and to assume the lvalue has been modified by that empty asm statement. (i.e. forget anything it knew about the value, blocking constant-propagation, CSE, and whatever.) That would explain why clang stores/reloads to memory while GCC picks a register: this is a longstanding missed-optimization bug with clang's inline asm support. It likes to pick memory when given the choice, which you can sometimes work around with multi-alternative constraints like "+r,m". But not here; I had to just drop the memory alternative; we don't want the compiler to spill/reload to memory anyway.
For GNU C compatible compilers, we can use asm volatile manually with only "+r" register constraints to get clang to make good scalar asm (Godbolt), like GCC. We get an essentially identical inner loop, with 3 add instructions, the last one being an add rbx, -1 / jnz that can macro-fuse.
static void TargetFunc(benchmark::State& state) {
uint64_t x2 = 0, y2 = 0;
// Code inside this loop is measured repeatedly
for (auto _ : state) {
x2 += 16;
y2 += 17;
asm volatile("" : "+r"(x2), "+r"(y2));
}
}
All of these should run at 1 clock cycle per iteration on modern Intel and AMD CPUs, again see #rcgldr's answer.
Of course this also disables auto-vectorization with SIMD, which compilers would do in many real use cases. Or if you used the result at all outside the loop, it might optimize the repeated increment into a single multiply.
You can't measure the cost of the + operator in C++ - it can compile very differently depending on context / surrounding code. Even without considering loop-invariant stuff that hoists work. e.g. x + (y<<2) + 4 can compile to a single LEA instruction for x86.
The question is actually why my computers execute two operations faster than one, first of all in code where these operations are not optimized away
TL:DR: it's not the operations, it's the loop-carried dependency chain through memory that stops the CPU from running the loop at 1 clock cycle per iteration, doing all 3 adds in parallel on separate execution ports.
Note that the loop counter increment is just as much of an operation as what you're doing with x (and sometimes y).
ETA: This was a guess, and Peter Cordes has made a very good argument about why it's incorrect. Go upvote Peter's answer.
I'm leaving my answer here because some found the information useful. Though this doesn't correctly explain the behavior seen in the OP, it highlights some of the issues that make it infeasible (and meaningless) to try to measure the speed of a particular instruction on a modern processor.
Educated guess:
It's the combined effect of pipelining, powering down portions of a core, and dynamic frequency scaling.
Modern processors pipeline so that multiple instructions can be executing at the same time. This is possible because the processor actually works on micro-ops rather than the assembly-level instructions we usually think of as machine language. Processors "schedule" micro-ops by dispatching them to different portions of the chip while keeping track of the dependencies between the instructions.
Suppose the core running your code has two arithmetic/logic units (ALUs). A single arithmetic instruction repeated over and over requires only one ALU. Using two ALUs doesn't help because the next operation depends on completion of the current one, so the second ALU would just be waiting around.
But in your two-expression test, the expressions are independent. To compute the next value of y, you do not have to wait for the current operation on x to complete. Now, because of power-saving features, that second ALU may be powered down at first. The core might run a few iterations before realizing that it could make use of the second ALU. At that point, it can power up the second ALU and most of the two-expression loop will run as fast as the one-expression loop. So you might expect the two examples to take approximately the same amount of time.
Finally, many modern processors use dynamic frequency scaling. When the processor detects that it's not running hard, it actually slows its clock a little bit to save power. But when it's used heavily (and the current temperature of the chip permits), it might increase the actual clock speed as high as its rated speed.
I assume this is done with heuristics. In the case where the second ALU stays powered down, the heuristic may decide it's not worth boosting the clock. In the case where two ALUs are powered up and running at top speed, it may decide to boost the clock. Thus the two-expression case, which should already be just about as fast as the one-expression case, actually runs at a higher average clock frequency, enabling it to complete twice as much work in slightly less time.
Given your numbers, the difference is about 14%. My Windows machine idles at about 3.75 GHz, and if I push it a little by building a solution in Visual Studio, the clock climbs to about 4.25GHz (eyeballing the Performance tab in Task Manager). That's a 13% difference in clock speed, so we're in the right ballpark.
I split up the code into C++ and assembly. I just wanted to test the loops, so I didn't return the sum(s). I'm running on Windows, the calling convention is rcx, rdx, r8, r9, the loop count is in rcx. The code is adding immediate values to 64 bit integers on the stack.
I'm getting similar times for both loops, less than 1% variation, same or either one up to 1% faster than the other.
There is an apparent dependency factor here: each add to memory has to wait for the prior add to memory to the same location to complete, so two add to memories can be performed essentially in parallel.
Changing test2 to do 3 add to memories, ends up about 6% slower, 4 add to memories, 7.5% slower.
My system is Intel 3770K 3.5 GHz CPU, Intel DP67BG motherboard, DDR3 1600 9-9-9-27 memory, Win 7 Pro 64 bit, Visual Studio 2015.
.code
public test1
align 16
test1 proc
sub rsp,16
mov qword ptr[rsp+0],0
mov qword ptr[rsp+8],0
tst10: add qword ptr[rsp+8],17
dec rcx
jnz tst10
add rsp,16
ret
test1 endp
public test2
align 16
test2 proc
sub rsp,16
mov qword ptr[rsp+0],0
mov qword ptr[rsp+8],0
tst20: add qword ptr[rsp+0],17
add qword ptr[rsp+8],-37
dec rcx
jnz tst20
add rsp,16
ret
test2 endp
end
I also tested with add immediate to register, 1 or 2 registers within 1% (either could be faster, but we'd expect them both to execute at 1 iteration / clock on Ivy Bridge, given its 3 integer ALU ports; What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?).
3 registers 1.5 times as long, somewhat worse than the ideal 1.333 cycles / iterations from 4 uops (including the loop counter macro-fused dec/jnz) for 3 back-end ALU ports with perfect scheduling.
4 registers, 2.0 times as long, bottlenecked on the front-end: Is performance reduced when executing loops whose uop count is not a multiple of processor width?. Haswell and later microarchitectures would handle this better.
.code
public test1
align 16
test1 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst10: add rdx,17
dec rcx
jnz tst10
ret
test1 endp
public test2
align 16
test2 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst20: add rdx,17
add r8,-37
dec rcx
jnz tst20
ret
test2 endp
public test3
align 16
test3 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst30: add rdx,17
add r8,-37
add r9,47
dec rcx
jnz tst30
ret
test3 endp
public test4
align 16
test4 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst40: add rdx,17
add r8,-37
add r9,47
add r10,-17
dec rcx
jnz tst40
ret
test4 endp
end
#PeterCordes proved this answer to be wrong in many assumptions, but it could still be useful as some blind research attempt of the problem.
I set up some quick benchmarks, thinking it may somehow be connected to code memory alignment, truly a crazy thought.
But it seems that #Adrian McCarthy got it right with the dynamic frequency scaling.
Anyway benchmarks tell that inserting some NOPs could help with the issue, with 15 NOPs after the x+=31 in Block 1 leading to nearly the same performance as the Block 2. Truly mind blowing how 15 NOPs in single instruction loop body increase performance.
http://quick-bench.com/Q_7HY838oK5LEPFt-tfie0wy4uA
I also tried -OFast thinking compilers might be smart enough to throw away some code memory inserting such NOPs, but it seems not to be the case.
http://quick-bench.com/so2CnM_kZj2QEWJmNO2mtDP9ZX0
Edit: Thanks to #PeterCordes it was made clear that optimizations were never working quite as expected in benchmarks above (as global variable required add instructions to access memory), new benchmark http://quick-bench.com/HmmwsLmotRiW9xkNWDjlOxOTShE clearly shows that Block 1 and Block 2 performance is equal for stack variables. But NOPs could still help with single-threaded application with loop accessing global variable, which you probably should not use in that case and just assign global variable to local variable after the loop.
Edit 2: Actually optimizations never worked due to quick-benchmark macros making variable access volatile, preventing important optimizations. It is only logical to load the variable once as we are only modifying it in the loop, so it is volatile or disabled optimizations being the bottleneck. So this answer is basically wrong, but at least it shows how NOPs could speed-up unoptimized code execution, if it makes any sense in the real world (there are better ways like bucketing counters).
Processors are so complex these days that we can only guess.
The assembly emitted by your compiler is not what is really executed. The microcode/firmware/whatever of your CPU will interpret it and turn it into instructions for its execution engine, much like JIT languages such as C# or java do.
One thing to consider here is that for each loop, there is not 1 or 2 instructions, but n + 2, as you also increment and compare i to your number of iteration. In the vast majority of case it wouldn't matter, but here it does, as the loop body is so simple.
Let's see the assembly :
Some defines:
#define NUM_ITERATIONS 1000000000ll
#define X_INC 17
#define Y_INC -31
C/C++ :
for (long i = 0; i < NUM_ITERATIONS; i++) { x+=X_INC; }
ASM :
mov QWORD PTR [rbp-32], 0
.L13:
cmp QWORD PTR [rbp-32], 999999999
jg .L12
add QWORD PTR [rbp-24], 17
add QWORD PTR [rbp-32], 1
jmp .L13
.L12:
C/C++ :
for (long i = 0; i < NUM_ITERATIONS; i++) {x+=X_INC; y+=Y_INC;}
ASM:
mov QWORD PTR [rbp-80], 0
.L21:
cmp QWORD PTR [rbp-80], 999999999
jg .L20
add QWORD PTR [rbp-64], 17
sub QWORD PTR [rbp-72], 31
add QWORD PTR [rbp-80], 1
jmp .L21
.L20:
So both Assemblies look pretty similar. But then let's think twice : modern CPUs have ALUs which operate on values which are wider than their register size. So there is a chance than in the first case, the operation on x and i are done on the same computing unit. But then you have to read i again, as you put a condition on the result of this operation. And reading means waiting.
So, in the first case, to iterate on x, the CPU might have to be in sync with the iteration on i.
In the second case, maybe x and y are treated on a different unit than the one dealing with i. So in fact, your loop body runs in parallel than the condition driving it. And there goes your CPU computing and computing until someone tells it to stop. It doesn't matter if it goes too far, going back a few loops is still fine compared to the amount of time it just gained.
So, to compare what we want to compare (one operation vs two operations), we should try to get i out of the way.
One solution is to completely get rid of it by using a while loop:
C/C++:
while (x < (X_INC * NUM_ITERATIONS)) { x+=X_INC; }
ASM:
.L15:
movabs rax, 16999999999
cmp QWORD PTR [rbp-40], rax
jg .L14
add QWORD PTR [rbp-40], 17
jmp .L15
.L14:
An other one is to use the antequated "register" C keyword:
C/C++:
register long i;
for (i = 0; i < NUM_ITERATIONS; i++) { x+=X_INC; }
ASM:
mov ebx, 0
.L17:
cmp rbx, 999999999
jg .L16
add QWORD PTR [rbp-48], 17
add rbx, 1
jmp .L17
.L16:
Here are my results:
x1 for: 10.2985 seconds. x,y = 17000000000,0
x1 while: 8.00049 seconds. x,y = 17000000000,0
x1 register-for: 7.31426 seconds. x,y = 17000000000,0
x2 for: 9.30073 seconds. x,y = 17000000000,-31000000000
x2 while: 8.88801 seconds. x,y = 17000000000,-31000000000
x2 register-for :8.70302 seconds. x,y = 17000000000,-31000000000
Code is here: https://onlinegdb.com/S1lAANEhI

C/C++: relaxed std::atomic<bool> vs unlocked bool on X64 architecture

Is there any efficency benefit to using an unlocked boolean over using an std::atomic<bool> where the operations are always done with relaxed memory order? I would assume that both eventually compile to the same machine code, since a single byte is actually atomic on X64 hardware. Am I wrong?
Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<> variable can't be optimized into a register.
If you compiled without optimization, the code-gen would be similar, but compiling with normal optimization enabled there can be massive differences. Un-optimized code is similar to making every variable volatile.
Current compilers also never combine multiple reads of an atomic variable into one, as if you'd used volatile atomic<T>, because that's what people expect and the dust hasn't settled yet on how to allow useful optimizations while prohibiting ones you don't want. (Why don't compilers merge redundant std::atomic writes? and Can and does the compiler optimize out two atomic loads?).
This isn't a great example, but imagine that checking the boolean is done inside an inlined function, and that there's something else inside the loop. (Otherwise you'd put the if around the loop like a normal person.)
int sumarr_atomic(int arr[]) {
int sum = 0;
for(int i=0 ; i<10000 ; i++) {
if (atomic_bool.load (std::memory_order_relaxed)) {
sum += arr[i];
}
}
return sum;
}
See the asm output on Godbolt.
But with a non-atomic bool, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).
With atomic_bool, it can't. With atomic_bool, the asm loop is much like the C++ source, actually doing a test and branch on the value of the variable inside every loop iteration. And this of course defeats auto-vectorization.
(The C++ as-if rules would allow the compiler to hoist the load because it's relaxed so it can reorder with non-atomic accesses. And merge because reading the same value every time is one possible result of a global order that reads one value. But as I said, compilers don't do that.)
Loops over an array of bool can auto-vectorize, but not over atomic<bool> [].
Also, inverting a boolean with something like b ^= 1; or b++ can be just a regular RMW, not atomic RMW, so it doesn't have to use lock xor or lock btc. (x86 atomic RMW is only possible with sequential-consistency vs. runtime reordering, i.e. the lock prefix is also a full memory barrier.)
Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.
void loop() {
for(int i=0 ; i<10000 ; i++) {
regular_bool ^= 1;
}
}
compiles to asm that keeps regular_bool in a register. Unfortunately it doesn't optimize away to nothing (which it could because flipping a boolean an even number of times sets it back to its original value). But it could with a smarter compiler.
loop():
movzx edx, BYTE PTR regular_bool[rip] # load into a register
mov eax, 10000
.L17: # do {
xor edx, 1 # flip the boolean
sub eax, 1
jne .L17 # } while(--i);
mov BYTE PTR regular_bool[rip], dl # store back the result
ret
Even if written as atomic_b.store( !atomic_b.load(mo_relaxed), mo_relaxed) (separate atomic loads/stores), you'd still get a store/reload in the loop, creating a 6-cycle loop-carried dependency chain through the store/reload (on Intel CPUs with 5-cycle store-forwarding latency) instead of a 1-cycle dep chain through a register.
Checking over at Godbolt, loading a regular bool and a std::atomic<bool> generate different code, although not because of synchronisation issues. Instead, the compiler (gcc) seems unwilling to assume that a std::atomic<bool> is guaranteed to be either 0 or 1. Strange, that.
Clang does the same thing, although the code generated is slightly different in detail.

Can num++ be atomic for 'int num'?

In general, for int num, num++ (or ++num), as a read-modify-write operation, is not atomic. But I often see compilers, for example GCC, generate the following code for it (try here):
void f()
{
int num = 0;
num++;
}
f():
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
add DWORD PTR [rbp-4], 1
nop
pop rbp
ret
Since line 5, which corresponds to num++ is one instruction, can we conclude that num++ is atomic in this case?
And if so, does it mean that so-generated num++ can be used in concurrent (multi-threaded) scenarios without any danger of data races (i.e. we don't need to make it, for example, std::atomic<int> and impose the associated costs, since it's atomic anyway)?
UPDATE
Notice that this question is not whether increment is atomic (it's not and that was and is the opening line of the question). It's whether it can be in particular scenarios, i.e. whether one-instruction nature can in certain cases be exploited to avoid the overhead of the lock prefix. And, as the accepted answer mentions in the section about uniprocessor machines, as well as this answer, the conversation in its comments and others explain, it can (although not with C or C++).
This is absolutely what C++ defines as a Data Race that causes Undefined Behaviour, even if one compiler happened to produce code that did what you hoped on some target machine. You need to use std::atomic for reliable results, but you can use it with memory_order_relaxed if you don't care about reordering. See below for some example code and asm output using fetch_add.
But first, the assembly language part of the question:
Since num++ is one instruction (add dword [num], 1), can we conclude that num++ is atomic in this case?
Memory-destination instructions (other than pure stores) are read-modify-write operations that happen in multiple internal steps. No architectural register is modified, but the CPU has to hold the data internally while it sends it through its ALU. The actual register file is only a small part of the data storage inside even the simplest CPU, with latches holding outputs of one stage as inputs for another stage, etc., etc.
Memory operations from other CPUs can become globally visible between the load and store. I.e. two threads running add dword [num], 1 in a loop would step on each other's stores. (See #Margaret's answer for a nice diagram). After 40k increments from each of two threads, the counter might have only gone up by ~60k (not 80k) on real multi-core x86 hardware.
"Atomic", from the Greek word meaning indivisible, means that no observer can see the operation as separate steps. Happening physically / electrically instantaneously for all bits simultaneously is just one way to achieve this for a load or store, but that's not even possible for an ALU operation. I went into a lot more detail about pure loads and pure stores in my answer to Atomicity on x86, while this answer focuses on read-modify-write.
The lock prefix can be applied to many read-modify-write (memory destination) instructions to make the entire operation atomic with respect to all possible observers in the system (other cores and DMA devices, not an oscilloscope hooked up to the CPU pins). That is why it exists. (See also this Q&A).
So lock add dword [num], 1 is atomic. A CPU core running that instruction would keep the cache line pinned in Modified state in its private L1 cache from when the load reads data from cache until the store commits its result back into cache. This prevents any other cache in the system from having a copy of the cache line at any point from load to store, according to the rules of the MESI cache coherency protocol (or the MOESI/MESIF versions of it used by multi-core AMD/Intel CPUs, respectively). Thus, operations by other cores appear to happen either before or after, not during.
Without the lock prefix, another core could take ownership of the cache line and modify it after our load but before our store, so that other store would become globally visible in between our load and store. Several other answers get this wrong, and claim that without lock you'd get conflicting copies of the same cache line. This can never happen in a system with coherent caches.
(If a locked instruction operates on memory that spans two cache lines, it takes a lot more work to make sure the changes to both parts of the object stay atomic as they propagate to all observers, so no observer can see tearing. The CPU might have to lock the whole memory bus until the data hits memory. Don't misalign your atomic variables!)
Note that the lock prefix also turns an instruction into a full memory barrier (like MFENCE), stopping all run-time reordering and thus giving sequential consistency. (See Jeff Preshing's excellent blog post. His other posts are all excellent, too, and clearly explain a lot of good stuff about lock-free programming, from x86 and other hardware details to C++ rules.)
On a uniprocessor machine, or in a single-threaded process, a single RMW instruction actually is atomic without a lock prefix. The only way for other code to access the shared variable is for the CPU to do a context switch, which can't happen in the middle of an instruction. So a plain dec dword [num] can synchronize between a single-threaded program and its signal handlers, or in a multi-threaded program running on a single-core machine. See the second half of my answer on another question, and the comments under it, where I explain this in more detail.
Back to C++:
It's totally bogus to use num++ without telling the compiler that you need it to compile to a single read-modify-write implementation:
;; Valid compiler output for num++
mov eax, [num]
inc eax
mov [num], eax
This is very likely if you use the value of num later: the compiler will keep it live in a register after the increment. So even if you check how num++ compiles on its own, changing the surrounding code can affect it.
(If the value isn't needed later, inc dword [num] is preferred; modern x86 CPUs will run a memory-destination RMW instruction at least as efficiently as using three separate instructions. Fun fact: gcc -O3 -m32 -mtune=i586 will actually emit this, because (Pentium) P5's superscalar pipeline didn't decode complex instructions to multiple simple micro-operations the way P6 and later microarchitectures do. See the Agner Fog's instruction tables / microarchitecture guide for more info, and the x86 tag wiki for many useful links (including Intel's x86 ISA manuals, which are freely available as PDF)).
Don't confuse the target memory model (x86) with the C++ memory model
Compile-time reordering is allowed. The other part of what you get with std::atomic is control over compile-time reordering, to make sure your num++ becomes globally visible only after some other operation.
Classic example: Storing some data into a buffer for another thread to look at, then setting a flag. Even though x86 does acquire loads/release stores for free, you still have to tell the compiler not to reorder by using flag.store(1, std::memory_order_release);.
You might be expecting that this code will synchronize with other threads:
// int flag; is just a plain global, not std::atomic<int>.
flag--; // Pretend this is supposed to be some kind of locking attempt
modify_a_data_structure(&foo); // doesn't look at flag, and the compiler knows this. (Assume it can see the function def). Otherwise the usual don't-break-single-threaded-code rules come into play!
flag++;
But it won't. The compiler is free to move the flag++ across the function call (if it inlines the function or knows that it doesn't look at flag). Then it can optimize away the modification entirely, because flag isn't even volatile.
(And no, C++ volatile is not a useful substitute for std::atomic. std::atomic does make the compiler assume that values in memory can be modified asynchronously similar to volatile, but there's much more to it than that. (In practice there are similarities between volatile int to std::atomic with mo_relaxed for pure-load and pure-store operations, but not for RMWs). Also, volatile std::atomic<int> foo is not necessarily the same as std::atomic<int> foo, although current compilers don't optimize atomics (e.g. 2 back-to-back stores of the same value) so volatile atomic wouldn't change the code-gen.)
Defining data races on non-atomic variables as Undefined Behaviour is what lets the compiler still hoist loads and sink stores out of loops, and many other optimizations for memory that multiple threads might have a reference to. (See this LLVM blog for more about how UB enables compiler optimizations.)
As I mentioned, the x86 lock prefix is a full memory barrier, so using num.fetch_add(1, std::memory_order_relaxed); generates the same code on x86 as num++ (the default is sequential consistency), but it can be much more efficient on other architectures (like ARM). Even on x86, relaxed allows more compile-time reordering.
This is what GCC actually does on x86, for a few functions that operate on a std::atomic global variable.
See the source + assembly language code formatted nicely on the Godbolt compiler explorer. You can select other target architectures, including ARM, MIPS, and PowerPC, to see what kind of assembly language code you get from atomics for those targets.
#include <atomic>
std::atomic<int> num;
void inc_relaxed() {
num.fetch_add(1, std::memory_order_relaxed);
}
int load_num() { return num; } // Even seq_cst loads are free on x86
void store_num(int val){ num = val; }
void store_num_release(int val){
num.store(val, std::memory_order_release);
}
// Can the compiler collapse multiple atomic operations into one? No, it can't.
# g++ 6.2 -O3, targeting x86-64 System V calling convention. (First argument in edi/rdi)
inc_relaxed():
lock add DWORD PTR num[rip], 1 #### Even relaxed RMWs need a lock. There's no way to request just a single-instruction RMW with no lock, for synchronizing between a program and signal handler for example. :/ There is atomic_signal_fence for ordering, but nothing for RMW.
ret
inc_seq_cst():
lock add DWORD PTR num[rip], 1
ret
load_num():
mov eax, DWORD PTR num[rip]
ret
store_num(int):
mov DWORD PTR num[rip], edi
mfence ##### seq_cst stores need an mfence
ret
store_num_release(int):
mov DWORD PTR num[rip], edi
ret ##### Release and weaker doesn't.
store_num_relaxed(int):
mov DWORD PTR num[rip], edi
ret
Notice how MFENCE (a full barrier) is needed after a sequential-consistency stores. x86 is strongly ordered in general, but StoreLoad reordering is allowed. Having a store buffer is essential for good performance on a pipelined out-of-order CPU. Jeff Preshing's Memory Reordering Caught in the Act shows the consequences of not using MFENCE, with real code to show reordering happening on real hardware.
Re: discussion in comments on #Richard Hodges' answer about compilers merging std::atomic num++; num-=2; operations into one num--; instruction:
A separate Q&A on this same subject: Why don't compilers merge redundant std::atomic writes?, where my answer restates a lot of what I wrote below.
Current compilers don't actually do this (yet), but not because they aren't allowed to. C++ WG21/P0062R1: When should compilers optimize atomics? discusses the expectation that many programmers have that compilers won't make "surprising" optimizations, and what the standard can do to give programmers control. N4455 discusses many examples of things that can be optimized, including this one. It points out that inlining and constant-propagation can introduce things like fetch_or(0) which may be able to turn into just a load() (but still has acquire and release semantics), even when the original source didn't have any obviously redundant atomic ops.
The real reasons compilers don't do it (yet) are: (1) nobody's written the complicated code that would allow the compiler to do that safely (without ever getting it wrong), and (2) it potentially violates the principle of least surprise. Lock-free code is hard enough to write correctly in the first place. So don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much. It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).
Getting back to num++; num-=2; compiling as if it were num--:
Compilers are allowed to do this, unless num is volatile std::atomic<int>. If a reordering is possible, the as-if rule allows the compiler to decide at compile time that it always happens that way. Nothing guarantees that an observer could see the intermediate values (the num++ result).
I.e. if the ordering where nothing becomes globally visible between these operations is compatible with the ordering requirements of the source
(according to the C++ rules for the abstract machine, not the target architecture), the compiler can emit a single lock dec dword [num] instead of lock inc dword [num] / lock sub dword [num], 2.
num++; num-- can't disappear, because it still has a Synchronizes With relationship with other threads that look at num, and it's both an acquire-load and a release-store which disallows reordering of other operations in this thread. For x86, this might be able to compile to an MFENCE, instead of a lock add dword [num], 0 (i.e. num += 0).
As discussed in PR0062, more aggressive merging of non-adjacent atomic ops at compile time can be bad (e.g. a progress counter only gets updated once at the end instead of every iteration), but it can also help performance without downsides (e.g. skipping the atomic inc / dec of ref counts when a copy of a shared_ptr is created and destroyed, if the compiler can prove that another shared_ptr object exists for entire lifespan of the temporary.)
Even num++; num-- merging could hurt fairness of a lock implementation when one thread unlocks and re-locks right away. If it's never actually released in the asm, even hardware arbitration mechanisms won't give another thread a chance to grab the lock at that point.
With current gcc6.2 and clang3.9, you still get separate locked operations even with memory_order_relaxed in the most obviously optimizable case. (Godbolt compiler explorer so you can see if the latest versions are different.)
void multiple_ops_relaxed(std::atomic<unsigned int>& num) {
num.fetch_add( 1, std::memory_order_relaxed);
num.fetch_add(-1, std::memory_order_relaxed);
num.fetch_add( 6, std::memory_order_relaxed);
num.fetch_add(-5, std::memory_order_relaxed);
//num.fetch_add(-1, std::memory_order_relaxed);
}
multiple_ops_relaxed(std::atomic<unsigned int>&):
lock add DWORD PTR [rdi], 1
lock sub DWORD PTR [rdi], 1
lock add DWORD PTR [rdi], 6
lock sub DWORD PTR [rdi], 5
ret
Without many complications an instruction like add DWORD PTR [rbp-4], 1 is very CISC-style.
It perform three operations: load the operand from memory, increment it, store the operand back to memory.
During these operations the CPU acquire and release the bus twice, in between any other agent can acquire it too and this violates the atomicity.
AGENT 1 AGENT 2
load X
inc C
load X
inc C
store X
store X
X is incremented only once.
...and now let's enable optimisations:
f():
rep ret
OK, let's give it a chance:
void f(int& num)
{
num = 0;
num++;
--num;
num += 6;
num -=5;
--num;
}
result:
f(int&):
mov DWORD PTR [rdi], 0
ret
another observing thread (even ignoring cache synchronisation delays) has no opportunity to observe the individual changes.
compare to:
#include <atomic>
void f(std::atomic<int>& num)
{
num = 0;
num++;
--num;
num += 6;
num -=5;
--num;
}
where the result is:
f(std::atomic<int>&):
mov DWORD PTR [rdi], 0
mfence
lock add DWORD PTR [rdi], 1
lock sub DWORD PTR [rdi], 1
lock add DWORD PTR [rdi], 6
lock sub DWORD PTR [rdi], 5
lock sub DWORD PTR [rdi], 1
ret
Now, each modification is:-
observable in another thread, and
respectful of similar modifications happening in other threads.
atomicity is not just at the instruction level, it involves the whole pipeline from processor, through the caches, to memory and back.
Further info
Regarding the effect of optimisations of updates of std::atomics.
The c++ standard has the 'as if' rule, by which it is permissible for the compiler to reorder code, and even rewrite code provided that the outcome has the exact same observable effects (including side-effects) as if it had simply executed your code.
The as-if rule is conservative, particularly involving atomics.
consider:
void incdec(int& num) {
++num;
--num;
}
Because there are no mutex locks, atomics or any other constructs that influence inter-thread sequencing, I would argue that the compiler is free to rewrite this function as a NOP, eg:
void incdec(int&) {
// nada
}
This is because in the c++ memory model, there is no possibility of another thread observing the result of the increment. It would of course be different if num was volatile (might influence hardware behaviour). But in this case, this function will be the only function modifying this memory (otherwise the program is ill-formed).
However, this is a different ball game:
void incdec(std::atomic<int>& num) {
++num;
--num;
}
num is an atomic. Changes to it must be observable to other threads that are watching. Changes those threads themselves make (such as setting the value to 100 in between the increment and decrement) will have very far-reaching effects on the eventual value of num.
Here is a demo:
#include <thread>
#include <atomic>
int main()
{
for (int iter = 0 ; iter < 20 ; ++iter)
{
std::atomic<int> num = { 0 };
std::thread t1([&] {
for (int i = 0 ; i < 10000000 ; ++i)
{
++num;
--num;
}
});
std::thread t2([&] {
for (int i = 0 ; i < 10000000 ; ++i)
{
num = 100;
}
});
t2.join();
t1.join();
std::cout << num << std::endl;
}
}
sample output:
99
99
99
99
99
100
99
99
100
100
100
100
99
99
100
99
99
100
100
99
The add instruction is not atomic. It references memory, and two processor cores may have different local cache of that memory.
IIRC the atomic variant of the add instruction is called lock xadd
Since line 5, which corresponds to num++ is one instruction, can we conclude that num++ is atomic in this case?
It is dangerous to draw conclusions based on "reverse engineering" generated assembly. For example, you seem to have compiled your code with optimization disabled, otherwise the compiler would have thrown away that variable or loaded 1 directly to it without invoking operator++. Because the generated assembly may change significantly, based on optimization flags, target CPU, etc., your conclusion is based on sand.
Also, your idea that one assembly instruction means an operation is atomic is wrong as well. This add will not be atomic on multi-CPU systems, even on the x86 architecture.
Even if your compiler always emitted this as an atomic operation, accessing num from any other thread concurrently would constitute a data race according to the C++11 and C++14 standards and the program would have undefined behavior.
But it is worse than that. First, as has been mentioned, the instruction generated by the compiler when incrementing a variable may depend on the optimization level. Secondly, the compiler may reorder other memory accesses around ++num if num is not atomic, e.g.
int main()
{
std::unique_ptr<std::vector<int>> vec;
int ready = 0;
std::thread t{[&]
{
while (!ready);
// use "vec" here
});
vec.reset(new std::vector<int>());
++ready;
t.join();
}
Even if we assume optimistically that ++ready is "atomic", and that the compiler generates the checking loop as needed (as I said, it's UB and therefore the compiler is free to remove it, replace it with an infinite loop, etc.), the compiler might still move the pointer assignment, or even worse the initialization of the vector to a point after the increment operation, causing chaos in the new thread. In practice, I would not be surprised at all if an optimizing compiler removed the ready variable and the checking loop completely, as this does not affect observable behavior under language rules (as opposed to your private hopes).
In fact, at last year's Meeting C++ conference, I've heard from two compiler developers that they very gladly implement optimizations that make naively written multi-threaded programs misbehave, as long as language rules allow it, if even a minor performance improvement is seen in correctly written programs.
Lastly, even if you didn't care about portability, and your compiler was magically nice, the CPU you are using is very likely of a superscalar CISC type and will break down instructions into micro-ops, reorder and/or speculatively execute them, to an extent only limited by synchronizing primitives such as (on Intel) the LOCK prefix or memory fences, in order to maximize operations per second.
To make a long story short, the natural responsibilities of thread-safe programming are:
Your duty is to write code that has well-defined behavior under language rules (and in particular the language standard memory model).
Your compiler's duty is to generate machine code which has the same well-defined (observable) behavior under the target architecture's memory model.
Your CPU's duty is to execute this code so that the observed behavior is compatible with its own architecture's memory model.
If you want to do it your own way, it might just work in some cases, but understand that the warranty is void, and you will be solely responsible for any unwanted outcomes. :-)
PS: Correctly written example:
int main()
{
std::unique_ptr<std::vector<int>> vec;
std::atomic<int> ready{0}; // NOTE the use of the std::atomic template
std::thread t{[&]
{
while (!ready);
// use "vec" here
});
vec.reset(new std::vector<int>());
++ready;
t.join();
}
This is safe because:
The checks of ready cannot be optimized away according to language rules.
The ++ready happens-before the check that sees ready as not zero, and other operations cannot be reordered around these operations. This is because ++ready and the check are sequentially consistent, which is another term described in the C++ memory model and that forbids this specific reordering. Therefore the compiler must not reorder the instructions, and must also tell the CPU that it must not e.g. postpone the write to vec to after the increment of ready. Sequentially consistent is the strongest guarantee regarding atomics in the language standard. Lesser (and theoretically cheaper) guarantees are available e.g. via other methods of std::atomic<T>, but these are definitely for experts only, and may not be optimized much by the compiler developers, because they are rarely used.
On a single-core x86 machine, an add instruction will generally be atomic with respect to other code on the CPU1. An interrupt can't split a single instruction down the middle.
Out-of-order execution is required to preserve the illusion of instructions executing one at a time in order within a single core, so any instruction running on the same CPU will either happen completely before or completely after the add.
Modern x86 systems are multi-core, so the uniprocessor special case doesn't apply.
If one is targeting a small embedded PC and has no plans to move the code to anything else, the atomic nature of the "add" instruction could be exploited. On the other hand, platforms where operations are inherently atomic are becoming more and more scarce.
(This doesn't help you if you're writing in C++, though. Compilers don't have an option to require num++ to compile to a memory-destination add or xadd without a lock prefix. They could choose to load num into a register and store the increment result with a separate instruction, and will likely do that if you use the result.)
Footnote 1: The lock prefix existed even on original 8086 because I/O devices operate concurrently with the CPU; drivers on a single-core system need lock add to atomically increment a value in device memory if the device can also modify it, or with respect to DMA access.
Back in the day when x86 computers had one CPU, the use of a single instruction ensured that interrupts would not split the read/modify/write and if the memory would not be used as a DMA buffer too, it was atomic in fact (and C++ did not mention threads in the standard, so this wasn’t addressed).
When it was rare to have a dual processor (e.g. dual-socket Pentium Pro) on a customer desktop, I effectively used this to avoid the LOCK prefix on a single-core machine and improve performance.
Today, it would only help against multiple threads that were all set to the same CPU affinity, so the threads you are worried about would only come into play via time slice expiring and running the other thread on the same CPU (core). That is not realistic.
With modern x86/x64 processors, the single instruction is broken up into several micro ops and furthermore the memory reading and writing is buffered. So different threads running on different CPUs will not only see this as non-atomic but may see inconsistent results concerning what it reads from memory and what it assumes other threads have read to that point in time: you need to add memory fences to restore sane behavior.
No.
https://www.youtube.com/watch?v=31g0YE61PLQ
(That's just a link to the "No" scene from "The Office")
Do you agree that this would be a possible output for the program:
sample output:
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
If so, then the compiler is free to make that the only possible output for the program, in whichever way the compiler wants. ie a main() that just puts out 100s.
This is the "as-if" rule.
And regardless of output, you can think of thread synchronization the same way - if thread A does num++; num--; and thread B reads num repeatedly, then a possible valid interleaving is that thread B never reads between num++ and num--. Since that interleaving is valid, the compiler is free to make that the only possible interleaving. And just remove the incr/decr entirely.
There are some interesting implications here:
while (working())
progress++; // atomic, global
(ie imagine some other thread updates a progress bar UI based on progress)
Can the compiler turn this into:
int local = 0;
while (working())
local++;
progress += local;
probably that is valid. But probably not what the programmer was hoping for :-(
The committee is still working on this stuff. Currently it "works" because compilers don't optimize atomics much. But that is changing.
And even if progress was also volatile, this would still be valid:
int local = 0;
while (working())
local++;
while (local--)
progress++;
:-/
That a single compiler's output, on a specific CPU architecture, with optimizations disabled (since gcc doesn't even compile ++ to add when optimizing in a quick&dirty example), seems to imply incrementing this way is atomic doesn't mean this is standard-compliant (you would cause undefined behavior when trying to access num in a thread), and is wrong anyways, because add is not atomic in x86.
Note that atomics (using the lock instruction prefix) are relatively heavy on x86 (see this relevant answer), but still remarkably less than a mutex, which isn't very appropriate in this use-case.
Following results are taken from clang++ 3.8 when compiling with -Os.
Incrementing an int by reference, the "regular" way :
void inc(int& x)
{
++x;
}
This compiles into :
inc(int&):
incl (%rdi)
retq
Incrementing an int passed by reference, the atomic way :
#include <atomic>
void inc(std::atomic<int>& x)
{
++x;
}
This example, which is not much more complex than the regular way, just gets the lock prefix added to the incl instruction - but caution, as previously stated this is not cheap. Just because assembly looks short doesn't mean it's fast.
inc(std::atomic<int>&):
lock incl (%rdi)
retq
Yes, but...
Atomic is not what you meant to say. You're probably asking the wrong thing.
The increment is certainly atomic. Unless the storage is misaligned (and since you left alignment to the compiler, it is not), it is necessarily aligned within a single cache line. Short of special non-caching streaming instructions, each and every write goes through the cache. Complete cache lines are being atomically read and written, never anything different.
Smaller-than-cacheline data is, of course, also written atomically (since the surrounding cache line is).
Is it thread-safe?
This is a different question, and there are at least two good reasons to answer with a definite "No!".
First, there is the possibility that another core might have a copy of that cache line in L1 (L2 and upwards is usually shared, but L1 is normally per-core!), and concurrently modifies that value. Of course that happens atomically, too, but now you have two "correct" (correctly, atomically, modified) values -- which one is the truly correct one now?
The CPU will sort it out somehow, of course. But the result may not be what you expect.
Second, there is memory ordering, or worded differently happens-before guarantees. The most important thing about atomic instructions is not so much that they are atomic. It's ordering.
You have the possibility of enforcing a guarantee that everything that happens memory-wise is realized in some guaranteed, well-defined order where you have a "happened before" guarantee. This ordering may be as "relaxed" (read as: none at all) or as strict as you need.
For example, you can set a pointer to some block of data (say, the results of some calculation) and then atomically release the "data is ready" flag. Now, whoever acquires this flag will be led into thinking that the pointer is valid. And indeed, it will always be a valid pointer, never anything different. That's because the write to the pointer happened-before the atomic operation.
When your compiler uses only a single instruction for the increment and your machine is single-threaded, your code is safe. ^^
Try compiling the same code on a non-x86 machine, and you'll quickly see very different assembly results.
The reason num++ appears to be atomic is because on x86 machines, incrementing a 32-bit integer is, in fact, atomic (assuming no memory retrieval takes place). But this is neither guaranteed by the c++ standard, nor is it likely to be the case on a machine that doesn't use the x86 instruction set. So this code is not cross-platform safe from race conditions.
You also don't have a strong guarantee that this code is safe from Race Conditions even on an x86 architecture, because x86 doesn't set up loads and stores to memory unless specifically instructed to do so. So if multiple threads tried to update this variable simultaneously, they may end up incrementing cached (outdated) values
The reason, then, that we have std::atomic<int> and so on is so that when you're working with an architecture where the atomicity of basic computations is not guaranteed, you have a mechanism that will force the compiler to generate atomic code.