Is mfence for rdtsc necessary on x86_64 platform?

Is mfence for rdtsc necessary on x86_64 platform? - c++

unsigned int lo = 0;
unsigned int hi = 0;
__asm__ __volatile__ (
"mfence;rdtsc" : "=a"(lo), "=d"(hi) : : "memory"
);
mfence in the above code, is it necessary?
Based on my test, cpu reorder is not found.
The fragment of test code is included below.
inline uint64_t clock_cycles() {
unsigned int lo = 0;
unsigned int hi = 0;
__asm__ __volatile__ (
"rdtsc" : "=a"(lo), "=d"(hi)
);
return ((uint64_t)hi << 32) | lo;
}
unsigned t1 = clock_cycles();
unsigned t2 = clock_cycles();
assert(t2 > t1);

What you need to perform a sensible measurement with rdtsc is a serializing instruction.
As it is well known, a lot of people use cpuid before rdtsc.
rdtsc needs to be serialized from above and below (read: all instructions before it must be retired and it must be retired before the test code starts).
Unfortunately the second condition is often neglected because cpuid is a very bad choice for this task (it clobbers the output of rdtsc).
When looking for alternatives people think that instructions that have a "fence" in their names will do, but this is also untrue. Straight from Intel:
MFENCE does not serialize the instruction stream.
An instruction that is almost serializing and will do in any measurement where previous stores don't need to complete is lfence.
Simply put, lfence makes sure that no new instructions start before any prior instruction completes locally. See this answer of mine for a more detailed explanation on locality.
It also doesn't drain the Store Buffer like mfence does and doesn't clobbers the registers like cpuid does.
So lfence / rdtsc / lfence is a better crafted sequence of instructions than mfence / rdtsc, where mfence is pretty much useless unless you explicitly want the previous stores to be completed before the test begins/ends (but not before rdstc is executed!).
If your test to detect reordering is assert(t2 > t1) then I believe you will test nothing.
Leaving out the return and the call that may or may not prevent the CPU from seeing the second rdtsc in time for a reorder, it is unlikely (though possible!) that the CPU will reorder two rdtsc even if one is right after the other.
Imagine we have a rdtsc2 that is exactly like rdtsc but writes ecx:ebx1.
Executing
rdtsc
rdtsc2
is highly likely that ecx:ebx > edx:eax because the CPU has no reason to execute rdtsc2 before rdtsc.
Reordering doesn't mean random ordering, it means look for other instruction if the current one cannot be executed.
But rdtsc has no dependency on any previous instruction, so it's unlikely to be delayed when encountered by the OoO core.
However peculiar internal micro-architectural details may invalidate my thesis, hence the likely word in my previous statement.
1 We don't need this altered instruction: register renaming will do it, but in case you are not familiar with it, this will help.

mfence is there to force serialization in CPU before rdtsc.
Usually you will find cpuid there (which is also serializing instruction).
Quote from Intel manuals about using rdtsc will make it clearer
Starting with the Intel Pentium processor, most Intel CPUs support
out-of-order execution of the code. The purpose is to optimize the
penalties due to the different instruction latencies. Unfortunately
this feature does not guarantee that the temporal sequence of the
single compiled C instructions will respect the sequence of the
instruction themselves as written in the source C file. When we call
the RDTSC instruction, we pretend that that instruction will be
executed exactly at the beginning and at the end of code being
measured (i.e., we don’t want to measure compiled code executed
outside of the RDTSC calls or executed in between the calls
themselves).
The solution is to call a serializing instruction before
calling the RDTSC one. A serializing instruction is an instruction
that forces the CPU to complete every preceding instruction of the C
code before continuing the program execution. By doing so we guarantee
that only the code that is under measurement will be executed in
between the RDTSC calls and that no part of that code will be executed
outside the calls.
TL;DR version - without serializing instruction before rdtsc you have no idea when that instruction started to execute making measurements possibly incorrect.
HINT - use rdtscp when possible.
Based on my test, cpu reorder is not found.
Still no guarantee that it may happen - that's why original code had "memory" to indicate possible memory clobber preventing compiler from reordering it.

Related

Inline assembly array sum benchmark near-zero time for large arrays with optimization enabled, even though result is used

I have written two functions that gets the sum of an array, the first one is written in C++ and the other is written with inline assembly (x86-64), I compared the performance of the two functions on my device.
If the -O flag is not enabled during compilation the function with inline assembly is almost 4-5x faster than the C++ version.
cpp time : 543070068 nanoseconds
cpp time : 547990578 nanoseconds
asm time : 185495494 nanoseconds
asm time : 188597476 nanoseconds
If the -O flag is set to -O1 they produce the same performance.
cpp time : 177510914 nanoseconds
cpp time : 178084988 nanoseconds
asm time : 179036546 nanoseconds
asm time : 181641378 nanoseconds
But if I try to set the -O flag to -O2 or -O3 I'm getting an unusual 2-3 digit nanoseconds performance for the function written with inline assembly which is sketchy fast (at least for me, please bear with me since I have no rock solid experience with assembly programming so I don't know how fast or how slow it can be compared to a program written in C++.
)
cpp time : 177522894 nanoseconds
cpp time : 183816275 nanoseconds
asm time : 125 nanoseconds
asm time : 75 nanoseconds
My Questions
Why is this array sum function written with inline assembly so fast after enabling -O2 or -O3?
Is this a normal reading or there is something wrong with the timing/measurement of the performance?
Or maybe there is something wrong with my inline assembly function?
And if the inline assembly function for the array sum is correct and the performance reading is correct, why does the C++ compiler failed to optimize a simple array sum function for the C++ version and make it as fast as the inline assembly version?
I have also speculated that maybe the memory alignment and cache misses are improved during compilation to increase the performance but my knowledge on this one is still very very limited.
Apart from answering my questions, if you have something to add please feel free to do so, I hope somebody can explain, thanks!
[EDIT]
So I have removed the use of macro and isolated running the two version and also tried to add volatile keyword, a "memory" clobber and "+&r" constraint for the output and the performance was now the same with the cpp_sum.
Though if I remove back the volatile keyword and "memory" clobber it I'm still getting those 2-3 digit nanoseconds performance.
code:
#include <iostream>
#include <random>
#include <chrono>
uint64_t sum_cpp(const uint64_t *numbers, size_t length) {
uint64_t sum = 0;
for(size_t i=0; i<length; ++i) {
sum += numbers[i];
}
return sum;
}
uint64_t sum_asm(const uint64_t *numbers, size_t length) {
uint64_t sum = 0;
asm volatile(
"xorq %%rax, %%rax\n\t"
"%=:\n\t"
"addq (%[numbers], %%rax, 8), %[sum]\n\t"
"incq %%rax\n\t"
"cmpq %%rax, %[length]\n\t"
"jne %=b"
: [sum]"+&r"(sum)
: [numbers]"r"(numbers), [length]"r"(length)
: "%rax", "memory", "cc"
);
return sum;
}
int main() {
std::mt19937_64 rand_engine(1);
std::uniform_int_distribution<uint64_t> random_number(0,5000);
size_t length = 99999999;
uint64_t *arr = new uint64_t[length];
for(size_t i=1; i<length; ++i) arr[i] = random_number(rand_engine);
uint64_t cpp_total = 0, asm_total = 0;
for(size_t i=0; i<5; ++i) {
auto start = std::chrono::high_resolution_clock::now();
#ifndef _INLINE_ASM
cpp_total += sum_cpp(arr, length);
#else
asm_total += sum_asm(arr,length);
#endif
auto end = std::chrono::high_resolution_clock::now();
auto dur = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start);
std::cout << "time : " << dur.count() << " nanoseconds\n";
}
#ifndef _INLINE_ASM
std::cout << "cpp sum = " << cpp_total << "\n";
#else
std::cout << "asm sum = " << asm_total << "\n";
#endif
delete [] arr;
return 0;
}

The compiler is hoisting the inline asm out of your repeat loop, and thus out of your timed region.
If your goal is performance, https://gcc.gnu.org/wiki/DontUseInlineAsm. The useful thing to spend your time learning first is SIMD intrinsics (and how they compile to asm) like _mm256_add_epi64 to add 4x uint64_t with a single AVX2 instruction. See https://stackoverflow.com/tags/sse/info (Compilers can auto-vectorize decently for a simple sum like this, which you could see the benefit from if you used a smaller array and put a repeat loop inside the timed region to get some cache hits.)
If you want to play around with asm to test what's actually fast on various CPUs, you can do that in a stand-alone static executable, or a function you call from C++. https://stackoverflow.com/tags/x86/info has some good performance links.
Re: benchmarking at -O0, yes the compiler makes slow asm with the default -O0 of consistent debugging and not trying at all to optimize. It's not much of a challenge to beat it when it has its hands tied behind its back.
Why your asm can get hoisted out of the timed regions
Without being asm volatile, your asm statement is a pure function of the inputs you've told the compiler about, which are a pointer, a length, and the initial value of sum=0. It does not include the pointed-to memory because you didn't use a dummy "m" input for that. (How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)
Without a "memory" clobber, your asm statement isn't ordered wrt. function calls, so GCC is hoisting the asm statement out of the loop. See How does Google's `DoNotOptimize()` function enforce statement ordering for more details about that effect of the "memory" clobber.
Have a look at the compiler output on https://godbolt.org/z/KeEMfoMvo and see how it inlined into main. -O2 and higher enables -finline-functions, while -O1 only enables -finline-functions-called-once and this isn't static or inline so it has to emit a stand-alone definition in case of calls from other compilation units.
75ns is just the timing overhead of std::chrono functions around a nearly-empty timed region. It is actually running, just not inside the timed regions. You can see this if you single-step the asm of your whole program, or for example set a breakpoint on the asm statement. When doing asm-level debugging of the executable, you could help yourself find it by putting a funky instruction like mov $0xdeadbeef, %eax before xor %eax,%eax, something you can search for in the debugger's disassembly output (like GDB's layout asm or layout reg; see asm debugging tips at the bottom of https://stackoverflow.com/tags/x86/info). And yes, you do often want to look at what the compiler did when debugging inline asm, how it filled in your constraints, because stepping on its toes is a very real possibility.
Note that a "memory" clobber without asm volatile would still let GCC do Common Subexpression Elimination (CSE) between two invocations of the asm statement, if there was no function call in between. Like if you put a repeat loop inside a timed region to test performance on an array small enough to fit in some level of cache.
Sanity-checking your benchmark
Is this a normal reading
It's wild that you even have to ask that. 99999999 8-byte integers in 75ns would be a memory bandwidth of 99999999 * 8 B / 75 ns = 10666666 GB/s, while fast dual-channel DDR4 might hit 32 GB/s. (Or cache bandwidth if it was that large, but it's not, so your code bottlenecks on memory).
Or a 4GHz CPU would have had to run at 99999999 / (75*4) = 333333.33 add instructions per clock cycle, but the pipeline is only 4 to 6 uops wide on modern CPUs, with taken-branch throughputs of at best 1 for a loop branch. (https://uops.info/ and https://agner.org/optimize/)
Even with AVX-512, that's 2/clock 8x uint64_t additions per core, but compilers don't rewrite your inline asm; that would defeat its purpose compared to using plain C++ or intrinsics.
This is pretty obviously just std::chrono timing overhead from a near-empty timed region.
Asm code-review: correctness
As mentioned above, How can I indicate that the memory *pointed* to by an inline ASM argument may be used?
You're also missing an & early clobber declaration in "+&r"(sum) which would in theory let it pick the same register for sum as for one of the inputs. But since sum is also an input, it could only do that if numbers or length were also 0.
It's kind of a toss-up whether it's better to xor-zero inside the asm for an "=&r" output, or better to use "+&r" and leave that zeroing to the compiler. For your loop counter, it makes sense because the compiler doesn't need to know about that at all. But by manually picking RAX for it (with a clobber), you're preventing the compiler from choosing to have your code produce sum in RAX, like it would want for a non-inline function. A dummy [idx] "=&r" (dummy) output operand will get the compiler to pick a register for you, of the appropriate width, e.g. intptr_t.
Asm code review: performance
As David Wohlferd said: xor %eax, %eax to zero RAX. Implicit zero-extension saves a REX prefix. (1 byte of code-size in the machine code. Smaller machine-code is generally better.)
It doesn't seem worth hand-writing asm if you're not going to do anything smarter than what GCC would on its own without -ftree-vectorize or with -mgeneral-regs-only or -mno-sse2 (even though it's baseline for x86-64, kernel code generally needs to avoid SIMD registers). But I guess it works as a learning exercise in how inline asm constraints work, and a starting point for measuring. And to get a benchmark working so you can then test better loops.
Typical x86-64 CPUs can do 2 loads per clock cycle (Intel since Sandybridge, AMD since K8) Or 3/clock on Alder Lake. On modern CPUs with AVX/AVX2, each load can be 32 bytes wide (or 64 bytes with AVX-512) best case on L1d hits. Or more like 1/clock with only L2 hits on recent Intel, which is a reasonable cache-blocking target.
But your loop can at best run 1x 8-byte load per clock cycle, because loop branches can run 1/clock, and add mem, %[sum] has a 1 cycle loop-carried dependency through sum.
That might max out DRAM bandwidth (with the help of HW prefetchers), e.g. 8 B / cycle * 4GHz = 32GB/s, which modern desktop/laptop Intel CPUs can manage for a single core (but not big Xeons). But with fast enough DRAM and/or a slower CPU relative to it, even DRAM can avoid being a bottleneck. But aiming for DRAM bandwidth is quite a low bar compared to L3 or L2 cache bandwidth.
So even if you want to keep using scalar code without movdqu / paddq (or better get to an alignment boundary for memory-source paddq, if you want to spend some code-size to optimize this loop), you could still unroll with two register accumulators for sum which you add at the end. This exposes some instruction-level parallelism, allowing two memory-source loads per clock cycle.
You can also avoid the cmp, which can reduce loop overhead. Fewer uops lets out-of-order exec see farther.
Get a pointer to the end of the array and index from -length up towards zero. Like (arr+len)[idx] with for(idx=-len ; idx != 0 ; idx++). Looping backwards through the array is on some CPUs a little worse for some of the HW prefetchers, so generally not recommended for loops that are often memory bound.
See also Micro fusion and addressing modes - an indexed addressing mode can only stay micro-fused in the back-end on Intel Haswell and later, and only for instructions like add that RMW their destination register.
So your best bet would be a loop with one pointer increment and 2 to 4 add instructions using it, and a cmp/jne at the bottom.

Can num++ be atomic for 'int num'?

In general, for int num, num++ (or ++num), as a read-modify-write operation, is not atomic. But I often see compilers, for example GCC, generate the following code for it (try here):
void f()
{
int num = 0;
num++;
}
f():
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
add DWORD PTR [rbp-4], 1
nop
pop rbp
ret
Since line 5, which corresponds to num++ is one instruction, can we conclude that num++ is atomic in this case?
And if so, does it mean that so-generated num++ can be used in concurrent (multi-threaded) scenarios without any danger of data races (i.e. we don't need to make it, for example, std::atomic<int> and impose the associated costs, since it's atomic anyway)?
UPDATE
Notice that this question is not whether increment is atomic (it's not and that was and is the opening line of the question). It's whether it can be in particular scenarios, i.e. whether one-instruction nature can in certain cases be exploited to avoid the overhead of the lock prefix. And, as the accepted answer mentions in the section about uniprocessor machines, as well as this answer, the conversation in its comments and others explain, it can (although not with C or C++).

This is absolutely what C++ defines as a Data Race that causes Undefined Behaviour, even if one compiler happened to produce code that did what you hoped on some target machine. You need to use std::atomic for reliable results, but you can use it with memory_order_relaxed if you don't care about reordering. See below for some example code and asm output using fetch_add.
But first, the assembly language part of the question:
Since num++ is one instruction (add dword [num], 1), can we conclude that num++ is atomic in this case?
Memory-destination instructions (other than pure stores) are read-modify-write operations that happen in multiple internal steps. No architectural register is modified, but the CPU has to hold the data internally while it sends it through its ALU. The actual register file is only a small part of the data storage inside even the simplest CPU, with latches holding outputs of one stage as inputs for another stage, etc., etc.
Memory operations from other CPUs can become globally visible between the load and store. I.e. two threads running add dword [num], 1 in a loop would step on each other's stores. (See #Margaret's answer for a nice diagram). After 40k increments from each of two threads, the counter might have only gone up by ~60k (not 80k) on real multi-core x86 hardware.
"Atomic", from the Greek word meaning indivisible, means that no observer can see the operation as separate steps. Happening physically / electrically instantaneously for all bits simultaneously is just one way to achieve this for a load or store, but that's not even possible for an ALU operation. I went into a lot more detail about pure loads and pure stores in my answer to Atomicity on x86, while this answer focuses on read-modify-write.
The lock prefix can be applied to many read-modify-write (memory destination) instructions to make the entire operation atomic with respect to all possible observers in the system (other cores and DMA devices, not an oscilloscope hooked up to the CPU pins). That is why it exists. (See also this Q&A).
So lock add dword [num], 1 is atomic. A CPU core running that instruction would keep the cache line pinned in Modified state in its private L1 cache from when the load reads data from cache until the store commits its result back into cache. This prevents any other cache in the system from having a copy of the cache line at any point from load to store, according to the rules of the MESI cache coherency protocol (or the MOESI/MESIF versions of it used by multi-core AMD/Intel CPUs, respectively). Thus, operations by other cores appear to happen either before or after, not during.
Without the lock prefix, another core could take ownership of the cache line and modify it after our load but before our store, so that other store would become globally visible in between our load and store. Several other answers get this wrong, and claim that without lock you'd get conflicting copies of the same cache line. This can never happen in a system with coherent caches.
(If a locked instruction operates on memory that spans two cache lines, it takes a lot more work to make sure the changes to both parts of the object stay atomic as they propagate to all observers, so no observer can see tearing. The CPU might have to lock the whole memory bus until the data hits memory. Don't misalign your atomic variables!)
Note that the lock prefix also turns an instruction into a full memory barrier (like MFENCE), stopping all run-time reordering and thus giving sequential consistency. (See Jeff Preshing's excellent blog post. His other posts are all excellent, too, and clearly explain a lot of good stuff about lock-free programming, from x86 and other hardware details to C++ rules.)
On a uniprocessor machine, or in a single-threaded process, a single RMW instruction actually is atomic without a lock prefix. The only way for other code to access the shared variable is for the CPU to do a context switch, which can't happen in the middle of an instruction. So a plain dec dword [num] can synchronize between a single-threaded program and its signal handlers, or in a multi-threaded program running on a single-core machine. See the second half of my answer on another question, and the comments under it, where I explain this in more detail.
Back to C++:
It's totally bogus to use num++ without telling the compiler that you need it to compile to a single read-modify-write implementation:
;; Valid compiler output for num++
mov eax, [num]
inc eax
mov [num], eax
This is very likely if you use the value of num later: the compiler will keep it live in a register after the increment. So even if you check how num++ compiles on its own, changing the surrounding code can affect it.
(If the value isn't needed later, inc dword [num] is preferred; modern x86 CPUs will run a memory-destination RMW instruction at least as efficiently as using three separate instructions. Fun fact: gcc -O3 -m32 -mtune=i586 will actually emit this, because (Pentium) P5's superscalar pipeline didn't decode complex instructions to multiple simple micro-operations the way P6 and later microarchitectures do. See the Agner Fog's instruction tables / microarchitecture guide for more info, and the x86 tag wiki for many useful links (including Intel's x86 ISA manuals, which are freely available as PDF)).
Don't confuse the target memory model (x86) with the C++ memory model
Compile-time reordering is allowed. The other part of what you get with std::atomic is control over compile-time reordering, to make sure your num++ becomes globally visible only after some other operation.
Classic example: Storing some data into a buffer for another thread to look at, then setting a flag. Even though x86 does acquire loads/release stores for free, you still have to tell the compiler not to reorder by using flag.store(1, std::memory_order_release);.
You might be expecting that this code will synchronize with other threads:
// int flag; is just a plain global, not std::atomic<int>.
flag--; // Pretend this is supposed to be some kind of locking attempt
modify_a_data_structure(&foo); // doesn't look at flag, and the compiler knows this. (Assume it can see the function def). Otherwise the usual don't-break-single-threaded-code rules come into play!
flag++;
But it won't. The compiler is free to move the flag++ across the function call (if it inlines the function or knows that it doesn't look at flag). Then it can optimize away the modification entirely, because flag isn't even volatile.
(And no, C++ volatile is not a useful substitute for std::atomic. std::atomic does make the compiler assume that values in memory can be modified asynchronously similar to volatile, but there's much more to it than that. (In practice there are similarities between volatile int to std::atomic with mo_relaxed for pure-load and pure-store operations, but not for RMWs). Also, volatile std::atomic<int> foo is not necessarily the same as std::atomic<int> foo, although current compilers don't optimize atomics (e.g. 2 back-to-back stores of the same value) so volatile atomic wouldn't change the code-gen.)
Defining data races on non-atomic variables as Undefined Behaviour is what lets the compiler still hoist loads and sink stores out of loops, and many other optimizations for memory that multiple threads might have a reference to. (See this LLVM blog for more about how UB enables compiler optimizations.)
As I mentioned, the x86 lock prefix is a full memory barrier, so using num.fetch_add(1, std::memory_order_relaxed); generates the same code on x86 as num++ (the default is sequential consistency), but it can be much more efficient on other architectures (like ARM). Even on x86, relaxed allows more compile-time reordering.
This is what GCC actually does on x86, for a few functions that operate on a std::atomic global variable.
See the source + assembly language code formatted nicely on the Godbolt compiler explorer. You can select other target architectures, including ARM, MIPS, and PowerPC, to see what kind of assembly language code you get from atomics for those targets.
#include <atomic>
std::atomic<int> num;
void inc_relaxed() {
num.fetch_add(1, std::memory_order_relaxed);
}
int load_num() { return num; } // Even seq_cst loads are free on x86
void store_num(int val){ num = val; }
void store_num_release(int val){
num.store(val, std::memory_order_release);
}
// Can the compiler collapse multiple atomic operations into one? No, it can't.
# g++ 6.2 -O3, targeting x86-64 System V calling convention. (First argument in edi/rdi)
inc_relaxed():
lock add DWORD PTR num[rip], 1 #### Even relaxed RMWs need a lock. There's no way to request just a single-instruction RMW with no lock, for synchronizing between a program and signal handler for example. :/ There is atomic_signal_fence for ordering, but nothing for RMW.
ret
inc_seq_cst():
lock add DWORD PTR num[rip], 1
ret
load_num():
mov eax, DWORD PTR num[rip]
ret
store_num(int):
mov DWORD PTR num[rip], edi
mfence ##### seq_cst stores need an mfence
ret
store_num_release(int):
mov DWORD PTR num[rip], edi
ret ##### Release and weaker doesn't.
store_num_relaxed(int):
mov DWORD PTR num[rip], edi
ret
Notice how MFENCE (a full barrier) is needed after a sequential-consistency stores. x86 is strongly ordered in general, but StoreLoad reordering is allowed. Having a store buffer is essential for good performance on a pipelined out-of-order CPU. Jeff Preshing's Memory Reordering Caught in the Act shows the consequences of not using MFENCE, with real code to show reordering happening on real hardware.
Re: discussion in comments on #Richard Hodges' answer about compilers merging std::atomic num++; num-=2; operations into one num--; instruction:
A separate Q&A on this same subject: Why don't compilers merge redundant std::atomic writes?, where my answer restates a lot of what I wrote below.
Current compilers don't actually do this (yet), but not because they aren't allowed to. C++ WG21/P0062R1: When should compilers optimize atomics? discusses the expectation that many programmers have that compilers won't make "surprising" optimizations, and what the standard can do to give programmers control. N4455 discusses many examples of things that can be optimized, including this one. It points out that inlining and constant-propagation can introduce things like fetch_or(0) which may be able to turn into just a load() (but still has acquire and release semantics), even when the original source didn't have any obviously redundant atomic ops.
The real reasons compilers don't do it (yet) are: (1) nobody's written the complicated code that would allow the compiler to do that safely (without ever getting it wrong), and (2) it potentially violates the principle of least surprise. Lock-free code is hard enough to write correctly in the first place. So don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much. It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).
Getting back to num++; num-=2; compiling as if it were num--:
Compilers are allowed to do this, unless num is volatile std::atomic<int>. If a reordering is possible, the as-if rule allows the compiler to decide at compile time that it always happens that way. Nothing guarantees that an observer could see the intermediate values (the num++ result).
I.e. if the ordering where nothing becomes globally visible between these operations is compatible with the ordering requirements of the source
(according to the C++ rules for the abstract machine, not the target architecture), the compiler can emit a single lock dec dword [num] instead of lock inc dword [num] / lock sub dword [num], 2.
num++; num-- can't disappear, because it still has a Synchronizes With relationship with other threads that look at num, and it's both an acquire-load and a release-store which disallows reordering of other operations in this thread. For x86, this might be able to compile to an MFENCE, instead of a lock add dword [num], 0 (i.e. num += 0).
As discussed in PR0062, more aggressive merging of non-adjacent atomic ops at compile time can be bad (e.g. a progress counter only gets updated once at the end instead of every iteration), but it can also help performance without downsides (e.g. skipping the atomic inc / dec of ref counts when a copy of a shared_ptr is created and destroyed, if the compiler can prove that another shared_ptr object exists for entire lifespan of the temporary.)
Even num++; num-- merging could hurt fairness of a lock implementation when one thread unlocks and re-locks right away. If it's never actually released in the asm, even hardware arbitration mechanisms won't give another thread a chance to grab the lock at that point.
With current gcc6.2 and clang3.9, you still get separate locked operations even with memory_order_relaxed in the most obviously optimizable case. (Godbolt compiler explorer so you can see if the latest versions are different.)
void multiple_ops_relaxed(std::atomic<unsigned int>& num) {
num.fetch_add( 1, std::memory_order_relaxed);
num.fetch_add(-1, std::memory_order_relaxed);
num.fetch_add( 6, std::memory_order_relaxed);
num.fetch_add(-5, std::memory_order_relaxed);
//num.fetch_add(-1, std::memory_order_relaxed);
}
multiple_ops_relaxed(std::atomic<unsigned int>&):
lock add DWORD PTR [rdi], 1
lock sub DWORD PTR [rdi], 1
lock add DWORD PTR [rdi], 6
lock sub DWORD PTR [rdi], 5
ret

Without many complications an instruction like add DWORD PTR [rbp-4], 1 is very CISC-style.
It perform three operations: load the operand from memory, increment it, store the operand back to memory.
During these operations the CPU acquire and release the bus twice, in between any other agent can acquire it too and this violates the atomicity.
AGENT 1 AGENT 2
load X
inc C
load X
inc C
store X
store X
X is incremented only once.

...and now let's enable optimisations:
f():
rep ret
OK, let's give it a chance:
void f(int& num)
{
num = 0;
num++;
--num;
num += 6;
num -=5;
--num;
}
result:
f(int&):
mov DWORD PTR [rdi], 0
ret
another observing thread (even ignoring cache synchronisation delays) has no opportunity to observe the individual changes.
compare to:
#include <atomic>
void f(std::atomic<int>& num)
{
num = 0;
num++;
--num;
num += 6;
num -=5;
--num;
}
where the result is:
f(std::atomic<int>&):
mov DWORD PTR [rdi], 0
mfence
lock add DWORD PTR [rdi], 1
lock sub DWORD PTR [rdi], 1
lock add DWORD PTR [rdi], 6
lock sub DWORD PTR [rdi], 5
lock sub DWORD PTR [rdi], 1
ret
Now, each modification is:-
observable in another thread, and
respectful of similar modifications happening in other threads.
atomicity is not just at the instruction level, it involves the whole pipeline from processor, through the caches, to memory and back.
Further info
Regarding the effect of optimisations of updates of std::atomics.
The c++ standard has the 'as if' rule, by which it is permissible for the compiler to reorder code, and even rewrite code provided that the outcome has the exact same observable effects (including side-effects) as if it had simply executed your code.
The as-if rule is conservative, particularly involving atomics.
consider:
void incdec(int& num) {
++num;
--num;
}
Because there are no mutex locks, atomics or any other constructs that influence inter-thread sequencing, I would argue that the compiler is free to rewrite this function as a NOP, eg:
void incdec(int&) {
// nada
}
This is because in the c++ memory model, there is no possibility of another thread observing the result of the increment. It would of course be different if num was volatile (might influence hardware behaviour). But in this case, this function will be the only function modifying this memory (otherwise the program is ill-formed).
However, this is a different ball game:
void incdec(std::atomic<int>& num) {
++num;
--num;
}
num is an atomic. Changes to it must be observable to other threads that are watching. Changes those threads themselves make (such as setting the value to 100 in between the increment and decrement) will have very far-reaching effects on the eventual value of num.
Here is a demo:
#include <thread>
#include <atomic>
int main()
{
for (int iter = 0 ; iter < 20 ; ++iter)
{
std::atomic<int> num = { 0 };
std::thread t1([&] {
for (int i = 0 ; i < 10000000 ; ++i)
{
++num;
--num;
}
});
std::thread t2([&] {
for (int i = 0 ; i < 10000000 ; ++i)
{
num = 100;
}
});
t2.join();
t1.join();
std::cout << num << std::endl;
}
}
sample output:
99
99
99
99
99
100
99
99
100
100
100
100
99
99
100
99
99
100
100
99

The add instruction is not atomic. It references memory, and two processor cores may have different local cache of that memory.
IIRC the atomic variant of the add instruction is called lock xadd

Since line 5, which corresponds to num++ is one instruction, can we conclude that num++ is atomic in this case?
It is dangerous to draw conclusions based on "reverse engineering" generated assembly. For example, you seem to have compiled your code with optimization disabled, otherwise the compiler would have thrown away that variable or loaded 1 directly to it without invoking operator++. Because the generated assembly may change significantly, based on optimization flags, target CPU, etc., your conclusion is based on sand.
Also, your idea that one assembly instruction means an operation is atomic is wrong as well. This add will not be atomic on multi-CPU systems, even on the x86 architecture.

Even if your compiler always emitted this as an atomic operation, accessing num from any other thread concurrently would constitute a data race according to the C++11 and C++14 standards and the program would have undefined behavior.
But it is worse than that. First, as has been mentioned, the instruction generated by the compiler when incrementing a variable may depend on the optimization level. Secondly, the compiler may reorder other memory accesses around ++num if num is not atomic, e.g.
int main()
{
std::unique_ptr<std::vector<int>> vec;
int ready = 0;
std::thread t{[&]
{
while (!ready);
// use "vec" here
});
vec.reset(new std::vector<int>());
++ready;
t.join();
}
Even if we assume optimistically that ++ready is "atomic", and that the compiler generates the checking loop as needed (as I said, it's UB and therefore the compiler is free to remove it, replace it with an infinite loop, etc.), the compiler might still move the pointer assignment, or even worse the initialization of the vector to a point after the increment operation, causing chaos in the new thread. In practice, I would not be surprised at all if an optimizing compiler removed the ready variable and the checking loop completely, as this does not affect observable behavior under language rules (as opposed to your private hopes).
In fact, at last year's Meeting C++ conference, I've heard from two compiler developers that they very gladly implement optimizations that make naively written multi-threaded programs misbehave, as long as language rules allow it, if even a minor performance improvement is seen in correctly written programs.
Lastly, even if you didn't care about portability, and your compiler was magically nice, the CPU you are using is very likely of a superscalar CISC type and will break down instructions into micro-ops, reorder and/or speculatively execute them, to an extent only limited by synchronizing primitives such as (on Intel) the LOCK prefix or memory fences, in order to maximize operations per second.
To make a long story short, the natural responsibilities of thread-safe programming are:
Your duty is to write code that has well-defined behavior under language rules (and in particular the language standard memory model).
Your compiler's duty is to generate machine code which has the same well-defined (observable) behavior under the target architecture's memory model.
Your CPU's duty is to execute this code so that the observed behavior is compatible with its own architecture's memory model.
If you want to do it your own way, it might just work in some cases, but understand that the warranty is void, and you will be solely responsible for any unwanted outcomes. :-)
PS: Correctly written example:
int main()
{
std::unique_ptr<std::vector<int>> vec;
std::atomic<int> ready{0}; // NOTE the use of the std::atomic template
std::thread t{[&]
{
while (!ready);
// use "vec" here
});
vec.reset(new std::vector<int>());
++ready;
t.join();
}
This is safe because:
The checks of ready cannot be optimized away according to language rules.
The ++ready happens-before the check that sees ready as not zero, and other operations cannot be reordered around these operations. This is because ++ready and the check are sequentially consistent, which is another term described in the C++ memory model and that forbids this specific reordering. Therefore the compiler must not reorder the instructions, and must also tell the CPU that it must not e.g. postpone the write to vec to after the increment of ready. Sequentially consistent is the strongest guarantee regarding atomics in the language standard. Lesser (and theoretically cheaper) guarantees are available e.g. via other methods of std::atomic<T>, but these are definitely for experts only, and may not be optimized much by the compiler developers, because they are rarely used.

On a single-core x86 machine, an add instruction will generally be atomic with respect to other code on the CPU1. An interrupt can't split a single instruction down the middle.
Out-of-order execution is required to preserve the illusion of instructions executing one at a time in order within a single core, so any instruction running on the same CPU will either happen completely before or completely after the add.
Modern x86 systems are multi-core, so the uniprocessor special case doesn't apply.
If one is targeting a small embedded PC and has no plans to move the code to anything else, the atomic nature of the "add" instruction could be exploited. On the other hand, platforms where operations are inherently atomic are becoming more and more scarce.
(This doesn't help you if you're writing in C++, though. Compilers don't have an option to require num++ to compile to a memory-destination add or xadd without a lock prefix. They could choose to load num into a register and store the increment result with a separate instruction, and will likely do that if you use the result.)
Footnote 1: The lock prefix existed even on original 8086 because I/O devices operate concurrently with the CPU; drivers on a single-core system need lock add to atomically increment a value in device memory if the device can also modify it, or with respect to DMA access.

Back in the day when x86 computers had one CPU, the use of a single instruction ensured that interrupts would not split the read/modify/write and if the memory would not be used as a DMA buffer too, it was atomic in fact (and C++ did not mention threads in the standard, so this wasn’t addressed).
When it was rare to have a dual processor (e.g. dual-socket Pentium Pro) on a customer desktop, I effectively used this to avoid the LOCK prefix on a single-core machine and improve performance.
Today, it would only help against multiple threads that were all set to the same CPU affinity, so the threads you are worried about would only come into play via time slice expiring and running the other thread on the same CPU (core). That is not realistic.
With modern x86/x64 processors, the single instruction is broken up into several micro ops and furthermore the memory reading and writing is buffered. So different threads running on different CPUs will not only see this as non-atomic but may see inconsistent results concerning what it reads from memory and what it assumes other threads have read to that point in time: you need to add memory fences to restore sane behavior.

No.
https://www.youtube.com/watch?v=31g0YE61PLQ
(That's just a link to the "No" scene from "The Office")
Do you agree that this would be a possible output for the program:
sample output:
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
If so, then the compiler is free to make that the only possible output for the program, in whichever way the compiler wants. ie a main() that just puts out 100s.
This is the "as-if" rule.
And regardless of output, you can think of thread synchronization the same way - if thread A does num++; num--; and thread B reads num repeatedly, then a possible valid interleaving is that thread B never reads between num++ and num--. Since that interleaving is valid, the compiler is free to make that the only possible interleaving. And just remove the incr/decr entirely.
There are some interesting implications here:
while (working())
progress++; // atomic, global
(ie imagine some other thread updates a progress bar UI based on progress)
Can the compiler turn this into:
int local = 0;
while (working())
local++;
progress += local;
probably that is valid. But probably not what the programmer was hoping for :-(
The committee is still working on this stuff. Currently it "works" because compilers don't optimize atomics much. But that is changing.
And even if progress was also volatile, this would still be valid:
int local = 0;
while (working())
local++;
while (local--)
progress++;
:-/

That a single compiler's output, on a specific CPU architecture, with optimizations disabled (since gcc doesn't even compile ++ to add when optimizing in a quick&dirty example), seems to imply incrementing this way is atomic doesn't mean this is standard-compliant (you would cause undefined behavior when trying to access num in a thread), and is wrong anyways, because add is not atomic in x86.
Note that atomics (using the lock instruction prefix) are relatively heavy on x86 (see this relevant answer), but still remarkably less than a mutex, which isn't very appropriate in this use-case.
Following results are taken from clang++ 3.8 when compiling with -Os.
Incrementing an int by reference, the "regular" way :
void inc(int& x)
{
++x;
}
This compiles into :
inc(int&):
incl (%rdi)
retq
Incrementing an int passed by reference, the atomic way :
#include <atomic>
void inc(std::atomic<int>& x)
{
++x;
}
This example, which is not much more complex than the regular way, just gets the lock prefix added to the incl instruction - but caution, as previously stated this is not cheap. Just because assembly looks short doesn't mean it's fast.
inc(std::atomic<int>&):
lock incl (%rdi)
retq

Yes, but...
Atomic is not what you meant to say. You're probably asking the wrong thing.
The increment is certainly atomic. Unless the storage is misaligned (and since you left alignment to the compiler, it is not), it is necessarily aligned within a single cache line. Short of special non-caching streaming instructions, each and every write goes through the cache. Complete cache lines are being atomically read and written, never anything different.
Smaller-than-cacheline data is, of course, also written atomically (since the surrounding cache line is).
Is it thread-safe?
This is a different question, and there are at least two good reasons to answer with a definite "No!".
First, there is the possibility that another core might have a copy of that cache line in L1 (L2 and upwards is usually shared, but L1 is normally per-core!), and concurrently modifies that value. Of course that happens atomically, too, but now you have two "correct" (correctly, atomically, modified) values -- which one is the truly correct one now?
The CPU will sort it out somehow, of course. But the result may not be what you expect.
Second, there is memory ordering, or worded differently happens-before guarantees. The most important thing about atomic instructions is not so much that they are atomic. It's ordering.
You have the possibility of enforcing a guarantee that everything that happens memory-wise is realized in some guaranteed, well-defined order where you have a "happened before" guarantee. This ordering may be as "relaxed" (read as: none at all) or as strict as you need.
For example, you can set a pointer to some block of data (say, the results of some calculation) and then atomically release the "data is ready" flag. Now, whoever acquires this flag will be led into thinking that the pointer is valid. And indeed, it will always be a valid pointer, never anything different. That's because the write to the pointer happened-before the atomic operation.

When your compiler uses only a single instruction for the increment and your machine is single-threaded, your code is safe. ^^

Try compiling the same code on a non-x86 machine, and you'll quickly see very different assembly results.
The reason num++ appears to be atomic is because on x86 machines, incrementing a 32-bit integer is, in fact, atomic (assuming no memory retrieval takes place). But this is neither guaranteed by the c++ standard, nor is it likely to be the case on a machine that doesn't use the x86 instruction set. So this code is not cross-platform safe from race conditions.
You also don't have a strong guarantee that this code is safe from Race Conditions even on an x86 architecture, because x86 doesn't set up loads and stores to memory unless specifically instructed to do so. So if multiple threads tried to update this variable simultaneously, they may end up incrementing cached (outdated) values
The reason, then, that we have std::atomic<int> and so on is so that when you're working with an architecture where the atomicity of basic computations is not guaranteed, you have a mechanism that will force the compiler to generate atomic code.

Slow Instructions in Simple Loop on x86

I have a simple loop which I've written in C++ as I wanted to profile the performance of a multiply instruction on my CPU. I found some interesting nuances in the assembly code that was generated when I profiled it.
Here is the C++ program:
#define TESTS 10000000
#define BUFSIZE 1000
uint32_t buf_in1[BUFSIZE];
uint32_t buf_in2[BUFSIZE];
uint32_t volatile buf_out[BUFSIZE];
unsigned int i, j;
for (i = 0; i < BUFSIZE; i++) {
buf_in1[i] = i;
buf_in2[i] = i;
}
for (j = 0; j < TESTS; j++) {
for (i = 0; i < BUFSIZE; i++) {
buf_out[i] = buf_in1[i] * buf_in2[i];
}
}
I compiled with the following flags:
Optimization:
Code Generation:
It's compiled in visual studio 2012 under Win32 although I am running it on a 64 bit machine.
Note the volatile qualifier on buf_out. It's just to stop the compiler from optimising the loop away.
I ran this code through a profiler (AMD's CodeXL) and I see that the multiplication instruction doesn't take up the majority of the CPU time. About 30% is taken up by the imul instruction, but around 60% is also spent on two other instructions:
Note that the Timer column shows the number of timer ticks during which the profiler found the code on this instruction. The timer tick is 1ms so 2609 ticks is approximately 2609ms spent on that instruction.
The two instructions other than the multiply instruction which are taking up a lot of time are a mov instruction and the jb (jump if condition is met) instruction.
The mov instruction,
mov [esp+eax+00001f40h],ecx
is moving the result of the multiply (ecx) back into the buffer buf_out buffer at eax (which is the register representing i). This makes sense, but why does it take so much longer to do this than the other mov instruction? Ie this one:
mov ecx,[esp+eax+00000fa0h]
They both read from similar locations in memory, the arrays are 1000 uint32_t's long or 4000 bytes long. That's 4000*3 = 12kB. My L1 cache is 64kB so it should all easily fit in L1 as far as I can see...
Here are results showing my cache sizes etc. from Coreinfo:
As for the jump instruction:
jb $-1ah (0x903732)
I can't tell why it's taking up 33% of the program execution time either. My processor line size is 64 bytes and the jump only jumps backwards 0x1A bytes or 26 bytes. Could it be because this jump crosses a 64-byte boundry? (0x903740 is a 64 byte boundary)
So can anyone explain these behaviours?
Thanks.

As mentioned by Mystical, the timings you are looking at are not one to one the responsibility of the instructions it is shown against.
Modern processors run many instructions in parallel (the imul and the add 4 to eax can both run in parallel, also the math involved in the mov addressing uses the ALU too and can be computed before the imul completes).
The way most profilers compute their timing is by using timed interrupts and what you see are the instructions that happened to be the ones executed at the time of the interrupts.
To properly use a profiler, you want to run against large programs and see whether the program spends a lot of time. On a per instruction basis, it does not have much value.
If you really want to do speed tests, you want to use the CPU timer before and after your loops and see how you can ameliorate it one way or another to get it to run faster.

I wouldn't assume that it all fits in your L1, because the code you're debugging isn't the only thing using the CPU (unless you've booted your machine to run that code, which in fact will be your operating system).
Also note that there's a pattern there: The slowest operations all are requiring main memory access. Since this access time isn't controlled by the CPU, it's difficult to point out why isn't it faster. That will require hardware analysis.
Hope this helps.

Unfortunately, you have not given the amount of time needed for a single pass through your loop, but I assume that it's three CPU cycles. If that is true, the three instructions that happen to get time attributed to them are the three instructions which the processor is officially on when the clock ticks. The other three instructions are executed in parallel to the three officially time consuming instructions, hiding behind them.

How to get the CPU cycle count in x86_64 from C++?

I saw this post on SO which contains C code to get the latest CPU Cycle count:
CPU Cycle count based profiling in C/C++ Linux x86_64
Is there a way I can use this code in C++ (windows and linux solutions welcome)? Although written in C (and C being a subset of C++) I am not too certain if this code would work in a C++ project and if not, how to translate it?
I am using x86-64
EDIT2:
Found this function but cannot get VS2010 to recognise the assembler. Do I need to include anything? (I believe I have to swap uint64_t to long long for windows....?)
static inline uint64_t get_cycles()
{
uint64_t t;
__asm volatile ("rdtsc" : "=A"(t));
return t;
}
EDIT3:
From above code I get the error:
"error C2400: inline assembler syntax error in 'opcode'; found 'data
type'"
Could someone please help?

Starting from GCC 4.5 and later, the __rdtsc() intrinsic is now supported by both MSVC and GCC.
But the include that's needed is different:
#ifdef _WIN32
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
Here's the original answer before GCC 4.5.
Pulled directly out of one of my projects:
#include <stdint.h>
// Windows
#ifdef _WIN32
#include <intrin.h>
uint64_t rdtsc(){
return __rdtsc();
}
// Linux/GCC
#else
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
#endif
This GNU C Extended asm tells the compiler:
volatile: the outputs aren't a pure function of the inputs (so it has to re-run every time, not reuse an old result).
"=a"(lo) and "=d"(hi) : the output operands are fixed registers: EAX and EDX. (x86 machine constraints). The x86 rdtsc instruction puts its 64-bit result in EDX:EAX, so letting the compiler pick an output with "=r" wouldn't work: there's no way to ask the CPU for the result to go anywhere else.
((uint64_t)hi << 32) | lo - zero-extend both 32-bit halves to 64-bit (because lo and hi are unsigned), and logically shift + OR them together into a single 64-bit C variable. In 32-bit code, this is just a reinterpretation; the values still just stay in a pair of 32-bit registers. In 64-bit code you typically get an actual shift + OR asm instructions, unless the high half optimizes away.
(editor's note: this could probably be more efficient if you used unsigned long instead of unsigned int. Then the compiler would know that lo was already zero-extended into RAX. It wouldn't know that the upper half was zero, so | and + are equivalent if it wanted to merge a different way. The intrinsic should in theory give you the best of both worlds as far as letting the optimizer do a good job.)
https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. But hopefully this section is useful if you need to understand old code that uses inline asm so you can rewrite it with intrinsics. See also https://stackoverflow.com/tags/inline-assembly/info

Your inline asm is broken for x86-64. "=A" in 64-bit mode lets the compiler pick either RAX or RDX, not EDX:EAX. See this Q&A for more
You don't need inline asm for this. There's no benefit; compilers have built-ins for rdtsc and rdtscp, and (at least these days) all define a __rdtsc intrinsic if you include the right headers. But unlike almost all other cases (https://gcc.gnu.org/wiki/DontUseInlineAsm), there's no serious downside to asm, as long as you're using a good and safe implementation like #Mysticial's.
(One minor advantage to asm is if you want to time a small interval that's certainly going to be less than 2^32 counts, you can ignore the high half of the result. Compilers could do that optimization for you with a uint32_t time_low = __rdtsc() intrinsic, but in practice they sometimes still waste instructions doing shift / OR.)
Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics.
Intel's intriniscs guide says _rdtsc (with one underscore) is in <immintrin.h>, but that doesn't work on gcc and clang. They only define SIMD intrinsics in <immintrin.h>, so we're stuck with <intrin.h> (MSVC) vs. <x86intrin.h> (everything else, including recent ICC). For compat with MSVC, and Intel's documentation, gcc and clang define both the one-underscore and two-underscore versions of the function.
Fun fact: the double-underscore version returns an unsigned 64-bit integer, while Intel documents _rdtsc() as returning (signed) __int64.
// valid C99 and C++
#include <stdint.h> // <cstdint> is preferred in C++, but stdint.h works.
#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif
// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
// _mm_lfence(); // optionally wait for earlier insns to retire before reading the clock
uint64_t tsc = __rdtsc();
// _mm_lfence(); // optionally block later instructions until rdtsc retires
return tsc;
}
// requires a Nehalem or newer CPU. Not Core2 or earlier. IDK when AMD added it.
inline
uint64_t readTSCp() {
unsigned dummy;
return __rdtscp(&dummy); // waits for earlier insns to retire, but allows later to start
}
Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer, including a couple test callers.
These intrinsics were new in gcc4.5 (from 2010) and clang3.5 (from 2014). gcc4.4 and clang 3.4 on Godbolt don't compile this, but gcc4.5.3 (April 2011) does. You might see inline asm in old code, but you can and should replace it with __rdtsc(). Compilers over a decade old usually make slower code than gcc6, gcc7, or gcc8, and have less useful error messages.
The MSVC intrinsic has (I think) existed far longer, because MSVC never supported inline asm for x86-64. ICC13 has __rdtsc in immintrin.h, but doesn't have an x86intrin.h at all. More recent ICC have x86intrin.h, at least the way Godbolt installs them for Linux they do.
You might want to define them as signed long long, especially if you want to subtract them and convert to float. int64_t -> float/double is more efficient than uint64_t on x86 without AVX512. Also, small negative results could be possible because of CPU migrations if TSCs aren't perfectly synced, and that probably makes more sense than huge unsigned numbers.
BTW, clang also has a portable __builtin_readcyclecounter() which works on any architecture. (Always returns zero on architectures without a cycle counter.) See the clang/LLVM language-extension docs
For more about using lfence (or cpuid) to improve repeatability of rdtsc and control exactly which instructions are / aren't in the timed interval by blocking out-of-order execution, see #HadiBrais' answer on clflush to invalidate cache line via C function and the comments for an example of the difference it makes.
See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset so you should use cpuid to serialize.) It's always been defined as partially-serializing on Intel.
How to Benchmark Code Execution Times on Intel® IA-32 and IA-64
Instruction Set Architectures, an Intel white-paper from 2010.
rdtsc counts reference cycles, not CPU core clock cycles
It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (not counting system clock adjustments, so it's a perfect time source for steady_clock).
The TSC frequency used to always be equal to the CPU's rated frequency, i.e. the advertised sticker frequency. In some CPUs it's merely close, e.g. 2592 MHz on an i7-6700HQ 2.6 GHz Skylake, or 4008MHz on a 4000MHz i7-6700k. On even newer CPUs like i5-1035 Ice Lake, TSC = 1.5 GHz, base = 1.1 GHz, so disabling turbo won't even approximately work for TSC = core cycles on those CPUs.
If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. (And optionally disable turbo and tell your OS to prefer max clock speed to avoid CPU frequency shifts during your microbenchmark).
Microbenchmarking is hard: see Idiomatic way of performance evaluation? for other pitfalls.
Instead of TSC at all, you can use a library that gives you access to hardware performance counters. The complicated but low-overhead way is to program perf counters and use rdmsr in user-space, or simpler ways include tricks like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID.
You usually will still want to keep the CPU clock fixed for microbenchmarks, though, unless you want to see how different loads will get Skylake to clock down when memory-bound or whatever. (Note that memory bandwidth / latency is mostly fixed, using a different clock than the cores. At idle clock speed, an L2 or L3 cache miss takes many fewer core clock cycles.)
Negative clock cycle measurements with back-to-back rdtsc? the history of RDTSC: originally CPUs didn't do power-saving, so the TSC was both real-time and core clocks. Then it evolved through various barely-useful steps into its current form of a useful low-overhead timesource decoupled from core clock cycles (constant_tsc), which doesn't stop when the clock halts (nonstop_tsc). Also some tips, e.g. don't take the mean time, take the median (there will be very high outliers).
std::chrono::clock, hardware clock and cycle count
Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
measuring code execution times in C using RDTSC instruction lists some gotchas, including SMI (system-management interrupts) which you can't avoid even in kernel mode with cli), and virtualization of rdtsc under a VM. And of course basic stuff like regular interrupts being possible, so repeat your timing many times and throw away outliers.
Determine TSC frequency on Linux. Programatically querying the TSC frequency is hard and maybe not possible, especially in user-space, or may give a worse result than calibrating it. Calibrating it using another known time-source takes time. See that question for more about how hard it is to convert TSC to nanoseconds (and that it would be nice if you could ask the OS what the conversion ratio is, because the OS already did it at bootup).
If you're microbenchmarking with RDTSC for tuning purposes, your best bet is to just use ticks and skip even trying to convert to nanoseconds. Otherwise, use a high-resolution library time function like std::chrono or clock_gettime. See faster equivalent of gettimeofday for some discussion / comparison of timestamp functions, or reading a shared timestamp from memory to avoid rdtsc entirely if your precision requirement is low enough for a timer interrupt or thread to update it.
See also Calculate system time using rdtsc about finding the crystal frequency and multiplier.
CPU TSC fetch operation especially in multicore-multi-processor environment says that Nehalem and newer have the TSC synced and locked together for all cores in a package (along with the invariant = constant and nonstop TSC feature). See #amdn's answer there for some good info about multi-socket sync.
(And apparently usually reliable even for modern multi-socket systems as long as they have that feature, see #amdn's answer on the linked question, and more details below.)
CPUID features relevant to the TSC
Using the names that Linux /proc/cpuinfo uses for the CPU features, and other aliases for the same feature that you'll also find.
tsc - the TSC exists and rdtsc is supported. Baseline for x86-64.
rdtscp - rdtscp is supported.
tsc_deadline_timer CPUID.01H:ECX.TSC_Deadline[bit 24] = 1 - local APIC can be programmed to fire an interrupt when the TSC reaches a value you put in IA32_TSC_DEADLINE. Enables "tickless" kernels, I think, sleeping until the next thing that's supposed to happen.
constant_tsc: Support for the constant TSC feature is determined by checking the CPU family and model numbers. The TSC ticks at constant frequency regardless of changes in core clock speed. Without this, RDTSC does count core clock cycles.
nonstop_tsc: This feature is called the invariant TSC in the Intel SDM manual and is supported on processors with CPUID.80000007H:EDX[8]. The TSC keeps ticking even in deep sleep C-states. On all x86 processors, nonstop_tsc implies constant_tsc, but constant_tsc doesn't necessarily imply nonstop_tsc. No separate CPUID feature bit; on Intel and AMD the same invariant TSC CPUID bit implies both constant_tsc and nonstop_tsc features. See Linux's x86/kernel/cpu/intel.c detection code, and amd.c was similar.
Some of the processors (but not all) that are based on the Saltwell/Silvermont/Airmont even keep TSC ticking in ACPI S3 full-system sleep: nonstop_tsc_s3. This is called always-on TSC. (Although it seems the ones based on Airmont were never released.)
For more details on constant and invariant TSC, see: Can constant non-invariant tsc change frequency across cpu states?.
tsc_adjust: CPUID.(EAX=07H, ECX=0H):EBX.TSC_ADJUST (bit 1) The IA32_TSC_ADJUST MSR is available, allowing OSes to set an offset that's added to the TSC when rdtsc or rdtscp reads it. This allows effectively changing the TSC on some/all cores without desyncing it across logical cores. (Which would happen if software set the TSC to a new absolute value on each core; it's very hard to get the relevant WRMSR instruction executed at the same cycle on every core.)
constant_tsc and nonstop_tsc together make the TSC usable as a timesource for things like clock_gettime in user-space. (But OSes like Linux only use RDTSC to interpolate between ticks of a slower clock maintained with NTP, updating the scale / offset factors in timer interrupts. See On a cpu with constant_tsc and nonstop_tsc, why does my time drift?) On even older CPUs that don't support deep sleep states or frequency scaling, TSC as a timesource may still be usable
The comments in the Linux source code also indicate that constant_tsc / nonstop_tsc features (on Intel) implies "It is also reliable across cores and sockets. (but not across cabinets - we turn it off in that case explicitly.)"
The "across sockets" part is not accurate. In general, an invariant TSC only guarantees that the TSC is synchronized between cores within the same socket. On an Intel forum thread, Martin Dixon (Intel) points out that TSC invariance does not imply cross-socket synchronization. That requires the platform vendor to distribute RESET synchronously to all sockets. Apparently platform vendors do in practice do that, given the above Linux kernel comment. Answers on CPU TSC fetch operation especially in multicore-multi-processor environment also agree that all sockets on a single motherboard should start out in sync.
On a multi-socket shared memory system, there is no direct way to check whether the TSCs in all the cores are synced. The Linux kernel, by default performs boot-time and run-time checks to make sure that TSC can be used as a clock source. These checks involve determining whether the TSC is synced. The output of the command dmesg | grep 'clocksource' would tell you whether the kernel is using TSC as the clock source, which would only happen if the checks have passed. But even then, this would not be definitive proof that the TSC is synced across all sockets of the system. The kernel paramter tsc=reliable can be used to tell the kernel that it can blindly use the TSC as the clock source without doing any checks.
There are cases where cross-socket TSCs may NOT be in sync: (1) hotplugging a CPU, (2) when the sockets are spread out across different boards connected by extended node controllers, (3) a TSC may not be resynced after waking up from a C-state in which the TSC is powered-downed in some processors, and (4) different sockets have different CPU models installed.
An OS or hypervisor that changes the TSC directly instead of using the TSC_ADJUST offset can de-sync them, so in user-space it might not always be safe to assume that CPU migrations won't leave you reading a different clock. (This is why rdtscp produces a core-ID as an extra output, so you can detect when start/end times come from different clocks. It might have been introduced before the invariant TSC feature, or maybe they just wanted to account for every possibility.)
If you're using rdtsc directly, you may want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram on Linux. Whether you need it for the TSC or not, CPU migration will normally lead to a lot of cache misses and mess up your test anyway, as well as taking extra time. (Although so will an interrupt).
How efficient is the asm from using the intrinsic?
It's about as good as you'd get from #Mysticial's GNU C inline asm, or better because it knows the upper bits of RAX are zeroed. The main reason you'd want to keep inline asm is for compat with crusty old compilers.
A non-inline version of the readTSC function itself compiles with MSVC for x86-64 like this:
unsigned __int64 readTSC(void) PROC ; readTSC
rdtsc
shl rdx, 32 ; 00000020H
or rax, rdx
ret 0
; return in RAX
For 32-bit calling conventions that return 64-bit integers in edx:eax, it's just rdtsc/ret. Not that it matters, you always want this to inline.
In a test caller that uses it twice and subtracts to time an interval:
uint64_t time_something() {
uint64_t start = readTSC();
// even when empty, back-to-back __rdtsc() don't optimize away
return readTSC() - start;
}
All 4 compilers make pretty similar code. This is GCC's 32-bit output:
# gcc8.2 -O3 -m32
time_something():
push ebx # save a call-preserved reg: 32-bit only has 3 scratch regs
rdtsc
mov ecx, eax
mov ebx, edx # start in ebx:ecx
# timed region (empty)
rdtsc
sub eax, ecx
sbb edx, ebx # edx:eax -= ebx:ecx
pop ebx
ret # return value in edx:eax
This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.
# MSVC 19 2017 -Ox
unsigned __int64 time_something(void) PROC ; time_something
rdtsc
shl rdx, 32 ; high <<= 32
or rax, rdx
mov rcx, rax ; missed optimization: lea rcx, [rdx+rax]
; rcx = start
;; timed region (empty)
rdtsc
shl rdx, 32
or rax, rdx ; rax = end
sub rax, rcx ; end -= start
ret 0
unsigned __int64 time_something(void) ENDP ; time_something
All 4 compilers use or+mov instead of lea to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.
But writing a shift/lea in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.
However, we can maybe get the best of both worlds with a modified version of #Mysticial's code:
// More efficient than __rdtsc() in some case, but maybe worse in others
uint64_t rdtsc(){
// long and uintptr_t are 32-bit on the x32 ABI (32-bit pointers in 64-bit mode), so #ifdef would be better if we care about this trick there.
unsigned long lo,hi; // let the compiler know that zero-extension to 64 bits isn't required
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) + lo;
// + allows LEA or ADD instead of OR
}
On Godbolt, this does sometimes give better asm than __rdtsc() for gcc/clang/ICC, but other times it tricks compilers into using an extra register to save lo and hi separately, so clang can optimize into ((end_hi-start_hi)<<32) + (end_lo-start_lo). Hopefully if there's real register pressure, compilers will combine earlier. (gcc and ICC still save lo/hi separately, but don't optimize as well.)
But 32-bit gcc8 makes a mess of it, compiling even just the rdtsc() function itself with an actual add/adc with zeros instead of just returning the result in edx:eax like clang does. (gcc6 and earlier do ok with | instead of +, but definitely prefer the __rdtsc() intrinsic if you care about 32-bit code-gen from gcc).

VC++ uses an entirely different syntax for inline assembly -- but only in the 32-bit versions. The 64-bit compiler doesn't support inline assembly at all.
In this case, that's probably just as well -- rdtsc has (at least) two major problem when it comes to timing code sequences. First (like most instructions) it can be executed out of order, so if you're trying to time a short sequence of code, the rdtsc before and after that code might both be executed before it, or both after it, or what have you (I am fairly sure the two will always execute in order with respect to each other though, so at least the difference will never be negative).
Second, on a multi-core (or multiprocessor) system, one rdtsc might execute on one core/processor and the other on a different core/processor. In such a case, a negative result is entirely possible.
Generally speaking, if you want a precise timer under Windows, you're going to be better off using QueryPerformanceCounter.
If you really insist on using rdtsc, I believe you'll have to do it in a separate module written entirely in assembly language (or use a compiler intrinsic), then linked with your C or C++. I've never written that code for 64-bit mode, but in 32-bit mode it looks something like this:
xor eax, eax
cpuid
xor eax, eax
cpuid
xor eax, eax
cpuid
rdtsc
; save eax, edx
; code you're going to time goes here
xor eax, eax
cpuid
rdtsc
I know this looks strange, but it's actually right. You execute CPUID because it's a serializing instruction (can't be executed out of order) and is available in user mode. You execute it three times before you start timing because Intel documents the fact that the first execution can/will run at a different speed than the second (and what they recommend is three, so three it is).
Then you execute your code under test, another cpuid to force serialization, and the final rdtsc to get the time after the code finished.
Along with that, you want to use whatever means your OS supplies to force this all to run on one process/core. In most cases, you also want to force the code alignment -- changes in alignment can lead to fairly substantial differences in execution spee.
Finally you want to execute it a number of times -- and it's always possible it'll get interrupted in the middle of things (e.g., a task switch), so you need to be prepared for the possibility of an execution taking quite a bit longer than the rest -- e.g., 5 runs that take ~40-43 clock cycles apiece, and a sixth that takes 10000+ clock cycles. Clearly, in the latter case, you just throw out the outlier -- it's not from your code.
Summary: managing to execute the rdtsc instruction itself is (almost) the least of your worries. There's quite a bit more you need to do before you can get results from rdtsc that will actually mean anything.

For Windows, Visual Studio provides a convenient "compiler intrinsic" (i.e. a special function, which the compiler understands) that executes the RDTSC instruction for you and gives you back the result:
unsigned __int64 __rdtsc(void);

Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES
This Linux system call appears to be a cross architecture wrapper for performance events.
This answer similar: Quick way to count number of instructions executed in a C program but with PERF_COUNT_HW_CPU_CYCLES instead of PERF_COUNT_HW_INSTRUCTIONS. This answer will focus on PERF_COUNT_HW_CPU_CYCLES specifics, see that other answer for more generic information.
Here is an example based on the one provided at the end of the man page.
perf_event_open.c
#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <inttypes.h>
#include <sys/types.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("%lld\n", count);
close(fd);
}
The results seem reasonable, e.g. if I print cycles then recompile for instruction counts, we get about 1 cycle per iteration (2 instructions done in a single cycle) possibly due to effects such as superscalar execution, with slightly different results for each run presumably due to random memory access latencies.
You might also be interested in PERF_COUNT_HW_REF_CPU_CYCLES, which as the manpage documents:
Total cycles; not affected by CPU frequency scaling.
so this will give something closer to the real wall time if your frequency scaling is on. These were 2/3x larger than PERF_COUNT_HW_INSTRUCTIONS on my quick experiments, presumably because my non-stressed machine is frequency scaled now.

What C++ code compiles down to the x86 REP instruction?

I'm copying elements from one array to another in C++. I found the rep movs instruction in x86 that seems to copy an array at ESI to an array at EDI of size ECX. However, neither the for nor while loops I tried compiled to a rep movs instruction in VS 2008 (on an Intel Xeon x64 processor). How can I write code that will get compiled to this instruction?

Honestly, you shouldn't. REP is sort of an obsolete holdover in the instruction set, and actually pretty slow since it has to call a microcoded subroutine inside the CPU, which has a ROM lookup latency and is nonpipelined as well.
In almost every implementation, you will find that the memcpy() compiler intrinsic both is easier to use and runs faster.

Under MSVC there are the __movsxxx & __stosxxx intrinsics that will generate a REP prefixed instruction.
there is also a 'hack' to force intrinsic memset aka REP STOS under vc9+, as the intrinsic no longer exits, due to the sse2 branching in the crt. this is better that __stosxxx due to the fact the compiler can optimize it for constants and order it correctly.
#define memset(mem,fill,size) memset((DWORD*)mem,((fill) << 24|(fill) << 16|(fill) << 8|(fill)),size)
__forceinline void memset(DWORD* pStart, unsigned long dwFill, size_t nSize)
{
//credits to Nepharius for finding this
DWORD* pLast = pStart + (nSize >> 2);
while(pStart < pLast)
*pStart++ = dwFill;
if((nSize &= 3) == 0)
return;
if(nSize == 3)
{
(((WORD*)pStart))[0] = WORD(dwFill);
(((BYTE*)pStart))[2] = BYTE(dwFill);
}
else if(nSize == 2)
(((WORD*)pStart))[0] = WORD(dwFill);
else
(((BYTE*)pStart))[0] = BYTE(dwFill);
}
of course REP isn't always the best thing to use, imo your way better off using memcpy, it'll branch to either sse2 or REPS MOV based on your system (under msvc), unless you feeling like writing custom assembly for 'hot' areas...

If you need exactly that instruction - use built-in assembler and write that instruction manually. You can't rely on the compiler to produce any specific machine code - even if it emits it in one compilation it can decide to emit some other equivalent during next compilation.

REP and friends was nice once upon a time, when the x86 CPU was a single-pipeline industrial CISC-processor.
But that has changed. Nowadays when the processor encounters any instruction, the first it does is translating it into an easier format (VLIW-like micro-ops) and schedules it for future execution (this is part of out-of-order-execution, part of scheduling between different logical CPU cores, it can be used to simplifying write-after-write-sequences into single-writes, et.c.). This machinery works well for instructions that translates into a few VLIW-like opcodes, but not machine-code that translates into loops. Loop-translated machine code will probably cause the execution pipeline to stall.
Rather than spending hundreds of thousands of transistors into building CPU-circuitry for handling looping portions of the micro-ops in the execution pipeline, they just handle it in some sort of crappy legacy-mode that stutterly stalls the pipeline, and ask modern programmers to write your own damn loops!
Therefore it is seldom used when machines write code. If you encounter REP in a binary executable, its probably a human assembly-muppet who didn't know better, or a cracker that really needed the few bytes it saved to use it instead of an actual loop, that wrote it.
(However. Take everything I just wrote with a grain of salt. Maybe this is not true anymore. I am not 100% up to date with the internals of x86 CPUs anymore, I got into other hobbies..)

I use the rep* prefix variants with cmps*, movs*, scas* and stos* instruction variants to generate inline code which minimizes the code size, avoids unnecessary calls/jumps and thereby keeps down the work done by the caches. The alternative is to set up parameters and call a memset or memcpy somewhere else which may overall be faster if I want to copy a hundred bytes or more but if it's just a matter of 10-20 bytes using rep is faster (or at least was the last time I measured).
Since my compiler allows specification and use of inline assembly functions and includes their register usage/modification in the optimization activities it is possible for me to use them when the circumstances are right.

On a historic note - not having any insight into the manufacturer's strategies - there was a time when the "rep movs*" (etc) instructions were very slow. I think it was around the time of the Pentium/Pentium MMX. A colleague of mine (who had more insight than I) said that the manufacturers had decreased the chip area (<=> fewer transistors/more microcode) allocated to the rep handling and used it to make other, more used instructions faster.
In the fifteen years or so since rep has become relatively speaking faster again which would suggest more transistors/less microcode.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js