Slow Instructions in Simple Loop on x86

Slow Instructions in Simple Loop on x86 - c++

I have a simple loop which I've written in C++ as I wanted to profile the performance of a multiply instruction on my CPU. I found some interesting nuances in the assembly code that was generated when I profiled it.
Here is the C++ program:
#define TESTS 10000000
#define BUFSIZE 1000
uint32_t buf_in1[BUFSIZE];
uint32_t buf_in2[BUFSIZE];
uint32_t volatile buf_out[BUFSIZE];
unsigned int i, j;
for (i = 0; i < BUFSIZE; i++) {
buf_in1[i] = i;
buf_in2[i] = i;
}
for (j = 0; j < TESTS; j++) {
for (i = 0; i < BUFSIZE; i++) {
buf_out[i] = buf_in1[i] * buf_in2[i];
}
}
I compiled with the following flags:
Optimization:
Code Generation:
It's compiled in visual studio 2012 under Win32 although I am running it on a 64 bit machine.
Note the volatile qualifier on buf_out. It's just to stop the compiler from optimising the loop away.
I ran this code through a profiler (AMD's CodeXL) and I see that the multiplication instruction doesn't take up the majority of the CPU time. About 30% is taken up by the imul instruction, but around 60% is also spent on two other instructions:
Note that the Timer column shows the number of timer ticks during which the profiler found the code on this instruction. The timer tick is 1ms so 2609 ticks is approximately 2609ms spent on that instruction.
The two instructions other than the multiply instruction which are taking up a lot of time are a mov instruction and the jb (jump if condition is met) instruction.
The mov instruction,
mov [esp+eax+00001f40h],ecx
is moving the result of the multiply (ecx) back into the buffer buf_out buffer at eax (which is the register representing i). This makes sense, but why does it take so much longer to do this than the other mov instruction? Ie this one:
mov ecx,[esp+eax+00000fa0h]
They both read from similar locations in memory, the arrays are 1000 uint32_t's long or 4000 bytes long. That's 4000*3 = 12kB. My L1 cache is 64kB so it should all easily fit in L1 as far as I can see...
Here are results showing my cache sizes etc. from Coreinfo:
As for the jump instruction:
jb $-1ah (0x903732)
I can't tell why it's taking up 33% of the program execution time either. My processor line size is 64 bytes and the jump only jumps backwards 0x1A bytes or 26 bytes. Could it be because this jump crosses a 64-byte boundry? (0x903740 is a 64 byte boundary)
So can anyone explain these behaviours?
Thanks.

As mentioned by Mystical, the timings you are looking at are not one to one the responsibility of the instructions it is shown against.
Modern processors run many instructions in parallel (the imul and the add 4 to eax can both run in parallel, also the math involved in the mov addressing uses the ALU too and can be computed before the imul completes).
The way most profilers compute their timing is by using timed interrupts and what you see are the instructions that happened to be the ones executed at the time of the interrupts.
To properly use a profiler, you want to run against large programs and see whether the program spends a lot of time. On a per instruction basis, it does not have much value.
If you really want to do speed tests, you want to use the CPU timer before and after your loops and see how you can ameliorate it one way or another to get it to run faster.

I wouldn't assume that it all fits in your L1, because the code you're debugging isn't the only thing using the CPU (unless you've booted your machine to run that code, which in fact will be your operating system).
Also note that there's a pattern there: The slowest operations all are requiring main memory access. Since this access time isn't controlled by the CPU, it's difficult to point out why isn't it faster. That will require hardware analysis.
Hope this helps.

Unfortunately, you have not given the amount of time needed for a single pass through your loop, but I assume that it's three CPU cycles. If that is true, the three instructions that happen to get time attributed to them are the three instructions which the processor is officially on when the clock ticks. The other three instructions are executed in parallel to the three officially time consuming instructions, hiding behind them.

Related

Inline assembly array sum benchmark near-zero time for large arrays with optimization enabled, even though result is used

I have written two functions that gets the sum of an array, the first one is written in C++ and the other is written with inline assembly (x86-64), I compared the performance of the two functions on my device.
If the -O flag is not enabled during compilation the function with inline assembly is almost 4-5x faster than the C++ version.
cpp time : 543070068 nanoseconds
cpp time : 547990578 nanoseconds
asm time : 185495494 nanoseconds
asm time : 188597476 nanoseconds
If the -O flag is set to -O1 they produce the same performance.
cpp time : 177510914 nanoseconds
cpp time : 178084988 nanoseconds
asm time : 179036546 nanoseconds
asm time : 181641378 nanoseconds
But if I try to set the -O flag to -O2 or -O3 I'm getting an unusual 2-3 digit nanoseconds performance for the function written with inline assembly which is sketchy fast (at least for me, please bear with me since I have no rock solid experience with assembly programming so I don't know how fast or how slow it can be compared to a program written in C++.
)
cpp time : 177522894 nanoseconds
cpp time : 183816275 nanoseconds
asm time : 125 nanoseconds
asm time : 75 nanoseconds
My Questions
Why is this array sum function written with inline assembly so fast after enabling -O2 or -O3?
Is this a normal reading or there is something wrong with the timing/measurement of the performance?
Or maybe there is something wrong with my inline assembly function?
And if the inline assembly function for the array sum is correct and the performance reading is correct, why does the C++ compiler failed to optimize a simple array sum function for the C++ version and make it as fast as the inline assembly version?
I have also speculated that maybe the memory alignment and cache misses are improved during compilation to increase the performance but my knowledge on this one is still very very limited.
Apart from answering my questions, if you have something to add please feel free to do so, I hope somebody can explain, thanks!
[EDIT]
So I have removed the use of macro and isolated running the two version and also tried to add volatile keyword, a "memory" clobber and "+&r" constraint for the output and the performance was now the same with the cpp_sum.
Though if I remove back the volatile keyword and "memory" clobber it I'm still getting those 2-3 digit nanoseconds performance.
code:
#include <iostream>
#include <random>
#include <chrono>
uint64_t sum_cpp(const uint64_t *numbers, size_t length) {
uint64_t sum = 0;
for(size_t i=0; i<length; ++i) {
sum += numbers[i];
}
return sum;
}
uint64_t sum_asm(const uint64_t *numbers, size_t length) {
uint64_t sum = 0;
asm volatile(
"xorq %%rax, %%rax\n\t"
"%=:\n\t"
"addq (%[numbers], %%rax, 8), %[sum]\n\t"
"incq %%rax\n\t"
"cmpq %%rax, %[length]\n\t"
"jne %=b"
: [sum]"+&r"(sum)
: [numbers]"r"(numbers), [length]"r"(length)
: "%rax", "memory", "cc"
);
return sum;
}
int main() {
std::mt19937_64 rand_engine(1);
std::uniform_int_distribution<uint64_t> random_number(0,5000);
size_t length = 99999999;
uint64_t *arr = new uint64_t[length];
for(size_t i=1; i<length; ++i) arr[i] = random_number(rand_engine);
uint64_t cpp_total = 0, asm_total = 0;
for(size_t i=0; i<5; ++i) {
auto start = std::chrono::high_resolution_clock::now();
#ifndef _INLINE_ASM
cpp_total += sum_cpp(arr, length);
#else
asm_total += sum_asm(arr,length);
#endif
auto end = std::chrono::high_resolution_clock::now();
auto dur = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start);
std::cout << "time : " << dur.count() << " nanoseconds\n";
}
#ifndef _INLINE_ASM
std::cout << "cpp sum = " << cpp_total << "\n";
#else
std::cout << "asm sum = " << asm_total << "\n";
#endif
delete [] arr;
return 0;
}

The compiler is hoisting the inline asm out of your repeat loop, and thus out of your timed region.
If your goal is performance, https://gcc.gnu.org/wiki/DontUseInlineAsm. The useful thing to spend your time learning first is SIMD intrinsics (and how they compile to asm) like _mm256_add_epi64 to add 4x uint64_t with a single AVX2 instruction. See https://stackoverflow.com/tags/sse/info (Compilers can auto-vectorize decently for a simple sum like this, which you could see the benefit from if you used a smaller array and put a repeat loop inside the timed region to get some cache hits.)
If you want to play around with asm to test what's actually fast on various CPUs, you can do that in a stand-alone static executable, or a function you call from C++. https://stackoverflow.com/tags/x86/info has some good performance links.
Re: benchmarking at -O0, yes the compiler makes slow asm with the default -O0 of consistent debugging and not trying at all to optimize. It's not much of a challenge to beat it when it has its hands tied behind its back.
Why your asm can get hoisted out of the timed regions
Without being asm volatile, your asm statement is a pure function of the inputs you've told the compiler about, which are a pointer, a length, and the initial value of sum=0. It does not include the pointed-to memory because you didn't use a dummy "m" input for that. (How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)
Without a "memory" clobber, your asm statement isn't ordered wrt. function calls, so GCC is hoisting the asm statement out of the loop. See How does Google's `DoNotOptimize()` function enforce statement ordering for more details about that effect of the "memory" clobber.
Have a look at the compiler output on https://godbolt.org/z/KeEMfoMvo and see how it inlined into main. -O2 and higher enables -finline-functions, while -O1 only enables -finline-functions-called-once and this isn't static or inline so it has to emit a stand-alone definition in case of calls from other compilation units.
75ns is just the timing overhead of std::chrono functions around a nearly-empty timed region. It is actually running, just not inside the timed regions. You can see this if you single-step the asm of your whole program, or for example set a breakpoint on the asm statement. When doing asm-level debugging of the executable, you could help yourself find it by putting a funky instruction like mov $0xdeadbeef, %eax before xor %eax,%eax, something you can search for in the debugger's disassembly output (like GDB's layout asm or layout reg; see asm debugging tips at the bottom of https://stackoverflow.com/tags/x86/info). And yes, you do often want to look at what the compiler did when debugging inline asm, how it filled in your constraints, because stepping on its toes is a very real possibility.
Note that a "memory" clobber without asm volatile would still let GCC do Common Subexpression Elimination (CSE) between two invocations of the asm statement, if there was no function call in between. Like if you put a repeat loop inside a timed region to test performance on an array small enough to fit in some level of cache.
Sanity-checking your benchmark
Is this a normal reading
It's wild that you even have to ask that. 99999999 8-byte integers in 75ns would be a memory bandwidth of 99999999 * 8 B / 75 ns = 10666666 GB/s, while fast dual-channel DDR4 might hit 32 GB/s. (Or cache bandwidth if it was that large, but it's not, so your code bottlenecks on memory).
Or a 4GHz CPU would have had to run at 99999999 / (75*4) = 333333.33 add instructions per clock cycle, but the pipeline is only 4 to 6 uops wide on modern CPUs, with taken-branch throughputs of at best 1 for a loop branch. (https://uops.info/ and https://agner.org/optimize/)
Even with AVX-512, that's 2/clock 8x uint64_t additions per core, but compilers don't rewrite your inline asm; that would defeat its purpose compared to using plain C++ or intrinsics.
This is pretty obviously just std::chrono timing overhead from a near-empty timed region.
Asm code-review: correctness
As mentioned above, How can I indicate that the memory *pointed* to by an inline ASM argument may be used?
You're also missing an & early clobber declaration in "+&r"(sum) which would in theory let it pick the same register for sum as for one of the inputs. But since sum is also an input, it could only do that if numbers or length were also 0.
It's kind of a toss-up whether it's better to xor-zero inside the asm for an "=&r" output, or better to use "+&r" and leave that zeroing to the compiler. For your loop counter, it makes sense because the compiler doesn't need to know about that at all. But by manually picking RAX for it (with a clobber), you're preventing the compiler from choosing to have your code produce sum in RAX, like it would want for a non-inline function. A dummy [idx] "=&r" (dummy) output operand will get the compiler to pick a register for you, of the appropriate width, e.g. intptr_t.
Asm code review: performance
As David Wohlferd said: xor %eax, %eax to zero RAX. Implicit zero-extension saves a REX prefix. (1 byte of code-size in the machine code. Smaller machine-code is generally better.)
It doesn't seem worth hand-writing asm if you're not going to do anything smarter than what GCC would on its own without -ftree-vectorize or with -mgeneral-regs-only or -mno-sse2 (even though it's baseline for x86-64, kernel code generally needs to avoid SIMD registers). But I guess it works as a learning exercise in how inline asm constraints work, and a starting point for measuring. And to get a benchmark working so you can then test better loops.
Typical x86-64 CPUs can do 2 loads per clock cycle (Intel since Sandybridge, AMD since K8) Or 3/clock on Alder Lake. On modern CPUs with AVX/AVX2, each load can be 32 bytes wide (or 64 bytes with AVX-512) best case on L1d hits. Or more like 1/clock with only L2 hits on recent Intel, which is a reasonable cache-blocking target.
But your loop can at best run 1x 8-byte load per clock cycle, because loop branches can run 1/clock, and add mem, %[sum] has a 1 cycle loop-carried dependency through sum.
That might max out DRAM bandwidth (with the help of HW prefetchers), e.g. 8 B / cycle * 4GHz = 32GB/s, which modern desktop/laptop Intel CPUs can manage for a single core (but not big Xeons). But with fast enough DRAM and/or a slower CPU relative to it, even DRAM can avoid being a bottleneck. But aiming for DRAM bandwidth is quite a low bar compared to L3 or L2 cache bandwidth.
So even if you want to keep using scalar code without movdqu / paddq (or better get to an alignment boundary for memory-source paddq, if you want to spend some code-size to optimize this loop), you could still unroll with two register accumulators for sum which you add at the end. This exposes some instruction-level parallelism, allowing two memory-source loads per clock cycle.
You can also avoid the cmp, which can reduce loop overhead. Fewer uops lets out-of-order exec see farther.
Get a pointer to the end of the array and index from -length up towards zero. Like (arr+len)[idx] with for(idx=-len ; idx != 0 ; idx++). Looping backwards through the array is on some CPUs a little worse for some of the HW prefetchers, so generally not recommended for loops that are often memory bound.
See also Micro fusion and addressing modes - an indexed addressing mode can only stay micro-fused in the back-end on Intel Haswell and later, and only for instructions like add that RMW their destination register.
So your best bet would be a loop with one pointer increment and 2 to 4 add instructions using it, and a cmp/jne at the bottom.

What is faster in C++: mod (%) or another counter?

At the risk of this being a duplicate, maybe I just can't find a similar post right now:
I am writing in C++ (C++20 to be specific). I have a loop with a counter that counts up every turn. Let's call it counter. And if this counter reaches a page-limit (let's call it page_limit), the program should continue on the next page. So it looks something like this:
const size_t page_limit = 4942;
size_t counter = 0;
while (counter < foo) {
if (counter % page_limit == 0) {
// start new page
}
// some other code
counter += 1;
}
Now I am wondering since the counter goes pretty high: would the program run faster, if I wouldn't have the program calculate the modulo counter % page_limit every time, but instead make another counter? It could look something like this:
const size_t page_limit = 4942;
size_t counter = 0;
size_t page_counter = 4942;
while (counter < foo) {
if (page_counter == page_limit) {
// start new page
page_counter = 0;
}
// some other code
counter += 1;
page_counter += 1;
}

Most optimizing compilers will convert divide or modulo operations into multiply by pre-generated inverse constant and shift instructions if the divisor is a constant. Possibly also if the same divisor value is used repeatedly in a loop.
Modulo multiplies by inverse to get a quotient, then multiplies quotient by divisor to get a product, and then original number minus product will be the modulo.
Multiply and shift are fast instructions on reasonably recent X86 processors, but branch prediction can also reduce the time it takes for a conditional branch, so as suggested a benchmark may be needed to determine which is best.

(I assume you meant to write if(x%y==0) not if(x%y), to be equivalent to the counter.)
I don't think compilers will do this optimization for you, so it could be worth it. It's going to be smaller code-size, even if you can't measure a speed difference. The x % y == 0 way still branches (so is still subject to a branch misprediction those rare times when it's true). Its only advantage is that it doesn't need a separate counter variable, just some temporary registers at one point in the loop. But it does need the divisor every iteration.
Overall this should be better for code size, and isn't less readable if you're used to the idiom. (Especially if you use if(--page_count == 0) { page_count=page_limit; ... so all pieces of the logic are in two adjacent lines.)
If your page_limit were not a compile-time constant, this is even more likely to help. dec/jz that's only taken once per many decrements is a lot cheaper than div/test edx,edx/jz, including for front-end throughput. (div is micro-coded on Intel CPUs as about 10 uops, so even though it's one instruction it still takes up the front-end for multiple cycles, taking away throughput resources from getting surrounding code into the out-of-order back-end).
(With a constant divisor, it's still multiply, right shift, sub to get the quotient, then multiply and subtract to get the remainder from that. So still several single-uop instructions. Although there are some tricks for divisibility testing by small constants see #Cassio Neri's answer on Fast divisibility tests (by 2,3,4,5,.., 16)? which cites his journal articles; recent GCC may have started using these.)
But if your loop body doesn't bottleneck on front-end instruction/uop throughput (on x86), or the divider execution unit, then out-of-order exec can probably hide most of the cost of even a div instruction. It's not on the critical path so it could be mostly free if its latency happens in parallel with other computation, and there are spare throughput resources. (Branch prediction + speculative execution allow execution to continue without waiting for the branch condition to be known, and since this work is independent of other work it can "run ahead" as the compiler can see into future iterations.)
Still, making that work even cheaper can help the compiler see and handle a branch mispredict sooner. But modern CPUs with fast recovery can keep working on old instructions from before the branch while recovering. ( What exactly happens when a skylake CPU mispredicts a branch? / Avoid stalling pipeline by calculating conditional early )
And of course a few loops do fully keep the CPU's throughput resources busy, not bottlenecking on cache misses or a latency chain. And fewer uops executed per iteration is more friendly to the other hyperthread (or SMT in general).
Or if you care about your code running on in-order CPUs (common for ARM and other non-x86 ISAs that target low-power implementations), the real work has to wait for the branch-condition logic. (Only hardware prefetch or cache-miss loads and things like that can be doing useful work while running extra code to test the branch condition.)
Use a down-counter
Instead of counting up, you'd actually want to hand-hold the compiler into using a down-counter that can compile to dec reg / jz .new_page or similar; all normal ISAs can do that quite cheaply because it's the same kind of thing you'd find at the bottom of normal loops. (dec/jnz to keep looping while non-zero)
if(--page_counter == 0) {
/*new page*/;
page_counter = page_limit;
}
A down-counter is more efficient in asm and equally readable in C (compared to an up-counter), so if you're micro-optimizing you should write it that way. Related: using that technique in hand-written asm FizzBuzz. Maybe also a code review of asm sum of multiples of 3 and 5, but it does nothing for no-match so optimizing it is different.
Notice that page_limit is only accessed inside the if body, so if the compiler is low on registers it can easily spill that and only read it as needed, not tying up a register with it or with multiplier constants.
Or if it's a known constant, just a move-immediate instruction. (Most ISAs also have compare-immediate, but not all. e.g. MIPS and RISC-V only have compare-and-branch instructions that use the space in the instruction word for a target address, not for an immediate.) Many RISC ISAs have special support for efficiently setting a register to a wider constant than most instructions that take an immediate (like ARM movw with a 16-bit immediate, so 4092 can be done in one instruction more mov but not cmp: it doesn't fit in 12 bits).
Compared to dividing (or multiplicative inverse), most RISC ISAs don't have multiply-immediate, and a multiplicative inverse is usually wider than one immediate can hold. (x86 does have multiply-immediate, but not for the form that gives you a high-half.) Divide-immediate is even rarer, not even x86 has that at all, but no compiler would use that unless optimizing for space instead of speed if it did exist.
CISC ISAs like x86 can typically multiply or divide with a memory source operand, so if low on registers the compiler could keep the divisor in memory (especially if it's a runtime variable). Loading once per iteration (hitting in cache) is not expensive. But spilling and reloading an actual variable that changes inside the loop (like page_count) could introduce a store/reload latency bottleneck if the loop is short enough and there aren't enough registers. (Although that might not be plausible: if your loop body is big enough to need all the registers, it probably has enough latency to hide a store/reload.)

If somebody put it in front of me, I would rather it was:
const size_t page_limit = 4942;
size_t npages = 0, nitems = 0;
size_t pagelim = foo / page_limit;
size_t resid = foo % page_limit;
while (npages < pagelim || nitems < resid) {
if (++nitems == page_limit) {
/* start new page */
nitems = 0;
npages++;
}
}
Because the program is now expressing the intent of the processing -- a bunch of things in page_limit sized chunks; rather than an attempt to optimize away an operation.
That the compiler might generate nicer code is just a blessing.

Why is ONE basic arithmetic operation in for loop body executed SLOWER THAN TWO arithmetic operations?

While I experimented with measuring time of execution of arithmetic operations, I came across very strange behavior. A code block containing a for loop with one arithmetic operation in the loop body was always executed slower than an identical code block, but with two arithmetic operations in the for loop body. Here is the code I ended up testing:
#include <iostream>
#include <chrono>
#define NUM_ITERATIONS 100000000
int main()
{
// Block 1: one operation in loop body
{
int64_t x = 0, y = 0;
auto start = std::chrono::high_resolution_clock::now();
for (long i = 0; i < NUM_ITERATIONS; i++) {x+=31;}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end-start;
std::cout << diff.count() << " seconds. x,y = " << x << "," << y << std::endl;
}
// Block 2: two operations in loop body
{
int64_t x = 0, y = 0;
auto start = std::chrono::high_resolution_clock::now();
for (long i = 0; i < NUM_ITERATIONS; i++) {x+=17; y-=37;}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end-start;
std::cout << diff.count() << " seconds. x,y = " << x << "," << y << std::endl;
}
return 0;
}
I tested this with different levels of code optimization (-O0,-O1,-O2,-O3), with different online compilers (for example onlinegdb.com), on my work machine, on my hame PC and laptop, on RaspberryPi and on my colleague's computer. I rearranged these two code blocks, repeated them, changed constants, changed operations (+, -, <<, =, etc.), changed integer types. But I always got similar result: the block with one line in loop is SLOWER than block with two lines:
1.05681 seconds. x,y = 3100000000,0
0.90414 seconds. x,y = 1700000000,-3700000000
I checked the assembly output on https://godbolt.org/ but everything looked like I expected: second block just had one more operation in assembly output.
Three operations always behaved as expected: they are slower than one and faster than four. So why two operations produce such an anomaly?
Edit:
Let me repeat: I have such behaviour on all of my Windows and Unix machines with code not optimized. I looked at assembly I execute (Visual Studio, Windows) and I see the instructions I want to test there. Anyway if the loop is optimized away, there is nothing I ask about in the code which left. I added that optimizations notice in the question to avoid "do not measure not optimized code" answers because optimizations is not what I ask about. The question is actually why my computers execute two operations faster than one, first of all in code where these operations are not optimized away. The difference in time of execution is 5-25% on my tests (quite noticeable).

This effect only happens at -O0 (or with volatile), and is a result of the compiler keeping your variables in memory (not registers). You'd expect that to just introduce a fixed amount of extra latency into a loop-carried dependency chains through i, x, and y, but modern CPUs are not that simple.
On Intel Sandybridge-family CPUs, store-forwarding latency is lower when the load uop runs some time after the store whose data it's reloading, not right away. So an empty loop with the loop counter in memory is the worst case. I don't understand what CPU design choices could lead to that micro-architectural quirk, but it's a real thing.
This is basically a duplicate of Adding a redundant assignment speeds up code when compiled without optimization, at least for Intel Sandybridge-family CPUs.
This is is one of the major reasons why you shouldn't benchmark at -O0: the bottlenecks are different than in realistically optimized code. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about why compilers make such terrible asm on purpose.
Micro-benchmarking is hard; you can only measure something properly if you can get compilers to emit realistically optimized asm loops for the thing you're trying to measure. (And even then you're only measuring throughput or latency, not both; those are separate things for single operations on out-of-order pipelined CPUs: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?)
See #rcgldr's answer for measurement + explanation of what would happen with loops that keep variables in registers.
With clang, benchmark::DoNotOptimize(x1 += 31) also de-optimizes into keeping x in memory, but with GCC it does just stay in a register. Unfortunately #SashaKnorre's answer used clang on QuickBench, not gcc, to get results similar to your -O0 asm. It does show the cost of lots of short-NOPs being hidden by the bottleneck through memory, and a slight speedup when those NOPs delay the reload next iteration just long enough for store-forwarding to hit the lower latency good case. (QuickBench I think runs on Intel Xeon server CPUs, with the same microarchitecture inside each CPU core as desktop version of the same generation.)
Presumably all the x86 machines you tested on had Intel CPUs from the last 10 years, or else there's a similar effect on AMD. It's plausible there's a similar effect on whichever ARM CPU your RPi uses, if your measurements really were meaningful there. Otherwise, maybe another case of seeing what you expected (confirmation bias), especially if you tested with optimization enabled there.
I tested this with different levels of code optimization (-O0,-O1,-O2,-O3) [...] But I always got similar result
I added that optimizations notice in the question to avoid "do not measure not optimized code" answers because optimizations is not what I ask about.
(later from comments) About optimizations: yes, I reproduced that with different optimization levels, but as the loops were optimized away, the execution time was too fast to say for sure.
So actually you didn't reproduce this effect for -O1 or higher, you just saw what you wanted to see (confirmation bias) and mostly made up the claim that the effect was the same. If you'd accurately reported your data (measurable effect at -O0, empty timed region at -O1 and higher), I could have answered right away.
See Idiomatic way of performance evaluation? - if your times don't increase linearly with increasing repeat count, you aren't measuring what you think you're measuring. Also, startup effects (like cold caches, soft page faults, lazy dynamic linking, and dynamic CPU frequency) can easily lead to the first empty timed region being slower than the second.
I assume you only swapped the loops around when testing at -O0, otherwise you would have ruled out there being any effect at -O1 or higher with that test code.
The loop with optimization enabled:
As you can see on Godbolt, gcc fully removes the loop with optimization enabled. Sometimes GCC leaves empty loops alone, like maybe it thinks the delay was intentional, but here it doesn't even loop at all. Time doesn't scale with anything, and both timed regions look the same like this:
orig_main:
...
call std::chrono::_V2::system_clock::now() # demangled C++ symbol name
mov rbp, rax # save the return value = start
call std::chrono::_V2::system_clock::now()
# end in RAX
So the only instruction in the timed region is saving start to a call-preserved register. You're measuring literally nothing about your source code.
With Google Benchmark, we can get asm that doesn't optimize the work away, but which doesn't store/reload to introduce new bottlenecks:
#include <benchmark/benchmark.h>
static void TargetFunc(benchmark::State& state) {
uint64_t x2 = 0, y2 = 0;
// Code inside this loop is measured repeatedly
for (auto _ : state) {
benchmark::DoNotOptimize(x2 += 31);
benchmark::DoNotOptimize(y2 += 31);
}
}
// Register the function as a benchmark
BENCHMARK(TargetFunc);
# just the main loop, from gcc10.1 -O3
.L7: # do{
add rax, 31 # x2 += 31
add rdx, 31 # y2 += 31
sub rbx, 1
jne .L7 # }while(--count != 0)
I assume benchmark::DoNotOptimize is something like asm volatile("" : "+rm"(x) ) (GNU C inline asm) to make the compiler materialize x in a register or memory, and to assume the lvalue has been modified by that empty asm statement. (i.e. forget anything it knew about the value, blocking constant-propagation, CSE, and whatever.) That would explain why clang stores/reloads to memory while GCC picks a register: this is a longstanding missed-optimization bug with clang's inline asm support. It likes to pick memory when given the choice, which you can sometimes work around with multi-alternative constraints like "+r,m". But not here; I had to just drop the memory alternative; we don't want the compiler to spill/reload to memory anyway.
For GNU C compatible compilers, we can use asm volatile manually with only "+r" register constraints to get clang to make good scalar asm (Godbolt), like GCC. We get an essentially identical inner loop, with 3 add instructions, the last one being an add rbx, -1 / jnz that can macro-fuse.
static void TargetFunc(benchmark::State& state) {
uint64_t x2 = 0, y2 = 0;
// Code inside this loop is measured repeatedly
for (auto _ : state) {
x2 += 16;
y2 += 17;
asm volatile("" : "+r"(x2), "+r"(y2));
}
}
All of these should run at 1 clock cycle per iteration on modern Intel and AMD CPUs, again see #rcgldr's answer.
Of course this also disables auto-vectorization with SIMD, which compilers would do in many real use cases. Or if you used the result at all outside the loop, it might optimize the repeated increment into a single multiply.
You can't measure the cost of the + operator in C++ - it can compile very differently depending on context / surrounding code. Even without considering loop-invariant stuff that hoists work. e.g. x + (y<<2) + 4 can compile to a single LEA instruction for x86.
The question is actually why my computers execute two operations faster than one, first of all in code where these operations are not optimized away
TL:DR: it's not the operations, it's the loop-carried dependency chain through memory that stops the CPU from running the loop at 1 clock cycle per iteration, doing all 3 adds in parallel on separate execution ports.
Note that the loop counter increment is just as much of an operation as what you're doing with x (and sometimes y).

ETA: This was a guess, and Peter Cordes has made a very good argument about why it's incorrect. Go upvote Peter's answer.
I'm leaving my answer here because some found the information useful. Though this doesn't correctly explain the behavior seen in the OP, it highlights some of the issues that make it infeasible (and meaningless) to try to measure the speed of a particular instruction on a modern processor.
Educated guess:
It's the combined effect of pipelining, powering down portions of a core, and dynamic frequency scaling.
Modern processors pipeline so that multiple instructions can be executing at the same time. This is possible because the processor actually works on micro-ops rather than the assembly-level instructions we usually think of as machine language. Processors "schedule" micro-ops by dispatching them to different portions of the chip while keeping track of the dependencies between the instructions.
Suppose the core running your code has two arithmetic/logic units (ALUs). A single arithmetic instruction repeated over and over requires only one ALU. Using two ALUs doesn't help because the next operation depends on completion of the current one, so the second ALU would just be waiting around.
But in your two-expression test, the expressions are independent. To compute the next value of y, you do not have to wait for the current operation on x to complete. Now, because of power-saving features, that second ALU may be powered down at first. The core might run a few iterations before realizing that it could make use of the second ALU. At that point, it can power up the second ALU and most of the two-expression loop will run as fast as the one-expression loop. So you might expect the two examples to take approximately the same amount of time.
Finally, many modern processors use dynamic frequency scaling. When the processor detects that it's not running hard, it actually slows its clock a little bit to save power. But when it's used heavily (and the current temperature of the chip permits), it might increase the actual clock speed as high as its rated speed.
I assume this is done with heuristics. In the case where the second ALU stays powered down, the heuristic may decide it's not worth boosting the clock. In the case where two ALUs are powered up and running at top speed, it may decide to boost the clock. Thus the two-expression case, which should already be just about as fast as the one-expression case, actually runs at a higher average clock frequency, enabling it to complete twice as much work in slightly less time.
Given your numbers, the difference is about 14%. My Windows machine idles at about 3.75 GHz, and if I push it a little by building a solution in Visual Studio, the clock climbs to about 4.25GHz (eyeballing the Performance tab in Task Manager). That's a 13% difference in clock speed, so we're in the right ballpark.

I split up the code into C++ and assembly. I just wanted to test the loops, so I didn't return the sum(s). I'm running on Windows, the calling convention is rcx, rdx, r8, r9, the loop count is in rcx. The code is adding immediate values to 64 bit integers on the stack.
I'm getting similar times for both loops, less than 1% variation, same or either one up to 1% faster than the other.
There is an apparent dependency factor here: each add to memory has to wait for the prior add to memory to the same location to complete, so two add to memories can be performed essentially in parallel.
Changing test2 to do 3 add to memories, ends up about 6% slower, 4 add to memories, 7.5% slower.
My system is Intel 3770K 3.5 GHz CPU, Intel DP67BG motherboard, DDR3 1600 9-9-9-27 memory, Win 7 Pro 64 bit, Visual Studio 2015.
.code
public test1
align 16
test1 proc
sub rsp,16
mov qword ptr[rsp+0],0
mov qword ptr[rsp+8],0
tst10: add qword ptr[rsp+8],17
dec rcx
jnz tst10
add rsp,16
ret
test1 endp
public test2
align 16
test2 proc
sub rsp,16
mov qword ptr[rsp+0],0
mov qword ptr[rsp+8],0
tst20: add qword ptr[rsp+0],17
add qword ptr[rsp+8],-37
dec rcx
jnz tst20
add rsp,16
ret
test2 endp
end
I also tested with add immediate to register, 1 or 2 registers within 1% (either could be faster, but we'd expect them both to execute at 1 iteration / clock on Ivy Bridge, given its 3 integer ALU ports; What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?).
3 registers 1.5 times as long, somewhat worse than the ideal 1.333 cycles / iterations from 4 uops (including the loop counter macro-fused dec/jnz) for 3 back-end ALU ports with perfect scheduling.
4 registers, 2.0 times as long, bottlenecked on the front-end: Is performance reduced when executing loops whose uop count is not a multiple of processor width?. Haswell and later microarchitectures would handle this better.
.code
public test1
align 16
test1 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst10: add rdx,17
dec rcx
jnz tst10
ret
test1 endp
public test2
align 16
test2 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst20: add rdx,17
add r8,-37
dec rcx
jnz tst20
ret
test2 endp
public test3
align 16
test3 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst30: add rdx,17
add r8,-37
add r9,47
dec rcx
jnz tst30
ret
test3 endp
public test4
align 16
test4 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst40: add rdx,17
add r8,-37
add r9,47
add r10,-17
dec rcx
jnz tst40
ret
test4 endp
end

#PeterCordes proved this answer to be wrong in many assumptions, but it could still be useful as some blind research attempt of the problem.
I set up some quick benchmarks, thinking it may somehow be connected to code memory alignment, truly a crazy thought.
But it seems that #Adrian McCarthy got it right with the dynamic frequency scaling.
Anyway benchmarks tell that inserting some NOPs could help with the issue, with 15 NOPs after the x+=31 in Block 1 leading to nearly the same performance as the Block 2. Truly mind blowing how 15 NOPs in single instruction loop body increase performance.
http://quick-bench.com/Q_7HY838oK5LEPFt-tfie0wy4uA
I also tried -OFast thinking compilers might be smart enough to throw away some code memory inserting such NOPs, but it seems not to be the case.
http://quick-bench.com/so2CnM_kZj2QEWJmNO2mtDP9ZX0
Edit: Thanks to #PeterCordes it was made clear that optimizations were never working quite as expected in benchmarks above (as global variable required add instructions to access memory), new benchmark http://quick-bench.com/HmmwsLmotRiW9xkNWDjlOxOTShE clearly shows that Block 1 and Block 2 performance is equal for stack variables. But NOPs could still help with single-threaded application with loop accessing global variable, which you probably should not use in that case and just assign global variable to local variable after the loop.
Edit 2: Actually optimizations never worked due to quick-benchmark macros making variable access volatile, preventing important optimizations. It is only logical to load the variable once as we are only modifying it in the loop, so it is volatile or disabled optimizations being the bottleneck. So this answer is basically wrong, but at least it shows how NOPs could speed-up unoptimized code execution, if it makes any sense in the real world (there are better ways like bucketing counters).

Processors are so complex these days that we can only guess.
The assembly emitted by your compiler is not what is really executed. The microcode/firmware/whatever of your CPU will interpret it and turn it into instructions for its execution engine, much like JIT languages such as C# or java do.
One thing to consider here is that for each loop, there is not 1 or 2 instructions, but n + 2, as you also increment and compare i to your number of iteration. In the vast majority of case it wouldn't matter, but here it does, as the loop body is so simple.
Let's see the assembly :
Some defines:
#define NUM_ITERATIONS 1000000000ll
#define X_INC 17
#define Y_INC -31
C/C++ :
for (long i = 0; i < NUM_ITERATIONS; i++) { x+=X_INC; }
ASM :
mov QWORD PTR [rbp-32], 0
.L13:
cmp QWORD PTR [rbp-32], 999999999
jg .L12
add QWORD PTR [rbp-24], 17
add QWORD PTR [rbp-32], 1
jmp .L13
.L12:
C/C++ :
for (long i = 0; i < NUM_ITERATIONS; i++) {x+=X_INC; y+=Y_INC;}
ASM:
mov QWORD PTR [rbp-80], 0
.L21:
cmp QWORD PTR [rbp-80], 999999999
jg .L20
add QWORD PTR [rbp-64], 17
sub QWORD PTR [rbp-72], 31
add QWORD PTR [rbp-80], 1
jmp .L21
.L20:
So both Assemblies look pretty similar. But then let's think twice : modern CPUs have ALUs which operate on values which are wider than their register size. So there is a chance than in the first case, the operation on x and i are done on the same computing unit. But then you have to read i again, as you put a condition on the result of this operation. And reading means waiting.
So, in the first case, to iterate on x, the CPU might have to be in sync with the iteration on i.
In the second case, maybe x and y are treated on a different unit than the one dealing with i. So in fact, your loop body runs in parallel than the condition driving it. And there goes your CPU computing and computing until someone tells it to stop. It doesn't matter if it goes too far, going back a few loops is still fine compared to the amount of time it just gained.
So, to compare what we want to compare (one operation vs two operations), we should try to get i out of the way.
One solution is to completely get rid of it by using a while loop:
C/C++:
while (x < (X_INC * NUM_ITERATIONS)) { x+=X_INC; }
ASM:
.L15:
movabs rax, 16999999999
cmp QWORD PTR [rbp-40], rax
jg .L14
add QWORD PTR [rbp-40], 17
jmp .L15
.L14:
An other one is to use the antequated "register" C keyword:
C/C++:
register long i;
for (i = 0; i < NUM_ITERATIONS; i++) { x+=X_INC; }
ASM:
mov ebx, 0
.L17:
cmp rbx, 999999999
jg .L16
add QWORD PTR [rbp-48], 17
add rbx, 1
jmp .L17
.L16:
Here are my results:
x1 for: 10.2985 seconds. x,y = 17000000000,0
x1 while: 8.00049 seconds. x,y = 17000000000,0
x1 register-for: 7.31426 seconds. x,y = 17000000000,0
x2 for: 9.30073 seconds. x,y = 17000000000,-31000000000
x2 while: 8.88801 seconds. x,y = 17000000000,-31000000000
x2 register-for :8.70302 seconds. x,y = 17000000000,-31000000000
Code is here: https://onlinegdb.com/S1lAANEhI

Is mfence for rdtsc necessary on x86_64 platform?

unsigned int lo = 0;
unsigned int hi = 0;
__asm__ __volatile__ (
"mfence;rdtsc" : "=a"(lo), "=d"(hi) : : "memory"
);
mfence in the above code, is it necessary?
Based on my test, cpu reorder is not found.
The fragment of test code is included below.
inline uint64_t clock_cycles() {
unsigned int lo = 0;
unsigned int hi = 0;
__asm__ __volatile__ (
"rdtsc" : "=a"(lo), "=d"(hi)
);
return ((uint64_t)hi << 32) | lo;
}
unsigned t1 = clock_cycles();
unsigned t2 = clock_cycles();
assert(t2 > t1);

What you need to perform a sensible measurement with rdtsc is a serializing instruction.
As it is well known, a lot of people use cpuid before rdtsc.
rdtsc needs to be serialized from above and below (read: all instructions before it must be retired and it must be retired before the test code starts).
Unfortunately the second condition is often neglected because cpuid is a very bad choice for this task (it clobbers the output of rdtsc).
When looking for alternatives people think that instructions that have a "fence" in their names will do, but this is also untrue. Straight from Intel:
MFENCE does not serialize the instruction stream.
An instruction that is almost serializing and will do in any measurement where previous stores don't need to complete is lfence.
Simply put, lfence makes sure that no new instructions start before any prior instruction completes locally. See this answer of mine for a more detailed explanation on locality.
It also doesn't drain the Store Buffer like mfence does and doesn't clobbers the registers like cpuid does.
So lfence / rdtsc / lfence is a better crafted sequence of instructions than mfence / rdtsc, where mfence is pretty much useless unless you explicitly want the previous stores to be completed before the test begins/ends (but not before rdstc is executed!).
If your test to detect reordering is assert(t2 > t1) then I believe you will test nothing.
Leaving out the return and the call that may or may not prevent the CPU from seeing the second rdtsc in time for a reorder, it is unlikely (though possible!) that the CPU will reorder two rdtsc even if one is right after the other.
Imagine we have a rdtsc2 that is exactly like rdtsc but writes ecx:ebx1.
Executing
rdtsc
rdtsc2
is highly likely that ecx:ebx > edx:eax because the CPU has no reason to execute rdtsc2 before rdtsc.
Reordering doesn't mean random ordering, it means look for other instruction if the current one cannot be executed.
But rdtsc has no dependency on any previous instruction, so it's unlikely to be delayed when encountered by the OoO core.
However peculiar internal micro-architectural details may invalidate my thesis, hence the likely word in my previous statement.
1 We don't need this altered instruction: register renaming will do it, but in case you are not familiar with it, this will help.

mfence is there to force serialization in CPU before rdtsc.
Usually you will find cpuid there (which is also serializing instruction).
Quote from Intel manuals about using rdtsc will make it clearer
Starting with the Intel Pentium processor, most Intel CPUs support
out-of-order execution of the code. The purpose is to optimize the
penalties due to the different instruction latencies. Unfortunately
this feature does not guarantee that the temporal sequence of the
single compiled C instructions will respect the sequence of the
instruction themselves as written in the source C file. When we call
the RDTSC instruction, we pretend that that instruction will be
executed exactly at the beginning and at the end of code being
measured (i.e., we don’t want to measure compiled code executed
outside of the RDTSC calls or executed in between the calls
themselves).
The solution is to call a serializing instruction before
calling the RDTSC one. A serializing instruction is an instruction
that forces the CPU to complete every preceding instruction of the C
code before continuing the program execution. By doing so we guarantee
that only the code that is under measurement will be executed in
between the RDTSC calls and that no part of that code will be executed
outside the calls.
TL;DR version - without serializing instruction before rdtsc you have no idea when that instruction started to execute making measurements possibly incorrect.
HINT - use rdtscp when possible.
Based on my test, cpu reorder is not found.
Still no guarantee that it may happen - that's why original code had "memory" to indicate possible memory clobber preventing compiler from reordering it.

How to get the CPU cycle count in x86_64 from C++?

I saw this post on SO which contains C code to get the latest CPU Cycle count:
CPU Cycle count based profiling in C/C++ Linux x86_64
Is there a way I can use this code in C++ (windows and linux solutions welcome)? Although written in C (and C being a subset of C++) I am not too certain if this code would work in a C++ project and if not, how to translate it?
I am using x86-64
EDIT2:
Found this function but cannot get VS2010 to recognise the assembler. Do I need to include anything? (I believe I have to swap uint64_t to long long for windows....?)
static inline uint64_t get_cycles()
{
uint64_t t;
__asm volatile ("rdtsc" : "=A"(t));
return t;
}
EDIT3:
From above code I get the error:
"error C2400: inline assembler syntax error in 'opcode'; found 'data
type'"
Could someone please help?

Starting from GCC 4.5 and later, the __rdtsc() intrinsic is now supported by both MSVC and GCC.
But the include that's needed is different:
#ifdef _WIN32
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
Here's the original answer before GCC 4.5.
Pulled directly out of one of my projects:
#include <stdint.h>
// Windows
#ifdef _WIN32
#include <intrin.h>
uint64_t rdtsc(){
return __rdtsc();
}
// Linux/GCC
#else
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
#endif
This GNU C Extended asm tells the compiler:
volatile: the outputs aren't a pure function of the inputs (so it has to re-run every time, not reuse an old result).
"=a"(lo) and "=d"(hi) : the output operands are fixed registers: EAX and EDX. (x86 machine constraints). The x86 rdtsc instruction puts its 64-bit result in EDX:EAX, so letting the compiler pick an output with "=r" wouldn't work: there's no way to ask the CPU for the result to go anywhere else.
((uint64_t)hi << 32) | lo - zero-extend both 32-bit halves to 64-bit (because lo and hi are unsigned), and logically shift + OR them together into a single 64-bit C variable. In 32-bit code, this is just a reinterpretation; the values still just stay in a pair of 32-bit registers. In 64-bit code you typically get an actual shift + OR asm instructions, unless the high half optimizes away.
(editor's note: this could probably be more efficient if you used unsigned long instead of unsigned int. Then the compiler would know that lo was already zero-extended into RAX. It wouldn't know that the upper half was zero, so | and + are equivalent if it wanted to merge a different way. The intrinsic should in theory give you the best of both worlds as far as letting the optimizer do a good job.)
https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. But hopefully this section is useful if you need to understand old code that uses inline asm so you can rewrite it with intrinsics. See also https://stackoverflow.com/tags/inline-assembly/info

Your inline asm is broken for x86-64. "=A" in 64-bit mode lets the compiler pick either RAX or RDX, not EDX:EAX. See this Q&A for more
You don't need inline asm for this. There's no benefit; compilers have built-ins for rdtsc and rdtscp, and (at least these days) all define a __rdtsc intrinsic if you include the right headers. But unlike almost all other cases (https://gcc.gnu.org/wiki/DontUseInlineAsm), there's no serious downside to asm, as long as you're using a good and safe implementation like #Mysticial's.
(One minor advantage to asm is if you want to time a small interval that's certainly going to be less than 2^32 counts, you can ignore the high half of the result. Compilers could do that optimization for you with a uint32_t time_low = __rdtsc() intrinsic, but in practice they sometimes still waste instructions doing shift / OR.)
Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics.
Intel's intriniscs guide says _rdtsc (with one underscore) is in <immintrin.h>, but that doesn't work on gcc and clang. They only define SIMD intrinsics in <immintrin.h>, so we're stuck with <intrin.h> (MSVC) vs. <x86intrin.h> (everything else, including recent ICC). For compat with MSVC, and Intel's documentation, gcc and clang define both the one-underscore and two-underscore versions of the function.
Fun fact: the double-underscore version returns an unsigned 64-bit integer, while Intel documents _rdtsc() as returning (signed) __int64.
// valid C99 and C++
#include <stdint.h> // <cstdint> is preferred in C++, but stdint.h works.
#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif
// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
// _mm_lfence(); // optionally wait for earlier insns to retire before reading the clock
uint64_t tsc = __rdtsc();
// _mm_lfence(); // optionally block later instructions until rdtsc retires
return tsc;
}
// requires a Nehalem or newer CPU. Not Core2 or earlier. IDK when AMD added it.
inline
uint64_t readTSCp() {
unsigned dummy;
return __rdtscp(&dummy); // waits for earlier insns to retire, but allows later to start
}
Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer, including a couple test callers.
These intrinsics were new in gcc4.5 (from 2010) and clang3.5 (from 2014). gcc4.4 and clang 3.4 on Godbolt don't compile this, but gcc4.5.3 (April 2011) does. You might see inline asm in old code, but you can and should replace it with __rdtsc(). Compilers over a decade old usually make slower code than gcc6, gcc7, or gcc8, and have less useful error messages.
The MSVC intrinsic has (I think) existed far longer, because MSVC never supported inline asm for x86-64. ICC13 has __rdtsc in immintrin.h, but doesn't have an x86intrin.h at all. More recent ICC have x86intrin.h, at least the way Godbolt installs them for Linux they do.
You might want to define them as signed long long, especially if you want to subtract them and convert to float. int64_t -> float/double is more efficient than uint64_t on x86 without AVX512. Also, small negative results could be possible because of CPU migrations if TSCs aren't perfectly synced, and that probably makes more sense than huge unsigned numbers.
BTW, clang also has a portable __builtin_readcyclecounter() which works on any architecture. (Always returns zero on architectures without a cycle counter.) See the clang/LLVM language-extension docs
For more about using lfence (or cpuid) to improve repeatability of rdtsc and control exactly which instructions are / aren't in the timed interval by blocking out-of-order execution, see #HadiBrais' answer on clflush to invalidate cache line via C function and the comments for an example of the difference it makes.
See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset so you should use cpuid to serialize.) It's always been defined as partially-serializing on Intel.
How to Benchmark Code Execution Times on Intel® IA-32 and IA-64
Instruction Set Architectures, an Intel white-paper from 2010.
rdtsc counts reference cycles, not CPU core clock cycles
It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (not counting system clock adjustments, so it's a perfect time source for steady_clock).
The TSC frequency used to always be equal to the CPU's rated frequency, i.e. the advertised sticker frequency. In some CPUs it's merely close, e.g. 2592 MHz on an i7-6700HQ 2.6 GHz Skylake, or 4008MHz on a 4000MHz i7-6700k. On even newer CPUs like i5-1035 Ice Lake, TSC = 1.5 GHz, base = 1.1 GHz, so disabling turbo won't even approximately work for TSC = core cycles on those CPUs.
If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. (And optionally disable turbo and tell your OS to prefer max clock speed to avoid CPU frequency shifts during your microbenchmark).
Microbenchmarking is hard: see Idiomatic way of performance evaluation? for other pitfalls.
Instead of TSC at all, you can use a library that gives you access to hardware performance counters. The complicated but low-overhead way is to program perf counters and use rdmsr in user-space, or simpler ways include tricks like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID.
You usually will still want to keep the CPU clock fixed for microbenchmarks, though, unless you want to see how different loads will get Skylake to clock down when memory-bound or whatever. (Note that memory bandwidth / latency is mostly fixed, using a different clock than the cores. At idle clock speed, an L2 or L3 cache miss takes many fewer core clock cycles.)
Negative clock cycle measurements with back-to-back rdtsc? the history of RDTSC: originally CPUs didn't do power-saving, so the TSC was both real-time and core clocks. Then it evolved through various barely-useful steps into its current form of a useful low-overhead timesource decoupled from core clock cycles (constant_tsc), which doesn't stop when the clock halts (nonstop_tsc). Also some tips, e.g. don't take the mean time, take the median (there will be very high outliers).
std::chrono::clock, hardware clock and cycle count
Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
measuring code execution times in C using RDTSC instruction lists some gotchas, including SMI (system-management interrupts) which you can't avoid even in kernel mode with cli), and virtualization of rdtsc under a VM. And of course basic stuff like regular interrupts being possible, so repeat your timing many times and throw away outliers.
Determine TSC frequency on Linux. Programatically querying the TSC frequency is hard and maybe not possible, especially in user-space, or may give a worse result than calibrating it. Calibrating it using another known time-source takes time. See that question for more about how hard it is to convert TSC to nanoseconds (and that it would be nice if you could ask the OS what the conversion ratio is, because the OS already did it at bootup).
If you're microbenchmarking with RDTSC for tuning purposes, your best bet is to just use ticks and skip even trying to convert to nanoseconds. Otherwise, use a high-resolution library time function like std::chrono or clock_gettime. See faster equivalent of gettimeofday for some discussion / comparison of timestamp functions, or reading a shared timestamp from memory to avoid rdtsc entirely if your precision requirement is low enough for a timer interrupt or thread to update it.
See also Calculate system time using rdtsc about finding the crystal frequency and multiplier.
CPU TSC fetch operation especially in multicore-multi-processor environment says that Nehalem and newer have the TSC synced and locked together for all cores in a package (along with the invariant = constant and nonstop TSC feature). See #amdn's answer there for some good info about multi-socket sync.
(And apparently usually reliable even for modern multi-socket systems as long as they have that feature, see #amdn's answer on the linked question, and more details below.)
CPUID features relevant to the TSC
Using the names that Linux /proc/cpuinfo uses for the CPU features, and other aliases for the same feature that you'll also find.
tsc - the TSC exists and rdtsc is supported. Baseline for x86-64.
rdtscp - rdtscp is supported.
tsc_deadline_timer CPUID.01H:ECX.TSC_Deadline[bit 24] = 1 - local APIC can be programmed to fire an interrupt when the TSC reaches a value you put in IA32_TSC_DEADLINE. Enables "tickless" kernels, I think, sleeping until the next thing that's supposed to happen.
constant_tsc: Support for the constant TSC feature is determined by checking the CPU family and model numbers. The TSC ticks at constant frequency regardless of changes in core clock speed. Without this, RDTSC does count core clock cycles.
nonstop_tsc: This feature is called the invariant TSC in the Intel SDM manual and is supported on processors with CPUID.80000007H:EDX[8]. The TSC keeps ticking even in deep sleep C-states. On all x86 processors, nonstop_tsc implies constant_tsc, but constant_tsc doesn't necessarily imply nonstop_tsc. No separate CPUID feature bit; on Intel and AMD the same invariant TSC CPUID bit implies both constant_tsc and nonstop_tsc features. See Linux's x86/kernel/cpu/intel.c detection code, and amd.c was similar.
Some of the processors (but not all) that are based on the Saltwell/Silvermont/Airmont even keep TSC ticking in ACPI S3 full-system sleep: nonstop_tsc_s3. This is called always-on TSC. (Although it seems the ones based on Airmont were never released.)
For more details on constant and invariant TSC, see: Can constant non-invariant tsc change frequency across cpu states?.
tsc_adjust: CPUID.(EAX=07H, ECX=0H):EBX.TSC_ADJUST (bit 1) The IA32_TSC_ADJUST MSR is available, allowing OSes to set an offset that's added to the TSC when rdtsc or rdtscp reads it. This allows effectively changing the TSC on some/all cores without desyncing it across logical cores. (Which would happen if software set the TSC to a new absolute value on each core; it's very hard to get the relevant WRMSR instruction executed at the same cycle on every core.)
constant_tsc and nonstop_tsc together make the TSC usable as a timesource for things like clock_gettime in user-space. (But OSes like Linux only use RDTSC to interpolate between ticks of a slower clock maintained with NTP, updating the scale / offset factors in timer interrupts. See On a cpu with constant_tsc and nonstop_tsc, why does my time drift?) On even older CPUs that don't support deep sleep states or frequency scaling, TSC as a timesource may still be usable
The comments in the Linux source code also indicate that constant_tsc / nonstop_tsc features (on Intel) implies "It is also reliable across cores and sockets. (but not across cabinets - we turn it off in that case explicitly.)"
The "across sockets" part is not accurate. In general, an invariant TSC only guarantees that the TSC is synchronized between cores within the same socket. On an Intel forum thread, Martin Dixon (Intel) points out that TSC invariance does not imply cross-socket synchronization. That requires the platform vendor to distribute RESET synchronously to all sockets. Apparently platform vendors do in practice do that, given the above Linux kernel comment. Answers on CPU TSC fetch operation especially in multicore-multi-processor environment also agree that all sockets on a single motherboard should start out in sync.
On a multi-socket shared memory system, there is no direct way to check whether the TSCs in all the cores are synced. The Linux kernel, by default performs boot-time and run-time checks to make sure that TSC can be used as a clock source. These checks involve determining whether the TSC is synced. The output of the command dmesg | grep 'clocksource' would tell you whether the kernel is using TSC as the clock source, which would only happen if the checks have passed. But even then, this would not be definitive proof that the TSC is synced across all sockets of the system. The kernel paramter tsc=reliable can be used to tell the kernel that it can blindly use the TSC as the clock source without doing any checks.
There are cases where cross-socket TSCs may NOT be in sync: (1) hotplugging a CPU, (2) when the sockets are spread out across different boards connected by extended node controllers, (3) a TSC may not be resynced after waking up from a C-state in which the TSC is powered-downed in some processors, and (4) different sockets have different CPU models installed.
An OS or hypervisor that changes the TSC directly instead of using the TSC_ADJUST offset can de-sync them, so in user-space it might not always be safe to assume that CPU migrations won't leave you reading a different clock. (This is why rdtscp produces a core-ID as an extra output, so you can detect when start/end times come from different clocks. It might have been introduced before the invariant TSC feature, or maybe they just wanted to account for every possibility.)
If you're using rdtsc directly, you may want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram on Linux. Whether you need it for the TSC or not, CPU migration will normally lead to a lot of cache misses and mess up your test anyway, as well as taking extra time. (Although so will an interrupt).
How efficient is the asm from using the intrinsic?
It's about as good as you'd get from #Mysticial's GNU C inline asm, or better because it knows the upper bits of RAX are zeroed. The main reason you'd want to keep inline asm is for compat with crusty old compilers.
A non-inline version of the readTSC function itself compiles with MSVC for x86-64 like this:
unsigned __int64 readTSC(void) PROC ; readTSC
rdtsc
shl rdx, 32 ; 00000020H
or rax, rdx
ret 0
; return in RAX
For 32-bit calling conventions that return 64-bit integers in edx:eax, it's just rdtsc/ret. Not that it matters, you always want this to inline.
In a test caller that uses it twice and subtracts to time an interval:
uint64_t time_something() {
uint64_t start = readTSC();
// even when empty, back-to-back __rdtsc() don't optimize away
return readTSC() - start;
}
All 4 compilers make pretty similar code. This is GCC's 32-bit output:
# gcc8.2 -O3 -m32
time_something():
push ebx # save a call-preserved reg: 32-bit only has 3 scratch regs
rdtsc
mov ecx, eax
mov ebx, edx # start in ebx:ecx
# timed region (empty)
rdtsc
sub eax, ecx
sbb edx, ebx # edx:eax -= ebx:ecx
pop ebx
ret # return value in edx:eax
This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.
# MSVC 19 2017 -Ox
unsigned __int64 time_something(void) PROC ; time_something
rdtsc
shl rdx, 32 ; high <<= 32
or rax, rdx
mov rcx, rax ; missed optimization: lea rcx, [rdx+rax]
; rcx = start
;; timed region (empty)
rdtsc
shl rdx, 32
or rax, rdx ; rax = end
sub rax, rcx ; end -= start
ret 0
unsigned __int64 time_something(void) ENDP ; time_something
All 4 compilers use or+mov instead of lea to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.
But writing a shift/lea in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.
However, we can maybe get the best of both worlds with a modified version of #Mysticial's code:
// More efficient than __rdtsc() in some case, but maybe worse in others
uint64_t rdtsc(){
// long and uintptr_t are 32-bit on the x32 ABI (32-bit pointers in 64-bit mode), so #ifdef would be better if we care about this trick there.
unsigned long lo,hi; // let the compiler know that zero-extension to 64 bits isn't required
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) + lo;
// + allows LEA or ADD instead of OR
}
On Godbolt, this does sometimes give better asm than __rdtsc() for gcc/clang/ICC, but other times it tricks compilers into using an extra register to save lo and hi separately, so clang can optimize into ((end_hi-start_hi)<<32) + (end_lo-start_lo). Hopefully if there's real register pressure, compilers will combine earlier. (gcc and ICC still save lo/hi separately, but don't optimize as well.)
But 32-bit gcc8 makes a mess of it, compiling even just the rdtsc() function itself with an actual add/adc with zeros instead of just returning the result in edx:eax like clang does. (gcc6 and earlier do ok with | instead of +, but definitely prefer the __rdtsc() intrinsic if you care about 32-bit code-gen from gcc).

VC++ uses an entirely different syntax for inline assembly -- but only in the 32-bit versions. The 64-bit compiler doesn't support inline assembly at all.
In this case, that's probably just as well -- rdtsc has (at least) two major problem when it comes to timing code sequences. First (like most instructions) it can be executed out of order, so if you're trying to time a short sequence of code, the rdtsc before and after that code might both be executed before it, or both after it, or what have you (I am fairly sure the two will always execute in order with respect to each other though, so at least the difference will never be negative).
Second, on a multi-core (or multiprocessor) system, one rdtsc might execute on one core/processor and the other on a different core/processor. In such a case, a negative result is entirely possible.
Generally speaking, if you want a precise timer under Windows, you're going to be better off using QueryPerformanceCounter.
If you really insist on using rdtsc, I believe you'll have to do it in a separate module written entirely in assembly language (or use a compiler intrinsic), then linked with your C or C++. I've never written that code for 64-bit mode, but in 32-bit mode it looks something like this:
xor eax, eax
cpuid
xor eax, eax
cpuid
xor eax, eax
cpuid
rdtsc
; save eax, edx
; code you're going to time goes here
xor eax, eax
cpuid
rdtsc
I know this looks strange, but it's actually right. You execute CPUID because it's a serializing instruction (can't be executed out of order) and is available in user mode. You execute it three times before you start timing because Intel documents the fact that the first execution can/will run at a different speed than the second (and what they recommend is three, so three it is).
Then you execute your code under test, another cpuid to force serialization, and the final rdtsc to get the time after the code finished.
Along with that, you want to use whatever means your OS supplies to force this all to run on one process/core. In most cases, you also want to force the code alignment -- changes in alignment can lead to fairly substantial differences in execution spee.
Finally you want to execute it a number of times -- and it's always possible it'll get interrupted in the middle of things (e.g., a task switch), so you need to be prepared for the possibility of an execution taking quite a bit longer than the rest -- e.g., 5 runs that take ~40-43 clock cycles apiece, and a sixth that takes 10000+ clock cycles. Clearly, in the latter case, you just throw out the outlier -- it's not from your code.
Summary: managing to execute the rdtsc instruction itself is (almost) the least of your worries. There's quite a bit more you need to do before you can get results from rdtsc that will actually mean anything.

For Windows, Visual Studio provides a convenient "compiler intrinsic" (i.e. a special function, which the compiler understands) that executes the RDTSC instruction for you and gives you back the result:
unsigned __int64 __rdtsc(void);

Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES
This Linux system call appears to be a cross architecture wrapper for performance events.
This answer similar: Quick way to count number of instructions executed in a C program but with PERF_COUNT_HW_CPU_CYCLES instead of PERF_COUNT_HW_INSTRUCTIONS. This answer will focus on PERF_COUNT_HW_CPU_CYCLES specifics, see that other answer for more generic information.
Here is an example based on the one provided at the end of the man page.
perf_event_open.c
#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <inttypes.h>
#include <sys/types.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("%lld\n", count);
close(fd);
}
The results seem reasonable, e.g. if I print cycles then recompile for instruction counts, we get about 1 cycle per iteration (2 instructions done in a single cycle) possibly due to effects such as superscalar execution, with slightly different results for each run presumably due to random memory access latencies.
You might also be interested in PERF_COUNT_HW_REF_CPU_CYCLES, which as the manpage documents:
Total cycles; not affected by CPU frequency scaling.
so this will give something closer to the real wall time if your frequency scaling is on. These were 2/3x larger than PERF_COUNT_HW_INSTRUCTIONS on my quick experiments, presumably because my non-stressed machine is frequency scaled now.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js