selected processor does not support `swp x1,x1,[x0]' - c++

I try to use swp instruction to implement atomic swap.
asm volatile ("swp %[newval], %[newval], [%[oldval]]"
: [newval] "+r" (newval), [oldval] "+p" (oldval)
:
: "memory");
when I compiling the code (using g++ main.cpp -o main -march=armv8-a). I got the following error message.
/tmp/cc0MHTHA.s: Assembler messages:
/tmp/cc0MHTHA.s:20: Error: selected processor does not support `swp x1,x1,[x0]'
The ARM machine I use is with armv8, /proc/cpuinfo is like this (It's a SMP machine with 16 cores, information of others processors is the same besides the first line.)
processor : 0
model name : phytium FT1500a
flags : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x70
CPU architecture: 8
CPU variant : 0x1
CPU part : 0x660
bogomips : 3590.55
CPU revision : 1
g++ --version outputs
g++ (Ubuntu/Linaro 4.9.1-16kord6) 4.9.1
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I'm getting the following errors when using ldrex/strex instructions
/tmp/ccXxJgQH.s: Assembler messages:
/tmp/ccXxJgQH.s:19: Error: unknown mnemonic `ldrex' -- `ldrex x0,[x0]'
Can anyone explain me why and where this error comes and how to deal with this error? The machine does not support SWP or I should add some parameters(maybe -march) on compile command to indicate the CPU architecture?

You don't need (and shouldn't use) inline assembly for this.
Use a gcc builtin: type __atomic_exchange_n (type *ptr, type val, int memorder) or C++11 std::atomic for this, so the compiler can use the best instruction sequence for the target CPU, based on your -mcpu= command line option and whether you're building for 64-bit or 32-bit ARM (or x86), etc. etc. Also, the compiler understands what you're doing and can optimize accordingly.
// static inline
int xchg_gcc(int *p, int newval) {
int oldval = __atomic_exchange_n(p, newval, __ATOMIC_SEQ_CST);
//__atomic_signal_fence ( __ATOMIC_SEQ_CST);
return oldval;
}
For ARM64 and ARM (32-bit with -mcpu=cortex-a72) with gcc5.4, this compiles to what you want (Godbolt compiler explorer):
.L2: ## This is the ARM64 version.
ldaxr w2, [x0]
stlxr w3, w1, [x0]
cbnz w3, .L2
mov w0, w2 # This insn will optimize away after inlining, leaving just the retry loop
ret
Or if you just want atomicity but don't need ordering wrt other operations, then use __ATOMIC_RELAXED instead of __ATOMIC_SEQ_CST. Then it compiles to ldxr / stxr, instead of the acquire/release version of the LL/SC instructions.
For the 32-bit version, if you don't specify a -mcpu or -march, it calls library functions because it doesn't know what to use for exchange.
I'm not sure if SEQ_CST for the __atomic_exchange builtin orders with respect to non-atomic things the way asm volatile("":::"memory") does; if not you might need fences as described below for C++11 atomic_signal_fence.
or use this portable C++11 version, which compiles to the same asm:
#include <atomic>
// static inline
int xchg_stdatomic(std::atomic<int> *p, int newval) {
atomic_signal_fence(std::memory_order_seq_cst);
int oldval = p->exchange(newval, std::memory_order_seq_cst);
atomic_signal_fence(std::memory_order_seq_cst); // order WRT. non-atomic variables (on gcc/clang at least).
return oldval;
}
atomic_signal_fence is used as an equivalent of asm("":::"memory"), to block compile-time reordering with non-atomic loads/stores (but without emitting any instructions). This is how gcc implements it, but IDK if that's required by the standard of just an implementation detail in gcc.
In gcc at least, atomic_signal_fence orders operations on "normal" variables, but atomic_thread_fence only orders operations on atomic variables. (Shared access to non-atomic variables from multiple threads would be an undefined-behaviour data race, so gcc assumes it doesn't happen. The question here is whether the standard requires signal_fence to order non-atomic operations along with atomic and volatile accesses, because the guarantees about what you can safely access in signal handlers are quite weak.)
Anyway, since signal_fence compiles to no instruction, and is only blocking reordering that we want exchange() to block anyway, there's no harm. (Unless you don't want exchange() to order your non-shared variables, in which case you shouldn't use signal_fence).
swp is supported but deprecated in ARMv6 and ARMv7. ARM's docs say that it increases interrupt latency (because swp itself is not interruptible). Also,
In a multi-core system, preventing access to main memory for all processors for the duration of a swap instruction can reduce overall system performance.

Related

Intrinsic load operation: Unsuited alignment, but the code still runs [duplicate]

I am looking at the generated assembly for my code (using Visual Studio 2017) and noticed that _mm_load_ps is often (always?) compiled to movups.
The data I'm using _mm_load_ps on is defined like this:
struct alignas(16) Vector {
float v[4];
}
// often embedded in other structs like this
struct AABB {
Vector min;
Vector max;
bool intersection(/* parameters */) const;
}
Now when I'm using this construct, the following will happen:
// this code
__mm128 bb_min = _mm_load_ps(min.v);
// generates this
movups xmm4, XMMWORD PTR [r8]
I'm expecting movaps because of alignas(16). Do I need something else to convince the compiler to use movaps in this case?
EDIT: My question is different from this question because I'm not getting any crashes. The struct is specifically aligned and I'm also using aligned allocation. Rather, I'm curious why the compiler is switching _mm_load_ps (the intrinsic for aligned memory) to movups. If I know struct was allocated at an aligned address and I'm calling it via this* it would be safe to use movaps, right?
On recent versions of Visual Studio and the Intel Compiler (recent as post-2013?), the compiler rarely ever generates aligned SIMD load/stores anymore.
When compiling for AVX or higher:
The Microsoft compiler (>VS2013?) doesn't generate aligned loads. But it still generates aligned stores.
The Intel compiler (> Parallel Studio 2012?) doesn't do it at all anymore. But you'll still see them in ICC-compiled binaries inside their hand-optimized libraries like memset().
As of GCC 6.1, it still generates aligned load/stores when you use the aligned intrinsics.
The compiler is allowed to do this because it's not a loss of functionality when the code is written correctly. All processors starting from Nehalem have no penalty for unaligned load/stores when the address is aligned.
Microsoft's stance on this issue is that it "helps the programmer by not crashing". Unfortunately, I can't find the original source for this statement from Microsoft anymore. In my opinion, this achieves the exact opposite of that because it hides misalignment penalties. From the correctness standpoint, it also hides incorrect code.
Whatever the case is, unconditionally using unaligned load/stores does simplify the compiler a bit.
New Relevations:
Starting Parallel Studio 2018, the Intel Compiler no longer generates aligned moves at all - even for pre-Nehalem targets.
Starting from Visual Studio 2017, the Microsoft Compiler also no longer generates aligned moves at all - even when targeting pre-AVX hardware.
Both cases result in inevitable performance degradation on older processors. But it seems that this is intentional as both Intel and Microsoft no longer care about old processors.
The only load/store intrinsics that are immune to this are the non-temporal load/stores. There is no unaligned equivalent of them, so the compiler has no choice.
So if you want to just test for correctness of your code, you can substitute in the load/store intrinsics for non-temporal ones. But be careful not to let something like this slip into production code since NT load/stores (NT-stores in particular) are a double-edged sword that can hurt you if you don't know what you're doing.

How to emulate _mm256_loadu_epi32 with gcc or clang?

Intel's intrinsic guide lists the intrinsic _mm256_loadu_epi32:
_m256i _mm256_loadu_epi32 (void const* mem_addr);
/*
Instruction: vmovdqu32 ymm, m256
CPUID Flags: AVX512VL + AVX512F
Description
Load 256-bits (composed of 8 packed 32-bit integers) from memory into dst.
mem_addr does not need to be aligned on any particular boundary.
Operation
a[255:0] := MEM[mem_addr+255:mem_addr]
dst[MAX:256] := 0
*/
But clang and gcc do not provide this intrinsic. Instead they provide (in file avx512vlintrin.h) only the masked versions
_mm256_mask_loadu_epi32 (__m256i, __mmask8, void const *);
_mm256_maskz_loadu_epi32 (__mmask8, void const *);
which boil down to the same instruction vmovdqu32. My question: how can I emulate _mm256_loadu_epi32:
inline _m256i _mm256_loadu_epi32(void const* mem_addr)
{
/* code using vmovdqu32 and compiles with gcc */
}
without writing assembly, i.e. using only intrinsics available?
Just use _mm256_loadu_si256 like a normal person. The only thing the AVX512 intrinsic gives you is a nicer prototype (const void* instead of const __m256i*) so you don't have to write ugly casts.
#chtz suggests out that you might still want to write a wrapper function yourself to get the void* prototype. But don't call it _mm256_loadu_epi32; some future GCC version will probably add that for compat with Intel's docs and break your code.
From another perspective, it's unfortunate that compilers don't treat it as an AVX1 intrinsic, but I guess compilers which don't optimize intrinsics, and which let you use intrinsics from ISA extensions you haven't enabled, need this kind of clue to know when they can use ymm16-31.
You don't even want the compiler to emit vmovdqu32 ymm when you're not masking; vmovdqu ymm is shorter and does exactly the same thing, with no penalty for mixing with EVEX-encoded instructions. The compiler can always use an vmovdqu32 or 64 if it wants to load into ymm16..31, otherwise you want it to use a shorter VEX-coded AVX1 vmovdqu.
I'm pretty sure that GCC treats _mm256_maskz_epi32(0xffu,ptr) exactly the same as _mm256_loadu_si256((const __m256i*)ptr) and makes the same asm regardless of which one you use. It can optimize away the 0xffu mask and simply use an unmasked load, but there's no need for that extra complication in your source.
But unfortunately GCC9 and earlier will pessimize to vmovdqu32 ymm0, [mem] when AVX512VL is enabled (e.g. -march=skylake-avx512) even when you write _mm256_loadu_si256. This was a missed-optimization, GCC Bug 89346.
It doesn't matter which 256-bit load intrinsic you use (except for aligned vs. unaligned) as long as there's no masking.
Related:
error: '_mm512_loadu_epi64' was not declared in this scope
What is the difference between _mm512_load_epi32 and _mm512_load_si512?

Are Reads and Writes of an int in C++ Atomic on x86-64 multi-core machine

I've read this, my question is quite similar yet somewhat different.
Note, I know C++0x does not guarantee that but I'm asking particularly for a multi-core machine like x86-64.
Let's say we have 2 threads (pinned to 2 physical cores) running the following code:
// I know people may delcare volatile useless, but here I do NOT care memory reordering nor synchronization/
// I just want to suppress complier optimization of using register.
volatile int n;
void thread1() {
for (;;)
n = 0xABCD1234;
// NOTE, I know ++n is not atomic,
// but I do NOT care here.
// what I cares is whether n can be 0x00001234, i.e. in the middle of the update from core-1's cache lines to main memory,
// will core-2 see an incomplete value(like the first 2 bytes lost)?
++n;
}
}
void thread2() {
while (true) {
printf('%d', n);
}
}
Is it possible for thread 2 to see n to be something like 0x00001234, i.e. in the middle of the update from core-1's cache lines to main memory, will core-2 see an incomplete value?
I know a single 4-byte int definitely fits into a typically 128-byte-long cache line, and if that int does store inside one cache line then I believe no issues here... yet what if it acrosses the cache line boundary? i.e. will it be possbile that some char already sit inside that cache line which makes first part of the n in one cache line and the other part in the next line? If that is the case, then core-2 may have a chance seeing an incomplete value, right?
Also, I think unless making every char or short or other less-than-4-bytes types padded to be 4-byte-long, one can never guarantee a single int does not pass the cache line boundary, isn't it?
If so, would that suggest generally even setting a single int is not guaranteed to be atomic on a x86-64 multi-core machine?
I got this question because as I researched on this topic, various people in various posts seem agreed on that as long as the machine architecture is proper(e.g. x86-64) setting an int should be atomic. But as I argued above that does not hold, right?
UPDATE
I'd like to give some background of my question. I'm dealing with a real-time system, which is sampling some signal and putting the result into one global int, this is of course done in one thread. And in yet another thread I read this value and process it.
I do not care the ordering of set and get, all I need is just a complete (vs. a corrrupted integer value) value.
x86 guarantees this. C++ doesn't. If you write x86 assembly you will be fine. If you write C++ it is undefined behavior. Since you can't reason about undefined behavior (it is undefined after all) you have to go lower and look at the assembler instructions that were generated. If they do what you want then this is fine. Note, however, that compilers tend to change generated assembly when you change compilers, compiler versions, compiler flags or any code which might change the optimizer's behavior, so you will constantly have to check the assembler code to make sure it is still correct.
The easier way is to use std::atomic<int> which will guarantee that the correct assembler instructions are generated so you don't have to constantly check.
The other question talks about variables "properly aligned". If it crosses a cache-line, the variable is not properly aligned. An int will not do that unless you specifically ask the compiler to pack a struct, for example.
You also assume that using volatile int is better than atomic<int>. If volatile int is the perfect way to sync variables on your platform, surely the library implementer would also know that and store a volatile x inside atomic<x>.
There is no requirement that atomic<int> has to be extra slow just because it is standard. :-)
Why worry so much?
Rely on your implementation. std::atomic<int> will reduce to an int if int is atomic on your platform (and in x86-64 they are, if properly aligned).
I'd also be concerned about the possibility of int overflow with your code (which is undefined behaviour), if I were you.
In other words std::atomic<unsigned> is the appropriate type here.
If you're looking for atomicity guarantee, std::atomic<> is your friend. Don't rely on volatile qualifier.
The question is almost a duplicate of Why is integer assignment on a naturally aligned variable atomic on x86?. The answer there does answer everything you ask, but this question is more focused on the ABI / compiler question of whether an int (or other type?) will be sufficiently-aligned, rather than what happens when it is. There's other stuff in this question that's worth answering specifically, too.
Yes, they almost invariably will be on machines where an int fits in a single register (e.g. not AVR: an 8-bit RISC), because compilers typically choose not to use multiple store instructions when they could use 1.
Normal x86 ABIs will align an int to a 4B boundary, even inside structs (unless you use GNU C __attribute__((packed)) or the equivalent for other dialects). But beware that the i386 System V ABI only aligns double to 4 bytes; it's only outside structs that modern compilers can go beyond that and give it natural alignment, making load/store atomic.
But nothing you can legally do in C++ can ever depend on this fact (because by definition it will involve a data race on a non-atomic type so it's Undefined Behaviour). Fortunately, there are efficient ways to get the same result (i.e. about the same compiler-generated asm, without mfence instructions or other slow stuff) that don't cause undefined behaviour.
You should use atomic instead of volatile or hoping that the compiler doesn't optimize away stores or loads on a non-volatile int, because the assumption of async modification is one of the ways that volatile and atomic overlap.
I'm dealing with a real-time system, which is sampling some signal and putting the result into one global int, this is of course done in one thread. And in yet another thread I read this value and process it.
std::atomic with .store(val, std::memory_order_relaxed) and .load(std::memory_order_relaxed) will give you exactly what you want here. The HW-access thread runs free and does plain ordinary x86 store instructions into the shared variable, while the reader thread does plain ordinary x86 load instructions.
This is the C++11 way to express that this is what you want, and you should expect it to compile to the same asm as with volatile. (With maybe a couple instructions difference if you use clang, but nothing important.) If there was any case where volatile int wouldn't have sufficient alignment, or any other corner cases, atomic<int> will work (barring compiler bugs). Except maybe in a packed struct; IDK if compilers stop you from breaking atomicity by packing atomic types in structs.
In theory, you might want to use volatile std::atomic<int> to make sure the compiler doesn't optimize out multiple stores to the same variable. See Why don't compilers merge redundant std::atomic writes?. But for now, compilers don't do that kind of optimization. (volatile std::atomic<int> should still compile to the same light-weight asm.)
I know a single 4-byte int definitely fits into a typically 128-byte-long cache line, and if that int does store inside one cache line then I believe no issues here...
Cache lines are 64B on all mainstream x86 CPUs since PentiumIII; before that 32B lines were typical. (Well AMD Geode still uses 32B lines...) Pentium4 uses 64B lines, although it prefers to transfer them in pairs or something? Still, I think it's accurate to say that it really does use 64B lines, not 128B. This page lists it as 64B per line.
AFAIK, there are no x86 microarchitectures that used 128B lines in any level of cache.
Also, only Intel CPUs guarantee that cached unaligned stores / loads are atomic if they don't cross a cache-line boundary. The baseline atomicity guarantee for x86 in general (AMD/Intel/other) is don't cross an 8-byte boundary. See Why is integer assignment on a naturally aligned variable atomic on x86? for quotes from Intel/AMD manuals.
Natural alignment works on pretty much any ISA (not just x86) up to the maximum guaranteed-atomic width.
The code in your question wants a non-atomic read-modify write where the load and store are separately atomic, and impose no ordering on surrounding loads/stores.
As everyone has said, the right way to do this is with atomic<int>, but nobody has pointed out exactly how. If you just n++ on atomic_int n, you will get (for x86-64) lock add [n], 1, which will be much slower than what you get with volatile, because it makes the entire RMW operation atomic. (Perhaps this is why you were avoiding std::atomic<>?)
#include <atomic>
volatile int vcount;
std::atomic <int> acount;
static_assert(alignof(vcount) == sizeof(vcount), "under-aligned volatile counter");
void inc_volatile() {
while(1) vcount++;
}
void inc_separately_atomic() {
while(1) {
int t = acount.load(std::memory_order_relaxed);
t++;
acount.store(t, std::memory_order_relaxed);
}
}
asm output from the Godbolt compiler explorer with gcc7.2 and clang5.0
Unsurprisingly, they both compile to equivalent asm with gcc/clang for x86-32 and x86-64. gcc makes identical asm for both, except for the address to increment:
# x86-64 gcc -O3
inc_volatile:
.L2:
mov eax, DWORD PTR vcount[rip]
add eax, 1
mov DWORD PTR vcount[rip], eax
jmp .L2
inc_separately_atomic():
.L5:
mov eax, DWORD PTR acount[rip]
add eax, 1
mov DWORD PTR acount[rip], eax
jmp .L5
clang optimizes better, and uses
inc_separately_atomic():
.LBB1_1:
add dword ptr [rip + acount], 1
jmp .LBB1_1
Note the lack of a lock prefix, so inside the CPU this decodes to separate load, ALU add, and store uops. (See Can num++ be atomic for 'int num'?).
Besides smaller code-size, some of these uops can be micro-fused when they come from the same instruction, reducing front-end bottlenecks. (Totally irrelevant here; the loop bottlenecks on the 5 or 6 cycle latency of a store/reload. But if used as part of a larger loop, it would be relevant.) Unlike with a register operand, add [mem], 1 is better than inc [mem] on Intel CPUs because it micro-fuses even more: INC instruction vs ADD 1: Does it matter?.
It's interesting that clang uses the less efficient inc dword ptr [rip + vcount] for inc_volatile().
And how does an actual atomic RMW compile?
void inc_atomic_rmw() {
while(1) acount++;
}
# both gcc and clang do this:
.L7:
lock add DWORD PTR acount[rip], 1
jmp .L7
Alignment inside structs:
#include <stdint.h>
struct foo {
int a;
volatile double vdouble;
};
// will fail with -m32, in the SysV ABI.
static_assert(alignof(foo) == sizeof(double), "under-aligned volatile counter");
But atomic<double> or atomic<unsigned long long> will guarantee atomicity.
For 64-bit integer load/store on 32-bit machines, gcc uses SSE2 instructions. Some other compilers unfortunately use lock cmpxchg8b, which is far less efficient for separate stores or loads. volatile long long wouldn't give you that.
volatile double usually would normally be atomic to load/store when aligned correctly, because the normal way is already to use single 8B load/store instructions.

RDRAND and RDSEED intrinsics on various compilers?

Does Intel C++ compiler and/or GCC support the following Intel intrinsics, like MSVC does since 2012 / 2013?
#include <immintrin.h> // for the following intrinsics
int _rdrand16_step(uint16_t*);
int _rdrand32_step(uint32_t*);
int _rdrand64_step(uint64_t*);
int _rdseed16_step(uint16_t*);
int _rdseed32_step(uint32_t*);
int _rdseed64_step(uint64_t*);
And if these intrinsics are supported, since which version are they supported (with compile-time-constant please)?
Both GCC and Intel compiler support them. GCC support was introduced at the end of 2010. They require the header <immintrin.h>.
GCC support has been present since at least version 4.6, but there doesn't seem to be any specific compile-time constant - you can just check __GNUC_MAJOR__ > 4 || (__GNUC_MAJOR__ == 4 && __GNUC_MINOR__ >= 6).
All the major compilers support Intel's intrinsics for rdrand and rdseed via <immintrin.h>.
Somewhat recent versions of some compilers are needed for rdseed, e.g. GCC9 (2019) or clang7 (2018), although those have been stable for a good while by now. If you'd rather use an older compiler, or not enable ISA-extension options like -march=skylake, a library1 wrapper function instead of the intrinsic is a good choice. (Inline asm is not necessary, I wouldn't recommend it unless you want to play with it.)
#include <immintrin.h>
#include <stdint.h>
// gcc -march=native or haswell or znver1 or whatever, or manually enable -mrdrnd
uint64_t rdrand64(){
unsigned long long ret; // not uint64_t, GCC/clang wouldn't compile.
do{}while( !_rdrand64_step(&ret) ); // retry until success.
return ret;
}
// and equivalent for _rdseed64_step
// and 32 and 16-bit sizes with unsigned and unsigned short.
Some compilers define __RDRND__ when the instruction is enabled at compile-time. GCC/clang since they supported the intrinsic at all, but only much later ICC (19.0). And with ICC, -march=ivybridge doesn't imply -mrdrnd or define __RDRND__ until 2021.1.
ICX is LLVM-based and behaves like clang.
MSVC doesn't define any macros; its handling of intrinsics is designed around runtime feature detection only, unlike gcc/clang where the easy way is compile-time CPU feature options.
Why do{}while() instead of while(){}? Turns out ICC compiles to a less-dumb loop with do{}while(), not uselessly peeling a first iteration. Other compilers don't benefit from that hand-holding, and it's not a correctness problem for ICC.
Why unsigned long long instead of uint64_t? The type has to agree with the pointer type expected by the intrinsic, or C and especially C++ compilers will complain, regardless of the object-representations being identical (64-bit unsigned). On Linux for example, uint64_t is unsigned long, but GCC/clang's immintrin.h define int _rdrand64_step(unsigned long long*), same as on Windows. So you always need unsigned long long ret with GCC/clang. MSVC is a non-problem as it can (AFAIK) only target Windows, where unsigned long long is the only 64-bit unsigned type.
But ICC defines the intrinsic as taking unsigned long* when compiling for GNU/Linux, according to my testing on https://godbolt.org/. So to be portable to ICC, you actually need #ifdef __INTEL_COMPILER; even in C++ I don't know a way to use auto or other type-deduction to declare a variable that matches it.
Compiler versions to support intrinsics
Tested on Godbolt; its earliest version of MSVC is 2015, and ICC 2013, so I can't go back any further. Support for _rdrand16_step / 32 / 64 were all introduced at the same time in any given compiler. 64 requires 64-bit mode.
CPU
gcc
clang
MSVC
ICC
rdrand
Ivy Bridge / Excavator
4.6
3.2
before 2015 (19.10)
before 13.0.1, but 19.0 for -mrdrnd defining __RDRND__. 2021.1 for -march=ivybridge to enable -mrdrnd
rdseed
Broadwell / Zen 1
9.1
7.0
before 2015 (19.10)
before(?) 13.0.1, but 19.0 also added -mrdrnd and -mrdseed options)
The earliest GCC and clang versions don't recognize -march=ivybridge only -mrdrnd. (GCC 4.9 and clang 3.6 for Ivy Bridge, not that you specifically want to use IvyBridge if modern CPUs are more relevant. So use a non-ancient compiler and set a CPU option appropriate for CPUs you actually care about, or at least a -mtune= with a more recent CPU.)
Intel's new oneAPI / ICX compilers all support rdrand/rdseed, and are based on LLVM internals so they work similarly to clang for CPU options. (It doesn't define __INTEL_COMPILER, which is good because it's different from ICC.)
GCC and clang only let you use intrinsics for instructions you've told the compiler the target supports. Use -march=native if compiling for your own machine, or use -march=skylake or something to enable all the ISA extensions for the CPU you're targeting. But if you need your program to run on old CPUs and only use RDRAND or RDSEED after runtime detection, only those functions need __attribute__((target("rdrnd"))) or rdseed, and won't be able to inline into functions with different target options. Or using a separately-compiled library would be easier1.
-mrdrnd: enabled by -march=ivybridge or -march=znver1 (or bdver4 Exavator APUs) and later
-mrdseed: enabled by -march=broadwell or -march=znver1 or later
Normally if you're going to enable one CPU feature, it makes sense to enable others that CPUs of that generation will have, and to set tuning options. But rdrand isn't something the compiler will use on its own (unlike BMI2 shlx for more efficient variable-count shifts, or AVX/SSE for auto-vectorization and array/struct copying and init). So enabling -mrdrnd globally likely won't make your program crash on pre-Ivy Bridge CPUs, if you check CPU features and don't actually run code that uses _rdrand64_step on CPUs without the feature.
But if you are only going to run your code on some specific kind of CPU or later, gcc -O3 -march=haswell is a good choice. (-march also implies -mtune=haswell, and tuning for Ivy Bridge specifically is not what you want for modern CPUs. You could -march=ivybridge -mtune=skylake to set an older baseline of CPU features, but still tune for newer CPUs.)
Wrappers that compile everywhere
This is valid C++ and C. For C, you probably want static inline instead of inline so you don't need to manually instantiate an extern inline version in a .c in case a debug build decided not to inline. (Or use __attribute__((always_inline)) in GNU C.)
The 64-bit versions are only defined for x86-64 targets, because asm instructions can only use 64-bit operand-size in 64-bit mode. I didn't #ifdef __RDRND__ or #if defined(__i386__)||defined(__x86_64__), on the assumption that you'd only include this for x86(-64) builds at all, not cluttering the ifdefs more than necessary. It does only define the rdseed wrappers if that's enabled at compile time, or for MSVC where there's no way to enable them or to detect it.
There are some commented __attribute__((target("rdseed"))) examples you can uncomment if you want to do it that way instead of compiler options. rdrand16 / rdseed16 are intentionally omitted as not being normally useful. rdrand runs the same speed for different operand-sizes, and even pulls the same amount of data from the CPU's internal RNG buffer, optionally throwing away part of it for you.
#include <immintrin.h>
#include <stdint.h>
#if defined(__x86_64__) || defined (_M_X64)
// Figure out which 64-bit type the output arg uses
#ifdef __INTEL_COMPILER // Intel declares the output arg type differently from everyone(?) else
// ICC for Linux declares rdrand's output as unsigned long, but must be long long for a Windows ABI
typedef uint64_t intrin_u64;
#else
// GCC/clang headers declare it as unsigned long long even for Linux where long is 64-bit, but uint64_t is unsigned long and not compatible
typedef unsigned long long intrin_u64;
#endif
//#if defined(__RDRND__) || defined(_MSC_VER) // conditional definition if you want
inline
uint64_t rdrand64(){
intrin_u64 ret;
do{}while( !_rdrand64_step(&ret) ); // retry until success.
return ret;
}
//#endif
#if defined(__RDSEED__) || defined(_MSC_VER)
inline
uint64_t rdseed64(){
intrin_u64 ret;
do{}while( !_rdseed64_step(&ret) ); // retry until success.
return ret;
}
#endif // RDSEED
#endif // x86-64
//__attribute__((target("rdrnd")))
inline
uint32_t rdrand32(){
unsigned ret; // Intel documents this as unsigned int, not necessarily uint32_t
do{}while( !_rdrand32_step(&ret) ); // retry until success.
return ret;
}
#if defined(__RDSEED__) || defined(_MSC_VER)
//__attribute__((target("rdseed")))
inline
uint32_t rdseed32(){
unsigned ret; // Intel documents this as unsigned int, not necessarily uint32_t
do{}while( !_rdseed32_step(&ret) ); // retry until success.
return ret;
}
#endif
The fact that Intel's intrinsics API is supported at all implies that unsigned int is a 32-bit type, regardless of whether uint32_t is defined as unsigned int or unsigned long if any compilers do that.
On the Godbolt compiler explorer we can see how these compile. Clang and MSVC do what we'd expect, just a 2-instruction loop until rdrand leaves CF=1
# clang 7.0 -O3 -march=broadwell MSVC -O2 does the same.
rdrand64():
.LBB0_1: # =>This Inner Loop Header: Depth=1
rdrand rax
jae .LBB0_1 # synonym for jnc - jump if Not Carry
ret
# same for other functions.
Unfortunately GCC is not so good, even current GCC12.1 makes weird asm:
# gcc 12.1 -O3 -march=broadwell
rdrand64():
mov edx, 1
.L2:
rdrand rax
mov QWORD PTR [rsp-8], rax # store into the red-zone where retval is allocated
cmovc eax, edx # materialize a 0 or 1 from CF. (rdrand zeros EAX when it clears CF=0, otherwise copy the 1)
test eax, eax # then test+branch on it
je .L2 # could have just been jnc after rdrand
mov rax, QWORD PTR [rsp-8] # reload retval
ret
rdseed64():
.L7:
rdseed rax
mov QWORD PTR [rsp-8], rax # dead store into the red-zone
jnc .L7
ret
ICC makes the same asm as long as we use a do{}while() retry loop; with a while() {} it's even worse, doing an rdrand and checking before entering the loop for the first time.
Footnote 1: rdrand/rdseed library wrappers
librdrand or Intel's libdrng have wrapper functions with retry loops like I showed, and ones that fill a buffer of bytes or array of uint32_t* or uint64_t*. (Consistently taking uint64_t*, no unsigned long long* on some targets).
A library is also a good choice if you're doing runtime CPU feature detection, so you don't have to mess around with __attribute__((target)) stuff. However you do it, that limits inlining of a function using the intrinsics anyway, so a small static library is equivalent.
libdrng also provides RdRand_isSupported() and RdSeed_isSupported(), so you don't need to do your own CPUID check.
But if you're going to build with -march= something newer than Ivy Bridge / Broadwell or Excavator / Zen1 anyway, inlining a 2-instruction retry loop (like clang compiles it to) is about the same code-size as a function call-site, but doesn't clobber any registers. rdrand is quite slow so that's probably not a big deal, but it also means no extra library dependency.
Performance / internals of rdrand / rdseed
For more details about the HW internals on Intel (not AMD's version), see Intel's docs. For the actual TRNG logic, see Understanding Intel's Ivy Bridge Random Number Generator - it's a metastable latch that settles to 0 or 1 due to thermal noise. Or at least Intel says it is; it's basically impossible to truly verify where the rdrand bits actually come from in a CPU you bought. Worst case, still much better than nothing if you're mixing it with other entropy sources, like Linux does for /dev/random.
For more on the fact that there's a buffer that cores pull from, see some SO answers from the engineer who designed the hardware and wrote librdrand, such as this and this about its exhaustion / performance characteristics on Ivy Bridge, the first generation to feature it.
Infinite retry count?
The asm instructions set the carry flag (CF) = 1 in FLAGS on success, when it put a random number in the destination register. Otherwise CF=0 and the output register = 0. You're intended to call it in a retry loop, that's (I assume) why the intrinsic has the word step in the name; it's one step of generating a single random number.
In theory, a microcode update could change things so it always indicates failure, e.g. if a problem is discovered in some CPU model that makes the RNG untrustworthy (by the standards of the CPU vendor). The hardware RNG also has some self-diagnostics, so it's in theory possible for a CPU to decide that the RNG is broken and not produce any outputs. I haven't heard of any CPUs ever doing this, but I haven't gone looking. And a future microcode update is always possible.
Either of these could lead to an infinite retry loop. That's not great, but unless you want to write a bunch of code to report on that situation, it's at least an observable behaviour that users could potentially deal with in the unlikely event it ever happened.
But occasional temporary failure is normal and expected, and must be handled. Preferably by retrying without telling the user about it.
If there wasn't a random number ready in its buffer, the CPU can report failure instead of stalling this core for potentially even longer. That design choice might be related to interrupt latency, or just keeping it simpler without having to build retrying into the microcode.
Ivy Bridge can't pull data from the DRNG faster than it can keep up, according to the designer, even with all cores looping rdrand, but later CPUs can. Therefore it is important to actually retry.
#jww has had some experience with deploying rdrand in libcrypto++, and found that with a retry count set too low, there were reports of occasional spurious failure. He's had good results from infinite retries, which is why I chose that for this answer. (I suspect he would have heard reports from users with broken CPUs that always fail, if that was a thing.)
Intel's library functions that include a retry loop take a retry count. That's likely to handle the permanent-failure case which, as I said, I don't think happens in any real CPUs yet. Without a limited retry count, you'd loop forever.
An infinite retry count allows a simple API returning the number by value, without silly limitations like OpenSSL's functions that use 0 as an error return: they can't randomly generate a 0!
If you did want a finite retry count, I'd suggest very high. Like maybe 1 million, so it takes maybe have a second or a second of spinning to give up on a broken CPU, with negligible chance of having one thread starve that long if it's repeatedly unlucky in contending for access to the internal queue.
https://uops.info/ measured a throughput on Skylake of one per 3554 cycles on Skylake, one per 1352 on Alder Lake P-cores, 1230 on E-cores. One per 1809 cycles on Zen2. The Skylake version ran thousands of uops, the others were in the low double digits. Ivy Bridge had 110 cycle throughput, but in Haswell it was already up to 2436 cycles, but still a double-digit number of uops.
These abysmal performance numbers on recent Intel CPUs are probably due to microcode updates to work around problems that weren't anticipated when the HW was designed. Agner Fog measured one per 460 cycle throughput for rdrand and rdseed on Skylake when it was new, each costing 16 uops. The thousands of uops are probably extra buffer flushing hooked into the microcode for those instructions by recent updates. Agner measured Haswell at 17 uops, 320 cycles when it was new. See RdRand Performance As Bad As ~3% Original Speed With CrossTalk/SRBDS Mitigation on Phoronix:
As explained in the earlier article, mitigating CrossTalk involves locking the entire memory bus before updating the staging buffer and unlocking it after the contents have been cleared. This locking and serialization now involved for those instructions is very brutal on the performance, but thankfully most real-world workloads shouldn't be making too much use of these instructions.
Locking the memory bus sounds like it could hurt performance even of other cores, if it's like cache-line splits for locked instructions.
(Those cycle numbers are core clock cycle counts; if the DRNG doesn't run on the same clock as the core, those might vary by CPU model. I wonder if uops.info's testing is running rdrand on multiple cores of the same hardware, since Coffee Lake is twice the uops as Skylake, and 1.4x as many cycles per random number. Unless that's just higher clocks leading to more microcode retries?)
Microsoft compiler does not have intrinsics support for RDSEED and RDRAND instruction.
But, you may implement these instruction using NASM or MASM. Assembly code is available at:
https://software.intel.com/en-us/articles/intel-digital-random-number-generator-drng-software-implementation-guide
For Intel Compiler, you can use header to determine the version. You can use following macros to determine the version and sub-version:
__INTEL_COMPILER //Major Version
__INTEL_COMPILER_UPDATE // Minor Update.
For instance if you use ICC15.0 Update 3 compiler, it will show that you have
__INTEL_COMPILER = 1500
__INTEL_COMPILER_UPDATE = 3
For further details on pre-defined macros you can go to: https://software.intel.com/en-us/node/524490

How to get the CPU cycle count in x86_64 from C++?

I saw this post on SO which contains C code to get the latest CPU Cycle count:
CPU Cycle count based profiling in C/C++ Linux x86_64
Is there a way I can use this code in C++ (windows and linux solutions welcome)? Although written in C (and C being a subset of C++) I am not too certain if this code would work in a C++ project and if not, how to translate it?
I am using x86-64
EDIT2:
Found this function but cannot get VS2010 to recognise the assembler. Do I need to include anything? (I believe I have to swap uint64_t to long long for windows....?)
static inline uint64_t get_cycles()
{
uint64_t t;
__asm volatile ("rdtsc" : "=A"(t));
return t;
}
EDIT3:
From above code I get the error:
"error C2400: inline assembler syntax error in 'opcode'; found 'data
type'"
Could someone please help?
Starting from GCC 4.5 and later, the __rdtsc() intrinsic is now supported by both MSVC and GCC.
But the include that's needed is different:
#ifdef _WIN32
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
Here's the original answer before GCC 4.5.
Pulled directly out of one of my projects:
#include <stdint.h>
// Windows
#ifdef _WIN32
#include <intrin.h>
uint64_t rdtsc(){
return __rdtsc();
}
// Linux/GCC
#else
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
#endif
This GNU C Extended asm tells the compiler:
volatile: the outputs aren't a pure function of the inputs (so it has to re-run every time, not reuse an old result).
"=a"(lo) and "=d"(hi) : the output operands are fixed registers: EAX and EDX. (x86 machine constraints). The x86 rdtsc instruction puts its 64-bit result in EDX:EAX, so letting the compiler pick an output with "=r" wouldn't work: there's no way to ask the CPU for the result to go anywhere else.
((uint64_t)hi << 32) | lo - zero-extend both 32-bit halves to 64-bit (because lo and hi are unsigned), and logically shift + OR them together into a single 64-bit C variable. In 32-bit code, this is just a reinterpretation; the values still just stay in a pair of 32-bit registers. In 64-bit code you typically get an actual shift + OR asm instructions, unless the high half optimizes away.
(editor's note: this could probably be more efficient if you used unsigned long instead of unsigned int. Then the compiler would know that lo was already zero-extended into RAX. It wouldn't know that the upper half was zero, so | and + are equivalent if it wanted to merge a different way. The intrinsic should in theory give you the best of both worlds as far as letting the optimizer do a good job.)
https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. But hopefully this section is useful if you need to understand old code that uses inline asm so you can rewrite it with intrinsics. See also https://stackoverflow.com/tags/inline-assembly/info
Your inline asm is broken for x86-64. "=A" in 64-bit mode lets the compiler pick either RAX or RDX, not EDX:EAX. See this Q&A for more
You don't need inline asm for this. There's no benefit; compilers have built-ins for rdtsc and rdtscp, and (at least these days) all define a __rdtsc intrinsic if you include the right headers. But unlike almost all other cases (https://gcc.gnu.org/wiki/DontUseInlineAsm), there's no serious downside to asm, as long as you're using a good and safe implementation like #Mysticial's.
(One minor advantage to asm is if you want to time a small interval that's certainly going to be less than 2^32 counts, you can ignore the high half of the result. Compilers could do that optimization for you with a uint32_t time_low = __rdtsc() intrinsic, but in practice they sometimes still waste instructions doing shift / OR.)
Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics.
Intel's intriniscs guide says _rdtsc (with one underscore) is in <immintrin.h>, but that doesn't work on gcc and clang. They only define SIMD intrinsics in <immintrin.h>, so we're stuck with <intrin.h> (MSVC) vs. <x86intrin.h> (everything else, including recent ICC). For compat with MSVC, and Intel's documentation, gcc and clang define both the one-underscore and two-underscore versions of the function.
Fun fact: the double-underscore version returns an unsigned 64-bit integer, while Intel documents _rdtsc() as returning (signed) __int64.
// valid C99 and C++
#include <stdint.h> // <cstdint> is preferred in C++, but stdint.h works.
#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif
// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
// _mm_lfence(); // optionally wait for earlier insns to retire before reading the clock
uint64_t tsc = __rdtsc();
// _mm_lfence(); // optionally block later instructions until rdtsc retires
return tsc;
}
// requires a Nehalem or newer CPU. Not Core2 or earlier. IDK when AMD added it.
inline
uint64_t readTSCp() {
unsigned dummy;
return __rdtscp(&dummy); // waits for earlier insns to retire, but allows later to start
}
Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer, including a couple test callers.
These intrinsics were new in gcc4.5 (from 2010) and clang3.5 (from 2014). gcc4.4 and clang 3.4 on Godbolt don't compile this, but gcc4.5.3 (April 2011) does. You might see inline asm in old code, but you can and should replace it with __rdtsc(). Compilers over a decade old usually make slower code than gcc6, gcc7, or gcc8, and have less useful error messages.
The MSVC intrinsic has (I think) existed far longer, because MSVC never supported inline asm for x86-64. ICC13 has __rdtsc in immintrin.h, but doesn't have an x86intrin.h at all. More recent ICC have x86intrin.h, at least the way Godbolt installs them for Linux they do.
You might want to define them as signed long long, especially if you want to subtract them and convert to float. int64_t -> float/double is more efficient than uint64_t on x86 without AVX512. Also, small negative results could be possible because of CPU migrations if TSCs aren't perfectly synced, and that probably makes more sense than huge unsigned numbers.
BTW, clang also has a portable __builtin_readcyclecounter() which works on any architecture. (Always returns zero on architectures without a cycle counter.) See the clang/LLVM language-extension docs
For more about using lfence (or cpuid) to improve repeatability of rdtsc and control exactly which instructions are / aren't in the timed interval by blocking out-of-order execution, see #HadiBrais' answer on clflush to invalidate cache line via C function and the comments for an example of the difference it makes.
See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset so you should use cpuid to serialize.) It's always been defined as partially-serializing on Intel.
How to Benchmark Code Execution Times on IntelĀ® IA-32 and IA-64
Instruction Set Architectures, an Intel white-paper from 2010.
rdtsc counts reference cycles, not CPU core clock cycles
It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (not counting system clock adjustments, so it's a perfect time source for steady_clock).
The TSC frequency used to always be equal to the CPU's rated frequency, i.e. the advertised sticker frequency. In some CPUs it's merely close, e.g. 2592 MHz on an i7-6700HQ 2.6 GHz Skylake, or 4008MHz on a 4000MHz i7-6700k. On even newer CPUs like i5-1035 Ice Lake, TSC = 1.5 GHz, base = 1.1 GHz, so disabling turbo won't even approximately work for TSC = core cycles on those CPUs.
If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. (And optionally disable turbo and tell your OS to prefer max clock speed to avoid CPU frequency shifts during your microbenchmark).
Microbenchmarking is hard: see Idiomatic way of performance evaluation? for other pitfalls.
Instead of TSC at all, you can use a library that gives you access to hardware performance counters. The complicated but low-overhead way is to program perf counters and use rdmsr in user-space, or simpler ways include tricks like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID.
You usually will still want to keep the CPU clock fixed for microbenchmarks, though, unless you want to see how different loads will get Skylake to clock down when memory-bound or whatever. (Note that memory bandwidth / latency is mostly fixed, using a different clock than the cores. At idle clock speed, an L2 or L3 cache miss takes many fewer core clock cycles.)
Negative clock cycle measurements with back-to-back rdtsc? the history of RDTSC: originally CPUs didn't do power-saving, so the TSC was both real-time and core clocks. Then it evolved through various barely-useful steps into its current form of a useful low-overhead timesource decoupled from core clock cycles (constant_tsc), which doesn't stop when the clock halts (nonstop_tsc). Also some tips, e.g. don't take the mean time, take the median (there will be very high outliers).
std::chrono::clock, hardware clock and cycle count
Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
measuring code execution times in C using RDTSC instruction lists some gotchas, including SMI (system-management interrupts) which you can't avoid even in kernel mode with cli), and virtualization of rdtsc under a VM. And of course basic stuff like regular interrupts being possible, so repeat your timing many times and throw away outliers.
Determine TSC frequency on Linux. Programatically querying the TSC frequency is hard and maybe not possible, especially in user-space, or may give a worse result than calibrating it. Calibrating it using another known time-source takes time. See that question for more about how hard it is to convert TSC to nanoseconds (and that it would be nice if you could ask the OS what the conversion ratio is, because the OS already did it at bootup).
If you're microbenchmarking with RDTSC for tuning purposes, your best bet is to just use ticks and skip even trying to convert to nanoseconds. Otherwise, use a high-resolution library time function like std::chrono or clock_gettime. See faster equivalent of gettimeofday for some discussion / comparison of timestamp functions, or reading a shared timestamp from memory to avoid rdtsc entirely if your precision requirement is low enough for a timer interrupt or thread to update it.
See also Calculate system time using rdtsc about finding the crystal frequency and multiplier.
CPU TSC fetch operation especially in multicore-multi-processor environment says that Nehalem and newer have the TSC synced and locked together for all cores in a package (along with the invariant = constant and nonstop TSC feature). See #amdn's answer there for some good info about multi-socket sync.
(And apparently usually reliable even for modern multi-socket systems as long as they have that feature, see #amdn's answer on the linked question, and more details below.)
CPUID features relevant to the TSC
Using the names that Linux /proc/cpuinfo uses for the CPU features, and other aliases for the same feature that you'll also find.
tsc - the TSC exists and rdtsc is supported. Baseline for x86-64.
rdtscp - rdtscp is supported.
tsc_deadline_timer CPUID.01H:ECX.TSC_Deadline[bit 24] = 1 - local APIC can be programmed to fire an interrupt when the TSC reaches a value you put in IA32_TSC_DEADLINE. Enables "tickless" kernels, I think, sleeping until the next thing that's supposed to happen.
constant_tsc: Support for the constant TSC feature is determined by checking the CPU family and model numbers. The TSC ticks at constant frequency regardless of changes in core clock speed. Without this, RDTSC does count core clock cycles.
nonstop_tsc: This feature is called the invariant TSC in the Intel SDM manual and is supported on processors with CPUID.80000007H:EDX[8]. The TSC keeps ticking even in deep sleep C-states. On all x86 processors, nonstop_tsc implies constant_tsc, but constant_tsc doesn't necessarily imply nonstop_tsc. No separate CPUID feature bit; on Intel and AMD the same invariant TSC CPUID bit implies both constant_tsc and nonstop_tsc features. See Linux's x86/kernel/cpu/intel.c detection code, and amd.c was similar.
Some of the processors (but not all) that are based on the Saltwell/Silvermont/Airmont even keep TSC ticking in ACPI S3 full-system sleep: nonstop_tsc_s3. This is called always-on TSC. (Although it seems the ones based on Airmont were never released.)
For more details on constant and invariant TSC, see: Can constant non-invariant tsc change frequency across cpu states?.
tsc_adjust: CPUID.(EAX=07H, ECX=0H):EBX.TSC_ADJUST (bit 1) The IA32_TSC_ADJUST MSR is available, allowing OSes to set an offset that's added to the TSC when rdtsc or rdtscp reads it. This allows effectively changing the TSC on some/all cores without desyncing it across logical cores. (Which would happen if software set the TSC to a new absolute value on each core; it's very hard to get the relevant WRMSR instruction executed at the same cycle on every core.)
constant_tsc and nonstop_tsc together make the TSC usable as a timesource for things like clock_gettime in user-space. (But OSes like Linux only use RDTSC to interpolate between ticks of a slower clock maintained with NTP, updating the scale / offset factors in timer interrupts. See On a cpu with constant_tsc and nonstop_tsc, why does my time drift?) On even older CPUs that don't support deep sleep states or frequency scaling, TSC as a timesource may still be usable
The comments in the Linux source code also indicate that constant_tsc / nonstop_tsc features (on Intel) implies "It is also reliable across cores and sockets. (but not across cabinets - we turn it off in that case explicitly.)"
The "across sockets" part is not accurate. In general, an invariant TSC only guarantees that the TSC is synchronized between cores within the same socket. On an Intel forum thread, Martin Dixon (Intel) points out that TSC invariance does not imply cross-socket synchronization. That requires the platform vendor to distribute RESET synchronously to all sockets. Apparently platform vendors do in practice do that, given the above Linux kernel comment. Answers on CPU TSC fetch operation especially in multicore-multi-processor environment also agree that all sockets on a single motherboard should start out in sync.
On a multi-socket shared memory system, there is no direct way to check whether the TSCs in all the cores are synced. The Linux kernel, by default performs boot-time and run-time checks to make sure that TSC can be used as a clock source. These checks involve determining whether the TSC is synced. The output of the command dmesg | grep 'clocksource' would tell you whether the kernel is using TSC as the clock source, which would only happen if the checks have passed. But even then, this would not be definitive proof that the TSC is synced across all sockets of the system. The kernel paramter tsc=reliable can be used to tell the kernel that it can blindly use the TSC as the clock source without doing any checks.
There are cases where cross-socket TSCs may NOT be in sync: (1) hotplugging a CPU, (2) when the sockets are spread out across different boards connected by extended node controllers, (3) a TSC may not be resynced after waking up from a C-state in which the TSC is powered-downed in some processors, and (4) different sockets have different CPU models installed.
An OS or hypervisor that changes the TSC directly instead of using the TSC_ADJUST offset can de-sync them, so in user-space it might not always be safe to assume that CPU migrations won't leave you reading a different clock. (This is why rdtscp produces a core-ID as an extra output, so you can detect when start/end times come from different clocks. It might have been introduced before the invariant TSC feature, or maybe they just wanted to account for every possibility.)
If you're using rdtsc directly, you may want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram on Linux. Whether you need it for the TSC or not, CPU migration will normally lead to a lot of cache misses and mess up your test anyway, as well as taking extra time. (Although so will an interrupt).
How efficient is the asm from using the intrinsic?
It's about as good as you'd get from #Mysticial's GNU C inline asm, or better because it knows the upper bits of RAX are zeroed. The main reason you'd want to keep inline asm is for compat with crusty old compilers.
A non-inline version of the readTSC function itself compiles with MSVC for x86-64 like this:
unsigned __int64 readTSC(void) PROC ; readTSC
rdtsc
shl rdx, 32 ; 00000020H
or rax, rdx
ret 0
; return in RAX
For 32-bit calling conventions that return 64-bit integers in edx:eax, it's just rdtsc/ret. Not that it matters, you always want this to inline.
In a test caller that uses it twice and subtracts to time an interval:
uint64_t time_something() {
uint64_t start = readTSC();
// even when empty, back-to-back __rdtsc() don't optimize away
return readTSC() - start;
}
All 4 compilers make pretty similar code. This is GCC's 32-bit output:
# gcc8.2 -O3 -m32
time_something():
push ebx # save a call-preserved reg: 32-bit only has 3 scratch regs
rdtsc
mov ecx, eax
mov ebx, edx # start in ebx:ecx
# timed region (empty)
rdtsc
sub eax, ecx
sbb edx, ebx # edx:eax -= ebx:ecx
pop ebx
ret # return value in edx:eax
This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.
# MSVC 19 2017 -Ox
unsigned __int64 time_something(void) PROC ; time_something
rdtsc
shl rdx, 32 ; high <<= 32
or rax, rdx
mov rcx, rax ; missed optimization: lea rcx, [rdx+rax]
; rcx = start
;; timed region (empty)
rdtsc
shl rdx, 32
or rax, rdx ; rax = end
sub rax, rcx ; end -= start
ret 0
unsigned __int64 time_something(void) ENDP ; time_something
All 4 compilers use or+mov instead of lea to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.
But writing a shift/lea in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.
However, we can maybe get the best of both worlds with a modified version of #Mysticial's code:
// More efficient than __rdtsc() in some case, but maybe worse in others
uint64_t rdtsc(){
// long and uintptr_t are 32-bit on the x32 ABI (32-bit pointers in 64-bit mode), so #ifdef would be better if we care about this trick there.
unsigned long lo,hi; // let the compiler know that zero-extension to 64 bits isn't required
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) + lo;
// + allows LEA or ADD instead of OR
}
On Godbolt, this does sometimes give better asm than __rdtsc() for gcc/clang/ICC, but other times it tricks compilers into using an extra register to save lo and hi separately, so clang can optimize into ((end_hi-start_hi)<<32) + (end_lo-start_lo). Hopefully if there's real register pressure, compilers will combine earlier. (gcc and ICC still save lo/hi separately, but don't optimize as well.)
But 32-bit gcc8 makes a mess of it, compiling even just the rdtsc() function itself with an actual add/adc with zeros instead of just returning the result in edx:eax like clang does. (gcc6 and earlier do ok with | instead of +, but definitely prefer the __rdtsc() intrinsic if you care about 32-bit code-gen from gcc).
VC++ uses an entirely different syntax for inline assembly -- but only in the 32-bit versions. The 64-bit compiler doesn't support inline assembly at all.
In this case, that's probably just as well -- rdtsc has (at least) two major problem when it comes to timing code sequences. First (like most instructions) it can be executed out of order, so if you're trying to time a short sequence of code, the rdtsc before and after that code might both be executed before it, or both after it, or what have you (I am fairly sure the two will always execute in order with respect to each other though, so at least the difference will never be negative).
Second, on a multi-core (or multiprocessor) system, one rdtsc might execute on one core/processor and the other on a different core/processor. In such a case, a negative result is entirely possible.
Generally speaking, if you want a precise timer under Windows, you're going to be better off using QueryPerformanceCounter.
If you really insist on using rdtsc, I believe you'll have to do it in a separate module written entirely in assembly language (or use a compiler intrinsic), then linked with your C or C++. I've never written that code for 64-bit mode, but in 32-bit mode it looks something like this:
xor eax, eax
cpuid
xor eax, eax
cpuid
xor eax, eax
cpuid
rdtsc
; save eax, edx
; code you're going to time goes here
xor eax, eax
cpuid
rdtsc
I know this looks strange, but it's actually right. You execute CPUID because it's a serializing instruction (can't be executed out of order) and is available in user mode. You execute it three times before you start timing because Intel documents the fact that the first execution can/will run at a different speed than the second (and what they recommend is three, so three it is).
Then you execute your code under test, another cpuid to force serialization, and the final rdtsc to get the time after the code finished.
Along with that, you want to use whatever means your OS supplies to force this all to run on one process/core. In most cases, you also want to force the code alignment -- changes in alignment can lead to fairly substantial differences in execution spee.
Finally you want to execute it a number of times -- and it's always possible it'll get interrupted in the middle of things (e.g., a task switch), so you need to be prepared for the possibility of an execution taking quite a bit longer than the rest -- e.g., 5 runs that take ~40-43 clock cycles apiece, and a sixth that takes 10000+ clock cycles. Clearly, in the latter case, you just throw out the outlier -- it's not from your code.
Summary: managing to execute the rdtsc instruction itself is (almost) the least of your worries. There's quite a bit more you need to do before you can get results from rdtsc that will actually mean anything.
For Windows, Visual Studio provides a convenient "compiler intrinsic" (i.e. a special function, which the compiler understands) that executes the RDTSC instruction for you and gives you back the result:
unsigned __int64 __rdtsc(void);
Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES
This Linux system call appears to be a cross architecture wrapper for performance events.
This answer similar: Quick way to count number of instructions executed in a C program but with PERF_COUNT_HW_CPU_CYCLES instead of PERF_COUNT_HW_INSTRUCTIONS. This answer will focus on PERF_COUNT_HW_CPU_CYCLES specifics, see that other answer for more generic information.
Here is an example based on the one provided at the end of the man page.
perf_event_open.c
#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <inttypes.h>
#include <sys/types.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("%lld\n", count);
close(fd);
}
The results seem reasonable, e.g. if I print cycles then recompile for instruction counts, we get about 1 cycle per iteration (2 instructions done in a single cycle) possibly due to effects such as superscalar execution, with slightly different results for each run presumably due to random memory access latencies.
You might also be interested in PERF_COUNT_HW_REF_CPU_CYCLES, which as the manpage documents:
Total cycles; not affected by CPU frequency scaling.
so this will give something closer to the real wall time if your frequency scaling is on. These were 2/3x larger than PERF_COUNT_HW_INSTRUCTIONS on my quick experiments, presumably because my non-stressed machine is frequency scaled now.