Does intel have a separate instruction set for it's GPU

Does intel have a separate instruction set for it's GPU - opengl

Assume I'm using my Intel x64 based laptop with no dedicated GPU.
I must have some GPU onboard otherwise my screen won't work, right?
Are onboard GPUs typically embedded into the CPU?
Does intel have a separate instruction set for it's GPU? if so is there a doc?
Do GPU instructions greatly differ from CPU? for example do GPUs have
shift, add, load, store instructions as well? What other instructions do they have
that regular CPUs don't have?
Is there a difference between the instruction set/pipeline of an onboard GPU vs Dedicated? or
the difference is just about the number of extra cores and dedicated RAM?
On a machine with dedicated GPU, how do generated instructions from a C++ OpenGL code get executed on the GPU and not end up with the regular CPU?

Full hardware reference
One can find a full documentation of Intel's graphic controller at 01.org:
Hardware Specification - PRMs Published by: Paul Parenteau Last
modification: Jun 15, 2020
Answering to question 2: yes, there are separate assembly instructions, as developed below (from "Introduction to GEN assembly")
General form of Intel GPU assembly
Typically, all instructions have the following form:
[(pred)] opcode (exec-size|exec-offset) dst src0 [src1] [src2]
(pred) is the optional predicate. We are going to skip it for now.
opcode is the symbol of the instruction, like add or mov (we have a full table of opcodes below.
exec-size is the SIMD width of the instruction, which of our architecture could be 1, 2, 4, 8, or 16. In SIMD32 compilation, typically two instructions of execution size 8 or 16 are grouped into one.
exec-offset is the part that's telling the EU, which part of the ARF registers to read or write from, e.g. (8|M24) consults the bits 24-31 of the execution mask. When emitting SIMD16 or SIMD32 code like the following:
mov (8|M0) r11.0<1>:q r5.0<8;8,1>:d // id:1
mov (8|M8) r13.0<1>:q r6.0<8;8,1>:d // id:1
mov (8|M16) r15.0<1>:q r9.0<8;8,1>:d // id:1
mov (8|M24) r17.0<1>:q r10.0<8;8,1>:d // id:1
(mov instructions of SIMD32 assembly)
the compiler has to emit four 8-wide operations due to a limitation of how many bytes can be accessed per operand in the GRF.
dst is a destination register
src0 is a source register
src1 is an optional source register. Note, that it could also be an immediate value, like 0x3F000000:f (0.5) or 0x2A:ud (42).
src2 is an optional source register.
General Register File (GRF) Registers
Each thread has a dedicated space of 128 registers, r0 through r127. Each register is 256 bits or 32 bytes.
Architecture Register File (ARF) Registers
In the assembly code above, we only saw one of these special registers, the null register, which is typically used as a destination for send instructions used for writing and indicating end of thread. Here is a full table of other architecture registers:
Available GEN (general) Assembly Instructions

Related

What is "MAX" referring to in the intel intrinsics documentation?

Within the intel intrinsics guide some operations are defined using a term "MAX". An example is __m256 _mm256_mask_permutexvar_ps (__m256 src, __mmask8 k, __m256i idx, __m256 a), which is defined as
FOR j := 0 to 7
i := j*32
id := idx[i+2:i]*32
IF k[j]
dst[i+31:i] := a[id+31:id]
ELSE
dst[i+31:i] := 0
FI
ENDFOR
dst[MAX:256] := 0
. Please take note of the last line within this definition: dst[MAX:256] := 0. What is MAX referring to and is this line even adding any valuable information? If I had to make assumptions, then MAX probably means the amount of bits within the vector, which is 256 in case of _mm256. This however does not seem to change anything for the definition of the operation and might as well have been omitted. But why is it there then?

This pseudo-code only makes sense for assembly documentation, where it was copied from, not for intrinsics. (HTML scrape of Intel's vol.2 PDF documenting the corresponding vpermps asm instruction.)
...
ENDFOR
DEST[MAXVL-1:VL] ← 0
(The same asm doc entry covers VL = 128, 256, and 512-bit versions, the vector width of the instruction.)
In asm, a YMM register is the low half of a ZMM register, and writing a YMM zeroes the upper bits out to the CPU's max supported vector width (just like writing EAX zero-extends into RAX).
The intrinsic you picked is for the masked version, so it requires AVX-512 (EVEX encoding), thus VLMAX is at least 5121. If the mask is a constant all-ones, it could get optimized to the AVX2 VEX encoding, but both still zero high bits of the full register out to VLMAX.
This is meaningless for intrinsics
The intrinsics API just has __m256 and __m512 types; an __m256 is not implicitly the low half of an __m512. You can use _mm512_castps256_ps512 to get a __m512 with your __m256 as the low half, but the API documentation says "the upper 256 bits of the result are undefined". So if you use it on a function arg, it doesn't force it to vmovaps ymm7, ymm0 or something to zero-extend into a ZMM register in case the caller left high garbage.
If you use _mm512_castps256_ps512 on a __m256 that came from an intrinsic in this function, it pretty much always will happen to compile with a zeroed high half whether it stayed in a reg or got stored/reloaded, but that's not guaranteed by the API. (If the compiler chose to combine a previous calculation with something else, using a 512-bit operation, you could plausibly end up with a non-zero high half.) If you want high zeros, there's no equivalent to _mm256_set_m128 (__m128 hi, __m128 lo), so you need some other explicit way.
Footnote 1: Or with some hypothetical future extension, VLMAX aka MAXVL could be even wider. It's determined by the current value of XCR0. This documentation is telling you these instructions will still zero out to whatever that is.
(I haven't looked into whether changing VLMAX is possible on a machine supporting AVX-512, or if it's read-only. IDK how the CPU would handle it if you can change it, like maybe not running 512-bit instructions at all. Mainstream OSes certainly don't do this even if it's possible with privileged operations.)
SSE didn't have any defined mechanism for extension to wider vectors, and some existing code (notably Windows kernel drivers) manually saved/restored a few XMM registers for their own use. To support that, AVX decided that legacy SSE would leave the high part of YMM/ZMM registers unmodified. But to run existing machine code using non-VEX legacy SSE encodings efficiently, it needed expensive state transitions (Haswell and Ice Lake) and/or false dependencies (Skylake): Why is this SSE code 6 times slower without VZEROUPPER on Skylake?
Intel wasn't going to make this mistake again, so they defined AVX as zeroing out to whatever vector width the CPU supports, and document it clearly in every AVX and AVX-512 instruction encoding. Thus VEX and EVEX can be mixed freely, even being useful to save machine-code size:
What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?
What is the penalty of mixing EVEX and VEX encoded scheme? (none), with an answer discussing more details of why SSE/AVX penalties are a thing.
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/301853 Agner Fog's 2008 post on Intel's forums about AVX, when it was first announced, pointing out the problem created by the lack of foresight with SSE.
Does vzeroall zero registers ymm16 to ymm31? - interestingly no; since they're not accessible via legacy SSE instructions, they can't be part of a dirty-uppers problem.

Bits in the registers are numbered with high indices on the “left” and low indices on the “right”. This matches how we write and talk about binary numerals: 100102 is the binary numeral for 18, with bit number 4, representing 24 = 16, on the left and bit number 0, representing 20 = 1, on the right.
R[m:n] denotes the set of bits of register R from m down to n, with m being the “left” end of the set and n being the “right” end. If m is less than n, then it is the empty set. Therefore, for registers with 512 bits, dst[511:256] := 0 says to set bits 511 to 256 to zero, and, for registers with 256 bits, dst[255:256] := 0 says to do nothing.

dst[MAX:256] := 0 sets all bits above (and including) 256th bit to zero. It is only relevant to registers having more than 256 bits. So MAX can be 256 if the register is 256 bits long or 512 if the processor is using 512 bits registers.

What's the meaning of "each CPU instruction can manipulate 32 bits of data"?

From: https://www.webopedia.com/TERM/R/register.html
The number of registers that a CPU has and the size of each (number of bits) help determine the power and speed of a CPU. For example a 32-bit CPU is one in which each register is 32 bits wide. Therefore, each CPU instruction can manipulate 32 bits of data.
What's the meaning of "each CPU instruction can manipulate 32 bits of data" w.r.t the C/C++ programs which we write, the text which we write in notepads?

First; "each CPU instruction can manipulate 32 bits of data" is a (technically incorrect) generalisation. For example (32-bit 80x86) there are instructions (e.g. cmpxchg8b, pushad, shrd) and entire extensions (MMX, SSE, AVX) where an instruction can manipulate more than 32 bits of data.
For performance; it's best to think of it as either "amount of work that can be done in a fixed amount of time" or "amount of time to do a fixed amount of work". This can be broken into 2 values - how many instructions you need to do an amount of work and how many instructions can be executed in a fixed amount of time (instructions per second).
Now consider something like adding a pair of 128-bit integers. For a 32-bit CPU this has to be broken down into four 32-bit additions, and might look something like this:
;Do a = a + b
mov eax,[b]
mov ebx,[b+4]
mov ecx,[b+8]
mov edx,[b+12]
add [a],eax
adc [a+4],ebx
adc [a+8],ecx
adc [a+12],edx
In this case "how many instructions you need to do an amount of work" is 8 instructions.
With a 16-bit CPU you need more instructions. For example, it might be more like this:
mov ax,[b]
mov bx,[b+2]
mov cx,[b+4]
mov dx,[b+6]
add [a],ax
mov ax,[b+8]
adc [a+2],bx
mov bx,[b+10]
adc [a+4],cx
mov cx,[b+12]
adc [a+6],dx
mov dx,[b+14]
add [a+8],ax
adc [a+10],bx
adc [a+12],cx
adc [a+14],dx
In this case "how many instructions you need to do an amount of work" is 16 instructions. With the same "instructions per second" a 16-bit CPU would be half as fast as a 32-bit CPU for this work.
With a 64-bit CPU this work would only need 4 instruction, maybe like this:
mov eax,[b]
mov ebx,[b+8]
add [a],eax
adc [a+8],ebx
In this case, with the same "instructions per second", a 64-bit CPU would be twice as fast as a 32-bit CPU (and 4 times as fast as a 16-bit CPU).
Of course the high level source code would be the same in all cases - the difference is what the compiler generates.
Note that what I've shown here (128-bit integer addition) is a "happy case" - I chose this specifically because it's easy to show how larger registers can reduce/improve "how many instructions you need to do an amount of work" and therefore improve performance (at the same "instructions per second"). For different work you might not get the same improvement. For example, for a function that works with 8-bit integers (e.g. char) "larger than 8-bit registers" might not help at all (and in some cases might make things worse).

Computers, operating systems, or software programs capable of transferring data 32-bits at a time. With computer processors, (e.g. 80386, 80486, and Pentium) they were 32-bit processors, which means the processor were capable of working with 32 bit binary numbers (decimal number up to 4,294,967,295). Anything larger and the computer would need to break up the number into smaller pieces

A "word" is the size of the basic unit of change.
This CPU, in addition to being a 32 bit CPU, has a 32 bit sized word. If it was changing an item, unless extra CPU cycles are used, the largest "single" item it can change is one 32 bit value.
This doesn't mean that any 32 bit can be changed with one instruction. But if the 32 bits are all part of the same word, they may be able to be changed in one instruction.

How to set MMX registers in a Windows exception handler to emulate unsupported 3DNow! instructions

I'm trying to revive an old Win32 game that uses 3DNow! instruction set to make 3D rendering.
On modern OSs like Win7 - Win10 instructions like FPADD or FPMUL are not allowed and the program throws an exception.
Since the number of 3DNow! instuctions used by the game is very limited, in my VS2008 MFC program I tried to use vectored exception handling to get the value of MMX registers, emulate the 3DNow! instructions by C code and push the values back to the processor 3DNow! registers.
So far I succeeded in first two steps (I get mmx register values from ExceptionInfo->ExtendedRegisters byte array at offset 32 and use float type C instructions to make calculations), but my problem is that, no matter how I try to update the MMX register values the register values seem to stay unchanged.
Assuming that my _asm statements might be wrong, I did also some minimal test using simple statements like this:
_asm movq mm0 mm7
This statement is executed without further exceptions, but when retrieving the MMX register values I still find that the original values were unchanged.
How can I make the assignment effective?

On modern OSs like Win7 - Win10 instructions like FPADD or FPMUL are not allowed
More likely your CPU doesn't support 3DNow! AMD dropped it for Bulldozer-family, and Intel never supported it. So unless you're running modern Windows on an Athlon64 / Phenom (or a Via C3), your CPU doesn't support it.
(Fun fact: PREFETCHW was originally a 3DNow! instruction, and is still supported (with its own CPUID feature bit). For a long time Intel CPUs ran it as a NOP, but Broadwell and later (IIRC) do actually prefetch a cache line into Exclusive state with a Read-For-Ownership.)
Unless this game only ever ran on AMD hardware, it must have a code path that avoids 3DNow. Fix its CPU detection to stop detecting your CPU as having 3DNow. (Maybe you have a recent AMD, and it assumes any AMD has 3DNow?)
(update on that: OP's comments say that the other code paths don't work for some reason. That's a problem.)
Returning from an exception handler probably restores registers from saved state, so it's not surprising that changing register values in the exception handler has no effect on the main program.
Apparently updating ExtendedRegisters in memory doesn't do the trick, though, so that's only a copy of the saved state.
The answer to modifying MMX registers from an exception handler is probably the same as for integer or XMM registers, so look up MS's documentation for that.
Alternative suggestion:
Rewrite the 3DNow code to use SSE2. (You said there's only a tiny amount of it?). SSE2 is baseline for x86-64, and generally safe to assume for 32-bit x86.
Without source, you could still modify the asm for the few functions that use 3DNow. You can literally just change the instructions to use 64-bit loads/stores into XMM registers instead of 3DNow! 64-bit loads/stores, and replace PFMUL with mulps, etc. (This could get slightly hairy if you run out of registers and the 3DNow code used a memory source operand. addps xmm0, [mem] requires 16B-aligned memory, and does a 16 byte load. So you may have to add a spill/reload to borrow another register as a temporary).
If you don't have room to rewrite the functions in-place, put in a jmp to somewhere you do have room to add new code.
Most of the 3DNow instructions have equivalents in SSE, but you may need some extra movaps instructions to copy registers around to implement PFCMPGE. If you can ignore the possibility of NaN, you can use cmpps with a not-less-than predicate. (Without AVX, SSE only has compare predicates based on less-than or not-less-than).
PFSUBR is easy to emulate with a spare register, just copy and subps to reverse. (Or SUBPS and invert the sign with XORPS). PFRCPIT1 (reciprocal-sqrt first iteration of refinement) and so on don't have a single-instruction implementation, but you can probably just use sqrtps and divps if you don't want to implement Newton-Raphson iterations with mulps and addps (or with AVX vfmadd). Modern CPUs are much faster than what this game was designed for.
You can load / store a pair of single-precision floats from/to memory into the bottom 64 bits of an XMM register using movsd (the SSE2 double-precision load/store instruction). You can also store a pair with movlps, but still use movsd for loading because it zeros the upper half instead of merging, so it doesn't have a dependency on the old value of the register.
Use movdq2q mm0, xmm0 and movq2dq xmm0, mm0 to move data between XMM and MMX.
Use movaps xmm1, xmm0 to copy registers, even if your data is only in the low half. (movsd xmm1, xmm0 merges the low half into the original high half. movq xmm1, xmm0 zeros the high half.)
addps and mulps work fine with zeros in the upper half. (They can slow down if any garbage (in the upper half) produces a denormal result, so prefer keeping the upper half zeroed). See http://felixcloutier.com/x86/ for an instruction-set reference (and other links in the x86 tag wiki.
Any shuffling of FP data can be done in XMM registers with shufps or pshufd instead of copying back to MMX registers to use whatever MMX shuffles.

When does data move around between SSE registers and the stack?

I'm not exactly sure what happens when I call _mm_load_ps? I mean I know I load an array of 4 floats into a __m128, which I can use to do SIMD accelerated arithmetic and then store them back, but isn't this __m128 data type still on the stack? I mean obviously there aren't enough registers for arbitrary amounts of vectors to be loaded in. So these 128 bits of data are moved back and forth each time you use some SIMD instruction to make computations? If so, than what is the point of _mm_load_ps?
Maybe I have it all wrong?

In just the same way that an int variable may reside in a register or in memory (or even both, at different times), the same is true of an SSE variable such as __m128. If there are sufficient free XMM registers then the compiler will normally try to keep the variable in a register (unless you do something unhelpful, like take the address of the variable), but if there is too much register pressure then some variables may spill to memory.

An Intel processor with SSE, AVX, or AVX-512 can have from 8 to 32 SIMD registers (see below). The number of registers also depends on if it's 32-bit code or 64-bit code as well. So when you call _mm_load_ps the values are loaded into SIMD register. If all the registers are used then some will have to be spilled onto the stack.
Exactly like if you have a lot of int or scalar float variables and the compiler can't keep them all the currently "live" ones in registers - load/store intrinsics mostly just exist to tell the compiler about alignment, and as an alternative to pointer-casting onto other C data types. Not because they have to compile to actual loads or stores, or that those are the only ways for compilers to emit vector load or store instructions.
Processor with SSE
8 128-bit registers labeled XMM0 - XMM7 //32-bit operating mode
16 128-bit registers labeled XMM0 - XMM15 //64-bit operating mode
Processor with AVX/AVX2
8 256-bit registers labeled YMM0 - YMM7 //32-bit operating mode
16 256 bit registers labeled YMM0 - YMM15 //64-bt operating mode
Processor with AVX-512 (2015/2016 servers, Ice Lake laptop, ?? desktop)
8 512-bit registers labeled ZMM0 - ZMM31 //32-bit operating mode
32 512-bit registers labeled ZMM0 - ZMM31 //64-bit operating mode
Wikipedia has a good summary on this AVX-512.
(Of course, the compiler can only use x/y/zmm16..31 if you tell it it's allowed to use AVX-512 instructions. Having an AVX-512-capable CPU does you no good when running machine code compiled to work on CPUs with only AVX2.)

How to get the CPU cycle count in x86_64 from C++?

I saw this post on SO which contains C code to get the latest CPU Cycle count:
CPU Cycle count based profiling in C/C++ Linux x86_64
Is there a way I can use this code in C++ (windows and linux solutions welcome)? Although written in C (and C being a subset of C++) I am not too certain if this code would work in a C++ project and if not, how to translate it?
I am using x86-64
EDIT2:
Found this function but cannot get VS2010 to recognise the assembler. Do I need to include anything? (I believe I have to swap uint64_t to long long for windows....?)
static inline uint64_t get_cycles()
{
uint64_t t;
__asm volatile ("rdtsc" : "=A"(t));
return t;
}
EDIT3:
From above code I get the error:
"error C2400: inline assembler syntax error in 'opcode'; found 'data
type'"
Could someone please help?

Starting from GCC 4.5 and later, the __rdtsc() intrinsic is now supported by both MSVC and GCC.
But the include that's needed is different:
#ifdef _WIN32
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
Here's the original answer before GCC 4.5.
Pulled directly out of one of my projects:
#include <stdint.h>
// Windows
#ifdef _WIN32
#include <intrin.h>
uint64_t rdtsc(){
return __rdtsc();
}
// Linux/GCC
#else
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
#endif
This GNU C Extended asm tells the compiler:
volatile: the outputs aren't a pure function of the inputs (so it has to re-run every time, not reuse an old result).
"=a"(lo) and "=d"(hi) : the output operands are fixed registers: EAX and EDX. (x86 machine constraints). The x86 rdtsc instruction puts its 64-bit result in EDX:EAX, so letting the compiler pick an output with "=r" wouldn't work: there's no way to ask the CPU for the result to go anywhere else.
((uint64_t)hi << 32) | lo - zero-extend both 32-bit halves to 64-bit (because lo and hi are unsigned), and logically shift + OR them together into a single 64-bit C variable. In 32-bit code, this is just a reinterpretation; the values still just stay in a pair of 32-bit registers. In 64-bit code you typically get an actual shift + OR asm instructions, unless the high half optimizes away.
(editor's note: this could probably be more efficient if you used unsigned long instead of unsigned int. Then the compiler would know that lo was already zero-extended into RAX. It wouldn't know that the upper half was zero, so | and + are equivalent if it wanted to merge a different way. The intrinsic should in theory give you the best of both worlds as far as letting the optimizer do a good job.)
https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. But hopefully this section is useful if you need to understand old code that uses inline asm so you can rewrite it with intrinsics. See also https://stackoverflow.com/tags/inline-assembly/info

Your inline asm is broken for x86-64. "=A" in 64-bit mode lets the compiler pick either RAX or RDX, not EDX:EAX. See this Q&A for more
You don't need inline asm for this. There's no benefit; compilers have built-ins for rdtsc and rdtscp, and (at least these days) all define a __rdtsc intrinsic if you include the right headers. But unlike almost all other cases (https://gcc.gnu.org/wiki/DontUseInlineAsm), there's no serious downside to asm, as long as you're using a good and safe implementation like #Mysticial's.
(One minor advantage to asm is if you want to time a small interval that's certainly going to be less than 2^32 counts, you can ignore the high half of the result. Compilers could do that optimization for you with a uint32_t time_low = __rdtsc() intrinsic, but in practice they sometimes still waste instructions doing shift / OR.)
Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics.
Intel's intriniscs guide says _rdtsc (with one underscore) is in <immintrin.h>, but that doesn't work on gcc and clang. They only define SIMD intrinsics in <immintrin.h>, so we're stuck with <intrin.h> (MSVC) vs. <x86intrin.h> (everything else, including recent ICC). For compat with MSVC, and Intel's documentation, gcc and clang define both the one-underscore and two-underscore versions of the function.
Fun fact: the double-underscore version returns an unsigned 64-bit integer, while Intel documents _rdtsc() as returning (signed) __int64.
// valid C99 and C++
#include <stdint.h> // <cstdint> is preferred in C++, but stdint.h works.
#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif
// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
// _mm_lfence(); // optionally wait for earlier insns to retire before reading the clock
uint64_t tsc = __rdtsc();
// _mm_lfence(); // optionally block later instructions until rdtsc retires
return tsc;
}
// requires a Nehalem or newer CPU. Not Core2 or earlier. IDK when AMD added it.
inline
uint64_t readTSCp() {
unsigned dummy;
return __rdtscp(&dummy); // waits for earlier insns to retire, but allows later to start
}
Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer, including a couple test callers.
These intrinsics were new in gcc4.5 (from 2010) and clang3.5 (from 2014). gcc4.4 and clang 3.4 on Godbolt don't compile this, but gcc4.5.3 (April 2011) does. You might see inline asm in old code, but you can and should replace it with __rdtsc(). Compilers over a decade old usually make slower code than gcc6, gcc7, or gcc8, and have less useful error messages.
The MSVC intrinsic has (I think) existed far longer, because MSVC never supported inline asm for x86-64. ICC13 has __rdtsc in immintrin.h, but doesn't have an x86intrin.h at all. More recent ICC have x86intrin.h, at least the way Godbolt installs them for Linux they do.
You might want to define them as signed long long, especially if you want to subtract them and convert to float. int64_t -> float/double is more efficient than uint64_t on x86 without AVX512. Also, small negative results could be possible because of CPU migrations if TSCs aren't perfectly synced, and that probably makes more sense than huge unsigned numbers.
BTW, clang also has a portable __builtin_readcyclecounter() which works on any architecture. (Always returns zero on architectures without a cycle counter.) See the clang/LLVM language-extension docs
For more about using lfence (or cpuid) to improve repeatability of rdtsc and control exactly which instructions are / aren't in the timed interval by blocking out-of-order execution, see #HadiBrais' answer on clflush to invalidate cache line via C function and the comments for an example of the difference it makes.
See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset so you should use cpuid to serialize.) It's always been defined as partially-serializing on Intel.
How to Benchmark Code Execution Times on Intel® IA-32 and IA-64
Instruction Set Architectures, an Intel white-paper from 2010.
rdtsc counts reference cycles, not CPU core clock cycles
It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (not counting system clock adjustments, so it's a perfect time source for steady_clock).
The TSC frequency used to always be equal to the CPU's rated frequency, i.e. the advertised sticker frequency. In some CPUs it's merely close, e.g. 2592 MHz on an i7-6700HQ 2.6 GHz Skylake, or 4008MHz on a 4000MHz i7-6700k. On even newer CPUs like i5-1035 Ice Lake, TSC = 1.5 GHz, base = 1.1 GHz, so disabling turbo won't even approximately work for TSC = core cycles on those CPUs.
If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. (And optionally disable turbo and tell your OS to prefer max clock speed to avoid CPU frequency shifts during your microbenchmark).
Microbenchmarking is hard: see Idiomatic way of performance evaluation? for other pitfalls.
Instead of TSC at all, you can use a library that gives you access to hardware performance counters. The complicated but low-overhead way is to program perf counters and use rdmsr in user-space, or simpler ways include tricks like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID.
You usually will still want to keep the CPU clock fixed for microbenchmarks, though, unless you want to see how different loads will get Skylake to clock down when memory-bound or whatever. (Note that memory bandwidth / latency is mostly fixed, using a different clock than the cores. At idle clock speed, an L2 or L3 cache miss takes many fewer core clock cycles.)
Negative clock cycle measurements with back-to-back rdtsc? the history of RDTSC: originally CPUs didn't do power-saving, so the TSC was both real-time and core clocks. Then it evolved through various barely-useful steps into its current form of a useful low-overhead timesource decoupled from core clock cycles (constant_tsc), which doesn't stop when the clock halts (nonstop_tsc). Also some tips, e.g. don't take the mean time, take the median (there will be very high outliers).
std::chrono::clock, hardware clock and cycle count
Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
measuring code execution times in C using RDTSC instruction lists some gotchas, including SMI (system-management interrupts) which you can't avoid even in kernel mode with cli), and virtualization of rdtsc under a VM. And of course basic stuff like regular interrupts being possible, so repeat your timing many times and throw away outliers.
Determine TSC frequency on Linux. Programatically querying the TSC frequency is hard and maybe not possible, especially in user-space, or may give a worse result than calibrating it. Calibrating it using another known time-source takes time. See that question for more about how hard it is to convert TSC to nanoseconds (and that it would be nice if you could ask the OS what the conversion ratio is, because the OS already did it at bootup).
If you're microbenchmarking with RDTSC for tuning purposes, your best bet is to just use ticks and skip even trying to convert to nanoseconds. Otherwise, use a high-resolution library time function like std::chrono or clock_gettime. See faster equivalent of gettimeofday for some discussion / comparison of timestamp functions, or reading a shared timestamp from memory to avoid rdtsc entirely if your precision requirement is low enough for a timer interrupt or thread to update it.
See also Calculate system time using rdtsc about finding the crystal frequency and multiplier.
CPU TSC fetch operation especially in multicore-multi-processor environment says that Nehalem and newer have the TSC synced and locked together for all cores in a package (along with the invariant = constant and nonstop TSC feature). See #amdn's answer there for some good info about multi-socket sync.
(And apparently usually reliable even for modern multi-socket systems as long as they have that feature, see #amdn's answer on the linked question, and more details below.)
CPUID features relevant to the TSC
Using the names that Linux /proc/cpuinfo uses for the CPU features, and other aliases for the same feature that you'll also find.
tsc - the TSC exists and rdtsc is supported. Baseline for x86-64.
rdtscp - rdtscp is supported.
tsc_deadline_timer CPUID.01H:ECX.TSC_Deadline[bit 24] = 1 - local APIC can be programmed to fire an interrupt when the TSC reaches a value you put in IA32_TSC_DEADLINE. Enables "tickless" kernels, I think, sleeping until the next thing that's supposed to happen.
constant_tsc: Support for the constant TSC feature is determined by checking the CPU family and model numbers. The TSC ticks at constant frequency regardless of changes in core clock speed. Without this, RDTSC does count core clock cycles.
nonstop_tsc: This feature is called the invariant TSC in the Intel SDM manual and is supported on processors with CPUID.80000007H:EDX[8]. The TSC keeps ticking even in deep sleep C-states. On all x86 processors, nonstop_tsc implies constant_tsc, but constant_tsc doesn't necessarily imply nonstop_tsc. No separate CPUID feature bit; on Intel and AMD the same invariant TSC CPUID bit implies both constant_tsc and nonstop_tsc features. See Linux's x86/kernel/cpu/intel.c detection code, and amd.c was similar.
Some of the processors (but not all) that are based on the Saltwell/Silvermont/Airmont even keep TSC ticking in ACPI S3 full-system sleep: nonstop_tsc_s3. This is called always-on TSC. (Although it seems the ones based on Airmont were never released.)
For more details on constant and invariant TSC, see: Can constant non-invariant tsc change frequency across cpu states?.
tsc_adjust: CPUID.(EAX=07H, ECX=0H):EBX.TSC_ADJUST (bit 1) The IA32_TSC_ADJUST MSR is available, allowing OSes to set an offset that's added to the TSC when rdtsc or rdtscp reads it. This allows effectively changing the TSC on some/all cores without desyncing it across logical cores. (Which would happen if software set the TSC to a new absolute value on each core; it's very hard to get the relevant WRMSR instruction executed at the same cycle on every core.)
constant_tsc and nonstop_tsc together make the TSC usable as a timesource for things like clock_gettime in user-space. (But OSes like Linux only use RDTSC to interpolate between ticks of a slower clock maintained with NTP, updating the scale / offset factors in timer interrupts. See On a cpu with constant_tsc and nonstop_tsc, why does my time drift?) On even older CPUs that don't support deep sleep states or frequency scaling, TSC as a timesource may still be usable
The comments in the Linux source code also indicate that constant_tsc / nonstop_tsc features (on Intel) implies "It is also reliable across cores and sockets. (but not across cabinets - we turn it off in that case explicitly.)"
The "across sockets" part is not accurate. In general, an invariant TSC only guarantees that the TSC is synchronized between cores within the same socket. On an Intel forum thread, Martin Dixon (Intel) points out that TSC invariance does not imply cross-socket synchronization. That requires the platform vendor to distribute RESET synchronously to all sockets. Apparently platform vendors do in practice do that, given the above Linux kernel comment. Answers on CPU TSC fetch operation especially in multicore-multi-processor environment also agree that all sockets on a single motherboard should start out in sync.
On a multi-socket shared memory system, there is no direct way to check whether the TSCs in all the cores are synced. The Linux kernel, by default performs boot-time and run-time checks to make sure that TSC can be used as a clock source. These checks involve determining whether the TSC is synced. The output of the command dmesg | grep 'clocksource' would tell you whether the kernel is using TSC as the clock source, which would only happen if the checks have passed. But even then, this would not be definitive proof that the TSC is synced across all sockets of the system. The kernel paramter tsc=reliable can be used to tell the kernel that it can blindly use the TSC as the clock source without doing any checks.
There are cases where cross-socket TSCs may NOT be in sync: (1) hotplugging a CPU, (2) when the sockets are spread out across different boards connected by extended node controllers, (3) a TSC may not be resynced after waking up from a C-state in which the TSC is powered-downed in some processors, and (4) different sockets have different CPU models installed.
An OS or hypervisor that changes the TSC directly instead of using the TSC_ADJUST offset can de-sync them, so in user-space it might not always be safe to assume that CPU migrations won't leave you reading a different clock. (This is why rdtscp produces a core-ID as an extra output, so you can detect when start/end times come from different clocks. It might have been introduced before the invariant TSC feature, or maybe they just wanted to account for every possibility.)
If you're using rdtsc directly, you may want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram on Linux. Whether you need it for the TSC or not, CPU migration will normally lead to a lot of cache misses and mess up your test anyway, as well as taking extra time. (Although so will an interrupt).
How efficient is the asm from using the intrinsic?
It's about as good as you'd get from #Mysticial's GNU C inline asm, or better because it knows the upper bits of RAX are zeroed. The main reason you'd want to keep inline asm is for compat with crusty old compilers.
A non-inline version of the readTSC function itself compiles with MSVC for x86-64 like this:
unsigned __int64 readTSC(void) PROC ; readTSC
rdtsc
shl rdx, 32 ; 00000020H
or rax, rdx
ret 0
; return in RAX
For 32-bit calling conventions that return 64-bit integers in edx:eax, it's just rdtsc/ret. Not that it matters, you always want this to inline.
In a test caller that uses it twice and subtracts to time an interval:
uint64_t time_something() {
uint64_t start = readTSC();
// even when empty, back-to-back __rdtsc() don't optimize away
return readTSC() - start;
}
All 4 compilers make pretty similar code. This is GCC's 32-bit output:
# gcc8.2 -O3 -m32
time_something():
push ebx # save a call-preserved reg: 32-bit only has 3 scratch regs
rdtsc
mov ecx, eax
mov ebx, edx # start in ebx:ecx
# timed region (empty)
rdtsc
sub eax, ecx
sbb edx, ebx # edx:eax -= ebx:ecx
pop ebx
ret # return value in edx:eax
This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.
# MSVC 19 2017 -Ox
unsigned __int64 time_something(void) PROC ; time_something
rdtsc
shl rdx, 32 ; high <<= 32
or rax, rdx
mov rcx, rax ; missed optimization: lea rcx, [rdx+rax]
; rcx = start
;; timed region (empty)
rdtsc
shl rdx, 32
or rax, rdx ; rax = end
sub rax, rcx ; end -= start
ret 0
unsigned __int64 time_something(void) ENDP ; time_something
All 4 compilers use or+mov instead of lea to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.
But writing a shift/lea in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.
However, we can maybe get the best of both worlds with a modified version of #Mysticial's code:
// More efficient than __rdtsc() in some case, but maybe worse in others
uint64_t rdtsc(){
// long and uintptr_t are 32-bit on the x32 ABI (32-bit pointers in 64-bit mode), so #ifdef would be better if we care about this trick there.
unsigned long lo,hi; // let the compiler know that zero-extension to 64 bits isn't required
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) + lo;
// + allows LEA or ADD instead of OR
}
On Godbolt, this does sometimes give better asm than __rdtsc() for gcc/clang/ICC, but other times it tricks compilers into using an extra register to save lo and hi separately, so clang can optimize into ((end_hi-start_hi)<<32) + (end_lo-start_lo). Hopefully if there's real register pressure, compilers will combine earlier. (gcc and ICC still save lo/hi separately, but don't optimize as well.)
But 32-bit gcc8 makes a mess of it, compiling even just the rdtsc() function itself with an actual add/adc with zeros instead of just returning the result in edx:eax like clang does. (gcc6 and earlier do ok with | instead of +, but definitely prefer the __rdtsc() intrinsic if you care about 32-bit code-gen from gcc).

VC++ uses an entirely different syntax for inline assembly -- but only in the 32-bit versions. The 64-bit compiler doesn't support inline assembly at all.
In this case, that's probably just as well -- rdtsc has (at least) two major problem when it comes to timing code sequences. First (like most instructions) it can be executed out of order, so if you're trying to time a short sequence of code, the rdtsc before and after that code might both be executed before it, or both after it, or what have you (I am fairly sure the two will always execute in order with respect to each other though, so at least the difference will never be negative).
Second, on a multi-core (or multiprocessor) system, one rdtsc might execute on one core/processor and the other on a different core/processor. In such a case, a negative result is entirely possible.
Generally speaking, if you want a precise timer under Windows, you're going to be better off using QueryPerformanceCounter.
If you really insist on using rdtsc, I believe you'll have to do it in a separate module written entirely in assembly language (or use a compiler intrinsic), then linked with your C or C++. I've never written that code for 64-bit mode, but in 32-bit mode it looks something like this:
xor eax, eax
cpuid
xor eax, eax
cpuid
xor eax, eax
cpuid
rdtsc
; save eax, edx
; code you're going to time goes here
xor eax, eax
cpuid
rdtsc
I know this looks strange, but it's actually right. You execute CPUID because it's a serializing instruction (can't be executed out of order) and is available in user mode. You execute it three times before you start timing because Intel documents the fact that the first execution can/will run at a different speed than the second (and what they recommend is three, so three it is).
Then you execute your code under test, another cpuid to force serialization, and the final rdtsc to get the time after the code finished.
Along with that, you want to use whatever means your OS supplies to force this all to run on one process/core. In most cases, you also want to force the code alignment -- changes in alignment can lead to fairly substantial differences in execution spee.
Finally you want to execute it a number of times -- and it's always possible it'll get interrupted in the middle of things (e.g., a task switch), so you need to be prepared for the possibility of an execution taking quite a bit longer than the rest -- e.g., 5 runs that take ~40-43 clock cycles apiece, and a sixth that takes 10000+ clock cycles. Clearly, in the latter case, you just throw out the outlier -- it's not from your code.
Summary: managing to execute the rdtsc instruction itself is (almost) the least of your worries. There's quite a bit more you need to do before you can get results from rdtsc that will actually mean anything.

For Windows, Visual Studio provides a convenient "compiler intrinsic" (i.e. a special function, which the compiler understands) that executes the RDTSC instruction for you and gives you back the result:
unsigned __int64 __rdtsc(void);

Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES
This Linux system call appears to be a cross architecture wrapper for performance events.
This answer similar: Quick way to count number of instructions executed in a C program but with PERF_COUNT_HW_CPU_CYCLES instead of PERF_COUNT_HW_INSTRUCTIONS. This answer will focus on PERF_COUNT_HW_CPU_CYCLES specifics, see that other answer for more generic information.
Here is an example based on the one provided at the end of the man page.
perf_event_open.c
#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <inttypes.h>
#include <sys/types.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("%lld\n", count);
close(fd);
}
The results seem reasonable, e.g. if I print cycles then recompile for instruction counts, we get about 1 cycle per iteration (2 instructions done in a single cycle) possibly due to effects such as superscalar execution, with slightly different results for each run presumably due to random memory access latencies.
You might also be interested in PERF_COUNT_HW_REF_CPU_CYCLES, which as the manpage documents:
Total cycles; not affected by CPU frequency scaling.
so this will give something closer to the real wall time if your frequency scaling is on. These were 2/3x larger than PERF_COUNT_HW_INSTRUCTIONS on my quick experiments, presumably because my non-stressed machine is frequency scaled now.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js