Is there a possibility to determine the number of AVX-512 FMA units during runtime using C++?
I already have codes to determine if a CPU is capable of AVX-512, but I cannot determine the number of FMA units.
The Intel® 64 and IA-32 Architectures Optimization Reference Manual, February 2022, Chapter 18.21 titled: Servers with a Single FMA Unit contains assembly language source code that identifies the number of AVX-512 FMA Units per core in an AVX-512 capable processor. See Example 18-25. This works by comparing the timing of two functions: one with FMA instructions and another with both FMA and shuffle instructions.
Intel's optimization manual can be downloaded from: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html#inpage-nav-8.
The source code from this manual is available at: https://github.com/intel/optimization-manual
There isn't a CPUID feature bit for this. Your options include a microbenchmark at startup, or checking the CPUID vendor string against a table. (If building the table as a cache of microbenchmark results, make sure the microbenchmark is careful to avoid false negatives or false positives, moreso than you'd be for one run at startup.)
If you have access to HW perf counters, perf stat --all-user -e uops_dispatched_port.port_0,uops_dispatched_port.port_5 in a loop that does mostly FMA instructions could work: existing CPUs with a second 512-bit FMA unit have it on port 5, so if you see counts for that port instead of all port 0, you have two FMA units. You might use a static executable that just contains a vfma... / dec/jne loop for 1000 iterations: only your instructions in user-space. (Making it easy to use perf stat.)
Intel's version seems like overkill, and some clunky choices
I think you can microbenchmark it without wasting so many cycles waiting for warm-up, by alternating two benchmark loops, YMM and ZMM, if you're careful about it. Intel's version (github source from their optimization manual) seems like huge overkill with so many registers and a bunch of useless constants when they could just use FMA on 0.0, and a shuffle with no control vector, or vpand or whatever.
It also runs a long warm-up loop, maybe taking multiple milliseconds when you hopefully only need microseconds. I don't have hardware to test on, so I haven't fleshed out the code examples in my suggestion.
Even if you want to use Intel's suggestion more or less unchanged, you can still make it waste less space in your binary by not using so much constant data.
Shuffle like vmovhlps xmm0, xmm0, xmm0 or vpunpckhpd x,x,x run on port 5 only even on Ice Lake and later. ICL/ICX can run some shuffles like pshufd or unpckhqdq on port 1 as well, but not the ZMM versions.
Picking a 1-cycle-latency shuffle is good (so something in-lane, not lane-crossing like vpermd), although you don't even want to create a loop-carried dependency with it, just throughput. i.e. shuffle the same source into multiple destination regs.
Picking something that definitely can't compete with the FMA unit on port 0 is good, so a shuffle is better than vpand. Probably more future-proof to pick one that can't run on port 1. On current CPUs, all the vector ALUs are shut down when any 512-bit uops are in flight (at least that's the case on Skylake-X.) But one could imagine some future CPU where vpshufd xmm or ymm running on port 1 in the same cycle as vfma...ps zmm instructions run on ports 0 and 5. But it's unlikely that the extra shuffle unit on port 1 will get widened to 512-bit soon, so perhaps vpunpckhpd zmm30, zmm0, zmm0 is a good choice.
With better design, you can hopefully avoid false results even without long warm-up
Confounding factors include the soft throttling of "heavy" instructions when the current clock speed or voltage are outside the requirements of running them at high throughput. (See also SIMD instructions lowering CPU frequency)
But waiting for alternating benchmarks to settle to nearly 1:1 or 2:1 should work, and if you're careful not be thrown off by clock-speed changes in the middle of one. (e.g. check against previous run of the same test, as well as the ratio against the previous.)
Ideally you could run this early enough in program startup that this core might still be at idle clock speed, although depending on what started the process, it might be at max turbo, above what it's willing to run 512-bit instructions with.
Intel's version runs all of one test, then all of the other, just assuming that the warm-up is sufficient and that scheduling competition from other loads didn't distort either run.
Test methods
You could do a quick throughput test on startup, timing with rdtsc. vmulps is easy to make independent since it only has 2 inputs, and is correlated with vfma... throughput on all CPUs so far. (Unlike vaddps zmm which is 0.5c throughput on Alder Lake P-cores (with AVX-512-enabled microcode) even though they only have 1c mul/fma. https://uops.info/. Presumably Sapphire Rapids will be the same for versions with 1x 512-bit FMA unit.)
It might be sufficient to do these steps in order, timing each step with lfence;rdtsc;lfence so you can use short benchmark intervals without having out-of-order exec read the TSC while there are still un-executed parts.
vaddps zmm1, zmm1, zmm1 to make sure ZMM1 was written with a uop of the appropriate type, to avoid weird latency effects.
times 3 vmulps zmm0, zmm1, zmm1 in a loop for maybe 100 iterations (thus a 4 uop loop since dec ecx/jnz will macro-fuse, no front-end bottleneck on Skylake-X). If you want, you could write 3 different ZMM registers, but writing ZMM0 3 times is fine.
times 3 vmulps ymm0, ymm1, ymm1 in a loop for maybe 100 iterations
times 3 vmulps zmm0, zmm1, zmm1 in a loop again.
If the ZMM times match between the first run within maybe 10%, you're done, and can assume that the CPU frequency was warmed up before the first run, but only to the AVX-512 "heavy" turbo limit or lower.
But that likely won't be the case unless you were able to do some useful startup work before this using "heavy" AVX-512 instructions. That would be the ideal case, taking at worst a small penalty during work your program already needs to do, before the benchmark runs.
The reference frequency might be significantly different from the actual core clock frequency the CPU can sustain, so unfortunately you can't just repeat this until you see close to 1 or 2 MULs per RDTSC count. e.g. i5-1035 Ice Lake client, TSC = 1.5 GHz, base = 1.1 GHz as reported by BeeOnRope. (Max turbo 3.7GHz). His results are 0.1 GHz higher than what Intel says is the "base" and max turbo, but I assume the point still stands that AVX-512 heavy instructions don't tend to make it run anywhere near the TSC frequency. In a VM environment after migration from different hardware, it's also possible for RDTSC to be transparently scaling and offsetting the counts (HW supported).
No "client" CPUs have 2x 512-bit FMA units (yet)
In "client" CPUs, so far only some Skylake-X CPUs have 2 FMA units. (At least the "client" Ice Lake, Rocket Lake, and Alder Lake CPUs tested by https://uops.info/ only have 1c throughput FMA for 512-bit ZMM.)
But (some?) Ice Lake server CPUs have 0.5c FMA ZMM throughput, so Intel hasn't given up on it. Including for example the Xeon Gold 6330 (IceLake-SP) that instlatx64 tested with 0.5c VFMADD132PS zmm, zmm, zmm throughput, same as xmm/ymm.
Related
I heard there is Intel book online which describes the CPU cycles needed for a specific assembly instruction, but I can not find it out (after trying hard). Could anyone show me how to find CPU cycle please?
Here is an example, in the below code, mov/lock is 1 CPU cycle, and xchg is 3 CPU cycles.
// This part is Platform dependent!
#ifdef WIN32
inline int CPP_SpinLock::TestAndSet(int* pTargetAddress,
int nValue)
{
__asm
{
mov edx, dword ptr [pTargetAddress]
mov eax, nValue
lock xchg eax, dword ptr [edx]
}
// mov = 1 CPU cycle
// lock = 1 CPU cycle
// xchg = 3 CPU cycles
}
#endif // WIN32
BTW: here is the URL for the code I posted: http://www.codeproject.com/KB/threads/spinlocks.aspx
Modern CPUs are complex beasts, using pipelining, superscalar execution, and out-of-order execution among other techniques which make performance analysis difficult... but not impossible!
While you can no longer simply add together the latencies of a stream of instructions to get the total runtime, you can still get a (often) highly accurate analysis of the behavior of some piece of code (especially a loop) as described below and in other linked resources.
Instruction Timings
First, you need the actual timings. These vary by CPU architecture, but the best resource currently for x86 timings is Agner Fog's instruction tables. Covering no less than thirty different microarchitecures, these tables list the instruction latency, which is the minimum/typical time that an instruction takes from inputs ready to output available. In Agner's words:
Latency: This is the delay that the instruction generates in a
dependency chain. The numbers are minimum values. Cache misses,
misalignment, and exceptions may increase the clock counts
considerably. Where hyperthreading is enabled, the use of the same
execution units in the other thread leads to inferior performance.
Denormal numbers, NAN's and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
So, for example, the add instruction has a latency of one cycle, so a series of dependent add instructions, as shown, will have a latency of 1 cycle per add:
add eax, eax
add eax, eax
add eax, eax
add eax, eax # total latency of 4 cycles for these 4 adds
Note that this doesn't mean that add instructions will only take 1 cycle each. For example, if the add instructions were not dependent, it is possible that on modern chips all 4 add instructions can execute independently in the same cycle:
add eax, eax
add ebx, ebx
add ecx, ecx
add edx, edx # these 4 instructions might all execute, in parallel in a single cycle
Agner provides a metric which captures some of this potential parallelism, called reciprocal throughput:
Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind
in the same thread.
For add this is listed as 0.25 meaning that up to 4 add instructions can execute every cycle (giving a reciprocal throughput of 1 / 4 = 0.25).
The reciprocal throughput number also gives a hint at the pipelining capability of an instruction. For example, on most recent x86 chips, the common forms of the imul instruction have a latency of 3 cycles, and internally only one execution unit can handle them (unlike add which usually has four add-capable units). Yet the observed throughput for a long series of independent imul instructions is 1/cycle, not 1 every 3 cycles as you might expect given the latency of 3. The reason is that the imul unit is pipelined: it can start a new imul every cycle, even while the previous multiplication hasn't completed.
This means a series of independent imul instructions can run at up to 1 per cycle, but a series of dependent imul instructions will run at only 1 every 3 cycles (since the next imul can't start until the result from the prior one is ready).
So with this information, you can start to see how to analyze instruction timings on modern CPUs.
Detailed Analysis
Still, the above is only scratching the surface. You now have multiple ways of looking at a series of instructions (latency or throughput) and it may not be clear which to use.
Furthermore, there are other limits not captured by the above numbers, such as the fact that certain instructions compete for the same resources within the CPU, and restrictions in other parts of the CPU pipeline (such as instruction decoding) which may result in a lower overall throughput than you'd calculate just by looking at latency and throughput. Beyond that, you have factors "beyond the ALUs" such as memory access and branch prediction: entire topics unto themselves - you can mostly model these well, but it takes work. For example here's a recent post where the answer covers in some detail most of the relevant factors.
Covering all the details would increase the size of this already long answer by a factor of 10 or more, so I'll just point you to the best resources. Agner Fog has an Optimizing Asembly guide that covers in detail the precise analysis of a loop with a dozen or so instructions. See "12.7 An example of analysis for bottlenecks in vector loops" which starts on page 95 in the current version of the PDF.
The basic idea is that you create a table, with one row per instruction and mark the execution resources each uses. This lets you see any throughput bottlenecks. In addition, you need to examine the loop for carried dependencies, to see if any of those limit the throughput (see "12.16 Analyzing dependencies" for a complex case).
If you don't want to do it by hand, Intel has released the Intel Architecture Code Analyzer, which is a tool that automates this analysis. It currently hasn't been updated beyond Skylake, but the results are still largely reasonable for Kaby Lake since the microarchitecture hasn't changed much and therefore the timings remain comparable. This answer goes into a lot of detail and provides example output, and the user's guide isn't half bad (although it is out of date with respect to the newest versions).
Other sources
Agner usually provides timings for new architectures shortly after they are released, but you can also check out instlatx64 for similarly organized timings in the InstLatX86 and InstLatX64 results. The results cover a lot of interesting old chips, and new chips usually show up fairly quickly. The results are mostly consistent with Agner's, with a few exceptions here and there. You can also find memory latency and other values on this page.
You can even get the timing results directly from Intel in their IA32 and Intel 64 optimization manual in Appendix C: INSTRUCTION LATENCY AND THROUGHPUT. Personally I prefer Agner's version because they are more complete, often arrive before the Intel manual is updated, and are easier to use as they provide a spreadsheet and PDF version.
Finally, the x86 tag wiki has a wealth of resources on x86 optimization, including links to other examples of how to do a cycle accurate analysis of code sequences.
If you want a deeper look into the type of "dataflow analysis" described above, I would recommend A Whirlwind Introduction to Data Flow Graphs.
Given pipelining, out of order processing, microcode, multi-core processors, etc there's no guarantee that a particular section of assembly code will take exactly x CPU cycles/clock cycle/whatever cycles.
If such a reference exists, it will only be able to provide broad generalizations given a particular architecture, and depending on how the microcode is implemented you may find that the Pentium M is different than the Core 2 Duo which is different than the AMD dual core, etc.
Note that this article was updated in 2000, and written earlier. Even the Pentium 4 is hard to pin down regarding instruction timing - PIII, PII, and the original pentium were easier, and the texts referenced were probably based on those earlier processors that had a more well-defined instruction timing.
These days people generally use statistical analysis for code timing estimation.
What the other answers say about it being impossible to accurately predict the performance of code running on a modern CPU is true, but that doesn't mean the latencies are unknown, or that knowing them is useless.
The exact latencies for Intels and AMD's processors are listed in Agner Fog's instruction tables. See also Intel® 64 and IA-32 Architectures Optimization Reference Manual, and Instruction latencies and throughput for AMD and Intel x86 processors (from Can Berk Güder's now-deleted link-only answer). AMD also has pdf manuals on their own website with their official values.
For (micro-)optimizing tight loops, knowing the latencies for each instruction can help a lot in manually trying to schedule your code. The programmer can make a lot of optimizations that the compiler can't (because the compiler can't guarantee it won't change the meaning of the program).
Of course, this still requires you to know a lot of other details about the CPU, such as how deeply pipelined it is, how many instructions it can issue per cycle, number of execution units and so on. And of course, these numbers vary for different CPU's. But you can often come up with a reasonable average that more or less works for all CPU's.
It's worth noting though, that it is a lot of work to optimize even a few lines of code at this level. And it is easy to make something that turns out to be a pessimization. Modern CPUs are hugely complicated, and they try extremely hard to get good performance out of bad code. But there are also cases they're unable to handle efficiently, or where you think you're clever and making efficient code, and it turns out to slow the CPU down.
Edit
Looking in Intel's optimization manual, table C-13:
The first column is instruction type, then there is a number of columns for latency for each CPUID. The CPUID indicates which processor family the numbers apply to, and are explained elsewhere in the document. The latency specifies how many cycles it takes before the result of the instruction is available, so this is the number you're looking for.
The throughput columns show how many of this type of instructions can be executed per cycle.
Looking up xchg in this table, we see that depending on the CPU family, it takes 1-3 cycles, and a mov takes 0.5-1. These are for the register-to-register forms of the instructions, not for a lock xchg with memory, which is a lot slower. And more importantly, hugely-variable latency and impact on surrounding code (much slower when there's contention with another core), so looking only at the best-case is a mistake. (I haven't looked up what each CPUID means, but I assume the .5 are for Pentium 4, which ran some components of the chip at double speed, allowing it to do things in half cycles)
I don't really see what you plan to use this information for, however, but if you know the exact CPU family the code is running on, then adding up the latency tells you the minimum number of cycles required to execute this sequence of instructions.
Measuring and counting CPU-cycles does not make sense on the x86 anymore.
First off, ask yourself for which CPU you're counting cycles? Core-2? a Athlon? Pentium-M? Atom? All these CPUs execute x86 code but all of them have different execution times. The execution even varies between different steppings of the same CPU.
The last x86 where cycle-counting made sense was the Pentium-Pro.
Also consider, that inside the CPU most instructions are transcoded into microcode and executed out of order by a internal execution unit that does not even remotely look like a x86. The performance of a single CPU instruction depends on how much resources in the internal execution unit is available.
So the time for a instruction depends not only on the instruction itself but also on the surrounding code.
Anyway: You can estimate the throughput-resource usage and latency of instructions for different processors. The relevant information can be found at the Intel and AMD sites.
Agner Fog has a very nice summary on his web-site. See the instruction tables for latency, throughput, and uop count. See the microarchictecture PDF to learn how to interpret those.
http://www.agner.org/optimize
But note that xchg-with-memory does not have predictable performance, even if you look at only one CPU model. Even in the no-contention case with the cache-line already hot in L1D cache, being a full memory barrier will mean it's impact depends a lot on loads and stores to other addresses in the surrounding code.
Btw - since your example-code is a lock-free datastructure basic building block: Have you considered using the compiler built-in functions? On win32 you can include intrin.h and use functions such as _InterlockedExchange.
That'll give you better execution time because the compiler can inline the instructions. Inline-assembler always forces the compiler to disable optimizations around the asm-code.
lock xchg eax, dword ptr [edx]
Note the lock will lock memory for the memory fetch for all cores, this can take 100 cycles on some multi cores and a cache line will also need to be flushed. It will also stall the pipeline. So i wouldnt worry about the rest.
So optimal performance gets back to tuning your algorithms critical regions.
Note on a single core you can optmize this by removing the lock but it is needed for multi core.
I'm writing some AVX code and I need to load from potentially unaligned memory. I'm currently loading 4 doubles, hence I would use intrinsic instruction _mm256_loadu_pd; the code I've written is:
__m256d d1 = _mm256_loadu_pd(vInOut + i*4);
I've then compiled with options -O3 -mavx -g and subsequently used objdump to get the assembler code plus annotated code and line (objdump -S -M intel -l avx.obj).When I look into the underlying assembler code, I find the following:
vmovupd xmm0,XMMWORD PTR [rsi+rax*1]
vinsertf128 ymm0,ymm0,XMMWORD PTR [rsi+rax*1+0x10],0x1
I was expecting to see this:
vmovupd ymm0,XMMWORD PTR [rsi+rax*1]
and fully use the 256 bit register (ymm0), instead it looks like gcc has decided to fill in the 128 bit part (xmm0) and then load again the other half with vinsertf128.
Is someone able to explain this?
Equivalent code is getting compiled with a single vmovupd in MSVC VS 2012.
I'm running gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 on Ubuntu 18.04 x86-64.
GCC's default tuning (-mtune=generic) includes -mavx256-split-unaligned-load and -mavx256-split-unaligned-store, because that gives a minor speedup on some CPUs (e.g. first-gen Sandybridge, and some AMD CPUs) in some cases when memory is actually misaligned at runtime.
Use -O3 -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store if you don't want this, or better, use -mtune=haswell. Or use -march=native to optimize for your own computer. There's no "generic-avx2" tuning. (https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html).
Intel Sandybridge runs 256-bit loads as a single uop that takes 2 cycles in a load port. (Unlike AMD which decodes all 256-bit vector instructions as 2 separate uops.) Sandybridge has a problem with unaligned 256-bit loads (if the address is actually misaligned at runtime). I don't know the details, and haven't found much specific info on exactly what the slowdown is. Perhaps because it uses a banked cache, with 16-byte banks? But IvyBridge handles 256-bit loads better and still has banked cache.
According to the GCC mailing list message about the code that implements the option (https://gcc.gnu.org/ml/gcc-patches/2011-03/msg01847.html), "It speeds up some SPEC CPU 2006 benchmarks by up to 6%." (I think that's for Sandybridge, the only Intel AVX CPU that existed at the time.)
But if memory is actually 32-byte aligned at runtime, this is pure downside even on Sandybridge and most AMD CPUs1. So with this tuning option, you potentially lose just from failing to tell your compiler about alignment guarantees. And if your loop runs on aligned memory most of the time, you'd better compile at least that compilation unit with -mno-avx256-split-unaligned-load or tuning options that imply that.
Splitting in software imposes the cost all the time. Letting hardware handle it makes the aligned case perfectly efficient (except stores on Piledriver1), with the misaligned case possibly slower than with software splitting on some CPUs. So it's the pessimistic approach, and makes sense if it's really likely that the data really is misaligned at runtime, rather than just not guaranteed to always be aligned at compile time. e.g. maybe you have a function that's called most of the time with aligned buffers, but you still want it to work for rare / small cases where it's called with misaligned buffers. In that case, a split-load/store strategy is inappropriate even on Sandybridge.
It's common for buffers to be 16-byte aligned but not 32-byte aligned because malloc on x86-64 glibc (and new in libstdc++) returns 16-byte aligned buffers (because alignof(maxalign_t) == 16). For large buffers, the pointer is normally 16 bytes after the start of a page, so it's always misaligned for alignments larger than 16. Use aligned_alloc instead.
Note that -mavx and -mavx2 don't change tuning options at all: gcc -O3 -mavx2 still tunes for all CPUs, including ones that can't actually run AVX2 instructions. This is pretty dumb, because you should use a single unaligned 256-bit load if tuning for "the average AVX2 CPU". Unfortunately gcc has no option to do that, and -mavx2 doesn't imply -mno-avx256-split-unaligned-load or anything. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 for feature requests to have instruction-set selection influence tuning.
This is why you should use -march=native to make binaries for local use, or maybe -march=sandybridge -mtune=haswell to make binaries that can run on a wide range of machines, but will probably mostly run on newer hardware that has AVX. (Note that even Skylake Pentium/Celeron CPUs don't have AVX or BMI2; probably on CPUs with any defects in the upper half of 256-bit execution units or register files, they disable decoding of VEX prefixes and sell them as low-end Pentium.)
gcc8.2's tuning options are as follows. (-march=x implies -mtune=x). https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html.
I checked on the Godbolt compiler explorer by compiling with -O3 -fverbose-asm and looking at the comments which include a full dump of all implied options. I included _mm256_loadu/storeu_ps functions, and a simple float loop that can auto-vectorize, so we can also look at what the compiler does.
Use -mprefer-vector-width=256 (gcc8) or -mno-prefer-avx128 (gcc7 and earlier) to override tuning options like -mtune=bdver3 and get 256-bit auto-vectorization if you want, instead of only with manual vectorization.
default / -mtune=generic: both -mavx256-split-unaligned-load and -store. Arguably less and less appropriate as Intel Haswell and later become more common, and the downside on recent AMD CPUs is I think still small. Especially splitting unaligned loads, which AMD tuning options don't enable.
-march=sandybridge and -march=ivybridge: split both. (I think I've read that IvyBridge improved handling of unaligned 256-bit loads or stores, so it's less appropriate for cases where the data might be aligned at runtime.)
-march=haswell and later: neither splitting option enabled.
-march=knl: neither splitting option enabled. (Silvermont/Atom don't have AVX)
-mtune=intel: neither splitting option enabled. Even with gcc8, auto-vectorization with -mtune=intel -mavx chooses to reach an alignment boundary for the read/write destination array, unlike gcc8's normal strategy of just using unaligned. (Again, another case of software handling that always has a cost vs. letting the hardware deal with the exceptional case.)
-march=bdver1 (Bulldozer): -mavx256-split-unaligned-store, but not loads.
It also sets the gcc8 equivalent gcc7 and earlier -mprefer-avx128 (auto-vectorization will only use 128-bit AVX, but of course intrinsics can still use 256-bit vectors).
-march=bdver2 (Piledriver), bdver3 (Steamroller), bdver4 (Excavator). same as Bulldozer. They auto-vectorize an FP a[i] += b[i] loop with software prefetch and enough unrolling to only prefetch once per cache line!
-march=znver1 (Zen): -mavx256-split-unaligned-store but not loads, still auto-vectorizing with only 128-bit, but this time without SW prefetch.
-march=btver2 (AMD Fam16h, aka Jaguar): neither splitting option enabled, auto-vectorizing like Bulldozer-family with only 128-bit vectors + SW prefetch.
-march=eden-x4 (Via Eden with AVX2): neither splitting option enabled, but the -march option doesn't even enable -mavx, and auto-vectorization uses movlps / movhps 8-byte loads, which is really dumb. At least use movsd instead of movlps to break the false dependency. But if you enable -mavx, it uses 128-bit unaligned loads. Really weird / inconsistent behaviour here, unless there's some strange front-end for this.
options (enabled as part of -march=sandybridge for example, presumably also for Bulldozer-family (-march=bdver2 is piledriver). That doesn't solve the problem when the compiler knows the memory is aligned, though.
Footnote 1: AMD Piledriver has a performance bug that makes 256-bit store throughput terrible: even vmovaps [mem], ymm aligned stores running one per 17 to 20 clocks according to Agner Fog's microarch pdf (https://agner.org/optimize/). This effect isn't present in Bulldozer or Steamroller/Excavator.
Agner Fog says 256-bit AVX throughput in general (not loads/stores specifically) on Bulldozer/Piledriver is typically worse than 128-bit AVX, partly because it can't decode instructions in a 2-2 uop pattern. Steamroller makes 256-bit close to break-even (if it doesn't cost extra shuffles). But register-register vmovaps ymm instructions still only benefit from mov-elimination for the low 128 bits on Bulldozer-family.
But closed-source software or binary distributions typically don't have the luxury of building with -march=native on every target architecture, so there's a tradeoff when making a binary that can run on any AVX-supporting CPU. Gaining big speedup with 256-bit code on some CPUs is typically worth it as long as there aren't catastrophic downsides on other CPUs.
Splitting unaligned loads/stores is an attempt to avoid big problems on some CPUs. It costs extra uop throughput, and extra ALU uops, on recent CPUs. But at least vinsertf128 ymm, [mem], 1 doesn't need the shuffle unit on port 5 on Haswell/Skylake: it can run on any vector ALU port. (And it doesn't micro-fuse, so it costs 2 uops of front-end bandwidth.)
PS:
Most code isn't compiled by bleeding edge compilers, so changing the "generic" tuning now will take a while before code compiled with an updated tuning will get into use. (Of course, most code is compiled with just -O2 or -O3, and this option only affects AVX code-gen anyway. But many people unfortunately use -O3 -mavx2 instead of -O3 -march=native. So they can miss out on FMA, BMI1/2, popcnt, and other things their CPU supports.
GCC's generic tuning splits unaligned 256-bit loads to help older processors. (Subsequent changes avoid splitting loads in generic tuning, I believe.)
You can tune for more recent Intel CPUs using something like -mtune=intel or -mtune=skylake, and you will get a single instruction, as intended.
I'm writing a benchmark for a school project. It's very simple but I am wondering, in real life, what are the typical weights used for the various types of benchmarks? For instance, if I am combining an integer test, a cache test, a floating point test, should they be equally weighted in the final "score"? My hunch is that for many things, the cache test matters more than raw arithmetic, and that for many things, the RAM speed is a big factor. Is there a consensus?
There is no universal set of weights.
Different real-world workloads have different bottlenecks, or different weightings.
There is no single number that can tell you how fast a computer is. It's possible (and happens in real life) that program X runs faster on computer A then B, but program Y runs faster on computer B.
Choosing a set of weights for microbenchmarks totally comes down to what you want your number to mean, and what kind of workload you want it to be a rough indicator for.
e.g. a dense matmul can usually saturate FMA execution unit throughput because it does O(N^3) work over N^2 data. With careful cache-blocking you can get mostly L1d cache hits, and avoid doing more than 1 SIMD vector load per FMA. DRAM / cache bandwidth has to be high enough to keep up, but most of the stores/reloads hit in L1d cache (which of course also has to be able to keep up).
But other workloads might bottleneck on memory bandwidth or latency and not care about FPU throughput at all. e.g. AMD Ryzen 1 can do 1x 128-bit FMA per clock while Intel Haswell and later can do 2x 256-bit FMA per clock. But Ryzen is faster or nearly equal clock-for-clock for some other workloads.
And on multi-core systems some programs are single-threaded and care only about single-core throughput, while others scale well and get a big speedup on a machine with lots of slower cores. Or they might care about inter-core latency vs. aggregate memory bandwidth.
While testing the work of custom heap manager (to replace system one) I have encountered some slowdowns in comparison to system heap.
I used AMD CodeAnalyst for profiling x64 application on Windows 7, Intel Xeon CPU E5-1620 v2 # 3.70 GHz. And got the following results:
This block consumes about 90% of the time for the whole application run. We can see a lot of time spent on "cmp [rsp+18h], rax" and "test eax, eax" but no time spent on jumps right below the compares. Is it ok that jumps take no time? Is it because of branch prediction mechanism?
I changed the clause to the opposite and here what I've got (the results are a bit different in absolute numbers because I manually stopped profiling sessions - but still a lot of time is taken by compares):
There are so many calls to these compares that they become a bottle-neck... This is how I can interpret these results. And probably the best optimization is reworking the algorithm, right?
Intel and AMD CPUs both macro-fuse cmp/jcc pairs into a single compare-and-branch uop (Intel) or macro-op (AMD). Intel SnB-family CPUs like yours can do this with some instructions that also write an output register, like and, sub/add, inc/dec.
To really understand profiling data, you have to understand something about how the out-of-order pipeline works in the microarch you're tuning on. See the links at the x86 tag wiki, especially Agner Fog's microarch pdf.
You should also beware that profiling cycle counts can get charged to the instruction that's waiting for results, not the instruction that is slow to produce them.
I often see code that converts ints to doubles to ints to doubles and back once again (sometimes for good reasons, sometimes not), and it just occurred to me that this seems like a "hidden" cost in my program. Let's assume the conversion method is truncation.
So, just how expensive is it? I'm sure it varies depending on hardware, so let's assume a newish Intel processor (Haswell, if you like, though I'll take anything). Some metrics I'd be interested in (though a good answer needn't have all of them):
# of generated instructions
# of cycles used
Relative cost compared to basic arithmetic operations
I would also assume that the way we would most acutely experience the impact of a slow conversion would be with respect to power usage rather than execution speed, given the difference in how many computations we can perform each second relative to how much data can actually arrive at the CPU each second.
Here's what I could dig up myself, for x86-64 doing FP math with SSE2 (not legacy x87 where changing the rounding mode for C++'s truncation semantics was expensive):
When I take a look at the generated assembly from clang and gcc, it looks like the cast int to double, it boils down to one instruction: cvttsd2si.
From double to int it's cvtsi2sd. (cvtsi2sdl AT&T syntax for cvtsi2sd with 32-bit operand-size.)
With auto-vectorization, we get cvtdq2pd.
So I suppose the question becomes: what is the cost of those?
These instructions each cost approximately the same as an FP addsd plus a movq xmm, r64 (fp <- integer) or movq r64, xmm (integer <- fp), because they decode to 2 uops which on the same ports, on mainstream (Sandybridge/Haswell/Sklake) Intel CPUs.
The Intel® 64 and IA-32 Architectures Optimization Reference Manual says that cost of the cvttsd2si instruction is 5 latency (see Appendix C-16). cvtsi2sd, depending on your architecture, has latency varying from 1 on Silvermont to more like 7-16 on several other architectures.
Agner Fog's instruction tables have more accurate/sensible numbers, like 5-cycle latency for cvtsi2sd on Silvermont (with 1 per 2 clock throughput), or 4c latency on Haswell, with one per clock throughput (if you avoid the dependency on the destination register from merging with the old upper half, like gcc usually does with pxor xmm0,xmm0).
SIMD packed-float to packed-int is great; single uop. But converting to double requires a shuffle to change element size. SIMD float/double<->int64_t doesn't exist until AVX512, but can be done manually with limited range.
Intel's manual defines latency as: "The number of clock cycles that are required for the execution core to complete the execution of all of the μops that form an instruction." But a more useful definition is the number of clocks from an input being ready until the output becomes ready. Throughput is more important than latency if there's enough parallelism for out-of-order execution to do its job: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?.
The same Intel manual says that an integer add instruction costs 1 latency and an integer imul costs 3 (Appendix C-27). FP addsd and mulsd run at 2 per clock throughput, with 4 cycle latency, on Skylake. Same for the SIMD versions, and for FMA, with 128 or 256-bit vectors.
On Haswell, addsd / addpd is only 1 per clock throughput, but 3 cycle latency thanks to a dedicated FP-add unit.
So, the answer boils down to:
1) It's hardware optimized, and the compiler leverages the hardware machinery.
2) It costs only a bit more than a multiply does in terms of the # of cycles in one direction, and a highly variable amount in the other (depending on your architecture). Its cost is neither free nor absurd, but probably warrants more attention given how easy it is write code that incurs the cost in a non-obvious way.
Of course this kind of question depends on the exact hardware and even on the mode.
On x86 my i7 when used in 32-bit mode with default options (gcc -m32 -O3) the conversion from int to double is quite fast, the opposite instead is much slower because the C standard mandates an absurd rule (truncation of decimals).
This way of rounding is bad both for math and for hardware and requires the FPU to switch to this special rounding mode, perform the truncation, and switch back to a sane way of rounding.
If you need speed doing the float->int conversion using the simple fistp instruction is faster and also much better for computation results, but requires some inline assembly.
inline int my_int(double x)
{
int r;
asm ("fldl %1\n"
"fistpl %0\n"
:"=m"(r)
:"m"(x));
return r;
}
is more than 6 times faster than naive x = (int)y; conversion (and doesn't have a bias toward 0).
The very same processor, when used in 64-bit mode however has no speed problems and using the fistp code actually makes the code run somewhat slower.
Apparently the hardware guys gave up and implemented the bad rounding algorithm directly in hardware (so badly rounding code can now run fast).