The alignment problem of SIMD data loading [duplicate]

The alignment problem of SIMD data loading [duplicate] - c++

I'm writing some AVX code and I need to load from potentially unaligned memory. I'm currently loading 4 doubles, hence I would use intrinsic instruction _mm256_loadu_pd; the code I've written is:
__m256d d1 = _mm256_loadu_pd(vInOut + i*4);
I've then compiled with options -O3 -mavx -g and subsequently used objdump to get the assembler code plus annotated code and line (objdump -S -M intel -l avx.obj).When I look into the underlying assembler code, I find the following:
vmovupd xmm0,XMMWORD PTR [rsi+rax*1]
vinsertf128 ymm0,ymm0,XMMWORD PTR [rsi+rax*1+0x10],0x1
I was expecting to see this:
vmovupd ymm0,XMMWORD PTR [rsi+rax*1]
and fully use the 256 bit register (ymm0), instead it looks like gcc has decided to fill in the 128 bit part (xmm0) and then load again the other half with vinsertf128.
Is someone able to explain this?
Equivalent code is getting compiled with a single vmovupd in MSVC VS 2012.
I'm running gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 on Ubuntu 18.04 x86-64.

GCC's default tuning (-mtune=generic) includes -mavx256-split-unaligned-load and -mavx256-split-unaligned-store, because that gives a minor speedup on some CPUs (e.g. first-gen Sandybridge, and some AMD CPUs) in some cases when memory is actually misaligned at runtime.
Use -O3 -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store if you don't want this, or better, use -mtune=haswell. Or use -march=native to optimize for your own computer. There's no "generic-avx2" tuning. (https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html).
Intel Sandybridge runs 256-bit loads as a single uop that takes 2 cycles in a load port. (Unlike AMD which decodes all 256-bit vector instructions as 2 separate uops.) Sandybridge has a problem with unaligned 256-bit loads (if the address is actually misaligned at runtime). I don't know the details, and haven't found much specific info on exactly what the slowdown is. Perhaps because it uses a banked cache, with 16-byte banks? But IvyBridge handles 256-bit loads better and still has banked cache.
According to the GCC mailing list message about the code that implements the option (https://gcc.gnu.org/ml/gcc-patches/2011-03/msg01847.html), "It speeds up some SPEC CPU 2006 benchmarks by up to 6%." (I think that's for Sandybridge, the only Intel AVX CPU that existed at the time.)
But if memory is actually 32-byte aligned at runtime, this is pure downside even on Sandybridge and most AMD CPUs1. So with this tuning option, you potentially lose just from failing to tell your compiler about alignment guarantees. And if your loop runs on aligned memory most of the time, you'd better compile at least that compilation unit with -mno-avx256-split-unaligned-load or tuning options that imply that.
Splitting in software imposes the cost all the time. Letting hardware handle it makes the aligned case perfectly efficient (except stores on Piledriver1), with the misaligned case possibly slower than with software splitting on some CPUs. So it's the pessimistic approach, and makes sense if it's really likely that the data really is misaligned at runtime, rather than just not guaranteed to always be aligned at compile time. e.g. maybe you have a function that's called most of the time with aligned buffers, but you still want it to work for rare / small cases where it's called with misaligned buffers. In that case, a split-load/store strategy is inappropriate even on Sandybridge.
It's common for buffers to be 16-byte aligned but not 32-byte aligned because malloc on x86-64 glibc (and new in libstdc++) returns 16-byte aligned buffers (because alignof(maxalign_t) == 16). For large buffers, the pointer is normally 16 bytes after the start of a page, so it's always misaligned for alignments larger than 16. Use aligned_alloc instead.
Note that -mavx and -mavx2 don't change tuning options at all: gcc -O3 -mavx2 still tunes for all CPUs, including ones that can't actually run AVX2 instructions. This is pretty dumb, because you should use a single unaligned 256-bit load if tuning for "the average AVX2 CPU". Unfortunately gcc has no option to do that, and -mavx2 doesn't imply -mno-avx256-split-unaligned-load or anything. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 for feature requests to have instruction-set selection influence tuning.
This is why you should use -march=native to make binaries for local use, or maybe -march=sandybridge -mtune=haswell to make binaries that can run on a wide range of machines, but will probably mostly run on newer hardware that has AVX. (Note that even Skylake Pentium/Celeron CPUs don't have AVX or BMI2; probably on CPUs with any defects in the upper half of 256-bit execution units or register files, they disable decoding of VEX prefixes and sell them as low-end Pentium.)
gcc8.2's tuning options are as follows. (-march=x implies -mtune=x). https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html.
I checked on the Godbolt compiler explorer by compiling with -O3 -fverbose-asm and looking at the comments which include a full dump of all implied options. I included _mm256_loadu/storeu_ps functions, and a simple float loop that can auto-vectorize, so we can also look at what the compiler does.
Use -mprefer-vector-width=256 (gcc8) or -mno-prefer-avx128 (gcc7 and earlier) to override tuning options like -mtune=bdver3 and get 256-bit auto-vectorization if you want, instead of only with manual vectorization.
default / -mtune=generic: both -mavx256-split-unaligned-load and -store. Arguably less and less appropriate as Intel Haswell and later become more common, and the downside on recent AMD CPUs is I think still small. Especially splitting unaligned loads, which AMD tuning options don't enable.
-march=sandybridge and -march=ivybridge: split both. (I think I've read that IvyBridge improved handling of unaligned 256-bit loads or stores, so it's less appropriate for cases where the data might be aligned at runtime.)
-march=haswell and later: neither splitting option enabled.
-march=knl: neither splitting option enabled. (Silvermont/Atom don't have AVX)
-mtune=intel: neither splitting option enabled. Even with gcc8, auto-vectorization with -mtune=intel -mavx chooses to reach an alignment boundary for the read/write destination array, unlike gcc8's normal strategy of just using unaligned. (Again, another case of software handling that always has a cost vs. letting the hardware deal with the exceptional case.)
-march=bdver1 (Bulldozer): -mavx256-split-unaligned-store, but not loads.
It also sets the gcc8 equivalent gcc7 and earlier -mprefer-avx128 (auto-vectorization will only use 128-bit AVX, but of course intrinsics can still use 256-bit vectors).
-march=bdver2 (Piledriver), bdver3 (Steamroller), bdver4 (Excavator). same as Bulldozer. They auto-vectorize an FP a[i] += b[i] loop with software prefetch and enough unrolling to only prefetch once per cache line!
-march=znver1 (Zen): -mavx256-split-unaligned-store but not loads, still auto-vectorizing with only 128-bit, but this time without SW prefetch.
-march=btver2 (AMD Fam16h, aka Jaguar): neither splitting option enabled, auto-vectorizing like Bulldozer-family with only 128-bit vectors + SW prefetch.
-march=eden-x4 (Via Eden with AVX2): neither splitting option enabled, but the -march option doesn't even enable -mavx, and auto-vectorization uses movlps / movhps 8-byte loads, which is really dumb. At least use movsd instead of movlps to break the false dependency. But if you enable -mavx, it uses 128-bit unaligned loads. Really weird / inconsistent behaviour here, unless there's some strange front-end for this.
options (enabled as part of -march=sandybridge for example, presumably also for Bulldozer-family (-march=bdver2 is piledriver). That doesn't solve the problem when the compiler knows the memory is aligned, though.
Footnote 1: AMD Piledriver has a performance bug that makes 256-bit store throughput terrible: even vmovaps [mem], ymm aligned stores running one per 17 to 20 clocks according to Agner Fog's microarch pdf (https://agner.org/optimize/). This effect isn't present in Bulldozer or Steamroller/Excavator.
Agner Fog says 256-bit AVX throughput in general (not loads/stores specifically) on Bulldozer/Piledriver is typically worse than 128-bit AVX, partly because it can't decode instructions in a 2-2 uop pattern. Steamroller makes 256-bit close to break-even (if it doesn't cost extra shuffles). But register-register vmovaps ymm instructions still only benefit from mov-elimination for the low 128 bits on Bulldozer-family.
But closed-source software or binary distributions typically don't have the luxury of building with -march=native on every target architecture, so there's a tradeoff when making a binary that can run on any AVX-supporting CPU. Gaining big speedup with 256-bit code on some CPUs is typically worth it as long as there aren't catastrophic downsides on other CPUs.
Splitting unaligned loads/stores is an attempt to avoid big problems on some CPUs. It costs extra uop throughput, and extra ALU uops, on recent CPUs. But at least vinsertf128 ymm, [mem], 1 doesn't need the shuffle unit on port 5 on Haswell/Skylake: it can run on any vector ALU port. (And it doesn't micro-fuse, so it costs 2 uops of front-end bandwidth.)
PS:
Most code isn't compiled by bleeding edge compilers, so changing the "generic" tuning now will take a while before code compiled with an updated tuning will get into use. (Of course, most code is compiled with just -O2 or -O3, and this option only affects AVX code-gen anyway. But many people unfortunately use -O3 -mavx2 instead of -O3 -march=native. So they can miss out on FMA, BMI1/2, popcnt, and other things their CPU supports.

GCC's generic tuning splits unaligned 256-bit loads to help older processors. (Subsequent changes avoid splitting loads in generic tuning, I believe.)
You can tune for more recent Intel CPUs using something like -mtune=intel or -mtune=skylake, and you will get a single instruction, as intended.

Related

What is the /d2vzeroupper MSVC compiler optimization flag doing?

What is the /d2vzeroupper MSVC compiler optimization flag doing?
I was reading through this Compiler Options Quick Reference Guide
for Epyc CPUs from AMD: https://developer.amd.com/wordpress/media/2020/04/Compiler%20Options%20Quick%20Ref%20Guide%20for%20AMD%20EPYC%207xx2%20Series%20Processors.pdf
For MSVC, to "Optimize for 64-bit AMD processors", they recommend to enable /favor:AMD64 /d2vzeroupper.
What /favor:AMD64 is doing is clear, there is documentation about that in the MSVC docs. But I can't seem to find /d2vzeroupper being mentioned anywhere in the internet at all, no documentation anywhere. What is it doing?

TL;DR: When using /favor:AMD64 add /d2vzeroupper to avoid very poor performance of SSE code on both current AMD CPUs and Intel CPUs.
Generally /d1... and /d2... are "secret" (undocumented) MSVC options to tune compiler behavior. /d1... apply to complier front-end, /d2... apply to compiler back-end.
/d2vzeroupper enables compiler-generated vzeroupper instruction
See Do I need to use _mm256_zeroupper in 2021? for more information.
Normally it is by default. You can disable it by /d2vzeroupper-. See here: https://godbolt.org/z/P48crzTrb
/favor:AMD64 switch suppresses vzeroupper, so /d2vzeroupper enables it back.
The up-to-date Visual Studio 2022 has fixed that, so /favor:AMD64 still emits vzeroupper and /d2vzeroupper is not needed to enable it.
Reason: current AMD optimization guides (available from AMD site; direct pdf link) suggest:
2.11.6 Mixing AVX and SSE
There is a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the
YMM registers contain non-zero data. Transitioning in either direction will cause a micro-fault to
spill or fill the upper 128 bits of all 16 YMM registers. There will be an approximately 100 cycle
penalty to signal and handle this fault. To avoid this penalty, a VZEROUPPER or VZEROALL
instruction should be used to clear the upper 128 bits of all YMM registers when transitioning from
AVX code to SSE or unknown code
Older AMD processor did not need vzeroupper, so /favor:AMD64 implemented optimization for them, even though penalizing Intel CPUs. From MS docs:
/favor:AMD64
(x64 only) optimizes the generated code for the AMD Opteron, and Athlon processors that support 64-bit extensions. The optimized code can run on all x64 compatible platforms. Code that is generated by using /favor:AMD64 might cause worse performance on Intel processors that support Intel64.

What happens when I compile on machine that supports avx2 and run the binary on another machine that only supports avx?

I compiled my c++ program on a machine that supports avx2 (Intel E5-2643 V3). It compiles and runs just fine. I confirm the avx2 instruction is used since after I dissemble the binary, I saw avx2 instructions such as vpbroadcastd.
Then I run this binary on another machine that only has avx instruction set (Intel E5-2643 V2). It runs also fine. Does the binary runs on a backward compatible avx instruction instead? What is this instruction? Do you see any potential issue?

There are multiple compilers and multiple settings you can use but the general principle is that usually a compiler is not targeting a particular processor, it's targeting an architecture, and by default it will usually have a fairly inclusive approach meaning the generated code will be compatible with as many processors as reasonable. You would normally expect an x86_64 compiler to generate code that runs without AVX2, indeed, that it should run on some of the earliest CPUs supporting the x86_64 instruction set.
If you have code that benefits greatly from extensions to the instruction set that aren't universally supported like AVX2, your aim when producing software is generally to degrade gracefully. For instance you could use runtime feature detection to see if the current processor supports AVX2 and run a separate code path. Some compilers may support automated ways of doing this or helpers to assist you in achieving this yourself.

It's not rare to have AVX2 instructions in a binary that uses CPU detection to make sure it only runs them on CPUs that support them. (e.g. via cpuid and setting function pointers).
If the AVX2 instruction actually executed on a CPU without AVX2 support, it raises #UD, so the OS delivers SIGILL (illegal instruction) to your process, or the Windows equivalent.
There are a few cases where an instruction like lzcnt decodes as rep bsr, which runs as bsr on CPUs without BMI1. (Giving a different answer). But VEX-coded AVX2 instructions just fault on older CPUs.

How to keep input-dependent hot data in registers when using SIMD intrinsics

I am trying to use Intel SIMD intrinsics to accelerate a query-answer program. Suppose query_cnt is input dependent but is always smaller than SIMD register count (i.e. there is enough SIMD registers to hold them). Since queries are the hot data in my application, instead of loading them each time when needed, may I load them at first and keep them always in registers?
Suppose queries are float type, and AVX256 is supported. Now I have to use something like:
std::vector<__m256> vec_queries(query_cnt / 8);
for (int i = 0; i < query_cnt / 8; ++i) {
vec_queries[i] = _mm256_loadu_ps((float const *)(curr_query_ptr));
curr_query_ptr += 8;
}
I know it is not a good practice since there is potential load/store overhead, but at least there is a slight chance that vec_queries[i] can be optimized so that they can be kept in registers, but I still think it is not a good way.
Any better ideas?

From the code sample you posted, it looks like you're just doing a variable-length memcpy. Depending on what the compiler does, and the surrounding code, you might get better results from just actually calling memcpy. e.g. for aligned copies of with a size that's a multiple of 16B, the break even point between a vector loop and rep movsb is maybe as low as ~128 bytes on Intel Haswell. Check Intel's optimization manual for some implementation notes on memcpy, and a graph of size vs. cycles for a couple different strategies. (Links in the x86 tag wiki).
You didn't say what CPU, so I'm just assuming recent Intel.
I think you're too worried about registers. Loads that hit in L1 cache are extremely cheap. Haswell (and Skylake) can do two __m256 loads per clock (and a store in the same cycle). Previous to that, Sandybridge/IvyBridge can do two memory operations per clock, with a max of one of them being a store. Or under ideal conditions (256b loads/stores), they can manage 2x 16B loaded and 1x 16B stored per clock. So loading/storing 256b vectors is more expensive than on Haswell, but still very cheap if they're aligned and hot in L1 cache.
I mentioned in comments that GNU C global register variables might be a possibility, but mostly in a "this is technically possible in theory" sense. You probably don't want multiple vector registers dedicated to this purpose for the entire run-time of your program (including library function calls, so you'd have to recompile them).
In reality, just make sure the compiler can inline (or at least see while optimizing) the definitions for every function you use inside any important loops. That way it can avoid having to spill/reload vector regs across function calls (since both the Windows and System V x86-64 ABIs have no call-preserved YMM (__m256) registers).
See Agner Fog's microarch pdf to learn even more about the microarchitectural details of modern CPUs, at least the details that are possible to measure by experiment and tune for.

SSE gives no speedup for C++ number crunching

I have a heavy number-crunching program that does image processing. It is mostly convolutions. It is written in C++ and compiled with Mingw GCC 4.8.1. I run it on a laptop with a Intel Core i7 4900MQ (with SSE up to SSE4.2 and AVX2).
When I tell GCC to use SSE optimisations (with -march=native -mfpmath=sse -msse2 ), I see no speedup compared to using the default x87 FPU.
When I use doubles instead of floats, there is no slowdown.
My understanding is that SSE should give me a 2x speedup when using floats instead of double. Am I mistaken?

My understanding is that SSE should give me a 2x speedup when using floats instead of double. Am I mistaken?
Yes, you are.
Compiler is as good as your code - remember that. If you didn't design your algorithm with vectorization in mind, compiler is powerless. It is not that easy: "turn the switch on and enjoy 100% performance boost".
First of all, compile your code with -ftree-vectorizer-verbose=N to see, what really was vectorized by the compiler.
N is the verbosity level, make that 5 to see all available output (more info can be found here).
Also, you may want to read about GCC's vectorizer.
And keep in mind, that for performance-critical sections of code, using SSE/AVX intrinsics (brilliantly documented here) directly may be the best option.

There is no code, no description on test procedures, but it generally can be explained this way:
It's not all about cpu bound, it's also bounded by memory speed.
Image processing usually have large working set and exceed the amount of cache of your non-xeon cpu. Eventually the cpu encounter starvation means the overall throughput can be bounded by memory speed.
You may be using an algorithm that is not friendly for vectorization.
Not every algorithm benefits from being vectorized. There are many conditions have to meet - flow dependency, memory layout, etc.

Using AVX CPU instructions: Poor performance without "/arch:AVX"

My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX.
To use AVX, it is necessary to include this:
#include "immintrin.h"
and then you can use intrinsics AVX functions like _mm256_mul_ps, _mm256_add_ps etc.
The problem is that by default, VS2010 produces code that works very slowly and shows the warning:
warning C4752: found Intel(R) Advanced Vector Extensions; consider
using /arch:AVX
It seems VS2010 actually does not use AVX instructions, but instead, emulates them. I added /arch:AVX to the compiler options and got good results. But this option tells the compiler to use AVX commands everywhere when possible. So my code may crash on CPU that does not support AVX!
So the question is how to make VS2010 compiler to produce AVX code but only when I specify AVX intrinsics directly. For SSE it works, I just use SSE intrinsics functions and it produce SSE code without any compiler options like /arch:SSE. But for AVX it does not work for some reason.

2021 update: Modern versions of MSVC don't need manual use of _mm256_zeroupper() even when compiling AVX intrinsics without /arch:AVX. VS2010 did.
The behavior that you are seeing is the result of expensive state-switching.
See page 102 of Agner Fog's manual:
http://www.agner.org/optimize/microarchitecture.pdf
Every time you improperly switch back and forth between SSE and AVX instructions, you will pay an extremely high (~70) cycle penalty.
When you compile without /arch:AVX, VS2010 will generate SSE instructions, but will still use AVX wherever you have AVX intrinsics. Therefore, you'll get code that has both SSE and AVX instructions - which will have those state-switching penalties. (VS2010 knows this, so it emits that warning you're seeing.)
Therefore, you should use either all SSE, or all AVX. Specifying /arch:AVX tells the compiler to use all AVX.
It sounds like you're trying to make multiple code paths: one for SSE, and one for AVX.
For this, I suggest you separate your SSE and AVX code into two different compilation units. (one compiled with /arch:AVX and one without) Then link them together and make a dispatcher to choose based on the what hardware it's running on.
If you need to mix SSE and AVX, be sure to use _mm256_zeroupper() or _mm256_zeroall() appropriately to avoid the state-switching penalties.

tl;dr for old versions of MSVC only
Use _mm256_zeroupper(); or _mm256_zeroall(); around sections of code using AVX (before or after depending on function arguments). Only use option /arch:AVX for source files with AVX rather than for an entire project to avoid breaking support for legacy-encoded SSE-only code paths.
In modern MSVC (and the other mainstream compilers, GCC/clang/ICC), the compiler knows when to use a vzeroupper asm instruction. Forcing extra vzerouppers with intrinsics can hurt performance when inlining. See Do I need to use _mm256_zeroupper in 2021?
Cause
I think the best explanation is in the Intel article, "Avoiding AVX-SSE Transition Penalties" (PDF). The abstract states:
Transitioning between 256-bit Intel® AVX instructions and legacy Intel® SSE instructions within a program may cause performance penalties because the hardware must save and restore the upper 128 bits of the YMM registers.
Separating your AVX and SSE code into different compilation units may NOT help if you switch between calling code from both SSE-enabled and AVX-enabled object files, because the transition may occur when AVX instructions or assembly are mixed with any of (from the Intel paper):
128-bit intrinsic instructions
SSE inline assembly
C/C++ floating point code that is compiled to Intel® SSE
Calls to functions or libraries that include any of the above
This means there may even be penalties when linking with external code using SSE.
Details
There are 3 processor states defined by the AVX instructions, and one of the states is where all of the YMM registers are split, allowing the lower half to be used by SSE instructions. The Intel document "Intel® AVX State Transitions: Migrating SSE Code to AVX" provides a diagram of these states:
When in state B (AVX-256 mode), all bits of the YMM registers are in use. When an SSE instruction is called, a transition to state C must occur, and this is where there is a penalty. The upper half of all YMM registers must be saved into an internal buffer before SSE can start, even if they happen to be zeros. The cost of the transitions is on the "order of 50-80 clock cycles on Sandy Bridge hardware". There is also a penalty going from C -> A, as diagrammed in Figure 2.
You can also find details about the state switching penalty causing this slowdown on page 130, Section 9.12, "Transitions between VEX and non-VEX modes" in Agner Fog's optimization guide (of version updated 2014-08-07), referenced in Mystical's answer. According to his guide, any transition to/from this state takes "about 70 clock cycles on Sandy Bridge". Just as the Intel document states, this is an avoidable transition penalty.
Skylake has a different dirty-upper mechanism that causes false dependencies for legacy-SSE with dirty uppers, rather than one-time penalties. Why is this SSE code 6 times slower without VZEROUPPER on Skylake?
Resolution
To avoid the transition penalties you can either remove all legacy SSE code, instruct the compiler to convert all SSE instructions to their VEX encoded form of 128-bit instructions (if compiler is capable), or put the YMM registers in a known zero state before transitioning between AVX and SSE code. Essentially, to maintain the separate SSE code path, you must zero out the upper 128-bits of all 16 YMM registers (issuing a VZEROUPPER instruction) after any code that uses AVX instructions. Zeroing these bits manually forces a transition to state A, and avoids the expensive penalty since the YMM values do not need to be stored in an internal buffer by hardware. The intrinsic that performs this instruction is _mm256_zeroupper. The description for this intrinsic is very informative:
This intrinsic is useful to clear the upper bits of the YMM registers when transitioning between Intel® Advanced Vector Extensions (Intel® AVX) instructions and legacy Intel® Supplemental SIMD Extensions (Intel® SSE) instructions. There is no transition penalty if an application clears the upper bits of all YMM registers (sets to ‘0’) via VZEROUPPER, the corresponding instruction for this intrinsic, before transitioning between Intel® Advanced Vector Extensions (Intel® AVX) instructions and legacy Intel® Supplemental SIMD Extensions (Intel® SSE) instructions.
In Visual Studio 2010+ (maybe even older), you get this intrinsic with immintrin.h.
Note that zeroing out the bits with other methods does not eliminate the penalty - the VZEROUPPER or VZEROALL instructions must be used.
One automatic solution implemented by the Intel Compiler is to insert a VZEROUPPER at the beginning of each function containing Intel AVX code if none of the arguments are a YMM register or __m256/__m256d/__m256i datatype, and at the end of functions if the returned value is not a YMM register or __m256/__m256d/__m256i datatype.
In the wild
This VZEROUPPER solution is used by FFTW to generate a library with both SSE and AVX support. See simd-avx.h:
/* Use VZEROUPPER to avoid the penalty of switching from AVX to SSE.
See Intel Optimization Manual (April 2011, version 248966), Section
11.3 */
#define VLEAVE _mm256_zeroupper
Then VLEAVE(); is called at the end of every function using intrinsics for AVX instructions.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js