Variable references in Intel style inline assembly and AT&T style, C++ - c++

I need to compile some assembly code in both Visual Studio and an IDE using G++ 4.6.1. The -masm=intel flag works as long as I do not reference and address any variables, which however I need to do.
I considered using intrinsics, but the compiled assembly is not optimal at all (for instance I cannot define the sse-register to be used and thus no pipe optimization is possible).
Consider these parts of code (inte style assembly):
mov ecx, dword ptr [p_pXcoords]
mov edx, dword ptr [p_pYcoords]
movhpd xmm6, qword ptr [oAvgX]
movhpd xmm7, qword ptr [oAvgY]
movlpd xmm6, qword ptr [oAvgX]
movlpd xmm7, qword ptr [oAvgY]
where p_pXcoords and p_pYcoords are doublde* arrays and function parameters, oAvgX and oAvgY simpled double values.
Another line of code is this, which is in the middle of an assembly block:
movhpd xmm6, qword ptr [oAvgY]
in other words, I need to access variables and use them within specific sse registers in the middle of the code. How can I do this with AT & T syntax, best: can I do this with a g++ compiler using the -masm flag?
Is there any way at all using one assembly code for both VS and a g++ 4.6.1 based compiler

You can certainly tell GCC which SSE register to use for each variable:
register __m128i x asm("xmm6");
But I guess VS does not support that. (I am also a little surprised you need it for decent performance. Register assignment and instruction scheduling are two of the most basic things an optimizing compiler knows. You sure you enabled optimization :-) ?)
I would probably just write two functions, one using intrinsics, and one using asm for whichever compiler does not know how to schedule instructions properly.

Related

Why modulo operation mov edx to eax and then do the reverse in X86-64? [duplicate]

I am disassembling this code on llvm clang Apple LLVM version 8.0.0 (clang-800.0.42.1):
int main() {
float a=0.151234;
float b=0.2;
float c=a+b;
printf("%f", c);
}
I compiled with no -O specifications, but I also tried with -O0 (gives the same) and -O2 (actually computes the value and stores it precomputed)
The resulting disassembly is the following (I removed the parts that are not relevant)
-> 0x100000f30 <+0>: pushq %rbp
0x100000f31 <+1>: movq %rsp, %rbp
0x100000f34 <+4>: subq $0x10, %rsp
0x100000f38 <+8>: leaq 0x6d(%rip), %rdi
0x100000f3f <+15>: movss 0x5d(%rip), %xmm0
0x100000f47 <+23>: movss 0x59(%rip), %xmm1
0x100000f4f <+31>: movss %xmm1, -0x4(%rbp)
0x100000f54 <+36>: movss %xmm0, -0x8(%rbp)
0x100000f59 <+41>: movss -0x4(%rbp), %xmm0
0x100000f5e <+46>: addss -0x8(%rbp), %xmm0
0x100000f63 <+51>: movss %xmm0, -0xc(%rbp)
...
Apparently it's doing the following:
loading the two floats onto registers xmm0 and xmm1
put them in the stack
load one value (not the one xmm0 had earlier) from the stack to xmm0
perform the addition.
store the result back to the stack.
I find it inefficient because:
Everything can be done in registry. I am not using a and b later, so it could just skip any operation involving the stack.
even if it wanted to use the stack, it could save reloading xmm0 from the stack if it did the operation with a different order.
Given that the compiler is always right, why did it choose this strategy?
-O0 (unoptimized) is the default. It tells the compiler you want it to compile fast (short compile times), not to take extra time compiling to make efficient code.
(-O0 isn't literally no optimization; e.g. gcc will still eliminate code inside if(1 == 2){ } blocks. Especially gcc more than most other compilers still does things like use multiplicative inverses for division at -O0, because it still transforms your C source through multiple internal representations of the logic before eventually emitting asm.)
Plus, "the compiler is always right" is an exaggeration even at -O3. Compilers are very good at a large scale, but minor missed-optimizations are still common within single loops. Often with very low impact, but wasted instructions (or uops) in a loop can eat up space in the out-of-order execution reordering window, and be less hyper-threading friendly when sharing a core with another thread. See C++ code for testing the Collatz conjecture faster than hand-written assembly - why? for more about beating the compiler in a simple specific case.
More importantly,-O0 also implies treating all variables similar to volatile for consistent debugging. i.e. so you can set a breakpoint or single step and modify the value of a C variable, and then continue execution and have the program work the way you'd expect from your C source running on the C abstract machine. So the compiler can't do any constant-propagation or value-range simplification. (e.g. an integer that's known to be non-negative can simplify things using it, or make some if conditions always true or always false.)
(It's not quite as bad as volatile: multiple references to the same variable within one statement don't always result in multiple loads; at -O0 compilers will still optimize somewhat within a single expression.)
Compilers have to specifically anti-optimize for -O0 by storing/reloading all variables to their memory address between statements. (In C and C++, every variable has an address unless it was declared with the (now obsolete) register keyword and has never had its address taken. Optimizing away the address is possible according to the as-if rule for other variables, but isn't done at -O0)
Unfortunately, debug-info formats can't track the location of a variable through registers, so fully consistent debugging isn't possible without this slow-and-stupid code-gen.
If you don't need this, you can compile with -Og for light optimization, and without the anti-optimizations required for consistent debugging. The GCC manual recommends it for the usual edit/compile/run cycle, but you will get "optimized out" for many local variables with automatic storage when debugging. Globals and function args still usually have their actual values, at least at function boundaries.
Even worse, -O0 makes code that still works even if you use GDB's jump command to continue execution at a different source line. So each C statement has to be compiled into a fully independent block of instructions. (Is it possible to "jump"/"skip" in GDB debugger?)
for() loops can't be transformed into idiomatic (for asm) do{}while() loops, and other restrictions.
For all the above reasons, (micro-)benchmarking un-optimized code is a huge waste of time; the results depend on silly details of how you wrote the source that don't matter when you compile with normal optimization. -O0 vs. -O3 performance is not linearly related; some code will speed up much more than others.
The bottlenecks in -O0 code will often be different from -O3- often on a loop counter that's kept in memory, creating a ~6-cycle loop-carried dependency chain. This can create interesting effects in the compiler-generated asm like Adding a redundant assignment speeds up code when compiled without optimization (which are interesting from an asm perspective, but not for C.)
"My benchmark optimized away otherwise" is not a valid justification for looking at the performance of -O0 code.
See C loop optimization help for final assignment for an example and more details about the rabbit hole that tuning for -O0 is.
Getting interesting compiler output
If you want to see how the compiler adds 2 variables, write a function that takes args and returns a value. Remember you only want to look at the asm, not run it, so you don't need a main or any numeric literal values for anything that should be a runtime variable.
See also How to remove "noise" from GCC/clang assembly output? for more about this.
float foo(float a, float b) {
float c=a+b;
return c;
}
compiles with clang -O3 (on the Godbolt compiler explorer) to the expected
addss xmm0, xmm1
ret
But with -O0 it spills the args to stack memory. (Godbolt uses debug info emitted by the compiler to colour-code asm instructions according to which C statement they came from. I've added line breaks to show blocks for each statement, but you can see this with colour highlighting on the Godbolt link above. Often very handy for finding the interesting part of an inner loop in optimized compiler output.)
gcc -fverbose-asm will put comments on every line showing the operand names as C vars. In optimized code that's often an internal tmp name, but in un-optimized code it's usual an actual variable from the C source. I've manually commented the clang output because it doesn't do that.
# clang7.0 -O0 also on Godbolt
foo:
push rbp
mov rbp, rsp # make a traditional stack frame
movss DWORD PTR [rbp-20], xmm0 # spill the register args
movss DWORD PTR [rbp-24], xmm1 # into the red zone (below RSP)
movss xmm0, DWORD PTR [rbp-20] # a
addss xmm0, DWORD PTR [rbp-24] # +b
movss DWORD PTR [rbp-4], xmm0 # store c
movss xmm0, DWORD PTR [rbp-4] # return 0
pop rbp # epilogue
ret
Fun fact: using register float c = a+b;, the return value can stay in XMM0 between statements, instead of being spilled/reloaded. The variable has no address. (I included that version of the function in the Godbolt link.)
The register keyword has no effect in optimized code (except making it an error to take a variable's address, like how const on a local stops you from accidentally modifying something). I don't recommend using it, but it's interesting to see that it does actually affect un-optimized code.
Related:
Complex compiler output for simple constructor - every copy of a variable when passing args typically results in extra copies in the asm.
Why is this C++ wrapper class not being inlined away? __attribute__((always_inline)) can force inlining, but doesn't optimize away the copying to create the function args, let alone optimize the function into the caller.

Why are clang and GCC not using xchg to implement std::swap?

I have the following code:
char swap(char reg, char* mem) {
std::swap(reg, *mem);
return reg;
}
I expected this to compile down to:
swap(char, char*):
xchg dil, byte ptr [rsi]
mov al, dil
ret
But what it actually compiles to is (at -O3 -march=haswell -std=c++20):
swap(char, char*):
mov al, byte ptr [rsi]
mov byte ptr [rsi], dil
ret
See here for a live demo.
From the documentation of xchg, the first form should be perfectly possible:
XCHG - Exchange Register/Memory with Register
Exchanges the contents of the destination (first) and source (second) operands. The operands can be two general-purpose registers or a register and a memory location.
So is there any particular reason why it's not possible for the compiler to use xchg here? I have tried other examples too, such as swapping pointers, swapping three operands, swapping types other than char but I never get an xchg in the compile output. How come?
TL:DR: because compilers optimize for speed, not for names that sound similar. There are lots of other terrible ways they also could have implemented it, but chose not to.
xchg with mem has an implicit lock prefix (on 386 and later) so it's horribly slow. You always want to avoid it unless you need an atomic exchange, or are optimizing completely for code-size without caring at all for performance, in cases where you do want the result in the same register as the original value. Sometimes seen in naive (performance oblivious) or code-golfed hand-written Bubble Sort as part of swapping 2 memory locations.
Possibly clang -Oz could go that crazy, IDK, but hopefully wouldn't in this case because your xchg way is larger code size, needing a REX prefix on both instructions to access DIL, vs. the 2-mov way being a 2-byte and a 3-byte instruction. clang -Oz does do stuff like push 1 / pop rax instead of mov eax, 1 to save 2 bytes of code size.
GCC -Os won't use xchg for swaps that don't need to be atomic because -Os still cares some about speed.
Also, IDK why would you think xchg + dependent mov would be faster or a better choice than two independent mov instructions that can run in parallel. (The store buffer makes sure that the store is correctly ordered after the load, regardless of which uop finds its execution port free first).
See https://agner.org/optimize/ and other links in https://stackoverflow.com/tags/x86/info
Seriously, I just don't see any plausible reason why you'd think a compiler might want to use xchg, especially given that the calling convention doesn't pass an arg in RAX so you still need 2 instructions. Even for registers, xchg reg,reg on Intel CPUs is 3 uops, and they're microcode uops that can't benefit from mov-elimination. (Some AMD CPUs have 2-uop xchg reg,reg. Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?)
I also guess you're looking at clang output; GCC will avoid partial register shenanigans (like false dependencies) by using a movzx eax, byte ptr [rsi] load even though the return value is only the low byte. Zero-extending loads are cheaper than merging into the old value of RAX. So that's another downside to xchg.
So is there any particular reason why it's not possible for the compiler to use xchg here?
Because mov is faster than xchg and compilers optimize for speed.
See:
Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?
Why does GCC use mov/mfence instead of xchg to implement C11's atomic_store?
Use xchg for -Os
Bug 47949 - Missed optimization for -Os using xchg instead of mov

Clang and '-O2' - disable specific optimization

For performance reasons I have to use -O2 optimization level on my code. The problem is that compiler promotes short strings (8 bytes or less) to registers, like:
__text:00000000001348DA mov rcx, 3D3D3D3D3D3D3D3Dh
__text:00000000001348E4 mov [rax+10h], rcx
__text:00000000001348E8 mov [rax+8], rcx
__text:00000000001348EC mov rcx, 3D3D3D3D3D3D3D0Ah
Which is equal to load string "\n========================".
I need to keep strings as data constants, prevent promoting them to registers. And I have to keep -O2 optimization for performance. clang is based on LLVM 10.
I'm asking or help, as I cannot find a flag that turning off such optimization pass.
Declaring those specific strings as volatile should prevent this from happening, but, the real question is why is it bad for you?

Compiler explorer and GCC have different outputs

I have some C code that when given to Compiler Explorer, it outputs:
mov BYTE PTR [rbp-4], al
mov eax, ecx
mov BYTE PTR [rbp-8], al
mov eax, edx
mov BYTE PTR [rbp-12], al
However if I use GCC or G++ then it gives me this:
mov BYTE PTR 16[rbp], al
mov eax, edx
mov BYTE PTR 24[rbp], al
mov eax, ecx
mov BYTE PTR 32[rbp], al
I have no idea why the BYTE PTRs are different. They have a completely wrong address and I don't get why they are before the [rdp] part.
If you know how to reproduce the first output using gcc or g++ please help!
gcc.exe (GCC) 8.2.0
Looks like GCC for the Windows x64 calling convention is using the shadow space (32 bytes above the return address) reserved by its caller. Godbolt's GCC installs target GNU/Linux, i.e. the x86-64 System V ABI.
You can get the same code on Godbolt by marking your function with __attribute__((ms_abi)). Of course that means your caller has to see that attribute in the prototype so it knows to reserve that space, and which registers to pass function args in.
The Windows x64 calling convention is mostly worse than x86-64 System V; fewer arg-passing registers for example. One of its only advantages is easier implementation of variadic functions (because of the shadow space), and having some call-preserved XMM regs. (Probably too many, but x86-64 SysV has zero.) So more likely you want to use a cross compiler (targeting GNU/Linux) on Windows, or use __attribute__((sysv_abi)) on all your functions. (https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html)
The XMM part of the calling convention is normally irrelevant for kernel code; most kernels avoid saving/restoring the SIMD/FPU state on kernel entry/exit by not letting the compiler use SIMD/FP instructions.

How to use processor instructions in C++ to implement fast arithmetic operations

I was working on the C++ implementation of Shamir's secret sharing scheme. I split the message into 8-bit chunks and on each performs corresponding arithmetic. The underlying finite field was Rijndael's finite field F_256 / (x^8 + x^4 + x^3 + x + 1).
I made a quick search if there is some well-known and spread library for Rijndael's finite field calculations (e. g. OpenSSL or similar), and didn't find any. So I implemented it from scratch, partly as a programming exercise.
A few days ago, however, a professor at our university mentioned following: "Modern processors support carry-less integer operations, so the characteristic-2 finite field multiplications run fast nowadays.".
Hence, since I know just little about hardware, assembler, and similar stuff, my question is: How do I actually use (in C++) all the modern processors' instructions when building crypto software - whether it is AES, SHA, arithmetic from above or whatever else? I can't find any satisfactory resources on that. My idea is to build a library containing both: "Modern-approach fast implementation" and fallback "pure C++ dependency-less code" and let the GNU Autoconf decide which one to use on each respective host. Any book/article/tutorial recommendation on this topic would be appreciated.
The question is quite broad because there are several ways you might access the power of the underlying hardware, so instead of one specific way here's a list of ways you can try to use all the modern processors' instructions:
Idiom Recognition
Write out the operation not offered directly in C++ in "long form" and hope your compiler recognizes it as an idiom for the underlying instruction you want. For example, you could write a variable rotate left of x by amount as (x << amount) | (x >> (32 - amount)) and all of gcc, clang and icc will recognize this as a rotate and issue the underlying rol instruction supported by x86.
Sometimes this technique puts you in a bit of an uncomfortable spot: the above C++ rotate implementation exhibits undefined behavior for amount == 0 (and also amount >= 32) since the result of a shift of 32 on a uint32_t is undefined, but the code actually produced by these compilers is just fine in that case. Still, having this lurking undefined behavior in your program is dangerous, and it probably won't run clear against ubsan and friends. The alternative safe version amount ? (x << amount) | (x >> (32 - amount)) : x; is only recognized by icc, but not by gcc or clang.
This approach tends to work for common idioms that map directly to assembly-level instructions that have been around for a while: rotates, bit tests and sets, multiplications with a wider result than inputs (e.g., multiplying two 32-bit values for a 64-bit result), conditional moves and so on, but is less likely to pick up bleeding edge instructions that might also be of interest to cryptography. For example, I'm quite sure no compiler will currently recognize an application of the AES instruction set extensions. It also works best on platforms that have received a lot of effort on the part of the compiler developers since each recognized idiom has to be added by hand.
I don't think this technique will work with your carry-less multiplication (PCLMULQDQ), but maybe one day (if you file an issue against the compilers)? It does work for other "crypt-interesting" functions though, including rotate.
Intrinsic Functions
As an extension compilers will often offer intrinsic functions which are not part of the language proper, but often map directly to an instruction offered by most hardware. Although it looks like a function call, the compiler generally just emits the single instruction needed at the place you call it.
GCC calls these built-in functions and you can find a list of generic ones here. For example, you can use the __builtin_popcnt call to emit the popcnt instruction, if the current target supports it. Man of the gcc builtins are also supported by icc and clang, and in this case all of gcc, clang and icc support this call and emit popcnt as long as the architecture (-march=Haswell)is set to Haswell. Otherwise, clang and icc inline a replacement version using some clever SWAR tricks, while gcc calls __popcountdi2 which is provided by the runtime1.
The list of intrinsics above are generic and generally offered on any platform the compilers support. You can also find platform specific instrinics, for example this list from gcc.
For x86 SIMD instructions specifically, Intel makes available a set of intrinsic functions declared headers covering their ISA extensions, e.g., by including #include <x86intrin.h>. These have wider support than the gcc instrinsics, e.g., they are supported by Microsoft's Visual Studio compiler suite. New instruction sets are usually added before chips that support them become available, so you can use these to access new instructions immediate on release.
Programming with SIMD intrinsic functions is kind of a halfway house between C++ and full assembly. The compiler still takes care of things like calling conventions and register allocation, and some optimization are made (especially for generating constants and other broadcasts) - but generally what you write is more or less what you get at the assembly level.
Inline Assembly
If your compiler offers it, you can use inline assembly to call whatever instructions you want2. This has a lot of similarities to using intrinsic functions, but with a somewhat higher level of difficulty and less opportunities for the optimizer to help you out. You should probably prefer intrinsic functions unless you have a specific reason for inline assembly. One example could be if the optimizer does a really bad job with intrinsics: you could use an inline assembly block to get exactly the code you want.
Out-of-line Assembly
You can also just write your entire kernel function in assembly, assembly it how you want, and then declare it extern "C" and call it from C++. This is similar to the inline assembly option, but works on compilers that don't support inline assembly (e.g., 64-bit Visual Studio). You can also use a different assembler if you want, which is especially convenient if you are targeting multiple C++ compilers since you can then use a single assembler for all of them.
You need to take care of the calling conventions youself, and other messy things like DWARF unwind info and Windows SEH handling.
For very short functions, this approach doesn't work well since the call overhead will likely be prohibitive3.
Auto-Vectorization4
If you want to write fast cryptography today for a CPU, you are pretty much going to be targeting mostly SIMD instructions. Most new algorithms designed with software implementation are also designed with vectorization in mind.
You can intrinsic functions or assembly to write SIMD code, but you can also write normal scalar code and rely on the auto-vectorizer. These got a bad name back in the early days of SIMD, and while they are still far from perfect they have come a long way.
Consider this simple function with takes payload and key byte array and xors key into payload:
void otp_scramble(uint8_t* payload, uint8_t* key, size_t n) {
for (size_t i = 0; i < n; i++) {
payload[i] ^= key[i];
}
}
This is a softball example, of course, but anyways gcc, clang and icc all vectorize this to something like this inner loop4:
movdqu xmm0, XMMWORD PTR [rdi+rax]
movdqu xmm1, XMMWORD PTR [rsi+rax]
pxor xmm0, xmm1
movups XMMWORD PTR [rdi+rax], xmm0
It's using SSE instructions to load and xor 16 bytes at a time. The developer only has to reason about the simple scalar code, however!
One advantage of this approach versus intrinsics or assembly is that you aren't baking in the SIMD length of the instruction set at the source level. The same C++ code as above compiled with -march=haswell results in a loop like:
vmovdqu ymm1, YMMWORD PTR [rdi+rax]
vpxor ymm0, ymm1, YMMWORD PTR [rsi+rax]
vmovdqu YMMWORD PTR [rdi+rax], ymm0
It's using the AVX2 instructions available on Haswell to do 32-bytes at a time. If you compile with -march=skylake-avx512 clang uses 64-byte vxorps instructions on zmm registers (but gcc and icc stick with 32-byte inner loops). So in principle you can take some advantage of new ISA simply with a recompile.
A downside of auto-vectorizatoin is that it is fairly fragile. What auto-vectorizes on one compiler might not on another or even on another version of the same compiler. So you need to check you are getting the results you want. The auto-vectorizer is often working with less information than you have: it might not know that the input length is a multiple of some power or two or that the input pointers are aligned in a certain way. Sometimes you can communicate this information to the compiler, but sometimes you can't.
Sometimes the compiler makes "interesting" decisions when it vectorizes, such as a small not-unrolled body for the inner loop, but then a giant "intro" or "outro" handling odd iterations, like what gcc produces after the first loop shown above:
movzx ecx, BYTE PTR [rsi+rax]
xor BYTE PTR [rdi+rax], cl
lea rcx, [rax+1]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+1+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+2]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+2+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+3]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+3+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+4]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+4+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+5]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+5+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+6]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+6+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+7]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+7+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+8]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+8+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+9]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+9+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+10]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+10+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+11]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+11+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+12]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+12+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+13]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+13+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+14]
cmp rdx, rcx
jbe .L1
movzx eax, BYTE PTR [rsi+14+rax]
xor BYTE PTR [rdi+rcx], al
You probably have better things to spend your instruction cache on (and this is far from the worst I've seen: it's easy to get examples with several hundreds of instructions in the intro and outro parts).
Unfortunately, the vectorizer probably won't produce crypto-specific instructions like carry-less multiply. You could consider a mix of scalar code that gets vectorized and an intrinsic only for the instructions the compiler won't generate, but this is easier to suggest than actually do successfully. At that point you are probably better off writing your entire loop with intrinsics.
1 The advantage of the gcc approach here is that at runtime if the platform supports popcnt this call can resolve to an implementation that just uses a popcnt instruction, using the GNU IFUNC mechanism.
2 Assuming the underlying assembler supports it, but even if it doesn't you could just encode the raw instruction bytes in the inline assembly block.
3 The call overhead includes more than just the explicit costs of the call and ret and argument passing: it also includes the effect on the optimizer which can't optimize code as well in the caller around the function call since it has unknown side-effects.
4 In some ways, auto-vectorization could be seen as a special case of idiom recognition, but it is important enough and has enough unique considerations that it gets its own section here.
5 With minor differences: gcc is as shown, clang unrolled a bit, and icc used a load-op pxor rather than a separate load.