Why does p1007r0 std::assume_aligned remove the need for epilogue?

Why does p1007r0 std::assume_aligned remove the need for epilogue? - c++

My understanding is that vectorization of code works something like this:
For data in array bellow the first address in the array that is the multiple of 128(or 256 or whatever SIMD instructions require) do slow element by element processing. Let's call this prologue.
For data in array between the first address that is multiple of 128 and last one address that is multiple of 128 use SIMD instruction.
For the data between last address that is multiple of 128 and end of array use slow element by element processing. Let's call this epilogue.
Now I understand why std::assume_aligned helps with prologue, but I do not get why it enables compiler to remove epilogue also.
Quote from proposal:
If we could make this property visible to the compiler, it could skip the loop prologue and epilogue

You can see the effect on code-gen from using GNU C / C++ __builtin_assume_aligned.
gcc 7 and earlier targeting x86 (and ICC18) prefer to use a scalar prologue to reach an alignment boundary, then an aligned vector loop, then a scalar epilogue to clean up any leftover elements that weren't a multiple of a full vector.
Consider the case where the total number of elements is known at compile time to be a multiple of the vector width, but the alignment isn't known. If you knew the alignment, you don't need either a prologue or epilogue. But if not, you need both. The number of left-over elements after the last aligned vector is not known.
This Godbolt compiler explorer link shows these functions compiled for x86-64 with ICC18, gcc7.3, and clang6.0. clang unrolls very aggressively, but still uses unaligned stores. This seems like a weird way to spend that much code-size for a loop that just stores.
// aligned, and size a multiple of vector width
void set42_aligned(int *p) {
p = (int*)__builtin_assume_aligned(p, 64);
for (int i=0 ; i<1024 ; i++ ) {
*p++ = 0x42;
}
}
# gcc7.3 -O3 (arch=tune=generic for x86-64 System V: p in RDI)
lea rax, [rdi+4096] # end pointer
movdqa xmm0, XMMWORD PTR .LC0[rip] # set1_epi32(0x42)
.L2: # do {
add rdi, 16
movaps XMMWORD PTR [rdi-16], xmm0
cmp rax, rdi
jne .L2 # }while(p != endp);
rep ret
This is pretty much exactly what I'd do by hand, except maybe unrolling by 2 so OoO exec could discover the loop exit branch being not-taken while still chewing on the stores.
Thus unaligned version includes a prologue and epilogue:
// without any alignment guarantee
void set42(int *p) {
for (int i=0 ; i<1024 ; i++ ) {
*p++ = 0x42;
}
}
~26 instructions of setup, vs. 2 from the aligned version
.L8: # then a bloated loop with 4 uops instead of 3
add eax, 1
add rdx, 16
movaps XMMWORD PTR [rdx-16], xmm0
cmp ecx, eax
ja .L8 # end of main vector loop
# epilogue:
mov eax, esi # then destroy the counter we spent an extra uop on inside the loop. /facepalm
and eax, -4
mov edx, eax
sub r8d, eax
cmp esi, eax
lea rdx, [r9+rdx*4] # recalc a pointer to the last element, maybe to avoid a data dependency on the pointer from the loop.
je .L5
cmp r8d, 1
mov DWORD PTR [rdx], 66 # fully-unrolled final up-to-3 stores
je .L5
cmp r8d, 2
mov DWORD PTR [rdx+4], 66
je .L5
mov DWORD PTR [rdx+8], 66
.L5:
rep ret
Even for a more complex loop which would benefit from a little bit of unrolling, gcc leaves the main vectorized loop not unrolled at all, but spends boatloads of code-size on fully-unrolled scalar prologue/epilogue. It's really bad for AVX2 256-bit vectorization with uint16_t elements or something. (up to 15 elements in the prologue/epilogue, rather than 3). This is not a smart tradeoff, so it helps gcc7 and earlier significantly to tell it when pointers are aligned. (The execution speed doesn't change much, but it makes a big difference for reducing code-bloat.)
BTW, gcc8 favours using unaligned loads/stores, on the assumption that data often is aligned. Modern hardware has cheap unaligned 16 and 32-byte loads/stores, so letting the hardware handle the cost of loads/stores that are split across a cache-line boundary is often good. (AVX512 64-byte stores are often worth aligning, because any misalignment means a cache-line split on every access, not every other or every 4th.)
Another factor is that earlier gcc's fully-unrolled scalar prologues/epilogues are crap compared to smart handling where you do one unaligned potentially-overlapping vector at the start/end. (See the epilogue in this hand-written version of set42). If gcc knew how to do that, it would be worth aligning more often.

This is discussed in the document itself in Section 5:
A function that returns a pointer T* , and guarantees that it will
point to over-aligned memory, could return like this:
T* get_overaligned_ptr()
{
// code...
return std::assume_aligned<N>(_data);
}
This technique can be used e.g. in the begin() and end()
implementations of a class wrapping an over-aligned range of data. As
long as such functions are inline, the over-alignment will be
transparent to the compiler at the call-site, enabling it to perform
the appropriate optimisations without any extra work by the caller.
The begin() and end() methods are data accessors for the over-aligned buffer _data. That is, begin() returns a pointer to the first byte of the buffer and end() returns a pointer to one byte past the last byte of the buffer.
Suppose they are defined as follows:
T* begin()
{
// code...
return std::assume_aligned<N>(_data);
}
T* end()
{
// code...
return _data + size; // No alignment hint!
}
In this case, the compiler may not be able to eliminate the epilogue. But if there were defined as follows:
T* begin()
{
// code...
return std::assume_aligned<N>(_data);
}
T* end()
{
// code...
return std::assume_aligned<N>(_data + size);
}
Then the compiler would be able to eliminate the epilogue. For example, if N is 128 bits, then every single 128-bit chunk of the buffer is guaranteed to be 128-bit aligned. Note that this is only possible when the size of the buffer is a multiple of the alignment.

Related

Why does GCC aggregate initialization of an array fill the whole thing with zeros first, including non-zero elements?

Why does gcc fill the whole array with zeros instead of only the remaining 96 integers? The non-zero initializers are all at the start of the array.
void *sink;
void bar() {
int a[100]{1,2,3,4};
sink = a; // a escapes the function
asm("":::"memory"); // and compiler memory barrier
// forces the compiler to materialize a[] in memory instead of optimizing away
}
MinGW8.1 and gcc9.2 both make asm like this (Godbolt compiler explorer).
# gcc9.2 -O3 -m32 -mno-sse
bar():
push edi # save call-preserved EDI which rep stos uses
xor eax, eax # eax=0
mov ecx, 100 # repeat-count = 100
sub esp, 400 # reserve 400 bytes on the stack
mov edi, esp # dst for rep stos
mov DWORD PTR sink, esp # sink = a
rep stosd # memset(a, 0, 400)
mov DWORD PTR [esp], 1 # then store the non-zero initializers
mov DWORD PTR [esp+4], 2 # over the zeroed part of the array
mov DWORD PTR [esp+8], 3
mov DWORD PTR [esp+12], 4
# memory barrier empty asm statement is here.
add esp, 400 # cleanup the stack
pop edi # and restore caller's EDI
ret
(with SSE enabled it would copy all 4 initializers with movdqa load/store)
Why doesn't GCC do lea edi, [esp+16] and memset (with rep stosd) only the last 96 elements, like Clang does? Is this a missed optimization, or is it somehow more efficient to do it this way? (Clang actually calls memset instead of inlining rep stos)
Editor's note: the question originally had un-optimized compiler output which worked the same way, but inefficient code at -O0 doesn't prove anything. But it turns out that this optimization is missed by GCC even at -O3.
Passing a pointer to a to a non-inline function would be another way to force the compiler to materialize a[], but in 32-bit code that leads to significant clutter of the asm. (Stack args result in pushes, which gets mixed in with stores to the stack to init the array.)
Using volatile a[100]{1,2,3,4} gets GCC to create and then copy the array, which is insane. Normally volatile is good for looking at how compilers init local variables or lay them out on the stack.

In theory your initialization could look like that:
int a[100] = {
[3] = 1,
[5] = 42,
[88] = 1,
};
so it may be more effective in sense of cache and optimizablity to first zero out the whole memory block and then set individual values.
May be the behavior changes depending on:
target architecture
target OS
array length
initialization ratio (explicitly initialized values/length)
positions of the initialized values
Of course, in your case the initialization are compacted at the start of the array and the optimization would be trivial.
So it seems that gcc is doing the most generic approach here. Looks like a missing optimization.

What is the compiler doing here that allows comparison of many values to be done with few actual comparisons?

My question is about what the compiler is doing in this case that optimizes the code way more than what I would think is possible.
Given this enum:
enum MyEnum {
Entry1,
Entry2,
... // Entry3..27 are the same, omitted for size.
Entry28,
Entry29
};
And this function:
bool MyFunction(MyEnum e)
{
if (
e == MyEnum::Entry1 ||
e == MyEnum::Entry3 ||
e == MyEnum::Entry8 ||
e == MyEnum::Entry14 ||
e == MyEnum::Entry15 ||
e == MyEnum::Entry18 ||
e == MyEnum::Entry21 ||
e == MyEnum::Entry22 ||
e == MyEnum::Entry25)
{
return true;
}
return false;
}
For the function, MSVC generates this assembly when compiled with -Ox optimization flag (Godbolt):
bool MyFunction(MyEnum) PROC ; MyFunction
cmp ecx, 24
ja SHORT $LN5#MyFunction
mov eax, 20078725 ; 01326085H
bt eax, ecx
jae SHORT $LN5#MyFunction
mov al, 1
ret 0
$LN5#MyFunction:
xor al, al
ret 0
Clang generates similar (slightly better, one less jump) assembly when compiled with -O3 flag:
MyFunction(MyEnum): # #MyFunction(MyEnum)
cmp edi, 24
ja .LBB0_2
mov eax, 20078725
mov ecx, edi
shr eax, cl
and al, 1
ret
.LBB0_2:
xor eax, eax
ret
What is happening here? I see that even if I add more enum comparisons to the function, the assembly that is generated does not actually become "more", it's only this magic number (20078725) that changes. That number depends on how many enum comparisons are happening in the function. I do not understand what is happening here.
The reason why I am looking at this is that I was wondering if it is good to write the function as above, or alternatively like this, with bitwise comparisons:
bool MyFunction2(MyEnum e)
{
if (
e == MyEnum::Entry1 |
e == MyEnum::Entry3 |
e == MyEnum::Entry8 |
e == MyEnum::Entry14 |
e == MyEnum::Entry15 |
e == MyEnum::Entry18 |
e == MyEnum::Entry21 |
e == MyEnum::Entry22 |
e == MyEnum::Entry25)
{
return true;
}
return false;
}
This results in this generated assembly with MSVC:
bool MyFunction2(MyEnum) PROC ; MyFunction2
xor edx, edx
mov r9d, 1
cmp ecx, 24
mov eax, edx
mov r8d, edx
sete r8b
cmp ecx, 21
sete al
or r8d, eax
mov eax, edx
cmp ecx, 20
cmove r8d, r9d
cmp ecx, 17
sete al
or r8d, eax
mov eax, edx
cmp ecx, 14
cmove r8d, r9d
cmp ecx, 13
sete al
or r8d, eax
cmp ecx, 7
cmove r8d, r9d
cmp ecx, 2
sete dl
or r8d, edx
test ecx, ecx
cmove r8d, r9d
test r8d, r8d
setne al
ret 0
Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.

Quite smart! The first comparison with 24 is to do a rough range check - if it's more than 24 or less than 0 it will bail out; this is important as the instructions that follow that operate on the magic number have a hard cap to [0, 31] for operand range.
For the rest, the magic number is just a bitmask, with the bits corresponding to the "good" values set.
>>> bin(20078725)
'0b1001100100110000010000101'
It's easy to spot the first and third bits (counting from 1 and from right) set, the 8th, 14th, 15th, ...
MSVC checks it "directly" using the BT (bit test) instruction and branching, clang instead shifts it of the appropriate amount (to get the relevant bit in the lowest order position) and keeps just it ANDing it with zero (avoiding a branch).
The C code corresponding to the clang version would be something like:
bool MyFunction(MyEnum e) {
if(unsigned(e) > 24) return false;
return (20078725 >> e) & 1;
}
as for the MSVC version, it's more like
inline bool bit_test(unsigned val, int bit) {
return val & (1<<bit);
}
bool MyFunction(MyEnum e) {
if(unsigned(e) > 24) return false;
return bit_test(20078725, e);
}
(I kept the bit_test function separated to emphasize that it's actually a single instruction in assembly, that val & (1<<bit) thing has no correspondence to the original assembly.
As for the if-based code, it's quite bad - it uses a lot of CMOV and ORs the results together, which is both longer code, and will probably serialize execution. I suspect the corresponding clang code will be better. OTOH, you wrote this code using bitwise OR (|) instead of the more semantically correct logical OR (||), and the compiler is strictly following your orders (typical of MSVC).
Another possibility to try instead could be a switch - but I don't think there's much to gain compared to the code already generated for the first snippet, which looks pretty good to me.
Ok, doing a quick test with all the versions against all compilers, we can see that:
the C translation of the CLang output above results in pretty much that same code (= to the clang output) in all compilers; similarly for the MSVC translation;
the bitwise or version is the same as the logical or version (= good) in both CLang and gcc;
in general, gcc does essentially the same thing as CLang except for the switch case;
switch results are varied:
CLang does best, by generating the exact same code;
both gcc and MSVC generate jump-table based code, which in this case is less good; however:
gcc prefers to emit a table of QWORDs, trading size for simplicity of the setup code;
MSVC instead emits a table of BYTEs, paying it in setup code size; I couldn't get gcc to emit similar code even changing -O3 to -Os (optimize for size).

Ah, the old immediate bitmap trick.
GCC does this too, at least for a switch.
x86 asm casetable implementation. Unfortunately GCC9 has a regression for some cases: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91026#c3 ; GCC8 and earlier do a better job.
Another example of using it, this time for code-golf (fewest bytes of code, in this case x86 machine code) to detect certain letters: User Appreciation Challenge #1: Dennis ♦
The basic idea is to use the input as an index into a bitmap of true/false results.
First you have to range-check because the bitmap is fixed-width, and x86 shifts wrap the shift count. We don't want high inputs to alias into the range where there are some that should return true. cmp edi, 24/ja is doing.
(If the range between the lowest and highest true values was from 120 to 140, for example, it might start with a sub edi,120 to range-shift everything before the cmp.)
Then you use bitmap & (1<<e) (the bt instruction), or (bitmap >> e) & 1 (shr / and) to check the bit in the bitmap that tells you whether that e value should return true or false.
There are many ways to implement that check, logically equivalent but with performance differences.
If the range was wider than 32, it would have to use 64-bit operand-size. If it was wider than 64, the compiler might not attempt this optimization at all. Or might still do it for some of the conditions that are in a narrow range.
Using an even larger bitmap (in .rodata memory) would be possible but probably not something most compilers will invent for you. Either with bt [mem],reg (inefficient) or manually indexing a dword and checking that the same way this code checks the immediate bitmap. If you had a lot of high-entropy ranges it might be worth checking 2x 64-bit immediate bitmap, branchy or branchless...
Clang/LLVM has other tricks up its sleeve for efficiently comparing against multiple values (when it doesn't matter which one is hit), e.g. broadcast a value into a SIMD register and use a packed compare. That isn't dependent on the values being in a dense range. (Clang generates worse code for 7 comparisons than for 8 comparisons)
that optimizes the code way more than what I would think is possible.
These kinds of optimizations come from smart human compiler developers that notice common patterns in source code and think of clever ways to implement them. Then get compilers to recognize those patterns and transform their internal representation of the program logic to use the trick.
Turns out that switch and switch-like if() statements are common, and aggressive optimizations are common.
Compilers are far from perfect, but sometimes they do come close to living up to what people often claim; that compilers will optimize your code for you so you can write it in a human-readable way and still have it run near-optimally. This is sometimes true over the small scale.
Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.
The immediate bitmap is vastly more efficient. There's no data memory access in either one so no cache miss loads. The only "expensive" instruction is a variable-count shift (3 uops on mainstream Intel, because of x86's annoying FLAGS-setting semantics; BMI2 shrx is only 1 uop and avoid having to mov the number to ecx.) https://agner.org/optimize. And see other performance analysis links in https://stackoverflow.com/tags/x86/info.
Each instruction in the cmp/cmov chain is at least 1 uop, and there's a pretty long dependency chain through each cmov because MSVC didn't bother to break it into 2 or more parallel chains. But regardless it's just a lot of uops, far more than the bitmap version, so worse for throughput (ability for out-of-order exec to overlap the work with surrounding code) as well as latency.
bt is also cheap: 1 uop on modern AMD and Intel. (bts, btr, btc are 2 on AMD, still 1 on Intel).
The branch in the immediate-bitmap version could have been a setna / and to make it branchless, but especially for this enum definition the compiler expected that it would be in range. It could have increased branch predictability by only requiring e <= 31, not e <= 24.
Since the enum only goes up to 29, and IIRC it's UB to have out-of-range enum values, it could actually optimize it away entirely.
Even if the e>24 branch doesn't predict very well, it's still probably better overall. Given current compilers, we only get a choice between the nasty chain of cmp/cmov or branch + bitmap. Unless turn the asm logic back into C to hand-hold compilers into making the asm we want, then we can maybe get branchless with an AND or CMOV to make it always zero for out-of-range e.
But if we're lucky, profile-guided optimization might let some compilers make the bitmap range check branchless. (In asm the behaviour of shl reg, cl with cl > 31 or 63 is well-defined: on x86 it simply masks the count. In a C equivalent, you could use bitmap >> (e&31) which can still optimize to a shr; compilers know that x86 shr masks the count so they can optimize that away. But not for other ISAs that saturate the shift count...)
There are lots of ways to implement the bitmap check that are pretty much equivalent. e.g. you could even use the CF output of shr, set according to the last bit shifted out. At least if you make sure CF has a known state ahead of time for the cl=0 case.
When you want an integer bool result, right-shifting seems to make more sense than bt / setcc, but with shr costing 3 uops on Intel it might actually be best to use bt reg,reg / setc al. Especially if you only need a bool, and can use EAX as your bitmap destination so the previous value of EAX is definitely ready before setcc. (Avoiding a false dependency on some unrelated earlier dep chain.)
BTW, MSVC has other silliness: as What is the best way to set a register to zero in x86 assembly: xor, mov or and? explains, xor al,al is totally stupid compared to xor eax,eax when you want to zero AL. If you don't need to leave the upper bytes of RAX unmodified, zero the full register with a zeroing idiom.
And of course branching just to return 0 or return 1 makes little sense, unless you expect it to be very predictable and want to break the data dependency. I'd expect that setc al would make more sense to read the CF result of bt

How to measure the number of increments per second

I want to measure the speed in which my PC can increment a counter N times (e.g., for N = 10^9).
I tried the following code:
using namespace std
auto start = chrono::steady_clock::now();
for (int i = 0; i < N; ++i)
{
}
auto end = chrono::steady_clock::now();
However, the compiler is smart enough to simply set i=N, and I get that start==end regardless of the value of N.
How can I change the code to measure the increment speed? (adding costly operations in the loop would dominate the runtime and would not allow the measurement to be correct).
I use Windows 10 and Visual Studio 15.9.7.
A bit of motivation: my code takes about 2 seconds for N=10^9. I'm wondering if there's any "meat" left in optimizing it further (e.g., could it possibly go down to 1 sec? or would the loop itself require more?)

This question doesn't really make sense in C or C++. The compiler aims to generate the fastest code that meets the constraints defined by your source code. In your question, you do not define a constraint that the compiler must do a loop at all. Because the loop has no effect, the optimizer will remove it.
Gabriel Staple's answer is probably the nearest thing you can get to a sensible answer to your question, but it is also not quite right because it defines too many constraints that limits the compiler's freedom to implement optimal code. Volatile often forces the compiler to write the result back to memory each time the variable is modified.
eg, this code:
void foo(int N) {
for (volatile int i = 0; i < N; ++i)
{
}
}
Becomes this assembly (on an x64 compiler I tried):
mov DWORD PTR [rsp-4], 0
mov eax, DWORD PTR [rsp-4]
cmp edi, eax
jle .L1
.L3:
mov eax, DWORD PTR [rsp-4] # Read i from mem
add eax, 1 # i++
mov DWORD PTR [rsp-4], eax # Write i to mem
mov eax, DWORD PTR [rsp-4] # Read it back again before
# evaluating the loop condition.
cmp eax, edi # Is i < N?
jl .L3 # Jump back to L3 if not.
.L1:
It sounds like your real question is more like how fast is:
L1: add eax, 1
jmp L1
Even the answer to that is complex and requires an understanding of the internals of your CPU's pipelines.
I recommend playing with Godbolt to understand more about what the compiler is doing. eg https://godbolt.org/z/59XUSu

You can directly measure the speed of the "empty loop", but it is not easy to convince a C++ compiler to emit it. GCC and Clang can be tricked with asm volatile("") but MSVC inline assembly has always been different and is disabled completely for 64bit programs.
It is possible to use MASM to side-step that restriction:
.MODEL FLAT
.CODE
_testfun PROC
sub ecx, 1
jnz _testfun
ret
_testfun ENDP
END
Import it into your code with extern "C" void testfun(unsigned N);.

Try volatile int i = 0 In your for loop. The volatile keyword tells the compiler this variable could change at any time, due to outside events or threads, and therefore it can't make the same assumptions about what the variable might be in the future.

Alignment and SSE strange behaviour

I try to work with SSE and i faced with some strange behaviour.
I write simple code for comparing two strings with SSE Intrinsics, run it and it work. But later i understand, that in my code one of pointer still not aligned, but i use _mm_load_si128 instruction, which requires pointer aligned on a 16-byte boundary.
//Compare two different, not overlapping piece of memory
__attribute((target("avx"))) int is_equal(const void* src_1, const void* src_2, size_t size)
{
//Skip tail for right alignment of pointer [head_1]
const char* head_1 = (const char*)src_1;
const char* head_2 = (const char*)src_2;
size_t tail_n = 0;
while (((uintptr_t)head_1 % 16) != 0 && tail_n < size)
{
if (*head_1 != *head_2)
return 0;
head_1++, head_2++, tail_n++;
}
//Vectorized part: check equality of memory with SSE4.1 instructions
//src1 - aligned, src2 - NOT aligned
const __m128i* src1 = (const __m128i*)head_1;
const __m128i* src2 = (const __m128i*)head_2;
const size_t n = (size - tail_n) / 32;
for (size_t i = 0; i < n; ++i, src1 += 2, src2 += 2)
{
printf("src1 align: %d, src2 align: %d\n", align(src1) % 16, align(src2) % 16);
__m128i mm11 = _mm_load_si128(src1);
__m128i mm12 = _mm_load_si128(src1 + 1);
__m128i mm21 = _mm_load_si128(src2);
__m128i mm22 = _mm_load_si128(src2 + 1);
__m128i mm1 = _mm_xor_si128(mm11, mm21);
__m128i mm2 = _mm_xor_si128(mm12, mm22);
__m128i mm = _mm_or_si128(mm1, mm2);
if (!_mm_testz_si128(mm, mm))
return 0;
}
//Check tail with scalar instructions
const size_t rem = (size - tail_n) % 32;
const char* tail_1 = (const char*)src1;
const char* tail_2 = (const char*)src2;
for (size_t i = 0; i < rem; i++, tail_1++, tail_2++)
{
if (*tail_1 != *tail_2)
return 0;
}
return 1;
}
I print alignment of two pointers and one of this wal aligned but second - wasn't. And program still running correctly and fast.
Then i create synthetic test like this:
//printChars128(...) function just print 16 byte values from __m128i
const __m128i* A = (const __m128i*)buf;
const __m128i* B = (const __m128i*)(buf + rand() % 15 + 1);
for (int i = 0; i < 5; i++, A++, B++)
{
__m128i A1 = _mm_load_si128(A);
__m128i B1 = _mm_load_si128(B);
printChars128(A1);
printChars128(B1);
}
And it crashes, as we expected, on first iteration, when try load pointer B.
Interesting fact that if i switch target to sse4.2 then my implementation of is_equal will crash.
Another interesting fact that if i try align second pointer instead of first (so first pointer will be not aligned, second - aligned), then is_equal will crash.
So, my question is: "Why is_equal function works fine with only first pointer aligned if i enable avx instruction generation?"
UPD: This is C++ code. I compile my code with MinGW64/g++, gcc version 4.9.2 under Windows, x86.
Compile string: g++.exe main.cpp -Wall -Wextra -std=c++11 -O2 -Wcast-align -Wcast-qual -o main.exe

TL:DR: Loads from _mm_load_* intrinsics can be folded (at compile time) into memory operands to other instructions. The AVX versions of vector instructions don't require alignment for memory operands, except for specifically-aligned load/store instructions like vmovdqa.
In the legacy SSE encoding of vector instructions (like pxor xmm0, [src1]) , unaligned 128 bit memory operands will fault except with the special unaligned load/store instructions (like movdqu / movups).
The VEX-encoding of vector instructions (like vpxor xmm1, xmm0, [src1]) doesn't fault with unaligned memory, except with the alignment-required load/store instructions (like vmovdqa, or vmovntdq).
The _mm_loadu_si128 vs. _mm_load_si128 (and store/storeu) intrinsics communicate alignment guarantees to the compiler, but doesn't force it to actually emit a stand-alone load instruction. (Or anything at all if it already has the data in a register, just like dereferencing a scalar pointer).
The as-if rule still applies when optimizing code that uses intrinsics. A load can be folded into a memory operand for the vector-ALU instruction that uses it, as long as that doesn't introduce the risk of a fault. This is advantageous for code-density reasons, and also fewer uops to track in parts of the CPU thanks to micro-fusion (see Agner Fog's microarch.pdf). The optimization pass that does this isn't enabled at -O0, so an unoptimized build of your code probably would have faulted with unaligned src1.
(Conversely, this means _mm_loadu_* can only fold into a memory operand with AVX, but not with SSE. So even on CPUs where movdqu is as fast as movqda when the pointer does happen to be aligned, _mm_loadu can hurt performance because movqdu xmm1, [rsi] / pxor xmm0, xmm1 is 2 fused-domain uops for the front-end to issue while pxor xmm0, [rsi] is only 1. And doesn't need a scratch register. See also Micro fusion and addressing modes).
The interpretation of the as-if rule in this case is that it's ok for the program to not fault in some cases where the naive translation into asm would have faulted. (Or for the same code to fault in an un-optimized build but not fault in an optimized build).
This is opposite from the rules for floating-point exceptions, where the compiler-generated code must still raise any and all exceptions that would have occurred on the C abstract machine. That's because there are well-defined mechanisms for handling FP exceptions, but not for handling segfaults.
Note that since stores can't fold into memory operands for ALU instructions, store (not storeu) intrinsics will compile into code that faults with unaligned pointers even when compiling for an AVX target.
To be specific: consider this code fragment:
// aligned version:
y = ...; // assume it's in xmm1
x = _mm_load_si128(Aptr); // Aligned pointer
res = _mm_or_si128(y, x);
// unaligned version: the same thing with _mm_loadu_si128(Uptr)
When targeting SSE (code that can run on CPUs without AVX support), the aligned version can fold the load into por xmm1, [Aptr], but the unaligned version has to use
movdqu xmm0, [Uptr] / por xmm0, xmm1. The aligned version might do that too, if the old value of y is still needed after the OR.
When targeting AVX (gcc -mavx, or gcc -march=sandybridge or later), all vector instructions emitted (including 128 bit) will use the VEX encoding. So you get different asm from the same _mm_... intrinsics. Both versions can compile into vpor xmm0, xmm1, [ptr]. (And the 3-operand non-destructive feature means that this actually happens except when the original value loaded is used multiple times).
Only one operand to ALU instructions can be a memory operand, so in your case one has to be loaded separately. Your code faults when the first pointer isn't aligned, but doesn't care about alignment for the second, so we can conclude that gcc chose to load the first operand with vmovdqa and fold the second, rather than vice-versa.
You can see this happen in practice in your code on the Godbolt compiler explorer. Unfortunately gcc 4.9 (and 5.3) compile it to somewhat sub-optimal code that generates the return value in al and then tests it, instead of just branching on the flags from vptest :( clang-3.8 does a significantly better job.
.L36:
add rdi, 32
add rsi, 32
cmp rdi, rcx
je .L9
.L10:
vmovdqa xmm0, XMMWORD PTR [rdi] # first arg: loads that will fault on unaligned
xor eax, eax
vpxor xmm1, xmm0, XMMWORD PTR [rsi] # second arg: loads that don't care about alignment
vmovdqa xmm0, XMMWORD PTR [rdi+16] # first arg
vpxor xmm0, xmm0, XMMWORD PTR [rsi+16] # second arg
vpor xmm0, xmm1, xmm0
vptest xmm0, xmm0
sete al # generate a boolean in a reg
test eax, eax
jne .L36 # then test&branch on it. /facepalm
Note that your is_equal is memcmp. I think glibc's memcmp will do better than your implementation in many cases, since it has hand-written asm versions for SSE4.1 and others which handle various cases of the buffers being misaligned relative to each other. (e.g. one aligned, one not.) Note that glibc code is LGPLed, so you might not be able to just copy it. If your use-case has smaller buffers that are typically aligned, your implementation is probably good. Not needing a VZEROUPPER before calling it from other AVX code is also nice.
The compiler-generated byte-loop to clean up at the end is definitely sub-optimal. If the size is bigger than 16 bytes, do an unaligned load that ends at the last byte of each src. It doesn't matter that you re-compared some bytes you've already checked.
Anyway, definitely benchmark your code with the system memcmp. Besides the library implementation, gcc knows what memcmp does and has its own builtin definition that it can inline code for.

How to optimize function return values in C and C++ on x86-64?

The x86-64 ABI specifies two return registers: rax and rdx, both 64-bits (8 bytes) in size.
Assuming that x86-64 is the only targeted platform, which of these two functions:
uint64_t f(uint64_t * const secondReturnValue) {
/* Calculate a and b. */
*secondReturnValue = b;
return a;
}
std::pair<uint64_t, uint64_t> g() {
/* Calculate a and b, same as in f() above. */
return { a, b };
}
would yield better performance, given the current state of C/C++ compilers targeting x86-64? Are there any pitfalls performance-wise using one or the other version? Are compilers (GCC, Clang) always able to optimize the std::pair to be returned in rax and rdx?
UPDATE: Generally, returning a pair is faster if the compiler optimizes out the std::pair methods (examples of binary output with GCC 5.3.0 and Clang 3.8.0). If f() is not inlined, the compiler must generate code to write a value to memory, e.g:
movq b, (%rdi)
movq a, %rax
retq
But in case of g() it suffices for the compiler to do:
movq a, %rax
movq b, %rdx
retq
Because instructions for writing values to memory are generally slower than instructions for writing values to registers, the second version should be faster.

Since the ABI specifies that in some particular cases two registers have to be used for the 2-word result any conforming compiler has to obey that rule.
However, for such tiny functions I guess that most of the performance will come from inlining.
You may want to compile and link with g++ -flto -O2 using link-time optimizations.
I guess that the second function (returning a pair thru 2 registers) might be slightly faster, and that perhaps in some situations the GCC compiler could inline and optimize the first into the second.
But you really should benchmark if you care that much.

Note that the ABI specifies packing any small struct into registers for passing/returning (if it contains only integer types). This means that returning a std::pair<uint32_t, uint32_t> means the values have to be shift+ORed into rax.
This is probably still better than a round trip through memory, because setting up space for a pointer, and passing that pointer as an extra arg, has some overhead. (Other than that, though, a round-trip through L1 cache is pretty cheap, like ~5c latency. The store/load are almost certainly going to hit in L1 cache, because stack memory is used all the time. Even if it misses, store-forwarding can still happen, so execution doesn't stall until the ROB fills because the store can't retire. See Agner Fog's microarch guide and other stuff at the x86 tag wiki.)
Anyway, here's the kind of code you get from gcc 5.3 -O2, using functions that take args instead of returning compile-time constant values (which would lead to movabs rax, 0x...):
#include <cstdint>
#include <utility>
#define type_t uint32_t
type_t f(type_t * const secondReturnValue, type_t x) {
*secondReturnValue = x+4;
return x+2;
}
lea eax, [rsi+4] # LEA is an add-and-shift instruction that uses memory-operand syntax and encoding
mov DWORD PTR [rdi], eax
lea eax, [rsi+2]
ret
std::pair<type_t, type_t> g(type_t x) { return {x+2, x+4}; }
lea eax, [rdi+4]
lea edx, [rdi+2]
sal rax, 32
or rax, rdx
ret
type_t use_pair(std::pair<type_t, type_t> pair) {
return pair.second + pair.first;
}
mov rax, rdi
shr rax, 32
add eax, edi
ret
So it's really not bad at all. Two or three insns in the caller and callee to pack and unpack a pair of uint32_t values. Nowhere near as good as returning a pair of uint64_t values, though.
If you're specifically optimizing for x86-64, and care what happens for non-inlined functions with multiple return values, then prefer returning std::pair<uint64_t, uint64_t> (or int64_t, obviously), even if you assign those pairs to narrower integers in the caller. Note that in the x32 ABI (-mx32), pointers are only 32bits. Don't assume pointers are 64bit when optimizing for x86-64, if you care about that ABI.
If either member of the pair is 64bit, they use separate registers. It doesn't do anything stupid like splitting one value between the high half of one reg and the low half of another.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js