how to force the use of cmov in gcc and VS - c++

I have this simple binary search member function, where lastIndex, nIter and xi are class members:
uint32 scalar(float z) const
{
uint32 lo = 0;
uint32 hi = lastIndex;
uint32 n = nIter;
while (n--) {
int mid = (hi + lo) >> 1;
// defining this if-else assignment as below cause VS2015
// to generate two cmov instructions instead of a branch
if( z < xi[mid] )
hi = mid;
if ( !(z < xi[mid]) )
lo = mid;
}
return lo;
}
Both gcc and VS 2015 translate the inner loop with a code flow branch:
000000013F0AA778 movss xmm0,dword ptr [r9+rax*4]
000000013F0AA77E comiss xmm0,xmm1
000000013F0AA781 jbe Tester::run+28h (013F0AA788h)
000000013F0AA783 mov r8d,ecx
000000013F0AA786 jmp Tester::run+2Ah (013F0AA78Ah)
000000013F0AA788 mov edx,ecx
000000013F0AA78A mov ecx,r8d
Is there a way, without writing assembler inline, to convince them to use exactly 1 comiss instruction and 2 cmov instructions?
If not, can anybody suggest how to write a gcc assembler template for this?
Please note that I am aware that there are variations of the binary search algorithm where it is easy for the compiler to generate branch free code, but this is beside the question.
Thanks

As Matteo Italia already noted, this avoidance of conditional-move instructions is a quirk of GCC version 6. What he didn't notice, though, is that it applies only when optimizing for Intel processors.
With GCC 6.3, when targeting AMD processors (i.e., -march= any of k8, k10, opteron, amdfam10, btver1, bdver1, btver2, btver2, bdver3, bdver4, znver1, and possibly others), you get exactly the code you want:
mov esi, DWORD PTR [rdi]
mov ecx, DWORD PTR [rdi+4]
xor eax, eax
jmp .L2
.L7:
lea edx, [rax+rsi]
mov r8, QWORD PTR [rdi+8]
shr edx
mov r9d, edx
movss xmm1, DWORD PTR [r8+r9*4]
ucomiss xmm1, xmm0
cmovbe eax, edx
cmova esi, edx
.L2:
dec ecx
cmp ecx, -1
jne .L7
rep ret
When optimizing for any generation of Intel processor, GCC 6.3 avoids conditional moves, preferring an explicit branch:
mov r9d, DWORD PTR [rdi]
mov ecx, DWORD PTR [rdi+4]
xor eax, eax
.L2:
sub ecx, 1
cmp ecx, -1
je .L6
.L8:
lea edx, [rax+r9]
mov rsi, QWORD PTR [rdi+8]
shr edx
mov r8d, edx
vmovss xmm1, DWORD PTR [rsi+r8*4]
vucomiss xmm1, xmm0
ja .L4
sub ecx, 1
mov eax, edx
cmp ecx, -1
jne .L8
.L6:
ret
.L4:
mov r9d, edx
jmp .L2
The likely justification for this optimization decision is that conditional moves are fairly inefficient on Intel processors. CMOV has a latency of 2 clock cycles on Intel processors compared to a 1-cycle latency on AMD. Additionally, while CMOV instructions are decoded into multiple µops (at least two, with no opportunity for µop fusion) on Intel processors because of the requirement that a single µop has no more than two input dependencies (a conditional move has at least three: the two operands and the condition flag), AMD processors can implement a CMOV with a single macro-operation since their design has no such limits on the input dependencies of a single macro-op. As such, the GCC optimizer is replacing branches with conditional moves only on AMD processors, where it might be a performance win—not on Intel processors and not when tuning for generic x86.
(Or, maybe the GCC devs just read Linus's infamous rant. :-)
Intriguingly, though, when you tell GCC to tune for the Pentium 4 processor (and you can't do this for 64-bit builds for some reason—GCC tells you that this architecture doesn't support 64-bit, even though there were definitely P4 processors that implemented EMT64), you do get conditional moves:
push edi
push esi
push ebx
mov esi, DWORD PTR [esp+16]
fld DWORD PTR [esp+20]
mov ebx, DWORD PTR [esi]
mov ecx, DWORD PTR [esi+4]
xor eax, eax
jmp .L2
.L8:
lea edx, [eax+ebx]
shr edx
mov edi, DWORD PTR [esi+8]
fld DWORD PTR [edi+edx*4]
fucomip st, st(1)
cmovbe eax, edx
cmova ebx, edx
.L2:
sub ecx, 1
cmp ecx, -1
jne .L8
fstp st(0)
pop ebx
pop esi
pop edi
ret
I suspect this is because branch misprediction is so expensive on Pentium 4, due to its extremely long pipeline, that the possibility of a single mispredicted branch outweighs any minor gains you might get from breaking loop-carried dependencies and the tiny amount of increased latency from CMOV. Put another way: mispredicted branches got a lot slower on P4, but the latency of CMOV didn't change, so this biases the equation in favor of conditional moves.
Tuning for later architectures, from Nocona to Haswell, GCC 6.3 goes back to its strategy of preferring branches over conditional moves.
So, although this looks like a major pessimization in the context of a tight inner loop (and it would look that way to me, too), don't be so quick to dismiss it out of hand without a benchmark to back up your assumptions. Sometimes, the optimizer is not as dumb as it looks. Remember, the advantage of a conditional move is that it avoids the penalty of branch mispredictions; the disadvantage of a conditional move is that it increases the length of a dependency chain and may require additional overhead because, on x86, only register→register or memory→register conditional moves are allowed (no constant→register). In this case, everything is already enregistered, but there is still the length of the dependency chain to consider. Agner Fog, in his Optimizing Subroutines in Assembly Language, gives us the following rule of thumb:
[W]e can say that a conditional jump is faster than a conditional move if the code is part of a dependency chain and the prediction rate is better than 75%. A conditional jump is also preferred if we can avoid a lengthy calculation ... when the other operand is chosen.
The second part of that doesn't apply here, but the first does. There is definitely a loop-carried dependency chain here, and unless you get into a really pathological case that disrupts branch prediction (which normally has a >90% accuracy), branching may actually be faster. In fact, Agner Fog continues:
Loop-carried dependency chains are particularly sensitive to the disadvantages of conditional moves. For example, [this code]
// Example 12.16a. Calculate pow(x,n) where n is a positive integer
double x, xp, power;
unsigned int n, i;
xp=x; power=1.0;
for (i = n; i != 0; i >>= 1) {
if (i & 1) power *= xp;
xp *= xp;
}
works more efficiently with a branch inside the loop than with a conditional move, even if the branch is poorly predicted. This is because the floating point conditional move adds to the loop-carried dependency chain and because the implementation with a conditional move has to calculate all the power*xp values, even when they are not used.
Another example of a loop-carried dependency chain is a binary search in a sorted list. If the items to search for are randomly distributed over the entire list then the branch prediction rate will be close to 50% and it will be faster to use conditional moves. But if the items are often close to each other so that the prediction rate will be better, then it is more efficient to use conditional jumps than conditional moves because the dependency chain is broken every time a correct branch prediction is made.
If the items in your list are actually random or close to random, then you'll be the victim of repeated branch-prediction failure, and conditional moves will be faster. Otherwise, in what is probably the more common case, branch prediction will succeed >75% of the time, such that you will experience a performance win from branching, as opposed to a conditional move that would extend the dependency chain.
It's hard to reason about this theoretically, and it's even harder to guess correctly, so you need to actually benchmark it with real-world numbers.
If your benchmarks confirm that conditional moves really would be faster, you have a couple of options:
Upgrade to a later version of GCC, like 7.1, that generate conditional moves in 64-bit builds even when targeting Intel processors.
Tell GCC 6.3 to optimize your code for AMD processors. (Maybe even just having it optimize one particular code module, so as to minimize the global effects.)
Get really creative (and ugly and potentially non-portable), writing some bit-twiddling code in C that does the comparison-and-set operation branchlessly. This might get the compiler to emit a conditional-move instruction, or it might get the compiler to emit a series of bit-twiddling instructions. You'd have to check the output to be sure, but if your goal is really just to avoid branch misprediction penalties, then either will work.
For example, something like this:
inline uint32 ConditionalSelect(bool condition, uint32 value1, uint32 value2)
{
const uint32 mask = condition ? static_cast<uint32>(-1) : 0;
uint32 result = (value1 ^ value2); // get bits that differ between the two values
result &= mask; // select based on condition
result ^= value2; // condition ? value1 : value2
return result;
}
which you would then call inside of your inner loop like so:
hi = ConditionalSelect(z < xi[mid], mid, hi);
lo = ConditionalSelect(z < xi[mid], lo, mid);
GCC 6.3 produces the following code for this when targeting x86-64:
mov rdx, QWORD PTR [rdi+8]
mov esi, DWORD PTR [rdi]
test edx, edx
mov eax, edx
lea r8d, [rdx-1]
je .L1
mov r9, QWORD PTR [rdi+16]
xor eax, eax
.L3:
lea edx, [rax+rsi]
shr edx
mov ecx, edx
mov edi, edx
movss xmm1, DWORD PTR [r9+rcx*4]
xor ecx, ecx
ucomiss xmm1, xmm0
seta cl // <-- begin our bit-twiddling code
xor edi, esi
xor eax, edx
neg ecx
sub r8d, 1 // this one's not part of our bit-twiddling code!
and edi, ecx
and eax, ecx
xor esi, edi
xor eax, edx // <-- end our bit-twiddling code
cmp r8d, -1
jne .L3
.L1:
rep ret
Notice that the inner loop is entirely branchless, which is exactly what you wanted. It may not be quite as efficient as two CMOV instructions, but it will be faster than chronically mispredicted branches. (It goes without saying that GCC and any other compiler will be smart enough to inline the ConditionalSelect function, which allows us to write it out-of-line for readability purposes.)
However, what I would definitely not recommend is that you rewrite any part of the loop using inline assembly. All of the standard reasons apply for avoiding inline assembly, but in this instance, even the desire for increased performance isn't a compelling reason to use it. You're more likely to confuse the compiler's optimizer if you try to throw inline assembly into the middle of that loop, resulting in sub-par code worse than what you would have gotten otherwise if you'd just left the compiler to its own devices. You'd probably have to write the entire function in inline assembly to get good results, and even then, there could be spill-over effects from this when GCC's optimizer tried to inline the function.
What about MSVC? Well, different compilers have different optimizers and therefore different code-generation strategies. Things can start to get really ugly really quickly if you have your heart set on cajoling all target compilers to emit a particular sequence of assembly code.
On MSVC 19 (VS 2015), when targeting 32-bit, you can write the code the way you did to get conditional-move instructions. But this doesn't work when building a 64-bit binary: you get branches instead, just like with GCC 6.3 targeting Intel.
There is a nice solution, though, that works well: use the conditional operator. In other words, if you write the code like this:
hi = (z < xi[mid]) ? mid : hi;
lo = (z < xi[mid]) ? lo : mid;
then VS 2013 and 2015 will always emit CMOV instructions, whether you're building a 32-bit or 64-bit binary, whether you're optimizing for size (/O1) or speed (/O2), and whether you're optimizing for Intel (/favor:Intel64) or AMD (/favor:AMD64).
This does fail to result in CMOV instructions back on VS 2010, but only when building 64-bit binaries. If you needed to ensure that this scenario also generated branchless code, then you could use the above ConditionalSelect function.

As said in the comments, there's no easy way to force what you are asking, although it seems that recent (>4.4) versions of gcc already optimize it like you said. Edit: interestingly, the gcc 6 series seems to use a branch, unlike both the gcc 5 and gcc 7 series, which use two cmov.
The usual __builtin_expect probably cannot do much into pushing gcc to use cmov, given that cmov is generally convenient when it's difficult to predict the result of a comparison, while __builtin_expect tells the compiler what is the likely outcome - so you would be just pushing it in the wrong direction.
Still, if you find that this optimization is extremely important, your compiler version typically gets it wrong and for some reason you cannot help it with PGO, the relevant gcc assembly template should be something like:
__asm__ (
"comiss %[xi_mid],%[z]\n"
"cmovb %[mid],%[hi]\n"
"cmovae %[mid],%[lo]\n"
: [hi] "+r"(hi), [lo] "+r"(lo)
: [mid] "rm"(mid), [xi_mid] "xm"(xi[mid]), [z] "x"(z)
: "cc"
);
The used constraints are:
hi and lo are into the "write" variables list, with +r constraint as cmov can only work with registers as target operands, and we are conditionally overwriting just one of them (we cannot use =, as it implies that the value is always overwritten, so the compiler would be free to give us a different target register than the current one, and use it to refer to that variable after our asm block);
mid is in the "read" list, rm as cmov can take either a register or a memory operand as input value;
xi[mid] and z are in the "read" list;
z has the special x constraint that means "any SSE register" (required for ucomiss first operand);
xi[mid] has xm, as the second ucomiss operand allows a memory operator; given the choice between z and xi[mid], I chose the last one as a better candidate for being taken directly from memory, given that z is already in a register (due to the System V calling convention - and is going to be cached between iterations anyway) and xi[mid] is used just in this comparison;
cc (the FLAGS register) is in the "clobber" list - we do clobber the flags and nothing else.

Related

Order of assignment produces different assembly

This experiment was done using GCC 6.3. There are two functions where the only difference is in the order we assign the i32 and i16 in the struct. We assumed that both functions should produce the same assembly. However this is not the case. The "bad" function produces more instructions. Can anyone explain why this happens?
#include <inttypes.h>
union pack {
struct {
int32_t i32;
int16_t i16;
};
void *ptr;
};
static_assert(sizeof(pack)==8, "what?");
void *bad(const int32_t i32, const int16_t i16) {
pack p;
p.i32 = i32;
p.i16 = i16;
return p.ptr;
}
void *good(const int32_t i32, const int16_t i16) {
pack p;
p.i16 = i16;
p.i32 = i32;
return p.ptr;
}
...
bad(int, short):
movzx eax, si
sal rax, 32
mov rsi, rax
mov eax, edi
or rax, rsi
ret
good(int, short):
movzx eax, si
mov edi, edi
sal rax, 32
or rax, rdi
ret
The compiler flags were -O3 -fno-rtti -std=c++14
This is/was a missed optimization in GCC10.2 and earlier. It seems to already be fixed in current nightly builds of GCC, so no need to report a missed-optimization bug on GCC's bugzilla. (https://gcc.gnu.org/bugzilla/). It looks like it first appeared as a regression from GCC4.8 to GCC4.9. (Godbolt)
# GCC11-dev nightly build
# actually *better* than "good", avoiding a mov-elimination missed opt.
bad(int, short):
movzx esi, si # mov-elimination never works for 16->32 movzx
mov eax, edi # mov-elimination works between different regs
sal rsi, 32
or rax, rsi
ret
Yes, you'd generally expect C++ that implements the same logic basically the same way to compile to the same asm, as long as optimization is enabled, or at least hope so1. And generally you can hope that there are no pointless missed optimizations that waste instructions for no apparent reason (rather than simply picking a different implementation strategy), but unfortunately that's not always true either.
Writing different parts of the same object and then reading the whole object is tricky for compilers in general so it's not a shock to see different asm when you write different parts of the full object in a different order.
Note that there's nothing "smart" about the bad asm, it's just doing a redundant mov instruction. Having to take input in fixed registers and produce output in another specific hard register to satisfy the calling convention is something GCC's register allocator isn't amazing at: wasted mov missed optimizations like this are more common in tiny functions than when part of a larger function.
If you're really curious, you could dig into the GIMPLE and RTL internal representations that GCC transformed through to get here. (Godbolt has a GCC tree-dump pane to help with this.)
Footnote 1: Or at least hope that, but missed-optimization bugs do happen in real life. Report them when you spot them, in case it's something that GCC or LLVM devs can easily teach the optimizer to avoid. Compilers are complex pieces of machinery with multiple passes; often a corner case for one part of the optimizer just didn't used to happen until some other optimization pass changed to doing something else, exposing a poor end result for a case the author of that code wasn't thinking about when writing / tweaking it to improve some other case.
Note that there's no Undefined Behaviour here despite the complaints in comments: The GNU dialect of C and C++ defines the behaviour of union type-punning in C89 and C++, not just in C99 and later like ISO C does. Implementations are free to define the behaviour of anything that ISO C++ leaves undefined.
Well technically there is a read-uninitialized because the upper 2 bytes of the void* object haven't been written yet in pack p. But fixing it with pack p = {.ptr=0}; doesn't help. (And doesn't change the asm; GCC happened to already zero the padding because that's convenient).
Also note, both versions in the question are less efficient than possible:
(The bad output from GCC4.8 or GCC11-trunk avoiding the wasted mov looks optimal for that choice of strategy.)
mov edi,edi defeats mov-elimination on both Intel and AMD, so that instruction has 1 cycle latency instead of 0, and costs a back-end µop. Picking a different register to zero-extend into would be cheaper. We could even pick RSI after reading SI, but any call-clobbered register would work.
hand_written:
movzx eax, si # 16->32 can't be eliminated, only 8->32 and 32->32 mov
shl rax, 32
mov ecx, edi # zero-extend into a different reg with 0 latency
or rax, rcx
ret
Or if optimizing for code-size or throughput on Intel (low µop count, not low latency), shld is an option: 1 µop / 3c latency on Intel, but 6 µops on Zen (also 3c latency, though). (https://uops.info/ and https://agner.org/optimize/)
minimal_uops_worse_latency: # also more uops on AMD.
movzx eax, si
shl rdi, 32 # int32 bits to the top of RDI
shld rax, rdi, 32 # shift the high 32 bits of RDI into RAX.
ret
If your struct was ordered the other way, with the padding in the middle, you could do something involving mov ax, si to merge into RAX. That could be efficient on non-Intel, and on Haswell and later which don't do partial-register renaming except for high-8 regs like AH.
Given the read-uninitialized UB, you could just compile it to literally anything, including ret or ud2. Or slightly less aggressive, you could compile it to just leave garbage for the padding part of the struct, the last 2 bytes.
high_garbage:
shl rsi, 32 # leaving high garbage = incoming high half of ESI
mov eax, edi # zero-extend into RAX
or rax, rsi
ret
Note that an unofficial extension to the x86-64 System V ABI (which clang actually depends on) is that narrow args are sign- or zero-extended to 32 bits. So instead of zeros, the high 2 bytes of the pointer would be copies of the sign bit. (Which would actually guarantee that it's a canonical 48-bit virtual address on x86-64!)

How to use processor instructions in C++ to implement fast arithmetic operations

I was working on the C++ implementation of Shamir's secret sharing scheme. I split the message into 8-bit chunks and on each performs corresponding arithmetic. The underlying finite field was Rijndael's finite field F_256 / (x^8 + x^4 + x^3 + x + 1).
I made a quick search if there is some well-known and spread library for Rijndael's finite field calculations (e. g. OpenSSL or similar), and didn't find any. So I implemented it from scratch, partly as a programming exercise.
A few days ago, however, a professor at our university mentioned following: "Modern processors support carry-less integer operations, so the characteristic-2 finite field multiplications run fast nowadays.".
Hence, since I know just little about hardware, assembler, and similar stuff, my question is: How do I actually use (in C++) all the modern processors' instructions when building crypto software - whether it is AES, SHA, arithmetic from above or whatever else? I can't find any satisfactory resources on that. My idea is to build a library containing both: "Modern-approach fast implementation" and fallback "pure C++ dependency-less code" and let the GNU Autoconf decide which one to use on each respective host. Any book/article/tutorial recommendation on this topic would be appreciated.
The question is quite broad because there are several ways you might access the power of the underlying hardware, so instead of one specific way here's a list of ways you can try to use all the modern processors' instructions:
Idiom Recognition
Write out the operation not offered directly in C++ in "long form" and hope your compiler recognizes it as an idiom for the underlying instruction you want. For example, you could write a variable rotate left of x by amount as (x << amount) | (x >> (32 - amount)) and all of gcc, clang and icc will recognize this as a rotate and issue the underlying rol instruction supported by x86.
Sometimes this technique puts you in a bit of an uncomfortable spot: the above C++ rotate implementation exhibits undefined behavior for amount == 0 (and also amount >= 32) since the result of a shift of 32 on a uint32_t is undefined, but the code actually produced by these compilers is just fine in that case. Still, having this lurking undefined behavior in your program is dangerous, and it probably won't run clear against ubsan and friends. The alternative safe version amount ? (x << amount) | (x >> (32 - amount)) : x; is only recognized by icc, but not by gcc or clang.
This approach tends to work for common idioms that map directly to assembly-level instructions that have been around for a while: rotates, bit tests and sets, multiplications with a wider result than inputs (e.g., multiplying two 32-bit values for a 64-bit result), conditional moves and so on, but is less likely to pick up bleeding edge instructions that might also be of interest to cryptography. For example, I'm quite sure no compiler will currently recognize an application of the AES instruction set extensions. It also works best on platforms that have received a lot of effort on the part of the compiler developers since each recognized idiom has to be added by hand.
I don't think this technique will work with your carry-less multiplication (PCLMULQDQ), but maybe one day (if you file an issue against the compilers)? It does work for other "crypt-interesting" functions though, including rotate.
Intrinsic Functions
As an extension compilers will often offer intrinsic functions which are not part of the language proper, but often map directly to an instruction offered by most hardware. Although it looks like a function call, the compiler generally just emits the single instruction needed at the place you call it.
GCC calls these built-in functions and you can find a list of generic ones here. For example, you can use the __builtin_popcnt call to emit the popcnt instruction, if the current target supports it. Man of the gcc builtins are also supported by icc and clang, and in this case all of gcc, clang and icc support this call and emit popcnt as long as the architecture (-march=Haswell)is set to Haswell. Otherwise, clang and icc inline a replacement version using some clever SWAR tricks, while gcc calls __popcountdi2 which is provided by the runtime1.
The list of intrinsics above are generic and generally offered on any platform the compilers support. You can also find platform specific instrinics, for example this list from gcc.
For x86 SIMD instructions specifically, Intel makes available a set of intrinsic functions declared headers covering their ISA extensions, e.g., by including #include <x86intrin.h>. These have wider support than the gcc instrinsics, e.g., they are supported by Microsoft's Visual Studio compiler suite. New instruction sets are usually added before chips that support them become available, so you can use these to access new instructions immediate on release.
Programming with SIMD intrinsic functions is kind of a halfway house between C++ and full assembly. The compiler still takes care of things like calling conventions and register allocation, and some optimization are made (especially for generating constants and other broadcasts) - but generally what you write is more or less what you get at the assembly level.
Inline Assembly
If your compiler offers it, you can use inline assembly to call whatever instructions you want2. This has a lot of similarities to using intrinsic functions, but with a somewhat higher level of difficulty and less opportunities for the optimizer to help you out. You should probably prefer intrinsic functions unless you have a specific reason for inline assembly. One example could be if the optimizer does a really bad job with intrinsics: you could use an inline assembly block to get exactly the code you want.
Out-of-line Assembly
You can also just write your entire kernel function in assembly, assembly it how you want, and then declare it extern "C" and call it from C++. This is similar to the inline assembly option, but works on compilers that don't support inline assembly (e.g., 64-bit Visual Studio). You can also use a different assembler if you want, which is especially convenient if you are targeting multiple C++ compilers since you can then use a single assembler for all of them.
You need to take care of the calling conventions youself, and other messy things like DWARF unwind info and Windows SEH handling.
For very short functions, this approach doesn't work well since the call overhead will likely be prohibitive3.
Auto-Vectorization4
If you want to write fast cryptography today for a CPU, you are pretty much going to be targeting mostly SIMD instructions. Most new algorithms designed with software implementation are also designed with vectorization in mind.
You can intrinsic functions or assembly to write SIMD code, but you can also write normal scalar code and rely on the auto-vectorizer. These got a bad name back in the early days of SIMD, and while they are still far from perfect they have come a long way.
Consider this simple function with takes payload and key byte array and xors key into payload:
void otp_scramble(uint8_t* payload, uint8_t* key, size_t n) {
for (size_t i = 0; i < n; i++) {
payload[i] ^= key[i];
}
}
This is a softball example, of course, but anyways gcc, clang and icc all vectorize this to something like this inner loop4:
movdqu xmm0, XMMWORD PTR [rdi+rax]
movdqu xmm1, XMMWORD PTR [rsi+rax]
pxor xmm0, xmm1
movups XMMWORD PTR [rdi+rax], xmm0
It's using SSE instructions to load and xor 16 bytes at a time. The developer only has to reason about the simple scalar code, however!
One advantage of this approach versus intrinsics or assembly is that you aren't baking in the SIMD length of the instruction set at the source level. The same C++ code as above compiled with -march=haswell results in a loop like:
vmovdqu ymm1, YMMWORD PTR [rdi+rax]
vpxor ymm0, ymm1, YMMWORD PTR [rsi+rax]
vmovdqu YMMWORD PTR [rdi+rax], ymm0
It's using the AVX2 instructions available on Haswell to do 32-bytes at a time. If you compile with -march=skylake-avx512 clang uses 64-byte vxorps instructions on zmm registers (but gcc and icc stick with 32-byte inner loops). So in principle you can take some advantage of new ISA simply with a recompile.
A downside of auto-vectorizatoin is that it is fairly fragile. What auto-vectorizes on one compiler might not on another or even on another version of the same compiler. So you need to check you are getting the results you want. The auto-vectorizer is often working with less information than you have: it might not know that the input length is a multiple of some power or two or that the input pointers are aligned in a certain way. Sometimes you can communicate this information to the compiler, but sometimes you can't.
Sometimes the compiler makes "interesting" decisions when it vectorizes, such as a small not-unrolled body for the inner loop, but then a giant "intro" or "outro" handling odd iterations, like what gcc produces after the first loop shown above:
movzx ecx, BYTE PTR [rsi+rax]
xor BYTE PTR [rdi+rax], cl
lea rcx, [rax+1]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+1+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+2]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+2+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+3]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+3+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+4]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+4+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+5]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+5+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+6]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+6+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+7]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+7+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+8]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+8+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+9]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+9+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+10]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+10+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+11]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+11+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+12]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+12+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+13]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+13+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+14]
cmp rdx, rcx
jbe .L1
movzx eax, BYTE PTR [rsi+14+rax]
xor BYTE PTR [rdi+rcx], al
You probably have better things to spend your instruction cache on (and this is far from the worst I've seen: it's easy to get examples with several hundreds of instructions in the intro and outro parts).
Unfortunately, the vectorizer probably won't produce crypto-specific instructions like carry-less multiply. You could consider a mix of scalar code that gets vectorized and an intrinsic only for the instructions the compiler won't generate, but this is easier to suggest than actually do successfully. At that point you are probably better off writing your entire loop with intrinsics.
1 The advantage of the gcc approach here is that at runtime if the platform supports popcnt this call can resolve to an implementation that just uses a popcnt instruction, using the GNU IFUNC mechanism.
2 Assuming the underlying assembler supports it, but even if it doesn't you could just encode the raw instruction bytes in the inline assembly block.
3 The call overhead includes more than just the explicit costs of the call and ret and argument passing: it also includes the effect on the optimizer which can't optimize code as well in the caller around the function call since it has unknown side-effects.
4 In some ways, auto-vectorization could be seen as a special case of idiom recognition, but it is important enough and has enough unique considerations that it gets its own section here.
5 With minor differences: gcc is as shown, clang unrolled a bit, and icc used a load-op pxor rather than a separate load.

Why do compilers duplicate some instructions?

Sometimes compilers generate code with weird instruction duplications that can safely be removed. Consider the following piece of code:
int gcd(unsigned x, unsigned y) {
return x == 0 ? y : gcd(y % x, x);
}
Here is the assembly code (generated by clang 5.0 with optimizations enabled):
gcd(unsigned int, unsigned int): # #gcd(unsigned int, unsigned int)
mov eax, esi
mov edx, edi
test edx, edx
je .LBB0_1
.LBB0_2: # =>This Inner Loop Header: Depth=1
mov ecx, edx
xor edx, edx
div ecx
test edx, edx
mov eax, ecx
jne .LBB0_2
mov eax, ecx
ret
.LBB0_1:
ret
In the following snippet:
mov eax, ecx
jne .LBB0_2
mov eax, ecx
If the jump doesn't happen, eax is reassigned for no obvious reason.
The other example is two ret's at the end of the function: one would perfectly work as well.
Is the compiler simply not intelligent enough or there's a reason to not remove the duplications?
Compilers can perform optimisations that are not obvious to people and removing instructions does not always make things faster.
A small amount of searching shows that various AMD processors have branch prediction problems when a RET is immediately after a conditional branch. By filling that slot with what is essentially a no-op, the performance problem is avoided.
Update:
Example reference, section 6.2 of the "Software Optimization Guide for AMD64 Processors" (see http://support.amd.com/TechDocs/25112.PDF) says:
Specifically, avoid the following two situations:
Any kind of branch (either conditional or unconditional) that has the single-byte near-return RET instruction as its target. See “Examples.”
A conditional branch that occurs in the code directly before the single-byte near-return RET instruction.
It also goes into detail on why jump targets should have alignment which is also likely to explain the duplicate RETs at the end of the function.
Any compiler will have a bunch of transformations for register renaming, unrolling, hoisting, and so on. Combining their outputs can lead to suboptimal cases such as what you have shown. Marc Glisse offers good advice: it's worth a bug report. You are describing an opportunity for a peephole optimizer to discard instructions that either
don't affect the state of registers & memory at all, or
don't affect state that matters for a function's post-conditions, won't matter for its public API.
Sounds like an opportunity for symbolic execution techniques. If the constraint solver finds no branch points for a given MOV, perhaps it is really a NOP.

Any advantage of XOR AL,AL + MOVZX EAX, AL over XOR EAX,EAX?

I have some unknown C++ code that was compiled in Release build, so it's optimized. The point I'm struggling with is:
xor al, al
add esp, 8
cmp byte ptr [ebp+userinput], 31h
movzx eax, al
This is my understanding:
xor al, al ; set eax to 0x??????00 (clear last byte)
add esp, 8 ; for some unclear reason, set the stack pointer higher
cmp byte ptr [ebp+userinput], 31h ; set zero flag if user input was "1"
movzx eax, al ; set eax to AL and extend with zeros, so eax = 0x000000??
I don't care about line 2 and 3. They might be there in this order for pipelining reasons and IMHO have nothing to do with EAX.
However, I don't understand why I would clear AL first, just to clear the rest of EAX later. The result will IMHO always be EAX = 0, so this could also be
xor eax, eax
instead. What is the advantage or "optimization" of that piece of code?
Some background info:
I will get the source code later. It's a short C++ console demo program, maybe 20 lines of code only, so nothing that I would call "complex" code. IDA shows a single loop in that program, but not around this piece. The Stud_PE signature scan didn't find anything, but likely it's Visual Studio 2013 or 2015 compiler.
xor al,al is already slower than xor eax,eax on most CPUs. e.g. on Haswell/Skylake it needs an ALU uop and doesn't break the dependency on the old value of eax/rax. It's equally bad on AMD CPUs, or Atom/Silvermont. (Well, maybe not equally because AMD doesn't eliminate xor eax,eax at issue/rename, but it still has a false dependency which could serialize the new dependency chain with whatever used eax last).
On CPUs that do rename al separately from the rest of the register (Intel pre-IvyBridge), the xor al,al may still be recognized as a zeroing idiom, but unless you actively want to preserve the upper bytes of the register, the best way to zero al is xor eax,eax.
Doing movzx on top of that just makes it even worse.
I'm guessing your compiler somehow got confused and decided it needed a 1-byte zero, but then realized it needed to promote it to 32 bits. xor sets flags, so it couldn't xor-zero after the cmp, and it failed to notice that it could have just xor-zeroed eax before the cmp.
Either that or it's something like Jester's suggestion, where the movzx is a branch target. Even if that's the case, xor eax,eax would still have been better because zero-extending into eax follows unconditionally on this code path.
I'm curious what compiler produced this from what source.

Why does ICC unroll this loop in this way and use lea for arithmetic?

Looking at the ICC 17 generated code for iterating over a std::unordered_map<> (using https://godbolt.org) left me very confused.
I distilled down the example to this:
long count(void** x)
{
long i = 0;
while (*x)
{
++i;
x = (void**)*x;
}
return i;
}
Compiling this with ICC 17, with the -O3 flag, leads to the following disassembly:
count(void**):
xor eax, eax #6.10
mov rcx, QWORD PTR [rdi] #7.11
test rcx, rcx #7.11
je ..B1.6 # Prob 1% #7.11
mov rdx, rax #7.3
..B1.3: # Preds ..B1.4 ..B1.2
inc rdx #7.3
mov rcx, QWORD PTR [rcx] #7.11
lea rsi, QWORD PTR [rdx+rdx] #9.7
lea rax, QWORD PTR [-1+rdx*2] #9.7
test rcx, rcx #7.11
je ..B1.6 # Prob 18% #7.11
mov rcx, QWORD PTR [rcx] #7.11
mov rax, rsi #9.7
test rcx, rcx #7.11
jne ..B1.3 # Prob 82% #7.11
..B1.6: # Preds ..B1.3 ..B1.4 ..B1.1
ret #12.10
Compared to the obvious implementation (which gcc and clang use, even for -O3), it seems to do a few things differently:
It unrolls the loop, with two decrements before looping back - however, there is a conditional jump in the middle of it all.
It uses lea for some of the arithmetic
It keeps a counter (inc rdx) for every two iterations of the while loop, and immediately computes the corresponding counters for every iteration (into rax and rsi)
What are the potential benefits to doing all this? I assume it may have something to do with scheduling?
Just for comparison, this is the code generated by gcc 6.2:
count(void**):
mov rdx, QWORD PTR [rdi]
xor eax, eax
test rdx, rdx
je .L4
.L3:
mov rdx, QWORD PTR [rdx]
add rax, 1
test rdx, rdx
jne .L3
rep ret
.L4:
rep ret
This isn't a great example because the loop trivially bottlenecks on pointer-chasing latency, not uop throughput or any other kind of loop-overhead. But there can be cases where having fewer uops helps an out-of-order CPU see farther ahead, maybe. Or we can just talk about the optimizations to the loop structure and pretend they matter, e.g. for a loop that did something else.
Unrolling is potentially useful in general, even when the loop trip-count is not computable ahead of time. (e.g. in a search loop like this one, which stops when it finds a sentinel). A not-taken conditional branch is different from a taken branch, since it doesn't have any negative impact on the front-end (when it predicts correctly).
Basically ICC just did a bad job unrolling this loop. The way it uses LEA and MOV to handle i is pretty braindead, since it used more uops than two inc rax instructions. (Although it does make the critical path shorter, on IvB and later which have zero-latency mov r64, r64, so out-of-order execution can get ahead on running those uops).
Of course, since this particular loop bottlenecks on the latency of pointer-chasing, you're getting at best a long-chain throughput of one per 4 clocks (L1 load-use latency on Skylake, for integer registers), or one per 5 clocks on most other Intel microarchitectures. (I didn't double-check these latencies; don't trust those specific numbers, but they're about right).
IDK if ICC analyses loop-carried dependency chains to decide how to optimize. If so, it should probably have just not unrolled at all, if it knew it was doing a poor job when it did try to unroll.
For a short chain, out-of-order execution might be able to get started on running something after the loop, if the loop-exit branch predicts correctly. In that case, it is useful to have the loop optimized.
Unrolling also throws more branch-predictor entries at the problem. Instead of one loop-exit branch with a long pattern (e.g. not-taken after 15 taken), you have two branches. For the same example, one that's never taken, and one that's take 7 times then not-taken the 8th time.
Here's what a hand-written unrolled-by-two implementation looks like:
Fix up i in the loop-exit path for one of the exit points, so you can handle it cheaply inside the loop.
count(void**):
xor eax, eax # counter
mov rcx, QWORD PTR [rdi] # *x
test rcx, rcx
je ..B1.6
.p2align 4 # mostly to make it more likely that the previous test/je doesn't decode in the same block at the following test/je, so it doesn't interfere with macro-fusion on pre-HSW
.loop:
mov rcx, QWORD PTR [rcx]
test rcx, rcx
jz .plus1
mov rcx, QWORD PTR [rcx]
add rax, 2
test rcx, rcx
jnz .loop
..B1.6:
ret
.plus1: # exit path for odd counts
inc rax
ret
This makes the loop body 5 fused-domain uops if both TEST/JCC pairs macro-fuse. Haswell can make two fusions in a single decode groups, but earlier CPUs can't.
gcc's implementation is only 3 uops, which is less than the issue width of the CPU. See this Q&A about small loops issuing from the loop buffer. No CPU can actually execute/retire more than one taken branch per clock, so it's not easily possible to test how CPUs issue loops with less than 4 uops, but apparently Haswell can issue a 5-uop loop at one per 1.25 cycles. Earlier CPUs might only issue it at one per 2 cycles.
There's no definite answer to why it does it, as it is a proprietary compiler. Only intel knows why. That said, Intel compiler is often more aggressive in loop optimization. It does not mean it is better. I have seen situations where intel's aggressive inlining lead to worse performance than clang/gcc. In that case, I had to explicitly forbid inlining at some call sites. Similarly, sometime it is necessary to forbid unrolling via pragmas in Intel C++ to get better performance.
lea is a particularly useful instruction. It allows one shift, two addition, and one move all in just one instruction. It is much faster than doing these four operations separated. However, it does not always make a difference. And if lea is used only for an addition or a move, it may or may not be better. So you see in 7.11 it uses a move, while in the next two lines lea is used to do an addition plus move, and addition, shift, plus a move
I don't see there's a optional benefit here