Why XOR before SETcc? - c++

This fragment of code
int foo(int a, int b)
{
return (a == b);
}
generates the following assembly (https://godbolt.org/z/fWsM1zo6q)
foo(int, int):
xorl %eax, %eax
cmpl %esi, %edi
sete %al
ret
According to https://www.felixcloutier.com/x86/setcc
[SETcc] Sets the destination operand to 0 or 1 depending on the settings of the status flags
So what is the point of initializing %eax with zero by doing xorl %eax, %eax first if it will be zero/one depending on result of a == b anyway? Isn't it a waste of CPU clocks that both gcc and clang can't avoid for some reason?

Because setcc sucks: only available in 8-bit operand-size. But you used 32-bit int for the return value, so you need that 8-bit result zero-extended to 32-bit.
Even if you did only want to return a bool or char, you might still do this to avoid a false dependency when writing AL. xor-zeroing doesn't cost "a cycle", it costs 1 uop (and is as cheap as a nop on Intel), but that's still not free. (https://agner.org/optimize/)
Unfortunately AMD64 didn't change setcc, nor did any later extensions, so producing a 32-bit 0/1 is still a pain on x86 even with -march=icelake-client or znver3. Having a 66 operand-size or rep prefix modify setcc to use 32-bit operand-size would have been helpful to avoid wasting an instruction (and front-end uop) for this, but neither vendor has ever bothered to introduce an extension like that. (Usually only extensions that can give major speedups in a few "hot" functions that you can do dynamic dispatch for, not things that need to be used everywhere to add up to a small improvement.)
xor-zeroing before the setcc is the least-bad way, when you have a spare register, as discussed at the bottom of my answer on What is the best way to set a register to zero in x86 assembly: xor, mov or and?.
The other options, if you do want to overwrite a compare input include:
1. mov-imm32=0 which you can do after a compare, not affecting FLAGS:
# for example if you want to replace a compare input with a boolean
cmp %ecx, %eax
mov $0, %eax
setcc %al
This wastes code-size (5 bytes vs. 2 for mov vs. xor), and (on Intel P6-family) has a partial register stall when reading EAX, because no xor-zeroing was used to set the internal RAX=AL upper-bytes-known-zero state.
The mov-immediate is off the critical path, so out-of-order exec can get it done early, before the compare inputs are ready, and have that zeroed register ready for setcc to write into.
(On Intel SnB-family CPUs, xor-zeroing is handled in the rename logic, so it doesn't have to execute early to get the zero ready; it's already done when it enters the back-end. e.g. after a front-end stall, xor-zeroing and setcc could enter the back-end in the same cycle, but the setcc could still execute in the first cycle after that, unlike if it was a mov-immediate that would have to actually run on a back-end execution unit to write a zero to a register.)
2. MOVZX on an 8-bit setcc result
cmp %ecx, %eax
setcc %cl
movzbl %cl, %eax
This is mostly even worse, except on P6-family where it avoids a partial-register stall.
But movzx is on the critical path from compare inputs being ready to 0/1 result being ready. (Although IvyBridge and later can run it with zero latency when it's between two separate registers, which is why I used %cl instead of %al. Compilers normally don't optimize for this, and would setcc %al / movzbl %al, %eax if they don't manage to xor-zero something first. This defeats mov-elimination even on Intel CPUs that have it.)
setcc %cl has a false dependency on RCX (except on Intel P6-family which renames low8 registers separately from the full register), but that's ok because RCX and RAX were both already part of the dependency chain leading to setcc.
If you're not overwriting one of the compare inputs, xor-zero the separate destination register. setcc %al / movzbl %al, %eax after cmp %esi, %edi would be the worst of all possible options, because RAX might have last been written by a cache-miss load of something independent, or a slow div or something like that before the function call, so you could be coupling this dependency chain into it.

Related

Performance of a string operation with strlen vs stop on zero

While I was writing a class for strings in C ++, I found a strange behavior regarding the speed of execution.
I'll take as an example the following two implementations of the upper method:
class String {
char* str;
...
forceinline void upperStrlen();
forceinline void upperPtr();
};
void String::upperStrlen()
{
INDEX length = strlen(str);
for (INDEX i = 0; i < length; i++) {
str[i] = toupper(str[i]);
}
}
void String::upperPtr()
{
char* ptr_char = str;
for (; *ptr_char != '\0'; ptr_char++) {
*ptr_char = toupper(*ptr_char);
}
}
INDEX is simple a typedef of uint_fast32_t.
Now I can test the speed of those methods in my main.cpp:
#define TEST_RECURSIVE(_function) \
{ \
bool ok = true; \
clock_t before = clock(); \
for (int i = 0; i < TEST_RECURSIVE_TIMES; i++) { \
if (!(_function()) && ok) \
ok = false; \
} \
char output[TEST_RECURSIVE_OUTPUT_STR]; \
sprintf(output, "[%s] Test %s %s: %ld ms\n", \
ok ? "OK" : "Failed", \
TEST_RECURSIVE_BUILD_TYPE, \
#_function, \
(clock() - before) * 1000 / CLOCKS_PER_SEC); \
fprintf(stdout, output); \
fprintf(file_log, output); \
}
String a;
String b;
bool stringUpperStrlen()
{
a.upperStrlen();
return true;
}
bool stringUpperPtr()
{
b.upperPtr();
return true;
}
int main(int argc, char** argv) {
...
a = "Hello World!";
b = "Hello World!";
TEST_RECURSIVE(stringUpperPtr);
TEST_RECURSIVE(stringUpperStrlen);
...
return 0;
}
Then I can compile and test with cmake in Debug or Release with the following results.
[OK] Test RELEASE stringUpperPtr: 21 ms
[OK] Test RELEASE stringUpperStrlen: 12 ms
[OK] Test DEBUG stringUpperPtr: 27 ms
[OK] Test DEBUG stringUpperStrlen: 33 ms
So in Debug the behavior is what I expected, the pointer is faster than strlen, but in Release strlen is faster.
So I took the GCC assembly and the number of instructions is much less in the stringUpperPtr than in stringUpperStrlen.
The stringUpperStrlen assembly:
_Z17stringUpperStrlenv:
.LFB72:
.cfi_startproc
pushq %r13
.cfi_def_cfa_offset 16
.cfi_offset 13, -16
xorl %eax, %eax
pushq %r12
.cfi_def_cfa_offset 24
.cfi_offset 12, -24
pushq %rbp
.cfi_def_cfa_offset 32
.cfi_offset 6, -32
xorl %ebp, %ebp
pushq %rbx
.cfi_def_cfa_offset 40
.cfi_offset 3, -40
pushq %rcx
.cfi_def_cfa_offset 48
orq $-1, %rcx
movq a#GOTPCREL(%rip), %r13
movq 0(%r13), %rdi
repnz scasb
movq %rcx, %rdx
notq %rdx
leaq -1(%rdx), %rbx
.L4:
cmpq %rbp, %rbx
je .L3
movq 0(%r13), %r12
addq %rbp, %r12
movsbl (%r12), %edi
incq %rbp
call toupper#PLT
movb %al, (%r12)
jmp .L4
.L3:
popq %rdx
.cfi_def_cfa_offset 40
popq %rbx
.cfi_def_cfa_offset 32
popq %rbp
.cfi_def_cfa_offset 24
popq %r12
.cfi_def_cfa_offset 16
movb $1, %al
popq %r13
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE72:
.size _Z17stringUpperStrlenv, .-_Z17stringUpperStrlenv
.globl _Z14stringUpperPtrv
.type _Z14stringUpperPtrv, #function
The stringUpperPtr assembly:
_Z14stringUpperPtrv:
.LFB73:
.cfi_startproc
pushq %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
movq b#GOTPCREL(%rip), %rax
movq (%rax), %rbx
.L9:
movsbl (%rbx), %edi
testb %dil, %dil
je .L8
call toupper#PLT
movb %al, (%rbx)
incq %rbx
jmp .L9
.L8:
movb $1, %al
popq %rbx
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE73:
.size _Z14stringUpperPtrv, .-_Z14stringUpperPtrv
.section .rodata.str1.1,"aMS",#progbits,1
So rationally, fewer instructions should mean more speed (excluding cache, scheduler, etc ...).
So how do you explain this difference in performance?
Thanks in advance.
EDIT:
CMake generate something like this command to compile:
/bin/g++-8 -Os -DNDEBUG -Wl,-rpath,$ORIGIN CMakeFiles/xpp-tests.dir/tests/main.cpp.o -o xpp-tests libxpp.so
/bin/g++-8 -O3 -DNDEBUG -Wl,-rpath,$ORIGIN CMakeFiles/xpp-tests.dir/tests/main.cpp.o -o Release/xpp-tests Release/libxpp.so
# CMAKE generated file: DO NOT EDIT!
# Generated by "Unix Makefiles" Generator, CMake Version 3.16
# compile CXX with /bin/g++-8
CXX_FLAGS = -O3 -DNDEBUG -Wall -pipe -fPIC -march=native -fno-strict-aliasing
CXX_DEFINES = -DPLATFORM_UNIX=1 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1
The define TEST_RECURSIVE will call _function 1000000 times in my examples.
You have several misconceptions about performance. You need to dispel these misconceptions.
Now I can test the speed of those methods in my main.cpp: (…)
Your benchmarking code calls the benchmarked functions directly. So you're measuring the benchmarked functions as optimized for the specific case of how they're used by the benchmarking code: to call them repeatedly on the same input. This is unlikely to have any relevance to how they behave in a realistic environment.
I think the compiler didn't do anything earth-shattering because it doesn't know what toupper does. If the compiler had known that toupper doesn't transform a nonzero character into zero, it might well have hoisted the strlen call outside the benchmarked loop. And if it had known that toupper(toupper(x)) == toupper(x), it might well have decided to run the loop only once.
To make a somewhat realistic benchmark, put the benchmarked code and the benchmarking code in separate source files, compile them separately, and disable any kind of cross-module or link-time optimization.
Then I can compile and test with cmake in Debug or Release
Compiling in debug mode rarely has any relevance to microbenchmarks (benchmarking the speed of an implementation of a small fragment of code, as opposed to benchmarking the relative speed of algorithms in terms of how many elementary functions they call). Compiler optimizations have a significant effect on microbenchmarks.
So rationally, fewer instructions should mean more speed (excluding cache, scheduler, etc ...).
No, absolutely not.
First of all, fewer instructions total is completely irrelevant to the speed of the program. Even on a platform where executing one instruction takes the same amount of time regardless of what the instruction is, which is unusual, what matters is how many instructions are executed, not how many instructions there are in the program. For example, a loop with 100 instructions that is executed 10 times is 10 times faster than a loop with 10 instructions that is executed 1000 times, even though it's 10 times larger. Inlining is a common program transformation that usually makes the code larger and makes it faster often enough that it's considered a common optimization.
Second, on many platforms, such as any PC or server made in the 21st century, any smartphone, and even many lower-end devices, the time it takes to execute an instruction can vary so widely that it's a poor indication of performance. Cache is a major factor: a read from memory can be more than 1000 times slower than a read from cache on a PC. Other factors with less impact include pipelining, which causes the speed of an instruction to depend on the surrounding instructions, and branch prediction, which causes the speed of a conditional instruction to depend on the outcome of previous conditional instructions.
Third, that's just considering processor instructions — what you see in assembly code. Compilers for C, C++ and most other languages optimize programs in such a way that it can be hard to predict what the processor will be doing exactly.
For example, how long does the instruction ++x; take on a PC?
If the compiler has figured out that the addition is unnecessary, for example because nothing uses x afterwards, or because the value of x is known at compile time and therefore so is the value of x+1, it'll optimize it away. So the answer is 0.
If the value of x is already in a register at this point and the value is only needed in a register afterwards, the compiler just needs to generate an addition or increment instruction. So the simplistic, but not quite correct answer is 1 clock cycle. One reason this is not quite correct is that merely decoding the instruction takes many cycles on a high-end processor such as what you find in a 21st century PC or smartphone. However “one cycle” is kind of correct in that while it takes multiple clock cycles from starting the instruction to finishing it, the instruction only takes one cycle in each pipeline stage. Furthermore, even taking this into account, another reason this is not quite correct is that ++x; ++y; might not take 2 clock cycles: modern processors are sophisticated enough that they may be able to decode and execute multiple instructions in parallel (for example, a processor with 4 arithmetic units can perform 4 additions at the same time). Yet another reason this might not be correct is if the type of x is larger or smaller than a register, which might require more than one assembly instruction to perform the addition.
If the value of x needs to be loaded from memory, this takes a lot more than one clock cycle. Anything other than the innermost cache level dwarfs the time it takes to decode the instruction and perform the addition. The amount of time is very different depending on whether x is found in the L3 cache, in the L2 cache, in the L1 cache, or in the “real” RAM. And even that gets more complicated when you consider that x might be part of a cache prefetch (hardware- or software- triggered).
It's even possible that x is currently in swap, so that reading it requires reading from a disk.
And writing the result exhibits somewhat similar variations to reading the input. However the performance characteristics are different for reads and for writes because when you need a value, you need to wait for the read to be complete, whereas when you write a value, you don't need to wait for the write to be complete: a write to memory writes to a buffer in cache, and the time when the buffer is flushed to a higher-level cache or to RAM depends on what else is happening on the system (what else is competing for space in the cache).
Ok, now let's turn to your specific example and look at what happens in their inner loop. I'm not very familiar with x86 assembly but I think I get the gist.
For stringUpperStrlen, the inner loop starts at .L4. Just before entering the inner loop, %rbx is set to the length of the string. Here's what the inner loop contains:
cmpq %rbp, %rbx: Compare the current index to the length, both obtained from registers.
je .L3: conditional jump, to exit the loop if the index is equal to the length.
movq 0(%r13), %r12: Read from memory to get the address of the beginning of the string. (I'm surprised that the address isn't in a register at this point.)
addq %rbp, %r12: an arithmetic operation that depends on the value that was just read from memory.
movsbl (%r12), %edi: Read the current character from the string in memory.
incq %rbp: Increment the index. This is an arithmetic instruction on a register value that doesn't depend on a recent memory read, so it's very likely to be free: it only takes pipeline stages and an arithmetic unit that wouldn't be busy anyway.
call toupper#PLT
movb %al, (%r12): Write the value returned by the function to the current character of the string in memory.
jmp .L4: Unconditional jump to the beginning of the loop.
For stringUpperPtr, the inner loop starts at .L9. Here's what the inner loop contains:
movsbl (%rbx), %edi: read from the address containing the current.
testb %dil, %dil: test if %dil is zero. %dil is the least significant byte of %edi which was just read from memory.
je .L8: conditional jump, to exit the loop if the character is zero.
call toupper#PLT
movb %al, (%rbx): Write the value returned by the function to the current character of the string in memory.
incq %rbx: Increment the pointer. This is an arithmetic instruction on a register value that doesn't depend on a recent memory read, so it's very likely to be free: it only takes pipeline stages and an arithmetic unit that wouldn't be busy anyway.
jmp .L9: Unconditional jump to the beginning of the loop.
The differences between the two loops are:
The loops have slightly different lengths, but both are small enough that they fit in a single cache line (or two, if the code happens to straddle a line boundary). So after the first iteration of the loop, the code will be in the innermost instruction cache. Not only that, but if I understand correctly, on modern Intel processors, there is a cache of decoded instructions, which the loop is small enough to fit in, and so no decoding needs to take place.
The stringUpperStrlen loop has one more read. The extra read is from a constant address which is likely to remain in the innermost cache after the first iteration.
The conditional instruction in the stringUpperStrlen loop depends only on values that are in registers. On the other hand, the conditional instruction in the stringUpperPtr loop depends on a value which was just read from memory.
So the difference boils down to an extra data read from the innermost cache, vs having a conditional instruction whose outcome depends on a memory read. An instruction whose outcome depends on the result of another instruction leads to a hazard: the second instruction is blocked until the first instruction is fully executed, which prevents taking advantage from pipelining, and can render speculative execution less effective. In the stringUpperStrlen loop, the processor essentially runs two things in parallel: the load-call-store cycle, which doesn't have any conditional instructions (apart from what happens inside toupper), and the increment-test cycle, which doesn't access memory. This lets the processor work on the conditional instruction while it's waiting for memory. In the stringUpperPtr loop, the conditional instruction depends on a memory read, so the processor can't start working on it until the read is complete. I'd typically expect this to be slower than the extra read from the innermost cache, although it might depend on the processor.
Of course, the stringUpperStrlen does need to have a load-test hazard to determine the end of the string: no matter how it does it, it needs to fetch characters in memory. This is hidden inside repnz scasb. I don't know the internal architecture of an x86 processor, but I suspect that this case (which is extremely common since it's the meat of strlen) is heavily optimized inside the processor, probably to an extent that is impossible to reach with generic instructions.
You may see different results if the string was longer and the two memory accesses in stringUpperStrlen weren't in the same cache line, although possibly not because this only costs one more cache line and there are several. The details would depend on how the caches work and how toupper uses them.

Remove needless assembler statements from g++ output

I am investigating some problem with a local binary. I've noticed that g++ creates a lot of ASM output that seems unnecessary to me. Example with -O0:
Derived::Derived():
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp <--- just need 8 bytes for the movq to -8(%rbp), why -16?
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movq %rax, %rdi <--- now we have moved rdi onto itself.
call Base::Base()
leaq 16+vtable for Derived(%rip), %rdx
movq -8(%rbp), %rax <--- effectively %edi, does not point into this area of the stack
movq %rdx, (%rax) <--- thus this wont change -8(%rbp)
movq -8(%rbp), %rax <--- so this statement is unnecessary
movl $4712, 12(%rax)
nop
leave
ret
option -O1 -fno-inline -fno-elide-constructors -fno-omit-frame-pointer:
Derived::Derived():
pushq %rbp
movq %rsp, %rbp
pushq %rbx
subq $8, %rsp <--- reserve some stack space and never use it.
movq %rdi, %rbx
call Base::Base()
leaq 16+vtable for Derived(%rip), %rax
movq %rax, (%rbx)
movl $4712, 12(%rbx)
addq $8, %rsp <--- release unused stack space.
popq %rbx
popq %rbp
ret
This code is for the constructor of Derived that calls the Base base constructor and then overrides the vtable pointer at position 0 and sets a constant value to an int member it holds in addition to what Base contains.
Question:
Can I translate my program with as few optimizations as possible and get rid of such stuff? Which options would I have to set? Or is there a reason the compiler cannot detect these cases with -O0 or -O1 and there is no way around them?
Why is the subq $8, %rsp statement generated at all? You cannot optimize in or out a statement that makes no sense to begin with. Why does the compiler generate it then? The register allocation algorithm should never, even with O0, generate code for something that is not there. So why it is done?
is there a reason the compiler cannot detect these cases with -O0 or -O1
exactly because you're telling the compiler not to. These are optimisation levels that need to be turn off or down for proper debugging. You're also trading off compilation time for run-time.
You're looking through the telescope the wrong way, check out the awesome optimisations that you're compiler will do for you when you crank up optimisation.
I don't see any obvious missed optimizations in your -O1 output. Except of course setting up RBP as a frame pointer, but you used -fno-omit-frame-pointer so clearly you know why GCC didn't optimize that away.
The function has no local variables
Your function is a non-static class member function, so it has one implicit arg: this in rdi. Which g++ spills to the stack because of -O0. Function args count as local variables.
How does a cyclic move without an effect improve the debugging experience. Please elaborate.
To improve C/C++ debugging: debug-info formats can only describe a C variable's location relative to RSP or RBP, not which register it's currently in. Also, so you can modify any variable with a debugger and continue, getting the expected results as if you'd done that in the C++ abstract machine. Every statement is compiled to a separate block of asm with no values alive in registers (Fun fact: except register int foo: that keyword does affect debug-mode code gen).
Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? applies to G++ and other compilers as well.
Which options would I have to set?
If you're reading / debugging the asm, use at least -Og or higher to disable the debug-mode spill-everything-between-statements behaviour of -O0. Preferably -O2 or -O3 unless you like seeing even more missed optimizations than you'd get with full optimization. But -Og or -O1 will do register allocation and make sane loops (with the conditional branch at the bottom), and various simple optimizations. Although still not the standard peephole of xor-zeroing.
How to remove "noise" from GCC/clang assembly output? explains how to write functions that take args and return a value so you can write functions that don't optimize away.
Loading into RAX and then movq %rax, %rdi is just a side-effect of -O0. GCC spends so little time optimizing the GIMPLE and/or RTL internal representations of the program logic (before emitting x86 asm) that it doesn't even notice it could have loaded into RDI in the first place. Part of the point of -O0 is to compile quickly, as well as consistent debugging.
Why is the subq $8, %rsp statement generated at all?
Because the ABI requires 16-byte stack alignment before a call instruction, and this function did an even number of 8-byte pushes. (call itself pushes a return address). It will go away at -O1 without -fno-omit-frame-pointer because you aren't forcing g++ to push/pop RBP as well as the call-preserved register it actually needs.
Why does System V / AMD64 ABI mandate a 16 byte stack alignment?
Fun fact: clang will often use a dummy push %rcx/pop or something, depending on -mtune options, instead of an 8-byte sub.
If it were a leaf function, g++ would just use the red-zone below RSP for locals, even at -O0. Why is there no "sub rsp" instruction in this function prologue and why are function parameters stored at negative rbp offsets?
In un-optimized code it's not rare for G++ to allocate an extra 16 bytes it doesn't ever use. Even sometimes with optimization enabled g++ rounds up its stack allocation size too far when aiming for a 16-byte boundary. This is a missed-optimization bug. e.g. Memory allocation and addressing in Assembly

How to move double in %rax into particular qword position on %ymm or %zmm? (Kaby Lake or later)

The idea is that I'd like to collect returned values of double into a vector register for processing for machine imm width at a time without storing back into memory first.
The particular processing is a vfma with other two operands that are all constexpr, so that they can simply be summoned by _mm256_setr_pd or aligned/unaligned memory load from constexpr array.
Is there a way to store double in %ymm at particular position directly from value in %rax for collecting purpose?
The target machine is Kaby Lake. More efficient of future vector instructions are welcome also.
Inline-assembly is usually a bad idea: modern compilers do a good job with x86 intrinsics.
Putting the bit-pattern for a double into RAX is usually also not useful, and smells like you've probably already gone down the wrong path into sub-optimal territory. Vector shuffle instructions are usually better: element-wise insert/extract instructions already cost a shuffle uop on Intel hardware, except for vmovq %xmm0, %rax to get the low element.
Also, if you're going to insert it into another vector, you should shuffle/immediate-blend. (vpermpd / vblendpd).
L1d and store-forwarding cache is fast, and even store-forwarding stalls are not a disaster. Choose wisely between ALU vs. memory to gather or scatter of data into / from SIMD vectors. Also remember that insert/extract instructions need an immediate index, so if you have a runtime index for a vector, you definitely want to store it and index. (See https://agner.org/optimize/ and other performance links in https://stackoverflow.com/tags/x86/info)
Lots of insert/extract can quickly bottleneck on port 5 on Haswell and later. See Loading an xmm from GP regs for more details, and some links to gcc bug reports where I went into more detail about strategies for different element widths on different uarches and with SSE4.1 vs. without SSE4.1, etc.
There's no PD version of extractps r/m32, xmm, imm, and insertps is a shuffle between XMM vectors.
To read/write the low lane of a YMM, you'll have to use integer vpextrq $1, %xmm0, %rax / pinsrq $1, %rax, %xmm0. Those aren't available in YMM width, so you need multiple instructions to read/write the high lane.
The VEX version vpinsrq $1, %rax, %xmm0 will zero the high lane(s) of the destination vector's full YMM or ZMM width, which is why I suggested pinsrq. On Skylake and later, it won't cause an SSE/AVX transition stall. See Using ymm registers as a "memory-like" storage location for an example (NASM syntax), and also Loading an xmm from GP regs
For the low element, use vmovq %xmm0, %rax to extract, it's cheaper than vpextrq (1 uop instead of 2).
For ZMM, my answer on that linked XMM from GP regs question shows that you can use AVX512F to merge-mask an integer register into a vector, given a mask register with a single bit set.
vpbroadcastq %rax, %zmm0{%k1}
Similarly, vpcompressq can move an element selected by a single-bit mask to the bottom for vmovq.
But for extract, if have an index instead of 1<<index to start with, you may be better off with vmovd %ecx, %zmm1 / vpermd %zmm0, %zmm1, %zmm2 / vmovq %zmm2, %rax. This trick even works with vpshufb for byte elements (with a lane at least). For lane crossing, maybe shuffle + vmovd with the high bits of the byte index, then scalar right-shift using the low bits of the index as the byte-within-word offset. See also How to use _mm_extract_epi8 function? for intrinsics for a variable-index emulation of pextrb.
High lane of a YMM, with AVX2
I think your best bet to write an element in the high lane of a YMM with AVX2 needs a scratch register:
vmovq %rax, %xmm0 (copy to a scratch vector)
shuffle into position with vinsertf128 (AVX1) or vpbroadcastq/vbroadcastsd. it's faster than vpermq/vpermpd on AMD. (But the reg-reg version is still AVX2-only)
vblendpd (FP) or vpblendd (integer) into the target YMM reg. Immediate blends with dword or larger element width are very cheap (1 uop for any vector ALU port on Intel).
This is only 3 uops, but 2 of them need port 5 on Intel CPUs. (So it costs the same as a vpinsrq + a blend). Only the blend is on the critical path from vector input to vector output, setting up ymm0 from rax is independent.
To read the highest element, vpermpd or vpermq $3, %ymm1, %ymm0 (AVX2), then vmovq from xmm0.
To read the 2nd-highest element, vextractf128 $1, %ymm1, %xmm0 (AVX1) and vmovq. vextractf128 is faster than vpermq/pd on AMD CPUs.
A bad alternative to avoid a scratch reg for insert would be vpermq or vperm2i128 to shuffle the qword you want to replace into the low lane, pinsrq (not vpinsrq), then vpermq to put it back in the right order. That's all shuffle uops, and pinsrq is 2 uops. (And causes an SSE/AVX transition stall on Haswell, but not Skylake). Plus all those operations are part of a dependency chain for the register you're modifying, unlike setting up a value in another register and blending.

Visual C++ optimization options - how to improve the code output?

Are there any options (other than /O2) to improve the Visual C++ code output? The MSDN documentation is quite bad in this regard.
Note that I'm not asking about project-wide settings (link-time optimization, etc...). I'm only interested in this particular example.
The fairly simple C++11 code looks like this:
#include <vector>
int main() {
std::vector<int> v = {1, 2, 3, 4};
int sum = 0;
for(int i = 0; i < v.size(); i++) {
sum += v[i];
}
return sum;
}
Clang's output with libc++ is quite compact:
main: # #main
mov eax, 10
ret
Visual C++ output, on the other hand, is a multi-page mess.
Am I missing something here or is VS really this bad?
Compiler explorer link:
https://godbolt.org/g/GJYHjE
Unfortunately, it's difficult to greatly improve Visual C++ output in this case, even by using more aggressive optimization flags. There are several factors contributing to VS inefficiency, including lack of certain compiler optimizations, and the structure of Microsoft's implementation of <vector>.
Inspecting the generated assembly, Clang does an outstanding job optimizing this code. Specifically, when compared to VS, Clang is able to perform a very effective Constant propagation, Function Inlining (and consequently, Dead Code Elimination), and New/delete optimization.
Constant Propagation
In the example, the vector is statically initialized:
std::vector<int> v = {1, 2, 3, 4};
Normally, the compiler will store the constants 1, 2, 3, 4 in the data memory, and in the for loop, will load one value at one at a time, starting from the low address in which 1 is stored, and add each value to the sum.
Here's the abbreviated VS code for doing this:
movdqa xmm0, XMMWORD PTR __xmm#00000004000000030000000200000001
...
movdqu XMMWORD PTR $T1[rsp], xmm0 ; Store integers 1, 2, 3, 4 in memory
...
$LL4#main:
add ebx, DWORD PTR [rdx] ; loop and sum the values
lea rdx, QWORD PTR [rdx+4]
inc r8d
movsxd rax, r8d
cmp rax, r9
jb SHORT $LL4#main
Clang, however, is very clever to realize that the sum could be calculated in advance. My best guess is that it replaces the loading of the constants from memory to constant mov operations into registers (propagates the constants), and then combines them into the result of 10. This has the useful side effect of breaking dependencies, and since the addresses are no longer loaded from, the compiler is free to remove everything else as dead code.
Clang seems to be unique in doing this - neither VS or GCC were able to precalculate the vector accumulation result in advance.
New/Delete Optimization
Compilers conforming to C++14 are allowed to omit calls to new and delete on certain conditions, specifically when the number of allocation calls is not part of the observable behavior of the program (N3664 standard paper).
This has already generated much discussion on SO:
clang vs gcc - optimization including operator new
Is the compiler allowed to optimize out heap memory allocations?
Optimization of raw new[]/delete[] vs std::vector
Clang invoked with -std=c++14 -stdlib=libc++ indeed performs this optimization and eliminates the calls to new and delete, which do carry side effects, but supposedly do not affect the observable behaviour of the program. With -stdlib=libstdc++, Clang is stricter and keeps the calls to new and delete - although, by looking at the assembly, it's clear they are not really needed.
Now, when inspecting the main code generated by VS, we can find there two function calls (with the rest of vector construction and iteration code inlined into main):
call std::vector<int,std::allocator<int> >::_Range_construct_or_tidy<int const * __ptr64>
and
call void __cdecl operator delete(void * __ptr64)
The first is used for allocating the vector, and the second for deallocating it, and practically all other functions in the VS output are pulled in by these functions calls. This hints that Visual C++ will not optimize away calls to allocation functions (for C++14 conformance we should add the /std:c++14 flag, but the results are the same).
This blog post (May 10, 2017) from the Visual C++ team confirms that indeed, this optimization is not implemented. Searching the page for N3664 shows that "Avoiding/fusing allocations" is at status N/A, and linked comment says:
[E] Avoiding/fusing allocations is permitted but not required. For the time being, we’ve chosen not to implement this.
Combining new/delete optimization and constant propagation, it's easy to see the impact of these two optimizations in this Compiler Explorer 3-way comparison of Clang with -stdlib=libc++, Clang with -stdlib=libstdc++, and GCC.
STL Implementation
VS has its own STL implementation which is very differently structured than libc++ and stdlibc++, and that seems to have a large contribution to VS inferior code generation. While VS STL has some very useful features, such as checked iterators and iterator debugging hooks (_ITERATOR_DEBUG_LEVEL), it gives the general impression of being heavier and to perform less efficiently than stdlibc++.
For isolating the impact of the vector STL implementation, an interesting experiment is to use Clang for compilation, combined with the VS header files. Indeed, using Clang 5.0.0 with Visual Studio 2015 headers, results in the following code generation - clearly, the STL implementation has a huge impact!
main: # #main
.Lfunc_begin0:
.Lcfi0:
.seh_proc main
.seh_handler __CxxFrameHandler3, #unwind, #except
# BB#0: # %.lr.ph
pushq %rbp
.Lcfi1:
.seh_pushreg 5
pushq %rsi
.Lcfi2:
.seh_pushreg 6
pushq %rdi
.Lcfi3:
.seh_pushreg 7
pushq %rbx
.Lcfi4:
.seh_pushreg 3
subq $72, %rsp
.Lcfi5:
.seh_stackalloc 72
leaq 64(%rsp), %rbp
.Lcfi6:
.seh_setframe 5, 64
.Lcfi7:
.seh_endprologue
movq $-2, (%rbp)
movl $16, %ecx
callq "??2#YAPEAX_K#Z"
movq %rax, -24(%rbp)
leaq 16(%rax), %rcx
movq %rcx, -8(%rbp)
movups .L.ref.tmp(%rip), %xmm0
movups %xmm0, (%rax)
movq %rcx, -16(%rbp)
movl 4(%rax), %ebx
movl 8(%rax), %esi
movl 12(%rax), %edi
.Ltmp0:
leaq -24(%rbp), %rcx
callq "?_Tidy#?$vector#HV?$allocator#H#std###std##IEAAXXZ"
.Ltmp1:
# BB#1: # %"\01??1?$vector#HV?$allocator#H#std###std##QEAA#XZ.exit"
addl %ebx, %esi
leal 1(%rdi,%rsi), %eax
addq $72, %rsp
popq %rbx
popq %rdi
popq %rsi
popq %rbp
retq
.seh_handlerdata
.long ($cppxdata$main)#IMGREL
.text
Update - Visual Studio 2017
In Visual Studio 2017, <vector> has seen a major overhaul, as announced on this blog post from the Visual C++ team. Specifically, it mentions the following optimizations:
Eliminated unnecessary EH logic. For example, vector’s copy assignment operator had an unnecessary try-catch block. It just has to provide the basic guarantee, which we can achieve through proper action sequencing.
Improved performance by avoiding unnecessary rotate() calls. For example, emplace(where, val) was calling emplace_back() followed by rotate(). Now, vector calls rotate() in only one scenario (range insertion with input-only iterators, as previously described).
Improved performance with stateful allocators. For example, move construction with non-equal allocators now attempts to activate our memmove() optimization. (Previously, we used make_move_iterator(), which had the side effect of inhibiting the memmove() optimization.) Note that a further improvement is coming in VS 2017 Update 1, where move assignment will attempt to reuse the buffer in the non-POCMA non-equal case.
Curious, I went back to test this. When building the example in Visual Studio 2017, the result is still a multi page assembly listing, with many function calls, so even if code generation improved, it is difficult to notice.
However, when building with clang 5.0.0 and Visual Studio 2017 headers, we get the following assembly:
main: # #main
.Lcfi0:
.seh_proc main
# BB#0:
subq $40, %rsp
.Lcfi1:
.seh_stackalloc 40
.Lcfi2:
.seh_endprologue
movl $16, %ecx
callq "??2#YAPEAX_K#Z" ; void * __ptr64 __cdecl operator new(unsigned __int64)
movq %rax, %rcx
callq "??3#YAXPEAX#Z" ; void __cdecl operator delete(void * __ptr64)
movl $10, %eax
addq $40, %rsp
retq
.seh_handlerdata
.text
Note the movl $10, %eax instruction - that is, with VS 2017's <vector>, clang was able to collapse everything, precalculate the result of 10, and keep only the calls to new and delete.
I'd say that is pretty amazing!
Function Inlining
Function inlining is probably the single most vital optimization in this example. By collapsing the code of called functions into their call sites, the compiler is able to perform further optimizations on the merged code, plus, removing of function calls is beneficial in reducing call overhead and removing of optimization barriers.
When inspecting the generated assembly for VS, and comparing the code before and after inlining (Compiler Explorer), we can see that most vector functions were indeed inlined, except for the allocation and deallocation functions. In particular, there are calls to memmove, which are the result of inlining of some higher level functions, such as _Uninitialized_copy_al_unchecked.
memmove is a library function, and therefore cannot be inlined. However, clang has a clever way around this - it replaces the call to memmove with a call to __builtin_memmove. __builtin_memmove is a builtin/intrinsic function, which has the same functionality as memmove, but as opposed to the plain function call, the compiler generates code for it and embeds it into the calling function. Consequently, the code could be further optimized inside the calling function and eventually removed as dead code.
Summary
To conclude, Clang is clearly superior than VS in this example, both thanks to high quality optimizations, and more efficient vector STL implementation. When using the same header files for Visual C++ and clang (the Visual Studio 2017 headers), Clang beats Visual C++ hands down.
While writing this answer, I couldn't help not to think, what would we do without Compiler Explorer? Thanks Matt Godbolt for this amazing tool!

Why does this difference in asm matter for performance (in an un-optimized ptr++ vs. ++ptr loop)?

TL;DR: the first loop runs ~18% faster on a Haswell CPU. Why? The loops are from gcc -O0 (un-optimized) loops using ptr++ vs ++ptr, but the question is why the resulting asm performs differently, not anything about how to write better C.
Let's say we have those two loops:
movl $0, -48(%ebp) //Loop counter set to 0
movl $_data, -12(%ebp) //Pointer to the data array
movl %eax, -96(%ebp)
movl %edx, -92(%ebp)
jmp L21
L22:
// ptr++
movl -12(%ebp), %eax //Get the current address
leal 4(%eax), %edx //Calculate the next address
movl %edx, -12(%ebp) //Store the new (next) address
// rest of the loop is the same as the other
movl -48(%ebp), %edx //Get the loop counter to edx
movl %edx, (%eax) //Move the loop counter value to the CURRENT address, note -12(%ebp) contains already the next one
addl $1, -48(%ebp) //Increase the counter
L21:
cmpl $999999, -48(%ebp)
jle L22
and the second one:
movl %eax, -104(%ebp)
movl %edx, -100(%ebp)
movl $_data-4, -12(%ebp) //Address of the data - 1 element (4 byte)
movl $0, -48(%ebp) //Set the loop counter to 0
jmp L23
L24:
// ++ptr
addl $4, -12(%ebp) //Calculate the CURRENT address by adding one sizeof(int)==4 bytes
movl -12(%ebp), %eax //Store in eax the address
// rest of the loop is the same as the other
movl -48(%ebp), %edx //Store in edx the current loop counter
movl %edx, (%eax) //Move the loop counter value to the current stored address location
addl $1, -48(%ebp) //Increase the loop counter
L23:
cmpl $999999, -48(%ebp)
jle L24
Those loops are doing exactly the same thing but in a little different manner, please refer to the comment for the details.
This asm code is generated from the following two C++ loops:
//FIRST LOOP:
for(;index<size;index++){
*(ptr++) = index;
}
//SECOND LOOP:
ptr = data - 1;
for(index = 0;index<size;index++){
*(++ptr) = index;
}
Now, the first loop is about ~18% faster than the second one, no matter in which order the loops are performed the one with ptr++ is faster than the one with ++ptr.
To run my benchmarks I just collected the running time of those loops for different size, and executing them both nested in other loops to repeat the operation frequently.
ASM analysis
Looking at the ASM code, the second loop contains less instructions, we have 3 movl and 2 addl whereas in the first loop we have 4 movl one addl and one leal, so we have one movl more and one leal instead of addl
Is that correct that the LEA operation for computing the correct address is much faster than the ADD (+4) method? Is this the reason for the difference in performance?
As far as i know, once a new address is calculated before the memory can be referenced some clock cycles must elapse, so the second loop after the addl $4,-12(%ebp) need to wait a little before proceeding, whereas in the first loop we can immediately refer the memory and in the meanwhile LEAL will compute the next address (some kind of better pipeline performance here).
Is there some reordering going on here? I'm not sure about my explanation for the performance difference of those loops, can I have your opinion?
First of all, performance analysis on -O0 compiler output is usually not very interesting or useful.
Is that correct that the LEAL operation for computing the correct address is much faster than the ADDL (+4) method? Is this the reason for the difference in performance?
Nope, add can run on every ALU execution port on any x86 CPU. lea is usually as low latency with simple addressing modes, but not as good throughput. On Atom, it runs in a different stage of the pipeline from normal ALU instructions, because it actually lives up to its name and uses the AGU on that in-order microarchitecture.
See the x86 tag wiki to learn what makes code slow or fast on different microarchitectures, esp. Agner Fog's microarchitecture pdf and instruction tables.
add is only worse because it lets gcc -O0 make even worse code by using it with a memory destination and then loading from that.
Compiling with -O0 doesn't even try to use the best instructions for the job. e.g. you'll get mov $0, %eax instead of the xor %eax,%eax you always get in optimized code. You shouldn't infer anything about what's good from looking at un-optimized compiler output.
-O0 code is always full of bottlenecks, usually on load/store or store-forwarding. Unfortunately IACA doesn't account for store-forwarding latency, so it doesn't realize that these loops actually bottleneck on
As far as i know, once a new address is calculated before the memory can be referenced some clock cycles must elapse, so the second loop after the addl $4,-12(%ebp) need to wait a little before proceeding,
Yes, the mov load of -12(%ebp) won't be ready for about 6 cycles after the load that was part of add's read-modify-write.
whereas in the first loop we can immediately refer the memory
Yes
and in the meanwhile LEAL will compute the next address
No.
Your analysis is close, but you missed the fact that the next iteration still has to load the value we stored into -12(%ebp). So the loop-carried dependency chain is the same length, and next iteration's lea can't actually start any sooner than in the loop using add
The latency issues might not be the loop throughput bottleneck:
uop / execution port throughput needs to be considered. In this case, the OP's testing shows it's actually relevant. (Or latency from resource conflicts.)
When gcc -O0 implements ptr++, it keeps the old value in a register, like you said. So store addresses are known further ahead of time, and there's one fewer load uop that needs an AGU.
Assuming an Intel SnB-family CPU:
## ptr++: 1st loop
movl -12(%ebp), %eax //1 uop (load)
leal 4(%eax), %edx //1 uop (ALU only)
movl %edx, -12(%ebp) //1 store-address, 1 store-data
// no load from -12(%ebp) into %eax
... rest the same.
## ++ptr: 2nd loop
addl $4, -12(%ebp) // read-modify-write: 2 fused-domain uops. 4 unfused: 1 load + 1 store-address + 1 store-data
movl -12(%ebp), %eax // load: 1 uop. ~6 cycle latency for %eax to be ready
... rest the same
So the pointer-increment part of the 2nd loop has one more load uop. Probably the code bottlenecks on AGU throughput (address-generation units). IACA says that's the case for arch=SNB, but that HSW bottlenecks on store-data throughput (not AGUs).
However, without taking store-forwarding latency into account, IACA says the first loop can run at one iteration per 3.5 cycles, vs. one per 4 cycles for the second loop. That's faster than the 6 cycle loop-carried dependency of the addl $1, -48(%ebp) loop counter, which indicates that the loop is bottlenecked by latency to less than max AGU throughput. (Resource conflicts probably mean it actually runs slower than one iteration per 6c, see below).
We could test this theory:
Adding an extra load uop to the lea version, off the critical path, would take more throughput, but wouldn't be part of the loop's latency chains. e.g.
movl -12(%ebp), %eax //Get the current address
leal 4(%eax), %edx //Calculate the next address
movl %edx, -12(%ebp) //Store the new (next) address
mov -12(%ebp), %edx
%edx is about to be overwritten by a mov, so there are no dependencies on the result of this load. (The destination of mov is write-only, so it breaks dependency chains, thanks to register renaming.).
So this extra load would bring the lea loop up to the same number and flavour of uops as the add loop, but with different latency. If the extra load has no effect on speed, we know the first loop isn't bottlenecked on load / store throughput.
Update: OP's testing confirmed that an extra unused load slows the lea loop down to about the same speed as the add loop.
Why extra uops matter when we're not hitting execution port throughput bottlenecks
uops are scheduled in oldest-first order (out of uops that have their operands ready), not in critical-path-first order. Extra uops that could have been done in a spare cycle later on will actually delay uops that are on the critical path (e.g. part of the loop-carried dependency). This is called a resource conflict, and can increase the latency of the critical path.
i.e. instead of waiting for a cycle where critical-path latency left a load port with nothing to do, the unused load will run when it's the oldest load with its load-address ready. This will delay other loads.
Similarly, in the add loop where the extra load is part of the critical path, the extra load causes more resource conflicts, delaying operations on the critical path.
Other guesses:
So maybe having the store address ready sooner is what's doing it, so memory operations are pipelined better. (e.g. TLB-miss page walks can start sooner when approaching a page boundary. Even normal hardware prefetching doesn't cross page boundaries, even if they are hot in the TLB. The loop touches 4MiB of memory, which is enough for this kind of thing to matter. L3 latency is high enough to maybe create a pipeline bubble. Or if your L3 is small, then main memory certainly is.
Or maybe the extra latency just makes it harder for out-of-order execution to do a good job.