I am teaching myself to debug assembly language; I am new to assembly. I have a very simple C++ program and I disassembled it 3 times using different disassemblers: GDB, otool, and godbolt.org. GDB and godbolt.org produced approximately the same amount of code (1 page in a word processor), though many lines differ. The otool -tv command produced about 14 pages of code so there are many differences with respect to the GDB and godbolt.org outputs. The assembly code is too long to post. I was expecting the assembly code outputs to be the same as each other. Why are they different and which disassembler is best?
Here is my C++ program:
#include <iostream>
int main () {
int a = 1;
int b = 2;
int c = 3;
a += b;
a = a + c;
std::cout << "Value of A is " << a << std::endl;
return 0;
}
An example of assembly differences:
GDB:
0x0000000100000f44 <+4>: sub $0x30,%rsp
0x0000000100000f48 <+8>: mov 0x10c1(%rip),%rdi # 0x100002010
0x0000000100000f4f <+15>: lea 0xfb6(%rip),%rsi
Godbolt.org:
sub rsp, 16
mov DWORD PTR [rbp-4], 1
mov DWORD PTR [rbp-8], 2
Otool -tv gave 13 more pages of code than the others so there is an obvious difference there.
The differences you are experiencing are not in the disassembled program, but rather in the syntax used to represent machine instructions.
Assembly is a very low-level language, in which there is a 1-to-1 mapping between machine instructions and mnemonics. The former are sequences of bits, possibly of variable length---as in the case of x86 architectures. This representation is directly interpreted by the CPU to carry out the work associated with the semantic of the instruction. Assembly language is a "human readable" representation of such sequences.
Basically, you can find any way to represent the same machine instruction. This is the assembly syntax.
Notoriously, for x86 architectures there exist two different syntaxes: AT&T and Intel. The output which you obtained from GBD is generated according to the AT&T syntax, while the output you got from Godbolt.org is Intel's.
Intel and AT&T syntax are very different from each other in appearance, and possibly this is why you have been thinking that the outcome is not the same. Actually, it's just a different way to represent the very same instructions.
These two "dialects" for the same architecture's assembly were born with different goals in mind. AT&T syntax was developed at AT&T labs to support the generation of programs for many different CPUs (see the book: Jeff Duntermann, Assembly Language Step-by-Step). At the time, AT&T was playing a major role in the history of computers. AT&T (Bell Labs) has been the source of Unix---its paradigm is currently (although partially) committed to by Linux---the C programming language, and many other fundamental tools that we continue to use today.
On the other hand, Intel syntax has been developed, well... by Intel for their own CPUs. Many adopters of the Intel syntax say that it is much neater when prorgamming on Intel CPUs. This might well be the case, as the syntax has been carefully crafted exactly for what the CPU supports.
While the AT&T syntax is no longer used at present days (at least, to the best of my knowledge) to write programs for CPUs other than x86, some of the "culprits" of the syntax are generated from it being more "general".
Then, which one to learn? My choice would be driven by the environment you work on. The whole Unix ecosystem (comprising Linux and Mac Os) has a toolchain (such as gas) which directly use that syntax. In the Linux kernel (and other low-level pieces of software) you will definitely find inlined assembly code in AT&T syntax to interact with the hardware. Windows systems, on the other hand, have toolchains (such as nasm) which speak the Intel syntax. While compile-time flags can ask these tools to switch to the other syntax (such as the -M flag for objdump), the habit is to adopt the "native" syntax.
With respect to the specific examples given in the question, they are "incompatible", in the sense that they refer to different portions of the disassembled code, so there is a higher degree of difference across the two.
Indeed, with respect to this GDB output:
sub $0x30, %rsp
mov 0x10c1(%rip), %rdi
lea 0xfb6(%rip), %rsi
the corresponding Intel disassembly would be:
sub rsp, 0x30
mov rdi, QWORD PTR [rip+0x10c1]
lea rsi, [rip+0xfb6]
On the other hand, with respect to the Godbolt.org output:
sub rsp, 16
mov DWORD PTR [rbp-4], 1
mov DWORD PTR [rbp-8], 2
the corresponding AT&T disassembly would be:
sub $0x10,%rsp
movl $0x1,-0x4(%rbp)
movl $0x2,-0x8(%rbp)
As you can see, the greatest difference, which might cause a lot of headaches, is related to the fact that the AT&T syntax places the source first and then the destination, while Intel syntax works the other way round.
The assembly sequences are not equivalents with different syntax, they are just different, probably due to using different compilers.
First pair:
sub $0x30,%rsp ;rsp -= 0x30
sub rsp,16 ;rsp -= 0x10
Next pair:
mov 0x10c1(%rip),%rdi ;rdi = [rip+0x10c1] (loads a value)
mov DWORD PTR [rbp-4],1 ;[rbp+4] = 1 (stores an immediate value)
Next pair:
lea 0xfb6(%rip),%rsi ;rsi = rip+0xfb6 (loads an offset)
mov DWORD PTR [rbp-8],2 ;[rbp+8] = 2 (stores an immediate value)
Both sequences are incomplete, but I don't think it matter much, as the shown sequences already show the differences.
Because there is not a 1 to 1 relationship between source code and assembly. The compiler would likely generate the same assembly for the following statements:
x = x + 1
and
x++;
both of which would be compiled to something like
add dword ptr [rdi], 1
So, when we dissassemble that, which one should it be disassembled to? x = x+1 or x++? This applies to virtually every statement of your program - if there is more than one way of expressing what happens in the source language, and the effects are the same, the compiler may choose to translate both of them to the same output. After which, you have no way of knowing which one was used.
Related
I am using same code snippet in C and C++.
#include <stdio.h>
int main() {
goto myLabel;
printf("skipped\n");
myLabel:
printf("after myLabel\n");
return 0;
}
Using Visual Studio 2022 IDE and Compiler.
Assembly Code for C++
0000000140001000 sub rsp,28h
0000000140001004 jmp 0000000140001014
0000000140001006 jmp 0000000140001014
0000000140001008 lea rcx,[0000000140004230h]
000000014000100F call 0000000140001090
0000000140001014 lea rcx,[0000000140004240h]
000000014000101B call 0000000140001090
0000000140001020 xor eax,eax
0000000140001022 add rsp,28h
0000000140001026 ret
Assembly Code for C
0000000140001000 sub rsp,28h
0000000140001004 jmp 0000000140001012
0000000140001006 lea rcx,[0000000140006000h]
000000014000100D call 0000000140001090
0000000140001012 lea rcx,[0000000140006010h]
0000000140001019 call 0000000140001090
000000014000101E xor eax,eax
0000000140001020 add rsp,28h
0000000140001024 ret
Question is why C++ assembly code uses 2 jmp instructions when C is using 1.
It is like this by design in debug builds (msvc bug database) see :
S2019 (debug, x86) generates two identical JMP instructions for one goto statement
C and C++ are two completely different programming languages.
Different compilers are used to compile them. The actual compiler might be a single, monolithic program, but functionally there are two logically distinct compilers and algorithms, that have nothing to do with each other.
It is not entirely unexpected that different compilers will generate different compiled code from syntactically identical source. The differences in the resulting compiled code result from the different algorithms that are employed by the different compilers, when translating the source code. The differences carry no special, inherent meaning.
I am disassembling this code on llvm clang Apple LLVM version 8.0.0 (clang-800.0.42.1):
int main() {
float a=0.151234;
float b=0.2;
float c=a+b;
printf("%f", c);
}
I compiled with no -O specifications, but I also tried with -O0 (gives the same) and -O2 (actually computes the value and stores it precomputed)
The resulting disassembly is the following (I removed the parts that are not relevant)
-> 0x100000f30 <+0>: pushq %rbp
0x100000f31 <+1>: movq %rsp, %rbp
0x100000f34 <+4>: subq $0x10, %rsp
0x100000f38 <+8>: leaq 0x6d(%rip), %rdi
0x100000f3f <+15>: movss 0x5d(%rip), %xmm0
0x100000f47 <+23>: movss 0x59(%rip), %xmm1
0x100000f4f <+31>: movss %xmm1, -0x4(%rbp)
0x100000f54 <+36>: movss %xmm0, -0x8(%rbp)
0x100000f59 <+41>: movss -0x4(%rbp), %xmm0
0x100000f5e <+46>: addss -0x8(%rbp), %xmm0
0x100000f63 <+51>: movss %xmm0, -0xc(%rbp)
...
Apparently it's doing the following:
loading the two floats onto registers xmm0 and xmm1
put them in the stack
load one value (not the one xmm0 had earlier) from the stack to xmm0
perform the addition.
store the result back to the stack.
I find it inefficient because:
Everything can be done in registry. I am not using a and b later, so it could just skip any operation involving the stack.
even if it wanted to use the stack, it could save reloading xmm0 from the stack if it did the operation with a different order.
Given that the compiler is always right, why did it choose this strategy?
-O0 (unoptimized) is the default. It tells the compiler you want it to compile fast (short compile times), not to take extra time compiling to make efficient code.
(-O0 isn't literally no optimization; e.g. gcc will still eliminate code inside if(1 == 2){ } blocks. Especially gcc more than most other compilers still does things like use multiplicative inverses for division at -O0, because it still transforms your C source through multiple internal representations of the logic before eventually emitting asm.)
Plus, "the compiler is always right" is an exaggeration even at -O3. Compilers are very good at a large scale, but minor missed-optimizations are still common within single loops. Often with very low impact, but wasted instructions (or uops) in a loop can eat up space in the out-of-order execution reordering window, and be less hyper-threading friendly when sharing a core with another thread. See C++ code for testing the Collatz conjecture faster than hand-written assembly - why? for more about beating the compiler in a simple specific case.
More importantly,-O0 also implies treating all variables similar to volatile for consistent debugging. i.e. so you can set a breakpoint or single step and modify the value of a C variable, and then continue execution and have the program work the way you'd expect from your C source running on the C abstract machine. So the compiler can't do any constant-propagation or value-range simplification. (e.g. an integer that's known to be non-negative can simplify things using it, or make some if conditions always true or always false.)
(It's not quite as bad as volatile: multiple references to the same variable within one statement don't always result in multiple loads; at -O0 compilers will still optimize somewhat within a single expression.)
Compilers have to specifically anti-optimize for -O0 by storing/reloading all variables to their memory address between statements. (In C and C++, every variable has an address unless it was declared with the (now obsolete) register keyword and has never had its address taken. Optimizing away the address is possible according to the as-if rule for other variables, but isn't done at -O0)
Unfortunately, debug-info formats can't track the location of a variable through registers, so fully consistent debugging isn't possible without this slow-and-stupid code-gen.
If you don't need this, you can compile with -Og for light optimization, and without the anti-optimizations required for consistent debugging. The GCC manual recommends it for the usual edit/compile/run cycle, but you will get "optimized out" for many local variables with automatic storage when debugging. Globals and function args still usually have their actual values, at least at function boundaries.
Even worse, -O0 makes code that still works even if you use GDB's jump command to continue execution at a different source line. So each C statement has to be compiled into a fully independent block of instructions. (Is it possible to "jump"/"skip" in GDB debugger?)
for() loops can't be transformed into idiomatic (for asm) do{}while() loops, and other restrictions.
For all the above reasons, (micro-)benchmarking un-optimized code is a huge waste of time; the results depend on silly details of how you wrote the source that don't matter when you compile with normal optimization. -O0 vs. -O3 performance is not linearly related; some code will speed up much more than others.
The bottlenecks in -O0 code will often be different from -O3- often on a loop counter that's kept in memory, creating a ~6-cycle loop-carried dependency chain. This can create interesting effects in the compiler-generated asm like Adding a redundant assignment speeds up code when compiled without optimization (which are interesting from an asm perspective, but not for C.)
"My benchmark optimized away otherwise" is not a valid justification for looking at the performance of -O0 code.
See C loop optimization help for final assignment for an example and more details about the rabbit hole that tuning for -O0 is.
Getting interesting compiler output
If you want to see how the compiler adds 2 variables, write a function that takes args and returns a value. Remember you only want to look at the asm, not run it, so you don't need a main or any numeric literal values for anything that should be a runtime variable.
See also How to remove "noise" from GCC/clang assembly output? for more about this.
float foo(float a, float b) {
float c=a+b;
return c;
}
compiles with clang -O3 (on the Godbolt compiler explorer) to the expected
addss xmm0, xmm1
ret
But with -O0 it spills the args to stack memory. (Godbolt uses debug info emitted by the compiler to colour-code asm instructions according to which C statement they came from. I've added line breaks to show blocks for each statement, but you can see this with colour highlighting on the Godbolt link above. Often very handy for finding the interesting part of an inner loop in optimized compiler output.)
gcc -fverbose-asm will put comments on every line showing the operand names as C vars. In optimized code that's often an internal tmp name, but in un-optimized code it's usual an actual variable from the C source. I've manually commented the clang output because it doesn't do that.
# clang7.0 -O0 also on Godbolt
foo:
push rbp
mov rbp, rsp # make a traditional stack frame
movss DWORD PTR [rbp-20], xmm0 # spill the register args
movss DWORD PTR [rbp-24], xmm1 # into the red zone (below RSP)
movss xmm0, DWORD PTR [rbp-20] # a
addss xmm0, DWORD PTR [rbp-24] # +b
movss DWORD PTR [rbp-4], xmm0 # store c
movss xmm0, DWORD PTR [rbp-4] # return 0
pop rbp # epilogue
ret
Fun fact: using register float c = a+b;, the return value can stay in XMM0 between statements, instead of being spilled/reloaded. The variable has no address. (I included that version of the function in the Godbolt link.)
The register keyword has no effect in optimized code (except making it an error to take a variable's address, like how const on a local stops you from accidentally modifying something). I don't recommend using it, but it's interesting to see that it does actually affect un-optimized code.
Related:
Complex compiler output for simple constructor - every copy of a variable when passing args typically results in extra copies in the asm.
Why is this C++ wrapper class not being inlined away? __attribute__((always_inline)) can force inlining, but doesn't optimize away the copying to create the function args, let alone optimize the function into the caller.
When debugging c code with gdb, the displayed assembly code is
0x000000000040116c main+0 push %rbp
0x000000000040116d main+1 mov %rsp,%rbp
!0x0000000000401170 main+4 movl $0x0,-0x4(%rbp)
0x0000000000401177 main+11 jmp 0x40118d <main+33>
0x0000000000401179 main+13 mov -0x4(%rbp),%eax
0x000000000040117c main+16 mov %eax,%edx
0x000000000040117e main+18 mov -0x4(%rbp),%eax
Is the 0x000000000040116d in the front of the first assembly instruction the virtual address of this function? Is main+1 the offset of this assembly from the main function? The next assembly is main+4. Does it mean that the first mov %rsp,%rbp is three bytes? If so, why is movl $0x0,-0x4(%rbp) 7 bytes?
I am using a server. The version is:Linux version 4.15.0-122-generic (buildd#lcy01-amd64-010) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)) #124~16.04.1-Ubuntu SMP.
Pretty much yes. It's quite apparent that, for example, adding 4 to the first address gives you the address shown for main+4, and adding another 7 on top of that gives you the corresponding address for main+11.
As far as the two move instructions go: they are very different, and do completely different things. They are two very different kind of moves, and that's how many bytes each one requires in x86 machine language, so its not surprising that one takes many more bytes than the other. As far as the precise reason why, well, in general, that opens a very long, broad, and windy discussion about the underlying reasons, and the original design goals of the x86 machine language instruction set. Much of it actually no longer applies (and you would probably find quite boring, actually), since the modern x86 CPU is something quite radically different than its original generation. But it has to remain binary compatible. Hence, little oddities like that.
Just to give you a basic understanding: the first move is between two CPU registers. Doesn't take a long novel to specify from where, and to where. The second move has to specify a 32 bit value (0 to be precise), a CPU register, and a memory offset. That has to be specified somewhere. You have to find the bytes somewhere to specify all the little details for that.
I am disassembling this code on llvm clang Apple LLVM version 8.0.0 (clang-800.0.42.1):
int main() {
float a=0.151234;
float b=0.2;
float c=a+b;
printf("%f", c);
}
I compiled with no -O specifications, but I also tried with -O0 (gives the same) and -O2 (actually computes the value and stores it precomputed)
The resulting disassembly is the following (I removed the parts that are not relevant)
-> 0x100000f30 <+0>: pushq %rbp
0x100000f31 <+1>: movq %rsp, %rbp
0x100000f34 <+4>: subq $0x10, %rsp
0x100000f38 <+8>: leaq 0x6d(%rip), %rdi
0x100000f3f <+15>: movss 0x5d(%rip), %xmm0
0x100000f47 <+23>: movss 0x59(%rip), %xmm1
0x100000f4f <+31>: movss %xmm1, -0x4(%rbp)
0x100000f54 <+36>: movss %xmm0, -0x8(%rbp)
0x100000f59 <+41>: movss -0x4(%rbp), %xmm0
0x100000f5e <+46>: addss -0x8(%rbp), %xmm0
0x100000f63 <+51>: movss %xmm0, -0xc(%rbp)
...
Apparently it's doing the following:
loading the two floats onto registers xmm0 and xmm1
put them in the stack
load one value (not the one xmm0 had earlier) from the stack to xmm0
perform the addition.
store the result back to the stack.
I find it inefficient because:
Everything can be done in registry. I am not using a and b later, so it could just skip any operation involving the stack.
even if it wanted to use the stack, it could save reloading xmm0 from the stack if it did the operation with a different order.
Given that the compiler is always right, why did it choose this strategy?
-O0 (unoptimized) is the default. It tells the compiler you want it to compile fast (short compile times), not to take extra time compiling to make efficient code.
(-O0 isn't literally no optimization; e.g. gcc will still eliminate code inside if(1 == 2){ } blocks. Especially gcc more than most other compilers still does things like use multiplicative inverses for division at -O0, because it still transforms your C source through multiple internal representations of the logic before eventually emitting asm.)
Plus, "the compiler is always right" is an exaggeration even at -O3. Compilers are very good at a large scale, but minor missed-optimizations are still common within single loops. Often with very low impact, but wasted instructions (or uops) in a loop can eat up space in the out-of-order execution reordering window, and be less hyper-threading friendly when sharing a core with another thread. See C++ code for testing the Collatz conjecture faster than hand-written assembly - why? for more about beating the compiler in a simple specific case.
More importantly,-O0 also implies treating all variables similar to volatile for consistent debugging. i.e. so you can set a breakpoint or single step and modify the value of a C variable, and then continue execution and have the program work the way you'd expect from your C source running on the C abstract machine. So the compiler can't do any constant-propagation or value-range simplification. (e.g. an integer that's known to be non-negative can simplify things using it, or make some if conditions always true or always false.)
(It's not quite as bad as volatile: multiple references to the same variable within one statement don't always result in multiple loads; at -O0 compilers will still optimize somewhat within a single expression.)
Compilers have to specifically anti-optimize for -O0 by storing/reloading all variables to their memory address between statements. (In C and C++, every variable has an address unless it was declared with the (now obsolete) register keyword and has never had its address taken. Optimizing away the address is possible according to the as-if rule for other variables, but isn't done at -O0)
Unfortunately, debug-info formats can't track the location of a variable through registers, so fully consistent debugging isn't possible without this slow-and-stupid code-gen.
If you don't need this, you can compile with -Og for light optimization, and without the anti-optimizations required for consistent debugging. The GCC manual recommends it for the usual edit/compile/run cycle, but you will get "optimized out" for many local variables with automatic storage when debugging. Globals and function args still usually have their actual values, at least at function boundaries.
Even worse, -O0 makes code that still works even if you use GDB's jump command to continue execution at a different source line. So each C statement has to be compiled into a fully independent block of instructions. (Is it possible to "jump"/"skip" in GDB debugger?)
for() loops can't be transformed into idiomatic (for asm) do{}while() loops, and other restrictions.
For all the above reasons, (micro-)benchmarking un-optimized code is a huge waste of time; the results depend on silly details of how you wrote the source that don't matter when you compile with normal optimization. -O0 vs. -O3 performance is not linearly related; some code will speed up much more than others.
The bottlenecks in -O0 code will often be different from -O3- often on a loop counter that's kept in memory, creating a ~6-cycle loop-carried dependency chain. This can create interesting effects in the compiler-generated asm like Adding a redundant assignment speeds up code when compiled without optimization (which are interesting from an asm perspective, but not for C.)
"My benchmark optimized away otherwise" is not a valid justification for looking at the performance of -O0 code.
See C loop optimization help for final assignment for an example and more details about the rabbit hole that tuning for -O0 is.
Getting interesting compiler output
If you want to see how the compiler adds 2 variables, write a function that takes args and returns a value. Remember you only want to look at the asm, not run it, so you don't need a main or any numeric literal values for anything that should be a runtime variable.
See also How to remove "noise" from GCC/clang assembly output? for more about this.
float foo(float a, float b) {
float c=a+b;
return c;
}
compiles with clang -O3 (on the Godbolt compiler explorer) to the expected
addss xmm0, xmm1
ret
But with -O0 it spills the args to stack memory. (Godbolt uses debug info emitted by the compiler to colour-code asm instructions according to which C statement they came from. I've added line breaks to show blocks for each statement, but you can see this with colour highlighting on the Godbolt link above. Often very handy for finding the interesting part of an inner loop in optimized compiler output.)
gcc -fverbose-asm will put comments on every line showing the operand names as C vars. In optimized code that's often an internal tmp name, but in un-optimized code it's usual an actual variable from the C source. I've manually commented the clang output because it doesn't do that.
# clang7.0 -O0 also on Godbolt
foo:
push rbp
mov rbp, rsp # make a traditional stack frame
movss DWORD PTR [rbp-20], xmm0 # spill the register args
movss DWORD PTR [rbp-24], xmm1 # into the red zone (below RSP)
movss xmm0, DWORD PTR [rbp-20] # a
addss xmm0, DWORD PTR [rbp-24] # +b
movss DWORD PTR [rbp-4], xmm0 # store c
movss xmm0, DWORD PTR [rbp-4] # return 0
pop rbp # epilogue
ret
Fun fact: using register float c = a+b;, the return value can stay in XMM0 between statements, instead of being spilled/reloaded. The variable has no address. (I included that version of the function in the Godbolt link.)
The register keyword has no effect in optimized code (except making it an error to take a variable's address, like how const on a local stops you from accidentally modifying something). I don't recommend using it, but it's interesting to see that it does actually affect un-optimized code.
Related:
Complex compiler output for simple constructor - every copy of a variable when passing args typically results in extra copies in the asm.
Why is this C++ wrapper class not being inlined away? __attribute__((always_inline)) can force inlining, but doesn't optimize away the copying to create the function args, let alone optimize the function into the caller.
Consider the following snippet of code:
int* find_ptr(int* mem, int sz, int val) {
for (int i = 0; i < sz; i++) {
if (mem[i] == val) {
return &mem[i];
}
}
return nullptr;
}
GCC on -O3 compiles this to:
find_ptr(int*, int, int):
mov rax, rdi
test esi, esi
jle .L4 # why not .L8?
lea ecx, [rsi-1]
lea rcx, [rdi+4+rcx*4]
jmp .L3
.L9:
add rax, 4
cmp rax, rcx
je .L8
.L3:
cmp DWORD PTR [rax], edx
jne .L9
ret
.L8:
xor eax, eax
ret
.L4:
xor eax, eax
ret
In this assembly, the blocks with labels .L4 and .L8 are identical. Would it not be better to rewrite jumps to .L4 to .L8 and drop .L4? I thought this might be a bug, but clang also duplicates the xor-ret sequence back to back. However, ICC and MSVC each take a pretty different approach.
Is this an optimization in this case and, if not, are there times when it would be? What is the rationale behind this behavior?
This is always a missed optimizations. Having both return-0 paths use the same basic block would be pure win on all microarchitectures that current compilers care about.
But unfortunately this missed-optimization is not rare with gcc. Often it's a separate bare ret that gcc conditionally branches to, instead of branching to a ret in another existing path. (x86 doesn't have a conditional ret, so simple functions that don't need any stack cleanup often just need to branch to a ret.
Often functions this small would get inlined in a complete program, so maybe it doesn't hurt a lot in real life?)
CPUs (since Pentium Pro if not earlier) have a return-address predictor stack that easily predicts the branch target for ret instructions, so there's not going to be an effect from one ret instruction more often returning to one caller and another ret more often returning to another caller. It doesn't help branch prediction to separate them and let them use different entries.
IDK about Pentium 4 and whether the traces in its trace cache follow call/ret. But fortunately that's not relevant anymore. The decoded-uop cache in SnB-family and Ryzen is not a trace cache; a line/way of uop cache holds uops for a contiguous block of x86 machine code, and unconditional jumps end a uop cache line. (https://agner.org/optimize/) So if anything, this could be worse for SnB-family because each return path needs a separate line of the uop cache even though they're each only 2 uops total (xor-zero and ret are both single-uop instructions).
Report this MCVE to gcc's bugzilla with keyword missed-optimization: https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc
(update: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90178 was reported by the OP. A fix was attempted, but reverted; for now it's still open. In this case it seems to be caused by -mavx, perhaps some interaction with return paths that need vzeroupper or not.)
Cause:
You can kind of see how it might arrive at 2 exit blocks: compilers normally transform for loops into if(sz>0) { do{}while(); } if there's a possibility of it needing to run 0 times, like gcc did here. So there's one branch that leaves the function without entering the loop at all. But the other exit is from fall through from the loop. Perhaps before optimizing away some stuff, there was some extra cleanup. Or just those paths got split up when the first branch was created.
I don't know why gcc fails to notice and merge two identical basic blocks that end with ret.
Maybe it only looked for that in some GIMPLE or RTL pass where they weren't actually identical, and only became identical during final x86 code-gen. Maybe after optimizing away save/restore of a register to hold some temporary that it ended up no needing?
You could dig deeper if you look at GCC's GIMPLE or RTL with -fdump-tree-... options after certain optimization passes: Godbolt has UI for that, in the + dropdown -> tree / RTL output. https://godbolt.org/z/l9mVlE. But unless you're a gcc-internals expert and planning to work on a patch or idea to help gcc find this optimization, it's probably not worth your time.
Interesting discovery that it only happens with -mavx (enabled by -march=skylake or directly). GCC and clang don't know how to auto-vectorize loops where the trip count is not known before the first iteration. e.g. search loops like this or memchr or strlen. So IDK why AVX even makes a difference at all.
(Note that the C abstract machine never reads mem[i] beyond the search point, and those elements might not actually exist. e.g. there's no UB if you passed this function a pointer to the last int before an unmapped page, and sz=1000, as long as *mem == val. So to auto-vectorize without int mem[static sz] guaranteed object size, the compiler would have to align the pointer... Not that C11 int mem[static sz] would even help; even a static array of compile-time-constant size larger than the max possible trip count wouldn't get gcc to auto-vectorize.)