how does the compiler optimize this piece of code - c++

consider the following loop:
unsigned long x = 0;
for(unsigned long i = 2314543142; i > 0; i-- )
x+=i;
std::cout << x << std::endl;
when I compile this normally it take roughly 6.5 seconds to execute this loop. But when I compile with -O3 optimization the loop gets executed in 10^-6 seconds. How is this possible? The compiler surely does not know how the closed form expression for x...

You don't really have to know all about assembly to see that the complier determines the value of x at compile time, if compiling with optimization on.
I modified your code slightly to be able to use the online tool Compiler Explorer, changing the std::cout << x << std::endl to extern unsigned long foo; and foo = x;. Not really necessary but it makes the output cleaner.
Compiled with -O2:
test():
movabs rax, 2678554979246887653
mov QWORD PTR foo[rip], rax
ret
Compiled with -O0:
test():
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], 0
mov DWORD PTR [rbp-16], -1980424154
mov DWORD PTR [rbp-12], 0
jmp .L2
.L3:
mov rax, QWORD PTR [rbp-16]
add QWORD PTR [rbp-8], rax
sub QWORD PTR [rbp-16], 1
.L2:
cmp QWORD PTR [rbp-16], 0
setne al
test al, al
jne .L3
mov rax, QWORD PTR [rbp-8]
mov QWORD PTR foo[rip], rax
leave
ret
Also: the first revision of your code with undefined behavior due to i >= 0 just outputs:
test():
.L2:
jmp .L2
:-)

The compiler determines that the value of x after the loop and uses that in the output statement.

Related

Finding max number between two, which implementation to choose

I am trying to figure out, which implementation has edge over other while finding max number between two. As an example let's examine two implementation:
Implementation 1:
int findMax (int a, int b)
{
return (a > b) ? a : b;
}
// Assembly output: (gcc 11.1)
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov eax, DWORD PTR [rbp-4]
cmp eax, DWORD PTR [rbp-8]
jle .L2
mov eax, DWORD PTR [rbp-4]
jmp .L4 .L2:
mov eax, DWORD PTR [rbp-8] .L4:
pop rbp
ret
Implementation 2:
int findMax(int a, int b)
{
int diff, s, max;
diff = a - b;
s = (diff >> 31) & 1;
max = a - (s * diff);
return max;
}
// Assembly output: (gcc 11.1)
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov eax, DWORD PTR [rbp-20]
sub eax, DWORD PTR [rbp-24]
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
shr eax, 31
mov DWORD PTR [rbp-8], eax
mov eax, DWORD PTR [rbp-8]
imul eax, DWORD PTR [rbp-4]
mov edx, eax
mov eax, DWORD PTR [rbp-20]
sub eax, edx
mov DWORD PTR [rbp-12], eax
mov eax, DWORD PTR [rbp-12]
pop rbp
ret
The second one produced more assembly instructions but first one has conditional jump. Just trying to understand if both are equally good.
First you need to turn on compiler optimizations (I used -O2 for the following). And you should compare to std::max. Then this:
#include <algorithm>
int findMax (int a, int b)
{
return (a > b) ? a : b;
}
int findMax2(int a, int b)
{
int diff, s, max;
diff = a - b;
s = (diff >> 31) & 1;
max = a - (s * diff);
return max;
}
int findMax3(int a,int b){
return std::max(a,b);
}
results in:
findMax(int, int):
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret
findMax2(int, int):
mov ecx, edi
mov eax, edi
sub ecx, esi
mov edx, ecx
shr edx, 31
imul edx, ecx
sub eax, edx
ret
findMax3(int, int):
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret
Your first version results in identical assembly as std::max, while your second variant is doing more. Actually when trying to optimize you need to specify what you optimize for. There are several options that typically require a trade-off to be made: Runtime, memory usage, size of executable, readability of code, etc. Typically you cannot get it all at once.
When in doubt, do not reinvent a wheel but use existing already optimzied std::max. And do not forget that code you write is not instructions for your CPU, rather it is a high level abstract description of what the program should do. Its the compilers job to figure out how that can be achieved best.
Last but not least, your second variant is actually broken. See example here compiled with -O2 -fsanitize=signed-integer-overflow, results in:
/app/example.cpp:13:10: runtime error: signed integer overflow: -2147483648 - 2147483647 cannot be represented in type 'int'
You should favor correctness over speed. The fastest code is not worth a thing when it is wrong. And because of that, readability is next on the list. Code that is difficult to read and understand is also difficult to proove correct. I was only able to spot the problem in your code with the help of the compiler, while std::max(a,b) is unlikely to cause undefined behavior (and even if it does, at least it isnt your fault ;).
For two ints, you can compute max(a, b) without branching using a technique you probably learnt at school:
a ^ ((a ^ b) & -(a < b));
But no sane person would write this in their code. Always use std::max and trust the compiler to pick the best way. You may well find it adopts the above for int arguments with optimisations set appropriately. Although I conject that a compare and jump is probably the best way on the whole, even at the expense of a pipeline dump.
Using std::max gives the compiler the best optimisation hint.
Implementation 1 performs well on a CISC CPU like a modern x64 AMD/Intel CPU.
Implementation 2 performs well on a RISC GPU like from nVIDIA or AMD Graphics.
The term "performs well" is only significant in a tight loop.

Why aren't clang++ and g++ de-duplicating these instructions?

Consider the following function:
std::string get_value(const bool b)
{
if (b) {
return "Hello";
}
else {
return "World";
}
}
g++ 11.0.1 20210312 compiles this (as C++17 and with maximum optimization) into:
get_value[abi:cxx11](bool):
lea rdx, [rdi+16]
mov rax, rdi
mov QWORD PTR [rdi], rdx
test sil, sil
je .L2
mov DWORD PTR [rdi+16], 1819043144
mov BYTE PTR [rdx+4], 111
mov QWORD PTR [rax+8], 5
mov BYTE PTR [rax+21], 0
ret
.L2:
mov DWORD PTR [rdi+16], 1819438935
mov BYTE PTR [rdx+4], 100
mov QWORD PTR [rax+8], 5
mov BYTE PTR [rax+21], 0
ret
Why does it not move the two replicated mov instructions up before the jump, or even before the test, reducing the code size by two instructions?
The same thing happens with clang++ and libc++, except it only has one relevant instruction to move up.
(See this also on GodBolt)

C++ inline asm move WCHAR in 32-bit register

I am trying to practice the inline ASM in C++ :) Maybe outdated, but it is interesting, to know how CPU is executing the code.
So, what I am trying to do here, is to loop through processes and get a handle of needed one :) I am using for that already created methods from tlhelp32
I have this code:
HANDLE RetHandle = nullptr, snap;
int SizeOfPE = sizeof(PROCESSENTRY32), pid; PROCESSENTRY32 pe;
int PA = PROCESS_ALL_ACCESS;
const char* Pname = "explorer.exe";
__asm
{
mov eax, pe
mov ebx, this
mov ecx, [ebx]pe.dwSize
mov ecx, SizeOfPE
mov[ebx]pe.dwSize, ecx
mov eax, PA
mov ebx,0
call CreateToolhelp32Snapshot
mov eax,snap
label1:
mov eax, snap
mov ebx, [pe]
call Process32First
cmp eax,1
jne exitLabel
Process32NextLoop:
mov eax, snap
mov ebx, [pe]
call Process32Next
cmp eax, 1
jne Process32NextLoop
mov edx, pe
mov ecx, [edx].szExeFile
cmp ecx, Pname
je ExitLoop
jne Process32NextLoop
ExitLoop:
mov eax, [ebx].th32ProcessID
mov pid, eax
ExitLabel:
ret
}
Apparently, it is throwing error in th32ProcessID as well, however, it is just regular int.
Have been searching, but haven't found the equivalent for movl in C++

Mystery: casting a GNU C label pointer to a function pointer, with inline asm to put a ret in that block. Block being optimized away?

Firstly: This code is considered to be of pure fun, please do not do anything like this in production. We will not be responsible of any harm caused to you, your company or your reindeer after compiling and executing this piece of code in any environment. The code below is not safe, not portable and is plainly dangerous. Be warned. Long post below. You were warned.
Now, after the disclaimer: Let's consider the following piece of code:
#include <stdio.h>
int fun()
{
return 5;
}
typedef int(*F)(void) ;
int main(int argc, char const *argv[])
{
void *ptr = &&hi;
F f = (F)ptr;
int c = f();
printf("TT: %d\n", c);
if(c == 5) goto bye;
//else goto bye; /* <---- This is the most important line. Pay attention to it */
hi:
c = 5;
asm volatile ("movl $5, %eax");
asm volatile ("retq");
bye:
return 66;
}
For the beginning we have the function fun which I have created purely for reference to get the generated assembly code.
Then we declare a function pointer F to functions taking no parameters and returning an int.
Then we use the not so well known GCC extension https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html to get the address of a label hi, and this works in clang too. Then we do something evil, we create a function pointer F called f and initialize it to be the label above.
Then the worst of all, we actually call this function, and assign its return value to a local variable, called C and the we print it out.
The following is an if to check if the value assigned to the c is actually the one we need, and if yes go to bye so that he application exits normally, with exit code 66. If that can be considered a normal exit code.
The next line is commented out, but I can say this is the most important line in the entire application.
The piece of code after the label hi is to assign 5 to the value of c, then two lines of assembly to initialize the value of eax to 5 and to actually return from the "function" call. As mentioned, there is a reference function, fun which generates the same code.
And now we compile this application, and run it on our online platform: https://gcc.godbolt.org/z/K6z5Yc
It generates the following assembly (with -O1 turned on, and O0 gives a similar result, albeit a bit more longer):
# else goto bye is COMMENTED OUT
fun:
mov eax, 5
ret
.LC0:
.string "TT: %d\n"
main:
push rbx
mov eax, OFFSET FLAT:.L3
call rax
mov ebx, eax
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
cmp ebx, 5
je .L4
.L3:
movl $5, %eax
retq
.L4:
mov eax, 66
pop rbx
ret
The important lines are mov eax, OFFSET FLAT:.L3 where the L3 corresponds to our hi label, and the line after that: call rax which actually calls it.
And runs like:
ASM generation compiler returned: 0
Execution build compiler returned: 0
Program returned: 66
TT: 5
Now, let's revisit the most important line in the application and uncomment it.
With -O0 we get the following assembly, generated by gcc:
# else goto bye is UNCOMMENTED
# even gcc -O0 "knows" hi: is unreachable.
fun:
push rbp
mov rbp, rsp
mov eax, 5
pop rbp
ret
.LC0:
.string "TT: %d\n"
main:
push rbp
mov rbp, rsp
sub rsp, 48
mov DWORD PTR [rbp-36], edi
mov QWORD PTR [rbp-48], rsi
mov QWORD PTR [rbp-8], OFFSET FLAT:.L4
mov rax, QWORD PTR [rbp-8]
mov QWORD PTR [rbp-16], rax
mov rax, QWORD PTR [rbp-16]
call rax
mov DWORD PTR [rbp-20], eax
mov eax, DWORD PTR [rbp-20]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
cmp DWORD PTR [rbp-20], 5
nop
.L4:
mov eax, 66
leave
ret
and the following output:
ASM generation compiler returned: 0
Execution build compiler returned: 0
Program returned: 66
so, as you can see our printf was never called, the culprit is the line mov QWORD PTR [rbp-8], OFFSET FLAT:.L4 where L4 actually corresponds to our bye label.
And from what I can see from the generated assembly, not a piece of code from the part after hi was added into the generated code.
But at least the application runs and at least has some code for comparing c to 5.
On the other end, clang, with O0 generates the following nightmare, which by the way crashes:
# else goto bye is UNCOMMENTED
# clang -O0 also doesn't emit any instructions for the hi: block
fun: # #fun
push rbp
mov rbp, rsp
mov eax, 5
pop rbp
ret
main: # #main
push rbp
mov rbp, rsp
sub rsp, 48
mov dword ptr [rbp - 4], 0
mov dword ptr [rbp - 8], edi
mov qword ptr [rbp - 16], rsi
mov qword ptr [rbp - 24], 1
mov rax, qword ptr [rbp - 24]
mov qword ptr [rbp - 32], rax
call qword ptr [rbp - 32]
mov dword ptr [rbp - 36], eax
mov esi, dword ptr [rbp - 36]
movabs rdi, offset .L.str
mov al, 0
call printf
cmp dword ptr [rbp - 36], 5
jne .LBB1_2
jmp .LBB1_3
.LBB1_2:
jmp .LBB1_3
.LBB1_3:
mov eax, 66
add rsp, 48
pop rbp
ret
.L.str:
.asciz "TT: %d\n"
If we turn on some optimization, for example O1, we get from gcc:
# else goto bye is UNCOMMENTED
# gcc -O1
fun:
mov eax, 5
ret
.LC0:
.string "TT: %d\n"
main:
sub rsp, 8
mov eax, OFFSET FLAT:.L3
call rax
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
.L3:
mov eax, 66
add rsp, 8
ret
and the application crashes, which is sort of understandable. Again, the compiler had entirely removed our hi section (mov eax, OFFSET FLAT:.L3 goes tiptoe to L3 which corresponds to our bye section) and unfortunately decided that it's a good idea to increase rsp before a ret so to be sure we end up somewhere totally different where we need to be.
And clang delivers something even more dubious:
# else goto bye is UNCOMMENTED
# clang -O1
fun: # #fun
mov eax, 5
ret
main: # #main
push rax
mov eax, 1
call rax
mov edi, offset .L.str
mov esi, eax
xor eax, eax
call printf
mov eax, 66
pop rcx
ret
.L.str:
.asciz "TT: %d\n"
1 ? How on earth did clang end up with this?
To some level I understand that the compiler decided that dead code after an if where both if and else go to the same location is not needed, but here my knowledge and insight stops.
So now, dear C and C++ gurus, assembly aficionados and compiler crushers, here comes the question:
Why?
Why do you think did the compiler decide that the two labels should be considered equivalent if we have added the else branch, or why did clang put there 1, and last but not least: someone with a deep understanding of the C standard could maybe point out where this piece of code deviated so badly from normality that we ended up in this really really weird situation.
someone with a deep understanding of the C standard could maybe point out where this piece of code deviated so badly from normality that we ended up in this really really weird situation.
You think the ISO C standard has anything to say about this code? It's chock full of UB and GNU extensions, notably pointers to local labels.
Casting a label pointer to a function pointer and calling through it is obviously UB. The GCC manual doesn't say you can do that. It's also UB to goto a label in another function.
You were only able to make that work by tricking the compiler into thinking that block might be reached so it's not removed, then using GNU C Basic asm statements to emit a ret instruction there.
GCC and clang remove dead code even with optimization disabled; e.g. if(0) { ... } doesn't emit any instructions to implement the ...
Also note that the c=5 in hi: compiles with optimization fully disabled (and else goto bye commented) to asm like movl $5, -20(%rbp). i.e. using the caller's RBP to modify local variables in the stack frame of the caller. So you have a nested function.
GNU C allows you to define nested functions that can access the local vars of their parent scope. (If you liked the asm you got from your experiment, you'll love the executable trampoline of machine-code that GCC stores to the stack with mov-immediate if you take a pointer to a nested function!)
asm volatile ("movl $5, %eax"); is missing a clobber on EAX. You step on the compiler's toes which would be UB if this statement was ever reached normally, rather than as if it were a separate function.
The use-case for GNU C Basic asm (no constraints / clobbers) is instructions like cli (disable interrupts), not anything involving integer registers, and definitely not ret.
If you want to define a callable function using inline asm, you can use asm("") at global scope, or as the body of an __attribute__((naked)) function.

Understanding what clang is doing in assembly, decrementing for a loop that is incrementing

Consider the following code, in C++:
#include <cstdlib>
std::size_t count(std::size_t n)
{
std::size_t i = 0;
while (i < n) {
asm volatile("": : :"memory");
++i;
}
return i;
}
int main(int argc, char* argv[])
{
return count(argc > 1 ? std::atoll(argv[1]) : 1);
}
It is just a loop that is incrementing its value, and returns it at the end. The asm volatile prevents the loop from being optimized away. We compile it under g++ 8.1 and clang++ 5.0 with the arguments -Wall -Wextra -std=c++11 -g -O3.
Now, if we look at what compiler explorer is producing, we have, for g++:
count(unsigned long):
mov rax, rdi
test rdi, rdi
je .L2
xor edx, edx
.L3:
add rdx, 1
cmp rax, rdx
jne .L3
.L2:
ret
main:
mov eax, 1
xor edx, edx
cmp edi, 1
jg .L25
.L21:
add rdx, 1
cmp rdx, rax
jb .L21
mov eax, edx
ret
.L25:
push rcx
mov rdi, QWORD PTR [rsi+8]
mov edx, 10
xor esi, esi
call strtoll
mov rdx, rax
test rax, rax
je .L11
xor edx, edx
.L12:
add rdx, 1
cmp rdx, rax
jb .L12
.L11:
mov eax, edx
pop rdx
ret
and for clang++:
count(unsigned long): # #count(unsigned long)
test rdi, rdi
je .LBB0_1
mov rax, rdi
.LBB0_3: # =>This Inner Loop Header: Depth=1
dec rax
jne .LBB0_3
mov rax, rdi
ret
.LBB0_1:
xor edi, edi
mov rax, rdi
ret
main: # #main
push rbx
cmp edi, 2
jl .LBB1_1
mov rdi, qword ptr [rsi + 8]
xor ebx, ebx
xor esi, esi
mov edx, 10
call strtoll
test rax, rax
jne .LBB1_3
mov eax, ebx
pop rbx
ret
.LBB1_1:
mov eax, 1
.LBB1_3:
mov rcx, rax
.LBB1_4: # =>This Inner Loop Header: Depth=1
dec rcx
jne .LBB1_4
mov rbx, rax
mov eax, ebx
pop rbx
ret
Understanding the code generated by g++, is not that complicated, the loop being:
.L3:
add rdx, 1
cmp rax, rdx
jne .L3
every iteration increments rdx, and compares it to rax that stores the size of the loop.
Now, I have no idea of what clang++ is doing. Apparently it uses dec, which is weird to me, and I don't even understand where the actual loop is. My question is the following: what is clang doing?
(I am looking for comments about the clang assembly code to describe what is done at each step and how it actually works).
The effect of the function is to return n, either by counting up to n and returning the result, or by simply returning the passed-in value of n. The clang code does the latter. The counting loop is here:
mov rax, rdi
.LBB0_3: # =>This Inner Loop Header: Depth=1
dec rax
jne .LBB0_3
mov rax, rdi
ret
It begins by copying the value of n into rax. It decrements the value in rax, and if the result is not 0, it jumps back to .LBB0_3. If the value is 0 it falls through to the next instruction, which copies the original value of n into rax and returns.
There is no i stored, but the code does the loop the prescribed number of times, and returns the value that i would have had, namely, n.