Assembly code for __rdtsc() in -O0 vs -O3 [duplicate]

Assembly code for __rdtsc() in -O0 vs -O3 [duplicate] - c++

This question already has answers here:
Why does main initialize stack frame when there are no variables
(3 answers)
Trying to understand gcc option -fomit-frame-pointer
(3 answers)
What is the purpose of the RBP register in x86_64 assembler?
(2 answers)
Closed 1 year ago.
I have the following code:
#include <x86intrin.h>
int main() {
return __rdtsc();
}
And I tried to compile on my machine (Intel i7-6700 CPU) and objdump
g++ -Wall test_tsc.cpp -o test_tsc -march=native -mtune=native -O0 -std=c++20
objdump -M intel -d test_tsc > test_tsc.O0
Then in test_tsc.O0:
0000000000401122 <main>:
401122: 55 push rbp
401123: 48 89 e5 mov rbp,rsp
401126: 0f 31 rdtsc
401128: 48 c1 e2 20 shl rdx,0x20
40112c: 48 09 d0 or rax,rdx
40112f: 90 nop
401130: 5d pop rbp
401131: c3 ret
401132: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
401139: 00 00 00
40113c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
What do push rbp and mov rbp,rsp do? It seems like they were for saving the stack pointer but then there isn't really a function call. If g++ consider __rdtsc() a function call, then would there be something like call afterward?
Thanks.

rbp is the base pointer, not the stack pointer. The base pointer is used for backtrace during debugging but it is not necessary for actually running.
It is preserved through function calls so with -O3 only the expected assembly is generated:
main:
rdtsc
salq $32, %rdx
orq %rdx, %rax
ret

Related

Clobber X86 register by modifying LLVM Backend

I am trying to alter a little bit the LLVM Backend for X86 target, to produce some desired behaviour.
More specifically, I would like to emulate a flag like gcc's fcall-used-reg option, which instructs the compiler to convert a callee-saved register into a clobbered register (meaning that it may be altered during a function call).
Let's focus on r14. I manually clobber the register, like in this answer:
#include <inttypes.h>
uint64_t inc(uint64_t i) {
__asm__ __volatile__(
""
: "+m" (i)
:
: "r14"
);
return i + 1;
}
int main(int argc, char **argv) {
(void)argv;
return inc(argc);
}
Compile and disassemble:
gcc -std=gnu99 -O3 -ggdb3 -Wall -Wextra -pedantic -o main.out main.c
objdump -d main.out
Disassembly contains:
0000000000001150 <inc>:
1150: 41 56 push %r14
1152: 48 89 7c 24 f8 mov %rdi,-0x8(%rsp)
1157: 48 8b 44 24 f8 mov -0x8(%rsp),%rax
115c: 41 5e pop %r14
115e: 48 83 c0 01 add $0x1,%rax
1162: c3 retq
1163: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
116a: 00 00 00
116d: 0f 1f 00 nopl (%rax)
where we can see that r14, because it is tampered with, is pushed to the stack, and then popped to regain its original value.
Now, repeat with the -fcall-used-r14 flag:
gcc -std=gnu99 -O3 -ggdb3 -fcall-used-r14 -Wall -Wextra -pedantic -o main.out main.c
objdump -d main.out
Disassembly contains:
0000000000001150 <inc>:
1150: 48 89 7c 24 f8 mov %rdi,-0x8(%rsp)
1155: 48 8b 44 24 f8 mov -0x8(%rsp),%rax
115a: 48 83 c0 01 add $0x1,%rax
115e: c3 retq
115f: 90 nop
where no push/pop happens.
Now, I have modified some LLVM Target files, compiled the source, and added(?) this functionality to the llc tool:
clang-11 -emit-llvm -S -c main.c -o main.ll
llc-11 main.ll -o main.s
Now, main.s contains:
# %bb.0:
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
pushq %r14
.cfi_offset %r14, -24
movq %rdi, -16(%rbp)
#APP
#NO_APP
movq -16(%rbp), %rax
addq $1, %rax
popq %r14
popq %rbp
.cfi_def_cfa %rsp, 8
retq
Apparently, r14 is still callee-saved.
Inside llvm/lib/Target/X86/X86CallingConv.td I have modified the following lines (removing R14), because they seemed the only relevant to the System V ABI for Linux and C calling conventions that I was interested in:
def CSR_64 : CalleeSavedRegs<(add R12, R13, R15, RBP)>;
...
def CSR_64_MostRegs : CalleeSavedRegs<(add RBX, RCX, RDX, RSI, RDI, R8, R9, R10,
R11, R12, R13, R15, RBP,
...
def CSR_64_AllRegs_NoSSE : CalleeSavedRegs<(add RAX, RBX, RCX, RDX, RSI, RDI, R8, R9,
R10, R11, R12, R13, R15, RBP)>;
My questions are:
Is X86CallingConv.td the only file I should modify? I think yes, but maybe I'm wrong.
Am I focusing on the correct lines? Maybe this is more difficult to answer, but at least a direction could be helpful.
I am running LLVM 11 inside Debian 10.5.
EDIT:
Changing the line, removing R14 from "hidden" definition:
def CSR_SysV64_RegCall_NoSSE : CalleeSavedRegs<(add RBX, RBP, RSP,
(sequence "R%u", 12, 13), R15)>;
as Margaret correctly pointed out did not help either.

Turns out, the minimum modification was the line:
def CSR_64 : CalleeSavedRegs<(add RBX, R12, R13, R15, RBP)>;
The problem was with how I built the source.
By running cmake --build . again after the original installation, the llc tool was not modified globally (I thought it would have because I was building the default architecture - X86 - but that was irrelevant). So, I was calling an unmodified llc-11 tool. Thus, when I ran:
/path/to/llvm-project/build/bin/lcc main.ll -o main.s
main.s contained:
# %bb.0:
movq %rdi, -8(%rsp)
#APP
#NO_APP
movq -8(%rsp), %rax
addq $1, %rax
retq
which is what I wanted in the first place.

selecting address to change value in memory

This question/answer on SO shows how to use GDB to change a value in memory, but in the example given, it chooses an address to set the value that wasn't previously being used
For example, to change the return value to 22, the author does
set {unsigned char}0x00000000004004b9 = 22
However, why would this address 0x00000000004004b9 be the address to change? If you look at the output of disas/r the address 0x00000000004004b9 isn't being used, so why use this one to set to 22? I'm trying to understand how to know which address needs to be changed to (in this example) change the return value, if the output of disas/r doesn't show it.
code
$ cat t.c
int main()
{
return 42;
}
$ gcc t.c && ./a.out; echo $?
42
$ gdb --write -q ./a.out
(gdb) disas/r main
Dump of assembler code for function main:
0x00000000004004b4 <+0>: 55 push %rbp
0x00000000004004b5 <+1>: 48 89 e5 mov %rsp,%rbp
0x00000000004004b8 <+4>: b8 2a 00 00 00 mov $0x2a,%eax
0x00000000004004bd <+9>: 5d pop %rbp
0x00000000004004be <+10>: c3 retq
End of assembler dump.
(gdb) set {unsigned char}0x00000000004004b9 = 22
(gdb) disas/r main
Dump of assembler code for function main:
0x00000000004004b4 <+0>: 55 push %rbp
0x00000000004004b5 <+1>: 48 89 e5 mov %rsp,%rbp
0x00000000004004b8 <+4>: b8 16 00 00 00 mov $0x16,%eax <<< ---changed
0x00000000004004bd <+9>: 5d pop %rbp
0x00000000004004be <+10>: c3 retq
End of assembler dump.
(gdb) q
$ ./a.out; echo $?
22 <<<--- Just as desired

I'm trying to understand how to know which address needs to be changed to (in this example) change the return value, if the output of disas/r doesn't show it.
To understand this, you need to understand instruction encoding. The instruction here is "move immediate 32-bit constant to register". The constant is part of the instruction (that's what "immediate" means). It may be helpful to compile this instead:
int foo() { return 0x41424344; }
int bar() { return 0x45464748; }
int main() { return foo() + bar(); }
When you do compile it, you should see something similar to:
(gdb) disas/r foo
Dump of assembler code for function foo:
0x00000000004004ed <+0>: 55 push %rbp
0x00000000004004ee <+1>: 48 89 e5 mov %rsp,%rbp
0x00000000004004f1 <+4>: b8 44 43 42 41 mov $0x41424344,%eax
0x00000000004004f6 <+9>: 5d pop %rbp
0x00000000004004f7 <+10>: c3 retq
End of assembler dump.
(gdb) disas/r bar
Dump of assembler code for function bar:
0x00000000004004f8 <+0>: 55 push %rbp
0x00000000004004f9 <+1>: 48 89 e5 mov %rsp,%rbp
0x00000000004004fc <+4>: b8 48 47 46 45 mov $0x45464748,%eax
0x0000000000400501 <+9>: 5d pop %rbp
0x0000000000400502 <+10>: c3 retq
End of assembler dump.
Now you can clearly see where in the instruction stream each byte of the immediate constant resides (and also that x86 uses little-endian encoding for them).
The standard reference on instruction encoding for x86 is Intel instruction set reference. You can find 0xB8 instruction on page 3-528.

function call seems not working properly in disassembled code [duplicate]

This question already has an answer here:
What are these seemingly-useless callq instructions in my x86 object files for?
(1 answer)
Closed 1 year ago.
I wrote a simple program and then compiled and assembled it.
tfc.cpp
int i = 0;
void f(int a)
{
i += a;
};
int main()
{
f(9);
return 0;
};
I got the tfc.o by running
$ g++ -c -O1 tfc.cpp
Then I use gobjdump (objdump) to disassemble the binary file.
$ gobjdump -d tfc.o
Then I got
0000000000000000 <__Z1fi>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 01 3d 00 00 00 00 add %edi,0x0(%rip) # a <__Z1fi+0xa>
a: 5d pop %rbp
b: c3 retq
c: 0f 1f 40 00 nopl 0x0(%rax)
0000000000000010 <_main>:
10: 55 push %rbp
11: 48 89 e5 mov %rsp,%rbp
14: bf 09 00 00 00 mov $0x9,%edi
19: e8 00 00 00 00 callq 1e <_main+0xe>
1e: 31 c0 xor %eax,%eax
20: 5d pop %rbp
21: c3 retq
The weird thing happened, the callq instruction is followed by 1e <_main+0xe>. Shouldn't it be the address of <__Z1fi>? If not, how does the main function call the f function?
EDIT
FYI:
$ g++ -v
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
Target: x86_64-apple-darwin13.1.0
Thread model: posix

It calls address 0, which is the address of the f function.
e8 is the call instruction in x86 according to this:
http://www.cs.cmu.edu/~fp/courses/15213-s07/misc/asm64-handout.pdf
call uses the displacement relative to the next instruction, at memory location 1e. That becomes memory location 0. So it's callq 1e when in reality it's calling address 0.

mingw-w64: slow sprintf in <cstdio>

Is that <cstdio> header in C++ contains just the same functions as <stdio.h> but put in std namespace?
I experienced strange efficiency problems in my program compiled with mingw-w64, which is more than ten times slower then on linux. After some test I found that the problem is in sprintf.
Then I did the following test:
#include <stdio.h>
// #include <cstdio>
// using std::sprintf;
int main () {
int i;
for (i = 0; i < 500000; i++){
char x[100];
sprintf(x, "x%dx%dx", i, i<<2);
}
}
When compiled with <stdio.h> it is 15 times faster then using <cstdio>. Here is the timing:
$ time ./stdio
real 0m0.557s
user 0m0.046s
sys 0m0.046s
$ time ./cstdio
real 0m7.465s
user 0m0.031s
sys 0m0.077s
$ g++ --version
g++.exe (rubenvb-4.8-stdthread) 4.8.1 20130324 (prerelease)
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
UPDATE 1:
I further timed with different mingw-w64 build (rubenvb, drangon, and mingw-build), and find that all 32bit version using <cstdio> timed 4.x seconds and 64bit versions 7.x~8.x seconds. And all versions using <stdio.h> timed around 0.4~0.6 second.
UPDATE 2:
I disassembled the main function in gdb and find only one line differs: the <stdio.h> version calls callq 0x4077c0 <sprintf> but the <cstdio> version calls callq 0x407990 <_Z7sprintfPcPKcz>.
sprintf contains:
0x00000000004077c0 <+0>: jmpq *0x7c6e(%rip) # 0x40f434 <__imp_sprintf>
0x00000000004077c6 <+6>: nop
0x00000000004077c7 <+7>: nop
Following __imp_sprintf I reached the sprinf inside msvcrt.dll.
_Z7sprintfPcPKcz contains some mingw codes:
0x0000000000407990 <+0>: push %rbp
0x0000000000407991 <+1>: push %rbx
0x0000000000407992 <+2>: sub $0x38,%rsp
0x0000000000407996 <+6>: lea 0x80(%rsp),%rbp
0x000000000040799e <+14>: mov %rcx,-0x30(%rbp)
0x00000000004079a2 <+18>: mov %r8,-0x20(%rbp)
0x00000000004079a6 <+22>: mov %r9,-0x18(%rbp)
0x00000000004079aa <+26>: mov %rdx,-0x28(%rbp)
0x00000000004079ae <+30>: lea -0x20(%rbp),%rax
0x00000000004079b2 <+34>: mov %rax,-0x58(%rbp)
0x00000000004079b6 <+38>: mov -0x58(%rbp),%rdx
0x00000000004079ba <+42>: mov -0x28(%rbp),%rax
0x00000000004079be <+46>: mov %rdx,%r8
0x00000000004079c1 <+49>: mov %rax,%rdx
0x00000000004079c4 <+52>: mov -0x30(%rbp),%rcx
0x00000000004079c8 <+56>: callq 0x402c40 <__mingw_vsprintf>
0x00000000004079cd <+61>: mov %eax,%ebx
0x00000000004079cf <+63>: mov %ebx,%eax
0x00000000004079d1 <+65>: add $0x38,%rsp
0x00000000004079d5 <+69>: pop %rbx
0x00000000004079d6 <+70>: pop %rbp
Why does cstdio use a different (and much slower) function?

libstdc++ does define __USE_MINGW_ANSI_STDIO during build (config/os/mingw32-w64/os_defines.h), which will turn on the mingw sprintf wrapper. As #Michael Burr pointed out, these wrappers exist for C99/GNU99 compatibility.
Your test does not define __USE_MINGW_ANSI_STDIO, hence you'll not get the wrapper with stdio.h. But since it was defined when building libstdc++, you'll get it with cstdio.
If you however define it yourself before including stdio.h, you will get the wrapper again.
So you do get in fact different implementations, and cstdio std::sprintf is not necessarily the same as stdio.h sprintf, at least not when it comes to mingw.
Here is a test. First the source:
#ifdef USE_STDIO
#include <stdio.h>
#else
#include <cstdio>
using std::sprintf;
#endif
int main () {
int i;
for (i = 0; i < 500000; i++){
char x[100];
sprintf(x, "x%dx%dx", i, i<<2);
}
}
Results:
$ g++ -o test_cstdio.exe test.cc
$ g++ -o test_stdio.exe -DUSE_STDIO test.cc
$ g++ -o test_stdio_wrap.exe -DUSE_STDIO -D__USE_MINGW_ANSI_STDIO test.cc
$ for x in test_*.exe; do ( echo $x; objdump -d $x | grep sprintf; echo ); done
test_cstdio.exe
40154a: e8 41 64 00 00 callq 407990 <_Z7sprintfPcPKcz>
0000000000402c40 <__mingw_vsprintf>:
0000000000407990 <_Z7sprintfPcPKcz>:
4079c8: e8 73 b2 ff ff callq 402c40 <__mingw_vsprintf>
test_stdio.exe
40154a: e8 71 62 00 00 callq 4077c0 <sprintf>
00000000004077c0 <sprintf>:
4077c0: ff 25 6e 6c 00 00 jmpq *0x6c6e(%rip) # 40e434 <__imp_sprintf>
test_stdio_wrap.exe
40154a: e8 41 64 00 00 callq 407990 <_Z7sprintfPcPKcz>
0000000000402c40 <__mingw_vsprintf>:
0000000000407990 <_Z7sprintfPcPKcz>:
4079c8: e8 73 b2 ff ff callq 402c40 <__mingw_vsprintf>

SIGSEGV When accessing array element using assembly

Background:
I am new to assembly. When I was learning programming, I made a program that implements multiplication tables up to 1000 * 1000. The tables are formatted so that each answer is on the line factor1 << 10 | factor2 (I know, I know, it's isn't pretty). These tables are then loaded into an array: int* tables. Empty lines are filled with 0. Here is a link to the file for the tables (7.3 MB). I know using assembly won't speed up this by much, but I just wanted to do it for fun (and a bit of practice).
Question:
I'm trying to convert this code into inline assembly (tables is a global):
int answer;
// ...
answer = tables [factor1 << 10 | factor2];
This is what I came up with:
asm volatile ( "shll $10, %1;"
"orl %1, %2;"
"movl _tables(,%2,4), %0;" : "=r" (answer) : "r" (factor1), "r" (factor2) );
My C++ code works fine, but my assembly fails. What is wrong with my assembly (especially the movl _tables(,%2,4), %0; part), compared to my C++
What I have done to solve it:
I used some random numbers: 89 796 as factor1 and factor2. I know that there is an element at 89 << 10 | 786 (which is 91922) – verified this with C++. When I run it with gdb, I get a SIGSEGV:
Program received signal SIGSEGV, Segmentation fault.
at this line:
"movl _tables(,%2,4), %0;" : "=r" (answer) : "r" (factor1), "r" (factor2) );
I added two methods around my asm, which is how I know where the asm block is in the disassembly.
Disassembly of my asm block:
The disassembly from objdump -M att -d looks fine (although I'm not sure, I'm new to assembly, as I said):
402096: 8b 45 08 mov 0x8(%ebp),%eax
402099: 8b 55 0c mov 0xc(%ebp),%edx
40209c: c1 e0 0a shl $0xa,%eax
40209f: 09 c2 or %eax,%edx
4020a1: 8b 04 95 18 e0 47 00 mov 0x47e018(,%edx,4),%eax
4020a8: 89 45 ec mov %eax,-0x14(%ebp)
The disassembly from objdump -M intel -d:
402096: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
402099: 8b 55 0c mov edx,DWORD PTR [ebp+0xc]
40209c: c1 e0 0a shl eax,0xa
40209f: 09 c2 or edx,eax
4020a1: 8b 04 95 18 e0 47 00 mov eax,DWORD PTR [edx*4+0x47e018]
4020a8: 89 45 ec mov DWORD PTR [ebp-0x14],eax
From what I understand, it's moving the first parameter of my void calc ( int factor1, int factor2 ) function into eax. Then it's moving the second parameter into edx. Then it shifts eax to the left by 10 and ors it with edx. A 32-bit integer is 4 bytes, so [edx*4+base_address]. Move result to eax and then put eax into int answer (which, I'm guessing is on -0x14 of the stack). I don't really see much of a problem.
Disassembly of the compiler's .exe:
When I replace the asm block with plain C++ (answer = tables [factor1 << 10 | factor2];) and disassemble it this is what I get in Intel syntax:
402096: a1 18 e0 47 00 mov eax,ds:0x47e018
40209b: 8b 55 08 mov edx,DWORD PTR [ebp+0x8]
40209e: c1 e2 0a shl edx,0xa
4020a1: 0b 55 0c or edx,DWORD PTR [ebp+0xc]
4020a4: c1 e2 02 shl edx,0x2
4020a7: 01 d0 add eax,edx
4020a9: 8b 00 mov eax,DWORD PTR [eax]
4020ab: 89 45 ec mov DWORD PTR [ebp-0x14],eax
AT&T syntax:
402096: a1 18 e0 47 00 mov 0x47e018,%eax
40209b: 8b 55 08 mov 0x8(%ebp),%edx
40209e: c1 e2 0a shl $0xa,%edx
4020a1: 0b 55 0c or 0xc(%ebp),%edx
4020a4: c1 e2 02 shl $0x2,%edx
4020a7: 01 d0 add %edx,%eax
4020a9: 8b 00 mov (%eax),%eax
4020ab: 89 45 ec mov %eax,-0x14(%ebp)
I am not really familiar with the Intel syntax, so I am just going to try and understand the AT&T syntax:
It first moves the base address of the tables array into %eax. Then, is moves the first parameter into %edx. It shifts %edx to the left by 10 then ors it with the second parameter. Then, by shifting %edx to the left by two, it actually multiplies %edx by 4. Then, it adds that to %eax (the base address of the array). So, basically it just did this: [edx*4+0x47e018] (Intel syntax) or 0x47e018(,%edx,4) AT&T. It moves the value of the element it got into %eax and puts it into int answer. This method is more "expanded", but it does the same thing as my hand-written assembly! So why is mine giving a SIGSEGV while the compiler's working fine?

I bet (from the disassembly) that tables is a pointer to an array, not the array itself.
So you need:
asm volatile ( "shll $10, %1;"
movl _tables,%%eax
"orl %1, %2;"
"movl (%%eax,%2,4)",
: "=r" (answer) : "r" (factor1), "r" (factor2) : "eax" )
(Don't forget the extra clobber in the last line).
There are of course variations, this may be more efficient if the code is in a loop:
asm volatile ( "shll $10, %1;"
"orl %1, %2;"
"movl (%3,%2,4)",
: "=r" (answer) : "r" (factor1), "r" (factor2), "r"(tables) )

This is intended to be an addition to Mats Petersson's answer - I wrote it simply because it wasn't immediately clear to me why OP's analysis of the disassembly (that his assembly and the compiler-generated one were equivalent) was incorrect.
As Mats Petersson explains, the problem is that tables is actually a pointer to an array, so to access an element, you have to dereference twice. Now to me, it wasn't immediately clear where this happens in the compiler-generated code. The culprit is this innocent-looking line:
a1 18 e0 47 00 mov 0x47e018,%eax
To the untrained eye (that includes mine), this might look like the value 0x47e018 is moved to eax, but it's actually not. The Intel-syntax representation of the same opcodes gives us a clue:
a1 18 e0 47 00 mov eax,ds:0x47e018
Ah - ds: - so it's not actually a value, but an address!
For anyone who is wondering now, the following would be the opcodes and ATT syntax assembly for moving the value 0x47e018 to eax:
b8 18 e0 47 00 mov $0x47e018,%eax

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Assembly code for __rdtsc() in -O0 vs -O3 [duplicate] - c++

rbp is the base pointer, not the stack pointer. The base pointer is used for backtrace during debugging but it is not necessary for actually running. It is preserved through function calls so with -O3 only the expected assembly is generated: main: rdtsc salq $32, %rdx orq %rdx, %rax ret

Related

Clobber X86 register by modifying LLVM Backend

selecting address to change value in memory

function call seems not working properly in disassembled code [duplicate]

mingw-w64: slow sprintf in <cstdio>

SIGSEGV When accessing array element using assembly

Categories

Resources