This question already has answers here:
How to invoke a system call via syscall or sysenter in inline assembly?
(2 answers)
Unexpected GCC inline ASM behaviour (clobbered variable overwritten)
(1 answer)
When to use earlyclobber constraint in extended GCC inline assembly?
(2 answers)
inline assembly constraint for value that might be overwritten
(1 answer)
Closed 1 year ago.
How does one specify multiple outputs with an inline asm statement using gcc? I don't follow how the garbage value for ret is printed, but I suspect it's possibly related to both syscall and the mov at the top of the inline assembly section both writing to an output register.
Source:
#include <string.h>
#include <iostream>
int main() {
const char* str = "Hello World\n";
long len = strlen(str);
long ret = 0;
long test = 0;
__asm__ __volatile__ (
"mov $22, %0\n\t"
"movq $1, %%rax \n\t"
"movq $1, %%rdi \n\t"
"movq %2, %%rsi \n\t"
"movl %3, %%edx \n\t"
"syscall"
: "=r"(test), "=g"(ret)
: "g"(str), "g" (len));
std::cout << ret << "\n";
return 0;
}
Output:
Hello World
4202512
Disassembly
Dump of assembler code for function main():
0x0000000000401080 <+0>: sub $0x8,%rsp
0x0000000000401084 <+4>: mov $0x16,%rax
0x000000000040108b <+11>: mov $0x1,%rax
0x0000000000401092 <+18>: mov $0x1,%rdi
0x0000000000401099 <+25>: mov $0x402010,%rsi
0x00000000004010a0 <+32>: mov $0xc,%edx
0x00000000004010a5 <+37>: syscall
0x00000000004010a7 <+39>: mov $0x404080,%edi
0x00000000004010ac <+44>: callq 0x401040 <_ZNSo9_M_insertIlEERSoT_#plt>
0x00000000004010b1 <+49>: mov $0x1,%edx
0x00000000004010b6 <+54>: mov $0x40201b,%esi
0x00000000004010bb <+59>: mov %rax,%rdi
0x00000000004010be <+62>: callq 0x401050 <_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l#plt>
0x00000000004010c3 <+67>: xor %eax,%eax
0x00000000004010c5 <+69>: add $0x8,%rsp
0x00000000004010c9 <+73>: retq
I believe this worked perfectly previously, but maybe I just forgot the correct syntax.
(gdb) disas main
Dump of assembler code for function main:
0x0000000000001125 <+0>: push rbp
0x0000000000001126 <+1>: mov rbp,rsp
0x0000000000001129 <+4>: mov DWORD PTR [rbp-0x4],edi
0x000000000000112c <+7>: mov QWORD PTR [rbp-0x10],rsi
0x0000000000001130 <+11>: mov eax,0x0
0x0000000000001135 <+16>: pop rbp
0x0000000000001136 <+17>: ret
Now I want to disassemble at 0x0000000000001127, which is 1 byte into the first mov instruction:
(gdb) disas 0x0000000000001127
Dump of assembler code for function main:
0x0000000000001125 <+0>: push rbp
0x0000000000001126 <+1>: mov rbp,rsp
0x0000000000001129 <+4>: mov DWORD PTR [rbp-0x4],edi
0x000000000000112c <+7>: mov QWORD PTR [rbp-0x10],rsi
0x0000000000001130 <+11>: mov eax,0x0
0x0000000000001135 <+16>: pop rbp
0x0000000000001136 <+17>: ret
It still starts the disassembly at the top of main.
I've also tried things such as main+1, disas /r, etc. Did gdb's behavior change somehow? I thought perhaps it was related to this being a PIE binary, but when I recompile it with -no-pie I still have this problem for something so simple.
What is the correct syntax?
It still starts the disassembly at the top of main.
When you give disas a single argument, it finds the enclosing function, and disassembles that entire function. This has been the behavior since forever.
If you give disas two arguments instead, then it will disassemble just the given range:
(gdb) disas &main
Dump of assembler code for function main:
0x00000000000005fa <+0>: push %rbp
0x00000000000005fb <+1>: mov %rsp,%rbp
0x00000000000005fe <+4>: mov $0x0,%eax
0x0000000000000603 <+9>: pop %rbp
0x0000000000000604 <+10>: retq
End of assembler dump.
(gdb) disas &main+3,&main+11
Dump of assembler code from 0x5fd to 0x605:
0x00000000000005fd <main+3>: in $0xb8,%eax
0x00000000000005ff <main+5>: add %al,(%rax)
0x0000000000000601 <main+7>: add %al,(%rax)
0x0000000000000603 <main+9>: pop %rbp
0x0000000000000604 <main+10>: retq
End of assembler dump.
You could also use x/i:
(gdb) x/4i &main+3
0x5fd <main+3>: in $0xb8,%eax
0x5ff <main+5>: add %al,(%rax)
0x601 <main+7>: add %al,(%rax)
0x603 <main+9>: pop %rbp
I've set the disassembly-flavor of the gdb-debugger to Intel (both: su & normal user), but anyway it's still showing the assembly-code in AT&T notation:
patrick#localhost:~/Dokumente/Projekte$ gdb -q ./a.out
Reading symbols from ./a.out...done.
(gdb) break main
Breakpoint 1 at 0x40050e: file firstprog.c, line 5.
(gdb) run
Starting program: /home/patrick/Dokumente/Projekte/a.out
Breakpoint 1, main () at firstprog.c:5
5 for(i=0; i < 10; i++)
(gdb) show disassembly
The disassembly flavor is "intel".
(gdb) info registers
rax 0x400506 4195590
rbx 0x0 0
rcx 0x0 0
rdx 0x7fffffffe2d8 140737488347864
rsi 0x7fffffffe2c8 140737488347848
rdi 0x1 1
rbp 0x7fffffffe1e0 0x7fffffffe1e0
(gdb) info register eip
Invalid register `eip'
I did restart the computer. My OS is Kali Linux amd64.
I have the following questions:
Why is gdb still showing the AT&T notation?
Why is the register EIP (instruction pointer) shown as invalid register?
You are misunderstanding what disassembly flavour means. It means exactly that: what the disassembly looks like when you view machine code in a human-readable(ish) form.
To print registers (or use registers in any other context), you need to use $reg, such as $rip or $pc, $eax, etc.
If I disassemble one of my programs with at&t syntax, gdb shows this:
0x00000000007378f0 <+0>: push %rbp
0x00000000007378f1 <+1>: mov %rsp,%rbp
0x00000000007378f4 <+4>: sub $0x20,%rsp
0x00000000007378f8 <+8>: movl $0x0,-0x4(%rbp)
0x00000000007378ff <+15>: mov %edi,-0x8(%rbp)
0x0000000000737902 <+18>: mov %rsi,-0x10(%rbp)
=> 0x0000000000737906 <+22>: mov -0x10(%rbp),%rsi
0x000000000073790a <+26>: mov (%rsi),%rdi
0x000000000073790d <+29>: callq 0x737950 <FindLibPath(char const*)>
0x0000000000737912 <+34>: xor %eax,%eax
Then do this:
(gdb) set disassembly-flavor intel
(gdb) disass main
Dump of assembler code for function main(int, char**):
0x00000000007378f0 <+0>: push rbp
0x00000000007378f1 <+1>: mov rbp,rsp
0x00000000007378f4 <+4>: sub rsp,0x20
0x00000000007378f8 <+8>: mov DWORD PTR [rbp-0x4],0x0
0x00000000007378ff <+15>: mov DWORD PTR [rbp-0x8],edi
0x0000000000737902 <+18>: mov QWORD PTR [rbp-0x10],rsi
=> 0x0000000000737906 <+22>: mov rsi,QWORD PTR [rbp-0x10]
0x000000000073790a <+26>: mov rdi,QWORD PTR [rsi]
0x000000000073790d <+29>: call 0x737950 <FindLibPath(char const*)>
0x0000000000737912 <+34>: xor eax,eax
and you can see the difference. But the names of registers and how you use registers on the gdb command-line isn't changing, you need a $reg in both cases.
(For testing purposes) I have written a simple Method to calculate the transpose of a nxn Matrix
void transpose(const size_t _n, double* _A) {
for(uint i=0; i < _n; ++i) {
for(uint j=i+1; j < _n; ++j) {
double tmp = _A[i*_n+j];
_A[i*_n+j] = _A[j*_n+i];
_A[j*_n+i] = tmp;
}
}
}
When using optimization levels O3 or Ofast I expected the compiler to unroll some loops which would lead to higher performance especially when the matrix size is a multiple of 2 (i.e., the double loop body can be performed each iteration) or similar. Instead what I measured was the exact opposite. Powers of 2 actually show a significant spike in execution time.
These spikes are also at regular intervals of 64, more pronounced at intervals of 128 and so on. Each spike extends to the neighboring matrix sizes like in the following table
size n time(us)
1020 2649
1021 2815
1022 3100
1023 5428
1024 15791
1025 6778
1026 3106
1027 2847
1028 2660
1029 3038
1030 2613
I compiled with a gcc version 4.8.2 but the same thing happens with a clang 3.5 so this might be some generic thing?
So my question basically is: Why is there this periodic increase in execution time? Is it some generic thing coming with any of the optimization options (as it happens with clang and gcc alike)? If so which optimization option is causing this?
And how can this be so significant that even the O0 version of the program outperforms the 03 version at multiples of 512?
EDIT: Note the magnitude of the spikes in this (logarithmic) plot. Transposing a 1024x1024 matrix with optimization actually takes as much time as transposing a 1300x1300 matrix without optimization. If this is a cache-fault / page-fault problem, then someone needs to explain to me why the memory layout is so significantly different for the optimized version of the program, that it fails for powers of two, just to recover high performance for slightly larger matrices. Shouldn't cache-faults create more of a step-like pattern? Why does the execution times go down again at all? (and why should optimization create cache-faults that weren't there before?)
EDIT: the following should be the assembler codes that gcc produced
no optimization (O0):
_Z9transposemRPd:
.LFB0:
.cfi_startproc
push rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
mov rbp, rsp
.cfi_def_cfa_register 6
mov QWORD PTR [rbp-24], rdi
mov QWORD PTR [rbp-32], rsi
mov DWORD PTR [rbp-4], 0
jmp .L2
.L5:
mov eax, DWORD PTR [rbp-4]
add eax, 1
mov DWORD PTR [rbp-8], eax
jmp .L3
.L4:
mov rax, QWORD PTR [rbp-32]
mov rdx, QWORD PTR [rax]
mov eax, DWORD PTR [rbp-4]
imul rax, QWORD PTR [rbp-24]
mov rcx, rax
mov eax, DWORD PTR [rbp-8]
add rax, rcx
sal rax, 3
add rax, rdx
mov rax, QWORD PTR [rax]
mov QWORD PTR [rbp-16], rax
mov rax, QWORD PTR [rbp-32]
mov rdx, QWORD PTR [rax]
mov eax, DWORD PTR [rbp-4]
imul rax, QWORD PTR [rbp-24]
mov rcx, rax
mov eax, DWORD PTR [rbp-8]
add rax, rcx
sal rax, 3
add rdx, rax
mov rax, QWORD PTR [rbp-32]
mov rcx, QWORD PTR [rax]
mov eax, DWORD PTR [rbp-8]
imul rax, QWORD PTR [rbp-24]
mov rsi, rax
mov eax, DWORD PTR [rbp-4]
add rax, rsi
sal rax, 3
add rax, rcx
mov rax, QWORD PTR [rax]
mov QWORD PTR [rdx], rax
mov rax, QWORD PTR [rbp-32]
mov rdx, QWORD PTR [rax]
mov eax, DWORD PTR [rbp-8]
imul rax, QWORD PTR [rbp-24]
mov rcx, rax
mov eax, DWORD PTR [rbp-4]
add rax, rcx
sal rax, 3
add rdx, rax
mov rax, QWORD PTR [rbp-16]
mov QWORD PTR [rdx], rax
add DWORD PTR [rbp-8], 1
.L3:
mov eax, DWORD PTR [rbp-8]
cmp rax, QWORD PTR [rbp-24]
jb .L4
add DWORD PTR [rbp-4], 1
.L2:
mov eax, DWORD PTR [rbp-4]
cmp rax, QWORD PTR [rbp-24]
jb .L5
pop rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size _Z9transposemRPd, .-_Z9transposemRPd
.ident "GCC: (Debian 4.8.2-15) 4.8.2"
.section .note.GNU-stack,"",#progbits
with optimization (O3)
_Z9transposemRPd:
.LFB0:
.cfi_startproc
push rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
xor r11d, r11d
xor ebx, ebx
.L2:
cmp r11, rdi
mov r9, r11
jae .L10
.p2align 4,,10
.p2align 3
.L7:
add ebx, 1
mov r11d, ebx
cmp rdi, r11
mov rax, r11
jbe .L2
mov r10, r9
mov r8, QWORD PTR [rsi]
mov edx, ebx
imul r10, rdi
.p2align 4,,10
.p2align 3
.L6:
lea rcx, [rax+r10]
add edx, 1
imul rax, rdi
lea rcx, [r8+rcx*8]
movsd xmm0, QWORD PTR [rcx]
add rax, r9
lea rax, [r8+rax*8]
movsd xmm1, QWORD PTR [rax]
movsd QWORD PTR [rcx], xmm1
movsd QWORD PTR [rax], xmm0
mov eax, edx
cmp rdi, rax
ja .L6
cmp r11, rdi
mov r9, r11
jb .L7
.L10:
pop rbx
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE0:
.size _Z9transposemRPd, .-_Z9transposemRPd
.ident "GCC: (Debian 4.8.2-15) 4.8.2"
.section .note.GNU-stack,"",#progbits
The periodic increase of execution time must be due to the cache being only N-way associative instead of fully associative. You are witnessing hash collision related to cache line selection algorithm.
The fastest L1 cache has a smaller number of cache lines than the next level L2. In each level each cache line can be filled only from a limited set of sources.
Typical HW implementations of cache line selection algorithms will just use few bits from the memory address to determine in which cache slot the data should be written -- in HW bit shifts are free.
This causes a competition between memory ranges e.g. between addresses 0x300010 and 0x341010.
In fully sequential algorithm this doesn't matter -- N is large enough for practically all algorithms of the form:
for (i=0;i<1000;i++) a[i] += b[i] * c[i] + d[i];
But when the number of the inputs (or outputs) gets larger, which happens internally when the algorithm is optimized, having one input in the cache forces another input out of the cache.
// one possible method of optimization with 2 outputs and 6 inputs
// with two unrelated execution paths -- should be faster, but maybe it isn't
for (i=0;i<500;i++) {
a[i] += b[i] * c[i] + d[i];
a[i+500] += b[i+500] * c[i+500] + d[i+500];
}
A graph in Example 5: Cache Associativity illustrates 512 byte offset between matrix lines being a global worst case dimension for the particular system. When this is known, a working mitigation is to over-allocate the matrix horizontally to some other dimension char matrix[512][512 + 64].
The improvement in performance is likely related to CPU/RAM caching.
When the data is not a power of 2, a cache line load (like 16, 32, or 64 words) transfers more than the data that is required tying up the bus—uselessly as it turns out. For a data set which is a power of 2, all of the pre-fetched data is used.
I bet if you were to disable L1 and L2 caching, the performance would be completely smooth and predictable. But it would be much slower. Caching really helps performance!
Comment with code: In the -O3 case, with
#include <cstdlib>
extern void transpose(const size_t n, double* a)
{
for (size_t i = 0; i < n; ++i) {
for (size_t j = i + 1; j < n; ++j) {
std::swap(a[i * n + j], a[j * n + i]); // or your expanded version.
}
}
}
compiling with
$ g++ --version
g++ (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1
...
$ g++ -g1 -std=c++11 -Wall -o test.S -S test.cpp -O3
I get
_Z9transposemPd:
.LFB68:
.cfi_startproc
.LBB2:
testq %rdi, %rdi
je .L1
leaq 8(,%rdi,8), %r10
xorl %r8d, %r8d
.LBB3:
addq $1, %r8
leaq -8(%r10), %rcx
cmpq %rdi, %r8
leaq (%rsi,%rcx), %r9
je .L1
.p2align 4,,10
.p2align 3
.L10:
movq %r9, %rdx
movq %r8, %rax
.p2align 4,,10
.p2align 3
.L5:
.LBB4:
movsd (%rdx), %xmm1
movsd (%rsi,%rax,8), %xmm0
movsd %xmm1, (%rsi,%rax,8)
.LBE4:
addq $1, %rax
.LBB5:
movsd %xmm0, (%rdx)
addq %rcx, %rdx
.LBE5:
cmpq %rdi, %rax
jne .L5
addq $1, %r8
addq %r10, %r9
addq %rcx, %rsi
cmpq %rdi, %r8
jne .L10
.L1:
rep ret
.LBE3:
.LBE2:
.cfi_endproc
And something quite different if I add -m32.
(Note: it makes no difference to the assembly whether I use std::swap or your variant)
In order to understand what is causing the spikes, though, you probably want to visualize the memory operations going on.
To add to others: g++ -std=c++11 -march=core2 -O3 -c -S - gcc version 4.8.2 (MacPorts gcc48 4.8.2_0) - x86_64-apple-darwin13.0.0 :
__Z9transposemPd:
LFB0:
testq %rdi, %rdi
je L1
leaq 8(,%rdi,8), %r10
xorl %r8d, %r8d
leaq -8(%r10), %rcx
addq $1, %r8
leaq (%rsi,%rcx), %r9
cmpq %rdi, %r8
je L1
.align 4,0x90
L10:
movq %r9, %rdx
movq %r8, %rax
.align 4,0x90
L5:
movsd (%rdx), %xmm0
movsd (%rsi,%rax,8), %xmm1
movsd %xmm0, (%rsi,%rax,8)
addq $1, %rax
movsd %xmm1, (%rdx)
addq %rcx, %rdx
cmpq %rdi, %rax
jne L5
addq $1, %r8
addq %r10, %r9
addq %rcx, %rsi
cmpq %rdi, %r8
jne L10
L1:
rep; ret
Basically the same as #ksfone's code, for:
#include <cstddef>
void transpose(const size_t _n, double* _A) {
for(size_t i=0; i < _n; ++i) {
for(size_t j=i+1; j < _n; ++j) {
double tmp = _A[i*_n+j];
_A[i*_n+j] = _A[j*_n+i];
_A[j*_n+i] = tmp;
}
}
}
Apart from the Mach-O 'as' differences (extra underscore, align and DWARF locations), it's the same. But very different from the OP's assembly output. A much 'tighter' inner loop.
I am implementing a profiler. I want to use the Constructor/Destructor idiom to keep track of when I enter/exit a function.
A rough outline of my code is as follows:
class Profile
{
Profile(void); //Start timing
~Profile(void); //Stop timer and log
};
//...
Game::Game(void) : m_Quit(false)
{
Profile p();
InitalizeModules();
//...
}
However, when I run it, the Constructor and destructor are not being called. Even when I disassemble, there are no references to Profile::Profile(). I understood that the standard specifies that an instance with a non-trivial constructor cannot be optimized out by the compiler.
There are no optimization flags on the command line of either the compiler or the linker.
I also tried specifying attribute((used)), but to no avail.
Here is the disassembly:
(gdb) disassemble Ztk::Game::Game
Dump of assembler code for function Ztk::Game::Game():
0x00000000004cd798 <+0>: push %rbp
0x00000000004cd799 <+1>: mov %rsp,%rbp
0x00000000004cd79c <+4>: push %r12
0x00000000004cd79e <+6>: push %rbx
0x00000000004cd79f <+7>: sub $0x30,%rsp
0x00000000004cd7a3 <+11>: mov %rdi,-0x38(%rbp)
0x00000000004cd7a7 <+15>: mov -0x38(%rbp),%rax
0x00000000004cd7ab <+19>: mov %rax,%rdi
0x00000000004cd7ae <+22>: callq 0x4cdc6a <Ztk::Highlander<Ztk::Game, int>::Highlander()>
/** CALL SHOULD BE HERE **/
0x00000000004cd7b3 <+27>: mov -0x38(%rbp),%rax
0x00000000004cd7b7 <+31>: movb $0x0,(%rax)
0x00000000004cd7ba <+34>: callq 0x4e59f0 <Ztk::InitializeModules()>
Indeed there is code generated and linked into the executable
(gdb) disassemble Ztk::Profile::Profile(void)
Dump of assembler code for function Ztk::Profile::Profile():
0x0000000000536668 <+0>: push %rbp
0x0000000000536669 <+1>: mov %rsp,%rbp
0x000000000053666c <+4>: sub $0x20,%rsp
0x0000000000536670 <+8>: mov %rdi,-0x18(%rbp)
0x0000000000536674 <+12>: mov 0x8(%rbp),%rax
0x0000000000536678 <+16>: mov %rax,-0x8(%rbp)
0x000000000053667c <+20>: mov -0x8(%rbp),%rax
0x0000000000536680 <+24>: mov %rax,%rsi
0x0000000000536683 <+27>: mov $0x802440,%edi
0x0000000000536688 <+32>: callq 0x5363ca <Ztk::Profiler::FindNode(void*)>
Profile p();
What you've done here is declared a function, called p, that returns an object of type Profile. What you want is this:
Profile p;