When I look at the symbols in my library, nm mylib.a, I see some duplicate entries that look like this:
000000000002d130 S __ZN7quadmat11SpAddLeavesC1EPNS_14BlockContainerEPy
00000000000628a8 S __ZN7quadmat11SpAddLeavesC1EPNS_14BlockContainerEPy.eh
When piped through c++filt:
000000000002d130 S quadmat::SpAddLeaves::SpAddLeaves(quadmat::BlockContainer*, unsigned long long*)
00000000000628a8 S quadmat::SpAddLeaves::SpAddLeaves(quadmat::BlockContainer*, unsigned long long*) (.eh)
What does that .eh mean, and what is this extra symbol used for?
I see it has something to do with exception handling. But why does that use an extra symbol?
(I'm noticing this with clang)
Here's some simple code:
bool extenrnal_variable;
int f(...)
{
if (extenrnal_variable)
throw 0;
return 42;
}
int g()
{
return f(1, 2, 3);
}
I added extenrnal_variable to prevent the compiler from optimizing all the branches away. f has ... to prevent inlining.
When compiled with:
$ clang++ -S -O3 -m32 -o - eh.cpp | c++filt
it emits the following code for g() (the rest is omitted):
g(): ## #_Z1gv
.cfi_startproc
## BB#0:
pushl %ebp
Ltmp9:
.cfi_def_cfa_offset 8
Ltmp10:
.cfi_offset %ebp, -8
movl %esp, %ebp
Ltmp11:
.cfi_def_cfa_register %ebp
subl $24, %esp
movl $3, 8(%esp)
movl $2, 4(%esp)
movl $1, (%esp)
calll f(...)
movl $42, %eax
addl $24, %esp
popl %ebp
ret
.cfi_endproc
All these .cfi_* directives are there for the stack unwinding in case of an exception being thrown. They all compiled into into an FDE (Frame Description Entry) block and saved under the g().eh (__Z1gv.eh mangled) name. These directives specify where on the stack the CPU registers are saved. When an exception is thrown and the stack is being unwound the code in the function should not be executed (except for the destructors of locals), but the registers that were saved earlier should be restored. These tables store exactly that information.
These tables could be dumped via the dwarfdump tool:
$ dwarfdump --eh-frame --english eh.o | c++filt
The output:
0x00000018: FDE
length: 0x00000018
CIE_pointer: 0x00000000
start_addr: 0x00000000 f(...)
range_size: 0x0000004d (end_addr = 0x0000004d)
Instructions: 0x00000000: CFA=esp+4 eip=[esp]
0x00000001: CFA=esp+8 ebp=[esp] eip=[esp+4]
0x00000003: CFA=ebp+8 ebp=[ebp] eip=[ebp+4]
0x00000007: CFA=ebp+8 ebp=[ebp] esi=[ebp-4] eip=[ebp+4]
0x00000034: FDE
length: 0x00000018
CIE_pointer: 0x00000000
start_addr: 0x00000050 g()
range_size: 0x0000002c (end_addr = 0x0000007c)
Instructions: 0x00000050: CFA=esp+4 eip=[esp]
0x00000051: CFA=esp+8 ebp=[esp] eip=[esp+4]
0x00000053: CFA=ebp+8 ebp=[ebp] eip=[ebp+4]
Here you could find out about the format of this block. Here a bit more and some alternative more compact way of representing the same information. Basically this block describes which registers and where from on the stack to pop during the stack unwinding.
To see the raw content of these symbols you can list all the symbols with their offsets:
$ nm -n eh.o
00000000 T __Z1fz
U __ZTIi
U ___cxa_allocate_exception
U ___cxa_throw
00000050 T __Z1gv
000000a8 s EH_frame0
000000c0 S __Z1fz.eh
000000dc S __Z1gv.eh
000000f8 S _extenrnal_variable
And then dump the (__TEXT,__eh_frame) section:
$ otool -s __TEXT __eh_frame eh.o
eh.o:
Contents of (__TEXT,__eh_frame) section
000000a8 14 00 00 00 00 00 00 00 01 7a 52 00 01 7c 08 01
000000b8 10 0c 05 04 88 01 00 00 18 00 00 00 1c 00 00 00
000000c8 38 ff ff ff 4d 00 00 00 00 41 0e 08 84 02 42 0d
000000d8 04 44 86 03 18 00 00 00 38 00 00 00 6c ff ff ff
000000e8 2c 00 00 00 00 41 0e 08 84 02 42 0d 04 00 00 00
By matching the offsets you could see how each symbol is encoded.
When there are local variables present, they would have to be destroyed during the stack unwinding. For that there's usually more code embedded in the functions themselves and some additional bigger tables are created. You could explore that yourself by adding a local variable with non-trivial destructor into g, compiling and looking at the assembly output.
Further reading
Explanations of some of the .cfi_* directives
Exception handling in LLVM
It stands for stands for exception handler and is usually associated with the info below:
If you are using an exports list and building either a shared library, or an executable that will be used with ld's -bundle_loader flag, you need to include the symbols for exception frame information in the exports list for your exported C++ symbols. Otherwise, they may be stripped. These symbols end with .eh; you can view them with the nm tool.
from XcodeUserGuide20
Related
I have two files, a.h and a.cpp:
// a.h
extern "C" void a();
// a.cpp
#include "a.h"
#include <stdio.h>
void a()
{
printf("a\n");
}
I compiled this both with and without -fPIC, and then objdumped both.
Weirdly, I got the same output for both files. For a(), I get this in both cases:
callq 15 <a+0x15>
I also tried to compile object files with -no-pie, still no luck.
Compile your code (or anything) in verbose mode (-v), inspect the output,
and you will find:
Configured with: ... --enable-default-pie ...
which, since GCC 6, means the toolchain is built to compile PIC code and link
PIE executables by default.
To insist on a non-PIC compilation, run e.g.
g++ -Wall -c -fno-PIC -o anopic.o a.cpp
And to insist on a PIC compilation, run e.g.
g++ -Wall -c -fPIC -o apic.o a.cpp
Then run:
$ objdump -d anopic.o
anopic.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <a>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: bf 00 00 00 00 mov $0x0,%edi
9: e8 00 00 00 00 callq e <a+0xe>
e: 90 nop
f: 5d pop %rbp
10: c3 retq
and:
$ objdump -d apic.o
apic.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <a>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # b <a+0xb>
b: e8 00 00 00 00 callq 10 <a+0x10>
10: 90 nop
11: 5d pop %rbp
12: c3 retq
and you will see the difference.
You can interleave the relocations with the assembly by:
$ objdump --reloc -d anopic.o
anopic.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <a>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: bf 00 00 00 00 mov $0x0,%edi
5: R_X86_64_32 .rodata
9: e8 00 00 00 00 callq e <a+0xe>
a: R_X86_64_PC32 puts-0x4
e: 90 nop
f: 5d pop %rbp
10: c3 retq
and:
$ objdump --reloc -d apic.o
apic.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <a>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # b <a+0xb>
7: R_X86_64_PC32 .rodata-0x4
b: e8 00 00 00 00 callq 10 <a+0x10>
c: R_X86_64_PLT32 puts-0x4
10: 90 nop
11: 5d pop %rbp
12: c3 retq
By default, objdump does not perform relocation processing. Try objdump --reloc instead.
In your case, the compiler and assembler produce a R_X86_64_PLT32 relocation. This is an position-independent relocation. It seems that your compiler defaults to generating PIE binaries. -no-pie is a linker flag, you need to use -fno-pie to change the compiler output. (In this particular case, it does not matter because the final result will be identical after the link editor has run.)
I'm currently trying to improve the performance of a custom "pseudo" stack, which is used like this (full code is provided at the end of this post):
void test() {
theStack.stackFrames[1] = StackFrame{ "someFunction", 30 }; // A
theStack.stackTop.store(1, std::memory_order_seq_cst); // B
someFunction(); // C
theStack.stackTop.store(0, std::memory_order_seq_cst); // D
theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 }; // E
theStack.stackTop.store(1, std::memory_order_seq_cst); // F
someOtherFunction(); // G
theStack.stackTop.store(0, std::memory_order_seq_cst); // H
}
A sampler thread periodically suspends the target thread and reads stackTop and the stackFrames array.
My biggest performance problem are the sequentially-consistent stores to stackTop, so I'm trying to find out whether I can change them to release-stores.
The central requirement is: When the sampler thread suspends the target thread and reads stackTop == 1, then the information in stackFrames[1] needs to be fully present and consistent. This means:
When B is observed, A must also be observed. ("Don't increment stackTop before putting the stack frame in place.")
When E is observed, D must also be observed. ("When putting the next frame's information in place, the previous stack frame must have been exited.")
My understanding is that using release-acquire memory ordering for stackTop guarantees the first requirement, but not the second. More specifically:
No writes that are before the stackTop release-store in program order can be reordered to occur after it.
However, no statement is made about writes that occur after the release-store to stackTop in program order. Thus, my understanding is that E can be observed before D is observed. Is this correct?
But if that's the case, then wouldn't the compiler be able to reorder my program like this:
void test() {
theStack.stackFrames[1] = StackFrame{ "someFunction", 30 }; // A
theStack.stackTop.store(1, std::memory_order_release); // B
someFunction(); // C
// switched D and E:
theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 }; // E
theStack.stackTop.store(0, std::memory_order_release); // D
theStack.stackTop.store(1, std::memory_order_release); // F
someOtherFunction(); // G
theStack.stackTop.store(0, std::memory_order_release); // H
}
... and then combine D and F, optimizing away the zero store?
Because that's not what I'm seeing if I compile the above program using system clang on macOS:
$ clang++ -c main.cpp -std=c++11 -O3 && objdump -d main.o
main.o: file format Mach-O 64-bit x86-64
Disassembly of section __TEXT,__text:
__Z4testv:
0: 55 pushq %rbp
1: 48 89 e5 movq %rsp, %rbp
4: 48 8d 05 5d 00 00 00 leaq 93(%rip), %rax
b: 48 89 05 10 00 00 00 movq %rax, 16(%rip)
12: c7 05 14 00 00 00 1e 00 00 00 movl $30, 20(%rip)
1c: c7 05 1c 00 00 00 01 00 00 00 movl $1, 28(%rip)
26: e8 00 00 00 00 callq 0 <__Z4testv+0x2B>
2b: c7 05 1c 00 00 00 00 00 00 00 movl $0, 28(%rip)
35: 48 8d 05 39 00 00 00 leaq 57(%rip), %rax
3c: 48 89 05 10 00 00 00 movq %rax, 16(%rip)
43: c7 05 14 00 00 00 23 00 00 00 movl $35, 20(%rip)
4d: c7 05 1c 00 00 00 01 00 00 00 movl $1, 28(%rip)
57: e8 00 00 00 00 callq 0 <__Z4testv+0x5C>
5c: c7 05 1c 00 00 00 00 00 00 00 movl $0, 28(%rip)
66: 5d popq %rbp
67: c3 retq
Specifically, the movl $0, 28(%rip) instruction at 2b is still present.
Coincidentally, this output is exactly what I need in my case. But I don't know if I can rely on it, because to my understanding it's not guaranteed by my chosen memory ordering.
So my main question is this: Does the acquire-release memory order give me another (fortunate) guarantee that I'm not aware of? Or is the compiler only doing what I need by accident / because it's not optimizing this particular case as well as it could?
Full code below:
// clang++ -c main.cpp -std=c++11 -O3 && objdump -d main.o
#include <atomic>
#include <cstdint>
struct StackFrame
{
const char* functionName;
uint32_t lineNumber;
};
struct Stack
{
Stack()
: stackFrames{ StackFrame{ nullptr, 0 }, StackFrame{ nullptr, 0 } }
, stackTop{0}
{
}
StackFrame stackFrames[2];
std::atomic<uint32_t> stackTop;
};
Stack theStack;
void someFunction();
void someOtherFunction();
void test() {
theStack.stackFrames[1] = StackFrame{ "someFunction", 30 };
theStack.stackTop.store(1, std::memory_order_release);
someFunction();
theStack.stackTop.store(0, std::memory_order_release);
theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 };
theStack.stackTop.store(1, std::memory_order_release);
someOtherFunction();
theStack.stackTop.store(0, std::memory_order_release);
}
/**
* // Sampler thread:
*
* #include <chrono>
* #include <iostream>
* #include <thread>
*
* void suspendTargetThread();
* void unsuspendTargetThread();
*
* void samplerThread() {
* for (;;) {
* // Suspend the target thread. This uses a platform-specific
* // mechanism:
* // - SuspendThread on Windows
* // - thread_suspend on macOS
* // - send a signal + grab a lock in the signal handler on Linux
* suspendTargetThread();
*
* // Now that the thread is paused, read the leaf stack frame.
* uint32_t stackTop =
* theStack.stackTop.load(std::memory_order_acquire);
* StackFrame& f = theStack.stackFrames[stackTop];
* std::cout << f.functionName << " at line "
* << f.lineNumber << std::endl;
*
* unsuspendTargetThread();
*
* std::this_thread::sleep_for(std::chrono::milliseconds(1));
* }
* }
*/
And, to satisfy curiosity, this is the assembly if I use sequentially-consistent stores:
$ clang++ -c main.cpp -std=c++11 -O3 && objdump -d main.o
main.o: file format Mach-O 64-bit x86-64
Disassembly of section __TEXT,__text:
__Z4testv:
0: 55 pushq %rbp
1: 48 89 e5 movq %rsp, %rbp
4: 41 56 pushq %r14
6: 53 pushq %rbx
7: 48 8d 05 60 00 00 00 leaq 96(%rip), %rax
e: 48 89 05 10 00 00 00 movq %rax, 16(%rip)
15: c7 05 14 00 00 00 1e 00 00 00 movl $30, 20(%rip)
1f: 41 be 01 00 00 00 movl $1, %r14d
25: b8 01 00 00 00 movl $1, %eax
2a: 87 05 20 00 00 00 xchgl %eax, 32(%rip)
30: e8 00 00 00 00 callq 0 <__Z4testv+0x35>
35: 31 db xorl %ebx, %ebx
37: 31 c0 xorl %eax, %eax
39: 87 05 20 00 00 00 xchgl %eax, 32(%rip)
3f: 48 8d 05 35 00 00 00 leaq 53(%rip), %rax
46: 48 89 05 10 00 00 00 movq %rax, 16(%rip)
4d: c7 05 14 00 00 00 23 00 00 00 movl $35, 20(%rip)
57: 44 87 35 20 00 00 00 xchgl %r14d, 32(%rip)
5e: e8 00 00 00 00 callq 0 <__Z4testv+0x63>
63: 87 1d 20 00 00 00 xchgl %ebx, 32(%rip)
69: 5b popq %rbx
6a: 41 5e popq %r14
6c: 5d popq %rbp
6d: c3 retq
Instruments identified the xchgl instructions as the most expensive part.
You could write it like this:
void test() {
theStack.stackFrames[1] = StackFrame{ "someFunction", 30 }; // A
theStack.stackTop.store(1, std::memory_order_release); // B
someFunction(); // C
theStack.stackTop.exchange(0, std::memory_order_acq_rel); // D
theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 }; // E
theStack.stackTop.store(1, std::memory_order_release); // F
someOtherFunction(); // G
theStack.stackTop.exchange(0, std::memory_order_acq_rel); // H
}
This should provide the second guarantee you are looking for, namely that E may not be observed before D. Otherwise I think the compiler will have the right to reorder the instructions as you suggested.
Since the sampler thread "acquires" stackTop and it suspends the target thread before reading which should provide additional synchronization, it should always see valid data when stackTop is 1.
If your sampler did not suspend the target thread, or if suspension does not wait for the thread to be actually suspended (check this), I think a mutex or equivalent would be necessary to prevent the sampler from reading stale data after reading stack top as one (example if it was suspended by the scheduler at the wrong moment).
If you can rely on the suspend to provide synchronization and just need to constrain reordering by the compiler, you should have a look at std::atomic_signal_fence
Is it possible for an attacker to get integer arrays from your compiled code?
Like how the attacker can get strings from your code using the strings command.
Yes. A quick example with main.c:
int main(void) {
int vars[8] = {0,1,2,3,4,5,6,7};
}
Then gcc -O0 main.c -o main to disable optimization so our unused array isn't removed. Then if you simply disassemble it:
0000000000400474 <main>:
400474: 55 push %rbp
400475: 48 89 e5 mov %rsp,%rbp
400478: c7 45 e0 00 00 00 00 movl $0x0,-0x20(%rbp)
40047f: c7 45 e4 01 00 00 00 movl $0x1,-0x1c(%rbp)
400486: c7 45 e8 02 00 00 00 movl $0x2,-0x18(%rbp)
40048d: c7 45 ec 03 00 00 00 movl $0x3,-0x14(%rbp)
400494: c7 45 f0 04 00 00 00 movl $0x4,-0x10(%rbp)
40049b: c7 45 f4 05 00 00 00 movl $0x5,-0xc(%rbp)
4004a2: c7 45 f8 06 00 00 00 movl $0x6,-0x8(%rbp)
4004a9: c7 45 fc 07 00 00 00 movl $0x7,-0x4(%rbp)
It makes logical sense, if you have data in your code and your program uses it, then the data must exist somewhere.
It is possible, I recommend you to open a executable file after compiled with notepad++ or notepad, and maybe you see something as:
But why this is so simple with strings and not with vectors? Because strings are usually just made by alfphanumeric letters and so on, just seen them inside of binary files we found tham, an array of int will be inside of your code as raw data but eye seen them is hard since they are not (obligatorily) printed characters, but if you use a disassembling program it is as easy to find strings as other arrays.
This question already has an answer here:
What are these seemingly-useless callq instructions in my x86 object files for?
(1 answer)
Closed 1 year ago.
I wrote a simple program and then compiled and assembled it.
tfc.cpp
int i = 0;
void f(int a)
{
i += a;
};
int main()
{
f(9);
return 0;
};
I got the tfc.o by running
$ g++ -c -O1 tfc.cpp
Then I use gobjdump (objdump) to disassemble the binary file.
$ gobjdump -d tfc.o
Then I got
0000000000000000 <__Z1fi>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 01 3d 00 00 00 00 add %edi,0x0(%rip) # a <__Z1fi+0xa>
a: 5d pop %rbp
b: c3 retq
c: 0f 1f 40 00 nopl 0x0(%rax)
0000000000000010 <_main>:
10: 55 push %rbp
11: 48 89 e5 mov %rsp,%rbp
14: bf 09 00 00 00 mov $0x9,%edi
19: e8 00 00 00 00 callq 1e <_main+0xe>
1e: 31 c0 xor %eax,%eax
20: 5d pop %rbp
21: c3 retq
The weird thing happened, the callq instruction is followed by 1e <_main+0xe>. Shouldn't it be the address of <__Z1fi>? If not, how does the main function call the f function?
EDIT
FYI:
$ g++ -v
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
Target: x86_64-apple-darwin13.1.0
Thread model: posix
It calls address 0, which is the address of the f function.
e8 is the call instruction in x86 according to this:
http://www.cs.cmu.edu/~fp/courses/15213-s07/misc/asm64-handout.pdf
call uses the displacement relative to the next instruction, at memory location 1e. That becomes memory location 0. So it's callq 1e when in reality it's calling address 0.
I've compiled some Qt code with google's nacl compiler, but the ncval validator does not grok it. One example among many:
src/corelib/animation/qabstractanimation.cpp:165
Here's the relevant code:
#define Q_GLOBAL_STATIC(TYPE, NAME) \
static TYPE *NAME() \
{ \
static TYPE thisVariable; \
static QGlobalStatic<TYPE > thisGlobalStatic(&thisVariable); \
return thisGlobalStatic.pointer; \
}
#ifndef QT_NO_THREAD
Q_GLOBAL_STATIC(QThreadStorage<QUnifiedTimer *>, unifiedTimer)
#endif
which compiles to:
00000480 <_ZL12unifiedTimerv>:
480: 55 push %ebp
481: 89 e5 mov %esp,%ebp
483: 57 push %edi
484: 56 push %esi
485: 53 push %ebx
486: 83 ec 2c sub $0x2c,%esp
489: c7 04 24 28 00 2e 10 movl $0x102e0028,(%esp)
490: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi
494: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi
49b: e8 fc ff ff ff call 49c <_ZL12unifiedTimerv+0x1c>
4a0: 84 c0 test %al,%al
4a2: 74 1c je 4c0 <_ZL12unifiedTimerv+0x40>
4a4: 0f b6 05 2c 00 2e 10 movzbl 0x102e002c,%eax
4ab: 83 f0 01 xor $0x1,%eax
4ae: 84 c0 test %al,%al
4b0: 74 0e je 4c0 <_ZL12unifiedTimerv+0x40>
4b2: b8 01 00 00 00 mov $0x1,%eax
4b7: eb 27 jmp 4e0 <_ZL12unifiedTimerv+0x60>
4b9: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi
4c0: b8 00 00 00 00 mov $0x0,%eax
4c5: eb 19 jmp 4e0 <_ZL12unifiedTimerv+0x60>
4c7: 90 nop
4c8: 90 nop
4c9: 90 nop
4ca: 90 nop
4cb: 90 nop
Check the call instruction at 49b: it is what the validator cannot grok. What on earth could induce the compiler to issue an instruction that calls into the middle of itself? Is there a way around this? I've compiled with -g -O0 -fno-inline. Compiler bug?
Presumably it's really a call to an external symbol, which will get filled in at link time. Actually what will get called is externalSymbol-4, which is a bit strange -- perhaps this is what is throwing the ncval validator off the scent.
Is this a dynamic library or a static object that is not linked to an executable yet?
In a dynamic library this likely came out because the code was built as position-dependent and linked into a dynamic library. Try "objdump -d -r -R" on it, if you see TEXTREL, that is the case. TEXTREL is not supported in NaCl dynamic linking stories. (solved by having -fPIC flag during compilation of the code)
With a static object try to validate after it was linked into a static executable.