odd compiled code - c++

I've compiled some Qt code with google's nacl compiler, but the ncval validator does not grok it. One example among many:
src/corelib/animation/qabstractanimation.cpp:165
Here's the relevant code:
#define Q_GLOBAL_STATIC(TYPE, NAME) \
static TYPE *NAME() \
{ \
static TYPE thisVariable; \
static QGlobalStatic<TYPE > thisGlobalStatic(&thisVariable); \
return thisGlobalStatic.pointer; \
}
#ifndef QT_NO_THREAD
Q_GLOBAL_STATIC(QThreadStorage<QUnifiedTimer *>, unifiedTimer)
#endif
which compiles to:
00000480 <_ZL12unifiedTimerv>:
480: 55 push %ebp
481: 89 e5 mov %esp,%ebp
483: 57 push %edi
484: 56 push %esi
485: 53 push %ebx
486: 83 ec 2c sub $0x2c,%esp
489: c7 04 24 28 00 2e 10 movl $0x102e0028,(%esp)
490: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi
494: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi
49b: e8 fc ff ff ff call 49c <_ZL12unifiedTimerv+0x1c>
4a0: 84 c0 test %al,%al
4a2: 74 1c je 4c0 <_ZL12unifiedTimerv+0x40>
4a4: 0f b6 05 2c 00 2e 10 movzbl 0x102e002c,%eax
4ab: 83 f0 01 xor $0x1,%eax
4ae: 84 c0 test %al,%al
4b0: 74 0e je 4c0 <_ZL12unifiedTimerv+0x40>
4b2: b8 01 00 00 00 mov $0x1,%eax
4b7: eb 27 jmp 4e0 <_ZL12unifiedTimerv+0x60>
4b9: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi
4c0: b8 00 00 00 00 mov $0x0,%eax
4c5: eb 19 jmp 4e0 <_ZL12unifiedTimerv+0x60>
4c7: 90 nop
4c8: 90 nop
4c9: 90 nop
4ca: 90 nop
4cb: 90 nop
Check the call instruction at 49b: it is what the validator cannot grok. What on earth could induce the compiler to issue an instruction that calls into the middle of itself? Is there a way around this? I've compiled with -g -O0 -fno-inline. Compiler bug?

Presumably it's really a call to an external symbol, which will get filled in at link time. Actually what will get called is externalSymbol-4, which is a bit strange -- perhaps this is what is throwing the ncval validator off the scent.

Is this a dynamic library or a static object that is not linked to an executable yet?
In a dynamic library this likely came out because the code was built as position-dependent and linked into a dynamic library. Try "objdump -d -r -R" on it, if you see TEXTREL, that is the case. TEXTREL is not supported in NaCl dynamic linking stories. (solved by having -fPIC flag during compilation of the code)
With a static object try to validate after it was linked into a static executable.

Related

How are non-static, non-virtual methods implemented in C++?

I wanted to know how methods are implemented in C++. I wanted to know how methods are implemented "under the hood".
So, I have made a simple C++ program which has a class with 1 non static field and 1 non static, non virtual method.
Then I instantiated the class in the main function and called the method. I have used objdump -d option in order to see the CPU instructions of this program. I have a x86-64 processor.
Here's the code:
#include<stdio.h>
class TestClass {
public:
int x;
int xPlus2(){
return x + 2;
}
};
int main(){
TestClass tc1 = {5};
int variable = tc1.xPlus2();
printf("%d \n", variable);
return 0;
}
Here are instructions for the method xPlus2:
0000000000402c30 <_ZN9TestClass6xPlus2Ev>:
402c30: 55 push %rbp
402c31: 48 89 e5 mov %rsp,%rbp
402c34: 48 89 4d 10 mov %rcx,0x10(%rbp)
402c38: 48 8b 45 10 mov 0x10(%rbp),%rax
402c3c: 8b 00 mov (%rax),%eax
402c3e: 83 c0 02 add $0x2,%eax
402c41: 5d pop %rbp
402c42: c3 retq
402c43: 90 nop
402c44: 90 nop
402c45: 90 nop
402c46: 90 nop
402c47: 90 nop
402c48: 90 nop
402c49: 90 nop
402c4a: 90 nop
402c4b: 90 nop
402c4c: 90 nop
402c4d: 90 nop
402c4e: 90 nop
402c4f: 90 nop
If I understand it correctly, these instructions can be replaced by just 3 instructions, because I believe that I don't need to use the stack, I think the compiler used it redundantly:
mov (%rcx), eax
add $2, eax
retq
and then maybe I still need lots of nop instructions for synchronization purposes or whatnot. If you look at the CPU instructions, it looks like the value that x field has is stored at the location in memory which rcx register holds. You will see the rest of the CPU instructions in a moment. It is a little bit hard for me to track what has happened here (especially what is going on with the call of _main function), I don't even know what parts of assembly are important to look at. Compiler produces main function (as I expected), but then it also produced _main function which is called from the main, there are some weird functions in between those two as well.
Here are other parts of the assembly that I think may be interesting:
0000000000401550 <main>:
401550: 55 push %rbp
401551: 48 89 e5 mov %rsp,%rbp
401554: 48 83 ec 30 sub $0x30,%rsp
401558: e8 e3 00 00 00 callq 401640 <__main>
40155d: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%rbp)
401564: 48 8d 45 f8 lea -0x8(%rbp),%rax
401568: 48 89 c1 mov %rax,%rcx
40156b: e8 c0 16 00 00 callq 402c30 <_ZN9TestClass6xPlus2Ev>
401570: 89 45 fc mov %eax,-0x4(%rbp)
401573: 8b 45 fc mov -0x4(%rbp),%eax
401576: 89 c2 mov %eax,%edx
401578: 48 8d 0d 81 2a 00 00 lea 0x2a81(%rip),%rcx # 404000 <.rdata>
40157f: e8 ec 14 00 00 callq 402a70 <printf>
401584: b8 00 00 00 00 mov $0x0,%eax
401589: 48 83 c4 30 add $0x30,%rsp
40158d: 5d pop %rbp
40158e: c3 retq
40158f: 90 nop
0000000000401590 <__do_global_dtors>:
401590: 48 83 ec 28 sub $0x28,%rsp
401594: 48 8b 05 75 1a 00 00 mov 0x1a75(%rip),%rax # 403010 <p.93846>
40159b: 48 8b 00 mov (%rax),%rax
40159e: 48 85 c0 test %rax,%rax
4015a1: 74 1d je 4015c0 <__do_global_dtors+0x30>
4015a3: ff d0 callq *%rax
4015a5: 48 8b 05 64 1a 00 00 mov 0x1a64(%rip),%rax # 403010 <p.93846>
4015ac: 48 8d 50 08 lea 0x8(%rax),%rdx
4015b0: 48 8b 40 08 mov 0x8(%rax),%rax
4015b4: 48 89 15 55 1a 00 00 mov %rdx,0x1a55(%rip) # 403010 <p.93846>
4015bb: 48 85 c0 test %rax,%rax
4015be: 75 e3 jne 4015a3 <__do_global_dtors+0x13>
4015c0: 48 83 c4 28 add $0x28,%rsp
4015c4: c3 retq
4015c5: 90 nop
4015c6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4015cd: 00 00 00
00000000004015d0 <__do_global_ctors>:
4015d0: 56 push %rsi
4015d1: 53 push %rbx
4015d2: 48 83 ec 28 sub $0x28,%rsp
4015d6: 48 8b 0d 23 2d 00 00 mov 0x2d23(%rip),%rcx # 404300 <.refptr.__CTOR_LIST__>
4015dd: 48 8b 11 mov (%rcx),%rdx
4015e0: 83 fa ff cmp $0xffffffff,%edx
4015e3: 89 d0 mov %edx,%eax
4015e5: 74 39 je 401620 <__do_global_ctors+0x50>
4015e7: 85 c0 test %eax,%eax
4015e9: 74 20 je 40160b <__do_global_ctors+0x3b>
4015eb: 89 c2 mov %eax,%edx
4015ed: 83 e8 01 sub $0x1,%eax
4015f0: 48 8d 1c d1 lea (%rcx,%rdx,8),%rbx
4015f4: 48 29 c2 sub %rax,%rdx
4015f7: 48 8d 74 d1 f8 lea -0x8(%rcx,%rdx,8),%rsi
4015fc: 0f 1f 40 00 nopl 0x0(%rax)
401600: ff 13 callq *(%rbx)
401602: 48 83 eb 08 sub $0x8,%rbx
401606: 48 39 f3 cmp %rsi,%rbx
401609: 75 f5 jne 401600 <__do_global_ctors+0x30>
40160b: 48 8d 0d 7e ff ff ff lea -0x82(%rip),%rcx # 401590 <__do_global_dtors>
401612: 48 83 c4 28 add $0x28,%rsp
401616: 5b pop %rbx
401617: 5e pop %rsi
401618: e9 f3 fe ff ff jmpq 401510 <atexit>
40161d: 0f 1f 00 nopl (%rax)
401620: 31 c0 xor %eax,%eax
401622: eb 02 jmp 401626 <__do_global_ctors+0x56>
401624: 89 d0 mov %edx,%eax
401626: 44 8d 40 01 lea 0x1(%rax),%r8d
40162a: 4a 83 3c c1 00 cmpq $0x0,(%rcx,%r8,8)
40162f: 4c 89 c2 mov %r8,%rdx
401632: 75 f0 jne 401624 <__do_global_ctors+0x54>
401634: eb b1 jmp 4015e7 <__do_global_ctors+0x17>
401636: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40163d: 00 00 00
0000000000401640 <__main>:
401640: 8b 05 ea 59 00 00 mov 0x59ea(%rip),%eax # 407030 <initialized>
401646: 85 c0 test %eax,%eax
401648: 74 06 je 401650 <__main+0x10>
40164a: c3 retq
40164b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
401650: c7 05 d6 59 00 00 01 movl $0x1,0x59d6(%rip) # 407030 <initialized>
401657: 00 00 00
40165a: e9 71 ff ff ff jmpq 4015d0 <__do_global_ctors>
40165f: 90 nop
I think what you are looking for are these instructions:
40155d: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%rbp)
401564: 48 8d 45 f8 lea -0x8(%rbp),%rax
401568: 48 89 c1 mov %rax,%rcx
40156b: e8 c0 16 00 00 callq 402c30 <_ZN9TestClass6xPlus2Ev>
401570: 89 45 fc mov %eax,-0x4(%rbp)
These match with the code from main:
TestClass tc1 = {5};
int variable = tc1.xPlus2();
At address 40155d the field tc1.x is initialized with the value 5.
At address 401564 the pointer to tc1 is loaded into the register %rax
At address 401568 the pointer to tc1 is copied into the register %rcx
At address 40156b is the call of the method tc1.xPlus2()
At address 401570 the result is store in variable
Your observations are mostly correct. rcx holds the this pointer to the object on which the method was called. x is stored in the first area of memory that the this pointer points to, so that is why rcx was dereferenced and the result added to. It is the responsibility of the caller to make sure that rcx is the address of the object before invoking the function. We can see main prepare rcx by setting it to an address in its stack frame. You are correct that the compiler produced inefficient code here and did not need to use the stack. Compiling with higher optimization levels -O1, -O2, or -O3 will likely fix that. These higher optimizations will probably get rid of the nops too, since they are used for function alignment. You can mostly ignore __main. It's used for libc initialization.

How std::memory_order_XXX works

I don't understand how std::memory_order_XXX(like memory_order_release/memory_order_acquire ...) works.
From some documents, it shows that these memory mode have different feature, but I'm really confused that they have the same assemble code, what determined the differences?
That code:
static std::atomic<long> gt;
void test1() {
gt.store(1, std::memory_order_release);
gt.store(2, std::memory_order_relaxed);
gt.load(std::memory_order_acquire);
gt.load(std::memory_order_relaxed);
}
Corresponds to:
00000000000007a0 <_Z5test1v>:
7a0: 55 push %rbp
7a1: 48 89 e5 mov %rsp,%rbp
7a4: 48 83 ec 30 sub $0x30,%rsp
**memory_order_release:
7a8: 48 c7 45 f8 01 00 00 movq $0x1,-0x8(%rbp)
7af: 00
7b0: c7 45 e8 03 00 00 00 movl $0x3,-0x18(%rbp)
7b7: 8b 45 e8 mov -0x18(%rbp),%eax
7ba: be ff ff 00 00 mov $0xffff,%esi
7bf: 89 c7 mov %eax,%edi
7c1: e8 b1 00 00 00 callq 877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
7c6: 89 45 ec mov %eax,-0x14(%rbp)
7c9: 48 8b 55 f8 mov -0x8(%rbp),%rdx
7cd: 48 8d 05 44 08 20 00 lea 0x200844(%rip),%rax # 201018 <_ZL2gt>
7d4: 48 89 10 mov %rdx,(%rax)
7d7: 0f ae f0 mfence**
**memory_order_relaxed:
7da: 48 c7 45 f0 02 00 00 movq $0x2,-0x10(%rbp)
7e1: 00
7e2: c7 45 e0 00 00 00 00 movl $0x0,-0x20(%rbp)
7e9: 8b 45 e0 mov -0x20(%rbp),%eax
7ec: be ff ff 00 00 mov $0xffff,%esi
7f1: 89 c7 mov %eax,%edi
7f3: e8 7f 00 00 00 callq 877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
7f8: 89 45 e4 mov %eax,-0x1c(%rbp)
7fb: 48 8b 55 f0 mov -0x10(%rbp),%rdx
7ff: 48 8d 05 12 08 20 00 lea 0x200812(%rip),%rax # 201018 <_ZL2gt>
806: 48 89 10 mov %rdx,(%rax)
809: 0f ae f0 mfence**
**memory_order_acquire:
80c: c7 45 d8 02 00 00 00 movl $0x2,-0x28(%rbp)
813: 8b 45 d8 mov -0x28(%rbp),%eax
816: be ff ff 00 00 mov $0xffff,%esi
81b: 89 c7 mov %eax,%edi
81d: e8 55 00 00 00 callq 877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
822: 89 45 dc mov %eax,-0x24(%rbp)
825: 48 8d 05 ec 07 20 00 lea 0x2007ec(%rip),%rax # 201018 <_ZL2gt>
82c: 48 8b 00 mov (%rax),%rax**
**memory_order_relaxed:
82f: c7 45 d0 00 00 00 00 movl $0x0,-0x30(%rbp)
836: 8b 45 d0 mov -0x30(%rbp),%eax
839: be ff ff 00 00 mov $0xffff,%esi
83e: 89 c7 mov %eax,%edi
840: e8 32 00 00 00 callq 877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
845: 89 45 d4 mov %eax,-0x2c(%rbp)
848: 48 8d 05 c9 07 20 00 lea 0x2007c9(%rip),%rax # 201018 <_ZL2gt>
84f: 48 8b 00 mov (%rax),%rax**
852: 90 nop
853: c9 leaveq
854: c3 retq
00000000000008cc <_ZStanSt12memory_orderSt23__memory_order_modifier>:
8cc: 55 push %rbp
8cd: 48 89 e5 mov %rsp,%rbp
8d0: 89 7d fc mov %edi,-0x4(%rbp)
8d3: 89 75 f8 mov %esi,-0x8(%rbp)
8d6: 8b 55 fc mov -0x4(%rbp),%edx
8d9: 8b 45 f8 mov -0x8(%rbp),%eax
8dc: 21 d0 and %edx,%eax
8de: 5d pop %rbp
8df: c3 retq
I expect different memory mode has different implements on assemble code,
but setting different mode value is no effect on assemble, who can explain this?
Each memory model setting has its semantics. Compiler is obliged to satisfy this semantics, meaning that:
It disallows compiler to perform certain optimizations, such as reordering of reads and writes.
It instructs the compiler to propagate the very same message down to the hardware. How it is done, depends on the platform. x86_64 itself provides very strong memory model. Hence in almost all cases you will see no difference in generated assembler code for x86_64 no matter what memory model you choose. However, on RISC architectures (e.g. ARM), you will see the difference because compiler will have to insert memory barriers. Type of memory barrier depends on the selected memory model setting.
EDIT: Have a look at the JSR-133. It is very old and is about Java, but it provides the nicest explanation about memory model from the compiler perspective that I know. In particular, look at the table of memory barrier instructions for different architectures.
Given the code:
#include <atomic>
static std::atomic<long> gt;
void test1() {
gt.store(41, std::memory_order_release);
gt.store(42, std::memory_order_relaxed);
gt.load(std::memory_order_acquire);
gt.load(std::memory_order_relaxed);
}
At decent optimization level there is no garbage assembly moving values around on registers than the stack:
test1():
movq $41, gt(%rip)
movq $42, gt(%rip)
movq gt(%rip), %rax
movq gt(%rip), %rax
ret
We see that the exact same code is generated for the different memory orders; although testing different instructions in the same function in sequence is very bad practice as C++ instructions don't have to be compiled independently and context might influence code generation. But with the current code generation in GCC, it compiles each statement involving an atomic as its own. Good practice is to have a different function for each statement.
The same code is generated here because no special instruction happens to be needed for these memory orders.

assembly code for C++ scoped static initialization

i read some old articles about the local scoped static variable initialzation order problem from
C++ scoped static initialization is not thread-safe back in 2004, and
Function Static Variables in Multi-Threaded Environments in 2006.
then I start to produce an example and check my compiler, gcc 4.4.7
int calcSomething(){}
void foo(){
static int x = calcSomething();
}
int main(){
foo();
return 0;
}
the result from objdump shows:
000000000040061a <_Z3foov>:
40061a: 55 push %rbp
40061b: 48 89 e5 mov %rsp,%rbp
40061e: b8 d0 0a 60 00 mov $0x600ad0,%eax
400623: 0f b6 00 movzbl (%rax),%eax
400626: 84 c0 test %al,%al
400628: 75 28 jne 400652 <_Z3foov+0x38>
40062a: bf d0 0a 60 00 mov $0x600ad0,%edi
40062f: e8 bc fe ff ff callq 4004f0 <__cxa_guard_acquire#plt>
400634: 85 c0 test %eax,%eax
400636: 0f 95 c0 setne %al
400639: 84 c0 test %al,%al
40063b: 74 15 je 400652 <_Z3foov+0x38>
40063d: e8 d2 ff ff ff callq 400614 <_Z13calcSomethingv>
400642: 89 05 90 04 20 00 mov %eax,0x200490(%rip) # 600ad8 <_ZZ3foovE1x>
400648: bf d0 0a 60 00 mov $0x600ad0,%edi
40064d: e8 be fe ff ff callq 400510 <__cxa_guard_release#plt>
400652: c9 leaveq
400653: c3 retq
unfortunately, my knowledge of asssmbly code is so limited that I cannot tell what compiler does here. Can anyone shed me some light, what this assembly code do? and is it still not thread-safe? I really appreciate some "pseudo code" showing what gcc is doing here.
EDIT-1:
as Jerry commented, I enabled optimization with O2, the assembly code is:
0000000000400620 <_Z3foov>:
400620: 48 83 ec 08 sub $0x8,%rsp
400624: 80 3d 85 04 20 00 00 cmpb $0x0,0x200485(%rip) # 600ab0 <_ZGVZ3foovE1x>
40062b: 74 0b je 400638 <_Z3foov+0x18>
40062d: 48 83 c4 08 add $0x8,%rsp
400631: c3 retq
400632: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
400638: bf b0 0a 60 00 mov $0x600ab0,%edi
40063d: e8 9e fe ff ff callq 4004e0 <__cxa_guard_acquire#plt>
400642: 85 c0 test %eax,%eax
400644: 74 e7 je 40062d <_Z3foov+0xd>
400646: c7 05 68 04 20 00 00 movl $0x0,0x200468(%rip) # 600ab8 <_ZZ3foovE1x>
40064d: 00 00 00
400650: bf b0 0a 60 00 mov $0x600ab0,%edi
400655: 48 83 c4 08 add $0x8,%rsp
400659: e9 a2 fe ff ff jmpq 400500 <__cxa_guard_release#plt>
40065e: 66 90 xchg %ax,%ax
Yes. In pseudocode (for the un-optimized case) it's something like:
if (flag_val() != 0) goto done;
if (guard_acquire() != 0) goto done;
x = calcSomething();
guard_release_and_set_flag();
// Note releasing the guard lock causes later
// calls to flag_val() to return non-zero.
done: return
The flag_val() is really a non-blocking check, apparently for efficiency to avoid calling the acquire primitive unless necessary. The flag must be set by guard_release as shown. The acquire seems to be the synchronized call to grab the lock. Only one thread will get a true value back and perform the initialization. After it releases the lock, the non-zero flag prevents any further touches of the lock.
Another interesting tidbit is that the guard data structure is 8 bytes away from the value of x itself in static memory.
Those familiar with the singleton pattern in languages with built-in threads e.g. Java will recognize this!
Addition
A bit more time now, so in a bit more detail:
000000000040061a <_Z3foov>:
; Prepare to access stack variables (never used in un-optimized code).
40061a: 55 push %rbp
40061b: 48 89 e5 mov %rsp,%rbp
; Test a byte 8 away from the static int x. This is apparently an "initialized" flag.
40061e: b8 d0 0a 60 00 mov $0x600ad0,%eax
400623: 0f b6 00 movzbl (%rax),%eax
400626: 84 c0 test %al,%al
; Goto the end of the function if the byte was no-zero.
400628: 75 28 jne 400652 <_Z3foov+0x38>
; Load the same byte address in di: the argument for the call to
; acquire the guard lock.
40062a: bf d0 0a 60 00 mov $0x600ad0,%edi
40062f: e8 bc fe ff ff callq 4004f0 <__cxa_guard_acquire#plt>
; Test the return value. Goto end of function if not zero (non-optimized code).
400634: 85 c0 test %eax,%eax
400636: 0f 95 c0 setne %al
400639: 84 c0 test %al,%al
40063b: 74 15 je 400652 <_Z3foov+0x38>
; Call the user's initialization function and move result into x.
40063d: e8 d2 ff ff ff callq 400614 <_Z13calcSomethingv>
400642: 89 05 90 04 20 00 mov %eax,0x200490(%rip) # 600ad8 <_ZZ3foovE1x>
; Load the guard byte's address again and call the release routine.
; This must set the flag to non-zero.
400648: bf d0 0a 60 00 mov $0x600ad0,%edi
40064d: e8 be fe ff ff callq 400510 <__cxa_guard_release#plt>
; Restore state and return.
400652: c9 leaveq
400653: c3 retq
This listing, although for the LLVM compiler rather than g++ (are you running OS X? OS X aliases g++ to LLVM), agrees with the guesswork above. The set_initialized routine is setting a flag value in guard_release.

Why symbols malloc, __malloc and __libc_malloc point to the same code address?

When I grep malloc from the symbol table, with the following command
readelf -s bin | grep malloc
I can see symbols malloc, __malloc and __libc_malloc share the same code address. I can get the PC address, want to know when a user program calls malloc, but __malloc and __libc_malloc gave me noisy information, any good ways to differentiate malloc out? As I compiled the binary with -static, so dlsym doesn't work in this case.
You're not going to be able to tell them apart unless you use dynamic linking as they will be the same thing, and the act of static linking will replace the name references with the address of the routine.
Take an example:
#include <stdlib.h>
extern void *__malloc(size_t);
extern void *__libc_malloc(size_t);
int
main(int argc, char **argv)
{
void *v = malloc(200);
free(v);
v = __malloc(200);
free(v);
v = __libc_malloc(200);
free(v);
return 0;
}
When compiled using: gcc -static -o example example.c, and then we disassemble the main routine we see:
40103e: 55 push %rbp
40103f: 48 89 e5 mov %rsp,%rbp
401042: 48 83 ec 20 sub $0x20,%rsp
401046: 89 7d ec mov %edi,-0x14(%rbp)
401049: 48 89 75 e0 mov %rsi,-0x20(%rbp)
40104d: bf c8 00 00 00 mov $0xc8,%edi
401052: e8 19 52 00 00 callq 406270 <__libc_malloc>
401057: 48 89 45 f8 mov %rax,-0x8(%rbp)
40105b: 48 8b 45 f8 mov -0x8(%rbp),%rax
40105f: 48 89 c7 mov %rax,%rdi
401062: e8 09 56 00 00 callq 406670 <__cfree>
401067: bf c8 00 00 00 mov $0xc8,%edi
40106c: e8 ff 51 00 00 callq 406270 <__libc_malloc>
401071: 48 89 45 f8 mov %rax,-0x8(%rbp)
401075: 48 8b 45 f8 mov -0x8(%rbp),%rax
401079: 48 89 c7 mov %rax,%rdi
40107c: e8 ef 55 00 00 callq 406670 <__cfree>
401081: bf c8 00 00 00 mov $0xc8,%edi
401086: e8 e5 51 00 00 callq 406270 <__libc_malloc>
40108b: 48 89 45 f8 mov %rax,-0x8(%rbp)
40108f: 48 8b 45 f8 mov -0x8(%rbp),%rax
401093: 48 89 c7 mov %rax,%rdi
401096: e8 d5 55 00 00 callq 406670 <__cfree>
40109b: b8 00 00 00 00 mov $0x0,%eax
4010a0: c9 leaveq
4010a1: c3 retq
4010a2: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4010a9: 00 00 00
4010ac: 0f 1f 40 00 nopl 0x0(%rax)
i.e. the code doesn't differentiate the entries.
Now, if you use dynamic linking; you get a different result. For one thing, __malloc is not available in the resulting binary - this is because the __malloc name is a side-effect of the static linking (there is a way to prevent it from being produced, but the mechanism escapes me at the moment). So when we compile the binary (removing the __malloc call), main looks like:
40058d: 55 push %rbp
40058e: 48 89 e5 mov %rsp,%rbp
400591: 48 83 ec 20 sub $0x20,%rsp
400595: 89 7d ec mov %edi,-0x14(%rbp)
400598: 48 89 75 e0 mov %rsi,-0x20(%rbp)
40059c: bf c8 00 00 00 mov $0xc8,%edi
4005a1: e8 ea fe ff ff callq 400490 <malloc#plt>
4005a6: 48 89 45 f8 mov %rax,-0x8(%rbp)
4005aa: 48 8b 45 f8 mov -0x8(%rbp),%rax
4005ae: 48 89 c7 mov %rax,%rdi
4005b1: e8 9a fe ff ff callq 400450 <free#plt>
4005b6: bf c8 00 00 00 mov $0xc8,%edi
4005bb: e8 c0 fe ff ff callq 400480 <__libc_malloc#plt>
4005c0: 48 89 45 f8 mov %rax,-0x8(%rbp)
4005c4: 48 8b 45 f8 mov -0x8(%rbp),%rax
4005c8: 48 89 c7 mov %rax,%rdi
4005cb: e8 80 fe ff ff callq 400450 <free#plt>
4005d0: b8 00 00 00 00 mov $0x0,%eax
4005d5: c9 leaveq
4005d6: c3 retq
4005d7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
4005de: 00 00
So to determine the use of __libc_malloc or malloc, you can check for calls to the plt entry for the routine.
This of course all assumes that you're actually performing some type of static analysis of the binary. If you're doing this at run-time, the usual method is library interception using LD_PRELOAD, which is a whole different question.

Is this an optimization bug in g++?

I'm not sure whether I've found a bug in g++ (4.4.1-4ubuntu9), or if I'm doing
something wrong. What I believe I'm seeing is a bug introduced by enabling
optimization with g++ -O2. I've tried to distill the code down to just the
relevant parts.
When optimization is enabled, I have an ASSERT which is failing. When
optimization is disabled, the same ASSERT does not fail. I think I've tracked
it down to the optimization of one function and its callers.
The System
Language: C++
Ubuntu 9.10
g++-4.4.real (Ubuntu 4.4.1-4ubuntu9) 4.4.1
Linux 2.6.31-22-server x86_64
Optimization Enabled
Object compiled with:
g++ -DHAVE_CONFIG_H -I. -fPIC -g -O2 -MT file.o -MD -MP -MF .deps/file.Tpo -c -o file.o file.cpp
And here is the relevant code from objdump -dg file.o.
00000000000018b0 <helper_function>:
;; This function takes two parameters:
;; pointer to int: %rdi
;; pointer to int[]: %rsi
18b0: 0f b6 07 movzbl (%rdi),%eax
18b3: 83 f8 12 cmp $0x12,%eax
18b6: 74 60 je 1918 <helper_function+0x68>
18b8: 83 f8 17 cmp $0x17,%eax
18bb: 74 5b je 1918 <helper_function+0x68>
...
1918: c7 06 32 00 00 00 movl $0x32,(%rsi)
191e: 66 90 xchg %ax,%ax
1920: c3 retq
0000000000005290 <buggy_invoker>:
... snip ...
52a0: 48 81 ec c8 01 00 00 sub $0x1c8,%rsp
52a7: 48 8d 84 24 a0 01 00 lea 0x1a0(%rsp),%rax
52ae: 00
52af: 48 c7 84 24 a0 01 00 movq $0x0,0x1a0(%rsp)
52b6: 00 00 00 00 00
52bb: 48 c7 84 24 a8 01 00 movq $0x0,0x1a8(%rsp)
52c2: 00 00 00 00 00
52c7: c7 84 24 b0 01 00 00 movl $0x0,0x1b0(%rsp)
52ce: 00 00 00 00
52d2: 4c 8d 7c 24 20 lea 0x20(%rsp),%r15
52d7: 48 89 c6 mov %rax,%rsi
52da: 48 89 44 24 08 mov %rax,0x8(%rsp)
;; ***** BUG HERE *****
;; Pointer to int[] loaded into %rsi
;; But where is %rdi populated?
52df: e8 cc c5 ff ff callq 18b0 <helper_function>
0000000000005494 <perfectly_fine_invoker>:
5494: 48 83 ec 20 sub $0x20,%rsp
5498: 0f ae f0 mfence
549b: 48 8d 7c 24 30 lea 0x30(%rsp),%rdi
54a0: 48 89 e6 mov %rsp,%rsi
54a3: 48 c7 04 24 00 00 00 movq $0x0,(%rsp)
54aa: 00
54ab: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
54b2: 00 00
54b4: c7 44 24 10 00 00 00 movl $0x0,0x10(%rsp)
54bb: 00
;; Non buggy invocation here: both %rdi and %rsi loaded correctly.
54bc: e8 ef c3 ff ff callq 18b0 <helper_function>
Optimization Disabled
Now compiled with:
g++ -DHAVE_CONFIG_H -I. -fPIC -g -O0 -MT file.o -MD -MP -MF .deps/file.Tpo -c -o file.o file.cpp
0000000000008d27 <helper_function>:
;; Still the same parameters here, but it looks a little different.
... snip ...
8d2b: 48 89 7d e8 mov %rdi,-0x18(%rbp)
8d2f: 48 89 75 e0 mov %rsi,-0x20(%rbp)
8d33: 48 8b 45 e8 mov -0x18(%rbp),%rax
8d37: 0f b6 00 movzbl (%rax),%eax
8d3a: 0f b6 c0 movzbl %al,%eax
8d3d: 89 45 fc mov %eax,-0x4(%rbp)
8d40: 8b 45 fc mov -0x4(%rbp),%eax
8d43: 83 f8 17 cmp $0x17,%eax
8d46: 74 40 je 8d88 <helper_function+0x61>
...
000000000000948a <buggy_invoker>:
948a: 55 push %rbp
948b: 48 89 e5 mov %rsp,%rbp
948e: 41 54 push %r12
9490: 53 push %rbx
9491: 48 81 ec c0 01 00 00 sub $0x1c0,%rsp
9498: 48 89 bd 38 fe ff ff mov %rdi,-0x1c8(%rbp)
949f: 48 89 b5 30 fe ff ff mov %rsi,-0x1d0(%rbp)
94a6: 48 c7 45 c0 00 00 00 movq $0x0,-0x40(%rbp)
94ad: 00
94ae: 48 c7 45 c8 00 00 00 movq $0x0,-0x38(%rbp)
94b5: 00
94b6: c7 45 d0 00 00 00 00 movl $0x0,-0x30(%rbp)
94bd: 48 8d 55 c0 lea -0x40(%rbp),%rdx
94c1: 48 8b 85 38 fe ff ff mov -0x1c8(%rbp),%rax
94c8: 48 89 d6 mov %rdx,%rsi
94cb: 48 89 c7 mov %rax,%rdi
;; ***** NOT BUGGY HERE *****
;; Now, without optimization, both %rdi and %rsi loaded correctly.
94ce: e8 54 f8 ff ff callq 8d27 <helper_function>
0000000000008eec <different_perfectly_fine_invoker>:
8eec: 55 push %rbp
8eed: 48 89 e5 mov %rsp,%rbp
8ef0: 48 83 ec 30 sub $0x30,%rsp
8ef4: 48 89 7d d8 mov %rdi,-0x28(%rbp)
8ef8: 48 c7 45 e0 00 00 00 movq $0x0,-0x20(%rbp)
8eff: 00
8f00: 48 c7 45 e8 00 00 00 movq $0x0,-0x18(%rbp)
8f07: 00
8f08: c7 45 f0 00 00 00 00 movl $0x0,-0x10(%rbp)
8f0f: 48 8d 55 e0 lea -0x20(%rbp),%rdx
8f13: 48 8b 45 d8 mov -0x28(%rbp),%rax
8f17: 48 89 d6 mov %rdx,%rsi
8f1a: 48 89 c7 mov %rax,%rdi
;; Another example of non-optimized call to that function.
8f1d: e8 05 fe ff ff callq 8d27 <helper_function>
The Original C++ Code
This is a sanitized version of the original C++. I've just changed some names
and removed irrelevant code. Forgive my paranoia, I just don't want to expose
too much code from unpublished and unreleased work :-).
static void helper_function(my_struct_t *e, int *outArr)
{
unsigned char event_type = e->header.type;
if (event_type == event_A || event_type == event_B) {
outArr[0] = action_one;
} else if (event_type == event_C) {
outArr[0] = action_one;
outArr[1] = action_two;
} else if (...) { ... }
}
static void buggy_invoker(my_struct_t *e, predicate_t pred)
{
// MAX_ACTIONS is #defined to 5
int action_array[MAX_ACTIONS] = {0};
helper_function(e, action_array);
...
}
static int has_any_actions(my_struct_t *e)
{
int actions[MAX_ACTIONS] = {0};
helper_function(e, actions);
return actions[0] != 0;
}
// *** ENTRY POINT to this code is this function (note not static).
void perfectly_fine_invoker(my_struct_t e, predicate_t pred)
{
memfence();
if (has_any_actions(&e)) {
buggy_invoker(&e, pred);
}
...
}
If you think I've obfuscated or eliminiated too much, let me know. Users of
this code call 'perfectly_fine_invoker'. With optimization, g++ optimizes the
'has_any_actions' function away into a direct call to 'helper_function', which
you can see in the assembly.
The Question
So, my question is, does it look like a buggy optimization to anyone else?
If it would be helpful, I could post a sanitized version of the original C++ code.
This is my first posting to Stack Overflow, so please let me know if I can do
anything to make the question clearer, or provide any additional information.
The Answer
Edit (several days after the fact):
I accepted an answer below to my question -- it was not an optimization bug in g++, I was just looking at the assembly code wrong.
However, for whoever may be viewing this question in the future, I've found the answer. I did some reading on undefined behavior in C ( http://blog.regehr.org/archives/213 and http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html ) and some of the descriptions of the compiler optimizing away functions with undefined behavior seemed eerily familiar.
I added some NULL-pointer checks to the function 'helper_function' and lo and behold... bug goes away. I should have had the NULL-pointer checks to begin with, but apparently not having them allowed g++ to do whatever it wanted (in my case, optimize away the call).
Hope this information helps someone down the road.
I think you are looking at the wrong thing. I imagine the compiler notice that your function is short and doesn't touch the %rdi register so it just leaves it alone (you have the same variable as the first parameter, which I guess is what is placed in %rdi. See page 21 here http://www.x86-64.org/documentation/abi.pdf)
If you look at the unoptimized version it saves the %rdi register on this line
9498: 48 89 bd 38 fe ff ff mov %rdi,-0x1c8(%rbp)
...and then later just before calling helper_function it moves the saved value into %rax that is moved into %rdi.
94c1: 48 8b 85 38 fe ff ff mov -0x1c8(%rbp),%rax
94c8: 48 89 d6 mov %rdx,%rsi
94cb: 48 89 c7 mov %rax,%rdi
When optimizing it the compiler just get rid of all that moving back and forth.