How std::memory_order_XXX works - c++

I don't understand how std::memory_order_XXX(like memory_order_release/memory_order_acquire ...) works.
From some documents, it shows that these memory mode have different feature, but I'm really confused that they have the same assemble code, what determined the differences?
That code:
static std::atomic<long> gt;
void test1() {
gt.store(1, std::memory_order_release);
gt.store(2, std::memory_order_relaxed);
gt.load(std::memory_order_acquire);
gt.load(std::memory_order_relaxed);
}
Corresponds to:
00000000000007a0 <_Z5test1v>:
7a0: 55 push %rbp
7a1: 48 89 e5 mov %rsp,%rbp
7a4: 48 83 ec 30 sub $0x30,%rsp
**memory_order_release:
7a8: 48 c7 45 f8 01 00 00 movq $0x1,-0x8(%rbp)
7af: 00
7b0: c7 45 e8 03 00 00 00 movl $0x3,-0x18(%rbp)
7b7: 8b 45 e8 mov -0x18(%rbp),%eax
7ba: be ff ff 00 00 mov $0xffff,%esi
7bf: 89 c7 mov %eax,%edi
7c1: e8 b1 00 00 00 callq 877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
7c6: 89 45 ec mov %eax,-0x14(%rbp)
7c9: 48 8b 55 f8 mov -0x8(%rbp),%rdx
7cd: 48 8d 05 44 08 20 00 lea 0x200844(%rip),%rax # 201018 <_ZL2gt>
7d4: 48 89 10 mov %rdx,(%rax)
7d7: 0f ae f0 mfence**
**memory_order_relaxed:
7da: 48 c7 45 f0 02 00 00 movq $0x2,-0x10(%rbp)
7e1: 00
7e2: c7 45 e0 00 00 00 00 movl $0x0,-0x20(%rbp)
7e9: 8b 45 e0 mov -0x20(%rbp),%eax
7ec: be ff ff 00 00 mov $0xffff,%esi
7f1: 89 c7 mov %eax,%edi
7f3: e8 7f 00 00 00 callq 877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
7f8: 89 45 e4 mov %eax,-0x1c(%rbp)
7fb: 48 8b 55 f0 mov -0x10(%rbp),%rdx
7ff: 48 8d 05 12 08 20 00 lea 0x200812(%rip),%rax # 201018 <_ZL2gt>
806: 48 89 10 mov %rdx,(%rax)
809: 0f ae f0 mfence**
**memory_order_acquire:
80c: c7 45 d8 02 00 00 00 movl $0x2,-0x28(%rbp)
813: 8b 45 d8 mov -0x28(%rbp),%eax
816: be ff ff 00 00 mov $0xffff,%esi
81b: 89 c7 mov %eax,%edi
81d: e8 55 00 00 00 callq 877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
822: 89 45 dc mov %eax,-0x24(%rbp)
825: 48 8d 05 ec 07 20 00 lea 0x2007ec(%rip),%rax # 201018 <_ZL2gt>
82c: 48 8b 00 mov (%rax),%rax**
**memory_order_relaxed:
82f: c7 45 d0 00 00 00 00 movl $0x0,-0x30(%rbp)
836: 8b 45 d0 mov -0x30(%rbp),%eax
839: be ff ff 00 00 mov $0xffff,%esi
83e: 89 c7 mov %eax,%edi
840: e8 32 00 00 00 callq 877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
845: 89 45 d4 mov %eax,-0x2c(%rbp)
848: 48 8d 05 c9 07 20 00 lea 0x2007c9(%rip),%rax # 201018 <_ZL2gt>
84f: 48 8b 00 mov (%rax),%rax**
852: 90 nop
853: c9 leaveq
854: c3 retq
00000000000008cc <_ZStanSt12memory_orderSt23__memory_order_modifier>:
8cc: 55 push %rbp
8cd: 48 89 e5 mov %rsp,%rbp
8d0: 89 7d fc mov %edi,-0x4(%rbp)
8d3: 89 75 f8 mov %esi,-0x8(%rbp)
8d6: 8b 55 fc mov -0x4(%rbp),%edx
8d9: 8b 45 f8 mov -0x8(%rbp),%eax
8dc: 21 d0 and %edx,%eax
8de: 5d pop %rbp
8df: c3 retq
I expect different memory mode has different implements on assemble code,
but setting different mode value is no effect on assemble, who can explain this?

Each memory model setting has its semantics. Compiler is obliged to satisfy this semantics, meaning that:
It disallows compiler to perform certain optimizations, such as reordering of reads and writes.
It instructs the compiler to propagate the very same message down to the hardware. How it is done, depends on the platform. x86_64 itself provides very strong memory model. Hence in almost all cases you will see no difference in generated assembler code for x86_64 no matter what memory model you choose. However, on RISC architectures (e.g. ARM), you will see the difference because compiler will have to insert memory barriers. Type of memory barrier depends on the selected memory model setting.
EDIT: Have a look at the JSR-133. It is very old and is about Java, but it provides the nicest explanation about memory model from the compiler perspective that I know. In particular, look at the table of memory barrier instructions for different architectures.

Given the code:
#include <atomic>
static std::atomic<long> gt;
void test1() {
gt.store(41, std::memory_order_release);
gt.store(42, std::memory_order_relaxed);
gt.load(std::memory_order_acquire);
gt.load(std::memory_order_relaxed);
}
At decent optimization level there is no garbage assembly moving values around on registers than the stack:
test1():
movq $41, gt(%rip)
movq $42, gt(%rip)
movq gt(%rip), %rax
movq gt(%rip), %rax
ret
We see that the exact same code is generated for the different memory orders; although testing different instructions in the same function in sequence is very bad practice as C++ instructions don't have to be compiled independently and context might influence code generation. But with the current code generation in GCC, it compiles each statement involving an atomic as its own. Good practice is to have a different function for each statement.
The same code is generated here because no special instruction happens to be needed for these memory orders.

Related

How are non-static, non-virtual methods implemented in C++?

I wanted to know how methods are implemented in C++. I wanted to know how methods are implemented "under the hood".
So, I have made a simple C++ program which has a class with 1 non static field and 1 non static, non virtual method.
Then I instantiated the class in the main function and called the method. I have used objdump -d option in order to see the CPU instructions of this program. I have a x86-64 processor.
Here's the code:
#include<stdio.h>
class TestClass {
public:
int x;
int xPlus2(){
return x + 2;
}
};
int main(){
TestClass tc1 = {5};
int variable = tc1.xPlus2();
printf("%d \n", variable);
return 0;
}
Here are instructions for the method xPlus2:
0000000000402c30 <_ZN9TestClass6xPlus2Ev>:
402c30: 55 push %rbp
402c31: 48 89 e5 mov %rsp,%rbp
402c34: 48 89 4d 10 mov %rcx,0x10(%rbp)
402c38: 48 8b 45 10 mov 0x10(%rbp),%rax
402c3c: 8b 00 mov (%rax),%eax
402c3e: 83 c0 02 add $0x2,%eax
402c41: 5d pop %rbp
402c42: c3 retq
402c43: 90 nop
402c44: 90 nop
402c45: 90 nop
402c46: 90 nop
402c47: 90 nop
402c48: 90 nop
402c49: 90 nop
402c4a: 90 nop
402c4b: 90 nop
402c4c: 90 nop
402c4d: 90 nop
402c4e: 90 nop
402c4f: 90 nop
If I understand it correctly, these instructions can be replaced by just 3 instructions, because I believe that I don't need to use the stack, I think the compiler used it redundantly:
mov (%rcx), eax
add $2, eax
retq
and then maybe I still need lots of nop instructions for synchronization purposes or whatnot. If you look at the CPU instructions, it looks like the value that x field has is stored at the location in memory which rcx register holds. You will see the rest of the CPU instructions in a moment. It is a little bit hard for me to track what has happened here (especially what is going on with the call of _main function), I don't even know what parts of assembly are important to look at. Compiler produces main function (as I expected), but then it also produced _main function which is called from the main, there are some weird functions in between those two as well.
Here are other parts of the assembly that I think may be interesting:
0000000000401550 <main>:
401550: 55 push %rbp
401551: 48 89 e5 mov %rsp,%rbp
401554: 48 83 ec 30 sub $0x30,%rsp
401558: e8 e3 00 00 00 callq 401640 <__main>
40155d: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%rbp)
401564: 48 8d 45 f8 lea -0x8(%rbp),%rax
401568: 48 89 c1 mov %rax,%rcx
40156b: e8 c0 16 00 00 callq 402c30 <_ZN9TestClass6xPlus2Ev>
401570: 89 45 fc mov %eax,-0x4(%rbp)
401573: 8b 45 fc mov -0x4(%rbp),%eax
401576: 89 c2 mov %eax,%edx
401578: 48 8d 0d 81 2a 00 00 lea 0x2a81(%rip),%rcx # 404000 <.rdata>
40157f: e8 ec 14 00 00 callq 402a70 <printf>
401584: b8 00 00 00 00 mov $0x0,%eax
401589: 48 83 c4 30 add $0x30,%rsp
40158d: 5d pop %rbp
40158e: c3 retq
40158f: 90 nop
0000000000401590 <__do_global_dtors>:
401590: 48 83 ec 28 sub $0x28,%rsp
401594: 48 8b 05 75 1a 00 00 mov 0x1a75(%rip),%rax # 403010 <p.93846>
40159b: 48 8b 00 mov (%rax),%rax
40159e: 48 85 c0 test %rax,%rax
4015a1: 74 1d je 4015c0 <__do_global_dtors+0x30>
4015a3: ff d0 callq *%rax
4015a5: 48 8b 05 64 1a 00 00 mov 0x1a64(%rip),%rax # 403010 <p.93846>
4015ac: 48 8d 50 08 lea 0x8(%rax),%rdx
4015b0: 48 8b 40 08 mov 0x8(%rax),%rax
4015b4: 48 89 15 55 1a 00 00 mov %rdx,0x1a55(%rip) # 403010 <p.93846>
4015bb: 48 85 c0 test %rax,%rax
4015be: 75 e3 jne 4015a3 <__do_global_dtors+0x13>
4015c0: 48 83 c4 28 add $0x28,%rsp
4015c4: c3 retq
4015c5: 90 nop
4015c6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4015cd: 00 00 00
00000000004015d0 <__do_global_ctors>:
4015d0: 56 push %rsi
4015d1: 53 push %rbx
4015d2: 48 83 ec 28 sub $0x28,%rsp
4015d6: 48 8b 0d 23 2d 00 00 mov 0x2d23(%rip),%rcx # 404300 <.refptr.__CTOR_LIST__>
4015dd: 48 8b 11 mov (%rcx),%rdx
4015e0: 83 fa ff cmp $0xffffffff,%edx
4015e3: 89 d0 mov %edx,%eax
4015e5: 74 39 je 401620 <__do_global_ctors+0x50>
4015e7: 85 c0 test %eax,%eax
4015e9: 74 20 je 40160b <__do_global_ctors+0x3b>
4015eb: 89 c2 mov %eax,%edx
4015ed: 83 e8 01 sub $0x1,%eax
4015f0: 48 8d 1c d1 lea (%rcx,%rdx,8),%rbx
4015f4: 48 29 c2 sub %rax,%rdx
4015f7: 48 8d 74 d1 f8 lea -0x8(%rcx,%rdx,8),%rsi
4015fc: 0f 1f 40 00 nopl 0x0(%rax)
401600: ff 13 callq *(%rbx)
401602: 48 83 eb 08 sub $0x8,%rbx
401606: 48 39 f3 cmp %rsi,%rbx
401609: 75 f5 jne 401600 <__do_global_ctors+0x30>
40160b: 48 8d 0d 7e ff ff ff lea -0x82(%rip),%rcx # 401590 <__do_global_dtors>
401612: 48 83 c4 28 add $0x28,%rsp
401616: 5b pop %rbx
401617: 5e pop %rsi
401618: e9 f3 fe ff ff jmpq 401510 <atexit>
40161d: 0f 1f 00 nopl (%rax)
401620: 31 c0 xor %eax,%eax
401622: eb 02 jmp 401626 <__do_global_ctors+0x56>
401624: 89 d0 mov %edx,%eax
401626: 44 8d 40 01 lea 0x1(%rax),%r8d
40162a: 4a 83 3c c1 00 cmpq $0x0,(%rcx,%r8,8)
40162f: 4c 89 c2 mov %r8,%rdx
401632: 75 f0 jne 401624 <__do_global_ctors+0x54>
401634: eb b1 jmp 4015e7 <__do_global_ctors+0x17>
401636: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40163d: 00 00 00
0000000000401640 <__main>:
401640: 8b 05 ea 59 00 00 mov 0x59ea(%rip),%eax # 407030 <initialized>
401646: 85 c0 test %eax,%eax
401648: 74 06 je 401650 <__main+0x10>
40164a: c3 retq
40164b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
401650: c7 05 d6 59 00 00 01 movl $0x1,0x59d6(%rip) # 407030 <initialized>
401657: 00 00 00
40165a: e9 71 ff ff ff jmpq 4015d0 <__do_global_ctors>
40165f: 90 nop
I think what you are looking for are these instructions:
40155d: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%rbp)
401564: 48 8d 45 f8 lea -0x8(%rbp),%rax
401568: 48 89 c1 mov %rax,%rcx
40156b: e8 c0 16 00 00 callq 402c30 <_ZN9TestClass6xPlus2Ev>
401570: 89 45 fc mov %eax,-0x4(%rbp)
These match with the code from main:
TestClass tc1 = {5};
int variable = tc1.xPlus2();
At address 40155d the field tc1.x is initialized with the value 5.
At address 401564 the pointer to tc1 is loaded into the register %rax
At address 401568 the pointer to tc1 is copied into the register %rcx
At address 40156b is the call of the method tc1.xPlus2()
At address 401570 the result is store in variable
Your observations are mostly correct. rcx holds the this pointer to the object on which the method was called. x is stored in the first area of memory that the this pointer points to, so that is why rcx was dereferenced and the result added to. It is the responsibility of the caller to make sure that rcx is the address of the object before invoking the function. We can see main prepare rcx by setting it to an address in its stack frame. You are correct that the compiler produced inefficient code here and did not need to use the stack. Compiling with higher optimization levels -O1, -O2, or -O3 will likely fix that. These higher optimizations will probably get rid of the nops too, since they are used for function alignment. You can mostly ignore __main. It's used for libc initialization.

ptrace(PTRACE_PEEKDATA) , problem reading from memory address

I'm implementing a simple debugger and I'm trying to read data from a child process memory address with ptrace (for create breakpoint), but I keep getting input/output error.
This is how my program works:
Child process calls to:
ptrace(PTRACE_TRACEME, 0, nullptr, nullptr);
execl(prog_name, prog_name, 0);
At this point the child process waits for a signal.
Now the parent process calls to:
errno=0
uint64_t addrVal = ptrace(PTRACE_PEEKDATA, m_pid, m_addr, nullptr);
perror("failed");
and there is where it fails.
perror returns input/output error and addrVal keep getting 0xffffffffffffffff value.
also, there is might be a chance that im trying to look at the wrong address,
i used objdump to find the address but it seems it just giving me the offset and not the actual address
0000000000001165 <main>:
1165: 55 push %rbp
1166: 48 89 e5 mov %rsp,%rbp
1169: 48 8d 35 95 0e 00 00 lea 0xe95(%rip),%rsi # 2005 <_ZStL19piecewise_construct+0x1>
1170: 48 8d 3d e9 2e 00 00 lea 0x2ee9(%rip),%rdi # 4060 <_ZSt4cout##GLIBCXX_3.4>
1177: e8 c4 fe ff ff callq 1040 <_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc#plt>
117c: 48 89 c2 mov %rax,%rdx
117f: 48 8b 05 4a 2e 00 00 mov 0x2e4a(%rip),%rax # 3fd0 <_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_#GLIBCXX_3.4>
1186: 48 89 c6 mov %rax,%rsi
1189: 48 89 d7 mov %rdx,%rdi
118c: e8 bf fe ff ff callq 1050 <_ZNSolsEPFRSoS_E#plt>
1191: b8 00 00 00 00 mov $0x0,%eax
1196: 5d pop %rbp
1197: c3 retq
these are offsets or the actual address?
thanks!

Member initializer list, pointer initialization without argument

In a large framework which used to use many smart pointers and now uses raw pointers, I come across situations like this quite often:
class A {
public:
int* m;
A() : m() {}
};
The reason is because int* m used to be a smart pointer and so the initializer list called a default constructor. Now that int* m is a raw pointer I am not certain if this is equivalent to:
class A {
public:
int* m;
A() : m(nullptr) {}
};
Without the explicit nullptr is A::m still initialized to zero? A look at no optimization objdump -d makes it appear to be yes but I am not certain. The reason I feel that the answer is yes is due to this line in the objdump -d (I posted more of the objdump -d below):
400644: 48 c7 00 00 00 00 00 movq $0x0,(%rax)
Little program that tries to find undefined behavior:
class A {
public:
int* m;
A() : m(nullptr) {}
};
int main() {
A buf[1000000];
unsigned int count = 0;
for (unsigned int i = 0; i < 1000000; ++i) {
count += buf[i].m ? 1 : 0;
}
return count;
}
Compilation, execution, and return value:
g++ -std=c++14 -O0 foo.cpp
./a.out; echo $?
0
Relevant assembly sections from objdump -d:
00000000004005b8 <main>:
4005b8: 55 push %rbp
4005b9: 48 89 e5 mov %rsp,%rbp
4005bc: 41 54 push %r12
4005be: 53 push %rbx
4005bf: 48 81 ec 10 12 7a 00 sub $0x7a1210,%rsp
4005c6: 48 8d 85 e0 ed 85 ff lea -0x7a1220(%rbp),%rax
4005cd: bb 3f 42 0f 00 mov $0xf423f,%ebx
4005d2: 49 89 c4 mov %rax,%r12
4005d5: eb 10 jmp 4005e7 <main+0x2f>
4005d7: 4c 89 e7 mov %r12,%rdi
4005da: e8 59 00 00 00 callq 400638 <_ZN1AC1Ev>
4005df: 49 83 c4 08 add $0x8,%r12
4005e3: 48 83 eb 01 sub $0x1,%rbx
4005e7: 48 83 fb ff cmp $0xffffffffffffffff,%rbx
4005eb: 75 ea jne 4005d7 <main+0x1f>
4005ed: c7 45 ec 00 00 00 00 movl $0x0,-0x14(%rbp)
4005f4: c7 45 e8 00 00 00 00 movl $0x0,-0x18(%rbp)
4005fb: eb 23 jmp 400620 <main+0x68>
4005fd: 8b 45 e8 mov -0x18(%rbp),%eax
400600: 48 8b 84 c5 e0 ed 85 mov -0x7a1220(%rbp,%rax,8),%rax
400607: ff
400608: 48 85 c0 test %rax,%rax
40060b: 74 07 je 400614 <main+0x5c>
40060d: b8 01 00 00 00 mov $0x1,%eax
400612: eb 05 jmp 400619 <main+0x61>
400614: b8 00 00 00 00 mov $0x0,%eax
400619: 01 45 ec add %eax,-0x14(%rbp)
40061c: 83 45 e8 01 addl $0x1,-0x18(%rbp)
400620: 81 7d e8 3f 42 0f 00 cmpl $0xf423f,-0x18(%rbp)
400627: 76 d4 jbe 4005fd <main+0x45>
400629: 8b 45 ec mov -0x14(%rbp),%eax
40062c: 48 81 c4 10 12 7a 00 add $0x7a1210,%rsp
400633: 5b pop %rbx
400634: 41 5c pop %r12
400636: 5d pop %rbp
400637: c3 retq
0000000000400638 <_ZN1AC1Ev>:
400638: 55 push %rbp
400639: 48 89 e5 mov %rsp,%rbp
40063c: 48 89 7d f8 mov %rdi,-0x8(%rbp)
400640: 48 8b 45 f8 mov -0x8(%rbp),%rax
400644: 48 c7 00 00 00 00 00 movq $0x0,(%rax)
40064b: 5d pop %rbp
40064c: c3 retq
40064d: 0f 1f 00 nopl (%rax)
Empty () initializer stands for default-initialization in C++98 and for value-initialization in C++03 and later. For scalar types (including pointers) value-initialization/default-initialization leads to zero-initialization.
Which means that in your case m() and m(nullptr) will have exactly the same effect: in both cases m is initialized as a null pointer. In C++ it was like that since the beginning of standardized times.

Why symbols malloc, __malloc and __libc_malloc point to the same code address?

When I grep malloc from the symbol table, with the following command
readelf -s bin | grep malloc
I can see symbols malloc, __malloc and __libc_malloc share the same code address. I can get the PC address, want to know when a user program calls malloc, but __malloc and __libc_malloc gave me noisy information, any good ways to differentiate malloc out? As I compiled the binary with -static, so dlsym doesn't work in this case.
You're not going to be able to tell them apart unless you use dynamic linking as they will be the same thing, and the act of static linking will replace the name references with the address of the routine.
Take an example:
#include <stdlib.h>
extern void *__malloc(size_t);
extern void *__libc_malloc(size_t);
int
main(int argc, char **argv)
{
void *v = malloc(200);
free(v);
v = __malloc(200);
free(v);
v = __libc_malloc(200);
free(v);
return 0;
}
When compiled using: gcc -static -o example example.c, and then we disassemble the main routine we see:
40103e: 55 push %rbp
40103f: 48 89 e5 mov %rsp,%rbp
401042: 48 83 ec 20 sub $0x20,%rsp
401046: 89 7d ec mov %edi,-0x14(%rbp)
401049: 48 89 75 e0 mov %rsi,-0x20(%rbp)
40104d: bf c8 00 00 00 mov $0xc8,%edi
401052: e8 19 52 00 00 callq 406270 <__libc_malloc>
401057: 48 89 45 f8 mov %rax,-0x8(%rbp)
40105b: 48 8b 45 f8 mov -0x8(%rbp),%rax
40105f: 48 89 c7 mov %rax,%rdi
401062: e8 09 56 00 00 callq 406670 <__cfree>
401067: bf c8 00 00 00 mov $0xc8,%edi
40106c: e8 ff 51 00 00 callq 406270 <__libc_malloc>
401071: 48 89 45 f8 mov %rax,-0x8(%rbp)
401075: 48 8b 45 f8 mov -0x8(%rbp),%rax
401079: 48 89 c7 mov %rax,%rdi
40107c: e8 ef 55 00 00 callq 406670 <__cfree>
401081: bf c8 00 00 00 mov $0xc8,%edi
401086: e8 e5 51 00 00 callq 406270 <__libc_malloc>
40108b: 48 89 45 f8 mov %rax,-0x8(%rbp)
40108f: 48 8b 45 f8 mov -0x8(%rbp),%rax
401093: 48 89 c7 mov %rax,%rdi
401096: e8 d5 55 00 00 callq 406670 <__cfree>
40109b: b8 00 00 00 00 mov $0x0,%eax
4010a0: c9 leaveq
4010a1: c3 retq
4010a2: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4010a9: 00 00 00
4010ac: 0f 1f 40 00 nopl 0x0(%rax)
i.e. the code doesn't differentiate the entries.
Now, if you use dynamic linking; you get a different result. For one thing, __malloc is not available in the resulting binary - this is because the __malloc name is a side-effect of the static linking (there is a way to prevent it from being produced, but the mechanism escapes me at the moment). So when we compile the binary (removing the __malloc call), main looks like:
40058d: 55 push %rbp
40058e: 48 89 e5 mov %rsp,%rbp
400591: 48 83 ec 20 sub $0x20,%rsp
400595: 89 7d ec mov %edi,-0x14(%rbp)
400598: 48 89 75 e0 mov %rsi,-0x20(%rbp)
40059c: bf c8 00 00 00 mov $0xc8,%edi
4005a1: e8 ea fe ff ff callq 400490 <malloc#plt>
4005a6: 48 89 45 f8 mov %rax,-0x8(%rbp)
4005aa: 48 8b 45 f8 mov -0x8(%rbp),%rax
4005ae: 48 89 c7 mov %rax,%rdi
4005b1: e8 9a fe ff ff callq 400450 <free#plt>
4005b6: bf c8 00 00 00 mov $0xc8,%edi
4005bb: e8 c0 fe ff ff callq 400480 <__libc_malloc#plt>
4005c0: 48 89 45 f8 mov %rax,-0x8(%rbp)
4005c4: 48 8b 45 f8 mov -0x8(%rbp),%rax
4005c8: 48 89 c7 mov %rax,%rdi
4005cb: e8 80 fe ff ff callq 400450 <free#plt>
4005d0: b8 00 00 00 00 mov $0x0,%eax
4005d5: c9 leaveq
4005d6: c3 retq
4005d7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
4005de: 00 00
So to determine the use of __libc_malloc or malloc, you can check for calls to the plt entry for the routine.
This of course all assumes that you're actually performing some type of static analysis of the binary. If you're doing this at run-time, the usual method is library interception using LD_PRELOAD, which is a whole different question.

Is this an optimization bug in g++?

I'm not sure whether I've found a bug in g++ (4.4.1-4ubuntu9), or if I'm doing
something wrong. What I believe I'm seeing is a bug introduced by enabling
optimization with g++ -O2. I've tried to distill the code down to just the
relevant parts.
When optimization is enabled, I have an ASSERT which is failing. When
optimization is disabled, the same ASSERT does not fail. I think I've tracked
it down to the optimization of one function and its callers.
The System
Language: C++
Ubuntu 9.10
g++-4.4.real (Ubuntu 4.4.1-4ubuntu9) 4.4.1
Linux 2.6.31-22-server x86_64
Optimization Enabled
Object compiled with:
g++ -DHAVE_CONFIG_H -I. -fPIC -g -O2 -MT file.o -MD -MP -MF .deps/file.Tpo -c -o file.o file.cpp
And here is the relevant code from objdump -dg file.o.
00000000000018b0 <helper_function>:
;; This function takes two parameters:
;; pointer to int: %rdi
;; pointer to int[]: %rsi
18b0: 0f b6 07 movzbl (%rdi),%eax
18b3: 83 f8 12 cmp $0x12,%eax
18b6: 74 60 je 1918 <helper_function+0x68>
18b8: 83 f8 17 cmp $0x17,%eax
18bb: 74 5b je 1918 <helper_function+0x68>
...
1918: c7 06 32 00 00 00 movl $0x32,(%rsi)
191e: 66 90 xchg %ax,%ax
1920: c3 retq
0000000000005290 <buggy_invoker>:
... snip ...
52a0: 48 81 ec c8 01 00 00 sub $0x1c8,%rsp
52a7: 48 8d 84 24 a0 01 00 lea 0x1a0(%rsp),%rax
52ae: 00
52af: 48 c7 84 24 a0 01 00 movq $0x0,0x1a0(%rsp)
52b6: 00 00 00 00 00
52bb: 48 c7 84 24 a8 01 00 movq $0x0,0x1a8(%rsp)
52c2: 00 00 00 00 00
52c7: c7 84 24 b0 01 00 00 movl $0x0,0x1b0(%rsp)
52ce: 00 00 00 00
52d2: 4c 8d 7c 24 20 lea 0x20(%rsp),%r15
52d7: 48 89 c6 mov %rax,%rsi
52da: 48 89 44 24 08 mov %rax,0x8(%rsp)
;; ***** BUG HERE *****
;; Pointer to int[] loaded into %rsi
;; But where is %rdi populated?
52df: e8 cc c5 ff ff callq 18b0 <helper_function>
0000000000005494 <perfectly_fine_invoker>:
5494: 48 83 ec 20 sub $0x20,%rsp
5498: 0f ae f0 mfence
549b: 48 8d 7c 24 30 lea 0x30(%rsp),%rdi
54a0: 48 89 e6 mov %rsp,%rsi
54a3: 48 c7 04 24 00 00 00 movq $0x0,(%rsp)
54aa: 00
54ab: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
54b2: 00 00
54b4: c7 44 24 10 00 00 00 movl $0x0,0x10(%rsp)
54bb: 00
;; Non buggy invocation here: both %rdi and %rsi loaded correctly.
54bc: e8 ef c3 ff ff callq 18b0 <helper_function>
Optimization Disabled
Now compiled with:
g++ -DHAVE_CONFIG_H -I. -fPIC -g -O0 -MT file.o -MD -MP -MF .deps/file.Tpo -c -o file.o file.cpp
0000000000008d27 <helper_function>:
;; Still the same parameters here, but it looks a little different.
... snip ...
8d2b: 48 89 7d e8 mov %rdi,-0x18(%rbp)
8d2f: 48 89 75 e0 mov %rsi,-0x20(%rbp)
8d33: 48 8b 45 e8 mov -0x18(%rbp),%rax
8d37: 0f b6 00 movzbl (%rax),%eax
8d3a: 0f b6 c0 movzbl %al,%eax
8d3d: 89 45 fc mov %eax,-0x4(%rbp)
8d40: 8b 45 fc mov -0x4(%rbp),%eax
8d43: 83 f8 17 cmp $0x17,%eax
8d46: 74 40 je 8d88 <helper_function+0x61>
...
000000000000948a <buggy_invoker>:
948a: 55 push %rbp
948b: 48 89 e5 mov %rsp,%rbp
948e: 41 54 push %r12
9490: 53 push %rbx
9491: 48 81 ec c0 01 00 00 sub $0x1c0,%rsp
9498: 48 89 bd 38 fe ff ff mov %rdi,-0x1c8(%rbp)
949f: 48 89 b5 30 fe ff ff mov %rsi,-0x1d0(%rbp)
94a6: 48 c7 45 c0 00 00 00 movq $0x0,-0x40(%rbp)
94ad: 00
94ae: 48 c7 45 c8 00 00 00 movq $0x0,-0x38(%rbp)
94b5: 00
94b6: c7 45 d0 00 00 00 00 movl $0x0,-0x30(%rbp)
94bd: 48 8d 55 c0 lea -0x40(%rbp),%rdx
94c1: 48 8b 85 38 fe ff ff mov -0x1c8(%rbp),%rax
94c8: 48 89 d6 mov %rdx,%rsi
94cb: 48 89 c7 mov %rax,%rdi
;; ***** NOT BUGGY HERE *****
;; Now, without optimization, both %rdi and %rsi loaded correctly.
94ce: e8 54 f8 ff ff callq 8d27 <helper_function>
0000000000008eec <different_perfectly_fine_invoker>:
8eec: 55 push %rbp
8eed: 48 89 e5 mov %rsp,%rbp
8ef0: 48 83 ec 30 sub $0x30,%rsp
8ef4: 48 89 7d d8 mov %rdi,-0x28(%rbp)
8ef8: 48 c7 45 e0 00 00 00 movq $0x0,-0x20(%rbp)
8eff: 00
8f00: 48 c7 45 e8 00 00 00 movq $0x0,-0x18(%rbp)
8f07: 00
8f08: c7 45 f0 00 00 00 00 movl $0x0,-0x10(%rbp)
8f0f: 48 8d 55 e0 lea -0x20(%rbp),%rdx
8f13: 48 8b 45 d8 mov -0x28(%rbp),%rax
8f17: 48 89 d6 mov %rdx,%rsi
8f1a: 48 89 c7 mov %rax,%rdi
;; Another example of non-optimized call to that function.
8f1d: e8 05 fe ff ff callq 8d27 <helper_function>
The Original C++ Code
This is a sanitized version of the original C++. I've just changed some names
and removed irrelevant code. Forgive my paranoia, I just don't want to expose
too much code from unpublished and unreleased work :-).
static void helper_function(my_struct_t *e, int *outArr)
{
unsigned char event_type = e->header.type;
if (event_type == event_A || event_type == event_B) {
outArr[0] = action_one;
} else if (event_type == event_C) {
outArr[0] = action_one;
outArr[1] = action_two;
} else if (...) { ... }
}
static void buggy_invoker(my_struct_t *e, predicate_t pred)
{
// MAX_ACTIONS is #defined to 5
int action_array[MAX_ACTIONS] = {0};
helper_function(e, action_array);
...
}
static int has_any_actions(my_struct_t *e)
{
int actions[MAX_ACTIONS] = {0};
helper_function(e, actions);
return actions[0] != 0;
}
// *** ENTRY POINT to this code is this function (note not static).
void perfectly_fine_invoker(my_struct_t e, predicate_t pred)
{
memfence();
if (has_any_actions(&e)) {
buggy_invoker(&e, pred);
}
...
}
If you think I've obfuscated or eliminiated too much, let me know. Users of
this code call 'perfectly_fine_invoker'. With optimization, g++ optimizes the
'has_any_actions' function away into a direct call to 'helper_function', which
you can see in the assembly.
The Question
So, my question is, does it look like a buggy optimization to anyone else?
If it would be helpful, I could post a sanitized version of the original C++ code.
This is my first posting to Stack Overflow, so please let me know if I can do
anything to make the question clearer, or provide any additional information.
The Answer
Edit (several days after the fact):
I accepted an answer below to my question -- it was not an optimization bug in g++, I was just looking at the assembly code wrong.
However, for whoever may be viewing this question in the future, I've found the answer. I did some reading on undefined behavior in C ( http://blog.regehr.org/archives/213 and http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html ) and some of the descriptions of the compiler optimizing away functions with undefined behavior seemed eerily familiar.
I added some NULL-pointer checks to the function 'helper_function' and lo and behold... bug goes away. I should have had the NULL-pointer checks to begin with, but apparently not having them allowed g++ to do whatever it wanted (in my case, optimize away the call).
Hope this information helps someone down the road.
I think you are looking at the wrong thing. I imagine the compiler notice that your function is short and doesn't touch the %rdi register so it just leaves it alone (you have the same variable as the first parameter, which I guess is what is placed in %rdi. See page 21 here http://www.x86-64.org/documentation/abi.pdf)
If you look at the unoptimized version it saves the %rdi register on this line
9498: 48 89 bd 38 fe ff ff mov %rdi,-0x1c8(%rbp)
...and then later just before calling helper_function it moves the saved value into %rax that is moved into %rdi.
94c1: 48 8b 85 38 fe ff ff mov -0x1c8(%rbp),%rax
94c8: 48 89 d6 mov %rdx,%rsi
94cb: 48 89 c7 mov %rax,%rdi
When optimizing it the compiler just get rid of all that moving back and forth.