C++ explicit template definition - code is still duplicated - c++

I have some template code that implements pretty heavy computations, but I only need it for floats and doubles. The goal is that the template instantiation is only done once in one compilation unit and not repeated for every file.
I tried to follow the ideas from the following Stackoverflow posts:
using extern template (C++11)
Can I put `extern template` into a header file?
separating constructor implementation with template from header file
and similiar duplicate questions. I came up with the following test to illustrate the issue:
A.h
#pragma once
#include <cmath>
template<typename T>
struct A
{
static T foo(T a, T b)
{
//do some heavy computations
T v1 = pow(a, b);
return pow(v1, b);
}
};
//explicit template instantiations, the declaration
extern template struct A<float>;
extern template struct A<double>;
A.cpp
#include "A.h"
//explicit template instantiations, the definition
template struct A<float>;
template struct A<double>;
Main.cpp
#include "A.h"
int main()
{
//use A
float result = A<float>::foo(0, 0);
return (int)result; //return it so that it doesn't get optimized away
}
When I now look at the generated .obj file (dumpbin /DISASM), I get the following output:
A.obj
Dump of file A.obj
File Type: COFF OBJECT
?foo#?$A#M##SAMMM#Z (public: static float __cdecl A<float>::foo(float,float)):
0000000000000000: F3 0F 11 4C 24 10 movss dword ptr [rsp+10h],xmm1
0000000000000006: F3 0F 11 44 24 08 movss dword ptr [rsp+8],xmm0
000000000000000C: 55 push rbp
000000000000000D: 57 push rdi
000000000000000E: 48 81 EC 18 01 00 sub rsp,118h
00
0000000000000015: 48 8D 6C 24 30 lea rbp,[rsp+30h]
000000000000001A: 48 8B FC mov rdi,rsp
000000000000001D: B9 46 00 00 00 mov ecx,46h
0000000000000022: B8 CC CC CC CC mov eax,0CCCCCCCCh
0000000000000027: F3 AB rep stos dword ptr [rdi]
0000000000000029: F3 0F 10 8D 08 01 movss xmm1,dword ptr [rbp+108h]
00 00
0000000000000031: F3 0F 10 85 00 01 movss xmm0,dword ptr [rbp+100h]
00 00
0000000000000039: E8 00 00 00 00 call ?pow##YAMMM#Z
000000000000003E: F3 0F 11 45 04 movss dword ptr [rbp+4],xmm0
0000000000000043: F3 0F 10 8D 08 01 movss xmm1,dword ptr [rbp+108h]
00 00
000000000000004B: F3 0F 10 45 04 movss xmm0,dword ptr [rbp+4]
0000000000000050: E8 00 00 00 00 call ?pow##YAMMM#Z
0000000000000055: 48 8D A5 E8 00 00 lea rsp,[rbp+0E8h]
00
000000000000005C: 5F pop rdi
000000000000005D: 5D pop rbp
000000000000005E: C3 ret
?foo#?$A#N##SANNN#Z (public: static double __cdecl A<double>::foo(double,double)):
0000000000000000: F2 0F 11 4C 24 10 movsd mmword ptr [rsp+10h],xmm1
0000000000000006: F2 0F 11 44 24 08 movsd mmword ptr [rsp+8],xmm0
000000000000000C: 55 push rbp
000000000000000D: 57 push rdi
000000000000000E: 48 81 EC 18 01 00 sub rsp,118h
00
0000000000000015: 48 8D 6C 24 30 lea rbp,[rsp+30h]
000000000000001A: 48 8B FC mov rdi,rsp
000000000000001D: B9 46 00 00 00 mov ecx,46h
0000000000000022: B8 CC CC CC CC mov eax,0CCCCCCCCh
0000000000000027: F3 AB rep stos dword ptr [rdi]
0000000000000029: F2 0F 10 8D 08 01 movsd xmm1,mmword ptr [rbp+108h]
00 00
0000000000000031: F2 0F 10 85 00 01 movsd xmm0,mmword ptr [rbp+100h]
00 00
0000000000000039: E8 00 00 00 00 call pow
000000000000003E: F2 0F 11 45 08 movsd mmword ptr [rbp+8],xmm0
0000000000000043: F2 0F 10 8D 08 01 movsd xmm1,mmword ptr [rbp+108h]
00 00
000000000000004B: F2 0F 10 45 08 movsd xmm0,mmword ptr [rbp+8]
0000000000000050: E8 00 00 00 00 call pow
0000000000000055: 48 8D A5 E8 00 00 lea rsp,[rbp+0E8h]
00
000000000000005C: 5F pop rdi
000000000000005D: 5D pop rbp
000000000000005E: C3 ret
....
Main.obj
Dump of file Main.obj
File Type: COFF OBJECT
?foo#?$A#M##SAMMM#Z (public: static float __cdecl A<float>::foo(float,float)):
0000000000000000: F3 0F 11 4C 24 10 movss dword ptr [rsp+10h],xmm1
0000000000000006: F3 0F 11 44 24 08 movss dword ptr [rsp+8],xmm0
000000000000000C: 55 push rbp
000000000000000D: 57 push rdi
000000000000000E: 48 81 EC 18 01 00 sub rsp,118h
00
0000000000000015: 48 8D 6C 24 30 lea rbp,[rsp+30h]
000000000000001A: 48 8B FC mov rdi,rsp
000000000000001D: B9 46 00 00 00 mov ecx,46h
0000000000000022: B8 CC CC CC CC mov eax,0CCCCCCCCh
0000000000000027: F3 AB rep stos dword ptr [rdi]
0000000000000029: F3 0F 10 8D 08 01 movss xmm1,dword ptr [rbp+108h]
00 00
0000000000000031: F3 0F 10 85 00 01 movss xmm0,dword ptr [rbp+100h]
00 00
0000000000000039: E8 00 00 00 00 call ?pow##YAMMM#Z
000000000000003E: F3 0F 11 45 04 movss dword ptr [rbp+4],xmm0
0000000000000043: F3 0F 10 8D 08 01 movss xmm1,dword ptr [rbp+108h]
00 00
000000000000004B: F3 0F 10 45 04 movss xmm0,dword ptr [rbp+4]
0000000000000050: E8 00 00 00 00 call ?pow##YAMMM#Z
0000000000000055: 48 8D A5 E8 00 00 lea rsp,[rbp+0E8h]
00
000000000000005C: 5F pop rdi
000000000000005D: 5D pop rbp
000000000000005E: C3 ret
....
A::foo is instantiated in A.obj as expected. But the code is again put into Main.obj as well, completely ignoring the extern keyword.
How can I tell the compiler (Visual Studio 2017, Release mode) to NOT inline the method, but to use the version from A.obj?

You can do that with __declspec(noinline).
But inlined version will likely be faster. If you worry about binary size, your .exe file will only have a single instance of that function. The code from A.obj is unused and will be discarded by linker during dead code elimination step.
Update: Put this in your A.h:
static __declspec( noinline ) T foo( T a, T b )
{
//do some heavy computations
T v1 = pow( a, b );
return pow( v1, b );
}
I’ve built with Visual C++ 2017 15.6.7, Release 32 and 64 bits, for both platforms Main.cpp compiles to this:
; Line 5
call ?foo#?$A#M##SAMMM#Z ; A<float>::foo
; Line 6
cvttss2si eax, xmm0
However, if you’re doing that trying to decrease compilation time, I’m not sure noinline gonna help. Instead, remove the function body from A.h (leave declaration), move it into A.cpp. Ideally, also remove eigen headers from A.h (or leave bare minimum that define data structures), and include eigen headers into A.cpp.

Related

How are non-static, non-virtual methods implemented in C++?

I wanted to know how methods are implemented in C++. I wanted to know how methods are implemented "under the hood".
So, I have made a simple C++ program which has a class with 1 non static field and 1 non static, non virtual method.
Then I instantiated the class in the main function and called the method. I have used objdump -d option in order to see the CPU instructions of this program. I have a x86-64 processor.
Here's the code:
#include<stdio.h>
class TestClass {
public:
int x;
int xPlus2(){
return x + 2;
}
};
int main(){
TestClass tc1 = {5};
int variable = tc1.xPlus2();
printf("%d \n", variable);
return 0;
}
Here are instructions for the method xPlus2:
0000000000402c30 <_ZN9TestClass6xPlus2Ev>:
402c30: 55 push %rbp
402c31: 48 89 e5 mov %rsp,%rbp
402c34: 48 89 4d 10 mov %rcx,0x10(%rbp)
402c38: 48 8b 45 10 mov 0x10(%rbp),%rax
402c3c: 8b 00 mov (%rax),%eax
402c3e: 83 c0 02 add $0x2,%eax
402c41: 5d pop %rbp
402c42: c3 retq
402c43: 90 nop
402c44: 90 nop
402c45: 90 nop
402c46: 90 nop
402c47: 90 nop
402c48: 90 nop
402c49: 90 nop
402c4a: 90 nop
402c4b: 90 nop
402c4c: 90 nop
402c4d: 90 nop
402c4e: 90 nop
402c4f: 90 nop
If I understand it correctly, these instructions can be replaced by just 3 instructions, because I believe that I don't need to use the stack, I think the compiler used it redundantly:
mov (%rcx), eax
add $2, eax
retq
and then maybe I still need lots of nop instructions for synchronization purposes or whatnot. If you look at the CPU instructions, it looks like the value that x field has is stored at the location in memory which rcx register holds. You will see the rest of the CPU instructions in a moment. It is a little bit hard for me to track what has happened here (especially what is going on with the call of _main function), I don't even know what parts of assembly are important to look at. Compiler produces main function (as I expected), but then it also produced _main function which is called from the main, there are some weird functions in between those two as well.
Here are other parts of the assembly that I think may be interesting:
0000000000401550 <main>:
401550: 55 push %rbp
401551: 48 89 e5 mov %rsp,%rbp
401554: 48 83 ec 30 sub $0x30,%rsp
401558: e8 e3 00 00 00 callq 401640 <__main>
40155d: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%rbp)
401564: 48 8d 45 f8 lea -0x8(%rbp),%rax
401568: 48 89 c1 mov %rax,%rcx
40156b: e8 c0 16 00 00 callq 402c30 <_ZN9TestClass6xPlus2Ev>
401570: 89 45 fc mov %eax,-0x4(%rbp)
401573: 8b 45 fc mov -0x4(%rbp),%eax
401576: 89 c2 mov %eax,%edx
401578: 48 8d 0d 81 2a 00 00 lea 0x2a81(%rip),%rcx # 404000 <.rdata>
40157f: e8 ec 14 00 00 callq 402a70 <printf>
401584: b8 00 00 00 00 mov $0x0,%eax
401589: 48 83 c4 30 add $0x30,%rsp
40158d: 5d pop %rbp
40158e: c3 retq
40158f: 90 nop
0000000000401590 <__do_global_dtors>:
401590: 48 83 ec 28 sub $0x28,%rsp
401594: 48 8b 05 75 1a 00 00 mov 0x1a75(%rip),%rax # 403010 <p.93846>
40159b: 48 8b 00 mov (%rax),%rax
40159e: 48 85 c0 test %rax,%rax
4015a1: 74 1d je 4015c0 <__do_global_dtors+0x30>
4015a3: ff d0 callq *%rax
4015a5: 48 8b 05 64 1a 00 00 mov 0x1a64(%rip),%rax # 403010 <p.93846>
4015ac: 48 8d 50 08 lea 0x8(%rax),%rdx
4015b0: 48 8b 40 08 mov 0x8(%rax),%rax
4015b4: 48 89 15 55 1a 00 00 mov %rdx,0x1a55(%rip) # 403010 <p.93846>
4015bb: 48 85 c0 test %rax,%rax
4015be: 75 e3 jne 4015a3 <__do_global_dtors+0x13>
4015c0: 48 83 c4 28 add $0x28,%rsp
4015c4: c3 retq
4015c5: 90 nop
4015c6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4015cd: 00 00 00
00000000004015d0 <__do_global_ctors>:
4015d0: 56 push %rsi
4015d1: 53 push %rbx
4015d2: 48 83 ec 28 sub $0x28,%rsp
4015d6: 48 8b 0d 23 2d 00 00 mov 0x2d23(%rip),%rcx # 404300 <.refptr.__CTOR_LIST__>
4015dd: 48 8b 11 mov (%rcx),%rdx
4015e0: 83 fa ff cmp $0xffffffff,%edx
4015e3: 89 d0 mov %edx,%eax
4015e5: 74 39 je 401620 <__do_global_ctors+0x50>
4015e7: 85 c0 test %eax,%eax
4015e9: 74 20 je 40160b <__do_global_ctors+0x3b>
4015eb: 89 c2 mov %eax,%edx
4015ed: 83 e8 01 sub $0x1,%eax
4015f0: 48 8d 1c d1 lea (%rcx,%rdx,8),%rbx
4015f4: 48 29 c2 sub %rax,%rdx
4015f7: 48 8d 74 d1 f8 lea -0x8(%rcx,%rdx,8),%rsi
4015fc: 0f 1f 40 00 nopl 0x0(%rax)
401600: ff 13 callq *(%rbx)
401602: 48 83 eb 08 sub $0x8,%rbx
401606: 48 39 f3 cmp %rsi,%rbx
401609: 75 f5 jne 401600 <__do_global_ctors+0x30>
40160b: 48 8d 0d 7e ff ff ff lea -0x82(%rip),%rcx # 401590 <__do_global_dtors>
401612: 48 83 c4 28 add $0x28,%rsp
401616: 5b pop %rbx
401617: 5e pop %rsi
401618: e9 f3 fe ff ff jmpq 401510 <atexit>
40161d: 0f 1f 00 nopl (%rax)
401620: 31 c0 xor %eax,%eax
401622: eb 02 jmp 401626 <__do_global_ctors+0x56>
401624: 89 d0 mov %edx,%eax
401626: 44 8d 40 01 lea 0x1(%rax),%r8d
40162a: 4a 83 3c c1 00 cmpq $0x0,(%rcx,%r8,8)
40162f: 4c 89 c2 mov %r8,%rdx
401632: 75 f0 jne 401624 <__do_global_ctors+0x54>
401634: eb b1 jmp 4015e7 <__do_global_ctors+0x17>
401636: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40163d: 00 00 00
0000000000401640 <__main>:
401640: 8b 05 ea 59 00 00 mov 0x59ea(%rip),%eax # 407030 <initialized>
401646: 85 c0 test %eax,%eax
401648: 74 06 je 401650 <__main+0x10>
40164a: c3 retq
40164b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
401650: c7 05 d6 59 00 00 01 movl $0x1,0x59d6(%rip) # 407030 <initialized>
401657: 00 00 00
40165a: e9 71 ff ff ff jmpq 4015d0 <__do_global_ctors>
40165f: 90 nop
I think what you are looking for are these instructions:
40155d: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%rbp)
401564: 48 8d 45 f8 lea -0x8(%rbp),%rax
401568: 48 89 c1 mov %rax,%rcx
40156b: e8 c0 16 00 00 callq 402c30 <_ZN9TestClass6xPlus2Ev>
401570: 89 45 fc mov %eax,-0x4(%rbp)
These match with the code from main:
TestClass tc1 = {5};
int variable = tc1.xPlus2();
At address 40155d the field tc1.x is initialized with the value 5.
At address 401564 the pointer to tc1 is loaded into the register %rax
At address 401568 the pointer to tc1 is copied into the register %rcx
At address 40156b is the call of the method tc1.xPlus2()
At address 401570 the result is store in variable
Your observations are mostly correct. rcx holds the this pointer to the object on which the method was called. x is stored in the first area of memory that the this pointer points to, so that is why rcx was dereferenced and the result added to. It is the responsibility of the caller to make sure that rcx is the address of the object before invoking the function. We can see main prepare rcx by setting it to an address in its stack frame. You are correct that the compiler produced inefficient code here and did not need to use the stack. Compiling with higher optimization levels -O1, -O2, or -O3 will likely fix that. These higher optimizations will probably get rid of the nops too, since they are used for function alignment. You can mostly ignore __main. It's used for libc initialization.

C++ corrupts registers [duplicate]

This question already has answers here:
What are callee and caller saved registers?
(6 answers)
Calling convention on x64 [duplicate]
(1 answer)
Closed 1 year ago.
When calling a C++ function, the RAX AND RCX registers are changed.
How can I force the compiler to store the value of the registers?
I could call push pop, but I plan to call more complex functions and it is important for me that the registers are not corrupted (even for floating point numbers).
Architecture: amd64
IDE: Visual Studio 19
.data
extern TestCall: proto
.code
HookFunc proc
mov rax, [rbp + 8h]
call TestCall
ret
HookFunc endp
end
extern "C" void TestCall()
{
cout << "TestCall" << endl;
}
Disassembler
extern "C" __declspec(dllexport) void TestCall()
{
00007FFAA9C76FB0 40 55 push rbp
00007FFAA9C76FB2 57 push rdi
00007FFAA9C76FB3 48 81 EC E8 00 00 00 sub rsp,0E8h
00007FFAA9C76FBA 48 8D 6C 24 20 lea rbp,[rsp+20h]
00007FFAA9C76FBF 48 8D 0D 70 B0 01 00 lea rcx,[__ED185583_dllmain#cpp (07FFAA9C92036h)]
00007FFAA9C76FC6 E8 DD A7 FF FF call __CheckForDebuggerJustMyCode (07FFAA9C717A8h)
cout << "TestCall" << endl;
00007FFAA9C76FCB 48 8D 15 EE EF 00 00 lea rdx,[string "TestCall" (07FFAA9C85FC0h)]
00007FFAA9C76FD2 48 8B 0D 7F 82 01 00 mov rcx,qword ptr [__imp_std::cout (07FFAA9C8F258h)]
00007FFAA9C76FD9 E8 FE A0 FF FF call std::operator<<<std::char_traits<char> > (07FFAA9C710DCh)
00007FFAA9C76FDE 48 8D 15 66 A0 FF FF lea rdx,[std::endl<char,std::char_traits<char> > (07FFAA9C7104Bh)]
00007FFAA9C76FE5 48 8B C8 mov rcx,rax
00007FFAA9C76FE8 FF 15 92 82 01 00 call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FFAA9C8F280h)]
}
00007FFAA9C76FEE 48 8D A5 C8 00 00 00 lea rsp,[rbp+0C8h]
00007FFAA9C76FF5 5F pop rdi
00007FFAA9C76FF6 5D pop rbp
00007FFAA9C76FF7 C3 ret

How std::memory_order_XXX works

I don't understand how std::memory_order_XXX(like memory_order_release/memory_order_acquire ...) works.
From some documents, it shows that these memory mode have different feature, but I'm really confused that they have the same assemble code, what determined the differences?
That code:
static std::atomic<long> gt;
void test1() {
gt.store(1, std::memory_order_release);
gt.store(2, std::memory_order_relaxed);
gt.load(std::memory_order_acquire);
gt.load(std::memory_order_relaxed);
}
Corresponds to:
00000000000007a0 <_Z5test1v>:
7a0: 55 push %rbp
7a1: 48 89 e5 mov %rsp,%rbp
7a4: 48 83 ec 30 sub $0x30,%rsp
**memory_order_release:
7a8: 48 c7 45 f8 01 00 00 movq $0x1,-0x8(%rbp)
7af: 00
7b0: c7 45 e8 03 00 00 00 movl $0x3,-0x18(%rbp)
7b7: 8b 45 e8 mov -0x18(%rbp),%eax
7ba: be ff ff 00 00 mov $0xffff,%esi
7bf: 89 c7 mov %eax,%edi
7c1: e8 b1 00 00 00 callq 877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
7c6: 89 45 ec mov %eax,-0x14(%rbp)
7c9: 48 8b 55 f8 mov -0x8(%rbp),%rdx
7cd: 48 8d 05 44 08 20 00 lea 0x200844(%rip),%rax # 201018 <_ZL2gt>
7d4: 48 89 10 mov %rdx,(%rax)
7d7: 0f ae f0 mfence**
**memory_order_relaxed:
7da: 48 c7 45 f0 02 00 00 movq $0x2,-0x10(%rbp)
7e1: 00
7e2: c7 45 e0 00 00 00 00 movl $0x0,-0x20(%rbp)
7e9: 8b 45 e0 mov -0x20(%rbp),%eax
7ec: be ff ff 00 00 mov $0xffff,%esi
7f1: 89 c7 mov %eax,%edi
7f3: e8 7f 00 00 00 callq 877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
7f8: 89 45 e4 mov %eax,-0x1c(%rbp)
7fb: 48 8b 55 f0 mov -0x10(%rbp),%rdx
7ff: 48 8d 05 12 08 20 00 lea 0x200812(%rip),%rax # 201018 <_ZL2gt>
806: 48 89 10 mov %rdx,(%rax)
809: 0f ae f0 mfence**
**memory_order_acquire:
80c: c7 45 d8 02 00 00 00 movl $0x2,-0x28(%rbp)
813: 8b 45 d8 mov -0x28(%rbp),%eax
816: be ff ff 00 00 mov $0xffff,%esi
81b: 89 c7 mov %eax,%edi
81d: e8 55 00 00 00 callq 877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
822: 89 45 dc mov %eax,-0x24(%rbp)
825: 48 8d 05 ec 07 20 00 lea 0x2007ec(%rip),%rax # 201018 <_ZL2gt>
82c: 48 8b 00 mov (%rax),%rax**
**memory_order_relaxed:
82f: c7 45 d0 00 00 00 00 movl $0x0,-0x30(%rbp)
836: 8b 45 d0 mov -0x30(%rbp),%eax
839: be ff ff 00 00 mov $0xffff,%esi
83e: 89 c7 mov %eax,%edi
840: e8 32 00 00 00 callq 877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
845: 89 45 d4 mov %eax,-0x2c(%rbp)
848: 48 8d 05 c9 07 20 00 lea 0x2007c9(%rip),%rax # 201018 <_ZL2gt>
84f: 48 8b 00 mov (%rax),%rax**
852: 90 nop
853: c9 leaveq
854: c3 retq
00000000000008cc <_ZStanSt12memory_orderSt23__memory_order_modifier>:
8cc: 55 push %rbp
8cd: 48 89 e5 mov %rsp,%rbp
8d0: 89 7d fc mov %edi,-0x4(%rbp)
8d3: 89 75 f8 mov %esi,-0x8(%rbp)
8d6: 8b 55 fc mov -0x4(%rbp),%edx
8d9: 8b 45 f8 mov -0x8(%rbp),%eax
8dc: 21 d0 and %edx,%eax
8de: 5d pop %rbp
8df: c3 retq
I expect different memory mode has different implements on assemble code,
but setting different mode value is no effect on assemble, who can explain this?
Each memory model setting has its semantics. Compiler is obliged to satisfy this semantics, meaning that:
It disallows compiler to perform certain optimizations, such as reordering of reads and writes.
It instructs the compiler to propagate the very same message down to the hardware. How it is done, depends on the platform. x86_64 itself provides very strong memory model. Hence in almost all cases you will see no difference in generated assembler code for x86_64 no matter what memory model you choose. However, on RISC architectures (e.g. ARM), you will see the difference because compiler will have to insert memory barriers. Type of memory barrier depends on the selected memory model setting.
EDIT: Have a look at the JSR-133. It is very old and is about Java, but it provides the nicest explanation about memory model from the compiler perspective that I know. In particular, look at the table of memory barrier instructions for different architectures.
Given the code:
#include <atomic>
static std::atomic<long> gt;
void test1() {
gt.store(41, std::memory_order_release);
gt.store(42, std::memory_order_relaxed);
gt.load(std::memory_order_acquire);
gt.load(std::memory_order_relaxed);
}
At decent optimization level there is no garbage assembly moving values around on registers than the stack:
test1():
movq $41, gt(%rip)
movq $42, gt(%rip)
movq gt(%rip), %rax
movq gt(%rip), %rax
ret
We see that the exact same code is generated for the different memory orders; although testing different instructions in the same function in sequence is very bad practice as C++ instructions don't have to be compiled independently and context might influence code generation. But with the current code generation in GCC, it compiles each statement involving an atomic as its own. Good practice is to have a different function for each statement.
The same code is generated here because no special instruction happens to be needed for these memory orders.

C++ assembly code analysis (compiled with clang)

I am trying to figure out how the C++ binary code looks like, especially for virtual function calls. I have come up with few curious things. I have this following C++ code:
#include <iostream>
using namespace std;
class Base {
public:
virtual void print() { cout << "from base" << endl; }
};
class Derived : public Base {
public:
virtual void print() { cout << "from derived" << endl; }
};
int main() {
Base *b;
Derived d;
d.print();
b = &d;
b->print();
return 0;
}
I compiled it with clang++, and then use objdump:
00000000004008b0 <main>:
4008b0: 55 push rbp
4008b1: 48 89 e5 mov rbp,rsp
4008b4: 48 83 ec 20 sub rsp,0x20
4008b8: 48 8d 7d e8 lea rdi,[rbp-0x18]
4008bc: c7 45 fc 00 00 00 00 mov DWORD PTR [rbp-0x4],0x0
4008c3: e8 28 00 00 00 call 4008f0 <Derived::Derived()>
4008c8: 48 8d 7d e8 lea rdi,[rbp-0x18]
4008cc: e8 5f 00 00 00 call 400930 <Derived::print()>
4008d1: 48 8d 7d e8 lea rdi,[rbp-0x18]
4008d5: 48 89 7d f0 mov QWORD PTR [rbp-0x10],rdi
4008d9: 48 8b 7d f0 mov rdi,QWORD PTR [rbp-0x10]
4008dd: 48 8b 07 mov rax,QWORD PTR [rdi]
4008e0: ff 10 call QWORD PTR [rax]
4008e2: 31 c0 xor eax,eax
4008e4: 48 83 c4 20 add rsp,0x20
4008e8: 5d pop rbp
4008e9: c3 ret
4008ea: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0]
My question is why in assembly code, we have the following code:
4008b8: 48 8d 7d e8 lea rdi,[rbp-0x18]
4008d1: 48 8d 7d e8 lea rdi,[rbp-0x18]
The local variable d in main() is stored at location [rbp-0x18]. This is in the automatic storage allocated on the stack for main().
lea rdi,[rbp-0x18]
This line loads the address of d into the rdi register. By convention, member functions of Derived treat rdi as the this pointer.

Compiler mov'es this pointer to wrong address

I have a simple polymorphic construction, with one pure virtual function Foo.
The only flaw in the big project it's used in is that the project uses a couple of global statics for centralized parameter loading and for event logging (can't easily get rid of that legacy code).
Project info:
Platform toolset: v110_xp
MFC in static library
MBCS charset
calling convention: __cdecl
All optimizations disabled
Warning level 4, no warnings on the whole project
Code:
class Base
{
public:
Base(){}
virtual ~Base(void){}
virtual void Foo(void) = 0;
};
class Derived
: public Base
{
public:
Derived(void) : Base(){}
virtual void Foo(void) override
{
double a = sqrt(4.9);
double b = -a;
}
Calling code (doesn't really matter, same behaviour everywhere)
BOOL MainMFCApp::InitInstance()
{
Derived* d = new Derived();
d->Foo();
delete d;
...
}
The problem is that when run in debug (not tested with release), and when we end up inside function Foo, the this pointer is 'corrupted':
this = 0xcccccccc
this.__vfptr = <unable to read memory>
When I dive into the assembly code entering the function I see the following:
13:
14:
15: void Derived::Foo(void)
16: {
015E3570 55 push ebp
015E3571 8B EC mov ebp,esp
015E3573 83 E4 F8 and esp,0FFFFFFF8h
015E3576 81 EC EC 00 00 00 sub esp,0ECh
015E357C 53 push ebx
015E357D 56 push esi
015E357E 57 push edi
015E357F 51 push ecx
015E3580 8D BD 14 FF FF FF lea edi,[ebp-0ECh]
015E3586 B9 3B 00 00 00 mov ecx,3Bh
015E358B B8 CC CC CC CC mov eax,0CCCCCCCCh
015E3590 F3 AB rep stos dword ptr es:[edi]
015E3592 59 pop ecx
015E3593 89 8C 24 F0 00 00 00 mov dword ptr [esp+0F0h],ecx
17: double a = sqrt(4.9);
015E359A F2 0F 10 05 00 55 09 02 movsd xmm0,mmword ptr ds:[2095500h]
015E35A2 E8 72 D4 FD FF call __libm_sse2_sqrt_precise (015C0A19h)
015E35A7 F2 0F 11 84 24 E0 00 00 00 movsd mmword ptr [esp+0E0h],xmm0
18: double b = -a;
015E35B0 F2 0F 10 84 24 E0 00 00 00 movsd xmm0,mmword ptr [esp+0E0h]
015E35B9 66 0F 57 05 10 55 09 02 xorpd xmm0,xmmword ptr ds:[2095510h]
015E35C1 F2 0F 11 84 24 D0 00 00 00 movsd mmword ptr [esp+0D0h],xmm0
19: return;
20: }
015E35CA 5F pop edi
015E35CB 5E pop esi
015E35CC 5B pop ebx
015E35CD 8B E5 mov esp,ebp
015E35CF 5D pop ebp
015E35D0 C3 ret
--- No source file -------------------------------------------------------------
015E35D1 CC int 3
...
015E35EF CC int 3
Breakpoint at line 17, right before entering the function body: Using the watch window to inspect the object behind register ecx (with a cast to Derived*) shows that ecx contains the pointer I need (to the object), but for some reason it is mov'ed to the seemingly random address [esp+0F0h].
And now the really interesting/flabbergasting part: When I change
double b = -a;
to
double b = -1.0 * a;
and compile again, everything magically works. The function assembly has now changed to:
13:
14:
15: void Derived::Foo(void)
16: {
00863570 55 push ebp
00863571 8B EC mov ebp,esp
00863573 81 EC EC 00 00 00 sub esp,0ECh
00863579 53 push ebx
0086357A 56 push esi
0086357B 57 push edi
0086357C 51 push ecx
0086357D 8D BD 14 FF FF FF lea edi,[ebp-0ECh]
00863583 B9 3B 00 00 00 mov ecx,3Bh
00863588 B8 CC CC CC CC mov eax,0CCCCCCCCh
0086358D F3 AB rep stos dword ptr es:[edi]
0086358F 59 pop ecx
00863590 89 4D F8 mov dword ptr [this],ecx
17: double a = sqrt(4.9);
00863593 F2 0F 10 05 00 55 31 01 movsd xmm0,mmword ptr ds:[1315500h]
0086359B E8 79 D4 FD FF call __libm_sse2_sqrt_precise (0840A19h)
008635A0 F2 0F 11 45 E8 movsd mmword ptr [a],xmm0
18: double b = -1.0 * a;
008635A5 F2 0F 10 05 10 55 31 01 movsd xmm0,mmword ptr ds:[1315510h]
008635AD F2 0F 59 45 E8 mulsd xmm0,mmword ptr [a]
008635B2 F2 0F 11 45 D8 movsd mmword ptr [b],xmm0
19: return;
20: }
008635B7 5F pop edi
008635B8 5E pop esi
008635B9 5B pop ebx
008635BA 81 C4 EC 00 00 00 add esp,0ECh
008635C0 3B EC cmp ebp,esp
008635C2 E8 32 CD FC FF call __RTC_CheckEsp (08302F9h)
008635C7 8B E5 mov esp,ebp
008635C9 5D pop ebp
008635CA C3 ret
--- No source file -------------------------------------------------------------
008635CB CC int 3
...
008635EF CC int 3
Now the generated code nicely moves the pointer in register ecx to this. Other difference:
different memory addresses/offsets
mulsd instead of xorpd to negate the variable
and esp,0FFFFFFF8h disappeared (?? used to align the stack pointer esp ??)
more cleanup (after the function body)?? (add cmp call)
The assembly part where the parameters get pushed to the stack is the same for both situations:
53: d->Foo();
011A500B 8B 45 E0 mov eax,dword ptr [d]
011A500E 8B 10 mov edx,dword ptr [eax]
011A5010 8B F4 mov esi,esp
011A5012 8B 4D E0 mov ecx,dword ptr [d]
011A5015 8B 42 04 mov eax,dword ptr [edx+4]
011A5018 FF D0 call eax
Of course when I try to replicate this with a Minimal, Complete, and Verifiable example, everything works as intened. But in my big project, it fails consistently.
I'm not sure which parameters can influence compilation, and don't know enough of assembly to even see what's going on there;
therefor I'm asking here in the hope that someone has seen this before or recognizes this behaviour.
Note: it also works again, when I remove the sqrt call.
Update:
No problems in release
VS2012 SP4 (v11.0.61030.00)
problem persists when referencing member variables (iso no member references)
TODO: try without global statics