C++ large deque - program takes very long time to exit? - c++

Consider the following C++ program:
#include <deque>
#include <iostream>
using namespace std;
int main()
{
deque<double> d(30000000);
cout << "Done\n";
}
The memory allocation in the first line only takes a second, but after it prints Done, it takes 33 seconds (!) to exit back to the terminal. Decreasing the number of elements to 20000000 reduces that time to 22 seconds, so clearly it's linear in the number of elements.
I am compiling on Windows 10, and the same thing happened with both GCC 10.2.0 and Visual Studio 2019.
What's going on here? Am I using deque in a way it's not supposed to be used?
EDIT:
#include <deque>
#include <iostream>
using namespace std;
void test_deque()
{
deque<double> d(30000000);
cout << "Function done\n";
}
int main()
{
test_deque();
cout << "Main done\n";
}
Now it prints Function done and then there is the 33 second delay. So I assume this has to do with the destructor that gets executed when the function exits. But why does it take so long to destruct 240 MB of memory?
EDIT 2: Tried it (the second version) with GCC on Ubuntu and it only takes a fraction of a second to run! Same with some online C++ compilers. Is this a problem specific to Windows?
EDIT 3: With vector it also takes a fraction of a second to run. However, with list (and forward_list) I get a similar extremely long delay.
EDIT 4: Compiling with MSVC in Release (rather than Debug) configuration also takes a fraction of a second. I'm not sure what the GCC equivalent it, but with -O3 (max optimizations) the execution time remains 33 seconds.

Fundamentally the answer isn't very interesting. Your program is a no-op, so a compiler may optimize out the deque construction. But it doesn't have to.
But first, a legal sane implementation may do any of the following:
Do an allocation for 30000000 float elements and nothing else. The allocator might:
Do the allocation in the most lazy way, doing essentially nothing but some bookkeeping.
Eagerly allocate and page in memory, causing 30000000/page size operations.
Zero-initialize or pattern-initialize (e.g. 0xdeadbeef) to help detect uninitialized usage the memory, causing 30000000 writes.
Allocate (include above) and zero-initialize or pattern-initialize the memory.
Run some sort of destructor over all elements (e.g. zeroing out memory).
Not run a destructor on any elements since it's a built-in type.
Now all of the above are possible options. And since your program is a no-op, a legal compiler may optimize out any or none of these steps. Your system allocator might vary in capabilities, supporting lazy allocation, overcommit, automatic zeroing, etc. So the end result is that you could get any kind of behavior depending on your operating system, compiler version, compiler flags, standard library, etc.

MSVC has a built-in profiler. We can run it (press Alt-F2) to see that the majority of CPU time is spent inside the constructor and destructor, which invoke deque::resize() and deque::_Tidy() functions, respectively.
If we drill down further we see that deque::emplace_back() is resulting in quite a lot of code
#define _PUSH_BACK_BEGIN \
if ((_Myoff() + _Mysize()) % _DEQUESIZ == 0 && _Mapsize() <= (_Mysize() + _DEQUESIZ) / _DEQUESIZ) { \
_Growmap(1); \
} \
_Myoff() &= _Mapsize() * _DEQUESIZ - 1; \
size_type _Newoff = _Myoff() + _Mysize(); \
size_type _Block = _Getblock(_Newoff); \
if (_Map()[_Block] == nullptr) { \
_Map()[_Block] = _Getal().allocate(_DEQUESIZ); \
}
#define _PUSH_BACK_END ++_Mysize()
template <class... _Valty>
decltype(auto) emplace_back(_Valty&&... _Val) {
_Orphan_all();
_PUSH_BACK_BEGIN;
_Alty_traits::construct(
_Getal(), _Unfancy(_Map()[_Block] + _Newoff % _DEQUESIZ), _STD forward<_Valty>(_Val)...);
_PUSH_BACK_END;
#if _HAS_CXX17
return back();
#endif // _HAS_CXX17
}
Disassembly view:
template <class... _Valty>
decltype(auto) emplace_back(_Valty&&... _Val) {
00007FF674A238E0 mov qword ptr [rsp+8],rcx
00007FF674A238E5 push rbp
00007FF674A238E6 push rdi
00007FF674A238E7 sub rsp,138h
00007FF674A238EE lea rbp,[rsp+20h]
00007FF674A238F3 mov rdi,rsp
00007FF674A238F6 mov ecx,4Eh
00007FF674A238FB mov eax,0CCCCCCCCh
00007FF674A23900 rep stos dword ptr [rdi]
00007FF674A23902 mov rcx,qword ptr [rsp+158h]
00007FF674A2390A lea rcx,[__0657B1E2_deque (07FF674A3E02Fh)]
00007FF674A23911 call __CheckForDebuggerJustMyCode (07FF674A21159h)
_Orphan_all();
00007FF674A23916 mov rcx,qword ptr [this]
00007FF674A2391D call std::deque<double,std::allocator<double> >::_Orphan_all (07FF674A217FDh)
_PUSH_BACK_BEGIN;
00007FF674A23922 mov rcx,qword ptr [this]
00007FF674A23929 call std::deque<double,std::allocator<double> >::_Myoff (07FF674A2139Dh)
00007FF674A2392E mov qword ptr [rbp+0F8h],rax
00007FF674A23935 mov rcx,qword ptr [this]
00007FF674A2393C call std::deque<double,std::allocator<double> >::_Mysize (07FF674A211B8h)
00007FF674A23941 mov rcx,qword ptr [rbp+0F8h]
00007FF674A23948 mov rcx,qword ptr [rcx]
00007FF674A2394B add rcx,qword ptr [rax]
00007FF674A2394E mov rax,rcx
00007FF674A23951 xor edx,edx
00007FF674A23953 mov ecx,2
00007FF674A23958 div rax,rcx
00007FF674A2395B mov rax,rdx
00007FF674A2395E test rax,rax
00007FF674A23961 jne std::deque<double,std::allocator<double> >::emplace_back<>+0D0h (07FF674A239B0h)
00007FF674A23963 mov rcx,qword ptr [this]
00007FF674A2396A call std::deque<double,std::allocator<double> >::_Mapsize (07FF674A214BFh)
00007FF674A2396F mov qword ptr [rbp+0F8h],rax
00007FF674A23976 mov rcx,qword ptr [this]
00007FF674A2397D call std::deque<double,std::allocator<double> >::_Mysize (07FF674A211B8h)
00007FF674A23982 mov rax,qword ptr [rax]
00007FF674A23985 add rax,2
00007FF674A23989 xor edx,edx
00007FF674A2398B mov ecx,2
00007FF674A23990 div rax,rcx
00007FF674A23993 mov rcx,qword ptr [rbp+0F8h]
00007FF674A2399A cmp qword ptr [rcx],rax
00007FF674A2399D ja std::deque<double,std::allocator<double> >::emplace_back<>+0D0h (07FF674A239B0h)
00007FF674A2399F mov edx,1
00007FF674A239A4 mov rcx,qword ptr [this]
00007FF674A239AB call std::deque<double,std::allocator<double> >::_Growmap (07FF674A21640h)
00007FF674A239B0 mov rcx,qword ptr [this]
00007FF674A239B7 call std::deque<double,std::allocator<double> >::_Mapsize (07FF674A214BFh)
00007FF674A239BC mov rax,qword ptr [rax]
00007FF674A239BF lea rax,[rax+rax-1]
00007FF674A239C4 mov qword ptr [rbp+0F8h],rax
00007FF674A239CB mov rcx,qword ptr [this]
00007FF674A239D2 call std::deque<double,std::allocator<double> >::_Myoff (07FF674A2139Dh)
00007FF674A239D7 mov qword ptr [rbp+100h],rax
00007FF674A239DE mov rax,qword ptr [rbp+100h]
00007FF674A239E5 mov rax,qword ptr [rax]
00007FF674A239E8 mov qword ptr [rbp+108h],rax
00007FF674A239EF mov rax,qword ptr [rbp+0F8h]
00007FF674A239F6 mov rcx,qword ptr [rbp+108h]
00007FF674A239FD and rcx,rax
00007FF674A23A00 mov rax,rcx
00007FF674A23A03 mov rcx,qword ptr [rbp+100h]
00007FF674A23A0A mov qword ptr [rcx],rax
00007FF674A23A0D mov rcx,qword ptr [this]
00007FF674A23A14 call std::deque<double,std::allocator<double> >::_Myoff (07FF674A2139Dh)
00007FF674A23A19 mov qword ptr [rbp+0F8h],rax
00007FF674A23A20 mov rcx,qword ptr [this]
00007FF674A23A27 call std::deque<double,std::allocator<double> >::_Mysize (07FF674A211B8h)
00007FF674A23A2C mov rcx,qword ptr [rbp+0F8h]
00007FF674A23A33 mov rcx,qword ptr [rcx]
00007FF674A23A36 add rcx,qword ptr [rax]
00007FF674A23A39 mov rax,rcx
00007FF674A23A3C mov qword ptr [_Newoff],rax
00007FF674A23A40 mov rdx,qword ptr [_Newoff]
00007FF674A23A44 mov rcx,qword ptr [this]
00007FF674A23A4B call std::deque<double,std::allocator<double> >::_Getblock (07FF674A21334h)
00007FF674A23A50 mov qword ptr [_Block],rax
00007FF674A23A54 mov rcx,qword ptr [this]
00007FF674A23A5B call std::deque<double,std::allocator<double> >::_Map (07FF674A21753h)
00007FF674A23A60 mov rax,qword ptr [rax]
00007FF674A23A63 mov rcx,qword ptr [_Block]
00007FF674A23A67 cmp qword ptr [rax+rcx*8],0
00007FF674A23A6C jne std::deque<double,std::allocator<double> >::emplace_back<>+1D7h (07FF674A23AB7h)
00007FF674A23A6E mov rcx,qword ptr [this]
00007FF674A23A75 call std::deque<double,std::allocator<double> >::_Getal (07FF674A216CCh)
00007FF674A23A7A mov qword ptr [rbp+0F8h],rax
00007FF674A23A81 mov edx,2
00007FF674A23A86 mov rcx,qword ptr [rbp+0F8h]
00007FF674A23A8D call std::allocator<double>::allocate (07FF674A216C7h)
00007FF674A23A92 mov qword ptr [rbp+100h],rax
00007FF674A23A99 mov rcx,qword ptr [this]
00007FF674A23AA0 call std::deque<double,std::allocator<double> >::_Map (07FF674A21753h)
00007FF674A23AA5 mov rax,qword ptr [rax]
00007FF674A23AA8 mov rcx,qword ptr [_Block]
00007FF674A23AAC mov rdx,qword ptr [rbp+100h]
00007FF674A23AB3 mov qword ptr [rax+rcx*8],rdx
_Alty_traits::construct(
00007FF674A23AB7 mov rcx,qword ptr [this]
00007FF674A23ABE call std::deque<double,std::allocator<double> >::_Map (07FF674A21753h)
00007FF674A23AC3 mov rax,qword ptr [rax]
00007FF674A23AC6 mov qword ptr [rbp+0F8h],rax
00007FF674A23ACD xor edx,edx
00007FF674A23ACF mov rax,qword ptr [_Newoff]
00007FF674A23AD3 mov ecx,2
00007FF674A23AD8 div rax,rcx
00007FF674A23ADB mov rax,rdx
00007FF674A23ADE mov rcx,qword ptr [_Block]
00007FF674A23AE2 mov rdx,qword ptr [rbp+0F8h]
00007FF674A23AE9 mov rcx,qword ptr [rdx+rcx*8]
00007FF674A23AED lea rax,[rcx+rax*8]
00007FF674A23AF1 mov rcx,rax
00007FF674A23AF4 call std::_Unfancy<double> (07FF674A214A6h)
00007FF674A23AF9 mov qword ptr [rbp+100h],rax
00007FF674A23B00 mov rcx,qword ptr [this]
00007FF674A23B07 call std::deque<double,std::allocator<double> >::_Getal (07FF674A216CCh)
00007FF674A23B0C mov qword ptr [rbp+108h],rax
00007FF674A23B13 mov rdx,qword ptr [rbp+100h]
00007FF674A23B1A mov rcx,qword ptr [rbp+108h]
00007FF674A23B21 call std::_Default_allocator_traits<std::allocator<double> >::construct<double> (07FF674A211E5h)
_Getal(), _Unfancy(_Map()[_Block] + _Newoff % _DEQUESIZ), _STD forward<_Valty>(_Val)...);
_PUSH_BACK_END;
00007FF674A23B26 mov rcx,qword ptr [this]
00007FF674A23B2D call std::deque<double,std::allocator<double> >::_Mysize (07FF674A211B8h)
00007FF674A23B32 mov qword ptr [rbp+0F8h],rax
00007FF674A23B39 mov rax,qword ptr [rbp+0F8h]
00007FF674A23B40 mov rax,qword ptr [rax]
00007FF674A23B43 inc rax
00007FF674A23B46 mov rcx,qword ptr [rbp+0F8h]
00007FF674A23B4D mov qword ptr [rcx],rax
#if _HAS_CXX17
return back();
00007FF674A23B50 mov rcx,qword ptr [this]
00007FF674A23B57 call std::deque<double,std::allocator<double> >::back (07FF674A2127Bh)
#endif // _HAS_CXX17
}
00007FF674A23B5C lea rsp,[rbp+118h]
00007FF674A23B63 pop rdi
00007FF674A23B64 pop rbp
00007FF674A23B65 ret
Apparently std::deque doesn't pre-allocate elements, and instead uses a loop to add them one-by-one. So no wonder it is slow.
You can speed up the Debug build by enabling some optimizations (e.g. /Ob1) and reducing runtime checks (e.g. remove /RTC1).
But really, std::deque is just a horrible structure from a performance point of view (a vector of tiny vectors - not cache-friendly at all).

It is really slow in debug mode.
MSDN:
Processes that the debugger creates (also known as spawned processes) behave a little differently than processes that the debugger does not create.
Instead of using the standard heap API, processes that the debugger creates use a special debug heap. You can force a spawned process to use the standard heap instead of the debug heap by using the _NO_DEBUG_HEAP environment variable.

std::deque allocates data in chunks of fixed size varying with platform and type. For double that could be 4KB. So allocating 30,000,000 doubles takes 2.4GB of memory and thus 6,000 allocations/deallocations. With std::list that would be 30,000,000 allocations/deallocations and take a couple GB more memory making it all ridiculously slow.
This might even cause memory fragmentation issues depending on your hardware. And if you run without optimizations it will be even slower.
There is also the privacy issue. Deallocation might be clearing data to ensure that your program didn't leak any information to outside programs.
As mentioned by #orlp as your program is a no-op the whole allocation/deallocation can be optimized out completely which might explain why it speeds up significantly at times.

Related

Error when using the delete array operator on a CDBVariant array

I create, use, and delete an array of CDBVariant :
CDBVariant *myVars = new CDBVariant[N];
// ...
delete[] myVars;
On the delete[] line the execution meets 3 breakpoints I can't explore, then Access Violations in reading over and over until it crashes. I have the exact same symptoms as this guy : http://computer-programming-forum.com/82-mfc/549a933737d9177d.htm ; namely I can use new[]/delete[] on other object types with no problem, and I can wrap the CDBVariant in a useless class and create/delete arrays of this class.
This happens specifically while using MFC. If I create a new console app and launch this;
#include <cassert>
#include <afxdb.h>
int main(void)
{
CDBVariant *vars = new CDBVariant[10];
assert(vars[0].m_dwType == DBVT_NULL); // don't optimise my array away please.
delete[] vars;
}
it wont cause any trouble. If, however, I create a basic, default MFC app (dialog-based, 'cause that's what I use) and use the OK button of the default dialog to do the same thing;
void CMFCApplication1Dlg::OnBnClickedOk()
{
CDBVariant *vars = new CDBVariant[10];
assert(vars[0].m_dwType == DBVT_NULL);
delete[] vars;
CDialogEx::OnOK();
}
once again, breaks, then access violations, then death.
(I use VS2017 with MSVC 19.10.25027.)
Please tell me what causes this, and how to properly avoid it.
Here are the requested disassembly codes. Sorry for the heaviness. Here is the console version:
delete[] vars;
00558E6D mov eax,dword ptr [vars]
00558E70 mov dword ptr [ebp-110h],eax
00558E76 mov ecx,dword ptr [ebp-110h]
00558E7C mov dword ptr [ebp-104h],ecx
00558E82 mov edx,dword ptr [ebp-104h]
00558E88 mov dword ptr [ebp-0F8h],edx
00558E8E cmp dword ptr [ebp-0F8h],0
00558E95 je main+172h (0558EF2h)
00558E97 mov eax,dword ptr [ebp-0F8h]
00558E9D cmp dword ptr [eax-4],0
00558EA1 je main+148h (0558EC8h)
00558EA3 mov esi,esp
00558EA5 push 3
00558EA7 mov ecx,dword ptr [ebp-104h]
00558EAD mov edx,dword ptr [ecx]
00558EAF mov ecx,dword ptr [ebp-104h]
delete[] vars;
00558EB5 mov eax,dword ptr [edx]
00558EB7 call eax
00558EB9 cmp esi,esp
00558EBB call __RTC_CheckEsp (0515C33h)
00558EC0 mov dword ptr [ebp-118h],eax
00558EC6 jmp main+164h (0558EE4h)
00558EC8 mov ecx,dword ptr [ebp-0F8h]
00558ECE sub ecx,4
00558ED1 push ecx
00558ED2 call operator delete[] (052763Bh)
00558ED7 add esp,4
00558EDA mov dword ptr [ebp-118h],0
00558EE4 mov edx,dword ptr [ebp-118h]
00558EEA mov dword ptr [ebp-11Ch],edx
00558EF0 jmp main+17Ch (0558EFCh)
00558EF2 mov dword ptr [ebp-11Ch],0
The call eax line leads to
CDBVariant::`vector deleting destructor':
0051EC84 jmp CDBVariant::`vector deleting destructor' (0556E30h)
which immediately jumps away to a big block of code, introduced as CDBVariant::`vector deleting destructor'.
The MFC version looks very similar at first:
delete[] vars;
00FB7D0A mov eax,dword ptr [vars]
00FB7D0D mov dword ptr [ebp-11Ch],eax
00FB7D13 mov ecx,dword ptr [ebp-11Ch]
00FB7D19 mov dword ptr [ebp-110h],ecx
00FB7D1F mov edx,dword ptr [ebp-110h]
00FB7D25 mov dword ptr [ebp-104h],edx
00FB7D2B cmp dword ptr [ebp-104h],0
00FB7D32 je CMFCApplication1Dlg::OnBnClickedOk+18Fh (0FB7D8Fh)
00FB7D34 mov eax,dword ptr [ebp-104h]
00FB7D3A cmp dword ptr [eax-4],0
00FB7D3E je CMFCApplication1Dlg::OnBnClickedOk+165h (0FB7D65h)
00FB7D40 mov esi,esp
00FB7D42 push 3
00FB7D44 mov ecx,dword ptr [ebp-110h]
00FB7D4A mov edx,dword ptr [ecx]
00FB7D4C mov ecx,dword ptr [ebp-110h]
00FB7D52 mov eax,dword ptr [edx]
00FB7D54 call eax
00FB7D56 cmp esi,esp
00FB7D58 call __RTC_CheckEsp (0FB1474h)
00FB7D5D mov dword ptr [ebp-124h],eax
00FB7D63 jmp CMFCApplication1Dlg::OnBnClickedOk+181h (0FB7D81h)
00FB7D65 mov ecx,dword ptr [ebp-104h]
00FB7D6B sub ecx,4
00FB7D6E push ecx
00FB7D6F call operator delete[] (0FB1B4Fh)
00FB7D74 add esp,4
00FB7D77 mov dword ptr [ebp-124h],0
00FB7D81 mov edx,dword ptr [ebp-124h]
00FB7D87 mov dword ptr [ebp-128h],edx
00FB7D8D jmp CMFCApplication1Dlg::OnBnClickedOk+199h (0FB7D99h)
00FB7D8F mov dword ptr [ebp-128h],0
but its call eax goes here:
0FCF7810 push ebp
0FCF7811 mov ebp,esp
0FCF7813 push ecx
0FCF7814 mov dword ptr [ebp-4],0CCCCCCCCh
0FCF781B mov dword ptr [ebp-4],ecx
0FCF781E mov ecx,dword ptr [ebp-4]
0FCF7821 call 0FCF77B0
0FCF7826 mov eax,dword ptr [ebp+8]
0FCF7829 and eax,1
0FCF782C je 0FCF783C
0FCF782E push 18h
0FCF7830 mov ecx,dword ptr [ebp-4]
0FCF7833 push ecx
0FCF7834 call 0FE87AB0
0FCF7839 add esp,8
0FCF783C mov eax,dword ptr [ebp-4]
0FCF783F add esp,4
0FCF7842 cmp ebp,esp
0FCF7844 call 0FE879E0
0FCF7849 mov esp,ebp
0FCF784B pop ebp
0FCF784C ret 4
The first break happens here:
ntdll.dll!770e48c2() Unknown
[Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll]
ntdll.dll!770843d7() Unknown
KernelBase.dll!73bf4814() Unknown
ucrtbased.dll!0fdf611b() Unknown
ucrtbased.dll!0fdf4a0e() Unknown
ucrtbased.dll!0fdf75bc() Unknown
mfc140ud.dll!0f641372() Unknown
mfc140ud.dll!0fa97abc() Unknown
mfc140ud.dll!0f907839() Unknown
MFCApplication1.exe!CMFCApplication1Dlg::OnBnClickedOk() Line 96 C++
the second and third a bit higher:
ucrtbased.dll!0fdf4a3a() Unknown
[Frames below may be incorrect and/or missing, no symbols loaded for ucrtbased.dll]
ucrtbased.dll!0fdf75bc() Unknown
mfc140ud.dll!0f641372() Unknown
mfc140ud.dll!0fa97abc() Unknown
mfc140ud.dll!0f907839() Unknown
MFCApplication1.exe!CMFCApplication1Dlg::OnBnClickedOk() Line 96 C++
and finally the call stack gets lost and the execution loops sending exceptions from there:
ucrtbased.dll!0fe55f0c() Unknown
[Frames below may be incorrect and/or missing, no symbols loaded for ucrtbased.dll]
ucrtbased.dll!0fe563d1() Unknown
ucrtbased.dll!0fe55dcb() Unknown
ucrtbased.dll!0fe56d02() Unknown
ucrtbased.dll!0fe468a9() Unknown
ucrtbased.dll!0fe464bf() Unknown
ucrtbased.dll!0fe3d8a2() Unknown
ucrtbased.dll!0fe37319() Unknown
ucrtbased.dll!0fe2f545() Unknown
ucrtbased.dll!0fdface6() Unknown
ucrtbased.dll!0fdfa17b() Unknown
ucrtbased.dll!0fdf8043() Unknown
"0xC0000005: Access violation reading location 0x00000047."

One writer and multiple readers - 256bit - AVX - atomic [duplicate]

This question already has answers here:
Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
(1 answer)
SSE instructions: which CPUs can do atomic 16B memory operations?
(7 answers)
Largest data type which can be fetch-ANDed atomically?
(2 answers)
Closed 2 years ago.
Would like to write 256bit of data on one core and read it on another one. So there will be only one process to write and can be multiple readers.
Was thinking to implement it using AVX. The reads and writes should be atomic since they are only 1 instruction (vmovdqa) and if aligned by cache line cache coherency would move the data atomically between cores.
Looked at the generated assembly but can see 2 writes and 2 reads. Why is not there just one? Would this solution for atomic read/write work given the assumptions?
#include <immintrin.h>
#include <cstdint>
struct Data {
int64_t a[4];
};
struct DataHolder {
void set_data(Data* in) {
_mm256_store_si256(reinterpret_cast<__m256i *>(&data_), *reinterpret_cast<__m256i *>(in));
}
void get_data(Data* out) {
_mm256_store_si256(reinterpret_cast<__m256i *>(out), *reinterpret_cast<__m256i *>(&data_));
}
alignas(64) Data data_;
char padding [64 - sizeof(Data)];
};
int main() {
Data a, b;
DataHolder ab;
ab.set_data(&a);
ab.get_data(&b);
}
DataHolder::set_data(Data*):
push rbp
mov rbp, rsp
and rsp, -32
mov QWORD PTR [rsp-72], rdi
mov QWORD PTR [rsp-80], rsi
mov rax, QWORD PTR [rsp-80]
vmovdqa ymm0, YMMWORD PTR [rax]
mov rax, QWORD PTR [rsp-72]
mov QWORD PTR [rsp-8], rax
vmovdqa YMMWORD PTR [rsp-64], ymm0
mov rax, QWORD PTR [rsp-8]
vmovdqa ymm0, YMMWORD PTR [rsp-64]
vmovdqa YMMWORD PTR [rax], ymm0
nop
nop
leave
ret
DataHolder::get_data(Data*):
push rbp
mov rbp, rsp
and rsp, -32
mov QWORD PTR [rsp-72], rdi
mov QWORD PTR [rsp-80], rsi
mov rax, QWORD PTR [rsp-72]
vmovdqa ymm0, YMMWORD PTR [rax]
mov rax, QWORD PTR [rsp-80]
mov QWORD PTR [rsp-8], rax
vmovdqa YMMWORD PTR [rsp-64], ymm0
mov rax, QWORD PTR [rsp-8]
vmovdqa ymm0, YMMWORD PTR [rsp-64]
vmovdqa YMMWORD PTR [rax], ymm0
nop
nop
leave
ret
main:
push rbp
mov rbp, rsp
and rsp, -64
add rsp, -128
lea rdx, [rsp+96]
mov rax, rsp
mov rsi, rdx
mov rdi, rax
call DataHolder::set_data(Data*)
lea rdx, [rsp+64]
mov rax, rsp
mov rsi, rdx
mov rdi, rax
call DataHolder::get_data(Data*)
mov eax, 0
leave
ret

x86-64 assembler for polymorphic call

I have the C++ code:
int main(){
M* m;
O* o = new IO();
H* h = new H("A");
if(__rdtsc() % 5 == 0){
m = new Y(o, h);
}
else{
m = new Z(o, h);
}
m->my_virtual();
return 1;
}
where the virtual call is represented by this asm:
mov rax,qword ptr [x]
mov rax,qword ptr [rax]
mov rcx,qword ptr [x]
call qword ptr [rax]
It is one more line than I was expecting for the vtable method invoccation. Are all four of the ASM lines specific to the polymorphic call?
How do the above four lines read pseudo-ly?
This is the complete ASM and C++ (the virtual call is made right at the end):
int main(){
add byte ptr [rax-33333334h],bh
rep stos dword ptr [rdi]
mov qword ptr [rsp+0A8h],0FFFFFFFFFFFFFFFEh
M* x;
o* o = new IO();
mov ecx,70h
call operator new (013F6B7A70h)
mov qword ptr [rsp+40h],rax
cmp qword ptr [rsp+40h],0
je main+4Fh (013F69687Fh)
mov rcx,qword ptr [rsp+40h]
call IO::IO (013F6814F6h)
mov qword ptr [rsp+0B0h],rax
jmp main+5Bh (013F69688Bh)
mov qword ptr [rsp+0B0h],0
mov rax,qword ptr [rsp+0B0h]
mov qword ptr [rsp+38h],rax
mov rax,qword ptr [rsp+38h]
mov qword ptr [o],rax
H* h = new H("A");
mov ecx,150h
call operator new (013F6B7A70h)
mov qword ptr [rsp+50h],rax
cmp qword ptr [rsp+50h],0
je main+0CEh (013F6968FEh)
lea rax,[rsp+58h]
mov qword ptr [rsp+80h],rax
lea rdx,[ec_table+11Ch (013F7C073Ch)]
mov rcx,qword ptr [rsp+80h]
call std::basic_string<char,std::char_traits<char>,std::allocator<char> >::basic_string<char,std::char_traits<char>,std::allocator<char> > (013F681104h)
mov qword ptr [rsp+0B8h],rax
mov rdx,qword ptr [rsp+0B8h]
mov rcx,qword ptr [rsp+50h]
call H::H (013F6826A3h)
mov qword ptr [rsp+0C0h],rax
jmp main+0DAh (013F69690Ah)
mov qword ptr [rsp+0C0h],0
mov rax,qword ptr [rsp+0C0h]
mov qword ptr [rsp+48h],rax
mov rax,qword ptr [rsp+48h]
mov qword ptr [h],rax
if(__rdtsc() % 5 == 0){
rdtsc
shl rdx,20h
or rax,rdx
xor edx,edx
mov ecx,5
div rax,rcx
mov rax,rdx
test rax,rax
jne main+175h (013F6969A5h)
x = new Y(o, h);
mov ecx,18h
call operator new (013F6B7A70h)
mov qword ptr [rsp+90h],rax
cmp qword ptr [rsp+90h],0
je main+14Ah (013F69697Ah)
mov r8,qword ptr [h]
mov rdx,qword ptr [o]
mov rcx,qword ptr [rsp+90h]
call Y::Y (013F681B4Fh)
mov qword ptr [rsp+0C8h],rax
jmp main+156h (013F696986h)
mov qword ptr [rsp+0C8h],0
mov rax,qword ptr [rsp+0C8h]
mov qword ptr [rsp+88h],rax
mov rax,qword ptr [rsp+88h]
mov qword ptr [x],rax
}
else{
jmp main+1DCh (013F696A0Ch)
x = new Z(o, h);
mov ecx,18h
call operator new (013F6B7A70h)
mov qword ptr [rsp+0A0h],rax
cmp qword ptr [rsp+0A0h],0
je main+1B3h (013F6969E3h)
mov r8,qword ptr [h]
mov rdx,qword ptr [o]
mov rcx,qword ptr [rsp+0A0h]
call Z::Z (013F68160Eh)
mov qword ptr [rsp+0D0h],rax
jmp main+1BFh (013F6969EFh)
mov qword ptr [rsp+0D0h],0
mov rax,qword ptr [rsp+0D0h]
mov qword ptr [rsp+98h],rax
mov rax,qword ptr [rsp+98h]
mov qword ptr [x],rax
}
x->my_virtual();
mov rax,qword ptr [x]
mov rax,qword ptr [rax]
mov rcx,qword ptr [x]
call qword ptr [rax]
return 1;
mov eax,1
}
You're probably looking at unoptimized code:
mov rax,qword ptr [x] ; load rax with object pointer
mov rax,qword ptr [rax] ; load rax with the vtable pointer
mov rcx,qword ptr [x] ; load rcx with the object pointer (the 'this' pointer)
call qword ptr [rax] ; call through the vtable slot for the virtual function
mov rax,qword ptr [x]
get the address pointed to by x
mov rax,qword ptr [rax]
get the address of the vtable for x's class (using rax we just worked out). Put it in rax
mov rcx,qword ptr [x]
get the pointer x and put it in rcx, so it can be used as the "this" pointer in the called function.
call qword ptr [rax]
call the function using the address from the vtable we found earlier (no offset as it is the first virtual function).
There are definitely shorter ways to do it, which the compiler might use if you switch optimizations on (e.g. only get [x] once).
Updated with more info from Ben Voigt
In pseudo-code:
(*(*m->__vtbl)[0])(m)
Optimized version (can rcx be used for indexing?):
mov rcx,qword ptr [x] ; load rcx with object pointer
mov rax,qword ptr [rcx] ; load rax with the vtable pointer
call qword ptr [rax] ; call through the vtable slot for the virtual function
or
mov rax,qword ptr [x] ; load rax with object pointer
mov rcx,rax ; copy object pointer to rcx (the 'this' pointer)
mov rax,qword ptr [rax] ; load rax with the vtable pointer
call qword ptr [rax] ; call through the vtable slot for the virtual function

Run-Time Check Failure #0 - The value of ESP was not properly saved across a function call

#include<stdio.h>
int a[100];
int main(){
char UserName[100];
char *n=UserName;
char *q=NULL;
char Serial[200];
q=Serial;
scanf("%s",UserName);
//this is about
__asm{
pushad
mov eax,q
push eax
mov eax,n
push eax
mov EAX,EAX
mov EAX,EAX
CALL G1
LEA EDX,DWORD PTR SS:[ESP+10H]
jmp End
G1:
SUB ESP,400H
XOR ECX,ECX
PUSH EBX
PUSH EBP
MOV EBP,DWORD PTR SS:[ESP+40CH]
PUSH ESI
PUSH EDI
MOV DL,BYTE PTR SS:[EBP]
TEST DL,DL
JE L048
LEA EDI,DWORD PTR SS:[ESP+10H]
MOV AL,DL
MOV ESI,EBP
SUB EDI,EBP
L014:
MOV BL,AL
ADD BL,CL
XOR BL,AL
SHL AL,1
OR BL,AL
MOV AL,BYTE PTR DS:[ESI+1]
MOV BYTE PTR DS:[EDI+ESI],BL
INC ECX
INC ESI
TEST AL,AL
JNZ L014
TEST DL,DL
JE L048
MOV EDI,DWORD PTR SS:[ESP+418H]
LEA EBX,DWORD PTR SS:[ESP+10H]
MOV ESI,EBP
SUB EBX,EBP
L031:
MOV AL,BYTE PTR DS:[ESI+EBX]
PUSH EDI
PUSH EAX
CALL G2
MOV AL,BYTE PTR DS:[ESI+1]
ADD ESP,8
ADD EDI,2
INC ESI
TEST AL,AL
JNZ L031
MOV BYTE PTR DS:[EDI],0
POP EDI
POP ESI
POP EBP
POP EBX
ADD ESP,400H
RETN
L048:
MOV ECX,DWORD PTR SS:[ESP+418H]
POP EDI
POP ESI
POP EBP
MOV BYTE PTR DS:[ECX],0
POP EBX
ADD ESP,400H
RETN
G2:
MOVSX ECX,BYTE PTR SS:[ESP+4]
MOV EAX,ECX
AND ECX,0FH
SAR EAX,4
AND EAX,0FH
CMP EAX,0AH
JGE L009
ADD AL,30H
JMP L010
L009:
ADD AL,42H
L010:
MOV EDX,DWORD PTR SS:[ESP+8]
CMP ECX,0AH
MOV BYTE PTR DS:[EDX],AL
JGE L017
ADD CL,61H
MOV BYTE PTR DS:[EDX+1],CL
RETN
L017:
ADD CL,45H
MOV BYTE PTR DS:[EDX+1],CL
RETN
End:
mov eax,eax
popad
}
printf("%s\n",Serial);
return 0;
}
Can you help me?
this problem about Asm,I don't know why cause this result.
this program is very easy,and it about a program of internal code.
Run-Time Check Failure #0 - The value of ESP was not properly saved across a function call. This is usually a result of calling a function declared with one calling convention with a function pointer declared with a different calling convention.
It seems the two parameters which are pushed onto the stack before the call to G1 are never popped from the stack.
Possibly it happens because at the beginning of the function G1 you SUB ESP,400H, after L031 you make ADD ESP,8 and at the end you ADD ESP,400H. It seems like ESP before the G1 call is by 8 less then after call.
EDIT: Regarding to the coding style of assembly function please see this. Here briefly described what are the caller's responsibilities and what are callee's responsibilities, that are regarded to ESP.

Default initialization with default constructor for primitive data types

Are there any drawbacks / disadvantages using the default constructor for default initialization for primitive data types?
For example
class MyClass
{
public:
MyClass();
private:
int miInt;
double mdDouble;
bool mbBool;
};
Using this constructor:
MyClass::MyClass()
: miInt(int())
, mdDouble(double())
, mbBool(bool())
{}
instead of this:
MyClass::MyClass()
: miInt(0)
, mdDouble(0.0)
, mbBool(false)
{}
No, and the compiler will most probably generate the same code for both.
With optimization off, the following code is generated:
MyClass::MyClass()
: miInt(0)
, mdDouble(0.0)
, mbBool(false)
{}
012313A0 push ebp
012313A1 mov ebp,esp
012313A3 sub esp,0CCh
012313A9 push ebx
012313AA push esi
012313AB push edi
012313AC push ecx
012313AD lea edi,[ebp-0CCh]
012313B3 mov ecx,33h
012313B8 mov eax,0CCCCCCCCh
012313BD rep stos dword ptr es:[edi]
012313BF pop ecx
012313C0 mov dword ptr [ebp-8],ecx
012313C3 mov eax,dword ptr [this]
012313C6 mov dword ptr [eax],0
012313CC mov eax,dword ptr [this]
012313CF fldz
012313D1 fstp qword ptr [eax+8]
012313D4 mov eax,dword ptr [this]
012313D7 mov byte ptr [eax+10h],0
012313DB mov eax,dword ptr [this]
012313DE pop edi
012313DF pop esi
012313E0 pop ebx
012313E1 mov esp,ebp
012313E3 pop ebp
012313E4 ret
and
MyClass::MyClass()
: miInt(int())
, mdDouble(double())
, mbBool(bool())
{}
001513A0 push ebp
001513A1 mov ebp,esp
001513A3 sub esp,0CCh
001513A9 push ebx
001513AA push esi
001513AB push edi
001513AC push ecx
001513AD lea edi,[ebp-0CCh]
001513B3 mov ecx,33h
001513B8 mov eax,0CCCCCCCCh
001513BD rep stos dword ptr es:[edi]
001513BF pop ecx
001513C0 mov dword ptr [ebp-8],ecx
001513C3 mov eax,dword ptr [this]
001513C6 mov dword ptr [eax],0
001513CC mov eax,dword ptr [this]
001513CF fldz
001513D1 fstp qword ptr [eax+8]
001513D4 mov eax,dword ptr [this]
001513D7 mov byte ptr [eax+10h],0
001513DB mov eax,dword ptr [this]
001513DE pop edi
001513DF pop esi
001513E0 pop ebx
001513E1 mov esp,ebp
001513E3 pop ebp
001513E4 ret
As you can see, it's identical.
There is more consistent syntax for creating default objects:
MyClass::MyClass()
: miInt()
, mdDouble()
, mbBool()
{
}
That is, don't pass anything. Just write T() and the object will be created with default value. It is also consistent with class types (think of POD types)!