C vs C++ Static Initialization of Objects

C vs C++ Static Initialization of Objects - c++

I have a question regarding initialization of fairly large sets of static data.
See my three examples below of initializing sets of static data. I'd like to understand the program load time & memory footprint implications of the methods shown below. I really don't know how to evaluate that on my own at the moment. My build environment is still on the desktop using Visual Studio, however the embedded targets will be compiled for VxWorks using GCC.
Traditionally, I've used basic C-structs for this sort of thing, although there's good reason to move this data into C++ classes moving forward. Dynamic memory allocation is frowned upon in the embedded application and is avoided wherever possible.
As far as I know, the only way to initialize a C++ class is through its constructor, shown below in Method 2. I am wondering how that compares to Method 1. Is there any appreciable additional overhead in terms of ROM (Program footprint), RAM (Memory Footprint), or program load time? It seems to me that the compiler may be able to optimize away the rather trivial constructor, but I'm not sure if that's common behavior.
I listed Method 3, as I've considered it, although it just seems like a bad idea. Is there something else that I'm missing here? Anyone else out there initialize data in a similar manner?
/* C-Style Struct Storage */
typedef struct{
int a;
int b;
}DATA_C;
/* CPP Style Class Storage */
class DATA_CPP{
public:
int a;
int b;
DATA_CPP(int,int);
};
DATA_CPP::DATA_CPP(int aIn, int bIn){
a = aIn;
b = bIn;
}
/* METHOD 1: Direct C-Style Static Initialization */
DATA_C MyCData[5] = { {1,2},
{3,4},
{5,6},
{7,8},
{9,10}
};
/* METHOD 2: Direct CPP-Style Initialization */
DATA_CPP MyCppData[5] = { DATA_CPP(1,2),
DATA_CPP(3,4),
DATA_CPP(5,6),
DATA_CPP(7,8),
DATA_CPP(9,10),
};
/* METHOD 3: Cast C-Struct to CPP class */
DATA_CPP* pMyCppData2 = (DATA_CPP*) MyCData;

In C++11 , you can write this:
DATA_CPP obj = {1,2}; //Or simply : DATA_CPP obj {1,2}; i.e omit '='
instead of
DATA_CPP obj(1,2);
Extending this, you can write:
DATA_CPP MyCppData[5] = { {1,2},
{3,4},
{5,6},
{7,8},
{9,10},
};
instead of this:
DATA_CPP MyCppData[5] = { DATA_CPP(1,2),
DATA_CPP(3,4),
DATA_CPP(5,6),
DATA_CPP(7,8),
DATA_CPP(9,10),
};
Read about uniform initialization.

I've done a bit of research in this, thought I'd post the results. I used Visual Studio 2008 in all cases here.
Here's the disassembly view of the code from Visual Studio in Debug Mode:
/* METHOD 1: Direct C-Style Static Initialization */
DATA_C MyCData[5] = { {1,2},
{3,4},
{5,6},
{7,8},
{9,10},
};
/* METHOD 2: Direct CPP-Style Initialization */
DATA_CPP MyCppData[5] = { DATA_CPP(1,2),
010345C0 push ebp
010345C1 mov ebp,esp
010345C3 sub esp,0C0h
010345C9 push ebx
010345CA push esi
010345CB push edi
010345CC lea edi,[ebp-0C0h]
010345D2 mov ecx,30h
010345D7 mov eax,0CCCCCCCCh
010345DC rep stos dword ptr es:[edi]
010345DE push 2
010345E0 push 1
010345E2 mov ecx,offset MyCppData (1038184h)
010345E7 call DATA_CPP::DATA_CPP (103119Ah)
DATA_CPP(3,4),
010345EC push 4
010345EE push 3
010345F0 mov ecx,offset MyCppData+8 (103818Ch)
010345F5 call DATA_CPP::DATA_CPP (103119Ah)
DATA_CPP(5,6),
010345FA push 6
010345FC push 5
010345FE mov ecx,offset MyCppData+10h (1038194h)
01034603 call DATA_CPP::DATA_CPP (103119Ah)
DATA_CPP(7,8),
01034608 push 8
0103460A push 7
0103460C mov ecx,offset MyCppData+18h (103819Ch)
01034611 call DATA_CPP::DATA_CPP (103119Ah)
DATA_CPP(9,10),
01034616 push 0Ah
01034618 push 9
0103461A mov ecx,offset MyCppData+20h (10381A4h)
0103461F call DATA_CPP::DATA_CPP (103119Ah)
};
01034624 pop edi
01034625 pop esi
01034626 pop ebx
01034627 add esp,0C0h
0103462D cmp ebp,esp
0103462F call #ILT+325(__RTC_CheckEsp) (103114Ah)
01034634 mov esp,ebp
01034636 pop ebp
Interesting thing to note here is that there is definitely some overhead in program memory usage and load time, at least in non-optimized debug mode. Notice that Method 1 has zero assembly instructions, while method two has about 44 instructions.
I also ran compiled the program in Release mode with optimization enabled, here is the abridged assembly output:
?MyCData##3PAUDATA_C##A DD 01H ; MyCData
DD 02H
DD 03H
DD 04H
DD 05H
DD 06H
DD 07H
DD 08H
DD 09H
DD 0aH
?MyCppData##3PAVDATA_CPP##A DD 01H ; MyCppData
DD 02H
DD 03H
DD 04H
DD 05H
DD 06H
DD 07H
DD 08H
DD 09H
DD 0aH
END
Seems like the compiler indeed optimized away the calls to the C++ constructor. I could find no evidence of the constructor ever being called anywhere in the assembly code.
I thought I'd try something a bit more. I changed the constructor to:
DATA_CPP::DATA_CPP(int aIn, int bIn){
a = aIn + bIn;
b = bIn;
}
Again, the compiler optimized this away, resulting in a static dataset:
?MyCppData##3PAVDATA_CPP##A DD 03H ; MyCppData
DD 02H
DD 07H
DD 04H
DD 0bH
DD 06H
DD 0fH
DD 08H
DD 013H
DD 0aH
END
Interesting, the compiler was able to evaluate the constructor code on all the static data during compilation and create a static dataset, still not calling the constructor.
I thought I'd try something still a bit more, operate on a global variable in the constructor:
int globalvar;
DATA_CPP::DATA_CPP(int aIn, int bIn){
a = aIn + globalvar;
globalvar += a;
b = bIn;
}
And in this case, the compiler now generated assembly code to call the constructor during initialization:
??__EMyCppData##YAXXZ PROC ; `dynamic initializer for 'MyCppData'', COMDAT
; 35 : DATA_CPP MyCppData[5] = { DATA_CPP(1,2),
00000 a1 00 00 00 00 mov eax, DWORD PTR ?globalvar##3HA ; globalvar
00005 8d 48 01 lea ecx, DWORD PTR [eax+1]
00008 03 c1 add eax, ecx
0000a 89 0d 00 00 00
00 mov DWORD PTR ?MyCppData##3PAVDATA_CPP##A, ecx
; 36 : DATA_CPP(3,4),
00010 8d 48 03 lea ecx, DWORD PTR [eax+3]
00013 03 c1 add eax, ecx
00015 89 0d 08 00 00
00 mov DWORD PTR ?MyCppData##3PAVDATA_CPP##A+8, ecx
; 37 : DATA_CPP(5,6),
0001b 8d 48 05 lea ecx, DWORD PTR [eax+5]
0001e 03 c1 add eax, ecx
00020 89 0d 10 00 00
00 mov DWORD PTR ?MyCppData##3PAVDATA_CPP##A+16, ecx
; 38 : DATA_CPP(7,8),
00026 8d 48 07 lea ecx, DWORD PTR [eax+7]
00029 03 c1 add eax, ecx
0002b 89 0d 18 00 00
00 mov DWORD PTR ?MyCppData##3PAVDATA_CPP##A+24, ecx
; 39 : DATA_CPP(9,10),
00031 8d 48 09 lea ecx, DWORD PTR [eax+9]
00034 03 c1 add eax, ecx
00036 89 0d 20 00 00
00 mov DWORD PTR ?MyCppData##3PAVDATA_CPP##A+32, ecx
0003c a3 00 00 00 00 mov DWORD PTR ?globalvar##3HA, eax ; globalvar
; 40 : };
00041 c3 ret 0
??__EMyCppData##YAXXZ ENDP ; `dynamic initializer for 'MyCppData''
FYI, I found this page helpful in setting up visual studio to output assembly:
How do I get the assembler output from a C file in VS2005

Related

Fastest way to convert 4 bytes to float in c++

I need to convert an array of bytes to array of floats. I get the bytes through a network connection, and then need to parse them into floats. the size of the array is not pre-definded.
this is the code I have so far, using unions. Do you have any suggestions on how to make it run faster?
int offset = DATA_OFFSET - 1;
UStuff bb;
//Convert every 4 bytes to float using a union
for (int i = 0; i < NUM_OF_POINTS;i++){
//Going backwards - due to endianness
for (int j = offset + BYTE_FLOAT*i + BYTE_FLOAT ; j > offset + BYTE_FLOAT*i; --j)
{
bb.c[(offset + BYTE_FLOAT*i + BYTE_FLOAT)- j] = sample[j];
}
res.append(bb.f);
}
return res;
This is the union I use
union UStuff
{
float f;
unsigned char c[4];
};

Technically you are not allowed to type-pun through a union in C++, although you are allowed to do it in C. The behaviour of your code is undefined.
Setting aside this important behaviour point: even then, you're assuming that a float is represented in the same way on all machines, which it isn't. You might think a float is a 32 bit IEEE754 little-endian block of data, but it doesn't have to be that.
Sadly the best solution will end up slower. Most serialisations of floating point data are performed by going into and back out from a string which, in your case pretty much solves the problem since you can represent them as an array of unsigned char data. So your only wrangle is down to figuring out the encoding of the data. Job done!

Short Answer
#include <cstdint>
#define NOT_STUPID 1
#define ENDIAN NOT_STUPID
namespace _ {
inline uint32_t UI4Set(char byte_0, char byte_1, char byte_2, char byte_3) {
#if ENDIAN == NOT_STUPID
return byte_0 | ((uint32_t)byte_1) << 8 | ((uint32_t)byte_2) << 16 |
((uint32_t)byte_3) << 24;
#else
return byte_3 | ((uint32_t)byte_2) << 8 | ((uint32_t)byte_1) << 16 |
((uint32_t)byte_0) << 24;
#endif
}
inline float FLTSet(char byte_0, char byte_1, char byte_2, char byte_3) {
uint32_t flt = UI4Set(byte_0, byte_1, byte_2, byte_3);
return *reinterpret_cast<float*>(&flt);
}
/* Use this function to write directly to RAM and avoid the xmm
registers. */
inline uint32_t FLTSet(char byte_0, char byte_1, char byte_2, char byte_3,
float* destination) {
uint32_t value = UI4Set (byte_0, byte_1, byte_2, byte_3);
*reinterpret_cast<uint32_t*>(destination) = value;
return value;
}
} //< namespace _
using namespace _; //< #see Kabuki Toolkit
static flt = FLTSet (0, 1, 2, 3);
int main () {
uint32_t flt_init = FLTSet (4, 5, 6, 7, &flt);
return 0;
}
//< This uses 4 extra bytes doesn't use the xmm register
Long Answer
It is GENERALLY NOT recommended to use a Union to convert to and from floating-point to integer because Unions to this day do not always generate optimal assembly code and the other techniques are more explicit and may use less typing; and rather then my opinion from other StackOverflow posts on Unions, we shall prove it will disassembly with a modern compiler: Visual-C++ 2018.
The first thing we must know about how to optimize floating-point algorithms is how the registers work. The core of the CPU is exclusively an integer-processing-unit with coprocessors (i.e. extensions) on them for processing floating-point numbers. These Load-Store Machines (LSM) can only work with integers and they must use a separate set of registers for interacting with floating-point coprocessors. On x86_64 these are the xmm registers, which are 128-bits wide and can process Single Instruction Multiple Data (SIMD). In C++ way to load-and-store a floating-point register is:
int Foo(double foo) { return foo + *reinterpret_cast<double*>(&foo); }
int main() {
double foo = 1.0;
uint64_t bar = *reinterpret_cast<uint64_t*>(&foo);
return Foo(bar);
}
Now let's inspect the disassembly with Visual-C++ O2 optimations on because without them you will get a bunch of debug stack frame variables. I had to add the function Foo into the example to avoid the code being optimized away.
double foo = 1.0;
uint64_t bar = *reinterpret_cast<uint64_t*>(&foo);
00007FF7482E16A0 mov rax,3FF0000000000000h
00007FF7482E16AA xorps xmm0,xmm0
00007FF7482E16AD cvtsi2sd xmm0,rax
return Foo(bar);
00007FF7482E16B2 addsd xmm0,xmm0
00007FF7482E16B6 cvttsd2si eax,xmm0
}
00007FF7482E16BA ret
And as described, we can see that the LSM first moves the double value into an integer register, then it zeros out the xmm0 register using a xor function because the register is 128-bits wide and we're loading a 64-bit integer, then loads the contents of the integer register to the floating-point register using the cvtsi2sd instruction, then finally followed by the cvttsd2si instruction that then loads the value back from the xmm0 register to the return register before finally returning.
So now let's address that concern about generating optimal assembly code using this test script and Visual-C++ 2018:
#include <stdafx.h>
#include <cstdint>
#include <cstdio>
static float foo = 0.0f;
void SetFooUnion(char byte_0, char byte_1, char byte_2, char byte_3) {
union {
float flt;
char bytes[4];
} u = {foo};
u.bytes[0] = byte_0;
u.bytes[1] = byte_1;
u.bytes[2] = byte_2;
u.bytes[3] = byte_3;
foo = u.flt;
}
void SetFooManually(char byte_0, char byte_1, char byte_2, char byte_3) {
uint32_t faster_method = byte_0 | ((uint32_t)byte_1) << 8 |
((uint32_t)byte_2) << 16 | ((uint32_t)byte_3) << 24;
*reinterpret_cast<uint32_t*>(&foo) = faster_method;
}
namespace _ {
inline uint32_t UI4Set(char byte_0, char byte_1, char byte_2, char byte_3) {
return byte_0 | ((uint32_t)byte_1) << 8 | ((uint32_t)byte_2) << 16 |
((uint32_t)byte_3) << 24;
}
inline float FLTSet(char byte_0, char byte_1, char byte_2, char byte_3) {
uint32_t flt = UI4Set(byte_0, byte_1, byte_2, byte_3);
return *reinterpret_cast<float*>(&flt);
}
inline void FLTSet(char byte_0, char byte_1, char byte_2, char byte_3,
float* destination) {
uint32_t value = byte_0 | ((uint32_t)byte_1) << 8 | ((uint32_t)byte_2) << 16 |
((uint32_t)byte_3) << 24;
*reinterpret_cast<uint32_t*>(destination) = value;
}
} // namespace _
int main() {
SetFooUnion(0, 1, 2, 3);
union {
float flt;
char bytes[4];
} u = {foo};
// Start union read tests
putchar(u.bytes[0]);
putchar(u.bytes[1]);
putchar(u.bytes[2]);
putchar(u.bytes[3]);
// Start union write tests
u.bytes[0] = 4;
u.bytes[2] = 5;
foo = u.flt;
// Start hand-coded tests
SetFooManually(6, 7, 8, 9);
uint32_t bar = *reinterpret_cast<uint32_t*>(&foo);
putchar((char)(bar));
putchar((char)(bar >> 8));
putchar((char)(bar >> 16));
putchar((char)(bar >> 24));
_::FLTSet (0, 1, 2, 3, &foo);
return 0;
}
Now after inspecting the O2-optimized disassembly, we have proven the compiler DOES NOT produce optimal code:
int main() {
00007FF6DB4A1000 sub rsp,28h
SetFooUnion(0, 1, 2, 3);
00007FF6DB4A1004 mov dword ptr [rsp+30h],3020100h
00007FF6DB4A100C movss xmm0,dword ptr [rsp+30h]
union {
float flt;
char bytes[4];
} u = {foo};
00007FF6DB4A1012 movss dword ptr [rsp+30h],xmm0
// Start union read tests
putchar(u.bytes[0]);
00007FF6DB4A1018 movsx ecx,byte ptr [u]
SetFooUnion(0, 1, 2, 3);
00007FF6DB4A101D movss dword ptr [foo (07FF6DB4A3628h)],xmm0
// Start union read tests
putchar(u.bytes[0]);
00007FF6DB4A1025 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar(u.bytes[1]);
00007FF6DB4A102B movsx ecx,byte ptr [rsp+31h]
00007FF6DB4A1030 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar(u.bytes[2]);
00007FF6DB4A1036 movsx ecx,byte ptr [rsp+32h]
00007FF6DB4A103B call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar(u.bytes[3]);
00007FF6DB4A1041 movsx ecx,byte ptr [rsp+33h]
00007FF6DB4A1046 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
uint32_t bar = *reinterpret_cast<uint32_t*>(&foo);
putchar((char)(bar));
00007FF6DB4A104C mov ecx,6
// Start union write tests
u.bytes[0] = 4;
u.bytes[2] = 5;
foo = u.flt;
// Start hand-coded tests
SetFooManually(6, 7, 8, 9);
00007FF6DB4A1051 mov dword ptr [foo (07FF6DB4A3628h)],9080706h
uint32_t bar = *reinterpret_cast<uint32_t*>(&foo);
putchar((char)(bar));
00007FF6DB4A105B call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar((char)(bar >> 8));
00007FF6DB4A1061 mov ecx,7
00007FF6DB4A1066 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar((char)(bar >> 16));
00007FF6DB4A106C mov ecx,8
00007FF6DB4A1071 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar((char)(bar >> 24));
00007FF6DB4A1077 mov ecx,9
00007FF6DB4A107C call qword ptr [__imp_putchar (07FF6DB4A2160h)]
return 0;
00007FF6DB4A1082 xor eax,eax
_::FLTSet(0, 1, 2, 3, &foo);
00007FF6DB4A1084 mov dword ptr [foo (07FF6DB4A3628h)],3020100h
}
00007FF6DB4A108E add rsp,28h
00007FF6DB4A1092 ret
Here is the raw main disassembly because the inlined functions are missing:
; Listing generated by Microsoft (R) Optimizing Compiler Version 19.12.25831.0
include listing.inc
INCLUDELIB OLDNAMES
EXTRN __imp_putchar:PROC
EXTRN __security_check_cookie:PROC
?foo##3MA DD 01H DUP (?) ; foo
_BSS ENDS
PUBLIC main
PUBLIC ?SetFooManually##YAXDDDD#Z ; SetFooManually
PUBLIC ?SetFooUnion##YAXDDDD#Z ; SetFooUnion
EXTRN _fltused:DWORD
; COMDAT pdata
pdata SEGMENT
$pdata$main DD imagerel $LN8
DD imagerel $LN8+137
DD imagerel $unwind$main
pdata ENDS
; COMDAT xdata
xdata SEGMENT
$unwind$main DD 010401H
DD 04204H
xdata ENDS
; Function compile flags: /Ogtpy
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; COMDAT ?SetFooManually##YAXDDDD#Z
_TEXT SEGMENT
byte_0$dead$ = 8
byte_1$dead$ = 16
byte_2$dead$ = 24
byte_3$dead$ = 32
?SetFooManually##YAXDDDD#Z PROC ; SetFooManually, COMDAT
00000 c7 05 00 00 00
00 06 07 08 09 mov DWORD PTR ?foo##3MA, 151521030 ; 09080706H
0000a c3 ret 0
?SetFooManually##YAXDDDD#Z ENDP ; SetFooManually
_TEXT ENDS
; Function compile flags: /Ogtpy
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; COMDAT main
_TEXT SEGMENT
u$1 = 48
u$ = 48
main PROC ; COMDAT
$LN8:
00000 48 83 ec 28 sub rsp, 40 ; 00000028H
00004 c7 44 24 30 00
01 02 03 mov DWORD PTR u$1[rsp], 50462976 ; 03020100H
0000c f3 0f 10 44 24
30 movss xmm0, DWORD PTR u$1[rsp]
00012 f3 0f 11 44 24
30 movss DWORD PTR u$[rsp], xmm0
00018 0f be 4c 24 30 movsx ecx, BYTE PTR u$[rsp]
0001d f3 0f 11 05 00
00 00 00 movss DWORD PTR ?foo##3MA, xmm0
00025 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
0002b 0f be 4c 24 31 movsx ecx, BYTE PTR u$[rsp+1]
00030 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00036 0f be 4c 24 32 movsx ecx, BYTE PTR u$[rsp+2]
0003b ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00041 0f be 4c 24 33 movsx ecx, BYTE PTR u$[rsp+3]
00046 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
0004c b9 06 00 00 00 mov ecx, 6
00051 c7 05 00 00 00
00 06 07 08 09 mov DWORD PTR ?foo##3MA, 151521030 ; 09080706H
0005b ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00061 b9 07 00 00 00 mov ecx, 7
00066 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
0006c b9 08 00 00 00 mov ecx, 8
00071 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00077 b9 09 00 00 00 mov ecx, 9
0007c ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00082 33 c0 xor eax, eax
00084 48 83 c4 28 add rsp, 40 ; 00000028H
00088 c3 ret 0
main ENDP
_TEXT ENDS
END
So what is the difference?
?SetFooUnion##YAXDDDD#Z PROC ; SetFooUnion, COMDAT
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; Line 7
mov BYTE PTR [rsp+32], r9b
; Line 14
mov DWORD PTR u$[rsp], 50462976 ; 03020100H
; Line 18
movss xmm0, DWORD PTR u$[rsp]
movss DWORD PTR ?foo##3MA, xmm0
; Line 19
ret 0
?SetFooUnion##YAXDDDD#Z ENDP ; SetFooUnion
versus:
?SetFooManually##YAXDDDD#Z PROC ; SetFooManually, COMDAT
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; Line 34
mov DWORD PTR ?foo##3MA, 151521030 ; 09080706H
; Line 35
ret 0
?SetFooManually##YAXDDDD#Z ENDP ; SetFooManually
The first thing to notice is the effect of the Union on inline memory optimizations. Unions are designed specifically to multiplex RAM for different purposes for different time periods in order to reduce RAM usage, so this means the memory must remain coherent in RAM and thus is less inlinable. The Union code forced the compiler to write the Union out to RAM while the non-Union method just throws out your code and replaces with a single mov DWORD PTR ?foo##3MA, 151521030 instruction without the use of the xmm0 register!!! The O2 optimizations automatically inlined the SetFooUnion and SetFooManually functions, but the non-Union method inlined a lot more code used less RAM reads, evidence from the difference between the Union method's line of code:
movsx ecx,byte ptr [rsp+31h]
versus the non-Union method's version:
mov ecx,7
The Union is loading ecx from A POINTER TO RAM while the other is using the faster single-cycle mov instruction. THAT IS A HUGE PERFORMANCE INCREASE!!! However, this actually may be the desired behavior when working with real-time systems and multi-threaded applications because the compiler optimizations may be unwanted and may mess up our timing, or you may want to use a mix of the two methods.
Aside from the potentially suboptimal RAM usage, I tried for several hours to get the compiler to generate suboptimal assembly but I was not able to for most of my toy problems, so it does seem as though this is a rather nifty feature of Unions rather than a reason to shun them. My favorite C++ metaphor is that C++ is like a kitchen full of sharp knives, you need to pick the correct knife for the correct job, and just because there are a lot of sharp knives in a kitchen doesn't mean that you pull out all of the knives at once, or that you leave the knives out. As long as you keep the kitchen tidy then you're not going to cut yourself. A Union is a sharp knife that may help ensure greater RAM coherency but it requires more typing and slows the program down.

Reimplementing std::swap() with static tmp variable for simple types C++

I decided to benchmark the realization of a swap function for simple types (like int, or struct, or a class that uses only simple types in its fields) with a static tmp variable in it to prevent memory allocation in each swap call. So I wrote this simple test program:
#include <iostream>
#include <chrono>
#include <utility>
#include <vector>
template<typename T>
void mySwap(T& a, T& b) //Like std::swap - just for tests
{
T tmp = std::move(a);
a = std::move(b);
b = std::move(tmp);
}
template<typename T>
void mySwapStatic(T& a, T& b) //Here with static tmp
{
static T tmp;
tmp = std::move(a);
a = std::move(b);
b = std::move(tmp);
}
class Test1 { //Simple class with some simple types
int foo;
float bar;
char bazz;
};
class Test2 { //Class with std::vector in it
int foo;
float bar;
char bazz;
std::vector<int> bizz;
public:
Test2()
{
bizz = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
}
};
#define Test Test1 //choosing class
const static unsigned int NUM_TESTS = 100000000;
static Test a, b; //making it static to prevent throwing out from code by compiler optimizations
template<typename T, typename F>
auto test(unsigned int numTests, T& a, T& b, const F swapFunction ) //test function
{
std::chrono::system_clock::time_point t1, t2;
t1 = std::chrono::system_clock::now();
for(unsigned int i = 0; i < NUM_TESTS; ++i) {
swapFunction(a, b);
}
t2 = std::chrono::system_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
}
int main()
{
std::chrono::system_clock::time_point t1, t2;
std::cout << "Test 1. MySwap Result:\t\t" << test(NUM_TESTS, a, b, mySwap<Test>) << " nanoseconds\n"; //caling test function
t1 = std::chrono::system_clock::now();
for(unsigned int i = 0; i < NUM_TESTS; ++i) {
mySwap<Test>(a, b);
}
t2 = std::chrono::system_clock::now();
std::cout << "Test 2. MySwap2 Result:\t\t" << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() << " nanoseconds\n"; //This result slightly better then 1. why?!
std::cout << "Test 3. MySwapStatic Result:\t" << test(NUM_TESTS, a, b, mySwapStatic<Test>) << " nanoseconds\n"; //test function with mySwapStatic
t1 = std::chrono::system_clock::now();
for(unsigned int i = 0; i < NUM_TESTS; ++i) {
mySwapStatic<Test>(a, b);
}
t2 = std::chrono::system_clock::now();
std::cout << "Test 4. MySwapStatic2 Result:\t" << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() << " nanoseconds\n"; //And again - it's better then 3...
std::cout << "Test 5. std::swap Result:\t" << test(NUM_TESTS, a, b, std::swap<Test>) << " nanoseconds\n"; //calling test function with std::swap for comparsion. Mostly similar to 1...
return 0;
}
Some results with Test defined as Test1 (g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2 called as g++ main.cpp -O3 -std=c++11):
Test 1. MySwap Result: 625,105,480 nanoseconds
Test 2. MySwap2 Result: 528,701,547 nanoseconds
Test 3. MySwapStatic Result: 338,484,180 nanoseconds
Test 4. MySwapStatic2 Result: 228,228,156 nanoseconds
Test 5. std::swap Result: 564,863,184 nanoseconds
My main question: is it good to use this implementation for swapping of simple types? I know that if you use it for swapping types with vectors, for example, then std::swap is better, and you can see it just by changing the Test define to Test2.
Second question: why are the results in test 1, 2, 3, and 4 so different? What am I doing wrong with the test function implementation?

An answer to your second question first : in your test 2 and 4, the compiler is inlining the functions, thus it gives better performances (there is even more on test 4, but I will cover this later).
Overall, it is probably a bad idea to use a static temp variable.
Why ? First, it should be noted that in x86 assembly, there is no instruction to copy from memory to memory. This means that when you swap, there is, not one, but two temporary variables in CPU registers. And these temporary variables MUST be in CPU registers, you can't copy mem to mem, thus a static variable will add a third memory location to transfer to and from.
One problem with your static temp is that it will hinder inlining. Imagine if the variables you swap are already in CPU registers. In this case, the compiler can inline the swapping, and never copy anything to memory, which is much faster. Now, if you forced a static temp, either the compiler removes it (useless), or is forced to add a memory copy. That's what happen in test 4, in which GCC removed all the reads to the static variable. It just pointlessly write updated values to it because you told it do to so. The read removal explains the good performance gain, but it could be even faster.
Your test cases are flawed because they don't show this point.
Now you may ask : Then why my static functions perform better ? I have no idea. (Answer at the end)
I was curious, so I compiled your code with MSVC, and it turns out MSVC is doing it right, and GCC is doing it weird. At O2 optimization levels, MSVC detects that two swaps is a no-op and shortcut them, but even at O1, the non-inlined generated code is faster than all the test case with GCC at O3. (EDIT: Actually, MSVC is not doing it right either, see explanation at the end.)
The assembly generated by MSVC looks indeed better, but when comparing the static and non-static assembly generated by GCC, I don't know why the static performs better.
Anyway, I think even if GCC is generating weird code, the inlining problem should be worth using the std::swap, because with bigger types the additional memory copy could be costly, and smaller types give better inlining.
Here are the assembly produced by all the test cases, if someone have an idea of why the GCC static performs better than the non-static, despite being longer and using more memory moves. EDIT: Answer at the end
GCC non-static (perf 570ms):
00402F90 44 8B 01 mov r8d,dword ptr [rcx]
00402F93 F3 0F 10 41 04 movss xmm0,dword ptr [rcx+4]
00402F98 0F B6 41 08 movzx eax,byte ptr [rcx+8]
00402F9C 4C 8B 0A mov r9,qword ptr [rdx]
00402F9F 4C 89 09 mov qword ptr [rcx],r9
00402FA2 44 0F B6 4A 08 movzx r9d,byte ptr [rdx+8]
00402FA7 44 88 49 08 mov byte ptr [rcx+8],r9b
00402FAB 44 89 02 mov dword ptr [rdx],r8d
00402FAE F3 0F 11 42 04 movss dword ptr [rdx+4],xmm0
00402FB3 88 42 08 mov byte ptr [rdx+8],al
GCC static and MSVC static (perf 275ms):
00402F10 48 8B 01 mov rax,qword ptr [rcx]
00402F13 48 89 05 66 11 00 00 mov qword ptr [404080h],rax
00402F1A 0F B6 41 08 movzx eax,byte ptr [rcx+8]
00402F1E 88 05 64 11 00 00 mov byte ptr [404088h],al
00402F24 48 8B 02 mov rax,qword ptr [rdx]
00402F27 48 89 01 mov qword ptr [rcx],rax
00402F2A 0F B6 42 08 movzx eax,byte ptr [rdx+8]
00402F2E 88 41 08 mov byte ptr [rcx+8],al
00402F31 48 8B 05 48 11 00 00 mov rax,qword ptr [404080h]
00402F38 48 89 02 mov qword ptr [rdx],rax
00402F3B 0F B6 05 46 11 00 00 movzx eax,byte ptr [404088h]
00402F42 88 42 08 mov byte ptr [rdx+8],al
MSVC non-static (perf 215ms):
00000 f2 0f 10 02 movsdx xmm0, QWORD PTR [rdx]
00004 f2 0f 10 09 movsdx xmm1, QWORD PTR [rcx]
00008 44 8b 41 08 mov r8d, DWORD PTR [rcx+8]
0000c f2 0f 11 01 movsdx QWORD PTR [rcx], xmm0
00010 8b 42 08 mov eax, DWORD PTR [rdx+8]
00013 89 41 08 mov DWORD PTR [rcx+8], eax
00016 f2 0f 11 0a movsdx QWORD PTR [rdx], xmm1
0001a 44 89 42 08 mov DWORD PTR [rdx+8], r8d
std::swap versions are all identical to the non-static versions.
After having some fun investigating, I found the likely reason of the bad performance of the GCC non-static version. Modern processors have a feature called store-to-load forwarding. This feature kicks in when a memory load match a previous memory store, and shortcut the memory operation to use the value already known. In this case GCC somehow use an asymmetric load/store for parameter A and B. A is copied using 4+4+1 bytes, and B is copied using 8+1 bytes. This means the 8 first bytes of the class won't be matched by the store-to-load forwarding, losing a precious CPU optimization. To check this, I manually replaced the 8+1 copy by a 4+4+1 copy, and the performance went up as expected (code below). In the end, GCC is at fault for not considering this.
GCC patched code, longer but taking advantage of store forwarding (perf 220ms) :
00402F90 44 8B 01 mov r8d,dword ptr [rcx]
00402F93 F3 0F 10 41 04 movss xmm0,dword ptr [rcx+4]
00402F98 0F B6 41 08 movzx eax,byte ptr [rcx+8]
00402F9C 4C 8B 0A mov r9,qword ptr [rdx]
00402F9F 4C 89 09 mov qword ptr [rcx],r9
00402F9C 44 8B 0A mov r9d,dword ptr [rdx]
00402F9F 44 89 09 mov dword ptr [rcx],r9d
00402FA2 44 8B 4A 04 mov r9d,dword ptr [rdx+4]
00402FA6 44 89 49 04 mov dword ptr [rcx+4],r9d
00402FAA 44 0F B6 4A 08 movzx r9d,byte ptr [rdx+8]
00402FAF 44 88 49 08 mov byte ptr [rcx+8],r9b
00402FB3 44 89 02 mov dword ptr [rdx],r8d
00402FB6 F3 0F 11 42 04 movss dword ptr [rdx+4],xmm0
00402FBB 88 42 08 mov byte ptr [rdx+8],al
Actually, this copy instructions (symmetric 4+4+1) is the right way to do it. In these test we are only doing copies, in which case MSVC version is the best without doubt. The problem is that in a real case, the class member will be accessed individually, thus generating 4 bytes read/write. MSVC 8 bytes batch copy (also generated by GCC for one argument) will prevent future store forwarding for individual members. A new test I did with member operation beside copies show that the patched 4+4+1 versions indeed outperform all the others. And by a factor of nearly x2. Sadly, no modern compiler is generating this code.

Does empty "else" cost any time

Is there any difference in computational cost of
if(something){
return something;
}else{
return somethingElse;
}
and
if(something){
return something;
}
//else (put in comments for readibility purposes)
return somethingElse;
In theory we have command (else) but it doesn't seem it should make an actuall difference.
Edit:
After running code for different set sizes, I found that there actually is a differrence, code without else appears to be about 1.5% more efficient. But it most likely depends on compiler, as stated by many people below. Code I tested it on:
int withoutElse(bool a){
if(a)
return 0;
return 1;
}
int withElse(bool a){
if(a)
return 0;
else
return 1;
}
int main(){
using namespace std;
bool a=true;
clock_t begin,end;
begin= clock();
for(__int64 i=0;i<1000000000;i++){
a=!a;
withElse(a);
}
end = clock();
cout<<end-begin<<endl;
begin= clock();
for(__int64 i=0;i<1000000000;i++){
a=!a;
withoutElse(a);
}
end = clock();
cout<<end-begin<<endl;
return 0;
}
Checked on loops from 1 000 000 to 1 000 000 000, and results were consistently different
Edit 2:
Assembly code (once again, generated using Visual Studio 2010) also shows small difference (appareantly, I'm no good with assemblers :()
?withElse##YAH_N#Z PROC ; withElse, COMDAT
; Line 12
push ebp
mov ebp, esp
sub esp, 192 ; 000000c0H
push ebx
push esi
push edi
lea edi, DWORD PTR [ebp-192]
mov ecx, 48 ; 00000030H
mov eax, -858993460 ; ccccccccH
rep stosd
; Line 13
movzx eax, BYTE PTR _a$[ebp]
test eax, eax
je SHORT $LN2#withElse
; Line 14
xor eax, eax
jmp SHORT $LN3#withElse
; Line 15
jmp SHORT $LN3#withElse
$LN2#withElse:
; Line 16
mov eax, 1
$LN3#withElse:
; Line 17
pop edi
pop esi
pop ebx
mov esp, ebp
pop ebp
ret 0
?withElse##YAH_N#Z ENDP ; withElse
and
?withoutElse##YAH_N#Z PROC ; withoutElse, COMDAT
; Line 4
push ebp
mov ebp, esp
sub esp, 192 ; 000000c0H
push ebx
push esi
push edi
lea edi, DWORD PTR [ebp-192]
mov ecx, 48 ; 00000030H
mov eax, -858993460 ; ccccccccH
rep stosd
; Line 5
movzx eax, BYTE PTR _a$[ebp]
test eax, eax
je SHORT $LN1#withoutEls
; Line 6
xor eax, eax
jmp SHORT $LN2#withoutEls
$LN1#withoutEls:
; Line 7
mov eax, 1
$LN2#withoutEls:
; Line 9
pop edi
pop esi
pop ebx
mov esp, ebp
pop ebp
ret 0
?withoutElse##YAH_N#Z ENDP ; withoutElse

It is generically different, but the compiler may decide to execute the same jump in both cases (it will practically do this always).
The best way to see what a compiler does is reading the assembler. Assuming that you are using gcc you can try with
gcc -g -c -fverbose-asm myfile.c; objdump -d -M intel -S myfile.o > myfile.s
which creates a mix of assembler/c code and makes the job easier at the beginning.
As for your example it is:
CASE1
if(something){
23: 83 7d fc 00 cmp DWORD PTR [ebp-0x4],0x0
27: 74 05 je 2e <main+0x19>
return something;
29: 8b 45 fc mov eax,DWORD PTR [ebp-0x4]
2c: eb 05 jmp 33 <main+0x1e>
}else{
return 0;
2e: b8 00 00 00 00 mov eax,0x0
}
CASE2
if(something){
23: 83 7d fc 00 cmp DWORD PTR [ebp-0x4],0x0
27: 74 05 je 2e <main+0x19>
return something;
29: 8b 45 fc mov eax,DWORD PTR [ebp-0x4]
2c: eb 05 jmp 33 <main+0x1e>
return 0;
2e: b8 00 00 00 00 mov eax,0x0
As you could imagine there are no differences!

It won't compile if you type `return
Think that once the code gets compiled, all ifs, elses and loops are changed to goto's
If (cond) { code A } code B
turns to
if cond is false jump to code b
code A
code B
and
If (cond) { code A } else { code B } code C
turns to
if cond is false jump to code B
code A
ALWAYS jump to code C
code B
code C
Most processors 'guess' whether they're going to jump or not before checking if they actually jump. Depending on the processor, it might affect the performance to fail a guess.
So the answer is YES! (unless there's an ALWAYS jump at the end of first comparison) It will take 2-3 cycles to do the ALWAYS jump which isn't in the first if.

Is accessing c++ member class through "this->member" faster/slower than implicit call to "member"

After some searching on our friend google, I could not get a clear view on the following point.
I'm used to call class members with this->. Even if not needed, I find it more explicit as it helps when maintaining some heavy piece of algorithm with loads of vars.
As I'm working on a supposed-to-be-optimised algorithm, I was wondering whether using this-> would alter runtime performance or not.
Does it ?

No, the call is exactly the same in both cases.

It doesn't make any difference. Here's a demonstration with GCC. The source is simple class, but I've restricted this post to the difference for clarity.
% diff -s with-this.cpp without-this.cpp
7c7
< this->x = 5;
---
> x = 5;
% g++ -c with-this.cpp without-this.cpp
% diff -s with-this.o without-this.o
Files with-this.o and without-this.o are identical

Answer has been given by zennehoy and here's assembly code (generated by Microsoft C++ compiler) for a simple test class:
class C
{
int n;
public:
void boo(){n = 1;}
void goo(){this->n = 2;}
};
int main()
{
C c;
c.boo();
c.goo();
return 0;
}
Disassembly Window in Visual Studio shows that assembly code is the same for both functions:
class C
{
int n;
public:
void boo(){n = 1;}
001B2F80 55 push ebp
001B2F81 8B EC mov ebp,esp
001B2F83 81 EC CC 00 00 00 sub esp,0CCh
001B2F89 53 push ebx
001B2F8A 56 push esi
001B2F8B 57 push edi
001B2F8C 51 push ecx
001B2F8D 8D BD 34 FF FF FF lea edi,[ebp-0CCh]
001B2F93 B9 33 00 00 00 mov ecx,33h
001B2F98 B8 CC CC CC CC mov eax,0CCCCCCCCh
001B2F9D F3 AB rep stos dword ptr es:[edi]
001B2F9F 59 pop ecx
001B2FA0 89 4D F8 mov dword ptr [ebp-8],ecx
001B2FA3 8B 45 F8 mov eax,dword ptr [this]
001B2FA6 C7 00 01 00 00 00 mov dword ptr [eax],1
001B2FAC 5F pop edi
001B2FAD 5E pop esi
001B2FAE 5B pop ebx
001B2FAF 8B E5 mov esp,ebp
001B2FB1 5D pop ebp
001B2FB2 C3 ret
...
--- ..\main.cpp -----------------------------
void goo(){this->n = 2;}
001B2FC0 55 push ebp
001B2FC1 8B EC mov ebp,esp
001B2FC3 81 EC CC 00 00 00 sub esp,0CCh
001B2FC9 53 push ebx
001B2FCA 56 push esi
001B2FCB 57 push edi
001B2FCC 51 push ecx
001B2FCD 8D BD 34 FF FF FF lea edi,[ebp-0CCh]
001B2FD3 B9 33 00 00 00 mov ecx,33h
001B2FD8 B8 CC CC CC CC mov eax,0CCCCCCCCh
001B2FDD F3 AB rep stos dword ptr es:[edi]
001B2FDF 59 pop ecx
001B2FE0 89 4D F8 mov dword ptr [ebp-8],ecx
001B2FE3 8B 45 F8 mov eax,dword ptr [this]
001B2FE6 C7 00 02 00 00 00 mov dword ptr [eax],2
001B2FEC 5F pop edi
001B2FED 5E pop esi
001B2FEE 5B pop ebx
001B2FEF 8B E5 mov esp,ebp
001B2FF1 5D pop ebp
001B2FF2 C3 ret
And the code in the main:
C c;
c.boo();
001B2F0E 8D 4D F8 lea ecx,[c]
001B2F11 E8 00 E4 FF FF call C::boo (1B1316h)
c.goo();
001B2F16 8D 4D F8 lea ecx,[c]
001B2F19 E8 29 E5 FF FF call C::goo (1B1447h)
Microsoft compiler uses __thiscall calling convention by default for class member calls and this pointer is passed via ECX register.

There are several layers involved in the compilation of a language.
The difference between accessing member as member, this->member, MyClass::member etc... is a syntactic difference.
More precisely, it's a matter of name lookup, and how the front-end of the compiler will "find" the exact element you are referring to. Therefore, you might speed up compilation by being more precise... though it will be unnoticeable (there are much more time-consuming tasks involved in C++, like opening all those includes).
Since (in this case) you are referring to the same element, it should not matter.
Now, an interesting parallel can be done with interpreted languages. In an interpreted language, the name lookup will be delayed to the moment where the line (or function) is called. Therefore, it could have an impact at runtime (though once again, probably not really noticeable).

Returning Strings from DLL Functions

For some reason, returning a string from a DLL function crashes my program on runtime with the error Unhandled exception at 0x775dfbae in Cranberry Library Tester.exe: Microsoft C++ exception: std::out_of_range at memory location 0x001ef604...
I have verified it's not a problem with the function itself by compiling the DLL code as an .exe and doing a few simple tests in the main function.
Functions with other return types (int, double, etc.) work perfectly.
Why does this happen?
Is there a way to work around this behavior?
Source code for DLL:
// Library.h
#include <string>
std::string GetGreeting();
.
// Library.cpp
#include "Library.h"
std::string GetGreeting()
{
return "Hello, world!";
}
Source code for tester:
// Tester.cpp
#include <iostream>
#include <Library.h>
int main()
{
std::cout << GetGreeting()
}
EDIT: I'm using VS2010.
Conclusion
A workaround is to make sure the library and source are compiled using the same compiler with the same options, etc.

This occurs because you're allocating memory in one DLL (using std::string's constructor), and freeing it in another DLL. You can't do that because each DLL typically sets up it's own heap.

Since your error message indicates you're using Microsoft C++ I'll offer an MS specific answer.
As long as you compile both the EXE and the DLL with the SAME compiler, and both link the the SAME version of the runtime DYNAMICALLY then you'll be just fine. For example, using "Multi-threaded DLL" for both.
If you link against the runtime statically, or link against different versions of the runtime then you're SOL for the reasons #Billy ONeal points out (memory will be allocated in one heap and freed in another).

You may return a std::string from a class as long as you inline the function that creates the string. This is done, for example, by the Qt toolkit in their QString::toStdString method:
inline std::string QString::toStdString() const
{ const QByteArray asc = toAscii(); return std::string(asc.constData(), asc.length()); }
The inline function is compiled with the program that uses the dll, not when the dll is compiled. This avoids any problems encountered by incompatible runtimes or compiler switches.
So in fact, you should only return standard types from your dll (such as const char * and int above) and add inline functions to convert them to std::string().
Passing in a string & parameter may also be unsafe because they are often implemented as copy on write.

Update:
I have compiled your sample in both VS2008 and VS2010 and I was able to successfully compile and execute without a problem. I compiled the library both as a static and a dynamic library.
Original:
The following relates to my discussion with bdk and imaginaryboy. I did not delete it as it might be of some interest to someone.
Ok this question really bothers me because it looks like its being passed by value not by reference. There does not appear to be any objects created in the heap it looks to be entirely stack based.
I did a quick test to check how objects are passed in Visual Studios (compiled in release mode without link time optimization and optimization disabled).
The Code:
class Foo {
int i, j;
public:
Foo() {}
Foo(int i, int j) : i(i), j(j) { }
};
Foo builder(int i, int j)
{
Foo built(i, j);
return built;
}
int main()
{
int i = sizeof(Foo);
int j = sizeof(int);
Foo buildMe;
buildMe = builder(i, j);
//std::string test = GetGreeting();
//std::cout << test;
return 0;
}
The Disassembly:
int main()
{
00AD1030 push ebp
00AD1031 mov ebp,esp
00AD1033 sub esp,18h
int i = sizeof(Foo);
00AD1036 mov dword ptr [i],8
int j = sizeof(int);
00AD103D mov dword ptr [j],4
Foo buildMe;
buildMe = builder(i, j);
00AD1044 mov eax,dword ptr [j] ;param j
00AD1047 push eax
00AD1048 mov ecx,dword ptr [i] ;param i
00AD104B push ecx
00AD104C lea edx,[ebp-18h] ;buildMe
00AD104F push edx
00AD1050 call builder (0AD1000h)
00AD1055 add esp,0Ch
00AD1058 mov ecx,dword ptr [eax] ;builder i
00AD105A mov edx,dword ptr [eax+4] ;builder j
00AD105D mov dword ptr [buildMe],ecx
00AD1060 mov dword ptr [ebp-8],edx
return 0;
00AD1063 xor eax,eax
}
00AD1065 mov esp,ebp
00AD1067 pop ebp
00AD1068 ret
Foo builder(int i, int j)
{
01041000 push ebp
01041001 mov ebp,esp
01041003 sub esp,8
Foo built(i, j);
01041006 mov eax,dword ptr [i]
01041009 mov dword ptr [built],eax ;ebp-8 built i
0104100C mov ecx,dword ptr [j]
0104100F mov dword ptr [ebp-4],ecx ;ebp-4 built j
return built;
01041012 mov edx,dword ptr [ebp+8] ;buildMe
01041015 mov eax,dword ptr [built]
01041018 mov dword ptr [edx],eax ;buildMe (i)
0104101A mov ecx,dword ptr [ebp-4]
0104101D mov dword ptr [edx+4],ecx ;buildMe (j)
01041020 mov eax,dword ptr [ebp+8]
}
The Stack:
0x003DF964 08 00 00 00 ....
0x003DF968 04 00 00 00 ....
0x003DF96C 98 f9 3d 00 ˜ù=.
0x003DF970 55 10 ad 00 U..
0x003DF974 80 f9 3d 00 €ù=.
0x003DF978 08 00 00 00 .... ;builder i param
0x003DF97C 04 00 00 00 .... ;builder j param
0x003DF980 08 00 00 00 .... ;builder return j
0x003DF984 04 00 00 00 .... ;builder return i
0x003DF988 04 00 00 00 .... ;j
0x003DF98C 08 00 00 00 .... ;buildMe i param
0x003DF990 04 00 00 00 .... ;buildMe j param
0x003DF994 08 00 00 00 .... ;i
0x003DF998 dc f9 3d 00 Üù=. ;esp
Why it applies:
Even if the code is in a seperate DLL the string that is returned is copied by value into the callers stack. There is a hidden param that passes the object to GetGreetings(). I do not see any heap getting created. I do not see the heap having anything to do with the problem.
int main()
{
01021020 push ebp
01021021 mov ebp,esp
01021023 push 0FFFFFFFFh
01021025 push offset __ehhandler$_main (10218A9h)
0102102A mov eax,dword ptr fs:[00000000h]
01021030 push eax
01021031 sub esp,24h
01021034 mov eax,dword ptr [___security_cookie (1023004h)]
01021039 xor eax,ebp
0102103B mov dword ptr [ebp-10h],eax
0102103E push eax
0102103F lea eax,[ebp-0Ch]
01021042 mov dword ptr fs:[00000000h],eax
std::string test = GetGreeting();
01021048 lea eax,[ebp-2Ch] ;mov test string to eax
0102104B push eax
0102104C call GetGreeting (1021000h)
01021051 add esp,4
01021054 mov dword ptr [ebp-4],0
std::cout << test;
0102105B lea ecx,[ebp-2Ch]
0102105E push ecx
0102105F mov edx,dword ptr [__imp_std::cout (1022038h)]
01021065 push edx
01021066 call dword ptr [__imp_std::operator<<<char,std::char_traits<char>,std::allocator<char> > (102203Ch)]
0102106C add esp,8
return 0;
0102106F mov dword ptr [ebp-30h],0
01021076 mov dword ptr [ebp-4],0FFFFFFFFh
0102107D lea ecx,[ebp-2Ch]
01021080 call dword ptr [__imp_std::basic_string<char,std::char_traits<char>,std::allocator<char> >::~basic_string<char,std::char_traits<char>,std::allocator<char> > (1022044h)]
01021086 mov eax,dword ptr [ebp-30h]
}
std::string GetGreeting()
{
01021000 push ecx ;ret + 4
01021001 push esi ;ret + 4 + 4 = 0C
01021002 mov esi,dword ptr [esp+0Ch] ; this is test string
std::string greet("Hello, world!");
01021006 push offset string "Hello, world!" (1022124h)
0102100B mov ecx,esi
0102100D mov dword ptr [esp+8],0
01021015 call dword ptr [__imp_std::basic_string<char,std::char_traits<char>,std::allocator<char> >::basic_string<char,std::char_traits<char>,std::allocator<char> > (1022040h)]
return greet;
0102101B mov eax,esi ;put back test
0102101D pop esi
//return "Hello, world!";
}

I had such problem that has been solved by:
using runtime library as DLL for both the application and the DLL (so that allocation is handle in a unique place)
making sure all project settings (compiler, linker) are the same on both project

Just like snmacdonald I was unable to repro your crash, under normal circumstances. Others already mentioned:
Configuration Properties -> Code Generation -> Runtime Library must be exactly the same
If I change one of them I get your crash. I have both the DLL and EXE set to Multi-Threaded DLL.

Just like user295190 and Adam said that it would work fine if with the exactly same compilation settings.
For example, in Qt QString::toStdString() would return a std::string and you could use it in your EXE from QtCore.dll.
It would crash if DLL and EXE had different compiling settings and linking settings. For example, DLL linked to MD and EXE linked to MT CRT library.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C vs C++ Static Initialization of Objects - c++

Related

Fastest way to convert 4 bytes to float in c++

Reimplementing std::swap() with static tmp variable for simple types C++

Does empty "else" cost any time

Is accessing c++ member class through "this->member" faster/slower than implicit call to "member"

Returning Strings from DLL Functions

Categories

Resources