I stumbled over some code that appears to be correct.
It is supposed to provide a public, immutable pointer, while keeping the modifiable non-const pointer private.
Strangely enough, this code broke on SN C++ Compiler (for PlayStation 3), but worked fine on GCC. On SN C++, data would point to bogus values, while m_data would work as intended.
Code in question:
#include <cstdint>
class Foo
{
public:
Foo() : data((const std::uint8_t* const&)m_data)
{
m_data = nullptr; // Set later in some other member function.
}
const std::uint8_t* const &data;
private:
std::uint8_t* m_data;
};
Does this code invoke undefined behavior? As far as I know, casting to a reference like that will convert (const std::uint8_t* const&)m_data into *reinterpret_cast<const std::uint8_t* const*>(&m_data).
Test cases:
Foo* new_foo() { return new Foo; }
and looking at the generated disassembly. Note that it is PowerPC 64-bit with 32-bit longs and pointers.
SN C++: ps3ppusnc -o test-sn.o -O3 -c test.cpp
0000000000000000 <._Z7new_foov>:
0: f8 21 ff 81 stdu r1,-128(r1) # ffffff80
4: 7c 08 02 a6 mflr r0
8: f8 01 00 90 std r0,144(r1) # 90
c: fb e1 00 78 std r31,120(r1) # 78
10: 38 60 00 08 li r3,8
14: 3b e0 00 00 li r31,0
18: 48 00 00 01 bl 18 <._Z7new_foov+0x18>
1c: 60 00 00 00 nop
20: 2c 03 00 00 cmpwi r3,0
24: 41 82 00 38 beq 5c <._Z7new_foov+0x5c>
28: 30 81 00 70 addic r4,r1,112 # 70
2c: 93 e3 00 04 stw r31,4(r3) <-- Set m_data to r31 (0).
30: 60 7f 00 00 ori r31,r3,0
34: 90 83 00 00 stw r4,0(r3) <-- Set data to r4 (r1 + 112 (On stack)?!)
38: 63 e3 00 00 ori r3,r31,0
3c: e8 01 00 90 ld r0,144(r1) # 90
40: 7c 08 03 a6 mtlr r0
44: eb e1 00 78 ld r31,120(r1) # 78
48: 38 21 00 80 addi r1,r1,128 # 80
4c: 4e 80 00 20 blr
GCC 4.1.1: ppu-lv2-g++ -o test-gcc.o -O3 -c test.cpp
0000000000000000 <._Z7new_foov>:
0: 38 60 00 08 li r3,8
4: 7c 08 02 a6 mflr r0
8: f8 21 ff 91 stdu r1,-112(r1) # ffffff90
c: f8 01 00 80 std r0,128(r1) # 80
10: 48 00 00 01 bl 10 <._Z7new_foov+0x10>
14: 60 00 00 00 nop
18: 7c 69 1b 78 mr r9,r3
1c: 38 00 00 00 li r0,0
20: 39 63 00 04 addi r11,r3,4 <-- Compute address of m_data
24: 78 63 00 20 clrldi r3,r3,32 # 20
28: 90 09 00 04 stw r0,4(r9) <-- Set m_data to r0 (0).
2c: e8 01 00 80 ld r0,128(r1) # 80
30: 38 21 00 70 addi r1,r1,112 # 70
34: 91 69 00 00 stw r11,0(r9) <-- Set data reference to m_data.
38: 7c 08 03 a6 mtlr r0
3c: 4e 80 00 20 blr
You'll have to fight through the spec language in 4.4/4 (of C++11), but I believe that 3.10/10 permits this. It says that object may be aliased as "a type similar to the dynamic type of the object".
In this case, the dynamic type of the object is std::uint8_t*, and the similar type is const std::uint8_t* const. I think. Check 4.4/4 for yourself.
[Update: C++03 doesn't mention "similar" types in 3.10/15, so it may be that you're in trouble on C++03, which presumably is what SNC works with.]
There's a second thing that needs to be checked, which is whether it's OK to initialize the reference data by binding it to an object that hasn't been initialized yet (m_data). Intuitively that seems OK, since the reference to the uninitialized m_data is never converted to an rvalue. It's easily fixed, anyway.
Related
Will (or can?) a switch statement based on a template argument be removed by the compiler?
See for example the following function in which the template argument funcStep1 is used to select the appropriate function. Will this switch statement be removed as its argument is known at compile time? I tried to learn it from the assembly code (given below) but I have no experience in reading assembly yet so this is not really a feasible task for me.
This question became now relevant for me since the newly introduced if constexpr provides an alternative that is guaranteed to be evaluated at compile time.
template<int funcStep1>
double Mdp::expectedCost(int sdx0, int x0, int x1, int adx0, int adx1) const
{
switch (funcStep1)
{
case 1:
case 3:
return expectedCost_exact(sdx0, x0, x1, adx0, adx1);
case 2:
case 4:
return expectedCost_approx(sdx0, x0, x1, adx0, adx1);
default:
throw string("Unknown value (Mdp::expectedCost.cc)");
}
}
// Define all the template types we need
template double Mdp::expectedCost<1>(int, int, int, int, int) const;
template double Mdp::expectedCost<2>(int, int, int, int, int) const;
template double Mdp::expectedCost<3>(int, int, int, int, int) const;
template double Mdp::expectedCost<4>(int, int, int, int, int) const;
Here you find the output of 'objdump -D' when the above function is compiled with gcc -O2 -ffunction-sections:
1expectedCost.o: file format elf64-x86-64
Disassembly of section .group:
0000000000000000 <.group>:
0: 01 00 add %eax,(%rax)
2: 00 00 add %al,(%rax)
4: 08 00 or %al,(%rax)
6: 00 00 add %al,(%rax)
8: 09 00 or %eax,(%rax)
...
Disassembly of section .group:
0000000000000000 <.group>:
0: 01 00 add %eax,(%rax)
2: 00 00 add %al,(%rax)
4: 0a 00 or (%rax),%al
6: 00 00 add %al,(%rax)
8: 0b 00 or (%rax),%eax
...
Disassembly of section .group:
0000000000000000 <.group>:
0: 01 00 add %eax,(%rax)
2: 00 00 add %al,(%rax)
4: 0c 00 or $0x0,%al
6: 00 00 add %al,(%rax)
8: 0d .byte 0xd
9: 00 00 add %al,(%rax)
...
Disassembly of section .group:
0000000000000000 <.group>:
0: 01 00 add %eax,(%rax)
2: 00 00 add %al,(%rax)
4: 0e (bad)
5: 00 00 add %al,(%rax)
7: 00 0f add %cl,(%rdi)
9: 00 00 add %al,(%rax)
...
Disassembly of section .bss:
0000000000000000 <_ZStL8__ioinit>:
...
Disassembly of section .text._ZNK3Mdp12expectedCostILi1EEEdiiiii:
0000000000000000 <_ZNK3Mdp12expectedCostILi1EEEdiiiii>:
0: e9 00 00 00 00 jmpq 5 <_ZNK3Mdp12expectedCostILi1EEEdiiiii+0x5>
Disassembly of section .text._ZNK3Mdp12expectedCostILi2EEEdiiiii:
0000000000000000 <_ZNK3Mdp12expectedCostILi2EEEdiiiii>:
0: e9 00 00 00 00 jmpq 5 <_ZNK3Mdp12expectedCostILi2EEEdiiiii+0x5>
Disassembly of section .text._ZNK3Mdp12expectedCostILi3EEEdiiiii:
0000000000000000 <_ZNK3Mdp12expectedCostILi3EEEdiiiii>:
0: e9 00 00 00 00 jmpq 5 <_ZNK3Mdp12expectedCostILi3EEEdiiiii+0x5>
Disassembly of section .text._ZNK3Mdp12expectedCostILi4EEEdiiiii:
0000000000000000 <_ZNK3Mdp12expectedCostILi4EEEdiiiii>:
0: e9 00 00 00 00 jmpq 5 <_ZNK3Mdp12expectedCostILi4EEEdiiiii+0x5>
Disassembly of section .text.startup._GLOBAL__sub_I_expectedCost.cc:
0000000000000000 <_GLOBAL__sub_I_expectedCost.cc>:
0: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 7 <_GLOBAL__sub_I_expectedCost.cc+0x7>
7: 48 83 ec 08 sub $0x8,%rsp
b: e8 00 00 00 00 callq 10 <_GLOBAL__sub_I_expectedCost.cc+0x10>
10: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi # 17 <_GLOBAL__sub_I_expectedCost.cc+0x17>
17: 48 8d 15 00 00 00 00 lea 0x0(%rip),%rdx # 1e <_GLOBAL__sub_I_expectedCost.cc+0x1e>
1e: 48 8d 35 00 00 00 00 lea 0x0(%rip),%rsi # 25 <_GLOBAL__sub_I_expectedCost.cc+0x25>
25: 48 83 c4 08 add $0x8,%rsp
29: e9 00 00 00 00 jmpq 2e <_GLOBAL__sub_I_expectedCost.cc+0x2e>
Disassembly of section .init_array:
0000000000000000 <.init_array>:
...
Disassembly of section .comment:
0000000000000000 <.comment>:
0: 00 47 43 add %al,0x43(%rdi)
3: 43 3a 20 rex.XB cmp (%r8),%spl
6: 28 55 62 sub %dl,0x62(%rbp)
9: 75 6e jne 79 <_ZStL8__ioinit+0x79>
b: 74 75 je 82 <_ZStL8__ioinit+0x82>
d: 20 37 and %dh,(%rdi)
f: 2e 33 2e xor %cs:(%rsi),%ebp
12: 30 2d 32 37 75 62 xor %ch,0x62753732(%rip) # 6275374a <_ZStL8__ioinit+0x6275374a>
18: 75 6e jne 88 <_ZStL8__ioinit+0x88>
1a: 74 75 je 91 <_ZStL8__ioinit+0x91>
1c: 31 7e 31 xor %edi,0x31(%rsi)
1f: 38 2e cmp %ch,(%rsi)
21: 30 34 29 xor %dh,(%rcx,%rbp,1)
24: 20 37 and %dh,(%rdi)
26: 2e 33 2e xor %cs:(%rsi),%ebp
29: 30 00 xor %al,(%rax)
Disassembly of section .eh_frame:
0000000000000000 <.eh_frame>:
0: 14 00 adc $0x0,%al
2: 00 00 add %al,(%rax)
4: 00 00 add %al,(%rax)
6: 00 00 add %al,(%rax)
8: 01 7a 52 add %edi,0x52(%rdx)
b: 00 01 add %al,(%rcx)
d: 78 10 js 1f <.eh_frame+0x1f>
f: 01 1b add %ebx,(%rbx)
11: 0c 07 or $0x7,%al
13: 08 90 01 00 00 10 or %dl,0x10000001(%rax)
19: 00 00 add %al,(%rax)
1b: 00 1c 00 add %bl,(%rax,%rax,1)
1e: 00 00 add %al,(%rax)
20: 00 00 add %al,(%rax)
22: 00 00 add %al,(%rax)
24: 05 00 00 00 00 add $0x0,%eax
29: 00 00 add %al,(%rax)
2b: 00 10 add %dl,(%rax)
2d: 00 00 add %al,(%rax)
2f: 00 30 add %dh,(%rax)
31: 00 00 add %al,(%rax)
33: 00 00 add %al,(%rax)
35: 00 00 add %al,(%rax)
37: 00 05 00 00 00 00 add %al,0x0(%rip) # 3d <.eh_frame+0x3d>
3d: 00 00 add %al,(%rax)
3f: 00 10 add %dl,(%rax)
41: 00 00 add %al,(%rax)
43: 00 44 00 00 add %al,0x0(%rax,%rax,1)
47: 00 00 add %al,(%rax)
49: 00 00 add %al,(%rax)
4b: 00 05 00 00 00 00 add %al,0x0(%rip) # 51 <.eh_frame+0x51>
51: 00 00 add %al,(%rax)
53: 00 10 add %dl,(%rax)
55: 00 00 add %al,(%rax)
57: 00 58 00 add %bl,0x0(%rax)
5a: 00 00 add %al,(%rax)
5c: 00 00 add %al,(%rax)
5e: 00 00 add %al,(%rax)
60: 05 00 00 00 00 add $0x0,%eax
65: 00 00 add %al,(%rax)
67: 00 14 00 add %dl,(%rax,%rax,1)
6a: 00 00 add %al,(%rax)
6c: 6c insb (%dx),%es:(%rdi)
6d: 00 00 add %al,(%rax)
6f: 00 00 add %al,(%rax)
71: 00 00 add %al,(%rax)
73: 00 2e add %ch,(%rsi)
75: 00 00 add %al,(%rax)
77: 00 00 add %al,(%rax)
79: 4b 0e rex.WXB (bad)
7b: 10 5e 0e adc %bl,0xe(%rsi)
7e: 08 00 or %al,(%rax)
Yes, this is optimized. There are a few things that make reading assembly easier, such as demangling names (example: _ZNK3Mdp12expectedCostILi1EEEdiiiii is the mangled form of double Mdp::expectedCost<1>(int, int, int, int, int) const), stripping comments and text (and using Intel syntax):
double expectedCost<1>(int, int, int, int, int): # #double expectedCost<1>(int, int, int, int, int)
jmp expectedCost_exact(int, int, int, int, int) # TAILCALL
double expectedCost<2>(int, int, int, int, int): # #double expectedCost<2>(int, int, int, int, int)
jmp expectedCost_approx(int, int, int, int, int) # TAILCALL
double expectedCost<3>(int, int, int, int, int): # #double expectedCost<3>(int, int, int, int, int)
jmp expectedCost_exact(int, int, int, int, int) # TAILCALL
double expectedCost<4>(int, int, int, int, int): # #double expectedCost<4>(int, int, int, int, int)
jmp expectedCost_approx(int, int, int, int, int) # TAILCALL
https://godbolt.org/z/ZtoKFH
The above site simplifies this whole process for you.
In this case I didn't provide definitions for expectedCost_approx so the compiler just leaves a jump. But in any case, compilers are definitely smart enough to realize that each template function has a constant value in the switch.
The answer to your question is: Yes, any moderately useful compiler will perform dead code elimination.
if constexpr is not so much about forcing compile time evaluation for reasons of performance. In terms of performance, there's not really going to be any difference between if constexpr and a normal if when the given expression is a compile-time constant because compilers will end up optimizing the unused branch away either way. What if constexpr enables is to have code in the inactive branch that must not be instantiated with the given template arguments (e.g., because it would be invalid in that particular case). For your switch above, the whole code will be instantiated for all cases. Only afterwards will the unused code be removed by the optimizer. if constexpr on the other hand, guarantees that the code in the unused branch will never be instantiated to begin with. See, e.g., here for more on that…
We don't have switch constexpr, and there are no guaranties of branch elimination for simple switch, even with constexpr value (as for regular if in fact), but I expect than compiler would remove them with proper optimization flag.
Notice also that your not-used branches would instantiate, if any, template methods/objects whereas if constexpr would not.
So if you want to have guaranty that only relevant code is there, or avoid unneeded instantiations, use if constexpr. Else use the one you find the clearer.
I'm debugging a full memory dump (procdump -ma ...), and I'm investigating the call stack, corresponding with following piece of source code:
unsigned int __stdcall ExecutionThread(void* pArg)
{
__try
{
BOOL bRunning = TRUE;
CInternalManagerObject* pInternalManagerObject = (CInternalManagerObject*) pArg;
pInternalManagerObject->Init();
CInternaStartlManagerObject* pInternaStartlManagerObject = pInternalManagerObject->GetInternaStartlManagerObject();
while(bRunning)
{
bRunning = pInternalManagerObject->Poll(pInternaStartlManagerObject);
if (CSLGlobal::IsValidHandle(_Module.m_hNeverEvent))
WaitForSingleObject(_Module.m_hNeverEvent, 15);
} <<<<<<<<<<<<<<<<============== here is the call stack pointer
pInternalManagerObject->DeInit();
As you can see, pArg is being typecasted and then being used, so it's impossible for pArg to be NULL, but yet this is exactly what the watch-window is telling me. In top of this, the internal variables seem not to be known (also as mentioned in the watch-window).
Watch-window content :
pArg 0x0000000000000000 void *
bRunning identifier "bRunning" is undefined
pInternalManagerObject identifier "pInternalManagerObject" is undefined
I can understand bRunning being optimised away, as this variable is not used anymore, but this is not correct for pInternalManagerObject, which is still used in the following line.
The symbols seem to be loaded fine.
I'm viewing this using Visual Studio Professional 2017, version 15.8.8.
Does anybody have a clue what might be causing this weird behaviour and what I can do in order to get a dump with correct values for the internal variables?
Edit after question for generated assembly code
The generated assembly is:
27:
28: unsigned int __stdcall ExecutionThread(void* pArg)
29: {
00007FF69C7A1690 48 89 5C 24 08 mov qword ptr [rsp+8],rbx
00007FF69C7A1695 48 89 74 24 10 mov qword ptr [rsp+10h],rsi
00007FF69C7A169A 57 push rdi
00007FF69C7A169B 48 83 EC 20 sub rsp,20h
00007FF69C7A169F 48 8B F9 mov rdi,rcx
30: __try
31: {
32: BOOL bRunning = TRUE;
00007FF69C7A16A2 BB 01 00 00 00 mov ebx,1
33: CInternalManagerObject* pInternalManagerObject = (CInternalManagerObject*) pArg;
34:
35: pInternalManagerObject->Init();
00007FF69C7A16A7 E8 64 EA FD FF call CInternalManagerObject::Init (07FF69C780110h)
36:
37: CBaseManager* pBaseManager = pInternalManagerObject->GetBaseManager();
00007FF69C7A16AC 48 8B CF mov rcx,rdi
00007FF69C7A16AF E8 0C E9 FD FF call CInternalManagerObject::GetBaseManager (07FF69C77FFC0h)
00007FF69C7A16B4 48 8B F0 mov rsi,rax
40: {
41: bRunning = pInternalManagerObject->Poll(pBaseManager);
00007FF69C7A16B7 48 8B CF mov rcx,rdi
38:
39: while(bRunning)
00007FF69C7A16BA 85 DB test ebx,ebx
00007FF69C7A16BC 74 2E je ExecutionThread+5Ch (07FF69C7A16ECh)
40: {
41: bRunning = pInternalManagerObject->Poll(pBaseManager);
00007FF69C7A16BE 48 8B D6 mov rdx,rsi
40: {
41: bRunning = pInternalManagerObject->Poll(pBaseManager);
00007FF69C7A16C1 E8 7A ED FD FF call CInternalManagerObject::Poll (07FF69C780440h)
00007FF69C7A16C6 8B D8 mov ebx,eax
42:
43: if (CSLGlobal::IsValidHandle(_Module.m_hNeverEvent))
00007FF69C7A16C8 48 8D 0D C1 13 0E 00 lea rcx,[_Module+550h (07FF69C882A90h)]
00007FF69C7A16CF E8 3C F2 FB FF call __Skyline_Global::CSLGlobal::IsValidHandle (07FF69C760910h)
00007FF69C7A16D4 85 C0 test eax,eax
00007FF69C7A16D6 74 12 je ExecutionThread+5Ah (07FF69C7A16EAh)
44: WaitForSingleObject(_Module.m_hNeverEvent, 15);
00007FF69C7A16D8 BA 0F 00 00 00 mov edx,0Fh
00007FF69C7A16DD 48 8B 0D AC 13 0E 00 mov rcx,qword ptr [_Module+550h (07FF69C882A90h)]
00007FF69C7A16E4 FF 15 16 0B 08 00 call qword ptr [__imp_WaitForSingleObject (07FF69C822200h)]
45: }
00007FF69C7A16EA EB CB jmp ExecutionThread+27h (07FF69C7A16B7h)
46:
47: pInternalManagerObject->DeInit();
00007FF69C7A16EC E8 FF E7 FD FF call CInternalManagerObject::DeInit (07FF69C77FEF0h)
48: }
I suppose this means that the correct value of pArg can be found in register RDI.
The Register window gives me following information:
RAX = 0000000000000000
RBX = 0000000000000001
RCX = 0000000000000000
RDX = 0000000000000000
RSI = 00000072A1E83220
RDI = 00000072A14A9990
...
Having a look into the memory at the mentioned place, I see hexadecimal values like:
0x00000072A14A9990 98 59 82 9c f6 7f 00 00 01 00 00 00 00 00 08 00 28 d2 28 62 f9 7f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 50 0e 78 a2 72 00 ˜Y.œö...........(Ò(bù...................................P.x¢r.
0x00000072A14A99CE 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 5c 07 00 00 00 00 00 00 d0 07 00 02 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ..ÿÿÿÿ............\.......Ð.......ÿÿÿÿÿÿÿÿÿÿÿÿ................
0x00000072A14A9A0C 00 00 00 00 d0 07 00 02 00 00 00 00 38 59 82 9c f6 7f 00 00 f0 90 60 a2 72 00 00 00 00 00 00 00 00 00 00 00 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 68 59 82 9c f6 7f 00 00 00 00
Does this mean that pArg is not NULL indeed? (Sorry, but I'm not experienced in assembly debugging)
Does this mean that pArg is not NULL indeed?
No it does not mean that; pArg is null. The watch window tells you, the registers tell you.
As you can see, pArg is being typecasted and then being used, so it's
impossible for pArg to be NULL.
That's not correct; that's not what the cast does. If the variable is null then the result of the cast will be null.
https://en.cppreference.com/w/c/language/cast
I suppose this means that the correct value of pArg can be found in
register RDI.
No; pArg is mounted onto rcx; mov works right-to-left.
mov rcx,rdi
RCX = 0000000000000000
https://c9x.me/x86/html/file_module_x86_id_176.html
I can understand bRunning being optimised away, as this variable is not used anymore, but this is not correct for pInternalManagerObject,
which is still used in the following line.
My guess is that you've observed the watch window when the program counter is on the first line of your function. bRunning and pInternalManagerObject are out of scope. (Although they could potentially be stripped due to optimisation). Note that if a variable is stripped you won't be able to see it even if it is used.
Thoughts
Program defensively. make a call to assert (or whatever assertion macro the codebase uses) in order to check the value of pArg (or any other pointer) before dereferencing. If it's an error you could reasonably see in production, go one step further: log the unexpected behaviour and early-out the function. http://www.cplusplus.com/reference/cassert/assert/
KISS: Whilst I'd commend anyone who's willing to get their "hands dirty" in this case it's just not necessary to start cracking open the disassembly. In this case the answer's right there. https://en.wikipedia.org/wiki/KISS_principle
Additionally you get a better response on SO if question is phrased in a manner that's easier to read. Remember to explain what it is you are doing and what the problem is before going into code. Explain what fault you are facing (along with any error output), along with asking a question. https://stackoverflow.com/help/how-to-ask
I created code where I have two functions returnValues and returnValuesVoid. One returns tuple of 2 values and other accept argument's references to the function.
#include <iostream>
#include <tuple>
std::tuple<int, int> returnValues(const int a, const int b) {
return std::tuple(a,b);
}
void returnValuesVoid(int &a,int &b) {
a += 100;
b += 100;
}
int main() {
auto [x,y] = returnValues(10,20);
std::cout << x ;
std::cout << y ;
int a = 10, b = 20;
returnValuesVoid(a, b);
std::cout << a ;
std::cout << b ;
}
I read about http://en.cppreference.com/w/cpp/language/structured_binding
which can destruct tuple to auto [x,y] variables.
Is auto [x,y] = returnValues(10,20); better than passing by references? As I know it's slower because it does have to return tuple object and reference just works on orginal variables passed to function so there's no reason to use it except cleaner code.
As auto [x,y] is since C++17 do people use it on production? I see that it looks cleaner than returnValuesVoid which is void type and but does it have other advantages over passing by reference?
Look at disassemble (compiled with GCC -O3):
It takes more instruction to implement tuple call.
0000000000000000 <returnValues(int, int)>:
0: 83 c2 64 add $0x64,%edx
3: 83 c6 64 add $0x64,%esi
6: 48 89 f8 mov %rdi,%rax
9: 89 17 mov %edx,(%rdi)
b: 89 77 04 mov %esi,0x4(%rdi)
e: c3 retq
f: 90 nop
0000000000000010 <returnValuesVoid(int&, int&)>:
10: 83 07 64 addl $0x64,(%rdi)
13: 83 06 64 addl $0x64,(%rsi)
16: c3 retq
But less instructions for the tuple caller:
0000000000000000 <callTuple()>:
0: 48 83 ec 18 sub $0x18,%rsp
4: ba 14 00 00 00 mov $0x14,%edx
9: be 0a 00 00 00 mov $0xa,%esi
e: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
13: e8 00 00 00 00 callq 18 <callTuple()+0x18> // call returnValues
18: 8b 74 24 0c mov 0xc(%rsp),%esi
1c: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
23: e8 00 00 00 00 callq 28 <callTuple()+0x28> // std::cout::operator<<
28: 8b 74 24 08 mov 0x8(%rsp),%esi
2c: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
33: e8 00 00 00 00 callq 38 <callTuple()+0x38> // std::cout::operator<<
38: 48 83 c4 18 add $0x18,%rsp
3c: c3 retq
3d: 0f 1f 00 nopl (%rax)
0000000000000040 <callRef()>:
40: 48 83 ec 18 sub $0x18,%rsp
44: 48 8d 74 24 0c lea 0xc(%rsp),%rsi
49: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
4e: c7 44 24 08 0a 00 00 movl $0xa,0x8(%rsp)
55: 00
56: c7 44 24 0c 14 00 00 movl $0x14,0xc(%rsp)
5d: 00
5e: e8 00 00 00 00 callq 63 <callRef()+0x23> // call returnValuesVoid
63: 8b 74 24 08 mov 0x8(%rsp),%esi
67: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
6e: e8 00 00 00 00 callq 73 <callRef()+0x33> // std::cout::operator<<
73: 8b 74 24 0c mov 0xc(%rsp),%esi
77: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
7e: e8 00 00 00 00 callq 83 <callRef()+0x43> // std::cout::operator<<
83: 48 83 c4 18 add $0x18,%rsp
87: c3 retq
I don't think there is any considerable performance different, but the tuple one is more clear, more readable.
Also tried inlined call, there is absolutely no different at all. Both of them generate exactly the same assemble code.
0000000000000000 <callTuple()>:
0: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
7: 48 83 ec 08 sub $0x8,%rsp
b: be 6e 00 00 00 mov $0x6e,%esi
10: e8 00 00 00 00 callq 15 <callTuple()+0x15>
15: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
1c: be 78 00 00 00 mov $0x78,%esi
21: 48 83 c4 08 add $0x8,%rsp
25: e9 00 00 00 00 jmpq 2a <callTuple()+0x2a> // TCO, optimized way to call a function and also return
2a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000000030 <callRef()>:
30: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
37: 48 83 ec 08 sub $0x8,%rsp
3b: be 6e 00 00 00 mov $0x6e,%esi
40: e8 00 00 00 00 callq 45 <callRef()+0x15>
45: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
4c: be 78 00 00 00 mov $0x78,%esi
51: 48 83 c4 08 add $0x8,%rsp
55: e9 00 00 00 00 jmpq 5a <callRef()+0x2a> // TCO, optimized way to call a function and also return
Focus on what's more readable and which approach provides a better intuition to the reader, and please keep the performance issues you might think that arise in the background.
A function that returns a tuple (or a pair, a struct, etc.) is yelling to the author that the function returns something, that almost always has some meaning that the user can take into account.
A function that gives back the results in variables passed by reference, may slip the eye's attention of a tired reader.
So, in general, prefer to return the results by a tuple.
Mike van Dyke pointed to this link:
F.21: To return multiple "out" values, prefer returning a tuple or struct
Reason
A return value is self-documenting as an "output-only"
value. Note that C++ does have multiple return values, by convention
of using a tuple (including pair), possibly with the extra convenience
of tie at the call site.
[...]
Exception
Sometimes, we need to pass an object to a function to manipulate its state. In such cases, passing the object by reference T& is usually the right technique.
Using another compiler (VS 2017) the resulting code shows no difference, as the function calls are just optimized away.
int main() {
00007FF6A9C51E50 sub rsp,28h
auto [x,y] = returnValues(10,20);
std::cout << x ;
00007FF6A9C51E54 mov edx,0Ah
00007FF6A9C51E59 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
std::cout << y ;
00007FF6A9C51E5E mov edx,14h
00007FF6A9C51E63 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
int a = 10, b = 20;
returnValuesVoid(a, b);
std::cout << a ;
00007FF6A9C51E68 mov edx,6Eh
00007FF6A9C51E6D call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
std::cout << b ;
00007FF6A9C51E72 mov edx,78h
00007FF6A9C51E77 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
}
00007FF6A9C51E7C xor eax,eax
00007FF6A9C51E7E add rsp,28h
00007FF6A9C51E82 ret
So using clearer code seems to be the obvious choice.
What Zang said is true but not up-to the point. I ran the code provided in question with chrono to measure time. I think the answer needs to be edited after observing what happened.
For 1M iterations, time taken by function call via reference was 3ms while time taken by function call via std::tie combined with std::tuple was about 94ms.
Though the difference seems very less in practice, still tuple one will perform slightly slower. Hence, for performance intensive systems, I suggest using call by reference.
My code:
#include <iostream>
#include <tuple>
#include <chrono>
std::tuple<int, int> returnValues(const int a, const int b)
{
return std::tuple<int, int>(a, b);
}
void returnValuesVoid(int &a, int &b)
{
a += 100;
b += 100;
}
int main()
{
int a = 10, b = 20;
auto begin = std::chrono::high_resolution_clock::now();
int x, y;
for (int i = 0; i < 1000000; i++)
{
std::tie(x, y) = returnValues(a, b);
}
auto end = std::chrono::high_resolution_clock::now();
std::cout << double(std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count()) << '\n';
a = 10;
b = 20;
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; i++)
{
returnValuesVoid(a, b);
}
auto stop = std::chrono::high_resolution_clock::now();
std::cout << double(std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count()) << '\n';
}
I have some template-heavy C++ code that I want to ensure the compiler optimizes as much as possible due to the large amount of information it has at compile time. To evaluate its performance, I decided to take a look at the disassembly of the object file that it generates. Below is a snippet of what I got from objdump -dC:
0000000000000000 <bar<foo, 0u>::get(bool)>:
0: 41 57 push %r15
2: 49 89 f7 mov %rsi,%r15
5: 41 56 push %r14
7: 41 55 push %r13
9: 41 54 push %r12
b: 55 push %rbp
c: 53 push %rbx
d: 48 81 ec 68 02 00 00 sub $0x268,%rsp
14: 48 89 7c 24 10 mov %rdi,0x10(%rsp)
19: 48 89 f7 mov %rsi,%rdi
1c: 89 54 24 1c mov %edx,0x1c(%rsp)
20: e8 00 00 00 00 callq 25 <bar<foo, 0u>::get(bool)+0x25>
25: 84 c0 test %al,%al
27: 0f 85 eb 00 00 00 jne 118 <bar<foo, 0u>::get(bool)+0x118>
2d: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
34: 00 00
36: 4c 89 ff mov %r15,%rdi
39: 4d 8d b7 30 01 00 00 lea 0x130(%r15),%r14
40: e8 00 00 00 00 callq 45 <bar<foo, 0u>::get(bool)+0x45>
45: 84 c0 test %al,%al
47: 88 44 24 1b mov %al,0x1b(%rsp)
4b: 0f 85 ef 00 00 00 jne 140 <bar<foo, 0u>::get(bool)+0x140>
51: 80 7c 24 1c 00 cmpb $0x0,0x1c(%rsp)
56: 0f 85 24 03 00 00 jne 380 <bar<foo, 0u>::get(bool)+0x380>
5c: 48 8b 44 24 10 mov 0x10(%rsp),%rax
61: c6 00 00 movb $0x0,(%rax)
64: 80 7c 24 1b 00 cmpb $0x0,0x1b(%rsp)
69: 75 25 jne 90 <bar<foo, 0u>::get(bool)+0x90>
6b: 48 8b 74 24 10 mov 0x10(%rsp),%rsi
70: 4c 89 ff mov %r15,%rdi
73: e8 00 00 00 00 callq 78 <bar<foo, 0u>::get(bool)+0x78>
78: 48 8b 44 24 10 mov 0x10(%rsp),%rax
7d: 48 81 c4 68 02 00 00 add $0x268,%rsp
84: 5b pop %rbx
85: 5d pop %rbp
86: 41 5c pop %r12
88: 41 5d pop %r13
8a: 41 5e pop %r14
8c: 41 5f pop %r15
8e: c3 retq
8f: 90 nop
90: 4c 89 f7 mov %r14,%rdi
93: e8 00 00 00 00 callq 98 <bar<foo, 0u>::get(bool)+0x98>
98: 83 f8 04 cmp $0x4,%eax
9b: 74 f3 je 90 <bar<foo, 0u>::get(bool)+0x90>
9d: 85 c0 test %eax,%eax
9f: 0f 85 e4 08 00 00 jne 989 <bar<foo, 0u>::get(bool)+0x989>
a5: 49 83 87 b0 01 00 00 addq $0x1,0x1b0(%r15)
ac: 01
ad: 49 8d 9f 58 01 00 00 lea 0x158(%r15),%rbx
b4: 48 89 df mov %rbx,%rdi
b7: e8 00 00 00 00 callq bc <bar<foo, 0u>::get(bool)+0xbc>
bc: 49 8d bf 80 01 00 00 lea 0x180(%r15),%rdi
c3: e8 00 00 00 00 callq c8 <bar<foo, 0u>::get(bool)+0xc8>
c8: 48 89 df mov %rbx,%rdi
cb: e8 00 00 00 00 callq d0 <bar<foo, 0u>::get(bool)+0xd0>
d0: 4c 89 f7 mov %r14,%rdi
d3: e8 00 00 00 00 callq d8 <bar<foo, 0u>::get(bool)+0xd8>
d8: 83 f8 04 cmp $0x4,%eax
The disassembly of this particular function continues on, but one thing I noticed is the relatively large number of call instructions like this one:
20: e8 00 00 00 00 callq 25 <bar<foo, 0u>::get(bool)+0x25>
These instructions, always with the opcode e8 00 00 00 00, occur frequently throughout the generated code, and from what I can tell, are nothing more than no-ops; they all seem to just fall through to the next instruction. This begs the question, then, is there a good reason why all these instructions are generated?
I'm concerned about the instruction cache footprint of the generated code, so wasting 5 bytes many times throughout a function seems counterproductive. It seems a bit heavyweight for a nop, unless the compiler is trying to preserve some kind of memory alignment or something. I wouldn't be surprised if this were the case.
I compiled my code using g++ 4.8.5 using -O3 -fomit-frame-pointer. For what it's worth, I saw similar code generation using clang 3.7.
The 00 00 00 00 (relative) target address in e8 00 00 00 00 is intended to be filled in by the linker. It doesn't mean that the call falls through. It just means you are disassembling an object file that has not been linked yet.
Also, a call to the next instruction, if that was the end result after the link phase, would not be a no-op, because it changes the stack (a certain hint that this is not what is going on in your case).
I have the following entry in a Dr Watson log. What is the significance of the "ds:0023:003a3000=??" part of the entry to the right of the FAULT line?
*----> State Dump for Thread Id 0xdfc <----*
eax=00000000 ebx=00390320 ecx=0854ff48 edx=09e44bfc esi=00012ce1 edi=0854ff61
eip=00465c51 esp=0854ff30 ebp=00000000 iopl=0 nv up ei pl zr na po nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246
function: sysman
00465c37 49 dec ecx
00465c38 eb02 jmp sysman+0x65c3c (00465c3c)
00465c3a 33c9 xor ecx,ecx
00465c3c 8d542428 lea edx,[esp+0x28]
00465c40 52 push edx
00465c41 51 push ecx
00465c42 8d4c2418 lea ecx,[esp+0x18]
00465c46 e8d5c0fcff call sysman+0x31d20 (00431d20)
00465c4b 33c0 xor eax,eax
00465c4d 8d4c2418 lea ecx,[esp+0x18]
FAULT ->00465c51 8a441eff mov al,[esi+ebx-0x1] ds:0023:003a3000=??
00465c55 50 push eax
00465c56 6864074900 push 0x490764
00465c5b 51 push ecx
00465c5c e8cfd0fcff call sysman+0x32d30 (00432d30)
00465c61 8d542424 lea edx,[esp+0x24]
00465c65 68689f4800 push 0x489f68
00465c6a 8d44242c lea eax,[esp+0x2c]
00465c6e 52 push edx
00465c6f 50 push eax
00465c70 e83bc3fcff call sysman+0x31fb0 (00431fb0)
*----> Stack Back Trace <----*
ChildEBP RetAddr Args to Child
00000000 00000000 00000000 00000000 00000000 sysman+0x65c51
*----> Raw Stack Dump <----*
000000000854ff30 58 01 55 08 75 07 c8 09 - 00 00 00 00 18 6d c7 01 X.U.u........m..
000000000854ff40 fc 4b e4 09 04 bd 47 00 - 04 bd 47 00 fc 0c c9 01 .K....G...G.....
000000000854ff50 ac ca ae 09 64 5f c4 01 - 20 37 37 30 32 34 3a 20 ....d_.. 77024:
000000000854ff60 00 b3 42 00 a8 ff 54 08 - 90 a6 47 00 02 00 00 00 ..B...T...G.....
000000000854ff70 8b c5 42 00 b8 ff 54 08 - 2e 03 39 00 28 99 cb 01 ..B...T...9.(...
000000000854ff80 ff ff ff ff 00 00 00 00 - 00 00 00 00 20 1e cb 01 ............ ...
000000000854ff90 a6 f7 ba 77 06 00 00 00 - c9 f7 ba 77 e1 6b d9 09 ...w.......w.k..
000000000854ffa0 06 00 00 00 1f 00 00 00 - 68 00 55 08 c1 a0 47 00 ........h.U...G.
000000000854ffb0 00 00 00 00 58 c4 42 00 - c9 a5 ca 09 d1 fb 38 0a ....X.B.......8.
000000000854ffc0 27 00 00 00 e1 6b d9 09 - ef f2 41 00 c9 a5 ca 09 '....k....A.....
000000000854ffd0 01 59 cc 01 38 00 55 08 - ec 00 55 08 00 00 00 00 .Y..8.U...U.....
000000000854ffe0 e0 00 55 08 ff ff ff ff - 89 00 00 00 01 00 01 01 ..U.............
000000000854fff0 c8 ff 54 08 b8 ff 54 08 - 77 00 55 08 29 a5 ca 09 ..T...T.w.U.)...
0000000008550000 51 00 00 00 5f 00 00 00 - 00 9f 82 7c 61 36 ca 01 Q..._......|a6..
0000000008550010 25 00 00 00 3f 00 00 00 - 00 ce bb 77 91 b7 c7 01 %...?......w....
0000000008550020 19 00 00 00 1f 00 00 00 - 00 ff ff ff d9 28 cc 01 .............(..
0000000008550030 0b 00 00 00 1f 00 00 00 - 00 00 55 08 d1 fb 38 0a ..........U...8.
0000000008550040 27 00 00 00 3f 00 00 00 - 00 20 ba 77 00 00 00 00 '...?.... .w....
0000000008550050 00 00 00 00 00 00 00 00 - 20 b7 c7 01 00 00 00 00 ........ .......
0000000008550060 00 00 00 00 00 00 00 00 - b4 00 55 08 1b 90 47 00 ..........U...G.`
To summarize:
You get a register dump here:
eax=00000000 ebx=00390320 ecx=0854ff48 edx=09e44bfc esi=00012ce1 edi=0854ff61
eip=00465c51 esp=0854ff30 ebp=00000000 iopl=0 nv up ei pl zr na po nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246
The eip indicates the instruction that failed:
FAULT ->00465c51 8a441eff mov al,[esi+ebx-0x1] ds:0023:003a3000=??
The stuff at the end is the address that failed to read, which is the "usual" data segment of 23, and address 3A3000, whcih is composed of esi and ebx minus 1: 390320+12ce1-1. To me, that looks like an index gone bad - 3a3000 would be the first address of a new "page" in memory, so that's why it's failing at that point. 77025 bytes into an array is quite a long way, but it is of course possible that it's something else that is wrong.