How to calculate the virtual address for an ELF program header?

How to calculate the virtual address for an ELF program header? - c++

I've wrote to file some assembly instructions and I would like to make them executable. However, I'm messing up something with the program headers. I've read the whole man page about the ELF header, but I didn't understand much.
#include <elf.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/stat.h>
void MakeExecutable(char *codeBuffer, char *startAddr, unsigned int codeSize){
char *fileBuffer = (char*) malloc(2058);
memset(fileBuffer, 0, 2058);
//I've omitted the sections declaration
unsigned int codeOffset = sizeof(Elf64_Ehdr) + (unsigned int ) (startAddr - codeBuffer);
Elf64_Ehdr *h = (Elf64_Ehdr*) &fileBuffer[0];
h->e_ident[0] = 0x7f;
h->e_ident[1] = 0x45;
h->e_ident[2] = 0x4c;
h->e_ident[3] = 0x46;
h->e_ident[EI_CLASS] = ELFCLASS64;
h->e_ident[EI_DATA] = ELFDATA2LSB;
h->e_ident[EI_VERSION] = EV_CURRENT;
h->e_ident[EI_OSABI] = ELFOSABI_SYSV;
h->e_type = ET_DYN;
h->e_machine = EM_X86_64;
h->e_version = EV_CURRENT;
h->e_entry = 64;
h->e_phentsize = sizeof(Elf64_Phdr);
h->e_shentsize = sizeof(Elf64_Shdr);
h->e_shoff = 0;
h->e_phoff = 800;
h->e_shnum = 0;
h->e_phnum = 1;
h->e_shstrndx= 0;
h->e_ehsize = sizeof(Elf64_Ehdr);
Elf64_Phdr *ph = (Elf64_Phdr*) &fileBuffer[800];
ph->p_type = PT_LOAD;
ph->p_vaddr = 0x8050000;
ph->p_paddr = 0;
ph->p_offset = 0x4000;
ph->p_memsz = 2058;
ph->p_flags = PF_X | PF_R | PF_W;
ph->p_filesz = 2058;
ph->p_align = 0x100000;
int file = open("ex.out", O_TRUNC | O_RDWR, S_IRWXO | S_IRWXU | S_IRWXG );
int error = write(file, fileBuffer, 2058);
close(file);
}
When I execute the ex.out It gives a segmentation fault and a core dump. This is what the core dump looks like.
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
NOTE 0x0000000000000158 0x0000000000000000 0x0000000000000000
0x0000000000000aac 0x0000000000000000 0x0
LOAD 0x0000000000001000 0x00007f1a823b2000 0x0000000000000000
0x0000000000000000 0x0000000000001000 RWE 0x1000
LOAD 0x0000000000001000 0x00007fff4a1d0000 0x0000000000000000
0x0000000000021000 0x0000000000021000 RWE 0x1000
LOAD 0x0000000000022000 0x00007fff4a1f1000 0x0000000000000000
0x0000000000003000 0x0000000000003000 R 0x1000
LOAD 0x0000000000025000 0x00007fff4a1f4000 0x0000000000000000
0x0000000000002000 0x0000000000002000 R E 0x1000
Displaying notes found at file offset 0x00000158 with length 0x00000aac:
Owner Data size Description
CORE 0x00000150 NT_PRSTATUS (prstatus structure)
CORE 0x00000088 NT_PRPSINFO (prpsinfo structure)
CORE 0x00000080 NT_SIGINFO (siginfo_t data)
CORE 0x00000140 NT_AUXV (auxiliary vector)
CORE 0x00000048 NT_FILE (mapped files)
Page size: 4096
Start End Page Offset
0x00007f1a823b2000 0x00007f1a823b3000 0x0000000000000004
/home/sudo_user/Compiler/ex.out
CORE 0x00000200 NT_FPREGSET (floating point registers)
LINUX 0x00000440 NT_X86_XSTATE (x86 XSAVE extended state)
description data: ffffffdf 66 f ...
It really did allocate a 4096 bytes page for me, so what could be the problem here? This is the objdump
ex.out: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <.text>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 81 ec 01 00 00 00 sub $0x1,%rsp
b: 8a 85 10 00 00 00 mov 0x10(%rbp),%al
11: 88 85 ff ff ff ff mov %al,-0x1(%rbp)
17: 49 c7 c1 00 00 00 00 mov $0x0,%r9
1e: 49 c7 c0 00 00 00 00 mov $0x0,%r8
25: 49 c7 c2 00 00 00 00 mov $0x0,%r10
2c: 48 c7 c2 00 00 00 00 mov $0x0,%rdx
33: 48 c7 c6 00 00 00 00 mov $0x0,%rsi
3a: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
41: 48 c7 c0 3c 00 00 00 mov $0x3c,%rax
48: 0f 05 syscall
4a: 48 89 ec mov %rbp,%rsp
4d: 5d pop %rbp
4e: c3 retq
Also when I try to run gdb with it, It also gives a segmentation fault:
"Program received signal SIGSEGV, Segmentation fault.
0x00007fffeffae040 in ??"
EDIT
Moreover, if I set p_vaddr to 0 no segmentation fault occurs, however the code doesn't behave like expected. For example, if I put an print instruction on the code, nothing gets printed on the command line when the file is executed.

Related

Constructor call of function local static object skipped

I have written the following function:
inline void putc(int c)
{
static Cons_serial serial;
if (serial.enabled())
serial.putc(c);
}
Where Cons_serial is a class with a non-trivial default constructor. I believe the exact class definition is not important here but you can correct me on that. I'm compiling for x86 32 bit with g++ using the following flags: -m32 -fno-PIC -ffreestanding -fno-rtti -fno-exceptions -fno-threadsafe-statics -O0, the generated assembly code for putc looks like this:
00100221 <_Z4putci>:
100221: 55 push %ebp
100222: 89 e5 mov %esp,%ebp
100224: 83 ec 08 sub $0x8,%esp
100227: b8 f8 02 10 00 mov $0x1002f8,%eax
10022c: 0f b6 00 movzbl (%eax),%eax
10022f: 84 c0 test %al,%al
100231: 75 18 jne 10024b <_Z4putci+0x2a>
100233: 83 ec 0c sub $0xc,%esp
100236: 68 f0 02 10 00 push $0x1002f0
10023b: e8 36 fe ff ff call 100076 <_ZN11Cons_serialC1Ev>
100240: 83 c4 10 add $0x10,%esp
100243: b8 f8 02 10 00 mov $0x1002f8,%eax
100248: c6 00 01 movb $0x1,(%eax)
10024b: 83 ec 0c sub $0xc,%esp
10024e: 68 f0 02 10 00 push $0x1002f0
100253: e8 4c fe ff ff call 1000a4 <_ZNK11Cons_serial7enabledEv>
100258: 83 c4 10 add $0x10,%esp
10025b: 84 c0 test %al,%al
10025d: 74 13 je 100272 <_Z4putci+0x51>
10025f: 83 ec 08 sub $0x8,%esp
100262: ff 75 08 pushl 0x8(%ebp)
100265: 68 f0 02 10 00 push $0x1002f0
10026a: e8 41 fe ff ff call 1000b0 <_ZN11Cons_serial4putcEi>
10026f: 83 c4 10 add $0x10,%esp
100272: 90 nop
100273: c9 leave
100274: c3 ret
During execution, the jump at 100231 is taken the first time the function runs, thus Cons_serial is never called. Why knowledge of x86 assembly is questionable, what do the instructions leading up to that one actually do? I assume the code is meant to skip the constructor call on subsequent function calls. But then why is it skipped the first time the function runs as well?
EDIT: This code is part of a kernel I'm writing and I suspect the root cause might be an issue with my kernel's .bss section, here is the linker script I use:
OUTPUT_FORMAT("elf32-i386")
ENTRY(_start)
SECTIONS
{
. = 0x100000;
.text : AT(0x100000) {
*(.text)
}
.data : SUBALIGN(2) {
*(.data);
*(.rodata*);
}
.bss : SUBALIGN(4) {
__bss_start = .;
*(.COMMON);
*(.bss*)
. = ALIGN(4);
__bss_end = .;
}
/DISCARD/ : {
*(.eh_frame)
*(.comment)
}
}
And here's the code I use to zero the .bss section:
extern uint32_t __bss_start;
extern uint32_t __bss_end;
void zero_bss()
{
for (uint32_t bss_addr = __bss_start; bss_addr < __bss_end; ++bss_addr)
*reinterpret_cast<uint8_t *>(bss_addr) = 0x00;
}
But when zero_bss runs, __bss_start is 0x27 and __bss_end is 0x101 which is not at all what I'd except (the BSS should encompass address 0x1002f8 after all).

I've solved it now, the hint from #user3124812 was what got me there, thanks again.
My zero_bss code was faulty, I needed to take the addresses of the __bss* markers from the linker script, i.e.:
extern uint8_t __bss_start;
extern uint8_t __bss_end;
void zero_bss()
{
uint8_t *bss_start = reinterpret_cast<uint8_t *>(&__bss_start);
uint8_t *bss_end = reinterpret_cast<uint8_t *>(&__bss_end);
for (uint8_t *bss_addr = bss_start; bss_addr < bss_end; ++bss_addr)
*bss_addr = 0x00;
}
Now everything works.

ptrace(PTRACE_PEEKDATA) , problem reading from memory address

I'm implementing a simple debugger and I'm trying to read data from a child process memory address with ptrace (for create breakpoint), but I keep getting input/output error.
This is how my program works:
Child process calls to:
ptrace(PTRACE_TRACEME, 0, nullptr, nullptr);
execl(prog_name, prog_name, 0);
At this point the child process waits for a signal.
Now the parent process calls to:
errno=0
uint64_t addrVal = ptrace(PTRACE_PEEKDATA, m_pid, m_addr, nullptr);
perror("failed");
and there is where it fails.
perror returns input/output error and addrVal keep getting 0xffffffffffffffff value.
also, there is might be a chance that im trying to look at the wrong address,
i used objdump to find the address but it seems it just giving me the offset and not the actual address
0000000000001165 <main>:
1165: 55 push %rbp
1166: 48 89 e5 mov %rsp,%rbp
1169: 48 8d 35 95 0e 00 00 lea 0xe95(%rip),%rsi # 2005 <_ZStL19piecewise_construct+0x1>
1170: 48 8d 3d e9 2e 00 00 lea 0x2ee9(%rip),%rdi # 4060 <_ZSt4cout##GLIBCXX_3.4>
1177: e8 c4 fe ff ff callq 1040 <_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc#plt>
117c: 48 89 c2 mov %rax,%rdx
117f: 48 8b 05 4a 2e 00 00 mov 0x2e4a(%rip),%rax # 3fd0 <_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_#GLIBCXX_3.4>
1186: 48 89 c6 mov %rax,%rsi
1189: 48 89 d7 mov %rdx,%rdi
118c: e8 bf fe ff ff callq 1050 <_ZNSolsEPFRSoS_E#plt>
1191: b8 00 00 00 00 mov $0x0,%eax
1196: 5d pop %rbp
1197: c3 retq
these are offsets or the actual address?
thanks!

How can debugged variables be NULL when source code proves otherwise

I'm debugging a full memory dump (procdump -ma ...), and I'm investigating the call stack, corresponding with following piece of source code:
unsigned int __stdcall ExecutionThread(void* pArg)
{
__try
{
BOOL bRunning = TRUE;
CInternalManagerObject* pInternalManagerObject = (CInternalManagerObject*) pArg;
pInternalManagerObject->Init();
CInternaStartlManagerObject* pInternaStartlManagerObject = pInternalManagerObject->GetInternaStartlManagerObject();
while(bRunning)
{
bRunning = pInternalManagerObject->Poll(pInternaStartlManagerObject);
if (CSLGlobal::IsValidHandle(_Module.m_hNeverEvent))
WaitForSingleObject(_Module.m_hNeverEvent, 15);
} <<<<<<<<<<<<<<<<============== here is the call stack pointer
pInternalManagerObject->DeInit();
As you can see, pArg is being typecasted and then being used, so it's impossible for pArg to be NULL, but yet this is exactly what the watch-window is telling me. In top of this, the internal variables seem not to be known (also as mentioned in the watch-window).
Watch-window content :
pArg 0x0000000000000000 void *
bRunning identifier "bRunning" is undefined
pInternalManagerObject identifier "pInternalManagerObject" is undefined
I can understand bRunning being optimised away, as this variable is not used anymore, but this is not correct for pInternalManagerObject, which is still used in the following line.
The symbols seem to be loaded fine.
I'm viewing this using Visual Studio Professional 2017, version 15.8.8.
Does anybody have a clue what might be causing this weird behaviour and what I can do in order to get a dump with correct values for the internal variables?
Edit after question for generated assembly code
The generated assembly is:
27:
28: unsigned int __stdcall ExecutionThread(void* pArg)
29: {
00007FF69C7A1690 48 89 5C 24 08 mov qword ptr [rsp+8],rbx
00007FF69C7A1695 48 89 74 24 10 mov qword ptr [rsp+10h],rsi
00007FF69C7A169A 57 push rdi
00007FF69C7A169B 48 83 EC 20 sub rsp,20h
00007FF69C7A169F 48 8B F9 mov rdi,rcx
30: __try
31: {
32: BOOL bRunning = TRUE;
00007FF69C7A16A2 BB 01 00 00 00 mov ebx,1
33: CInternalManagerObject* pInternalManagerObject = (CInternalManagerObject*) pArg;
34:
35: pInternalManagerObject->Init();
00007FF69C7A16A7 E8 64 EA FD FF call CInternalManagerObject::Init (07FF69C780110h)
36:
37: CBaseManager* pBaseManager = pInternalManagerObject->GetBaseManager();
00007FF69C7A16AC 48 8B CF mov rcx,rdi
00007FF69C7A16AF E8 0C E9 FD FF call CInternalManagerObject::GetBaseManager (07FF69C77FFC0h)
00007FF69C7A16B4 48 8B F0 mov rsi,rax
40: {
41: bRunning = pInternalManagerObject->Poll(pBaseManager);
00007FF69C7A16B7 48 8B CF mov rcx,rdi
38:
39: while(bRunning)
00007FF69C7A16BA 85 DB test ebx,ebx
00007FF69C7A16BC 74 2E je ExecutionThread+5Ch (07FF69C7A16ECh)
40: {
41: bRunning = pInternalManagerObject->Poll(pBaseManager);
00007FF69C7A16BE 48 8B D6 mov rdx,rsi
40: {
41: bRunning = pInternalManagerObject->Poll(pBaseManager);
00007FF69C7A16C1 E8 7A ED FD FF call CInternalManagerObject::Poll (07FF69C780440h)
00007FF69C7A16C6 8B D8 mov ebx,eax
42:
43: if (CSLGlobal::IsValidHandle(_Module.m_hNeverEvent))
00007FF69C7A16C8 48 8D 0D C1 13 0E 00 lea rcx,[_Module+550h (07FF69C882A90h)]
00007FF69C7A16CF E8 3C F2 FB FF call __Skyline_Global::CSLGlobal::IsValidHandle (07FF69C760910h)
00007FF69C7A16D4 85 C0 test eax,eax
00007FF69C7A16D6 74 12 je ExecutionThread+5Ah (07FF69C7A16EAh)
44: WaitForSingleObject(_Module.m_hNeverEvent, 15);
00007FF69C7A16D8 BA 0F 00 00 00 mov edx,0Fh
00007FF69C7A16DD 48 8B 0D AC 13 0E 00 mov rcx,qword ptr [_Module+550h (07FF69C882A90h)]
00007FF69C7A16E4 FF 15 16 0B 08 00 call qword ptr [__imp_WaitForSingleObject (07FF69C822200h)]
45: }
00007FF69C7A16EA EB CB jmp ExecutionThread+27h (07FF69C7A16B7h)
46:
47: pInternalManagerObject->DeInit();
00007FF69C7A16EC E8 FF E7 FD FF call CInternalManagerObject::DeInit (07FF69C77FEF0h)
48: }
I suppose this means that the correct value of pArg can be found in register RDI.
The Register window gives me following information:
RAX = 0000000000000000
RBX = 0000000000000001
RCX = 0000000000000000
RDX = 0000000000000000
RSI = 00000072A1E83220
RDI = 00000072A14A9990
...
Having a look into the memory at the mentioned place, I see hexadecimal values like:
0x00000072A14A9990 98 59 82 9c f6 7f 00 00 01 00 00 00 00 00 08 00 28 d2 28 62 f9 7f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 50 0e 78 a2 72 00 ˜Y.œö...........(Ò(bù...................................P.x¢r.
0x00000072A14A99CE 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 5c 07 00 00 00 00 00 00 d0 07 00 02 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ..ÿÿÿÿ............\.......Ð.......ÿÿÿÿÿÿÿÿÿÿÿÿ................
0x00000072A14A9A0C 00 00 00 00 d0 07 00 02 00 00 00 00 38 59 82 9c f6 7f 00 00 f0 90 60 a2 72 00 00 00 00 00 00 00 00 00 00 00 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 68 59 82 9c f6 7f 00 00 00 00
Does this mean that pArg is not NULL indeed? (Sorry, but I'm not experienced in assembly debugging)

Does this mean that pArg is not NULL indeed?
No it does not mean that; pArg is null. The watch window tells you, the registers tell you.
As you can see, pArg is being typecasted and then being used, so it's
impossible for pArg to be NULL.
That's not correct; that's not what the cast does. If the variable is null then the result of the cast will be null.
https://en.cppreference.com/w/c/language/cast
I suppose this means that the correct value of pArg can be found in
register RDI.
No; pArg is mounted onto rcx; mov works right-to-left.
mov rcx,rdi
RCX = 0000000000000000
https://c9x.me/x86/html/file_module_x86_id_176.html
I can understand bRunning being optimised away, as this variable is not used anymore, but this is not correct for pInternalManagerObject,
which is still used in the following line.
My guess is that you've observed the watch window when the program counter is on the first line of your function. bRunning and pInternalManagerObject are out of scope. (Although they could potentially be stripped due to optimisation). Note that if a variable is stripped you won't be able to see it even if it is used.
Thoughts
Program defensively. make a call to assert (or whatever assertion macro the codebase uses) in order to check the value of pArg (or any other pointer) before dereferencing. If it's an error you could reasonably see in production, go one step further: log the unexpected behaviour and early-out the function. http://www.cplusplus.com/reference/cassert/assert/
KISS: Whilst I'd commend anyone who's willing to get their "hands dirty" in this case it's just not necessary to start cracking open the disassembly. In this case the answer's right there. https://en.wikipedia.org/wiki/KISS_principle
Additionally you get a better response on SO if question is phrased in a manner that's easier to read. Remember to explain what it is you are doing and what the problem is before going into code. Explain what fault you are facing (along with any error output), along with asking a question. https://stackoverflow.com/help/how-to-ask

Which is better: returning tuple or passing arguments to function as references?

I created code where I have two functions returnValues and returnValuesVoid. One returns tuple of 2 values and other accept argument's references to the function.
#include <iostream>
#include <tuple>
std::tuple<int, int> returnValues(const int a, const int b) {
return std::tuple(a,b);
}
void returnValuesVoid(int &a,int &b) {
a += 100;
b += 100;
}
int main() {
auto [x,y] = returnValues(10,20);
std::cout << x ;
std::cout << y ;
int a = 10, b = 20;
returnValuesVoid(a, b);
std::cout << a ;
std::cout << b ;
}
I read about http://en.cppreference.com/w/cpp/language/structured_binding
which can destruct tuple to auto [x,y] variables.
Is auto [x,y] = returnValues(10,20); better than passing by references? As I know it's slower because it does have to return tuple object and reference just works on orginal variables passed to function so there's no reason to use it except cleaner code.
As auto [x,y] is since C++17 do people use it on production? I see that it looks cleaner than returnValuesVoid which is void type and but does it have other advantages over passing by reference?

Look at disassemble (compiled with GCC -O3):
It takes more instruction to implement tuple call.
0000000000000000 <returnValues(int, int)>:
0: 83 c2 64 add $0x64,%edx
3: 83 c6 64 add $0x64,%esi
6: 48 89 f8 mov %rdi,%rax
9: 89 17 mov %edx,(%rdi)
b: 89 77 04 mov %esi,0x4(%rdi)
e: c3 retq
f: 90 nop
0000000000000010 <returnValuesVoid(int&, int&)>:
10: 83 07 64 addl $0x64,(%rdi)
13: 83 06 64 addl $0x64,(%rsi)
16: c3 retq
But less instructions for the tuple caller:
0000000000000000 <callTuple()>:
0: 48 83 ec 18 sub $0x18,%rsp
4: ba 14 00 00 00 mov $0x14,%edx
9: be 0a 00 00 00 mov $0xa,%esi
e: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
13: e8 00 00 00 00 callq 18 <callTuple()+0x18> // call returnValues
18: 8b 74 24 0c mov 0xc(%rsp),%esi
1c: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
23: e8 00 00 00 00 callq 28 <callTuple()+0x28> // std::cout::operator<<
28: 8b 74 24 08 mov 0x8(%rsp),%esi
2c: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
33: e8 00 00 00 00 callq 38 <callTuple()+0x38> // std::cout::operator<<
38: 48 83 c4 18 add $0x18,%rsp
3c: c3 retq
3d: 0f 1f 00 nopl (%rax)
0000000000000040 <callRef()>:
40: 48 83 ec 18 sub $0x18,%rsp
44: 48 8d 74 24 0c lea 0xc(%rsp),%rsi
49: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
4e: c7 44 24 08 0a 00 00 movl $0xa,0x8(%rsp)
55: 00
56: c7 44 24 0c 14 00 00 movl $0x14,0xc(%rsp)
5d: 00
5e: e8 00 00 00 00 callq 63 <callRef()+0x23> // call returnValuesVoid
63: 8b 74 24 08 mov 0x8(%rsp),%esi
67: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
6e: e8 00 00 00 00 callq 73 <callRef()+0x33> // std::cout::operator<<
73: 8b 74 24 0c mov 0xc(%rsp),%esi
77: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
7e: e8 00 00 00 00 callq 83 <callRef()+0x43> // std::cout::operator<<
83: 48 83 c4 18 add $0x18,%rsp
87: c3 retq
I don't think there is any considerable performance different, but the tuple one is more clear, more readable.
Also tried inlined call, there is absolutely no different at all. Both of them generate exactly the same assemble code.
0000000000000000 <callTuple()>:
0: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
7: 48 83 ec 08 sub $0x8,%rsp
b: be 6e 00 00 00 mov $0x6e,%esi
10: e8 00 00 00 00 callq 15 <callTuple()+0x15>
15: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
1c: be 78 00 00 00 mov $0x78,%esi
21: 48 83 c4 08 add $0x8,%rsp
25: e9 00 00 00 00 jmpq 2a <callTuple()+0x2a> // TCO, optimized way to call a function and also return
2a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000000030 <callRef()>:
30: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
37: 48 83 ec 08 sub $0x8,%rsp
3b: be 6e 00 00 00 mov $0x6e,%esi
40: e8 00 00 00 00 callq 45 <callRef()+0x15>
45: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
4c: be 78 00 00 00 mov $0x78,%esi
51: 48 83 c4 08 add $0x8,%rsp
55: e9 00 00 00 00 jmpq 5a <callRef()+0x2a> // TCO, optimized way to call a function and also return

Focus on what's more readable and which approach provides a better intuition to the reader, and please keep the performance issues you might think that arise in the background.
A function that returns a tuple (or a pair, a struct, etc.) is yelling to the author that the function returns something, that almost always has some meaning that the user can take into account.
A function that gives back the results in variables passed by reference, may slip the eye's attention of a tired reader.
So, in general, prefer to return the results by a tuple.
Mike van Dyke pointed to this link:
F.21: To return multiple "out" values, prefer returning a tuple or struct
Reason
A return value is self-documenting as an "output-only"
value. Note that C++ does have multiple return values, by convention
of using a tuple (including pair), possibly with the extra convenience
of tie at the call site.
[...]
Exception
Sometimes, we need to pass an object to a function to manipulate its state. In such cases, passing the object by reference T& is usually the right technique.

Using another compiler (VS 2017) the resulting code shows no difference, as the function calls are just optimized away.
int main() {
00007FF6A9C51E50 sub rsp,28h
auto [x,y] = returnValues(10,20);
std::cout << x ;
00007FF6A9C51E54 mov edx,0Ah
00007FF6A9C51E59 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
std::cout << y ;
00007FF6A9C51E5E mov edx,14h
00007FF6A9C51E63 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
int a = 10, b = 20;
returnValuesVoid(a, b);
std::cout << a ;
00007FF6A9C51E68 mov edx,6Eh
00007FF6A9C51E6D call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
std::cout << b ;
00007FF6A9C51E72 mov edx,78h
00007FF6A9C51E77 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
}
00007FF6A9C51E7C xor eax,eax
00007FF6A9C51E7E add rsp,28h
00007FF6A9C51E82 ret
So using clearer code seems to be the obvious choice.

What Zang said is true but not up-to the point. I ran the code provided in question with chrono to measure time. I think the answer needs to be edited after observing what happened.
For 1M iterations, time taken by function call via reference was 3ms while time taken by function call via std::tie combined with std::tuple was about 94ms.
Though the difference seems very less in practice, still tuple one will perform slightly slower. Hence, for performance intensive systems, I suggest using call by reference.
My code:
#include <iostream>
#include <tuple>
#include <chrono>
std::tuple<int, int> returnValues(const int a, const int b)
{
return std::tuple<int, int>(a, b);
}
void returnValuesVoid(int &a, int &b)
{
a += 100;
b += 100;
}
int main()
{
int a = 10, b = 20;
auto begin = std::chrono::high_resolution_clock::now();
int x, y;
for (int i = 0; i < 1000000; i++)
{
std::tie(x, y) = returnValues(a, b);
}
auto end = std::chrono::high_resolution_clock::now();
std::cout << double(std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count()) << '\n';
a = 10;
b = 20;
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; i++)
{
returnValuesVoid(a, b);
}
auto stop = std::chrono::high_resolution_clock::now();
std::cout << double(std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count()) << '\n';
}

Why does C++ inline function has call instructions?

I read that with inline functions where ever the function call is made we replace the function call with the body of the function definition.
According to the above explanation there should not be any function call when inline is user.
If that is the case Why do I see three call instructions in the assembly code ?
#include <iostream>
inline int add(int x, int y)
{
return x+ y;
}
int main()
{
add(8,9);
add(20,10);
add(100,233);
}
meow#vikkyhacks ~/Arena/c/temp $ g++ -c a.cpp
meow#vikkyhacks ~/Arena/c/temp $ objdump -M intel -d a.o
0000000000000000 <main>:
0: 55 push rbp
1: 48 89 e5 mov rbp,rsp
4: be 09 00 00 00 mov esi,0x9
9: bf 08 00 00 00 mov edi,0x8
e: e8 00 00 00 00 call 13 <main+0x13>
13: be 0a 00 00 00 mov esi,0xa
18: bf 14 00 00 00 mov edi,0x14
1d: e8 00 00 00 00 call 22 <main+0x22>
22: be e9 00 00 00 mov esi,0xe9
27: bf 64 00 00 00 mov edi,0x64
2c: e8 00 00 00 00 call 31 <main+0x31>
31: b8 00 00 00 00 mov eax,0x0
36: 5d pop rbp
37: c3 ret
NOTE
Complete dump of the object file is here

You did not optimize so the calls are not inlined
You produced an object file (not a .exe) so the calls are not resolved. What you see is a dummy call whose address will be filled by the linker
If you compile a full executable you will see the correct addresses for the jumps
See page 28 of:
http://www.cs.princeton.edu/courses/archive/spr04/cos217/lectures/Assembler.pdf

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js