Relative vs Physical addresses in C++

Relative vs Physical addresses in C++ - c++

I recently started learning about memory management and I read about relative addresses and physical addresses, and a question appeared in my mind:
When I print a variable's address, is it showing the relative (virtual) address or the physical address in where the variable located in the memory?
And another question regarding memory management:
Why does this code produce the same stack pointer value for each run (from Shellcoder's Handbook, page 28)?
Does any program that I run produce this address?
// find_start.c
unsigned long find_start(void)
{
__asm__("movl %esp, %eax");
}
int main()
{
printf("0x%x\n",find_start());
}
If we compile this and run this a few times, we get:
shellcoders#debian:~/chapter_2$ ./find_start
0xbffffad8
shellcoders#debian:~/chapter_2$ ./find_start
0xbffffad8
shellcoders#debian:~/chapter_2$ ./find_start
0xbffffad8
shellcoders#debian:~/chapter_2$ ./find_start
0xbffffad8
I would appreciate if someone could clarify this topic to me.

When I print a variable's address, is it showing the relative ( virtual ) address or the physical address in where the variable located in the memory ?
The counterpart to a relative address is an absolute address. That has nothing to do with the distinction between virtual and physical addresses.
On most common modern operating systems, such as Windows, Linux and MacOS, unless you are writing a driver, you will never encounter physical addresses. These are handled internally by the operating system. You will only be working with virtual addresses.
Why does this code produces the same stack pointer value for each run ( from shellcoder's handbook , page 28) ?
On most modern operating systems, every process has its own virtual memory address space. The executable is loaded to its preferred base address in that virtual address space, if possible, otherwise it is loaded at another address (relocated). The preferred base address of an executable file is normally stored in its header. Depending on the operating system and CPU, the heap is probably created at a higher address, since the heap normally grows upward (towards higher addresses). Because the stack normally grows downward (towards lower addresses), it will likely be created below the load address of the executable and grow towards the address 0.
Since the preferred load address is the same every time you run the executable, it is likely that the virtual memory addresses are the same. However, this may change if address layout space randomization is used. Also, just because the virtual memory addresses are the same does not mean that the physical memory address are the same, too.
Does any program that I will run produce this address ?
Depending on your operating system, you can set the preferred base address in which your program is loaded into virtual memory in the linker settings. Many programs may still have the same base address as your program, probably because both programs were built using the same linker with default settings.
The virtual addresses are only per program? Let's say I have 2 programs: program1 and program2. Can program2 access program1's memory?
It is not possible for program2 to access program1's memory directly, because they have separate virtual memory address spaces. However, it is possible for one program to ask the operating system for permission to access another process's address space. The operating system will normally grant this permission, provided that the program has sufficient priviledges. On Windows, this is can be accomplished for example with the function WriteProcessMemory. Linux offers similar functionality by using ptrace and writing to /proc/[pid]/mem. See this link for further information.

You get virtual addresses. Your program never gets to see physical addresses. Ever.
can program2 access program1's memory ?
No, because you can don't have addresses that point to program1's memory. If you have virtual address 0xabcd1234 in the program1 process, and you try to read it from the program2 process, you get program2's 0xabcd1234 (or a crash if there is no such address in program2). It's not a permission check - it's not like the CPU goes to the memory and sees "oh, this is program1's memory, I shouldn't access it". It's program2's own memory space.
But yes, if you use "shared memory" to ask the OS to put the same physical memory in both processes.
And yes, if you use ptrace or /proc/<pid>/mem to ask the OS nicely to read from the other process's memory, and you have permission to do that, then it will do that.
why does this code produces the same stack pointer value for each run ( from shellcoder's handbook , page 28) ? does any program that I will run will produce this address ?
Apparently, that program always has that stack pointer value. Different programs might have different stack pointers. And if you put more local variables in main, or call find_start from a different function, you will get a different stack pointer value because there will be more data pushed on the stack.
Note: even if you run the program twice at the same time, the address will be the same, because they are virtual addresses, and every process has its own virtual address space. They will be different physical addresses but you don't see the physical addresses.
In stack overflow example in the book I mentioned, they overwrite the return address in the stack to an address of an exploit in the enviroment variables. how does it work ?
It all works within one process.

Only focusing on a small part of your question.
#include <stdio.h>
// find_start.c
unsigned long find_start(void)
{
__asm__("movl %esp, %eax");
}
unsigned long nest ( void )
{
return(find_start());
}
int main()
{
printf("0x%lx\n",find_start());
printf("0x%lx\n",nest());
}
gcc so.c -o so
./so
0x50e381a0
0x50e38190
There is no magic here. The virtual space allows for programs to be built the same. I don't need to know where my program is going to live, each program can be compiled for the same address space, when loaded and run the can see the same virtual address space because they are all mapped to separate/different physical address spaces.
readelf -a so
(don't but I prefer objdump)
objdump -D so
Disassembly of section .text:
0000000000000540 <_start>:
540: 31 ed xor %ebp,%ebp
542: 49 89 d1 mov %rdx,%r9
545: 5e pop %rsi
....
000000000000064a <find_start>:
64a: 55 push %rbp
64b: 48 89 e5 mov %rsp,%rbp
64e: 89 e0 mov %esp,%eax
650: 90 nop
651: 5d pop %rbp
652: c3 retq
0000000000000653 <nest>:
653: 55 push %rbp
654: 48 89 e5 mov %rsp,%rbp
657: e8 ee ff ff ff callq 64a <find_start>
65c: 5d pop %rbp
65d: c3 retq
000000000000065e <main>:
65e: 55 push %rbp
65f: 48 89 e5 mov %rsp,%rbp
662: e8 e3 ff ff ff callq 64a <find_start>
667: 48 89 c6 mov %rax,%rsi
66a: 48 8d 3d b3 00 00 00 lea 0xb3(%rip),%rdi # 724 <_IO_stdin_used+0x4>
671: b8 00 00 00 00 mov $0x0,%eax
676: e8 a5 fe ff ff callq 520 <printf#plt>
67b: e8 d3 ff ff ff callq 653 <nest>
So, two things or maybe more than two things. Our entry point _start is in ram at a low address. low virtual address. On this system with this compiler I would expect all/most programs to start in a similar place or the same or in some cases it may depend on what is in my program, but it should be somewhere low.
The stack pointer though, if you check above and now as I type stuff:
0x355d38d0
0x355d38c0
it has changed.
0x4ebf1760
0x4ebf1750
0x31423240
0x31423230
0xa63188d0
0xa63188c0
a few times within a few seconds. The stack is a relative thing not absolute so there is no real need to create a fixed address that is the same every time. Needs to be in a space that is related to this user/thread and virtual since it is going through the mmu for protection reasons. There is no reason for a virtual address to not equal the physical address. The kernel code/driver that manages the mmu for a platform is programmed to do it a certain way. You can have the address space for code start at 0x0000 for every program, and you might wish the address space for data to be the same, zero based. but for stack it doesn't matter. And on my machine, my os, this particular version this particular day it isn't consistent.
I originally thought your question was different depending on factors that are specific to your build, and settings. For a specific build a single call to find_start is going to be at a fixed relative address for the stack pointer each function that uses the stack will put it back the way it was found, assuming you can't change the compilation of the program while running the stack pointer for a single instance of the call the nesting will be the same the stack consumption by each function along the way will be the same.
I added another layer and by looking at the disassembly, main, nest and find_start all mess with the stack pointer (unoptimized) so that is why for these runs they are 0x10 apart. if I added/removed more code per function to change the stack usage in one or more of the functions then that delta could change.
But
gcc -O2 so.c -o so
objdump -D so > so.txt
./so
0x0
0x0
Disassembly of section .text:
0000000000000560 <main>:
560: 48 83 ec 08 sub $0x8,%rsp
564: 89 e0 mov %esp,%eax
566: 48 8d 35 e7 01 00 00 lea 0x1e7(%rip),%rsi # 754 <_IO_stdin_used+0x4>
56d: 31 d2 xor %edx,%edx
56f: bf 01 00 00 00 mov $0x1,%edi
574: 31 c0 xor %eax,%eax
576: e8 c5 ff ff ff callq 540 <__printf_chk#plt>
57b: 89 e0 mov %esp,%eax
57d: 48 8d 35 d0 01 00 00 lea 0x1d0(%rip),%rsi # 754 <_IO_stdin_used+0x4>
584: 31 d2 xor %edx,%edx
586: bf 01 00 00 00 mov $0x1,%edi
58b: 31 c0 xor %eax,%eax
58d: e8 ae ff ff ff callq 540 <__printf_chk#plt>
592: 31 c0 xor %eax,%eax
594: 48 83 c4 08 add $0x8,%rsp
598: c3 retq
The optimizer didn't recognize the return value for some reason.
unsigned long fun ( void )
{
return(0x12345678);
}
00000000000006b0 <fun>:
6b0: b8 78 56 34 12 mov $0x12345678,%eax
6b5: c3 retq
calling convention looks fine.
Put find_start in a separate file so the optimizer can't remove it
gcc -O2 so.c sp.c -o so
./so
0xb1192fc8
0xb1192fc8
./so
0x7aa979d8
0x7aa979d8
./so
0x485134c8
0x485134c8
./so
0xa8317c98
0xa8317c98
./so
0x2ba70b8
0x2ba70b8
Disassembly of section .text:
0000000000000560 <main>:
560: 48 83 ec 08 sub $0x8,%rsp
564: e8 67 01 00 00 callq 6d0 <find_start>
569: 48 8d 35 f4 01 00 00 lea 0x1f4(%rip),%rsi # 764 <_IO_stdin_used+0x4>
570: 48 89 c2 mov %rax,%rdx
573: bf 01 00 00 00 mov $0x1,%edi
578: 31 c0 xor %eax,%eax
57a: e8 c1 ff ff ff callq 540 <__printf_chk#plt>
57f: e8 4c 01 00 00 callq 6d0 <find_start>
584: 48 8d 35 d9 01 00 00 lea 0x1d9(%rip),%rsi # 764 <_IO_stdin_used+0x4>
58b: 48 89 c2 mov %rax,%rdx
58e: bf 01 00 00 00 mov $0x1,%edi
593: 31 c0 xor %eax,%eax
595: e8 a6 ff ff ff callq 540 <__printf_chk#plt>
I didn't let it inline those functions it can see nest so it inlined it removing the stack change that came with it. So now the value nested or not is the same.

Related

C++ odd assembly output query

Using Windows 10 Pro with Visual Studio 2022, Debug mode, X64 platform, I have the following code...
int main()
{
int var = 1;
int* varPtr = &var;
*varPtr = 10;
return 0;
}
In the disassembly window we see this...
int var = 1;
00007FF75F1D1A0D C7 45 04 01 00 00 00 mov dword ptr [var],1
int* varPtr = &var;
00007FF75F1D1A14 48 8D 45 04 lea rax,[var]
00007FF75F1D1A18 48 89 45 28 mov qword ptr [varPtr],rax
*varPtr = 10;
00007FF75F1D1A1C 48 8B 45 28 mov rax,qword ptr [varPtr]
00007FF75F1D1A20 C7 00 0A 00 00 00 mov dword ptr [rax],0Ah
return 0;
Upon stepping through the above, the RAX register is loaded with the memory address for the stack variable, var, via...
00007FF75F1D1A14 48 8D 45 04 lea rax,[var]
Since RAX is not changed after this, why is that same var address being loaded into RAX again, 2 instructions later with...
00007FF75F1D1A1C 48 8B 45 28 mov rax,qword ptr [varPtr]
The memory view window shows that the &var address is constant throughout. Am I missing something daft?
[Updated] - switching to release mode and optimisation off returns the above in full. Turning on speed/size optimization returns only that "return 0" code. Would be interesting to see if there's a way to force the compiler to compile everything (using fast switch) and force it to not remove what it thought was redundant, for this example. This minimal appears to be too minimal, lol.
Still concerned about that unneeded double load of RAX - primarily, for such a small program, though yes, that's what 'optimisation' is all about. Sill.

When compiling in Debug mode (i.e. with all optimisations disabled), the compiler generates code like this for a reason.
Suppose you are stepping through the code and you stop on the line that reads *varPtr = 10;. At that point, you decide that you loaded the wrong address into varPtr and would like to change it and continue debugging without stopping, rebuilding and restarting your program.
Well, in Debug mode, you can. Just change the address stored in varPtr (in the Watch window, say) and carry on debugging. Without the 'redundant' second load, this wouldn't work. When the compiler emits said load, it does.
So, to summarise, Debug mode is designed to make debugging easier, while Release mode is designed to make your code run as fast (or be as small) as possible, hopefully with the same semantics.
And just be grateful that compiler writers understand the need for these two modes of operation. Without them, our lives as developers would be much, much harder.

Is it possible to write asm in C++ with opcode instead of shellcode

I'm curious if there's a way to use __asm in c++ then write that into memory instead of doing something like:
BYTE shell_code[] = { 0x48, 0x03 ,0x1c ,0x25, 0x0A, 0x00, 0x00, 0x00 };
write_to_memory(function, &shell_code, sizeof(shell_code));
So I would like to do:
asm_code = __asm("add rbx, &variable\n\t""jmp rbx") ;
write_to_memory(function, &asm_code , sizeof(asm_code ));
Worst case I can use GCC and objdump externally or something but hoping there's an internal way

You can put an asm(""); statement at global scope, with start/end labels inside it, and declare those labels as extern char start_code[], end_code[0]; so you can access them from C. C char arrays work most like asm labels, in terms of being able to use the C name and have it work as an address.
// compile with gcc -masm=intel
// AFAIK, no way to do that with clang
asm(
".pushsection .rodata \n" // we don't want to run this from here, it's just data
"start_code: \n"
" add rax, OFFSET variable \n" // *absolute* address as 32-bit sign-extended immediate
"end_code: \n"
".popsection"
);
__attribute__((used)) static int variable = 1;
extern char start_code[], end_code[0]; // C declarations for those asm labels
#include <string.h>
void copy_code(void *dst)
{
memcpy(dst, start_code, end_code - start_code);
}
It would be fine to have the payload code in the default .text section, but we can put it in .rodata since we don't want to run it.
Is that the kind of thing you're looking for? asm output on Godbolt (without assembling + disassembling:
start_code:
add rax, OFFSET variable
end_code:
copy_code(void*):
mov edx, OFFSET FLAT:end_code
mov esi, OFFSET FLAT:start_code
sub rdx, OFFSET FLAT:start_code
jmp [QWORD PTR memcpy#GOTPCREL[rip]]
To see if it actually assembles to what we want, I compiled with
gcc -O2 -fno-plt -masm=intel -fno-pie -no-pie -c foo.c to get a .o. objdump -drwC -Mintel shows:
0000000000000000 <copy_code>:
0: ba 00 00 00 00 mov edx,0x0 1: R_X86_64_32 .rodata+0x6
5: be 00 00 00 00 mov esi,0x0 6: R_X86_64_32 .rodata
a: 48 81 ea 00 00 00 00 sub rdx,0x0 d: R_X86_64_32S .rodata
11: ff 25 00 00 00 00 jmp QWORD PTR [rip+0x0] # 17 <end_code+0x11> 13: R_X86_64_GOTPCRELX memcpy-0x4
And with -D to see all sections, the actual payload is there in .rodata, still not linked yet:
Disassembly of section .rodata:
0000000000000000 <start_code>:
0: 48 05 00 00 00 00 add rax,0x0 2: R_X86_64_32S .data
-fno-pie -no-pie is only necessary for the 32-bit absolute address of variable to work. (Without it, we get two RIP-relative LEAs and a sub rdx, rsi. Unfortunately neither way of compiling gets GCC to subtract the symbols at build time with mov edx, OFFSET end_code - start_code, but that's just in the code doing the memcpy, not in the machine code being copied.)
In a linked executable
We can see how the linker filled in those relocations.
(I tested by using -nostartfiles instead of -c - I didn't want to run it, just look at the disassembly, so there was not point to actually writing a main.)
$ gcc -O2 -fno-plt -masm=intel -fno-pie -no-pie -nostartfiles foo.c
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000401000
$ objdump -D -rwC -Mintel a.out
(manually edited to remove uninteresting sections)
Disassembly of section .text:
0000000000401000 <copy_code>:
401000: ba 06 20 40 00 mov edx,0x402006
401005: be 00 20 40 00 mov esi,0x402000
40100a: 48 81 ea 00 20 40 00 sub rdx,0x402000
401011: ff 25 e1 2f 00 00 jmp QWORD PTR [rip+0x2fe1] # 403ff8 <memcpy#GLIBC_2.14>
The linked payload:
0000000000402000 <start_code>:
402000: 48 05 18 40 40 00 add rax,0x404018 # from add rax, OFFSET variable
0000000000402006 <end_code>:
402006: 48 c7 c2 06 00 00 00 mov rdx,0x6
# this was from mov rdx, OFFSET end_code - start_code to see if that would assemble + link
Our non-zero-init dword variable that we're taking the address of:
Disassembly of section .data:
0000000000404018 <variable>:
404018: 01 00 add DWORD PTR [rax],eax
...
Your specific asm instruction is weird
&variable isn't valid asm syntax, but I'm guessing you wanted to add the address?
Since you're going to be copying the machine code somewhere, you must avoid RIP-relative addressing modes and any other relative references to things outside the block you're copying. Only mov can use 64-bit absolute addresses, like movabs rdi, OFFSET variable instead of the usual lea rdi, [rip + variable]. Also, you can even load / store into/from RAX/EAX/AX/AL with 64-bit absolute addresses movabs eax, [variable]. (mov-immediate can use any register, load/store are only the accumulator. https://www.felixcloutier.com/x86/mov)
(movabs is an AT&T mnemonic, but GAS allows it in .intel_syntax noprefix to force using 64-bit immediates, instead of the default 32-bit-sign-extended.)
This is kind of opposite of normal position-independent code, which works when the whole image is loaded at an arbitrary base. This will make code that works when the image is loaded to a fixed base (or even variable since runtime fixups should work for symbolic references), and then copied around relative to the rest of your code. So all your memory refs have to be absolute, except for within the asm.
So we couldn't have made PIE-compatible machine code by using lea rdx, [RIP+variable] / add rax, rdx - that would only get the right address for variable when run from the linked location in .rodata, not from any copy. (Unless you manually fixup the code when copying it, but it's still only a rel32 displacement.)
Terminology:
An opcode is part of a machine instruction, e.g. add ecx, 123 assembles to 3 bytes: 83 c1 7b. Those are the opcode, modrm, and imm8 respectively. (https://www.felixcloutier.com/x86/add).
"opcode" also gets misused (especially in shellcode usage) to describe the whole instruction represented as bytes.
Text names for instructions like add are mnemonics.

this is just a guess, i don't know if it will work. i'm sorry in advance for an ugly answer since i don't have much time due to work.
i think you can enclose your asm code inside labels.
get the address of that label and the size. treat it as a blob of data and you can write it anywhere.
void funcA(){
//some code here.
labelStart:
__asm("
;asm code here.
")
labelEnd:
//some code here.
//---make code as movable data.
char* pDynamicProgram = labelStart;
size_t sizeDP = labelEnd - labelStart;
//---writing to some memory.
char* someBuffer = malloc(sizeDP);
memcpy(someBuffer, pDynamicProgram, sizeDP);
//---execute: cast as a function pointer then execute call.
((func*)someBuffer)(/* parameters if any*/);
}
the sample code above of course is not compilable. but the logic is kind of like that. i see viruses do it that way though i haven't saw the actual c++ code. but we saw it from disassemblers. for the "return" logic after the call, there are many adhoc ways to do that. just be creative.
also, i think you have to enable first some settings for your program to write to some forbidden memory in case you want to override an existing function.

Why do you need to recompile C/C++ for each OS? [duplicate]

This question already has answers here:
Why are "Executable files" operating system dependent?
(6 answers)
Closed 2 years ago.
This is more of a theoretical question than anything. I'm a Comp sci major with a huge interest in low level programming. I love finding out how things work under the hood. My specialization is compiler design.
Anyway, as I'm working on my first compiler, things are occurring to me that are kind of confusing.
When you write a program in C/C++, the traditional thing people know is, a compiler magically turns your C/C++ code into native code for that machine.
But something doesn't add up here. If I compile my C/C++ program targeting the x86 architecture, it would seem that the same program should run on any computer with the same architecture. But that doesn't happen. You need to recompile your code for OS X or Linux or Windows.(And yet again for 32-bit vs 64-bit)
I'm just wondering why this is the case? Don't we target the CPU architecture/instruction set when compiling a C/C++ program? And a Mac OS and a Windows Os can very much be running on the same exact architecture.
(I know Java and similar target a VM or CLR so those don't count)
If I took a best-shot answer at this, I'd say C/C++ must then compile to OS-specific instructions. But every source I read says the compiler targets the machine. So I'm very confused.

Don't we target the CPU architecture/instruction set when compiling a C/C++ program?
No, you don't.
I mean yes, you are compiling for a CPU instruction set. But that's not all compilation is.
Consider the simplest "Hello, world!" program. All it does is call printf, right? But there's no "printf" instruction set opcode. So... what exactly happens?
Well, that's part of the C standard library. Its printf function does some processing on the string and parameters, then... displays it. How does that happen? Well, it sends the string to standard out. OK... who controls that?
The operating system. And there's no "standard out" opcode either, so sending a string to standard out involves some form of OS call.
And OS calls are not standardized across operating systems. Pretty much every standard library function that does something you couldn't build on your own in C or C++ is going to talk to the OS to do at least some of its work.
malloc? Memory doesn't belong to you; it belongs to the OS, and you maybe are allowed to have some. scanf? Standard input doesn't belong to you; it belongs to the OS, and you can maybe read from it. And so on.
Your standard library is built from calls to OS routines. And those OS routines are non-portable, so your standard library implementation is non-portable. So your executable has these non-portable calls in it.
And on top of all of that, different OSs have different ideas of what an "executable" even looks like. An executable isn't just a bunch of opcodes, after all; where do you think all of those constant and pre-initialized static variables get stored? Different OSs have different ways of starting up an executable, and the structure of the executable is a part of that.

How do you allocate memory? There's no CPU instruction for allocating dynamic memory, you have to ask the OS for the memory. But what are the parameters? How do you invoke the OS?
How do you print output? How do you open a file? How do you set a timer? How do you display a UI? All of these things require requesting services from the OS, and different OSes provide different services with different calls necessary to request them.

If I compile my C/C++ program targeting the x86 architecture, it would seem that the same program should run on any computer with the same architecture.
It is very true, but there're a few nuances.
Let's consider several cases of programs that are, from C-language point of view, OS-independent.
Suppose all that your program does, from the very beginning, is stress-testing the CPU by doing lots of computations without any I/O.
The machine code could be exactly the same for all the OSes (provided they all run in the same CPU mode, e.g. x86 32-bit Protected Mode). You could even write it in assembly language directly, it wouldn't need to be adapted for each OS.
But each OS wants different headers for the binaries containing this code. E.g. Windows wants PE format, Linux needs ELF, macOS uses Mach-O format. For your simple program you could prepare the machine code as a separate file, and a bunch of headers for each OS's executable format. Then all you need to "recompile" would actually be to concatenate the header and the machine code and, possibly, add alignment "footer".
So, suppose you compiled your C code into machine code, which looks as follows:
offset: instruction disassembly
00: f7 e0 mul eax
02: eb fc jmp short 00
This is the simple stress-testing code, repeatedly doing multiplications of eax register by itself.
Now you want to make it run on 32-bit Linux and 32-bit Windows. You'll need two headers, here're examples (hex dump):
For Linux:
000000 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 >.ELF............<
000010 02 00 03 00 01 00 00 00 54 80 04 08 34 00 00 00 >........T...4...<
000020 00 00 00 00 00 00 00 00 34 00 20 00 01 00 28 00 >........4. ...(.<
000030 00 00 00 00 01 00 00 00 54 00 00 00 54 80 04 08 >........T...T...<
000040 54 80 04 08 04 00 00 00 04 00 00 00 05 00 00 00 >T...............<
000050 00 10 00 00 >....<
For Windows (* simply repeats previous line until the address below * is reached):
000000 4d 5a 80 00 01 00 00 00 04 00 10 00 ff ff 00 00 >MZ..............<
000010 40 01 00 00 00 00 00 00 40 00 00 00 00 00 00 00 >#.......#.......<
000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
000030 00 00 00 00 00 00 00 00 00 00 00 00 80 00 00 00 >................<
000040 0e 1f ba 0e 00 b4 09 cd 21 b8 01 4c cd 21 54 68 >........!..L.!Th<
000050 69 73 20 70 72 6f 67 72 61 6d 20 63 61 6e 6e 6f >is program canno<
000060 74 20 62 65 20 72 75 6e 20 69 6e 20 44 4f 53 20 >t be run in DOS <
000070 6d 6f 64 65 2e 0d 0a 24 00 00 00 00 00 00 00 00 >mode...$........<
000080 50 45 00 00 4c 01 01 00 ee 71 b4 5e 00 00 00 00 >PE..L....q.^....<
000090 00 00 00 00 e0 00 0f 01 0b 01 01 47 00 02 00 00 >...........G....<
0000a0 00 02 00 00 00 00 00 00 00 10 00 00 00 10 00 00 >................<
0000b0 00 10 00 00 00 00 40 00 00 10 00 00 00 02 00 00 >......#.........<
0000c0 01 00 00 00 00 00 00 00 03 00 0a 00 00 00 00 00 >................<
0000d0 00 20 00 00 00 02 00 00 40 fb 00 00 03 00 00 00 >. ......#.......<
0000e0 00 10 00 00 00 10 00 00 00 00 01 00 00 00 00 00 >................<
0000f0 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00 >................<
000100 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
*
000170 00 00 00 00 00 00 00 00 2e 66 6c 61 74 00 00 00 >.........flat...<
000180 04 00 00 00 00 10 00 00 00 02 00 00 00 02 00 00 >................<
000190 00 00 00 00 00 00 00 00 00 00 00 00 60 00 00 e0 >............`...<
0001a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
*
000200
Now if you append your machine code to these headers and, for Windows, also append a bunch of null bytes to make file size 1024 bytes, you'll get valid executables that will run on the corresponding OS.
Suppose now that your program wants to terminate after doing some amount of calculations.
Now it has two options:
Crash—e.g. by execution of an invalid instruction (on x86 it could be UD2). This is easy, OS-independent, but not elegant.
Ask the OS to correctly terminate the process. At this point we need an OS-dependent mechanism to do this.
On x86 Linux it would be
xor ebx, ebx ; zero exit code
mov eax, 1 ; __NR_exit
int 0x80 ; do the system call (the easiest way)
On x86 Windows 7 it would be
; First call terminates all threads except caller thread, see for details:
; http://www.rohitab.com/discuss/topic/41523-windows-process-termination/
mov eax, 0x172 ; NtTerminateProcess_Wind7
mov edx, terminateParams
int 0x2e ; do the system call
; Second call terminates current process
mov eax, 0x172
mov edx, terminateParams
int 0x2e
terminateParams:
dd 0, 0 ; processHandle, exitStatus
Note that on other Windows version you'd need another system call number. The proper way to call NtTerminateProcess is via yet another nuance of OS-dependence: shared libraries.
Now your program wants to load some shared library to avoid reinventing some wheels.
OK, we've seen that our executable file formats are different. Suppose that we've taken this into account and prepared the import sections for the file targeting each of the target OS. There's still a problem: the way to call a function—the so called calling convention—for each OS is different.
E.g. suppose the C-language function your program needs to call returns a structure containing two int values. On Linux the caller would have to allocate some space (e.g. on the stack) and pass the pointer to it as the first parameter to the function being called, like so:
sub esp, 12 ; 4*2+alignment: stack must be 16-byte aligned
push esp ; right before the call instruction
call myFunc
On Windows you'd get the first int value of the structure in EAX, and the second in EDX, without passing any additional parameters to the function.
There are other nuances like different name mangling schemes (though these can differ between compilers even on the same OS), different data types (e.g. long double on MSVC vs long double on GCC) etc., but the above mentioned ones are the most important differences between the OSes from the point of view of the compiler and linker.

No, you are not just targeting a CPU. You are also targeting the OS. Let's say you need to print something to the terminal screen using cout. cout will eventually wind up calling an API function for the OS the program is running on. That call can, and will, be different for different operating systems, so that means you need to compile the program for each OS so it makes the correct OS calls.

The standard library and the C-runtime must interact with OS API's.
The executable formats for the different target OS's are different.
Different OS kernels can configure the hardware differently. Things like byte order, stack direction, register use conventions, and probably many other things can be physically different.

Strictly speaking, you don't need to
Program Loaders
You have wine, the WSL1 or darling, which all are loaders for the respective other OS' binary formats. These tools work so well because the machine is basically the same.
When you create an executable, the machine code for "5+3" is basically the same on all x86 based platforms, however there are differences, already mentioned by the other answers, like:
file format
API: eg. Functions exposed by the OS
ABI: Binary layout etc.
These differ. Now, eg. wine makes Linux understand the WinPE format, and then "simply" runs the machine code as a Linux process (no emulation!). It implements parts of the WinAPI and translates it for Linux. Actually, Windows does pretty much the same thing, as Windows programs do not talk to the Windows Kernel (NT) but the Win32 subsystem... which translates the WinAPI into the NT API. As such, wine is "basically" another WinAPI implementation based on the Linux API.
C in a VM
Also, you can actually compile C into something else than "bare" machine code, like LLVM Byte code or wasm. Projects like GraalVM make it even possible to run C in the Java Virtual Machine: Compile Once, Run Everywhere. There you target another API/ABI/File Format which was intended to be "portable" from the start.
So while the ISA makes up the whole language a CPU can understand, most programs don't only "depend" on the CPU ISA but need the OS to be made work. The toolchain must see to that
But you're right
Actually, you are rather close to being right, however. You actually could compile for Linux and Win32 with your compiler and perhaps even get the same result -- for a rather narrow definition of "compiler". But when you invoke the compiler like this:
c99 -o foo foo.c
You don't only compile (translate the C code to, eg., assembly) but you do this:
Run the C preprocessor
Run the "actual" C compiler frontend
Run the assembler
Run the linker
There might be more or less steps, but that's the usual pipeline. And step 2 is, again with a grain of salt, basically the same on every platform. However the preprocessor copies different header files into your compilation unit (step 1) and the linker works completely differently. The actual translation from one language (C) to another (ASM), that is what from a theoretical perspective a compiler does, is platform independent.

For a binary to work properly (or in some cases at all) there are a whole lot of ugly details that need to be consistent/correct including but probablly not limited to.
How the C source constructs like procedure calls, parameters, types etc are mapped onto architecture-specific contstructs like registers, memory locations, stack frames etc.
How the results of compilation are expressed in an executable file so that the binary loader can load them into the correct places in the virtual address space and/or perform "fixups" after they are loaded in an arbitary place.
How exactly the standard library is implemented, sometimes standard library functions are actual functions in a library, but often they are instead macros, inline functions or even compiler builtins that may rely on non-standard functions in the library.
Where the boundary between the OS and the application is considered to be, on unix-like systems the C standard library is considered a core platform library. On the other hand on windows the C standard library is considered to be something that the compiler provides and is either compiled into the application or shipped alongside it.
How are other libraries implemented? what names do they use? how are they loaded?
Differences in one or more of these things are why you can't just take a binary intended for one OS and load it normally on another.
Having said that it is possible to run code intended for one os on another. That is essentially what wine does. It has special translator libraries that translate windows API calls into calls that are available on Linux and a special binary loader that knows how to load both windows and Linux binaries.

How does C++ linking work in practice? [duplicate]

This question already has answers here:
What do linkers do?
(5 answers)
Closed 7 years ago.
How does C++ linking work in practice? What I am looking for is a detailed explanation about how the linking happens, and not what commands do the linking.
There's already a similar question about compilation which doesn't go into too much detail: How does the compilation/linking process work?

EDIT: I have moved this answer to the duplicate: https://stackoverflow.com/a/33690144/895245
This answer focuses on address relocation, which is one of the crucial functions of linking.
A minimal example will be used to clarify the concept.
0) Introduction
Summary: relocation edits the .text section of object files to translate:
object file address
into the final address of the executable
This must be done by the linker because the compiler only sees one input file at a time, but we must know about all object files at once to decide how to:
resolve undefined symbols like declared undefined functions
not clash multiple .text and .data sections of multiple object files
Prerequisites: minimal understanding of:
x86-64 or IA-32 assembly
global structure of an ELF file. I have made a tutorial for that
Linking has nothing to do with C or C++ specifically: compilers just generate the object files. The linker then takes them as input without ever knowing what language compiled them. It might as well be Fortran.
So to reduce the crust, let's study a NASM x86-64 ELF Linux hello world:
section .data
hello_world db "Hello world!", 10
section .text
global _start
_start:
; sys_write
mov rax, 1
mov rdi, 1
mov rsi, hello_world
mov rdx, 13
syscall
; sys_exit
mov rax, 60
mov rdi, 0
syscall
compiled and assembled with:
nasm -felf64 hello_world.asm # creates hello_world.o
ld -o hello_world.out hello_world.o # static ELF executable with no libraries
with NASM 2.10.09.
1) .text of .o
First we decompile the .text section of the object file:
objdump -d hello_world.o
which gives:
0000000000000000 <_start>:
0: b8 01 00 00 00 mov $0x1,%eax
5: bf 01 00 00 00 mov $0x1,%edi
a: 48 be 00 00 00 00 00 movabs $0x0,%rsi
11: 00 00 00
14: ba 0d 00 00 00 mov $0xd,%edx
19: 0f 05 syscall
1b: b8 3c 00 00 00 mov $0x3c,%eax
20: bf 00 00 00 00 mov $0x0,%edi
25: 0f 05 syscall
the crucial lines are:
a: 48 be 00 00 00 00 00 movabs $0x0,%rsi
11: 00 00 00
which should move the address of the hello world string into the rsi register, which is passed to the write system call.
But wait! How can the compiler possibly know where "Hello world!" will end up in memory when the program is loaded?
Well, it can't, specially after we link a bunch of .o files together with multiple .data sections.
Only the linker can do that since only he will have all those object files.
So the compiler just:
puts a placeholder value 0x0 on the compiled output
gives some extra information to the linker of how to modify the compiled code with the good addresses
This "extra information" is contained in the .rela.text section of the object file
2) .rela.text
.rela.text stands for "relocation of the .text section".
The word relocation is used because the linker will have to relocate the address from the object into the executable.
We can disassemble the .rela.text section with:
readelf -r hello_world.o
which contains;
Relocation section '.rela.text' at offset 0x340 contains 1 entries:
Offset Info Type Sym. Value Sym. Name + Addend
00000000000c 000200000001 R_X86_64_64 0000000000000000 .data + 0
The format of this section is fixed documented at: http://www.sco.com/developers/gabi/2003-12-17/ch4.reloc.html
Each entry tells the linker about one address which needs to be relocated, here we have only one for the string.
Simplifying a bit, for this particular line we have the following information:
Offset = C: what is the first byte of the .text that this entry changes.
If we look back at the decompiled text, it is exactly inside the critical movabs $0x0,%rsi, and those that know x86-64 instruction encoding will notice that this encodes the 64-bit address part of the instruction.
Name = .data: the address points to the .data section
Type = R_X86_64_64, which specifies what exactly what calculation has to be done to translate the address.
This field is actually processor dependent, and thus documented on the AMD64 System V ABI extension section 4.4 "Relocation".
That document says that R_X86_64_64 does:
Field = word64: 8 bytes, thus the 00 00 00 00 00 00 00 00 at address 0xC
Calculation = S + A
S is value at the address being relocated, thus 00 00 00 00 00 00 00 00
A is the addend which is 0 here. This is a field of the relocation entry.
So S + A == 0 and we will get relocated to the very first address of the .data section.
3) .text of .out
Now lets look at the text area of the executable ld generated for us:
objdump -d hello_world.out
gives:
00000000004000b0 <_start>:
4000b0: b8 01 00 00 00 mov $0x1,%eax
4000b5: bf 01 00 00 00 mov $0x1,%edi
4000ba: 48 be d8 00 60 00 00 movabs $0x6000d8,%rsi
4000c1: 00 00 00
4000c4: ba 0d 00 00 00 mov $0xd,%edx
4000c9: 0f 05 syscall
4000cb: b8 3c 00 00 00 mov $0x3c,%eax
4000d0: bf 00 00 00 00 mov $0x0,%edi
4000d5: 0f 05 syscall
So the only thing that changed from the object file are the critical lines:
4000ba: 48 be d8 00 60 00 00 movabs $0x6000d8,%rsi
4000c1: 00 00 00
which now point to the address 0x6000d8 (d8 00 60 00 00 00 00 00 in little-endian) instead of 0x0.
Is this the right location for the hello_world string?
To decide we have to check the program headers, which tell Linux where to load each section.
We disassemble them with:
readelf -l hello_world.out
which gives:
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x00000000000000d7 0x00000000000000d7 R E 200000
LOAD 0x00000000000000d8 0x00000000006000d8 0x00000000006000d8
0x000000000000000d 0x000000000000000d RW 200000
Section to Segment mapping:
Segment Sections...
00 .text
01 .data
This tells us that the .data section, which is the second one, starts at VirtAddr = 0x06000d8.
And the only thing on the data section is our hello world string.

Actually, one could say linking is relatively simple.
In the simplest sense, it's just about bundling together object files1 as those already contain the emitted assembly for each of the functions/globals/data... contained in their respective source. The linker can be extremely dumb here and just treat everything as a symbol (name) and its definition (or content).
Obviously, the linker need produce a file that respects a certain format (the ELF format generally on Unix) and will separate the various categories of code/data into different sections of the file, but that is just dispatching.
The two complications I know of are:
the need to de-duplicate symbols: some symbols are present in several object files and only one should make it in the resulting library/executable being created; it is the linker job to only include one of the definitions
link-time optimization: in this case the object files contain not the emitted assembly but an intermediate representation and the linker merge all the object files together, apply optimization passes (inlining, for example), compiles this down to assembly and finally emit its result.
1: the result of the compilation of the different translation units (roughly, preprocessed source files)

Besides the already mentioned "Linkers and Loaders", if you wanted to know how a real and modern linker works, you could start here.

GCC function padding value

Whenever I compile C or C++ code with optimizations enable,d GCC aligns functions to a 16-byte boundary (on IA-32). If the function is shorter than 16 bytes, GCC pads it with some bytes, which don't seem to be random at all:
19: c3 ret
1a: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
It always seems to be either 8d b6 00 00 00 00 ... or 8d 74 26 00.
Do function padding bytes have any significance?

The padding is created by the assembler, not by gcc. It merely sees a .align directive (or equivalent) and doesn't know whether the space to be padded is inside a function (e.g. loop alignment) or between functions, so it must insert NOPs of some sort. Modern x86 assemblers use the largest possible NOP opcodes with the intention of spending as few cycles as possible if the padding is for loop alignment.
Personally, I'm extremely skeptical of alignment as an optimization technique. I've never seen it help much, and it can definitely hurt by increasing the total code size (and cache utilization) tremendously. If you use the -Os optimization level, it's off by default, so there's nothing to worry about. Otherwise you can disable all the alignments with the proper -f options.

The assembler first sees an .align directive. Since it doesn't know if this address is within a function body or not, it cannot output NULL 0x00 bytes, and must generate NOPs (0x90).
However:
lea esi,[esi+0x0] ; does nothing, psuedocode: ESI = ESI + 0
executes in fewer clock cycles than
nop
nop
nop
nop
nop
nop
If this code happened to fall within a function body (for instance, loop alignment), the lea version would be much faster, while still "doing nothing."

The instruction lea 0x0(%esi),%esi just loads the value in %esi into %esi - it's no-operation (or NOP), which means that if it's executed it will have no effect.
This just happens to be a single instruction, 6-byte NOP. 8d 74 26 00 is just a 4-byte encoding of the same instruction.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js