Dwarf DW_AT_location objdump and dwarfdump inconsistent

Dwarf DW_AT_location objdump and dwarfdump inconsistent - gdb

I am playing around with CPython and trying to understand how a debugger works.
Specifically, I am trying to get the location of the last PyFrameObject so that I can traverse that and get the Python backtrace.
In the file ceval.c, line 689 has the definition of the function:
PyObject * PyEval_EvalFrameEx(PyFrameObject *f, int throwflag)
What I am interested in getting is the location of f on the stack. When dumping the binary with dwarfdump I get that f is at $rbp-824, but if I dump the binary with objdump I get that the location is $rbp-808 - a discrepancy of 16. Also, when debugging with GDB, I get that the correct answer is $rbp-808 like objdump gives me. Why the discrepancy, and why is dwarfdump incorrect? What am I not understanding?
How to technically recreate the problem:
Download python-2.7.17.tgz from Python website. Extract.
I compiled python-2.7.17 from source with debug symbols (./configure --enable-pydebug && make). Run the following commands on the resulting python binary:
dwarfdump Python-2.7.17/python has the following output:
DW_AT_name f
DW_AT_decl_file 0x00000001 /home/meir/code/python/Python-2.7.17/Python/ceval.c
DW_AT_decl_line 0x000002b1
DW_AT_type <0x00002916>
DW_AT_location len 0x0003: 91c879: DW_OP_fbreg -824
I know this is the correct f because the line the variable is declared on is 689 (0x2b1). As you can see the location is:
DW_AT_location len 0x0003: 91c879: DW_OP_fbreg -824: Meaning $rbp-824.
Running the command objdump -S Python-2.7.17/python has the following output:
PyEval_EvalFrameEx(PyFrameObject *f, int throwflag)
{
f7577: 55 push %rbp
f7578: 48 89 e5 mov %rsp,%rbp
f757b: 41 57 push %r15
f757d: 41 56 push %r14
f757f: 41 55 push %r13
f7581: 41 54 push %r12
f7583: 53 push %rbx
f7584: 48 81 ec 38 03 00 00 sub $0x338,%rsp
f758b: 48 89 bd d8 fc ff ff mov %rdi,-0x328(%rbp)
f7592: 89 b5 d4 fc ff ff mov %esi,-0x32c(%rbp)
f7598: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
f759f: 00 00
f75a1: 48 89 45 c8 mov %rax,-0x38(%rbp)
f75a5: 31 c0 xor %eax,%eax
Debugging this output will show you that the relevant line is:
f758b: 48 89 bd d8 fc ff ff mov %rdi,-0x328(%rbp) where you can clearly see that f is being loaded from -0x328(%rbp) which is $rbp-808. Also, GDB supports this finding.
So again, the question is, what am I missing and why the 16 byte discrepency between dwarfdump and reality?
Thanks
Edit:
The dwarfdump including the function above is:
< 1><0x00004519> DW_TAG_subprogram
DW_AT_external yes(1)
DW_AT_name PyEval_EvalFrameEx
DW_AT_decl_file 0x00000001 /home/meir/code/python/Python-2.7.17/Python/ceval.c
DW_AT_decl_line 0x000002b1
DW_AT_prototyped yes(1)
DW_AT_type <0x00000817>
DW_AT_low_pc 0x000f7577
DW_AT_high_pc <offset-from-lowpc>53969
DW_AT_frame_base len 0x0001: 9c: DW_OP_call_frame_cfa
DW_AT_GNU_all_tail_call_sites yes(1)
DW_AT_sibling <0x00005bbe>
< 2><0x0000453b> DW_TAG_formal_parameter
DW_AT_name f
DW_AT_decl_file 0x00000001 /home/meir/code/python/Python-2.7.17/Python/ceval.c
DW_AT_decl_line 0x000002b1
DW_AT_type <0x00002916>
DW_AT_location len 0x0003: 91c879: DW_OP_fbreg -824
According to the answer below, DW_OP_fbreg is offset from the frame base - in my case DW_OP_call_frame_cfa. I am having trouble identifying the frame base. My registers are as following:
(gdb) info registers
rax 0xfffffffffffffdfe -514
rbx 0x7f6a4887d040 140094460121152
rcx 0x7f6a48e83ff7 140094466441207
rdx 0x0 0
rsi 0x0 0
rdi 0x0 0
rbp 0x7ffd24bcef00 0x7ffd24bcef00
rsp 0x7ffd24bceba0 0x7ffd24bceba0
r8 0x7ffd24bcea50 140725219813968
r9 0x0 0
r10 0x0 0
r11 0x246 582
r12 0x7f6a48870df0 140094460071408
r13 0x7f6a48874b58 140094460087128
r14 0x1 1
r15 0x7f6a48873794 140094460082068
rip 0x5559834e99c0 0x5559834e99c0 <PyEval_EvalFrameEx+46153>
eflags 0x246 [ PF ZF IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
As stated above, I already know that %rbp-808 works. What is the correct way to do it with the registers that I have?
Edit:
I finally understood the answer. I needed to unwind one more function, and find the place my function was called. There, the variable I was looking for really was in $rsp and $rsp-824 was correct

DW_OP_fbreg -824: Meaning $rbp-824
It does not mean that. It means, offset -824 from frame base (virtual) register, which is not necessarily (nor usually) equal to $rbp.
You need to look for DW_AT_frame_base to know what the frame base in the current function is.
Most likely it's defined as DW_OP_call_frame_cfa, which is the value of $RSP just before current function was called, and is equal to $RBP-16 (8 bytes for return address saved by the CALL instruction, and 8 bytes for previous $RBP saved by the first instruction of your function).

Related

Why is this binary vulnerable to buffer overflow?

This is an extract of a binary that is buffer overflowed. I decompiled it with Ghidra.
char local_7 [32];
long local_78;
printf("Give it a try");
gets(local_7);
if (local_78 != 0x4141414141414141) {
if (local_78 == 0x1122334455667788) {
puts ("That's won")
}
puts("Let's continue");
}
I'd like to understand why it is possible to make a buffer overflow.
I checked the "0x4141414141414141" hex value and saw it was related to "A" string. But what the conditions related to "0x4141414141414141" and "0x1122334455667788" exactly do ? And to be more precise, what the user could answer to get the message ("That's won") ?
Any explanations would be greatly appreciated, thanks !
___EDIT___
I have to add that I see these two hex values at using "disas main" command :
0x00000000000011a7 <+8>: movabs $0x4141414141414141,%rax
0x00000000000011e6 <+71>: movabs $0x4141414141414141,%rax
0x00000000000011f6 <+87>: movabs $0x1122334455667788,%rax
I tried a buffer overflow using python3 -c "print ('A' * 32 +'\x88\x77\x66\x55\x44\x33\x22\x11')" | ./ myBinary.
But I always have the "Let's continue" message. I'm not that far from the solution but I guess I miss a thing.. Could you help me what ?
___EDIT 2___
Before the gets :
char local_7 [40];
long local_78;
local_78 = 0x4141414141414141;
printf("Give it a try");
fflush(stdout);
gets(local_7);
[... and so on]

Here is the full disassembly:
(gdb) disassemble main
Dump of assembler code for function main:
0x0000000000001189 <+0>: endbr64
0x000000000000118d <+4>: push %rbp
0x000000000000118e <+5>: mov %rsp,%rbp
0x0000000000001191 <+8>: sub $0x30,%rsp
0x0000000000001195 <+12>: lea 0xe68(%rip),%rdi # 0x2004
0x000000000000119c <+19>: mov $0x0,%eax
0x00000000000011a1 <+24>: callq 0x1080 <printf#plt>
0x00000000000011a6 <+29>: lea -0x30(%rbp),%rax
0x00000000000011aa <+33>: mov %rax,%rdi
0x00000000000011ad <+36>: mov $0x0,%eax
0x00000000000011b2 <+41>: callq 0x1090 <gets#plt>
0x00000000000011b7 <+46>: movabs $0x4141414141414141,%rax
0x00000000000011c1 <+56>: cmp %rax,-0x8(%rbp)
0x00000000000011c5 <+60>: je 0x11ef <main+102>
0x00000000000011c7 <+62>: movabs $0x1122334455667788,%rax
0x00000000000011d1 <+72>: cmp %rax,-0x8(%rbp)
0x00000000000011d5 <+76>: jne 0x11e3 <main+90>
0x00000000000011d7 <+78>: lea 0xe34(%rip),%rdi # 0x2012
0x00000000000011de <+85>: callq 0x1070 <puts#plt>
0x00000000000011e3 <+90>: lea 0xe33(%rip),%rdi # 0x201d
0x00000000000011ea <+97>: callq 0x1070 <puts#plt>
0x00000000000011ef <+102>: mov $0x0,%eax
0x00000000000011f4 <+107>: leaveq
0x00000000000011f5 <+108>: retq
The important addresses can be determined from the instruction setting the gets parameter as local_7:
0x00000000000011a6 <+29>: lea -0x30(%rbp),%rax
and the cmp instruction comparing the local_78 variable.
0x00000000000011c1 <+56>: cmp %rax,-0x8(%rbp)
As you can see the local_7 is at -0x30(%rbp), and local_78 is at -0x8(%rbp), exactly 40 bytes after the buffer.
Your python command is not correct since you are using string operations which cause it to produce valid UTF-8, and therefore, extra bytes:
$ python3 -c "print ('A' * 40 +'\x88\x77\x66\x55\x44\x33\x22\x11')"|hd -v
00000000 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 |AAAAAAAAAAAAAAAA|
00000010 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 |AAAAAAAAAAAAAAAA|
00000020 41 41 41 41 41 41 41 41 c2 88 77 66 55 44 33 22 |AAAAAAAA..wfUD3"|
00000030 11 0a |..|
00000032
Notice the c2 byte before 88. See the following question for details:
Why is the output of print in python2 and python3 different with the same string?
If we instead use bytes types, we can get the correct output:
$ python3 -c "import sys; sys.stdout.buffer.write(b'A' * 40 + b'\x88\x77\x66\x55\x44\x33\x22\x11')"|hd -v
00000000 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 |AAAAAAAAAAAAAAAA|
00000010 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 |AAAAAAAAAAAAAAAA|
00000020 41 41 41 41 41 41 41 41 88 77 66 55 44 33 22 11 |AAAAAAAA.wfUD3".|
00000030
Using this input, we get the "That's won" message:
$ python3 -c "import sys; sys.stdout.buffer.write(b'A' * 40 + b'\x88\x77\x66\x55\x44\x33\x22\x11')"|./a.out
Give it a tryThat's won
Let's continue

This binary is vulnerable to a buffer overflow because it uses the gets() function, which is vulnerable, and deprecated because of that reason.
It will copy the user input to the passed buffer, without checking the size of the buffer. So, if the input of the user is larger than the available space, it will overflow in memory and potentially, overwrite other variables or structures that are located after the buffer.
That is the case of the long local_78; variable, which is in the stack after the buffer, so we can potentially overwrite its value.
To do so, we need to pass an input that is:
minimun 32 bytes, to fill the actual buffer. (A char (ASCII character) should usually equivalent to 1 byte)
plus, an additional variable number of bytes to fill the space between the buffer and the long variable (this is because a lot of times, compilers make optimization and may add other variables between those two, even if we haven't placed them just there in the code. The stack is a dynamic memory region so it's not often possible to 100% predict its layout)
plus, 8 bytes, which is typically the size of a long in most computer architectures (though it could be different, but let's assume this is x86/64). This is the value we will be overflowing the variable with.
We don't care about the stuff we put in the first 32+X bytes (except for the null byte). The program then checks for some special value of local_78, and if that check passes, it will execute puts ("That's won"); saying that you have "won" or successfully exploited the program and overwrote the memory.
The problem here, is that such value is 0x1122334455667788 (again, a long which is 8 bytes). We could read this separating its bytes: 0x11 0x22 0x33 0x44 0x55 0x66 0x77 0x88, and trying to see which byte corresponds to which character in ASCII
The issue is that bytes like 0x22 are not ASCII representable characters, so you cannot type them directly into the console, because normal keyboards don't have a key that inputs the character 0x11 as it doesn't have a visual representation. You will need an additional program to exploit the program. Such program will need to use any mechanisms available in the Operating System to pass such values. In Linux for example this can be done using pipes / output redirection

gcc x86 Windows stack alignment

I'm in the process of writing a compiler purely as a learning experience. I'm currently learning about stack frames by compiling simple c++ code and then studying the output asm produced by gcc 4.9.2 for Windows x86.
my simple c++ code is
#include <iostream>
using namespace std;
int globalVar;
void testStackStuff(void);
void testPassingOneInt32(int v);
void forceStackFrameCreation(int v);
int main()
{
globalVar = 0;
testStackStuff();
std::cout << globalVar << std::endl;
}
void testStackStuff(void)
{
testPassingOneInt32(666);
}
void testPassingOneInt32(int v)
{
globalVar = globalVar + v;
forceStackFrameCreation(v);
}
void forceStackFrameCreation(int v)
{
globalVar = globalVar + v;
}
Ok, when this is compiled with -mpreferred-stack-boundary=4 I was expecting to see a stack aligned to 16 bytes (technically it is aligned to 16 bytes but with an extra 16 bytes of unused stack space). The prologue for main as produced by gcc is:
22 .loc 1 12 0
23 .cfi_startproc
24 0000 8D4C2404 lea ecx, [esp+4]
25 .cfi_def_cfa 1, 0
26 0004 83E4F0 and esp, -16
27 0007 FF71FC push DWORD PTR [ecx-4]
28 000a 55 push ebp
29 .cfi_escape 0x10,0x5,0x2,0x75,0
30 000b 89E5 mov ebp, esp
31 000d 51 push ecx
32 .cfi_escape 0xf,0x3,0x75,0x7c,0x6
33 000e 83EC14 sub esp, 20
34 .loc 1 12 0
35 0011 E8000000 call ___main
35 00
36 .loc 1 13 0
37 0016 C7050000 mov DWORD PTR _globalVar, 0
38 .loc 1 15 0
39 0020 E8330000 call __Z14testStackStuffv
line 26 rounds esp down to the nearest 16 byte boundary.
lines 27, 28 and 31 push a total of 12 bytes onto the stack, then
line 33 subtracts another 20 bytes from esp, giving a total of 32 bytes!
Why?
line 39 then calls testStackStuff.
NOTE - this call pushes the return address (4 bytes).
Now, lets look at the prologue for testStackStuff, keeping in mind that the stack is now 4 bytes closer to the next 16 byte boundary.
67 0058 55 push ebp
68 .cfi_def_cfa_offset 8
69 .cfi_offset 5, -8
70 0059 89E5 mov ebp, esp
71 .cfi_def_cfa_register 5
72 005b 83EC18 sub esp, 24
73 .loc 1 22 0
74 005e C704249A mov DWORD PTR [esp], 666
line 67 pushes another 4 bytes (now 8 bytes towards the boundary).
line 72 subtracts another 24 bytes (total 32 bytes).
At this point the stack is now aligned correctly on a 16 byte boundary. But why the multiple of 2?
If I change the compiler flags to -mpreferred-stack-boundary=5 I would expect a stack aligned to 32 bytes, but again gcc seems to produce stack frames aligned to 64 bytes, twice the amount I was expecting.
Prologue for main
23 .cfi_startproc
24 0000 8D4C2404 lea ecx, [esp+4]
25 .cfi_def_cfa 1, 0
26 0004 83E4E0 and esp, -32
27 0007 FF71FC push DWORD PTR [ecx-4]
28 000a 55 push ebp
29 .cfi_escape 0x10,0x5,0x2,0x75,0
30 000b 89E5 mov ebp, esp
31 000d 51 push ecx
32 .cfi_escape 0xf,0x3,0x75,0x7c,0x6
33 000e 83EC34 sub esp, 52
34 .loc 1 12 0
35 0011 E8000000 call ___main
35 00
36 .loc 1 13 0
37 0016 C7050000 mov DWORD PTR _globalVar, 0
37 00000000
37 0000
38 .loc 1 15 0
39 0020 E8330000 call __Z14testStackStuffv
line 26 rounds esp down to the nearest 32 byte boundary
lines 27, 28 and 31 push a total of 12 bytes onto the stack, then
line 33 subtracts another 52 bytes from esp, giving a total of 64 bytes!
and the prologue for testStackStuff is
66 .cfi_startproc
67 0058 55 push ebp
68 .cfi_def_cfa_offset 8
69 .cfi_offset 5, -8
70 0059 89E5 mov ebp, esp
71 .cfi_def_cfa_register 5
72 005b 83EC38 sub esp, 56
73 .loc 1 22 0
(4 bytes on stack from) call __Z14testStackStuffv
(4 bytes on stack from) push ebp
(56 bytes on stack from) sub esp,56
total 64 bytes.
Does anybody know why gcc is creating this extra stack space or have I overlooked something obvious?
Thanks for any help you can offer.

In order to resolve this mystery, you will need to look at the documentation of gcc to find out exactly which flavor of Application Binary Interface (ABI) it uses, and then go find the specification of that ABI and read it. If you are "in the process of writing a compiler purely as a learning experience" you will definitely need it.
In short, and in broad terms, what is happening is that the ABI mandates that this extra space be reserved by the current function, for the purpose of passing parameters to functions invoked by the current function. The decision of how much space to reserve depends primarily on the amount of parameter passing that the function intends to do, but it is a bit more nuanced than that, and the ABI is the document which explains it in detail
In the old style of stack frames, we would PUSH parameters to the stack, and then invoke a function.
In the new style of stack frames, EBP is not used anymore, (not sure why it is preserved and copied from ESP anymore,) parameters are placed in the stack at a specific offset with respect to ESP, and then the function is invoked. This is evidenced by the fact that mov DWORD PTR [esp], 666 is used to pass the 666 argument to the call testPassingOneInt32(666);.

For why it's doing the push DWORD PTR [ecx-4] to copy the return address, see this partial duplicate. IIRC, it's constructing a complete copy of the return-address / saved-ebp pair.
but again gcc seems to produce stack frames aligned to 64 bytes
No, it used and esp, -32. The stack frame size looks like 64 bytes, but its alignment is only 32B.
I'm not sure why it leaves so much extra space in the stack frame. It's not very interesting to guess why gcc -O0 does what it does, because it's not even trying to be optimal.
You obviously compiled without optimization, which makes the whole thing less interesting. This tells you more about gcc internals and what was convenient for gcc, not that the code it emitted was necessary or does anything useful. Also, use http://gcc.godbolt.org/ to get nice asm output without the CFI directives and other noise. (Please tidy up the asm code blocks in your question with output from that. All the noise makes them harder to read.)

What is stored in this 26KB executable?

Compiling this code with -O3:
#include <iostream>
int main(){std::cout<<"Hello World"<<std::endl;}
results in a file with a length of 25,890 bytes. (Compiled with GCC 4.8.1)
Can't the compiler just store two calls to write(STDOUT_FILENO, ???, strlen(???));, store write's contents, store the string, and boom write it to the disk? It should result in a EXE with a length under 1,024 bytes to my estimate.
Compiling a hello world program in assembly results in 17 bytes file: https://stackoverflow.com/questions/284797/hello-world-in-less-than-17-bytes, means actual code is 5-bytes long. (The string is Hello World\0)
What that EXE stores except the actual main and the functions it calls?
NOTE: This question applies to MSVC too.
Edit:
A lot of users pointed at iostream as being the culprit, so I tested this hypothesis and compiled this program with the same parameters:
int main( ) {
}
And got 23,815 bytes, the hypothesis has been disproved.

The compiler generates by default a complete PE-conformant executable. Assuming a release build, the simple code you posted might probably include:
all the PE headers and tables needed by the loader (e.g. IAT), this also means alignment requirements have to be met
CRT library initialization code
Debugging info (you need to manually drop these off even for a release build)
In case the compiler were MSVC there would have been additional inclusions:
Manifest xml and relocation data
Results of default compiler options that favor speed over size
The link you posted does contain a very small assembly "hello world" program, but in order to properly run in a Windows environment at least the complete and valid PE structure needs to be available to the loader (setting aside all the low-level issues that might cause that code not to run at all).
Assuming the loader had already and correctly 'set up' the process where to run that code into, only at that point you could map it into a PE section and do
jmp small_hello_world_entry_point
to actually execute the code.
References: The PE format
One last notice: UPX and similar compression tools are also used to reduce filesize for executables.

C++ isn't assembly, like C it comes with a lot of infrastructure. In addition to the overheads of C - required to be compatible with the C abi - C++ also has its own variants of many things, and it also has to have all the tear-up and -down code required to provide the many guarantees of the language.
Much of these are provided by libraries, but some of it has to be in the executable itself so that a failure to load shared libraries could be handled.
Under Linux/BSD we can reverse engineer an executable with objdump -dsl. I took the following code:
int main() {}
and compiled it with:
g++ -Wall -O3 -g0 test.cpp -o test.exe
The resulting executable?
6922 bytes
Then I compiled with less cruft:
g++ -Wall -O3 -g0 test.cpp -o test.exe -nostdlib
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000400150
Basically: main is a facade entry point for our C++ code, the program really starts at _start.
Executable size?
1454 bytes
Here's how objdump describes the two:
g++ -Wall -O3 -g0 test.cpp -o test.exe
objdump -test.exe
test.exe: file format elf64-x86-64
Contents of section .interp:
400200 2f6c6962 36342f6c 642d6c69 6e75782d /lib64/ld-linux-
400210 7838362d 36342e73 6f2e3200 x86-64.so.2.
Contents of section .note.ABI-tag:
40021c 04000000 10000000 01000000 474e5500 ............GNU.
40022c 00000000 02000000 06000000 12000000 ................
Contents of section .note.gnu.build-id:
40023c 04000000 14000000 03000000 474e5500 ............GNU.
40024c a0f55c7d 671f9eb2 93078fd3 0f52581a ..\}g........RX.
40025c 544829b2 TH).
Contents of section .hash:
400260 03000000 06000000 02000000 05000000 ................
400270 00000000 00000000 00000000 01000000 ................
400280 00000000 03000000 04000000 ............
Contents of section .dynsym:
400290 00000000 00000000 00000000 00000000 ................
4002a0 00000000 00000000 10000000 20000000 ............ ...
4002b0 00000000 00000000 00000000 00000000 ................
4002c0 1f000000 20000000 00000000 00000000 .... ...........
4002d0 00000000 00000000 8b000000 12000000 ................
4002e0 00000000 00000000 00000000 00000000 ................
4002f0 33000000 20000000 00000000 00000000 3... ...........
400300 00000000 00000000 4f000000 20000000 ........O... ...
400310 00000000 00000000 00000000 00000000 ................
Contents of section .dynstr:
400320 006c6962 73746463 2b2b2e73 6f2e3600 .libstdc++.so.6.
400330 5f5f676d 6f6e5f73 74617274 5f5f005f __gmon_start__._
400340 4a765f52 65676973 74657243 6c617373 Jv_RegisterClass
400350 6573005f 49544d5f 64657265 67697374 es._ITM_deregist
400360 6572544d 436c6f6e 65546162 6c65005f erTMCloneTable._
400370 49544d5f 72656769 73746572 544d436c ITM_registerTMCl
400380 6f6e6554 61626c65 006c6962 6d2e736f oneTable.libm.so
400390 2e36006c 69626763 635f732e 736f2e31 .6.libgcc_s.so.1
4003a0 006c6962 632e736f 2e36005f 5f6c6962 .libc.so.6.__lib
4003b0 635f7374 6172745f 6d61696e 00474c49 c_start_main.GLI
4003c0 42435f32 2e322e35 00 BC_2.2.5.
Contents of section .gnu.version:
4003ca 00000000 00000200 00000000 ............
Contents of section .gnu.version_r:
4003d8 01000100 81000000 10000000 00000000 ................
4003e8 751a6909 00000200 9d000000 00000000 u.i.............
Contents of section .rela.dyn:
4003f8 50096000 00000000 06000000 01000000 P.`.............
400408 00000000 00000000 ........
Contents of section .rela.plt:
400410 70096000 00000000 07000000 03000000 p.`.............
400420 00000000 00000000 ........
Contents of section .init:
400428 4883ec08 e85b0000 00e86a01 0000e845 H....[....j....E
400438 02000048 83c408c3 ...H....
Contents of section .plt:
400440 ff351a05 2000ff25 1c052000 0f1f4000 .5.. ..%.. ...#.
400450 ff251a05 20006800 000000e9 e0ffffff .%.. .h.........
Contents of section .text:
400460 31ed4989 d15e4889 e24883e4 f0505449 1.I..^H..H...PTI
400470 c7c0e005 400048c7 c1f00540 0048c7c7 ....#.H....#.H..
400480 d0054000 e8c7ffff fff49090 4883ec08 ..#.........H...
400490 488b05b9 04200048 85c07402 ffd04883 H.... .H..t...H.
4004a0 c408c390 90909090 90909090 90909090 ................
4004b0 90909090 90909090 90909090 90909090 ................
4004c0 b88f0960 00482d88 09600048 83f80e76 ...`.H-..`.H...v
4004d0 17b80000 00004885 c0740dbf 88096000 ......H..t....`.
4004e0 ffe0660f 1f440000 f3c3660f 1f440000 ..f..D....f..D..
4004f0 be880960 004881ee 88096000 48c1fe03 ...`.H....`.H...
400500 4889f048 c1e83f48 01c648d1 fe7411b8 H..H..?H..H..t..
400510 00000000 4885c074 07bf8809 6000ffe0 ....H..t....`...
400520 f3c36666 6666662e 0f1f8400 00000000 ..fffff.........
400530 803d5104 20000075 5f5553bb 80076000 .=Q. ..u_US...`.
400540 4881eb78 07600048 83ec0848 8b053e04 H..x.`.H...H..>.
400550 200048c1 fb034883 eb01488d 6c241048 .H...H...H.l$.H
400560 39d87322 0f1f4000 4883c001 4889051d 9.s"..#.H...H...
400570 042000ff 14c57807 6000488b 050f0420 . ....x.`.H....
400580 004839d8 72e2e835 ffffffc6 05f60320 .H9.r..5.......
400590 00014883 c4085b5d f3c3660f 1f440000 ..H...[]..f..D..
4005a0 bf880760 0048833f 007505e9 40ffffff ...`.H.?.u..#...
4005b0 b8000000 004885c0 74f15548 89e5ffd0 .....H..t.UH....
4005c0 5de92aff ffff9090 90909090 90909090 ].*.............
4005d0 31c0c390 90909090 90909090 90909090 1...............
4005e0 f3c36666 6666662e 0f1f8400 00000000 ..fffff.........
4005f0 48896c24 d84c8964 24e0488d 2d630120 H.l$.L.d$.H.-c.
400600 004c8d25 5c012000 4c896c24 e84c8974 .L.%\. .L.l$.L.t
400610 24f04c89 7c24f848 895c24d0 4883ec38 $.L.|$.H.\$.H..8
400620 4c29e541 89fd4989 f648c1fd 034989d7 L).A..I..H...I..
400630 e8f3fdff ff4885ed 741c31db 0f1f4000 .....H..t.1...#.
400640 4c89fa4c 89f64489 ef41ff14 dc4883c3 L..L..D..A...H..
400650 014839eb 72ea488b 5c240848 8b6c2410 .H9.r.H.\$.H.l$.
400660 4c8b6424 184c8b6c 24204c8b 7424284c L.d$.L.l$ L.t$(L
400670 8b7c2430 4883c438 c3909090 90909090 .|$0H..8........
400680 554889e5 53bb6807 60004883 ec08488b UH..S.h.`.H...H.
400690 05d30020 004883f8 ff74140f 1f440000 ... .H...t...D..
4006a0 4883eb08 ffd0488b 034883f8 ff75f148 H.....H..H...u.H
4006b0 83c4085b 5dc39090 ...[]...
Contents of section .fini:
4006b8 4883ec08 e86ffeff ff4883c4 08c3 H....o...H....
Contents of section .rodata:
4006c8 01000200 ....
Contents of section .eh_frame_hdr:
4006cc 011b033b 20000000 03000000 04ffffff ...; ...........
4006dc 3c000000 14ffffff 54000000 24ffffff <.......T...$...
4006ec 6c000000 l...
Contents of section .eh_frame:
4006f0 14000000 00000000 017a5200 01781001 .........zR..x..
400700 1b0c0708 90010000 14000000 1c000000 ................
400710 c0feffff 03000000 00000000 00000000 ................
400720 14000000 34000000 b8feffff 02000000 ....4...........
400730 00000000 00000000 24000000 4c000000 ........$...L...
400740 b0feffff 89000000 00518c05 86065f0e .........Q...._.
400750 4083078f 028e038d 0402580e 08000000 #.........X.....
400760 00000000 ....
Contents of section .ctors:
600768 ffffffff ffffffff 00000000 00000000 ................
Contents of section .dtors:
600778 ffffffff ffffffff 00000000 00000000 ................
Contents of section .jcr:
600788 00000000 00000000 ........
Contents of section .dynamic:
600790 01000000 00000000 01000000 00000000 ................
6007a0 01000000 00000000 69000000 00000000 ........i.......
6007b0 01000000 00000000 73000000 00000000 ........s.......
6007c0 01000000 00000000 81000000 00000000 ................
6007d0 0c000000 00000000 28044000 00000000 ........(.#.....
6007e0 0d000000 00000000 b8064000 00000000 ..........#.....
6007f0 04000000 00000000 60024000 00000000 ........`.#.....
600800 05000000 00000000 20034000 00000000 ........ .#.....
600810 06000000 00000000 90024000 00000000 ..........#.....
600820 0a000000 00000000 a9000000 00000000 ................
600830 0b000000 00000000 18000000 00000000 ................
600840 15000000 00000000 00000000 00000000 ................
600850 03000000 00000000 58096000 00000000 ........X.`.....
600860 02000000 00000000 18000000 00000000 ................
600870 14000000 00000000 07000000 00000000 ................
600880 17000000 00000000 10044000 00000000 ..........#.....
600890 07000000 00000000 f8034000 00000000 ..........#.....
6008a0 08000000 00000000 18000000 00000000 ................
6008b0 09000000 00000000 18000000 00000000 ................
6008c0 feffff6f 00000000 d8034000 00000000 ...o......#.....
6008d0 ffffff6f 00000000 01000000 00000000 ...o............
6008e0 f0ffff6f 00000000 ca034000 00000000 ...o......#.....
6008f0 00000000 00000000 00000000 00000000 ................
600900 00000000 00000000 00000000 00000000 ................
600910 00000000 00000000 00000000 00000000 ................
600920 00000000 00000000 00000000 00000000 ................
600930 00000000 00000000 00000000 00000000 ................
600940 00000000 00000000 00000000 00000000 ................
Contents of section .got:
600950 00000000 00000000 ........
Contents of section .got.plt:
600958 90076000 00000000 00000000 00000000 ..`.............
600968 00000000 00000000 56044000 00000000 ........V.#.....
Contents of section .data:
600978 00000000 00000000 00000000 00000000 ................
Contents of section .comment:
0000 4743433a 2028474e 55292034 2e342e37 GCC: (GNU) 4.4.7
0010 20323031 32303331 33202852 65642048 20120313 (Red H
0020 61742034 2e342e37 2d313129 00474343 at 4.4.7-11).GCC
0030 3a202847 4e552920 342e392e 782d676f : (GNU) 4.9.x-go
0040 6f676c65 20323031 35303132 33202870 ogle 20150123 (p
0050 72657265 6c656173 652900 rerelease).
Disassembly of section .init:
0000000000400428 <_init>:
_init():
400428: 48 83 ec 08 sub $0x8,%rsp
40042c: e8 5b 00 00 00 callq 40048c <call_gmon_start>
400431: e8 6a 01 00 00 callq 4005a0 <frame_dummy>
400436: e8 45 02 00 00 callq 400680 <__do_global_ctors_aux>
40043b: 48 83 c4 08 add $0x8,%rsp
40043f: c3 retq
Disassembly of section .plt:
0000000000400440 <__libc_start_main#plt-0x10>:
400440: ff 35 1a 05 20 00 pushq 0x20051a(%rip) # 600960 <_GLOBAL_OFFSET_TABLE_+0x8>
400446: ff 25 1c 05 20 00 jmpq *0x20051c(%rip) # 600968 <_GLOBAL_OFFSET_TABLE_+0x10>
40044c: 0f 1f 40 00 nopl 0x0(%rax)
0000000000400450 <__libc_start_main#plt>:
400450: ff 25 1a 05 20 00 jmpq *0x20051a(%rip) # 600970 <_GLOBAL_OFFSET_TABLE_+0x18>
400456: 68 00 00 00 00 pushq $0x0
40045b: e9 e0 ff ff ff jmpq 400440 <_init+0x18>
Disassembly of section .text:
0000000000400460 <_start>:
_start():
400460: 31 ed xor %ebp,%ebp
400462: 49 89 d1 mov %rdx,%r9
400465: 5e pop %rsi
400466: 48 89 e2 mov %rsp,%rdx
400469: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
40046d: 50 push %rax
40046e: 54 push %rsp
40046f: 49 c7 c0 e0 05 40 00 mov $0x4005e0,%r8
400476: 48 c7 c1 f0 05 40 00 mov $0x4005f0,%rcx
40047d: 48 c7 c7 d0 05 40 00 mov $0x4005d0,%rdi
400484: e8 c7 ff ff ff callq 400450 <__libc_start_main#plt>
400489: f4 hlt
40048a: 90 nop
40048b: 90 nop
000000000040048c <call_gmon_start>:
call_gmon_start():
40048c: 48 83 ec 08 sub $0x8,%rsp
400490: 48 8b 05 b9 04 20 00 mov 0x2004b9(%rip),%rax # 600950 <_DYNAMIC+0x1c0>
400497: 48 85 c0 test %rax,%rax
40049a: 74 02 je 40049e <call_gmon_start+0x12>
40049c: ff d0 callq *%rax
40049e: 48 83 c4 08 add $0x8,%rsp
4004a2: c3 retq
4004a3: 90 nop
4004a4: 90 nop
4004a5: 90 nop
4004a6: 90 nop
4004a7: 90 nop
4004a8: 90 nop
4004a9: 90 nop
4004aa: 90 nop
4004ab: 90 nop
4004ac: 90 nop
4004ad: 90 nop
4004ae: 90 nop
4004af: 90 nop
4004b0: 90 nop
4004b1: 90 nop
4004b2: 90 nop
4004b3: 90 nop
4004b4: 90 nop
4004b5: 90 nop
4004b6: 90 nop
4004b7: 90 nop
4004b8: 90 nop
4004b9: 90 nop
4004ba: 90 nop
4004bb: 90 nop
4004bc: 90 nop
4004bd: 90 nop
4004be: 90 nop
4004bf: 90 nop
00000000004004c0 <deregister_tm_clones>:
deregister_tm_clones():
4004c0: b8 8f 09 60 00 mov $0x60098f,%eax
4004c5: 48 2d 88 09 60 00 sub $0x600988,%rax
4004cb: 48 83 f8 0e cmp $0xe,%rax
4004cf: 76 17 jbe 4004e8 <deregister_tm_clones+0x28>
4004d1: b8 00 00 00 00 mov $0x0,%eax
4004d6: 48 85 c0 test %rax,%rax
4004d9: 74 0d je 4004e8 <deregister_tm_clones+0x28>
4004db: bf 88 09 60 00 mov $0x600988,%edi
4004e0: ff e0 jmpq *%rax
4004e2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
4004e8: f3 c3 repz retq
4004ea: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
00000000004004f0 <register_tm_clones>:
register_tm_clones():
4004f0: be 88 09 60 00 mov $0x600988,%esi
4004f5: 48 81 ee 88 09 60 00 sub $0x600988,%rsi
4004fc: 48 c1 fe 03 sar $0x3,%rsi
400500: 48 89 f0 mov %rsi,%rax
400503: 48 c1 e8 3f shr $0x3f,%rax
400507: 48 01 c6 add %rax,%rsi
40050a: 48 d1 fe sar %rsi
40050d: 74 11 je 400520 <register_tm_clones+0x30>
40050f: b8 00 00 00 00 mov $0x0,%eax
400514: 48 85 c0 test %rax,%rax
400517: 74 07 je 400520 <register_tm_clones+0x30>
400519: bf 88 09 60 00 mov $0x600988,%edi
40051e: ff e0 jmpq *%rax
400520: f3 c3 repz retq
400522: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
400529: 1f 84 00 00 00 00 00
0000000000400530 <__do_global_dtors_aux>:
__do_global_dtors_aux():
400530: 80 3d 51 04 20 00 00 cmpb $0x0,0x200451(%rip) # 600988 <__bss_start>
400537: 75 5f jne 400598 <__do_global_dtors_aux+0x68>
400539: 55 push %rbp
40053a: 53 push %rbx
40053b: bb 80 07 60 00 mov $0x600780,%ebx
400540: 48 81 eb 78 07 60 00 sub $0x600778,%rbx
400547: 48 83 ec 08 sub $0x8,%rsp
40054b: 48 8b 05 3e 04 20 00 mov 0x20043e(%rip),%rax # 600990 <dtor_idx.6648>
400552: 48 c1 fb 03 sar $0x3,%rbx
400556: 48 83 eb 01 sub $0x1,%rbx
40055a: 48 8d 6c 24 10 lea 0x10(%rsp),%rbp
40055f: 48 39 d8 cmp %rbx,%rax
400562: 73 22 jae 400586 <__do_global_dtors_aux+0x56>
400564: 0f 1f 40 00 nopl 0x0(%rax)
400568: 48 83 c0 01 add $0x1,%rax
40056c: 48 89 05 1d 04 20 00 mov %rax,0x20041d(%rip) # 600990 <dtor_idx.6648>
400573: ff 14 c5 78 07 60 00 callq *0x600778(,%rax,8)
40057a: 48 8b 05 0f 04 20 00 mov 0x20040f(%rip),%rax # 600990 <dtor_idx.6648>
400581: 48 39 d8 cmp %rbx,%rax
400584: 72 e2 jb 400568 <__do_global_dtors_aux+0x38>
400586: e8 35 ff ff ff callq 4004c0 <deregister_tm_clones>
40058b: c6 05 f6 03 20 00 01 movb $0x1,0x2003f6(%rip) # 600988 <__bss_start>
400592: 48 83 c4 08 add $0x8,%rsp
400596: 5b pop %rbx
400597: 5d pop %rbp
400598: f3 c3 repz retq
40059a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
00000000004005a0 <frame_dummy>:
frame_dummy():
4005a0: bf 88 07 60 00 mov $0x600788,%edi
4005a5: 48 83 3f 00 cmpq $0x0,(%rdi)
4005a9: 75 05 jne 4005b0 <frame_dummy+0x10>
4005ab: e9 40 ff ff ff jmpq 4004f0 <register_tm_clones>
4005b0: b8 00 00 00 00 mov $0x0,%eax
4005b5: 48 85 c0 test %rax,%rax
4005b8: 74 f1 je 4005ab <frame_dummy+0xb>
4005ba: 55 push %rbp
4005bb: 48 89 e5 mov %rsp,%rbp
4005be: ff d0 callq *%rax
4005c0: 5d pop %rbp
4005c1: e9 2a ff ff ff jmpq 4004f0 <register_tm_clones>
4005c6: 90 nop
4005c7: 90 nop
4005c8: 90 nop
4005c9: 90 nop
4005ca: 90 nop
4005cb: 90 nop
4005cc: 90 nop
4005cd: 90 nop
4005ce: 90 nop
4005cf: 90 nop
00000000004005d0 <main>:
main():
4005d0: 31 c0 xor %eax,%eax
4005d2: c3 retq
4005d3: 90 nop
4005d4: 90 nop
4005d5: 90 nop
4005d6: 90 nop
4005d7: 90 nop
4005d8: 90 nop
4005d9: 90 nop
4005da: 90 nop
4005db: 90 nop
4005dc: 90 nop
4005dd: 90 nop
4005de: 90 nop
4005df: 90 nop
00000000004005e0 <__libc_csu_fini>:
__libc_csu_fini():
4005e0: f3 c3 repz retq
4005e2: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
4005e9: 1f 84 00 00 00 00 00
00000000004005f0 <__libc_csu_init>:
__libc_csu_init():
4005f0: 48 89 6c 24 d8 mov %rbp,-0x28(%rsp)
4005f5: 4c 89 64 24 e0 mov %r12,-0x20(%rsp)
4005fa: 48 8d 2d 63 01 20 00 lea 0x200163(%rip),%rbp # 600764 <__init_array_end>
400601: 4c 8d 25 5c 01 20 00 lea 0x20015c(%rip),%r12 # 600764 <__init_array_end>
400608: 4c 89 6c 24 e8 mov %r13,-0x18(%rsp)
40060d: 4c 89 74 24 f0 mov %r14,-0x10(%rsp)
400612: 4c 89 7c 24 f8 mov %r15,-0x8(%rsp)
400617: 48 89 5c 24 d0 mov %rbx,-0x30(%rsp)
40061c: 48 83 ec 38 sub $0x38,%rsp
400620: 4c 29 e5 sub %r12,%rbp
400623: 41 89 fd mov %edi,%r13d
400626: 49 89 f6 mov %rsi,%r14
400629: 48 c1 fd 03 sar $0x3,%rbp
40062d: 49 89 d7 mov %rdx,%r15
400630: e8 f3 fd ff ff callq 400428 <_init>
400635: 48 85 ed test %rbp,%rbp
400638: 74 1c je 400656 <__libc_csu_init+0x66>
40063a: 31 db xor %ebx,%ebx
40063c: 0f 1f 40 00 nopl 0x0(%rax)
400640: 4c 89 fa mov %r15,%rdx
400643: 4c 89 f6 mov %r14,%rsi
400646: 44 89 ef mov %r13d,%edi
400649: 41 ff 14 dc callq *(%r12,%rbx,8)
40064d: 48 83 c3 01 add $0x1,%rbx
400651: 48 39 eb cmp %rbp,%rbx
400654: 72 ea jb 400640 <__libc_csu_init+0x50>
400656: 48 8b 5c 24 08 mov 0x8(%rsp),%rbx
40065b: 48 8b 6c 24 10 mov 0x10(%rsp),%rbp
400660: 4c 8b 64 24 18 mov 0x18(%rsp),%r12
400665: 4c 8b 6c 24 20 mov 0x20(%rsp),%r13
40066a: 4c 8b 74 24 28 mov 0x28(%rsp),%r14
40066f: 4c 8b 7c 24 30 mov 0x30(%rsp),%r15
400674: 48 83 c4 38 add $0x38,%rsp
400678: c3 retq
400679: 90 nop
40067a: 90 nop
40067b: 90 nop
40067c: 90 nop
40067d: 90 nop
40067e: 90 nop
40067f: 90 nop
0000000000400680 <__do_global_ctors_aux>:
__do_global_ctors_aux():
400680: 55 push %rbp
400681: 48 89 e5 mov %rsp,%rbp
400684: 53 push %rbx
400685: bb 68 07 60 00 mov $0x600768,%ebx
40068a: 48 83 ec 08 sub $0x8,%rsp
40068e: 48 8b 05 d3 00 20 00 mov 0x2000d3(%rip),%rax # 600768 <__CTOR_LIST__>
400695: 48 83 f8 ff cmp $0xffffffffffffffff,%rax
400699: 74 14 je 4006af <__do_global_ctors_aux+0x2f>
40069b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
4006a0: 48 83 eb 08 sub $0x8,%rbx
4006a4: ff d0 callq *%rax
4006a6: 48 8b 03 mov (%rbx),%rax
4006a9: 48 83 f8 ff cmp $0xffffffffffffffff,%rax
4006ad: 75 f1 jne 4006a0 <__do_global_ctors_aux+0x20>
4006af: 48 83 c4 08 add $0x8,%rsp
4006b3: 5b pop %rbx
4006b4: 5d pop %rbp
4006b5: c3 retq
4006b6: 90 nop
4006b7: 90 nop
Disassembly of section .fini:
00000000004006b8 <_fini>:
_fini():
4006b8: 48 83 ec 08 sub $0x8,%rsp
4006bc: e8 6f fe ff ff callq 400530 <__do_global_dtors_aux>
4006c1: 48 83 c4 08 add $0x8,%rsp
4006c5: c3 retq
and the smaller file:
g++ -Wall -O3 -g0 test.cpp -o test.exe -nostdlib
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000400150
test.exe: file format elf64-x86-64
Contents of section .note.gnu.build-id:
400120 04000000 14000000 03000000 474e5500 ............GNU.
400130 d4b1e35c 21d1f541 b81d3ac9 d62bac7a ...\!..A..:..+.z
400140 606b1ad4 `k..
Contents of section .text:
400150 31c0c3 1..
Contents of section .eh_frame_hdr:
400154 011b033b 10000000 01000000 fcffffff ...;............
400164 2c000000 ,...
Contents of section .eh_frame:
400168 14000000 00000000 017a5200 01781001 .........zR..x..
400178 1b0c0708 90010000 14000000 1c000000 ................
400188 c8ffffff 03000000 00000000 00000000 ................
Contents of section .comment:
0000 4743433a 2028474e 55292034 2e392e78 GCC: (GNU) 4.9.x
0010 2d676f6f 676c6520 32303135 30313233 -google 20150123
0020 20287072 6572656c 65617365 2900 (prerelease).
Disassembly of section .text:
0000000000400150 <main>:
main():
400150: 31 c0 xor %eax,%eax
400152: c3 retq
It's worth noting that this executable doesn't work, it segfaults: to make it work, we'd actually have to implement _start instead of main.
We can see here that the bulk of the larger executable is glue code that deals with loading the dynamic library and preparing the broader environment required by the standard library.
--- EDIT ---
Even our smaller code still has to include exception handling, ctor/dtor support for globals, and so forth. It could probably elide such things and if you dig deeply enough you can probably find ways to elide them, but in general you probably don't need to, and it is probably easier to always include such basic support than to have the majority of new programmers stumbling over "how do I force the compiler to emit basic language support" than have a handful of new embedded programmers asking "how can I prevent the compiler emitting basic language support?".
Note also that the compiler generates ELF format binaries, this is a small contribution (maybe ~60bytes), plus emitting it's own identity added some size. But the bulk of the smaller binary is language support (EH and CTOR/DTOR).
Compiling with #include <iostream> and -O3 -g0 produces a 7625 byte binary, if I compile that with -O0 -g3 it produces a 64Kb binary most of which is text describing symbols from the STL.

Your executable is including the C runtime, which knows how to do things like get the environment, setup the argv vector, and close all open files after calling exit() but before calling _exit().

There are many things which could affect the final file size during compilation, as other posters have pointed out.
Dissecting your specific example is more work than I'm willing to put in, but I know of a similar example from many years ago that should help you to understand the general problem, and guide you towards finding the specific answer you seek.
http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html
This is done in C (rather than C++) using GCC, looking at the size of the ELF executable (not a Windows EXE), but as I said many of the same problems apply. In this case, the author looks at just return 42;
After you've read that document, consider that printing to stdout is considerably more complex than just returning a number. Also, since you are using C++ and cout <<, there's a lot of code hiding in there that you didn't write, and you can't really know how it's implemented without looking at that source.

people keep ignoring/forgetting that executables created in high level languages need engine to run properly. for example C++ engine is responsible for things like:
heap/stack management
when you call new,delete you are not actually accessing OS functions
instead the engine use its own allocated heap memory
so engine has it own memory management that takes code/space
local variables memory management
each time you call any function all the local variables must be allocated
and released before exiting it
classes/templates
to handle these properly you need quite a lot of code
In addition to this you have to link all the stuff you use like:
RTL most executables nowdays MSVCPP and MSVB does not link them so we need to install huge amount of RTLs in system to make exe to even run. but still the linking to used DLL's must be present in executable (see DLL linking on runtime)
debug info
frameworks linkage (similar to RTL you need the code to bind to frameworks libs too)
for high level winows/forms IDE's you also have the window engine present
included libs and linked objs (iostream classes and operators even if you use just << you need much more of them to make it work...)
You can look at the C++ engine as a small operating system within operating system
in standalone MCU apps they are really the OS itself
Another space is occupied by the executable format (like PE), and also code aligns add some space
When you put all these together then the 26KB is not so insane anymore

Compilers are not omnipotent.
std::cout is a stream object, with a set of data members for managing a buffer (allocating it, copying data to it and, when the stream is destroyed, releasing it).
The operator<< is implemented as an overloaded function which interprets its arguments and - when supplied a string - copies data to the buffer, with some logic that potentially flushes the buffer when it is full.
std::endl is actually an function which - in cooperation with all versions of a stream's operator<<() - affects data owned by the stream. Specifically, it inserts a newline into the streams buffer, and then flushes the buffer.
Flushing the stream's buffer calls other functions that copy data from the buffer to the standard output device (say, the screen).
All of the above is what the statement std::cout<<"Hello World"<<std::endl does.
In addition, as a C++ program, there is a certain amount of code that must be executed before main() is even called. This includes checking if the program was run with command line arguments, creating streams like std::cout, std::cerr, std::cin (there are others) ensuring those streams are connected with relevant devices (like the terminal, or pipes, or whatever). When main() returns, it is then necessary to release all the streams created (and flush their buffers), and things like that.
All of the above involves invoking other functionality. Creating a buffer for the stream means that buffer must be allocated and - after main() returns - released.
The specification of C++ streams also involves error checking. The allocation of std::cout's buffer might fail (e.g. if the host system doesn't have much free memory). The standard output device might be redirected to a file, which has limited capacity - so writing data to it might fail. All of those things must be checked for and handled gracefully.
All of this stuff will be in this 26K executable (unless that code is in runtime libraries).
In principle, the compiler can recognise that the program is not using its command line arguments (so not include code to manage command line arguments), is only writing to std::cout (so no need to create all the other streams before main() and release them after main() returns), is only using two overloaded versions of operator<<() and one stream manipulator (so the linker need not include code for all other member functions of the stream). It might also recognise that the statement writes data to the stream and immediately flushes the buffer - and thereby eliminate std::cout's buffer and all code that manages it. If the compiler can read the programmer's mind (few compilers can, in practice) it might work out that none of the buffers are actually needed, that the user will never run the program with standard output redirected, etc - and eliminate the code and data structures associated with all those things.
So, how would a compiler recognise that all those things aren't needed? Compilers are software, so they have to conduct some level of analysis on their inputs (e.g. source files). The analysis to eliminate all the code that a human might deem unnecessary is significant - so would take time. If the compiler doesn't do the analysis, potentially the linker might. Whether that analysis to eliminate unnecessary code is done by the compiler or linker is irrelevant - it takes time. Potentially significant time.
Programmers tend to be impatient. Very few programmers would tolerate a build process for a simple "hello world" program that took more than a few seconds (maybe they will tolerate a minute, but not much more).
That leaves compiler vendors with a decision. They can get their programmers to design and implement all sorts of analysis to eliminate unwanted code. That will add weeks - or, if they are working to a tight deadline, months - to implement, validate, verify, and ship a working compiler to customers (other developers). That compiler will be painfully slow at compiling code. Instead, vendors (and their developers) choose to implement less of that analysis in their compiler, so they can actually ship a working compiler to developers who will use it within a reasonable time. This compiler will produce an executable in a time that is somewhat tolerable for most programmers (say, under a minute for a "hello world" program). So what if the executable is larger? It will work. Hardware (e.g. drives) is relatively inexpensive and developer effort is relatively expensive.

It's very old question. It have clear answer. The most problem is that one have to write many small pieces of information and make many small test which demonstrates different aspects of PE structures. I try to skip details and to describe the main parts of the problem based on Microsoft Visual Studio, which I know and use since many years. All other compilers do mostly the same, and I suppose that one need use just a little other options of compiler and linker.
First of all I suggest you to set breakpoint on the first line of the main, start debugging and to examine the Call Stack windows of the debugger. You will see something like
So the first thing, which is very important to understand, the main is not the first function which will be called in your program. The entry point of the program is mainCRTStartup, which calls __tmainCRTStartup, which calls main.
The CRT Startup code make many small things. One thing is very easy to understand: it uses GetCommandLineW Windows API to get the command line and parse the parameters, then it calls main with the parameters.
To reduce the size of the code there are two common approach:
use CRT from DLL
remove CRT from the EXE if it's not really used in the code.
It's very helpful if you start cmd.exe using "VS2013 x64 Native Tools Command Prompt" (or some close command prompt). Some additional paths will be set inside of the command prompt and you can use for example dumpbin.exe utility.
If you would use Multi-threaded DLL (/MD) compiler option then you will get 7K large exe file. "dumpbin /imports HelloWorld.exe" will show you that your program uses "MSVCR120.dll" together with "KERNEL32.dll".
Removing of CRT depends on the version of c/cpp compiler (the version of Visual Studio) which you use and even from the extension of the file: .c or .cpp. I understand your question as the common question for understanding the problem. So I suggest to start with the most simple case, rename .cpp file .c and the beginning and to modify the code to the following
#include <Windows.h>
int mainCRTStartup()
{
return 0;
}
One can see now
C:\Oleg\StackOverflow\HelloWorld\Release>dir HelloWorld.exe
Volume in drive C has no label.
Volume Serial Number is 4CF9-FADF
Directory of C:\Oleg\StackOverflow\HelloWorld\Release
21.06.2015 12:56 3.584 HelloWorld.exe
1 File(s) 3.584 bytes
0 Dir(s) 16.171.196.416 bytes free
C:\Oleg\StackOverflow\HelloWorld\Release>dumpbin HelloWorld.exe
Microsoft (R) COFF/PE Dumper Version 12.00.31101.0
Copyright (C) Microsoft Corporation. All rights reserved.
Dump of file HelloWorld.exe
File Type: EXECUTABLE IMAGE
Summary
1000 .data
1000 .rdata
1000 .reloc
1000 .rsrc
1000 .text
One can add the linker option /MERGE:.rdata=.text to reduce the size and to remove one section
C:\Oleg\StackOverflow\HelloWorld\Release>dir HelloWorld.exe
Volume in drive C has no label.
Volume Serial Number is 4CF9-FADF
Directory of C:\Oleg\StackOverflow\HelloWorld\Release
21.06.2015 18:44 3.072 HelloWorld.exe
1 File(s) 3.072 bytes
0 Dir(s) 16.170.852.352 bytes free
C:\Oleg\StackOverflow\HelloWorld\Release>dumpbin HelloWorld.exe
Microsoft (R) COFF/PE Dumper Version 12.00.31101.0
Copyright (C) Microsoft Corporation. All rights reserved.
Dump of file HelloWorld.exe
File Type: EXECUTABLE IMAGE
Summary
1000 .data
1000 .reloc
1000 .rsrc
1000 .text
To have "Hello World" program I suggest to modify the code to
#include <Windows.h>
int mainCRTStartup()
{
LPCTSTR pszString = TEXT("Hello world");
DWORD cbWritten;
WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE), pszString, lstrlen(pszString), &cbWritten, NULL);
return 0;
}
One can easy verify that the code work and it's still small.
To remove CRT from .cpp file I suggest to follow the following steps. First of all we would use the following HelloWorld.cpp code
#include <Windows.h>
int mainCRTStartup()
{
LPCTSTR pszString = TEXT("Hello world");
DWORD cbWritten;
WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE), pszString, lstrlen(pszString), &cbWritten, NULL);
return 0;
}
It's important that one verify some compiler and linker options and set/remove someone. I included the settings on the pictures below:
The last screen shows that we remove binding to default libraries which we don't need. The compiler uses directive like #pragma comment(lib, "some.lib") to inject usage of some libraries. By usage the options /NODEFAULTLIB we remove such libs and the exe will be compiled like we need.
One will see that the resulting HelloWorld.exe have only 3K (3.072 bytes) and there are exist dependency to one KERNEL32.dll only:
C:\Oleg\StackOverflow\HelloWorld\Release>dumpbin /imports HelloWorld.exe
Microsoft (R) COFF/PE Dumper Version 12.00.31101.0
Copyright (C) Microsoft Corporation. All rights reserved.
Dump of file HelloWorld.exe
File Type: EXECUTABLE IMAGE
Section contains the following imports:
KERNEL32.dll
402000 Import Address Table
402038 Import Name Table
0 time date stamp
0 Index of first forwarder reference
60B lstrlenW
5E0 WriteConsoleW
2C0 GetStdHandle
Summary
1000 .idata
1000 .reloc
1000 .rsrc
1000 .text
One can download the corresponding Visual Studio 2013 demo project from here. One need switch from default "Debug" compiling to "Release" and rebuild solution. One will have working HelloWorld.exe which length is 3K.

This does show how hard it can be to write a program with identical semantics.
<<std::endl will flush a stream if that stream is good(). That means the whole error handling code of ostream must be present.
Also, std::cout could have its streambuf swapped out from under it. The compiler cannot know it's actually going to STDOUT_FILENO. It has to use the whole streambuf intermediate layer.

Adding UNUSED elements to C/C++ structure speeds up and slows down code execution

I wrote the following structure for use in an Arduino software PWM library I'm making, to PWM up to 20 pins at once (on an Uno) or 70 pins at once (on a Mega).
As written, the ISR portion of the code (eRCaGuy_SoftwarePWMupdate()), processing an array of this structure, takes 133us to run. VERY strangely, however, if I uncomment the line "byte flags1;" (in the struct), though flags1 is NOT used anywhere yet, the ISR now takes 158us to run. Then, if I uncomment "byte flags2;" so that BOTH flags are now uncommented, the runtime drops back down to where it was before (133us).
Why is this happening!? And how do I fix it? (ie: I want to ensure consistently fast code, for this particular function, not code that is inexplicably fickle). Adding one byte dramatically slows down the code, yet adding two makes no change at all.
I am trying to optimize the code (and I needed to add another feature too, requiring a single byte for flags), but I don't understand why adding one unused byte slows the code down by 25us, yet adding two unused bytes doesn't change the run-time at all.
I need to understand this to ensure my optimizations consistently work.
In .h file (my original struct, using C-style typedef'ed struct):
typedef struct softPWMpin //global struct
{
//VOLATILE VARIBLES (WILL BE ACCESSED IN AND OUTSIDE OF ISRs)
//for pin write access:
volatile byte pinBitMask;
volatile byte* volatile p_PORT_out; //pointer to port output register; NB: the 1st "volatile" says the port itself (1 byte) is volatile, the 2nd "volatile" says the *pointer* itself (2 bytes, pointing to the port) is volatile.
//for PWM output:
volatile unsigned long resolution;
volatile unsigned long PWMvalue; //Note: duty cycle = PWMvalue/(resolution - 1) = PWMvalue/topValue;
//ex: if resolution is 256, topValue is 255
//if PWMvalue = 255, duty_cycle = PWMvalue/topValue = 255/255 = 1 = 100%
//if PWMvalue = 50, duty_cycle = PWMvalue/topValue = 50/255 = 0.196 = 19.6%
//byte flags1;
//byte flags2;
//NON-VOLATILE VARIABLES (WILL ONLY BE ACCESSED INSIDE AN ISR, OR OUTSIDE AN ISR, BUT NOT BOTH)
unsigned long counter; //incremented each time update() is called; goes back to zero after reaching topValue; does NOT need to be volatile, since only the update function updates this (it is read-to or written from nowhere else)
} softPWMpin_t;
In .h file (new, using C++ style struct....to see if it makes any difference, per the comments. It appears to make no difference in any way, including run-time and compiled size)
struct softPWMpin //global struct
{
//VOLATILE VARIBLES (WILL BE ACCESSED IN AND OUTSIDE OF ISRs)
//for pin write access:
volatile byte pinBitMask;
volatile byte* volatile p_PORT_out; //pointer to port output register; NB: the 1st "volatile" says the port itself (1 byte) is volatile, the 2nd "volatile" says the *pointer* itself (2 bytes, pointing to the port) is volatile.
//for PWM output:
volatile unsigned long resolution;
volatile unsigned long PWMvalue; //Note: duty cycle = PWMvalue/(resolution - 1) = PWMvalue/topValue;
//ex: if resolution is 256, topValue is 255
//if PWMvalue = 255, duty_cycle = PWMvalue/topValue = 255/255 = 1 = 100%
//if PWMvalue = 50, duty_cycle = PWMvalue/topValue = 50/255 = 0.196 = 19.6%
//byte flags1;
//byte flags2;
//NON-VOLATILE VARIABLES (WILL ONLY BE ACCESSED INSIDE AN ISR, OR OUTSIDE AN ISR, BUT NOT BOTH)
unsigned long counter; //incremented each time update() is called; goes back to zero after reaching topValue; does NOT need to be volatile, since only the update function updates this (it is read-to or written from nowhere else)
};
In .cpp file (here I am creating the array of structs, and here is the update function which is called at a fixed rate in an ISR, via timer interrupts):
//static softPWMpin_t PWMpins[MAX_NUMBER_SOFTWARE_PWM_PINS]; //C-style, old, MAX_NUMBER_SOFTWARE_PWM_PINS = 20; static to give it file scope only
static softPWMpin PWMpins[MAX_NUMBER_SOFTWARE_PWM_PINS]; //C++-style, old, MAX_NUMBER_SOFTWARE_PWM_PINS = 20; static to give it file scope only
//This function must be placed within an ISR, to be called at a fixed interval
void eRCaGuy_SoftwarePWMupdate()
{
//Forced nonatomic block (ie: interrupts *enabled*)
byte SREG_old = SREG; //[1 clock cycle]
interrupts(); //[1 clock cycle] turn interrupts ON to allow *nested interrupts* (ex: handling of time-sensitive timing, such as reading incoming PWM signals or counting Timer2 overflows)
{
//first, increment all counters of attached pins (ie: where the value != PIN_NOT_ATTACHED)
//pinMapArray
for (byte pin=0; pin<NUM_DIGITAL_PINS; pin++)
{
byte i = pinMapArray[pin]; //[2 clock cycles: 0.125us]; No need to turn off interrupts to read this volatile variable here since reading pinMapArray[pin] is an atomic operation (since it's a single byte)
if (i != PIN_NOT_ATTACHED) //if the pin IS attached, increment counter and decide what to do with pin...
{
//Read volatile variables ONE time, all at once, to optimize code (volatile variables take more time to read [I know] since their values can't be recalled from registers [I believe]).
noInterrupts(); //[1 clock cycle] turn off interrupts to read non-atomic volatile variables that could be updated simultaneously right now in another ISR, since nested interrupts are enabled here
unsigned long resolution = PWMpins[i].resolution;
unsigned long PWMvalue = PWMpins[i].PWMvalue;
volatile byte* p_PORT_out = PWMpins[i].p_PORT_out; //[0.44us raw: 5 clock cycles, 0.3125us]
interrupts(); //[1 clock cycle]
//handle edge cases FIRST (PWMvalue==0 and PMWvalue==topValue), since if an edge case exists we should NOT do the main case handling below
if (PWMvalue==0) //the PWM command is 0% duty cycle
{
fastDigitalWrite(p_PORT_out,PWMpins[i].pinBitMask,LOW); //write LOW [1.19us raw: 17 clock cycles, 1.0625us]
}
else if (PWMvalue==resolution-1) //the PWM command is 100% duty cycle
{
fastDigitalWrite(p_PORT_out,PWMpins[i].pinBitMask,HIGH); //write HIGH [0.88us raw; 12 clock cycles, 0.75us]
}
//THEN handle main cases (PWMvalue is > 0 and < topValue)
else //(0% < PWM command < 100%)
{
PWMpins[i].counter++; //not volatile
if (PWMpins[i].counter >= resolution)
{
PWMpins[i].counter = 0; //reset
fastDigitalWrite(p_PORT_out,PWMpins[i].pinBitMask,HIGH);
}
else if (PWMpins[i].counter>=PWMvalue)
{
fastDigitalWrite(p_PORT_out,PWMpins[i].pinBitMask,LOW); //write LOW [1.18us raw: 17 clock cycles, 1.0625us]
}
}
}
}
}
SREG = SREG_old; //restore interrupt enable status
}
Update (5/4/2015, 8:58pm):
I've tried changing the alignment via the aligned attribute. My compiler is gcc.
Here's how I modified the struct in the .h file to add the attribute (it's on the very last line). Note that I also changed the order of the struct members to be largest to smallest:
struct softPWMpin //C++ style
{
volatile unsigned long resolution;
volatile unsigned long PWMvalue; //Note: duty cycle = PWMvalue/(resolution - 1) = PWMvalue/topValue;
//ex: if resolution is 256, topValue is 255
//if PWMvalue = 255, duty_cycle = PWMvalue/topValue = 255/255 = 1 = 100%
//if PWMvalue = 50, duty_cycle = PWMvalue/topValue = 50/255 = 0.196 = 19.6%
unsigned long counter; //incremented each time update() is called; goes back to zero after reaching topValue; does NOT need to be volatile, since only the update function updates this (it is read-to or written from nowhere else)
volatile byte* volatile p_PORT_out; //pointer to port output register; NB: the 1st "volatile" says the port itself (1 byte) is volatile, the 2nd "volatile" says the *pointer* itself (2 bytes, pointing to the port) is volatile.
volatile byte pinBitMask;
// byte flags1;
// byte flags2;
} __attribute__ ((aligned));
Source: https://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Type-Attributes.html
Here's the results of what I've tried so far:
__attribute__ ((aligned));
__attribute__ ((aligned(1)));
__attribute__ ((aligned(2)));
__attribute__ ((aligned(4)));
__attribute__ ((aligned(8)));
None of them seem to fix the problem I see when I add one flag byte. When leaving the flag bytes commented out the 2-8 ones make the run-time longer than 133us, and the align 1 one makes no difference (run-time stays 133us), implying that it is what is already occurring with the attribute not added at all. Additionally, even when I use the align options of 2, 4, 8, the sizeof(PWMvalue) function still returns the exact number of bytes in the struct, with no additional padding.
...still don't know what's going on...
Update, 11:02pm:
(see comments below)
Optimization levels definitely have an effect. When I changed the compiler optimization level from -Os to -O2, for instance, the base case remained at 133us (as before), uncommenting flags1 gave me 120us (vs 158us), and uncommenting flags1 and flags2 simultaneously gave me 132us (vs 133us). This still doesn't answer my question, but I've at least learned that optimization levels exist, and how to change them.
Summary of above paragraph:
Processing time of (of eRCaGuy_SoftwarePWMupdate() function)
Optimization No flags w/flags1 w/flags1+flags2
Os 133us 158us 133us
O2 132us 120us 132us
Memory Use (bytes: flash/global vars SRAM/sizeof(softPWMpin)/sizeof(PWMpins))
Optimization No flags w/flags1 w/flags1+flags2
Os 4020/591/15/300 3950/611/16/320 4020/631/17/340
O2 4154/591/15/300 4064/611/16/320 4154/631/17/340
Update (5/5/2015, 4:05pm):
Just updated the tables above with more detailed information.
Added resources below.
Resources:
Sources for gcc compiler optimization levels:
- https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
- https://gcc.gnu.org/onlinedocs/gnat_ugn/Optimization-Levels.html
- http://www.rapidtables.com/code/linux/gcc/gcc-o.htm
How to change compiler settings in Arduino IDE:
- http://www.instructables.com/id/Arduino-IDE-16x-compiler-optimisations-faster-code/
Info on structure packing:
- http://www.catb.org/esr/structure-packing/
Data Alignment:
- http://www.songho.ca/misc/alignment/dataalign.html
Writing efficient C code for an 8-bit Atmel AVR Microcontroller
- AVR035 Efficient C Coding for AVR - doc1497 - http://www.atmel.com/images/doc1497.pdf
- AVR4027 Tips and Tricks to Optimize Your C Code for 8-bit AVR Microcontrollers - doc8453 - http://www.atmel.com/images/doc8453.pdf
Additional info that may prove useful to help you help me with my problem:
FOR NO FLAGS (flags1 and flags2 commented out), and Os Optimization
Build Preferences (from buildprefs.txt file where Arduino spits out compiled code):
For me: "C:\Users\Gabriel\AppData\Local\Temp\build8427371380606368699.tmp"
build.arch = AVR
build.board = AVR_UNO
build.core = arduino
build.core.path = C:\Program Files (x86)\Arduino\hardware\arduino\avr\cores\arduino
build.extra_flags =
build.f_cpu = 16000000L
build.mcu = atmega328p
build.path = C:\Users\Gabriel\AppData\Local\Temp\build8427371380606368699.tmp
build.project_name = software_PWM_fade13_speed_test2.cpp
build.system.path = C:\Program Files (x86)\Arduino\hardware\arduino\avr\system
build.usb_flags = -DUSB_VID={build.vid} -DUSB_PID={build.pid} '-DUSB_MANUFACTURER={build.usb_manufacturer}' '-DUSB_PRODUCT={build.usb_product}'
build.usb_manufacturer =
build.variant = standard
build.variant.path = C:\Program Files (x86)\Arduino\hardware\arduino\avr\variants\standard
build.verbose = true
build.warn_data_percentage = 75
compiler.S.extra_flags =
compiler.S.flags = -c -g -x assembler-with-cpp
compiler.ar.cmd = avr-ar
compiler.ar.extra_flags =
compiler.ar.flags = rcs
compiler.c.cmd = avr-gcc
compiler.c.elf.cmd = avr-gcc
compiler.c.elf.extra_flags =
compiler.c.elf.flags = -w -Os -Wl,--gc-sections
compiler.c.extra_flags =
compiler.c.flags = -c -g -Os -w -ffunction-sections -fdata-sections -MMD
compiler.cpp.cmd = avr-g++
compiler.cpp.extra_flags =
compiler.cpp.flags = -c -g -Os -w -fno-exceptions -ffunction-sections -fdata-sections -fno-threadsafe-statics -MMD
compiler.elf2hex.cmd = avr-objcopy
compiler.elf2hex.extra_flags =
compiler.elf2hex.flags = -O ihex -R .eeprom
compiler.ldflags =
compiler.objcopy.cmd = avr-objcopy
compiler.objcopy.eep.extra_flags =
compiler.objcopy.eep.flags = -O ihex -j .eeprom --set-section-flags=.eeprom=alloc,load --no-change-warnings --change-section-lma .eeprom=0
compiler.path = {runtime.ide.path}/hardware/tools/avr/bin/
compiler.size.cmd = avr-size
Some of the Assembly:
(Os, no flags):
00000328 <_Z25eRCaGuy_SoftwarePWMupdatev>:
328: 8f 92 push r8
32a: 9f 92 push r9
32c: af 92 push r10
32e: bf 92 push r11
330: cf 92 push r12
332: df 92 push r13
334: ef 92 push r14
336: ff 92 push r15
338: 0f 93 push r16
33a: 1f 93 push r17
33c: cf 93 push r28
33e: df 93 push r29
340: 0f b7 in r16, 0x3f ; 63
342: 78 94 sei
344: 20 e0 ldi r18, 0x00 ; 0
346: 30 e0 ldi r19, 0x00 ; 0
348: 1f e0 ldi r17, 0x0F ; 15
34a: f9 01 movw r30, r18
34c: e8 5a subi r30, 0xA8 ; 168
34e: fe 4f sbci r31, 0xFE ; 254
350: 80 81 ld r24, Z
352: 8f 3f cpi r24, 0xFF ; 255
354: 09 f4 brne .+2 ; 0x358 <_Z25eRCaGuy_SoftwarePWMupdatev+0x30>
356: 67 c0 rjmp .+206 ; 0x426 <_Z25eRCaGuy_SoftwarePWMupdatev+0xfe>
358: f8 94 cli
35a: 90 e0 ldi r25, 0x00 ; 0
35c: 18 9f mul r17, r24
35e: f0 01 movw r30, r0
360: 19 9f mul r17, r25
362: f0 0d add r31, r0
364: 11 24 eor r1, r1
366: e4 59 subi r30, 0x94 ; 148
368: fe 4f sbci r31, 0xFE ; 254
36a: c0 80 ld r12, Z
36c: d1 80 ldd r13, Z+1 ; 0x01
36e: e2 80 ldd r14, Z+2 ; 0x02
370: f3 80 ldd r15, Z+3 ; 0x03
372: 44 81 ldd r20, Z+4 ; 0x04
374: 55 81 ldd r21, Z+5 ; 0x05
376: 66 81 ldd r22, Z+6 ; 0x06
378: 77 81 ldd r23, Z+7 ; 0x07
37a: 04 84 ldd r0, Z+12 ; 0x0c
37c: f5 85 ldd r31, Z+13 ; 0x0d
37e: e0 2d mov r30, r0
380: 78 94 sei
382: 41 15 cp r20, r1
384: 51 05 cpc r21, r1
386: 61 05 cpc r22, r1
388: 71 05 cpc r23, r1
38a: 51 f4 brne .+20 ; 0x3a0 <_Z25eRCaGuy_SoftwarePWMupdatev+0x78>
38c: 18 9f mul r17, r24
38e: d0 01 movw r26, r0
390: 19 9f mul r17, r25
392: b0 0d add r27, r0
394: 11 24 eor r1, r1
396: a4 59 subi r26, 0x94 ; 148
398: be 4f sbci r27, 0xFE ; 254
39a: 1e 96 adiw r26, 0x0e ; 14
39c: 4c 91 ld r20, X
39e: 3b c0 rjmp .+118 ; 0x416 <_Z25eRCaGuy_SoftwarePWMupdatev+0xee>
3a0: 46 01 movw r8, r12
3a2: 57 01 movw r10, r14
3a4: a1 e0 ldi r26, 0x01 ; 1
3a6: 8a 1a sub r8, r26
3a8: 91 08 sbc r9, r1
3aa: a1 08 sbc r10, r1
3ac: b1 08 sbc r11, r1
3ae: 48 15 cp r20, r8
3b0: 59 05 cpc r21, r9
3b2: 6a 05 cpc r22, r10
3b4: 7b 05 cpc r23, r11
3b6: 51 f4 brne .+20 ; 0x3cc <_Z25eRCaGuy_SoftwarePWMupdatev+0xa4>
3b8: 18 9f mul r17, r24
3ba: d0 01 movw r26, r0
3bc: 19 9f mul r17, r25
3be: b0 0d add r27, r0
3c0: 11 24 eor r1, r1
3c2: a4 59 subi r26, 0x94 ; 148
3c4: be 4f sbci r27, 0xFE ; 254
3c6: 1e 96 adiw r26, 0x0e ; 14
3c8: 9c 91 ld r25, X
3ca: 1c c0 rjmp .+56 ; 0x404 <_Z25eRCaGuy_SoftwarePWMupdatev+0xdc>
3cc: 18 9f mul r17, r24
3ce: e0 01 movw r28, r0
3d0: 19 9f mul r17, r25
3d2: d0 0d add r29, r0
3d4: 11 24 eor r1, r1
3d6: c4 59 subi r28, 0x94 ; 148
3d8: de 4f sbci r29, 0xFE ; 254
3da: 88 85 ldd r24, Y+8 ; 0x08
3dc: 99 85 ldd r25, Y+9 ; 0x09
3de: aa 85 ldd r26, Y+10 ; 0x0a
3e0: bb 85 ldd r27, Y+11 ; 0x0b
3e2: 01 96 adiw r24, 0x01 ; 1
3e4: a1 1d adc r26, r1
3e6: b1 1d adc r27, r1
3e8: 88 87 std Y+8, r24 ; 0x08
3ea: 99 87 std Y+9, r25 ; 0x09
3ec: aa 87 std Y+10, r26 ; 0x0a
3ee: bb 87 std Y+11, r27 ; 0x0b
3f0: 8c 15 cp r24, r12
3f2: 9d 05 cpc r25, r13
3f4: ae 05 cpc r26, r14
3f6: bf 05 cpc r27, r15
3f8: 40 f0 brcs .+16 ; 0x40a <_Z25eRCaGuy_SoftwarePWMupdatev+0xe2>
3fa: 18 86 std Y+8, r1 ; 0x08
3fc: 19 86 std Y+9, r1 ; 0x09
3fe: 1a 86 std Y+10, r1 ; 0x0a
400: 1b 86 std Y+11, r1 ; 0x0b
402: 9e 85 ldd r25, Y+14 ; 0x0e
404: 80 81 ld r24, Z
406: 89 2b or r24, r25
408: 0d c0 rjmp .+26 ; 0x424 <_Z25eRCaGuy_SoftwarePWMupdatev+0xfc>
40a: 84 17 cp r24, r20
40c: 95 07 cpc r25, r21
40e: a6 07 cpc r26, r22
410: b7 07 cpc r27, r23
412: 48 f0 brcs .+18 ; 0x426 <_Z25eRCaGuy_SoftwarePWMupdatev+0xfe>
414: 4e 85 ldd r20, Y+14 ; 0x0e
416: 80 81 ld r24, Z
418: 90 e0 ldi r25, 0x00 ; 0
41a: 50 e0 ldi r21, 0x00 ; 0
41c: 40 95 com r20
41e: 50 95 com r21
420: 84 23 and r24, r20
422: 95 23 and r25, r21
424: 80 83 st Z, r24
426: 2f 5f subi r18, 0xFF ; 255
428: 3f 4f sbci r19, 0xFF ; 255
42a: 24 31 cpi r18, 0x14 ; 20
42c: 31 05 cpc r19, r1
42e: 09 f0 breq .+2 ; 0x432 <_Z25eRCaGuy_SoftwarePWMupdatev+0x10a>
430: 8c cf rjmp .-232 ; 0x34a <_Z25eRCaGuy_SoftwarePWMupdatev+0x22>
432: 0f bf out 0x3f, r16 ; 63
434: df 91 pop r29
436: cf 91 pop r28
438: 1f 91 pop r17
43a: 0f 91 pop r16
43c: ff 90 pop r15
43e: ef 90 pop r14
440: df 90 pop r13
442: cf 90 pop r12
444: bf 90 pop r11
446: af 90 pop r10
448: 9f 90 pop r9
44a: 8f 90 pop r8
44c: 08 95 ret

This is almost certainly an alignment issue. Judging by the size of your struct, your compiler seems to be automatically packing it.
The LDR instruction loads a 4-byte value into a register, and operates on 4-byte boundaries. If it needs to load a memory address that isn't on a 4-byte boundary, it actually performs two loads and combines them to obtain the value at that address.
For example, if you want to load the 4-byte value at 0x02, the processor will do two loads, as 0x02 does not fall on a 4-byte boundary.
Let's say we have the following memory at address 0x00 and we want to load the 4-byte value at 0x02 into the register r0:
Address |0x00|0x01|0x02|0x03|0x04|0x05|0x06|0x07|0x08|
Value | 12 | 34 | 56 | 78 | 90 | AB | CD | EF | 12 |
------------------------------------------------------
r0: 00 00 00 00
It will first load the 4 bytes at 0x00, because that's the 4-byte segment containing 0x02, and store the 2 bytes at 0x02 and 0x03 in the register:
Address |0x00|0x01|0x02|0x03|0x04|0x05|0x06|0x07|
Value | 12 | 34 | 56 | 78 | 90 | AB | CD | EF |
Load 1 | ** ** |
------------------------------------------------------
r0: 56 78 00 00
It will then load the 4 bytes at 0x04, which is the next 4-byte segment, and store the 2 bytes at 0x04 and 0x05 in the register.
Address |0x00|0x01|0x02|0x03|0x04|0x05|0x06|0x07|
Value | 12 | 34 | 56 | 78 | 90 | AB | CD | EF |
Load 2 | ** ** |
------------------------------------------------------
r0: 56 78 90 AB
As you can see, each time you want to access the value at 0x02, the processor actually has to split your instruction into two operations. However, if you instead wanted to access the value at 0x04, the processor can do it in a single operation:
Address |0x00|0x01|0x02|0x03|0x04|0x05|0x06|0x07|
Value | 12 | 34 | 56 | 78 | 90 | AB | CD | EF |
Load 1 | ** ** ** ** |
------------------------------------------------------
r0: 90 AB CD EF
In your example, with both flags1 and flags2 commented out, your struct's size is 15. This means that every second struct in your array is going to be at an odd address, so none of it's pointer or long members are going to be aligned correctly.
By introducing one of the flags variables, your struct's size increases to 16, which is a multiple of 4. This ensures that all of your structs begin on a 4-byte boundary, so you likely won't run into alignment issues.
There's likely a compiler flag that can help you with this, but in general, it's good to be aware of the layout of your structures. Alignment is a tricky issue to deal with, and only compilers that conform to the current standards have well defined behavior.

SIGSEGV When accessing array element using assembly

Background:
I am new to assembly. When I was learning programming, I made a program that implements multiplication tables up to 1000 * 1000. The tables are formatted so that each answer is on the line factor1 << 10 | factor2 (I know, I know, it's isn't pretty). These tables are then loaded into an array: int* tables. Empty lines are filled with 0. Here is a link to the file for the tables (7.3 MB). I know using assembly won't speed up this by much, but I just wanted to do it for fun (and a bit of practice).
Question:
I'm trying to convert this code into inline assembly (tables is a global):
int answer;
// ...
answer = tables [factor1 << 10 | factor2];
This is what I came up with:
asm volatile ( "shll $10, %1;"
"orl %1, %2;"
"movl _tables(,%2,4), %0;" : "=r" (answer) : "r" (factor1), "r" (factor2) );
My C++ code works fine, but my assembly fails. What is wrong with my assembly (especially the movl _tables(,%2,4), %0; part), compared to my C++
What I have done to solve it:
I used some random numbers: 89 796 as factor1 and factor2. I know that there is an element at 89 << 10 | 786 (which is 91922) – verified this with C++. When I run it with gdb, I get a SIGSEGV:
Program received signal SIGSEGV, Segmentation fault.
at this line:
"movl _tables(,%2,4), %0;" : "=r" (answer) : "r" (factor1), "r" (factor2) );
I added two methods around my asm, which is how I know where the asm block is in the disassembly.
Disassembly of my asm block:
The disassembly from objdump -M att -d looks fine (although I'm not sure, I'm new to assembly, as I said):
402096: 8b 45 08 mov 0x8(%ebp),%eax
402099: 8b 55 0c mov 0xc(%ebp),%edx
40209c: c1 e0 0a shl $0xa,%eax
40209f: 09 c2 or %eax,%edx
4020a1: 8b 04 95 18 e0 47 00 mov 0x47e018(,%edx,4),%eax
4020a8: 89 45 ec mov %eax,-0x14(%ebp)
The disassembly from objdump -M intel -d:
402096: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
402099: 8b 55 0c mov edx,DWORD PTR [ebp+0xc]
40209c: c1 e0 0a shl eax,0xa
40209f: 09 c2 or edx,eax
4020a1: 8b 04 95 18 e0 47 00 mov eax,DWORD PTR [edx*4+0x47e018]
4020a8: 89 45 ec mov DWORD PTR [ebp-0x14],eax
From what I understand, it's moving the first parameter of my void calc ( int factor1, int factor2 ) function into eax. Then it's moving the second parameter into edx. Then it shifts eax to the left by 10 and ors it with edx. A 32-bit integer is 4 bytes, so [edx*4+base_address]. Move result to eax and then put eax into int answer (which, I'm guessing is on -0x14 of the stack). I don't really see much of a problem.
Disassembly of the compiler's .exe:
When I replace the asm block with plain C++ (answer = tables [factor1 << 10 | factor2];) and disassemble it this is what I get in Intel syntax:
402096: a1 18 e0 47 00 mov eax,ds:0x47e018
40209b: 8b 55 08 mov edx,DWORD PTR [ebp+0x8]
40209e: c1 e2 0a shl edx,0xa
4020a1: 0b 55 0c or edx,DWORD PTR [ebp+0xc]
4020a4: c1 e2 02 shl edx,0x2
4020a7: 01 d0 add eax,edx
4020a9: 8b 00 mov eax,DWORD PTR [eax]
4020ab: 89 45 ec mov DWORD PTR [ebp-0x14],eax
AT&T syntax:
402096: a1 18 e0 47 00 mov 0x47e018,%eax
40209b: 8b 55 08 mov 0x8(%ebp),%edx
40209e: c1 e2 0a shl $0xa,%edx
4020a1: 0b 55 0c or 0xc(%ebp),%edx
4020a4: c1 e2 02 shl $0x2,%edx
4020a7: 01 d0 add %edx,%eax
4020a9: 8b 00 mov (%eax),%eax
4020ab: 89 45 ec mov %eax,-0x14(%ebp)
I am not really familiar with the Intel syntax, so I am just going to try and understand the AT&T syntax:
It first moves the base address of the tables array into %eax. Then, is moves the first parameter into %edx. It shifts %edx to the left by 10 then ors it with the second parameter. Then, by shifting %edx to the left by two, it actually multiplies %edx by 4. Then, it adds that to %eax (the base address of the array). So, basically it just did this: [edx*4+0x47e018] (Intel syntax) or 0x47e018(,%edx,4) AT&T. It moves the value of the element it got into %eax and puts it into int answer. This method is more "expanded", but it does the same thing as my hand-written assembly! So why is mine giving a SIGSEGV while the compiler's working fine?

I bet (from the disassembly) that tables is a pointer to an array, not the array itself.
So you need:
asm volatile ( "shll $10, %1;"
movl _tables,%%eax
"orl %1, %2;"
"movl (%%eax,%2,4)",
: "=r" (answer) : "r" (factor1), "r" (factor2) : "eax" )
(Don't forget the extra clobber in the last line).
There are of course variations, this may be more efficient if the code is in a loop:
asm volatile ( "shll $10, %1;"
"orl %1, %2;"
"movl (%3,%2,4)",
: "=r" (answer) : "r" (factor1), "r" (factor2), "r"(tables) )

This is intended to be an addition to Mats Petersson's answer - I wrote it simply because it wasn't immediately clear to me why OP's analysis of the disassembly (that his assembly and the compiler-generated one were equivalent) was incorrect.
As Mats Petersson explains, the problem is that tables is actually a pointer to an array, so to access an element, you have to dereference twice. Now to me, it wasn't immediately clear where this happens in the compiler-generated code. The culprit is this innocent-looking line:
a1 18 e0 47 00 mov 0x47e018,%eax
To the untrained eye (that includes mine), this might look like the value 0x47e018 is moved to eax, but it's actually not. The Intel-syntax representation of the same opcodes gives us a clue:
a1 18 e0 47 00 mov eax,ds:0x47e018
Ah - ds: - so it's not actually a value, but an address!
For anyone who is wondering now, the following would be the opcodes and ATT syntax assembly for moving the value 0x47e018 to eax:
b8 18 e0 47 00 mov $0x47e018,%eax

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js