Can I display disassembly code along with symbols from the .bss section only? - avr-gcc

I'd like avr-objdump to show the disassembled code from an avr-elf binary file and also include symbols but only those from the .bss section, like this:
SYMBOL TABLE:
00800100 l d .bss 00000000 .bss
00800102 u O .bss 00000001 cqueue<drv::uart0>::ptr
00800101 u O .bss 00000001 cqueue<drv::uart1>::ptr
...
00000000 <__vectors>:
0: 78 c0 rjmp .+240 ; 0xf2 <__ctors_end>
4: b2 c0 rjmp .+356 ; 0x16a <__bad_interrupt>
8: b0 c0 rjmp .+352 ; 0x16a <__bad_interrupt>
c: ae c0 rjmp .+348 ; 0x16a <__bad_interrupt>
10: ac c0 rjmp .+344 ; 0x16a <__bad_interrupt>
14: aa c0 rjmp .+340 ; 0x16a <__bad_interrupt>
18: a8 c0 rjmp .+336 ; 0x16a <__bad_interrupt>
1c: a6 c0 rjmp .+332 ; 0x16a <__bad_interrupt>
20: a4 c0 rjmp .+328 ; 0x16a <__bad_interrupt>
24: a2 c0 rjmp .+324 ; 0x16a <__bad_interrupt>
...
Can it be done?
So far all my attempts have leaded to either showing the .bss section but no disassembly at all or show everything, code and all symbols from all sections...

Depending on your shell, you can just call avr-objdump twice to give a single integrated output:
(avr-objdump <symbols-from-bss>; avr-objdump <disassembly>)

Related

What is _IO_stdin_used

Can someone explain me what _IO_stdin_used is in the following line:
114a: 48 8d 3d b3 0e 00 00 lea rdi,[rip+0xeb3] # 2004 <_IO_stdin_used+0x4>
Sorry for the noob question.
Expanding on https://stackoverflow.com/users/224132/peter-cordes 's comment (What is _IO_stdin_used ), some tools can help make sense of this. Starting with burt.c:
#include <stdio.h>
void main (void) {
puts("hello!");
}
built like so: LDFLAGS=-Wl,-Map=burt.map make burt
We will look at the map file in a moment, first inspect the executable:
$ objdump -xsd -M intel burt
...
0000000000002000 g O .rodata 0000000000000004 _IO_stdin_used
...
Contents of section .rodata:
2000 01000200 68656c6c 6f2100 ....hello!.
...
1148: 48 8d 05 b5 0e 00 00 lea rax,[rip+0xeb5] # 2004 <_IO_stdin_used+0x4>
114f: 48 89 c7 mov rdi,rax
1152: e8 d9 fe ff ff call 1030 <puts#plt>
...
Note, the string's location (2004h) does not have an entry in the symbol table (where the 2000h entry _IO_stdin_used occurs). So, objdump's disassembly comments mention it relative to another symbol it knows about.
But why does it know about _IO_stdin_used, or, why does is that symbol included and why does it have a name? burt.map shows it came from Scrt1.o:
...
.rodata 0x0000000000002000 0xb
*(.rodata .rodata.* .gnu.linkonce.r.*)
.rodata.cst4 0x0000000000002000 0x4 /usr/lib/gcc/x86_64-pc-linux-gnu/12.2.1/../../../../lib/Scrt1.o
.rodata 0x0000000000002004 0x7 /tmp/ccYG1XvK.o
...
On my computer, Scrt1.o belongs to glibc, and the symbol can be traced back here: https://sourceware.org/git/?p=glibc.git;a=blob;f=csu/init.c;h=c2f978f3da565590bcab355fefa3d81cf211cb36;hb=63fb8f9aa9d19f85599afe4b849b567aefd70a36.
To show that this is displayed in the objdump disassembly because there is no name for the address, we can give our string a name to add it to the symbol table, so that objdump will show it instead. To do this let's modify burt.c like so:
#include <stdio.h>
const char greeting[] = "hello!";
void main (void) {
puts(greeting);
}
Build, then inspect:
...
0000000000002004 g O .rodata 0000000000000007 greeting
...
0000000000002000 g O .rodata 0000000000000004 _IO_stdin_used
...
Contents of section .rodata:
2000 01000200 68656c6c 6f2100 ....hello!.
...
113d: 48 8d 05 c0 0e 00 00 lea rax,[rip+0xec0] # 2004 <greeting>
1144: 48 89 c7 mov rdi,rax
1147: e8 e4 fe ff ff call 1030 <puts#plt>
The offset relative to rip has changed a little bit, because the .text and .rodata sections are now a little bit further apart, although the .rodata contents remain the same, because the source modification was chosen to avoid modifying .rodata -- if greetings were not const, it would be located in .data instead. The map file below now shows burt.c's temporary object files's contribution to .rodata defining the greeting symbol, as compared to the previous map file where the same data had no name:
.rodata 0x0000000000002000 0xb
*(.rodata .rodata.* .gnu.linkonce.r.*)
.rodata.cst4 0x0000000000002000 0x4 /usr/lib/gcc/x86_64-pc-linux-gnu/12.2.1/../../../..
/lib/Scrt1.o
0x0000000000002000 _IO_stdin_used
.rodata 0x0000000000002004 0x7 /tmp/ccDnWXjG.o
0x0000000000002004 greeting

Reading from flash that's not part of the application

I'm programming bare-metal embedded, so no OS etc. on a STM32L4 (ARM Cortex M4). I have a separate page in flash, which is written by a bootloader (it is not and should not be part of my application binary, this is a must). In this page, I store configuration parameters that will be used in my application. This configuration page may change, but not during runtime, after a change I reset the processor.
How can I access this data in flash most nicely?
My definition of nice is (in this order of priority):
- support for (u)int32_t, (u)int8_t, bool, char[fixed-size]
- little overhead when compared to #define PARAM (1) or constexpr
- typesafe usage (i.e. uint8_t var = CONFIG_CHAR_ARRAY shall issue atleast a warning)
- no RAM copy
- readability of the configuration parameters while debugging (using STM32CubeIDE)
The solution shall scale for all possible 2048 bytes of the flashpage. Code generation is anyhow part of the process.
So far, I have tested two variants (I am coming from plain C but am using (potentially modern) C++ in this project). My current testcase is
if (param) function_call();
but it should also work for other cases such as
for(int i = 0; i < param2; i++)
define with pointer cast
#define CONF_PARAM1 (*(bool*)(CONFIG_ADDRESS + 0x0083))
Which leads to (using -Os):
8008872: 4b1b ldr r3, [pc, #108] ; (80088e0 <_Z16main_applicationv+0xac>)
8008874: 781b ldrb r3, [r3, #0]
8008876: b10b cbz r3, 800887c <_Z16main_applicationv+0x48>
8008878: f7ff ff56 bl 8008728 <_Z10function_callv>
80088e0: 0801f883 .word 0x0801f883
const variable
const bool CONF_PARAM1 = *(bool*)(CONFIG_ADDRESS + 0x0083);
leading to
800887c: 4b19 ldr r3, [pc, #100] ; (80088e4 <_Z16main_applicationv+0xb0>)
800887e: 781b ldrb r3, [r3, #0]
8008880: b10b cbz r3, 8008886 <_Z16main_applicationv+0x52>
8008882: f000 f899 bl 8008728 <_Z10function_callv>
80088e4: 200000c0 .word 0x200000c0
I dislike option 2, as it adds a RAM copy (would not scale well for 2048 bytes of config), option 1 looks like very old c style and does not help while debugging. I struggle to find another option using the linker script, as I do not find a way to not end up with the variable being in the application's binary.
Is there any better way of doing this?
If you make your constant a reference the compiler wont copy it into a variable, it will probably just load the address into a variable. You can then wrap the generation of the references into a templated function to make your code cleaner:
#include <cstdint>
#include <iostream>
template <typename T>
const T& configOption(uintptr_t offset)
{
const uintptr_t CONFIG_ADDRESS = 0x1000;
return *reinterpret_cast<T*>(CONFIG_ADDRESS + offset);
}
auto& CONF_PARAM1 = configOption< bool >(0x0083);
auto& CONF_PARAM2 = configOption< int >(0x0087);
int main()
{
std::cout << CONF_PARAM1 << ", " << CONF_PARAM2 << "\n";
}
GCC optimises this fairly well: https://godbolt.org/z/r27o5Q
As proposed by #old_timer in a comment above, I favour this solution:
In the linker file, I put
CONF_PARAM = _config_start + 0x0083;
In my config.hpp, I put
extern const bool CONF_PARAM;
which then can easily be accessed in any source file
if (CONF_PARAM)
This basically fulfills all "nice"-definitions of mine, as far as I can see.
No need to re-invent the wheel - placing data in flash is a fairly common use-case in embedded systems. When dealing with such data flash, there are some important considerations:
All data must sit at the very same address, with the same type, from case to case. This means that struct is problematic because of padding (and even more so class). If you align all data on 32 bit boundaries, this shouldn't be a problem, so I strongly recommend that you do so. Then the program becomes portable between compilers.
All these variables and pointers to them must be declared with volatile qualifier, otherwise the optimizer might go haywire. Things like (*(bool*)(CONFIG_ADDRESS + 0x0083)) are brittle and might break at any point, unless you add volatile.
You can place data at a fixed location in memory, but how to do so is compiler/linker-specific. And since it isn't standardized, it's always a pain to get right. With gcc-flavoured compilers it might be something like: __attribute__(section(".dataflash")) where .dataflash is your custom segment that you must reserve space for in the linker script. You'll need to take a closer look at how to do this with your specific toolchain (others use #pragmas etc instead), I'll use the __attribute__ here just to illustrate.
If this section gets downloaded together with the executable binary, or only through bootloader, is up to you to define. Linker scripts typically come with a "no init" option.
So you could do something like:
// flashdata.h
typedef struct
{
uint32_t stuff;
uint32_t more_stuff;
...
} flashdata_t;
extern volatile const flashdata_t flash_data __attribute__(section(".dataflash"));
And then declare it as:
// flashdata.c
volatile const flashdata_t flash_data __attribute__(section(".dataflash"));
And now you can use it as any struct, flash_data.stuff.
If you are using C, you can even split up each uint32_t chunk with union, such as typedef union { uint32_t u32; uint8_t u8 [4]; } and similar, but that isn't possible in C++, because it doesn't allow union type punning.
You can isolate the variables in question their own section. There is more than one way to do that. The tools build normally and do all the addressing work. Like using structs across compile domains you need to be extremely careful and probably put checks into the code, but you can build the binary and only load it or all but the other flash contents, then at that time or later you can change the VALUES of the variables in the other section and build and isolate those into their own load.
Testing the theory
vectors.s
.globl _start
_start:
.word 0x20001000
.word reset
.thumb_func
reset:
bl main
b .
.globl dummy
.thumb_func
dummy:
bx lr
so.c
extern volatile unsigned int x;
extern volatile unsigned short y;
extern volatile unsigned char z[7];
extern void dummy ( unsigned int );
int main ( void )
{
dummy(x);
dummy(y);
dummy(z[0]<<z[1]);
return(0);
}
flashvars.c
volatile unsigned int x=1;
volatile unsigned short y=3;
volatile unsigned char z[7]={1,2,3,4,5,6,7};
flash.ld
MEMORY
{
rom0 : ORIGIN = 0x08000000, LENGTH = 0x1000
rom1 : ORIGIN = 0x08002000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom0
.vars : { flashvars.o } > rom1
}
build
arm-none-eabi-as --warn --fatal-warnings -mcpu=cortex-m0 vectors.s -o vectors.o
arm-none-eabi-ld -nostdlib -nostartfiles -T flash.ld vectors.o so.o flashvars.o -o so.elf
arm-none-eabi-objdump -D so.elf > so.list
arm-none-eabi-objcopy -R .vars -O binary so.elf so.bin
examine
Disassembly of section .text:
08000000 <_start>:
8000000: 20001000 andcs r1, r0, r0
8000004: 08000009 stmdaeq r0, {r0, r3}
08000008 <reset>:
8000008: f000 f802 bl 8000010 <main>
800000c: e7fe b.n 800000c <reset+0x4>
0800000e <dummy>:
800000e: 4770 bx lr
08000010 <main>:
8000010: 4b08 ldr r3, [pc, #32] ; (8000034 <main+0x24>)
8000012: b510 push {r4, lr}
8000014: 6818 ldr r0, [r3, #0]
8000016: f7ff fffa bl 800000e <dummy>
800001a: 4b07 ldr r3, [pc, #28] ; (8000038 <main+0x28>)
800001c: 8818 ldrh r0, [r3, #0]
800001e: b280 uxth r0, r0
8000020: f7ff fff5 bl 800000e <dummy>
8000024: 4b05 ldr r3, [pc, #20] ; (800003c <main+0x2c>)
8000026: 7818 ldrb r0, [r3, #0]
8000028: 785b ldrb r3, [r3, #1]
800002a: 4098 lsls r0, r3
800002c: f7ff ffef bl 800000e <dummy>
8000030: 2000 movs r0, #0
8000032: bd10 pop {r4, pc}
8000034: 0800200c stmdaeq r0, {r2, r3, sp}
8000038: 08002008 stmdaeq r0, {r3, sp}
800003c: 08002000 stmdaeq r0, {sp}
Disassembly of section .vars:
08002000 <z>:
8002000: 04030201 streq r0, [r3], #-513 ; 0xfffffdff
8002004: 00070605 andeq r0, r7, r5, lsl #12
08002008 <y>:
8002008: 00000003 andeq r0, r0, r3
0800200c <x>:
800200c: 00000001 andeq r0, r0, r1
that looks good
hexdump -C so.bin
00000000 00 10 00 20 09 00 00 08 00 f0 02 f8 fe e7 70 47 |... ..........pG|
00000010 08 4b 10 b5 18 68 ff f7 fa ff 07 4b 18 88 80 b2 |.K...h.....K....|
00000020 ff f7 f5 ff 05 4b 18 78 5b 78 98 40 ff f7 ef ff |.....K.x[x.#....|
00000030 00 20 10 bd 0c 20 00 08 08 20 00 08 00 20 00 08 |. ... ... ... ..|
00000040
as does that.
arm-none-eabi-objcopy -j .vars -O binary so.elf sovars.bin
hexdump -C sovars.bin
00000000 01 02 03 04 05 06 07 00 03 00 00 00 01 00 00 00 |................|
00000010 47 43 43 3a 20 28 47 4e 55 29 20 39 2e 33 2e 30 |GCC: (GNU) 9.3.0|
00000020 00 41 30 00 00 00 61 65 61 62 69 00 01 26 00 00 |.A0...aeabi..&..|
00000030 00 05 43 6f 72 74 65 78 2d 4d 30 00 06 0c 07 4d |..Cortex-M0....M|
00000040 09 01 12 04 14 01 15 01 17 03 18 01 19 01 1a 01 |................|
00000050 1e 02 |..|
00000052
hah, okay a little more work.
MEMORY
{
rom0 : ORIGIN = 0x08000000, LENGTH = 0x1000
rom1 : ORIGIN = 0x08002000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom0
.vars : { flashvars.o(.data) } > rom1
}
hexdump -C sovars.bin
00000000 01 02 03 04 05 06 07 00 03 00 00 00 01 00 00 00 |................|
00000010
much better.
I strongly recommend against structs across compile domains and this falls into that category as the build for the real data is separate and between the code build and the data build you could get data that doesn't land the same, when I do things like this I put in protections to catch the problem during execution before it goes off the rails (or better at build time). It is not a case of if it is a case of when. Implementation defined means implementation defined.
But thinking about your question this became an easy solution. And yes technically this data is read only, const this or that, but 1) does volatile and const go together? and 2) do you really want/need to do that?
Does it even need to be volatile? Probably not, just banged that out to start with. Switching it to const the tool puts them in .rodata. Well my tool does depends on how you write your linker script and I think the version of binutils.
so.c
extern const unsigned int x;
extern const unsigned short y;
extern const unsigned char z[7];
extern void dummy ( unsigned int );
int main ( void )
{
dummy(x);
dummy(y);
dummy(z[0]<<z[1]);
return(0);
}
flashvars.c
const unsigned int x=1;
const unsigned short y=3;
const unsigned char z[7]={1,2,3,4,5,6,7};
flash.ld
MEMORY
{
rom0 : ORIGIN = 0x08000000, LENGTH = 0x1000
rom1 : ORIGIN = 0x08002000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom0
.vars : { flashvars.o(.rodata) } > rom1
}
output
Disassembly of section .text:
08000000 <_start>:
8000000: 20001000 andcs r1, r0, r0
8000004: 08000009 stmdaeq r0, {r0, r3}
08000008 <reset>:
8000008: f000 f802 bl 8000010 <main>
800000c: e7fe b.n 800000c <reset+0x4>
0800000e <dummy>:
800000e: 4770 bx lr
08000010 <main>:
8000010: 4b08 ldr r3, [pc, #32] ; (8000034 <main+0x24>)
8000012: b510 push {r4, lr}
8000014: 6818 ldr r0, [r3, #0]
8000016: f7ff fffa bl 800000e <dummy>
800001a: 4b07 ldr r3, [pc, #28] ; (8000038 <main+0x28>)
800001c: 8818 ldrh r0, [r3, #0]
800001e: f7ff fff6 bl 800000e <dummy>
8000022: 4b06 ldr r3, [pc, #24] ; (800003c <main+0x2c>)
8000024: 7818 ldrb r0, [r3, #0]
8000026: 785b ldrb r3, [r3, #1]
8000028: 4098 lsls r0, r3
800002a: f7ff fff0 bl 800000e <dummy>
800002e: 2000 movs r0, #0
8000030: bd10 pop {r4, pc}
8000032: 46c0 nop ; (mov r8, r8)
8000034: 0800200c stmdaeq r0, {r2, r3, sp}
8000038: 08002008 stmdaeq r0, {r3, sp}
800003c: 08002000 stmdaeq r0, {sp}
Disassembly of section .vars:
08002000 <z>:
8002000: 04030201 streq r0, [r3], #-513 ; 0xfffffdff
8002004: 00070605 andeq r0, r7, r5, lsl #12
08002008 <y>:
8002008: 00000003 andeq r0, r0, r3
0800200c <x>:
800200c: 00000001 andeq r0, r0, r1
hexdump -C so.bin
00000000 00 10 00 20 09 00 00 08 00 f0 02 f8 fe e7 70 47 |... ..........pG|
00000010 08 4b 10 b5 18 68 ff f7 fa ff 07 4b 18 88 ff f7 |.K...h.....K....|
00000020 f6 ff 06 4b 18 78 5b 78 98 40 ff f7 f0 ff 00 20 |...K.x[x.#..... |
00000030 10 bd c0 46 0c 20 00 08 08 20 00 08 00 20 00 08 |...F. ... ... ..|
00000040
hexdump -C sovars.bin
00000000 01 02 03 04 05 06 07 00 03 00 00 00 01 00 00 00 |................|
00000010

Iteration Vs. Recursion for Printing Integer Numbers to a Character LCD/OLED Display

Question
I am looking for some input on how to optimize printing the digits of an integer number, say uint32_t num = 1234567890;, to a character display with an Arduino UNO. The main metrics to consider are memory usage and complied size. The display is so slow that no improvement in speed would be meaningful and minimum code length, while nice, isn't a requirement.
Currently, I am extracting the least significant digit using num%10 and then removing this digit by num/10 and so on until all the digits of num are extracted. Using recursion I can reverse the order of the printing so very few operations are needed (as explicit lines of code) to print the digits in proper order. Using for loops I need to find the number of characters used to write the number, then store them before being able to print them in the correct order, requiring an array and 3 for loops.
According to the Arduino IDE, when printing an assortment of signed and unsigned integers, recursion uses 2010/33 bytes of storage/memory, while iteration uses 2200/33 bytes verses 2474/52 bytes when using the Adafruit_CharacterOLED library that extends class Print.
Is there a way to implement this better than the functions I've written using recursion and iteration below? If not, which would you prefer and why? I feel like there might be a better way to do this with less resources--but maybe I'm Don Quixote fighting windmills and the code is already good enough.
Background
I'm working with a NHD-0420DZW character OLED display and have used the Newhaven datasheet and LiquidCrystal library as a guide to write my own library and the display is working great. However, to minimize code bloat, I chose to not make my display library a subclass of Print, which is a part of the Arduino core libraries. In doing this, significant savings in storage space (~400 bytes) and memory (~19 bytes) have already been realized (the ATmega328P has 32k storage with 2k RAM, so resources are scarce).
Recursion
If I use recursion, the print method is rather elegant. The number is divided by 10 until the base case of zero is achieved. Then the least significant digit of the smallest number is printed (MSD of num), and the LSD of the next smallest number (second MSD of num) and so on, causing the final print order to be reversed. This corrects for the reversed order of digit extraction using %10 and /10 operations.
// print integer type literals to display (base-10 representation)
void NewhavenDZW::print(int8_t num) {print(static_cast<int32_t>(num));}
void NewhavenDZW::print(uint8_t num) {print(static_cast<uint32_t>(num));}
void NewhavenDZW::print(int16_t num) {print(static_cast<int32_t>(num));}
void NewhavenDZW::print(uint16_t num) {print(static_cast<uint32_t>(num));}
void NewhavenDZW::print(int32_t num) {
if(num < 0) { // print negative sign if present
send('-', HIGH); // and make num positive
print(static_cast<uint32_t>(-num));
} else
print(static_cast<uint32_t>(num));
}
void NewhavenDZW::print(uint32_t num) {
if(num < 10) { // print single digit numbers directly
send(num + '0', HIGH);
return;
} else // use recursion to print nums with more
recursivePrint(num); // than two digits in the correct order
}
// recursive method for printing a number "backwards"
// used to correct the reversed order of digit extraction
void NewhavenDZW::recursivePrint(uint32_t num) {
if(num) { // true if num>0, false if num==0
recursivePrint(num/10); // maximum of 11 recursive steps
send(num%10 + '0', HIGH); // for a 10 digit number
}
}
Iteration
Since the digit extraction method starts at the LSD, rather than the MSD, the extracted digits cannot be printed directly unless I move the cursor and tell the display to print right-to-left. So I have to store the digits as I extract them before I can write them to the display in the correct order.
void NewhavenDZW::print(uint32_t num) {
if(num < 10) {
send(num + '0', HIGH);
return;
}
uint8_t length = 0;
for(uint32_t i=num; i>0; i/=10) // determine number of characters
++length; // needed to represent number
char text[length];
for(uint8_t i=length; num>0; num/=10, --i)
text[i-1] = num%10 + '0'; // map each numerical digit to
for(uint8_t i=0; i<length; i++) // its char value and fix ordering
send(text[i], HIGH); // before printing result
}
Update
Ultimately, recursion takes the least storage space, but is likely to use the most memory.
After reviewing the code kindly provided by Igor G and darune, as well as looking at the number of instructions listed on godbolt (as discussed by darune and old_timer) I believe that Igor G's solution is the best overall. It compiles to 2076 bytes vs. 2096 bytes for darune's function (using an if statement to stop leading zeros and be able to print 0) during testing. It also requires less instructions (88) than darune's (273) when the necessary if statement is tacked on.
Using Pointer Variable
void NewhavenDZW::print(uint32_t num) {
char buffer[10];
char* p = buffer;
do {
*p++ = num%10 + '0';
num /= 10;
} while (num);
while (p != buffer)
send(*--p, HIGH);
}
Using Index Variable
This is what my original for loop was trying to do, but in a naive way. There is really no point in trying to minimize the size of the buffer array as Igor G has point out.
void NewhavenDZW::print(uint32_t num) {
char text[10]; // signed/unsigned 32-bit ints are <= 10 digits
uint8_t i = sizeof(text) - 1; // set index to end of char array
do {
text[i--] = num%10 + '0'; // store each numerical digit as
num /= 10; // its associated char value
} while (num);
while (i < sizeof(text))
send(text[i++], HIGH); // print num in the correct order
}
The Alternative
Here's darune's function with the added if statement, for those who don't want to sift through the comments. The condition pow10 == 100 is the same as pow10 == 1, but saves two iterations of the loop to print zero while having the same compile size.
void NewhavenDZW::print(uint32_t num) {
for (uint32_t pow10 = 1000000000; pow10 != 0; pow10 /= 10)
if (num >= pow10 || (num == 0 && pow10 == 100))
send((num/pow10)%10 + '0', HIGH);
}
For a smaller footprint you can use something like this:
void Send(unsigned char);
void SmallPrintf(unsigned long val)
{
static_assert(sizeof(decltype(val)) == 4, "expected '10 digit type'");
for (unsigned long digit_pow10{1000000000}; digit_pow10 != 0; digit_pow10 /= 10)
{
Send((val / digit_pow10 % 10) + '0');
}
}
This produces around 70 instructions - which is about ~14 instructions less then using a buffer and iterating the buffer after. (Also the code is a lot simpler)
Link to godbolt.
If leading zero's is unwant'ed then an if clause can avoid that fairly simpel - something like:
if (val >= digit_pow10) {
Send((val / digit_pow10 % 10) + '0');
}
But it will cost some extra instructions (~9) though - however the total is still below the buffered example.
Try this one. My avr-gcc-5.4.0 + readelf tells that the function body is only 138 bytes.
void Send(uint8_t);
void OptimizedPrintf(uint32_t val)
{
uint8_t buffer[sizeof(val) * CHAR_BIT / 3 + 1];
uint8_t* p = buffer;
do
{
*p++ = (val % 10) + '0';
val /= 10;
} while (val);
while (p != buffer)
Send(*--p);
}
interesting experiment.
unsigned long fun ( unsigned long x )
{
return(x/10);
}
unsigned long fun2 ( unsigned long x )
{
return(x%10);
}
int main ( void )
{
return(0);
}
does/can give with an apt-got toolchain:
00000000 <fun>:
0: 2a e0 ldi r18, 0x0A ; 10
2: 30 e0 ldi r19, 0x00 ; 0
4: 40 e0 ldi r20, 0x00 ; 0
6: 50 e0 ldi r21, 0x00 ; 0
8: 0e d0 rcall .+28 ; 0x26 <__udivmodsi4>
a: 95 2f mov r25, r21
c: 84 2f mov r24, r20
e: 73 2f mov r23, r19
10: 62 2f mov r22, r18
12: 08 95 ret
00000014 <fun2>:
14: 2a e0 ldi r18, 0x0A ; 10
16: 30 e0 ldi r19, 0x00 ; 0
18: 40 e0 ldi r20, 0x00 ; 0
1a: 50 e0 ldi r21, 0x00 ; 0
1c: 04 d0 rcall .+8 ; 0x26 <__udivmodsi4>
1e: 08 95 ret
00000020 <main>:
20: 80 e0 ldi r24, 0x00 ; 0
22: 90 e0 ldi r25, 0x00 ; 0
24: 08 95 ret
00000026 <__udivmodsi4>:
26: a1 e2 ldi r26, 0x21 ; 33
28: 1a 2e mov r1, r26
2a: aa 1b sub r26, r26
2c: bb 1b sub r27, r27
2e: ea 2f mov r30, r26
30: fb 2f mov r31, r27
32: 0d c0 rjmp .+26 ; 0x4e <__udivmodsi4_ep>
00000034 <__udivmodsi4_loop>:
34: aa 1f adc r26, r26
36: bb 1f adc r27, r27
38: ee 1f adc r30, r30
3a: ff 1f adc r31, r31
3c: a2 17 cp r26, r18
3e: b3 07 cpc r27, r19
40: e4 07 cpc r30, r20
42: f5 07 cpc r31, r21
44: 20 f0 brcs .+8 ; 0x4e <__udivmodsi4_ep>
46: a2 1b sub r26, r18
48: b3 0b sbc r27, r19
4a: e4 0b sbc r30, r20
4c: f5 0b sbc r31, r21
0000004e <__udivmodsi4_ep>:
4e: 66 1f adc r22, r22
50: 77 1f adc r23, r23
52: 88 1f adc r24, r24
54: 99 1f adc r25, r25
56: 1a 94 dec r1
58: 69 f7 brne .-38 ; 0x34 <__udivmodsi4_loop>
5a: 60 95 com r22
5c: 70 95 com r23
5e: 80 95 com r24
60: 90 95 com r25
62: 26 2f mov r18, r22
64: 37 2f mov r19, r23
66: 48 2f mov r20, r24
68: 59 2f mov r21, r25
6a: 6a 2f mov r22, r26
6c: 7b 2f mov r23, r27
6e: 8e 2f mov r24, r30
70: 9f 2f mov r25, r31
72: 08 95 ret
Answered one of my questions, 78 instructions for the division function. Also it returns both the numerator and denominator in one call something that could be taken advantage of if desperate.
unsigned int fun ( unsigned int x )
{
return(x/10);
}
unsigned int fun2 ( unsigned int x )
{
return(x%10);
}
int main ( void )
{
return(0);
}
gives
00000000 <fun>:
0: 6a e0 ldi r22, 0x0A ; 10
2: 70 e0 ldi r23, 0x00 ; 0
4: 0a d0 rcall .+20 ; 0x1a <__udivmodhi4>
6: 86 2f mov r24, r22
8: 97 2f mov r25, r23
a: 08 95 ret
0000000c <fun2>:
c: 6a e0 ldi r22, 0x0A ; 10
e: 70 e0 ldi r23, 0x00 ; 0
10: 04 d0 rcall .+8 ; 0x1a <__udivmodhi4>
12: 08 95 ret
00000014 <main>:
14: 80 e0 ldi r24, 0x00 ; 0
16: 90 e0 ldi r25, 0x00 ; 0
18: 08 95 ret
0000001a <__udivmodhi4>:
1a: aa 1b sub r26, r26
1c: bb 1b sub r27, r27
1e: 51 e1 ldi r21, 0x11 ; 17
20: 07 c0 rjmp .+14 ; 0x30 <__udivmodhi4_ep>
00000022 <__udivmodhi4_loop>:
22: aa 1f adc r26, r26
24: bb 1f adc r27, r27
26: a6 17 cp r26, r22
28: b7 07 cpc r27, r23
2a: 10 f0 brcs .+4 ; 0x30 <__udivmodhi4_ep>
2c: a6 1b sub r26, r22
2e: b7 0b sbc r27, r23
00000030 <__udivmodhi4_ep>:
30: 88 1f adc r24, r24
32: 99 1f adc r25, r25
34: 5a 95 dec r21
36: a9 f7 brne .-22 ; 0x22 <__udivmodhi4_loop>
38: 80 95 com r24
3a: 90 95 com r25
3c: 68 2f mov r22, r24
3e: 79 2f mov r23, r25
40: 8a 2f mov r24, r26
42: 9b 2f mov r25, r27
44: 08 95 ret
22 lines, 44 bytes for the divide. also interesting. Might be taken advantage of at the C++ level to save space (if this display loop is the only place you do a divide/modulo).
of course the optimizer knows that the library function does both result and remainder:
unsigned long fun ( unsigned long x )
{
unsigned long res;
unsigned long rem;
res=x/10;
rem=x&10;
res&=0xFFFF;
rem&=0xFFFF;
return((res<<16)|rem);
}
int main ( void )
{
return(0);
}
00000000 <fun>:
0: cf 92 push r12
2: df 92 push r13
4: ef 92 push r14
6: ff 92 push r15
8: c6 2e mov r12, r22
a: d7 2e mov r13, r23
c: e8 2e mov r14, r24
e: f9 2e mov r15, r25
10: 2a e0 ldi r18, 0x0A ; 10
12: 30 e0 ldi r19, 0x00 ; 0
14: 40 e0 ldi r20, 0x00 ; 0
16: 50 e0 ldi r21, 0x00 ; 0
18: 19 d0 rcall .+50 ; 0x4c <__udivmodsi4>
1a: a2 2f mov r26, r18
1c: b3 2f mov r27, r19
1e: 99 27 eor r25, r25
20: 88 27 eor r24, r24
22: 2a e0 ldi r18, 0x0A ; 10
24: c2 22 and r12, r18
26: dd 24 eor r13, r13
28: ee 24 eor r14, r14
2a: ff 24 eor r15, r15
2c: 68 2f mov r22, r24
2e: 79 2f mov r23, r25
30: 8a 2f mov r24, r26
32: 9b 2f mov r25, r27
34: 6c 29 or r22, r12
36: 7d 29 or r23, r13
38: 8e 29 or r24, r14
3a: 9f 29 or r25, r15
3c: ff 90 pop r15
3e: ef 90 pop r14
40: df 90 pop r13
42: cf 90 pop r12
44: 08 95 ret
00000046 <main>:
46: 80 e0 ldi r24, 0x00 ; 0
48: 90 e0 ldi r25, 0x00 ; 0
4a: 08 95 ret
0000004c <__udivmodsi4>:
4c: a1 e2 ldi r26, 0x21 ; 33
4e: 1a 2e mov r1, r26
50: aa 1b sub r26, r26
52: bb 1b sub r27, r27
54: ea 2f mov r30, r26
56: fb 2f mov r31, r27
58: 0d c0 rjmp .+26 ; 0x74 <__udivmodsi4_ep>
0000005a <__udivmodsi4_loop>:
5a: aa 1f adc r26, r26
5c: bb 1f adc r27, r27
5e: ee 1f adc r30, r30
60: ff 1f adc r31, r31
62: a2 17 cp r26, r18
64: b3 07 cpc r27, r19
66: e4 07 cpc r30, r20
68: f5 07 cpc r31, r21
6a: 20 f0 brcs .+8 ; 0x74 <__udivmodsi4_ep>
6c: a2 1b sub r26, r18
6e: b3 0b sbc r27, r19
70: e4 0b sbc r30, r20
72: f5 0b sbc r31, r21
00000074 <__udivmodsi4_ep>:
74: 66 1f adc r22, r22
76: 77 1f adc r23, r23
78: 88 1f adc r24, r24
7a: 99 1f adc r25, r25
7c: 1a 94 dec r1
7e: 69 f7 brne .-38 ; 0x5a <__udivmodsi4_loop>
80: 60 95 com r22
82: 70 95 com r23
84: 80 95 com r24
86: 90 95 com r25
88: 26 2f mov r18, r22
8a: 37 2f mov r19, r23
8c: 48 2f mov r20, r24
8e: 59 2f mov r21, r25
90: 6a 2f mov r22, r26
92: 7b 2f mov r23, r27
94: 8e 2f mov r24, r30
96: 9f 2f mov r25, r31
98: 08 95 ret
Sorry using this space to have fun with this exercise of seeing what a compiler does with this problem. Trying to use the 16 bit division starts to explode register usage burning through the 34 instructions saved.
Because of the fixed denominator and because this is an 8 bit processor you can play optimization games with the compiler, but you may take this down an unreadable path.
Still pretty sure that this can be done tighter without using the division library function and doing it all yourself knowing that this is an AVR. Shifts are brutal though, lots of registers but once you spill over then that explodes the size of the function too. Very delicate.
For the price of one uno you could have bought a handful of blue pills with a lot more of everything, including 32 bit registers and a multiply which turns a divide by 10 into a few instruction. Can still use the arduino sandbox, And runs way faster. (more flash, more ram, compiler friendly instruction set, likely no longer needing to count bytes, should try compiling your project for that target and see how much of the flash is used).

Where Are Relocation Entries Stored in Flat Binaries?

Please be attend to the following example. I have a C++ code that I want to compile under powerpc and generate the binary code.
#include <stdio.h>
int function(int x);
int myfunction(int x);
int main()
{
int x = function(2);
int y = myfunction(2);
return x + y;
}
int function(int x)
{
return x * myfunction(x);
}
int myfunction(int x)
{
return x;
}
I have two function calls: call function(2) and call myfunction(2). I compile this C++ code under powerpc‍‍‍. So, now I use the objdump to get the assembly behind the object file and that is as follows:
00000000 <main>:
0: 94 21 ff e0 stwu r1,-32(r1)
4: 7c 08 02 a6 mflr r0
8: 93 e1 00 1c stw r31,28(r1)
c: 90 01 00 24 stw r0,36(r1)
10: 7c 3f 0b 78 mr r31,r1
14: 38 60 00 02 li r3,2
18: 48 00 00 01 bl 18 <main+0x18>
1c: 7c 60 1b 78 mr r0,r3
20: 90 1f 00 08 stw r0,8(r31)
24: 38 60 00 02 li r3,2
28: 48 00 00 01 bl 28 <main+0x28>
2c: 7c 60 1b 78 mr r0,r3
30: 90 1f 00 0c stw r0,12(r31)
34: 80 1f 00 08 lwz r0,8(r31)
38: 81 3f 00 0c lwz r9,12(r31)
3c: 7c 00 4a 14 add r0,r0,r9
40: 7c 03 03 78 mr r3,r0
44: 48 00 00 0c b 50 <main+0x50>
48: 38 60 00 00 li r3,0
4c: 48 00 00 04 b 50 <main+0x50>
50: 81 61 00 00 lwz r11,0(r1)
54: 80 0b 00 04 lwz r0,4(r11)
58: 7c 08 03 a6 mtlr r0
5c: 83 eb ff fc lwz r31,-4(r11)
60: 7d 61 5b 78 mr r1,r11
64: 4e 80 00 20 blr
00000068 <function__Fi>:
68: 94 21 ff e0 stwu r1,-32(r1)
6c: 7c 08 02 a6 mflr r0
70: 93 e1 00 1c stw r31,28(r1)
74: 90 01 00 24 stw r0,36(r1)
78: 7c 3f 0b 78 mr r31,r1
7c: 90 7f 00 08 stw r3,8(r31)
80: 80 7f 00 08 lwz r3,8(r31)
84: 48 00 00 01 bl 84 <function__Fi+0x1c>
88: 7c 60 1b 78 mr r0,r3
8c: 81 3f 00 08 lwz r9,8(r31)
90: 7c 00 49 d6 mullw r0,r0,r9
94: 7c 03 03 78 mr r3,r0
98: 48 00 00 0c b a4 <function__Fi+0x3c>
9c: 48 00 00 08 b a4 <function__Fi+0x3c>
a0: 48 00 00 04 b a4 <function__Fi+0x3c>
a4: 81 61 00 00 lwz r11,0(r1)
a8: 80 0b 00 04 lwz r0,4(r11)
ac: 7c 08 03 a6 mtlr r0
b0: 83 eb ff fc lwz r31,-4(r11)
b4: 7d 61 5b 78 mr r1,r11
b8: 4e 80 00 20 blr
000000bc <myfunction__Fi>:
bc: 94 21 ff e0 stwu r1,-32(r1)
c0: 93 e1 00 1c stw r31,28(r1)
c4: 7c 3f 0b 78 mr r31,r1
c8: 90 7f 00 08 stw r3,8(r31)
cc: 80 1f 00 08 lwz r0,8(r31)
d0: 7c 03 03 78 mr r3,r0
d4: 48 00 00 04 b d8 <myfunction__Fi+0x1c>
d8: 81 61 00 00 lwz r11,0(r1)
dc: 83 eb ff fc lwz r31,-4(r11)
e0: 7d 61 5b 78 mr r1,r11
e4: 4e 80 00 20 blr
The interesting thing that wondered me is the line that do function calls:
18: 48 00 00 01 bl 18 <main+0x18>
...
28: 48 00 00 01 bl 28 <main+0x28>
As you see, both are the binary code "48 00 00 01", but one calls function and another calls myfunction. The problem is that how we can find the call target. As I found, the call targets are written on RELOCATION ENTRIES. Oh, everything is okay, I use the command below to generate the relocation entries and is as follows:
RELOCATION RECORDS FOR [.text]:
OFFSET TYPE VALUE
00000018 R_PPC_REL24 function__Fi
00000028 R_PPC_REL24 myfunction__Fi
00000084 R_PPC_REL24 myfunction__Fi
This entries are useful to find the call target. Now, I use the objcopy -O binary command to generate the raw binary file (flat binary).
objcopy -O binary object-file
My object-file is the elf32-powerpc. The output binary file shown in following block:
2564 0000 2564 0a00 93e1 001c 9001 0024
7c3f 0b78 3860 0002 4800 0001 7c60 1b78
901f 0008 3860 0002 4800 0001 7c60 1b78
901f 000c 801f 0008 2c00 0002 4182 003c
2c00 0002 4181 0010 2c00 0001 4182 0014
4800 0058 2c00 0003 4182 0038 4800 004c
3d20 0000 3869 0000 389f 0008 4cc6 3182
4800 0001 4800 004c 3d20 0000 3869 0004
3880 0014 4cc6 3182 4800 0001 4800 0034
3d20 0000 3869 0004 3880 001e 4cc6 3182
4800 0001 4800 001c 3d20 0000 3869 0004
3880 0028 4cc6 3182 4800 0001 4800 0004
3860 0000 4800 0004 8161 0000 800b 0004
7c08 03a6 83eb fffc 7d61 5b78 4e80 0020
9421 ffe0 7c08 02a6 93e1 001c 9001 0024
7c3f 0b78 907f 0008 807f 0008 4800 0001
7c60 1b78 813f 0008 7c00 49d6 7c03 0378
4800 000c 4800 0008 4800 0004 8161 0000
800b 0004 7c08 03a6 83eb fffc 7d61 5b78
4e80 0020 9421 ffe0 93e1 001c 7c3f 0b78
907f 0008 801f 0008 7c03 0378 4800 0004
8161 0000 83eb fffc 7d61 5b78 4e80 0020
We can find the 4800 0001 on it. But there is no relocation entries. Could any one please tell me how can I find the relocation entries?
Thanks in Advance.
Your relocation entries are discarded when you do the objcopy. From the manual:
objcopy can be used to generate a raw binary file by using an output
target of binary (e.g., use -O binary). When objcopy generates a raw
binary file, it will essentially produce a memory dump of the contents
of the input object file. All symbols and relocation information will
be discarded. The memory dump will start at the load address of the
lowest section copied into the output file.
For your raw binary to be useful, you have a few options here:
You can perform the relocations during the build process, so that your raw binary is ready to run. However, this means that the binary file needs to be run at a fixed address in memory.
Alternatively, you can produce an object file that does not require relocation - all address references need to be relative. Lookup "position-independent code" for more details.
Finally, you could also use some other way of generating the raw binary (instead of, or in addition to, the objcopy stage), which includes the relocation table in the output file, and then have your code manually process these relocations at runtime.
The choice here will depend on what you're trying to do, and what constraints your runtime environment has.

What is stored in this 26KB executable?

Compiling this code with -O3:
#include <iostream>
int main(){std::cout<<"Hello World"<<std::endl;}
results in a file with a length of 25,890 bytes. (Compiled with GCC 4.8.1)
Can't the compiler just store two calls to write(STDOUT_FILENO, ???, strlen(???));, store write's contents, store the string, and boom write it to the disk? It should result in a EXE with a length under 1,024 bytes to my estimate.
Compiling a hello world program in assembly results in 17 bytes file: https://stackoverflow.com/questions/284797/hello-world-in-less-than-17-bytes, means actual code is 5-bytes long. (The string is Hello World\0)
What that EXE stores except the actual main and the functions it calls?
NOTE: This question applies to MSVC too.
Edit:
A lot of users pointed at iostream as being the culprit, so I tested this hypothesis and compiled this program with the same parameters:
int main( ) {
}
And got 23,815 bytes, the hypothesis has been disproved.
The compiler generates by default a complete PE-conformant executable. Assuming a release build, the simple code you posted might probably include:
all the PE headers and tables needed by the loader (e.g. IAT), this also means alignment requirements have to be met
CRT library initialization code
Debugging info (you need to manually drop these off even for a release build)
In case the compiler were MSVC there would have been additional inclusions:
Manifest xml and relocation data
Results of default compiler options that favor speed over size
The link you posted does contain a very small assembly "hello world" program, but in order to properly run in a Windows environment at least the complete and valid PE structure needs to be available to the loader (setting aside all the low-level issues that might cause that code not to run at all).
Assuming the loader had already and correctly 'set up' the process where to run that code into, only at that point you could map it into a PE section and do
jmp small_hello_world_entry_point
to actually execute the code.
References: The PE format
One last notice: UPX and similar compression tools are also used to reduce filesize for executables.
C++ isn't assembly, like C it comes with a lot of infrastructure. In addition to the overheads of C - required to be compatible with the C abi - C++ also has its own variants of many things, and it also has to have all the tear-up and -down code required to provide the many guarantees of the language.
Much of these are provided by libraries, but some of it has to be in the executable itself so that a failure to load shared libraries could be handled.
Under Linux/BSD we can reverse engineer an executable with objdump -dsl. I took the following code:
int main() {}
and compiled it with:
g++ -Wall -O3 -g0 test.cpp -o test.exe
The resulting executable?
6922 bytes
Then I compiled with less cruft:
g++ -Wall -O3 -g0 test.cpp -o test.exe -nostdlib
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000400150
Basically: main is a facade entry point for our C++ code, the program really starts at _start.
Executable size?
1454 bytes
Here's how objdump describes the two:
g++ -Wall -O3 -g0 test.cpp -o test.exe
objdump -test.exe
test.exe: file format elf64-x86-64
Contents of section .interp:
400200 2f6c6962 36342f6c 642d6c69 6e75782d /lib64/ld-linux-
400210 7838362d 36342e73 6f2e3200 x86-64.so.2.
Contents of section .note.ABI-tag:
40021c 04000000 10000000 01000000 474e5500 ............GNU.
40022c 00000000 02000000 06000000 12000000 ................
Contents of section .note.gnu.build-id:
40023c 04000000 14000000 03000000 474e5500 ............GNU.
40024c a0f55c7d 671f9eb2 93078fd3 0f52581a ..\}g........RX.
40025c 544829b2 TH).
Contents of section .hash:
400260 03000000 06000000 02000000 05000000 ................
400270 00000000 00000000 00000000 01000000 ................
400280 00000000 03000000 04000000 ............
Contents of section .dynsym:
400290 00000000 00000000 00000000 00000000 ................
4002a0 00000000 00000000 10000000 20000000 ............ ...
4002b0 00000000 00000000 00000000 00000000 ................
4002c0 1f000000 20000000 00000000 00000000 .... ...........
4002d0 00000000 00000000 8b000000 12000000 ................
4002e0 00000000 00000000 00000000 00000000 ................
4002f0 33000000 20000000 00000000 00000000 3... ...........
400300 00000000 00000000 4f000000 20000000 ........O... ...
400310 00000000 00000000 00000000 00000000 ................
Contents of section .dynstr:
400320 006c6962 73746463 2b2b2e73 6f2e3600 .libstdc++.so.6.
400330 5f5f676d 6f6e5f73 74617274 5f5f005f __gmon_start__._
400340 4a765f52 65676973 74657243 6c617373 Jv_RegisterClass
400350 6573005f 49544d5f 64657265 67697374 es._ITM_deregist
400360 6572544d 436c6f6e 65546162 6c65005f erTMCloneTable._
400370 49544d5f 72656769 73746572 544d436c ITM_registerTMCl
400380 6f6e6554 61626c65 006c6962 6d2e736f oneTable.libm.so
400390 2e36006c 69626763 635f732e 736f2e31 .6.libgcc_s.so.1
4003a0 006c6962 632e736f 2e36005f 5f6c6962 .libc.so.6.__lib
4003b0 635f7374 6172745f 6d61696e 00474c49 c_start_main.GLI
4003c0 42435f32 2e322e35 00 BC_2.2.5.
Contents of section .gnu.version:
4003ca 00000000 00000200 00000000 ............
Contents of section .gnu.version_r:
4003d8 01000100 81000000 10000000 00000000 ................
4003e8 751a6909 00000200 9d000000 00000000 u.i.............
Contents of section .rela.dyn:
4003f8 50096000 00000000 06000000 01000000 P.`.............
400408 00000000 00000000 ........
Contents of section .rela.plt:
400410 70096000 00000000 07000000 03000000 p.`.............
400420 00000000 00000000 ........
Contents of section .init:
400428 4883ec08 e85b0000 00e86a01 0000e845 H....[....j....E
400438 02000048 83c408c3 ...H....
Contents of section .plt:
400440 ff351a05 2000ff25 1c052000 0f1f4000 .5.. ..%.. ...#.
400450 ff251a05 20006800 000000e9 e0ffffff .%.. .h.........
Contents of section .text:
400460 31ed4989 d15e4889 e24883e4 f0505449 1.I..^H..H...PTI
400470 c7c0e005 400048c7 c1f00540 0048c7c7 ....#.H....#.H..
400480 d0054000 e8c7ffff fff49090 4883ec08 ..#.........H...
400490 488b05b9 04200048 85c07402 ffd04883 H.... .H..t...H.
4004a0 c408c390 90909090 90909090 90909090 ................
4004b0 90909090 90909090 90909090 90909090 ................
4004c0 b88f0960 00482d88 09600048 83f80e76 ...`.H-..`.H...v
4004d0 17b80000 00004885 c0740dbf 88096000 ......H..t....`.
4004e0 ffe0660f 1f440000 f3c3660f 1f440000 ..f..D....f..D..
4004f0 be880960 004881ee 88096000 48c1fe03 ...`.H....`.H...
400500 4889f048 c1e83f48 01c648d1 fe7411b8 H..H..?H..H..t..
400510 00000000 4885c074 07bf8809 6000ffe0 ....H..t....`...
400520 f3c36666 6666662e 0f1f8400 00000000 ..fffff.........
400530 803d5104 20000075 5f5553bb 80076000 .=Q. ..u_US...`.
400540 4881eb78 07600048 83ec0848 8b053e04 H..x.`.H...H..>.
400550 200048c1 fb034883 eb01488d 6c241048 .H...H...H.l$.H
400560 39d87322 0f1f4000 4883c001 4889051d 9.s"..#.H...H...
400570 042000ff 14c57807 6000488b 050f0420 . ....x.`.H....
400580 004839d8 72e2e835 ffffffc6 05f60320 .H9.r..5.......
400590 00014883 c4085b5d f3c3660f 1f440000 ..H...[]..f..D..
4005a0 bf880760 0048833f 007505e9 40ffffff ...`.H.?.u..#...
4005b0 b8000000 004885c0 74f15548 89e5ffd0 .....H..t.UH....
4005c0 5de92aff ffff9090 90909090 90909090 ].*.............
4005d0 31c0c390 90909090 90909090 90909090 1...............
4005e0 f3c36666 6666662e 0f1f8400 00000000 ..fffff.........
4005f0 48896c24 d84c8964 24e0488d 2d630120 H.l$.L.d$.H.-c.
400600 004c8d25 5c012000 4c896c24 e84c8974 .L.%\. .L.l$.L.t
400610 24f04c89 7c24f848 895c24d0 4883ec38 $.L.|$.H.\$.H..8
400620 4c29e541 89fd4989 f648c1fd 034989d7 L).A..I..H...I..
400630 e8f3fdff ff4885ed 741c31db 0f1f4000 .....H..t.1...#.
400640 4c89fa4c 89f64489 ef41ff14 dc4883c3 L..L..D..A...H..
400650 014839eb 72ea488b 5c240848 8b6c2410 .H9.r.H.\$.H.l$.
400660 4c8b6424 184c8b6c 24204c8b 7424284c L.d$.L.l$ L.t$(L
400670 8b7c2430 4883c438 c3909090 90909090 .|$0H..8........
400680 554889e5 53bb6807 60004883 ec08488b UH..S.h.`.H...H.
400690 05d30020 004883f8 ff74140f 1f440000 ... .H...t...D..
4006a0 4883eb08 ffd0488b 034883f8 ff75f148 H.....H..H...u.H
4006b0 83c4085b 5dc39090 ...[]...
Contents of section .fini:
4006b8 4883ec08 e86ffeff ff4883c4 08c3 H....o...H....
Contents of section .rodata:
4006c8 01000200 ....
Contents of section .eh_frame_hdr:
4006cc 011b033b 20000000 03000000 04ffffff ...; ...........
4006dc 3c000000 14ffffff 54000000 24ffffff <.......T...$...
4006ec 6c000000 l...
Contents of section .eh_frame:
4006f0 14000000 00000000 017a5200 01781001 .........zR..x..
400700 1b0c0708 90010000 14000000 1c000000 ................
400710 c0feffff 03000000 00000000 00000000 ................
400720 14000000 34000000 b8feffff 02000000 ....4...........
400730 00000000 00000000 24000000 4c000000 ........$...L...
400740 b0feffff 89000000 00518c05 86065f0e .........Q...._.
400750 4083078f 028e038d 0402580e 08000000 #.........X.....
400760 00000000 ....
Contents of section .ctors:
600768 ffffffff ffffffff 00000000 00000000 ................
Contents of section .dtors:
600778 ffffffff ffffffff 00000000 00000000 ................
Contents of section .jcr:
600788 00000000 00000000 ........
Contents of section .dynamic:
600790 01000000 00000000 01000000 00000000 ................
6007a0 01000000 00000000 69000000 00000000 ........i.......
6007b0 01000000 00000000 73000000 00000000 ........s.......
6007c0 01000000 00000000 81000000 00000000 ................
6007d0 0c000000 00000000 28044000 00000000 ........(.#.....
6007e0 0d000000 00000000 b8064000 00000000 ..........#.....
6007f0 04000000 00000000 60024000 00000000 ........`.#.....
600800 05000000 00000000 20034000 00000000 ........ .#.....
600810 06000000 00000000 90024000 00000000 ..........#.....
600820 0a000000 00000000 a9000000 00000000 ................
600830 0b000000 00000000 18000000 00000000 ................
600840 15000000 00000000 00000000 00000000 ................
600850 03000000 00000000 58096000 00000000 ........X.`.....
600860 02000000 00000000 18000000 00000000 ................
600870 14000000 00000000 07000000 00000000 ................
600880 17000000 00000000 10044000 00000000 ..........#.....
600890 07000000 00000000 f8034000 00000000 ..........#.....
6008a0 08000000 00000000 18000000 00000000 ................
6008b0 09000000 00000000 18000000 00000000 ................
6008c0 feffff6f 00000000 d8034000 00000000 ...o......#.....
6008d0 ffffff6f 00000000 01000000 00000000 ...o............
6008e0 f0ffff6f 00000000 ca034000 00000000 ...o......#.....
6008f0 00000000 00000000 00000000 00000000 ................
600900 00000000 00000000 00000000 00000000 ................
600910 00000000 00000000 00000000 00000000 ................
600920 00000000 00000000 00000000 00000000 ................
600930 00000000 00000000 00000000 00000000 ................
600940 00000000 00000000 00000000 00000000 ................
Contents of section .got:
600950 00000000 00000000 ........
Contents of section .got.plt:
600958 90076000 00000000 00000000 00000000 ..`.............
600968 00000000 00000000 56044000 00000000 ........V.#.....
Contents of section .data:
600978 00000000 00000000 00000000 00000000 ................
Contents of section .comment:
0000 4743433a 2028474e 55292034 2e342e37 GCC: (GNU) 4.4.7
0010 20323031 32303331 33202852 65642048 20120313 (Red H
0020 61742034 2e342e37 2d313129 00474343 at 4.4.7-11).GCC
0030 3a202847 4e552920 342e392e 782d676f : (GNU) 4.9.x-go
0040 6f676c65 20323031 35303132 33202870 ogle 20150123 (p
0050 72657265 6c656173 652900 rerelease).
Disassembly of section .init:
0000000000400428 <_init>:
_init():
400428: 48 83 ec 08 sub $0x8,%rsp
40042c: e8 5b 00 00 00 callq 40048c <call_gmon_start>
400431: e8 6a 01 00 00 callq 4005a0 <frame_dummy>
400436: e8 45 02 00 00 callq 400680 <__do_global_ctors_aux>
40043b: 48 83 c4 08 add $0x8,%rsp
40043f: c3 retq
Disassembly of section .plt:
0000000000400440 <__libc_start_main#plt-0x10>:
400440: ff 35 1a 05 20 00 pushq 0x20051a(%rip) # 600960 <_GLOBAL_OFFSET_TABLE_+0x8>
400446: ff 25 1c 05 20 00 jmpq *0x20051c(%rip) # 600968 <_GLOBAL_OFFSET_TABLE_+0x10>
40044c: 0f 1f 40 00 nopl 0x0(%rax)
0000000000400450 <__libc_start_main#plt>:
400450: ff 25 1a 05 20 00 jmpq *0x20051a(%rip) # 600970 <_GLOBAL_OFFSET_TABLE_+0x18>
400456: 68 00 00 00 00 pushq $0x0
40045b: e9 e0 ff ff ff jmpq 400440 <_init+0x18>
Disassembly of section .text:
0000000000400460 <_start>:
_start():
400460: 31 ed xor %ebp,%ebp
400462: 49 89 d1 mov %rdx,%r9
400465: 5e pop %rsi
400466: 48 89 e2 mov %rsp,%rdx
400469: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
40046d: 50 push %rax
40046e: 54 push %rsp
40046f: 49 c7 c0 e0 05 40 00 mov $0x4005e0,%r8
400476: 48 c7 c1 f0 05 40 00 mov $0x4005f0,%rcx
40047d: 48 c7 c7 d0 05 40 00 mov $0x4005d0,%rdi
400484: e8 c7 ff ff ff callq 400450 <__libc_start_main#plt>
400489: f4 hlt
40048a: 90 nop
40048b: 90 nop
000000000040048c <call_gmon_start>:
call_gmon_start():
40048c: 48 83 ec 08 sub $0x8,%rsp
400490: 48 8b 05 b9 04 20 00 mov 0x2004b9(%rip),%rax # 600950 <_DYNAMIC+0x1c0>
400497: 48 85 c0 test %rax,%rax
40049a: 74 02 je 40049e <call_gmon_start+0x12>
40049c: ff d0 callq *%rax
40049e: 48 83 c4 08 add $0x8,%rsp
4004a2: c3 retq
4004a3: 90 nop
4004a4: 90 nop
4004a5: 90 nop
4004a6: 90 nop
4004a7: 90 nop
4004a8: 90 nop
4004a9: 90 nop
4004aa: 90 nop
4004ab: 90 nop
4004ac: 90 nop
4004ad: 90 nop
4004ae: 90 nop
4004af: 90 nop
4004b0: 90 nop
4004b1: 90 nop
4004b2: 90 nop
4004b3: 90 nop
4004b4: 90 nop
4004b5: 90 nop
4004b6: 90 nop
4004b7: 90 nop
4004b8: 90 nop
4004b9: 90 nop
4004ba: 90 nop
4004bb: 90 nop
4004bc: 90 nop
4004bd: 90 nop
4004be: 90 nop
4004bf: 90 nop
00000000004004c0 <deregister_tm_clones>:
deregister_tm_clones():
4004c0: b8 8f 09 60 00 mov $0x60098f,%eax
4004c5: 48 2d 88 09 60 00 sub $0x600988,%rax
4004cb: 48 83 f8 0e cmp $0xe,%rax
4004cf: 76 17 jbe 4004e8 <deregister_tm_clones+0x28>
4004d1: b8 00 00 00 00 mov $0x0,%eax
4004d6: 48 85 c0 test %rax,%rax
4004d9: 74 0d je 4004e8 <deregister_tm_clones+0x28>
4004db: bf 88 09 60 00 mov $0x600988,%edi
4004e0: ff e0 jmpq *%rax
4004e2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
4004e8: f3 c3 repz retq
4004ea: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
00000000004004f0 <register_tm_clones>:
register_tm_clones():
4004f0: be 88 09 60 00 mov $0x600988,%esi
4004f5: 48 81 ee 88 09 60 00 sub $0x600988,%rsi
4004fc: 48 c1 fe 03 sar $0x3,%rsi
400500: 48 89 f0 mov %rsi,%rax
400503: 48 c1 e8 3f shr $0x3f,%rax
400507: 48 01 c6 add %rax,%rsi
40050a: 48 d1 fe sar %rsi
40050d: 74 11 je 400520 <register_tm_clones+0x30>
40050f: b8 00 00 00 00 mov $0x0,%eax
400514: 48 85 c0 test %rax,%rax
400517: 74 07 je 400520 <register_tm_clones+0x30>
400519: bf 88 09 60 00 mov $0x600988,%edi
40051e: ff e0 jmpq *%rax
400520: f3 c3 repz retq
400522: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
400529: 1f 84 00 00 00 00 00
0000000000400530 <__do_global_dtors_aux>:
__do_global_dtors_aux():
400530: 80 3d 51 04 20 00 00 cmpb $0x0,0x200451(%rip) # 600988 <__bss_start>
400537: 75 5f jne 400598 <__do_global_dtors_aux+0x68>
400539: 55 push %rbp
40053a: 53 push %rbx
40053b: bb 80 07 60 00 mov $0x600780,%ebx
400540: 48 81 eb 78 07 60 00 sub $0x600778,%rbx
400547: 48 83 ec 08 sub $0x8,%rsp
40054b: 48 8b 05 3e 04 20 00 mov 0x20043e(%rip),%rax # 600990 <dtor_idx.6648>
400552: 48 c1 fb 03 sar $0x3,%rbx
400556: 48 83 eb 01 sub $0x1,%rbx
40055a: 48 8d 6c 24 10 lea 0x10(%rsp),%rbp
40055f: 48 39 d8 cmp %rbx,%rax
400562: 73 22 jae 400586 <__do_global_dtors_aux+0x56>
400564: 0f 1f 40 00 nopl 0x0(%rax)
400568: 48 83 c0 01 add $0x1,%rax
40056c: 48 89 05 1d 04 20 00 mov %rax,0x20041d(%rip) # 600990 <dtor_idx.6648>
400573: ff 14 c5 78 07 60 00 callq *0x600778(,%rax,8)
40057a: 48 8b 05 0f 04 20 00 mov 0x20040f(%rip),%rax # 600990 <dtor_idx.6648>
400581: 48 39 d8 cmp %rbx,%rax
400584: 72 e2 jb 400568 <__do_global_dtors_aux+0x38>
400586: e8 35 ff ff ff callq 4004c0 <deregister_tm_clones>
40058b: c6 05 f6 03 20 00 01 movb $0x1,0x2003f6(%rip) # 600988 <__bss_start>
400592: 48 83 c4 08 add $0x8,%rsp
400596: 5b pop %rbx
400597: 5d pop %rbp
400598: f3 c3 repz retq
40059a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
00000000004005a0 <frame_dummy>:
frame_dummy():
4005a0: bf 88 07 60 00 mov $0x600788,%edi
4005a5: 48 83 3f 00 cmpq $0x0,(%rdi)
4005a9: 75 05 jne 4005b0 <frame_dummy+0x10>
4005ab: e9 40 ff ff ff jmpq 4004f0 <register_tm_clones>
4005b0: b8 00 00 00 00 mov $0x0,%eax
4005b5: 48 85 c0 test %rax,%rax
4005b8: 74 f1 je 4005ab <frame_dummy+0xb>
4005ba: 55 push %rbp
4005bb: 48 89 e5 mov %rsp,%rbp
4005be: ff d0 callq *%rax
4005c0: 5d pop %rbp
4005c1: e9 2a ff ff ff jmpq 4004f0 <register_tm_clones>
4005c6: 90 nop
4005c7: 90 nop
4005c8: 90 nop
4005c9: 90 nop
4005ca: 90 nop
4005cb: 90 nop
4005cc: 90 nop
4005cd: 90 nop
4005ce: 90 nop
4005cf: 90 nop
00000000004005d0 <main>:
main():
4005d0: 31 c0 xor %eax,%eax
4005d2: c3 retq
4005d3: 90 nop
4005d4: 90 nop
4005d5: 90 nop
4005d6: 90 nop
4005d7: 90 nop
4005d8: 90 nop
4005d9: 90 nop
4005da: 90 nop
4005db: 90 nop
4005dc: 90 nop
4005dd: 90 nop
4005de: 90 nop
4005df: 90 nop
00000000004005e0 <__libc_csu_fini>:
__libc_csu_fini():
4005e0: f3 c3 repz retq
4005e2: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
4005e9: 1f 84 00 00 00 00 00
00000000004005f0 <__libc_csu_init>:
__libc_csu_init():
4005f0: 48 89 6c 24 d8 mov %rbp,-0x28(%rsp)
4005f5: 4c 89 64 24 e0 mov %r12,-0x20(%rsp)
4005fa: 48 8d 2d 63 01 20 00 lea 0x200163(%rip),%rbp # 600764 <__init_array_end>
400601: 4c 8d 25 5c 01 20 00 lea 0x20015c(%rip),%r12 # 600764 <__init_array_end>
400608: 4c 89 6c 24 e8 mov %r13,-0x18(%rsp)
40060d: 4c 89 74 24 f0 mov %r14,-0x10(%rsp)
400612: 4c 89 7c 24 f8 mov %r15,-0x8(%rsp)
400617: 48 89 5c 24 d0 mov %rbx,-0x30(%rsp)
40061c: 48 83 ec 38 sub $0x38,%rsp
400620: 4c 29 e5 sub %r12,%rbp
400623: 41 89 fd mov %edi,%r13d
400626: 49 89 f6 mov %rsi,%r14
400629: 48 c1 fd 03 sar $0x3,%rbp
40062d: 49 89 d7 mov %rdx,%r15
400630: e8 f3 fd ff ff callq 400428 <_init>
400635: 48 85 ed test %rbp,%rbp
400638: 74 1c je 400656 <__libc_csu_init+0x66>
40063a: 31 db xor %ebx,%ebx
40063c: 0f 1f 40 00 nopl 0x0(%rax)
400640: 4c 89 fa mov %r15,%rdx
400643: 4c 89 f6 mov %r14,%rsi
400646: 44 89 ef mov %r13d,%edi
400649: 41 ff 14 dc callq *(%r12,%rbx,8)
40064d: 48 83 c3 01 add $0x1,%rbx
400651: 48 39 eb cmp %rbp,%rbx
400654: 72 ea jb 400640 <__libc_csu_init+0x50>
400656: 48 8b 5c 24 08 mov 0x8(%rsp),%rbx
40065b: 48 8b 6c 24 10 mov 0x10(%rsp),%rbp
400660: 4c 8b 64 24 18 mov 0x18(%rsp),%r12
400665: 4c 8b 6c 24 20 mov 0x20(%rsp),%r13
40066a: 4c 8b 74 24 28 mov 0x28(%rsp),%r14
40066f: 4c 8b 7c 24 30 mov 0x30(%rsp),%r15
400674: 48 83 c4 38 add $0x38,%rsp
400678: c3 retq
400679: 90 nop
40067a: 90 nop
40067b: 90 nop
40067c: 90 nop
40067d: 90 nop
40067e: 90 nop
40067f: 90 nop
0000000000400680 <__do_global_ctors_aux>:
__do_global_ctors_aux():
400680: 55 push %rbp
400681: 48 89 e5 mov %rsp,%rbp
400684: 53 push %rbx
400685: bb 68 07 60 00 mov $0x600768,%ebx
40068a: 48 83 ec 08 sub $0x8,%rsp
40068e: 48 8b 05 d3 00 20 00 mov 0x2000d3(%rip),%rax # 600768 <__CTOR_LIST__>
400695: 48 83 f8 ff cmp $0xffffffffffffffff,%rax
400699: 74 14 je 4006af <__do_global_ctors_aux+0x2f>
40069b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
4006a0: 48 83 eb 08 sub $0x8,%rbx
4006a4: ff d0 callq *%rax
4006a6: 48 8b 03 mov (%rbx),%rax
4006a9: 48 83 f8 ff cmp $0xffffffffffffffff,%rax
4006ad: 75 f1 jne 4006a0 <__do_global_ctors_aux+0x20>
4006af: 48 83 c4 08 add $0x8,%rsp
4006b3: 5b pop %rbx
4006b4: 5d pop %rbp
4006b5: c3 retq
4006b6: 90 nop
4006b7: 90 nop
Disassembly of section .fini:
00000000004006b8 <_fini>:
_fini():
4006b8: 48 83 ec 08 sub $0x8,%rsp
4006bc: e8 6f fe ff ff callq 400530 <__do_global_dtors_aux>
4006c1: 48 83 c4 08 add $0x8,%rsp
4006c5: c3 retq
and the smaller file:
g++ -Wall -O3 -g0 test.cpp -o test.exe -nostdlib
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000400150
test.exe: file format elf64-x86-64
Contents of section .note.gnu.build-id:
400120 04000000 14000000 03000000 474e5500 ............GNU.
400130 d4b1e35c 21d1f541 b81d3ac9 d62bac7a ...\!..A..:..+.z
400140 606b1ad4 `k..
Contents of section .text:
400150 31c0c3 1..
Contents of section .eh_frame_hdr:
400154 011b033b 10000000 01000000 fcffffff ...;............
400164 2c000000 ,...
Contents of section .eh_frame:
400168 14000000 00000000 017a5200 01781001 .........zR..x..
400178 1b0c0708 90010000 14000000 1c000000 ................
400188 c8ffffff 03000000 00000000 00000000 ................
Contents of section .comment:
0000 4743433a 2028474e 55292034 2e392e78 GCC: (GNU) 4.9.x
0010 2d676f6f 676c6520 32303135 30313233 -google 20150123
0020 20287072 6572656c 65617365 2900 (prerelease).
Disassembly of section .text:
0000000000400150 <main>:
main():
400150: 31 c0 xor %eax,%eax
400152: c3 retq
It's worth noting that this executable doesn't work, it segfaults: to make it work, we'd actually have to implement _start instead of main.
We can see here that the bulk of the larger executable is glue code that deals with loading the dynamic library and preparing the broader environment required by the standard library.
--- EDIT ---
Even our smaller code still has to include exception handling, ctor/dtor support for globals, and so forth. It could probably elide such things and if you dig deeply enough you can probably find ways to elide them, but in general you probably don't need to, and it is probably easier to always include such basic support than to have the majority of new programmers stumbling over "how do I force the compiler to emit basic language support" than have a handful of new embedded programmers asking "how can I prevent the compiler emitting basic language support?".
Note also that the compiler generates ELF format binaries, this is a small contribution (maybe ~60bytes), plus emitting it's own identity added some size. But the bulk of the smaller binary is language support (EH and CTOR/DTOR).
Compiling with #include <iostream> and -O3 -g0 produces a 7625 byte binary, if I compile that with -O0 -g3 it produces a 64Kb binary most of which is text describing symbols from the STL.
Your executable is including the C runtime, which knows how to do things like get the environment, setup the argv vector, and close all open files after calling exit() but before calling _exit().
There are many things which could affect the final file size during compilation, as other posters have pointed out.
Dissecting your specific example is more work than I'm willing to put in, but I know of a similar example from many years ago that should help you to understand the general problem, and guide you towards finding the specific answer you seek.
http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html
This is done in C (rather than C++) using GCC, looking at the size of the ELF executable (not a Windows EXE), but as I said many of the same problems apply. In this case, the author looks at just return 42;
After you've read that document, consider that printing to stdout is considerably more complex than just returning a number. Also, since you are using C++ and cout <<, there's a lot of code hiding in there that you didn't write, and you can't really know how it's implemented without looking at that source.
people keep ignoring/forgetting that executables created in high level languages need engine to run properly. for example C++ engine is responsible for things like:
heap/stack management
when you call new,delete you are not actually accessing OS functions
instead the engine use its own allocated heap memory
so engine has it own memory management that takes code/space
local variables memory management
each time you call any function all the local variables must be allocated
and released before exiting it
classes/templates
to handle these properly you need quite a lot of code
In addition to this you have to link all the stuff you use like:
RTL most executables nowdays MSVCPP and MSVB does not link them so we need to install huge amount of RTLs in system to make exe to even run. but still the linking to used DLL's must be present in executable (see DLL linking on runtime)
debug info
frameworks linkage (similar to RTL you need the code to bind to frameworks libs too)
for high level winows/forms IDE's you also have the window engine present
included libs and linked objs (iostream classes and operators even if you use just << you need much more of them to make it work...)
You can look at the C++ engine as a small operating system within operating system
in standalone MCU apps they are really the OS itself
Another space is occupied by the executable format (like PE), and also code aligns add some space
When you put all these together then the 26KB is not so insane anymore
Compilers are not omnipotent.
std::cout is a stream object, with a set of data members for managing a buffer (allocating it, copying data to it and, when the stream is destroyed, releasing it).
The operator<< is implemented as an overloaded function which interprets its arguments and - when supplied a string - copies data to the buffer, with some logic that potentially flushes the buffer when it is full.
std::endl is actually an function which - in cooperation with all versions of a stream's operator<<() - affects data owned by the stream. Specifically, it inserts a newline into the streams buffer, and then flushes the buffer.
Flushing the stream's buffer calls other functions that copy data from the buffer to the standard output device (say, the screen).
All of the above is what the statement std::cout<<"Hello World"<<std::endl does.
In addition, as a C++ program, there is a certain amount of code that must be executed before main() is even called. This includes checking if the program was run with command line arguments, creating streams like std::cout, std::cerr, std::cin (there are others) ensuring those streams are connected with relevant devices (like the terminal, or pipes, or whatever). When main() returns, it is then necessary to release all the streams created (and flush their buffers), and things like that.
All of the above involves invoking other functionality. Creating a buffer for the stream means that buffer must be allocated and - after main() returns - released.
The specification of C++ streams also involves error checking. The allocation of std::cout's buffer might fail (e.g. if the host system doesn't have much free memory). The standard output device might be redirected to a file, which has limited capacity - so writing data to it might fail. All of those things must be checked for and handled gracefully.
All of this stuff will be in this 26K executable (unless that code is in runtime libraries).
In principle, the compiler can recognise that the program is not using its command line arguments (so not include code to manage command line arguments), is only writing to std::cout (so no need to create all the other streams before main() and release them after main() returns), is only using two overloaded versions of operator<<() and one stream manipulator (so the linker need not include code for all other member functions of the stream). It might also recognise that the statement writes data to the stream and immediately flushes the buffer - and thereby eliminate std::cout's buffer and all code that manages it. If the compiler can read the programmer's mind (few compilers can, in practice) it might work out that none of the buffers are actually needed, that the user will never run the program with standard output redirected, etc - and eliminate the code and data structures associated with all those things.
So, how would a compiler recognise that all those things aren't needed? Compilers are software, so they have to conduct some level of analysis on their inputs (e.g. source files). The analysis to eliminate all the code that a human might deem unnecessary is significant - so would take time. If the compiler doesn't do the analysis, potentially the linker might. Whether that analysis to eliminate unnecessary code is done by the compiler or linker is irrelevant - it takes time. Potentially significant time.
Programmers tend to be impatient. Very few programmers would tolerate a build process for a simple "hello world" program that took more than a few seconds (maybe they will tolerate a minute, but not much more).
That leaves compiler vendors with a decision. They can get their programmers to design and implement all sorts of analysis to eliminate unwanted code. That will add weeks - or, if they are working to a tight deadline, months - to implement, validate, verify, and ship a working compiler to customers (other developers). That compiler will be painfully slow at compiling code. Instead, vendors (and their developers) choose to implement less of that analysis in their compiler, so they can actually ship a working compiler to developers who will use it within a reasonable time. This compiler will produce an executable in a time that is somewhat tolerable for most programmers (say, under a minute for a "hello world" program). So what if the executable is larger? It will work. Hardware (e.g. drives) is relatively inexpensive and developer effort is relatively expensive.
It's very old question. It have clear answer. The most problem is that one have to write many small pieces of information and make many small test which demonstrates different aspects of PE structures. I try to skip details and to describe the main parts of the problem based on Microsoft Visual Studio, which I know and use since many years. All other compilers do mostly the same, and I suppose that one need use just a little other options of compiler and linker.
First of all I suggest you to set breakpoint on the first line of the main, start debugging and to examine the Call Stack windows of the debugger. You will see something like
So the first thing, which is very important to understand, the main is not the first function which will be called in your program. The entry point of the program is mainCRTStartup, which calls __tmainCRTStartup, which calls main.
The CRT Startup code make many small things. One thing is very easy to understand: it uses GetCommandLineW Windows API to get the command line and parse the parameters, then it calls main with the parameters.
To reduce the size of the code there are two common approach:
use CRT from DLL
remove CRT from the EXE if it's not really used in the code.
It's very helpful if you start cmd.exe using "VS2013 x64 Native Tools Command Prompt" (or some close command prompt). Some additional paths will be set inside of the command prompt and you can use for example dumpbin.exe utility.
If you would use Multi-threaded DLL (/MD) compiler option then you will get 7K large exe file. "dumpbin /imports HelloWorld.exe" will show you that your program uses "MSVCR120.dll" together with "KERNEL32.dll".
Removing of CRT depends on the version of c/cpp compiler (the version of Visual Studio) which you use and even from the extension of the file: .c or .cpp. I understand your question as the common question for understanding the problem. So I suggest to start with the most simple case, rename .cpp file .c and the beginning and to modify the code to the following
#include <Windows.h>
int mainCRTStartup()
{
return 0;
}
One can see now
C:\Oleg\StackOverflow\HelloWorld\Release>dir HelloWorld.exe
Volume in drive C has no label.
Volume Serial Number is 4CF9-FADF
Directory of C:\Oleg\StackOverflow\HelloWorld\Release
21.06.2015 12:56 3.584 HelloWorld.exe
1 File(s) 3.584 bytes
0 Dir(s) 16.171.196.416 bytes free
C:\Oleg\StackOverflow\HelloWorld\Release>dumpbin HelloWorld.exe
Microsoft (R) COFF/PE Dumper Version 12.00.31101.0
Copyright (C) Microsoft Corporation. All rights reserved.
Dump of file HelloWorld.exe
File Type: EXECUTABLE IMAGE
Summary
1000 .data
1000 .rdata
1000 .reloc
1000 .rsrc
1000 .text
One can add the linker option /MERGE:.rdata=.text to reduce the size and to remove one section
C:\Oleg\StackOverflow\HelloWorld\Release>dir HelloWorld.exe
Volume in drive C has no label.
Volume Serial Number is 4CF9-FADF
Directory of C:\Oleg\StackOverflow\HelloWorld\Release
21.06.2015 18:44 3.072 HelloWorld.exe
1 File(s) 3.072 bytes
0 Dir(s) 16.170.852.352 bytes free
C:\Oleg\StackOverflow\HelloWorld\Release>dumpbin HelloWorld.exe
Microsoft (R) COFF/PE Dumper Version 12.00.31101.0
Copyright (C) Microsoft Corporation. All rights reserved.
Dump of file HelloWorld.exe
File Type: EXECUTABLE IMAGE
Summary
1000 .data
1000 .reloc
1000 .rsrc
1000 .text
To have "Hello World" program I suggest to modify the code to
#include <Windows.h>
int mainCRTStartup()
{
LPCTSTR pszString = TEXT("Hello world");
DWORD cbWritten;
WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE), pszString, lstrlen(pszString), &cbWritten, NULL);
return 0;
}
One can easy verify that the code work and it's still small.
To remove CRT from .cpp file I suggest to follow the following steps. First of all we would use the following HelloWorld.cpp code
#include <Windows.h>
int mainCRTStartup()
{
LPCTSTR pszString = TEXT("Hello world");
DWORD cbWritten;
WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE), pszString, lstrlen(pszString), &cbWritten, NULL);
return 0;
}
It's important that one verify some compiler and linker options and set/remove someone. I included the settings on the pictures below:
The last screen shows that we remove binding to default libraries which we don't need. The compiler uses directive like #pragma comment(lib, "some.lib") to inject usage of some libraries. By usage the options /NODEFAULTLIB we remove such libs and the exe will be compiled like we need.
One will see that the resulting HelloWorld.exe have only 3K (3.072 bytes) and there are exist dependency to one KERNEL32.dll only:
C:\Oleg\StackOverflow\HelloWorld\Release>dumpbin /imports HelloWorld.exe
Microsoft (R) COFF/PE Dumper Version 12.00.31101.0
Copyright (C) Microsoft Corporation. All rights reserved.
Dump of file HelloWorld.exe
File Type: EXECUTABLE IMAGE
Section contains the following imports:
KERNEL32.dll
402000 Import Address Table
402038 Import Name Table
0 time date stamp
0 Index of first forwarder reference
60B lstrlenW
5E0 WriteConsoleW
2C0 GetStdHandle
Summary
1000 .idata
1000 .reloc
1000 .rsrc
1000 .text
One can download the corresponding Visual Studio 2013 demo project from here. One need switch from default "Debug" compiling to "Release" and rebuild solution. One will have working HelloWorld.exe which length is 3K.
This does show how hard it can be to write a program with identical semantics.
<<std::endl will flush a stream if that stream is good(). That means the whole error handling code of ostream must be present.
Also, std::cout could have its streambuf swapped out from under it. The compiler cannot know it's actually going to STDOUT_FILENO. It has to use the whole streambuf intermediate layer.