Why does data/BSS size change in these different cases?

Why does data/BSS size change in these different cases? - c++

I'm learning about virtual memory management and memory allocation of process. And do some experiment about it. There are some confusing points as below:
case1
#include <iostream>
int main() {
return 0;
}
After compile the code and run size with the binary and got the following output:
text data bss dec hex filename
1985 640 8 2633 a49 main
case2:
Change the code to:
#include <iostream>
int global;
int main() {
return 0;
}
and rebuild it and the size output is:
text data bss dec hex filename
1985 640 16 2641 a51 main
Note: data memory part is not changed, but bss changed from 8 to 16. This result make sense to me, since int global define a uninitialized global variable.
case3:
Then I change the code to:
#include <iostream>
int global = 5;
int main() {
return 0;
}
I initialized the global variable. And analyze the binary again:
text data bss dec hex filename
1985 644 4 2633 a49 main
This time, the change doen't make sense to me. Compared to case1, why data part increase by 4 bytes and bss part decrease by 4?

The discrepancy is caused by the linker, and how it decides to layout the binary. The G++ default linker on x86 (GNU ld).
To see the difference, let's compile your test cases in three different ways:
$ for x in *.cpp ; do g++ -g3 $x -o $x.ld.exe ; done # GNU ld
$ for x in *.cpp ; do g++ -g3 $x -o $x.gold.exe -fuse-ld=gold ; done # gold
$ for x in *.cpp ; do g++ -g3 $x -o $x.gold.exe -fuse-ld=lld ; done # lld
Now, the sizes are as follows:
text data bss dec hex filename
1873 656 8 2537 9e9 test1.cpp.ld.exe
1873 656 16 2545 9f1 test2.cpp.ld.exe
1873 660 4 2537 9e9 test3.cpp.ld.exe
text data bss dec hex filename
1877 656 2 2535 9e7 test1.cpp.gold.exe
1877 656 9 2542 9ee test2.cpp.gold.exe
1877 660 2 2539 9eb test3.cpp.gold.exe
text data bss dec hex filename
1817 576 2 2395 95b test1.cpp.lld.exe
1817 576 9 2402 962 test2.cpp.lld.exe
1817 580 2 2399 95f test3.cpp.lld.exe
The behavior of the bss section can be explained as follows:
There are two or three variables to store in here (completed.0, possibly global and std::__ioinit). The two that are always there, have a size and alignment of 1 byte each. The global that is only there in case 2, has size and alignment of 4 bytes. This is why both gold and lld require 2 bytes for the bss section in cases 1 and 3. For case 2, the layout is such that global is stored in between the two single-byte values, which requires an additional 3 bytes of padding between the first 1-byte value and global.
Generally speaking, GNU ld does something similar, except it uses the bss section to also provide inter-section padding. In case 1 and 3, only 2 bytes of the bss section are actually used, while case 2 uses 9 bytes of the bss section. Since data grows by 4 bytes in case 3 and bss is layouted after data, 4 of these padding bytes are removed to keep the offset of the following sections aligned. This is also why the size of the bss section grows by 8 bytes rather than 4 upon introducing global: GNU ld ensures that the end of the section remains 8-byte aligned.

The much shorter version is: .bss section is mapped to the zero page with copy-on-write — any variable that is not explicitly initialized, will be added to .bss, which does not actually exist on the ELF file, and is only going to be allocated when changing the content of one of its variables (I'm oversimplifying — the allocation happen per-page so there's more details with that).
When you give it a default value, it can't be mapped to the zero page: its value needs to be present in the ELF file itself, and mapped into memory (again, as copy-on-write, so if you then change the value, a new page gets allocated with the modified value of your variable). If you changed that to const int global = 5; it would have been added to .rodata and you would have seen the size of .text increase instead.
I've blogged about the size command in the past: https://flameeyes.blog/2008/12/01/for-elves-size-matters/

Related

Why is a C++ Hello World binary larger than the equivalent C binary?

In his FAQ, Bjarne Stroustrup says that when compiled with gcc -O2, the file size of a hello world using C and C++ are identical.
Reference: http://www.stroustrup.com/bs_faq.html#Hello-world
I decided to try this, here is the C version:
#include <stdio.h>
int main(int argc, char* argv[])
{
printf("Hello world!\n");
return 0;
}
And here is the C++ version
#include <iostream>
int main(int argc, char* argv[])
{
std::cout << "Hello world!\n";
return 0;
}
Here I compile, and the sizes are different:
r00t#wutdo:~/hello$ ls
hello.c hello.cpp
r00t#wutdo:~/hello$ gcc -O2 hello.c -o c.out
r00t#wutdo:~/hello$ g++ -O2 hello.cpp -o cpp.out
r00t#wutdo:~/hello$ ls -l
total 32
-rwxr-xr-x 1 r00t r00t 8559 Sep 1 18:00 c.out
-rwxr-xr-x 1 r00t r00t 8938 Sep 1 18:01 cpp.out
-rw-r--r-- 1 r00t r00t 95 Sep 1 17:59 hello.c
-rw-r--r-- 1 r00t r00t 117 Sep 1 17:59 hello.cpp
r00t#wutdo:~/hello$ size c.out cpp.out
text data bss dec hex filename
1191 560 8 1759 6df c.out
1865 608 280 2753 ac1 cpp.out
I replaced std::endl with \n and it made the binary smaller. I figured something this simple would be inlined, and am dissapointed it's not.
Also wow, the optimized assemblies have hundreds of lines of assembly output? I can write hello world with like 5 assembly instructions using sys_write, what's up with all the extra stuff? Why does C put some much extra on the stack to setup? I mean, like 50 bytes of assembly vs 8kb of C, why?

You're looking at a mix of information that's easily misinterpreted. The 8559 and 8938 byte file sizes are largely meaningless since they're mostly headers with symbol names and other misc information for at least minimal debugging purposes. The somewhat meaningful numbers are the size(1) output you added later:
r00t#wutdo:~/hello$ size c.out cpp.out
text data bss dec hex filename
1191 560 8 1759 6df c.out
1865 608 280 2753 ac1 cpp.out
You could get a more detailed breakdown by using the -A option to size, but in short, the differences here are fairly trivial.
What's more interesting is that Bjarne Stroustrup never mentioned whether he was talking about static or dynamic linking. In your case, both programs are dynamic-linked, so the size differences have nothing to do with the actual size cost of stdio or iostream; you're just measuring the cost of the calling code, or (more likely, based on the other comments/answer) the base overhead of exception-handling support for C++. Now, there is a common claim that a static-linked C++ iostream-based hello world can be even smaller than a printf-based one, since the compiler can see exactly which overloaded versions of operator<< are used and optimize out unneeded code (such as expensive floating point printing), whereas printf's use of format strings makes this difficult in the common case and impossible in general. However, I've never seen a C++ implementation where a static-linked iostream-based hello program could come anywhere near close to being as small as, much less smaller than, a printf-based one in C.

I think he's treating the half kilobyte as a rounding error. Both are "9 kilobytes" and that's what you'll see in a typical file browser. They aren't exactly the same because, under the hood, the C and C++ libraries are quite different. If you're already familiar with your disassembler, you can see the details of the difference for yourself.
The "extra stuff" is for the sake of importing symbols from the standard library shlib, and handling C++ exceptions. Strangely enough, much of the GCC-compiled C executable is taken up by C++ exception handling tables. I've not figured out how to strip them using GCC.
endl is inlined, but it contains calls to print the \n character and flush the stream, which are not inlined. The difference in size is due to importing those from the standard library.
In truth, individual kilobytes seldom matter on any system with dynamically-loaded libraries. Self-contained code such as on an embedded system would need to include the standard library functionality it uses, and the C++ standard library tends to be heavier than its C counterpart — <iostream> vs. <stdio.h> in particular.

This code seems to append characters outside allocated range

I'm playing with some basic stuff of cpp. I'm new in this language... so I'm warning that my question maybe was not correctly formulated. I appreciate any help.
The thing is that after saw the example in www.cplusplus.com/reference/cstdlib/malloc/ I found my self with this code:
#include <stdio.h>
int main (void) {
char *str;
str = (char*) malloc(2);
str[0] ='8';
str[1] ='8';
str[2] ='6';
str[3] ='\0';
printf ("%s\n",str);
}
And compiling with:
gcc -O0 -pedantic -Wall test2.cpp
(gcc version 4.7.2)
I get no errors and the output 886. Why I get no errors? Have I not passed the boundary of the allocated space?
I didn't get no errors and I got the output 886. Why no errors? Have I not passed the boundary of the allocated space?
In the case that code is ok... Why the example in the reference?
In the other (more probable) case... What are the risks?
Thanks!

You don't get any errors because C and C++ don't do bounds checking. You overwrote sections of memory that you weren't using, but you got lucky and it wasn't anything important. Compare it to putting a row of nails into a wall where you know there's a stud. If you miss the stud, most of the time, you just put a hole in the plaster, but it's dangerous to keep doing it because eventually, you're going to hit one of the live wires instead.

You have passed over the boundary of the allocated memory.
However, printf does not bother what size of a memory you have declared. All it cares is it will start from the start and continue till it finds a 0.
The case you created is an undefined behaviour. There can be some other data right after your allocated region (maybe another variable) in which case it will get corrupted. If the next part is unallocated memory you might escape without a visible problem. And if the memory right after your allocated memory belongs to another process, you will see the nice and tidy Segmentation Fault. The consequences can be even worse, so better not try this anywhere.

the following can be found in comments in malloc.c of glibc:
Minimum overhead per allocated chunk: 4 or 8 bytes Each malloced
chunk has a hidden word of overhead holding size and status
information.
Minimum allocated size: 4-byte ptrs: 16 bytes (including 4
overhead)
8-byte ptrs: 24/32 bytes (including, 4/8 overhead)
When a chunk is freed, 12 (for 4byte ptrs) or 20 (for 8 byte
ptrs but 4 byte size) or 24 (for 8/8) additional bytes are needed;
4 (8) for a trailing size field and 8 (16) bytes for free list
pointers. Thus, the minimum allocatable size is 16/24/32 bytes.
Since minimum allocated size would be 16/24/32, since it is greater than 3 bytes your program ran without errors. This is one of the possibility executing your program correctly.

How do I get MSVC to put uninitialized data in .bss?

I am building a DLL using a custom build system (outside Visual Studio), and I can't get uninitialized data to show up in the .bss section; the compiler lumps it into .data. This bloats the final binary size, since it's full of giant arrays of zeroes.
For example (small 1KB arrays in the example, but the actual buffers are much larger):
int uninitialized[1024];
int initialized[1024] = { 123 };
The compiler emits assembly like this:
PUBLIC _initialized
_DATA SEGMENT
COMM _uninitialized:DWORD:0400H
_initialized DD 07bH
ORG $+4092
_DATA ENDS
Which ends up in the object file like this:
SECTION HEADER #3
.data name
0 physical address
0 virtual address
1000 size of raw data
147 file pointer to raw data (00000147 to 00001146)
0 file pointer to relocation table
0 file pointer to line numbers
0 number of relocations
0 number of line numbers
C0400040 flags
Initialized Data
8 byte align
Read Write
(There is no .bss section.)
The current compilation flags:
cl -nologo -c -FAsc -Faobjs\ -W4 -WX -X -J -EHs-c- -GR- -Gy -GS- -O1 -Os -Foobjs\file.o file.cpp
I have looked through the list of options at http://msdn.microsoft.com/en-us/library/fwkeyyhe(v=vs.71).aspx but I haven't spotted anything obvious.
I'm using the compiler from Visual Studio 2008 SP1 (Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.30729.01 for 80x86).

You want to use __declspec(allocate()), which you can read up on here: http://msdn.microsoft.com/en-us/library/5bkb2w6t(v=vs.80).aspx

Notice that "size of raw data" is only 0x1000 or 4kB - exactly the size of your initialized array only. The VirtualSize of your .data section will be larger than the size of the actual data stored in the binary image and your uninitialized array will occupy the slack space. Using the bss_seg pragma will force the linker to place your uninitialized data into its own separate section.

Yo can try using bss_seg pragma if you aren't concerned about portability.

C/C++ maximum stack size of program on mainstream OSes

I want to do DFS on a 100 X 100 array. (Say elements of array represents graph nodes) So assuming worst case, depth of recursive function calls can go upto 10000 with each call taking upto say 20 bytes. So is it feasible means is there a possibility of stackoverflow?
What is the maximum size of stack in C/C++?
Please specify for gcc for both
1) cygwin on Windows
2) Unix
What are the general limits?

In Visual Studio the default stack size is 1 MB i think, so with a recursion depth of 10,000 each stack frame can be at most ~100 bytes which should be sufficient for a DFS algorithm.
Most compilers including Visual Studio let you specify the stack size. On some (all?) linux flavours the stack size isn't part of the executable but an environment variable in the OS. You can then check the stack size with ulimit -s and set it to a new value with for example ulimit -s 16384.
Here's a link with default stack sizes for gcc.
DFS without recursion:
std::stack<Node> dfs;
dfs.push(start);
do {
Node top = dfs.top();
if (top is what we are looking for) {
break;
}
dfs.pop();
for (outgoing nodes from top) {
dfs.push(outgoing node);
}
} while (!dfs.empty())

Stacks for threads are often smaller.
You can change the default at link time,
or change at run time also.
For reference, some defaults are:
glibc i386, x86_64: 7.4 MB
Tru64 5.1: 5.2 MB
Cygwin: 1.8 MB
Solaris 7..10: 1 MB
MacOS X 10.5: 460 KB
AIX 5: 98 KB
OpenBSD 4.0: 64 KB
HP-UX 11: 16 KB

Platform-dependent, toolchain-dependent, ulimit-dependent, parameter-dependent.... It is not at all specified, and there are many static and dynamic properties that can influence it.

Yes, there is a possibility of stack overflow. The C and C++ standard do not dictate things like stack depth, those are generally an environmental issue.
Most decent development environments and/or operating systems will let you tailor the stack size of a process, either at link or load time.
You should specify which OS and development environment you're using for more targeted assistance.
For example, under Ubuntu Karmic Koala, the default for gcc is 2M reserved and 4K committed but this can be changed when you link the program. Use the --stack option of ld to do that.

I just ran out of stack at work, it was a database and it was running some threads, basically the previous developer had thrown a big array on the stack, and the stack was low anyway. The software was compiled using Microsoft Visual Studio 2015.
Even though the thread had run out of stack, it silently failed and continued on, it only stack overflowed when it came to access the contents of the data on the stack.
The best advice i can give is to not declare arrays on the stack - especially in complex applications and particularly in threads, instead use heap. That's what it's there for ;)
Also just keep in mind it may not fail immediately when declaring the stack, but only on access. My guess is that the compiler declares stack under windows "optimistically", i.e. it will assume that the stack has been declared and is sufficiently sized until it comes to use it and then finds out that the stack isn't there.
Different operating systems may have different stack declaration policies. Please leave a comment if you know what these policies are.

I am not sure what you mean by doing a depth first search on a rectangular array, but I assume you know what you are doing.
If the stack limit is a problem you should be able to convert your recursive solution into an iterative solution that pushes intermediate values onto a stack which is allocated from the heap.

(Added 26 Sept. 2020)
On 24 Oct. 2009, as #pixelbeat first pointed out here, Bruno Haible empirically discovered the following default thread stack sizes for several systems. He said that in a multithreaded program, "the default thread stack size is" as follows. I added in the "Actual" size column because #Peter.Cordes indicates in his comments below my answer, however, that the odd tested numbers shown below do not include all of the thread stack, since some of it was used in initialization. If I run ulimit -s to see "the maximum stack size" that my Linux computer is configured for, it outputs 8192 kB, which is exactly 8 MB, not the odd 7.4 MB listed in the table below for my x86-64 computer with the gcc compiler and glibc. So, you can probably add a little to the numbers in the table below to get the actual full stack size for a given thread.
Note also that the below "Tested" column units are all in MB and KB (base 1000 numbers), NOT MiB and KiB (base 1024 numbers). I've proven this to myself by verifying the 7.4 MB case.
Thread stack sizes
System and std library Tested Actual
---------------------- ------ ------
- glibc i386, x86_64 7.4 MB 8 MiB (8192 KiB, as shown by `ulimit -s`)
- Tru64 5.1 5.2 MB ?
- Cygwin 1.8 MB ?
- Solaris 7..10 1 MB ?
- MacOS X 10.5 460 KB ?
- AIX 5 98 KB ?
- OpenBSD 4.0 64 KB ?
- HP-UX 11 16 KB ?
Bruno Haible also stated that:
32 KB is more than you can safely allocate on the stack in a multithreaded program
And he said:
And the default stack size for sigaltstack, SIGSTKSZ, is
only 16 KB on some platforms: IRIX, OSF/1, Haiku.
only 8 KB on some platforms: glibc, NetBSD, OpenBSD, HP-UX, Solaris.
only 4 KB on some platforms: AIX.
Bruno
He wrote the following simple Linux C program to empirically determine the above values. You can run it on your system today to quickly see what your maximum thread stack size is, or you can run it online on GDBOnline here: https://onlinegdb.com/rkO9JnaHD.
Explanation: It simply creates a single new thread, so as to check the thread stack size and NOT the program stack size, in case they differ, then it has that thread repeatedly allocate 128 bytes of memory on the stack (NOT the heap), using the Linux alloca() call, after which it writes a 0 to the first byte of this new memory block, and then it prints out how many total bytes it has allocated. It repeats this process, allocating 128 more bytes on the stack each time, until the program crashes with a Segmentation fault (core dumped) error. The last value printed is the estimated maximum thread stack size allowed for your system.
Important note: alloca() allocates on the stack: even though this looks like dynamic memory allocation onto the heap, similar to a malloc() call, alloca() does NOT dynamically allocate onto the heap. Rather, alloca() is a specialized Linux function to "pseudo-dynamically" (I'm not sure what I'd call this, so that's the term I chose) allocate directly onto the stack as though it was statically-allocated memory. Stack memory used and returned by alloca() is scoped at the function-level, and is therefore "automatically freed when the function that called alloca() returns to its caller." That's why its static scope isn't exited and memory allocated by alloca() is NOT freed each time a for loop iteration is completed and the end of the for loop scope is reached. See man 3 alloca for details. Here's the pertinent quote (emphasis added):
DESCRIPTION
The alloca() function allocates size bytes of space in the stack frame of the caller. This temporary space is automatically freed when the function that called alloca() returns to its caller.
RETURN VALUE
The alloca() function returns a pointer to the beginning of the allocated space. If the allocation causes stack overflow, program behavior is undefined.
Here is Bruno Haible's program from 24 Oct. 2009, copied directly from the GNU mailing list here:
Again, you can run it live online here.
// By Bruno Haible
// 24 Oct. 2009
// Source: https://lists.gnu.org/archive/html/bug-coreutils/2009-10/msg00262.html
// =============== Program for determining the default thread stack size =========
#include <alloca.h>
#include <pthread.h>
#include <stdio.h>
void* threadfunc (void*p) {
int n = 0;
for (;;) {
printf("Allocated %d bytes\n", n);
fflush(stdout);
n += 128;
*((volatile char *) alloca(128)) = 0;
}
}
int main()
{
pthread_t thread;
pthread_create(&thread, NULL, threadfunc, NULL);
for (;;) {}
}
When I run it on GDBOnline using the link above, I get the exact same results each time I run it, as both a C and a C++17 program. It takes about 10 seconds or so to run. Here are the last several lines of the output:
Allocated 7449856 bytes
Allocated 7449984 bytes
Allocated 7450112 bytes
Allocated 7450240 bytes
Allocated 7450368 bytes
Allocated 7450496 bytes
Allocated 7450624 bytes
Allocated 7450752 bytes
Allocated 7450880 bytes
Segmentation fault (core dumped)
So, the thread stack size is ~7.45 MB for this system, as Bruno mentioned above (7.4 MB).
I've made a few changes to the program, mostly just for clarity, but also for efficiency, and a bit for learning.
Summary of my changes:
[learning] I passed in BYTES_TO_ALLOCATE_EACH_LOOP as an argument to the threadfunc() just for practice passing in and using generic void* arguments in C.
Note: This is also the required function prototype, as required by the pthread_create() function, for the callback function (threadfunc() in my case) passed to pthread_create(). See: https://www.man7.org/linux/man-pages/man3/pthread_create.3.html.
[efficiency] I made the main thread sleep instead of wastefully spinning.
[clarity] I added more-verbose variable names, such as BYTES_TO_ALLOCATE_EACH_LOOP and bytes_allocated.
[clarity] I changed this:
*((volatile char *) alloca(128)) = 0;
to this:
volatile uint8_t * byte_buff =
(volatile uint8_t *)alloca(BYTES_TO_ALLOCATE_EACH_LOOP);
byte_buff[0] = 0;
Here is my modified test program, which does exactly the same thing as Bruno's, and even has the same results:
You can run it online here, or download it from my repo here. If you choose to run it locally from my repo, here's the build and run commands I used for testing:
Build and run it as a C program:
mkdir -p bin && \
gcc -Wall -Werror -g3 -O3 -std=c11 -pthread -o bin/tmp \
onlinegdb--empirically_determine_max_thread_stack_size_GS_version.c && \
time bin/tmp
Build and run it as a C++ program:
mkdir -p bin && \
g++ -Wall -Werror -g3 -O3 -std=c++17 -pthread -o bin/tmp \
onlinegdb--empirically_determine_max_thread_stack_size_GS_version.c && \
time bin/tmp
It takes < 0.5 seconds to run locally on a fast computer with a thread stack size of ~7.4 MB.
Here's the program:
// =============== Program for determining the default thread stack size =========
// Modified by Gabriel Staples, 26 Sept. 2020
// Originally by Bruno Haible
// 24 Oct. 2009
// Source: https://lists.gnu.org/archive/html/bug-coreutils/2009-10/msg00262.html
#include <alloca.h>
#include <pthread.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <unistd.h> // sleep
/// Thread function to repeatedly allocate memory within a thread, printing
/// the total memory allocated each time, until the program crashes. The last
/// value printed before the crash indicates how big a thread's stack size is.
///
/// Note: passing in a `uint32_t` as a `void *` type here is for practice,
/// to learn how to pass in ANY type to a func by using a `void *` parameter.
/// This is also the required function prototype, as required by the
/// `pthread_create()` function, for the callback function (this function)
/// passed to `pthread_create()`. See:
/// https://www.man7.org/linux/man-pages/man3/pthread_create.3.html
void* threadfunc(void* bytes_to_allocate_each_loop)
{
const uint32_t BYTES_TO_ALLOCATE_EACH_LOOP =
*(uint32_t*)bytes_to_allocate_each_loop;
uint32_t bytes_allocated = 0;
while (true)
{
printf("bytes_allocated = %u\n", bytes_allocated);
fflush(stdout);
// NB: it appears that you don't necessarily need `volatile` here,
// but you DO definitely need to actually use (ex: write to) the
// memory allocated by `alloca()`, as we do below, or else the
// `alloca()` call does seem to get optimized out on some systems,
// making this whole program just run infinitely forever without
// ever hitting the expected segmentation fault.
volatile uint8_t * byte_buff =
(volatile uint8_t *)alloca(BYTES_TO_ALLOCATE_EACH_LOOP);
byte_buff[0] = 0;
bytes_allocated += BYTES_TO_ALLOCATE_EACH_LOOP;
}
}
int main()
{
const uint32_t BYTES_TO_ALLOCATE_EACH_LOOP = 128;
pthread_t thread;
pthread_create(&thread, NULL, threadfunc,
(void*)(&BYTES_TO_ALLOCATE_EACH_LOOP));
while (true)
{
const unsigned int SLEEP_SEC = 10000;
sleep(SLEEP_SEC);
}
return 0;
}
Sample output (same results as Bruno Haible's original program):
bytes_allocated = 7450240
bytes_allocated = 7450368
bytes_allocated = 7450496
bytes_allocated = 7450624
bytes_allocated = 7450752
bytes_allocated = 7450880
Segmentation fault (core dumped)

Do .bss section zero initialized variables occupy space in elf file?

If I understand correctly, the .bss section in ELF files is used to allocate space for zero-initialized variables. Our tool chain produces ELF files, hence my question: does the .bss section actually have to contain all those zeroes? It seems such an awful waste of spaces that when, say, I allocate a global ten megabyte array, it results in ten megabytes of zeroes in the ELF file. What am I seeing wrong here?

Has been some time since i worked with ELF. But i think i still remember this stuff. No, it does not physically contain those zeros. If you look into an ELF file program header, then you will see each header has two numbers: One is the size in the file. And another is the size as the section has when allocated in virtual memory (readelf -l ./a.out):
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000034 0x08048034 0x08048034 0x000e0 0x000e0 R E 0x4
INTERP 0x000114 0x08048114 0x08048114 0x00013 0x00013 R 0x1
[Requesting program interpreter: /lib/ld-linux.so.2]
LOAD 0x000000 0x08048000 0x08048000 0x00454 0x00454 R E 0x1000
LOAD 0x000454 0x08049454 0x08049454 0x00104 0x61bac RW 0x1000
DYNAMIC 0x000468 0x08049468 0x08049468 0x000d0 0x000d0 RW 0x4
NOTE 0x000128 0x08048128 0x08048128 0x00020 0x00020 R 0x4
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4
Headers of type LOAD are the one that are copied into virtual memory when the file is loaded for execution. Other headers contain other information, like the shared libraries that are needed. As you see, the FileSize and MemSiz significantly differ for the header that contains the bss section (the second LOAD one):
0x00104 (file-size) 0x61bac (mem-size)
For this example code:
int a[100000];
int main() { }
The ELF specification says that the part of a segment that the mem-size is greater than the file-size is just filled out with zeros in virtual memory. The segment to section mapping of the second LOAD header is like this:
03 .ctors .dtors .jcr .dynamic .got .got.plt .data .bss
So there are some other sections in there too. For C++ constructor/destructors. The same thing for Java. Then it contains a copy of the .dynamic section and other stuff useful for dynamic linking (i believe this is the place that contains the needed shared libraries among other stuff). After that the .data section that contains initialized globals and local static variables. At the end, the .bss section appears, which is filled by zeros at load time because file-size does not cover it.
By the way, you can see into which output-section a particular symbol is going to be placed by using the -M linker option. For gcc, you use -Wl,-M to put the option through to the linker. The above example shows that a is allocated within .bss. It may help you verify that your uninitialized objects really end up in .bss and not somewhere else:
.bss 0x08049560 0x61aa0
[many input .o files...]
*(COMMON)
*fill* 0x08049568 0x18 00
COMMON 0x08049580 0x61a80 /tmp/cc2GT6nS.o
0x08049580 a
0x080ab000 . = ALIGN ((. != 0x0)?0x4:0x1)
0x080ab000 . = ALIGN (0x4)
0x080ab000 . = ALIGN (0x4)
0x080ab000 _end = .
GCC keeps uninitialized globals in a COMMON section by default, for compatibility with old compilers, that allow to have globals defined twice in a program without multiple definition errors. Use -fno-common to make GCC use the .bss sections for object files (does not make a difference for the final linked executable, because as you see it's going to get into a .bss output section anyway. This is controlled by the linker script. Display it with ld -verbose). But that shouldn't scare you, it's just an internal detail. See the manpage of gcc.

The .bss section in an ELF file is used for static data which is not initialized programmatically but guaranteed to be set to zero at runtime. Here's a little example that will explain the difference.
int main() {
static int bss_test1[100];
static int bss_test2[100] = {0};
return 0;
}
In this case bss_test1 is placed into the .bss since it is uninitialized. bss_test2 however is placed into the .data segment along with a bunch of zeros. The runtime loader basically allocates the amount of space reserved for the .bss and zeroes it out before any userland code begins executing.
You can see the difference using objdump, nm, or similar utilities:
moozletoots$ objdump -t a.out | grep bss_test
08049780 l O .bss 00000190 bss_test1.3
080494c0 l O .data 00000190 bss_test2.4
This is usually one of the first surprises that embedded developers run into... never initialize statics to zero explicitly. The runtime loader (usually) takes care of that. As soon as you initialize anything explicitly, you are telling the compiler/linker to include the data in the executable image.

A .bss section is not stored in an executable file. Of the most common sections (.text, .data, .bss), only .text (actual code) and .data (initialized data) are present in an ELF file.

That is correct, .bss is not present physically in the file, rather just the information about its size is present for the dynamic loader to allocate the .bss section for the application program.
As thumb rule only LOAD, TLS Segment gets the memory for the application program, rest are used for dynamic loader.
About static executable file, bss sections is also given space in the execuatble
Embedded application where there is no loader this is common.
Suman

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js