I am building a DLL using a custom build system (outside Visual Studio), and I can't get uninitialized data to show up in the .bss section; the compiler lumps it into .data. This bloats the final binary size, since it's full of giant arrays of zeroes.
For example (small 1KB arrays in the example, but the actual buffers are much larger):
int uninitialized[1024];
int initialized[1024] = { 123 };
The compiler emits assembly like this:
PUBLIC _initialized
_DATA SEGMENT
COMM _uninitialized:DWORD:0400H
_initialized DD 07bH
ORG $+4092
_DATA ENDS
Which ends up in the object file like this:
SECTION HEADER #3
.data name
0 physical address
0 virtual address
1000 size of raw data
147 file pointer to raw data (00000147 to 00001146)
0 file pointer to relocation table
0 file pointer to line numbers
0 number of relocations
0 number of line numbers
C0400040 flags
Initialized Data
8 byte align
Read Write
(There is no .bss section.)
The current compilation flags:
cl -nologo -c -FAsc -Faobjs\ -W4 -WX -X -J -EHs-c- -GR- -Gy -GS- -O1 -Os -Foobjs\file.o file.cpp
I have looked through the list of options at http://msdn.microsoft.com/en-us/library/fwkeyyhe(v=vs.71).aspx but I haven't spotted anything obvious.
I'm using the compiler from Visual Studio 2008 SP1 (Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.30729.01 for 80x86).
You want to use __declspec(allocate()), which you can read up on here: http://msdn.microsoft.com/en-us/library/5bkb2w6t(v=vs.80).aspx
Notice that "size of raw data" is only 0x1000 or 4kB - exactly the size of your initialized array only. The VirtualSize of your .data section will be larger than the size of the actual data stored in the binary image and your uninitialized array will occupy the slack space. Using the bss_seg pragma will force the linker to place your uninitialized data into its own separate section.
Yo can try using bss_seg pragma if you aren't concerned about portability.
Related
I'm learning about virtual memory management and memory allocation of process. And do some experiment about it. There are some confusing points as below:
case1
#include <iostream>
int main() {
return 0;
}
After compile the code and run size with the binary and got the following output:
text data bss dec hex filename
1985 640 8 2633 a49 main
case2:
Change the code to:
#include <iostream>
int global;
int main() {
return 0;
}
and rebuild it and the size output is:
text data bss dec hex filename
1985 640 16 2641 a51 main
Note: data memory part is not changed, but bss changed from 8 to 16. This result make sense to me, since int global define a uninitialized global variable.
case3:
Then I change the code to:
#include <iostream>
int global = 5;
int main() {
return 0;
}
I initialized the global variable. And analyze the binary again:
text data bss dec hex filename
1985 644 4 2633 a49 main
This time, the change doen't make sense to me. Compared to case1, why data part increase by 4 bytes and bss part decrease by 4?
The discrepancy is caused by the linker, and how it decides to layout the binary. The G++ default linker on x86 (GNU ld).
To see the difference, let's compile your test cases in three different ways:
$ for x in *.cpp ; do g++ -g3 $x -o $x.ld.exe ; done # GNU ld
$ for x in *.cpp ; do g++ -g3 $x -o $x.gold.exe -fuse-ld=gold ; done # gold
$ for x in *.cpp ; do g++ -g3 $x -o $x.gold.exe -fuse-ld=lld ; done # lld
Now, the sizes are as follows:
text data bss dec hex filename
1873 656 8 2537 9e9 test1.cpp.ld.exe
1873 656 16 2545 9f1 test2.cpp.ld.exe
1873 660 4 2537 9e9 test3.cpp.ld.exe
text data bss dec hex filename
1877 656 2 2535 9e7 test1.cpp.gold.exe
1877 656 9 2542 9ee test2.cpp.gold.exe
1877 660 2 2539 9eb test3.cpp.gold.exe
text data bss dec hex filename
1817 576 2 2395 95b test1.cpp.lld.exe
1817 576 9 2402 962 test2.cpp.lld.exe
1817 580 2 2399 95f test3.cpp.lld.exe
The behavior of the bss section can be explained as follows:
There are two or three variables to store in here (completed.0, possibly global and std::__ioinit). The two that are always there, have a size and alignment of 1 byte each. The global that is only there in case 2, has size and alignment of 4 bytes. This is why both gold and lld require 2 bytes for the bss section in cases 1 and 3. For case 2, the layout is such that global is stored in between the two single-byte values, which requires an additional 3 bytes of padding between the first 1-byte value and global.
Generally speaking, GNU ld does something similar, except it uses the bss section to also provide inter-section padding. In case 1 and 3, only 2 bytes of the bss section are actually used, while case 2 uses 9 bytes of the bss section. Since data grows by 4 bytes in case 3 and bss is layouted after data, 4 of these padding bytes are removed to keep the offset of the following sections aligned. This is also why the size of the bss section grows by 8 bytes rather than 4 upon introducing global: GNU ld ensures that the end of the section remains 8-byte aligned.
The much shorter version is: .bss section is mapped to the zero page with copy-on-write — any variable that is not explicitly initialized, will be added to .bss, which does not actually exist on the ELF file, and is only going to be allocated when changing the content of one of its variables (I'm oversimplifying — the allocation happen per-page so there's more details with that).
When you give it a default value, it can't be mapped to the zero page: its value needs to be present in the ELF file itself, and mapped into memory (again, as copy-on-write, so if you then change the value, a new page gets allocated with the modified value of your variable). If you changed that to const int global = 5; it would have been added to .rodata and you would have seen the size of .text increase instead.
I've blogged about the size command in the past: https://flameeyes.blog/2008/12/01/for-elves-size-matters/
I have a file which, when compiled to object file, has the following size:
On Windows, using MSVC, it's 8MB.
On macOS, using clang, it's 8MB.
On linux (Ubuntu 18.04 or Gentoo), using either gcc or clang, it's 20MB.
The file (detailed below) is a representation of (a part of) a unicode table along with character properties. The encoding is utf8.
It occured to me that the problem might be that libstdc++ can't handle the file well, so I tried libc++ with clang on Gentoo, but it didn't do anything (the object file size remained the same).
Then I thought that it might be some optimization doing something odd, but once again I had no size improvements when I went from -O3 to -O0.
The file, on line 50 includes UnicodeTable.inc. The UnicodeTable.inc contains a std::array of the unicode codepoints.
I tried changing std::array to C style array, but again, the object file size did not change.
I have the preprocessed version of the CodePoint.cpp which can be compiled with $CC -xc++ CodePoint.i -c -o CodePoint.o. CodePoint.i contains about 40k lines of STL code and about 130k lines of unicode table.
I tried uploading the preprocessed CodePoint.i to gists.github.com and to paste.pound-python.org, but both refused the 170k lines long file.
At this point I'm out of ideas and would greatly appreciate any help regarding finding out the source of the "bloated" object file size.
From the output of size you linked you can see that there are 12 MB of relocations in the elf object (section .rela.dyn). If a 64 bit relocation takes 24 bytes and you have 132624 table entries with 4 pointers to strings each, this pretty much explains the 12 MB difference (132624 *4 * 24 = 12731904 ~ 12 MB ).
Apparently the other formats either use a more efficient relocation type or link the references directly and just relocate the whole block together with the strings as one piece of memory.
Since you are linking this to a shared library the dynamic relocations will not go away.
I am not sure if it is possible to avoid this with the code you currently use.
However, I think a unicode code point must have a maximal size. Why don't you store the code points by value in char arrays in the RawCodePoint struct? The size of each code point string should be no larger than the pointer you currently store, and the locality of reference of the table lookup may actually improve.
constexpr size_t MAX_CP_SIZE = 4; // Check if that is correct
struct RawCodePointLocal {
const std::array<char, MAX_CP_SIZE> original;
const std::array<char, MAX_CP_SIZE> normal;
const std::array<char, MAX_CP_SIZE> folded_case;
const std::array<char, MAX_CP_SIZE> swapped_case;
bool is_letter;
bool is_punctuation;
bool is_uppercase;
uint8_t break_property;
uint8_t combining_class;
};
This way you should not need relocations for the entries.
I read that it depends on the compiler and operating system architecture. How do I find out the data segment and stack max size on a Linux system using GCC as compiler?
Let me experiment with you: create file ``test.c'' like this:
int main (void) { return 0; }
Now compile it, specifying max stack size (just to easy lookup this number in map file and determine symbol name, refering to it):
gcc test.c -o test.x -Wl,--stack=0x20000 -Wl,-Map=output.map
Determining data size is simple:
size -A -d test.x
You will get something like this:
section size addr
.text 1880 4299165696
.data 104 4299169792
...
Also ``objdump -h test.x'' will work fine but with less verbose results.
There is more sections here (not just code and data) but there is no stack information here. Why? Because stack size is not ELF section, it is reserved only after your program is loaded to be executed. You should read it from some (platform dependent) symbol in your file like this:
$ nm test.x | grep __size_of_stack_reserve__
0000000000020000 A __size_of_stack_reserve__
It is not surprising, that size is 0x20000, as it was stated when compiling.
I determined symbol name by looking into output.map file that was generated during compilation. I recommend you also to start from looking at it.
Next when you do have some unknown file a.out, just repeat sequence:
size -A -d a.out
nm a.out | grep __size_of_stack_reserve__
Substituting a platform dependent symbol to that, you determined in experiment, described above.
Segments are a method for organizing stuff that your executable needs.
The data segment is usually for any data that your executable uses (without inputting from external sources). Some data segments may contain string literals or numeric constants.
Many executables use a stack for storing function local variables, statement block local variables, return addresses and function parameters. A stack is not required by the C or C++ languages; it's just a handy data structure.
The stack size can either be the capacity allocated to the stack or the number of elements residing on the stack or the amount of memory occupied by the stack.
Many platforms have a default size for the stack. Since platforms vary, you will need to read the documentation for your tools to see how to set stack size and what the default capacity is.
How do I find out the data segment and stack max size on a Linux system using GCC as compiler?
These limits can be read as RLIMIT_DATA and RLIMIT_STACK resource limits of getrlimit.
In the command line you can use ulimit command to find these limit of your system:
$ ulimit -s # stack
8515
$ ulimit -d # data
unlimited
You can change the system limits by modifying limits.conf.
And more in man pthread_create:
On Linux/x86-32, the default stack size for a new thread is 2 megabytes. Under the NPTL threading implementation, if the RLIMIT_STACK soft resource limit at the time the program started has any value other than "unlimited", then it determines the default stack size of new threads. Using pthread_attr_setstacksize(3), the stack size attribute can be explicitly set in the attr argument used to create a thread, in order to obtain a stack size other than the default.
And in man ld:
--stack reserve
--stack reserve,commit
Specify the number of bytes of memory to reserve (and optionally commit) to be used as stack for this program. The default is 2MB reserved, 4K committed. [This option is specific to the i386 PE targeted port of the linker]
I just used DUMPBIN for the first time and I see the term HIGHLOW repeatedly in the output file:
BASE RELOCATIONS #7
11000 RVA, E0 SizeOfBlock
...
3B5 HIGHLOW 2001753D ___onexitbegin
3C1 HIGHLOW 2001753D ___onexitbegin
...
I'm curious what this term stands for. I didn't find anything on Google or Stackoverflow about it.
To apply a fixup, a delta is calculated as the difference between the
preferred base address, and the base where the image is actually
loaded.
The basic idea is that when doing a fixup at some address, we must know
what memory must be changed ("offset" field)
what value is needed for its relocation ("delta" value)
which parts of relocated data and delta value to use ("type" field)
Here are some possible values of the "type" field
HIGH - add higher word (16 bits) of delta to the 16-bit value at "offset"
LOW - add lower word of delta to the value at "offset"
HIGHLOW - add full delta to the 32-bit value at "offset"
In other words, HIGHLOW type tells the program that it's doing a fix-up on offset "offset" from the page of this relocation block*, and that there is a doubleword that needs to be modified in order to have properly working executable.
* all of the relocation entries are grouped into blocks, and every block has a page on which its entries are applied
Let's say that you have this instruction in your code:
section .data
message: "Hello World!", 0
section .code
...
mov eax, message
...
You run assembler and immediately after it you run disassembler. Now your code looks like this:
mov eax, dword [0x702000]
You're now curious why is it 0x700000, and when you look into file dump, you see that
ImageBase: 0x00700000
Now you understand where did this number come from and you'e ready to run the executable.
Loader which loads executable files into memory and creates address space for them finds out, that memory 0x700000 is unavailable and it needs to place that file somewhere else. It decides that 0xf00000 will be OK and copies the file contents there.
But, your program was linked to work only with data on 0x700000 and there was no way for linker to know that its output would be relocated. Because of this, loader must do its magic. It
calculates delta value - the old address (image base) is 0x700000 but it wants 0xf00000 (preferred address). It subtracts one from another and gets 0x800000 as result.
gets to the .reloc section of the file
checks if there is still another page (4KB of data) to be relocated. If no, it continues toward calling file´s entry point.
4.for every relocation for the current page, it
gets data at relocation offset
adds the delta value (in the way as type field states)
places the new value at relocation offset
continues on step 3
There are also more types of relocation entry and some of them are architecture-specific. To see a full list, read the "Microsoft Portable Executable and Common Object File Format, section 6.6.2. Fixup Types".
What you see here is the content of the "Base relocation table" in Microsoft Windows executable files.
Base relocation tables are necessary in Windows for DLL files and they are optional for executable files; they contain information about the location of address information in the EXE/DLL file that must be updated when the actual address of the DLL file in memory is known (when loading the DLL into memory). Windows uses the information stored in this table to update the address information.
The table supports different types of addresses while the naming is Microsoft-specific: ABSOLUTE (= dummy), HIGH, LOW, HIGHLOW, HIGHADJ and MIPS_JMPADDR.
The full name of the constant is "IMAGE_REL_BASED_HIGHLOW".
The "ABSOLUTE" type is typically a dummy entry inserted to ensure the parts of the table are a multiple of 4 (or 8) bytes long.
On x86 CPUs only the "HIGHLOW" type is used: It tells Windows about the location of an absolute (32-bit) address in the file.
Some background info:
In your example the "Image Base" could be 0x20000000 which means that the EXE/DLL file has been compiled to be loaded into address 0x20000000. At the addresses 0x200113B5 (0x20000000 + 0x11000 + 0x3B5) and 0x200113C1 there are absolute addresses.
Let's say the memory at location 0x200113B5 contains the value 0x20012345 which is the address of a function or variable in the program.
Maybe the memory at address 0x20000000 cannot be used and Windows decides to load the DLL into the memory at 0x50000000 instead. Then the 0x20012345 must be replaced by 0x50012345.
The information in the base relocation table is used by Windows to find all addresses that must be replaced.
If I understand correctly, the .bss section in ELF files is used to allocate space for zero-initialized variables. Our tool chain produces ELF files, hence my question: does the .bss section actually have to contain all those zeroes? It seems such an awful waste of spaces that when, say, I allocate a global ten megabyte array, it results in ten megabytes of zeroes in the ELF file. What am I seeing wrong here?
Has been some time since i worked with ELF. But i think i still remember this stuff. No, it does not physically contain those zeros. If you look into an ELF file program header, then you will see each header has two numbers: One is the size in the file. And another is the size as the section has when allocated in virtual memory (readelf -l ./a.out):
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000034 0x08048034 0x08048034 0x000e0 0x000e0 R E 0x4
INTERP 0x000114 0x08048114 0x08048114 0x00013 0x00013 R 0x1
[Requesting program interpreter: /lib/ld-linux.so.2]
LOAD 0x000000 0x08048000 0x08048000 0x00454 0x00454 R E 0x1000
LOAD 0x000454 0x08049454 0x08049454 0x00104 0x61bac RW 0x1000
DYNAMIC 0x000468 0x08049468 0x08049468 0x000d0 0x000d0 RW 0x4
NOTE 0x000128 0x08048128 0x08048128 0x00020 0x00020 R 0x4
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4
Headers of type LOAD are the one that are copied into virtual memory when the file is loaded for execution. Other headers contain other information, like the shared libraries that are needed. As you see, the FileSize and MemSiz significantly differ for the header that contains the bss section (the second LOAD one):
0x00104 (file-size) 0x61bac (mem-size)
For this example code:
int a[100000];
int main() { }
The ELF specification says that the part of a segment that the mem-size is greater than the file-size is just filled out with zeros in virtual memory. The segment to section mapping of the second LOAD header is like this:
03 .ctors .dtors .jcr .dynamic .got .got.plt .data .bss
So there are some other sections in there too. For C++ constructor/destructors. The same thing for Java. Then it contains a copy of the .dynamic section and other stuff useful for dynamic linking (i believe this is the place that contains the needed shared libraries among other stuff). After that the .data section that contains initialized globals and local static variables. At the end, the .bss section appears, which is filled by zeros at load time because file-size does not cover it.
By the way, you can see into which output-section a particular symbol is going to be placed by using the -M linker option. For gcc, you use -Wl,-M to put the option through to the linker. The above example shows that a is allocated within .bss. It may help you verify that your uninitialized objects really end up in .bss and not somewhere else:
.bss 0x08049560 0x61aa0
[many input .o files...]
*(COMMON)
*fill* 0x08049568 0x18 00
COMMON 0x08049580 0x61a80 /tmp/cc2GT6nS.o
0x08049580 a
0x080ab000 . = ALIGN ((. != 0x0)?0x4:0x1)
0x080ab000 . = ALIGN (0x4)
0x080ab000 . = ALIGN (0x4)
0x080ab000 _end = .
GCC keeps uninitialized globals in a COMMON section by default, for compatibility with old compilers, that allow to have globals defined twice in a program without multiple definition errors. Use -fno-common to make GCC use the .bss sections for object files (does not make a difference for the final linked executable, because as you see it's going to get into a .bss output section anyway. This is controlled by the linker script. Display it with ld -verbose). But that shouldn't scare you, it's just an internal detail. See the manpage of gcc.
The .bss section in an ELF file is used for static data which is not initialized programmatically but guaranteed to be set to zero at runtime. Here's a little example that will explain the difference.
int main() {
static int bss_test1[100];
static int bss_test2[100] = {0};
return 0;
}
In this case bss_test1 is placed into the .bss since it is uninitialized. bss_test2 however is placed into the .data segment along with a bunch of zeros. The runtime loader basically allocates the amount of space reserved for the .bss and zeroes it out before any userland code begins executing.
You can see the difference using objdump, nm, or similar utilities:
moozletoots$ objdump -t a.out | grep bss_test
08049780 l O .bss 00000190 bss_test1.3
080494c0 l O .data 00000190 bss_test2.4
This is usually one of the first surprises that embedded developers run into... never initialize statics to zero explicitly. The runtime loader (usually) takes care of that. As soon as you initialize anything explicitly, you are telling the compiler/linker to include the data in the executable image.
A .bss section is not stored in an executable file. Of the most common sections (.text, .data, .bss), only .text (actual code) and .data (initialized data) are present in an ELF file.
That is correct, .bss is not present physically in the file, rather just the information about its size is present for the dynamic loader to allocate the .bss section for the application program.
As thumb rule only LOAD, TLS Segment gets the memory for the application program, rest are used for dynamic loader.
About static executable file, bss sections is also given space in the execuatble
Embedded application where there is no loader this is common.
Suman