What exactly is in a .o / .a / .so file? - c++

I was wondering what exactly is stored in a .o or a .so file that results from compiling a C++ program. This post gives a quite good overview of the compilation process and the function of a .o file in it, and as far as I understand from this post, .a and .so files are just multiple .o files merged into a single file that is linked in a static (.a) or dynamic (.so) way.
But I wanted to check if I understand correctly what is stored in such a file. After compiling the following code
void f();
void f2(int);
const int X = 25;
void g() {
f();
f2(X);
}
void h() {
g();
}
I would expect to find the following items in the .o file:
Machine code for g(), containing some placeholder addresses where f() and f2(int) are called.
Machine code for h(), with no placeholders
Machine code for X, which would be just the number 25
Some kind of table that specifies at which addresses in the file the symbols g(), h() and X can be found
Another table that specifies which placeholders were used to refer to the undefined symbols f() and f2(int), which have to be resolved during linking.
Then a program like nm would list all the symbol names from both tables.
I suppose that the compiler could optimize the call f2(X) by calling f2(25) instead, but it would still need to keep the symbol X in the .o file since there is no way to know if it will be used from a different .o file.
Would that be about correct? Is it the same for .a and .so files?
Thanks for your help!

You're pretty much correct in the general idea for object files. In the "table that specifies at which addresses in the file" I would replace "addresses" with "offsets", but that's just wording.
.a files are simply just archives (an old format that predates tar, but does the same thing). You could replace .a files with tar files as long as you taught the linker to unpack them and just link with all the .o files contained in them (more or less, there's a little bit more logic to not link with object files in the archive that aren't necessary, but that's just an optimization).
.so files are different. They are closer to a final binary than an object file. An .so file with all symbols resolved can at least theoretically be run as a program. In fact, with PIE (position independent executables) the difference between a shared library and a program are (at least in theory) just a few bits in the header. They contain instructions for the dynamic linker how to load the library (more or less the same instructions as a normal program) and a relocation table that contains instructions telling the dynamic linker how to resolve the external symbols (again, the same in a program). All unresolved symbols in a dynamic library (and a program) are accessed through indirection tables which get populated at dynamic linking time (program start or dlopen).
If we simplify this a lot, the difference between objects and shared libraries is that much more work has been done in the shared library to not do text relocation (this is not strictly necessary and enforced, but it's the general idea). This means that in object files the assembler has only generated placeholders for addresses which the linker then fills in, for a shared library the addresses are filled in with addresses to jump tables so that the text of the library doesn't need to get changed, only a limited jump table.
Btw. I'm talking ELF. Older formats had more differences between programs and libraries.

What you described in your question (machine code for functions, initialization data and relocation tables) is pretty much exactly what is inside .o (object) and .so (shared object) files.
.a (archives) are basically multiple .o (object) files bunched together for easier reference during linking. ("Link libraries")
.so (shared object) files include some additional metadata, like which other .so's would need to be linked in. (xyz.so might reference some functions that reside in abc.so, and the information that abc.so would need to be linked in, plus optionally the path where to find abc.so (the RPATH), need to be encoded in xyz.so.)
Windows .dll (dynamic link library) files are basically shared objects (.so) with a different name.
Disclaimer: This is simplifying things significantly, but is close enough to "The Truth (tm)" to serve for everyday developer needs.

Related

C++ linker order issue - Linking my own main with a subset of object files in a large C++ project

I have a large C++ project with around 250 cpp files.
I didn't write this code, I'm just trying to write a test for fuzzing(testing) purpose. Therefore:
I wrote my own main cpp file, called wrapper.cpp, containing the int main()
I included in this file some header files needed
I compiled after removing the inital main from the Makefile and adding my wrapper.cpp
It works, it produces a functionnal executable. However, the binary size is quiet important. I'm pretty sure I can reduce the size as a lot of object files are linked but not used. Therefore, I built all the object files and now I'm thinking about how to link the needed ones with the executables. But after many tries, it seems impossible:
The executable is linked against the object files, some static libraries and some dynamic lib
The order matters for the static libs (interdependencies between them and some *.o files)
There a several definitions for some symbols and this is allowed by the zmuldefs linker option
Thus, I first tried to create a bug static libs with all the object files and to link the executable against it assuming only the right .o files would be picked by the linker. I didn't think about the order problem ... Some of these object files need symbols contained in other static lib and vice versa (interdependencies). No matter where I place the static lib I created, there will be issues. So I can't go this way, it is too complex.
Then, I tried to add the -Wl,--start-group/-Wl,--end-group linker option. It allows my to compile but the binary will segfault. I guess this is because of the zmuldefs option that allows multiple definitons, so the order is really important.
So I was wondering if there was a way to this, maybe an obvious way that I'm missing ? Cause it seems to be a pretty common use case to me(imagine if you want to test a single function), but I cannot cannot find anything online.
Thank you in advance for your precious help

static and dynamic linking using gcc

I've been recently reading about static and dynamic linking and I understood the differences and how to create static and dynamic library and link it to my project
But, a question came to my mind that I couldn't answer or find answer for it as It's a specific question ... when I compile my code on linux using the line
#include <stdio.h>
int main()
{
printf("hello, world!\n");
}
compiling using this command
[root#host ~]# gcc helloworld.c -o helloworld
which type of linking is this??
so the stdio.h is statically or dynamically linked to my project???
Libraries are mostly used as shared resources so, that several different programs can reuse the same pre-compiled code in some manner. Some libraries come as standard libraries which are delivered with the operating system and/or the compiler package. Some libraries come with other third party projects.
When you run just gcc in the manner of your example, you really run a compiler driver which provides you with few compilation-related functions, calling different parts of the compilation process and finally linking your application with a few standard libraries. The type of the libraries is chosen based on the qualifiers you provide. By default it will try to find dynamic (shared) libraries and if missing will attempt for static. Unless you tell it to use static libs only (-static).
When you link to project libraries you tell the gcc/g++ which libraries to use in a manner (-lname). In such a way it will do the same as with the standard libraries, looking for '.so' first and '.a' second, unless -static is used. You can directly specify the path to the full library name as well, actually telling it which library to use. There are several other qualifiers which control the linking process, please look man for 'g++' and 'ld'.
A library must contain real program code and data. The way it is linked to the main executable (and other libraries) is through symbol tables which are parts of the libraries. A symbol table contains entries for global functions an data.
There is a slight difference in the structure of the shared and static libs. The former one is actually a pre-linked object, similar to an executable image with some extra info related to the symbols and relocation (such a library can be loaded at any address in the memory and still should work correctly). The static library is actually an archive of '.o' files, ready for a full-blown linking.
The usual steps to create a library is to compile multiple parts of your program into '.o' files which in turn could be linked in a shared library by 'ld' (or g++) or archived in .a with 'ar'. Afterwards you can use them for linking in a manner described above.
An object file (.o) is created one per a .cpp source file. The source file contains code and can include any number of header files, as 'stdio.h' in your case (or cstdio) or whatever. These files become a part of the source which is insured by the cpp preprocessor. The latter takes care of macros and flattening all the #include hierarchies so that the compiler sees only a single text stream which it converts into '.o'. In general header files should not contain executable code, but declarations and macros, though it is not always true. But it does not matter since they become welded with the main source file.
Hope this would explain it.
which type of linking is this?? so the stdio.h is statically or
dynamically linked to my project???
stdio.h is not linked, it is a header file, and contains code / text, no compiled objects.
The normal link process prefers the '.so' library over the '.a' archive when both are found in the same directory. Your simple command is linking with the .so (if that is in the correct path) or the .a (if that is found in a path with no .so equivalent).
To achieve static linking, you have several choices, including
1) copy the '.a' archive to a directory you create, then specify that
directory (-L)
2) specify the path to the '.a' in the build command. Boost example:
$(CC) $(CC_FLAGS) $< /usr/local/lib/libboost_chrono.a -o $# $(LIB_DIRs) $(LIB_NMs)
I have used both techniques, I find the first easier.
Note that archive code might refer to symbols in another archive. You can command the linker to search a library multiple times.
If you let the build link with the .so, this does not pull in a copy of the entire .so into the build. Instead, the .so (the entire lib) is loaded into memory (if not already there) at run-time, after the program starts. For most applications, this is considered a 'small' start-up performance hit as the program adjusts its memory map (auto-magically behind the scenes) Note that the app itself can control when to load the .so, called dynamic library.
Unrelated:
// If your C++ 'Hello World' has no class ... why bother?
#include <iostream>
class Hello_t {
public:
Hello_t() { std::cout << "\n Hello" << std::flush; }
~Hello_t() { std::cout << "World!" << std::endl; }
void operator() () { std::cout << " C++ "; }
};
int main(int, char**) { Hello_t()(); }

How library classes are instantiated

I'm going to ask how it's done in c++, but this idea can apply to multiple languages. If you know how to do it in objective-c as well, please provide any similarities between the two
Lets say I want to create an instance of an ofstream like
ofstream myfile;
I'm assuming all I have on my computer is the *.o file (in a library archive) and the *.h file for iostream class. If this part isn't true let me know. I am assuming this when all I have installed is the runtime and the devel packages, not the source files.
How does it connect the header file to the object file, is there a naming scheme. And where does it look and in what order.?
Why this is confusing me is normally when I want to create a class I link my implementation of the class with the program, so where does it now and how does it now to link the files?
One more, does it matter if it loaded statically or dynamically?
Thank you in advance, and sry if this is a silly question.
Computer Science 101:
Broadly speaking (VERY broadly!), there are two kinds of "programs":
a) Interpreted: you read the program source line-by-line every time you execute it
<= *nix shell scripts and DOS .bat files are "interpeted"
b) Compiled: you read the source once (to convert it into a "binary machine code"). You link the machine code "object files" to build an "executable program".
You're talking about "compiled programs"
The "ofstream" part is irrelevant once the program is "compiled"
The binary implementation for "ofstream" can be compiled directly into the executable, or it can be dynamically loaded from a shared library (.dll) at runtime.
A "compiler" users ".h" headers to process the source file.
A "linker" uses ".lib" libraries to match symbols and link static code at link type.
The "Operating System" recognizes dynamic links and loads the needed shared libraries (.dll's) at runtime.
Three different things, all independent of each other: Compiler/source code, Linker/machine object code, OS/executable programs
'Hope that helps .. a bit...
This is not standardized and it's up to the implementation. I don't know about *unix, but I assume it's fairly similar to Windows.
You can assume that .o files are similar to library files .lib.
The header does define the class definition, so that the linker knows what to look for in the library.
Say you have a header:
class A
{
public:
A();
void foo();
};
and a lib file A.lib.
You include that header and call:
A a;
a.foo();
The compiler finds the declarations for bot A() and A::foo(). Now it knows it has to search the library for these functions. Names in the library are decorated, and contain modifiers, but its specific to the compiler so the linker finds the functions if they are exported in the library. It then binds the functions to the specific entry point from the dll.
If by dynamic loading you mean using LoadModule() and GetProcAddress() instead of linking, than the concept is pretty similar.
If you do static linking all symbols with linkage are available in the .obj file. The linker binds the calls of the functions to the entry points of the functions. There is a name mangeling involved in this process so that the symbols can be resolved correctly.
Dynamic linking is a platform dependent issue and not part of the C or C++ standard as far as I know.

Relation between object file and shared object file

what is the relation between shared object(.so) file and object(.o) file?
can you please explain via example?
Let's say you have the following C source file, call it name.c
#include <stdio.h>
#include <stdlib.h>
void print_name(const char * name)
{
printf("My name is %s\n", name);
}
When you compile it, with cc name.c you generate name.o. The .o contains the compiled code and data for all functions and variables defined in name.c, as well as index associated their names with the actual code. If you look at that index, say with the nm tool (available on Linux and many other Unixes) you'll notice two entries:
00000000 T print_name
U printf
What this means: there are two symbols (names of functions or variables, but not names of classes, structs, or any types) stored in the .o. The first, marked with T actually contains its definition in name.o. The other, marked with U is merely a reference. The code for print_name can be found here, but the code for printf cannot. When your actual program runs it will need to find all the symbols that are references and look up their definitions in other object files in order to be linked together into a complete program or complete library. An object file is therefore the definitions found in the source file, converted to binary form, and available for placing into a full program.
You can link together .o files one by one, but you don't: there are generally a lot of them, and they are an implementation detail. You'd really prefer to have them all collected into bundles of related objects, with well recognized names. These bundles are called libraries and they come in two forms: static and dynamic.
A static library (in Unix) is almost always suffixed with .a (examples include libc.a which is the C core library, libm.a which is the C math library) and so on. Continuing the example you'd build your static library with ar rc libname.a name.o. If you run nm on libname.a you'll see this:
name.o:
00000000 T print_name
U printf
As you can see it is primarily a big table of object files with an index finding all the names in it. Just like object files it contains both the symbols defined in every .o and the symbols referred to by them. If you were to link in another .o (e.g. date.o to print_date), you'd see another entry like the one above.
If you link in a static library into an executable it embeds the entire library into the executable. This is just like linking in all the individual .o files. As you can imagine this can make your program very large, especially if you are using (as most modern applications are) a lot of libraries.
A dynamic or shared library is suffixed with .so. It, like its static analogue, is a large table of object files, referring to all the code compiled. You'd build it with cc -shared libname.so name.o. Looking at with nm is quite a bit different than the static library though. On my system it contains about two dozen symbols only two of which are print_name and printf:
00001498 a _DYNAMIC
00001574 a _GLOBAL_OFFSET_TABLE_
w _Jv_RegisterClasses
00001488 d __CTOR_END__
00001484 d __CTOR_LIST__
00001490 d __DTOR_END__
0000148c d __DTOR_LIST__
00000480 r __FRAME_END__
00001494 d __JCR_END__
00001494 d __JCR_LIST__
00001590 A __bss_start
w __cxa_finalize##GLIBC_2.1.3
00000420 t __do_global_ctors_aux
00000360 t __do_global_dtors_aux
00001588 d __dso_handle
w __gmon_start__
000003f7 t __i686.get_pc_thunk.bx
00001590 A _edata
00001594 A _end
00000454 T _fini
000002f8 T _init
00001590 b completed.5843
000003c0 t frame_dummy
0000158c d p.5841
000003fc T print_name
U printf##GLIBC_2.0
A shared library differs from a static library in one very important way: it does not embed itself in your final executable. Instead the executable contains a reference to that shared library that is resolved, not at link time, but at run-time. This has a number of advantages:
Your executable is much smaller. It only contains the code you explicitly linked via the object files. The external libraries are references and their code does not go into the binary.
You can share (hence the name) one library's bits among multiple executables.
You can, if you are careful about binary compatibility, update the code in the library between runs of the program, and the program will pick up the new library without you needing to change it.
There are some disadvantages:
It takes time to link a program together. With shared libraries some of this time is deferred to every time the executable runs.
The process is more complex. All the additional symbols in the shared library are part of the infrastructure needed to make the library link up at run-time.
You run the risk of subtle incompatibilities between differing versions of the library. On Windows this is called "DLL hell".
(If you think about it many of these are the reasons programs use or do not use references and pointers instead of directly embedding objects of a class into other objects. The analogy is pretty direct.)
Ok, that's a lot of detail, and I've skipped a lot, such as how the linking process actually works. I hope you can follow it. If not ask for clarification.
A .so is analogous to a .dll on windows. A .o is exactly the same as a .obj under Visual Studio.

Includes with the Linux GCC Linker

I don't understand how GCC works under Linux. In a source file, when I do a:
#include <math.h>
Does the compiler extract the appropriate binary code and insert it into the compiled executable OR does the compiler insert a reference to an external binary file (a-la Windows DLL?)
I guess a generic version of this question is: Is there an equivalent concept to Windows DLLs under *nix?
Well. When you include math.h the compiler will read the file that contains declarations of the functions and macros that can be used. If you call a function declared in that file (header), then the compiler inserts a call instruction into that place in your object file that will be made from the file you compile (let's call it test.c and the object file created test.o). It also adds an entry into the relocation table of that object-file:
Relocation section '.rel.text' at offset 0x308 contains 1 entries:
Offset Info Type Sym.Value Sym. Name
0000001c 00000902 R_386_PC32 00000000 bar
This would be a relocation entry for a function bar. An entry in the symbol table will be made noting the function is yet undefined:
9: 00000000 0 NOTYPE GLOBAL DEFAULT UND bar
When you link the test.o object file into a program, you need to link against the math library called libm.so . The so extension is similar to the .dll extension for windows. It means it is a shared object file. The compiler, when linking, will fix-up all the places that appear in the relocation table of test.o, replacing its entries with the proper address of the bar function. Depending on whether you use the shared version of the library or the static one (it's called libm.a then), the compiler will do that fix-up after compiling, or later, at runtime when you actually start your program. When finished, it will inject an entry in the table of shared libraries needed for that program. (can be shown with readelf -d ./test):
Dynamic section at offset 0x498 contains 22 entries:
Tag Type Name/Value
0x00000001 (NEEDED) Shared library: [libm.so.6]
0x00000001 (NEEDED) Shared library: [libc.so.6]
... ... ...
Now, if you start your program, the dynamic linker will lookup that library, and will link that library to your executable image. In Linux, the program doing this is called ld.so. Static libraries don't have a place in the dynamic section, as they are just linked to the other object files and then they are forgotten about; they are part of the executable from then on.
In reality it is actually much more complex and i also don't understand this in detail. That's the rough plan, though.
There are several aspects involved here.
First, header files. The compiler simply includes the content of the file at the location where it was included, nothing more. As far as I know, GCC doesn't even treat standard header files differently (but I might be wrong there).
However, header files might actually not contain the implementation, only its declaration. If the implementation is located somewhere else, you've got to tell the compiler/linker that. By default, you do this by simply passing the appropriate library files to the compiler, or by passing a library name. For example, the following two are equivalent (provided that libcurl.a resides in a directory where it can be found by the linker):
gcc codefile.c -lcurl
gcc codefile.c /path/to/libcurl.a
This tells the link editor (“linker”) to link your code file against the implementation of the static library libcurl.a (the compiler gcc actually ignores these arguments because it doesn't know what to do with them, and simply passes them on to the linker). However, this is called static linking. There's also dynamic linking, which takes place at startup of your program, and which happens with .dlls under Windows (whereas static libraries correspond to .lib files on Windows). Dynamic library files under Linux usually have the file extension .so.
The best way to learn more about these files is to familiarize yourself with the GCC linker, ld, as well as the excellent toolset binutils, with which you can edit/view library files effortlessly (any binary code files, really).
Is there an equivalent concept to Windows DLLs under *nix?
Yes they are called "Shared Objects" or .so files. They are dynamically linked into your binary at runtime. In linux you can use the "ldd" command on your executable to see which shared objects your binary is linked to. You can use ListDLLs from sysinternals to accomplish the same thing in windows.
The compiler is allowed to do whatever it pleases, as long as, in effect, it acts as if you'd included the file. (All the compilers I know of, including GCC, simply include a file called math.h.)
And no, it doesn't usually contain the function definitions itself. That's libm.so, a "shared object", similar to windows .DLLs. It should be on every system, as it is a companion of libc.so, the C runtime.
Edit: And that's why you have to pass -lm to the linker if you use math functions - it instructs it to link against libm.so.
There is. The include does a textual include of the header file (which is standard C/C++ behavior). What you're looking for is the linker . The -l argument to gcc/g++ tells the linker what library(ies) to add in. For math (libm.so), you'd use -lm. The common pattern is:
source file: #include <foo.h>
gcc/g++ command line: -lfoo
shared library: libfoo.so
math.h is a slight variation on this theme.