C Preprocessor directives and Linking - c++

The #include directive will result in placing the content of the header file in the source code before compilation : for example if I included stdio.h , the preprocessor will work on placing all the content of stdio.h in the source code then compile isn't it ?
So let's pretend that I'm just using the printf() function in my code. So there must be something that will happen after the compilation and between the linking that will delete all function implementations that were included from the header file and only insert the printf() function implementation in the Import table of the executable , knowing that all other functions were compiled with the source. Can you explain that to me please ?

The printf function is not actually in the header file, it's in a library that is automatically linked to the executable by the linker.
Header files should contain only function prototypes, i.e. the declaration of the functions and not the definitions. And as there's no function definitions in the header file, no code will actually be generated for them, and the compiler will make sure that only the functions you actually call will get an entry in the generated object file so the linker knows about it.

A function prototype declared in a C header file acts as a compile-time constraint. It identifies the correct number of parameters and correct types of those parameters. The verification of such constraints happens during the compilation phase (i.e. *.c -> *.o).
The (static) linking phase eliminates unused binary objects from programming libraries on a per-object-file basis. A programming library is an archive of binary object files (*.o), each containing the implementation of a collection of functions, constants, etc. As for your question, the binary object file containing the implementation of printf (and everything else in the same object file) will be linked into your program. Other unused object files will be eliminated as a link-time optimization.

The header file only contains a declaration, e.g. of the form
size_t printf(char const *format, ...);
This tells the compiler that if it encounters the word printf and it is used as a function, then it can generate a function call.
The call instruction contains a placeholder for the actual address of the function, which is then inserted by the linker when building the final executable.
The linker generally just keeps a running list of yet unresolved symbols, and adds to this list when it encounters one of these placeholders, and fills in the actual address when it finds a definition for the symbol. In the case of printf, that definition is found in the C standard library, so the linker ensures that the library is loaded at runtime and the call instructions point at the right address.

Related

In the compilation system, how does linker (ld) know who to link myprogram.o to?

I recently read the CSAPP and had some doubts about the compilation system part of it.
Now we have a sample using HelloWorld.c(just print hello world). The book said in Pre-processor phase, they replace the "#include " line with the content of this header file. But when I open the stdio.h, I find that there is only a declaration for printf() and there is no concrete implementation. So in the compilation system, when will the specific implementation of printf() be introduced?
And the book also said, in linking phase, the linker(ld) linked helloworld.o and printf.o . Why the linker knows to link my object file to printf.o? In a compilation system, why does it declare this function in the first step(Pre-processor phase) and link the concrete implementation in the last step(linking phase)?
Practically, over-simplified:
You can compile a function into a library (ex. .a or .so file on unix).
The library has a function body (assembly instructions) and a function name. Ex. the library libc.so has printf function that starts at character number 0xaabbccdd in the library file libc.so.
You want to compile your program.
You need to know what arguments printf takes. Does it take int ? Does it take char *? Does it take uint_least64_t? It's in the header file - int printf(const char *, ...);. The header tells the compiler how to call the function (what parameters does the function take and what type it returns). Note that each .c file is compiled separately.
The function declaration (what arguments the function takes and what does it return) is not stored in the library file. It is stored in the header (only). The library has function name (only printf) and compiled function body. The header has int printf(const char *, ...); without function body.
You compile your program. The compiler generates the code, so that arguments with proper size are pushed onto the stack. And from the stack your code takes variable returned from the function. Now your program is compiled into assembly that looks like push pointer to "%d\n" on the stack; push some int on the stack; call printf; pop from the stack the returned "int"; rest of the instructions;.
Linker searches through your compiled program and it sees call printf. It then says: "Och, there is no printf body in your code". So then it searches printf in the libraries, to see where it is. The linker goes through all the libraries you link your program with and it finds printf in the standard library - it's in libc.so at address 0xaabbccdd. So linker substitutes call printf for goto libs.so file to address 0xaabbccdd kind-of instruction.
After all "symbols" (ie. function names, variables names) are "resolved" (the linker has found them somewhere), then you can run your program. The call printf will jump into the file libc.so at specified location.
What I have written above is only for illustration purposes.
Why the linker knows to link my object file to printf.o
Because the complier notes this inside what it produces, typically called object files (.o).
why does it declare this function in the first step ...
To know about it.
... and link the concrete implementation in the last step
Because there is no need to do this earlier.
All the C and C++ standards tell you is that you need to #include a given header file in order to introduce some functionality (on some platforms that might not even be necessary although inclusion is a good idea since then you're writing portable code).
That affords compilers a lot of flexibility.
The linking, if any, will be done automatically. Note that some functions might even be hardcoded into the compiler itself.
By default the library ( containing the implementation of printf ) is linked everytime in your C program.
By including headers you just specify (for the time being) at compile time that the implementations of the declared functions (inside the header) are somewhere else. And later in the linking phase, those function implementations are 'added' in your code.
Why the linker knows to link my object file to printf.o?
LD knows how to search and find them. You can see the with man ld.so:
If a shared object dependency does not contain a slash, then it is
searched for in the following order:
Using the directories specified in the DT_RPATH dynamic section attribute of the binary if present and DT_RUNPATH attribute does not
exist. Use of DT_RPATH is deprecated.
Using the environment variable LD_LIBRARY_PATH, unless the executable is being run in secure-execution mode (see below), in which
case this variable is ignored.
Using the directories specified in the DT_RUNPATH dynamic section attribute of the binary if present. Such directories are searched only
to find those objects required by DT_NEEDED (direct dependencies)
entries and do not apply to those objects' children, which must
themselves have their own DT_RUNPATH entries. This is unlike DT_RPATH,
which is applied to searches for all children in the dependency tree.
From the cache file /etc/ld.so.cache, which contains a compiled list of candidate shared objects previously found in the augmented
library path. If, however, the binary was linked with the -z nodeflib
linker option, shared objects in the default paths are skipped. Shared
objects installed in hardware capability directories (see below) are
preferred to other shared objects.
In the default path /lib, and then /usr/lib. (On some 64-bit architectures, the default paths for 64-bit shared objects are /lib64,
and then /usr/lib64.) If the binary was linked with the -z nodeflib
linker option, this step is skipped.
In a compilation system, why does it declare this function in the first step(Pre-processor phase) and link the concrete implementation in the last step(linking phase)?
In the compilation stage, you need to know what you're going to link to and compile accordingly, so it needs to read the .h files with the definition. In the linking stage, only .o files are needed.

C++ build process (Including)

I'm currently learning C++ from a book called Alex Allain - Jumping into c++, and i got stuck at chapter 21. It details the C++ build process and i got it, except 2 parts:
First:
"The header file should not contain any function definitions. If we had added a function definition to the header file and then included that header file into more than one source file, the function definition would have shown up twice at link time. That will confuse the linker."
Second:
"Never ever include a .cpp file directly. Including a .cpp file will just lead to problems because the compiler will compile a copy of each function definition in the .cpp file into each object file, and the linker will see multiple definitions of the same function. Even if you were incredibly careful about how you did this, you would also lose the time-saving benefits of separate compilation."
Can somebody explain them?
A C++ program is created from one or more translation units. Each translation unit (TU for short) is basically a single source file with all included header files. When you create an object file you are actually creating a TU. When you are linking, you are taking the object files (TUs) created by the compiler and link them with libraries to create an executable program.
A program can only have a single definition of anything. If you have multiple definitions you will get an error when linking. A definition can be a variable definition like
int a;
or
double b = 6.0;
It can also be a function definition, which is the actual implementation of the function.
The reason you can only have a single definition is because these definitions are mapped to memory addresses when the program is loaded to be executed. A variable or a function can not exist in two places at once.
This is one of the reasons you should not include source files into other source files. It is also the reason you should not have definitions in header files, because header files could be included into multiple source files as that would lead to the definition being in multiple TUs.
There are of course exceptions to this, like having a function being marked as inline or static. But that is solved because these definitions are not exported from the TU, the linker doesn't see them.

How does a function caller use a header file to determine what to do with a compiled binary?

My understanding is that C++ (and C, I guess) header files are never compiled, and simply act as an explanation of the interface of the C++ file they describe.
So if my header file describes a hello() function, some program that includes the header will know about hello() and how to call it and what arguments to give it, etc.
However, after compilation (and before linking, I guess? I'm not sure), when the hello.c file is binary machine code, and hello.h is still C++, how does the compiler/linker know how to call a function in the binary blob based on the presence of its declaration in the header file?
I understand concepts such as symbol tables, abstract syntax trees, etc (i.e., I have taken a compiler class in the past), but this is a gap in my knowledge).
The implementation of hello() assumes a certain calling convention (where are the parameters on the stack, who cleans up the stack the caller or the callee, etc).
The compiler generates code with the correct calling convention. It may use information from the header file to do this (e.g. the function is marked __stdcall in Windows program) or it may use it's default calling convention. The compiler will also use the header file to make sure your are calling the routine with the right number and types of parameters. Once the code is generated by the compiler the header file is not used again.
The linker is not concerned with calling convention it's primary responsibility is to patch together the binaries you've compiled by fixing up references among your modules and any libraries it calls.
A C/C++ compilation unit (cpp file / c file) includes all the header files (as text) and the code.
The header file helps explain how to produce the call instruction
push arg1
push arg2
call _some_function
If the compilation unit includes _some_function then this will be resolved at compile time.
Otherwise it becomes an undefined symbol. If so, when the linker comes along, it looks through all the object files and libraries to resolve all the undefined symbols.
So the header file helps code the assembly correctly.
Object and library files provide implementations.
The library files are optional. When a linker looks in a library file, it only gets added if it satisfies some symbol, otherwise it is not added to the binary.
Object files (ignoring optimization) will get added to the binary completely.
Building a C++ program is a two-step process: compile and link.
The header is for compilation of the module you are writing. The binary is for linking: it contains the compiled code for the method defined in the correspnding header. The header has to match what's already been compiled. At link time you will learn if your header has a method signature that matches what was compiled in the binary.

redefinition happend in which procedure? compile or link time?

If I defined a function twice, I'll get a redefinition error message, but
I'm confused that redefinition happened in compile or link time?
and why you can override malloc in libc without a redefinition error?
You get a function redefinition error when you have two function with the same prototype or signature (function signature is made of the function name number of parameters and parameter types, does NOT include the return type).
This is a compile time error if the compiler see two functions with the same signature:
int foo(int a);
double foo(int b);
Why you can override function calls in libraries? Let's look at how the code is build into an executable:
the compiler is called for each source file and outputs an object file: any function call which cannot be resolved (i.e. calling a function in a different file) is an external symbol which the linker will have to resolve.
the linker take all the object files and tries to resolve all the symbols; but it does this on a first come first served manner. For a external symbol it will consider the first definition it finds and not worry about the fact that there may be more definitions of the same symbol available.
So, the linker actually allows you to override a function's behavior. And it all depends on the order the files are linked - the first function definition it finds is the one used to resolve the symbol.
Hope this sheds some light on the matter.
Either or both. It can also result from the programmer editing source code or modifying build scripts.
"Redefinition" errors are emitted when the linker find two things (symbols) with the same name.
There are many reasons the linker might find two symbols with the same name. Some possibilities (there are many permutations) include;
An object file is specified twice in the link command. This usually results from an error in a build script.
Two object files that contain the same function definition. This results from code duplication - for example, a function definition being copied into different source files, which are then compiled and linked. It can also result from monkey business with the preprocessor (e.g. #includeing a file that contains the definition of a global variable by two source files).
The causes of things like the above are generally programmer error (e.g. supplying a bad linker command in a build script, misuse of the preprocessor, copying and pasting code between projects.
The reason functions in libraries like libc can often be "overridden" is that the linker typically only looks for symbols in libraries if it can't find them in object files. So, if an object file defines malloc() the linker will resolve all calls to that, and not attempt to resolve using the version in a library. This sort of thing is quite dangerous, because some other functions within libraries (e.g. even within libc) may resolve directly to the original malloc() (e.g. some calls may be inlined) which can unpredictable behaviours. This sort of behaviour is also linker specific: although this sort of thing is common with unix/linux variants, there are systems where the linkers do things differently.

How do C++ header files work?

When I include some function from a header file in a C++ program, does the entire header file code get copied to the final executable or only the machine code for the specific function is generated. For example, if I call std::sort from the <algorithm> header in C++, is the machine code generated only for the sort() function or for the entire <algorithm> header file.
I think that a similar question exists somewhere on Stack Overflow, but I have tried my best to find it (I glanced over it once, but lost the link). If you can point me to that, it would be wonderful.
You're mixing two distinct issues here:
Header files, handled by the preprocessor
Selective linking of code by the C++ linker
Header files
These are simply copied verbatim by the preprocessor into the place that includes them. All the code of algorithm is copied into the .cpp file when you #include <algorithm>.
Selective linking
Most modern linkers won't link in functions that aren't getting called in your application. I.e. write a function foo and never call it - its code won't get into the executable. So if you #include <algorithm> and only use sort here's what happens:
The preprocessor shoves the whole algorithm file into your source file
You call only sort
The linked analyzes this and only adds the source of sort (and functions it calls, if any) to the executable. The other algorithms' code isn't getting added
That said, C++ templates complicate the matter a bit further. It's a complex issue to explain here, but in a nutshell - templates get expanded by the compiler for all the types that you're actually using. So if have a vector of int and a vector of string, the compiler will generate two copies of the whole code for the vector class in your code. Since you are using it (otherwise the compiler wouldn't generate it), the linker also places it into the executable.
In fact, the entire file is copied into .cpp file, and it depends on compiler/linker, if it picks up only 'needed' functions, or all of them.
In general, simplified summary:
debug configuration means compiling in all of non-template functions,
release configuration strips all unneeded functions.
Plus it depends on attributes -> function declared for export will be never stripped.
On the other side, template function variants are 'generated' when used, so only the ones you explicitly use are compiled in.
EDIT: header file code isn't generated, but in most cases hand-written.
If you #include a header file in your source code, it acts as if the text in that header was written in place of the #include preprocessor directive.
Generally headers contain declarations, i.e. information about what's inside a library. This way the compiler allows you to call things for which the code exists outside the current compilation unit (e.g. the .cpp file you are including the header from). When the program is linked into an executable that you can run, the linker decides what to include, usually based on what your program actually uses. Libraries may also be linked dynamically, meaning that the executable file does not actually include the library code but the library is linked at runtime.
It depends on the compiler. Most compilers today do flow analysis to prune out uncalled functions. http://en.wikipedia.org/wiki/Data-flow_analysis