Does linker link an object file with itself? - c++

From what I read in other SO answers, like this and this, compiler converts the source code into object file. And the object files might contain references to functions like printf that needs to be resolved by the linker.
What I don't understand is, when both declaration and definition exist in the same file, like the following case, does compiler or linker resolve the reference to return1?
Or is this just a part of the compiler optimization?
int return1();
int return2() {
int b = return1();
return b + 1;
}
int return1() {
return 1;
}
int main() {
int b = return2();
}
I have ensured that preprocessing has nothing to do with this by running g++ -E main.cpp.
Update 2020-7-22
The answers are all helpful! Thanks!
From the answers below, it seems to me that the compiler may or may not resolve the reference to return1. However, I'm still unclear if there's only one translation unit, like the example I gave, and if compiler did not resolve it, does this mean that linker must resolve it then?
Since it seems to me that linker will link several (greater than one) object files together, and if there's only one translation unit (object file), linker need to link the object file with itself, am I right?
And is there anyway to know for sure which one is the case on my computer?

It depends. Both options are possible, so are options that you didn't mention, like either the compiler or the linker rearranging the code so that none of the functions exist any more. It's fine thinking about compilers emitting references to functions and linkers resolving those references as a way of understanding C++, but bear in mind is that all the compiler and linker have to do is produce a working program and there are many different ways to do that.
One thing the compiler and linker must do however, is make sure that any calls to standard library functions happen (like printf as you mentioned), and happen in the order that the C++ source specifies. Apart from that (and some other similar concerns) they can more or less do as they wish.

[lex.phases]/1.9, covering the final phase of translation, states [emphasis mine]:
All external entity references are resolved. Library components are linked to satisfy external references to entities not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.
It is, however, up to the compiler to decide whether a library component is a single translation unit or a combination of them; as governed by [lex.separate]/2 [emphasis mine]:
[ Note: Previously translated translation units and instantiation units can be preserved individually or in libraries. The separate translation units of a program communicate ([basic.link]) by (for example) calls to functions whose identifiers have external linkage, manipulation of objects whose identifiers have external linkage, or manipulation of data files. Translation units can be separately translated and then later linked to produce an executable program.  — end note ]
OP: [...] does compiler or linker resolve the reference to return1?
Thus, even if return1 has external linkage, as it is defined in the translation unit where it is referred to (in return2), the linker should not need to resolve the reference to it, as its is definition exists in the current translation. The standard passage is, however, (likely intentionally) a bit vague regarding requirements for when linking to satisfy external references need to occur, and I do not see it to be a non-compliant implementation to defer resolving the reference to return1 in return2 until the linking phase.

Practically speaking, the problem is with the following code:
static int return1();
int return2() {
int b = return1();
return b + 1;
}
int return1() {
return 1;
}
The problem for the linker is that each Translation Unit can now contain its own return1, so the linker would have a problem in choosing the right return1. There are tricks around this, e.g. adding the Translation Unit name to the function name. Most ABI's do not do that, but the C++ standard would allow it. For anonymous namespaces however, i.e. namespace { int function1(); }, ABI's will use such tricks.

Related

Why does the one definition rule exist in C/C++

In C and C++, you can't have a function with two definitions. For example, say we have the following two files:
1.c:
int main(){ return 0}
2.c:
int main(){ return 0}
Issuing the command gcc 1.c 2.c will give you a duplicate symbol linker error.
Why doesn't the same happen with structs and classes? Why are we allowed to have multiple
definitions of the same struct as long as they have the same tokens?
To answer this question, one has to delve into compilation process and what is needed in each part (question why these steps are perfomed is more historical, going back to beginning of C before it's standardization)
C and C++ programs are compiled in multiple steps:
Preprocessing
Compilation
Linkage
Preprocessing is everything that starts with #, it's not really important here.
Compilation is performed on each and every translation unit (typically a single .c or .cpp file plus the headers it includes). Compiler takes one translation unit at a time, reads it and produces an internal list of classes and their members, and then assembly code of each function in given unit (basing on the structures list). If a function call is not inlined (e.g. it is defined in different TU), compiler produces a "link" - "please insert function X here" for the linker to read.
Then linker takes all of the compiled translation units and merges them into one binary, substituting all the links specified by compiler.
Now, what is needed at each phase?
For compilation phase, you need the
definition of every class used in this file - compiler needs to know the size and offset of each class member to produce assembly
declaration of every function used in this file - to produce those "links".
Since function definitions are not needed for producing assembly (as long as they are compiled somewhere), they are not needed in compilation phase, only in linking phase.
To sum up:
One Definition Rule is there to protect programmers from theselves. If they'd accidentally define a function twice, linker will notice that and executable is not produced.
However, class definitions are required in every translation unit, and therefore such a rule cannot be set up for them. Since it cannot be forced by language, programmers have to be responsible beings and not define the same class in different ways.
ODR has also other limitations, e.g. you have to define template functions (or template class methods) in header files. You can also take the responsibility and say to the compiler "Every definition of this function will be the same, trust me dude" and make the function inline.
There is no use case for a function with 2 definitions. Either the two definitions would have to be the same, making it useless, or the compiler wouldn't be able to tell which one you meant.
This is not the case with classes or structures. There is also a large advantage to allowing multiple definitions of them, i.e. if we want to use a class or struct in multiple files. (This leads indirectly to multiple definitions because of includes.)
Structures, classes, unions and enumerations define types that can be used in several compilation units to define objects of these types. So each compilation unit need to know how the types are defined, for example to allocate correctly memory for an object or to be sure that specified member of a class does indeed exist.
For functions (if they are not inline functions) it is enough to have their declaration without their definition to generate for example a function call.
But the function definition shall be single. Otherwise the compiler will not know what function to call or the object code will be too big due to duplication and will be error prone..
It's quite simple: It's a question of scope. Non-static functions are seen (callable) by every compilation unit linked together, while structures are only seen in the compilation unit where they are defined.
For example, it's valid to link the following together because it's clear which definition of struct Foo and which definition of f is being used:
1.c:
struct Foo { int x; };
static void f(void) { struct Foo foo; ... }
2.c:
struct Foo { double d; };
static void f(void) { struct Foo foo; ... }
int main(void) { ... }
But it isn't valid to link the following together because the linker wouldn't know which f to call.
1.c:
void f(void) { ... }
2.c:
void f(void) { ... }
int main(void) { f(); }
Actually every programming element is associated with a scope of its applicability. And within this scope you cannot have the same name associated with multiple definitions of an element. In compiled world:
You cannot have more than one class definition with the same name within a single file. But you can have it in different compilation units.
You cannot have the same function or global variable name within a single link unit (library or executable), but you can potentially have functions named the same within different libraries.
you cannot have shared libraries with the same name situated in the same directory, but you can have them in different directories.
C/C++ compilation is very much after the compilation performance. Checking 2 objects like function or classes for identity is a time-consuming task. So, it is not done. Only names are considered for comparison. It is better to consider that 2 types are different and error out then checking them for identity. The only exception from this rule are text macros.
Macros are a pre-processor concept and historically it is allowed to have multiple identical macro definitions. If a definition changes, a warning gets generated. Comparing macro context is easy, just a simple string comparison, but some macro definitions could be huge.
Types are the compiler concept and they are resolved by the compiler. Types do not exist in object libraries and are represented by the sizes of corresponding variables. So, there is no reason for checking type name collisions at this scope.
Functions and variables on the other hand are named pointers to executable codes or data. They are the building blocks of applications. Applications are assembled from the codes and libraries coming from all around the world in some cases. In order to use someone else's function you'd better now its name and you do not want the same name to be used by some one else. Within a shared library names of functions and variables are usually stored in a hash table. There is no place for duplicates there.
And as I already mention checking functions for identical contents is seldom done, however there are some cases, but not in c or c++.
The reason of impeding two different definitions for the same thing to be used in programming is to avoid the ambiguity of deciding which definition to use at run time.
If you have two different implementations to the same thing to coexist in a program, then there's the possibility of aliasing them (with a different name each) into a common reference to decide at runtime which one of the two to use.
Anyway, in order to distinguish both, you have to be able to indicate the compiler which one you want to use. In C++ you can overload a function, giving it the same name and different lists of parameters, so you can distinguish which one of both you want to use. But in C, the compilers only preserve the name of the function to be able to solve at link time which definition matches the name you use in a different compilation unit. In case the linker ends with two different definitions with the same name, it is uncapable of deciding for you which one to use, so it emits an error and gives up the building process.
What should be the intention of using this ambiguity in a productive way? this is the question you have actually to ask to yourself.

How does compiler know how other .cpp files use a static const member?

Could someone please explain me this example from the most authoritative ISO C++ FAQ? The code goes like this:
// Fred.h
class Fred {
public:
static const int maximum = 42;
// ...
};
// Fred.cpp
#include "Fred.h"
const int Fred::maximum;
// ...
And the statement I can't get is:
If you ever take the address of Fred::maximum, such as passing it by reference or explicitly saying &Fred::maximum, the compiler will make sure it has a unique address. If not, Fred::maximum won’t even take up space in your process’s static data area.
Compiler processes .cpp files separately and does not know what other files do with data defined in the one currently being processed. So, how can compiler decide if it should allocate a unique address or not?
The original item is here: https://isocpp.org/wiki/faq/ctors#static-const-with-initializers
The compiler does not decide anything. For translation units where the static class member is not define, the object module produced by the compiler contains an unresolved reference to the symbol.
When all the object modules get linked together, the linker is responsible for finishing the job, and resolving all unresolved references from translation units referencing the static symbol to the lone translation unit that has the symbol defined.
The FAQ entry says that const int Fred::maximum; must be defined in exactly one compilation unit. However, this is only true if the variable is odr-used by the program (for example, if a reference is bound to it).
If the variable is not odr-used then the definition can be omitted.
However, if the variable is actually odr-used but has no definition, then it is undefined behaviour with no diagnostic required. Typically, if the variable's address is required but the definition was omitted, a good quality linker would omit an "undefined reference" error.
But you don't always want to be relying on particular manifestations of undefined behaviour. So it is good practice to always include the definition const int Fred::maximum;.
The quoted paragraph in your question is meant to address a potential programmer concern: "Well, can't I save 4 bytes in my static data area by omitting the definition in some cases?"
It is saying that the compiler/linker could perform whole program analysis, and make its own optimization decision to omit the definition once it has determined that the definition was not used.
Even though the line const int Fred::maximum; is defined as allocating memory for an int, this is a permitted optimization because there is no way that a conforming program could measure whether or not memory had actually been allocated for the int which is not odr-used.
The author of that FAQ entry clearly expects that a compiler/linker would in fact do this.
The wording in the Standard about odr-use is designed to support the following model of compilation/linking:
If some code requires the variable to have an address, the object file will contain a reference to the variable.
Optimization may remove some code paths that are never called, etc.
At link-time, when these references are resolved, that reference will be bound to the definition of the variable.
It does not require the compiler to produce an "undefined reference" error message because this would make it harder for compilers to optimize. Another stage of optimization might happen to entirely remove the part of the object file containing the reference. For example, if it turned out the odr-use only ever occurred in a function that was never called.

How does this header-only library guard against linker problems?

After reading this question I thought I understood everything, but then I saw this file from a popular header-only library.
The library uses the #ifndef line, but the SO question points out that this is NOT adequate protection against multiple definition errors in multiple TUs.
So one of the following must be true:
It is possible to avoid multiple definition linker errors in ways other than described in the SO question. Perhaps the library is using techniques not mentioned in to the other SO question that are worthy of additional explanation.
The library assumes you won't include its header files in more than translation unit -- this seems fragile since a robust library shouldn't make this assumption on its users.
I'd appreciate having some light shed on this seemingly simple curiosity.
A header that causes linking problems when included in multiple translation units is one that will (attempt to) define some object (not just, for an obvious example, a type) in each source file where it's included.
For example, if you had something like: int f = 0; in a header, then each source file into which it was included would attempt to define f, and when you tried to link the object files together you'd get a complaint about multiple definitions of f.
The "technique" used in this header is simple: it doesn't attempt to define any actual objects. Rather, it includes some typedefs, and the definition of one fairly large class--but not any instances of that class or any instance of anything else either. That class includes a number of member functions, but they're all defined inside the function definition, which implicitly defines them as inline functions (so defining separately in each translation unit in which they're used is not only allowed, but required).
In short, the header only defines types, not objects, so there's nothing there to cause linker collisions when it's included in multiple source files that are linked together.
If the header defines items, as opposed to just declaring them, then it's possible to include it in more than one translation unit (i.e. cpp file) and have multiple definitions and hence linker errors.
I've used boost's unit test framework which is header only. I include a specified header in only one of my own cpp files to get my project to compile. But I include other unit test headers in other cpp files which presumably use the items that are defined in the specified header.
Headers only include libraries like Boost C++ Libraries use (mostly) stand-alone templates and as so are compiled at compile-time and don't require any linkage to binary libraries (that would need separate compilation). One designed to never need linkage is the great Catch
Templates are a special case in C++ regarding multiple definitions, as long as they're the same. See the "One Definition Rule" section of the C++ standard:
There can be more than one definition of a class type (Clause 9),
enumeration type (7.2), inline function with external linkage (7.1.2),
class template (Clause 14), non-static function template (14.5.6),
static data member of a class template (14.5.1.3), member function of
a class template (14.5.1.1), or template specialization for which some
template parameters are not specified (14.7, 14.5.5) in a program
provided that each definition appears in a different translation unit,
and provided the definitions satisfy the following requirements. ....
This is then followed by a list of conditions that make sure the template definitions are identical across translation units.
This specific quote is from the 2014 working draft, section 3.2 ("One Definition Rule"), subsection 6.
This header file can indeed be included in difference source files without causing "multiple symbol definition" errors.
This happens because it is fine to have multiple identically named symbols in different object files as long as these symbols are either weak or local.
Let's take a closer look at the header file. It (potentially) defines several objects like this helper:
static int const helper[] = {0,7,8,13};
Each translation unit that includes this header file will have this helper in it. However, there will be no "multiple symbol definition" errors, since helper is static and thus has internal linkage. The symbols created for helpers will be local and linker will just happily put them all in the resulting executable.
The header file also defines a class template connection. But it is also okay. Class templates can be defined multiple times in different translation units.
In fact, even regular class types can be defined multiple times (I've noticed that you've asked about this in the comments). Symbols created for member functions are usually weak symbols. Once again, weak symbols don't cause "multiple symbol definition" errors, because they can be redefined. Linker will just keep redefining weak symbols with names he has already seen until there will be just one symbol per member function left.
There are also other cases, where certain things (like inline functions and enumerations) can be defined several times in different translation units (see §3.2). And mechanisms of achieving this can be different (see class templates and inline functions). But the general rule is not to place stuff with global linkage in header files. As long as you follow this rule, you're really unlikely to stumble upon multiple symbol definitions problems.
And yes, include guards have nothing to do with this.

Does use of unnamed namespaces reduce link time?

Suppose I have a large system with many object files such that link time is a problem. Suppose also that I know that many of the classes and functions in my system are not used outside their translation unit.
Is it reasonable to assume that if I reduce the number of symbols with external linkage, my link-time will be reduced?
If so, will putting the entities (e.g., classes and functions) that are used in only a single TU into unnamed namespaces do me any good? Technically, the entities with external linkage will retain their external linkage in an unnamed namespace, but, as the C++11 standard notes,
Although entities in an unnamed namespace might have external linkage, they are effectively qualified by a name unique to their translation unit and therefore can never be seen from any other translation unit.
Do linker algorithms perform optimizations based on the knowledge that entities with external linkage in unnamed namespaces aren't visible outside their namespaces?
Yes I think is does reduce the link time. I think this on the Google chromium stie:
"Unnamed namespaces restrict these symbols to the compilation unit, improving function call cost and reducing the size of entry point tables." Here the link
I know this is about the chromium project but it should apply to other c++ projects.
I don't see how a linker could do such optimizations, because by the time the linker gets a hold of the symbol(s) in question they look like ordinary decorated external-linkage symbols. Unless the linker has specific information about how the compiler decorates names in an anonymous namespace I can't see any way that it could optimize its work.
Have you confirmed that your linker is in fact CPU bound and not I/O bound? If it's not CPU bound already it's probably not going to help to reorganize your code.

How does linkage and name mangling work?

Lets take this code sample
//header
struct A { };
struct B { };
struct C { };
extern C c;
//code
A myfunc(B&b){ A a; return a; }
void myfunc(B&b, C&c){}
C c;
Lets do this line by line starting from the code section.
When the compiler sees the first myfunc method it does not care about A or B because its use is internal. Each c++ file will know what it takes in, what it returns. Although there needs to be a name for each of the two overload so how is that chosen and how does the linker know which means what?
Next is C c; I once had a bug were the linker wouldnt reconize thus allow me access to C in other C++ files. It was because that cpp didnt know c was extern and i had to mark it as extern in the header before i could link successfully. Now i am not sure if the class type has any involvement with the linker and the variable C. I dont know how RTTI will be involved but i do know C needs to be visible by other files.
How does the linker work and name mangling and such.
We first need to understand where compilation ends and linking begins. Compilation involves taking a compilation unit (a C or C++ source file) and turning it into an object file. Simplistically, this involves generating snippets of machine code for each function as well as a symbol table for all functions and static (global) variables. Placeholders are used for any symbols needed by the compilation unit that are external to the said compilation unit.
The linker is then responsible for loading all the object files and resolving all the place-holder symbols with real addresses (or offsets for machine independent code). This is placed into various sections that can be read by the operating system's dynamic loader when loading an executable.
So for the specifics. In order to avoid errors during linking, the compiler requires you to declare all external symbols that will be used by the current compilation unit. For global variables one must use the extern keyword, for functions this is optional.
All functions and global variables defined in a compilation unit have external linkage (i.e., can be referenced by other compilation units) unless one declares that with the static keyword (or the unnamed namespace in C++). In C++, the vtable will also have a symbol needed for linkage.
Now in C++, since functions can be overloaded, the parameters also form part of the function name. Since machine-code is just addresses and registers, extra information needs to be added to the function name in the symbols table. This extra parameter information comes in the form of a mangled name and ensures that the linker links to the correct version of an overloaded function.
If you really are interested in the gory details take a look at the ELF file format (PDF) used extensively on Linux. Windows has a different format but the principles can be expected to be the same.
Name mangling on the Itanuim (and ARM) platforms can be found here.
http://en.wikipedia.org/wiki/Name_mangling