How does linkage and name mangling work? - c++

Lets take this code sample
//header
struct A { };
struct B { };
struct C { };
extern C c;
//code
A myfunc(B&b){ A a; return a; }
void myfunc(B&b, C&c){}
C c;
Lets do this line by line starting from the code section.
When the compiler sees the first myfunc method it does not care about A or B because its use is internal. Each c++ file will know what it takes in, what it returns. Although there needs to be a name for each of the two overload so how is that chosen and how does the linker know which means what?
Next is C c; I once had a bug were the linker wouldnt reconize thus allow me access to C in other C++ files. It was because that cpp didnt know c was extern and i had to mark it as extern in the header before i could link successfully. Now i am not sure if the class type has any involvement with the linker and the variable C. I dont know how RTTI will be involved but i do know C needs to be visible by other files.
How does the linker work and name mangling and such.

We first need to understand where compilation ends and linking begins. Compilation involves taking a compilation unit (a C or C++ source file) and turning it into an object file. Simplistically, this involves generating snippets of machine code for each function as well as a symbol table for all functions and static (global) variables. Placeholders are used for any symbols needed by the compilation unit that are external to the said compilation unit.
The linker is then responsible for loading all the object files and resolving all the place-holder symbols with real addresses (or offsets for machine independent code). This is placed into various sections that can be read by the operating system's dynamic loader when loading an executable.
So for the specifics. In order to avoid errors during linking, the compiler requires you to declare all external symbols that will be used by the current compilation unit. For global variables one must use the extern keyword, for functions this is optional.
All functions and global variables defined in a compilation unit have external linkage (i.e., can be referenced by other compilation units) unless one declares that with the static keyword (or the unnamed namespace in C++). In C++, the vtable will also have a symbol needed for linkage.
Now in C++, since functions can be overloaded, the parameters also form part of the function name. Since machine-code is just addresses and registers, extra information needs to be added to the function name in the symbols table. This extra parameter information comes in the form of a mangled name and ensures that the linker links to the correct version of an overloaded function.
If you really are interested in the gory details take a look at the ELF file format (PDF) used extensively on Linux. Windows has a different format but the principles can be expected to be the same.
Name mangling on the Itanuim (and ARM) platforms can be found here.

http://en.wikipedia.org/wiki/Name_mangling

Related

Why does the one definition rule exist in C/C++

In C and C++, you can't have a function with two definitions. For example, say we have the following two files:
1.c:
int main(){ return 0}
2.c:
int main(){ return 0}
Issuing the command gcc 1.c 2.c will give you a duplicate symbol linker error.
Why doesn't the same happen with structs and classes? Why are we allowed to have multiple
definitions of the same struct as long as they have the same tokens?
To answer this question, one has to delve into compilation process and what is needed in each part (question why these steps are perfomed is more historical, going back to beginning of C before it's standardization)
C and C++ programs are compiled in multiple steps:
Preprocessing
Compilation
Linkage
Preprocessing is everything that starts with #, it's not really important here.
Compilation is performed on each and every translation unit (typically a single .c or .cpp file plus the headers it includes). Compiler takes one translation unit at a time, reads it and produces an internal list of classes and their members, and then assembly code of each function in given unit (basing on the structures list). If a function call is not inlined (e.g. it is defined in different TU), compiler produces a "link" - "please insert function X here" for the linker to read.
Then linker takes all of the compiled translation units and merges them into one binary, substituting all the links specified by compiler.
Now, what is needed at each phase?
For compilation phase, you need the
definition of every class used in this file - compiler needs to know the size and offset of each class member to produce assembly
declaration of every function used in this file - to produce those "links".
Since function definitions are not needed for producing assembly (as long as they are compiled somewhere), they are not needed in compilation phase, only in linking phase.
To sum up:
One Definition Rule is there to protect programmers from theselves. If they'd accidentally define a function twice, linker will notice that and executable is not produced.
However, class definitions are required in every translation unit, and therefore such a rule cannot be set up for them. Since it cannot be forced by language, programmers have to be responsible beings and not define the same class in different ways.
ODR has also other limitations, e.g. you have to define template functions (or template class methods) in header files. You can also take the responsibility and say to the compiler "Every definition of this function will be the same, trust me dude" and make the function inline.
There is no use case for a function with 2 definitions. Either the two definitions would have to be the same, making it useless, or the compiler wouldn't be able to tell which one you meant.
This is not the case with classes or structures. There is also a large advantage to allowing multiple definitions of them, i.e. if we want to use a class or struct in multiple files. (This leads indirectly to multiple definitions because of includes.)
Structures, classes, unions and enumerations define types that can be used in several compilation units to define objects of these types. So each compilation unit need to know how the types are defined, for example to allocate correctly memory for an object or to be sure that specified member of a class does indeed exist.
For functions (if they are not inline functions) it is enough to have their declaration without their definition to generate for example a function call.
But the function definition shall be single. Otherwise the compiler will not know what function to call or the object code will be too big due to duplication and will be error prone..
It's quite simple: It's a question of scope. Non-static functions are seen (callable) by every compilation unit linked together, while structures are only seen in the compilation unit where they are defined.
For example, it's valid to link the following together because it's clear which definition of struct Foo and which definition of f is being used:
1.c:
struct Foo { int x; };
static void f(void) { struct Foo foo; ... }
2.c:
struct Foo { double d; };
static void f(void) { struct Foo foo; ... }
int main(void) { ... }
But it isn't valid to link the following together because the linker wouldn't know which f to call.
1.c:
void f(void) { ... }
2.c:
void f(void) { ... }
int main(void) { f(); }
Actually every programming element is associated with a scope of its applicability. And within this scope you cannot have the same name associated with multiple definitions of an element. In compiled world:
You cannot have more than one class definition with the same name within a single file. But you can have it in different compilation units.
You cannot have the same function or global variable name within a single link unit (library or executable), but you can potentially have functions named the same within different libraries.
you cannot have shared libraries with the same name situated in the same directory, but you can have them in different directories.
C/C++ compilation is very much after the compilation performance. Checking 2 objects like function or classes for identity is a time-consuming task. So, it is not done. Only names are considered for comparison. It is better to consider that 2 types are different and error out then checking them for identity. The only exception from this rule are text macros.
Macros are a pre-processor concept and historically it is allowed to have multiple identical macro definitions. If a definition changes, a warning gets generated. Comparing macro context is easy, just a simple string comparison, but some macro definitions could be huge.
Types are the compiler concept and they are resolved by the compiler. Types do not exist in object libraries and are represented by the sizes of corresponding variables. So, there is no reason for checking type name collisions at this scope.
Functions and variables on the other hand are named pointers to executable codes or data. They are the building blocks of applications. Applications are assembled from the codes and libraries coming from all around the world in some cases. In order to use someone else's function you'd better now its name and you do not want the same name to be used by some one else. Within a shared library names of functions and variables are usually stored in a hash table. There is no place for duplicates there.
And as I already mention checking functions for identical contents is seldom done, however there are some cases, but not in c or c++.
The reason of impeding two different definitions for the same thing to be used in programming is to avoid the ambiguity of deciding which definition to use at run time.
If you have two different implementations to the same thing to coexist in a program, then there's the possibility of aliasing them (with a different name each) into a common reference to decide at runtime which one of the two to use.
Anyway, in order to distinguish both, you have to be able to indicate the compiler which one you want to use. In C++ you can overload a function, giving it the same name and different lists of parameters, so you can distinguish which one of both you want to use. But in C, the compilers only preserve the name of the function to be able to solve at link time which definition matches the name you use in a different compilation unit. In case the linker ends with two different definitions with the same name, it is uncapable of deciding for you which one to use, so it emits an error and gives up the building process.
What should be the intention of using this ambiguity in a productive way? this is the question you have actually to ask to yourself.

symbol name in shared object differs from function in .cpp file

In a project environment, I wanted to change a source file for a shared object from c to cpp. I made sure to change its entry in the CMakeLists.txt, too:
add_library(*name* SHARED *mysource*.cpp)
target_link_libraries(*name as target* *item*)
The build process runs fine. Unfortunately, when I try to use it I get an error that the functions inside the .so can not be found.
After checking the dynamic symbol table inside the shared object with objdump -T, I found out that the names of the symbols differ from the ones in the source file.
e.g.
int sr_plugin_init_cb(sr_session_ctx_t *session, void **private_ctx);
becomes
_Z17sr_plugin_init_cbP16sr_session_ctx_sPPv
Inside my visual studio code it says that it can build the object and link the shared library correctly, and it also changed from C to CXX in the output and gives me no errors even though some code is c++ only.
Why do the symbol names change?
Why do the symbol names change?
C++ has a feature called function overload. Basically what happens is that you declare two function that are named the same, but slightly differ:
int sr_plugin_init_cb(sr_session_ctx_t *session, void **private_ctx);
int sr_plugin_init_cb(sr_session_ctx_t *session, void **private_ctx, int some_arg);
or a little worse case:
struct A {
# each of these functions can be different depending on the object
void func();
void func() const;
void func() volatile;
void func() volatile const;
};
Functions are named the same. Linker doesn't see the C++ source, but it still has to differentiate between the two functions to link with them. So C++ compiler "mangles" the function names, so that linker can differentiate between them. For big simplicity it could look like:
sr_plugin_init_cb_that_doesnt_take_int_arg
sr_plugin_init_cb_that_takes_int_arg
A_func
A_func_but_object_is_const
A_func_but_object_is_volatile
A_func_but_object_is_volatile_and_const
The rules of name mangling are complicated, to make the names as short as possible. They have to take into account any number of templates, arguments, objects, names, qualifers, lambdas, overloads, operators etc. and generate an unique name and they have to use only characters that are compatible with the linker on a specific architecture. For example here is a reference for name mangling used by gnu g++ compiler.
The symbol name _Z17sr_plugin_init_cbP16sr_session_ctx_sPPv is the mangled by your compiler name of your function.
Thank You very much for the detailled answer. I now understand the issue.
After a quick search I found a solution for my problem. Encapsulating the function prototypes like this avoids the name mangling.
extern "C" {
// Function prototypes
};

Linker removing static initialiser [duplicate]

I am working on a factory that will have types added to them, however, if the class is not explicitly instiated in the .exe that is exectured (compile-time), then the type is not added to the factory. This is due to the fact that the static call is some how not being made. Does anyone have any suggestions on how to fix this? Below is five very small files that I am putting into a lib, then an .exe will call this lib. If there is any suggestions on how I can get this to work, or maybe a better design pattern, please let me know. Here is basically what I am looking for
1) A factory that can take in types
2) Auto registration to go in the classes .cpp file, any and all registration code should go in the class .cpp (for the example below, RandomClass.cpp) and no other files.
BaseClass.h : http://codepad.org/zGRZvIZf
RandomClass.h : http://codepad.org/rqIZ1atp
RandomClass.cpp : http://codepad.org/WqnQDWQd
TemplateFactory.h : http://codepad.org/94YfusgC
TemplateFactory.cpp : http://codepad.org/Hc2tSfzZ
When you are linking with a static library, you are in fact extracting from it the object files which provide symbols which are currently used but not defined. In the pattern that you are using, there is probably no undefined symbols provided by the object file which contains the static variable which triggers registration.
Solutions:
use explicit registration
have somehow an undefined symbol provided by the compilation unit
use the linker arguments to add your static variables as a undefined symbols
something useful, but this is often not natural
a dummy one, well it is not natural if it is provided by the main program, as a linker argument it main be easier than using the mangled name of the static variable
use a linker argument stating that all the objects of a library have to be included
dynamic libraries are fully imported, thus don't have that problem
As a general rule of thumb, an application do not include static or global variables from a library unless they are implicitly or explicitly used by the application.
There are hundred different ways this can be refactored. One method could be to place the static variable inside function, and make sure the function is called.
To expand on one of #AProgrammer's excellent suggestions, here is a portable way to guarantee the calling program will reference at least one symbol from the library.
In the library code declare a global function that returns an int.
int make_sure_compilation_unit_referenced() { return 0; }
Then in the header for the library declare a static variable that is initialized by calling the global function:
extern int make_sure_compilation_unit_referenced();
static int never_actually_used = make_sure_compilation_unit_referenced();
Every compilation unit that includes the header will have a static variable that needs to be initialized by calling a (useless) function in the library.
This is made a little cleaner if your library has its own namespace encapsulating both of the definitions, then there's less chance of name collisions between the bogus function in your library with other libraries, or of the static variable with other variables in the compilation unit(s) that include the header.

Is there a standard way to ensure that a piece of code is executed at global scope?

I have some code I want to execute at global scope. So, I can use a global variable in a compilation unit like this:
int execute_global_code();
namespace {
int dummy = execute_global_code();
}
The thing is that if this compilation unit ends up in a static library (or a shared one with -fvisibility=hidden), the linker may decide to eliminate dummy, as it isn't used, and with it my global code execution.
So, I know that I can use concrete solutions based on the specific context: specific compiler (pragma include), compilation unit location (attribute visibility default), surrounding code (say, make an dummy use of dummy in my code).
The question is, is there a standard way to ensure execute_global_code will be executed that can fit in a single macro which will work regardless of the compilation unit placement (executable or lib)? ie: only standard c++ and no user code outside of that macro (like a dummy use of dummy in main())
The issue is that the linker will use all object files for linking a binary given to it directly, but for static libraries it will only pull those object files which define a symbol that is currently undefined.
That means that if all the object files in a static library contain only such self-registering code (or other code that is not referenced from the binary being linked) - nothing from the entire static library shall be used!
This is true for all modern compilers. There is no platform-independent solution.
A non-intrusive to the source code way to circumvent this using CMake can be found here - read more about it here - it will work if no precompiled headers are used. Example usage:
doctest_force_link_static_lib_in_target(exe_name lib_name)
There are some compiler-specific ways to do this as grek40 has already pointed out in a comment.

Which object file contains the following static templatized "member variable"?

Say I have the following template class with a static member function that itself instantiates a static variable (which is functionally a static member variable instantiated the first time its containing routine is called):
template <typename T>
struct foo
{
static int& mystatic()
{
static int value;
return value;
}
};
If I use foo<T> in multiple translation units for some T, into which object file does the compiler put foo<T>::mystatic::value? How is this apparent duplication/conflict resolved at link time?
You do understand that the your function mystatic is a function with external linkage? Which means that the very same conflict exists between multiple definitions of mystatic made in different translation units. Also, exactly the same issue can arise without templates: ordinary inline functions with external linkage defined in header files can produce the same apparent multiple definition conflict (and the same issue with a local static variable can be reproduced there as well).
In order to resolve such conflicts, all such symbols are labeled by the compiler in some implementation-dependent way. By doing this the compiler communicates to the linker the fact that these symbols can legally end up being defined multiple times. For example, one known implementation puts such symbols into a separate section of object file (sometimes called "COMDAT" section). Other implementations might label such symbols in some other way. When the linker discovers such symbols in multiple object files, instead of reporting a multiple definition error, it chooses one and only one of each identical symbol and uses it throughout the entire program. The other copies of each such symbol are discarded by the linker.
One typical consequence of this approach is that your local static variable value will have to be included as an external symbol into each object file, despite the fact that it has no linkage from the language point of view. The name of the symbol will usually be composed of the function name mystatic and variable name value and some other mangling.
In other words, the compiler proper puts both the definition of mystatic and the variable value into all independent object files that use the member function. The linker will later make sure that only one mystatic and only one value will exist in the linked program. There's probably no way to determine, which original object file supplied the surviving copy (if such distinction even makes sense).