Real Applications of internal linkage in C++

Real Applications of internal linkage in C++ - c++

This sounds like a duplicate version of What is the point of internal linkage in C++ and probably is. There was only one post with some code that didn't look like a practical example. C++ISO draft says:
When a name has internal linkage, the entity it denotes can be
referred to by names from other scopes in the same translation unit.
It looks a good punctual definition for me, but I couldn't find any reasonable application of that, something like: "look this code, the internal linkage here makes a great difference to implement it". Furthermore based on the definition provided above,it looks that global variables fulfils the internal linkage duty. Could you provide some examples?

Internal linkage is a practical way to comply with the One Definition Rule.
One might find the need to define functions or objects with plain names, like sum, or total, or collection, or any one of other common terms, more than once. In different translation units they might serve different purposes, specific purposes that are particular to that, particular, translation unit.
If only external linkage existed you'd have to make sure that the same name will not be repeated in different translation units, i.e. little_sum, big_sum, red_sum, etc... At some point this will get real old, real fast.
Internal linkage solves this problem. And unnamed namespaces effectively results in internal linkage for entire classes and templates. With an unnamed namespace: if a translation unit has a need for its own private little template, with a practical name of trampoline it can go ahead and use it, safely, without worrying about violating the ODR.

Consider helper functions, that don't exactly need to be exposed to the outside like:
// Foo.hh
void do_something();
// Foo.cc
static void log(std::string) { /* Log into foo.log */ }
void do_something() { /* Do stuff while using log() to log info */ }
// Bar.hh
void bar();
// Bar.cc
static void log(std::string) { /* Log into bar.log */ }
void bar() { /* Do stuff while using log */ }
You can use the proper log function within two parts of your project, while avoiding multiple definition errors.
This latter part becomes very important for header only libraries, where the library might be included in multiple translation units within the same project.
Now as to a reason of using internal linkage with variables: Again you can avoid multiple definitions errors, which would be the result of code like this:
// Foo.cc
int a = 5;
// Bar.cc
int a = 5;
When compiling this the compiler will happily produce object code but the linker will not link it together.

Related

Does linker link an object file with itself?

From what I read in other SO answers, like this and this, compiler converts the source code into object file. And the object files might contain references to functions like printf that needs to be resolved by the linker.
What I don't understand is, when both declaration and definition exist in the same file, like the following case, does compiler or linker resolve the reference to return1?
Or is this just a part of the compiler optimization?
int return1();
int return2() {
int b = return1();
return b + 1;
}
int return1() {
return 1;
}
int main() {
int b = return2();
}
I have ensured that preprocessing has nothing to do with this by running g++ -E main.cpp.
Update 2020-7-22
The answers are all helpful! Thanks!
From the answers below, it seems to me that the compiler may or may not resolve the reference to return1. However, I'm still unclear if there's only one translation unit, like the example I gave, and if compiler did not resolve it, does this mean that linker must resolve it then?
Since it seems to me that linker will link several (greater than one) object files together, and if there's only one translation unit (object file), linker need to link the object file with itself, am I right?
And is there anyway to know for sure which one is the case on my computer?

It depends. Both options are possible, so are options that you didn't mention, like either the compiler or the linker rearranging the code so that none of the functions exist any more. It's fine thinking about compilers emitting references to functions and linkers resolving those references as a way of understanding C++, but bear in mind is that all the compiler and linker have to do is produce a working program and there are many different ways to do that.
One thing the compiler and linker must do however, is make sure that any calls to standard library functions happen (like printf as you mentioned), and happen in the order that the C++ source specifies. Apart from that (and some other similar concerns) they can more or less do as they wish.

[lex.phases]/1.9, covering the final phase of translation, states [emphasis mine]:
All external entity references are resolved. Library components are linked to satisfy external references to entities not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.
It is, however, up to the compiler to decide whether a library component is a single translation unit or a combination of them; as governed by [lex.separate]/2 [emphasis mine]:
[ Note: Previously translated translation units and instantiation units can be preserved individually or in libraries. The separate translation units of a program communicate ([basic.link]) by (for example) calls to functions whose identifiers have external linkage, manipulation of objects whose identifiers have external linkage, or manipulation of data files. Translation units can be separately translated and then later linked to produce an executable program.  — end note ]
OP: [...] does compiler or linker resolve the reference to return1?
Thus, even if return1 has external linkage, as it is defined in the translation unit where it is referred to (in return2), the linker should not need to resolve the reference to it, as its is definition exists in the current translation. The standard passage is, however, (likely intentionally) a bit vague regarding requirements for when linking to satisfy external references need to occur, and I do not see it to be a non-compliant implementation to defer resolving the reference to return1 in return2 until the linking phase.

Practically speaking, the problem is with the following code:
static int return1();
int return2() {
int b = return1();
return b + 1;
}
int return1() {
return 1;
}
The problem for the linker is that each Translation Unit can now contain its own return1, so the linker would have a problem in choosing the right return1. There are tricks around this, e.g. adding the Translation Unit name to the function name. Most ABI's do not do that, but the C++ standard would allow it. For anonymous namespaces however, i.e. namespace { int function1(); }, ABI's will use such tricks.

Why does the one definition rule exist in C/C++

In C and C++, you can't have a function with two definitions. For example, say we have the following two files:
1.c:
int main(){ return 0}
2.c:
int main(){ return 0}
Issuing the command gcc 1.c 2.c will give you a duplicate symbol linker error.
Why doesn't the same happen with structs and classes? Why are we allowed to have multiple
definitions of the same struct as long as they have the same tokens?

To answer this question, one has to delve into compilation process and what is needed in each part (question why these steps are perfomed is more historical, going back to beginning of C before it's standardization)
C and C++ programs are compiled in multiple steps:
Preprocessing
Compilation
Linkage
Preprocessing is everything that starts with #, it's not really important here.
Compilation is performed on each and every translation unit (typically a single .c or .cpp file plus the headers it includes). Compiler takes one translation unit at a time, reads it and produces an internal list of classes and their members, and then assembly code of each function in given unit (basing on the structures list). If a function call is not inlined (e.g. it is defined in different TU), compiler produces a "link" - "please insert function X here" for the linker to read.
Then linker takes all of the compiled translation units and merges them into one binary, substituting all the links specified by compiler.
Now, what is needed at each phase?
For compilation phase, you need the
definition of every class used in this file - compiler needs to know the size and offset of each class member to produce assembly
declaration of every function used in this file - to produce those "links".
Since function definitions are not needed for producing assembly (as long as they are compiled somewhere), they are not needed in compilation phase, only in linking phase.
To sum up:
One Definition Rule is there to protect programmers from theselves. If they'd accidentally define a function twice, linker will notice that and executable is not produced.
However, class definitions are required in every translation unit, and therefore such a rule cannot be set up for them. Since it cannot be forced by language, programmers have to be responsible beings and not define the same class in different ways.
ODR has also other limitations, e.g. you have to define template functions (or template class methods) in header files. You can also take the responsibility and say to the compiler "Every definition of this function will be the same, trust me dude" and make the function inline.

There is no use case for a function with 2 definitions. Either the two definitions would have to be the same, making it useless, or the compiler wouldn't be able to tell which one you meant.
This is not the case with classes or structures. There is also a large advantage to allowing multiple definitions of them, i.e. if we want to use a class or struct in multiple files. (This leads indirectly to multiple definitions because of includes.)

Structures, classes, unions and enumerations define types that can be used in several compilation units to define objects of these types. So each compilation unit need to know how the types are defined, for example to allocate correctly memory for an object or to be sure that specified member of a class does indeed exist.
For functions (if they are not inline functions) it is enough to have their declaration without their definition to generate for example a function call.
But the function definition shall be single. Otherwise the compiler will not know what function to call or the object code will be too big due to duplication and will be error prone..

It's quite simple: It's a question of scope. Non-static functions are seen (callable) by every compilation unit linked together, while structures are only seen in the compilation unit where they are defined.
For example, it's valid to link the following together because it's clear which definition of struct Foo and which definition of f is being used:
1.c:
struct Foo { int x; };
static void f(void) { struct Foo foo; ... }
2.c:
struct Foo { double d; };
static void f(void) { struct Foo foo; ... }
int main(void) { ... }
But it isn't valid to link the following together because the linker wouldn't know which f to call.
1.c:
void f(void) { ... }
2.c:
void f(void) { ... }
int main(void) { f(); }

Actually every programming element is associated with a scope of its applicability. And within this scope you cannot have the same name associated with multiple definitions of an element. In compiled world:
You cannot have more than one class definition with the same name within a single file. But you can have it in different compilation units.
You cannot have the same function or global variable name within a single link unit (library or executable), but you can potentially have functions named the same within different libraries.
you cannot have shared libraries with the same name situated in the same directory, but you can have them in different directories.
C/C++ compilation is very much after the compilation performance. Checking 2 objects like function or classes for identity is a time-consuming task. So, it is not done. Only names are considered for comparison. It is better to consider that 2 types are different and error out then checking them for identity. The only exception from this rule are text macros.
Macros are a pre-processor concept and historically it is allowed to have multiple identical macro definitions. If a definition changes, a warning gets generated. Comparing macro context is easy, just a simple string comparison, but some macro definitions could be huge.
Types are the compiler concept and they are resolved by the compiler. Types do not exist in object libraries and are represented by the sizes of corresponding variables. So, there is no reason for checking type name collisions at this scope.
Functions and variables on the other hand are named pointers to executable codes or data. They are the building blocks of applications. Applications are assembled from the codes and libraries coming from all around the world in some cases. In order to use someone else's function you'd better now its name and you do not want the same name to be used by some one else. Within a shared library names of functions and variables are usually stored in a hash table. There is no place for duplicates there.
And as I already mention checking functions for identical contents is seldom done, however there are some cases, but not in c or c++.

The reason of impeding two different definitions for the same thing to be used in programming is to avoid the ambiguity of deciding which definition to use at run time.
If you have two different implementations to the same thing to coexist in a program, then there's the possibility of aliasing them (with a different name each) into a common reference to decide at runtime which one of the two to use.
Anyway, in order to distinguish both, you have to be able to indicate the compiler which one you want to use. In C++ you can overload a function, giving it the same name and different lists of parameters, so you can distinguish which one of both you want to use. But in C, the compilers only preserve the name of the function to be able to solve at link time which definition matches the name you use in a different compilation unit. In case the linker ends with two different definitions with the same name, it is uncapable of deciding for you which one to use, so it emits an error and gives up the building process.
What should be the intention of using this ambiguity in a productive way? this is the question you have actually to ask to yourself.

How does linker handle variables with different linkages?

In C and C++ we can manipulate a variable's linkage. There are three kinds of linkage: no linkage, internal linkage, and external linkage. My question is probably related to why these are called "linkage" (How is that related to the linker).
I understand a linker is able to handle variables with external linkage, because references to this variable is not confined within a single translation unit, therefore not confined within a single object file. How that actually works under the hood is typically discussed in courses on operating systems.
But how does the linker handle variables (1) with no linkage and (2) with internal linkage? What are the differences in these two cases?

As far as C++ itself goes, this does not matter: the only thing that matters is the behavior of the system as a whole. Variables with no linkage should not be linked; variables with internal linkage should not be linked across translation units; and variables with external linkage should be linked across translation units. (Of course, as the person writing the C++ code, you must obey all of your constraints as well.)
Inside a compiler and linker suite of programs, however, we certainly do have to care about this. The method by which we achieve the desired result is up to us. One traditional method is pretty simple:
Identifiers with no linkage are never even passed through to the linker.
Identifiers with internal linkage are not passed through to the linker either, or are passed through to the linker but marked "for use within this one translation unit only". That is, there is no .global declaration for them, or there is a .local declaration for them, or similar.
Identifiers with external linkage are passed through to the linker, and if internal linkage identifiers are seen by the linker, these external linkage symbols are marked differently, e.g., have a .global declaration or no .local declaration.
If you have a Linux or Unix like system, run nm on object (.o) files produced by the compiler. Note that some symbols are annotated with uppercase letters like T and D for text and data: these are global. Other symbols are annotated with lowercase letters like t and d: these are local. So these systems are using the "pass internal linkage to the linker, but mark them differently from external linkage" method.

The linker isn't normally involved in either internal linkage or no linkage--they're resolved entirely by the compiler, before the linker gets into the act at all.
Internal linkage means two declarations at different scopes in the same translation unit can refer to the same thing.
No Linkage
No linkage means two declarations at different scopes in the same translation unit can't refer to the same thing.
So, if I have something like:
int f() {
static int x; // no linkage
}
...no other declaration of x in any other scope can refer to this x. The linker is involved only to the degree that it typically has to produce a field in the executable telling it the size of static space needed by the executable, and that will include space for this variable. Since it can never be referred to by any other declaration, there's no need for the linker to get involved beyond that though (in particular, the linker has nothing to do with resolving the name).
Internal linkage
Internal linkage means declarations at different scopes in the same translation unit can refer to the same object. For example:
static int x; // a namespace scope, so `x` has internal linkage
int f() {
extern int x; // declaration in one scope
}
int g() {
extern int x; // declaration in another scope
}
Assuming we put these all in one file (i.e., they end up as a single translation unit), the declarations in both f() and g() refer to the same thing--the x that's defined as static at namespace scope.
For example, consider code like this:
#include <iostream>
static int x; // a namespace scope, so `x` has internal linkage
int f()
{
extern int x;
++x;
}
int g()
{
extern int x;
std::cout << x << '\n';
}
int main() {
g();
f();
g();
}
This will print:
0
1
...because the x being incremented in f() is the same x that's being printed in g().
The linker's involvement here can be (and usually is) pretty much the same as in the no linkage case--the variable x needs some space, and the linker specifies that space when it creates the executable. It does not, however, need to get involved in determining that when f() and g() both declare x, they're referring to the same x--the compiler can determine that.
We can see this in the generated code. For example, if we compile the code above with gcc, the relevant bits for f() and g() are these.
f:
movl _ZL1x(%rip), %eax
addl $1, %eax
movl %eax, _ZL1x(%rip)
That's the increment of x (it uses the name _ZL1x for it).
g:
movl _ZL1x(%rip), %eax
[...]
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_c#PLT
So that's basically loading up x, then sending it to std::cout (I've left out code for other parameters we don't care about here).
The important part is that the code refers to _ZL1x--the same name as f used, so both of them refer to the same object.
The linker isn't really involved, because all it sees is that this file has requested space for one statically allocated variable. It makes space for that, but doesn't have to do anything to make f and g refer to the same thing--that's already handled by the compiler.

My question is probably related to why these are called "linkage" (How is that related to the linker).
According to the C standard,
An identifier declared in different scopes or in the same scope more
than once can be made to refer to the same object or function by a
process called linkage.
The term "linkage" seems reasonably well fitting -- different declarations of the same identifier are linked together so that they refer to the same object or function. That being the chosen terminology, it's pretty natural that a program that actually makes linkage happen is conventionally called a "linker".
But how does the linker handle variables (1) with no linkage and (2) with internal linkage? What are the differences in these two cases?
The linker does not have to do anything with identifiers that have no linkage. Every such declaration of an object identifier declares a distinct object (and function declarations always have internal or external linkage).
The linker does not necessarily do anything with identifiers having internal linkage, either, as the compiler can generally do everything that needs to be done with these. Nevertheless, identifiers with internal linkage can be declared multiple times in the same translation unit, with those identifiers all referring to the same object or function. The most common case is a static function with a forward declaration:
static void internal(void);
// ...
static void internal(void) {
// do something
}
File-scope variables can also have internal linkage and multiple declarations that are all linked to refer to the same object, but the multiple declaration part is not as useful for variables.

What is the point of internal linkage in C++

I understand that there are three possible linkage values for a variable in C++ - no linkage, internal linkage and external linkage.
So external linkage means that the variable identifier is accessible in multiple files, and internal linkage means that it is accessible within the same file. But what is the point of internal linkage? Why not just have two possible linkages for an identifier - no linkage and external linkage? To me it seems like global (or file) scope and internal linkage serve the same purpose.
Is there any use case where internal linkage is actually useful that is not covered by global scope?
In the below example, I have two pieces of code - the first one links to the static int i11 (which has internal linkage), and the second one does not. Both pretty much do the same thing, since main already has access to the variable i11 due to its file scope. So why have a separate linkage called internal linkage.
static int i11 = 10;
int main()
{
extern int i11;
cout << ::i11;
return 0;
}
gives the same result as
static int i11 = 10;
int main()
{
cout << ::i11;
return 0;
}
EDIT: Just to add more clarity, as per HolyBlackCat's definition below, internal linkage really means you can forward-declare a variable within the same translation unit. But why would you even need to do that for a variable that is already globally accessible within the file .. Is there any use case for this feature?

Examples of each:
External linkage:
foo.h
extern int foo; // Declaration
foo.cpp
extern int foo = 42; // Definition
bar.cpp
#include "foo.h"
int bar() { return foo; } // Use
Internal linkage:
foo.cpp
static int foo = 42; // No relation to foo in bar.cpp
bar.cpp
static int foo = -43; // No relation to foo in foo.cpp
No linkage:
foo.cpp
int foo1() { static int foo = 42; foo++; return foo; }
int foo2() { static int foo = -43; foo++; return foo; }
Surely you will agree that the foo variables in functions foo1 and foo2 have to have storage. This means they probably have to have names because of how assemblers and linkers work. Those names cannot conflict and should not be accessible by any other code. The way the C++ standard encodes this is as "no linkage." There are a few other cases where it is used as well, but for things where it is a little less obvious what the storage is used for. (E.g. for class you can imagine the vtable has storage, but for a typedef it is mostly a matter of language specification minutiae about access scope of the name.)
C++ specifies somewhat of a least common denominator linkage model that can be mapped onto the richer models of actual linkers on actual platforms. In practice this is highly imperfect and lots of real systems end up using attributes, pragmas, or compiler flags to gain greater control of linkage type. In order to do this and still provide a reasonably useful language, one gets into name mangling and other compiler techniques. If C++ were ever to try and provide a greater degree of compiled code interop, such as Java or .NET virtual machines do, it is very likely the language would gain clearer and more elaborate control over linkage.
EDIT: To more clearly answer the question... The standard has to define how this works for both access to identifiers in the source language and linkage of compiled code. The definition must be strong enough so that correctly written code never produces errors for things being undefined or multiply defined. There are certainly better ways to do this than C++ uses, but it is largely an evolved language and the specification is somewhat influenced by the substrate it is compiled onto. In effect the three different types of linkage are:
External linkage: The entire program agrees on this name and it can be access anywhere there is a declaration visible.
Internal linkage: A single file agrees on this name and it can be accessed in any scope the declaration is visible.
No linkage: The name is for one scope only and can only be accessed within this scope.
In the assembly, these tend to map into a global declaration, a file local declaration, and a file local declaration with a synthesized unique name.
It is also relevant for cases where the same name is declared with different linkage in different parts of the program and in determining what extern int foo refers to from a given place.

External linkage is for when you have files being compiled independently of each other (#included .h, .c, and .cpp files are not compiled independently of each other). An extern variable is special in that it can be used between files being compiled separately.

How does main.cpp see this?

Just started out in C++, so bit of a noob, and not sure why this works. How can main.cpp see to use the print() function contained in the separate print.cpp file? I thought you had to use #include/header files or something like that? I'm using Visual Studio if that helps.
main.cpp
#include "stdafx.h"
#include <iostream>
#include <string>
void print(std::string message);
int main()
{
std::cout << "Enter message: ";
std::string message = "";
std::getline(std::cin, message);
print(message);
return 0;
}
print.cpp
#include "stdafx.h"
#include <iostream>
#include <string>
void print(std::string message)
{
std::cout << "Your message is - " << message << std::endl;
}

Actually the code in main.cpp does not "see" the print function in print.cpp at all!
The call to print is only checked by the compiler against the incomplete function specification you wrote earlier in the file, not against anything from any other file. C++ allows this incomplete specification as a way to say, "well I'm not telling you how this function is implemented now, but it should be available to you after this file and all other files are compiled and ready to link together, perhaps with some existing libraries."
You mentioned include files. All an include directive does is (among other things) place a bunch of partial function specifications directly inside your program. After including (which runs as a pre-processing phase before the compiler runs), you will have some code that looks just like your main.cpp above. In fact, to the C++ compiler, your code looks no different than one in which your incomplete function specification of print was replaced with an #include directive of a file containing that specification.
An interesting thing about writing incomplete function specifications is that functions implementing those specifications can often be written in different languages, as long as their data types map directly to C++ types. In your case, std::string binds you directly to C++, but had you used int or even char* an external program in assembly language or C could have been used!

The reason that you can compile code in separate translation units is linkage: Linkage is the property of a name, and names come in three kinds of linkage, which determine what the name means when it is seen in different scopes:
None: the meaning of a name with no linkage is unique to the scope in which the name appears. For example, "normal" variables declared inside a function have no linkage, so the name i in foo() has a distinct meaning from the name i in bar().
Internal: the meaning of a name with internal linkage is the same inside each translation unit, but distinct across translation units. A typical example are the names of variables declared at namespace scope that are constants, or that appear in an unnamed namespace, or that use the static specifier. For a concrete example, static int n = 10; declared in one .cpp file refers to the same entity in every use of that name inside that file, but a different static int n in a different file refers to a distinct entity.
External: the meaning of a name with external linkage is the same across the entire program. That is, wherever you declare a specific name with external linkage, that name refers to the same thing. This is the default linkage for functions and non-constants at namespace scope, but you can also explicitly request external linkage with the extern specifier. For example, extern int a; would refer to the same int object anywhere in the program.
Now we see how your program fits together (or: "links"): The name print has external linkage (because it's the name of a function), and so every declaration in the program refers to the same function. There's a declaration in main.cpp that you use to call the function, and there's another declaration in print.cpp that defines the function, and the two mean the same thing, which means that the thing you call in main is the exact thing you define in print.cpp.
The use of header files doesn't do any magic: header files are just textually substituted, and now we see precisely what header files are useful for: They are useful to hold declarations of names with external linkage, so that anyone wanting to refer to the entities thus names has an easy and maintainable way of including those declarations into their code.
You could do entirely without headers, but that would require you to know precisely how to declare the names you need, and that is generally not desirable, because the specifics of the declarations are owned by the library owner, not the user, and it is the library owner's responsibility to maintain and ship the declarations.
Now you also see what the purpose of the "linker" part of the translation toolchain is: The linker matches up references to names with external linkage. The linker fills in the reference to the print name in your first translation unit with the ultimate address of the defined entity with that name (coming from the second translation unit) in the final link.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js