Difference in linkage between C and C++?

Difference in linkage between C and C++? - c++

I have read the existing questions on external/internal linkage over here on SO. My question is different - what happens if I have multiple definitions of the same variable with external linkage in different translation units under C and C++?
For example:
/*file1.c*/
typedef struct foo {
int a;
int b;
int c;
} foo;
foo xyz;
/*file2.c*/
typedef struct abc {
double x;
} foo;
foo xyz;
Using Dev-C++ and as a C program, the above program compiles and links perfectly; whereas it gives a multiple redefinition error if the same is compiled as a C++ program. Why should it work under C and what's the difference with C++? Is this behavior undefined and compiler-dependent? How "bad" is this code and what should I do if I want to refactor it (i've come across a lot of old code written like this)?

Both C and C++ have a "one definition rule" which is that each object may only be defined once in any program. Violations of this rule cause undefined behaviour which means that you may or may not see a diagnostic message when compiling.
There is a language difference between the following declarations at file scope, but it does not directly concern the problem with your example.
int a;
In C this is a tentative definition. It may be amalgamated with other tentative definitions in the same translation unit to form a single definition. In C++ it is always a definition (you have to use extern to declare an object without defining it) and any subsequent definitions of the same object in the same translation unit are an error.
In your example both translation units have a (conflicting) definition of xyz from their tentative definitions.

This is caused by C++'s name mangling. From Wikipedia:
The first C++ compilers were
implemented as translators to C source
code, which would then be compiled by
a C compiler to object code; because
of this, symbol names had to conform
to C identifier rules. Even later,
with the emergence of compilers which
produced machine code or assembly
directly, the system's linker
generally did not support C++ symbols,
and mangling was still required.
With regards to compatibility:
In order to give compiler vendors
greater freedom, the C++ standards
committee decided not to dictate the
implementation of name mangling,
exception handling, and other
implementation-specific features. The
downside of this decision is that
object code produced by different
compilers is expected to be
incompatible. There are, however,
third party standards for particular
machines or operating systems which
attempt to standardize compilers on
those platforms (for example C++
ABI[18]); some compilers adopt a
secondary standard for these items.
From
http://www.cs.indiana.edu/~welu/notes/node36.html
the following example is given:
For example for the below C code
int foo(double*);
double bar(int, double*);
int foo (double* d)
{
return 1;
}
double bar (int i, double* d)
{
return 0.9;
}
Its symbol table would be (by dump -t)
[4] 0x18 44 2 1 0 0x2 bar
[5] 0x0 24 2 1 0 0x2 foo
For same file, if compile in g++, then the symbol table would be
[4] 0x0 24 2 1 0 0x2 _Z3fooPd
[5] 0x18 44 2 1 0 0x2 _Z3bariPd
_Z3bariPd means a function whose name is bar and whose first arg is integer and second argument is pointer to double.

C++ does not allow a symbol to be defined more than once. Not sure what the C linker is doing, a good guess might be that it simply maps both definitions onto the same symbol, which would of course cause severe errors.
For porting I would try to put the contents of individual C-files into anonymous namespaces, which essentially makes the symbols different, and local to the file, so they don't clash with the same name elsewhere.

The C program permits this and treats the memory a little like a union. It will run, but may not give you what you expected.
The C++ program (which is stronger typed) correctly detects the problem and asks you to fix it. If you really want a union, declare it as one. If you want two distinct objects, limit their scope.

You have found the One Definition Rule. Clearly your program has a bug, since
There can only be one object named foo once the program is linked.
If some source file includes all the header files, it will see two definitions of foo.
C++ compilers can get around #1 because of "name mangling": the name of your variable in the linked program may be different from the one you chose. In this case, it isn't required, but it's probably how your compiler detected the problem. #2, though, remains, so you can't do that.
If you really want to defeat the safety mechanism, you can disable mangling like this:
extern "C" struct abc foo;
… other file …
extern "C" struct foo foo;
extern "C" instructs the linker to use C ABI conventions.

Related

How do I make variable names that replace a c function an error?

My code mysteriously stop working. I figured out I accidentally wrote int listen; in my main.cpp and used listen in my network.cpp which seems to be trying to call the int as a function instead of the C function. Changing the name of the variable (or making it static) fixed the problem.
Is there any warnings I can turn on so I don't get caught by something like this again? The closest I found was something that suggest I make variables static if they don't need to be extern
Heres code
//a.cpp
#include<cstdio>
int main() { puts("Hello"); }
//b.cpp
int puts;

What you have is an ODR violation. You have two different definitions of puts in two different TUs. The standard (N4713 - C++17 draft) says
§6.2 One-definition rule [basic.def.odr]
Every program shall contain exactly one definition of every non-inline function or variable that is odr-used in that program
outside of a discarded statement (9.4.1); no diagnostic required.
This is of course for C++. C has similar rules.
Because of the "no diagnostic required" the compiler chain is not required to issue errors or warnings for your code.
As far as I know there are no flags on popular compilers to issue error/warnings for ODR violations. This is because of how compilers optimize their parsing. See this great answer for more info on that.
There are some good practices that you can follow to minimize this kind of mistake:
In C++ don't declare symbols at global space. Use namespaces.
In C you can prefix global variables with g_. If you write a library you also add a library prefix to everything global.
(as you've suggested) make symbols that are used only in their TU have internal linkage (declare them static or in an unnamed namespace).
use more meaningful, relevant names. E.g. instead of listen you can name it server_is_listening (or something like that that makes sense for its use).

-Werror=missing-variable-declarations may help. It forces you to use static which avoids this problem or declare the variable elsewhere (ie a header). Including that header in any file that includes the C function will cause an error causing another chance to catch the error

Why do c and c++ treat redefinitions of uninitialised variables differently?

int a;
int a=3; //error as cpp compiled with clang++-7 compiler but not as C compiled with clang-7;
int main() {
}
For C, the compiler seems to merge these symbols into one global symbol but for C++ it is an error.
Demo
file1:
int a = 2;
file2:
#include<stdio.h>
int a;
int main() {
printf("%d", a); //2
}
As C files compiled with clang-7, the linker does not produce an error and I assume it converts the uninitialised global symbol 'a' to an extern symbol (treating it as if it were compiled as an extern declaration). As C++ files compiled with clang++-7, the linker produces a multiple definition error.
Update: the linked question does answer the first example in my question, specifically 'In C, If an actual external definition is found earlier or later in the same translation unit, then the tentative definition just acts as a declaration.' and 'C++ does not have “tentative definitions”'.
As for the second scenario, if I printf a, then it does print 2, so obviously the linker has linked it correctly (but I previously would have assumed that a tentative definition would be initialised to 0 by the compiler as a global definition and would cause a link error).
It turns out that int i[]; tentative defintion in both files also gets linked to one definition. int i[5]; is also a tentative definition in .common, just with a different size expressed to the assembler. The former is known as a tentative definition with an incomplete type, whereas the latter is a tentative definition with a complete type.
What happens with the C compiler is that int a is made strong-bound weak global in .common and left uninitialised (where .common implies a weak global) in the symbol table (whereas extern int a would be an extern symbol), and the linker makes the necessary decision, i.e. it ignores all weak-bound globals defined using #pragma weak if there is a strong-bound global with the same identifier in a translation unit, where 2 strong-bounds would be a multiple definition error (but if it finds no strong-bounds and 1 weak-bound, the output is a single weak-bound, and if it finds no strong-bounds but two weak-bounds, it chooses the definition in the first file on the command line and outputs the single weak-bound. Though two weak-bounds are two definitions to the linker (because they are initialised to 0 by the compiler), it is not a multiple definition error, because they are both weak-bound) and then resolves all .common symbols to point to the strong/weak-bound strong global. https://godbolt.org/z/Xu_8tY https://docs.oracle.com/cd/E19120-01/open.solaris/819-0690/chapter2-93321/index.html
As baz is declared with #pragma weak, it is weak-bound and gets zeroed by the compiler and put in .bss (even though it is a weak global, it doesn't go in .common, because it is weak-bound; all weak-bound variables go in .bss if uninitialised and get initialised by the compiler, or .data if they are initialised). If it were not declared with #pragma weak, baz would go in common and the linker will zero it if no weak/strong-bound strong global symbol is found.
C++ compiler makes int a a strong-bound strong global in .bss and initialises it to 0: https://godbolt.org/z/aGT2-o, therefore the linker treats it as a multiple definition.
Update 2:
GCC 10.1 defaults to -fno-common. As a result, global variable targets are more efficient on various targets. In C, global variables with multiple tentative definitions now result in linker errors (like C++). With -fcommon such definitions are silently merged during linking.

I'll address the C end of the question, since I'm more familiar with that language and you seem to already be pretty clear on why the C++ side works as it does. Someone else is welcome to add a detailed C++ answer.
As you noted, in your first example, C treats the line int a; as a tentative definition (see 6.9.2 in N2176). The later int a = 3; is a declaration with an initializer, so it is an external definition. As such, the earlier tentative definition int a; is treated as merely a declaration. So, retroactively, you have first declared a variable at file scope and later defined it (with an initializer). No problem.
In your second example, file2 also has a tentative definition of a. There is no external definition in this translation unit, so
the behavior is exactly as if the translation
unit contains a file scope declaration of that identifier, with the composite type as of the end of the
translation unit, with an initializer equal to 0. [6.9.2 (1)]
That is, it is as if you had written int a = 0; in file2. Now you have two external definitions of a in your program, one in file1 and another in file2. This violates 6.9 (5):
If an identifier declared with external linkage is used in an expression
(other than as part of the operand of a sizeof or _Alignof operator whose result is an integer
constant), somewhere in the entire program there shall be exactly one external definition for the
identifier; otherwise, there shall be no more than one.
So under the C standard, the behavior of your program is undefined and the compiler is free to do as it likes. (But note that no diagnostic is required.) With your particular implementation, instead of summoning nasal demons, what your compiler chooses to do is what you described: use the common feature of your object file format, and have the linker merge the definitions into one. Although not required by the standard, this behavior is traditional at least on Unix, and is mentioned by the standard as a "common extension" (no pun intended) in J.5.11.
This feature is quite convenient, in my opinion, but since it's only possible if your object file format supports it, we couldn't really expect the C standard authors to mandate it.
clang doesn't document this behavior very clearly, as far as I can see, but gcc, which has the same behavior, describes it under the -fcommon option. On either compiler, you can disable it with -fno-common, and then your program should fail to link with a multiple definition error.

Why does the one definition rule exist in C/C++

In C and C++, you can't have a function with two definitions. For example, say we have the following two files:
1.c:
int main(){ return 0}
2.c:
int main(){ return 0}
Issuing the command gcc 1.c 2.c will give you a duplicate symbol linker error.
Why doesn't the same happen with structs and classes? Why are we allowed to have multiple
definitions of the same struct as long as they have the same tokens?

To answer this question, one has to delve into compilation process and what is needed in each part (question why these steps are perfomed is more historical, going back to beginning of C before it's standardization)
C and C++ programs are compiled in multiple steps:
Preprocessing
Compilation
Linkage
Preprocessing is everything that starts with #, it's not really important here.
Compilation is performed on each and every translation unit (typically a single .c or .cpp file plus the headers it includes). Compiler takes one translation unit at a time, reads it and produces an internal list of classes and their members, and then assembly code of each function in given unit (basing on the structures list). If a function call is not inlined (e.g. it is defined in different TU), compiler produces a "link" - "please insert function X here" for the linker to read.
Then linker takes all of the compiled translation units and merges them into one binary, substituting all the links specified by compiler.
Now, what is needed at each phase?
For compilation phase, you need the
definition of every class used in this file - compiler needs to know the size and offset of each class member to produce assembly
declaration of every function used in this file - to produce those "links".
Since function definitions are not needed for producing assembly (as long as they are compiled somewhere), they are not needed in compilation phase, only in linking phase.
To sum up:
One Definition Rule is there to protect programmers from theselves. If they'd accidentally define a function twice, linker will notice that and executable is not produced.
However, class definitions are required in every translation unit, and therefore such a rule cannot be set up for them. Since it cannot be forced by language, programmers have to be responsible beings and not define the same class in different ways.
ODR has also other limitations, e.g. you have to define template functions (or template class methods) in header files. You can also take the responsibility and say to the compiler "Every definition of this function will be the same, trust me dude" and make the function inline.

There is no use case for a function with 2 definitions. Either the two definitions would have to be the same, making it useless, or the compiler wouldn't be able to tell which one you meant.
This is not the case with classes or structures. There is also a large advantage to allowing multiple definitions of them, i.e. if we want to use a class or struct in multiple files. (This leads indirectly to multiple definitions because of includes.)

Structures, classes, unions and enumerations define types that can be used in several compilation units to define objects of these types. So each compilation unit need to know how the types are defined, for example to allocate correctly memory for an object or to be sure that specified member of a class does indeed exist.
For functions (if they are not inline functions) it is enough to have their declaration without their definition to generate for example a function call.
But the function definition shall be single. Otherwise the compiler will not know what function to call or the object code will be too big due to duplication and will be error prone..

It's quite simple: It's a question of scope. Non-static functions are seen (callable) by every compilation unit linked together, while structures are only seen in the compilation unit where they are defined.
For example, it's valid to link the following together because it's clear which definition of struct Foo and which definition of f is being used:
1.c:
struct Foo { int x; };
static void f(void) { struct Foo foo; ... }
2.c:
struct Foo { double d; };
static void f(void) { struct Foo foo; ... }
int main(void) { ... }
But it isn't valid to link the following together because the linker wouldn't know which f to call.
1.c:
void f(void) { ... }
2.c:
void f(void) { ... }
int main(void) { f(); }

Actually every programming element is associated with a scope of its applicability. And within this scope you cannot have the same name associated with multiple definitions of an element. In compiled world:
You cannot have more than one class definition with the same name within a single file. But you can have it in different compilation units.
You cannot have the same function or global variable name within a single link unit (library or executable), but you can potentially have functions named the same within different libraries.
you cannot have shared libraries with the same name situated in the same directory, but you can have them in different directories.
C/C++ compilation is very much after the compilation performance. Checking 2 objects like function or classes for identity is a time-consuming task. So, it is not done. Only names are considered for comparison. It is better to consider that 2 types are different and error out then checking them for identity. The only exception from this rule are text macros.
Macros are a pre-processor concept and historically it is allowed to have multiple identical macro definitions. If a definition changes, a warning gets generated. Comparing macro context is easy, just a simple string comparison, but some macro definitions could be huge.
Types are the compiler concept and they are resolved by the compiler. Types do not exist in object libraries and are represented by the sizes of corresponding variables. So, there is no reason for checking type name collisions at this scope.
Functions and variables on the other hand are named pointers to executable codes or data. They are the building blocks of applications. Applications are assembled from the codes and libraries coming from all around the world in some cases. In order to use someone else's function you'd better now its name and you do not want the same name to be used by some one else. Within a shared library names of functions and variables are usually stored in a hash table. There is no place for duplicates there.
And as I already mention checking functions for identical contents is seldom done, however there are some cases, but not in c or c++.

The reason of impeding two different definitions for the same thing to be used in programming is to avoid the ambiguity of deciding which definition to use at run time.
If you have two different implementations to the same thing to coexist in a program, then there's the possibility of aliasing them (with a different name each) into a common reference to decide at runtime which one of the two to use.
Anyway, in order to distinguish both, you have to be able to indicate the compiler which one you want to use. In C++ you can overload a function, giving it the same name and different lists of parameters, so you can distinguish which one of both you want to use. But in C, the compilers only preserve the name of the function to be able to solve at link time which definition matches the name you use in a different compilation unit. In case the linker ends with two different definitions with the same name, it is uncapable of deciding for you which one to use, so it emits an error and gives up the building process.
What should be the intention of using this ambiguity in a productive way? this is the question you have actually to ask to yourself.

how to use c and c++ together

There is a example I find:
in main.c
int main() {
cppsayhello("Hello from C to C++");
return 0;
}
in cppsayhello.cpp
#include<iostream>
#include"cppsayhello.h"
extern "C" void cppsayhello(char* str);
void cppsayhello(char *str) {
std::cout　<< str << std::endl;
}
It works! main.c includes nothing so how can main.c know of the existence of the function cppsayhello? Would somebody tell me how it works behind the scenes.
(I'm now working on an embeded system. The bottom-level is written in C, and I want to use c++ to construct top-level application. But It's hard to work with 2 language.)

The main.c is compiled by a C compiler that allows implicit function declaration. When C compiler finds a function that is not declare yet it assumes it is int function() which means function with any number of any parameters in C.
C99 and newer does not allow implicit declaration and the code would not compile with C99 compliant compiler.

It is a sign that your C compiler predates the 1999 standard, since later C compilers will reject that main() function.
Generally speaking, with older C compilers, your C code will link even when the usage of an implicitly declared function does not match the actual definition. The result is undefined behaviour (although, in practice, the code often still works - one possible manifestation of undefined behaviour is that the code works as expected with at least some compilers).
It works behind the scenes because C - unlike C++ - does not support function overloading. So any function named cppsayhello() will be given the same name (in the object file) by the compiler, so the linker can match things up. So you could define your cppsayhello() to have any return type, and any set of arguments, you desire - your code will still compile and link. However, the linker must only see one definition (it will complain about a multiply defined symbol if it encounters more than one definition of anything - for example, linking two object files that each contain a definition of some function).
Your code would avoid the undefined behaviour if the main() function had visibility of a proper declaration of the function.
void cppsayhello(const char *);
int main()
{
cppsayhello("Hello from C to C++");
return 0;
}
That will prevent the main() function compiling if it uses the function in any manner inconsistent with the declaration.

The whole thing is undefined behavior and works by luck.
A few things happen behind the scenes but your compiler should complain a lot.
in your main.c: in C a function the compiler does not know is implicitly expected to be of return type int without checking the parameters. (this applies to C older than C99. C99 or later will give you an error with this code)
in your cppsayhello.cpp: thanks to the extern "C" your C++ compiler creates an object file that contains a function with C calling conventions, without name mangeling.
the linker took both objects and finds in cppsayhello.obj a exportet symbol cppsayhello and in main.c a required symbol cppsayhello. so it links the two.
obviously they have different signatures but in your case the returned value is ignored and the parameters are compatible. But the actual runtime behavior may depend on architecture and compiler.
In Short: Your example is very bad code.

How does linkage and name mangling work?

Lets take this code sample
//header
struct A { };
struct B { };
struct C { };
extern C c;
//code
A myfunc(B&b){ A a; return a; }
void myfunc(B&b, C&c){}
C c;
Lets do this line by line starting from the code section.
When the compiler sees the first myfunc method it does not care about A or B because its use is internal. Each c++ file will know what it takes in, what it returns. Although there needs to be a name for each of the two overload so how is that chosen and how does the linker know which means what?
Next is C c; I once had a bug were the linker wouldnt reconize thus allow me access to C in other C++ files. It was because that cpp didnt know c was extern and i had to mark it as extern in the header before i could link successfully. Now i am not sure if the class type has any involvement with the linker and the variable C. I dont know how RTTI will be involved but i do know C needs to be visible by other files.
How does the linker work and name mangling and such.

We first need to understand where compilation ends and linking begins. Compilation involves taking a compilation unit (a C or C++ source file) and turning it into an object file. Simplistically, this involves generating snippets of machine code for each function as well as a symbol table for all functions and static (global) variables. Placeholders are used for any symbols needed by the compilation unit that are external to the said compilation unit.
The linker is then responsible for loading all the object files and resolving all the place-holder symbols with real addresses (or offsets for machine independent code). This is placed into various sections that can be read by the operating system's dynamic loader when loading an executable.
So for the specifics. In order to avoid errors during linking, the compiler requires you to declare all external symbols that will be used by the current compilation unit. For global variables one must use the extern keyword, for functions this is optional.
All functions and global variables defined in a compilation unit have external linkage (i.e., can be referenced by other compilation units) unless one declares that with the static keyword (or the unnamed namespace in C++). In C++, the vtable will also have a symbol needed for linkage.
Now in C++, since functions can be overloaded, the parameters also form part of the function name. Since machine-code is just addresses and registers, extra information needs to be added to the function name in the symbols table. This extra parameter information comes in the form of a mangled name and ensures that the linker links to the correct version of an overloaded function.
If you really are interested in the gory details take a look at the ELF file format (PDF) used extensively on Linux. Windows has a different format but the principles can be expected to be the same.
Name mangling on the Itanuim (and ARM) platforms can be found here.

http://en.wikipedia.org/wiki/Name_mangling

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js