One Definition Rule - compilation

One Definition Rule - compilation - c++

I am trying to understand ODR.
I created one file pr1.cpp like this:
struct S{
int a;
};
a second file pr2.cpp like this
struct S {
char a;
};
and a main file like this :
#include <iostream>
int main() {
return 0;
}
I am compiling using terminal with the command :
g++ -Wall -Wextra pr1.cpp pr2.cpp main.cpp -o mypr
The compiler does not find any kind of error BUT there are two declarations of the type "S"...I am not understanding what is really happening..I thought to get an error after the "linkage" phase because of the ODR violation..
I can get the error only editing the main.cpp file adding :
#include "pr1.cpp"
#include "pr2.cpp"
Can anyone exmplain me what is happening?

Unless you include both definitions in the same file, there is no problem. This is because the compiler operates on a single translation unit which is usually a .cpp file. Everything you #include in this file is also part of the translation unit because the preprocessor basically copies and pastes the contents of all included files.
What happens is that the compiler will create and object file (.obj usually) for each translation unit and then the linker will create a single executable (or .dll etc) by linking all the object files and the libraries the project depends on. In your case the compiler encountered each struct in a different translation unit so it doesn't see a problem. When you include both files, the two definitions now find themselves in the same translation unit and the compiler throws an error because it cannot resolve the ambiguity if an S is used in this translation unit (even though you don't have to use one for the program to be ill-formed).
As a side-note, do not include .cpp files in other .cpp files. I'm sure you can find a lot on how to organize your code in header and source files and it doesn't directly answer the question so I won't expand on it.
EDIT: I neglected to say why you didn't get a linker error. Some comments have pointed out that this is undefined behavior which means that even though your linker should probably complain it doesn't actually have to. In your case you have one .obj file for each struct and a main.obj. None of these references the other so the linker does not see any references that it needs to resolve and it probably doesn't bother checking for ambiguous symbols.
I assume most linkers would throw an error if you declared struct S; and tried to use a S* or S& (an actual S would require definition inside the same translation unit). That is because the linker would need to resolve that symbol and it would find two matching definitions. Given that this is undefined, though, a standard-compliant linker could just pick one and silently link your program into something nonsensical because you meant to use the other. This can be especially dangerous for structs that get passed around from one .cpp to the other as the definition needs to be consistent. It might also be a problem when identically named structs/classes are passed through library boundaries. Always avoid duplicating names for these reasons.

Related

C++ build process (Including)

I'm currently learning C++ from a book called Alex Allain - Jumping into c++, and i got stuck at chapter 21. It details the C++ build process and i got it, except 2 parts:
First:
"The header file should not contain any function definitions. If we had added a function definition to the header file and then included that header file into more than one source file, the function definition would have shown up twice at link time. That will confuse the linker."
Second:
"Never ever include a .cpp file directly. Including a .cpp file will just lead to problems because the compiler will compile a copy of each function definition in the .cpp file into each object file, and the linker will see multiple definitions of the same function. Even if you were incredibly careful about how you did this, you would also lose the time-saving benefits of separate compilation."
Can somebody explain them?

A C++ program is created from one or more translation units. Each translation unit (TU for short) is basically a single source file with all included header files. When you create an object file you are actually creating a TU. When you are linking, you are taking the object files (TUs) created by the compiler and link them with libraries to create an executable program.
A program can only have a single definition of anything. If you have multiple definitions you will get an error when linking. A definition can be a variable definition like
int a;
or
double b = 6.0;
It can also be a function definition, which is the actual implementation of the function.
The reason you can only have a single definition is because these definitions are mapped to memory addresses when the program is loaded to be executed. A variable or a function can not exist in two places at once.
This is one of the reasons you should not include source files into other source files. It is also the reason you should not have definitions in header files, because header files could be included into multiple source files as that would lead to the definition being in multiple TUs.
There are of course exceptions to this, like having a function being marked as inline or static. But that is solved because these definitions are not exported from the TU, the linker doesn't see them.

Why is it that when you don't define a function you get a linker error and not a compiler?

For Example
#include <iostream>
int add(int x, int y);
int main()
{
cout << add(5, 5) << endl;
}
This would compile but not link. I understand the problem, I just don't understand why it compiles fine but doesn't link.

Because the compiler doesn't know whether that function is provided by a library (or other translation unit). There's nothing in your prototype that tells the compiler the function is defined locally (you could use static for that).

The input to a C or C++ compiler is one translation unit - more or less one source code file. Once the compiler is finished with that one source code, it has done its job.
If you call/use a symbol, such as a function, which is not part of that translation unit, the compiler assumes it's defined somewhere else.
Later on, you link together all the object files and possibly the libraries you want to use, all references are tied together - it's only at this point, when pulling together everything that's supposed to create an executable, one can know that something is missing.

When a compiler compiles, it generates the output (object file) with the table of defined symbols (T) and undefined symbols (U) (see man page of nm). Hence there is no requirement that all the references are defined in every translation unit. When all the object files are linked (with any libraries etc), the final binary should have all the symbols defined (unless the target in itself is a library). This is the job of the linker. Hence based on the requested target type (library or not), the linker might not or might give an error for undefined functions. Even if the target is a non-library, if it is not statically linked, it still might refer to shared libraries (.so or .dll), hence if on the target machine while the binary is run, if the shared libraries are missing or if any symbols missing, you might even get a linker error. Hence between compiler, linker and loader, every one is trying to best provide you with the definition of every symbol needed. Here by giving declaring add, you are pacifying the compiler, which hopes that the linker or loader would do the required job. Since you didnt pacify the linker (by say providing it with a shared library reference), it stops and cribs. If you have even pacified the linker, you would have got the error in the loader.

Is '#include "file2.c"' different from 'gcc file1.c file2.c'?

Is there any reason to prefer linker commands over include directives if you don't plan on recompiling the included files separately?
P.S. If it matters, I'm actually concerned with C++ and g++, but I thought gcc would be more recognizable as a generic compiler.

Is there any reason to prefer linker commands over include directives
Yes. You'll get into serious trouble if you include implementation (.c) files here and there. Meet the infamous "Multiple definitions of symbol _MyFunc" linker error...
(By the way, it's also considered bad style/practice, in general, only header files are meant to be included.)

If you really want to just have one long C file, use your editor to insert file2.c into file1.c and then delete file2.c. If they ALWAYS go together, then that's (possibly) the right solution. Using #include for this is not the right solution.
The reason we split files into separate .c anc .cpp files is that they logically do something separate from the rest of the code. Compiling each unit separately is a good idea when programs are large, but the main reason for splitting things into separate files is to show the independence of each unit of code. This way, you can see what other parts affect this particular file (looking at the headers that are included). If a class is local to a .cpp file, you know that class isn't used somewhere else in the system, so you can safely change the internals of that class without having to worry about other components being affected, for example. On the other hand, if everything is in one large file, then it's very hard to follow what's affecting what, and what is safe to change.

Here's the difference. file1.c:
#include <stdio.h>
static int foo = 37;
int main() { printf("%d\n", foo); }
file2.c:
static int foo = 42;
These two trivial modules compile fine with gcc file1.c file2.c, even though file2.c's definition of foo is then never used. static identifiers are visible only within a translation unit (C's version of what is more commonly called a module).
When you #include "file2.c" in file1.c, you effectively insert file2.c into file1.c, causing an identifier clash before the two files now become one translation unit.
As a rule, never #include a C or C++ source file. Only #include headers.

Understanding the origin of a linker duplicate symbol error

I have a c++ program that compiled previously, but after mucking with the Jamfiles, the program no longer compiled and ld emitted a duplicate symbol error. This persisted after successively reverting to the original Jamfiles, running bjam clean, removing the objects by hand, and switching from clang with the gcc front end to gcc 4.2.1 on MacOs 10.6.7.
A simplified description of the program is that there is main.cpp and four files, a.h,cpp and b.h,cpp, which are compiled into a static library which is linked to main.o. Both, main.cpp and b.cpp depend on the file containing the offending symbol, off.h, through two different intermediate files, but neither a.h nor a.cpp depend in any way on off.h.
Before you ask, I made sure that all files were wrapped in multiple definition guards (#ifndef, #define, #endif), and while I did find a file that was missing them, it did not reference off.h. More importantly, b.h does not include anything that references off.h, only the implementation, b.cpp, makes any reference to off.h. This alone had me puzzled.
To add to my confusion, I was able to remove the reference to off.h from b.cpp and, as expected, it recompiled successfully. However, when I added the reference back in, it also compiled successfully, and continued to do so after cleaning out the object files. I am still at a loss for why it was failing to compile, especially considering that the symbols should not have conflicted, I had prevented symbol duplication, and I had gotten rid of any prior/incomplete builds.
Since I was able to successfully compile my program, I doubt I'll be able to reproduce it to test out any suggestions. However, I am curious as to how this can happen, and if I run across this behavior in the future, what, if anything beyond what I've done, might I do to fix it?

This is often the result of defining an object in a header file, rather than merely declaring it. Consider:
h.h :
#ifndef H_H_
#define H_H_
int i;
#endif
a.cpp :
#include "h.h"
b.cpp :
#include "h.h"
int main() {}
This will produce a duplicate symbol i. The solution is to declare the object in the header file: extern int i; and to define it in exactly one of the source-code files: int i;.

Can't compile C++ in Ubuntu using GCC -- Include/Library Problems (collect2: ld returned 1 exit status)

I guess I'm not linking something right?
I want to call ABC.cpp which needs XYZ.h and XYZ.cpp. All are in my current directory and I've tried #include <XYZ.h> as well as#include "XYZ.h".
Running $ g++ -I. -l. ABC.cpp at the Ubuntu 10 Terminal gives me:
`/tmp/ccCneYzI.o: In function `ABC(double, double, unsigned long)':
ABC.cpp:(.text+0x93): undefined reference to `GetOneGaussianByBoxMuller()'
collect2: ld returned 1 exit status`
Here's a summary of ABC.cpp:
#include "XYZ.h"
#include <iostream>
#include <cmath>
using namespace std;
double ABC(double X, double Y, unsigned long Z)
{
...stuff...
}
int main()
{
...cin, ABC(cin), return, cout...
}
Here's XYZ.h:
#ifndef XYZ_H
#define XYZ_H
double GetOneGaussianByBoxMuller();
#endif
Here's XYZ.cpp:
#include "XYZ.h"
#include <cstdlib>
#include <cmath>
// basic math functions are in std namespace but not in Visual C++ 6
//(comment's in code but I'm using GNU, not Visual C++)
#if !defined(_MSC_VER)
using namespace std;
#endif
double GetOneGaussianByBoxMuller()
{
...stuff...
}
I'm using GNU Compiler version g++ (Ubuntu 4.4.3-4ubuntu5) 4.4.3.
This is my first post; I hope I included everything that someone would need to know to help me. I have actually read the "Related Questions" and the Gough article listed in one of the responses, as well as searched around for the error message. However, I still can't figure out how it applies to my problem.
Thanks in advance!

When you run g++ -I. -l. ABC.cpp you are asking the compiler to create an executable out of ABC.cpp. But the code in this file replies on a function defined in XYZ.cpp, so the executable cannot be created due to that missing function.
You have two options (depending on what it is that you want to do). Either you give the compiler all of the source files at once so that it has all the definitions, e.g.
g++ -I. -l. ABC.cpp XYZ.cpp
or, you use the -c option compile to ABC.cpp to object code (.obj on Windows, .o in Linux) which can be linked later, e.g.
g++ -I. -l. -c ABC.cpp
Which will produce ABC.o which can be linked later with XYZ.o to produce an executable.
Edit: What is the difference between #including and linking?
Understanding this fully requires understanding exactly what happens when you compile a C++ program, which unfortunately even many people who consider themselves to be C++ programmers do not. At a high level, the compilation of a C++ program goes through three stages: preprocessing, compilation, and linking.
Preprocessing
Every line that starts with # is a preprocessor directive which is evaluated at the preprocessing stage. The #include directive is literally a copy-and-paste. If you write #include "XYZ.h", the preprocessor replaces that line with the entire contents of XYZ.h (including recursive evaluations of #include within XYZ.h).
The purpose of including is to make declarations visible. In order to use the function GetOneGaussianByBoxMuller, the compiler needs to know that GetOneGaussianByBoxMuller is a function, and to know what (if any) arguments it takes and what value it returns, the compiler will need to see a declaration for it. Declarations go in header files, and header files are included to make declarations visible to the compiler before the point of use.
Compiling
This is the part where the compiler runs and turns your source code into machine code. Note that machine code is not the same thing as executable code. An executable requires additional information about how to load the machine code and the data into memory, and how to bring in external dynamic libraries if necessary. That's not done here. This is just the part where your code goes from C++ to raw machine instructions.
Unlike Java, Python, and some other languages, C++ has no concept of a "module". Instead, C++ works in terms of translation units. In nearly all cases, a translation unit corresponds to a single (non-header) source code file, e.g. ABC.cpp or XYZ.cpp. Each translation unit is compiled independently (whether you run separate -c commands for them, or you give them to the compiler all at once).
When a source file is compiled, the preprocessor runs first, and does the #include copy-pasting as well as macros and other things that the preprocessor does. The result is one long stream of C++ code consisting of the contents of the source file and everything included by it (and everything included by what it included, etc...) This long stream of code is the translation unit.
When the translation unit is compiled, every function and every variable used must be declared. The compiler will not allow you to call a function for which there is no declaration or to use a global variable for which there is no declaration, because then it wouldn't know the types, parameters, return values, etc, involved and could not generate sensible code. That's why you need headers -- keep in mind that at this point the compiler is not even remotely aware of the existence of any other source files; it is only considering this stream of code produced by the processing of the #include directives.
In the machine code produced by the compiler, there are no such things as variable names or function names. Everything must become a memory address. Every global variable must be translated to a memory address where it is stored, and every function must have a memory address that the flow of execution jumps to when it is called. For things that are defined (i.e. for functions, implemented) in the translation unit, the compiler can assign an address. For things that are only declared (usually as a result of included headers) and not defined, the compiler does not at this point know what the memory address should be. These functions and global variables for which the compiler has only a declaration but not a definition/implementation, are called external symbols, and they are presumed to exist in a different translation unit. For now, their memory addresses are represented with placeholders.
For example, when compiling the translation unit corresponding to ABC.cpp, it has a definition (implementation) of ABC, so it can assign an address to the function ABC and wherever in that translation unit ABC is called, it can create a jump instruction to that address. On the other hand, although its declaration is visible, GetOneGaussianByBoxMuller is not implemented in that translation unit, so its address must be represented with a placeholder.
The result of compiling a translation unit is an object file (with the .o suffix on Linux).
Linking
One of the main jobs of the linker is to resolve external symbols. That is, the linker looks through a set of object files, sees what their external symbols are, and then tries to find out what memory address should be assigned to them, replacing the placeholder.
In your case the function GetOneGaussianByBoxMuller is defined in the translation unit corresponding to XYZ.cpp, so inside XYZ.o it has been assigned a specific memory address. In the translation unit corresponding to ABC.cpp, it was only declared, so inside ABC.o, it is only a placeholder (external symbol). The linker, if given both ABC.o and XYZ.o will see that ABC.o needs an address filled in for GetOneGaussianByBoxMuller, find that address in XYZ.o, and replace the placeholder in ABC.o with it. Addresses for external symbols can also be found in libraries.
If the linker fails to find an address for GetOneGaussianByBoxMuller (as it does in your example where it is only working on ABC.o, as a result of not having passed XYZ.cpp to the compiler), it will report an unresolved external symbol error, also described as an undefined reference.
Finally, once the compiler has resolved all external symbols, it combines all of the now-placeholder-free object code, adds in all the loading information that the operating system needs, and produces an executable. Tada!
Note that through all of this, the names of the files don't matter one bit. It's a convention that XYZ.h should contain declarations for things that are defined in XYZ.cpp, and it's good for maintainable code to organize things that way, but the compiler and linker don't care one bit whether that's true or not. The linker will look through all the object files it's given and only the object files it's given to try to resolve a symbol. It neither knows nor cares which header the declaration of the symbol was in, and it will not try to automatically pull in other object files or compile other source files in order to resolve a missing symbol.
... wow, that was long.

Try
g++ ABC.cpp XYZ.cpp
If you want to compile the seprately you need to build object files:
g++ -c ABC.cpp
g++ -c XYZ.cpp
g++ ABC.o XYZ.o

Wish I had read these when I was having these problems:
http://c.learncodethehardway.org/book/learn-c-the-hard-waych3.html
http://www.thegeekstuff.com/2010/08/make-utility/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

One Definition Rule - compilation - c++

Related

C++ build process (Including)

Why is it that when you don't define a function you get a linker error and not a compiler?

Is '#include "file2.c"' different from 'gcc file1.c file2.c'?

Understanding the origin of a linker duplicate symbol error

Can't compile C++ in Ubuntu using GCC -- Include/Library Problems (collect2: ld returned 1 exit status)

Categories

Resources