Compiling & linking multiple files in C++

Compiling & linking multiple files in C++ - c++

One of my "non-programmer" friends recently decided to make a C++ program to solve a complicated mechanical problem.
He wrote each function in a separate .cpp file, then included them all in the main source file, something like this:
main.cpp:
#include "function1.cpp"
#include "function2.cpp"
...
int main()
{
...
}
He then compiled the code, with a single gcc line:
g++ main.cpp // took about 2 seconds
Now, I know that this should work, but I'm not sure whether including .cpp files directly into the main program is a good idea. I have seen the following scheme several times, where all the function prototypes go into a header file with the extern keyword, like this:
funcs.h:
extern void function1(..);
extern void function2(..);
...
main.cpp:
...
#include "funcs.h"
...
& compiling with:
g++ -c function1.cpp
g++ -c function2.cpp
...
g++ -c main.cpp
g++ -o final main.o function1.o function2.o ...
I think that this scheme is better (with a makefile, ofcourse). What reasons can I give my friend to convince him so?

The main reason people compile object by object is to save time. High-level localised code changes often only require compilation of one object and a relink, which can be faster. (Compiling too many objects that draw in heaps of headers, or redundantly instantiate the same templates, may actually be slower when a change in common code triggers a fuller recompilation).
If the project is so small that it can be compiled in 2 seconds, then there's not much actual benefit to the traditional approach, though doing what's expected can save developer time - like yours and ours on here :-). Balancing that, maintaining a makefile takes time too, though you may well end up doing that anyway in order to conveniently capture include directories, libraries, compiler switches etc.
Actual implications to written/generated code:
cpp files normally first include their own headers, which provides a sanity check that the header content can be used independently by other client code: put everything together and the namespace is already "contaminated" with includes from earlier headers/implementation files
the compiler may optimise better when everything is in one translation unit (+1 for leppie's comment, do do the same...)
static non-member variables and anonymous namespaces are private to the translation unit, so including multiple cpps means sharing these around, for better or worse (+1 for Alexander :-))
say a cpp files defines a function or variable which is not mentioned in its header and might even be in an anonymous namespace or static: code later in the translation unit could call it freely without needing to hack up their own forward declaration (this is bad - if the function was intended to be called outside its own cpp then it should have been in the header and an externally exposed symbol in its translation unit's object)
BTW - in C++ your headers can declare functions without explicitly using the extern keyword, and it's normal to do so.

The reason for the second style is because each .cpp file can be treated separately, with its own classes, global variables, ect.. without risk of conflict.
It is also easier in IDEs that automatically link all the .cpp files (like MSVC).

Related

What's the REAL difference between .h and .cpp files?

This question was posted several times on StackOverflow, but most of the answers stated something similar to ".h files are supposed to contain declarations whereas .cpp files are supposed to contain their definitions/implementation". I've noticed that simply defining functions in .h files works just fine. What's the purpose of declaring functions in .h files but defining and implementing them in .cpp files? Does it really reduce compile time? What else?

Practically: the conventions around .h files are in place so that you can safely include that file in multiple other files in your project. Header files are designed to be shared, while code files are not.
Let's take your example of defining functions or variables. Suppose your header file contains the following line:
header.h:
int x = 10;
code.cpp:
#include "header.h"
Now, if you only have one code file and one header file this probably works just fine:
g++ code.cpp -o outputFile
However, if you have two code files this breaks:
header.h:
int x = 10;
code1.cpp:
#include "header.h"
code2.cpp:
#include "header.h"
And then:
g++ code1.cpp -c (produces code1.o)
g++ code2.cpp -c (produces code2.o)
g++ code1.o code2.o -o outputFile
This breaks, specifically at the linker step, because now you have two symbols in the same executable that have the same symbol, and the linker doesn't know what's it's supposed to do with that. When you include your header in code1 you get a symbol "x" and when you include your header in code2 you get another symbol "x". The linker doesn't know your intention here, so it throws an error:
code2.o:(.data+0x0): multiple definition of `x'
code1.o:(.data+0x0): first defined here
collect2: error: ld returned 1 exit status
Which again is just the linker saying that it can't resolve the fact that you now have two symbols with the same name in the same executable.

What's the REAL difference between .h and .cpp files?
They are both fundamentally just text files. From certain perspective, their only difference is the filename.
However, many programming related tools treat the files differently depending on their name. For example, some tools will detect programming language: .c is compiled as C language, .cpp is compiled as C++ and .h is not compiled at all.
For header files, the name does not matter at all to the compiler. The name could be .h or .header or anything else, it doesn't affect how the pre processor includes it. It is however good practice to conform to a common convention in order to avoid confusion.
I've noticed that simply defining functions in .h files works just fine.
Are the functions declared non-inline? Have you ever included the header file into more than one translation unit? If you answered yes to both, then your program has been ill formed. If you didn't, then that would explain why you didn't encounter any problems.
Does it really reduce compile time?
Yes. Dividing function definitions into smaller translation units can indeed reduce the time to compile said translation units compared to compiling larger translation units.
This is because doing less work takes less time. What is important to realise is that other translation units do not need to be recompiled when only one is modified. If you only have one translation unit, then you have to compile it i.e. the program in its entirety.
Multiple translation units are also better because they can be compiled in parallel, which allows taking advantage of modern multi core hardware.
What else?
Does there need to be anything else? Having to wait a few minutes to compile your program instead of a day improves development speed drastically.
There are some other advantages too regarding organisation of files. In particular, it is quite convenient to be able to define different implementations for same function for different target systems on order to be able to support multiple platforms. With header files, you must do tricks with macros while with source files, you simply choose which files to compile.
Another use case where implementing functions in header is not an option is distributing a library without source, as some middleware providers do. You must give the headers or else your functions cannot be called, but if all your source is in the headers, then you've given up your trade secrets. Compiled sources have to be at least reverse engineered.

Keep in mind that the C++ compiler is a fairly simple beast as far as file-handling goes. All it's allowed to do is a read in a single source-code file (and, via the pre-processor, logically insert into that incoming text-stream the contents of any files that the file #includes, recursively), parse the contents, and spit out the resulting .o file.
For small programs, keeping the entire codebase in a single .cpp file (or even a single .h file) works fine, because number of lines of code that the compiler needs to load into memory are small (relative to the computer's RAM).
But imagine you are working on a monster program, with tens of millions of lines of code -- yes, such things do exist. Loading that much code into RAM at once would likely stress the capabilities of all but the most powerful computers, leading to exceedingly long compile times or even outright failure.
And even worse than that, touching any of the code in a .h file requires the recompilation of any other files that #include that .h file, either directly or indirectly -- so if all your code is in .h files, then your compiler is likely to spend a lot of time unnecessarily recompiling a lot of code that didn't actually change.
To avoid those problems, C++ lets you place your code into multiple .cpp files. Since .cpp files are (at least traditionally) never #include'd by anything, the only time your Makefile or IDE will need to recompile any given .cpp file is after you've actually modified that exact file, or a .h file it #include's.
So when you've modified a function in the 375th .cpp file out of 700 .cpp files in your program, and now you want to test your modification, the compiler only has to recompile that one .cpp file and then re-link the .o files into an executable. If OTOH you've modified a .h file, compilation might be much longer, because now the build system will have to recompile every other file that includes that .h file, directly or indirectly, just in case you changed the meaning of something those files depend on.
.cpp files also make link-time issues much easier to deal with. For example, if you want to have a global variable, defining that global variable in a .cpp file (and maybe declaring an extern for it in a .h file) is straightforward; if OTOH you want to do that in a .h file, you'll have to be very careful or you'll end up with duplicate-symbol errors from your linker, and/or subtle violations of the One Definition Rule that will come back to bite you later on.

The REAL difference is that your programming environment lists .h and .cpp files separately. And/or populates file-browser-dialogs appropriately. And/or tries to compile .cpp files into object form (but doesn't do that to .h files). And whatever, depending on which IDE / environment you use.
The second difference is that people assume that your .h files are header files, and that your .cpp files are code source files.
If you don't care about people or development environments, you can put any damn thing you want in a .h or .cpp file, and call them any thing you want. You can put your declarations in a .cpp file and call it an "include file", and your definitions in a .pas file and call it a "source file".
I have to do this kind of thing when working in a constrained environment.
Header files weren't part of the original definition of c. The world got on perfectly well without them. Opening and closing lots of header files did slow down the compilation of c, which is why we got pre-compiled header files. Pre-compiled header files do speed up the compilation and linking of source code, but not any faster than just writing assembler, or machine code, or any other thing that didn't take advantage of the co-operation of other people or a design environment.
It is useful to put declarations in a header file, and definitions in a code source file. That's why you should do that. There isn't a requirement.

Whenever you see an #include <header.h> directive, pretend that the contents of header.h is being copied and pasted right where the #include directive appears.
.cpp files get compiled to become .obj files. They have no knowledge of the existence of any other .cpp file, and are compiled individually. That's why we need to declare things before we use them - otherwise the compiler won't know whether the function we're trying to invoke exists within a different .cpp file.
We use header files to share declarations amongst multiple .cpp files to avoid having to write the same code over and over for every single .cpp file.

How do I partially-expose object contents in an object library?

I'm compiling some C++ code into a library. Suppose my source files are mylib.cpp and util.cpp. The code in util.cpp is used in the library implementation, but is not part of the library in the sense that code using the library cannot call it (it's not in the public headers) and should not be aware of its existence; but my_lib.cpp includes util.hpp and does rely on the compiled util.cpp object code.
Now, if I compile mylib.o and util.o, then perform:
ar qc libmylib.a mylib.o util.o
my library works just fine; but - the utility code is exposed as symbols. Thus, if I link this library with some other code, there might be clashes of double-definitions. Or that other code might inappropriately rely on symbols being available (e.g. with its own header).
How can I ensure that only the object code in mylib.o (and in util.o) "sees" the symbols from util.o, while outside code does not?
Note: I believe this question stands also for C and perhaps other compiled languages.

Transferring comments into an answer.
If your C++ library has its own namespace, then using that or a sub-namespace is nominally the correct way to control access to the internal utilities. It sounds as if your code is not providing template classes — the constraints for those have to be thought through separately.
If privacy is a major concern, I'd probably consider including util.cpp (as well as util.hpp) into the source for mylib.cpp (meaning #include "util.cpp") with appropriate namespace controls so that the code from util.cpp is available inside mylib.cpp but not outside (using an anonymous namespace, or namespace mylib::Private or some such scheme). This is not very conventional, but it is probably effective (once you've worked out the necessary tweaks). The chances are that the combination TU (translation unit) is not so big as to cause your compiler major problems. This doesn't rely on compiler extensions.

Here is my fallback "solution", which is actually a workaround:
I keep the public visibility, but I burden all of the code in util.cpp with some element of naming which will make it effectively unique. For example, I may enclose those functions with a namespace mylib. Not the (demangled) symbols are all mylib::foo() (or mylib::util::foo()). They will be searched, but it is reasonable to assume they won't match anything outside of the mylib code.
In addition to the hassle, this has the detriment of still allowing external code to depend on this internal utility code - if it does so intentionally.

Is '#include "file2.c"' different from 'gcc file1.c file2.c'?

Is there any reason to prefer linker commands over include directives if you don't plan on recompiling the included files separately?
P.S. If it matters, I'm actually concerned with C++ and g++, but I thought gcc would be more recognizable as a generic compiler.

Is there any reason to prefer linker commands over include directives
Yes. You'll get into serious trouble if you include implementation (.c) files here and there. Meet the infamous "Multiple definitions of symbol _MyFunc" linker error...
(By the way, it's also considered bad style/practice, in general, only header files are meant to be included.)

If you really want to just have one long C file, use your editor to insert file2.c into file1.c and then delete file2.c. If they ALWAYS go together, then that's (possibly) the right solution. Using #include for this is not the right solution.
The reason we split files into separate .c anc .cpp files is that they logically do something separate from the rest of the code. Compiling each unit separately is a good idea when programs are large, but the main reason for splitting things into separate files is to show the independence of each unit of code. This way, you can see what other parts affect this particular file (looking at the headers that are included). If a class is local to a .cpp file, you know that class isn't used somewhere else in the system, so you can safely change the internals of that class without having to worry about other components being affected, for example. On the other hand, if everything is in one large file, then it's very hard to follow what's affecting what, and what is safe to change.

Here's the difference. file1.c:
#include <stdio.h>
static int foo = 37;
int main() { printf("%d\n", foo); }
file2.c:
static int foo = 42;
These two trivial modules compile fine with gcc file1.c file2.c, even though file2.c's definition of foo is then never used. static identifiers are visible only within a translation unit (C's version of what is more commonly called a module).
When you #include "file2.c" in file1.c, you effectively insert file2.c into file1.c, causing an identifier clash before the two files now become one translation unit.
As a rule, never #include a C or C++ source file. Only #include headers.

The way of the include in c++ using Eclipse

I learned that if I compile main.cpp the compiler simply replaces all includes with the actual content of the file i.e. #include "LongClassName.h" with the text in that file. This is done recursively in LongClassName.h. In the end the compiler sees a huge "virtual" file with the complete code of all .cpp and .h files.
But it seems to be much more complicated in real projects. I had a look at the Makefile Eclipse created for my Qt project and it seems that there is an entry for every file named file.o and its dependencies are file.cpp and file.h. So that means that eclipse compiles each .cpp separately(?)
Does that mean that class.cpp will know nothing about global stuff in main.cpp or a class in higher include hirarchy?
I stumbled upon this problem while trying to create an alias for a long class name. It is my main class and I wanted to call static functions with a shorter name: Ln::globalFunction() instead of LongClassName::globalFunction()
I have a class LongClassName whose header I include in main.cpp. This is the main class. All other classes are included in it.
LongClassName.h
#define PI 3.14159265
#include <QDebug>
Class LongClassName
{
...
public:
...
private:
...
};
typedef LongClassName Ln;
LongClassName.cpp
#include "Class1.h"
#include "Class2.h"
#include "Class3.h"
/*implementations of LongClassName's functions*/
So I assumed that when the code is included in one single "virtual" file by the compiler every class will be inserted after this source code and because of that every class should know that Ln is an alias for LongClassName
This didn't work
So what is the best way to propagate this alias to all classes?
I want to avoid including LongClassname.h in all classes because of reverse dependencies. LongClassName includes all other classes in its implementation. And almost all the other classes use some static functions of LongClassName.
(At the moment I have a seperate class Ln but try to merge it with LongClassName because it seems more logical.)

The compiler knows how to compile a .cpp file (if it's a cpp compiler) into a .o file called 'object file', which is your code translated (and probably manipulated, optimized, etc.) to a machine code. Actually the compiler creates an assembly code, which is translated to machine code by the assembler.
So each cpp file is compiled to a different object file, and knows nothing about variables declared in other cpp files, unless you include declarations you want the object file to know about, either in the cpp file or in an h file it includes.
Although the compilation is done separately for each cpp, the linker links all object files to a single executable (or a library), so a variable declared in the global namespace is indeed global, and every declaration not explicitly placed in a named
namespace is placed in the global namespace.
You will probably benefit from reading about all stages of "compiling", for example here: http://www.network-theory.co.uk/docs/gccintro/gccintro_83.html

In the end the compiler sees a huge "virtual" file with the complete code of all .cpp and .h files.
This is wrong. In .cpps you should include just the .hs (or .hpps if you like), almost never the .cpps; the .h in general just contain the declarations of the classes and of the methods, and not their actual body1 (i.e. their definition), so when you compile each .cpp the compiler still knows nothing about the definition of the functions defined in other .cpps, it just knows their declaration, and with it it can perform syntactical checks, generate code for function calls, ... but still it will generate an "incomplete" object file (.o), that will contain several "placeholders" ("here goes the address of this function defined somewhere else" "here goes the address of this extern variable" and so on)
After all the object files have been generated, it's the linker that have to take care of these placeholders, by plumbing all the object files together and linking their references to the actual code (which now can be found, since we have all the object files).
For some more info about the classical compile+link model, see here.
Does that mean that class.cpp will know nothing about global stuff in main.cpp or a class in higher include hirarchy?
Yes, it's exactly like that.
But why doesn't the Makefile created by eclipse simply compile main.cpp. Why isn't this enough? main.cpp contains all the dependencies. Why compile every .cpp separately?
main.cpp doesn't contain all the code, but just the declarations. You don't include all the code in the same .cpp (e.g. by including the other .cpps) mainly to decrease compilation time.
I want to avoid including LongClassname.h in all classes because of reverse dependencies. LongClassName includes all other classes in its implementation. And almost all the other classes use some static functions of LongClassName.
If you use header guards, you shouldn't have problems.
1. Ok, they also contain inline and template functions, but they are the exception, not the rule.

Can't compile C++ in Ubuntu using GCC -- Include/Library Problems (collect2: ld returned 1 exit status)

I guess I'm not linking something right?
I want to call ABC.cpp which needs XYZ.h and XYZ.cpp. All are in my current directory and I've tried #include <XYZ.h> as well as#include "XYZ.h".
Running $ g++ -I. -l. ABC.cpp at the Ubuntu 10 Terminal gives me:
`/tmp/ccCneYzI.o: In function `ABC(double, double, unsigned long)':
ABC.cpp:(.text+0x93): undefined reference to `GetOneGaussianByBoxMuller()'
collect2: ld returned 1 exit status`
Here's a summary of ABC.cpp:
#include "XYZ.h"
#include <iostream>
#include <cmath>
using namespace std;
double ABC(double X, double Y, unsigned long Z)
{
...stuff...
}
int main()
{
...cin, ABC(cin), return, cout...
}
Here's XYZ.h:
#ifndef XYZ_H
#define XYZ_H
double GetOneGaussianByBoxMuller();
#endif
Here's XYZ.cpp:
#include "XYZ.h"
#include <cstdlib>
#include <cmath>
// basic math functions are in std namespace but not in Visual C++ 6
//(comment's in code but I'm using GNU, not Visual C++)
#if !defined(_MSC_VER)
using namespace std;
#endif
double GetOneGaussianByBoxMuller()
{
...stuff...
}
I'm using GNU Compiler version g++ (Ubuntu 4.4.3-4ubuntu5) 4.4.3.
This is my first post; I hope I included everything that someone would need to know to help me. I have actually read the "Related Questions" and the Gough article listed in one of the responses, as well as searched around for the error message. However, I still can't figure out how it applies to my problem.
Thanks in advance!

When you run g++ -I. -l. ABC.cpp you are asking the compiler to create an executable out of ABC.cpp. But the code in this file replies on a function defined in XYZ.cpp, so the executable cannot be created due to that missing function.
You have two options (depending on what it is that you want to do). Either you give the compiler all of the source files at once so that it has all the definitions, e.g.
g++ -I. -l. ABC.cpp XYZ.cpp
or, you use the -c option compile to ABC.cpp to object code (.obj on Windows, .o in Linux) which can be linked later, e.g.
g++ -I. -l. -c ABC.cpp
Which will produce ABC.o which can be linked later with XYZ.o to produce an executable.
Edit: What is the difference between #including and linking?
Understanding this fully requires understanding exactly what happens when you compile a C++ program, which unfortunately even many people who consider themselves to be C++ programmers do not. At a high level, the compilation of a C++ program goes through three stages: preprocessing, compilation, and linking.
Preprocessing
Every line that starts with # is a preprocessor directive which is evaluated at the preprocessing stage. The #include directive is literally a copy-and-paste. If you write #include "XYZ.h", the preprocessor replaces that line with the entire contents of XYZ.h (including recursive evaluations of #include within XYZ.h).
The purpose of including is to make declarations visible. In order to use the function GetOneGaussianByBoxMuller, the compiler needs to know that GetOneGaussianByBoxMuller is a function, and to know what (if any) arguments it takes and what value it returns, the compiler will need to see a declaration for it. Declarations go in header files, and header files are included to make declarations visible to the compiler before the point of use.
Compiling
This is the part where the compiler runs and turns your source code into machine code. Note that machine code is not the same thing as executable code. An executable requires additional information about how to load the machine code and the data into memory, and how to bring in external dynamic libraries if necessary. That's not done here. This is just the part where your code goes from C++ to raw machine instructions.
Unlike Java, Python, and some other languages, C++ has no concept of a "module". Instead, C++ works in terms of translation units. In nearly all cases, a translation unit corresponds to a single (non-header) source code file, e.g. ABC.cpp or XYZ.cpp. Each translation unit is compiled independently (whether you run separate -c commands for them, or you give them to the compiler all at once).
When a source file is compiled, the preprocessor runs first, and does the #include copy-pasting as well as macros and other things that the preprocessor does. The result is one long stream of C++ code consisting of the contents of the source file and everything included by it (and everything included by what it included, etc...) This long stream of code is the translation unit.
When the translation unit is compiled, every function and every variable used must be declared. The compiler will not allow you to call a function for which there is no declaration or to use a global variable for which there is no declaration, because then it wouldn't know the types, parameters, return values, etc, involved and could not generate sensible code. That's why you need headers -- keep in mind that at this point the compiler is not even remotely aware of the existence of any other source files; it is only considering this stream of code produced by the processing of the #include directives.
In the machine code produced by the compiler, there are no such things as variable names or function names. Everything must become a memory address. Every global variable must be translated to a memory address where it is stored, and every function must have a memory address that the flow of execution jumps to when it is called. For things that are defined (i.e. for functions, implemented) in the translation unit, the compiler can assign an address. For things that are only declared (usually as a result of included headers) and not defined, the compiler does not at this point know what the memory address should be. These functions and global variables for which the compiler has only a declaration but not a definition/implementation, are called external symbols, and they are presumed to exist in a different translation unit. For now, their memory addresses are represented with placeholders.
For example, when compiling the translation unit corresponding to ABC.cpp, it has a definition (implementation) of ABC, so it can assign an address to the function ABC and wherever in that translation unit ABC is called, it can create a jump instruction to that address. On the other hand, although its declaration is visible, GetOneGaussianByBoxMuller is not implemented in that translation unit, so its address must be represented with a placeholder.
The result of compiling a translation unit is an object file (with the .o suffix on Linux).
Linking
One of the main jobs of the linker is to resolve external symbols. That is, the linker looks through a set of object files, sees what their external symbols are, and then tries to find out what memory address should be assigned to them, replacing the placeholder.
In your case the function GetOneGaussianByBoxMuller is defined in the translation unit corresponding to XYZ.cpp, so inside XYZ.o it has been assigned a specific memory address. In the translation unit corresponding to ABC.cpp, it was only declared, so inside ABC.o, it is only a placeholder (external symbol). The linker, if given both ABC.o and XYZ.o will see that ABC.o needs an address filled in for GetOneGaussianByBoxMuller, find that address in XYZ.o, and replace the placeholder in ABC.o with it. Addresses for external symbols can also be found in libraries.
If the linker fails to find an address for GetOneGaussianByBoxMuller (as it does in your example where it is only working on ABC.o, as a result of not having passed XYZ.cpp to the compiler), it will report an unresolved external symbol error, also described as an undefined reference.
Finally, once the compiler has resolved all external symbols, it combines all of the now-placeholder-free object code, adds in all the loading information that the operating system needs, and produces an executable. Tada!
Note that through all of this, the names of the files don't matter one bit. It's a convention that XYZ.h should contain declarations for things that are defined in XYZ.cpp, and it's good for maintainable code to organize things that way, but the compiler and linker don't care one bit whether that's true or not. The linker will look through all the object files it's given and only the object files it's given to try to resolve a symbol. It neither knows nor cares which header the declaration of the symbol was in, and it will not try to automatically pull in other object files or compile other source files in order to resolve a missing symbol.
... wow, that was long.

Try
g++ ABC.cpp XYZ.cpp
If you want to compile the seprately you need to build object files:
g++ -c ABC.cpp
g++ -c XYZ.cpp
g++ ABC.o XYZ.o

Wish I had read these when I was having these problems:
http://c.learncodethehardway.org/book/learn-c-the-hard-waych3.html
http://www.thegeekstuff.com/2010/08/make-utility/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js