Compile Time & Memory Usage of a large C++ Project?

Compile Time & Memory Usage of a large C++ Project? - c++

Suppose one has about 50,000 different .cpp files.
Each .cpp file contains just one class that has about ~1000 lines of code in it (the code itself is not complicated -- involves in-memory operations on matrices & vectors -- i.e, no special libraries are used).
I need to build a project (in a Linux environment) that will have to import & use all of these 50,000 different .cpp files.
A couple of questions come to mind:
How long will it roughly take to compile this? What will be the approx. size of the compiled file?
What would be a better approach -- keep 50,000 different .so files (compiled extenstions) and have the main program import them one by one, or alternatively, unite these 50,000 different .cpp files into one large .cpp file, and just deal with that? Which method will be faster / more efficient?
Any insights are greatly appreicated.

There is no answer, just advice.
Right back at you: What are you really trying to do? Are you trying to make a code library from different source files? Or is that an executable? Did you actually code that many .cpp files?
50,000 source files is well... a massively sized project. Are you trying to do something common across all files (e.g. every source file represents a resource, record, image, or something unique). Or it just 50K disparate code files?
Most of your compile time will not be based on the size of each source file. It will be based on the amount of header files (and the headers they include) that will be brought in with each cpp file. Headers, while not usually containing implementations, just declarations, have to go through a compile process. And redundant headers across the code base can slow your build time down.
Large projects at that kind of scale use precompiled headers. You can include all the commonly used header files in one header file (common.h) and build common.h. Then all the other source files just include "common.h". The compiler can be configured to automatically use the compiled header file when it sees the #include "common.h" for each source.

(i) There are way too many factors involved in determining this, even an approximation is impossible. Compilation can be memory, cpu or hard drive bound. The complexity of the files matter (from your description, your complexity is low).
(ii) The typical way of doing this is to make a library and let the system figure out linking or loading. You can choose static or dynamic linking.
static linking
Assuming you are using gcc, this would look like this:
g++ -c file1.cpp -o file1.o
g++ -c file2.cpp -o file2.o
...
g++ -c filen.cpp -o filen.o
ar -rc libvector.a file1.o file2.o ... filen.o
Then, when you build your own code, your final link looks like this:
g++ myfile.cpp libvector.a -o mytask
dynamic linking
Again, assuming you are using gcc, this would look like this:
g++ -c file1.cpp -fPIC -o file1.o
g++ -c file2.cpp -fPIC -o file2.o
...
g++ -c filen.cpp -fPIC -o filen.o
ld -G file1.o file2.o ... filen.o -o libvector.so
Then, when you build your own code, your final link looks like this:
g++ myfile.cpp libvector.so -o mytask
You will need libvector.so to be in the loader's path for your executable to work.
In any case, as long as the 50,000 files don't change, you will only need to do the last command (which will be much faster).

You can build each object file from a '.cpp' with having the '.h' file having lots (and I MEAN LOTS) of forward declarations - so when you change a .h file it does not need to recompile the rest of the program. Usually a function/method needs the name of the object in its parmaters or what it is returing. If it needs other details - yes it needs to be included.
Please get a book by Scott Myers - Will help you a lot.
Oh - When trying to eat a big cake - divied it up. The slices are more manageable.

We can't really say the time it will take to compile, but what you should do is compile each .cpp/.h pair into a .o file:
$ g++ -c -o test.o test.cpp ...
Once you have all of these, you compile the main program as so:
$ g++ -c -o main.o main.cpp
$ g++ -o main main.o test.o blah.o otherThings.o foo.o bar.o baz.o etc...
Your idea of using .sos is pretty much asking "how quickly can I crash the program and possibly the OS?". Shared libraries are ment for large libraries in small numbers, not 50,000 .sos linked to a binary (especially if you load them dynamicly...that would be BAD).

Related

How to avoid recompiling all the files everytime?

I have about 20 .cpp files and and as many .h files. When I compile I do
g++ -std=c++11 main.cpp -o main
in main.cpp I start by some forward declarations and then I include all .h and .cpp files.
As a result, at every compilation, all the files must be recompiled which it uncomfortably slow. I think I would need to use .so and/or .dll files so that at future compilation, only the modified code need to be recompiled. I don't really know how to do that. Could you give me some advice?

You generally want each translation unit to be compiled separately and you might even compile them in parallel (e.g. with make -j, see below).
(I am guessing and hoping that you are on Linux and using GCC as g++; adapt this answer to your compiler and operating system if not.)
If you have src1.cpp src2.cpp src3.cpp (each containing appropriate #include directives, probably with a common header file) you would compile src1.cpp into an object file src1.o using GCC as:
g++ -Wall -Wextra -g src1.cpp -c -o src1.o
The -Wall -Wextra options asks for all warnings and some extra ones, and you really want them (to improve your code to get no warnings). The -g option asks for DWARF debugging information (to be able to use the gdb debugger later, and also to use valgrind). The -c option requires only the compilation step without linking. The -o src1.o explicits the output object file.
Likewise, you'll compile src2.cpp into src2.o
g++ -Wall -Wextra -g src2.cpp -c -o src2.o
and src3.cpp
g++ -Wall -Wextra -g src3.cpp -c -o src3.o
BTW I prefer the shorter .cc suffix to .cpp. For benchmarking purposes, enable optimizations in your compiler, e.g. by adding -O2 -march=native after -g.
Finally you want to link all your three objects files src1.o, src2.o and src3.o into a myprog executable:
g++ -g src1.o src2.o src3.o -o myprog
you might add additional options, e.g. to link external libraries.
Beware that C++14 (and also C++11 and even C++17) has no real modules (in contrast to Ocaml or Go). So the preprocessor is used a lot, and you practically need to include a lot of code; for example, #include <vector> is pulling more than ten thousand lines of C++ code from standard and internal header files on my Linux/Debian desktop. This explains why C++ compilers are slow, hence I recommend avoiding too small C++ files (e.g. a source file of only a hundred C++ lines including several header files, that would in practice pull dozens of thousands of C++ lines from various internal headers). My preference is to define several related functions (and perhaps classes) in each .cc (or .cpp) source file of one or a few thousands lines of C++.
(future C++ standards, perhaps C++20, might add modules to the language; but this could be postponed...)
My recommendation is to learn to use GNU make or ninja. Indeed you need a build automation tool.
You certainly should learn how to invoke your compiler on the command line. Read about invoking GCC if you use g++. Order of arguments to g++ matters a big lot.
You could use tools like cmake or meson which generates configuration files (for make or ninja). But I recommend simpler things than cmake or meson (e.g. code your Makefile by hand and just use make). Here is an example of Makefile for make. You need to understand your build process. In some cases, you might generate some (simple) C++ file(s) during your build (e.g. with GNU bison or Qt moc or your own script or program emitting some C++ file).
I think I would need to use .so
Not necessarily. .so files are shared objects, used in shared libraries. See this. You could (and probably at first, you want to) have several object files, as explained above. You might later consider making your own software libraries, but that is worthwhile only for reusable source code.
To link external libraries, you might even want to use pkg-config (for those packages knowing about it) which expands to appropriate build options for g++.
Look also into existing free software projects for inspiration, and study their source code (including their build process). You'll find many of them in Linux distributions, and on github, sourceforge and elsewhere.

... and then I include all .h and .cpp files.
.cpp files (aka translation units) aren't meant to be included. You should use a script or any other kind of build system, that compiles each of the .cpp files separately, and links all the produced .o object files together into the executable program.
This can be enabled to avoid to recompile all the .cpp files, if they aren't affected by changes in header files.

Understanding makefiles

I was looking at this flow diagram to understand how makefiles really operate but I'm still struggling to 100% understand what's going on.
I have a main.cpp file that calls upon some function that is defined in function.h and function.cpp. Then, I'm given the makefile:
main: main.cpp function.o
g++ main.cpp function.o -o main
mainAssembly: main.cpp
g++ -S main.cpp
function.o: function.cpp
g++ -c function.cpp
clean:
rm -f *.o *.S main
linkerError: main.cpp function.o
g++ main.cpp function.o -o main
What's going on? From what I understand so far is that we are compiling function.cpp, which turns into an object file? Why is this necessary?
I don't know what the mainAssembly part is really doing. I tried reading the g++ flags but I still have trouble understand what this is. Is this just compiling main.cpp with the headers? Shouldn't we also convert main into an object file as well?
I guess main is simply linking everything together into an exe called main? And I'm completely lost on what clean and linkerError are trying to do. Can someone help me understand what is going on?

That flowchart confuses more than it explains as it seems needlessly complicated. Each step is actually quite simple in isolation, and there's no point in jamming them all into one chart.
Remember a Makefile simply establishes a dependency chain, an order of operations which it tries to follow, where the file on the left is dependent on the files on the right.
Here's your first part where function.o is the product of function.cpp:
function.o: function.cpp
g++ -c function.cpp
If function.cpp changes, then the .o file must be rebuilt. This is perhaps incomplete if function.h exists, as function.cpp might #include it, so the correct definition is probably:
function.o: function.cpp function.h
g++ -c function.cpp
Now if you're wondering why you'd build a single .cpp into a single .o file, consider programs at a much larger scale. You don't want to recompile every source file every time you change anything, you only want to compile the things that are directly impacted by your changes. Editing function.cpp should only impact function.o, and not main.o as that's unrelated. However, changing function.h might impact main.o because of a reference in main.cpp. It depends on how things are referenced with #include.
This part is a little odd:
mainAssembly: main.cpp
g++ -S main.cpp
That just dumps out the compiled assembly code for main.cpp. This is an optional step and isn't necessary for building the final executable.
This part ham-fistedly assembles the two parts:
main: main.cpp function.o
g++ main.cpp function.o -o main
I say that because normally you'd compile all .cpp files to .o and then link the .o files together with your libstdc++ library and any other shared libraries you're using with a tool like ld, the linker. The final step in any typical compilation is linking to produce a binary executable or library, though g++ will silently do this for you when directed to, like here.
I think there's much better examples to work from than what you have here. This file is just full of confusion.

how to run c++ file if header, class, and main are not in the same folder?

The code::block IDE generates the following files:
./main.cpp
./include/class.h
./src/class.cpp It include class.h with #include "class.h"
How can I run this set of files, with the three files in three different folders?
First, this program can be run by clicking IDE "build and run" button.
This program need to take some arguments, like ./a.out arg[1] arg[2]. So I cannot input arguments by clicking "build and run" button, and thus I have to use g++ to compile an output first.
But g++ is not smart enough as the IDE in finding the three files(I try g++ -I./include main.cpp, it seems that it has no problem with class.h file, but cannot find class.cpp file)
So how can I compile the three files in three different locations?
BTW, how could the class.h file find the class.cpp file in IDE/g++(scan all the files in the directory to see which contains the definition of the class functions?)?

It's a bad idea to #include source files. But this will do it:
g++ -I./include -Isrc main.cpp

Normally one would expect that the IDE has some function to just build the application, especially when there's a function to build-and-run. In addition there are those that have the possibility to supply command line arguments for the program so build-and-run will run with supplied arguments.
You have to supply the source files and the search path for includes, normally one would write:
g++ -o exec-file-name -I./include main.cpp src/class.cpp
but that may depend a bit on how you include the header file. Another note is that you normally don't compile the header file separately - it's included when you compile the .cpp files that includes it.
If on the other hand you actually want to do what you write (compile the .h file that includes the .cpp file - which is higly unorthodox) you would do:
g++ -c -I./src include/class.h
g++ -c main.cpp
g++ -o exec-file-name main.o class.o
where you need to replace the .o extension if your platform uses another extension. Note that in this case you should probably not include class.h from main.cpp since that could lead to duplicate symbols.

Difference between compiling with object and source files

I have a file main.cpp containing an implementation of int main() and a library foo split up between foo.h and foo.cpp.
What is the difference (if any) between
g++ main.cpp foo.cpp -o main
and
g++ -c foo.cpp -o foo.o && g++ main.cpp foo.o
?
Edit: of course there is a third version:
g++ -c foo.cpp -o foo.o && g++ -c main.cpp -o main.o && g++ main.o foo.o -o main

The total work that the compiler & linker (and other tools used by the compiler) has to do is exactly the same (give or take a few minor things like deleting the temporary object file created for foo.o and main.o that the compiler makes in the first example, which remains in the second example, and both remain in the third example).
The main difference comes when you have a larger project, and you use a Makefile to build the code. Here the advantage is that, since the Makefile only recompiles things that need to be recompiled, you don't have to wait for the compiler to compile code that don't need to recompile. Assuming of course, we choose to use the g++ -c file.cpp -o file.o variant in the makefile (which is the typical way to do it), and not the g++ file.cpp main.cpp ... -o main.
Of course, there are other possible scenarios - for example in unit testing, you may want to use the same object file to build a test around, as you were using to build the main application. Again, this makes more of a difference when the project is large and has half a dozen or more source files.
On a modern machine, compiling doesn't take that long - my compiler project (~5500 lines of C++ code) that links with LLVM takes about 10 seconds to compile the source files, and another 10 seconds to link with all the LLVM files. That's a debug version of the llvm libraries, so it produces a 120+ MB executable.
Once you get onto commercial (or corresponding open source type projects) level of software, the number of sourcefiles and other things involved in a project can be hundreds, and the number of lines of the sources can often be in the 100k-several million lines range. And now it starts to matter if you just recompile foo.cpp or "everything", because compiling everything takes an hour of CPU time. Sure, with multicore machines, it still is only a few minutes, but it's not ideal to spend minutes, when you could just spend a few seconds.

If you type something like this:
g++ -o main main.cpp foo.cpp
You are compiling and linking two cpp files at once and generating an executable file called main (you get it with -o)
If you type this:
g++ main.cpp foo.cpp
You are compiling and linking two cpp files at once, generating an executable file with the default name a.out.
Finally, if you type this:
g++ -c foo.cpp
You will generate an object file called foo.o which can later be linked with g++ -o executable_name file1.o ... fileN.o
Using options -c and -o allows you to perform separately two of the tasks performed by the g++ compiler and getting the corresponding preprocessed and object files respectively. I have found a link which may provide you helpful information about it. It talks about gcc (C compiler), but both g++ and gcc work similarly after all:
http://www3.ntu.edu.sg/home/ehchua/programming/cpp/gcc_make.html
Be careful with the syntax of the commands you are using. If you work with Linux and you have problems with commands, just open a cmd window and type "man name_of_the_command", in order to read about, syntax, options, return values and some more relevant information about commands, system calls, user library functions and many other issues.
Hope it helps!

Reason for having to compile files to different extension types

Recently I had to use this command in a makefile I had for an sqlite program I'm working on:
gcc -g -c sqlite3.c -o sqlite3.o
g++ -g -c main.cpp -o main.o
g++ sqlite3.o main.o -o sqliteex
I had to directly compile the sqlite3.c file into my program in order to use the sqlite3.h interface (included in the main.cpp file with #include SQL/sqlite3.h). But why did I need to use gcc to do this and create sqlite3.o, then compile both files as .o files into my executable?
Edit: My guess would be that .o files are compilable by both gcc and g++, if this is the case, is it a good practice to just always compile things as .o files?

But why did I need to use gcc to do this and create sqlite3.o, then compile both files as .o files into my executable?
You did not need to do that. The reason you did do that was to specify that sqlite.c was C code and not C++ code. You could have done this instead:
g++ main.cpp -x c sqlite3.c -o sqliteex
Additionally, it is possible (but not at all certain) that the sqlite code could have compiled as C++, like this:
g++ main.cpp sqlite3.c -o sqliteex

Quote from Wikipedia:
Single Compilation Unit is a technique of computer programming for the C/C++ languages, which reduces compilation time and aids the compiler to perform program optimization even when the compiler itself is lacking support for whole program optimization or precompiled headers.
http://en.wikipedia.org/wiki/Single_Compilation_Unit
Development is mostly edit->compile until success cycle. When you have separately compiled files you can just recompile only file which was modified, which makes rebuild much faster. Last line is not compilation but linking of compiled object files into target executable.
Also as Mysticial noted, you have mixture of C and C++

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js