Disadvantages of condensing .cpp files? - c++

When compiling a 'static library' project in MSVC++, I often get .lib files that are several MB in size. If I use conditional macros and include directives to "condense" all my .cpp files in one .cpp file at compile time, the .lib file size decreases considerably.
Are there any disadvantages with this practice?

The main problem of Unity Builds as they are called is that they break the way C++ works.
In C++, a source file, with its includes preprocessed, is called a Translation Unit. Some symbols are "private" to this translation unit:
symbols declared static at namespace level
anything declared in anonymous namespace
If you merge several C++ files, then the compiler will share those private symbols among all the files that are merged together since from its point of view this has become a single Translation Unit.
You will get an error if two local classes suddenly have the same name, and idem for constants. Annoying as hell, but at least you are notified.
For functions however, it may break silently because of overload. When before the compiler would pick static void launch(short u); for your call to launch(1), then suddenly it will shift to static void launch(int i, Target t = "Irak");. Oups ?
Unity Builds are dangerous. What you are looking for is called WPO (Whole Program Optimization) or LTO (Link Time Optimization), look into the innards of your compiler manual to know how to activate it.

A disadvantage would be if you change a single line in the cpp you have to compile the whole code.

Your file might get more complex and you'll have to recompile everyting even if you just change one single source file. Other then that, there's no real disadvantage, unless the files are redefining local functions or variables that might screw you up, when merging everything (e.g. due to multiple definitions within one translation unit).
The size decrease you notice is due to advanced optimizations that become available that way (e.g. reusing more code). Depending on your code you might get similar results by enabling all optimizations for size as well as link time optimizations, which might result in some acceptable solution between both approaches.

It's usually a confusing practice to include cpp to another cpp (at least you should leave explanatory comment about why did you do this).

Related

Motivating real world examples of the 'inline' specifier?

Background: The C++ inline keyword does not determine if a function should be inlined.
Instead, inline permits you to provide multiple definitions of a single function or variable, so long as each definition occurs in a different translation unit.
Basically, this allows definitions of global variables and functions in header files.
Are there some examples of why I might want to write a definition in a header file?
I've heard that there might be templating examples where it's impossible to write the definition in a separate cpp file.
I've heard other claims about performance. But is that really true? Since, to my knowledge, the use of the inline keyword doesn't guarantee that the function call is inlined (and vice versa).
I have a sense that this feature is probably primarily used by library writers trying to write wacky and highly optimized implementations. But are there some examples?
It's actually simple: you need inline when you want to write a definition (of a function or variable (since c++17)) in a header. Otherwise you would violate odr as soon as your header is included in more than 1 tu. That's it. That's all there is to it.
Of note is that some entities are implicitly declared inline like:
methods defined inside the body of the class
template functions and variables
constexpr functions and variables
Now the question becomes why and when would someone want to write definitions in the header instead of separating declarations in headers and definitions in source code files. There are advantages and disadvantages to this approach. Here are some to consider:
optimization
Having the definition in a source file means that the code of the function is baked into the tu binary. It cannot be inlined at the calling site outside of the tu that defines it. Having it in a header means that the compiler can inline it everywhere it sees fit. Or it can generate different code for the function depending on the context where it is called. The same can be achieved with lto within an executable or library, but for libraries the only option for enabling this optimization is having the definitions in the header.
library distribution
Besides enabling more optimizations in a library, having a header only library (when it's possible) means an easier way to distribute that library. All the user has to do is download the headers folder and add it to the include path of his/her project. In the case of non header only library things become more complicated. Because you can't mix and match binaries compiled by different compiler and even by the same compiler but with different flags. So you either have to distribute your library with the full source code along with a build tool or have the library compiled in many formats (cpu architecture/OS/compiler/compiler flags combinations)
human preference
Having to write the code once is considered by some (me included) an advantage: both from code documentation perspective and from a maintenance perspective. Others consider separating declaration from definitions is better. One argument is that it achieves separation of interface vs implementation but that is just not the case: in a header you need to have private member declarations even if those aren't part of the interface.
compile time performance
Having all the code in header means duplicating it in every tu. This is a real problem when it comes to compilation time. Heavy header C++ projects are notorious for slow compilation times. It also means that a modification of a function definition would trigger the recompilation of all the tu that include it, as opposed to just 1 tu in the case of definition in source code. Precompiled headers try to solve this problem but the solutions are not portable and have problems of their own.
If the same function definition appears in multiple compilation units then it needs to be inline otherwise you get a linking error.
You need the inline keyword e.g. for function templates if you want to make them available using a header because then their definition also has to be in the header.
The below statement might be a bit oversimplified because compilers and linkers are really complex nowadays, but to get a basic idea it is still valid.
A cpp file and the headers included by that cpp file form a compilation unit and each compilation unit is compiled individually. Within that compilation unit, the compiler can do many optimizations like potentially inlining any function call (no matter if it is a member or a free function) as long as the code still behaves according to the specification.
So if you place the function definition in the header you allow the compiler to know the code of that function and potentially do more optimizations.
If the definition is in another compilation unit the compiler can't do much and optimizations then can only be done at linking time. Link time optimizations are also possible and are indeed also done. And while link-time optimizations became better they potentially can't do as much as the compiler can do.
Header only libraries have the big advantage that you do not need to provide project files with them, the one how wants to use that library just copies the headers to their projects and includes them.
In short:
You're writing a library and you want it to be header-only, to make its use more convenient.
Even if it's not a library, in some cases you may want to keep some of the definitions in a header to make it easier to maintain (whether or not this makes things easier is subjective).
to my knowledge, the use of the inline keyword doesn't guarantee that the function call is inlined
Yes, defining it in a header (as inline) doesn't guarantee inlining. But if you don't define it in a header, it will never be inlined (unless you're using link-time optimizations). So:
You want the compiler to be able to inline the functions, if it decides to.
Also it may the compiler more knowledge about a function:
maybe it never throws, but is not marked noexcept;
maybe several consecutive calls can be merged into one (there's no side effects, etc), but __attribute__((const)) is missing;
maybe it never returns, but [[noreturn]] is missing;
...
there might be templating examples where it's impossible to write the definition in a separate cpp file.
That's true for most templates. They automatically behave as if they were inline, so you don't need to specify it explicitly. See Why can templates only be implemented in the header file? for details.

Are functions in C/C++ headers a no-go?

I'm working on a very tiny piece of C/C++ source code. The program reads input values from stdin, processes them with an algorithm and writes the results to stdout.
I would just implement all that in a single file, but I also want test cases for the algorithm (not the input/output reading), so I have the following files in my project:
main.cpp
sort.hpp
sort_test.cpp
I implement the algorithm in sort.hpp right away, no sort.cpp. It's rather short and doesn't have any dependencies.
Would you say that, in some cases, functions defined in headers are okay, even if they are sophisticated algorithms and not just simple accessors/mutators? Or is there a reason I should avoid this? When should I move code from header to source file?
There is nothing wrong with having functions in header files, as long as you understand the tradeoff. Putting them in a header file means they'll have to be compiled (and recompiled) in any translation unit that includes the header. (and they have to be declared inline, or you will get linker errors.)
In projects with many translation units, that may add up to a noticeable slowdown in compile times, if you do it a lot.
On the other hand, it ensures that the function definition is visible everywhere the function is called -- and that means that it can be trivially inlined, so the resulting program may run faster.
And finally, with function templates, you typically have no realistic alternative. The definition must be visible at the call site, and the only practical way to achieve that is to put it in a header.
A final consideration is that header-only libraries are easier to deploy and use. You don't need to link against anything, you don't have to worry about ABI's or anything else. You just add the headers to your project, include them and off you go.
Quite a few popular libraries use a header-only strategy.
When you put functions in headers you have to make sure to declare them inline. This is required to avoid a duplicate definition warning when more than one .cpp file include that header file. Generally you should only put small functions inside header files because it will be compiled for each cpp file that includes the header which will slow down compilation time and also results in code bloat; a larger executable file.
It's OK to put any function in the header as long as it's inline. Things such as functions defined inside class { } and templates are implicitly inline.
If the resulting application becomes too large, then optimize the code size. Optimizing before there is a problem is an anti-pattern, especially when there is a benefit to doing it "your way," and the fix is as simple as moving from one file to another and erasing inline.
Of course, if you want to distribute the code as a library, then deciding between a header, static library, or dynamic library binary is an important decision affecting the users.
The vast majority of the boost libraries are header-only, so I'd say: Yes, this is an established and accepted practice. Just don't forget to inline.
That really is a stile choice. But putting it in the header does mean that it will be inline code rather than a function. If you wanted that same functionality, you could use the inline keyword:
inline int max(int a, int b)
{
return (a > b) ? a : b;
}
http://en.wikipedia.org/wiki/Inline_function
The reason you should avoid this in general (for non inline functions) is because multiple source files will be including your header, creating linker errors.
It doesn't matter if you have a pramga once or similar trick - the duplication will show up if you have more than one compilation unit (e.g. cpp files) including the same header.
If you wish to inline the function, it MUST be in the header else it can't get inlined.
If you publish a header with your libraries and the header has some sort of implementation in it, you can be sure that after a few years if you change the implementation and it doesn't work exactly the same way as it did before, some peoples code will break since thay will have come to rely on the implementation they saw in the header. Yeah i know one should not do it but many people do look in header for the implementation and other behaviour they can exploit/use in a not intended way to overcome some problem they are having.
If you are planning to use templates then you have no choice but to put it all in header. (this might not be necessary if you compiler supports export templates but there is only 1 i know of).
Its ok to have the implementation in the header. It depends on what you need. If you separate the definition to a different file then the compiler will create symbols with external linkage if you dont want that you can define the functions inside the header itself. But you would be wasting some amount of memory for the code segment. If you include this header file in two different files then both files codes segment will have this function definition.
If other header file is going to have a function with similar name then its going to be a problem. Then you have to use inline.

Why is including a header file such an evil thing?

I have seen many explanations on when to use forward declarations over including header files, but few of them go into why it is important to do so. Some of the reasons I have seen include the following:
compilation speed
reducing complexity of header file management
removing cyclic dependencies
Coming from a .net background I find header management frustrating. I have this feeling I need to master forward declarations, but I have been scrapping by on includes so far.
Why cannot the compiler work for me and figure out my dependencies using one mechanism (includes)?
How do forward declarations speed up compilations since at some point the object referenced will need to be compiled?
I can buy the argument for reduced complexity, but what would a practical example of this be?
"to master forward declarations" is not a requirement, it's a useful guideline where possible.
When a header is included, and it pulls in more headers, and yet more, the compiler has to do a lot of work processing a single translation module.
You can see how much, for example, with gcc -E:
A single #include <iostream> gives my g++ 4.5.2 additional 18,560 lines of code to process.
A #include <boost/asio.hpp> adds another 74,906 lines.
A #include <boost/spirit/include/qi.hpp> adds 154,024 lines, that's over 5 MB of code.
This adds up, especially if carelessly included in some file that's included in every file of your project.
Sometimes going over old code and pruning unnecessary includes improves the compilation dramatically just because of that. Replacing includes with forward declarations in the translation modules where only references or pointers to some class are used, improves this even further.
Why cannot the compiler work for me and figure out my dependencies using one mechanism (includes)?
It cannot because, unlike some other languages, C++ has an ambiguous grammar:
int f(X);
Is it a function declaration or a variable definition? To answer this question the compiler must know what does X mean, so X must be declared before that line.
Because when you're doing something like this :
bar.h :
class Bar {
int foo(Foo &);
}
Then the compiler does not need to know how the Foo struct / class is defined ; so importing the header that defines Foo is useless. Moreover, importing the header that defines Foo might also need importing the header that defines some other class that Foo uses ; and this might mean importing the header that defines some other class, etc.... turtles all the way.
In the end, the file that the compiler is working against is almost like the result of copy pasting all the headers ; so it will get big for no good reason, and when someone makes a typo in a header file that you don't need (or import , or something like that), then compiling your class starts to take waaay too much time (or fail for no obvious reason).
So it's a good thing to give as little info as needed to the compiler.
How do forward declarations speed up compilations since at some point the object referenced will need to be compiled?
1) reduced disk i/o (fewer files to open, fewer times)
2) reduced memory/cpu usage
most translations need only a name. if you use/allocate the object, you'll need its declaration.
this is probably where it will click for you: each file you compile compiles what is visible in its translation.
a poorly maintained system will end up including a ton of stuff it does not need - then this gets compiled for every file it sees. by using forwards where possible, you can bypass that, and significantly reduce the number of times a public interface (and all of its included dependencies) must be compiled.
that is to say: the content of the header won't be compiled once. it will be compiled over and over. everything in this translation must be parsed, checked that it's a valid program, checked for warnings, optimized, etc. many, many times.
including lazily only adds significant disk/cpu/memory increase, which turns into intolerable build times for you, while introducing significant dependencies (in non-trivial projects).
I can buy the argument for reduced complexity, but what would a practical example of this be?
unnecessary includes introduce dependencies as side effects. when you edit an include (necessary or not), then every file which includes it must be recompiled (not trivial when hundreds of thousands of files must be unnecessarily opened and compiled).
Lakos wrote a good book which covers this in detail:
http://www.amazon.com/Large-Scale-Software-Design-John-Lakos/dp/0201633620/ref=sr_1_1?ie=UTF8&s=books&qid=1304529571&sr=8-1
Header file inclusion rules specified in this article will help reduce the effort in managing header files.
I used forward declarations simply to reduce the amount of navigation between source files done. e.g. if module X calls some glue or interface function F in module Y, then using a forward declaration means the writing the function and the call can be done by only visiting 2 places, X.c and Y.c not so much of an issue when a good IDE helps you navigate, but I tend to prefer coding bottom-up creating working code then figuring out how to wrap it rather than through top down interface specification.. as the interfaces themselves evolve it's handy to not have to write them out in full.
In C (or c++ minus classes) it's possible to truly keep structure details Private by only defining them in the source files that use them, and only exposing forward declarations to the outside world - a level of black boxing that requires performance-destroying virtuals in the c++/classes way of doing things. It's also possible to avoid needing to prototype things (visiting the header) by listing 'bottom-up' within the source files (good old static keyword).
The pain of managing headers can sometimes expose how modular your program is or isn't - if its' truly modular, the number of headers you have to visit and the amount of code & datastructures declared within them should be minimized.
Working on a big project with 'everything included everywhere' through precompiled headers won't encourage this real modularity.
module dependancies can correlate with data-flow relating to performance issues, i.e. both i-cache & d-cache issues. If a program involves many modules that call each other & modify data at many random places, it's likely to have poor cache-coherency - the process of optimizing such a program will often involve breaking up passes and adding intermediate data.. often playing havoc with many'class diagrams'/'frameworks' (or at least requiring the creation of many intermediates datastructures). Heavy template use often means complex pointer-chasing cache-destroying data structures. In its optimized state, dependancies & pointer chasing will be reduced.
I believe forward declarations speed up compilation because the header file is ONLY included where it is actually used. This reduces the need to open and close the file once. You are correct that at some point the object referenced will need to be compiled, but if I am only using a pointer to that object in my other .h file, why actually include it? If I tell the compiler I am using a pointer to a class, that's all it needs (as long as I am not calling any methods on that class.)
This is not the end of it. Those .h files include other .h files... So, for a large project, opening, reading, and closing, all the .h files which are included repetitively can become a significant overhead. Even with #IF checks, you still have to open and close them a lot.
We practice this at my source of employment. My boss explained this in a similar way, but I'm sure his explanation was more clear.
How do forward declarations speed up compilations since at some point the object referenced will need to be compiled?
Because include is a preprocessor thing, which means it is done via brute force when parsing the file. Your object will be compiled once (compiler) then linked (linker) as appropriate later.
In C/C++, when you compile, you've got to remember there is a whole chain of tools involved (preprocessor, compiler, linker plus build management tools like make or Visual Studio, etc...)
Good and evil. The battle continues, but now on the battle field of header files. Header files are a necessity and a feature of the language, but they can create a lot of unnecessary overhead if used in a non optimal way, e.g. not using forward declarations etc.
How do forward declarations speed up
compilations since at some point the
object referenced will need to be
compiled?
I can buy the argument for reduced
complexity, but what would a practical
example of this be?
Forward declarations are bad ass. My experience is that a lot of c++ programmers are not aware of the fact that you don't have to include any header file, unless you actually want to use some type, e.g. you need to have the type defined so the compiler understands what you want to do. It's important to try and refrain from including header files in other header files.
Just passing around a pointer from one function to another, only requires a forward declaration:
// someFile.h
class CSomeClass;
void SomeFunctionUsingSomeClass(CSomeClass* foo);
Including someFile.h does not require you to include the header file of CSomeClass, since you are merely passing a pointer to it, not using the class. This means that the compiler only needs to parse one line (class CSomeClass;) instead of an entire header file (that might be chained to other header files etc etc).
This reduces both compile time and link time, and we are talking big optimizations here if you have many headers and many classes.

Alternatives for gmake?

I have a c++ program file with two functions in it. If I change the first function alone, why should both of them have to be recompiled?
Is there any build system which recompiles the first one alone and put it back in the same object file?
Is this possible? The instructions of one function shouldn't depend on other right?
Since gmake recompiles the whole file, it takes a lot of time, cant this be avoided? Putting the second function in a separate file is not a good idea, as it involves creation of unwanted files which is not necessary.
If the second function is quite long or requires more time to compile, place it in a separate file. That is why people separate source files. From what I know, it has to compile the whole file, as a small change in the source will result in a major change in the output file, as the functions would not link to each other.
I doubt that compiling only part of a source file is ever possible, using any programming language. Compilations are done on a per-file basis.
The analysis to decide which semantic parts of a given source file have changed and thus need recompiling would likely outweigh the cost of the compilation itself in most cases.
Build systems get big wins by analyzing the dependencies between source files because the cost of file I/O (particularly for include files) is a large part of the overall compilation cost. Once you've decided to recompile a given source file, you would likely only achieve a tiny speedup by ignoring unchanged parts of the file, even if there were zero cost to computing which parts those were.
All build systems for C++ that I know work on translation unit (file) level, not on function level. Although in theory it should be possible it is complicated when you consider the preprocessor, e.g.
#define ANSWER 42
void foo()
{
#undef ANSWER
#define ANSWER 41
}
int bar()
{
return ANSWER;
}
Although this is a terrible code any standard compliant compiler/build system should support it. And as you can see changing foo (redefining ANSWER) can affect bar.
Putting the second function in a separate file is a good idea, and is necessary if you want to avoid this "problem". If your functions are so large that the time spent recompiling one file is noticeable, then the file is probably too big and should be broken up anyway.
The issue isn't gmake, it's the compiler. If you change one function, you may have no choice but to recompile others. For instance:
if function a calls function b, and you change function b, you need to ensure that the a still calls b correctly, in case b's signature changed.
if function b is between a and c in the memory, and now b grows so that it no longer fits, you may have to move either a or c, which also involves recompiling to generate correct offsets.
If b is no longer in the same place, you need to compile its caller, a to point to the right function.
There are probably more and better cases where this is necessary.

Does #include affect program size?

When my cpp file uses #include to add some header, does my final program's size gets bigger? Header aren't considered as compilation units, but the content of the header file is added to the actual source file by the preprocessor, so will the size of the output file (either exe or dll) be affected by this?
Edit: I forgot to mention that the question is not about templates/inline functions. I meant what will happen if I place an #include to a header that doesn't have any implementation detail of functions. Thanks.
It depends on the contents, and how your compiler is implmented. It is quite possible that if you don't use anything in the header your compiler will be smart enough to not add any of it to your executable.
However, I wouldn't count on that. I know that back in the VC++ 6 days we discovered that meerly #including Windows.h added 64K to the excecutable for each source file that did it.
You clarified that:
[The header has no] templates/inline functions... doesn't have any implementation detail of functions.
Generally speaking, no, adding a header file won't affect program size.
You could test this. Take a program that already builds, and check the executable size. Then go into each .cpp file and include a standard C or C++ header file that isn't actually needed in that file. Build the program and check the executable size again - it should be the same size as before.
By and large, the only things that affect executable size are those that cause the compiler to either generate different amounts of code, global/static variable initializations, or DLLs/shared library usages. And even then, if any such items aren't needed for the program to operate, most modern linkers will toss those things out.
So including header files that only contain things like function prototypes, class/struct definitions without inlines, and definitions of enums shouldn't change anything.
However, there are certainly exceptions. Here are a few.
One is if you have an unsophisticated linker. Then, if you add a header file that generates things the program doesn't actually need, and the linker doesn't toss them out, the executable size will bloat. (Some people deliberately build linkers this way because the link time can become insanely fast.)
Many times, adding a header file that adds or changes a preprocessor symbol definition will change what the compiler generates. For instance, assert.h (or cassert) defines the assert() macro. If you include a header file in a .c/.cpp file that changes the definition of the NDEBUG preprocessor symbol, it will change whether assert() usages generate any code, and thus change the executable size.
Also, adding a header file that changes compiler options will change the executable size. For instance, many compilers let you change the default "packing" of structs via something like a #pragma pack line. So if you add a header file that changes structure packing in a .c/.cpp file, the compiler will generate different code for dealing with structs, and hence change the executable size.
And as someone else pointed out, when you're dealing with Visual C++/Visual Studio, all bets are off. Microsoft has, shall we say, a unique perspective around their development tools which is not shared by people writing compiler systems on other platforms.
With modern compilers included files only affect the binaries size if they contain static data or if you use normal or inline function that are defined in them.
Yes, because for example inline functions can be defined in header files, and the code of those inline functions will ofcourse be added to your program's code when you call those functions. Also template instantiations will be added to the code of your program.
And you can also define global variables in header files (although I wouldn't recommend it). If you do, you should surround them with #ifdef/#end blocks so that they don't get defined in more than one compilation unit (or the linker would complain). In any case, this would increase the program's size.
Remember, #define sentences can bloat the code at compile time into a huge, yet functional compiled file.
For headers that are entirely declarative (as they generally should be), no. However the pre-processor and compiler have to take time to parse the headers, so headers can increase compile time; especially large and deeply nested ones such as - which is why some compilers use 'precompiled headers'.
Even inline and template code does not increase code size directly. If the template is never instantiated or an inline function not called, the code will not be generated. However if it is called/instantiated, code size can grow rapidly. If the compiler actually inlines the code (a compiler is not obliged to do so, and most don't unless forced to), the code duplication may be significant. Even if it is not truly inlined by the compiler, it is still instantiated statically in every object module that references it - it takes a smart linker to remove duplicates, and that is not a given - if separate object files were compiled with different options, inline code from the same source may not generate identical code in each object file, so would not even be duplicated. In the case of templates, and separate instantiation will be created for each type in is invoked for.
It's good practice to limit the #includes in a file to those that are necessary. Besides affecting the executable size, having extra #includes will cause a larger list of compile-time dependencies which will increase your build-time if you change a commonly #included header file.