Benchmark template compilation [duplicate]

Benchmark template compilation [duplicate] - c++

I'm working on a C++ project with extensive compile-time computations. Long compilation time is slowing us down. How might I find out the slowest parts of our template meta-programs so I can optimize them? (When we have slow runtime computations, I have many profilers to choose from, e.g. valgrind's callgrind tool. So I tried building a debug GCC and profiling it compiling our code, but I didn't learn much from that.)
I use GCC and Clang, but any suggestions are welcome.
I found profile_templates on Boost's site, but it seems to be thinly documented and require the jam/bjam build system. If you show how to use it on a non-jam project1, I will upvote you. https://svn.boost.org/svn/boost/sandbox/tools/profile_templates/ appears to count number-of-instantiations, whereas counting time taken would be ideal.
1 Our project uses CMake and is small enough that hacking together a Jamfile just for template profiling could be acceptable.

I know this is an old question, but there is a newer answer that I would like to give.
There is a clang-based set of projects that target this particular problem. The first component is an instrumentation onto the clang compiler which produces a complete trace of all the template instantiations that occurred during compilation, with timing values and optionally memory usage counts as well. That tool is called Templight, as is accessible here (currently needs to compile against a patched clang source tree):
https://github.com/mikael-s-persson/templight
A second component is a conversion tool that allows you to convert the templight traces into other formats, such as easily parsable text-based format (yaml, xml, text, etc.) and into formats that can more easily be visualized, such as graphviz / graphML, and more importantly a callgrind output that can be loaded into KCacheGrind to visualize and inspect the meta-call-graph of template instantiations and their compile-time costs, such as this screenshot of a template instantiation profile of a piece of code that creates a boost::container::vector and sorts it with std::sort:
Check it out here:
https://github.com/mikael-s-persson/templight-tools
Finally, there is also another related project that creates an interactive shell and debugger to be able to interactively walk up and down the template instantiation graph:
https://github.com/sabel83/metashell

I've been working since 2008 on a library that uses template metaprogramming heavily. There is a real need for better tools or approaches for understanding what consumes the most compile time.
The only technique I know of is a divide and conquer approach, either by separating code into different files, commenting out bodies of template definitions, or by wrapping your template instantiations in #define macros and temporarily redefining those macros to do nothing. Then you can recompile the project with and without various instantiations and narrow down.
Incidentally just separating the same code into more, smaller files may make it compile faster. I'm not just talking about opportunity for parallel compilation - even serially, I observed it to still be faster. I've observed this effect in gcc both when compiling my library, and when compiling Boost Spirit parsers. My theory is that some of the symbol resolution, overload resolution, SFINAE, or type inference code in gcc has an O(n log n) or even O(n^2) complexity with respect to the number of type definitions or symbols in play in the execution unit.
Ultimately what you need to do is carefully examine your templates and separate what really depends on the type information from what does not, and use type erasure and virtual functions whereever possible on the portion of the code that does not actually require the template types. You need to get stuff out of the headers and into cpp files if that part of the code can be moved. In a perfect world the compiler should be able to figure this out for itself - you shouldn't have to manually move this code to babysit it - but this is the state of the art with the compilers we have today.

The classic book C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond comes with a 20 page appendix on profiling compile-time costs. It has a companion CD with the code that generates the graphs in that appendix.
Another paper is http://aszt.inf.elte.hu/~gsd/s/cikkek/profiling/profile.pdf, perhaps that is of use to you.
Yet another, but more labor-intensive, way is to count the number of instantiations of each class from your compiler output.

http://www.cs.uoregon.edu/research/tau/about.php might be something which can be of your interest as for templated entities, it shows the breakup of time spent for each instantiation. Other data includes how many times each function was called, how many profiled functions did each function invoke, and what the mean inclusive time per call was

Related

Why is there no accurate C++ decompiler?

Why is it not possible to create a C++ decompiler that will function as accurately as those made for Java and C#?

There are several reasons:
Inlining. A lot of C++ code gets inlined in optimized builds. That plays havoc with any form of decompiler. To figure out that a function was inlined, the decompiler would have to analyze the specifics of the inlined code and match them up. And post-inlining optimization steps can make code very different, depending on where it was inlined.
Templates. Templates use #1 exclusively, but they create additional problems. It is at least theoretically possible that a function that gets inlined in two places would compile to the same sequence of assembly instructions. But for template code, which was instantiated with different template arguments? Different instantiations will usually have to compile down to different sequences of instructions. And this becomes even more difficult, since template code can call different sets of functions based on the template parameters. And those functions themselves could be inlined.
Compile-time execution. Template metaprogramming allows the compiler to actually execute code. But C++11's constexpr provides a more natural way to do some computations at compile time. Obviously, compile-time function calls or metafunction instantiations cannot be part of the compiled executable. Only the results of them will be (since that's kinda the point).
Lack of comprehensive runtime reflection. C# and Java both lace their bytecode with a lot of information about what the nature of the original source code. Object definitions are easily detectable, as are object names, member variable types and names, etc. C++ compiles down to machine language, which is not required to have any such information. And since it isn't required, compilers don't generate it. Even the reflection study group of the ISO C++ committee is focused on compile-time reflection, which is information that won't be available at runtime.
Even std::type_info doesn't offer anything. The reason being that, if the compiler does not detect that a particular type will have typeid called on it, then the compiler doesn't need to generate a std::type_info object for it. And even if it did, all that gives you is an object's name (and an identifier). Nothing more.

Because C++ compilers generally do not put any more information into the executable than they absolutely have to (especially not if they are compiling in release mode rather than a debug build), so the information you'd need to accurately decompile the program simply is not present in the executable.
Of course a C++ compiler could be made that does include all of the necessary information in the executable (e.g. in the most naive implementation, it could simply include a copy of the source code itself in the executable), but doing so would make the executables significantly larger, and most non-open-source C++ developers would prefer that other people not be able to decompile the executable, so there isn't a whole lot of demand for that functionality.

Profiling template metaprogram compilation time

I'm working on a C++ project with extensive compile-time computations. Long compilation time is slowing us down. How might I find out the slowest parts of our template meta-programs so I can optimize them? (When we have slow runtime computations, I have many profilers to choose from, e.g. valgrind's callgrind tool. So I tried building a debug GCC and profiling it compiling our code, but I didn't learn much from that.)
I use GCC and Clang, but any suggestions are welcome.
I found profile_templates on Boost's site, but it seems to be thinly documented and require the jam/bjam build system. If you show how to use it on a non-jam project1, I will upvote you. https://svn.boost.org/svn/boost/sandbox/tools/profile_templates/ appears to count number-of-instantiations, whereas counting time taken would be ideal.
1 Our project uses CMake and is small enough that hacking together a Jamfile just for template profiling could be acceptable.

I know this is an old question, but there is a newer answer that I would like to give.
There is a clang-based set of projects that target this particular problem. The first component is an instrumentation onto the clang compiler which produces a complete trace of all the template instantiations that occurred during compilation, with timing values and optionally memory usage counts as well. That tool is called Templight, as is accessible here (currently needs to compile against a patched clang source tree):
https://github.com/mikael-s-persson/templight
A second component is a conversion tool that allows you to convert the templight traces into other formats, such as easily parsable text-based format (yaml, xml, text, etc.) and into formats that can more easily be visualized, such as graphviz / graphML, and more importantly a callgrind output that can be loaded into KCacheGrind to visualize and inspect the meta-call-graph of template instantiations and their compile-time costs, such as this screenshot of a template instantiation profile of a piece of code that creates a boost::container::vector and sorts it with std::sort:
Check it out here:
https://github.com/mikael-s-persson/templight-tools
Finally, there is also another related project that creates an interactive shell and debugger to be able to interactively walk up and down the template instantiation graph:
https://github.com/sabel83/metashell

The classic book C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond comes with a 20 page appendix on profiling compile-time costs. It has a companion CD with the code that generates the graphs in that appendix.
Another paper is http://aszt.inf.elte.hu/~gsd/s/cikkek/profiling/profile.pdf, perhaps that is of use to you.
Yet another, but more labor-intensive, way is to count the number of instantiations of each class from your compiler output.

http://www.cs.uoregon.edu/research/tau/about.php might be something which can be of your interest as for templated entities, it shows the breakup of time spent for each instantiation. Other data includes how many times each function was called, how many profiled functions did each function invoke, and what the mean inclusive time per call was

How hide C++ source code from customer

I wish to send some components to my customers. The reasons I want to deliver source code are:
1) My class is templatized. Customer might use any template argument, so I can't pre-compile and send .o file.
2) The customer might use different compiler versions for gcc than mine. So I want him to do compilation at his end.
Now, I can't reveal my source code for obvious reasons. The max I can do is to reveal the .h file. Any ideas how I may achieve this. I am thinking about some hooks in gcc that supports decryption before compilation, etc. Is this possible?
In short, I want him to be able to compile this code without being able to peek inside.

Contract = good, obfuscation = ungood.
That said, you can always do a kind of PIMPL idiom to serve your customer with binaries and just templated wrappers in the header(s). The idea is then to use an "untyped" separately compiled implementation, where the templated wrapper just provides type safety for client code. That's how one often did things before compilers started to understand how to optimize templates, that is, to avoid machine-code level code bloat, but it only provides some measure of protection about trivial copy-and-paste theft, not any protection against someone willing to delve into the machine code.
But perhaps the effort is then greater than just reinventing your functionality?

Just adding some terminology to Alf's answer: The Thin template idiom is what you might look at. It basically simulates the functionality of a generic. Don't get confused by the wikipedia article which pops up in google, you don't have to use void*...
This, of course, does not guarantee binary compatibility. As usual with 'native' c++, you either compile the component for customers platform yourself and deploy the binary, or give them your code... The difference to the pure generic component code is that you can do the former at all.

use some c++ obfuscators may be help?: http://www.semdesigns.com/products/obfuscators/CppObfuscationExample.html or Magle It

First, if you're going to provide the source code, then you have to provide the source code. Sure, you could encrypt it, but even if GCC had a "decrypt before compile" option, it would need to decrypt the code, and if GCC can decrypt the code, so can your customer.
What you're asking is impossible. (If you find a way to do it, I believe the movie industry might have a multi-million contract for you. They currently have to resort to expensive custom hardware to prevent people from ripping content, and that only works to a limited degree)
As for your "obvious reasons" why you don't want to provide the source code, I don't see why they're obvious. What would happen if you provided the source code?
You have two options:
provide the source code in its entirety, or
compile everything that can be precompiled into a (static or dynamic) library, and provide your customer with that, plus the header files.

what about pimpls?

1) My class is templatized. Customer might use any template argument, so I can't pre-compile and send .o file.
2) The customer might use different compiler versions for gcc than mine. So I want him to do compilation at his end.
Now, I can't reveal my source code for obvious reasons. The max I can do is to reveal the .h file. Any ideas how I may achieve this. I am thinking about some hooks in gcc that supports decryption before compilation, etc. Is this possible?
In short, I want him to be able to compile this code without being able to peek inside.
Consideration 2) above encompasses A) ABI differences such that the same code compiled with different compiler versions/vendors on the same platform is incompatible, as well as B) the differences in system libraries, kernel versions etc. that the code might be dependent on. The only general solution is to compile on the specific platforms. Either you do it for all platforms, or you give them all the source code and they do it. That's not just the headers and template implementation, that's your out-of-line functions too. You might mitigate A) a little by building a wall of more interoperable extern "C" functions, but you're basically stuck when it comes to B).
So, can you decrypt during compilation? Only if you ship your own hacked GCC binaries to them, built for their specific system, which is probably more hassle than providing different builds of your own libraries (though it may address the template/header exposure issue).
Alternatively, you could employ source code obfuscation techniques. This is probably - practically - as good as it gets. I don't know what tools are out there, but it's an approach that people have pursued for decades (though I'm yet to hear anyone recommend it), so there's sure to be some mature tools.
Re templated code - other people have suggested a templated front end to a C-style generic implementation shipped as a precompiled object. That may or may not be practical (clearly risks performance degradation, and you have to capture the set of type-specific operations you want - e.g. by instantiating a type-specific class derived from an abstract operations base class) but anyway the precompiled object still runs afoul of B).
One other thought... clients might take your source code, but are unlikely to understand it as well as you. Even if they build more systems dependent on their version of it, in a way they're getting more locked in, and may have more need for your services in future. And, if you see they've not played fair, you charge them for it appropriately when the time comes.

It seems with gcc 4.5 comes the support for plugins. So you can provide your own .so which would be, for instance, called before compilation stage starts. So you can have all kinds of tricks(decryption of source file) in there, neatly hidden. This would also be portable solution as no change is made to g++ per se.
This is exactly what I was looking for. You can read more here:
http://www.codesynthesis.com/~boris/blog/2010/05/03/parsing-cxx-with-gcc-plugin-part-1/

When does template instantiation bloat matter in practice?

It seems that in C++ and D, languages which are statically compiled and in which template metaprogramming is a popular technique, there is a decent amount of concern about template instantiation bloat. It seems to me like mostly a theoretical concern, except on very resource-constrained embedded systems. Outside of the embedded space, I have yet to hear of an example of someone being able to demonstrate that it was a problem in practice.
Can anyone provide an example outside of severely resource-limited embedded systems of where template instantiation bloat mattered in practice and had measurable, practically significant negative effects?

There's little problem in C++, because the amount of template stuff you can do in C++ is limited by their complexity.
In D however ... before CTFE (compile-time function evaluation) existed, we had to use templates for string processing. This is also the reason big mangled symbols are compressed in DMD - the strings used as template arguments become part of the mangled symbol names, and when instantiating templates with longer segments of code (for instance) the resulting symbol sizes would spectacularly explode the object format.
It's better nowadays. But on the whole, templates still cause a lot of bloat for a simple reason - they parse faster and are more powerful than in C++, so people naturally use them a lot more (even in cases that technically wouldn't need templates). I must admit I'm one of the prime offenders here (take a look at tools.base if you like, but be sure to keep a barf bag handy - the file is effectively 90% template code).

Template bloat is NOT an issue (It is a mental problem not a code problem).
Yes it can get big. But what's the alternative?
You could write all the code yourself manually (once for each type). Do you think writting it manually will make it smaller. The compiler only instanciate the versions it actually needs and the linker will remove multiple copies spread over compilation units.
So there is no actual bloat.
It is just building what you use. If you use a lot of different types you need to write more code.

I think you'll need to find an older compiler to see the template code bloat in practice. Modern C++ compilers (and linkers) have been able to optimize it away for a while.

I think it's mainly mental bloat. The next programmer to work on your code will first need to figure out what subset of it matters.

Template instantion bloat IS a matter in practice, because it can increases ( a lot!!! ) compile and link time.
I personnaly thinks that c++ #1 problem is compil time, and it's mainly due to template.
I worked on a project with about 50 libs. We had our own rtti system using templates. I had to rewrite because of the template bloat
Here is some numbers:
libs went from 640 mbytes to 420 mbytes
temps went from 4.3 gbytes to 2.9 gbytes
full rebuild went from 19:30 to 13:10

What techniques can be used to speed up C++ compilation times?

What techniques can be used to speed up C++ compilation times?
This question came up in some comments to Stack Overflow question C++ programming style, and I'm interested to hear what ideas there are.
I've seen a related question, Why does C++ compilation take so long?, but that doesn't provide many solutions.

Language techniques
Pimpl Idiom
Take a look at the Pimpl idiom here, and here, also known as an opaque pointer or handle classes. Not only does it speed up compilation, it also increases exception safety when combined with a non-throwing swap function. The Pimpl idiom lets you reduce the dependencies between headers and reduces the amount of recompilation that needs to be done.
Forward Declarations
Wherever possible, use forward declarations. If the compiler only needs to know that SomeIdentifier is a struct or a pointer or whatever, don't include the entire definition, forcing the compiler to do more work than it needs to. This can have a cascading effect, making this way slower than they need to be.
The I/O streams are particularly known for slowing down builds. If you need them in a header file, try #including <iosfwd> instead of <iostream> and #include the <iostream> header in the implementation file only. The <iosfwd> header holds forward declarations only. Unfortunately the other standard headers don't have a respective declarations header.
Prefer pass-by-reference to pass-by-value in function signatures. This will eliminate the need to #include the respective type definitions in the header file and you will only need to forward-declare the type. Of course, prefer const references to non-const references to avoid obscure bugs, but this is an issue for another question.
Guard Conditions
Use guard conditions to keep header files from being included more than once in a single translation unit.
#pragma once
#ifndef filename_h
#define filename_h
// Header declarations / definitions
#endif
By using both the pragma and the ifndef, you get the portability of the plain macro solution, as well as the compilation speed optimization that some compilers can do in the presence of the pragma once directive.
Reduce interdependency
The more modular and less interdependent your code design is in general, the less often you will have to recompile everything. You can also end up reducing the amount of work the compiler has to do on any individual block at the same time, by virtue of the fact that it has less to keep track of.
Compiler options
Precompiled Headers
These are used to compile a common section of included headers once for many translation units. The compiler compiles it once, and saves its internal state. That state can then be loaded quickly to get a head start in compiling another file with that same set of headers.
Be careful that you only include rarely changed stuff in the precompiled headers, or you could end up doing full rebuilds more often than necessary. This is a good place for STL headers and other library include files.
ccache is another utility that takes advantage of caching techniques to speed things up.
Use Parallelism
Many compilers / IDEs support using multiple cores/CPUs to do compilation simultaneously. In GNU Make (usually used with GCC), use the -j [N] option. In Visual Studio, there's an option under preferences to allow it to build multiple projects in parallel. You can also use the /MP option for file-level paralellism, instead of just project-level paralellism.
Other parallel utilities:
Incredibuild
Unity Build
distcc
Use a Lower Optimization Level
The more the compiler tries to optimize, the harder it has to work.
Shared Libraries
Moving your less frequently modified code into libraries can reduce compile time. By using shared libraries (.so or .dll), you can reduce linking time as well.
Get a Faster Computer
More RAM, faster hard drives (including SSDs), and more CPUs/cores will all make a difference in compilation speed.

I work on the STAPL project which is a heavily-templated C++ library. Once in a while, we have to revisit all the techniques to reduce compilation time. In here, I have summarized the techniques we use. Some of these techniques are already listed above:
Finding the most time-consuming sections
Although there is no proven correlation between the symbol lengths and compilation time, we have observed that smaller average symbol sizes can improve compilation time on all compilers. So your first goals it to find the largest symbols in your code.
Method 1 - Sort symbols based on size
You can use the nm command to list the symbols based on their sizes:
nm --print-size --size-sort --radix=d YOUR_BINARY
In this command the --radix=d lets you see the sizes in decimal numbers (default is hex). Now by looking at the largest symbol, identify if you can break the corresponding class and try to redesign it by factoring the non-templated parts in a base class, or by splitting the class into multiple classes.
Method 2 - Sort symbols based on length
You can run the regular nm command and pipe it to your favorite script (AWK, Python, etc.) to sort the symbols based on their length. Based on our experience, this method identifies the largest trouble making candidates better than method 1.
Method 3 - Use Templight
"Templight is a Clang-based tool to profile the time and memory consumption of template instantiations and to perform interactive debugging sessions to gain introspection into the template instantiation process".
You can install Templight by checking out LLVM and Clang (instructions) and applying the Templight patch on it. The default setting for LLVM and Clang is on debug and assertions, and these can impact your compilation time significantly. It does seem like Templight needs both, so you have to use the default settings. The process of installing LLVM and Clang should take about an hour or so.
After applying the patch you can use templight++ located in the build folder you specified upon installation to compile your code.
Make sure that templight++ is in your PATH. Now to compile add the following switches to your CXXFLAGS in your Makefile or to your command line options:
CXXFLAGS+=-Xtemplight -profiler -Xtemplight -memory -Xtemplight -ignore-system
Or
templight++ -Xtemplight -profiler -Xtemplight -memory -Xtemplight -ignore-system
After compilation is done, you will have a .trace.memory.pbf and .trace.pbf generated in the same folder. To visualize these traces, you can use the Templight Tools that can convert these to other formats. Follow these instructions to install templight-convert. We usually use the callgrind output. You can also use the GraphViz output if your project is small:
$ templight-convert --format callgrind YOUR_BINARY --output YOUR_BINARY.trace
$ templight-convert --format graphviz YOUR_BINARY --output YOUR_BINARY.dot
The callgrind file generated can be opened using kcachegrind in which you can trace the most time/memory consuming instantiation.
Reducing the number of template instantiations
Although there are no exact solution for reducing the number of template instantiations, there are a few guidelines that can help:
Refactor classes with more than one template arguments
For example, if you have a class,
template <typename T, typename U>
struct foo { };
and both of T and U can have 10 different options, you have increased the possible template instantiations of this class to 100. One way to resolve this is to abstract the common part of the code to a different class. The other method is to use inheritance inversion (reversing the class hierarchy), but make sure that your design goals are not compromised before using this technique.
Refactor non-templated code to individual translation units
Using this technique, you can compile the common section once and link it with your other TUs (translation units) later on.
Use extern template instantiations (since C++11)
If you know all the possible instantiations of a class you can use this technique to compile all cases in a different translation unit.
For example, in:
enum class PossibleChoices = {Option1, Option2, Option3}
template <PossibleChoices pc>
struct foo { };
We know that this class can have three possible instantiations:
template class foo<PossibleChoices::Option1>;
template class foo<PossibleChoices::Option2>;
template class foo<PossibleChoices::Option3>;
Put the above in a translation unit and use the extern keyword in your header file, below the class definition:
extern template class foo<PossibleChoices::Option1>;
extern template class foo<PossibleChoices::Option2>;
extern template class foo<PossibleChoices::Option3>;
This technique can save you time if you are compiling different tests with a common set of instantiations.
NOTE : MPICH2 ignores the explicit instantiation at this point and always compiles the instantiated classes in all compilation units.
Use unity builds
The whole idea behind unity builds is to include all the .cc files that you use in one file and compile that file only once. Using this method, you can avoid reinstantiating common sections of different files and if your project includes a lot of common files, you probably would save on disk accesses as well.
As an example, let's assume you have three files foo1.cc, foo2.cc, foo3.cc and they all include tuple from STL. You can create a foo-all.cc that looks like:
#include "foo1.cc"
#include "foo2.cc"
#include "foo3.cc"
You compile this file only once and potentially reduce the common instantiations among the three files. It is hard to generally predict if the improvement can be significant or not. But one evident fact is that you would lose parallelism in your builds (you can no longer compile the three files at the same time).
Further, if any of these files happen to take a lot of memory, you might actually run out of memory before the compilation is over. On some compilers, such as GCC, this might ICE (Internal Compiler Error) your compiler for lack of memory. So don't use this technique unless you know all the pros and cons.
Precompiled headers
Precompiled headers (PCHs) can save you a lot of time in compilation by compiling your header files to an intermediate representation recognizable by a compiler. To generate precompiled header files, you only need to compile your header file with your regular compilation command. For example, on GCC:
$ g++ YOUR_HEADER.hpp
This will generate a YOUR_HEADER.hpp.gch file (.gch is the extension for PCH files in GCC) in the same folder. This means that if you include YOUR_HEADER.hpp in some other file, the compiler will use your YOUR_HEADER.hpp.gch instead of YOUR_HEADER.hpp in the same folder before.
There are two issues with this technique:
You have to make sure that the header files being precompiled is stable and is not going to change (you can always change your makefile)
You can only include one PCH per compilation unit (on most of compilers). This means that if you have more than one header file to be precompiled, you have to include them in one file (e.g., all-my-headers.hpp). But that means that you have to include the new file in all places. Fortunately, GCC has a solution for this problem. Use -include and give it the new header file. You can comma separate different files using this technique.
For example:
g++ foo.cc -include all-my-headers.hpp
Use unnamed or anonymous namespaces
Unnamed namespaces (a.k.a. anonymous namespaces) can reduce the generated binary sizes significantly. Unnamed namespaces use internal linkage, meaning that the symbols generated in those namespaces will not be visible to other TU (translation or compilation units). Compilers usually generate unique names for unnamed namespaces. This means that if you have a file foo.hpp:
namespace {
template <typename T>
struct foo { };
} // Anonymous namespace
using A = foo<int>;
And you happen to include this file in two TUs (two .cc files and compile them separately). The two foo template instances will not be the same. This violates the One Definition Rule (ODR). For the same reason, using unnamed namespaces is discouraged in the header files. Feel free to use them in your .cc files to avoid symbols showing up in your binary files. In some cases, changing all the internal details for a .cc file showed a 10% reduction in the generated binary sizes.
Changing visibility options
In newer compilers you can select your symbols to be either visible or invisible in the Dynamic Shared Objects (DSOs). Ideally, changing the visibility can improve compiler performance, link time optimizations (LTOs), and generated binary sizes. If you look at the STL header files in GCC you can see that it is widely used. To enable visibility choices, you need to change your code per function, per class, per variable and more importantly per compiler.
With the help of visibility you can hide the symbols that you consider them private from the generated shared objects. On GCC you can control the visibility of symbols by passing default or hidden to the -visibility option of your compiler. This is in some sense similar to the unnamed namespace but in a more elaborate and intrusive way.
If you would like to specify the visibilities per case, you have to add the following attributes to your functions, variables, and classes:
__attribute__((visibility("default"))) void foo1() { }
__attribute__((visibility("hidden"))) void foo2() { }
__attribute__((visibility("hidden"))) class foo3 { };
void foo4() { }
The default visibility in GCC is default (public), meaning that if you compile the above as a shared library (-shared) method, foo2 and class foo3 will not be visible in other TUs (foo1 and foo4 will be visible). If you compile with -visibility=hidden then only foo1 will be visible. Even foo4 would be hidden.
You can read more about visibility on GCC wiki.

I'd recommend these articles from "Games from Within, Indie Game Design And Programming":
Physical Structure and C++ – Part 1: A First Look
Physical Structure and C++ – Part 2: Build Times
Even More Experiments with Includes
How Incredible Is Incredibuild?
The Care and Feeding of Pre-Compiled Headers
The Quest for the Perfect Build System
The Quest for the Perfect Build System (Part 2)
Granted, they are pretty old - you'll have to re-test everything with the latest versions (or versions available to you), to get realistic results. Either way, it is a good source for ideas.

One technique which worked quite well for me in the past: don't compile multiple C++ source files independently, but rather generate one C++ file which includes all the other files, like this:
// myproject_all.cpp
// Automatically generated file - don't edit this by hand!
#include "main.cpp"
#include "mainwindow.cpp"
#include "filterdialog.cpp"
#include "database.cpp"
Of course this means you have to recompile all of the included source code in case any of the sources changes, so the dependency tree gets worse. However, compiling multiple source files as one translation unit is faster (at least in my experiments with MSVC and GCC) and generates smaller binaries. I also suspect that the compiler is given more potential for optimizations (since it can see more code at once).
This technique breaks in various cases; for instance, the compiler will bail out in case two or more source files declare a global function with the same name. I couldn't find this technique described in any of the other answers though, that's why I'm mentioning it here.
For what it's worth, the KDE Project used this exact same technique since 1999 to build optimized binaries (possibly for a release). The switch to the build configure script was called --enable-final. Out of archaeological interest I dug up the posting which announced this feature: http://lists.kde.org/?l=kde-devel&m=92722836009368&w=2

I will just link to my other answer: How do YOU reduce compile time, and linking time for Visual C++ projects (native C++)?. Another point I want to add, but which causes often problems is to use precompiled headers. But please, only use them for parts which hardly ever change (like GUI toolkit headers). Otherwise, they will cost you more time than they save you in the end.
Another option is, when you work with GNU make, to turn on -j<N> option:
-j [N], --jobs[=N] Allow N jobs at once; infinite jobs with no arg.
I usually have it at 3 since I've got a dual core here. It will then run compilers in parallel for different translation units, provided there are no dependencies between them. Linking cannot be done in parallel, since there is only one linker process linking together all object files.
But the linker itself can be threaded, and this is what the GNU gold ELF linker does. It's optimized threaded C++ code which is said to link ELF object files a magnitude faster than the old ld (and was actually included into binutils).

There's an entire book on this topic, which is titled Large-Scale C++ Software Design (written by John Lakos).
The book pre-dates templates, so to the contents of that book add "using templates, too, can make the compiler slower".

Once you have applied all the code tricks above (forward declarations, reducing header inclusion to the minimum in public headers, pushing most details inside the implementation file with Pimpl...) and nothing else can be gained language-wise, consider your build system. If you use Linux, consider using distcc (distributed compiler) and ccache (cache compiler).
The first one, distcc, executes the preprocessor step locally and then sends the output to the first available compiler in the network. It requires the same compiler and library versions in all the configured nodes in the network.
The latter, ccache, is a compiler cache. It again executes the preprocessor and then check with an internal database (held in a local directory) if that preprocessor file has already been compiled with the same compiler parameters. If it does, it just pops up the binary and output from the first run of the compiler.
Both can be used at the same time, so that if ccache does not have a local copy it can send it trough the net to another node with distcc, or else it can just inject the solution without further processing.

Here are some:
Use all processor cores by starting a multiple-compile job (make -j2 is a good example).
Turn off or lower optimizations (for example, GCC is much faster with -O1 than -O2 or -O3).
Use precompiled headers.

When I came out of college, the first real production-worthy C++ code I saw had these arcane #ifndef ... #endif directives in between them where the headers were defined. I asked the guy who was writing the code about these overarching things in a very naive fashion and was introduced to world of large-scale programming.
Coming back to the point, using directives to prevent duplicate header definitions was the first thing I learned when it came to reducing compiling times.

More RAM.
Someone talked about RAM drives in another answer. I did this with a 80286 and Turbo C++ (shows age) and the results were phenomenal. As was the loss of data when the machine crashed.

You could use Unity Builds.

Use
#pragma once
at the top of header files, so if they're included more than once in a translation unit, the text of the header will only get included and parsed once.

Use forward declarations where you can. If a class declaration only uses a pointer or reference to a type, you can just forward declare it and include the header for the type in the implementation file.
For example:
// T.h
class Class2; // Forward declaration
class T {
public:
void doSomething(Class2 &c2);
private:
Class2 *m_Class2Ptr;
};
// T.cpp
#include "Class2.h"
void Class2::doSomething(Class2 &c2) {
// Whatever you want here
}
Fewer includes means far less work for the preprocessor if you do it enough.

Just for completeness: a build might be slow because the build system is being stupid as well as because the compiler is taking a long time to do its work.
Read Recursive Make Considered Harmful (PDF) for a discussion of this topic in Unix environments.

Not about the compilation time, but about the build time:
Use ccache if you have to rebuild the same files when you are working
on your buildfiles
Use ninja-build instead of make. I am currently compiling a project
with ~100 source files and everything is cached by ccache. make needs
5 minutes, ninja less than 1.
You can generate your ninja files from cmake with -GNinja.

Upgrade your computer
Get a quad core (or a dual-quad system)
Get LOTS of RAM.
Use a RAM drive to drastically reduce file I/O delays. (There are companies that make IDE and SATA RAM drives that act like hard drives).
Then you have all your other typical suggestions
Use precompiled headers if available.
Reduce the amount of coupling between parts of your project. Changing one header file usually shouldn't require recompiling your entire project.

I had an idea about using a RAM drive. It turned out that for my projects it doesn't make that much of a difference after all. But then they are pretty small still. Try it! I'd be interested in hearing how much it helped.

Dynamic linking (.so) can be much much faster than static linking (.a). Especially when you have a slow network drive. This is since you have all of the code in the .a file which needs to be processed and written out. In addition, a much larger executable file needs to be written out to the disk.

Where are you spending your time? Are you CPU bound? Memory bound? Disk bound? Can you use more cores? More RAM? Do you need RAID? Do you simply want to improve the efficiency of your current system?
Under gcc/g++, have you looked at ccache? It can be helpful if you are doing make clean; make a lot.

Starting with Visual Studio 2017 you have the capability to have some compiler metrics about what takes time.
Add those parameters to C/C++ -> Command line (Additional Options) in the project properties window:
/Bt+ /d2cgsummary /d1reportTime
You can have more informations in this post.

Faster hard disks.
Compilers write many (and possibly huge) files to disk. Work with SSD instead of typical hard disk and compilation times are much lower.

On Linux (and maybe some other *NIXes), you can really speed the compilation by NOT STARING at the output and changing to another TTY.
Here is the experiment: printf slows down my program

Networks shares will drastically slow down your build, as the seek latency is high. For something like Boost, it made a huge difference for me, even though our network share drive is pretty fast. Time to compile a toy Boost program went from about 1 minute to 1 second when I switched from a network share to a local SSD.

If you have a multicore processor, both Visual Studio (2005 and later) as well as GCC support multi-processor compiles. It is something to enable if you have the hardware, for sure.

First of all, we have to understand what so different about C++ that sets it apart from other languages.
Some people say it's that C++ has many too features. But hey, there are languages that have a lot more features and they are nowhere near that slow.
Some people say it's the size of a file that matters. Nope, source lines of code don't correlate with compile times.
But wait, how can it be? More lines of code should mean longer compile times, what's the sorcery?
The trick is that a lot of lines of code is hidden in preprocessor directives. Yes. Just one #include can ruin your module's compilation performance.
You see, C++ doesn't have a module system. All *.cpp files are compiled from scratch. So having 1000 *.cpp files means compiling your project a thousand times. You have more than that? Too bad.
That's why C++ developers hesitate to split classes into multiple files. All those headers are tedious to maintain.
So what can we do other than using precompiled headers, merging all the cpp files into one, and keeping the number of headers minimal?
C++20 brings us preliminary support of modules! Eventually, you'll be able to forget about #include and the horrible compile performance that header files bring with them. Touched one file? Recompile only that file! Need to compile a fresh checkout? Compile in seconds rather than minutes and hours.
The C++ community should move to C++20 as soon as possible. C++ compiler developers should put more focus on this, C++ developers should start testing preliminary support in various compilers and use those compilers that support modules. This is the most important moment in C++ history!

Although not a "technique", I couldn't figure out how Win32 projects with many source files compiled faster than my "Hello World" empty project. Thus, I hope this helps someone like it did me.
In Visual Studio, one option to increase compile times is Incremental Linking (/INCREMENTAL). It's incompatible with Link-time Code Generation (/LTCG) so remember to disable incremental linking when doing release builds.

Using dynamic linking instead of static one make you compiler faster that can feel.
If you use t Cmake, active the property:
set(BUILD_SHARED_LIBS ON)
Build Release, using static linking can get more optimize.

From Microsoft: https://devblogs.microsoft.com/cppblog/recommendations-to-speed-c-builds-in-visual-studio/
Specific recommendations include:
DO USE PCH for projects
DO include commonly used system, runtime and third party headers in
PCH
DO include rarely changing project specific headers in PCH
DO NOT include headers that change frequently
DO audit PCH regularly to keep it up to date with product churn
DO USE /MP
DO Remove /Gm in favor of /MP
DO resolve conflict with #import and use /MP
DO USE linker switch /incremental
DO USE linker switch /debug:fastlink
DO consider using a third party build accelerator

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js