Profiling template metaprogram compilation time - c++

I'm working on a C++ project with extensive compile-time computations. Long compilation time is slowing us down. How might I find out the slowest parts of our template meta-programs so I can optimize them? (When we have slow runtime computations, I have many profilers to choose from, e.g. valgrind's callgrind tool. So I tried building a debug GCC and profiling it compiling our code, but I didn't learn much from that.)
I use GCC and Clang, but any suggestions are welcome.
I found profile_templates on Boost's site, but it seems to be thinly documented and require the jam/bjam build system. If you show how to use it on a non-jam project1, I will upvote you. https://svn.boost.org/svn/boost/sandbox/tools/profile_templates/ appears to count number-of-instantiations, whereas counting time taken would be ideal.
1 Our project uses CMake and is small enough that hacking together a Jamfile just for template profiling could be acceptable.

I know this is an old question, but there is a newer answer that I would like to give.
There is a clang-based set of projects that target this particular problem. The first component is an instrumentation onto the clang compiler which produces a complete trace of all the template instantiations that occurred during compilation, with timing values and optionally memory usage counts as well. That tool is called Templight, as is accessible here (currently needs to compile against a patched clang source tree):
https://github.com/mikael-s-persson/templight
A second component is a conversion tool that allows you to convert the templight traces into other formats, such as easily parsable text-based format (yaml, xml, text, etc.) and into formats that can more easily be visualized, such as graphviz / graphML, and more importantly a callgrind output that can be loaded into KCacheGrind to visualize and inspect the meta-call-graph of template instantiations and their compile-time costs, such as this screenshot of a template instantiation profile of a piece of code that creates a boost::container::vector and sorts it with std::sort:
Check it out here:
https://github.com/mikael-s-persson/templight-tools
Finally, there is also another related project that creates an interactive shell and debugger to be able to interactively walk up and down the template instantiation graph:
https://github.com/sabel83/metashell

I've been working since 2008 on a library that uses template metaprogramming heavily. There is a real need for better tools or approaches for understanding what consumes the most compile time.
The only technique I know of is a divide and conquer approach, either by separating code into different files, commenting out bodies of template definitions, or by wrapping your template instantiations in #define macros and temporarily redefining those macros to do nothing. Then you can recompile the project with and without various instantiations and narrow down.
Incidentally just separating the same code into more, smaller files may make it compile faster. I'm not just talking about opportunity for parallel compilation - even serially, I observed it to still be faster. I've observed this effect in gcc both when compiling my library, and when compiling Boost Spirit parsers. My theory is that some of the symbol resolution, overload resolution, SFINAE, or type inference code in gcc has an O(n log n) or even O(n^2) complexity with respect to the number of type definitions or symbols in play in the execution unit.
Ultimately what you need to do is carefully examine your templates and separate what really depends on the type information from what does not, and use type erasure and virtual functions whereever possible on the portion of the code that does not actually require the template types. You need to get stuff out of the headers and into cpp files if that part of the code can be moved. In a perfect world the compiler should be able to figure this out for itself - you shouldn't have to manually move this code to babysit it - but this is the state of the art with the compilers we have today.

The classic book C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond comes with a 20 page appendix on profiling compile-time costs. It has a companion CD with the code that generates the graphs in that appendix.
Another paper is http://aszt.inf.elte.hu/~gsd/s/cikkek/profiling/profile.pdf, perhaps that is of use to you.
Yet another, but more labor-intensive, way is to count the number of instantiations of each class from your compiler output.

http://www.cs.uoregon.edu/research/tau/about.php might be something which can be of your interest as for templated entities, it shows the breakup of time spent for each instantiation. Other data includes how many times each function was called, how many profiled functions did each function invoke, and what the mean inclusive time per call was

Related

How to deliberately slow down compilation?

There are a lot of questions asking how to speed up compilation of C++ code. I need to do the opposite.
I'm working with a software that monitors compiler invocation in order to do static code analysis. But if compiler process is closed too quickly, monitoring software can miss it. So I need to slow compilation down. I understand that's a terrible solution and hope it will be temporary.
I came up with two solutions:
Disable parallel build, enable preprocessor and compiler listing generation. It works but requires a lot of mouse clicking
Use compiler option to force inclusion of special header file that somehow slows compilation.
Unfortunately I couldn't come up with something simple to write and hard to compile at the same time. Using a lot of #warning seems to work but obviously clutters the output significantly.
I'm using Keil with armcc compiler, so I can use most of C++11 but maximum template recursion depth is just 63.
Preferably this should not produce any overhead for binary size or running time.
UPD: I'll try to clarify this a bit. I know that's a horrible idea, I know that this problem should be solved differently. I will try to solve it differently but I also want to explore this possibility.
Maybe this solution will be slow enough =), something like #NathanOliver propose.
Its compile time table sine I use. It requires extra space, but you can tune it a little (table size and sine accuracy are template parameters of "staticSinus" function, hope you`ll find your best).
https://godbolt.org/z/DYZDF5
You don't want to do anything of the sort. Here are some solutions, of varying degree of kludginess:
Ideal solution: invoke the code analysis from the Makefile.
Replace the compiler with an e.g. Python script that forwards the command-line to the compiler, then triggers the analysis tool.
Monitor make instead of the compiler - it tends to live longer.
Have a tiny wrapper script maintain a reference count in shared memory, and when the reference count is initially incremented, the wrapper should go to sleep for "long enough" after the compiler has finished. Monitor that script.
In a nutshell: the monitoring tool shouldn't be monitoring anything. The code analysis should be invoked from the build tool, i.e. given in the Makefile. If generating the Makefile by hand is too cumbersome, use cmake with ninja, or xmake with no dependencies. You can also generate whatever "project" file the IDE needs to make working on the project easier. But make something else than Keil-specific stuff be the source of truth for the project: it'll make everything go easy from then on.

Benchmark template compilation [duplicate]

I'm working on a C++ project with extensive compile-time computations. Long compilation time is slowing us down. How might I find out the slowest parts of our template meta-programs so I can optimize them? (When we have slow runtime computations, I have many profilers to choose from, e.g. valgrind's callgrind tool. So I tried building a debug GCC and profiling it compiling our code, but I didn't learn much from that.)
I use GCC and Clang, but any suggestions are welcome.
I found profile_templates on Boost's site, but it seems to be thinly documented and require the jam/bjam build system. If you show how to use it on a non-jam project1, I will upvote you. https://svn.boost.org/svn/boost/sandbox/tools/profile_templates/ appears to count number-of-instantiations, whereas counting time taken would be ideal.
1 Our project uses CMake and is small enough that hacking together a Jamfile just for template profiling could be acceptable.
I know this is an old question, but there is a newer answer that I would like to give.
There is a clang-based set of projects that target this particular problem. The first component is an instrumentation onto the clang compiler which produces a complete trace of all the template instantiations that occurred during compilation, with timing values and optionally memory usage counts as well. That tool is called Templight, as is accessible here (currently needs to compile against a patched clang source tree):
https://github.com/mikael-s-persson/templight
A second component is a conversion tool that allows you to convert the templight traces into other formats, such as easily parsable text-based format (yaml, xml, text, etc.) and into formats that can more easily be visualized, such as graphviz / graphML, and more importantly a callgrind output that can be loaded into KCacheGrind to visualize and inspect the meta-call-graph of template instantiations and their compile-time costs, such as this screenshot of a template instantiation profile of a piece of code that creates a boost::container::vector and sorts it with std::sort:
Check it out here:
https://github.com/mikael-s-persson/templight-tools
Finally, there is also another related project that creates an interactive shell and debugger to be able to interactively walk up and down the template instantiation graph:
https://github.com/sabel83/metashell
I've been working since 2008 on a library that uses template metaprogramming heavily. There is a real need for better tools or approaches for understanding what consumes the most compile time.
The only technique I know of is a divide and conquer approach, either by separating code into different files, commenting out bodies of template definitions, or by wrapping your template instantiations in #define macros and temporarily redefining those macros to do nothing. Then you can recompile the project with and without various instantiations and narrow down.
Incidentally just separating the same code into more, smaller files may make it compile faster. I'm not just talking about opportunity for parallel compilation - even serially, I observed it to still be faster. I've observed this effect in gcc both when compiling my library, and when compiling Boost Spirit parsers. My theory is that some of the symbol resolution, overload resolution, SFINAE, or type inference code in gcc has an O(n log n) or even O(n^2) complexity with respect to the number of type definitions or symbols in play in the execution unit.
Ultimately what you need to do is carefully examine your templates and separate what really depends on the type information from what does not, and use type erasure and virtual functions whereever possible on the portion of the code that does not actually require the template types. You need to get stuff out of the headers and into cpp files if that part of the code can be moved. In a perfect world the compiler should be able to figure this out for itself - you shouldn't have to manually move this code to babysit it - but this is the state of the art with the compilers we have today.
The classic book C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond comes with a 20 page appendix on profiling compile-time costs. It has a companion CD with the code that generates the graphs in that appendix.
Another paper is http://aszt.inf.elte.hu/~gsd/s/cikkek/profiling/profile.pdf, perhaps that is of use to you.
Yet another, but more labor-intensive, way is to count the number of instantiations of each class from your compiler output.
http://www.cs.uoregon.edu/research/tau/about.php might be something which can be of your interest as for templated entities, it shows the breakup of time spent for each instantiation. Other data includes how many times each function was called, how many profiled functions did each function invoke, and what the mean inclusive time per call was

Why is there no accurate C++ decompiler?

Why is it not possible to create a C++ decompiler that will function as accurately as those made for Java and C#?
There are several reasons:
Inlining. A lot of C++ code gets inlined in optimized builds. That plays havoc with any form of decompiler. To figure out that a function was inlined, the decompiler would have to analyze the specifics of the inlined code and match them up. And post-inlining optimization steps can make code very different, depending on where it was inlined.
Templates. Templates use #1 exclusively, but they create additional problems. It is at least theoretically possible that a function that gets inlined in two places would compile to the same sequence of assembly instructions. But for template code, which was instantiated with different template arguments? Different instantiations will usually have to compile down to different sequences of instructions. And this becomes even more difficult, since template code can call different sets of functions based on the template parameters. And those functions themselves could be inlined.
Compile-time execution. Template metaprogramming allows the compiler to actually execute code. But C++11's constexpr provides a more natural way to do some computations at compile time. Obviously, compile-time function calls or metafunction instantiations cannot be part of the compiled executable. Only the results of them will be (since that's kinda the point).
Lack of comprehensive runtime reflection. C# and Java both lace their bytecode with a lot of information about what the nature of the original source code. Object definitions are easily detectable, as are object names, member variable types and names, etc. C++ compiles down to machine language, which is not required to have any such information. And since it isn't required, compilers don't generate it. Even the reflection study group of the ISO C++ committee is focused on compile-time reflection, which is information that won't be available at runtime.
Even std::type_info doesn't offer anything. The reason being that, if the compiler does not detect that a particular type will have typeid called on it, then the compiler doesn't need to generate a std::type_info object for it. And even if it did, all that gives you is an object's name (and an identifier). Nothing more.
Because C++ compilers generally do not put any more information into the executable than they absolutely have to (especially not if they are compiling in release mode rather than a debug build), so the information you'd need to accurately decompile the program simply is not present in the executable.
Of course a C++ compiler could be made that does include all of the necessary information in the executable (e.g. in the most naive implementation, it could simply include a copy of the source code itself in the executable), but doing so would make the executables significantly larger, and most non-open-source C++ developers would prefer that other people not be able to decompile the executable, so there isn't a whole lot of demand for that functionality.

How hide C++ source code from customer

I wish to send some components to my customers. The reasons I want to deliver source code are:
1) My class is templatized. Customer might use any template argument, so I can't pre-compile and send .o file.
2) The customer might use different compiler versions for gcc than mine. So I want him to do compilation at his end.
Now, I can't reveal my source code for obvious reasons. The max I can do is to reveal the .h file. Any ideas how I may achieve this. I am thinking about some hooks in gcc that supports decryption before compilation, etc. Is this possible?
In short, I want him to be able to compile this code without being able to peek inside.
Contract = good, obfuscation = ungood.
That said, you can always do a kind of PIMPL idiom to serve your customer with binaries and just templated wrappers in the header(s). The idea is then to use an "untyped" separately compiled implementation, where the templated wrapper just provides type safety for client code. That's how one often did things before compilers started to understand how to optimize templates, that is, to avoid machine-code level code bloat, but it only provides some measure of protection about trivial copy-and-paste theft, not any protection against someone willing to delve into the machine code.
But perhaps the effort is then greater than just reinventing your functionality?
Just adding some terminology to Alf's answer: The Thin template idiom is what you might look at. It basically simulates the functionality of a generic. Don't get confused by the wikipedia article which pops up in google, you don't have to use void*...
This, of course, does not guarantee binary compatibility. As usual with 'native' c++, you either compile the component for customers platform yourself and deploy the binary, or give them your code... The difference to the pure generic component code is that you can do the former at all.
use some c++ obfuscators may be help?: http://www.semdesigns.com/products/obfuscators/CppObfuscationExample.html or Magle It
First, if you're going to provide the source code, then you have to provide the source code. Sure, you could encrypt it, but even if GCC had a "decrypt before compile" option, it would need to decrypt the code, and if GCC can decrypt the code, so can your customer.
What you're asking is impossible. (If you find a way to do it, I believe the movie industry might have a multi-million contract for you. They currently have to resort to expensive custom hardware to prevent people from ripping content, and that only works to a limited degree)
As for your "obvious reasons" why you don't want to provide the source code, I don't see why they're obvious. What would happen if you provided the source code?
You have two options:
provide the source code in its entirety, or
compile everything that can be precompiled into a (static or dynamic) library, and provide your customer with that, plus the header files.
what about pimpls?
1) My class is templatized. Customer might use any template argument, so I can't pre-compile and send .o file.
2) The customer might use different compiler versions for gcc than mine. So I want him to do compilation at his end.
Now, I can't reveal my source code for obvious reasons. The max I can do is to reveal the .h file. Any ideas how I may achieve this. I am thinking about some hooks in gcc that supports decryption before compilation, etc. Is this possible?
In short, I want him to be able to compile this code without being able to peek inside.
Consideration 2) above encompasses A) ABI differences such that the same code compiled with different compiler versions/vendors on the same platform is incompatible, as well as B) the differences in system libraries, kernel versions etc. that the code might be dependent on. The only general solution is to compile on the specific platforms. Either you do it for all platforms, or you give them all the source code and they do it. That's not just the headers and template implementation, that's your out-of-line functions too. You might mitigate A) a little by building a wall of more interoperable extern "C" functions, but you're basically stuck when it comes to B).
So, can you decrypt during compilation? Only if you ship your own hacked GCC binaries to them, built for their specific system, which is probably more hassle than providing different builds of your own libraries (though it may address the template/header exposure issue).
Alternatively, you could employ source code obfuscation techniques. This is probably - practically - as good as it gets. I don't know what tools are out there, but it's an approach that people have pursued for decades (though I'm yet to hear anyone recommend it), so there's sure to be some mature tools.
Re templated code - other people have suggested a templated front end to a C-style generic implementation shipped as a precompiled object. That may or may not be practical (clearly risks performance degradation, and you have to capture the set of type-specific operations you want - e.g. by instantiating a type-specific class derived from an abstract operations base class) but anyway the precompiled object still runs afoul of B).
One other thought... clients might take your source code, but are unlikely to understand it as well as you. Even if they build more systems dependent on their version of it, in a way they're getting more locked in, and may have more need for your services in future. And, if you see they've not played fair, you charge them for it appropriately when the time comes.
It seems with gcc 4.5 comes the support for plugins. So you can provide your own .so which would be, for instance, called before compilation stage starts. So you can have all kinds of tricks(decryption of source file) in there, neatly hidden. This would also be portable solution as no change is made to g++ per se.
This is exactly what I was looking for. You can read more here:
http://www.codesynthesis.com/~boris/blog/2010/05/03/parsing-cxx-with-gcc-plugin-part-1/

When does template instantiation bloat matter in practice?

It seems that in C++ and D, languages which are statically compiled and in which template metaprogramming is a popular technique, there is a decent amount of concern about template instantiation bloat. It seems to me like mostly a theoretical concern, except on very resource-constrained embedded systems. Outside of the embedded space, I have yet to hear of an example of someone being able to demonstrate that it was a problem in practice.
Can anyone provide an example outside of severely resource-limited embedded systems of where template instantiation bloat mattered in practice and had measurable, practically significant negative effects?
There's little problem in C++, because the amount of template stuff you can do in C++ is limited by their complexity.
In D however ... before CTFE (compile-time function evaluation) existed, we had to use templates for string processing. This is also the reason big mangled symbols are compressed in DMD - the strings used as template arguments become part of the mangled symbol names, and when instantiating templates with longer segments of code (for instance) the resulting symbol sizes would spectacularly explode the object format.
It's better nowadays. But on the whole, templates still cause a lot of bloat for a simple reason - they parse faster and are more powerful than in C++, so people naturally use them a lot more (even in cases that technically wouldn't need templates). I must admit I'm one of the prime offenders here (take a look at tools.base if you like, but be sure to keep a barf bag handy - the file is effectively 90% template code).
Template bloat is NOT an issue (It is a mental problem not a code problem).
Yes it can get big. But what's the alternative?
You could write all the code yourself manually (once for each type). Do you think writting it manually will make it smaller. The compiler only instanciate the versions it actually needs and the linker will remove multiple copies spread over compilation units.
So there is no actual bloat.
It is just building what you use. If you use a lot of different types you need to write more code.
I think you'll need to find an older compiler to see the template code bloat in practice. Modern C++ compilers (and linkers) have been able to optimize it away for a while.
I think it's mainly mental bloat. The next programmer to work on your code will first need to figure out what subset of it matters.
Template instantion bloat IS a matter in practice, because it can increases ( a lot!!! ) compile and link time.
I personnaly thinks that c++ #1 problem is compil time, and it's mainly due to template.
I worked on a project with about 50 libs. We had our own rtti system using templates. I had to rewrite because of the template bloat
Here is some numbers:
libs went from 640 mbytes to 420 mbytes
temps went from 4.3 gbytes to 2.9 gbytes
full rebuild went from 19:30 to 13:10