Performance Gains with Visual Studio Whole Program Optimization

Performance Gains with Visual Studio Whole Program Optimization - c++

Our product is a library which we deliver as a dll or static library. I've noticed that using Whole Program Optimization in Visual Studio improves the performance around 30%. This is good but referring to
http://blogs.msdn.com/b/vcblog/archive/2009/02/24/quick-tips-on-using-whole-program-optimization.aspx
I see that it is not suggested to use whole program optimization for libraries that are delivered to customers.
The same article mentions around 3-4% improvement in performance. Now that we see 10 times of the expected performance gain, I am thinking whether we are doing something wrong.
Not sure how to formulate this but I'll give it a try: Apparently our code base has a "problem" that WPO can solve very well. Whatever this "problem" (or problems?) is , it is less important in other software hence WPO has relatively small impact. Now my question is what might be this problem? We would like to optimize our code manually since turning on WPO is not an option.

Probably, you have some functions called many times, which can't be inlined without WPO due to being defined in source files. You can use a profiler to identify these, then move them into headers and mark them inline.

Related

Should I group all my code and move into DLLs?

Will I get better performance if I break my code into pieces put them in DLLs?
Is there something wrong with having multiple DLLs in terms of performance or is it better? Or does not have any affect?
My project is quite large and I heard that DLLs are not suitable for cross-platform apps. Is it true?

DLLs can have positive and negative impact on performance. As with all performance questions you should get data before committing to a strategy.
Huge headers / huge code / slow compilation does not mean slow performance. It's often the other way around: that slow compilation is because you've given the optimizer a lot to work with. And simply rearranging your project won't reduce the amount of code. You'll just cordon off bits so the optimizer has less to work with. You may benefit in project structure, but at the cost of optimization opportunities.
A lot of modern elegant C++ is templated and purely in header files. That allows inlining as the optimizer sees fit. Breaking code into separate images will prevent optimization across image boundaries.
Consider the development cost of using DLLs as well. Interfaces across DLLs present a lot of complications. For example templated / header-inlined code should be avoided in the ABI. It's the entire reason COM was invented: to deal with the difficulties of passing basic objects between DLL boundaries.
To try and answer the question, my gut feeling is that "No", breaking your project into multiple modules will do absolutely nothing to improve performance and is much more likely to make it perform worse. I say that with about 93% confidence. But as I said: The only way to properly answer your question is to measure.

No. DLL(s) cause more delay in loading than a statically linked libraries.
Because the operating system loader needs to open the DLL and link it individually which affects time. Calling DLL functions is not slow because they are pointed in the import address table of your executable image, the loader then modify these pointers to match the address of symbols imported in the DLL.

Faster build times in C++ [duplicate]

I once worked on a C++ project that took about an hour and a half for a full rebuild. Small edit, build, test cycles took about 5 to 10 minutes. It was an unproductive nightmare.
What is the worst build times you ever had to handle?
What strategies have you used to improve build times on large projects?
Update:
How much do you think the language used is to blame for the problem? I think C++ is prone to massive dependencies on large projects, which often means even simple changes to the source code can result in a massive rebuild. Which language do you think copes with large project dependency issues best?

Forward declaration
pimpl idiom
Precompiled headers
Parallel compilation (e.g. MPCL add-in for Visual Studio).
Distributed compilation (e.g. Incredibuild for Visual Studio).
Incremental build
Split build in several "projects" so not compile all the code if not needed.
[Later Edit]
8. Buy faster machines.

My strategy is pretty simple - I don't do large projects. The whole thrust of modern computing is away from the giant and monolithic and towards the small and componentised. So when I work on projects, I break things up into libraries and other components that can be built and tested independantly, and which have minimal dependancies on each other. A "full build" in this kind of environment never actually takes place, so there is no problem.

One trick that sometimes helps is to include everything into one .cpp file. Since includes are processed once per file, this can save you a lot of time. (The downside to this is that it makes it impossible for the compiler to parallelize compilation)
You should be able to specify that multiple .cpp files should be compiled in parallel (-j with make on linux, /MP on MSVC - MSVC also has an option to compile multiple projects in parallel. These are separate options, and there's no reason why you shouldn't use both)
In the same vein, distributed builds (Incredibuild, for example), may help take the load off a single system.
SSD disks are supposed to be a big win, although I haven't tested this myself (but a C++ build touches a huge number of files, which can quickly become a bottleneck).
Precompiled headers can help too, when used with care. (They can also hurt you, if they have to be recompiled too often).
And finally, trying to minimize dependencies in the code itself is important. Use the pImpl idiom, use forward declarations, keep the code as modular as possible. In some cases, use of templates may help you decouple classes and minimize dependencies. (In other cases, templates can slow down compilation significantly, of course)
But yes, you're right, this is very much a language thing. I don't know of another language which suffers from the problem to this extent. Most languages have a module system that allows them to eliminate header files, which area huge factor. C has header files, but is such a simple language that compile times are still manageable. C++ gets the worst of both worlds. A big complex language, and a terrible primitive build mechanism that requires a huge amount of code to be parsed again and again.

Multi core compilation. Very fast with 8 cores compiling on the I7.
Incremental linking
External constants
Removed inline methods on C++ classes.
The last two gave us a reduced linking time from around 12 minutes to 1-2 minutes. Note that this is only needed if things have a huge visibility, i.e. seen "everywhere" and if there are many different constants and classes.
Cheers

IncrediBuild

Unity Builds
Incredibuild
Pointer to implementation
forward declarations
compiling "finished" sections of the proejct into dll's

ccache & distcc (for C/C++ projects) -
ccache caches compiled output, using the pre-processed file as the 'key' for finding the output. This is great because pre-processing is pretty quick, and quite often changes that force recompile don't actually change the source for many files. Also, it really speeds up a full re-compile. Also nice is the instance where you can have a shared cache among team members. This means that only the first guy to grab the latest code actually compiles anything.
distcc does distributed compilation across a network of machines. This is only good if you HAVE a network of machines to use for compilation. It goes well with ccache, and only moves the pre-processed source around, so the only thing you have to worry about on the compiler engine systems is that they have the right compiler (no need for headers or your entire source tree to be visible).

The best suggestion is to build makefiles that actually understand dependencies and do not automatically rebuild the world for a small change. But, if a full rebuild takes 90 minutes, and a small rebuild takes 5-10 minutes, odds are good that your build system already does that.
Can the build be done in parallel? Either with multiple cores, or with multiple servers?
Checkin pre-compiled bits for pieces that really are static and do not need to be rebuilt every time. 3rd party tools/libraries that are used, but not altered are a good candidate for this treatment.
Limit the build to a single 'stream' if applicable. The 'full product' might include things like a debug version, or both 32 and 64 bit versions, or may include help files or man pages that are derived/built every time. Removing components that are not necessary for development can dramatically reduce the build time.
Does the build also package the product? Is that really required for development and testing? Does the build incorporate some basic sanity tests that can be skipped?
Finally, you can re-factor the code base to be more modular and to have fewer dependencies. Large Scale C++ Software Design is an excellent reference for learning to decouple large software products into something that is easier to maintain and faster to build.
EDIT: Building on a local filesystem as opposed to a NFS mounted filesystem can also dramatically speed up build times.

Fiddle with the compiler optimisation flags,
use option -j4 for gmake for parallel compilation (multicore or single core)
if you are using clearmake , use winking
we can take out the debug flags..in extreme cases.
Use some powerful servers.

This book Large-Scale C++ Software Design has very good advice I've used in past projects.

Minimize your public API
Minimize inline functions in your API. (Unfortunately this also increases linker requirements).
Maximize forward declarations.
Reduce coupling between code. For instance pass in two integers to a function, for coordinates, instead of your custom Point class that has it's own header file.
Use Incredibuild. But it has some issues sometimes.
Do NOT put code that get exported from two different modules in the SAME header file.
Use the PImple idiom. Mentioned before, but bears repeating.
Use Pre-compiled headers.
Avoid C++/CLI (i.e. managed c++). Linker times are impacted too.
Avoid using a global header file that includes 'everything else' in your API.
Don't put a dependency on a lib file if your code doesn't really need it.
Know the difference between including files with quotes and angle brackets.

Powerful compilation machines and parallel compilers. We also make sure the full build is needed as little as possible. We don't alter the code to make it compile faster.
Efficiency and correctness is more important than compilation speed.

In Visual Studio, you can set number of project to compile at a time. Its default value is 2, increasing that would reduce some time.
This will help if you don't want to mess with the code.

This is the list of things we did for a development under Linux :
As Warrior noted, use parallel builds (make -jN)
We use distributed builds (currently icecream which is very easy to setup), with this we can have tens or processors at a given time. This also has the advantage of giving the builds to the most powerful and less loaded machines.
We use ccache so that when you do a make clean, you don't have to really recompile your sources that didn't change, it's copied from a cache.
Note also that debug builds are usually faster to compile since the compiler doesn't have to make optimisations.

We tried creating proxy classes once.
These are really a simplified version of a class that only includes the public interface, reducing the number of internal dependencies that need to be exposed in the header file. However, they came with a heavy price of spreading each class over several files that all needed to be updated when changes to the class interface were made.

In general large C++ projects that I've worked on that had slow build times were pretty messy, with lots of interdependencies scattered through the code (the same include files used in most cpps, fat interfaces instead of slim ones). In those cases, the slow build time was just a symptom of the larger problem, and a minor symptom at that. Refactoring to make clearer interfaces and break code out into libraries improved the architecture, as well as the build time. When you make a library, it forces you to think about what is an interface and what isn't, which will actually (in my experience) end up improving the code base. If there's no technical reason to have to divide the code, some programmers through the course of maintenance will just throw anything into any header file.

Cătălin Pitiș covered a lot of good things. Other ones we do:
Have a tool that generates reduced Visual Studio .sln files for people working in a specific sub-area of a very large overall project
Cache DLLs and pdbs from when they are built on CI for distribution on developer machines
For CI, make sure that the link machine in particular has lots of memory and high-end drives
Store some expensive-to-regenerate files in source control, even though they could be created as part of the build
Replace Visual Studio's checking of what needs to be relinked by our own script tailored to our circumstances

It's a pet peeve of mine, so even though you already accepted an excellent answer, I'll chime in:
In C++, it's less the language as such, but the language-mandated build model that was great back in the seventies, and the header-heavy libraries.
The only thing that is wrong about Cătălin Pitiș' reply: "buy faster machines" should go first. It is the easyest way with the least impact.
My worst was about 80 minutes on an aging build machine running VC6 on W2K Professional. The same project (with tons of new code) now takes under 6 minutes on a machine with 4 hyperthreaded cores, 8G RAM Win 7 x64 and decent disks. (A similar machine, about 10..20% less processor power, with 4G RAM and Vista x86 takes twice as long)
Strangely, incremental builds are most of the time slower than full rebuuilds now.

Full build is about 2 hours. I try to avoid making modification to the base classes and since my work is mainly on the implementation of these base classes I only need to build small components (couple of minutes).

Create some unit test projects to test individual libraries, so that if you need to edit low level classes that would cause a huge rebuild, you can use TDD to know your new code works before you rebuild the entire app. The John Lakos book as mentioned by Themis has some very practical advice for restructuring your libraries to make this possible.

Profiling line-by-line in C++

I have a C++ program I am trying to optimize.
Since I want it to run fast, I am not using a lot of function calls. Most profiling tool I have seen can give you profiling info in a function-call resolution. However, I would like it in a line-by-line resolution. Is there some option like this?
I am using Visual Studio 2010 on Windows.
Thanks.

Intel Parallel Amplifier should be capable of what you want. If that is what you want:

If you're running with on an AMD processor, CodeAnalyst is free and can do that (at least, in time-based profiling); you can actually "zoom" in and out seeing what is taking the most CPU time from processes to functions down to single assembly instructions.
However, keep in mind that to get meaningful results to that resolution with time-based profiling you should run the critical part of the code several times, otherwise the statistics you get doesn't have much sense.
By the way, in my opinion you should forget about the less function calls=>faster idea. If the cost of a function call is bigger than its "payload", the compiler should be able to figure out by itself if it's convenient to inline the call, and in some cases even inlining too much can slow down the code.

AQTime is a commercial profiler for Windows and I have found it to work pretty well for both function and line timings. One thing I like about it is that you do not have to fiddle with compiler options or Visual Studio settings -- i.e. you do not need any additional compiler options to enable profiling: All you need to do the profiling is the pdb (symbol) file and the executable. (And yes, you can create a pdb file for your release-compile.)

IMHO, this method is best, for these reasons, and here's an example of a 43x speedup. It's not a well-known technique, except to a small number of people, for one example, and another, and another. You may be surprised that it's very low-tech and manual, but you can't beat the results.
Oh, and by the way, for Visual Studio, LTProf may well be the next best thing. It gives you line-level percents, derived from stack samples taken at random wall-clock times. Don't get sucked in by a lot of fancy UI options or promises of accuracy of timing. Those things don't matter. What matters is that it pinpoints the spots worth optimizing.

Runtime optimization of static languages: JIT for C++?

Is anyone using JIT tricks to improve the runtime performance of statically compiled languages such as C++? It seems like hotspot analysis and branch prediction based on observations made during runtime could improve the performance of any code, but maybe there's some fundamental strategic reason why making such observations and implementing changes during runtime are only possible in virtual machines. I distinctly recall overhearing C++ compiler writers mutter "you can do that for programs written in C++ too" while listening to dynamic language enthusiasts talk about collecting statistics and rearranging code, but my web searches for evidence to support this memory have come up dry.

Profile guided optimization is different than runtime optimization. The optimization is still done offline, based on profiling information, but once the binary is shipped there is no ongoing optimization, so if the usage patterns of the profile-guided optimization phase don't accurately reflect real-world usage then the results will be imperfect, and the program also won't adapt to different usage patterns.
You may be interesting in looking for information on HP's Dynamo, although that system focused on native binary -> native binary translation, although since C++ is almost exclusively compiled to native code I suppose that's exactly what you are looking for.
You may also want to take a look at LLVM, which is a compiler framework and intermediate representation that supports JIT compilation and runtime optimization, although I'm not sure if there are actually any LLVM-based runtimes that can compile C++ and execute + runtime optimize it yet.

I did that kind of optimization quite a lot in the last years. It was for a graphic rendering API that I've implemented. Since the API defined several thousand different drawing modes as general purpose function was way to slow.
I ended up writing my own little Jit-compiler for a domain specific language (very close to asm, but with some high level control structures and local variables thrown in).
The performance improvement I got was between factor 10 and 60 (depended on the complexity of the compiled code), so the extra work paid off big time.
On the PC I would not start to write my own jit-compiler but use either LIBJIT or LLVM for the jit-compilation. It wasn't possible in my case due to the fact that I was working on a non mainstream embedded processor that is not supported by LIBJIT/LLVM, so I had to invent my own.

The answer is more likely: no one did more than PGO for C++ because the benefits are likely unnoticeable.
Let me elaborate: JIT engines/runtimes have both blesses and drawbacks from their developer's view: they have more information at runtime but much little time to analyze.
Some optimizations are really expensive and you will unlikely see without a huge impact on start time are those one like: loop unrolling, auto-vectorization (which in most cases is also based on loop unrolling), instruction selection (to use SSE4.1 for CPU that use SSE4.1) combined with instruction scheduling and reordering (to use better super-scalar CPUs). This kind of optimizations combine great with C like code (that is accessible from C++).
The single full-blown compiler architecture to do advanced compilation (as far as I know) is the Java Hotspot compilation and architectures with similar principles using tiered compilation (Java Azul's systems, the popular to the day JaegerMonkey JS engine).
But one of the biggest optimization on runtime is the following:
Polymorphic inline caching (meaning that if you run the first loop with some types, the second time, the code of the loop will be specialized types that were from previous loop, and the JIT will put a guard and will put as default branch the inlined types, and based on it, from this specialized form using a SSA-form engine based will apply constant folding/propagation, inlining, dead-code-elimination optimizations, and depends of how "advanced" the JIT is, will do an improved or less improved CPU register assignment.)
As you may notice, the JIT (hotspots) will improve mostly the branchy code, and with runtime information will get better than a C++ code, but a static compiler, having at it's side the time to do analysis, instruction reordering, for simple loops, will likely get a little better performance. Also, typically, the C++ code, areas that need to be fast tends to not be OOP, so the information of the JIT optimizations will not bring such an amazing improvement.
Another advantage of JITs is that JIT works cross assemblies, so it has more information if it wants to do inlining.
Let me elaborate: let's say that you have a base class A and you have just one implementation of it namely B in another package/assembly/gem/etc. and is loaded dynamically.
The JIT as it see that B is the only implementation of A, it can replace everywhere in it's internal representation the A calls with B codes, and the method calls will not do a dispatch (look on vtable) but will be direct calls. Those direct calls may be inlined also. For example this B have a method: getLength() which returns 2, all calls of getLength() may be reduced to constant 2 all over. At the end a C++ code will not be able to skip the virtual call of B from another dll.
Some implementations of C++ do not support to optimize over more .cpp files (even today there is the -lto flag in recent versions of GCC that makes this possible). But if you are a C++ developer, concerned about speed, you will likely put the all sensitive classes in the same static library or even in the same file, so the compiler can inline it nicely, making the extra information that JIT have it by design, to be provided by developer itself, so no performance loss.

visual studio has an option for doing runtime profiling that then can be used for optimization of code.
"Profile Guided Optimization"

Microsoft Visual Studio calls this "profile guided optimization"; you can learn more about it at MSDN. Basically, you run the program a bunch of times with a profiler attached to record its hotspots and other performance characteristics, and then you can feed the profiler's output into the compiler to get appropriate optimizations.

I believe LLVM attempts to do some of this. It attempts to optimize across the whole lifetime of the program (compile-time, link-time, and run-time).

Reasonable question - but with a doubtful premise.
As in Nils' answer, sometimes "optimization" means "low-level optimization", which is a nice subject in its own right.
However, it is based on the concept of a "hot-spot", which has nowhere near the relevance it is commonly given.
Definition: a hot-spot is a small region of code where a process's program counter spends a large percentage of its time.
If there is a hot-spot, such as a tight inner loop occupying a lot of time, it is worth trying to optimize at the low level, if it is in code that you control (i.e. not in a third-party library).
Now suppose that inner loop contains a call to a function, any function. Now the program counter is not likely to be found there, because it is more likely to be in the function. So while the code may be wasteful, it is no longer a hot-spot.
There are many common ways to make software slow, of which hot-spots are one. However, in my experience, that is the only one of which most programmers are aware, and the only one to which low-level optimization applies.
See this.

What are the negative consequences of including and/or linking things that aren't used by your binary?

Let's say that I have a binary that I am building, and I include a bunch of files that are never actually used, and do the subsequent linking to the libraries described by those include files? (again, these libraries are never used)
What are the negative consequences of this, beyond increased compile time?

A few I can think of are namespace pollution and binary size

In addition to compile time; Increased complexity, needless distraction while debugging, a maintenance overhead.
Apart from that, nothing.

In addition to what Sasha lists is maintenance cost. Will you be able to detect easily what is used and what is not used in the future when and if you chose to remove unused stuff?

If the libraries are never used, there should be no increase in size for the executable.

Depending on the exact linker, you might also notice that the global objects of your unused libraries still get constructed. This implies a memory overhead and increases startup costs.

If the libraries you're including but not using aren't on the target system, it won't be able to compile even though they aren't necessary.

Here is my answer for a similar question concerning C and static libraries. Perhaps it is useful to you in the context of C++ as well.

You mention an increase in the compilation time. From that I understand the libraries are statically linked, and not dynamically. In this case, it depends how the linker handles unused functions. If it ignores them, you will have mostly maintenance problems. If they’ll be included, the executable size will increase. Now, this is more significant than the place it takes on the hard drive. Large executables could run slower due to caching issues. If active code and non-active code are adjacent in the exe, they will be cached together, making the cache effectively smaller and less efficient.
VC2005 and above have an optimization called PGO, which orders the code within the executable in a way that ensures effective caching of code that is often used. I don’t know if g++ has a similar optimization, but it’s worth looking into that.

A little compilation here of the issues, wiki-edit it as necessary:
The main problem appears to be: Namespace Pollution
This can cause problems in future debugging, version control, and increased future maintenance cost.
There will also be, at the minimum, minor Binary Bloat, as the function/class/namespace references will be maintained (In the symbol table?). Dynamic libraries should not greatly increase binary size(but they become a dependency for the binary to run?). Judging from the GNU C compiler, statically linked libraries should not be included in final binary if they are never referenced in the source. (Assumption based on the C compiler, may need to clarify/correct)
Also, depending on the nature of your libraries, global and static objects/variables may be instantiated, causing increased startup time and memory overhead.
Oh, and increased compile/linking time.

I find it frustrating when I edit a file in the source tree because some symbol that I'm working on appears in the source file (e.g. a function name, where I've just changed the prototype - or, sadly but more typically, just added the prototype to a header) so I need to check that the use is correct, or the compiler now tells me the use in that file is incorrect. So, I edit the file. Then I see a problem - what is this file doing? And it turns out that although the code is 'used' in the product, it really isn't actively used at all.
I found an occurrence of this problem on Monday. A file with 10,000+ lines of code invoked a function 'extern void add_remainder(void);' with an argument of 0. So, I went to fix it. Then I looked at the rest of the code...it turned out it was a development stub from about 15 years ago that had never been removed. Cleanly excising the code turned out to involve minor edits to more than half-a-dozen files - and I've not yet worked out whether it is safe to remove the enumeration constant from the middle of an enumeration in case. Temporarily, that is marked 'Unused/Obsolete - can it be removed safely?'.
That chunk of code has had zero cove coverage for the last 15 years - production, test, ... True, it's only a tiny part of a vast system - percentage-wise, it's less than a 1% blip on the chart. Still, it is extra wasted code.
Puzzling. Annoying. Depressingly common (I've logged, and fixed, at least half a dozen similar bugs this year so far).
And a waste of my time - and other developers' time. The file had been edited periodically over the years by other people doing what I was doing - a thorough job.

I have never experienced any problems with linking a .lib file of which only a very small part is used. Only the code that is really used will be linked into the executable, and the linking time did not increase noticeably (with Visual Studio).

If you link to binaries and they get loaded at runtime, they may perform non-trivial initialization which can do anything from allocate a small amount of memory to consume scarce resources to alter the state of your module in ways you don't expect, and beyond.
You're better off getting rid of stuff you don't need, simply to eliminate a bunch of unknowns.

It could perhaps even fail to compile if the build tree isn't well maintained. if your'e compiling on embedded systems without swap space. The compiler can run out of memory while trying to compile a massive object file.
It happened at work to us recently.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js