profile-guided optimization (C) - c++

Anyone know this compiler feature? It seems GCC support that. How does it work? What is the potential gain? In which case it's good? Inner loops?
(this question is specific, not about optimization in general, thanks)

It works by placing extra code to count the number of times each codepath is taken. When you compile a second time the compiler uses the knowledge gained about execution of your program that it could only guess at before. There are a couple things PGO can work toward:
Deciding which functions should be inlined or not depending on how often they are called.
Deciding how to place hints about which branch of an "if" statement should be predicted on based on the percentage of calls going one way or the other.
Deciding how to optimize loops based on how many iterations get taken each time that loop is called.
You never really know how much these things can help until you test it.

PGO gives about a 5% speed boost when compiling x264, the project I work on, and we have a built-in system for it (make fprofiled). Its a nice free speed boost in some cases, and probably helps more in applications that, unlike x264, are less made up of handwritten assembly.

Jason's advise is right on. The best speedups you are going to get come from "discovering" that you let an O(n2) algorithm slip into an inner loop somewhere, or that you can cache certain computations outside of expensive functions.
Compared to the micro-optimizations that PGO can trigger, these are the big winners. Once you've done that level of optimization PGO might be able to help. We never had much luck with it though - the cost of the instrumentation was such that our application become unusably slow (by several orders of magnitude).
I like using Intel VTune as a profiler primarily because it is non-invasive compared to instrumenting profilers which change behaviour too much.

The fun thing about optimization is that speed gains are found in the unlikeliest of places.
It's also the reason you need a profiler, rather than guessing where the speed problems are.
I recommend starting with a profiler (gperf if you're using GCC) and just start poking around the results of running your application through some normal operations.

Related

How much faster is C++ code "supposed" to be with optimizations turned on?

I have a program that runs in around 1 minute when compiling with g++ without any options.
Compiling with -O3 however makes it run in around 1-2 seconds.
My question is whether it is normal to have this much of a speed up? Or is my code perhaps so bad, that optimization can take away that much time. Obviously I know my code isn't perfect but because of this huge speedup I'm beginning to think it's worse than I thought. Please tell me what the "normal" amount of speed up is (if that's a thing), and whether too much speed up can mean bad code that could (and should) be easily optimized by hand instead of relying on the compiler.
How much faster is C++ code “supposed” to be with optimizations turned on?
In theory: There doesn't necessarily need to be any speed difference. Nor does there exist any upper limit to the speed difference. The C++ language simply doesn't specify a difference between optimisation and lack thereof.
In practice: It depends. Some programs have more to gain from optimisation than others. Some behaviours are easier to prove than others. Some optimisations can even make the program slower, because the compiler cannot know about everything that may happen at runtime.
... 1 minute ... [optimisation] makes it run in around 1-2 seconds.
My question is whether it is normal to have this much of a speed up?
It is entirely normal. You cannot assume that you'll always get as much improvement, but this is not out of the ordinary.
Or is my code perhaps so bad, that optimization can take away that much time.
If the program is fast with optimisation, then it is a fast program. If the program is slow without optimisation, we don't care because we can enable optimisation. Usually, only the optimised speed is relevant.
Faster is better than slower, although that is not the only important metric of a program. Readability, maintainability and especially correctness are more important.
Please tell me ... whether ... code ... could ... be ... optimized by hand instead of relying on the compiler.
Everything could be optimized by hand, at least if you write the program in assembly.
... or should ...
No. There is no reason to waste time doing what the compiler has already done for you.
There are sometimes reasons to optimise by hand something that is already well optimised by the compiler. Relative speedup is not one of those reasons. An example of a valid reason is that the non-optimised build may be too slow to be executed for debugging purposes when there are real time requirements (whether hard or soft) involved.

Profile optimised C++/C code

I have some heavily templated c++ code that I am working with. I can compile and profile with AMD tools and sleepy in debug mode. However without optimisation most of time spent concentrated in the templated code and STL. With optimised compilation, all the profile tools that I know produce garbage information. Does anybody know a good way to profile optimised native code
PS1:
The code that I am writing is also heavily templated. Most of the time spent in the unoptimised code will be optimized away. I am talking about 96-97% of the run time are spent in templated code without optimisation. This is going to corrupt the accuracy of the profiling. And yes I can change many templated code or at least what part of the templated code is introducing the most trouble and I can do better in those places.
You should focus on the code you wrote because that is what you can change, time spent in STL is irrelevant, just ignore it and focus on the callers of that code. If too much time is spent in STL you probably can call some other STL primitive instead of the current one.
Profiling unoptimized code is less interesting, but you can still get some informations. If used algorithms from some parts of code are totally flawed it will show up even there. But you should be able to get useful informations from any good profiling tool in optimized code. What tools do you use exactly and why do you call their output garbage ?
Also it's usually easy enough to instrument your code by hand and find out exactly which parts are efficient and which are not. It's just a matter of calling timer functions (or reading cycle count of processor if possible) at well chosen points. I usually do that from unit tests to have reproducible results, but all depends of the specifics of your program.
Tools or instrumenting code are the easy part of optimization. The hard part is finding ways to get faster code where it's needed.
What do you mean by "garbage information"?
Profiling is only really meaningful on optimized builds, so tools are designed to work with them -- thus if you're getting meaningless results, it's probably due to the profiler not finding the right symbols, or needing to instrument the build.
In the case of Intel VTune, for example, I found I got impossible results from the sampler unless I explicitly told it where to find the PDBs for the executable I was tuning. In the instrumented version, I had to fiddle with the settings until it was reliably putting probes into the function calls.
When #kriss says
You should focus on the code you wrote
because that is what you can change
that's exactly what I was going to say.
I would add that in my opinion it is easier to do performance tuning first on code compiled without optimization, and then later turn on the optimizer, for the same reason. If something you can fix is costing excess time, it will cost proportionally excess time regardless of what the compiler does, and it's easier to find it in code that hasn't been scrambled.
I don't look for such code by measuring time. If the excess time is, say, 20%, then what I do is randomly pause it several times. As soon as I see something that can obviously be improved on 2 or more samples, then I've found it. It's an oddball method, but it doesn't really miss anything. I do measure the overall time before and after to see how much I saved. This can be done multiple times until you can't find anything to fix. (BTW, if you're on Linux, Zoom is a more automated way to do this.)
Then you can turn on the optimizer and see how much it gives you, but when you see what changes you made, you can see there's really no way the compiler could have done it for you.

Compiler optimization for fastest possible code

I would like to select the compiler optimizations to generate the fastest possible application.
Which of the following settings should I set to true?
Dead store elimination
Eliminate duplicate expressions within basic blocks and functions
Enable loop induction variable and strength reduction
Enable Pentium instruction scheduling
Expand common intrinsic functions
Optimize jumps
Use register variables
There is also the option 'Generate the fastest possible code.', which I have obviously set to true. However, when I set this to true, all the above options are still set at false.
So I would like to know if any of the above options will speed up the application if I set them to true?
So I would like to know if any of the above options will speed up the application if I set them to true?
I know some will hate me for this, but nobody here can answer you truthfully. You have to try your program with and without them, and profile each build and see what the results are. Guess-work won't get anybody anywhere.
Compilers already do tons(!) of great optimization, with or without your permission. Your best bet is to write your code in a clean and organized matter, and worry about maintainability and extensibility. As I like to say: Code now, optimize later.
Don't micromanage down to the individual optimization. Compiler writers are very smart people - just turn them all on unless you see a specific need not to. Your time is better spent by optimizing your code (improve algorithmic complexity of your functions, etc) rather than fiddling with compiler options.
My other advice, use a different compiler. Intel has a great reputation as an optimizing compiler. VC and GCC of course are also great choices.
You could look at the generated code with different compiled options to see which is fastest, but I understand nowadays many people don't have experience doing this.
Therefore, it would be useful to profile the application. If there is an obvious portion requiring speed, add some code to execute it a thousand or ten million times and time it using utime() if it's available. The loop should run long enough that other processes running intermittently don't affect the result—ten to twenty seconds is a popular benchmark range. Or run multiple timing trials. Compile different test cases and run it to see what works best.
Spending an hour or two playing with optimization options will quickly reveal that most have minor effect. However, that same time spent thinking about the essence of the algorithm and making small changes (code removal is especially effective) can often vastly improve execution time.

C/C++ compiler feedback optimization

Has anyone seen any real world numbers for different programs which are using the feedback optimization that C/C++ compilers offer to support the branch prediction, cache preloading functions etc.
I searched for it and amazingly not even the popular interpreter development groups seem to have checked the effect. And increasing ruby,python,php etc. performance by 10% or so should be considered usefull.
Is there really no benefit or is the whole developer community just to lazy to use it?
10% is a good ballpark figure. That said, ...
You have to REALLY care about the performance to go this route. The product I work on (DB2) uses PGO and other invasive and agressive optimizations. Among the costs are significant build time (triple on some platforms) and development and support nightmares.
When something goes wrong it can be non-trivial to map the fault location in the optimized code back to the source. Developers don't usually expect that functions in different modules can end up merged and inlined and this can have "interesting" effects.
Problems with pointer aliasing, which are nasty to track down also usually show up with these sorts of optimizations. You have the additional fun of having non-deterministic builds (an aliasing problem can show up in monday's build, vanish again till thursday's, ...).
The line between what is correct or incorrect compiler behaviour under these sorts of aggressive optimizations also becomes fairly blurred. Even with the luxury of having our compiler guys in house (literally) the optimization issues (either in our source or the compiler) are still not easy to understand and resolve.
From unladen-swallow (a project optimizing the CPython VM):
For us, the final nail in PyBench's coffin was when experimenting with gcc's feedback-directed optimization tools, we were able to produce a universal 15% performance increase across our macrobenchmarks; using the same training workload, PyBench got 10% slower.
So some people are at least looking at it. That said, PGO sets some pretty tricky requirements on the build environment that are hard to satisfy for open-source projects meant to be built by a distributed heterogeneous group of people. Heavy optimization also creates difficult to debug heisenbugs. It's less work to give the compiler explicit hints for the performance critical parts.
That said, I expect significant performance increases from runtime profile guided optimization. JIT'ing allows the optimizer to cope with the profile of data changing across the execution of a program and do many extremely runtime data specific optimizations that would explode the code size for static compilation. Especially dynamic languages need good runtime data based optimization to perform well. With dynamic language performance getting significant attention lately (JavaScript VM's, MS DLR, JSR-292, PyPy and so on) there's a lot of work being done in this area.
Traditional methods in improving the compiler efficiency via profiling is done by performance analysis tools. However, how the data from the tools may be of use in optimization still depends on the compiler you use. For example, GCC is a framework being worked on to produce compilers for different domains. Providing profiling mechanism in the such compiler framework will be extremely difficult.
We can rely on statistical data to do certain optimization. For instance, GCC unrolls a loop if the loop count is less than a constant (say 7). How it fixes up the constant will be based on statistical result of the code size generated for different target architecture.
Profile guided optimizations track the special areas of the source. Details regarding previous run results needs to be stored which is an overhead. The input on the other hand requires a statistical representation of the target application which may use the compiler. So the complexity level rises with the number of different inputs and outputs. In short, deciding profile guided optimization needs extreme data collection. Automation or embedding such profiling into source needs careful monitoring. If not, the entire result will be awry and in our effort to swim we actually will drown.
However, experimentation on this regard is ongoing. Just have a look at POGO.

c++ profiling/optimization: How to get better profiling granularity in an optimized function

I am using google's perftools (http://google-perftools.googlecode.com/svn/trunk/doc/cpuprofile.html) for CPU profiling---it's a wonderful tool that has helped me perform a great deal of CPU-time improvements on my application.
Unfortunately, I have gotten to the point that the code is still a bit slow, and when compiled using g++'s -O3 optimization level, all I know is that a specific function is slow, but not which aspects of it are slow.
If I remove the -O3 flag, then the unoptimized portions of the program overtake this function, and I don't get a lot of clarity into the actual parts of the function that are slow. If I leave the -O3 flag in, then the slow parts of the function are inlined, and I can't determine which parts of the function are slow.
Any suggestions? Thanks for your help!
For something like this, I've always used the "old school" way of doing it:
Insert into the routine you want to measure at various points statements which measure the current time (or cputime). Then simply print out or log the differences between them and you will know how long each section of code took. From there you can find out what is eating most of the time, and go in and get fine-grained timing within that section until you know what the problem is, and how to fix it.
If the overhead of the function calls is not the problem, you can also force inlining to be off with -fno-inline-small-functions -fno-inline-functions -fno-inline-functions-called-once -fno-inline (I'm not exactly sure how these switches interact with each other, but I think they are independent). Then you can use your ordinary profiler to look at the call graph profile and see what function calls are taking what amount of time.
If you're on linux, use oprofile.
If you're on Windows, use AMD's CodeAnalyst.
Both will give sample-based profiles down to the level of individual source lines or assembly instructions and you should have no problem identifying "hot spots" within functions.
I've spent decades doing performance tuning.
People love their tools, but I swear by this method.