How to use profile guided optimizations in g++?

How to use profile guided optimizations in g++? - c++

Also, can anyone point me to a good tutorial on the subject? I can't find any.

-fprofile-generate will instrument the application with profiling code. The application will, while actually running, log certain events that could improve performance if this usage pattern was known at compile time. Branches, possibility for inlining, etc, can all be logged, but I'm not sure in detail how GCC implements this.
After the program exits, it will dump all this data into *.gcda files, which are essentially log data for a test run. After rebuilding the application with -fprofile-use flag, GCC will take the *.gcda log data into account when doing its optimizations, usually increasing the performance significantly. Of course, this depends on many factors.

From this example:
g++ -O3 -fprofile-generate [more params here, like -march=native ...] -o executable_name
// run my program's benchmarks, or something to stress its most common path
g++ -O3 -fprofile-use [more params here, like -march=native...] -o executable_name
Basically, you initially compile and link with this extra flag for both compiling and linking: -fprofile-generate (from here).
Then, when you run it, by default it will create .gcda files "next" to your .o files, it seems (hard coded to the full path where they were built).
You can optionally change where it creates these .gcda files with the -fprofile-dir=XXX setting.
Then you re compile and relink using the -fprofile-use parameter, and it compiles it using profile guided goodness.

The tricky bit is setting up the makefiles.
You definitely need separate output directories for object files. I would recommend naming them "profile" and "release". You might have to copy the *.gcda files that result from the profile run so that GCC finds them in the release build step.
The result will almost certainly be faster. It will probably be larger as well. The -fprofile-use option enables many other optimization steps that are otherwise only enabled by -O3.

Related

Required compile flags in order to user perf

I am trying to understand linux perf and hotspot to understand call stacks/trace of my c++ application.
Should the program compiled in debug mode or in release mode ? Assuming I have only one file inline.cpp. I have seen in one of the example using
g++ -O2 -g inline.cpp -o inline
perf record --call-graph dwarf ./inline
I am wondering is it necessary to compile the program in debug (-g) and optimization -O2 ? What are flag that the executable is compiled in order make it useful to run with perf record ?
Does it make any difference if we compile the program with out compiler flags ?
g++ inline.cpp -o inline

First of all, -g and -O2 aren't opposites. -g specifies that debugging symbols will be generated, so that you can associate hotspots with actual lines of code. -O2 specifies that code optimization should be performed; this is ordinarily not done with code you intend to run in a debugger because it makes it more difficult to follow the execution.
A profiler measures the performance of an executable, not of source code. If you profile an unoptimized executable you'll see where the unoptimized executable's performance problems are, but those may be different problems than in the optimized executable. Presuming you care about the performance of the optimized executable (because it's what users will ordinarily be running) you should be profiling that. The use of -O2 will definitely make it harder to understand where the performance problems are, but that's just the nature of profiling code.

Are there any downsides to compiling with -g flag?

GDB documentation tells me that in order to compile for debugging, I need to ask my compiler to generate debugging symbols. This is done by specifying a '-g' flag.
Furthermore, GDB doc recommends I'd always compile with a '-g' flag. This sounds good, and I'd like to do that.
But first, I'd like to find out about downsides. Are there any penalties involved with compiling-for-debugging in production code?
I am mostly interested in:
GCC as the compiler of choice
Red hat Linux as target OS
C and C++ languages
(Although information about other environments is welcome as well)
Many thanks!

If you use -g (which on recent GCC or Clang can be used with optimization flags like -O2):
compilation time is slower (and linking will use a lot more memory)
the executable is a bigger file (see elf(5) and use readelf(1)...)
the executable carries a lot of information about your source code.
you can use GDB easily
some interesting libraries, like Ian Taylor's libbacktrace, requires DWARF information (e.g. -g)
If you don't use -g it would be harder to use the GDB debugger (but possible).
So if you transmit the binary executable to a partner that should not understand how your source code was written, you need to avoid -g
See also the strip(1) and strace(1) commands.
Notice that using the -g flag for debugging information is also valid for Ocaml, Rust
PS. Recent GCC (e.g. GCC 10 or GCC 11 in 2021) accept many debugger flags. With -g3 your executable carries more debug information (e.g. description of C++ macros and their expansion) that with -g or -g1. Of course, compilation time increases, and executable size also. In principle, your GCC plugin (perhaps Bismon in 2021, or those inside the source code of the Linux kernel) could add even more debug information. In practice, you won't do that unless you can improve your debugger. However, a GCC plugin (or some #pragmas) can remove some debug information (e.g. remove debug information for a selected set of functions).

Generally, adding debug information increases the size of the binary files (or creates extra files for the debug information). That's nowadays usually not a problem, unless you're distributing it over slow networks. And of course this debug information may help others in analyzing your code, if they want to do that. Typically, the -g flag is used together with -O0 (the default), which disables compiler optimization and generates code that is as close as possible to the source, so that debugging is easier. While you can use debug information together with optimizations enabled, this is really tricky, because variables may not exist, or the sequence of instructions may be different than in the source. This is generally only done if an error needs to be analyzed that only happens after the optimizations are enabled. Of course, the downside of -O0 is poorer performance.
So as a conclusion: Typically one uses -g -O0 during development, and for distribution or production code just -O3.

Nonsense results from gprof

I've been trying to profile some C++ with gprof 2.25.2 (under Cygwin) and it is reporting that 10% of the time is being spent in a function which I know is not being called. (I put a print statement into the relevant function to verify this.) It also seems to think that this function is calling itself recursively (number of calls is 500+16636500), which it definitely isn't.
It's a large enough program that I don't have an easy way of producing a minimal working example I can post here, but if anyone has any ideas about what might be causing this, I would be grateful to know.
Edit: building with CMake + g++. CMAKE_BUILD_TYPE=RELWITHDEBINFO.

I'll assume you're using gcc/g++...
This sounds like a case of the debug symbols being out-of-date with respect to your source code or executable. Try cleaning your build space, recompiling (with -g or -ggdb3, of course). If you're compiling with optimizations and you can afford to turn them off (i.e. -O0 instead of -O1, -O2 or -O3), do so for this run. If that works, try -O1 or -O2 and see what happens.

Is a program compiled with -g gcc flag slower than the same program compiled without -g?

I'm compiling a program with -O3 for performance and -g for debug symbols (in case of crash I can use the core dump). One thing bothers me a lot, does the -g option results in a performance penalty? When I look on the output of the compilation with and without -g, I see that the output without -g is 80% smaller than the output of the compilation with -g. If the extra space goes for the debug symbols, I don't care about it (I guess) since this part is not used during runtime. But if for each instruction in the compilation output without -g I need to do 4 more instructions in the compilation output with -g than I certainly prefer to stop using -g option even at the cost of not being able to process core dumps.
How to know the size of the debug symbols section inside the program and in general does compilation with -g creates a program which runs slower than the same code compiled without -g?

Citing from the gcc documentation
GCC allows you to use -g with -O. The shortcuts taken by optimized
code may occasionally produce surprising results: some variables you
declared may not exist at all; flow of control may briefly move where
you did not expect it; some statements may not be executed because
they compute constant results or their values are already at hand;
some statements may execute in different places because they have been
moved out of loops.
that means:
I will insert debugging symbols for you but I won't try to retain them if an optimization pass screws them out, you'll have to deal with that
Debugging symbols aren't written into the code but into another section called "debug section" which isn't even loaded at runtime (only by a debugger). That means: no code changes. You shouldn't notice any performance difference in code execution speed but you might experience some slowness if the loader needs to deal with the larger binary or if it takes into account the increased binary size somehow. You will probably have to benchmark the app yourself to be 100% sure in your specific case.
Notice that there's also another option from gcc 4.8:
-Og
Optimize debugging experience. -Og enables optimizations that do not interfere with debugging. It should be the optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization while maintaining fast compilation and a good debugging experience.
This flag will impact performance because it will disable any optimization pass that would interfere with debugging infos.
Finally, it might even happen that some optimizations are better suited to a specific architecture rather than another one and unless instructed to do so for your specific processor (see march/mtune options for your architecture), in O3 gcc will do its best for a generic architecture. That means you might even experience O3 being slower than O2 in some contrived scenarios. "Best-effort" doesn't always mean "the best available".

Optimization and flags for making a static library with g++

I am just starting with g++ compiler on Linux and got some questions on the compiler flags. Here are they
Optimizations
I read about optimization flags -O1, -O2 and -O3 in the g++ manual page. I didn't understood when to use these flags. Usually what optimization level do you use? The g++ manual says the following for -O2.
Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. The compiler does not perform loop unrolling or function inlining when you specify -O2. As compared to -O, this option increases both compilation time and the performance of the generated code.
If it is not doing inlining and loop unrolling, how the said performance befits are achieved and is this option recommended?
Static Library
How do I create a static library using g++? In Visual Studio, I can choose a class library project and it will be compiled into "lib" file. What is the equivalent in g++?

The rule of thumb:
When you need to debug, use -O0 (and -g to generate debugging symbols.)
When you are preparing to ship it, use -O2.
When you use gentoo, use -O3...!
When you need to put it on an embedded system, use -Os (optimize for size, not for efficiency.)

The gcc manual list all implied options by every optimization level. At O2, you get things like constant folding, branch prediction and co, which can change significantly the speed of your application, depending on your code. The exact options are version dependent, but they are documented in great detail.
To build a static library, you use ar as follows:
ar rc libfoo.a foo.o foo2.o ....
ranlib libfoo.a
Ranlib is not always necessary, but there is no reason for not using it.

Regarding when to use what optimization option - there is no single correct answer.
Certain optimization levels may, at times, decrease performance. It depends on the kind of code you are writing and the execution pattern it has, and depends on the specific CPU you are running on.
(To give a simple canonical example - the compiler may decide to use an optimization that makes your code slightly larger than before. This may cause a certain part of the code to no longer fit into the instruction cache, at which point many more accesses to memory would be required - in a loop, for example).
It is best to measure and optimize for whatever you need. Try, measure and decide.
One important rule of thumb - the more optimizations are performed on your code, the harder it is to debug it using a debugger (or read its disassembly), because the C/C++ source view gets further away from the generated binary. It is a good rule of thumb to work with fewer optimizations when developing / debugging for this reason.

There are many optimizations that a compiler can perform, other than loop unrolling and inlining. Loop unrolling and inlining are specifically mentioned there since, although they make the code faster, they also make it larger.
To make a static library, use 'g++ -c' to generate the .o files and 'ar' to archive them into a library.

In regards to the Static library question the answer given by David Cournapeau is correct but you can alternatively use the 's' flag with 'ar' rather than running ranlib on your static library file. The 'ar' manual page states that
Running ar s on an archive is equivalent to running ranlib on it.
Whichever method you use is just a matter of personal preference.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js