I have a Fortran program I would like to profile with GNU coverage. So I compiled the program with GCC 11.2 with these coverage options:
-fprofile-arcs -ftest-coverage
Also, I add flags to disallow compiler to inline the code:
-fno-inline -fno-inline-small-functions -fno-default-inline
I turned off lto and add -lcgov to linker. This setup worked well for a sample program I proved. But when I tried to use it for the program I am interested in it did not generate any *.gcno files, just nothing. Execution, however, exited well (0 exit code) producing correct results.
My question is, how can I find where the problem is. Without an error message, I don't know where to start. It is a rather bigger program ~10 MB of source code, can that be a problem? Also, it heavily depends on MKL, can the external library be the problem? Once I accidentally mixed compile time and runtime environments and it complained about the version of libgcov.so, so something is working after all. Or do you have any other suggestions for coverage profiling?
Related
I am trying to understand linux perf and hotspot to understand call stacks/trace of my c++ application.
Should the program compiled in debug mode or in release mode ? Assuming I have only one file inline.cpp. I have seen in one of the example using
g++ -O2 -g inline.cpp -o inline
perf record --call-graph dwarf ./inline
I am wondering is it necessary to compile the program in debug (-g) and optimization -O2 ? What are flag that the executable is compiled in order make it useful to run with perf record ?
Does it make any difference if we compile the program with out compiler flags ?
g++ inline.cpp -o inline
First of all, -g and -O2 aren't opposites. -g specifies that debugging symbols will be generated, so that you can associate hotspots with actual lines of code. -O2 specifies that code optimization should be performed; this is ordinarily not done with code you intend to run in a debugger because it makes it more difficult to follow the execution.
A profiler measures the performance of an executable, not of source code. If you profile an unoptimized executable you'll see where the unoptimized executable's performance problems are, but those may be different problems than in the optimized executable. Presuming you care about the performance of the optimized executable (because it's what users will ordinarily be running) you should be profiling that. The use of -O2 will definitely make it harder to understand where the performance problems are, but that's just the nature of profiling code.
GDB documentation tells me that in order to compile for debugging, I need to ask my compiler to generate debugging symbols. This is done by specifying a '-g' flag.
Furthermore, GDB doc recommends I'd always compile with a '-g' flag. This sounds good, and I'd like to do that.
But first, I'd like to find out about downsides. Are there any penalties involved with compiling-for-debugging in production code?
I am mostly interested in:
GCC as the compiler of choice
Red hat Linux as target OS
C and C++ languages
(Although information about other environments is welcome as well)
Many thanks!
If you use -g (which on recent GCC or Clang can be used with optimization flags like -O2):
compilation time is slower (and linking will use a lot more memory)
the executable is a bigger file (see elf(5) and use readelf(1)...)
the executable carries a lot of information about your source code.
you can use GDB easily
some interesting libraries, like Ian Taylor's libbacktrace, requires DWARF information (e.g. -g)
If you don't use -g it would be harder to use the GDB debugger (but possible).
So if you transmit the binary executable to a partner that should not understand how your source code was written, you need to avoid -g
See also the strip(1) and strace(1) commands.
Notice that using the -g flag for debugging information is also valid for Ocaml, Rust
PS. Recent GCC (e.g. GCC 10 or GCC 11 in 2021) accept many debugger flags. With -g3 your executable carries more debug information (e.g. description of C++ macros and their expansion) that with -g or -g1. Of course, compilation time increases, and executable size also. In principle, your GCC plugin (perhaps Bismon in 2021, or those inside the source code of the Linux kernel) could add even more debug information. In practice, you won't do that unless you can improve your debugger. However, a GCC plugin (or some #pragmas) can remove some debug information (e.g. remove debug information for a selected set of functions).
Generally, adding debug information increases the size of the binary files (or creates extra files for the debug information). That's nowadays usually not a problem, unless you're distributing it over slow networks. And of course this debug information may help others in analyzing your code, if they want to do that. Typically, the -g flag is used together with -O0 (the default), which disables compiler optimization and generates code that is as close as possible to the source, so that debugging is easier. While you can use debug information together with optimizations enabled, this is really tricky, because variables may not exist, or the sequence of instructions may be different than in the source. This is generally only done if an error needs to be analyzed that only happens after the optimizations are enabled. Of course, the downside of -O0 is poorer performance.
So as a conclusion: Typically one uses -g -O0 during development, and for distribution or production code just -O3.
I am doing some tests and I realized that using the -G parameter when compiling is giving me a bad performance than without it.
I have checked the documentation in Nvidia:
--device-debug (-G)
Generate debug information for device code.
But it is not helping me to know the reason why is giving me such bad performance.
Where is it generating this debug information and when? and what could be the cause of this bad performance?
Using the -G switch disables most compiler optimizations that nvcc might do in device code. The resulting code will often run slower than code that is not compiled with -G, for this reason.
This is pretty easy to see by running your executable in each case through cuobjdump -sass myexecutable and looking at the generated device code. You'll see generally less device code in the non -G case, and you can see the differences in specific optimizations as well.
One of the reasons for this is that highly optimized device code may eliminate actual lines of source code and actual source code variables. This can make it very difficult to debug code. Therefore to enable debugging, most optimizations are disabled with -G.
Also note that with Thrust, using the -G switch may result in unpredictable behavior. Newer versions of thrust should behave better, but there may still be unexpected issues when compiling thrust code with -G.
I've been trying to profile some C++ with gprof 2.25.2 (under Cygwin) and it is reporting that 10% of the time is being spent in a function which I know is not being called. (I put a print statement into the relevant function to verify this.) It also seems to think that this function is calling itself recursively (number of calls is 500+16636500), which it definitely isn't.
It's a large enough program that I don't have an easy way of producing a minimal working example I can post here, but if anyone has any ideas about what might be causing this, I would be grateful to know.
Edit: building with CMake + g++. CMAKE_BUILD_TYPE=RELWITHDEBINFO.
I'll assume you're using gcc/g++...
This sounds like a case of the debug symbols being out-of-date with respect to your source code or executable. Try cleaning your build space, recompiling (with -g or -ggdb3, of course). If you're compiling with optimizations and you can afford to turn them off (i.e. -O0 instead of -O1, -O2 or -O3), do so for this run. If that works, try -O1 or -O2 and see what happens.
I am using -O3 when compiling the code, and now I need to profile it. For profiling, there are two main choices I came accross: valgrind --tool=callgrind and gprof.
Valgrind (callgrind) docs state:
As with Cachegrind, you probably want to compile with debugging info (the -g option) and with optimization turned on.
However, in the C++ optimization book by Agner Fog, I have read the following:
Many optimization options are incompatible with debugging. A debugger can execute a
code one line at a time and show the values of all variables. Obviously, this is not possible
when parts of the code have been reordered, inlined, or optimized away. It is common to
make two versions of a program executable: a debug version with full debugging support
which is used during program development, and a release version with all relevant
optimization options turned on. Most IDE's (Integrated Development Environments) have
facilities for making a debug version and a release version of object files and executables.
Make sure to distinguish these two versions and turn off debugging and profiling support in
the optimized version of the executable.
This seems to conflict the callgrind instructions to compile the code with the debugging info flag -g. If I enable debugging in the following way:
-ggdb -DFULLDEBUG
am I not causing this option to conflict with the -O3 optimization flag? Using those two options together makes no sense to me after what I have read so far.
If I use say -O3 optimization flag, can I compile the code with additional profiling info by using:
-pg
and still profile it with valgrind?
Does it ever make sense to profile a code compiled with
-ggdb -DFULLDEBUG -O0
flags? It seems silly - not inlining functions and unrolling loops may shift the bottlenecks in the code, so this should be used for development only, to get the code to actually do stuff properly.
Does it ever make sense to compile the code with one optimization flag, and profile the code compiled with another optimization flag?
Why are you profiling? Just to get measurements or to find speedups?
The common wisdom that you should only profile optimized code is based on assuming the code is nearly optimal to begin with, which if there are significant speedups, it is not.
You should treat the finding of speedups as if they were bugs. Many people use this method of doing so.
After you've removed needless computations, if you still have tight CPU loops, i.e. you're not spending all your time in system or library or I/O routines the optimizer doesn't see, then turn on -O3, and let it do its magic.