Nonsense results from gprof - c++

I've been trying to profile some C++ with gprof 2.25.2 (under Cygwin) and it is reporting that 10% of the time is being spent in a function which I know is not being called. (I put a print statement into the relevant function to verify this.) It also seems to think that this function is calling itself recursively (number of calls is 500+16636500), which it definitely isn't.
It's a large enough program that I don't have an easy way of producing a minimal working example I can post here, but if anyone has any ideas about what might be causing this, I would be grateful to know.
Edit: building with CMake + g++. CMAKE_BUILD_TYPE=RELWITHDEBINFO.

I'll assume you're using gcc/g++...
This sounds like a case of the debug symbols being out-of-date with respect to your source code or executable. Try cleaning your build space, recompiling (with -g or -ggdb3, of course). If you're compiling with optimizations and you can afford to turn them off (i.e. -O0 instead of -O1, -O2 or -O3), do so for this run. If that works, try -O1 or -O2 and see what happens.

Related

Coverage run with GCC does not produce data

I have a Fortran program I would like to profile with GNU coverage. So I compiled the program with GCC 11.2 with these coverage options:
-fprofile-arcs -ftest-coverage
Also, I add flags to disallow compiler to inline the code:
-fno-inline -fno-inline-small-functions -fno-default-inline
I turned off lto and add -lcgov to linker. This setup worked well for a sample program I proved. But when I tried to use it for the program I am interested in it did not generate any *.gcno files, just nothing. Execution, however, exited well (0 exit code) producing correct results.
My question is, how can I find where the problem is. Without an error message, I don't know where to start. It is a rather bigger program ~10 MB of source code, can that be a problem? Also, it heavily depends on MKL, can the external library be the problem? Once I accidentally mixed compile time and runtime environments and it complained about the version of libgcov.so, so something is working after all. Or do you have any other suggestions for coverage profiling?

How to profile Rcpp code (on linux)

I made an R package with Rcpp where whole simulations are run in c++ and results are analyzed in R. Now I need to profile my functions so I can optimize them, but R profilers can't distinguish what happens inside the C++ functions, and I don't know how to run C++ profilers when the functions can only be ran from inside R.
So far, I have found some suggestions to use gperftools (questions and tutorials) but the guides are incomplete (maybe they assume a level of knowledge that I lack?), have missing links, and I keep running into walls. Hence this question. Here's where I'm at:
Install gperftools (I installed from extra/gperftools with pacman)
include gperftools/profiler.h on the C++ header
Add ProfilerStart("myprof.log") and ProfilerStop() in the C++ code around what I want to profile
Compile with -lprofiler
Run "$ CPUPROFILE="myprof.log" R -f myscript.R"
The current wall is gcc tells me "Undefined Symbol: ProfilerStart", so I think there's something wrong with the linking?
I'm not really very impressed with gperftools. Also, it appears to be an instrumenting profiler, sampling-based profilers are easier to use and are likely to run faster. Intels VTune is an excellent sampling-based profiler, available for free if you're an educational user. Even if you're not, your organisation may already have licenses.
Turning to your gperftools issue, yes, that's a linker issue. As you have decided not to share any of the relevant information (link command? compile command? Actual error messages?) we can't help you further.
It was a linking error after all, caused by my lack of experience as this is the first time I use Makevars.
In step #4, I added "-lprofiler" to PKG_CXXFLAGS, that is used in compiling, when I should have added it to PKG_LIBS. I made the change and now the profiler works just fine. This is my Makevars now:
PKG_CXXFLAGS += -Wall -pedantic -g -ggdb #-fno-inline-small-functions
PKG_LIBS += -lprofiler
CXX_STD = CXX11

Why is thrust sort so slow? [duplicate]

I am doing some tests and I realized that using the -G parameter when compiling is giving me a bad performance than without it.
I have checked the documentation in Nvidia:
--device-debug (-G)
Generate debug information for device code.
But it is not helping me to know the reason why is giving me such bad performance.
Where is it generating this debug information and when? and what could be the cause of this bad performance?
Using the -G switch disables most compiler optimizations that nvcc might do in device code. The resulting code will often run slower than code that is not compiled with -G, for this reason.
This is pretty easy to see by running your executable in each case through cuobjdump -sass myexecutable and looking at the generated device code. You'll see generally less device code in the non -G case, and you can see the differences in specific optimizations as well.
One of the reasons for this is that highly optimized device code may eliminate actual lines of source code and actual source code variables. This can make it very difficult to debug code. Therefore to enable debugging, most optimizations are disabled with -G.
Also note that with Thrust, using the -G switch may result in unpredictable behavior. Newer versions of thrust should behave better, but there may still be unexpected issues when compiling thrust code with -G.

Is a program compiled with -g gcc flag slower than the same program compiled without -g?

I'm compiling a program with -O3 for performance and -g for debug symbols (in case of crash I can use the core dump). One thing bothers me a lot, does the -g option results in a performance penalty? When I look on the output of the compilation with and without -g, I see that the output without -g is 80% smaller than the output of the compilation with -g. If the extra space goes for the debug symbols, I don't care about it (I guess) since this part is not used during runtime. But if for each instruction in the compilation output without -g I need to do 4 more instructions in the compilation output with -g than I certainly prefer to stop using -g option even at the cost of not being able to process core dumps.
How to know the size of the debug symbols section inside the program and in general does compilation with -g creates a program which runs slower than the same code compiled without -g?
Citing from the gcc documentation
GCC allows you to use -g with -O. The shortcuts taken by optimized
code may occasionally produce surprising results: some variables you
declared may not exist at all; flow of control may briefly move where
you did not expect it; some statements may not be executed because
they compute constant results or their values are already at hand;
some statements may execute in different places because they have been
moved out of loops.
that means:
I will insert debugging symbols for you but I won't try to retain them if an optimization pass screws them out, you'll have to deal with that
Debugging symbols aren't written into the code but into another section called "debug section" which isn't even loaded at runtime (only by a debugger). That means: no code changes. You shouldn't notice any performance difference in code execution speed but you might experience some slowness if the loader needs to deal with the larger binary or if it takes into account the increased binary size somehow. You will probably have to benchmark the app yourself to be 100% sure in your specific case.
Notice that there's also another option from gcc 4.8:
-Og
Optimize debugging experience. -Og enables optimizations that do not interfere with debugging. It should be the optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization while maintaining fast compilation and a good debugging experience.
This flag will impact performance because it will disable any optimization pass that would interfere with debugging infos.
Finally, it might even happen that some optimizations are better suited to a specific architecture rather than another one and unless instructed to do so for your specific processor (see march/mtune options for your architecture), in O3 gcc will do its best for a generic architecture. That means you might even experience O3 being slower than O2 in some contrived scenarios. "Best-effort" doesn't always mean "the best available".

Should I use matching (gcc) compiler optimization flags when profiling the code?

I am using -O3 when compiling the code, and now I need to profile it. For profiling, there are two main choices I came accross: valgrind --tool=callgrind and gprof.
Valgrind (callgrind) docs state:
As with Cachegrind, you probably want to compile with debugging info (the -g option) and with optimization turned on.
However, in the C++ optimization book by Agner Fog, I have read the following:
Many optimization options are incompatible with debugging. A debugger can execute a
code one line at a time and show the values of all variables. Obviously, this is not possible
when parts of the code have been reordered, inlined, or optimized away. It is common to
make two versions of a program executable: a debug version with full debugging support
which is used during program development, and a release version with all relevant
optimization options turned on. Most IDE's (Integrated Development Environments) have
facilities for making a debug version and a release version of object files and executables.
Make sure to distinguish these two versions and turn off debugging and profiling support in
the optimized version of the executable.
This seems to conflict the callgrind instructions to compile the code with the debugging info flag -g. If I enable debugging in the following way:
-ggdb -DFULLDEBUG
am I not causing this option to conflict with the -O3 optimization flag? Using those two options together makes no sense to me after what I have read so far.
If I use say -O3 optimization flag, can I compile the code with additional profiling info by using:
-pg
and still profile it with valgrind?
Does it ever make sense to profile a code compiled with
-ggdb -DFULLDEBUG -O0
flags? It seems silly - not inlining functions and unrolling loops may shift the bottlenecks in the code, so this should be used for development only, to get the code to actually do stuff properly.
Does it ever make sense to compile the code with one optimization flag, and profile the code compiled with another optimization flag?
Why are you profiling? Just to get measurements or to find speedups?
The common wisdom that you should only profile optimized code is based on assuming the code is nearly optimal to begin with, which if there are significant speedups, it is not.
You should treat the finding of speedups as if they were bugs. Many people use this method of doing so.
After you've removed needless computations, if you still have tight CPU loops, i.e. you're not spending all your time in system or library or I/O routines the optimizer doesn't see, then turn on -O3, and let it do its magic.