Newbie: Performance Analysis through the command line - c++

I am looking for a performance analysis tool with the following properties:
Free.
Runs on Windows.
Does not require using the GUI
(i.e. can be run from command line or by using some library in any programming language).
Runs on some x86-based architecture (preferably Intel).
Can measure the running time of my C++, mingw-compiled, program, except for the time spent in a few certain functions I specify (and all calls emanating from them).
Can measure the amount of memory used by my program, except for the memory allocated by those functions I specified in (5) and all calls emanating from them.
A tool that have properties (1) to (5) (without 6) can still be very valuable to me.
My goal is to be able to compare the running time and memory usage of different programs in a consistent way (i.e. the main requirement is that timing the same program twice would return roughly the same results).

Mingw should already have a gprof tool. To use it you just need to compile with correct flags set. I think it was -g -pg.

For heap analysis (free) you can use umdh.exe which is a full heap dumper, you can also compare consecutive memory snapshots to check for leakage over time. You'd have to filter the output yourself to remove functions that were not of interest, however.
I know that's not exactly what you were asking for in (6), but it might be useful. I think filtering like this is not going to be so common in freeware.

Related

Troubleshoot C++ program memory usage issue

I am authoring a C++ program and find it consumes too much memory. I would like to know which part of the program consumes the most number of memory, ideally, I would like to know how much percentage of memory are consumed by what kind of C++ objects the program is using at a particular moment.
In Java, I know tools like Eclipse Memory Analyzer (https://www.eclipse.org/mat/) which could take a heap dump and show/visualize such memory usage, and I wonder if this can be done for a C++ program. For example, I expect to use a tool/approach letting me know a particular vector<shared_ptr<MyObject>> is holding 30% of the memory.
Note:
I develop the program mainly on macOS (compile using Apple Clang), so it will be better if the approach works on macOS. But I do deploy to Linux as well (compile using gcc) so approaches/tools on Linux is okay.
I tried using Apple's Intruments for such purpose, but so far I can only use it to find memory allocation issue. I have no idea how to figure out the memory consumption of the program at a particular moment (the memory consumption should be related with C++ objects in the program so that I can do some action to reduce it accordingly).
I don't find an easy way to visualize/summarize each part of my program's memory yet. So far, the best tool/approach that I found is Apple's Instruments (if you are on macOS).
By using Instruments, you can use Allocations profiling template. When using this profiling template, you can choose File ==> Recording Options ==> Check Discard events for freed memory option
And you will be able to figure out the un-free memory (aka. the data that are still in the memory) during allocation recording. If you have your program's debug symbol loaded, you can see which function leads to this result.
Although this doesn't address all the issues, it does help to identify part of the problem.

Looking for a way to detect valgrind/memcheck at runtime without including valgrind headers

Valgrind/Memcheck can be intensive and causes runtime performance to drop significantly. I need a way (at runtime) to detect it in order to disable all auxiliary services and features in order to perform checks in under 24 hours. I would prefer not to pass any explicit flags to program, but that would be one way.
I explored searching the symbol table (via abi calls) for valgrind or memcheck symbols, but there were none.
I explored checking the stack (via boost::stacktrace), but nothing was there either.
Not sure it's a good idea to have a different behavior when running under Valgrind since the goal of Valgrind is to assert your software in the expected usage case.
Anyway, Valgrind does not change the stack or symbols since it (kind of) emulates a CPU running your program. The only way to detect if you're being run under Valgrind is to observe for its effects, that is, everything is slow and not multithreaded in Valgrind.
So, for example, run a test that spawn 3 threads consuming a common FIFO (with a mutex/lock) and observe the number of items received. In real CPU, you'd expect the 3 threads to have processed close to the same amount of items in T time, but when run under Valgrind, one thread will have consumed almost all items in >>T time.
Another possibility is to call a some known syscall. Valgrind has some rules for observing syscall. For example, if you are allocating memory, then Valgrind will intercept this block of memory, and fill that memory area with some data. In a good software, you should not read that data and first write to it (so overwriting what Valgrind set). If you try to read that data and observe non zero value, you'll get a Valgrind invalid read of size XXX message, but your code will know it's being instrumented.
Finally, (and I think it much simpler), you should move the code you need to instrument in a library, and have 2 frontends. The "official" frontend, and a test frontend where you've disabled all bells and whistles that's supposed to be run under Valgrind.

Visual Studio Profile Guided Optimization

I have a native C++ application which performs heavy calculations and consumes a lot of memory. My goal is to optimize it, mainly reduce its run time.
After several cycles of profiling-optimizing, I tried the Profile Guided Optimization which I never tried before.
I followed the steps described on MSDN Profile-Guided Optimizations, changing the compilation (/GL) and linking (/LTCG) flags. After adding /GENPROFILE, I ran the application to create .pgc and .pdg files, then changed the linker options to /USEPROFILE and watched additional linker messages that reported that the profiling data was used:
3> 0 of 0 ( 0.0%) original invalid call sites were matched.
3> 0 new call sites were added.
3> 116 of 27096 ( 0.43%) profiled functions will be compiled for speed, and the rest of the functions will be compiled for size
3> 63583 of 345025 inline instances were from dead/cold paths
3> 27096 of 27096 functions (100.0%) were optimized using profile data
3> 608324578581 of 608324578581 instructions (100.0%) were optimized using profile data
3> Finished generating code
Everything looked promising, until I measured the program's performance.
The results were absolutely counterintuitive for me
Performance went down instead of up! 4% to 5% slower than without using Profile Guided Optimization (when comparing with/without the /USEPROFILE option).
Even when running the exact same scenario that was used with /GENPROFILE to create the Profile Guided Optimization data files, it ran 4% slower.
What is going on?
Looking at the sparse doc here the profiler doesn't seem to include any memory optimizations.
If your program takes 2GiB of memory, I'd speculate that the execution speed is limited by memory access and not by the CPU itself. (You also stated something about maps being used, these are also memory limited)
Memory access is difficult to optimize for a profiler, cause it can't change your malloc calls to (for example) put frequently used data into the same pages or make sure they are moved to the same cache line of the CPU.
In addition to that the profiler may introduce additional memory accesses when trying to optimize the bare CPU performance of your program.
The doc states "virtual call speculation", I would speculate that this (and maybe other features like inlining) could introduce additional memory traffic, thus degrading the overall performance cause memory bandwidth is already the limiting factor.
Don't look at it as a black box. If the program can be speeded up, it's because it is doing things it doesn't need to do.
Those things will hide from the profile-guided or any other optimizer, and they will certainly hide from your guesser.
They won't hide from this. Many people use it.
I'm trying to resist the temptation to guess, but I'm failing.
Here's what I see in big C++ apps, no matter how well-written they are.
When people could use a simple data structure like an array, instead they use an abstract container class, with iterators and whatnot. Where does the time go?
That's where it goes.
Another thing they do is write "powerful functions and methods". The writer of the function is so proud of it, that it does so much, that he/she expects it will be called reverently and sparingly.
The user of the function (which could be the same person) thinks "Look how useful this function is! See how much I can get done in a single line of code? The more I use it the more productive I will be."
See how this can easily do needless work?
There's another thing that happens in software - layers of abstraction.
If the pattern above is repeated over several layers of abstraction, the slowdown factors multiply.
The good news is, if you can find those, and if you can fix them, you can get enormous speedup. The bad news is you could suffer as "not a team player".

How do I get call parents for libc6 symbols (e.g. _int_malloc) with linux perf?

I'm profiling a C++ application using linux perf, and I'm getting a nice control flow graph using GProf2dot. However, some symbols from the C library (libc6-2.13.so) take a substantial portion of the total time, and yet have no in-edges.
For example:
_int_malloc takes 8% of the time but has no call parents.
__strcmp_sse42 and __cxxabiv1::__si_class_type_info::__do_dyncast together take about 10% of the time, and have a caller whose name is 0, which has callers 2d6935c, 2cc748c, and 6, which have no callers.
As a result, I can't find out which routines are responsible for all this mallocing and dynamic casting using just perf. However, it seems that other symbols (e.g. malloc but not _int_malloc) do have call parents.
Why doesn't perf show call parents for _int_malloc? Why can't I find the ultimate callers of __do_dyn_cast? And, is there some way for me to modify my setup so that I can get this information? I'm on x86-64, so I'm wondering if I need a (non-standard) libc6 with frame pointers.
Update: As of the 3.7.0 kernel, one can determine call parents of symbols in system libraries using perf record -gdwarf <command>.
Using -gdwarf, there is no need to compile with -fno-omit-frame-pointer.
Original answer:
Yes, one probably would need a libc6 compiled with frame pointers (-fno-omit-framepointer) on x86_64, at the moment (May 24, 2012).
However, developers are currently working on allowing the perf tools to use DWARF unwind info. This means that frame pointers are no longer needed to get backtrace information on x86_64. Linus, however, does not want a DWARF unwinder in the kernel. Thus, the perf tools will save registers as the system is running, and perform the DWARF unwinding in the userspace perf tool using the libunwind library.
This technique has been tested to successfully determine callers of (for example) malloc and dynamic_cast. However, the patch set is not yet integrated into the Linux kernel, and needs to undergo further revision before it is ready.
_int_malloc and __do_dyn_cast are being called from routines that the profiler can't identify because it doesn't have symbol table information for them.
What's more, it looks like you are showing self (exclusive) time.
That is only useful for finding hotspots in routines that a) have much self time, and b) you can fix.
There's a reason profilers subsequent to the original unix profil were created. Real software consists of functions that spend nearly all their time calling other functions, and you need to be able to find code that is on the stack much of the time, not that has the program counter much of the time.
So you need to configure perf to take stack samples and tell you the percent of time each of your routines is on the stack.
It is even better if it reports not just routines, but lines of code, as in Zoom.
It is best to take the samples on wall-clock time, so you're not blind to IO.
There's more to say on all this.

How to profile memory usage?

I am aware of Valgrind, but it just detects memory management issues. What I am searching is a tool that gives me an overview, which parts of my program do consume how much memory. A graphical representation with e.g. a tree map (as KCachegrind does for Callgrind) would be cool.
I am working on a Linux machine, so windows tools will not help me very much.
Use massif, which is part of the Valgrind tools. massif-visualizer can help you graph the data or you can just use the ms_print command.
Try out the heap profiler delivered with gperftools, by Google. I've always built it from sources, but it's available as a precompiled package under several Linux distros.
It's as simple to use as linking a dynamic library to your executables and running the program. It collects information about every dynamic memory allocation (as far as I've seen) and save to disk a memory dump every time one of the following happens:
HEAP_PROFILE_ALLOCATION_INTERVAL bytes have been allocated by the program (default: 1Gb)
the high-water memory usage mark increases by HEAP_PROFILE_INUSE_INTERVAL bytes (default: 100Mb)
HEAP_PROFILE_TIME_INTERVAL seconds have elapsed (default: inactive)
You explicitly call HeapProfilerDump() from your code
The last one, in my experience, is the most useful because you can control exactly when to have a snapshot of the heap usage and then compare two different snapshots and see what's wrong.
Eventually, there are several possible output formats, like textual or graphical (in the form of a directed graph):
Using this tool I've been able to spot incorrect memory usages that I couldn't find using Massif.
A "newer" option is HeapTrack. Contrary to massif, it is an instrumented version of malloc/free that stores all the calls and dumps a log.
The GUI is nice (but requires Qt5 IIRC) and the results timings (because you may want to track time as well) are less biased than valgrind (as they are not emulated).
Use callgrind option with valgrind