Getting consistent callgrind output over multiple runs - c++

I've been using valgrind to do some profiling on a C++ codebase, and based on my understanding it's performing sampling of individual calls to build its profile data. As a result, I can profile the same application twice using something like this:
valgrind --tool=callgrind myprogram.exe
and get different results each time. For instance, my last two runs I did gave me the following output at the bottom:
==70741==
==70741== Events : Ir
==70741== Collected : 196823780
==70741==
==70741== I refs: 196,823,780
and
==70758==
==70758== Events : Ir
==70758== Collected : 195728098
==70758==
==70758== I refs: 195,728,098
this all makes complete sense to me. However, I'm currently in the process of optimizing my codebase and I'd like to be able to make a change and determine if it does improve performance. Due to sampling, it seems that running callgrind alone will not be sufficient since I can get different numbers on each run. As a result, it would be difficult to determine if my latest run just ran faster just due to random sampling, or my change actually made a significant difference (in the statistical sense, why not?).
My question is: Is there a way I can force callgrind to be consistent in it's sampling? Or, is there some more appropriate higher-level tool that would allow me to understand how a refactor affects performance? I'm currently developing in Mac OSX Sierra.

callgrind does not use sampling technique. Instead, it exactly "counts" the instructions that are executed. So, if a program is run twice under callgrind and does exactly the same, then it will give the same result. However, as soon as your program is doing non trivial things (e.g. uses a lot of libraries), such libraries might do slightly different things, depending e.g. on the clock or system load or content of env or ...
So, the differences you see are not due to callgrind, it is because really your program and/or the libraries it is using are each time doing slightly different things.
You can use some callgrind output file visualisation or reporting tools, such as kcachegrind to analyse the difference between 2 runs. You might then see what is the origin of these differences and eliminate them. Otherwise, you should be able to determine the effect of your changes by only looking at the cost of the functions you are interested in.

Related

Performance counter for Halide?

Is there a performance counter available for code written in the Halide language? I would like to know how many loads, stores, and ALU operations are performed by my code.
The Halide tutorial for scheduling multi-stage pipelines compares different schedules by comparing the amount of allocated memory, loads, stores, and calls to halide Funcs, but I don't see how this information was collected. I suppose it might be possible to use trace_stores, trace_loads, and trace_realizations to print to the console every time one of these operations occurs. This isn't a great option though because it would greatly slow down the program's execution and would require some kind of counting script to compile the long list of console outputs into the desired counts for loads, stores, and ALU operations.
I'm pretty sure they just used the trace_xxx output and ran some scripts/programs on it.
If you're looking for real performance numbers on a X86 platform, I would go with Intel VTune Amplifier. It's pretty expensive, but may be free if you're in academia (student, teacher, researcher) or it's for an open source project.
Other than that, look at the lowered statement code by setting HL_DEBUG_CODEGEN=1 in the environment and you can get a better idea of the loop structure and data use. Note that this output goes to stderr, not stdout.
EDIT: For Linux, there's perf.
We do not have any perf counter based support at present. It is fairly difficult to make it portable. (And on mobile devices, often the OS simply doesn't allow access to the hardware.) The support in Profiling.cpp and src/profiling.cpp could likely be used to drive perf counter operation. The profiling lowering pass adds code to call routines in the runtime which update information about Func and Pipeline execution. This information is collected and aggregated by another thread.
If tracing is run to a file (e.g. using HL_TRACE_FILE) a binary format is used and it is a bit more efficient. See utils/HalideTraceViz for a tool to work with the binary format. This is generally how analyses are done within the team.
There was a small amount of investigation of OProfile, which looked promising but I don't think we ever got code working.

Is callgrind profiling influenced by other processes?

I'd like to profile my application using callgrind. Now, since it takes a very long time, in the meanwhile I go on with web-browsing, compiling and other intensive tasks on the same machine.
Am I biasing the profiling results? I'm expecting that, since valgrind uses a simulated CPU, other external processes should not interfere with valgrind execution. Am I right?
By default, Callgrind does not record anything related to time, so you can expect all collected metrics to (mostly) be independent of other processes on the machine. As the Callgrind manual states,
By default, the collected data consists of the number of instructions executed, their relationship to source lines, the caller/callee relationship between functions, and the numbers of such calls.
As such, the metrics Callgrind reports should only depend on what instructions the program is executing on the (simulated) CPU - not on how much time such instructions take. Indeed, many times the output of Callgrind can be somewhat misleading, as the simulated CPU might operate different to the real one (particularly when it comes to branch prediction).
The Callgrind paper presented at ICCS 2004 is very clear about this as well:
We note that the simulation is not able to predict consumed wall clock time, as this would need a detailed simulation of the microarchitecture.
In any case, however, the simulated CPU is unaffected by what the real CPU is doing.
The reason is straightforward.
Like you said, your program is not executed on your machine at all.
Instead, at runtime, Valgrind dynamically translates your program, that is, it disassembles the binary into "UCode" for an simulated machine, adds analysis code (called instrumentation), then generates binary code that executes the simulation.
The addition of analysis code is what makes instruction counting (in Callgrind), memory checking (in Memcheck), and all other plugins possible.
Therein lies the twist, however.
Naturally there are limits to how isolated the program can run in such a dynamic simulation.
First, your program might interact with other programs.
While the time spent for doing so is irrelevant (as it is not accounted for), the return codes of inter-process communication can certainly change, depending on what else is going on in the system.
Second, most system calls need to be run untranslated and their return codes can change as well -- leading to different execution paths of your program and, thus, slightly different metrics being collected. (As an aside, Calgrind offers an option to record the wall clock time spent during syscalls, which will always be affected by what else goes on in the system).
More details about these restrictions can be found in the PhD Dissertation of Nicholas Nethercote ("Dynamic Binary Analysis and Instrumentation").

What is the effect of Callgrind in code perfomance?

I am using Callgrind in order to see how many times specific functions are called. However, I am also interested in the execution time.
I know the programs take much longer when running on Callgrind, since it has to take information. However, what I am surprised about is that how the time changes. In my case, I am running two different versions of the Fast Marching Method (FMM and Simplified FMM) on 2D and 3D grids. The results are as follow:
In 2D the ratio FMM/SFMM is not kept at all, but at least it is always >1 (it takes always longes for FMM than for SFMM). However, in 3D the effect of Callgrind is completely the opposite, the times are completely changed: SFMM takes shorter will callgrind but longer in regular execution.
The compilation I am using (-Ofast, -fno-finite-math-only) is the same all the time and the same binaries are being run in callgrind and regular running ./bin-name
The time measuring functions are those from std::chrono.
Therefore, the question is: as I am using the same binary in all cases, how is it possible that the same binary behaves so differently? Are the other data I am getting (function calls,% time cost, etc) reliable in this case? Callgrind-like results were expected when running the binaries with regular execution command.
EDIT: in the implemtation, the main change is that in FMM I am using the Boost Fibonacci heap and in the SFMM I am using a small modification with a Boost Priority Queue.
Thank you!

callgrind slow with instrumentation turned off

I am using callgrind to profile a linux multi-threaded app and mostly it's working great. I start it with instrumentation off (--instr-atstart=no) and then once setup is done i turn it on with callgrind_control -i on. However, when I change certain configurations to try to profile a different part of the app it starts running extremely slow even before I turn instrumentation on. Basically part of the code that would take a few seconds with normal operation takes over an hour with callgrind (instrumentation turned off). Any ideas as to why that might be and how to go about debugging/resolving the slowness?
Callgrind is a tool, built on valgrind. Valgrind is basically a dynamic binary translator (libVEX, part of valgrind). It will decode every instruction and JIT-compile them into stream of some instructions of the same CPU.
As I know, there is no way to enable this translation (in valgrind implementation) for already running process, so dynamic translation is enabled all time, from start of program. It can't be turned off too.
Tools are built on valgrind by adding some instrumentation code. The "Nul" tool (nulgrind) is the tool which adds no instrumentation. But every tool uses valgrind and dynamic translation is active all time. Turning on and off in callgrind is just turning on and off additional instrumentation.
Virtual CPU, implemented by Valgrind is limited, there is (incomplete) list of limitations http://valgrind.org/docs/manual/manual-core.html#manual-core.limits Most of limitations are about floating point operations, and they can be emulated wrong.
Is the change connected with floating-point operations? Or with other listed limitations?
Also you should know, that "Valgrind serialises execution so that only one thread is running at a time". (from the same page manual-core.html)

Does GDB supports "run time sampling" or is there a user "extension" that does it

Motivation: I cant get google cpu profiler to work on machine where code runs(with my last breath I curse libunwind :)), so I was wondering if the gdb supports high frequency random pausing of the program execution, storing the name of the function where break occured and counting how many times it paused in function x.
That is what I would call "run time sampling", there is probably more precise/smarter name.
I looked at the oprofile, but it is to complicated to a) figure out if it can do it b) to figure out how to do it
EDIT: apparently correct name is:
"statistical sampling method"
EDIT2: reason why Im offering a bounty for this is that I see some ppl on SO recommending doing manual break 10-20x and examining stack with bt...
Seems very wasteful when it comes to time, so I guestimate some smart ppl automated it. :)
EDIT3: gprof wont cut it... i tried running it recently on ARM system and output was trash... :( I guess its troubles with multithreading is the reason for that...
You can manually sample in GDB by pausing it at run time.
What you seem to think you want is gprof, but
if your goal is to make the program as fast as possible, then I would suggest
High frequency of sampling is not helpful.
Counting the number of samples where the program counter is in function X is not helpful except in artificially small programs.
If you follow that link, you will see the reasons why, and directions for how to do it successfully.
GDB would not do this well, although you could certainly hack something up that gave wildly inaccurate results.
I'd suggest Valgrind's "Callgrind" plugin. As a bonus it requires absolutely no recompilation or other special setup. All you need is valgrind installed in your system, and debug information in your program (or, full symbol information, at least; I'm not sure).
You then invoke your program like this:
valgrind --tool=callgrind <your program command line>
When it's done there will be a file name callgrind.out.<pid> in the current directory. You can read and visualise this file with a very nice GUI tool called kcachegrind (usually you have to install this separately).
The only problem is that, because callgrind slows the execution of your program slightly, the time spent in system calls can appear smaller (in percentage terms) than it really would be. By default, callgrind does not include system time in its counters, so the values it give you are a real comparison of the code in your program, if not the actual time 'under' that function. This can be confusing, at first, so if that happens you try adding --collect-systime=yes.
I'm not sure what the state of callgrind on ARM might be. ARMv7 is listed as a supported platform, but only says "fairly complete", whatever that means.