Which is the most reliable profiling tool gprof or kcachegrind? - c++

Profiling some C++ number crunching code with both gprof and kcachegrind gives similar results for the functions that contribute most to the execution time (50-80% depending on input) but for functions between 10-30% both these tools give different results. Does it mean one of them is not reliable? What would yo do here?

gprof is actually quite primitive. Here's what it does.
1) It samples the program counter at a constant rate and records how many samples land in each function (exclusive time).
2) It counts how many times any function A calls any function B.
From that it can find out how many times each function was called in total, and what it's average exclusive time was.
To get average inclusive time of each function it propagates exclusive time upward in the call graph.
If you're expecting this to have some kind of accuracy, you should be aware of some issues.
First, it only counts CPU-time-in-process, meaning it is blind to I/O or other system calls.
Second, recursion confuses it.
Third, the premise that functions always adhere to an average run time, no matter when they are called or who calls them, is very suspect.
Forth, the notion that functions (and their call graph) are what you need to know about, rather than lines of code, is simply a popular assumption, nothing more.
Fifth, the notion that accuracy of measurement is even relevant to finding "bottlenecks" is also just a popular assumption, nothing more.
Callgrind can work at the level of lines - that's good. Unfortunately it shares the other problems.
If your goal is to find "bottlenecks" (as opposed to getting general measurements), you should take a look at wall-clock time stack samplers that report percent-by-line, such as Zoom.
The reason is simple but possibly unfamiliar.
Suppose you have a program with a bunch of functions calling each other that takes a total of 10 seconds. Also, there is a sampler that samples, not just the program counter, but the entire call stack, and it does it all the time at a constant rate, like 100 times per second. (Ignore other processes for now.)
So at the end you have 1000 samples of the call stack.
Pick any line of code L that appears on more than one of them.
Suppose you could somehow optimize that line, by avoiding it, removing it, or passing it off to a really really fast processor.
What would happen to those samples?
Since that line of code L now takes (essentially) no time at all, no sample can hit it, so those samples would just disappear, reducing the total number of samples, and therefore the total time!
In fact the overall time would be reduced by the fraction of time L had been on the stack, which is roughly the fraction of samples that contained it.
I don't want to get too statistical, but many people think you need a lot of samples, because they think accuracy of measurement is important.
It isn't, if the reason you're doing this is to find out what to fix to get speedup.
The emphasis is on finding what to fix, not on measuring it.
Line L is on the stack some fraction F of the time, right?
So each sample has a probability F of hitting it, right? Just like flipping a coin.
There is a theory of this, called the Rule of Succession.
It says that (under simplifying but general assumptions), if you flip a coin N times, and see "heads" S times, you can estimate the fairness of the coin F as (on average) (S+1)/(N+2).
So, if you take as few as three samples, and see L on two of them, do you know what F is? Of course not.
But you do know on average it is (2+1)/(3+2) or 60%.
So that's how much time you could save (on average) by "optimizing away" line L.
And, of course, the stack samples showed you exactly where line L (the "bottleneck"**) is.
Did it really matter that you didn't measure it to two or three decimal places?
BTW, it is immune to all the other problems mentioned above.
**I keep putting quotes around "bottleneck" because what makes most software slow has nothing in common with the neck of a bottle.
A better metaphor is a "drain" - something that just needlessly wastes time.

gprof's timing data is statistical (read about it in details of profiling docs).
On the other hand, KCacheGrind uses valgrind which actually interprets all the code.
So KCacheGrind can be "more accurate" (at the expense of more overhead) if the CPU modeled by valgrind is close to your real CPU.
Which one to choose also depends on what type of overhead you can handle. In my experience, gprof adds less runtime overhead (execution time that is), but it is more intrusive (i.e. -pg adds code to each and every one of your functions). So depending on the situation, on or the other is more appropriate.
For "better" gprof data, run your code longer (and on as wide a range of test data you can). The more you have, the better the measurements will be statistically.

Related

Recommended samples for performance benchmarks?

I'm writing performance benchmarks for some of my code. This is both to compare my own implementations as I develop/experiment, and to compare against "competing" implementations. I have no problem writing these, and getting usable results.
It's very well established that more samples are a good thing, as it reduces the impact of erroneous data and gives a more true result.
So, if I'm profiling a given function/procedure/whatever, how many samples does it seem reasonable to get?
I'm currently doing about 1 million samples for each test. These are individual operations, the results rarely take longer than 10s per item, even on an old laptop. Most are under a hundredth of a second.
Actually, it is not well established that more samples are a good thing.
It is nothing more than common wisdom.
I think you are sharing in a general confusion about the reason for profiling, whether the purpose is to measure performance or to find speedups.
For measuring performance, you don't need samples at all.
what you need is a stopwatch, whether in software or not.
If your process runs too quickly for the resolution of the stopwatch, just run your process 10^3 or 10^6 times, measure it, and divide by that number.
For finding speedups, sampling the call stack is very effective, provided the samples contain line-level or instruction-level call site information.
How many samples do you need?
Well, if you see it doing something that could be removed on one sample, that probably doesn't mean much.
But if you see it on two samples, that estimates it's costing time fraction F of about 2/N where N is the number of samples.
Example: if you see it twice in 10 samples, that means it costs roughly 20% of time.
In general, if the speedup is going to save you fraction F of time, it takes on average 2/F samples to see it twice.
Example: if it is going to save 30% of time (F = 0.3) you need on average 2/0.3 = 6.67 samples to see it twice.
Of course, if you see it more than twice, all the better.
Bottom line, for finding speedups, you don't need a lot of samples.
What you do need is to examine each one for activity that could be removed.
What you don't need is to mush them together into "statistics" (like most profilers do).
Many people understand this.
If you want a bit more rigorous explanation, look here.

Capturing function exit time with __gnu_mcount_nc

I'm trying to do some performance profiling on a poorly supported prototype embedded platform.
I note that GCC's -pg flag causes thunks to __gnu_mcount_nc to be inserted on entry to every function. No implementation of __gnu_mcount_nc is available (and the vendor is not interested in assisting), however as it is trivial to write one that simply records the stack frame and current cycle count, I have done so; this works fine and is yielding useful results in terms of caller/callee graphs and most frequently called functions.
I would really like to obtain information about the time spent in function bodies as well, however I am having difficulty understanding how to approach this with only the entry, but not the exit, to each function getting hooked: you can tell exactly when each function is entered, but without hooking the exit points you cannot know how much of the time until you receive the next piece of information to attribute to callee and how much to the callers.
Nevertheless, the GNU profiling tools are in fact demonstrably able to gather runtime information for functions on many platforms, so presumably the developers have some scheme in mind for achieving this.
I have seen some existing implementations that do things like maintain a shadow callstack and twiddle the return address on entry to __gnu_mcount_nc so that __gnu_mcount_nc will get invoked again when the callee returns; it can then match the caller/callee/sp triplet against the top of the shadow callstack and so distinguish this case from the call on entry, record the exit time and correctly return to the caller.
This approach leaves much to be desired:
it seems like it may be brittle in the presence of recursion and libraries compiled without the -pg flag
it seems like it would be difficult to implement with low overhead or at all in embedded multithreaded/multicore environments where toolchain TLS support is absent and current thread ID may be expensive/complex to obtain
Is there some obvious better way to implement a __gnu_mcount_nc so that a -pg build is able to capture function exit as well as entry time that I am missing?
gprof does not use that function for timing, of entry or exit, but for call-counting of function A calling any function B.
Rather, it uses the self-time gathered by counting PC samples in each routine, and then uses the function-to-function call counts to estimate how much of that self-time should be charged back to callers.
For example, if A calls C 10 times, and B calls C 20 times, and C has 1000ms of self time (i.e 100 PC samples), then gprof knows C has been called 30 times, and 33 of the samples can be charged to A, while the other 67 can be charged to B.
Similarly, sample counts propagate up the call hierarchy.
So you see, it doesn't time function entry and exit.
The measurements it does get are very coarse, because it makes no distinction between short calls and long calls.
Also, if a PC sample happens during I/O or in a library routine that is not compiled with -pg, it is not counted at all.
And, as you noted, it is very brittle in the presence of recursion, and can introduce notable overhead on short functions.
Another approach is stack-sampling, rather than PC-sampling.
Granted, it is more expensive to capture a stack sample than a PC-sample, but fewer samples are needed.
If, for example, a function, line of code, or any description you want to make, is evident on fraction F out of the total of N samples, then you know that the fraction of time it costs is F, with a standard deviation of sqrt(NF(1-F)).
So, for example, if you take 100 samples, and a line of code appears on 50 of them, then you can estimate the line costs 50% of the time, with an uncertainty of sqrt(100*.5*.5) = +/- 5 samples or between 45% and 55%.
If you take 100 times as many samples, you can reduce the uncertainty by a factor of 10.
(Recursion doesn't matter. If a function or line of code appears 3 times in a single sample, that counts as 1 sample, not 3.
Nor does it matter if function calls are short - if they are called enough times to cost a significant fraction, they will be caught.)
Notice, when you're looking for things you can fix to get speedup, the exact percent doesn't matter.
The important thing is to find it.
(In fact, you only need see a problem twice to know it is big enough to fix.)
That's this technique.
P.S. Don't get suckered into call-graphs, hot-paths, or hot-spots.
Here's a typical call-graph rat's nest. Yellow is the hot-path, and red is the hot-spot.
And this shows how easy it is for a juicy speedup opportunity to be in none of those places:
The most valuable thing to look at is a dozen or so random raw stack samples, and relating them to the source code.
(That means bypassing the back-end of the profiler.)
ADDED: Just to show what I mean, I simulated ten stack samples from the call graph above, and here's what I found
3/10 samples are calling class_exists, one for the purpose of getting the class name, and two for the purpose of setting up a local configuration. class_exists calls autoload which calls requireFile, and two of those call adminpanel. If this can be done more directly, it could save about 30%.
2/10 samples are calling determineId, which calls fetch_the_id which calls getPageAndRootlineWithDomain, which calls three more levels, terminating in sql_fetch_assoc. That seems like a lot of trouble to go through to get an ID, and it's costing about 20% of time, and that's not counting I/O.
So the stack samples don't just tell you how much inclusive time a function or line of code costs, they tell you why it's being done, and what possible silliness it takes to accomplish it.
I often see this - galloping generality - swatting flies with hammers, not intentionally, but just following good modular design.
ADDED: Another thing not to get sucked into is flame graphs.
For example, here is a flame graph (rotated right 90 degrees) of the ten simulated stack samples from the call graph above. The routines are all numbered, rather than named, but each routine has its own color.
Notice the problem we identified above, with class_exists (routine 219) being on 30% of the samples, is not at all obvious by looking at the flame graph.
More samples and different colors would make the graph look more "flame-like", but does not expose routines which take a lot of time by being called many times from different places.
Here's the same data sorted by function rather than by time.
That helps a little, but doesn't aggregate similarities called from different places:
Once again, the goal is to find the problems that are hiding from you.
Anyone can find the easy stuff, but the problems that are hiding are the ones that make all the difference.
ADDED: Another kind of eye-candy is this one:
where the black-outlined routines could all be the same, just called from different places.
The diagram doesn't aggregate them for you.
If a routine has high inclusive percent by being called a large number of times from different places, it will not be exposed.

profiler for c++ code, very sleepy

I'm a newbie with profiling. I'd like to optimize my code to satisfy timing constraints. I use Visual C++ 08 Express and thus had to download a profiler, for me it's Very Sleepy. I did some search but found no decent tutorial on Sleepy, and here my question:
How to use it properly? I grasped the general idea of profiling, so I sorted according to %exclusive to find my bottlenecks. Firstly, on the top of this list I have ZwWaitForSingleObject, RtlEnterCriticalSection, operator new, RtlLeaveCriticalSection, printf, some iterators ... and after they take some like 60% there comes my first function, first position with Child Calls. Can someone explain me why above mentioned come out, what do they mean and how can I optimize my code if I have no access to this critical 60%? (for "source file": unknown...).
Also, for my function I'd think I get time for each line, but it's not the case, e.g. arithmetics or some functions have no timing (not nested in unused "if" clauses).
AND last thing: how to find out that some line can execute superfast, but is called thousands times, being the actual bottleneck?
Finally, is Sleepy good? Or some free alternative for my platform?
Help very appreciated!
cheers!
UPDATE - - - - -
I have found another version of profiler, called plain Sleepy. It shows how many times some snippet was called plus the number of line (I guess it points to the critical one). So in my case.. KiFastSystemCallRet takes 50%! It means that it waits for some data right? How to improve that matter, is there maybe a decent approach to trace what causes these multiple calls and eventually remove/change it?
I'd like to optimize my code to satisfy timing constraints
You're running smack into a persistent issue in this business.
You want to find ways to make your code take less time, and you (and many people) assume (and have been taught) the only way to do that is by taking various sorts of measurements.
There's a minority view, and the only thing it has to recommend it is actual significant results (plus an ironclad theory behind it).
If you've got a "bottleneck" (and you do, probably several), it's taking some fraction of time, like 30%.
Just treat it as a bug to be found.
Randomly halt the program with the pause button, and look carefully to see what the program is doing and why it's doing it.
Ask if it's something that could be gotten rid of.
Do this 10 times. On average you will see the problem on 3 of the pauses.
Any activity you see more than once, if it's not truly necessary, is a speed bug.
This does not tell you precisely how much the problem costs, but it does tell you precisely what the problem is, and that it's worth fixing.
You'll see things this way that no profiler can find, because profilers
are only programs, and cannot be broad-minded about what constitutes an opportunity.
Some folks are risk-averse, thinking it might not give enough speedup to be worth it.
Granted, there is a small chance of a low payoff, but it's like investing.
The theory says on average it will be worthwhile, and there's also a small chance of a high payoff.
In any case, if you're worried about the risks, a few more samples will settle your fears.
After you fix the problem, the remaining bottlenecks each take a larger percent, because they didn't get smaller but the overall program did.
So they will be easier to find when you repeat the whole process.
There's lots of literature about profiling, but very little that actually says how much speedup it achieves in practice.
Here's a concrete example with almost 3 orders of magnitude speedup.
I've used GlowCode (commercial product, similar to Sleepy) for profiling native C++ code. You run the instrumenting process, then execute your program, then look at the data produced by the tool. The instrumenting step injects a little trace function at every methods' entrypoints and exitpoints, and simply measures how much time it takes for each function to run through to completion.
Using the call graph profiling tool, I listed the methods sorted from "most time used" to "least time used", and the tool also displays a call count. Simply drilling into the highest percentage routine showed me which methods were using the most time. I could see that some methods were very slow, but drilling into them I discovered they were waiting for user input, or for a service to respond. And some took a long time because they were calling some internal routines thousands of times each invocation. We found someone made a coding error and was walking a large linked list repeatedly for each item in the list, when they really only needed to walk it once.
If you sort by "most frequently called" to "least called", you can see some of the tiny functions that get called from everywhere (iterator methods like next(), etc.) Something to check for is to make sure the functions that are called the most often are really clean. Saving a millisecond in a routine called 500 times to paint a screen will speed that screen up by half a second. This helps you decide which are the most important places to spend your efforts.
I've seen two common approaches to using profiling. One is to do some "general" profiling, running through a suite of "normal" operations, and discovering which methods are slowing the app down the most. The other is to do specific profiling, focusing on specific user complaints about performance, and running through those functions to reveal their issues.
One thing I would caution you about is to limit your changes to those that will measurably impact the users' experience or system throughput. Shaving one millisecond off a mouse click won't make a difference to the average user, because human reaction time simply isn't that fast. Race car drivers have reaction times in the 8 millisecond range, some elite twitch gamers are even faster, but normal users like bank tellers will have reaction times in the 20-30 millisecond range. The benefits would be negligible.
Making twenty 1-millisecond improvements or one 20-millisecond change will make the system a lot more responsive. It's cheaper and better if you can do the single big improvement over the many small improvements.
Similarly, shaving one millisecond off a service that handles 100 users per second will make a 10% improvement, meaning you could improve the service to handle 110 users per second.
The reason for concern is that coding changes strictly to improve performance often negatively impact your code's structure by adding complexity. Let's say you decided to improve a call to a database by caching results. How do you know when the cache goes invalid? Do you add a cache cleaning mechanism? Consider a financial transaction where looping through all the line items to produce a running total is slow, so you decide to keep a runningTotal accumulator to answer faster. You now have to modify the runningTotal for all kinds of situations like line voids, reversals, deletions, modifications, quantity changes, etc. It makes the code more complex and more error-prone.

Inaccuracy in gprof output

I am trying to profile a c++ function using gprof, I am intrested in the %time taken. I did more than one run and for some reason I got a large difference in the results. I don't know what is causing this, I am assuming the sampling rate or I read in other posts that I/O has something to do with it. So is there a way to make it more accurate and generate somehow almost constant results?
I was thinking of the following:
increase the sampling rate
flush the caches before executing anything
use another profiler but I want it to generate results in a similar format to grof as function time% function name, I tried Valgrind but it gave me a massive file in size. So maybe I am generating the file with the wrong command.
Waiting for your input
Regards
I recommend printing a copy of the gprof paper and reading it carefully.
According to the paper, here's how gprof measures time. It samples the PC, and it counts how many samples land in each routine. Multiplied by the time between samples, that is each routine's total self time.
It also records in a table, by call site, how many times routine A calls routine B, assuming routine B is instrumented by the -pg option. By summing those up, it can tell how many times routine B was called.
Starting from the bottom of the call tree (where total time = self time), it assumes the average time per call of each routine is its total time divided by the number of calls.
Then it works back up to each caller of those routines. The time of each routine is its average self time plus the average number of calls to each subordinate routine times the average time of the subordinate routine.
You can see, even if recursions (cycles in the call graph) are not present, how this is fraught with possibilities for errors, such as assumptions about average times and average numbers of calls, and assumptions about subroutines being instrumented, which the authors point out. If there are recursions, they basically say "forget it".
All of this technology, even if it weren't problematic, begs the question - What is it's purpose? Usually, the purpose is "find bottlenecks". According to the paper, it can help people evaluate alternative implementations. That's not finding bottlenecks. They do recommend looking at routines that seem to be called a lot of times, or that have high average times. Certainly routines with low average cumulative time should be ignored, but that doesn't localize the problem very much. And, it completely ignores I/O, as if all I/O that is done is unquestionably necessary.
So, to try to answer your question, try Zoom, for one, and don't expect to eliminate statistical noise in measurements.
gprof is a venerable tool, simple and rugged, but the problems it had in the beginning are still there, and far better tools have come along in the intervening decades.
Here's a list of the issues.
gprof is not very accurate, particularly for small functions, see http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html#SEC11
If this is Linux then I recommend a profiler that doesn't require the code to be instrumented, e.g. Zoom - you can get a free 30 day evaluation license, after that it costs money.
All sampling profilers suffer form statistical inaccuracies - if the error is too large then you need to sample for longer and/or with a smaller sampling interval.

Tools to evaluate callgrind's call profiles?

Somehow related to this question, which tool would you recommend to evaluate the profiling data created with callgrind?
It does not have to have a graphical interface, but it should prepare the results in a concise, clear and easy-to-interpret way. I know about e.g. kcachegrind, but this program is missing some features such as data export of the tables shown or simply copying lines from the display.
Years ago I wrote a profiler to run under DOS.
If you are using KCacheGrind here's what I would have it do. It might not be too difficult to write it, or you can just do it by hand.
KCacheGrind has a toolbar button "Force Dump", with which you can trigger a dump manually at a random time. The capture of stack traces at random or pseudo-random times, in the interval when you are waiting for the program, is the heart of the technique.
Not many samples are needed - 20 is usually more than enough. If a bottleneck costs a large amount, like more than 50%, 5 samples may be quite enough.
The processing of the samples is very simple. Each stack trace consists of a series of lines of code (actually addresses), where all but the last are function/method calls.
Collect a list of all the lines of code that appear on the samples, and eliminate duplicates.
For each line of code, count what fraction of samples it appears on. For example, if you take 20 samples, and the line of code appears on 3 of them, even if it appears more than once in some sample (due to recursion) the count is 3/20 or 15%. That is a direct measure of the cost of each statement.
Display the most costly 100 or so lines of code. Your bottlenecks are in that list.
What I typically do with this information is choose a line with high cost, and then manually take stack samples until it appears (or look at the ones I've already got), and ask myself "Why is it doing that line of code, not just in a local sense, but in a global sense." Another way to put it is "What in a global sense was the program trying to accomplish at the time slice when that sample was taken". The reason I ask this is because that tells me if it was really necessary to be spending what that line is costing.
I don't want to be critical of all the great work people do developing profilers, but sadly there is a lot of firmly entrenched myth on the subject, including:
that precise measuring, with lots of samples, is important. Rather the emphasis should be on finding the bottlenecks. Precise measurement is not a prerequisite for that. For typical bottlenecks, costing between 10% and 90%, the measurement can be quite coarse.
that functions matter more than lines of code. If you find a costly function, you still have to search within it for the lines that are the bottleneck. That information is right there, in the stack traces - no need to hunt for it.
that you need to distinguish CPU from wall-clock time. If you're waiting for it, it's wall clock time (wrist-watch time?). If you have a bottleneck consisting of extraneous I/O, for example, do you want to ignore that because it's not CPU time?
that the distinction between exclusive time and inclusive time is useful. That only makes sense if you're timing functions and you want some clue whether the time is spent not in callees. If you look at lines of code, the only thing that matters is inclusive time. Another way to put it is, every instruction is a call instruction, even if it only calls microcode.
that recursion matters. It is irrelevant, because it doesn't affect the fraction of samples a line is on and is therefore responsible for.
that the invocation count of a line or function matters. Whether it's fast and is called too many times, or slow and called once, the cost is the percent of time it uses, and that's what the stack samples estimate.
that performance of sampling matters. I don't mind taking a stack sample and looking at it for several minutes before continuing, assuming that doesn't make the bottlenecks move.
Here's a more complete explanation.
There are some CLI tools for working with callgrind data:
callgrind_annotate
and cachegrind tool which can show some information from callgrind.out
cg_annotate