How to organize data (writing your own profiler) - profiling

I was thinking about using reflection to generate a profiler. Lets say I am generating code without a problem; how do I properly measure or organize the results? I'm mostly concerned about CPU time but memory suggestions are welcome

There are lots of bad ways to write profilers.
I wrote what I thought was a pretty good one over 20 years ago.
That is, it made a decent demo, but when it came down to serious performance tuning I concluded there really is nothing that works better, and gives better results, than the dumb old manual method, and here's why.
Anyway, if you're writing a profiler, here's what I think it should do:
It should sample the stack at unpredictable times, and each stack sample should contain line number information, not just functions, in the code being tuned. It's not so important to have that in system functions you can't edit.
It should be able to sample during blocked time like I/O, sleeps, and locking, because those are just as likely to result in slowness as CPU operations.
It should have a hot-key that the user can use, to enable the sampling during the times they actually care about (like not when waiting for the user to do something).
Do not assume it is necessary to get measurement precision, necessitating a large number of frequent samples. This is incredibly basic, and it is a major reversal of common wisdom. The reason is simple - it doesn't do any good to measure problems if the price you pay is failure to find them.
That's what happens with profilers - speedups hide from them, so the user is content with finding maybe one or two small speedups while giant ones get away.
Giant speedups are the ones that take a large percentage of time, and the number of stack samples it takes to find them is inversely proportional to the time they take. If the program spends 30% of its time doing something avoidable, it takes (on average) 2/0.3 = 6.67 samples before it is seen twice, and that's enough to pinpoint it.
To answer your question, if the number of samples is small, it really doesn't matter how you store them. Print them to a file if you like - whatever.
It doesn't have to be fast, because you don't sample while you're saving a sample.
What does allow those speedups to be found is when the user actually looks at and understands individual samples. Profilers have all kinds of UI - hot spots, call counts, hot paths, call graphs, call trees, flame graphs, phony 3-digit "statistics", blah, blah.
Even if it's well done, that's only timing information.
It doesn't tell you why the time is spent, and that's what you need to know.
Make eye candy if you want, but let the user see the actual samples.
... and good luck.
ADDED: A sample looks something like this:
main:27, myFunc:16, otherFunc:9, ..., someFunc;132
That means main is at line 27, calling myFunc. myFunc is at line 16, calling otherFunc, and so on. At the end, it's in someFunc at line 132, not calling anything (or calling something you can't identify).
No need for line ranges.
(If you're tempted to worry about recursion - don't. If the same function shows up more than once in a sample, that's recursion. It doesn't affect anything.)
You don't need a lot of samples.
When I did it, sampling was not automatic at all.
I would just have the user press both shift keys simultaneously, and that would trigger a sample.
So the user would grab like 10 or 20 samples, but it is crucial that the user take the samples during the phase of the program's execution that annoys the user with its slowness,
like between the time some button is clicked and the time the UI responds.
Another way is to have a hot-key that runs sampling on a timer while it is pressed.
If the program is just a command-line app with no user input, it can just sample all the time while it executes.
The frequency of sampling does not have to be fast.
The goal is to get a moderate number of samples during the program phase that is subjectively slow.
If you take too many samples to look at, then when you look at them you need to select some at random.
The thing to do when examining a sample is to look at each line of code in the sample so you can fully understand why the program was spending that instant of time.
If it is doing something that might be avoided,
and if you see a similar thing on another sample, you've found a speedup.
How much of a speedup? This much (the math is here):
For example, if you look at three samples, and on two of them you see avoidable code, fixing it will give you a speedup - maybe less, maybe more, but on average 4x.
(That's what I mean by giant speedup. The way you get it is by studying individual samples, not by measuring anything.)
There's a video here.

Related

Cycles consumed in each function through Oprofile

Oprofile works on Sampling Based theory.
Opreport -l option provides us the profiling report in the following way:
samples % image name symbol name
78149 15.0776 cvqa comp_corr.clone.2
With this information I can know the %age of time consumed in consumption. If I do some optimizaion in my code I will again get the report as:
samples % image name symbol name
73179 15.0732 cvqa comp_corr.clone.2
In this report I am not getting how much optimization of cycles has been done so that I can benchmark. How much optimization has been done till now?
Is there any way we can know how much cycles optimization has been done or any other way through which I can bench mark?
I am working on AMD64 bit machine.
Since your real goal is to optimize the program, let me suggest another way to think about it.
The main thing to measure is overall time, not cycles or times of the various routines.
Now, here's how to do optimization. Don't base it on any measurements. Rather, get a number of samples of the program's state and (this is the key point) study each sample closely enough, with your own eyes and brain, and understand what the program is doing in that state, and the full reason why it is doing it.
(You will see anything worth fixing that statistics could reveal, plus things they could not reveal, and that makes all the difference.)
As soon as you catch it in the act of doing, on two or more samples, something that could be removed, fixing it will give you a substantial speedup.
Here is an explanation of why it works and how much speedup you can expect.
After you do that, you can do the overall time measurement again and see how much time you saved.
Then don't stop. Do it again. You'll find something else to fix, which is now a bigger percent because of the first problem you removed.
In my experience, with real software, this can be done as many as 5 or 6 times, after which the program can be orders of magnitude faster than it was originally. The reason is because each optimization removes a fraction of the original execution time, and those fractions can accumulate up to nearly 100%. I'm not aware of any such result achieved with Oprofile or any other profiler tool.

profiler for c++ code, very sleepy

I'm a newbie with profiling. I'd like to optimize my code to satisfy timing constraints. I use Visual C++ 08 Express and thus had to download a profiler, for me it's Very Sleepy. I did some search but found no decent tutorial on Sleepy, and here my question:
How to use it properly? I grasped the general idea of profiling, so I sorted according to %exclusive to find my bottlenecks. Firstly, on the top of this list I have ZwWaitForSingleObject, RtlEnterCriticalSection, operator new, RtlLeaveCriticalSection, printf, some iterators ... and after they take some like 60% there comes my first function, first position with Child Calls. Can someone explain me why above mentioned come out, what do they mean and how can I optimize my code if I have no access to this critical 60%? (for "source file": unknown...).
Also, for my function I'd think I get time for each line, but it's not the case, e.g. arithmetics or some functions have no timing (not nested in unused "if" clauses).
AND last thing: how to find out that some line can execute superfast, but is called thousands times, being the actual bottleneck?
Finally, is Sleepy good? Or some free alternative for my platform?
Help very appreciated!
cheers!
UPDATE - - - - -
I have found another version of profiler, called plain Sleepy. It shows how many times some snippet was called plus the number of line (I guess it points to the critical one). So in my case.. KiFastSystemCallRet takes 50%! It means that it waits for some data right? How to improve that matter, is there maybe a decent approach to trace what causes these multiple calls and eventually remove/change it?
I'd like to optimize my code to satisfy timing constraints
You're running smack into a persistent issue in this business.
You want to find ways to make your code take less time, and you (and many people) assume (and have been taught) the only way to do that is by taking various sorts of measurements.
There's a minority view, and the only thing it has to recommend it is actual significant results (plus an ironclad theory behind it).
If you've got a "bottleneck" (and you do, probably several), it's taking some fraction of time, like 30%.
Just treat it as a bug to be found.
Randomly halt the program with the pause button, and look carefully to see what the program is doing and why it's doing it.
Ask if it's something that could be gotten rid of.
Do this 10 times. On average you will see the problem on 3 of the pauses.
Any activity you see more than once, if it's not truly necessary, is a speed bug.
This does not tell you precisely how much the problem costs, but it does tell you precisely what the problem is, and that it's worth fixing.
You'll see things this way that no profiler can find, because profilers
are only programs, and cannot be broad-minded about what constitutes an opportunity.
Some folks are risk-averse, thinking it might not give enough speedup to be worth it.
Granted, there is a small chance of a low payoff, but it's like investing.
The theory says on average it will be worthwhile, and there's also a small chance of a high payoff.
In any case, if you're worried about the risks, a few more samples will settle your fears.
After you fix the problem, the remaining bottlenecks each take a larger percent, because they didn't get smaller but the overall program did.
So they will be easier to find when you repeat the whole process.
There's lots of literature about profiling, but very little that actually says how much speedup it achieves in practice.
Here's a concrete example with almost 3 orders of magnitude speedup.
I've used GlowCode (commercial product, similar to Sleepy) for profiling native C++ code. You run the instrumenting process, then execute your program, then look at the data produced by the tool. The instrumenting step injects a little trace function at every methods' entrypoints and exitpoints, and simply measures how much time it takes for each function to run through to completion.
Using the call graph profiling tool, I listed the methods sorted from "most time used" to "least time used", and the tool also displays a call count. Simply drilling into the highest percentage routine showed me which methods were using the most time. I could see that some methods were very slow, but drilling into them I discovered they were waiting for user input, or for a service to respond. And some took a long time because they were calling some internal routines thousands of times each invocation. We found someone made a coding error and was walking a large linked list repeatedly for each item in the list, when they really only needed to walk it once.
If you sort by "most frequently called" to "least called", you can see some of the tiny functions that get called from everywhere (iterator methods like next(), etc.) Something to check for is to make sure the functions that are called the most often are really clean. Saving a millisecond in a routine called 500 times to paint a screen will speed that screen up by half a second. This helps you decide which are the most important places to spend your efforts.
I've seen two common approaches to using profiling. One is to do some "general" profiling, running through a suite of "normal" operations, and discovering which methods are slowing the app down the most. The other is to do specific profiling, focusing on specific user complaints about performance, and running through those functions to reveal their issues.
One thing I would caution you about is to limit your changes to those that will measurably impact the users' experience or system throughput. Shaving one millisecond off a mouse click won't make a difference to the average user, because human reaction time simply isn't that fast. Race car drivers have reaction times in the 8 millisecond range, some elite twitch gamers are even faster, but normal users like bank tellers will have reaction times in the 20-30 millisecond range. The benefits would be negligible.
Making twenty 1-millisecond improvements or one 20-millisecond change will make the system a lot more responsive. It's cheaper and better if you can do the single big improvement over the many small improvements.
Similarly, shaving one millisecond off a service that handles 100 users per second will make a 10% improvement, meaning you could improve the service to handle 110 users per second.
The reason for concern is that coding changes strictly to improve performance often negatively impact your code's structure by adding complexity. Let's say you decided to improve a call to a database by caching results. How do you know when the cache goes invalid? Do you add a cache cleaning mechanism? Consider a financial transaction where looping through all the line items to produce a running total is slow, so you decide to keep a runningTotal accumulator to answer faster. You now have to modify the runningTotal for all kinds of situations like line voids, reversals, deletions, modifications, quantity changes, etc. It makes the code more complex and more error-prone.

Inaccuracy in gprof output

I am trying to profile a c++ function using gprof, I am intrested in the %time taken. I did more than one run and for some reason I got a large difference in the results. I don't know what is causing this, I am assuming the sampling rate or I read in other posts that I/O has something to do with it. So is there a way to make it more accurate and generate somehow almost constant results?
I was thinking of the following:
increase the sampling rate
flush the caches before executing anything
use another profiler but I want it to generate results in a similar format to grof as function time% function name, I tried Valgrind but it gave me a massive file in size. So maybe I am generating the file with the wrong command.
Waiting for your input
Regards
I recommend printing a copy of the gprof paper and reading it carefully.
According to the paper, here's how gprof measures time. It samples the PC, and it counts how many samples land in each routine. Multiplied by the time between samples, that is each routine's total self time.
It also records in a table, by call site, how many times routine A calls routine B, assuming routine B is instrumented by the -pg option. By summing those up, it can tell how many times routine B was called.
Starting from the bottom of the call tree (where total time = self time), it assumes the average time per call of each routine is its total time divided by the number of calls.
Then it works back up to each caller of those routines. The time of each routine is its average self time plus the average number of calls to each subordinate routine times the average time of the subordinate routine.
You can see, even if recursions (cycles in the call graph) are not present, how this is fraught with possibilities for errors, such as assumptions about average times and average numbers of calls, and assumptions about subroutines being instrumented, which the authors point out. If there are recursions, they basically say "forget it".
All of this technology, even if it weren't problematic, begs the question - What is it's purpose? Usually, the purpose is "find bottlenecks". According to the paper, it can help people evaluate alternative implementations. That's not finding bottlenecks. They do recommend looking at routines that seem to be called a lot of times, or that have high average times. Certainly routines with low average cumulative time should be ignored, but that doesn't localize the problem very much. And, it completely ignores I/O, as if all I/O that is done is unquestionably necessary.
So, to try to answer your question, try Zoom, for one, and don't expect to eliminate statistical noise in measurements.
gprof is a venerable tool, simple and rugged, but the problems it had in the beginning are still there, and far better tools have come along in the intervening decades.
Here's a list of the issues.
gprof is not very accurate, particularly for small functions, see http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html#SEC11
If this is Linux then I recommend a profiler that doesn't require the code to be instrumented, e.g. Zoom - you can get a free 30 day evaluation license, after that it costs money.
All sampling profilers suffer form statistical inaccuracies - if the error is too large then you need to sample for longer and/or with a smaller sampling interval.

How to get the call graph of a program with a bit of profiling information

I want to understand how a C++ program that was given to me works, and where it spends the most time.
For that I tried to use first gprof and then gprof2dot to get the pictures, but the results are sometimes kind of ugly.
How do you usually do this? Can you recommend any better alternatives?
P.D. Which are the open source solutions (preferably for Linux or Mac OS )X?
OProfile on Linux works fairly well, actually i like it better than GProf. There are a couple graphical tools that help visualize OProfile output.
You can try KCachegrind. This is a program that visualizes samples acquired by Valgrind tool called Callgrind. KCachegrind may seem to be not actively maintained, but the graphs he produces are very useful.
In my opinion there are two alternatives (on Windows):
Profilers that change the assembly instructions of your applications (that's called instrumenting) and record every detail. These profilers tend to be slow (applications running about 10 times slower), sometimes hard to set up, and often not-free, but they give you the best performance related information. Look for "Ration Quantity", "AQTime" and "Performance Validator" if you want a profiler of this type.
Profilers that don't instrument the application, but just look at a running application and collect 'samples' of it. These profilers are fast (no performance loss), often easy to set up, and you can find quite some free alternatives. Look for "Very Sleepy" and "Luke Stackwalker" if you want a profiler of this type.
Although I used commercial profilers like Rational Quantity and AQTime in the past, and was very satisfied with the results, I found that the disadvantages (hard to setup, unexplainable crashes, slow performance) outweighed the advantages.
Therefore I switched to the free alternatives and I am mainly using "Very Sleepy" at this moment.
If you want to look at the structure of your application (who calls what, references, call trees, ...) look at "Understand for C/C++". This application investigates your source code and allows you to query almost everything from the application's structure.
See the SD C++ Profiler.
Other answers here suggest that probe-oriented profilers have high overhead (10x). This one does not.
Same answer as ---
EDIT: #Steve suggested I give a less pithy answer.
I hear this all the time - "I want to find out where my program spends its time".
Let me suggest an alternate phrasing - "I want to find out why my program spends its time".
Maybe the difference isn't obvious.
When a program executes an instruction, the reason why it is doing so is encoded in the entire state of the program, including the call stack.
Looking only at the program counter is like trying to see if a taxi ride is necessary by profiling the rotation angle of its wheels.
You need to look at the whole state of the program.
There's another myth I hear all the time - that you need to measure the execution time of methods, to find the "slow" ones.
There are many ways for programs to take more time than they need to than by, say, doing a linear search instead of a binary search in some method, which might be the kind of thing people have in mind.
The way to think about it is this:
There isn't just one thing taking more time than necessary. There probably are several.
Each thing taking time is taking some fraction, like 10%, 50%, 90% or some such number. That means if the wall clock could be stopped during that time, that is how much less time the overall app could take.
You want to find out what those things are, whatever they are. Profilers (samplers) work by taking a lot of shallow samples (PC or call stack) and summarizing them to get measurements. But measurements are not what you need. What you need is finding out what it's doing, from a time perspective. It's better to get a small number of samples, like 10 or 20, and examine (not summarize) them. If some activity takes 20%, 50%, or 90% of the time, then that is the probability you will catch it in the act on each sample, so that is roughly the percent of samples on which you will see it. The important thing is finding out what it is, not getting an accurate measurement of something irrelevant.
So as a way to see what the program is doing, from a time perspective, here's how many people do it.

Tools to evaluate callgrind's call profiles?

Somehow related to this question, which tool would you recommend to evaluate the profiling data created with callgrind?
It does not have to have a graphical interface, but it should prepare the results in a concise, clear and easy-to-interpret way. I know about e.g. kcachegrind, but this program is missing some features such as data export of the tables shown or simply copying lines from the display.
Years ago I wrote a profiler to run under DOS.
If you are using KCacheGrind here's what I would have it do. It might not be too difficult to write it, or you can just do it by hand.
KCacheGrind has a toolbar button "Force Dump", with which you can trigger a dump manually at a random time. The capture of stack traces at random or pseudo-random times, in the interval when you are waiting for the program, is the heart of the technique.
Not many samples are needed - 20 is usually more than enough. If a bottleneck costs a large amount, like more than 50%, 5 samples may be quite enough.
The processing of the samples is very simple. Each stack trace consists of a series of lines of code (actually addresses), where all but the last are function/method calls.
Collect a list of all the lines of code that appear on the samples, and eliminate duplicates.
For each line of code, count what fraction of samples it appears on. For example, if you take 20 samples, and the line of code appears on 3 of them, even if it appears more than once in some sample (due to recursion) the count is 3/20 or 15%. That is a direct measure of the cost of each statement.
Display the most costly 100 or so lines of code. Your bottlenecks are in that list.
What I typically do with this information is choose a line with high cost, and then manually take stack samples until it appears (or look at the ones I've already got), and ask myself "Why is it doing that line of code, not just in a local sense, but in a global sense." Another way to put it is "What in a global sense was the program trying to accomplish at the time slice when that sample was taken". The reason I ask this is because that tells me if it was really necessary to be spending what that line is costing.
I don't want to be critical of all the great work people do developing profilers, but sadly there is a lot of firmly entrenched myth on the subject, including:
that precise measuring, with lots of samples, is important. Rather the emphasis should be on finding the bottlenecks. Precise measurement is not a prerequisite for that. For typical bottlenecks, costing between 10% and 90%, the measurement can be quite coarse.
that functions matter more than lines of code. If you find a costly function, you still have to search within it for the lines that are the bottleneck. That information is right there, in the stack traces - no need to hunt for it.
that you need to distinguish CPU from wall-clock time. If you're waiting for it, it's wall clock time (wrist-watch time?). If you have a bottleneck consisting of extraneous I/O, for example, do you want to ignore that because it's not CPU time?
that the distinction between exclusive time and inclusive time is useful. That only makes sense if you're timing functions and you want some clue whether the time is spent not in callees. If you look at lines of code, the only thing that matters is inclusive time. Another way to put it is, every instruction is a call instruction, even if it only calls microcode.
that recursion matters. It is irrelevant, because it doesn't affect the fraction of samples a line is on and is therefore responsible for.
that the invocation count of a line or function matters. Whether it's fast and is called too many times, or slow and called once, the cost is the percent of time it uses, and that's what the stack samples estimate.
that performance of sampling matters. I don't mind taking a stack sample and looking at it for several minutes before continuing, assuming that doesn't make the bottlenecks move.
Here's a more complete explanation.
There are some CLI tools for working with callgrind data:
callgrind_annotate
and cachegrind tool which can show some information from callgrind.out
cg_annotate