Recommended samples for performance benchmarks?

Recommended samples for performance benchmarks? - profiling

I'm writing performance benchmarks for some of my code. This is both to compare my own implementations as I develop/experiment, and to compare against "competing" implementations. I have no problem writing these, and getting usable results.
It's very well established that more samples are a good thing, as it reduces the impact of erroneous data and gives a more true result.
So, if I'm profiling a given function/procedure/whatever, how many samples does it seem reasonable to get?
I'm currently doing about 1 million samples for each test. These are individual operations, the results rarely take longer than 10s per item, even on an old laptop. Most are under a hundredth of a second.

Actually, it is not well established that more samples are a good thing.
It is nothing more than common wisdom.
I think you are sharing in a general confusion about the reason for profiling, whether the purpose is to measure performance or to find speedups.
For measuring performance, you don't need samples at all.
what you need is a stopwatch, whether in software or not.
If your process runs too quickly for the resolution of the stopwatch, just run your process 10^3 or 10^6 times, measure it, and divide by that number.
For finding speedups, sampling the call stack is very effective, provided the samples contain line-level or instruction-level call site information.
How many samples do you need?
Well, if you see it doing something that could be removed on one sample, that probably doesn't mean much.
But if you see it on two samples, that estimates it's costing time fraction F of about 2/N where N is the number of samples.
Example: if you see it twice in 10 samples, that means it costs roughly 20% of time.
In general, if the speedup is going to save you fraction F of time, it takes on average 2/F samples to see it twice.
Example: if it is going to save 30% of time (F = 0.3) you need on average 2/0.3 = 6.67 samples to see it twice.
Of course, if you see it more than twice, all the better.
Bottom line, for finding speedups, you don't need a lot of samples.
What you do need is to examine each one for activity that could be removed.
What you don't need is to mush them together into "statistics" (like most profilers do).
Many people understand this.
If you want a bit more rigorous explanation, look here.

Related

How to organize data (writing your own profiler)

I was thinking about using reflection to generate a profiler. Lets say I am generating code without a problem; how do I properly measure or organize the results? I'm mostly concerned about CPU time but memory suggestions are welcome

There are lots of bad ways to write profilers.
I wrote what I thought was a pretty good one over 20 years ago.
That is, it made a decent demo, but when it came down to serious performance tuning I concluded there really is nothing that works better, and gives better results, than the dumb old manual method, and here's why.
Anyway, if you're writing a profiler, here's what I think it should do:
It should sample the stack at unpredictable times, and each stack sample should contain line number information, not just functions, in the code being tuned. It's not so important to have that in system functions you can't edit.
It should be able to sample during blocked time like I/O, sleeps, and locking, because those are just as likely to result in slowness as CPU operations.
It should have a hot-key that the user can use, to enable the sampling during the times they actually care about (like not when waiting for the user to do something).
Do not assume it is necessary to get measurement precision, necessitating a large number of frequent samples. This is incredibly basic, and it is a major reversal of common wisdom. The reason is simple - it doesn't do any good to measure problems if the price you pay is failure to find them.
That's what happens with profilers - speedups hide from them, so the user is content with finding maybe one or two small speedups while giant ones get away.
Giant speedups are the ones that take a large percentage of time, and the number of stack samples it takes to find them is inversely proportional to the time they take. If the program spends 30% of its time doing something avoidable, it takes (on average) 2/0.3 = 6.67 samples before it is seen twice, and that's enough to pinpoint it.
To answer your question, if the number of samples is small, it really doesn't matter how you store them. Print them to a file if you like - whatever.
It doesn't have to be fast, because you don't sample while you're saving a sample.
What does allow those speedups to be found is when the user actually looks at and understands individual samples. Profilers have all kinds of UI - hot spots, call counts, hot paths, call graphs, call trees, flame graphs, phony 3-digit "statistics", blah, blah.
Even if it's well done, that's only timing information.
It doesn't tell you why the time is spent, and that's what you need to know.
Make eye candy if you want, but let the user see the actual samples.
... and good luck.
ADDED: A sample looks something like this:
main:27, myFunc:16, otherFunc:9, ..., someFunc;132
That means main is at line 27, calling myFunc. myFunc is at line 16, calling otherFunc, and so on. At the end, it's in someFunc at line 132, not calling anything (or calling something you can't identify).
No need for line ranges.
(If you're tempted to worry about recursion - don't. If the same function shows up more than once in a sample, that's recursion. It doesn't affect anything.)
You don't need a lot of samples.
When I did it, sampling was not automatic at all.
I would just have the user press both shift keys simultaneously, and that would trigger a sample.
So the user would grab like 10 or 20 samples, but it is crucial that the user take the samples during the phase of the program's execution that annoys the user with its slowness,
like between the time some button is clicked and the time the UI responds.
Another way is to have a hot-key that runs sampling on a timer while it is pressed.
If the program is just a command-line app with no user input, it can just sample all the time while it executes.
The frequency of sampling does not have to be fast.
The goal is to get a moderate number of samples during the program phase that is subjectively slow.
If you take too many samples to look at, then when you look at them you need to select some at random.
The thing to do when examining a sample is to look at each line of code in the sample so you can fully understand why the program was spending that instant of time.
If it is doing something that might be avoided,
and if you see a similar thing on another sample, you've found a speedup.
How much of a speedup? This much (the math is here):
For example, if you look at three samples, and on two of them you see avoidable code, fixing it will give you a speedup - maybe less, maybe more, but on average 4x.
(That's what I mean by giant speedup. The way you get it is by studying individual samples, not by measuring anything.)
There's a video here.

Cycles consumed in each function through Oprofile

Oprofile works on Sampling Based theory.
Opreport -l option provides us the profiling report in the following way:
samples % image name symbol name
78149 15.0776 cvqa comp_corr.clone.2
With this information I can know the %age of time consumed in consumption. If I do some optimizaion in my code I will again get the report as:
samples % image name symbol name
73179 15.0732 cvqa comp_corr.clone.2
In this report I am not getting how much optimization of cycles has been done so that I can benchmark. How much optimization has been done till now?
Is there any way we can know how much cycles optimization has been done or any other way through which I can bench mark?
I am working on AMD64 bit machine.

Since your real goal is to optimize the program, let me suggest another way to think about it.
The main thing to measure is overall time, not cycles or times of the various routines.
Now, here's how to do optimization. Don't base it on any measurements. Rather, get a number of samples of the program's state and (this is the key point) study each sample closely enough, with your own eyes and brain, and understand what the program is doing in that state, and the full reason why it is doing it.
(You will see anything worth fixing that statistics could reveal, plus things they could not reveal, and that makes all the difference.)
As soon as you catch it in the act of doing, on two or more samples, something that could be removed, fixing it will give you a substantial speedup.
Here is an explanation of why it works and how much speedup you can expect.
After you do that, you can do the overall time measurement again and see how much time you saved.
Then don't stop. Do it again. You'll find something else to fix, which is now a bigger percent because of the first problem you removed.
In my experience, with real software, this can be done as many as 5 or 6 times, after which the program can be orders of magnitude faster than it was originally. The reason is because each optimization removes a fraction of the original execution time, and those fractions can accumulate up to nearly 100%. I'm not aware of any such result achieved with Oprofile or any other profiler tool.

Which is the most reliable profiling tool gprof or kcachegrind?

Profiling some C++ number crunching code with both gprof and kcachegrind gives similar results for the functions that contribute most to the execution time (50-80% depending on input) but for functions between 10-30% both these tools give different results. Does it mean one of them is not reliable? What would yo do here?

gprof is actually quite primitive. Here's what it does.
1) It samples the program counter at a constant rate and records how many samples land in each function (exclusive time).
2) It counts how many times any function A calls any function B.
From that it can find out how many times each function was called in total, and what it's average exclusive time was.
To get average inclusive time of each function it propagates exclusive time upward in the call graph.
If you're expecting this to have some kind of accuracy, you should be aware of some issues.
First, it only counts CPU-time-in-process, meaning it is blind to I/O or other system calls.
Second, recursion confuses it.
Third, the premise that functions always adhere to an average run time, no matter when they are called or who calls them, is very suspect.
Forth, the notion that functions (and their call graph) are what you need to know about, rather than lines of code, is simply a popular assumption, nothing more.
Fifth, the notion that accuracy of measurement is even relevant to finding "bottlenecks" is also just a popular assumption, nothing more.
Callgrind can work at the level of lines - that's good. Unfortunately it shares the other problems.
If your goal is to find "bottlenecks" (as opposed to getting general measurements), you should take a look at wall-clock time stack samplers that report percent-by-line, such as Zoom.
The reason is simple but possibly unfamiliar.
Suppose you have a program with a bunch of functions calling each other that takes a total of 10 seconds. Also, there is a sampler that samples, not just the program counter, but the entire call stack, and it does it all the time at a constant rate, like 100 times per second. (Ignore other processes for now.)
So at the end you have 1000 samples of the call stack.
Pick any line of code L that appears on more than one of them.
Suppose you could somehow optimize that line, by avoiding it, removing it, or passing it off to a really really fast processor.
What would happen to those samples?
Since that line of code L now takes (essentially) no time at all, no sample can hit it, so those samples would just disappear, reducing the total number of samples, and therefore the total time!
In fact the overall time would be reduced by the fraction of time L had been on the stack, which is roughly the fraction of samples that contained it.
I don't want to get too statistical, but many people think you need a lot of samples, because they think accuracy of measurement is important.
It isn't, if the reason you're doing this is to find out what to fix to get speedup.
The emphasis is on finding what to fix, not on measuring it.
Line L is on the stack some fraction F of the time, right?
So each sample has a probability F of hitting it, right? Just like flipping a coin.
There is a theory of this, called the Rule of Succession.
It says that (under simplifying but general assumptions), if you flip a coin N times, and see "heads" S times, you can estimate the fairness of the coin F as (on average) (S+1)/(N+2).
So, if you take as few as three samples, and see L on two of them, do you know what F is? Of course not.
But you do know on average it is (2+1)/(3+2) or 60%.
So that's how much time you could save (on average) by "optimizing away" line L.
And, of course, the stack samples showed you exactly where line L (the "bottleneck"**) is.
Did it really matter that you didn't measure it to two or three decimal places?
BTW, it is immune to all the other problems mentioned above.
**I keep putting quotes around "bottleneck" because what makes most software slow has nothing in common with the neck of a bottle.
A better metaphor is a "drain" - something that just needlessly wastes time.

gprof's timing data is statistical (read about it in details of profiling docs).
On the other hand, KCacheGrind uses valgrind which actually interprets all the code.
So KCacheGrind can be "more accurate" (at the expense of more overhead) if the CPU modeled by valgrind is close to your real CPU.
Which one to choose also depends on what type of overhead you can handle. In my experience, gprof adds less runtime overhead (execution time that is), but it is more intrusive (i.e. -pg adds code to each and every one of your functions). So depending on the situation, on or the other is more appropriate.
For "better" gprof data, run your code longer (and on as wide a range of test data you can). The more you have, the better the measurements will be statistically.

Inaccuracy in gprof output

I am trying to profile a c++ function using gprof, I am intrested in the %time taken. I did more than one run and for some reason I got a large difference in the results. I don't know what is causing this, I am assuming the sampling rate or I read in other posts that I/O has something to do with it. So is there a way to make it more accurate and generate somehow almost constant results?
I was thinking of the following:
increase the sampling rate
flush the caches before executing anything
use another profiler but I want it to generate results in a similar format to grof as function time% function name, I tried Valgrind but it gave me a massive file in size. So maybe I am generating the file with the wrong command.
Waiting for your input
Regards

I recommend printing a copy of the gprof paper and reading it carefully.
According to the paper, here's how gprof measures time. It samples the PC, and it counts how many samples land in each routine. Multiplied by the time between samples, that is each routine's total self time.
It also records in a table, by call site, how many times routine A calls routine B, assuming routine B is instrumented by the -pg option. By summing those up, it can tell how many times routine B was called.
Starting from the bottom of the call tree (where total time = self time), it assumes the average time per call of each routine is its total time divided by the number of calls.
Then it works back up to each caller of those routines. The time of each routine is its average self time plus the average number of calls to each subordinate routine times the average time of the subordinate routine.
You can see, even if recursions (cycles in the call graph) are not present, how this is fraught with possibilities for errors, such as assumptions about average times and average numbers of calls, and assumptions about subroutines being instrumented, which the authors point out. If there are recursions, they basically say "forget it".
All of this technology, even if it weren't problematic, begs the question - What is it's purpose? Usually, the purpose is "find bottlenecks". According to the paper, it can help people evaluate alternative implementations. That's not finding bottlenecks. They do recommend looking at routines that seem to be called a lot of times, or that have high average times. Certainly routines with low average cumulative time should be ignored, but that doesn't localize the problem very much. And, it completely ignores I/O, as if all I/O that is done is unquestionably necessary.
So, to try to answer your question, try Zoom, for one, and don't expect to eliminate statistical noise in measurements.
gprof is a venerable tool, simple and rugged, but the problems it had in the beginning are still there, and far better tools have come along in the intervening decades.
Here's a list of the issues.

gprof is not very accurate, particularly for small functions, see http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html#SEC11
If this is Linux then I recommend a profiler that doesn't require the code to be instrumented, e.g. Zoom - you can get a free 30 day evaluation license, after that it costs money.
All sampling profilers suffer form statistical inaccuracies - if the error is too large then you need to sample for longer and/or with a smaller sampling interval.

Predict C++ program running time

How to predict C++ program running time, if program executes different functions (working with database, reading files, parsing xml and others)? How installers do it?

They do not predict the time. They calculate the number of operations to be done on a total of operations.

You can predict the time by using measurement and estimation. Of course the quality of the predictions will differ. And BTW: The word "predict" is correct.
You split the workload into small tasks, and create an estimation rule for each task, e.g.: if copying files one to ten took 10s, then the remaining 90 files may take another 90s. Measure the time that these tasks take at runtime, and update your estimations.
Each new measurement will make the prediction a bit more precise.

There really is no way to do this in any sort of reliable way, since it depends on thousands of factors.
Progress bars typically measure this in one of two ways:
Overall progress - I have n number of bytes/files/whatever to transfer, and so far I have done m.
Overall work divided by current speed - I have n bytes to transfer, and so far I have done m and it took t seconds, so if things continue at this rate it will take u seconds to complete.

Short answer:
No you can't. For progress bars and such, most applications simply increase the bar length with a percentage based on the overall tasks done. Some psuedo-code:
for(int i=0; i<num_files_to_load; ++i){
files.push_back(File(filepath[i]));
SetProgressBarLength((float)i/((float)num_files_to_load) - 1.0f);
}
This is a very simplified example. Making a for-loop like this would surely block the window system's event/message queue. You would probably add a timed event or something similar instead.
Longer answer:
Given N known parameters, the problem finding whether a program completes at all is undecidable. This is called the Halting problem. You can however, find the time it takes to execute a single instruction. Some very old games actually depended on exact cycle timings, and failed to execute correctly on newer computers due to race conditions that occur because of subtle differences in runtime. Also, on architectures with data and instruction caches, the cycles the instructions consume is not constant anymore. So cache makes cycle-counting unpredictable.

Raymond Chen discussed this issue in his blog.
Why does the copy dialog give such
horrible estimates?
Because the copy dialog is just
guessing. It can't predict the future,
but it is forced to try. And at the
very beginning of the copy, when there
is very little history to go by, the
prediction can be really bad.

In general it is impossible to predict the running time of a program. It is even impossible to predict whether a program will even halt at all. This is undecidable.
http://en.wikipedia.org/wiki/Halting_problem

As others have said, you can't predict the time. Approaches suggested by Partial and rmn are valid solutions.
What you can do more is assign weights to certain operations (for instance, if you know a db call takes roughly twice as long as some processing step, you can adjust accordingly).

A cool installer compiler would execute a faux install, time each op, then save this to disk for the future.
I used such a technique for a 3D application once, which had a pretty dead-on progress bar for loading and mashing data, after you've run it a few times. It wasn't that hard, and it made development much nicer. (Since we had to see that bar 10-15 times/day, startup was 10-20 secs)

You can't predict it entirely.
What you can do is wait until a fraction of the work is done, say 1%, and estimate the remaining time by that - just time how long it takes for 1% and multiply by 100, for example. That is easily done if you can enumerate all that you have to do in advance, or have some kind of a loop going on..

As I mentioned in a previous answer, it is impossible in general to predict the running time.
However, empirically it may be possible to predict with good accuracy.
Typically all of these programs are approximatelyh linear in some input.
But if you wanted a more sophisticated approach, you could define a large number of features (database size, file size, OS, etc. etc.) and input those feature values + running time into a neural network. If you had millions of examples (obviously you would have an automated method for gathering data, e.g. some discovery programs) you might come up with a very flexible and intelligent prediction algorithm.
Of course this would only be worth doing for fun, as I'm sure the value to your company over some crude guessing algorithm will probably be nil :)

You should make estimation of time needed for different phases of the program. For example: reading files - 50, working with database - 30, working with network - 20. In ideal it would be good if you make some progress callback during all of those phases, but it requires coding the progress calculation into the iterations of algorithm.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js