Profiling algorithm - c++

I need to implement execution time measuring functionality. I thought about two possibilities.
First - regular time() call, just remember time each execution step starts and time when each execution step completes. Unix time shell command works this way.
Second method is sampling. Every execution step set some sort of flag before execution begins(for example - creates some object in the stack frame), and destroy it when it's completes. Timer periodically scan all flags and generate execution time profile. If some execution step takes more time then the others - it will be scanned more times. Many profilers works like this.
I need to add some profiling functionality in my server application, what method is better and why? I think that second method is less accurate and first method add dependency to profiling library code.

The second method is essentially stack sampling.
You can try to do it yourself, by means of some kind of entry-exit event capture, or it's better if there is a utility to actually read the stack.
The latter has an advantage in that you get line-of-code resolution, rather than just method-level.
There's something that a lot of people don't get about this, which is that precision of timing measurement is far less important than precision of problem identification.
It is important to take samples even during I/O or other blocking, so you are not blind to needless I/O. If you are worried that competition with other processes could inflate the time, don't be, because what really matters is not absolute time measurements, but percentages.
For example, if one line of code is on the stack 50% of the wall-clock time, and thus responsible for that much, getting rid of it would double the speed of the app, regardless of whatever else is going on.
Profiling is about more than just getting samples.
Often people are pretty casual about what they do with them, but that's where the money is.
First, inclusive time is the fraction of time a method or line of code is on the stack. Forget "self" time - it's included in inclusive time.
Forget invocation counting - its relation to inclusive percent is, at best, very indirect.
If you are summarizing, the best way to do it is to have a "butterfly view" whose focus is on a single line of code.
To its left and right are the lines of code appearing immediately above it and below it on the stack samples.
Next to each line of code is a percent - the percent of stack samples containing that line of code.
(And don't worry about recursion. It's simply not an issue.)
Even better than any kind of summary is to just let the user see a small random selection of the stack samples themselves.
That way, the user can get the whole picture of why each snapshot in time was being spent.
Any avoidable activity appearing on more than one sample is a chance for some serious speedup, guaranteed.
People often think "Well, that could just be a fluke, not a real bottleneck".
Not so. Fixing it will pay off, maybe a little, maybe a lot, but on average - significant.
People should not be ruled by risk-aversion.
More on all that.

When boost is an option, you can use the timer library.

Make sure that you really know what you're looking for in the profiler you're writing, whenever you're collecting the total execution time of a certain piece of code, it will include time spent in all its children and it may be hard to really find what is the bottleneck in your system as the most top-level function will always bubble up as the most expensive one - like for instance main().
What I would suggest is to hook into every function's prologue and epilogue (if your application is a CLR application, you can use the ICorProfilerInfo::SetEnterLeaveFunctionHooks to do that, you can also use macros at the beginning of every method, or any other mechanism that would inject your code at the beginning and and of every function) and collect your times in a form of a tree for each thread that your profiling.
The algorithm for this would look somehow similar to this:
For each thread that you're monitoring create a stack-like data structure.
Whenever you're notified about a function that began execution, push something that would identify the function into that stack.
If that function is not the only function on the stack, then you know that the previous function that did not return yet was the one that called your function.
Keep track of those callee-called relationships in your favorite data structure.
Whenever a method returns, it's identifier will always be on top of its thread stack. It's total execution time is equal to (time when the last (it's) identifier was pushed on the stack - current time). Pop that identifier of the stack.
This way you'll have a tree-like breakdown of what eats up your execution time where you can see what child calls account for the total execution time of a function.
Have fun!

In my profiler I used an extended version of the 1-st approach mentioned by you.
I have a class which provides context objects. You can define them in your work code as automatic objects to be freed up as soon as execution flow leaves the context where they have been defined (for example, a function or a loop). The constructor calls GetTickCount (it was a Windows project, you can choose analogous function as appropriate to your target platform) and stores this value, while destructor calls GetTickCount again and calculates difference between this moment and start. Each object has unique context ID (can be autogenerated as a static object inside the same context), so profiler can sum up all timings with the same IDs, which means that the same context has been passed several times. Also number of executions is counted.
Here is a macro for preprocessor, which helps to profile a function:
#define _PROFILEFUNC_ static ProfilerLocator locator(__FUNC__); ProfilerObject obj(locator);
When I want to profile a function I just insert PROFILEFUNC at the beginning of the function. This generates a static object locator which identifies the context and stores a name of it as the function name (you may decide to choose another naming). Then automatic ProfilerObject is created on stack and "traces" own creation and deletion, reporting this to the profiler.

Related

How do I profile Hiccups in performance?

Usually profile data is gathered by randomly sampling the stack of the running program to see which function is in execution, over a running period it is possible to be statistically sure which methods/function calls eats most time and need intervention in case of bottlenecks.
However this has to do with overall application/game performance. Sometime happens that there are singular and isolated hiccups in performance that are causing usability troubles anyway (user notice it / introduced lag in some internal mechanism etc). With regular profiling over few seconds of execution is not possible to know which. Even if the hiccup lasts long enough (says 30 ms, which are not enough anyway), to detect some method that is called too often, we will still miss to see execution of many other methods that are just "skipped" because of the random sampling.
So are there any tecniques to profile hiccups in order to keep framerate more stable after fixing those kind of "rare bottlenecks"? I'm assuming usage of languages like C# or C++.
This has been answered before, but I can't find it, so here goes...
The problem is that the DrawFrame routine sometimes takes too long.
Suppose it normally takes less than 1000/30 = 33ms, but once in a while it takes longer than 33ms.
At the beginning of DrawFrame, set a timer interrupt that will expire after, say, 40ms.
Then at the end of DrawFrame, disable the interrupt.
So if it triggers, you know DrawFrame is taking an unusually long time.
Put a breakpoint in the interrupt handler, and when it gets there, examine the stack.
Chances are pretty good that you have caught it in the process of doing the costly thing.
That's a variation on random pausing.

Are there some common techniques to profile coroutine based code?

It's pretty obvious how to visualize a regular call stack and count internal and external execution times. However, if one have dealt with coroutines, the call stack can look pretty messy. I mean, a coroutine may yield execution not to its parent but to another coroutine (eg. greenlet). Are there some common ways to make consistent profiling output for such scenarios?
Think about a single sample, of the stack for all threads at the same time.
What you need to know is - who's waiting for whom, and why.
Normally if function A is above B on a stack, it means A is waiting for B to return, and the reason is that A wanted B to do something.
If you look at a whole stack, for one thread, you get a chain of reasons why that particular nanosecond is being spent, by that thread.
If you're looking for speed, you're looking for chains of reasons that, altogether, you don't really need (because there is a weak link).
This works even if the chain ends in I/O.
If it is user input it's simply waiting for the user.
But if it's output, or disk I/O, or plain old CPU cranking, you might be able to do something to reduce it, and get a performance gain (if you see the same problem on 2 or more samples).
What if thread A is waiting for thread B?
Then what you see at the bottom of A's stack is a function that waits for the other thread.
You need to figure out which is thread B, and look at its stack, because the longer it takes, the longer A takes.
So this is more difficult, but surely you're not afraid of that.
I'm talking about manual profiling here, where you take samples yourself, in a debugger, and apply your full attention to each sample.
Profiling tools tend to assume you're lazy and only want numbers, and if nothing jumps out of those numbers you will be happy because you found nothing.
In fact, if some silly needless activity is taking 30% of time, then on average the number of samples you require to see it twice is 2/0.3 = 6.67 samples (not a big number), and it is quite likely that you will see it and the profiler will not.
That's random pausing.

Capturing function exit time with __gnu_mcount_nc

I'm trying to do some performance profiling on a poorly supported prototype embedded platform.
I note that GCC's -pg flag causes thunks to __gnu_mcount_nc to be inserted on entry to every function. No implementation of __gnu_mcount_nc is available (and the vendor is not interested in assisting), however as it is trivial to write one that simply records the stack frame and current cycle count, I have done so; this works fine and is yielding useful results in terms of caller/callee graphs and most frequently called functions.
I would really like to obtain information about the time spent in function bodies as well, however I am having difficulty understanding how to approach this with only the entry, but not the exit, to each function getting hooked: you can tell exactly when each function is entered, but without hooking the exit points you cannot know how much of the time until you receive the next piece of information to attribute to callee and how much to the callers.
Nevertheless, the GNU profiling tools are in fact demonstrably able to gather runtime information for functions on many platforms, so presumably the developers have some scheme in mind for achieving this.
I have seen some existing implementations that do things like maintain a shadow callstack and twiddle the return address on entry to __gnu_mcount_nc so that __gnu_mcount_nc will get invoked again when the callee returns; it can then match the caller/callee/sp triplet against the top of the shadow callstack and so distinguish this case from the call on entry, record the exit time and correctly return to the caller.
This approach leaves much to be desired:
it seems like it may be brittle in the presence of recursion and libraries compiled without the -pg flag
it seems like it would be difficult to implement with low overhead or at all in embedded multithreaded/multicore environments where toolchain TLS support is absent and current thread ID may be expensive/complex to obtain
Is there some obvious better way to implement a __gnu_mcount_nc so that a -pg build is able to capture function exit as well as entry time that I am missing?
gprof does not use that function for timing, of entry or exit, but for call-counting of function A calling any function B.
Rather, it uses the self-time gathered by counting PC samples in each routine, and then uses the function-to-function call counts to estimate how much of that self-time should be charged back to callers.
For example, if A calls C 10 times, and B calls C 20 times, and C has 1000ms of self time (i.e 100 PC samples), then gprof knows C has been called 30 times, and 33 of the samples can be charged to A, while the other 67 can be charged to B.
Similarly, sample counts propagate up the call hierarchy.
So you see, it doesn't time function entry and exit.
The measurements it does get are very coarse, because it makes no distinction between short calls and long calls.
Also, if a PC sample happens during I/O or in a library routine that is not compiled with -pg, it is not counted at all.
And, as you noted, it is very brittle in the presence of recursion, and can introduce notable overhead on short functions.
Another approach is stack-sampling, rather than PC-sampling.
Granted, it is more expensive to capture a stack sample than a PC-sample, but fewer samples are needed.
If, for example, a function, line of code, or any description you want to make, is evident on fraction F out of the total of N samples, then you know that the fraction of time it costs is F, with a standard deviation of sqrt(NF(1-F)).
So, for example, if you take 100 samples, and a line of code appears on 50 of them, then you can estimate the line costs 50% of the time, with an uncertainty of sqrt(100*.5*.5) = +/- 5 samples or between 45% and 55%.
If you take 100 times as many samples, you can reduce the uncertainty by a factor of 10.
(Recursion doesn't matter. If a function or line of code appears 3 times in a single sample, that counts as 1 sample, not 3.
Nor does it matter if function calls are short - if they are called enough times to cost a significant fraction, they will be caught.)
Notice, when you're looking for things you can fix to get speedup, the exact percent doesn't matter.
The important thing is to find it.
(In fact, you only need see a problem twice to know it is big enough to fix.)
That's this technique.
P.S. Don't get suckered into call-graphs, hot-paths, or hot-spots.
Here's a typical call-graph rat's nest. Yellow is the hot-path, and red is the hot-spot.
And this shows how easy it is for a juicy speedup opportunity to be in none of those places:
The most valuable thing to look at is a dozen or so random raw stack samples, and relating them to the source code.
(That means bypassing the back-end of the profiler.)
ADDED: Just to show what I mean, I simulated ten stack samples from the call graph above, and here's what I found
3/10 samples are calling class_exists, one for the purpose of getting the class name, and two for the purpose of setting up a local configuration. class_exists calls autoload which calls requireFile, and two of those call adminpanel. If this can be done more directly, it could save about 30%.
2/10 samples are calling determineId, which calls fetch_the_id which calls getPageAndRootlineWithDomain, which calls three more levels, terminating in sql_fetch_assoc. That seems like a lot of trouble to go through to get an ID, and it's costing about 20% of time, and that's not counting I/O.
So the stack samples don't just tell you how much inclusive time a function or line of code costs, they tell you why it's being done, and what possible silliness it takes to accomplish it.
I often see this - galloping generality - swatting flies with hammers, not intentionally, but just following good modular design.
ADDED: Another thing not to get sucked into is flame graphs.
For example, here is a flame graph (rotated right 90 degrees) of the ten simulated stack samples from the call graph above. The routines are all numbered, rather than named, but each routine has its own color.
Notice the problem we identified above, with class_exists (routine 219) being on 30% of the samples, is not at all obvious by looking at the flame graph.
More samples and different colors would make the graph look more "flame-like", but does not expose routines which take a lot of time by being called many times from different places.
Here's the same data sorted by function rather than by time.
That helps a little, but doesn't aggregate similarities called from different places:
Once again, the goal is to find the problems that are hiding from you.
Anyone can find the easy stuff, but the problems that are hiding are the ones that make all the difference.
ADDED: Another kind of eye-candy is this one:
where the black-outlined routines could all be the same, just called from different places.
The diagram doesn't aggregate them for you.
If a routine has high inclusive percent by being called a large number of times from different places, it will not be exposed.

C++ Asymptotic Profiling

I have a performance issue where I suspect one standard C library function is taking too long and causing my entire system (suite of processes) to basically "hiccup". Sure enough if I comment out the library function call, the hiccup goes away. This prompted me to investigate what standard methods there are to prove this type of thing? What would be the best practice for testing a function to see if it causes an entire system to hang for a sec (causing other processes to be momentarily starved)?
I would at least like to definitively correlate the function being called and the visible freeze.
Thanks
The best way to determine this stuff is to use a profiling tool to get the information on how long is spent in each function call.
Failing that set up a function that reserves a block of memory. Then in your code at various points, write a string to memory including the current time. (This avoids the delays associated with writing to the display).
After you have run your code, pull out the memory and parse it to deterimine how long parts of your code are taking.
I'm trying to figure out what you mean by "hiccup". I'm imagining your program does something like this:
while (...){
// 1. do some computing and/or file I/O
// 2. print something to the console or move something on the screen
}
and normally the printed or graphical output hums along in a subjectively continuous way, but sometimes it appears to freeze, while the computing part takes longer.
Is that what you meant?
If so, I suspect in the running state it is most always in step 2, but in the hiccup state it spending time in step 1.
I would comment out step 2, so it would spend nearly all it's time in the hiccup state, and then just pause it under the debugger to see what it's doing.
That technique tells you exactly what the problem is with very little effort.

Tools to evaluate callgrind's call profiles?

Somehow related to this question, which tool would you recommend to evaluate the profiling data created with callgrind?
It does not have to have a graphical interface, but it should prepare the results in a concise, clear and easy-to-interpret way. I know about e.g. kcachegrind, but this program is missing some features such as data export of the tables shown or simply copying lines from the display.
Years ago I wrote a profiler to run under DOS.
If you are using KCacheGrind here's what I would have it do. It might not be too difficult to write it, or you can just do it by hand.
KCacheGrind has a toolbar button "Force Dump", with which you can trigger a dump manually at a random time. The capture of stack traces at random or pseudo-random times, in the interval when you are waiting for the program, is the heart of the technique.
Not many samples are needed - 20 is usually more than enough. If a bottleneck costs a large amount, like more than 50%, 5 samples may be quite enough.
The processing of the samples is very simple. Each stack trace consists of a series of lines of code (actually addresses), where all but the last are function/method calls.
Collect a list of all the lines of code that appear on the samples, and eliminate duplicates.
For each line of code, count what fraction of samples it appears on. For example, if you take 20 samples, and the line of code appears on 3 of them, even if it appears more than once in some sample (due to recursion) the count is 3/20 or 15%. That is a direct measure of the cost of each statement.
Display the most costly 100 or so lines of code. Your bottlenecks are in that list.
What I typically do with this information is choose a line with high cost, and then manually take stack samples until it appears (or look at the ones I've already got), and ask myself "Why is it doing that line of code, not just in a local sense, but in a global sense." Another way to put it is "What in a global sense was the program trying to accomplish at the time slice when that sample was taken". The reason I ask this is because that tells me if it was really necessary to be spending what that line is costing.
I don't want to be critical of all the great work people do developing profilers, but sadly there is a lot of firmly entrenched myth on the subject, including:
that precise measuring, with lots of samples, is important. Rather the emphasis should be on finding the bottlenecks. Precise measurement is not a prerequisite for that. For typical bottlenecks, costing between 10% and 90%, the measurement can be quite coarse.
that functions matter more than lines of code. If you find a costly function, you still have to search within it for the lines that are the bottleneck. That information is right there, in the stack traces - no need to hunt for it.
that you need to distinguish CPU from wall-clock time. If you're waiting for it, it's wall clock time (wrist-watch time?). If you have a bottleneck consisting of extraneous I/O, for example, do you want to ignore that because it's not CPU time?
that the distinction between exclusive time and inclusive time is useful. That only makes sense if you're timing functions and you want some clue whether the time is spent not in callees. If you look at lines of code, the only thing that matters is inclusive time. Another way to put it is, every instruction is a call instruction, even if it only calls microcode.
that recursion matters. It is irrelevant, because it doesn't affect the fraction of samples a line is on and is therefore responsible for.
that the invocation count of a line or function matters. Whether it's fast and is called too many times, or slow and called once, the cost is the percent of time it uses, and that's what the stack samples estimate.
that performance of sampling matters. I don't mind taking a stack sample and looking at it for several minutes before continuing, assuming that doesn't make the bottlenecks move.
Here's a more complete explanation.
There are some CLI tools for working with callgrind data:
callgrind_annotate
and cachegrind tool which can show some information from callgrind.out
cg_annotate