Tuning the resolution in callgrind - c++

Sorry, I can't create a minimal complete example as the problem only occurs for relatively large programs and I am not sure this is even a 'bug' per se as opposed to a misunderstanding of what callgrind profiling is supposed to accomplish.
I have a large program whose run time is split about 50 50 into 2 sequential parts. The first part is most file reading, and the second mostly computation.
The order of function calls that I would expect is the following:
Calling Scope, Callee
main Part1_main
Part1_main Part1_main_subfunction_1
Part1_main Part1_main_subfunction_2
Part1_main Part1_main_subfunction_3
main Part2_main
Part2_main Part2_main_subfunction_1
Part2_main Part2_main_subfunction_2
..
..
When I run callgrind on the code (and then view the results in kcachegrind on osx), I have some results regarding the function calls that are approximately as you would expect except for one thing: There is no resolution of function calls within the second part: The profile output is qualitatively the same as
Function, Pct_time, Self_time
Part1_main 50 4
Part2_main 50 50
Part1_main_subfunction_1 20 4
Part1_main_subfunction_2 15 5
..
..
..
What is the interpretation that the second function has very high self time? It seems that the profiler thinks that it is not calling any other functions. I suppose it is possible, though unlikely, that everything in function 2 is inlined, so maybe there shouldn't be any further resolution. If this is true, this doesn't yield very interesting profiling results.
If you ever come across this type of thing, how do you force the profiler to show further resolution? Or, if my intuition is wrong, what else could be causing this behaviour?
As per instruction of the callgrind website, I am compiling with the -g flag, and optimisation turned on.

As default kcachegrind hides functions with small weight, but you can customize it. See the answer here:
Make callgrind show all function calls in the kcachegrind callgraph

Related

Removing code portions does not match the profiler's data

I am doing a little proof of concept profile and optimize type of example. However, I ran into something that I can't quite explain, and I'm hoping that someone here can clear this up.
I wrote a very short snippet of code:
int main (void)
{
for (int j = 0; j < 1000; j++)
{
a = 1;
b = 2;
c = 3;
for (int i = 0; i < 100000; i++)
{
callbackWasterOne();
callbackWasterTwo();
}
printf("Two a: %d, b: %d, c: %d.", a, b, c);
}
return 0;
}
void callbackWasterOne(void)
{
a = b * c;
}
void callbackWasterTwo(void)
{
b = a * c;
}
All it does is call two very basic functions that just multiply numbers together. Since the code is identical, I expect the profiler (oprofile) to return roughly the same number.
I run this code 10 times per profile, and I got the following values for how long time is spent on each function:
main: average = 5.60%, stdev = 0.10%
callbackWasterOne = 43.78%, stdev = 1.04%
callbackWasterTwo = 50.24%, stdev = 0.98%
rest is in miscellaneous things like printf and no-vmlinux
The difference between the time for callbackWasterOne and callbackWasterTwo is significant enough (to me at least) given that they have the same code, that I switched their order in my code and reran the profiler with the following results now:
main: average = 5.45%, stdev = 0.40%
callbackWasterOne = 50.69%, stdev = 0.49%
callbackWasterTwo = 43.54%, stdev = 0.18%
rest is in miscellaneous things like printf and no-vmlinux
So evidently the profiler samples one more than the other based on the execution order. Not good. Disregarding this, I decided to see the effects of removing some code and I got this for execution times (averages):
Nothing removed: 0.5295s
call to callbackWasterOne() removed from for loop: 0.2075s
call to callbackWasterTwo() removed from for loop: 0.2042s
remove both calls from for loop: 0.1903s
remove both calls and the for loop: 0.0025s
remove contents of callbackWasterOne: 0.379s
remove contents of callbackWasterTwo: 0.378s
remove contents of both: 0.382s
So here is what I'm having trouble understanding:
When I remove just one of the calls from the for loop, the execution time drops by ~60%, which is greater than the time spent by that one function + the main in the first place! How is this possible?
why is the effect of removing both calls from the loop so little compared to removing just one? I can't figure out this non-linearity. I understand that the for loop is expensive, but in that case (if most of the remaining time can be attributed to the for loop that performs the function calls), why would removing one of the calls cause such a large improvement in the first place?
I looked at the disassembly and the two functions are the same in code. The calls to them are the same, and removing the call simply deletes the one call line.
Other info that might be relevant
I'm using Ubuntu 14.04LTS
The code is complied by Eclipse with no optimization (O0)
I time the code by running it in terminal using "time"
I use OProfile with count = 10000 and 10 repetitions.
Here are the results from when I do this with -O1 optimization:
main: avg = 5.89%, stdev = 0.14%
callbackWasterOne: avg = 44.28%, stdev = 2.64%
callbackWasterTwo: avg = 49.66%, stdev = 2.54% (greater than before)
Rest is miscellaneous
Results of removing various bits (execution time averages):
Nothing removed: 0.522s
Remove callbackWasterOne call: 0.149s (71.47% decrease)
Remove callbackWasterTwo call: 0.123% (76.45% decrease)
Remove both calls: 0.0365s (93.01% decrease) (what I would expect given the profile data just above)
So removing one call now is much better than before, and removing both still carries a benefit (probably because the optimizer understands that nothing happens in the loop). Still, removing one is much more beneficial than I would have anticipated.
Results of the two functions using different variables:
I defined 3 more variables for callbackWasterTwo() to use instead of reusing same ones. Now the results are what I would have expected.
main: avg = 10.87%, stdev = 0.19% (average is greater, but maybe due to those new variables)
callbackWasterOne: avg = 46.08%, stdev = 0.53%
callbackWasterTwo: avg = 42.82%, stdev = 0.62%
Rest is miscellaneous
Results of removing various bits (execution time averages):
Nothing removed: 0.520s
Remove callbackWasterOne call: 0.292s (43.83% decrease)
Remove callbackWasterTwo call: 0.291% (44.07% decrease)
Remove both calls: 0.065s (87.55% decrease)
So now removing both calls is pretty much equivalent (within stdev) to removing one call + the other.
Since the result of removing either function is pretty much the same (43.83% vs 44.07%), I am going to go out on a limb and say that perhaps the profiler data (46% vs 42%) is still skewed. Perhaps it is the way it samples (going to vary the counter value next and see what happens).
It appears that the success of optimization relates pretty strongly to the code reuse fraction. The only way to achieve "exactly" (you know what I mean) the speedup noted by the profiler is to optimize on completely independent code. Anyway this is all interesting.
I am still looking for some explanation ideas for the 70% decrease in the -O1 case though...
I did this with 10 functions (different formulas in each, but using some combination of 6 different variables, 3 at a time, all multiplication):
These results are disappointing to say the least. I know the functions are identical, and yet, the profiler indicates that some take significantly longer. No matter which one I remove ("fast" or "slow" one), the results are the same ;) So this leaves me to wonder, how many people are incorrectly relying on the profiler to indicate the incorrect areas of code to fix? If I unknowingly saw these results, what could possible tell me to go fix the 5% function rather than the 20% (even though they are exactly the same)? What if the 5% one was much easier to fix, with a large potential benefit? And of course, this profiler might just not be very good, but it is popular! People use it!
Here is a screenshot. I don't feel like typing it in again:
My conclusion: I am overall rather disappointed with Oprofile. I decided to try out callgrind (valgrind) through command line on the same function and it gave me far more reasonable results. In fact, the results were very reasonable (all functions spent ~ the same amount of time executing). I think Callgrind samples far more than Oprofile ever did.
Callgrind will still not explain the difference in improvement when a function is removed, but at least it gives the correct baseline information...
Ah, I see you did look at the assembly. This question is indeed interesting in its own right, but in general there's no point profiling unoptimized code, since there's so much boilerplate that could easily be reduced even in -O1.
If it's really only the call that's missing, then that could explain the timing differences -- there's lots of boilerplate from the -O0 stack manipulation code (any caller-saved registers have to be pushed onto the stack, and any arguments too, then afterwards the any return value has to be treated and the opposite stack manipulation has to be done) which contributes to the time it takes to call the functions, but is not necessarily completely attributed to the functions themselves by oprofile since that code is executed before/after the function is actually called.
I suspect the reason the second function seems to always take less time is that there's less (or no) stack juggling that needs to be done -- the parameter values are already on the stack thanks to the previous function call, and so, as you've seen, only the call to the function has to be executed, without any other extra work.

Implementing an Interrupt driven Sampling Profiler

I am trying to create a sampling profiler that works on linux, I am unsure how to send an interrupt or how to get the program counter (pc) so I can find out the point the program is at when I interrupt it.
I have tried using signal(SIGUSR1, Foo*) and calling backtrace, but I get the stack for the thread I am in when I raise(SIGUSR1) rather than the thread the program is being run on.
I am not really sure if this is even the right way of going about it...
Any advice?
If you must write a profiler, let me suggest you use a good one (Zoom) as your model, not a bad one (gprof).
These are its characteristics.
There are two phases. First is the data-gathering phase:
When it takes a sample, it reads the whole call stack, not just the program counter.
It can take samples even when the process is blocked due to I/O, sleep, or anything else.
You can turn sampling on/off, so as to only take samples during times you care about. For example, while waiting for the user to type something, it is pointless to be sampling.
Second is the data-presentation phase.
What you have is a collection of stack samples, where a stack sample is a vector of memory addresses, which are almost all return addresses.
Each return address indicates a line of code in a function, unless it's in some system routine you don't have symbolic information for.
The key piece of useful information is residency fraction (usually expressed as a percent).
If there are a total of m stack samples, and line of code L is present anywhere on n of the samples, then its residency fraction is n/m.
This is true even if L appears more that once on a sample, that is still just one sample it appears on.
The importance of residency fraction is it directly indicates what fraction of time statement L is responsible for.
If you have taken m=1000 samples, and L appears on n=300 of them, then L's residency fraction is 300/1000 or 30%.
This means that if L could be removed, total time would decrease by 30%.
It is typically known as inclusive percent.
You can determine residency fraction not just for lines of code, but for anything else you can describe. For example, line of code L is inside some function F.
So you can determine the residency fraction for functions, as opposed to lines of code.
That would give you inclusive percent by function.
You could look at function pairs, like on what fraction of samples do you see function F calling function G.
That would give you the information that makes up call graphs.
There are all kinds of information you can get out of the stack samples.
One that is often seen is a "butterfly view", where you have a "focus" on one line L or function F, and on one side you show all the lines or functions immediately above it in the stack samples, and on the other side all the lines of functions immediately below it.
On each of these, you can show the residency fraction.
You can click around in this to try to find lines of code with high residency fraction that you can find a way to eliminate or reduce.
That's how you speed up the code.
Whatever you do for output, I think it is very important to allow the user to actually examine a small number of them, randomly selected.
They convey far more insight than can be gotten from any method that condenses the information.
As important as it is to know what the profiler should do, it is also important to know what not to do, even if lots of other profilers do them:
self time. A useless number. Look at some reasonable-size programs and you will see why.
invocation counts. Of no help in finding code with high residency fraction, and you can't get it with samples alone anyway.
high-frequency sampling. It's amazing how many people, certainly profiler builders, think it is important to get lots of samples. Suppose line L is on 30% of 1000 samples. Then its true inclusive percent is 30 +/- 1.4 percent. On the other hand, if it is on 30% of 10 samples, its inclusive percent is 30 +/- 14 percent. It's still pretty big - big enough to fix. What happens in most profilers is people think they need "numerical precision", so they take lots of samples and accumulate what they call "statistics", and then throw away the samples. That's like digging up diamonds, weighing them, and throwing them away. The real value is in the samples themselves, because they tell you what the problem is.
You can send signal to specific thread using pthread_kill and tid (gettid()) of target thread.
Right way of creating simple profilers is by using setitimer which can send periodic signal (SIGALRM or SIGPROF) for example, every 10 ms; or posix timers (timer_create, timer_settime, or timerfd), without needs of separate thread for sending profiling signals. Check sources of google-perftools (gperftools), they use setitimer or posix timers and collects profile with backtraces.
gprof also uses setitimer for implementing cpu time profiling (9.1 Implementation of Profiling - " Linux 2.0 ..arrangements are made for the kernel to periodically deliver a signal to the process (typically via setitimer())").
For example: result of codesearch for setitimer in gperftools's sources: https://code.google.com/p/gperftools/codesearch#search/&q=setitimer&sq=package:gperftools&type=cs
void ProfileHandler::StartTimer() {
if (!allowed_) {
return;
}
struct itimerval timer;
timer.it_interval.tv_sec = 0;
timer.it_interval.tv_usec = 1000000 / frequency_;
timer.it_value = timer.it_interval;
setitimer(timer_type_, &timer, 0);
}
You should know that setitimer has problems with fork and clone; it doesn't work with multithreaded applications. There is try to create helper wrapper: http://sam.zoy.org/writings/programming/gprof.html (wrong one) but I don't remember, does it work correctly (setitimer usually send process-wide signal, and not thread-wide). UPD: seems that since linux kernel 2.6.12, setitimer's signal is directed to the process as whole (any thread may get it).
To direct signal from timer_create to specific thread, you need gettid() (#include <sys/syscall.h>, syscall(__NR_gettid)) and SIGEV_THREAD_ID flag. Don't checked how to create periodic posix timer with thread_create (probably with timer_settime and non-zero it_interval).
PS: there is some overview of profiling in wikibooks: http://en.wikibooks.org/wiki/Introduction_to_Software_Engineering/Tools/Profiling

C++ Time measurement of functions

I need to measure the time of a C++ programs, especially the overall running time of some recursive functions. There are a lot of function calls inside other functions. My first thought was to implement some time measurements functions in the actual code.
The problem with gprof is, that it prints out the time of class operators of a datatype, but i only need the infomartion about the functions and "-f func_name prog_name" wont work.
So, what is the most common way in science to measure time of a numerical program?
Its something like this:
function2()
{
}
function1()
{
function2();
funtcion1();
}
int main(){
function1();
}
If you're using the GNU package, i.e. gcc, you can try gprof. Just compile your program with -g and -pg flags and then run
gprof <your_program_name>
gprof: http://www.cs.utah.edu/dept/old/texinfo/as/gprof_toc.html
EDIT:
In order to increase the level of detail you can run gprof with other flags:
-l (--line) enables line by line profiling, giving you a histogram hits to be charged to individual lines of code, instead of functions.
-a Don’t include private functions in the output.
-e <function> Exclude output for a function <function>. Use this when there are functions that won’t be changed. For example, some sites have source code that’s been approved by a regulatory agency, and no matter how inefficient, the code will remain unchanged.
-E <function> Also exclude the time spent in the function from the percentage tables.
-f <function> The opposite of -e: only track time in <function>.
-F <function> Only use the time in <function> when calculating percentages.
-b Don’t print the explanatory text. If you’re more experienced, you can appreciate this option.
-s Accumulate samples. By running the program several times, it’s possible to get a
better picture of where time is spent. For example, a slow routine may not be called
for all input values, and therefore you maybe mislead reading where to find
performance problems.
If you need higher precision (for functions which do not take more than few (or less) milliseconds), you can use std::chrono::high_resolution_clock:
auto beginT = std::chrono::high_resolution_clock::now();
//Your computation here
auto endT = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(endT - beginT).count()
The std::chrono::high_resolution_clock can be found in chrono header and is part of C++11 stadard.

Interpreting GPerfTools sample count

I'm struggling a little with reading the textual output the GPerfTools generate. I think part of the problem is that I don't fully understand how the sampling method operates.
From Wikipedia I gather that profilers based on sample functions usually work by sending an interrupt to the OS and querying the program's current instruction pointer. Now my knowledge about assembly is a little rusty, so I'm wondering what it means if the instruction pointer points to method m at any given time? I.e. does it mean that the function is about to be called or does it mean it's currently executed, or both?
There's a difference if I'm not mistaken, because in the first case the sample count (i.e. times m is seen while taking a sample) translates to the absolute call count of m, while in the latter case it simply translates to times seen, i.e. a mere indication of relative time spent in this method.
Can someone clarify?

Is there a way to figure out the top callers of a C function?

Say I have function that is called a LOT from many different places. So I would like to find out who calls this functions the most. For example, top 5 callers or who ever calls this function more than N times.
I am using AS3 Linux, gcc 3.4.
For now I just put a breakpoint and then stop there after every 300 times, thus brute-forcing it...
Does anyone know of tools that can help me?
Thanks
Compile with -pg option, run the program for a while and then use gprof. Running a program compiled with -pg option will generate gmon.out file with execution profile. gprof can read this file and present it in readable form.
I wrote call logging example just for fun. A macro change the function call with an instrumented one.
include <stdio.h>.
int funcA( int a, int b ){ return a+b; }
// instrumentation
void call_log(const char*file,const char*function,const int line,const char*args){
printf("file:%s line: %i function: %s args: %s\n",file,line,function,args);
}
#define funcA(...) \
(call_log(__FILE__, __FUNCTION__, __LINE__, "" #__VA_ARGS__), funcA(__VA_ARGS__)).
// testing
void funcB(void){
funcA(7,8);
}
int main(void){
int x = funcA(1,2)+
funcA(3,4);
printf( "x: %i (==10)\n", x );
funcA(5,6);
funcB();
}
Output:
file:main.c line: 22 function: main args: 1,2
file:main.c line: 24 function: main args: 3,4
x: 10 (==10)
file:main.c line: 28 function: main args: 5,6
file:main.c line: 17 function: funcB args: 7,8
Profiling helps.
Since you mentioned oprofile in another comment, I'll say that oprofile supports generating callgraphs on profiled programs.
See http://oprofile.sourceforge.net/doc/opreport.html#opreport-callgraph for more details.
It's worth noting this is definitely not as clear as the callers profile you may get from gprof or another profiler, as the numbers it reports is the number of times oprofile collected a sample in which X is the caller for a given function, not the number of times X called a given function. But this should be sufficient to figure out the top callers of a given function.
A somewhat cumbersome method, but not requiring additional tools:
#define COUNTED_CALL( fn, ...) do{ \
fprintf( call_log_fp, "%s->%s\n", __FUNCTION__, #fn ) ; \
(fn)(__VA_ARGS__) ; \
}while(0) ;
Then all calls written like:
int input_available = COUNTED_CALL( scanf, "%s", &instring ) ;
will be logged to the file associated to call_log_fp (a global FILE* which you must have initialised). The log for the above would look like:
main->scanf
You can then process that log file to extract the data you need. You could even write your own code to do the instrumentation which would make it perhaps less cumbersome.
Might be a bit ambiguous for C++ class member functions though. I am not sure if there is a __CLASS__ macro.
In addition to the aforementioned gprof profiler, you may also try the gcov code-coverage tool. Information on compiling for and using both should be included in the gcc manual.
Once again, stack sampling to the rescue! Just take a bunch of "stackshots", as many as you like. Discard any samples where your function (call it F) is not somewhere on the stack. (If you're discarding most of them, then F is not a performance problem.)
On each remaining sample, locate the call to F, and see what function (call it G) that call is in. If F is recursive (it appears more than once on the sample) only use the topmost call.
Rank your Gs by how many stacks each one appears in.
If you don't want to do this by hand, you could make a simple tool or script. You don't need a zillion samples. 20 or so will give you reasonably good information.
By the way, if what you're really trying to do is find performance problems, you don't actually need to do all that discarding and ranking. In fact - don't discard the exact locations of the call instruction inside each G. Those can actually tell you a good bit more than just the fact that they were somewhere inside G.
P.S. This is all based on the assumption that when you say "calls it the most" you mean "spends the most wall clock time in calling it", not "calls it the greatest number of times". If you are interested in performance, fraction of wall clock time is more useful than invocation count.