Cycles consumed in each function through Oprofile

Cycles consumed in each function through Oprofile - profiling

Oprofile works on Sampling Based theory.
Opreport -l option provides us the profiling report in the following way:
samples % image name symbol name
78149 15.0776 cvqa comp_corr.clone.2
With this information I can know the %age of time consumed in consumption. If I do some optimizaion in my code I will again get the report as:
samples % image name symbol name
73179 15.0732 cvqa comp_corr.clone.2
In this report I am not getting how much optimization of cycles has been done so that I can benchmark. How much optimization has been done till now?
Is there any way we can know how much cycles optimization has been done or any other way through which I can bench mark?
I am working on AMD64 bit machine.

Since your real goal is to optimize the program, let me suggest another way to think about it.
The main thing to measure is overall time, not cycles or times of the various routines.
Now, here's how to do optimization. Don't base it on any measurements. Rather, get a number of samples of the program's state and (this is the key point) study each sample closely enough, with your own eyes and brain, and understand what the program is doing in that state, and the full reason why it is doing it.
(You will see anything worth fixing that statistics could reveal, plus things they could not reveal, and that makes all the difference.)
As soon as you catch it in the act of doing, on two or more samples, something that could be removed, fixing it will give you a substantial speedup.
Here is an explanation of why it works and how much speedup you can expect.
After you do that, you can do the overall time measurement again and see how much time you saved.
Then don't stop. Do it again. You'll find something else to fix, which is now a bigger percent because of the first problem you removed.
In my experience, with real software, this can be done as many as 5 or 6 times, after which the program can be orders of magnitude faster than it was originally. The reason is because each optimization removes a fraction of the original execution time, and those fractions can accumulate up to nearly 100%. I'm not aware of any such result achieved with Oprofile or any other profiler tool.

Related

profiler for c++ code, very sleepy

I'm a newbie with profiling. I'd like to optimize my code to satisfy timing constraints. I use Visual C++ 08 Express and thus had to download a profiler, for me it's Very Sleepy. I did some search but found no decent tutorial on Sleepy, and here my question:
How to use it properly? I grasped the general idea of profiling, so I sorted according to %exclusive to find my bottlenecks. Firstly, on the top of this list I have ZwWaitForSingleObject, RtlEnterCriticalSection, operator new, RtlLeaveCriticalSection, printf, some iterators ... and after they take some like 60% there comes my first function, first position with Child Calls. Can someone explain me why above mentioned come out, what do they mean and how can I optimize my code if I have no access to this critical 60%? (for "source file": unknown...).
Also, for my function I'd think I get time for each line, but it's not the case, e.g. arithmetics or some functions have no timing (not nested in unused "if" clauses).
AND last thing: how to find out that some line can execute superfast, but is called thousands times, being the actual bottleneck?
Finally, is Sleepy good? Or some free alternative for my platform?
Help very appreciated!
cheers!
UPDATE - - - - -
I have found another version of profiler, called plain Sleepy. It shows how many times some snippet was called plus the number of line (I guess it points to the critical one). So in my case.. KiFastSystemCallRet takes 50%! It means that it waits for some data right? How to improve that matter, is there maybe a decent approach to trace what causes these multiple calls and eventually remove/change it?

I'd like to optimize my code to satisfy timing constraints
You're running smack into a persistent issue in this business.
You want to find ways to make your code take less time, and you (and many people) assume (and have been taught) the only way to do that is by taking various sorts of measurements.
There's a minority view, and the only thing it has to recommend it is actual significant results (plus an ironclad theory behind it).
If you've got a "bottleneck" (and you do, probably several), it's taking some fraction of time, like 30%.
Just treat it as a bug to be found.
Randomly halt the program with the pause button, and look carefully to see what the program is doing and why it's doing it.
Ask if it's something that could be gotten rid of.
Do this 10 times. On average you will see the problem on 3 of the pauses.
Any activity you see more than once, if it's not truly necessary, is a speed bug.
This does not tell you precisely how much the problem costs, but it does tell you precisely what the problem is, and that it's worth fixing.
You'll see things this way that no profiler can find, because profilers
are only programs, and cannot be broad-minded about what constitutes an opportunity.
Some folks are risk-averse, thinking it might not give enough speedup to be worth it.
Granted, there is a small chance of a low payoff, but it's like investing.
The theory says on average it will be worthwhile, and there's also a small chance of a high payoff.
In any case, if you're worried about the risks, a few more samples will settle your fears.
After you fix the problem, the remaining bottlenecks each take a larger percent, because they didn't get smaller but the overall program did.
So they will be easier to find when you repeat the whole process.
There's lots of literature about profiling, but very little that actually says how much speedup it achieves in practice.
Here's a concrete example with almost 3 orders of magnitude speedup.

I've used GlowCode (commercial product, similar to Sleepy) for profiling native C++ code. You run the instrumenting process, then execute your program, then look at the data produced by the tool. The instrumenting step injects a little trace function at every methods' entrypoints and exitpoints, and simply measures how much time it takes for each function to run through to completion.
Using the call graph profiling tool, I listed the methods sorted from "most time used" to "least time used", and the tool also displays a call count. Simply drilling into the highest percentage routine showed me which methods were using the most time. I could see that some methods were very slow, but drilling into them I discovered they were waiting for user input, or for a service to respond. And some took a long time because they were calling some internal routines thousands of times each invocation. We found someone made a coding error and was walking a large linked list repeatedly for each item in the list, when they really only needed to walk it once.
If you sort by "most frequently called" to "least called", you can see some of the tiny functions that get called from everywhere (iterator methods like next(), etc.) Something to check for is to make sure the functions that are called the most often are really clean. Saving a millisecond in a routine called 500 times to paint a screen will speed that screen up by half a second. This helps you decide which are the most important places to spend your efforts.
I've seen two common approaches to using profiling. One is to do some "general" profiling, running through a suite of "normal" operations, and discovering which methods are slowing the app down the most. The other is to do specific profiling, focusing on specific user complaints about performance, and running through those functions to reveal their issues.
One thing I would caution you about is to limit your changes to those that will measurably impact the users' experience or system throughput. Shaving one millisecond off a mouse click won't make a difference to the average user, because human reaction time simply isn't that fast. Race car drivers have reaction times in the 8 millisecond range, some elite twitch gamers are even faster, but normal users like bank tellers will have reaction times in the 20-30 millisecond range. The benefits would be negligible.
Making twenty 1-millisecond improvements or one 20-millisecond change will make the system a lot more responsive. It's cheaper and better if you can do the single big improvement over the many small improvements.
Similarly, shaving one millisecond off a service that handles 100 users per second will make a 10% improvement, meaning you could improve the service to handle 110 users per second.
The reason for concern is that coding changes strictly to improve performance often negatively impact your code's structure by adding complexity. Let's say you decided to improve a call to a database by caching results. How do you know when the cache goes invalid? Do you add a cache cleaning mechanism? Consider a financial transaction where looping through all the line items to produce a running total is slow, so you decide to keep a runningTotal accumulator to answer faster. You now have to modify the runningTotal for all kinds of situations like line voids, reversals, deletions, modifications, quantity changes, etc. It makes the code more complex and more error-prone.

Profiling code built from ifort 11.1 yields __powr8i4 routine, what is it?

I built a Fortran code with Intel 11.1. I built it with the -p option in order to produce profiling data. When I check these results, there are some routines present that aren't a part of my code. I assume they were put there by Intel. The include:
__powr8i4
__intel_new_memset
__intel_fast_memset
__intel_fast_memset.J
__intel_fast_memcpy
__intel_new_memcpy
__intel_fast_memcpy.J
There are others, too. When I build the code without optimization, the code doesn't spend much time in them. Except that results show __powr8i4 being used 3.3% of the time. However, when I build the code with optimization, this number goes way up to about 35%. I can't seem to find out what these routines are, but they are confusing my results because I want to know where to look to optimize my code.

Most programs spend a lot of their cycles in the calling of subroutines, often library subroutines, so if you look only at exclusive (self) time, you will see what you are seeing.
So point 1 is look at inclusive (self plus callees) time.
Now, if the profiler is a "CPU profiler", it will probably be blind to I/O time. That means your program might be spending most of its time reading or writing, but the profiler will give you no clue about that.
So point 2 is use a profiler that works on "wall clock" time, not "CPU" time, unless you are sure you are not doing much I/O. (Sometimes you think you're not doing I/O, but deep inside some subroutine layers deep, guess what - it's doing I/O.)
Many profilers try to produce a call-graph, and if your program does not contain recursion, and if the profiler has access to all the routines in your code, that can be helpful in identifying the subroutine calls in your code that account for a lot of time.
However, if routine A is large and calls B in several places, the profiler won't tell you which lines of code to look at.
Point 3 is use a profiler that gives you line-level inclusive time percentage, if possible.
(Percentage is the most useful number, because that tells you how much overall time you would save if you could somehow remove that line of code. Also, it is not much affected by competing processes in the system.)
One example of such a profiler is Zoom.
It may be that after you do all this, you don't see much you could do to speed up the code.
However, if you could see how certain properties of the data might affect performance, you might find there were further speedups you could get. Profilers are unable to look at data.
What I do is randomly sample the state of the program under the debugger, and see if I can really understand what it is doing at each sample.
You can find things that way that you can't find any other way.
(Some people say this is not accurate, but it is accurate - about what matters. What matters is what the problem is, not precisely how much it costs.)
And that is point 4.

Inaccuracy in gprof output

I am trying to profile a c++ function using gprof, I am intrested in the %time taken. I did more than one run and for some reason I got a large difference in the results. I don't know what is causing this, I am assuming the sampling rate or I read in other posts that I/O has something to do with it. So is there a way to make it more accurate and generate somehow almost constant results?
I was thinking of the following:
increase the sampling rate
flush the caches before executing anything
use another profiler but I want it to generate results in a similar format to grof as function time% function name, I tried Valgrind but it gave me a massive file in size. So maybe I am generating the file with the wrong command.
Waiting for your input
Regards

I recommend printing a copy of the gprof paper and reading it carefully.
According to the paper, here's how gprof measures time. It samples the PC, and it counts how many samples land in each routine. Multiplied by the time between samples, that is each routine's total self time.
It also records in a table, by call site, how many times routine A calls routine B, assuming routine B is instrumented by the -pg option. By summing those up, it can tell how many times routine B was called.
Starting from the bottom of the call tree (where total time = self time), it assumes the average time per call of each routine is its total time divided by the number of calls.
Then it works back up to each caller of those routines. The time of each routine is its average self time plus the average number of calls to each subordinate routine times the average time of the subordinate routine.
You can see, even if recursions (cycles in the call graph) are not present, how this is fraught with possibilities for errors, such as assumptions about average times and average numbers of calls, and assumptions about subroutines being instrumented, which the authors point out. If there are recursions, they basically say "forget it".
All of this technology, even if it weren't problematic, begs the question - What is it's purpose? Usually, the purpose is "find bottlenecks". According to the paper, it can help people evaluate alternative implementations. That's not finding bottlenecks. They do recommend looking at routines that seem to be called a lot of times, or that have high average times. Certainly routines with low average cumulative time should be ignored, but that doesn't localize the problem very much. And, it completely ignores I/O, as if all I/O that is done is unquestionably necessary.
So, to try to answer your question, try Zoom, for one, and don't expect to eliminate statistical noise in measurements.
gprof is a venerable tool, simple and rugged, but the problems it had in the beginning are still there, and far better tools have come along in the intervening decades.
Here's a list of the issues.

gprof is not very accurate, particularly for small functions, see http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html#SEC11
If this is Linux then I recommend a profiler that doesn't require the code to be instrumented, e.g. Zoom - you can get a free 30 day evaluation license, after that it costs money.
All sampling profilers suffer form statistical inaccuracies - if the error is too large then you need to sample for longer and/or with a smaller sampling interval.

Profilers Instrumenting Vs Sampling

I am doing a study to between profilers mainly instrumenting and sampling.
I have came up with the following info:
sampling: stop the execution of program, take PC and thus deduce were the program is
instrumenting: add some overhead code
to the program so it would increment
some pointers to know the program
If the above info is wrong correct me.
After this I was looking at the time of execution and some said that instrumenting takes more time than sampling! is this correct?
if yes why is that? in sampling you have to pay the price of context switching between processes while in the latter your in the same program no cost
Am i missing something?
cheers! =)

The interrupts generated by a sampling profiler generally add an insignficant amount of time to the total execution time, unless you have a very short sampling interval (e.g. < 1 ms).
With instrumented profiling there can be a large overhead, e.g. on small leaf functions that get called many times, as the calls to the instrumentation library can be significant compared to the execution time of the function.

It depends how conventional you want to be.
gprof does both those things you've mentioned. Here are some comments on that.
There is a school of thought that says profiling is about measuring. Measuring what? Well, anything - just measuring. Along with this goes the idea that what you want to get is a "big picture" of what's happening.
This school looks mostly at trying to find "slow functions", without clearly defining what that even means, and telling you to look there to optimize.
Another school says that you are really debugging. You want to precisely locate bugs of a certain kind - ones that don't make the program incorrect, rather they take too long. These are not big-picture things. They are very precise points in the code where something is happening that costs a lot more time than necessary.
Exactly how much more is not important. What's important is that it is located so it can be fixed.
In this viewpoint, profiling overhead is irrelevant, and so is accuracy of measurement.
What measuring is for is seeing how much time was saved.
One profiler that, I think, successfully spans both camps, is Zoom, because it samples the call stack, on wall-clock time, and presents, at the line/instruction level, percent of time on the stack. Some other profilers do this also, but most don't.
I'm in the second school, and here's an example of what you can accomplish with it.
Here's a more brief discussion of the issues.

How to get the call graph of a program with a bit of profiling information

I want to understand how a C++ program that was given to me works, and where it spends the most time.
For that I tried to use first gprof and then gprof2dot to get the pictures, but the results are sometimes kind of ugly.
How do you usually do this? Can you recommend any better alternatives?
P.D. Which are the open source solutions (preferably for Linux or Mac OS )X?

OProfile on Linux works fairly well, actually i like it better than GProf. There are a couple graphical tools that help visualize OProfile output.

You can try KCachegrind. This is a program that visualizes samples acquired by Valgrind tool called Callgrind. KCachegrind may seem to be not actively maintained, but the graphs he produces are very useful.

In my opinion there are two alternatives (on Windows):
Profilers that change the assembly instructions of your applications (that's called instrumenting) and record every detail. These profilers tend to be slow (applications running about 10 times slower), sometimes hard to set up, and often not-free, but they give you the best performance related information. Look for "Ration Quantity", "AQTime" and "Performance Validator" if you want a profiler of this type.
Profilers that don't instrument the application, but just look at a running application and collect 'samples' of it. These profilers are fast (no performance loss), often easy to set up, and you can find quite some free alternatives. Look for "Very Sleepy" and "Luke Stackwalker" if you want a profiler of this type.
Although I used commercial profilers like Rational Quantity and AQTime in the past, and was very satisfied with the results, I found that the disadvantages (hard to setup, unexplainable crashes, slow performance) outweighed the advantages.
Therefore I switched to the free alternatives and I am mainly using "Very Sleepy" at this moment.

If you want to look at the structure of your application (who calls what, references, call trees, ...) look at "Understand for C/C++". This application investigates your source code and allows you to query almost everything from the application's structure.

See the SD C++ Profiler.
Other answers here suggest that probe-oriented profilers have high overhead (10x). This one does not.

Same answer as ---
EDIT: #Steve suggested I give a less pithy answer.
I hear this all the time - "I want to find out where my program spends its time".
Let me suggest an alternate phrasing - "I want to find out why my program spends its time".
Maybe the difference isn't obvious.
When a program executes an instruction, the reason why it is doing so is encoded in the entire state of the program, including the call stack.
Looking only at the program counter is like trying to see if a taxi ride is necessary by profiling the rotation angle of its wheels.
You need to look at the whole state of the program.
There's another myth I hear all the time - that you need to measure the execution time of methods, to find the "slow" ones.
There are many ways for programs to take more time than they need to than by, say, doing a linear search instead of a binary search in some method, which might be the kind of thing people have in mind.
The way to think about it is this:
There isn't just one thing taking more time than necessary. There probably are several.
Each thing taking time is taking some fraction, like 10%, 50%, 90% or some such number. That means if the wall clock could be stopped during that time, that is how much less time the overall app could take.
You want to find out what those things are, whatever they are. Profilers (samplers) work by taking a lot of shallow samples (PC or call stack) and summarizing them to get measurements. But measurements are not what you need. What you need is finding out what it's doing, from a time perspective. It's better to get a small number of samples, like 10 or 20, and examine (not summarize) them. If some activity takes 20%, 50%, or 90% of the time, then that is the probability you will catch it in the act on each sample, so that is roughly the percent of samples on which you will see it. The important thing is finding out what it is, not getting an accurate measurement of something irrelevant.
So as a way to see what the program is doing, from a time perspective, here's how many people do it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js