C++ profiling and optimization - c++

I have some issues with performance of my application. I found this answer on Stackoverflow:
https://stackoverflow.com/a/378024/5363
which I like. One bit I don't really understand is what is the relation between code optimization and profiling. Because obviously one wants to profile optimized code, but at the same time a lot of information is lost during optimizations. So is it practical to run optimized code in a debugger and break into it as suggested in the quoted answer?
I am using CMake with gcc under Linux, if this makes any difference.

The general Law is called the Law of Pareto, the law of 80/20:
20% of the causes produce 80% of the consequences.
By profiling, you are going to indentify the 20% of the most important causes that makes your application slow/consuming memory, or other consequences. And if you fix the 20% causes, you'll tackle 80% of the slowliness/memory consumption etc...
Of course the figures are just figures. Just to give you the spirit of it:
You have to focuss only on the real main causes so as to improve the optimization until you're satisfied.
Technically, with gcc under linux, an answer to the question you refering to " How can I profile C++ code running in Linux? " suggests to use, in a nutshell :
gprof.
google-perftools
Valgrind
Intel VTune
Sun DTrace

If you need to collect stack samples, why do it through a debugger. Run pstack at regular time intervals. You can redirect the output to a different file for each run and analyze those files later. By looking at the call stack of these files, you may figure out the hot function. You do not need a debug binary and can do above on a fully optimized binary.
I would prefer using a profiler tool to doing the above or doing what is listed in the thread that you refer to. They quickly pinpoint the top hot functions and you can understand the call stack by looking at the caller callee graph. I would spend time understanding the caller callee stack rather than analyze random stacks using the above method.

As Schumi said, you can use something like pstack to get stack samples.
However, what you really need to know is why the program is spending the instant of time when the sample was taken.
Maybe you can figure that out from only a stack of function names.
It's better if you can also see the lines of code where the calls occurred.
It's better still if you can see the argument values and data context.
The reason is, contrary to popular conceptions that you are looking for "hot spots", "slow methods", "bottlenecks" - i.e. a measurement-based perspective, the most valuable thing to look for is things being done that could be eliminated.
In other words, when you halt the program in the debugger, consider whatever it is doing as if it were a bug.
Try to find a way not to do that thing.
However, resist doing this until you take another sample and see it doing the same thing - however you describe that thing.
Now you know it's taking significant time.
How much time? It doesn't matter - you'll find out after you fix it.
You do know that it's a lot. The fewer samples you had to take before seeing it twice, the bigger it is.
Then there's a "magnification effect". After you fix that "speed bug" the program will take a lot less time - but - that wasn't the only one.
There are others, and now they take a larger fraction of the time.
So do it all again.
By the time you finish this, if the program is any bigger than a toy, you could be amazed at how much faster it is.
Here's a 43x speedup.
Here's a 730x speedup.
Here's the dreary math behind it.
You see, the problem with tools is you're paying a price for that ease of sampling.
Since you're thinking of it as measurement, you're not concentrating on the reasons why the code is doing what it's doing - dubious reasons.
That causes you to miss opportunities to make the code faster,
causing you to miss the magnification effect,
causing you to stop far short of your ultimate possible speedup.
EDIT: Apologies for the flame. Now to answer your question - I do not turn on compiler optimization until the very end, because it can mask bigger problems.
Then I try to do a build that has optimization turned on, but still has symbolic information so the debugger can get a reasonable stack trace and examine variables.
When I've hit diminishing speedup returns, I can see how much difference the optimizer made just by measuring overall time - don't need a profiler for that.

Related

Cycles consumed in each function through Oprofile

Oprofile works on Sampling Based theory.
Opreport -l option provides us the profiling report in the following way:
samples % image name symbol name
78149 15.0776 cvqa comp_corr.clone.2
With this information I can know the %age of time consumed in consumption. If I do some optimizaion in my code I will again get the report as:
samples % image name symbol name
73179 15.0732 cvqa comp_corr.clone.2
In this report I am not getting how much optimization of cycles has been done so that I can benchmark. How much optimization has been done till now?
Is there any way we can know how much cycles optimization has been done or any other way through which I can bench mark?
I am working on AMD64 bit machine.
Since your real goal is to optimize the program, let me suggest another way to think about it.
The main thing to measure is overall time, not cycles or times of the various routines.
Now, here's how to do optimization. Don't base it on any measurements. Rather, get a number of samples of the program's state and (this is the key point) study each sample closely enough, with your own eyes and brain, and understand what the program is doing in that state, and the full reason why it is doing it.
(You will see anything worth fixing that statistics could reveal, plus things they could not reveal, and that makes all the difference.)
As soon as you catch it in the act of doing, on two or more samples, something that could be removed, fixing it will give you a substantial speedup.
Here is an explanation of why it works and how much speedup you can expect.
After you do that, you can do the overall time measurement again and see how much time you saved.
Then don't stop. Do it again. You'll find something else to fix, which is now a bigger percent because of the first problem you removed.
In my experience, with real software, this can be done as many as 5 or 6 times, after which the program can be orders of magnitude faster than it was originally. The reason is because each optimization removes a fraction of the original execution time, and those fractions can accumulate up to nearly 100%. I'm not aware of any such result achieved with Oprofile or any other profiler tool.

profiler for c++ code, very sleepy

I'm a newbie with profiling. I'd like to optimize my code to satisfy timing constraints. I use Visual C++ 08 Express and thus had to download a profiler, for me it's Very Sleepy. I did some search but found no decent tutorial on Sleepy, and here my question:
How to use it properly? I grasped the general idea of profiling, so I sorted according to %exclusive to find my bottlenecks. Firstly, on the top of this list I have ZwWaitForSingleObject, RtlEnterCriticalSection, operator new, RtlLeaveCriticalSection, printf, some iterators ... and after they take some like 60% there comes my first function, first position with Child Calls. Can someone explain me why above mentioned come out, what do they mean and how can I optimize my code if I have no access to this critical 60%? (for "source file": unknown...).
Also, for my function I'd think I get time for each line, but it's not the case, e.g. arithmetics or some functions have no timing (not nested in unused "if" clauses).
AND last thing: how to find out that some line can execute superfast, but is called thousands times, being the actual bottleneck?
Finally, is Sleepy good? Or some free alternative for my platform?
Help very appreciated!
cheers!
UPDATE - - - - -
I have found another version of profiler, called plain Sleepy. It shows how many times some snippet was called plus the number of line (I guess it points to the critical one). So in my case.. KiFastSystemCallRet takes 50%! It means that it waits for some data right? How to improve that matter, is there maybe a decent approach to trace what causes these multiple calls and eventually remove/change it?
I'd like to optimize my code to satisfy timing constraints
You're running smack into a persistent issue in this business.
You want to find ways to make your code take less time, and you (and many people) assume (and have been taught) the only way to do that is by taking various sorts of measurements.
There's a minority view, and the only thing it has to recommend it is actual significant results (plus an ironclad theory behind it).
If you've got a "bottleneck" (and you do, probably several), it's taking some fraction of time, like 30%.
Just treat it as a bug to be found.
Randomly halt the program with the pause button, and look carefully to see what the program is doing and why it's doing it.
Ask if it's something that could be gotten rid of.
Do this 10 times. On average you will see the problem on 3 of the pauses.
Any activity you see more than once, if it's not truly necessary, is a speed bug.
This does not tell you precisely how much the problem costs, but it does tell you precisely what the problem is, and that it's worth fixing.
You'll see things this way that no profiler can find, because profilers
are only programs, and cannot be broad-minded about what constitutes an opportunity.
Some folks are risk-averse, thinking it might not give enough speedup to be worth it.
Granted, there is a small chance of a low payoff, but it's like investing.
The theory says on average it will be worthwhile, and there's also a small chance of a high payoff.
In any case, if you're worried about the risks, a few more samples will settle your fears.
After you fix the problem, the remaining bottlenecks each take a larger percent, because they didn't get smaller but the overall program did.
So they will be easier to find when you repeat the whole process.
There's lots of literature about profiling, but very little that actually says how much speedup it achieves in practice.
Here's a concrete example with almost 3 orders of magnitude speedup.
I've used GlowCode (commercial product, similar to Sleepy) for profiling native C++ code. You run the instrumenting process, then execute your program, then look at the data produced by the tool. The instrumenting step injects a little trace function at every methods' entrypoints and exitpoints, and simply measures how much time it takes for each function to run through to completion.
Using the call graph profiling tool, I listed the methods sorted from "most time used" to "least time used", and the tool also displays a call count. Simply drilling into the highest percentage routine showed me which methods were using the most time. I could see that some methods were very slow, but drilling into them I discovered they were waiting for user input, or for a service to respond. And some took a long time because they were calling some internal routines thousands of times each invocation. We found someone made a coding error and was walking a large linked list repeatedly for each item in the list, when they really only needed to walk it once.
If you sort by "most frequently called" to "least called", you can see some of the tiny functions that get called from everywhere (iterator methods like next(), etc.) Something to check for is to make sure the functions that are called the most often are really clean. Saving a millisecond in a routine called 500 times to paint a screen will speed that screen up by half a second. This helps you decide which are the most important places to spend your efforts.
I've seen two common approaches to using profiling. One is to do some "general" profiling, running through a suite of "normal" operations, and discovering which methods are slowing the app down the most. The other is to do specific profiling, focusing on specific user complaints about performance, and running through those functions to reveal their issues.
One thing I would caution you about is to limit your changes to those that will measurably impact the users' experience or system throughput. Shaving one millisecond off a mouse click won't make a difference to the average user, because human reaction time simply isn't that fast. Race car drivers have reaction times in the 8 millisecond range, some elite twitch gamers are even faster, but normal users like bank tellers will have reaction times in the 20-30 millisecond range. The benefits would be negligible.
Making twenty 1-millisecond improvements or one 20-millisecond change will make the system a lot more responsive. It's cheaper and better if you can do the single big improvement over the many small improvements.
Similarly, shaving one millisecond off a service that handles 100 users per second will make a 10% improvement, meaning you could improve the service to handle 110 users per second.
The reason for concern is that coding changes strictly to improve performance often negatively impact your code's structure by adding complexity. Let's say you decided to improve a call to a database by caching results. How do you know when the cache goes invalid? Do you add a cache cleaning mechanism? Consider a financial transaction where looping through all the line items to produce a running total is slow, so you decide to keep a runningTotal accumulator to answer faster. You now have to modify the runningTotal for all kinds of situations like line voids, reversals, deletions, modifications, quantity changes, etc. It makes the code more complex and more error-prone.

Profiling code built from ifort 11.1 yields __powr8i4 routine, what is it?

I built a Fortran code with Intel 11.1. I built it with the -p option in order to produce profiling data. When I check these results, there are some routines present that aren't a part of my code. I assume they were put there by Intel. The include:
__powr8i4
__intel_new_memset
__intel_fast_memset
__intel_fast_memset.J
__intel_fast_memcpy
__intel_new_memcpy
__intel_fast_memcpy.J
There are others, too. When I build the code without optimization, the code doesn't spend much time in them. Except that results show __powr8i4 being used 3.3% of the time. However, when I build the code with optimization, this number goes way up to about 35%. I can't seem to find out what these routines are, but they are confusing my results because I want to know where to look to optimize my code.
Most programs spend a lot of their cycles in the calling of subroutines, often library subroutines, so if you look only at exclusive (self) time, you will see what you are seeing.
So point 1 is look at inclusive (self plus callees) time.
Now, if the profiler is a "CPU profiler", it will probably be blind to I/O time. That means your program might be spending most of its time reading or writing, but the profiler will give you no clue about that.
So point 2 is use a profiler that works on "wall clock" time, not "CPU" time, unless you are sure you are not doing much I/O. (Sometimes you think you're not doing I/O, but deep inside some subroutine layers deep, guess what - it's doing I/O.)
Many profilers try to produce a call-graph, and if your program does not contain recursion, and if the profiler has access to all the routines in your code, that can be helpful in identifying the subroutine calls in your code that account for a lot of time.
However, if routine A is large and calls B in several places, the profiler won't tell you which lines of code to look at.
Point 3 is use a profiler that gives you line-level inclusive time percentage, if possible.
(Percentage is the most useful number, because that tells you how much overall time you would save if you could somehow remove that line of code. Also, it is not much affected by competing processes in the system.)
One example of such a profiler is Zoom.
It may be that after you do all this, you don't see much you could do to speed up the code.
However, if you could see how certain properties of the data might affect performance, you might find there were further speedups you could get. Profilers are unable to look at data.
What I do is randomly sample the state of the program under the debugger, and see if I can really understand what it is doing at each sample.
You can find things that way that you can't find any other way.
(Some people say this is not accurate, but it is accurate - about what matters. What matters is what the problem is, not precisely how much it costs.)
And that is point 4.

Profilers Instrumenting Vs Sampling

I am doing a study to between profilers mainly instrumenting and sampling.
I have came up with the following info:
sampling: stop the execution of program, take PC and thus deduce were the program is
instrumenting: add some overhead code
to the program so it would increment
some pointers to know the program
If the above info is wrong correct me.
After this I was looking at the time of execution and some said that instrumenting takes more time than sampling! is this correct?
if yes why is that? in sampling you have to pay the price of context switching between processes while in the latter your in the same program no cost
Am i missing something?
cheers! =)
The interrupts generated by a sampling profiler generally add an insignficant amount of time to the total execution time, unless you have a very short sampling interval (e.g. < 1 ms).
With instrumented profiling there can be a large overhead, e.g. on small leaf functions that get called many times, as the calls to the instrumentation library can be significant compared to the execution time of the function.
It depends how conventional you want to be.
gprof does both those things you've mentioned. Here are some comments on that.
There is a school of thought that says profiling is about measuring. Measuring what? Well, anything - just measuring. Along with this goes the idea that what you want to get is a "big picture" of what's happening.
This school looks mostly at trying to find "slow functions", without clearly defining what that even means, and telling you to look there to optimize.
Another school says that you are really debugging. You want to precisely locate bugs of a certain kind - ones that don't make the program incorrect, rather they take too long. These are not big-picture things. They are very precise points in the code where something is happening that costs a lot more time than necessary.
Exactly how much more is not important. What's important is that it is located so it can be fixed.
In this viewpoint, profiling overhead is irrelevant, and so is accuracy of measurement.
What measuring is for is seeing how much time was saved.
One profiler that, I think, successfully spans both camps, is Zoom, because it samples the call stack, on wall-clock time, and presents, at the line/instruction level, percent of time on the stack. Some other profilers do this also, but most don't.
I'm in the second school, and here's an example of what you can accomplish with it.
Here's a more brief discussion of the issues.

How to get the call graph of a program with a bit of profiling information

I want to understand how a C++ program that was given to me works, and where it spends the most time.
For that I tried to use first gprof and then gprof2dot to get the pictures, but the results are sometimes kind of ugly.
How do you usually do this? Can you recommend any better alternatives?
P.D. Which are the open source solutions (preferably for Linux or Mac OS )X?
OProfile on Linux works fairly well, actually i like it better than GProf. There are a couple graphical tools that help visualize OProfile output.
You can try KCachegrind. This is a program that visualizes samples acquired by Valgrind tool called Callgrind. KCachegrind may seem to be not actively maintained, but the graphs he produces are very useful.
In my opinion there are two alternatives (on Windows):
Profilers that change the assembly instructions of your applications (that's called instrumenting) and record every detail. These profilers tend to be slow (applications running about 10 times slower), sometimes hard to set up, and often not-free, but they give you the best performance related information. Look for "Ration Quantity", "AQTime" and "Performance Validator" if you want a profiler of this type.
Profilers that don't instrument the application, but just look at a running application and collect 'samples' of it. These profilers are fast (no performance loss), often easy to set up, and you can find quite some free alternatives. Look for "Very Sleepy" and "Luke Stackwalker" if you want a profiler of this type.
Although I used commercial profilers like Rational Quantity and AQTime in the past, and was very satisfied with the results, I found that the disadvantages (hard to setup, unexplainable crashes, slow performance) outweighed the advantages.
Therefore I switched to the free alternatives and I am mainly using "Very Sleepy" at this moment.
If you want to look at the structure of your application (who calls what, references, call trees, ...) look at "Understand for C/C++". This application investigates your source code and allows you to query almost everything from the application's structure.
See the SD C++ Profiler.
Other answers here suggest that probe-oriented profilers have high overhead (10x). This one does not.
Same answer as ---
EDIT: #Steve suggested I give a less pithy answer.
I hear this all the time - "I want to find out where my program spends its time".
Let me suggest an alternate phrasing - "I want to find out why my program spends its time".
Maybe the difference isn't obvious.
When a program executes an instruction, the reason why it is doing so is encoded in the entire state of the program, including the call stack.
Looking only at the program counter is like trying to see if a taxi ride is necessary by profiling the rotation angle of its wheels.
You need to look at the whole state of the program.
There's another myth I hear all the time - that you need to measure the execution time of methods, to find the "slow" ones.
There are many ways for programs to take more time than they need to than by, say, doing a linear search instead of a binary search in some method, which might be the kind of thing people have in mind.
The way to think about it is this:
There isn't just one thing taking more time than necessary. There probably are several.
Each thing taking time is taking some fraction, like 10%, 50%, 90% or some such number. That means if the wall clock could be stopped during that time, that is how much less time the overall app could take.
You want to find out what those things are, whatever they are. Profilers (samplers) work by taking a lot of shallow samples (PC or call stack) and summarizing them to get measurements. But measurements are not what you need. What you need is finding out what it's doing, from a time perspective. It's better to get a small number of samples, like 10 or 20, and examine (not summarize) them. If some activity takes 20%, 50%, or 90% of the time, then that is the probability you will catch it in the act on each sample, so that is roughly the percent of samples on which you will see it. The important thing is finding out what it is, not getting an accurate measurement of something irrelevant.
So as a way to see what the program is doing, from a time perspective, here's how many people do it.