Profiling serialization code - c++

I ran my app twice (in the VS ide). The first time it took 33seconds. I decommented obj.save which calls a lot of code and it took 87seconds. Thats some slow serialization code! I suspect two problems. The first is i do the below
template<class T> void Save_IntX(ostream& o, T v){ o.write((char*)&v,sizeof(T)); }
I call this templates hundreds of thousands of times (well maybe not that much). Does each .write() use a lock that may be slowing it down? maybe i can use a memory steam which doesnt require a lock and dump that instead? Which ostream may i use that doesnt lock and perhaps depends that its only used in a single thread?
The other suspected problem is i use dynamic_cast a lot. But i am unsure if i can work around this.
Here is a quick profiling session after converting it to use fopen instead of ostream. I wonder why i dont see the majority of my functions in this list but as you can see write is still taking the longest. Note: i just realize my output file is half a gig. oops. Maybe that is why.

I'm glad you got it figured out, but the next time you do profiling, you might want to consider a few points:
The VS profiler in sampling mode does not sample during I/O or any other time your program is blocked, so it's only really useful for CPU-bound profiling. For example, if it says a routine has 80% inclusive time, but the app is actually computing only 10% of the time, that 80% is really only 8%. Because of this, for any non-CPU-bound work, you need to use the profiler's instrumentation mode.
Assuming you did that, of all those columns of data, the one that matters is "Inclusive %", because that is the routine's true cost, in the sense that if it could be avoided, that is how much the overall time would be reduced.
Of all those rows of data, the ones likely to matter are the ones containing your routines, because your routines are the only ones you can do anything about. It looks like "Unknown Frames" are maybe your code, if your code is compiled without debugging info. In general, it's a good idea to profile with debugging info, make it fast, and then remove the debugging info.

Related

C++ profiling and optimization

I have some issues with performance of my application. I found this answer on Stackoverflow:
https://stackoverflow.com/a/378024/5363
which I like. One bit I don't really understand is what is the relation between code optimization and profiling. Because obviously one wants to profile optimized code, but at the same time a lot of information is lost during optimizations. So is it practical to run optimized code in a debugger and break into it as suggested in the quoted answer?
I am using CMake with gcc under Linux, if this makes any difference.
The general Law is called the Law of Pareto, the law of 80/20:
20% of the causes produce 80% of the consequences.
By profiling, you are going to indentify the 20% of the most important causes that makes your application slow/consuming memory, or other consequences. And if you fix the 20% causes, you'll tackle 80% of the slowliness/memory consumption etc...
Of course the figures are just figures. Just to give you the spirit of it:
You have to focuss only on the real main causes so as to improve the optimization until you're satisfied.
Technically, with gcc under linux, an answer to the question you refering to " How can I profile C++ code running in Linux? " suggests to use, in a nutshell :
gprof.
google-perftools
Valgrind
Intel VTune
Sun DTrace
If you need to collect stack samples, why do it through a debugger. Run pstack at regular time intervals. You can redirect the output to a different file for each run and analyze those files later. By looking at the call stack of these files, you may figure out the hot function. You do not need a debug binary and can do above on a fully optimized binary.
I would prefer using a profiler tool to doing the above or doing what is listed in the thread that you refer to. They quickly pinpoint the top hot functions and you can understand the call stack by looking at the caller callee graph. I would spend time understanding the caller callee stack rather than analyze random stacks using the above method.
As Schumi said, you can use something like pstack to get stack samples.
However, what you really need to know is why the program is spending the instant of time when the sample was taken.
Maybe you can figure that out from only a stack of function names.
It's better if you can also see the lines of code where the calls occurred.
It's better still if you can see the argument values and data context.
The reason is, contrary to popular conceptions that you are looking for "hot spots", "slow methods", "bottlenecks" - i.e. a measurement-based perspective, the most valuable thing to look for is things being done that could be eliminated.
In other words, when you halt the program in the debugger, consider whatever it is doing as if it were a bug.
Try to find a way not to do that thing.
However, resist doing this until you take another sample and see it doing the same thing - however you describe that thing.
Now you know it's taking significant time.
How much time? It doesn't matter - you'll find out after you fix it.
You do know that it's a lot. The fewer samples you had to take before seeing it twice, the bigger it is.
Then there's a "magnification effect". After you fix that "speed bug" the program will take a lot less time - but - that wasn't the only one.
There are others, and now they take a larger fraction of the time.
So do it all again.
By the time you finish this, if the program is any bigger than a toy, you could be amazed at how much faster it is.
Here's a 43x speedup.
Here's a 730x speedup.
Here's the dreary math behind it.
You see, the problem with tools is you're paying a price for that ease of sampling.
Since you're thinking of it as measurement, you're not concentrating on the reasons why the code is doing what it's doing - dubious reasons.
That causes you to miss opportunities to make the code faster,
causing you to miss the magnification effect,
causing you to stop far short of your ultimate possible speedup.
EDIT: Apologies for the flame. Now to answer your question - I do not turn on compiler optimization until the very end, because it can mask bigger problems.
Then I try to do a build that has optimization turned on, but still has symbolic information so the debugger can get a reasonable stack trace and examine variables.
When I've hit diminishing speedup returns, I can see how much difference the optimizer made just by measuring overall time - don't need a profiler for that.

profiler for c++ code, very sleepy

I'm a newbie with profiling. I'd like to optimize my code to satisfy timing constraints. I use Visual C++ 08 Express and thus had to download a profiler, for me it's Very Sleepy. I did some search but found no decent tutorial on Sleepy, and here my question:
How to use it properly? I grasped the general idea of profiling, so I sorted according to %exclusive to find my bottlenecks. Firstly, on the top of this list I have ZwWaitForSingleObject, RtlEnterCriticalSection, operator new, RtlLeaveCriticalSection, printf, some iterators ... and after they take some like 60% there comes my first function, first position with Child Calls. Can someone explain me why above mentioned come out, what do they mean and how can I optimize my code if I have no access to this critical 60%? (for "source file": unknown...).
Also, for my function I'd think I get time for each line, but it's not the case, e.g. arithmetics or some functions have no timing (not nested in unused "if" clauses).
AND last thing: how to find out that some line can execute superfast, but is called thousands times, being the actual bottleneck?
Finally, is Sleepy good? Or some free alternative for my platform?
Help very appreciated!
cheers!
UPDATE - - - - -
I have found another version of profiler, called plain Sleepy. It shows how many times some snippet was called plus the number of line (I guess it points to the critical one). So in my case.. KiFastSystemCallRet takes 50%! It means that it waits for some data right? How to improve that matter, is there maybe a decent approach to trace what causes these multiple calls and eventually remove/change it?
I'd like to optimize my code to satisfy timing constraints
You're running smack into a persistent issue in this business.
You want to find ways to make your code take less time, and you (and many people) assume (and have been taught) the only way to do that is by taking various sorts of measurements.
There's a minority view, and the only thing it has to recommend it is actual significant results (plus an ironclad theory behind it).
If you've got a "bottleneck" (and you do, probably several), it's taking some fraction of time, like 30%.
Just treat it as a bug to be found.
Randomly halt the program with the pause button, and look carefully to see what the program is doing and why it's doing it.
Ask if it's something that could be gotten rid of.
Do this 10 times. On average you will see the problem on 3 of the pauses.
Any activity you see more than once, if it's not truly necessary, is a speed bug.
This does not tell you precisely how much the problem costs, but it does tell you precisely what the problem is, and that it's worth fixing.
You'll see things this way that no profiler can find, because profilers
are only programs, and cannot be broad-minded about what constitutes an opportunity.
Some folks are risk-averse, thinking it might not give enough speedup to be worth it.
Granted, there is a small chance of a low payoff, but it's like investing.
The theory says on average it will be worthwhile, and there's also a small chance of a high payoff.
In any case, if you're worried about the risks, a few more samples will settle your fears.
After you fix the problem, the remaining bottlenecks each take a larger percent, because they didn't get smaller but the overall program did.
So they will be easier to find when you repeat the whole process.
There's lots of literature about profiling, but very little that actually says how much speedup it achieves in practice.
Here's a concrete example with almost 3 orders of magnitude speedup.
I've used GlowCode (commercial product, similar to Sleepy) for profiling native C++ code. You run the instrumenting process, then execute your program, then look at the data produced by the tool. The instrumenting step injects a little trace function at every methods' entrypoints and exitpoints, and simply measures how much time it takes for each function to run through to completion.
Using the call graph profiling tool, I listed the methods sorted from "most time used" to "least time used", and the tool also displays a call count. Simply drilling into the highest percentage routine showed me which methods were using the most time. I could see that some methods were very slow, but drilling into them I discovered they were waiting for user input, or for a service to respond. And some took a long time because they were calling some internal routines thousands of times each invocation. We found someone made a coding error and was walking a large linked list repeatedly for each item in the list, when they really only needed to walk it once.
If you sort by "most frequently called" to "least called", you can see some of the tiny functions that get called from everywhere (iterator methods like next(), etc.) Something to check for is to make sure the functions that are called the most often are really clean. Saving a millisecond in a routine called 500 times to paint a screen will speed that screen up by half a second. This helps you decide which are the most important places to spend your efforts.
I've seen two common approaches to using profiling. One is to do some "general" profiling, running through a suite of "normal" operations, and discovering which methods are slowing the app down the most. The other is to do specific profiling, focusing on specific user complaints about performance, and running through those functions to reveal their issues.
One thing I would caution you about is to limit your changes to those that will measurably impact the users' experience or system throughput. Shaving one millisecond off a mouse click won't make a difference to the average user, because human reaction time simply isn't that fast. Race car drivers have reaction times in the 8 millisecond range, some elite twitch gamers are even faster, but normal users like bank tellers will have reaction times in the 20-30 millisecond range. The benefits would be negligible.
Making twenty 1-millisecond improvements or one 20-millisecond change will make the system a lot more responsive. It's cheaper and better if you can do the single big improvement over the many small improvements.
Similarly, shaving one millisecond off a service that handles 100 users per second will make a 10% improvement, meaning you could improve the service to handle 110 users per second.
The reason for concern is that coding changes strictly to improve performance often negatively impact your code's structure by adding complexity. Let's say you decided to improve a call to a database by caching results. How do you know when the cache goes invalid? Do you add a cache cleaning mechanism? Consider a financial transaction where looping through all the line items to produce a running total is slow, so you decide to keep a runningTotal accumulator to answer faster. You now have to modify the runningTotal for all kinds of situations like line voids, reversals, deletions, modifications, quantity changes, etc. It makes the code more complex and more error-prone.

Profiling code built from ifort 11.1 yields __powr8i4 routine, what is it?

I built a Fortran code with Intel 11.1. I built it with the -p option in order to produce profiling data. When I check these results, there are some routines present that aren't a part of my code. I assume they were put there by Intel. The include:
__powr8i4
__intel_new_memset
__intel_fast_memset
__intel_fast_memset.J
__intel_fast_memcpy
__intel_new_memcpy
__intel_fast_memcpy.J
There are others, too. When I build the code without optimization, the code doesn't spend much time in them. Except that results show __powr8i4 being used 3.3% of the time. However, when I build the code with optimization, this number goes way up to about 35%. I can't seem to find out what these routines are, but they are confusing my results because I want to know where to look to optimize my code.
Most programs spend a lot of their cycles in the calling of subroutines, often library subroutines, so if you look only at exclusive (self) time, you will see what you are seeing.
So point 1 is look at inclusive (self plus callees) time.
Now, if the profiler is a "CPU profiler", it will probably be blind to I/O time. That means your program might be spending most of its time reading or writing, but the profiler will give you no clue about that.
So point 2 is use a profiler that works on "wall clock" time, not "CPU" time, unless you are sure you are not doing much I/O. (Sometimes you think you're not doing I/O, but deep inside some subroutine layers deep, guess what - it's doing I/O.)
Many profilers try to produce a call-graph, and if your program does not contain recursion, and if the profiler has access to all the routines in your code, that can be helpful in identifying the subroutine calls in your code that account for a lot of time.
However, if routine A is large and calls B in several places, the profiler won't tell you which lines of code to look at.
Point 3 is use a profiler that gives you line-level inclusive time percentage, if possible.
(Percentage is the most useful number, because that tells you how much overall time you would save if you could somehow remove that line of code. Also, it is not much affected by competing processes in the system.)
One example of such a profiler is Zoom.
It may be that after you do all this, you don't see much you could do to speed up the code.
However, if you could see how certain properties of the data might affect performance, you might find there were further speedups you could get. Profilers are unable to look at data.
What I do is randomly sample the state of the program under the debugger, and see if I can really understand what it is doing at each sample.
You can find things that way that you can't find any other way.
(Some people say this is not accurate, but it is accurate - about what matters. What matters is what the problem is, not precisely how much it costs.)
And that is point 4.

Using gprof with sockets

I have a program I want to profile with gprof. The problem (seemingly) is that it uses sockets. So I get things like this:
::select(): Interrupted system call
I hit this problem a while back, gave up, and moved on. But I would really like to be able to profile my code, using gprof if possible. What can I do? Is there a gprof option I'm missing? A socket option? Is gprof totally useless in the presence of these types of system calls? If so, is there a viable alternative?
EDIT: Platform:
Linux 2.6 (x64)
GCC 4.4.1
gprof 2.19
The socket code needs to handle interrupted system calls regardless of profiler, but under profiler it's unavoidable. This means having code like.
if ( errno == EINTR ) { ...
after each system call.
Take a look, for example, here for the background.
gprof (here's the paper) is reliable, but it only was ever intended to measure changes, and even for that, it only measures CPU-bound issues. It was never advertised to be useful for locating problems. That is an idea that other people layered on top of it.
Consider this method.
Another good option, if you don't mind spending some money, is Zoom.
Added: If I can just give you an example. Suppose you have a call-hierarchy where Main calls A some number of times, A calls B some number of times, B calls C some number of times, and C waits for some I/O with a socket or file, and that's basically all the program does. Now, further suppose that the number of times each routine calls the next one down is 25% more times than it really needs to. Since 1.25^3 is about 2, that means the entire program takes twice as long to run as it really needs to.
In the first place, since all the time is spent waiting for I/O gprof will tell you nothing about how that time is spent, because it only looks at "running" time.
Second, suppose (just for argument) it did count the I/O time. It could give you a call graph, basically saying that each routine takes 100% of the time. What does that tell you? Nothing more than you already know.
However, if you take a small number of stack samples, you will see on every one of them the lines of code where each routine calls the next.
In other words, it's not just giving you a rough percentage time estimate, it is pointing you at specific lines of code that are costly.
You can look at each line of code and ask if there is a way to do it fewer times. Assuming you do this, you will get the factor of 2 speedup.
People get big factors this way. In my experience, the number of call levels can easily be 30 or more. Every call seems necessary, until you ask if it can be avoided. Even small numbers of avoidable calls can have a huge effect over that many layers.

How to get the call graph of a program with a bit of profiling information

I want to understand how a C++ program that was given to me works, and where it spends the most time.
For that I tried to use first gprof and then gprof2dot to get the pictures, but the results are sometimes kind of ugly.
How do you usually do this? Can you recommend any better alternatives?
P.D. Which are the open source solutions (preferably for Linux or Mac OS )X?
OProfile on Linux works fairly well, actually i like it better than GProf. There are a couple graphical tools that help visualize OProfile output.
You can try KCachegrind. This is a program that visualizes samples acquired by Valgrind tool called Callgrind. KCachegrind may seem to be not actively maintained, but the graphs he produces are very useful.
In my opinion there are two alternatives (on Windows):
Profilers that change the assembly instructions of your applications (that's called instrumenting) and record every detail. These profilers tend to be slow (applications running about 10 times slower), sometimes hard to set up, and often not-free, but they give you the best performance related information. Look for "Ration Quantity", "AQTime" and "Performance Validator" if you want a profiler of this type.
Profilers that don't instrument the application, but just look at a running application and collect 'samples' of it. These profilers are fast (no performance loss), often easy to set up, and you can find quite some free alternatives. Look for "Very Sleepy" and "Luke Stackwalker" if you want a profiler of this type.
Although I used commercial profilers like Rational Quantity and AQTime in the past, and was very satisfied with the results, I found that the disadvantages (hard to setup, unexplainable crashes, slow performance) outweighed the advantages.
Therefore I switched to the free alternatives and I am mainly using "Very Sleepy" at this moment.
If you want to look at the structure of your application (who calls what, references, call trees, ...) look at "Understand for C/C++". This application investigates your source code and allows you to query almost everything from the application's structure.
See the SD C++ Profiler.
Other answers here suggest that probe-oriented profilers have high overhead (10x). This one does not.
Same answer as ---
EDIT: #Steve suggested I give a less pithy answer.
I hear this all the time - "I want to find out where my program spends its time".
Let me suggest an alternate phrasing - "I want to find out why my program spends its time".
Maybe the difference isn't obvious.
When a program executes an instruction, the reason why it is doing so is encoded in the entire state of the program, including the call stack.
Looking only at the program counter is like trying to see if a taxi ride is necessary by profiling the rotation angle of its wheels.
You need to look at the whole state of the program.
There's another myth I hear all the time - that you need to measure the execution time of methods, to find the "slow" ones.
There are many ways for programs to take more time than they need to than by, say, doing a linear search instead of a binary search in some method, which might be the kind of thing people have in mind.
The way to think about it is this:
There isn't just one thing taking more time than necessary. There probably are several.
Each thing taking time is taking some fraction, like 10%, 50%, 90% or some such number. That means if the wall clock could be stopped during that time, that is how much less time the overall app could take.
You want to find out what those things are, whatever they are. Profilers (samplers) work by taking a lot of shallow samples (PC or call stack) and summarizing them to get measurements. But measurements are not what you need. What you need is finding out what it's doing, from a time perspective. It's better to get a small number of samples, like 10 or 20, and examine (not summarize) them. If some activity takes 20%, 50%, or 90% of the time, then that is the probability you will catch it in the act on each sample, so that is roughly the percent of samples on which you will see it. The important thing is finding out what it is, not getting an accurate measurement of something irrelevant.
So as a way to see what the program is doing, from a time perspective, here's how many people do it.