Sampling vs timing profilers: Massive difference in code that uses external APIs - c++

I have a program that contains a function fast_write() that is entirely a wrapper around WriteConsole() (windows console output). The program generates a lot of console output and writes it. When profiling with the Visual Studio profiler (a sampling profiler) and Tracy (a timing profiler), the results are massively different.
Tracy shows that function taking the majority of time (~82%) while VS clocks it at 6% or so.
How can that be? I am aware that (if tracy is right) most samples won't hit "my code". And that made me realize I don't really understand how sampling profilers map such samples to code. But I would have assumed that if VS shows a percentage - that that is right.
When digging deeper with tracy, I can see that during fast_print(), conhost.exe takes takes up most of the time. That further confirms the theory that tracy is right.
If that is correct - does that mean that sampling profilers just try to guess which code a sample relates to? That would mean they are massively inaccurate for code that calls external code. I can hardly believe that.
Note that I also tried loading symbols from the symbol server, but it didn't change sampling counts in VS.
edit: I have manually confirmed that fast_print() actually takes 80% percent by just calling it twice and doing some math. That still leaves me with a massive: They can't do that, right? They can't just mis-attribute those samples without any indication?!
edit: VTune also reports same results as Tracy. So it's only VS messing up.

Related

Cycles consumed in each function through Oprofile

Oprofile works on Sampling Based theory.
Opreport -l option provides us the profiling report in the following way:
samples % image name symbol name
78149 15.0776 cvqa comp_corr.clone.2
With this information I can know the %age of time consumed in consumption. If I do some optimizaion in my code I will again get the report as:
samples % image name symbol name
73179 15.0732 cvqa comp_corr.clone.2
In this report I am not getting how much optimization of cycles has been done so that I can benchmark. How much optimization has been done till now?
Is there any way we can know how much cycles optimization has been done or any other way through which I can bench mark?
I am working on AMD64 bit machine.
Since your real goal is to optimize the program, let me suggest another way to think about it.
The main thing to measure is overall time, not cycles or times of the various routines.
Now, here's how to do optimization. Don't base it on any measurements. Rather, get a number of samples of the program's state and (this is the key point) study each sample closely enough, with your own eyes and brain, and understand what the program is doing in that state, and the full reason why it is doing it.
(You will see anything worth fixing that statistics could reveal, plus things they could not reveal, and that makes all the difference.)
As soon as you catch it in the act of doing, on two or more samples, something that could be removed, fixing it will give you a substantial speedup.
Here is an explanation of why it works and how much speedup you can expect.
After you do that, you can do the overall time measurement again and see how much time you saved.
Then don't stop. Do it again. You'll find something else to fix, which is now a bigger percent because of the first problem you removed.
In my experience, with real software, this can be done as many as 5 or 6 times, after which the program can be orders of magnitude faster than it was originally. The reason is because each optimization removes a fraction of the original execution time, and those fractions can accumulate up to nearly 100%. I'm not aware of any such result achieved with Oprofile or any other profiler tool.

Profiling code built from ifort 11.1 yields __powr8i4 routine, what is it?

I built a Fortran code with Intel 11.1. I built it with the -p option in order to produce profiling data. When I check these results, there are some routines present that aren't a part of my code. I assume they were put there by Intel. The include:
__powr8i4
__intel_new_memset
__intel_fast_memset
__intel_fast_memset.J
__intel_fast_memcpy
__intel_new_memcpy
__intel_fast_memcpy.J
There are others, too. When I build the code without optimization, the code doesn't spend much time in them. Except that results show __powr8i4 being used 3.3% of the time. However, when I build the code with optimization, this number goes way up to about 35%. I can't seem to find out what these routines are, but they are confusing my results because I want to know where to look to optimize my code.
Most programs spend a lot of their cycles in the calling of subroutines, often library subroutines, so if you look only at exclusive (self) time, you will see what you are seeing.
So point 1 is look at inclusive (self plus callees) time.
Now, if the profiler is a "CPU profiler", it will probably be blind to I/O time. That means your program might be spending most of its time reading or writing, but the profiler will give you no clue about that.
So point 2 is use a profiler that works on "wall clock" time, not "CPU" time, unless you are sure you are not doing much I/O. (Sometimes you think you're not doing I/O, but deep inside some subroutine layers deep, guess what - it's doing I/O.)
Many profilers try to produce a call-graph, and if your program does not contain recursion, and if the profiler has access to all the routines in your code, that can be helpful in identifying the subroutine calls in your code that account for a lot of time.
However, if routine A is large and calls B in several places, the profiler won't tell you which lines of code to look at.
Point 3 is use a profiler that gives you line-level inclusive time percentage, if possible.
(Percentage is the most useful number, because that tells you how much overall time you would save if you could somehow remove that line of code. Also, it is not much affected by competing processes in the system.)
One example of such a profiler is Zoom.
It may be that after you do all this, you don't see much you could do to speed up the code.
However, if you could see how certain properties of the data might affect performance, you might find there were further speedups you could get. Profilers are unable to look at data.
What I do is randomly sample the state of the program under the debugger, and see if I can really understand what it is doing at each sample.
You can find things that way that you can't find any other way.
(Some people say this is not accurate, but it is accurate - about what matters. What matters is what the problem is, not precisely how much it costs.)
And that is point 4.

Profilers Instrumenting Vs Sampling

I am doing a study to between profilers mainly instrumenting and sampling.
I have came up with the following info:
sampling: stop the execution of program, take PC and thus deduce were the program is
instrumenting: add some overhead code
to the program so it would increment
some pointers to know the program
If the above info is wrong correct me.
After this I was looking at the time of execution and some said that instrumenting takes more time than sampling! is this correct?
if yes why is that? in sampling you have to pay the price of context switching between processes while in the latter your in the same program no cost
Am i missing something?
cheers! =)
The interrupts generated by a sampling profiler generally add an insignficant amount of time to the total execution time, unless you have a very short sampling interval (e.g. < 1 ms).
With instrumented profiling there can be a large overhead, e.g. on small leaf functions that get called many times, as the calls to the instrumentation library can be significant compared to the execution time of the function.
It depends how conventional you want to be.
gprof does both those things you've mentioned. Here are some comments on that.
There is a school of thought that says profiling is about measuring. Measuring what? Well, anything - just measuring. Along with this goes the idea that what you want to get is a "big picture" of what's happening.
This school looks mostly at trying to find "slow functions", without clearly defining what that even means, and telling you to look there to optimize.
Another school says that you are really debugging. You want to precisely locate bugs of a certain kind - ones that don't make the program incorrect, rather they take too long. These are not big-picture things. They are very precise points in the code where something is happening that costs a lot more time than necessary.
Exactly how much more is not important. What's important is that it is located so it can be fixed.
In this viewpoint, profiling overhead is irrelevant, and so is accuracy of measurement.
What measuring is for is seeing how much time was saved.
One profiler that, I think, successfully spans both camps, is Zoom, because it samples the call stack, on wall-clock time, and presents, at the line/instruction level, percent of time on the stack. Some other profilers do this also, but most don't.
I'm in the second school, and here's an example of what you can accomplish with it.
Here's a more brief discussion of the issues.

How to get the call graph of a program with a bit of profiling information

I want to understand how a C++ program that was given to me works, and where it spends the most time.
For that I tried to use first gprof and then gprof2dot to get the pictures, but the results are sometimes kind of ugly.
How do you usually do this? Can you recommend any better alternatives?
P.D. Which are the open source solutions (preferably for Linux or Mac OS )X?
OProfile on Linux works fairly well, actually i like it better than GProf. There are a couple graphical tools that help visualize OProfile output.
You can try KCachegrind. This is a program that visualizes samples acquired by Valgrind tool called Callgrind. KCachegrind may seem to be not actively maintained, but the graphs he produces are very useful.
In my opinion there are two alternatives (on Windows):
Profilers that change the assembly instructions of your applications (that's called instrumenting) and record every detail. These profilers tend to be slow (applications running about 10 times slower), sometimes hard to set up, and often not-free, but they give you the best performance related information. Look for "Ration Quantity", "AQTime" and "Performance Validator" if you want a profiler of this type.
Profilers that don't instrument the application, but just look at a running application and collect 'samples' of it. These profilers are fast (no performance loss), often easy to set up, and you can find quite some free alternatives. Look for "Very Sleepy" and "Luke Stackwalker" if you want a profiler of this type.
Although I used commercial profilers like Rational Quantity and AQTime in the past, and was very satisfied with the results, I found that the disadvantages (hard to setup, unexplainable crashes, slow performance) outweighed the advantages.
Therefore I switched to the free alternatives and I am mainly using "Very Sleepy" at this moment.
If you want to look at the structure of your application (who calls what, references, call trees, ...) look at "Understand for C/C++". This application investigates your source code and allows you to query almost everything from the application's structure.
See the SD C++ Profiler.
Other answers here suggest that probe-oriented profilers have high overhead (10x). This one does not.
Same answer as ---
EDIT: #Steve suggested I give a less pithy answer.
I hear this all the time - "I want to find out where my program spends its time".
Let me suggest an alternate phrasing - "I want to find out why my program spends its time".
Maybe the difference isn't obvious.
When a program executes an instruction, the reason why it is doing so is encoded in the entire state of the program, including the call stack.
Looking only at the program counter is like trying to see if a taxi ride is necessary by profiling the rotation angle of its wheels.
You need to look at the whole state of the program.
There's another myth I hear all the time - that you need to measure the execution time of methods, to find the "slow" ones.
There are many ways for programs to take more time than they need to than by, say, doing a linear search instead of a binary search in some method, which might be the kind of thing people have in mind.
The way to think about it is this:
There isn't just one thing taking more time than necessary. There probably are several.
Each thing taking time is taking some fraction, like 10%, 50%, 90% or some such number. That means if the wall clock could be stopped during that time, that is how much less time the overall app could take.
You want to find out what those things are, whatever they are. Profilers (samplers) work by taking a lot of shallow samples (PC or call stack) and summarizing them to get measurements. But measurements are not what you need. What you need is finding out what it's doing, from a time perspective. It's better to get a small number of samples, like 10 or 20, and examine (not summarize) them. If some activity takes 20%, 50%, or 90% of the time, then that is the probability you will catch it in the act on each sample, so that is roughly the percent of samples on which you will see it. The important thing is finding out what it is, not getting an accurate measurement of something irrelevant.
So as a way to see what the program is doing, from a time perspective, here's how many people do it.

Visual Studio 2008 Profiler - Instrumented produces strange results

I run the Visual Studio 2008 profiler on a "RelDebug" build of my app. Optimizations are on, but inlining is only moderate, stack frames are present, and symbols are emitted. In other words, RelDebug is a somewhat optimized build that can be debugged (although the usual Release caveats about inspecting variables applies).
I run both the Sampling, and the Instrumented profiler on separate runs.
Result? The Sampling profiler produces a result that looks reasonable. However when I look at the Instrumented profiler results, I see functions that should not even be near the top of the list, coming out up to.
For example, a function like "SetFont" that consists of only 1 line assigning the height to a class member. Or "SetClipRect" that merely assigns a rectangle.
Of course I am looking at "Exclusive" stats (i.e. minus children).
This happen to anyone else? It always seems to happen once my application has grown to a certain size. It makes the instrumented profiler useless at that point.
I figured out the problem. Both the Visual Studio 2008 and the Visual Studio 2010 profilers are mediocre (to put it politely). I bought Intel C++ Studio which comes with vTune Amplifier (a profiler). Using the Intel profiler on the exact same code I was able to get profiler results that actually made sense.
You say "of course you are looking at Exclusive". Look at inclusive stats. In all but the simplest programs or algorithms, nearly all the time is spent in subroutines and functions, so if you've got a performance problem, it most likely consists of calls you didn't know were time-hogs.
The method I rely on is this. Assuming you are trying to find out what you could fix to make the code faster, it will find it, while not wasting your time with high-precision statistics about things that are not problems.
There's no bug. Sampling cannot tell you how much time you spent per call. Profiler is just counting how many times timer ended up in that specific function. Since SetFont is not frequently called, you don't get many hits in that function and you get impression that that function is not time consuming.
On the other hand, when you run instrumentation, profiler counts every call and measures execution time of every function. That is why you get accurate information about functions CPU consumption.
When examining instrumentation results you must always look at number of calls as well. Since SetFont is more-less API it doesn't matter if it's exclusive or inclusive. The only thing that matters is its overall time and how frequently it's called.