Visual Studio 2008 Profiler - Instrumented produces strange results - c++

I run the Visual Studio 2008 profiler on a "RelDebug" build of my app. Optimizations are on, but inlining is only moderate, stack frames are present, and symbols are emitted. In other words, RelDebug is a somewhat optimized build that can be debugged (although the usual Release caveats about inspecting variables applies).
I run both the Sampling, and the Instrumented profiler on separate runs.
Result? The Sampling profiler produces a result that looks reasonable. However when I look at the Instrumented profiler results, I see functions that should not even be near the top of the list, coming out up to.
For example, a function like "SetFont" that consists of only 1 line assigning the height to a class member. Or "SetClipRect" that merely assigns a rectangle.
Of course I am looking at "Exclusive" stats (i.e. minus children).
This happen to anyone else? It always seems to happen once my application has grown to a certain size. It makes the instrumented profiler useless at that point.
I figured out the problem. Both the Visual Studio 2008 and the Visual Studio 2010 profilers are mediocre (to put it politely). I bought Intel C++ Studio which comes with vTune Amplifier (a profiler). Using the Intel profiler on the exact same code I was able to get profiler results that actually made sense.

You say "of course you are looking at Exclusive". Look at inclusive stats. In all but the simplest programs or algorithms, nearly all the time is spent in subroutines and functions, so if you've got a performance problem, it most likely consists of calls you didn't know were time-hogs.
The method I rely on is this. Assuming you are trying to find out what you could fix to make the code faster, it will find it, while not wasting your time with high-precision statistics about things that are not problems.

There's no bug. Sampling cannot tell you how much time you spent per call. Profiler is just counting how many times timer ended up in that specific function. Since SetFont is not frequently called, you don't get many hits in that function and you get impression that that function is not time consuming.
On the other hand, when you run instrumentation, profiler counts every call and measures execution time of every function. That is why you get accurate information about functions CPU consumption.
When examining instrumentation results you must always look at number of calls as well. Since SetFont is more-less API it doesn't matter if it's exclusive or inclusive. The only thing that matters is its overall time and how frequently it's called.

Related

Sampling vs timing profilers: Massive difference in code that uses external APIs

I have a program that contains a function fast_write() that is entirely a wrapper around WriteConsole() (windows console output). The program generates a lot of console output and writes it. When profiling with the Visual Studio profiler (a sampling profiler) and Tracy (a timing profiler), the results are massively different.
Tracy shows that function taking the majority of time (~82%) while VS clocks it at 6% or so.
How can that be? I am aware that (if tracy is right) most samples won't hit "my code". And that made me realize I don't really understand how sampling profilers map such samples to code. But I would have assumed that if VS shows a percentage - that that is right.
When digging deeper with tracy, I can see that during fast_print(), conhost.exe takes takes up most of the time. That further confirms the theory that tracy is right.
If that is correct - does that mean that sampling profilers just try to guess which code a sample relates to? That would mean they are massively inaccurate for code that calls external code. I can hardly believe that.
Note that I also tried loading symbols from the symbol server, but it didn't change sampling counts in VS.
edit: I have manually confirmed that fast_print() actually takes 80% percent by just calling it twice and doing some math. That still leaves me with a massive: They can't do that, right? They can't just mis-attribute those samples without any indication?!
edit: VTune also reports same results as Tracy. So it's only VS messing up.

To what extent DLLs speed up calculations in a code such as loops etc

I have a code with a loop that counts to 10000000000, and within that loop, I do some calculations with conditional operators (if etc). It takes about 5 minutes to reach that number. so, my question is, can I reduce the time it takes by creating a DLL and call that dll for functions to do the calculation and return the values to the main program? will it make a difference in time it takes to do the calculations? further, will it improve the overall efficiency of the program?
By a “dll” I assume you mean going from managed .net code to that of un-managed “native” compiled code. Yes this approach can help.
It much depends. Remember, the speed of the loop code is likely only 25 seconds on a typical i3 (that is the cost and overhead to loop to 10 billion but doing much nothing else).
And I assumed you went to the project, then compile. On that screen select advanced compile. There you want to check remove integer overflow checks. Make sure you loop vars are integers for speed.
At that point the “base” loop that does nothing will drop from about 20 seconds down to about 6 seconds.
So that is the base loop speed – now it come down to what we are doing inside of that loop.
At this point, .net DOES HAVE a JIT (a just in time native compiler). This means your source code goes to “CLR” code and then in tern that code does get compiled down to native x86 assembly code. So this “does” get the source code down to REAL machine code level. However a JIT is certainly NOT as efficient nor can it spend “time” optimizing the code since the JIT has to work on the “fly” without you noticing it. So a c++ (or VB6 which runs as fast as c++ when native compiled) can certainly run faster, but the question then is by how much?
The optimized compiler might get another double in speed for the actually LOOPING code etc.
However, in BOTH cases (using .net managed code, or code compiled down to native Intel code), they BOTH LIKELY are calling the SAME routines to do the math!
In other words, if 80% of the code is spend in “library” code that does the math etc., then calling such code from c++ or calling such code from .net will make VERY LITTLE difference since the BULK of the work is spend in the same system code!
The above concept is really “supervisor” mode vs. your application mode.
In other words, the amount of time spent in your code vs. that of using system “library” code means that the bulk of the heaving lifting is occurring in supervisor code. That means jumping from .net to native c++/vb6 dll’s will NOT yield much in the way of performance.
So I would first ensure loops and array index refs in your code are integer types. The above tip of taking off bounds checking likely will give you “close” to that of using a .dll. Worse is often the time to “shuffle” the data two and from that external.dll sub will cost you MORE than the time saved on the processing side.
And if your routines are doing database or file i/o, then all bets are off, as that is VERY different problem.
So I would first test/try your application with [x] remove integer overflow checks turned off. And make sure during testing that you use ctrl-F5 in place of F5 to run your code without DEBUGGING. The above overflow check and options will NOT show increased speed when in debug mode.
So it hard to tell – it really depends on how much math (especially floating calls) you are doing (supervisor code) vs. that of just moving values around in arrays. If more code is moving things around, then I suggest the integer optimizing above, and going to a .dll likely will not help much.
Couldn´t you utilize "Parallel.ForEach" and strip this huge loop in some equal pieces?
Or try to work with some Backgroundworkers or even Threads (more than 1!!) to achieve the optimal CPU performance and try to reduce the spent time.

Profiling code built from ifort 11.1 yields __powr8i4 routine, what is it?

I built a Fortran code with Intel 11.1. I built it with the -p option in order to produce profiling data. When I check these results, there are some routines present that aren't a part of my code. I assume they were put there by Intel. The include:
__powr8i4
__intel_new_memset
__intel_fast_memset
__intel_fast_memset.J
__intel_fast_memcpy
__intel_new_memcpy
__intel_fast_memcpy.J
There are others, too. When I build the code without optimization, the code doesn't spend much time in them. Except that results show __powr8i4 being used 3.3% of the time. However, when I build the code with optimization, this number goes way up to about 35%. I can't seem to find out what these routines are, but they are confusing my results because I want to know where to look to optimize my code.
Most programs spend a lot of their cycles in the calling of subroutines, often library subroutines, so if you look only at exclusive (self) time, you will see what you are seeing.
So point 1 is look at inclusive (self plus callees) time.
Now, if the profiler is a "CPU profiler", it will probably be blind to I/O time. That means your program might be spending most of its time reading or writing, but the profiler will give you no clue about that.
So point 2 is use a profiler that works on "wall clock" time, not "CPU" time, unless you are sure you are not doing much I/O. (Sometimes you think you're not doing I/O, but deep inside some subroutine layers deep, guess what - it's doing I/O.)
Many profilers try to produce a call-graph, and if your program does not contain recursion, and if the profiler has access to all the routines in your code, that can be helpful in identifying the subroutine calls in your code that account for a lot of time.
However, if routine A is large and calls B in several places, the profiler won't tell you which lines of code to look at.
Point 3 is use a profiler that gives you line-level inclusive time percentage, if possible.
(Percentage is the most useful number, because that tells you how much overall time you would save if you could somehow remove that line of code. Also, it is not much affected by competing processes in the system.)
One example of such a profiler is Zoom.
It may be that after you do all this, you don't see much you could do to speed up the code.
However, if you could see how certain properties of the data might affect performance, you might find there were further speedups you could get. Profilers are unable to look at data.
What I do is randomly sample the state of the program under the debugger, and see if I can really understand what it is doing at each sample.
You can find things that way that you can't find any other way.
(Some people say this is not accurate, but it is accurate - about what matters. What matters is what the problem is, not precisely how much it costs.)
And that is point 4.

Profilers Instrumenting Vs Sampling

I am doing a study to between profilers mainly instrumenting and sampling.
I have came up with the following info:
sampling: stop the execution of program, take PC and thus deduce were the program is
instrumenting: add some overhead code
to the program so it would increment
some pointers to know the program
If the above info is wrong correct me.
After this I was looking at the time of execution and some said that instrumenting takes more time than sampling! is this correct?
if yes why is that? in sampling you have to pay the price of context switching between processes while in the latter your in the same program no cost
Am i missing something?
cheers! =)
The interrupts generated by a sampling profiler generally add an insignficant amount of time to the total execution time, unless you have a very short sampling interval (e.g. < 1 ms).
With instrumented profiling there can be a large overhead, e.g. on small leaf functions that get called many times, as the calls to the instrumentation library can be significant compared to the execution time of the function.
It depends how conventional you want to be.
gprof does both those things you've mentioned. Here are some comments on that.
There is a school of thought that says profiling is about measuring. Measuring what? Well, anything - just measuring. Along with this goes the idea that what you want to get is a "big picture" of what's happening.
This school looks mostly at trying to find "slow functions", without clearly defining what that even means, and telling you to look there to optimize.
Another school says that you are really debugging. You want to precisely locate bugs of a certain kind - ones that don't make the program incorrect, rather they take too long. These are not big-picture things. They are very precise points in the code where something is happening that costs a lot more time than necessary.
Exactly how much more is not important. What's important is that it is located so it can be fixed.
In this viewpoint, profiling overhead is irrelevant, and so is accuracy of measurement.
What measuring is for is seeing how much time was saved.
One profiler that, I think, successfully spans both camps, is Zoom, because it samples the call stack, on wall-clock time, and presents, at the line/instruction level, percent of time on the stack. Some other profilers do this also, but most don't.
I'm in the second school, and here's an example of what you can accomplish with it.
Here's a more brief discussion of the issues.

Why does my program run way faster when I enable profiling?

I have a program that's running pretty slowly (takes like 20 seconds even on release) so, wanting to fix it, I tried to use Visual Studio's built in profiler. However, when I run the program with profiling enabled, it finishes in less than a second. This makes it very difficult to find a bottleneck. I would post the code but it is long. Are there any obvious or not so obvious reasons why this would be happening?
EDIT:
Ok so I narrowed the problem down to a bunch of free() calls. When I comment them out, the program runs in the same amount of time that it does with profiling enabled. But now I have a memory leak :-/
The reason is because when you run your application within Visual Studio, the debugger is attached to it. When you run it using the profiler, the debugger is not attached.
If you press F5 to run your program, even with the Release build, the debugger is still attached.
If you try running the .exe by itself, or running the program through the IDE with "Debug > Start Without Debugging" (or just press Ctrl+F5) the application should run as fast as it does with the profiler.
That sounds a lot like a Heisenbug.
They really happen, and they can be painful to uncover.
Your best solution in my experience is to change how you are profiling -- possibly several ways -- until the bug disappears.
Use different profilers. Try adding timing code instead of using a profiler.
turning on the profiler will end up moving your code around (a bit) which probably masking the problem.
The most common cause of hiesenbugs is unitialized variables, The second most common cause is using memory after it has been freed(). Since your free seems to fix it, you might think to look for late references, but I would still look for uninitialized variables first if I were you.
In my case it was due to the Windows Timer Resolution.
If you program uses threading, the System wide Timer resolution may be the reason for longer times when running through Visual studio.
The default windows timer resolution is 15.6ms
When running through profiler, the profiler sets this value to 1ms causing faster execution. Checkout this answer
The general way would be divide-and-conquer, i.e. running only parts of the program and see when the problem goes away. But it sounds as if you already did that. AFAIK free usually doesn't take much time, but malloc can take a lot of time if memory is fragmented. If you don't call free(), the heap never gets fragmented in the first place. (intrusive profiling code might prevent memory fragmentation by allocating small data blocks and filling the free gaps - but I admit that's bit of a weak explanation).
Maybe you can add manual time measurement calls before/after the calls to malloc and new and print out the times to verify that? Maybe you can also analyze your memory allocation patterns to find out if you have a heap fragmentation problem (probably by looking at the code and doing some symbolic debugging in your head ;-)
Use a non-intrusive sample profiler instead of an intrusive instrumented profiler.
It could be due to few optimizations not being performed by the compiler when you run it in profiling mode. So, I suggest you check the parameters being passed and check the compiler documentation.