Simple profiling of single C++ function in Windows - c++

There are times, particularly when I'm writing a new function, where I would like to profile a single piece of code but a full profile run is not really necessary, and possibly too slow.
I'm using VS 2008 and have used the AMD profiler on C++ with good results, but I'm looking for something a little more lightweight.
What tools do you use to profile single functions? Perhaps something that is a macro which gets excluded when your not in DEBUG mode. I could write my own, but I wanted to know if there were any built in that I'm missing. I was thinking something like:
void FunctionToTest()
{
PROFILE_ENTER("FunctionToTest")
// Do some stuff
PROFILE_EXIT()
}
Which would simply print in the debug output window how long the function took.

If I want to get maximum speed from a particular function, I wrap it in a nice long-running loop and use this technique.
I really don't care too much about the time it takes. That's just the result.
What I really need to know is what must I do to make it take less time.
See the difference?
After finding and fixing the speed bugs, when the outer loop is removed, it flies.
Also I don't follow the orthodoxy of only tuning optimized code, because that assumes the code is already nearly as tight as possible.
In fact, in programs of any significant size, there is usually stupid stuff going on, like calling a subfunction over and over with the same arguments, or new-ing objects repeatedly, when prior copies could be re-used.
The compiler's optimizer might be able to clean up some such problems,
but I need to clean up every one, because the ones left behind will dominate.
What it can do is make them harder to find, by scrambling the code.
When I've gotten all the stupid stuff out (making it much faster), then I turn on the optimizer.
You might think "Well I would never put stupid stuff in my code."
Right.
And you'd never put in bugs either.
None of us try to make mistakes, but we all do, if we're working.

This code by Jeff Preshing should do the trick:-
http://preshing.com/20111203/a-c-profiling-module-for-multithreaded-apis

Measure the time - which the code in the link does too - using either clock() or one of the OS provided high resolution timers. With C++11 you can use the timers from the <chrono> header.
Please note, that you should always measure in Release build not Debug build, to get proper timings.

Related

Best way to make loop profiling

I must change a C/C++ program with a lot of loop inside one function. I must add cuda functions.
Before i start making changes, I wanted to take the time to all loops found. But i did't find any profiling programs who make exactly that. What is the best way to do that's. I on linux. if you have any solutions let me know.
here you will find an example of tool who makes exactly what i want but i haven't find it or something like that : http://carbon.ucdenver.edu/~dconnors/papers/wbia06-loopprof.pdf
I would use gperftools, and figure out where the code is spending most of it's time. Once you have identified a function or part of a function, you're probably done. Understanding exactly which instructions are the "heaviest" in a function will require a long running testcase for that particular loop, so that the profiler can get sufficient data for each instruction (or at least most instructions) in the loop. But actually profiling down to instructions is probably not relevant if you are looking to replace the code with another technology - it is unlikely that replacing one loop of a few lines of code will help much, since there'd be too much overhead. Instead, you want to take a larger block and move that across to CUDA.

C++ Code profiling/analysis for Mac and MPI

I am looking for a code analysis/profiling tool for C++ on MacOS. I know that there have been posts about this thread, but my application in which I need is very specific, so maybe one can give me a little more specific advice.
So here is my problem: I am writing a scientific code (master's project) in C++, so it's a pure console application, no interactivity given. The code is supposed to run on massively parallel computers, thus I use MPI. However, right now I am not yet optimizing for scalability, but only for singlecore performance. Since I do not want to rewrite the whole programm as a serial one, I just use MPI with 1 thread. It works fine, but the optimizer obviously needs to be able to deal with this.
What do I want to analyze? Well, the code is not very complex in a sense that it has a very simple structure and thus all I need would be a list of how long the program spends in certain functions, so that I know where it loses most time and I can measure the speedup of my optimizations.
Thanks for all ideas
You should use Instruments.app which includes a CPU sampler and thread activity viewer... among other things. (Choose "Product > Profile..." in Xcode)
If you want something more fine-grained, you could instrument your code. Coincidentally, I wrote a set of profiling macros just for such an occasion :)
https://github.com/nielsbot/Profiler
This will show a nice nested print out of time spent in instrumented routines. (instructions on that page)
Did you try kcachgrind: http://kcachegrind.sourceforge.net/html/Home.html with valgrind ?
I can recommed http://www.scalasca.org/ . You can use it also for the parallel performance afterwards.
Don't look for "slow functions" and don't look to measure the time used by different pieces. Those concepts are so indirect as to be almost useless for telling you what to optimize.
Instead, take some stroboscopic X-rays, on wall-clock time, of what the entire program is doing, and study each one to see why the program is spending that instant of time.
The reason this works better is it's not looking with function-colored glasses. It's looking with purpose-colored glasses, and you can tell if the program needs to be doing what it's doing.
It's very accurate about locating big problems.
It is not accurate about measuring them, nor does it need to be.
What happens when you just do measuring is this: You get a bunch of numbers on a bunch of routines. You look at them and say "what does it mean?".
If it doesn't tell you what you should fix, you pat yourself on the back and say the program must be optimal.
In fact, there probably is something you could fix, but you couldn't figure it out from the profiler.
What's more, if you do find it and fix it, it can expose other things you can fix for even greater speedup.
That's what random pausing is about.

Profile optimised C++/C code

I have some heavily templated c++ code that I am working with. I can compile and profile with AMD tools and sleepy in debug mode. However without optimisation most of time spent concentrated in the templated code and STL. With optimised compilation, all the profile tools that I know produce garbage information. Does anybody know a good way to profile optimised native code
PS1:
The code that I am writing is also heavily templated. Most of the time spent in the unoptimised code will be optimized away. I am talking about 96-97% of the run time are spent in templated code without optimisation. This is going to corrupt the accuracy of the profiling. And yes I can change many templated code or at least what part of the templated code is introducing the most trouble and I can do better in those places.
You should focus on the code you wrote because that is what you can change, time spent in STL is irrelevant, just ignore it and focus on the callers of that code. If too much time is spent in STL you probably can call some other STL primitive instead of the current one.
Profiling unoptimized code is less interesting, but you can still get some informations. If used algorithms from some parts of code are totally flawed it will show up even there. But you should be able to get useful informations from any good profiling tool in optimized code. What tools do you use exactly and why do you call their output garbage ?
Also it's usually easy enough to instrument your code by hand and find out exactly which parts are efficient and which are not. It's just a matter of calling timer functions (or reading cycle count of processor if possible) at well chosen points. I usually do that from unit tests to have reproducible results, but all depends of the specifics of your program.
Tools or instrumenting code are the easy part of optimization. The hard part is finding ways to get faster code where it's needed.
What do you mean by "garbage information"?
Profiling is only really meaningful on optimized builds, so tools are designed to work with them -- thus if you're getting meaningless results, it's probably due to the profiler not finding the right symbols, or needing to instrument the build.
In the case of Intel VTune, for example, I found I got impossible results from the sampler unless I explicitly told it where to find the PDBs for the executable I was tuning. In the instrumented version, I had to fiddle with the settings until it was reliably putting probes into the function calls.
When #kriss says
You should focus on the code you wrote
because that is what you can change
that's exactly what I was going to say.
I would add that in my opinion it is easier to do performance tuning first on code compiled without optimization, and then later turn on the optimizer, for the same reason. If something you can fix is costing excess time, it will cost proportionally excess time regardless of what the compiler does, and it's easier to find it in code that hasn't been scrambled.
I don't look for such code by measuring time. If the excess time is, say, 20%, then what I do is randomly pause it several times. As soon as I see something that can obviously be improved on 2 or more samples, then I've found it. It's an oddball method, but it doesn't really miss anything. I do measure the overall time before and after to see how much I saved. This can be done multiple times until you can't find anything to fix. (BTW, if you're on Linux, Zoom is a more automated way to do this.)
Then you can turn on the optimizer and see how much it gives you, but when you see what changes you made, you can see there's really no way the compiler could have done it for you.

How do you find the least optimized parts of a program?

Are there any tools to give some sort of histogram of where most of the execution time of the program is spent at?
This is for a project using c++ in visual studio 2008.
The name you're after is a profiler. Try Find Application Bottlenecks with Visual Studio Profiler
You need a profiler.
Visual Studio Team edition includes a profiler (which is what you are looking for) but you may only have access to the Professional or Express editions. Take a look at these threads for alternatives:
What's your favorite profiling tool (for C++)
What are some good profilers for native C++ on Windows?
You really shouldn't optimize ANY parts of your application until you've measured how long they take to run. Otherwise you may be directing effort in the wrong place, and you may be making things worse, not better.
I have used a profiler called "AQ Time" which gives every detail you want to know about about the performance of your code. It's not free though..
You could get a histogram of the program counter, but it is practically useless unless you are doing something dumb like spending time in a bubble sort of a big array of ints or doubles.
If you do something as simple as a bubble sort of an array of strings, the PC histogram will only tell you that you have a hotspot in the string compare routine.
That's not much help, is it?
I know you wouldn't do such a bubble sort, but just for fun, let's assume you did, and it was taking 90% of your time. (i.e. if you fixed it, it could go up to 10 times faster.)
It's actually a very easy thing to find, because if you just hit the pause button in the debugger, you will almost certainly see that it stops in the string compare routine. Then if you look up the stack one level, you will be looking directly at the bubble sort loop which is your bug. If you're not sure you've really spotted the problem, just pause it several times. The number of times you see the problem tells you how costly it is.
Any line of code that appears on the call stack on multiple pauses, is something that is begging you to fix it. Some you can't, like "call _main", but if you can you will get a good speedup, guaranteed.
Then do it again, and again.
When you run out of things you can fix, then you've really tuned the program within an inch of its life.
It's that simple.
You could also use the profiler in Visual Studio. It is a nice tool but be aware of these shortcomings:
Confusing you with "exclusive time", which if you concentrate on line-level information, is almost meaningless.
If your program is wasting time doing I/O, it won't see that, because when it stops to do I/O, the samples stop, unless you use instrumentation.
But if you use instrumentation, you won't get line-level information, only function-level. That's OK if your functions are all small.
Confusing you with the "call tree". What matters for a line of code is how many stack samples it is on. If it is in many branches of the call tree, the call tree won't show you what it really costs.
If it tells you a line is costly, it cannot tell you why. For that you want to see as much state information on each sample as you need, rather than just summaries.
It's hard to tell it when you want to do samples, and when not. You want it to sample when you're waiting for the app, not when it's waiting for you.
So now that you know you need a profiler, you might not have the Visual Studio one, so Very Sleepy might be of help.

profile-guided optimization (C)

Anyone know this compiler feature? It seems GCC support that. How does it work? What is the potential gain? In which case it's good? Inner loops?
(this question is specific, not about optimization in general, thanks)
It works by placing extra code to count the number of times each codepath is taken. When you compile a second time the compiler uses the knowledge gained about execution of your program that it could only guess at before. There are a couple things PGO can work toward:
Deciding which functions should be inlined or not depending on how often they are called.
Deciding how to place hints about which branch of an "if" statement should be predicted on based on the percentage of calls going one way or the other.
Deciding how to optimize loops based on how many iterations get taken each time that loop is called.
You never really know how much these things can help until you test it.
PGO gives about a 5% speed boost when compiling x264, the project I work on, and we have a built-in system for it (make fprofiled). Its a nice free speed boost in some cases, and probably helps more in applications that, unlike x264, are less made up of handwritten assembly.
Jason's advise is right on. The best speedups you are going to get come from "discovering" that you let an O(n2) algorithm slip into an inner loop somewhere, or that you can cache certain computations outside of expensive functions.
Compared to the micro-optimizations that PGO can trigger, these are the big winners. Once you've done that level of optimization PGO might be able to help. We never had much luck with it though - the cost of the instrumentation was such that our application become unusably slow (by several orders of magnitude).
I like using Intel VTune as a profiler primarily because it is non-invasive compared to instrumenting profilers which change behaviour too much.
The fun thing about optimization is that speed gains are found in the unlikeliest of places.
It's also the reason you need a profiler, rather than guessing where the speed problems are.
I recommend starting with a profiler (gperf if you're using GCC) and just start poking around the results of running your application through some normal operations.