Best way to make loop profiling - c++

I must change a C/C++ program with a lot of loop inside one function. I must add cuda functions.
Before i start making changes, I wanted to take the time to all loops found. But i did't find any profiling programs who make exactly that. What is the best way to do that's. I on linux. if you have any solutions let me know.
here you will find an example of tool who makes exactly what i want but i haven't find it or something like that : http://carbon.ucdenver.edu/~dconnors/papers/wbia06-loopprof.pdf

I would use gperftools, and figure out where the code is spending most of it's time. Once you have identified a function or part of a function, you're probably done. Understanding exactly which instructions are the "heaviest" in a function will require a long running testcase for that particular loop, so that the profiler can get sufficient data for each instruction (or at least most instructions) in the loop. But actually profiling down to instructions is probably not relevant if you are looking to replace the code with another technology - it is unlikely that replacing one loop of a few lines of code will help much, since there'd be too much overhead. Instead, you want to take a larger block and move that across to CUDA.

Related

Simple profiling of single C++ function in Windows

There are times, particularly when I'm writing a new function, where I would like to profile a single piece of code but a full profile run is not really necessary, and possibly too slow.
I'm using VS 2008 and have used the AMD profiler on C++ with good results, but I'm looking for something a little more lightweight.
What tools do you use to profile single functions? Perhaps something that is a macro which gets excluded when your not in DEBUG mode. I could write my own, but I wanted to know if there were any built in that I'm missing. I was thinking something like:
void FunctionToTest()
{
PROFILE_ENTER("FunctionToTest")
// Do some stuff
PROFILE_EXIT()
}
Which would simply print in the debug output window how long the function took.
If I want to get maximum speed from a particular function, I wrap it in a nice long-running loop and use this technique.
I really don't care too much about the time it takes. That's just the result.
What I really need to know is what must I do to make it take less time.
See the difference?
After finding and fixing the speed bugs, when the outer loop is removed, it flies.
Also I don't follow the orthodoxy of only tuning optimized code, because that assumes the code is already nearly as tight as possible.
In fact, in programs of any significant size, there is usually stupid stuff going on, like calling a subfunction over and over with the same arguments, or new-ing objects repeatedly, when prior copies could be re-used.
The compiler's optimizer might be able to clean up some such problems,
but I need to clean up every one, because the ones left behind will dominate.
What it can do is make them harder to find, by scrambling the code.
When I've gotten all the stupid stuff out (making it much faster), then I turn on the optimizer.
You might think "Well I would never put stupid stuff in my code."
Right.
And you'd never put in bugs either.
None of us try to make mistakes, but we all do, if we're working.
This code by Jeff Preshing should do the trick:-
http://preshing.com/20111203/a-c-profiling-module-for-multithreaded-apis
Measure the time - which the code in the link does too - using either clock() or one of the OS provided high resolution timers. With C++11 you can use the timers from the <chrono> header.
Please note, that you should always measure in Release build not Debug build, to get proper timings.

C++ Code profiling/analysis for Mac and MPI

I am looking for a code analysis/profiling tool for C++ on MacOS. I know that there have been posts about this thread, but my application in which I need is very specific, so maybe one can give me a little more specific advice.
So here is my problem: I am writing a scientific code (master's project) in C++, so it's a pure console application, no interactivity given. The code is supposed to run on massively parallel computers, thus I use MPI. However, right now I am not yet optimizing for scalability, but only for singlecore performance. Since I do not want to rewrite the whole programm as a serial one, I just use MPI with 1 thread. It works fine, but the optimizer obviously needs to be able to deal with this.
What do I want to analyze? Well, the code is not very complex in a sense that it has a very simple structure and thus all I need would be a list of how long the program spends in certain functions, so that I know where it loses most time and I can measure the speedup of my optimizations.
Thanks for all ideas
You should use Instruments.app which includes a CPU sampler and thread activity viewer... among other things. (Choose "Product > Profile..." in Xcode)
If you want something more fine-grained, you could instrument your code. Coincidentally, I wrote a set of profiling macros just for such an occasion :)
https://github.com/nielsbot/Profiler
This will show a nice nested print out of time spent in instrumented routines. (instructions on that page)
Did you try kcachgrind: http://kcachegrind.sourceforge.net/html/Home.html with valgrind ?
I can recommed http://www.scalasca.org/ . You can use it also for the parallel performance afterwards.
Don't look for "slow functions" and don't look to measure the time used by different pieces. Those concepts are so indirect as to be almost useless for telling you what to optimize.
Instead, take some stroboscopic X-rays, on wall-clock time, of what the entire program is doing, and study each one to see why the program is spending that instant of time.
The reason this works better is it's not looking with function-colored glasses. It's looking with purpose-colored glasses, and you can tell if the program needs to be doing what it's doing.
It's very accurate about locating big problems.
It is not accurate about measuring them, nor does it need to be.
What happens when you just do measuring is this: You get a bunch of numbers on a bunch of routines. You look at them and say "what does it mean?".
If it doesn't tell you what you should fix, you pat yourself on the back and say the program must be optimal.
In fact, there probably is something you could fix, but you couldn't figure it out from the profiler.
What's more, if you do find it and fix it, it can expose other things you can fix for even greater speedup.
That's what random pausing is about.

Any way to handle "predictable branches" faster?

I have some code in which there are two or three branches which you don't know what way they will go, but after the first time they are hit, it is either 100% certain, or close to that, that the same path will happen again. I have noticed that use of the __builtin_likely doesn't do much in terms of avoiding branch misses. And even though branch prediction does a good job when my function is called repeatedly in a short time span..as soon as there is other stuff going on between calls to my function, performance degrades substantially. What are some ways around this or some techniques I can look into? Any way to somehow "tag" these branches for when they are reached again after some vagrancy?
You could use templates to generate a different version of the function for each code path, then use a function pointer to select one at runtime when you find out which way the condition goes.
The branch predictor and compiler intrinsics are all you've got. At best, you can look at the assembly and try to hand-roll some optimization yourself, but you won't find much.

Profile optimised C++/C code

I have some heavily templated c++ code that I am working with. I can compile and profile with AMD tools and sleepy in debug mode. However without optimisation most of time spent concentrated in the templated code and STL. With optimised compilation, all the profile tools that I know produce garbage information. Does anybody know a good way to profile optimised native code
PS1:
The code that I am writing is also heavily templated. Most of the time spent in the unoptimised code will be optimized away. I am talking about 96-97% of the run time are spent in templated code without optimisation. This is going to corrupt the accuracy of the profiling. And yes I can change many templated code or at least what part of the templated code is introducing the most trouble and I can do better in those places.
You should focus on the code you wrote because that is what you can change, time spent in STL is irrelevant, just ignore it and focus on the callers of that code. If too much time is spent in STL you probably can call some other STL primitive instead of the current one.
Profiling unoptimized code is less interesting, but you can still get some informations. If used algorithms from some parts of code are totally flawed it will show up even there. But you should be able to get useful informations from any good profiling tool in optimized code. What tools do you use exactly and why do you call their output garbage ?
Also it's usually easy enough to instrument your code by hand and find out exactly which parts are efficient and which are not. It's just a matter of calling timer functions (or reading cycle count of processor if possible) at well chosen points. I usually do that from unit tests to have reproducible results, but all depends of the specifics of your program.
Tools or instrumenting code are the easy part of optimization. The hard part is finding ways to get faster code where it's needed.
What do you mean by "garbage information"?
Profiling is only really meaningful on optimized builds, so tools are designed to work with them -- thus if you're getting meaningless results, it's probably due to the profiler not finding the right symbols, or needing to instrument the build.
In the case of Intel VTune, for example, I found I got impossible results from the sampler unless I explicitly told it where to find the PDBs for the executable I was tuning. In the instrumented version, I had to fiddle with the settings until it was reliably putting probes into the function calls.
When #kriss says
You should focus on the code you wrote
because that is what you can change
that's exactly what I was going to say.
I would add that in my opinion it is easier to do performance tuning first on code compiled without optimization, and then later turn on the optimizer, for the same reason. If something you can fix is costing excess time, it will cost proportionally excess time regardless of what the compiler does, and it's easier to find it in code that hasn't been scrambled.
I don't look for such code by measuring time. If the excess time is, say, 20%, then what I do is randomly pause it several times. As soon as I see something that can obviously be improved on 2 or more samples, then I've found it. It's an oddball method, but it doesn't really miss anything. I do measure the overall time before and after to see how much I saved. This can be done multiple times until you can't find anything to fix. (BTW, if you're on Linux, Zoom is a more automated way to do this.)
Then you can turn on the optimizer and see how much it gives you, but when you see what changes you made, you can see there's really no way the compiler could have done it for you.

How do you find the least optimized parts of a program?

Are there any tools to give some sort of histogram of where most of the execution time of the program is spent at?
This is for a project using c++ in visual studio 2008.
The name you're after is a profiler. Try Find Application Bottlenecks with Visual Studio Profiler
You need a profiler.
Visual Studio Team edition includes a profiler (which is what you are looking for) but you may only have access to the Professional or Express editions. Take a look at these threads for alternatives:
What's your favorite profiling tool (for C++)
What are some good profilers for native C++ on Windows?
You really shouldn't optimize ANY parts of your application until you've measured how long they take to run. Otherwise you may be directing effort in the wrong place, and you may be making things worse, not better.
I have used a profiler called "AQ Time" which gives every detail you want to know about about the performance of your code. It's not free though..
You could get a histogram of the program counter, but it is practically useless unless you are doing something dumb like spending time in a bubble sort of a big array of ints or doubles.
If you do something as simple as a bubble sort of an array of strings, the PC histogram will only tell you that you have a hotspot in the string compare routine.
That's not much help, is it?
I know you wouldn't do such a bubble sort, but just for fun, let's assume you did, and it was taking 90% of your time. (i.e. if you fixed it, it could go up to 10 times faster.)
It's actually a very easy thing to find, because if you just hit the pause button in the debugger, you will almost certainly see that it stops in the string compare routine. Then if you look up the stack one level, you will be looking directly at the bubble sort loop which is your bug. If you're not sure you've really spotted the problem, just pause it several times. The number of times you see the problem tells you how costly it is.
Any line of code that appears on the call stack on multiple pauses, is something that is begging you to fix it. Some you can't, like "call _main", but if you can you will get a good speedup, guaranteed.
Then do it again, and again.
When you run out of things you can fix, then you've really tuned the program within an inch of its life.
It's that simple.
You could also use the profiler in Visual Studio. It is a nice tool but be aware of these shortcomings:
Confusing you with "exclusive time", which if you concentrate on line-level information, is almost meaningless.
If your program is wasting time doing I/O, it won't see that, because when it stops to do I/O, the samples stop, unless you use instrumentation.
But if you use instrumentation, you won't get line-level information, only function-level. That's OK if your functions are all small.
Confusing you with the "call tree". What matters for a line of code is how many stack samples it is on. If it is in many branches of the call tree, the call tree won't show you what it really costs.
If it tells you a line is costly, it cannot tell you why. For that you want to see as much state information on each sample as you need, rather than just summaries.
It's hard to tell it when you want to do samples, and when not. You want it to sample when you're waiting for the app, not when it's waiting for you.
So now that you know you need a profiler, you might not have the Visual Studio one, so Very Sleepy might be of help.