Time-based profiling using Clang instrumentation - c++

Clang's -fprofile-instr-generate option can record the number of times each line of code (and even parts of a line of code) is executed. There is some overhead but it is pretty minimal.
Is there a way to get Clang to do a similar thing but recording the total execution time for a line of code rather than the number of times it was run.
I know there are sample-based profilers (perf, etc.) but these seem to suck - e.g. as far as I can tell they sample the call stack so you don't get line-level information.
I am ok with a significant overhead (e.g. 100%) as long as it doesn't distort the relative timings too much (+/-30% is fine).

There does seem to be something like this - it's called XRay and was developed by Google.
It doesn't go as far as line-by-line profiling, or even to the basic block level as far as I can tell. The granularity is limited to functions - but you can control exactly which functions are instrumented (by default those with more than 100 instructions) and even turn on and off the instrumentation at runtime.
It seems to be at a fairly early stage of development and only works on Linux. Looks useful nonetheless.
Edit: gperftools actually work very well for this (I guess I dismissed them earlier because pprof didn't used to work at all on Mac, but I fixed that). I strongly recommend you use the -http option of pprof - it gives you a cool interactive interface with source code, a call graph, a flame graph and so on.

Related

Locate the most costly methods and evaluate/profile them

I would like to know if there is a technique or a tool available that can tell you how much time is required to execute a particular method.
Something similar to the big O notation in math/computer science that can give you an idea about the complexity of an algorithm, I wonder if there is something similar for code analysis.
Profiling is a means of analysing a program to determine the relative amount of time spent in particular functions or methods. It is useful for discovering performance issues in a program empirically. Using GCC, for example, you can:
Compile the program with the -pg option to enable profiling.
Run the executable to produce a file called gmon.out, which contains information about the runtime characteristics of your program as it actually ran.
Run gprof to display the information generated by the instrumented executable.
Generally speaking, human analysis is the only way to discover the asymptotic (i.e., big-O) complexity of a particular algorithm—there is, so far as I know, no mechanical way to do it.
If you want to know how much time is spent in a function, use a so-called "profiler".
Complexity analysis is outside the scope of a profiler, though, since a profiler tells you what happens when you run a program once, whereas complexity tells you the limit behavior of what happens when you run an infinite sequence of programs with bigger and bigger input.
So: do you want to know which functions are most costly in your program (in which case find a profiler for your C++ implementation and follow its documentation), or do you want to know about time complexity (in which case you pretty much need a human to analyze your code)?
You should try running your code with callgrind started, it will record what functions are called, how many times, but the code will run 20X slower or such. After you get your callgrind output you should open it with kcachegrind to view the tree structure of the calls. There you can browse around and see where do you have bottlenecks
Useful links:
kcachegrind http://kcachegrind.sourceforge.net/html/Home.html
callgrind documentation http://valgrind.org/docs/manual/cl-manual.html
(valgrind is the framework, callgrind is a component)
how to start callgrind if the program spawns processes and you want to profile those too (replace sage bench.py with your program) https://github.com/titusnicolae/pynac-callgrind/blob/master/run.sh
edit: "--instr-atstart=no" should be removed from the parameter list if you don't plan on enabling the instrumentation later
How is this different from the halting problem?
Please note that I can trivially solve the halting problem using your automatic complexity analyzer -- your problem is HARDER. And the halting problem is already undecidable.

Profile optimised C++/C code

I have some heavily templated c++ code that I am working with. I can compile and profile with AMD tools and sleepy in debug mode. However without optimisation most of time spent concentrated in the templated code and STL. With optimised compilation, all the profile tools that I know produce garbage information. Does anybody know a good way to profile optimised native code
PS1:
The code that I am writing is also heavily templated. Most of the time spent in the unoptimised code will be optimized away. I am talking about 96-97% of the run time are spent in templated code without optimisation. This is going to corrupt the accuracy of the profiling. And yes I can change many templated code or at least what part of the templated code is introducing the most trouble and I can do better in those places.
You should focus on the code you wrote because that is what you can change, time spent in STL is irrelevant, just ignore it and focus on the callers of that code. If too much time is spent in STL you probably can call some other STL primitive instead of the current one.
Profiling unoptimized code is less interesting, but you can still get some informations. If used algorithms from some parts of code are totally flawed it will show up even there. But you should be able to get useful informations from any good profiling tool in optimized code. What tools do you use exactly and why do you call their output garbage ?
Also it's usually easy enough to instrument your code by hand and find out exactly which parts are efficient and which are not. It's just a matter of calling timer functions (or reading cycle count of processor if possible) at well chosen points. I usually do that from unit tests to have reproducible results, but all depends of the specifics of your program.
Tools or instrumenting code are the easy part of optimization. The hard part is finding ways to get faster code where it's needed.
What do you mean by "garbage information"?
Profiling is only really meaningful on optimized builds, so tools are designed to work with them -- thus if you're getting meaningless results, it's probably due to the profiler not finding the right symbols, or needing to instrument the build.
In the case of Intel VTune, for example, I found I got impossible results from the sampler unless I explicitly told it where to find the PDBs for the executable I was tuning. In the instrumented version, I had to fiddle with the settings until it was reliably putting probes into the function calls.
When #kriss says
You should focus on the code you wrote
because that is what you can change
that's exactly what I was going to say.
I would add that in my opinion it is easier to do performance tuning first on code compiled without optimization, and then later turn on the optimizer, for the same reason. If something you can fix is costing excess time, it will cost proportionally excess time regardless of what the compiler does, and it's easier to find it in code that hasn't been scrambled.
I don't look for such code by measuring time. If the excess time is, say, 20%, then what I do is randomly pause it several times. As soon as I see something that can obviously be improved on 2 or more samples, then I've found it. It's an oddball method, but it doesn't really miss anything. I do measure the overall time before and after to see how much I saved. This can be done multiple times until you can't find anything to fix. (BTW, if you're on Linux, Zoom is a more automated way to do this.)
Then you can turn on the optimizer and see how much it gives you, but when you see what changes you made, you can see there's really no way the compiler could have done it for you.

C and C++ source code profiling tools [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What's your favorite profiling tool (for C++)
Are there any good tools to profile a source code which is mix of of C and C++. What are the pros and cons of any, and which ones have you used and would reccomend for usage. Please do not get me a list of tools from google. I can do that too, what i want is to leverage the personal experience of someone who has used these tools and knows the pros and cons about them.
Thanks in advance.
I've found gprof to be the best CPU hotspot profiler, and Google Performance Tools to be the best sampling profiler. Both work for C and C++.
In my opinion there are no good profiling tools on Windows.
GNU gprof pros and cons
GCC only
Works with C and C++
Only treats CPU time, and code inside the binary, you need everything you wish to profile statically linked in
Very accurate
Adds a small overhead to execution
Google Performance Tools pros and cons
I think it requires the GNU tool chain
Occasionally fails to identify symbols
Very customizable
Outputs to a huge variety of formats, including the Callgrind format, and automatically loads KCacheGrind for you
Has various memory profiling tools also
Is a sampling profiler, with minimal overhead
Related useful questions and answers
Alternative to -pg with Clang?
What's your favorite profiling tool (for C++)
Alternatives to gprof
C++ Code Profiler
Confusing gprof output
I would respectfully disagree with Matt.
The tool I use all the time on Windows is the random-pausing technique, and it works with all languages that the IDE supports.
As an example of using it to do performance tuning, this case shows how a speedup of 43 times was achieved through a series of steps.
Gprof has a lot of problems, listed here, and according to the google-perftools manual, some of the same issues are repeated there, such as reporting procedures, not lines, emphasizing self (local) time, emphasizing the graph, etc. (I can't tell from the doc if it samples while blocked.)
As software systems become ever larger, self time becomes less and less relevant. The program counter spends most of its time in library routines or blocked in the system.
Graphs become gigantic nests.
People ask "I know function X is costly, but where in function X is the problem?"
What's more, the "bottlenecks" get bigger and bigger, because the stack gets deeper on average, and every layer of the stack is a fresh opportunity to do more function calls than necessary.
An example of a stack-sampler that reports percent by line, and samples while blocked, and allows user control of sampling so as not to dilute the sample set during user input, is Zoom.
EDIT: Sorry, can't leave well enough alone. Here's a new explanation:
The way programs work, they trace out a call tree, which is a lot like the oak tree outside my window. It has a trunk (main) which sprouts branches (call sites) which sprout further branches for several levels out to leaves (instructions) and acorns (blocking calls).
When the tree surgeon comes to prune (optimize) it, does he look only where the leaves are (hotspots)? Does he ignore acorns (no samples during blocking)?
No, he looks for branches (call sites) that are both heavy (on the stack a lot) and unhealthy (unnecessary). Those are what he prunes.
That's what random-pausing and Zoom do, is help find those call sites.
You can use Callgrind to create profiling output. It is part of Valgrind.
Callgrind-output could be used with KCacheGrind, which is probably worth a look as long as you're using Linux.
AMD CodeAnalyst is pretty nice. It's also cross platform which is nice when one finds a platform specific bottleneck.

c++ profiling/optimization: How to get better profiling granularity in an optimized function

I am using google's perftools (http://google-perftools.googlecode.com/svn/trunk/doc/cpuprofile.html) for CPU profiling---it's a wonderful tool that has helped me perform a great deal of CPU-time improvements on my application.
Unfortunately, I have gotten to the point that the code is still a bit slow, and when compiled using g++'s -O3 optimization level, all I know is that a specific function is slow, but not which aspects of it are slow.
If I remove the -O3 flag, then the unoptimized portions of the program overtake this function, and I don't get a lot of clarity into the actual parts of the function that are slow. If I leave the -O3 flag in, then the slow parts of the function are inlined, and I can't determine which parts of the function are slow.
Any suggestions? Thanks for your help!
For something like this, I've always used the "old school" way of doing it:
Insert into the routine you want to measure at various points statements which measure the current time (or cputime). Then simply print out or log the differences between them and you will know how long each section of code took. From there you can find out what is eating most of the time, and go in and get fine-grained timing within that section until you know what the problem is, and how to fix it.
If the overhead of the function calls is not the problem, you can also force inlining to be off with -fno-inline-small-functions -fno-inline-functions -fno-inline-functions-called-once -fno-inline (I'm not exactly sure how these switches interact with each other, but I think they are independent). Then you can use your ordinary profiler to look at the call graph profile and see what function calls are taking what amount of time.
If you're on linux, use oprofile.
If you're on Windows, use AMD's CodeAnalyst.
Both will give sample-based profiles down to the level of individual source lines or assembly instructions and you should have no problem identifying "hot spots" within functions.
I've spent decades doing performance tuning.
People love their tools, but I swear by this method.

C++ code performance

When is about writing code into C++ using VS2005, how can you measure the performance of your code?
Is any default tool in VS for that? Can I know which function or class slow down my application?
Are other external tools which can be integrated into VS in order to measure the gaps in my code?
If you have the Team System edition of Visual Studio 2005, you can use the built-in profiler.
http://msdn.microsoft.com/en-gb/library/z9z62c29(VS.80).aspx
AMD CodeAnalyst is available for free for both Windows and Linux and works on most x86 or x64 CPUs (including Intel's).
It has extra features available when you have an AMD processor, of course. It also integrates into Visual Studio.
I've had pretty good luck with it.
Note that there are generally at least two common forms of profiler:
instrumenting: alters your build to record information at the beginning and end of certain areas (usually per function)
sampling: periodically looks at what code is running to record information
The types of information recorded can include (but are not limited to): elapsed time, # of CPU cycles, cache hits/misses, etc.
Instrumenting can be specific to certain areas of the code (just certain files or just code you compile, not libraries you link to). The overhead is much higher (you're adding code to the project, which takes time to execute, so you're altering timing; you may change program behavior for e.g. interrupt handlers or other timing-dependent code). You're guaranteed that you will get information about the functions/areas you instrument, though.
Sampling can miss very small or very sporadic functions, but modern machines have hardware help to allow you to sample much more thoroughly. Note that some sampling systems may still inject timing differences, although they generally will be much much smaller.
Some profiling tools support a mixture of the above, depending on how you use them.
You could also use Intel VTune.
You want a tool called a profiler. For a free one that covers most simple cases, I recommend Very Sleepy. It works by sampling the application's current call stack at regular intervals.
You can always measure the time and performance of you code yourself. Consult MSDN about the the following functions QueryPerformanceCounter() and QueryPerformanceFrequency().
For more in depth analysis of memory allocation and execution times we use Memory Validator and Performance Validator from Software Verify. They have support for several languages other than C++.
I think measuring performance, and locating code to optimize, are different problems, and require different methods.
To locate code to optimize, I swear by this simple method, which is orthogonal to accepted wisdom about profiling, and does not require you to buy or install any tools.
To measure performance, I'm content with the simple process of running the subject code in a loop and timing it.
EDIT: BTW, I just looked at Very Sleepy, and it appears to be on the right track. It samples the entire call stack, and retains each stack. What I can't tell is if it gives you, for each call instruction or regular instruction, the fraction of stack samples containing that instruction. In my opinion, that is the most valuable statistic, and it does not need to be very precise.
dotTrace, on the other hand, also looks like maybe it retains stack samples, but its UI presentation of call-stack info seems to be a call-tree. What I would look for is something that shows the stack-residence percentage of individual instructions (or statements), because they could be in different branches of the call-tree, and thus the call-tree could miss their importance.
For intrusive measurement, use the performance counters. Since you're using C++, you should use a facade over this slightly painful API. STLSoft has a family of such things, with different pros and cons. I suggest winstl::performance_counter for highest resolution, or winstl::threadtimes_counter if you want to monitor the performance of a particular thread regardless of other activity in your process(es). There was an article about this in Dr Dobb's several years ago, in which the design rationale behind the facades was described in detail.
For non-intrusive measurement, you can't go past VTune.
We use Rational quantify which comes as a part of Rational PurifyPlus set of tools.
Its an excellent tool for profiling application performance.
I've recently tried JetBrains dotTrace profiler and it looks very good. It helped me locate a number "black holes" in existing C++ code quite easily.
It works fine in Visual Studio 2005 Professional in a solution which mixes C# and C++ - it uses the right function names for both pieces of code and does an integrated analysis. You can trace for time or memory.
It will be a pity when the evaluation period expires :)
We've had good results from AQTime. It's not free but is cheaper than Visual Studio ;-)