What is the fastest instrumentation profiler out there - c++

What is the fastest profiler available for dynamic profiling (like what gprof does). The profiler has to be an instrumentation profiler, or even if it has sampling profiling with it, I'm interested to know the overhead of instrumentation profiling, because sampling profiling can be done with almost 0% overhead anyway.

Any profiler that uses hardware based sampling (via the CPU PMSR's) will have the smallest overhead (as its reading the profiling data the CPU is keeping track of at a hardware level, for more info, see AMD & Intels Architecture manuals, they should be explained in-depth in one of the appendices).
The only profilers I know of using these are VTune for Intel (not free) and CodeAnalyst for AMD (free).
Next in line would be timer based profilers and event based profilers, of these the ones with the least overhead would probably be ones compiled directly into your code (CodeAnalyst has an API for event based, so does VTune). gprof also falls into this category (Clang also has something but IDK if its still maintained...). If you have VS Pro or Ultimate, its PG compile mode will do similar things, though I have never found it to compare with a dedicated profiler suite.
Last would be the ones that need to insert probes into your code to determine its profiling data, all the aforementioned ones can do this, as well as other freeware profilers like VerySleepy.

Intel's vtune amplifier is probably the most complete.

Related

How can I find out how much time is spend on each line in C/C++ code?

I am trying to find a profiling tool with which I can find out, how much time is spend on each line of code in a C/C++ program. I am working on Linux platforms (Ubuntu, Gentoo, SL) mainly with gcc. I use gprof but sometimes I need the "per line" information.
Any suggestions? Thank you!
On linux you can use oprofile. This is a sample based profiler which runs on almost any platform and supports the performance monitoring registers if they are available. On x86 it works with both AMD and Intel.
You can use it as standalone program wich will give you an annotated source, but there is a plugin available (linuxtools) for eclipse which integrates nicely into the IDE.
AMD CodeAnalyst is your best bet, it is totally free, and it works on windows and linux, though its primarily for AMD CPU's, so non-AMD CPU's won't get the MSR based profiling options. Under Windows it also has great integration for Visual Studio 2008 & 2010 as well.
for non-vendor specific, free profilers, you can try very sleepy, which also happens to be open source.
What Zoom does is take stack samples on wall-clock time.
Then the percent of time any function or line of code is responsible for is the fraction of samples on which it appears.
For example, if a line of code is on 30% of stack samples, and you could avoid executing it, the total execution time would decrease by 30%
This is true regardless of I/O, recursion, competing processes, swapping, all the things that confuse many profilers.

Measure how often a branch is mispredicted

Assuming I have a if-else branch in C++ how can I (in-code) measure how often the branch is mispredicted? I would like to add some calls or macros around the branch (similar to how you do bottom-up profiling) that would report branch mispredictions.
It would be nice to have a generic method, but lets do Intel i5 2500k for starters.
If you are using an AMD CPU, AMD's CodeAnalyst is just what you need (works on windows and Linux)*.
if your not, then you may need to fork out for a VTune licence or build something using the on CPU performance registers and counters details in the instruction manuals.
You can also check out gperf & OProfile (linux only), see how well they perform (I've never used these, but I see them referred to quite a bit).
*CodeAnalyst should work on an Intel CPU, you just don't get all then nice CPU level analysis.
I wonder if it would be possible to extract this information from g++ -fprofile-arcs? It has to measure exactly this in order to feed back into the optimizer in order to optimize branching.
OProfile
OProfile is pretty complex, but it can profile anything your CPU tracks.
Look through the Event Type Reference and look for your particular CPU.
For instance here is the core2 events. After a quick search I don't see any event counters for missed branch prediction on the core2 architecture.

How to profile a C++ function at assembly level?

I have a function that is the bottleneck of my program. It requires no access to memory and requires only calculation. It is the inner loop and called many times so any small gains to this function is big wins for my program.
I come from a background in optimizing SPU code on the PS3 where you take a SPU program and run it through a pipeline analyzer where you can put each assembly statement in its own column and you minimize the amount of cycles the function takes. Then you overlay loops so you can minimized pipeline dependencies even more. With that program and a list of all the cycles each assembly instruction takes I could optimize much better then the compiler ever could.
On a different platform it had events I could register (cache misses, cycles, etc.) and I could run the function and track CPU events. That was pretty nice as well.
Now I'm doing a hobby project on Windows using Visual Studio C++ 2010 w/ a Core i7 Intel processor. I don't have the money to justify paying the large cost of VTune.
My question:
How do I profile a function at the assembly level for an Intel processor on Windows?
I want to compile, view disassembly, get performance metrics, adjust my code and repeat.
There are some great free tools available, mainly AMD's CodeAnalyst (from my experiences on my i7 vs my phenom II, its a bit handicapped on the Intel processor cause it doesn't have access to the direct hardware specific counters, though that might have been bad config).
However, a lesser know tool is the Intel Architecture Code Analyser (which is free like CodeAnalyst), which is similar to the spu tool you described, as it details latency, throughput and port pressure (basically the request dispatches to the ALU's, MMU and the like) line by line for your programs assembly. Stan Melax gave a nice talk on it and x86 optimization at this years GDC, under the title "hotspots, flops and uops: to-the-metal cpu optimization".
Intel also has a few more tools in the same vein as IACA, avaibale under the performance tuning section of their experimental/what-if code site, such as PTU, which is (or was) an experimental evolution of VTune, from what I can see, its free.
Its also a good idea to have read the intel optimization manual before diving into this.
EDIT: as Ben pointed out, the timings might not be correct for older processors, but that can be easily made up for using Agner Fog's Optimization manuals, which also contain many other gems.
You might want to try some of the utilities included in valgrind like callgrind or cachegrind.
Callgrind can do profiling and dump assembly.
And kcachegrind is a nice GUI, and will show the dumps including assembly and number of hits per instruction etc.
From you description it sounds like you problem may be embarrassingly parallel, have you considered using ppl's parallel_for?

Efficient cache and BLOB's - profiling cache hits/misses

For a program to be cache efficient the data used should be stored linearly right?
So instead of dynamic allocation I put my data in a blob using a linear allocator. Is this enought to improve performace? what should I do to improve cache efficiency even more?
I know that this questions arent specific but I don't know how to explain it...
Which programs can help me profile cache hits/misses?
If your looking for a profiler for windows, you can try AMD's CodeAnalyst or VerySleepy, both of these are free, AMDs is the more powerful of the two however( and works on intel hardware, but iirc you can't use the hardware based profiling stuff), it includes monitoring of things like branch prediction misses and cache utilization. Profiling is great, as it tells you what to optimize, but you don't always know how, for that, you should have a look at Agner Fog's optimization manuals combined with Intel's optimization manual (which contains a lot on locality and cachability optimizations)
If you're on Linux you could use Valgrind(specifically cachegrind tool).
If you're on Windows then VS2010(2008) Professional edition has a builtin profiler but
I don't know any details about it's cache profiling facilities. There is also the Intel
VTune Analyzer(Amplifier). Both of them are commercial products, although I think you can get 30 days evaluation copies.
Some other questions on SO that might be of help:
What's your favorite profiling tool (for C++)
C and C++ source code profiling tools
On Linux, you can use perf mem to sample memory accesses, including misses in a very fine-grained manner (including the miss address), as described here.

Profiling C++ multi-threaded applications

Have you used any profiling tool like Intel Vtune analyzer?
What are your recommendations for a C++ multi threaded application on Linux and windows? I am primarily interested in cache misses, memory usage, memory leaks and CPU usage.
I use valgrind (only on UNIX), but mainly for finding memory errors and leaks.
Following are the good tools for multithreaded applications. You can try evaluation copy.
Runtime sanity check tool
Thread Checker -- Intel Thread checker / VTune, here
Memory consistency-check tools (memory usage, memory leaks)
- Memory Validator, here
Performance Analysis. (CPU usage)
- AQTime , here
EDIT: Intel thread checker can be used to diagnose Data races, Deadlocks, Stalled threads, abandoned locks etc. Please have lots of patience in analyzing the results as it is easy to get confused.
Few tips:
Disable the features that are not required.(In case of identifying deadlocks, data race can be disabled and vice versa.)
Use Instrumentation level based on your need. Levels like "All Function" and "Full Image" are used for data races, where as "API Imports" can be used for deadlock detection)
use context sensitive menu "Diagnostic Help" often.
On Linux, try oprofile.
It supports various performance counters.
On Windows, AMD's CodeAnalyst (free, unlike VTune) is worth a look.
It only supports event profiling on AMD hardware though
(on Intel CPUs it's just a handy timer-based profiler).
A colleague recently tried Intel Parallel Studio (beta) and rated it favourably
(it found some interesting parallelism-related issues in some code).
VTune give you a lot of details on what the processor is doing and sometimes I find it hard to see the wood for the trees. VTune will not report on memory leaks. You'll need purify plus for that, or if you can run on a Linux box valgrind is good for memory leaks at a great price.
VTune shows two views, one is useful the tabular one, the other I think is just for sales men to impress people with but not that useful.
For quick and cheap option I'd go with valgrind. Valgrind also has a cache grind part to it but i've not used it, but suspect its very good also.
cheers,
Martin.
You can try out AMD CodeXL's CPU profiler. It is free and available for both Windows and Linux.
AMD CodeXL's CPU profiler replaces the no longer supported CodeAnalyst tool (which was mentioned in an answer above given by timday).
For more information and download links, visit: AMD CodeXL web page.
I'll put in another answer for valgrind, especially the callgrind portion with the UI. It can handle multiple threads by profiling each thread for cache misses, etc. They also have a multi-thread error checker called helgrind, but I've never used it and don't know how good it is.
The Rational PurifyPlus suite includes both a well-proven leak detector and pretty good profiler. I'm not sure if it does go down to the level of cache misses, though - you might need VTune for that.
PurifyPlus is available both on various Unices and Windows so it should cover your requirements, but unfortunately in contrast to Valgrind, it isn't free.
For simple profiling gprof is pretty good..