Profiling a multiprocess system - c++

I have a system that i need to profile.
It is comprised of tens of processes, mostly c++, some comprised of several threads, that communicate to the network and to one another though various system calls.
I know there are performance bottlenecks sometimes, but no one has put in the time/effort to check where they are: they may be in userspace code, inefficient use of syscalls, or something else.
What would be the best way to approach profiling a system like this?
I have thought of the following strategy:
Manually logging the roundtrip times of various code sequences (for example processing an incoming packet or a cli command) and seeing which process takes the largest time. After that, profiling that process, fixing the problem and repeating.
This method seems sorta hacky and guess-worky. I dont like it.
How would you suggest to approach this problem?
Are there tools that would help me out (multi-process profiler?)?
What im looking for is more of a strategy than just specific tools.
Should i profile every process separately and look for problems? if so how do i approach this?
Do i try and isolate the problematic processes and go from there? if so, how do i isolate them?
Are there other options?

I don't think there is a single answer to this sort of question. And every type of issue has it's own problems and solutions.
Generally, the first step is to figure out WHERE in the big system is the time spent. Is it CPU-bound or I/O-bound?
If the problem is CPU-bound, a system-wide profiling tool can be useful to determine where in the system the time is spent - the next question is of course whether that time is actually necessary or not, and no automated tool can tell the difference between a badly written piece of code that does a million completely useless processing steps, and one that does a matrix multiplication with a million elements very efficiently - it takes the same amount of CPU-time to do both, but one isn't actually achieving anything. However, knowing which program takes most of the time in a multiprogram system can be a good starting point for figuring out IF that code is well written, or can be improved.
If the system is I/O bound, such as network or disk I/O, then there are tools for analysing disk and network traffic that can help. But again, expecting the tool to point out what packet response or disk access time you should expect is a different matter - if you contact google to search for "kerflerp", or if you contact your local webserver that is a meter away, will have a dramatic impact on the time for a reasonable response.
There are lots of other issues - running two pieces of code in parallel that uses LOTS of memory can cause both to run slower than if they are run in sequence - because the high memory usage causes swapping, or because the OS isn't able to use spare memory for caching file-I/O, for example.
On the other hand, two or more simple processes that use very little memory will benefit quite a lot from running in parallel on a multiprocessor system.
Adding logging to your applications such that you can see WHERE it is spending time is another method that works reasonably well. Particularly if you KNOW what the use-case is where it takes time.
If you have a use-case where you know "this should take no more than X seconds", running regular pre- or post-commit test to check that the code is behaving as expected, and no-one added a lot of code to slow it down would also be a useful thing.

Related

Fastest way to communicate with subprocess

I have a parent process which spawns several subprocesses to do some CPU intensive work. For each batch of work, the parent needs to send several 100MB of data (as one single chunk) to the subprocess and when it is done, it must receive about the same amount of data (again as one single chunk).
The parent process and the subprocess are different applications and even different languages (Python and C++ mostly) but if I have any solution in C/C++, I could write a Python wrapper if needed.
I thought the simplest way would be to use pipes. That has many advantages, such as being mostly cross-platform, being simple, and flexible, and I can maybe even later extend my code without too much work to communicate over network.
However, now I'm profiling the whole application and I see some noticeable overhead in the communication and I wonder whether there are faster ways. Cross-platform is not really needed for my case (scientific research), it's enough if it works on Ubuntu >=12 or so (although MacOSX would also be nice). In principle, I thought that copying a big chunk of data into a pipe and reading it at the other end should not take much more time than setting up some shared memory and doing a memcpy. Am I wrong? Or how much performance would you expect is it worse?
The profiling itself is complicated and I don't really have reliable and exact data, only clues (because it's all a quite complicated system). I wonder where I should spent my time now. Trying to get more exact profiling data? Trying to implement some shared memory solution and see how much it improves?. Or something else? I also thought about wrapping and compiling the subprocess application in a library and linking it into the main process and thus avoiding the communication with another process - in that case I need just a memcpy.
There are quite a few related questions here on StackOverflow but I haven't really seen a performance comparison for different methods of communication.
Ok, so I wrote a small benchmarking tool here which copies some data (~200MB) either via shared memory or via pipe, 10 times.
Results on my MacBook with MacOSX:
Shared memory:
24.34 real 18.49 user 5.96 sys
Pipe:
36.16 real 20.45 user 17.79 sys
So, first we see that the shared memory is noticeably faster. Note that if I copy smaller chunks of data (~10MB), I almost don't see a difference in total time.
The second noticeable difference is the time spent in kernel. It is expected that the pipe needs more kernel time because the kernel has to handle all those reads and writes. But I would not have expected it to be that much.

How to get an accurate performance measure?

In our project we're trying to automatically monitor the performance of test runs, to make sure that we don't have any significant changes in the performance of the program over time.
The problem is that there seems to be a consistent 5% variability in the measures we get. That is, on the same machine with the same program (no recompilation) running the same test we get values that differ by around 5% from run to run. This is way too much for what we want to use the numbers for.
We're already excluding setup costs from the timing considerations - that is, from within C++ code itself we're grabbing the time immediately before and after running the time-critical portions, rather than doing the timing of the whole program on the OS level. We are also doing averaging and outlier exclusion. The problem is that the variability looks to also have long-term trends, so we get tight clustering of times for replicates right after each other, but an hour or two later the times are substantially different. (Unfortunately, spreading the test out over several hours is not feasible.) The tests are also being run on a dedicated machine while "nothing else" is being run on it.
We're not quite sure where the timing variation is coming from, but it may have to do with the processor and the system - there's indications that the size of the variability depends on what machine the program is running on.
Does anyone have an idea where this variation is likely to be coming from, and how to remove it? The tests are running on a dedicated machine, so changing the operating system settings would be possible.
(As indicated by the tags, this is a C++ program running on a x86 Linux system, if that helps clarify things.)
Edit: Response to comments
Our current timing scheme is to use the clock() function from the C standard library, looking at the difference in the return value from before/after the functions we want to test.
The code we're testing should be deterministic, and shouldn't involve heavy IO.
I realize that the situation is a little hazy for a "silver bullet" answer. I guess I'm more looking for a "these are the factors that are important to consider, this is the order you probably should check them in, and here's how you go about checking each of them" type answer.
I'm amazed you got down to 5% variation.
Unless you can get rid of all the unnecessary things running on your system, you will be getting high variation. This is at the top level.
You OS needs to be deterministic. You need to know what other tasks and threads are running and their durations. For example, there is the clock interrupt. Now, how many other functions are chained to this interrupt? Do these other functions vary?
Is your system isolated? For example, your measurements may vary if your system is connected to a network.
Does your program use external resources? For example a hard drive. If the program writes to the hard drive, the drive will not be deterministic. Files and parts of files may move on the drive. The drive may become fragmented. This fragmentation may cause variance in your measurements.
The operating system memory may get fragmented. Also, the executable's memory may become fragmented. Fragmentation may add to the variance.

Optimizing console programs

I have a console application that is doing intense calculations, and it takes several hours to complete. I can run as many as I want at a given moment, and they all run at exactly the same speed. That means my computer has the ability to go faster, so why isn't it? I want my computer to take 100% of its processing power and dedicate it to this one program.
I can easily cut time in half by putting half of the work in one program and half of the work in another, and have them communicate through a txt file or something. Is there any way I can make it go faster without doing that? Task Manager's priority doesn't change anything.
Well... There's always multithreading:
Stack Overflow, for example.
But it's always a complicated process, especially for C++. I'm afraid you'll have to google some tutorials for your particular case.
Here are some: tutorialspoint, codebase, plenty more around. There are a lot of methods, achieved by many distinct libraries, and none of them could be considered simple.

Measuring device drivers CPU/IO utilization caused by my program

Sometimes code can utilize device drivers up to the point where the system is unresponsive.
Lately I've optimized a WIN32/VC++ code which made the system almost unresponsive. The CPU usage, however, was very low. The reason was 1000's of creations and destruction of GDI objects (pens, brushes, etc.). Once I refactored the code to create all objects only once - the system became responsive again.
This leads me to the question: Is there a way to measure CPU/IO usage of device drivers (GPU/disk/etc) for a given program / function / line of code?
You can use various tools from SysInternals Utilities (now a Microsoft product, see http://technet.microsoft.com/en-us/sysinternals/bb545027) to give a basic idea before jumping in. In your case process explorer (procexp) and process monitor (procmon) performs a decent job. They can be used to get you a basic idea about what type of slowness it is before doing profiling drill down.
Then you can use xperf http://msdn.microsoft.com/en-us/performance/default to drill down. With correct setup, this tool can bring you to the exact function that causes slowness without injecting profiling code into your existing program. There's already a PDC video talking about how to use it http://www.microsoftpdc.com/2009/CL16 and I highly recommend this tool. Per my own experience, it's always better to observe using procexp/procmon first, then targeting your suspects with xperf, because xperf can generate overwhelming load of information if not filtered in a smart way.
In certain hard cases that involving locking contentions, Debugging Tools for Windows (windbg) will be very handy, and there are dedicated books talking about its usage. These books typically talk about hang detection and there are quite a few techniques here can be used to detect slowness, too. (e.g. !runaway)
Maybe you could use ETW for this? Not so sure it will help you see what line causes what, but it should give you a good overall picture of how your app is doing.
To find the CPU/memory/disk usage of the program in real time, you can use the resource monitor and task manager programs that come with windows. You can find the amount of time that a block of code takes relative to the other blocks of code by printing out the systime. Remember not to do too much monitoring at once, because that can throw off your calculations.
If you know how much CPU time that the program takes and what percentage of time the block of code takes, then you can estimate approximately how much CPU time that a block of code takes.

Beyond Stack Sampling: C++ Profilers

A Hacker's Tale
The date is 12/02/10. The days before Christmas are dripping away and I've pretty much hit a major road block as a windows programmer. I've been using AQTime, I've tried sleepy, shiny, and very sleepy, and as we speak, VTune is installing. I've tried to use the VS2008 profiler, and it's been positively punishing as well as often insensible. I've used the random pause technique. I've examined call-trees. I've fired off function traces. But the sad painful fact of the matter is that the app I'm working with is over a million lines of code, with probably another million lines worth of third-party apps.
I need better tools. I've read the other topics. I've tried out each profiler listed in each topic. There simply has to be something better than these junky and expensive options, or ludicrous amounts of work for almost no gain. To further complicate matters, our code is heavily threaded, and runs a number of Qt Event loops, some of which are so fragile that they crash under heavy instrumentation due to timing delays. Don't ask me why we're running multiple event loops. No one can tell me.
Are there any options more along the lines of Valgrind in a windows environment?
Is there anything better than the long swath of broken tools I've already tried?
Is there anything designed to integrate with Qt, perhaps with a useful display of events in queue?
A full list of the tools I tried, with the ones that were really useful in italics:
AQTime: Rather good! Has some trouble with deep recursion, but the call graph is correct in these cases, and can be used to clear up any confusion you might have. Not a perfect tool, but worth trying out. It might suit your needs, and it certainly was good enough for me most of the time.
Random Pause attack in debug mode: Not enough information enough of the time.
A good tool but not a complete solution.
Parallel Studios: The nuclear option. Obtrusive, weird, and crazily powerful. I think you should hit up the 30 day evaluation, and figure out if it's a good fit. It's just darn cool, too.
AMD Codeanalyst: Wonderful, easy to use, very crash-prone, but I think that's an environment thing. I'd recommend trying it, as it is free.
Luke Stackwalker: Works fine on small projects, it's a bit trying to get it working on ours. Some good results though, and it definitely replaces Sleepy for my personal tasks.
PurifyPlus: No support for Win-x64 environments, most prominently Windows 7. Otherwise excellent. A number of my colleagues in other departments swear by it.
VS2008 Profiler: Produces output in the 100+gigs range in function trace mode at the required resolution. On the plus side, produces solid results.
GProf: Requires GCC to be even moderately effective.
VTune: VTune's W7 support borders on criminal. Otherwise excellent
PIN: I'd need to hack up my own tool, so this is sort of a last resort.
Sleepy\VerySleepy: Useful for smaller apps, but failing me here.
EasyProfiler: Not bad if you don't mind a bit of manually injected code to indicate where to instrument.
Valgrind: *nix only, but very good when you're in that environment.
OProfile: Linux only.
Proffy: They shoot wild horses.
Suggested tools that I haven't tried:
XPerf:
Glowcode:
Devpartner:
Notes:
Intel environment at the moment. VS2008, boost libraries. Qt 4+. And the wretched humdinger of them all: Qt/MFC integration via trolltech.
Now: Almost two weeks later, it looks like my issue is resolved. Thanks to a variety of tools, including almost everything on the list and a couple of my personal tricks, we found the primary bottlenecks. However, I'm going to keep testing, exploring, and trying out new profilers as well as new tech. Why? Because I owe it to you guys, because you guys rock. It does slow the timeline down a little, but I'm still very excited to keep trying out new tools.
Synopsis
Among many other problems, a number of components had recently been switched to the incorrect threading model, causing serious hang-ups due to the fact that the code underneath us was suddenly no longer multithreaded. I can't say more because it violates my NDA, but I can tell you that this would never have been found by casual inspection or even by normal code review. Without profilers, callgraphs, and random pausing in conjunction, we'd still be screaming our fury at the beautiful blue arc of the sky. Thankfully, I work with some of the best hackers I've ever met, and I have access to an amazing 'verse full of great tools and great people.
Gentlefolk, I appreciate this tremendously, and only regret that I don't have enough rep to reward each of you with a bounty. I still think this is an important question to get a better answer to than the ones we've got so far on SO.
As a result, each week for the next three weeks, I'll be putting up the biggest bounty I can afford, and awarding it to the answer with the nicest tool that I think isn't common knowledge. After three weeks, we'll hopefully have accumulated a definitive profile of the profilers, if you'll pardon my punning.
Take-away
Use a profiler. They're good enough for Ritchie, Kernighan, Bentley, and Knuth. I don't care who you think you are. Use a profiler. If the one you've got doesn't work, find another. If you can't find one, code one. If you can't code one, or it's a small hang up, or you're just stuck, use random pausing. If all else fails, hire some grad students to bang out a profiler.
A Longer View
So, I thought it might be nice to write up a bit of a retrospective. I opted to work extensively with Parallel Studios, in part because it is actually built on top of the PIN Tool. Having had academic dealings with some of the researchers involved, I felt that this was probably a mark of some quality. Thankfully, I was right. While the GUI is a bit dreadful, I found IPS to be incredibly useful, though I can't comfortably recommend it for everyone. Critically, there's no obvious way to get line-level hit counts, something that AQT and a number of other profilers provide, and I've found very useful for examining rate of branch-selection among other things. In net, I've enjoyed using AQTime as well, and I've found their support to be really responsive. Again, I have to qualify my recommendation: A lot of their features don't work that well, and some of them are downright crash-prone on Win7x64. XPerf also performed admirably, but is agonizingly slow for the sampling detail required to get good reads on certain kinds of applications.
Right now, I'd have to say that I don't think there's a definitive option for profiling C++ code in a W7x64 environment, but there are certainly options that simply fail to perform any useful service.
First:
Time sampling profilers are more robust than CPU sampling profilers. I'm not extremely familiar with Windows development tools so I can't say which ones are which. Most profilers are CPU sampling.
A CPU sampling profiler grabs a stack trace every N instructions.
This technique will reveal portions of your code that are CPU bound. Which is awesome if that is the bottle neck in your application. Not so great if your application threads spend most of their time fighting over a mutex.
A time sampling profiler grabs a stack trace every N microseconds.
This technique will zero in on "slow" code. Whether the cause is CPU bound, blocking IO bound, mutex bound, or cache thrashing sections of code. In short what ever piece of code is slowing your application will standout.
So use a time sampling profiler if at all possible especially when profiling threaded code.
Second:
Sampling profilers generate gobs of data. The data is extremely useful, but there is often too much to be easily useful. A profile data visualizer helps tremendously here. The best tool I've found for profile data visualization is gprof2dot. Don't let the name fool you, it handles all kinds of sampling profiler output (AQtime, Sleepy, XPerf, etc). Once the visualization has pointed out the offending function(s), jump back to the raw profile data to get better hints on what the real cause is.
The gprof2dot tool generates a dot graph description that you then feed into a graphviz tool. The output is basically a callgraph with functions color coded by their impact on the application.
A few hints to get gprof2dot to generate nice output.
I use a --skew of 0.001 on my graphs so I can easily see the hot code paths. Otherwise the int main() dominates the graph.
If you're doing anything crazy with C++ templates you'll probably want to add --strip. This is especially true with Boost.
I use OProfile to generate my sampling data. To get good output I need configure it to load the debug symbols from my 3rd party and system libraries. Be sure to do the same, otherwise you'll see that CRT is taking 20% of your application's time when what's really going on is malloc is trashing the heap and eating up 15%.
What happened when you tried random pausing? I use it all the time on a monster app. You said it did not give enough information, and you've suggested you need high resolution. Sometimes people need a little help in understanding how to use it.
What I do, under VS, is configure the stack display so it doesn't show me the function arguments, because that makes the stack display totally unreadable, IMO.
Then I take about 10 samples by hitting "pause" during the time it's making me wait. I use ^A, ^C, and ^V to copy them into notepad, for reference. Then I study each one, to try to figure out what it was in the process of trying to accomplish at that time.
If it was trying to accomplish something on 2 or more samples, and that thing is not strictly necessary, then I've found a live problem, and I know roughly how much fixing it will save.
There are things you don't really need to know, like precise percents are not important, and what goes on inside 3rd-party code is not important, because you can't do anything about those. What you can do something about is the rich set of call-points in code you can modify displayed on each stack sample. That's your happy hunting ground.
Examples of the kinds of things I find:
During startup, it can be about 30 layers deep, in the process of trying to extract internationalized character strings from DLL resources. If the actual strings are examined, it can easily turn out that the strings don't really need to be internationalized, like they are strings the user never actually sees.
During normal usage, some code innocently sets a Modified property in some object. That object comes from a super-class that captures the change and triggers notifications that ripple throughout the entire data structure, manipulating the UI, creating and desroying obects in ways hard to foresee. This can happen a lot - the unexpected consequences of notifications.
Filling in a worksheet row-by-row, cell-by-cell. It turns out if you build the row all at once, from an array of values, it's a lot faster.
P.S. If you're multi-threaded, when you pause it, all threads pause. Take a look at the call stack of each thread. Chances are, only one of them is the real culprit, and the others are idling.
I've had some success with AMD CodeAnalyst.
Do you have an MFC OnIdle function? In the past I had a near real-time app I had to fix that was dropping serial packets when set at 19.2K speed which a PentiumD should have been able to keep up with. The OnIdle function was what was killing things. I'm not sure if QT has that concept, but I'd check for that too.
Re the VS Profiler -- if it's generating such large files, perhaps your sampling interval is too frequent? Try lowering it, as you probably have enough samples anyway.
And ideally, make sure you're not collecting samples until you're actually exercising the problem area. So start with collection paused, get your program to do its "slow activity", then start collection. You only need at most 20 seconds of collection. Stop collection after this.
This should help reduce your sample file sizes, and only capture what is necessary for your analysis.
I have successfully used PurifyPlus for Windows. Although it is not cheap, IBM provides a trial version that is slightly crippled. All you need for profiling with quantify are pdb files and linking with /FIXED:NO. Only drawback: No support for Win7/64.
Easyprofiler - I haven't seen it mentioned here yet so not sure if you've looked at it already. It takes a slightly different approach in how it gathers metric data. A drawback to using its compile-time profile approach is you have to make changes to the code-base. Thus you'll need to have some idea of where the slow might be and insert profiling code there.
Going by your latest comments though, it sounds like you're at least making some headway. Perhaps this tool might provide some useful metrics for you. If nothing else it has some really purdy charts and pictures :P
Two more tool suggestions.
Luke Stackwalker has a cute name (even if it's trying a bit hard for my taste), it won't cost you anything, and you get the source code. It claims to support multi threaded programs, too. So it is surely worth a spin.
http://lukestackwalker.sourceforge.net/
Also Glowcode, which I've had pointed out to me as worth using:
http://www.glowcode.com/
Unfortunately I haven't done any PC work for a while, so I haven't tried either of these. I hope the suggestions are of help anyway.
Checkout XPerf
This is free, non-invasive and extensible profiler offered by MS. It was developed by Microsoft to profile Windows.
If you're suspicious of the event loop, could overriding QCoreApplication::notify() and dosome manual profiling (one or two maps of senders/events to counts/time)?
I'm thinking that you first log the frequency of event types, then examine those events more carefully (which object sends it, what does it contain, etc). Signals across threads are queued implicitly, so they end up in the event loop (as well explicit queued connections too, obviously).
We've done it to trap and report exceptions in our event handlers, so really, every event goes through there.
Just an idea.
Edit: I see now you mentioned this in your first post. Dammit, I never thought I'd be that guy.
You can use Pin to instrument your code with finer granularity. I think Pin would let you create a tool to count how many times you enter a function or how many clockticks you spend there, roughly emulating something like VTune or CodeAnalyst. Then you could strip down which functions get instrumented until your timing issues go away.
I can tell you what I use everyday.
a) AMD Code Analyst
It is easy, and it will give you a quick overview of what is happening. It will be ok for most of the time.
With AMD CPUs, it will tell you info about the cpu pipeline, but you only need this only if you have heavy loops, like in graphic engines, video codecs, etc.
b) VTune.
It is very well integrated in vs2008
after you know the hotspots, you need to sample not only time, but other things like cache misses, and memory usage. This is very important. Setup a sampling session, and edit the properties. I always sample for time, memory read/write, and cache misses (three different runs)
But more than the tool, you need to get experience with profiling. And that means understanding how the CPU/Memory/PCI works... so, this is my 3rd option
c) Unit testing
This is very important if you are developing a big application that needs huge performance. If you cannot split the app in some pieces, it will be difficult to track cpu usage. I dont test all the cases and classes, but I have hardcoded executions and input files with important features.
My advice is using random sampling in several small tests, and try to standardise a profile strategy.
I use xperf/ETW for all of my profiling needs. It has a steep learning curve but is incredibly powerful. If you are profiling on Windows then you must know xperf. I frequently use this profiler to find performance problems in my code and in other people's code.
In the configuration that I use it:
xperf grabs CPU samples from every core that is executing code every
ms. The sampling rate can be increased to 8 KHz and the samples
include user-mode and kernel code. This allows finding out what a
thread is doing while it is running
xperf records every context
switch (allowing for perfect reconstruction of how much time each
thread uses), plus call stacks for when threads are switched in, plus
call stacks for what thread readied another thread, allowing tracing
of wait chains and finding out why a thread is not running
xperf
records all file I/O from all processes
xperf records all disk I/O
from all processes
xperf records what window is active, the CPU
frequency, CPU power state, UI delays, etc.
xperf can also record all
heap allocations from one process, all virtual allocations from all
processes, and much more.
That's a lot of data, all on one timeline, for all processes. No other profiler on Windows can do that.
I have blogged extensively about how to use xperf/ETW. These blog posts, and some professionally quality training videos, can be found here:
http://randomascii.wordpress.com/2014/08/19/etw-training-videos-available-now/
If you want to find out what might happen if you don't use xperf read these blog posts:
http://randomascii.wordpress.com/category/investigative-reporting/
These are tales of performance problems I have found in other people's code, that should have been found by the developers. This includes mshtml.dll being loaded into the VC++ compiler, a denial of service in VC++'s find-in-files, thermal throttling in a surprising number of customer machines, slow single-stepping in Visual Studio, a 4 GB allocation in a hard-disk driver, a powerpoint performance bug, and more.
I just finished the first usable version of CxxProf, a portable manual instrumented profiling library for C++.
It fulfills the following goals:
Easy integration
Easily remove the lib during compile time
Easily remove the lib during runtime
Support for multithreaded applications
Support for distributed systems
Keep impact on a minimum
These points were ripped from the project wiki, have a look there for more details.
Disclaimer: Im the main developer of CxxProf
Just to throw it out, even though it's not a full-blown profiler: if all you're after is hung event loops that take long processing an event, an ad-hoc tool is simple matter in Qt. That approach could be easily expanded to keep track of how long did each event take to process, and what those events were, and so on. It's not a universal profiler, but an event-loop-centric one.
In Qt, all cross-thread signal-slot calls are delivered via the event loop, as are timers, network and serial port notifications, and all user interaction,. Thus, observing the event loops is a big step towards understanding where the application is spending its time.
DevPartner, originally developed by NuMega and now distributed by MicroFocus, was once the solution of choice for profiling and code analysis (memory and resource leaks for example).
I haven't tried it recently, so I cannot assure you it will help you; but I once had excellent results with it, so that this is an alternative I do consider to re-install in our code quality process (they provide a 14 days trial)
though your os is win7,the programm cann't run under xp?
how about profile it under xp and the result should be a hint for win7.
There are lots of profilers listed here and I've tried a few of them myself - however I ended up writing my own based on this:
http://code.google.com/p/high-performance-cplusplus-profiler/
It does of course require that you modify the code base, but it's perfect for narrowing down bottlenecks, should work on all x86s (could be a problem with multi-core boxes, i.e. it uses rdtsc, however - this is purely for indicative timing anyway - so I find it's sufficient for my needs..)
I use Orbit profiler, easy, open source and powerfull ! https://orbitprofiler.com/