Profiling a dll plugin

Profiling a dll plugin - c++

I want to profile a dll plugin in C++. I have access to the source (being the author/mantainer) and can modify them (if needed for instrumentation).
What I don't have is the source/symbols/etc of the host program which is calling the dll. I only have the headers needed to build the plugin.
The dll is invoked upon action from the client.
What is the best way to proceed for profiling the code? It is not realistic to "wrap" an executable around the dll and it would be not useful because since in the plugin I am calling some functions from the host AND i need to profile those paths, a wrapper would skew the performance.
EDIT after Kieren Johnston's comment: Ideally I would like to hook into the loaded dll just like the debugger is able to (attaching to the running host process and placing a breakpoint somewhere in the dll as needed). Is it possible? If not, I will need to ask another question to ask why :-)
I am using the TFS edition of Visual Studio 2010.
Bonus points for providing suggestions/answers for the same task under AIX (ah, the joys of multiple environments!).

This is possible albeit a little annoying.
Deploy your plug-in DLL to where the host application needs it to be
Launch your host application and verify that it is using your plug-in
Create a new Performance Session
Add the host EXE as a target in the Session from step 3
Select Sampling or Instrumentation for your Session
Launch the profiling session
During all this keep your plug-in solution loaded and VS should find the symbols for your plug-in automatically.

Not sure about VS10, but in older ones, you debug the dll by specifying the exe for running it.
Let's split the problem into two parts: 1) locating what you might call "bottlenecks", and 2) measuring the overall speedup you get by fixing each one.
(2) is easy, right? All you need is an outer timer.
That leaves (1). If you're like most people, you think that finding the "bottlenecks" cannot be done without some kind of precision timing of the parts of the program.
Not so, because most of the time the things you need to fix to get the most speedup are not things you can detect that way.
They are not necessarily bad algorithms, or slow functions, or hotspots.
They are distributed things being done by perfectly innocent-looking well-designed code, that just happen to present huge speedup opportunity if coded in a different way.
Here's an example where a reasonably well written program had its execution time reduced from 48 seconds to 20, 17, 13, 10, 7, 4, 2.1, and finally 1.1, over 8 iterations.**
That's a compound speedup factor of over 40x.
The speedup factor you can get is different in every different program - some can get less, some can get more, depending on how close they are to optimal in the first place.
There's no mystery of how to do this.
The method was random pausing.
(It's an alternative to using a profiler. Profilers measure various things, and give you various clues that may or may not be helpful, but they don't reliably tell you what the problem is.)
** The speedup factors achieved, per iteration, were 2.38, 1.18, 1.31, 1.30, 1.43, 1.75, 1.90, 1.91. Another way to put it is the percent time reduced in each iteration: 58%, 15%, 24%, 23%, 30%, 43%, 48%, 48%. I get a hard time from profiler fans because the method is so manual, but they never talk about the speedup results.
(Maybe that will change.)

Related

Interlocked API between Windows CE 5 and 6

I'm currently porting a VS2005 C++ application from CE5 to CE6 and I'm experiencing severe performance problems. This goes so far that a single HTTP request retrieving dynamic content takes 40ms on CE5 and 350ms on CE6. These values used to be worse due to a bunch of inefficiencies that I already cleaned up, improving performance on both systems, but at the moment I'm stuck at that latency. For the record, both tests are made on the same machine and the webserver is not the one supplied with CE but a custom one implemented in C++. Note also that the problem is not the network IO, CE6 even outperforms CE5 on the same machine when serving static files, but it's the dynamic content handling.
While trying to figure out why the program performs so badly, I stumbled across something that puzzled me: Under CE5, the Interlocked* API for x86 use neither the compiler intrinsics nor real function calls but inline assembly code. This code has a comment saying that the intrinsic includes lock prefixes that are only required for multi-processor systems and that slow down code running on just a single core like CE5. On CE6, these functions are implemented using the compiler intrinsics including the lock prefix. Since these functions are used by e.g. Boost and STLport, both of which are used inside the webserver, I was wondering if those could be the culprit.
Another thing I noticed was that some string parsing functions take extremely long. Worse, it seems that calling the same function a second time after the first time takes less time, so it seems as if some kind of caching was going on. Since this is a short (<1kB) string received via TCP that is parsed in memory, I can't imagine which cache could be responsible for that. The only cache could be the instruction cache, but the program is not larger than the CE5 version and if the code was running from uncached memory it would not show these caching effects.
TLDR - Questions:
Is CE6 capable of handling multiple processors at all?
Is there an easy way to tell the compiler that it should omit the lock prefix? My current approach to achieve that is to simply copy the inline assembly from the CE5 SDK, but that's beyond ugly.
I'd also appreciate any other suggestions what to look at or what to try. Many thanks in advance!
Summary There is no problem that depends on the executable, let alone on the Interlocked API. Running the same executable proved that. However, running on a different machine with a different platform setup made a difference. We're now back to Platform Builder, trying to figure out the differences between the two platforms.

No. WEC7 is required for SMP support. Most likely in CE6 the OEM has disabled the other cores.
None that I am aware of.
Either use the performance profiling tools or instrument your code with timing calls to narrow down where things are taking too long.

I have finally found the reason for the performance behaviour, it's simply paging. CE6 has a pool manager (see http://blogs.msdn.com/b/ce_base/archive/2008/01/19/paging-and-the-windows-ce-paging-pool.aspx) which handles paging out unused mapped DLLs and EXEs. When the amount of mapped binaries exceeds a certain size, it starts (with low priority) to page out memory. The limit when it starts paging out is just 3MiB by default, which is rather low for current applications. Also, the cache is not an LRU cache but simply discarding the pages in the order they were loaded.
It turns out that our system exceeded this limit, which causes the paging to begin. Due to the algorithm used, it will always throw out used ones that will then have to be paged in again. The code that serves static files is small, so this wasn't affected as much by this limit. The code that serves dynamic pages is much larger though, so it wreaks havoc on the overall system with IO. This also explains why the problem couldn't be attributed to a specific piece of code, it wasn't the code itself but loading it.
I have detected this via IOCTL_HAL_GET_POOL_PARAMETERS, which gave me the relevant configuration parameters, current state, how often the pageout-thread ran and for how long (although the latter is only the time it took to swap out pages). I should be able to find the resulting page faults in the kernel tracker, too, now that I know what I'm looking for. I could also observe that the activity LED on the CF card adapter now lights up when first loading a file, but not on subsequent requests, where it is taken from cache. This used to always cause the LED to flash on dynamic pages.
The simple solution is to increase the limit for the pool manager, so it doesn't start throwing out things. This can be done easily in config.bib by patching kernel.dll with the according values. Alternatively, reducing the executable size would help, but that's not so easy.

check the performance of an exe through code

I want to check the performance of an application (whose exe i have, no source code) by running it multiple times and possibly compare the results, dint find much on the internet regarding this topic,
Since i have to do it with multiple input times, i thought doing it through code(no bar on the language used) can make things easier, as i may have to repeat them many times,
can anyone help me start off???
Note: by Performance i mean the memory usage, cpu and possibly the time taken to do it!
(I'm currently using perfmon on windows by using necessary counters to check these parameters and manually noting it down)
Thanks

It strongly depends upon your operating system. On Linux, you could use the time utility. And strace might help you understanding the system calls that are used.
I have no idea of the equivalent on Windows systems.

I think that you could create a bash/batch script to call your program as many times as you need and with different inputs.
You could then have your script create a CSV file that contains the time it took to execute your program (start date and end date for example). CSV files are usually compatible with most spreadsheet programs like Excel, so I think that can make it easier for you to process your data, like creating means and standard deviations.
I don't have much to say regarding the memory and CPU usage, but if you are in Windows it wouldn't hurt to take a look at the Process Explorer and the Process Monitor (you can find them in this page). I think that they might help you in your task.
Finally if you are in Linux I think that you might be able to use grep with the top command to gather some statistics.
Regards,
Felipe

If you want exact results, Rational Purify (on Windows), or valgrind (on Linux) are the best tools; these run your application in a virtual machine that can be instructed to do exact cycle counting.

In another post an utility named timethis.exe was mentioned for measuring time under Windows. Maybe it is useful for your purposes.

I used the perform im using to manually note down in an automated way,
that is, i used the performance counter class available in dot net and obtained samples of the particular application at regular intervals and generated a graph with those values..
Thanks :)

Beyond Stack Sampling: C++ Profilers

A Hacker's Tale
The date is 12/02/10. The days before Christmas are dripping away and I've pretty much hit a major road block as a windows programmer. I've been using AQTime, I've tried sleepy, shiny, and very sleepy, and as we speak, VTune is installing. I've tried to use the VS2008 profiler, and it's been positively punishing as well as often insensible. I've used the random pause technique. I've examined call-trees. I've fired off function traces. But the sad painful fact of the matter is that the app I'm working with is over a million lines of code, with probably another million lines worth of third-party apps.
I need better tools. I've read the other topics. I've tried out each profiler listed in each topic. There simply has to be something better than these junky and expensive options, or ludicrous amounts of work for almost no gain. To further complicate matters, our code is heavily threaded, and runs a number of Qt Event loops, some of which are so fragile that they crash under heavy instrumentation due to timing delays. Don't ask me why we're running multiple event loops. No one can tell me.
Are there any options more along the lines of Valgrind in a windows environment?
Is there anything better than the long swath of broken tools I've already tried?
Is there anything designed to integrate with Qt, perhaps with a useful display of events in queue?
A full list of the tools I tried, with the ones that were really useful in italics:
AQTime: Rather good! Has some trouble with deep recursion, but the call graph is correct in these cases, and can be used to clear up any confusion you might have. Not a perfect tool, but worth trying out. It might suit your needs, and it certainly was good enough for me most of the time.
Random Pause attack in debug mode: Not enough information enough of the time.
A good tool but not a complete solution.
Parallel Studios: The nuclear option. Obtrusive, weird, and crazily powerful. I think you should hit up the 30 day evaluation, and figure out if it's a good fit. It's just darn cool, too.
AMD Codeanalyst: Wonderful, easy to use, very crash-prone, but I think that's an environment thing. I'd recommend trying it, as it is free.
Luke Stackwalker: Works fine on small projects, it's a bit trying to get it working on ours. Some good results though, and it definitely replaces Sleepy for my personal tasks.
PurifyPlus: No support for Win-x64 environments, most prominently Windows 7. Otherwise excellent. A number of my colleagues in other departments swear by it.
VS2008 Profiler: Produces output in the 100+gigs range in function trace mode at the required resolution. On the plus side, produces solid results.
GProf: Requires GCC to be even moderately effective.
VTune: VTune's W7 support borders on criminal. Otherwise excellent
PIN: I'd need to hack up my own tool, so this is sort of a last resort.
Sleepy\VerySleepy: Useful for smaller apps, but failing me here.
EasyProfiler: Not bad if you don't mind a bit of manually injected code to indicate where to instrument.
Valgrind: *nix only, but very good when you're in that environment.
OProfile: Linux only.
Proffy: They shoot wild horses.
Suggested tools that I haven't tried:
XPerf:
Glowcode:
Devpartner:
Notes:
Intel environment at the moment. VS2008, boost libraries. Qt 4+. And the wretched humdinger of them all: Qt/MFC integration via trolltech.
Now: Almost two weeks later, it looks like my issue is resolved. Thanks to a variety of tools, including almost everything on the list and a couple of my personal tricks, we found the primary bottlenecks. However, I'm going to keep testing, exploring, and trying out new profilers as well as new tech. Why? Because I owe it to you guys, because you guys rock. It does slow the timeline down a little, but I'm still very excited to keep trying out new tools.
Synopsis
Among many other problems, a number of components had recently been switched to the incorrect threading model, causing serious hang-ups due to the fact that the code underneath us was suddenly no longer multithreaded. I can't say more because it violates my NDA, but I can tell you that this would never have been found by casual inspection or even by normal code review. Without profilers, callgraphs, and random pausing in conjunction, we'd still be screaming our fury at the beautiful blue arc of the sky. Thankfully, I work with some of the best hackers I've ever met, and I have access to an amazing 'verse full of great tools and great people.
Gentlefolk, I appreciate this tremendously, and only regret that I don't have enough rep to reward each of you with a bounty. I still think this is an important question to get a better answer to than the ones we've got so far on SO.
As a result, each week for the next three weeks, I'll be putting up the biggest bounty I can afford, and awarding it to the answer with the nicest tool that I think isn't common knowledge. After three weeks, we'll hopefully have accumulated a definitive profile of the profilers, if you'll pardon my punning.
Take-away
Use a profiler. They're good enough for Ritchie, Kernighan, Bentley, and Knuth. I don't care who you think you are. Use a profiler. If the one you've got doesn't work, find another. If you can't find one, code one. If you can't code one, or it's a small hang up, or you're just stuck, use random pausing. If all else fails, hire some grad students to bang out a profiler.
A Longer View
So, I thought it might be nice to write up a bit of a retrospective. I opted to work extensively with Parallel Studios, in part because it is actually built on top of the PIN Tool. Having had academic dealings with some of the researchers involved, I felt that this was probably a mark of some quality. Thankfully, I was right. While the GUI is a bit dreadful, I found IPS to be incredibly useful, though I can't comfortably recommend it for everyone. Critically, there's no obvious way to get line-level hit counts, something that AQT and a number of other profilers provide, and I've found very useful for examining rate of branch-selection among other things. In net, I've enjoyed using AQTime as well, and I've found their support to be really responsive. Again, I have to qualify my recommendation: A lot of their features don't work that well, and some of them are downright crash-prone on Win7x64. XPerf also performed admirably, but is agonizingly slow for the sampling detail required to get good reads on certain kinds of applications.
Right now, I'd have to say that I don't think there's a definitive option for profiling C++ code in a W7x64 environment, but there are certainly options that simply fail to perform any useful service.

First:
Time sampling profilers are more robust than CPU sampling profilers. I'm not extremely familiar with Windows development tools so I can't say which ones are which. Most profilers are CPU sampling.
A CPU sampling profiler grabs a stack trace every N instructions.
This technique will reveal portions of your code that are CPU bound. Which is awesome if that is the bottle neck in your application. Not so great if your application threads spend most of their time fighting over a mutex.
A time sampling profiler grabs a stack trace every N microseconds.
This technique will zero in on "slow" code. Whether the cause is CPU bound, blocking IO bound, mutex bound, or cache thrashing sections of code. In short what ever piece of code is slowing your application will standout.
So use a time sampling profiler if at all possible especially when profiling threaded code.
Second:
Sampling profilers generate gobs of data. The data is extremely useful, but there is often too much to be easily useful. A profile data visualizer helps tremendously here. The best tool I've found for profile data visualization is gprof2dot. Don't let the name fool you, it handles all kinds of sampling profiler output (AQtime, Sleepy, XPerf, etc). Once the visualization has pointed out the offending function(s), jump back to the raw profile data to get better hints on what the real cause is.
The gprof2dot tool generates a dot graph description that you then feed into a graphviz tool. The output is basically a callgraph with functions color coded by their impact on the application.
A few hints to get gprof2dot to generate nice output.
I use a --skew of 0.001 on my graphs so I can easily see the hot code paths. Otherwise the int main() dominates the graph.
If you're doing anything crazy with C++ templates you'll probably want to add --strip. This is especially true with Boost.
I use OProfile to generate my sampling data. To get good output I need configure it to load the debug symbols from my 3rd party and system libraries. Be sure to do the same, otherwise you'll see that CRT is taking 20% of your application's time when what's really going on is malloc is trashing the heap and eating up 15%.

What happened when you tried random pausing? I use it all the time on a monster app. You said it did not give enough information, and you've suggested you need high resolution. Sometimes people need a little help in understanding how to use it.
What I do, under VS, is configure the stack display so it doesn't show me the function arguments, because that makes the stack display totally unreadable, IMO.
Then I take about 10 samples by hitting "pause" during the time it's making me wait. I use ^A, ^C, and ^V to copy them into notepad, for reference. Then I study each one, to try to figure out what it was in the process of trying to accomplish at that time.
If it was trying to accomplish something on 2 or more samples, and that thing is not strictly necessary, then I've found a live problem, and I know roughly how much fixing it will save.
There are things you don't really need to know, like precise percents are not important, and what goes on inside 3rd-party code is not important, because you can't do anything about those. What you can do something about is the rich set of call-points in code you can modify displayed on each stack sample. That's your happy hunting ground.
Examples of the kinds of things I find:
During startup, it can be about 30 layers deep, in the process of trying to extract internationalized character strings from DLL resources. If the actual strings are examined, it can easily turn out that the strings don't really need to be internationalized, like they are strings the user never actually sees.
During normal usage, some code innocently sets a Modified property in some object. That object comes from a super-class that captures the change and triggers notifications that ripple throughout the entire data structure, manipulating the UI, creating and desroying obects in ways hard to foresee. This can happen a lot - the unexpected consequences of notifications.
Filling in a worksheet row-by-row, cell-by-cell. It turns out if you build the row all at once, from an array of values, it's a lot faster.
P.S. If you're multi-threaded, when you pause it, all threads pause. Take a look at the call stack of each thread. Chances are, only one of them is the real culprit, and the others are idling.

I've had some success with AMD CodeAnalyst.

Do you have an MFC OnIdle function? In the past I had a near real-time app I had to fix that was dropping serial packets when set at 19.2K speed which a PentiumD should have been able to keep up with. The OnIdle function was what was killing things. I'm not sure if QT has that concept, but I'd check for that too.

Re the VS Profiler -- if it's generating such large files, perhaps your sampling interval is too frequent? Try lowering it, as you probably have enough samples anyway.
And ideally, make sure you're not collecting samples until you're actually exercising the problem area. So start with collection paused, get your program to do its "slow activity", then start collection. You only need at most 20 seconds of collection. Stop collection after this.
This should help reduce your sample file sizes, and only capture what is necessary for your analysis.

I have successfully used PurifyPlus for Windows. Although it is not cheap, IBM provides a trial version that is slightly crippled. All you need for profiling with quantify are pdb files and linking with /FIXED:NO. Only drawback: No support for Win7/64.

Easyprofiler - I haven't seen it mentioned here yet so not sure if you've looked at it already. It takes a slightly different approach in how it gathers metric data. A drawback to using its compile-time profile approach is you have to make changes to the code-base. Thus you'll need to have some idea of where the slow might be and insert profiling code there.
Going by your latest comments though, it sounds like you're at least making some headway. Perhaps this tool might provide some useful metrics for you. If nothing else it has some really purdy charts and pictures :P

Two more tool suggestions.
Luke Stackwalker has a cute name (even if it's trying a bit hard for my taste), it won't cost you anything, and you get the source code. It claims to support multi threaded programs, too. So it is surely worth a spin.
http://lukestackwalker.sourceforge.net/
Also Glowcode, which I've had pointed out to me as worth using:
http://www.glowcode.com/
Unfortunately I haven't done any PC work for a while, so I haven't tried either of these. I hope the suggestions are of help anyway.

Checkout XPerf
This is free, non-invasive and extensible profiler offered by MS. It was developed by Microsoft to profile Windows.

If you're suspicious of the event loop, could overriding QCoreApplication::notify() and dosome manual profiling (one or two maps of senders/events to counts/time)?
I'm thinking that you first log the frequency of event types, then examine those events more carefully (which object sends it, what does it contain, etc). Signals across threads are queued implicitly, so they end up in the event loop (as well explicit queued connections too, obviously).
We've done it to trap and report exceptions in our event handlers, so really, every event goes through there.
Just an idea.

Edit: I see now you mentioned this in your first post. Dammit, I never thought I'd be that guy.
You can use Pin to instrument your code with finer granularity. I think Pin would let you create a tool to count how many times you enter a function or how many clockticks you spend there, roughly emulating something like VTune or CodeAnalyst. Then you could strip down which functions get instrumented until your timing issues go away.

I can tell you what I use everyday.
a) AMD Code Analyst
It is easy, and it will give you a quick overview of what is happening. It will be ok for most of the time.
With AMD CPUs, it will tell you info about the cpu pipeline, but you only need this only if you have heavy loops, like in graphic engines, video codecs, etc.
b) VTune.
It is very well integrated in vs2008
after you know the hotspots, you need to sample not only time, but other things like cache misses, and memory usage. This is very important. Setup a sampling session, and edit the properties. I always sample for time, memory read/write, and cache misses (three different runs)
But more than the tool, you need to get experience with profiling. And that means understanding how the CPU/Memory/PCI works... so, this is my 3rd option
c) Unit testing
This is very important if you are developing a big application that needs huge performance. If you cannot split the app in some pieces, it will be difficult to track cpu usage. I dont test all the cases and classes, but I have hardcoded executions and input files with important features.
My advice is using random sampling in several small tests, and try to standardise a profile strategy.

I use xperf/ETW for all of my profiling needs. It has a steep learning curve but is incredibly powerful. If you are profiling on Windows then you must know xperf. I frequently use this profiler to find performance problems in my code and in other people's code.
In the configuration that I use it:
xperf grabs CPU samples from every core that is executing code every
ms. The sampling rate can be increased to 8 KHz and the samples
include user-mode and kernel code. This allows finding out what a
thread is doing while it is running
xperf records every context
switch (allowing for perfect reconstruction of how much time each
thread uses), plus call stacks for when threads are switched in, plus
call stacks for what thread readied another thread, allowing tracing
of wait chains and finding out why a thread is not running
xperf
records all file I/O from all processes
xperf records all disk I/O
from all processes
xperf records what window is active, the CPU
frequency, CPU power state, UI delays, etc.
xperf can also record all
heap allocations from one process, all virtual allocations from all
processes, and much more.
That's a lot of data, all on one timeline, for all processes. No other profiler on Windows can do that.
I have blogged extensively about how to use xperf/ETW. These blog posts, and some professionally quality training videos, can be found here:
http://randomascii.wordpress.com/2014/08/19/etw-training-videos-available-now/
If you want to find out what might happen if you don't use xperf read these blog posts:
http://randomascii.wordpress.com/category/investigative-reporting/
These are tales of performance problems I have found in other people's code, that should have been found by the developers. This includes mshtml.dll being loaded into the VC++ compiler, a denial of service in VC++'s find-in-files, thermal throttling in a surprising number of customer machines, slow single-stepping in Visual Studio, a 4 GB allocation in a hard-disk driver, a powerpoint performance bug, and more.

I just finished the first usable version of CxxProf, a portable manual instrumented profiling library for C++.
It fulfills the following goals:
Easy integration
Easily remove the lib during compile time
Easily remove the lib during runtime
Support for multithreaded applications
Support for distributed systems
Keep impact on a minimum
These points were ripped from the project wiki, have a look there for more details.
Disclaimer: Im the main developer of CxxProf

Just to throw it out, even though it's not a full-blown profiler: if all you're after is hung event loops that take long processing an event, an ad-hoc tool is simple matter in Qt. That approach could be easily expanded to keep track of how long did each event take to process, and what those events were, and so on. It's not a universal profiler, but an event-loop-centric one.
In Qt, all cross-thread signal-slot calls are delivered via the event loop, as are timers, network and serial port notifications, and all user interaction,. Thus, observing the event loops is a big step towards understanding where the application is spending its time.

DevPartner, originally developed by NuMega and now distributed by MicroFocus, was once the solution of choice for profiling and code analysis (memory and resource leaks for example).
I haven't tried it recently, so I cannot assure you it will help you; but I once had excellent results with it, so that this is an alternative I do consider to re-install in our code quality process (they provide a 14 days trial)

though your os is win7,the programm cann't run under xp?
how about profile it under xp and the result should be a hint for win7.

There are lots of profilers listed here and I've tried a few of them myself - however I ended up writing my own based on this:
http://code.google.com/p/high-performance-cplusplus-profiler/
It does of course require that you modify the code base, but it's perfect for narrowing down bottlenecks, should work on all x86s (could be a problem with multi-core boxes, i.e. it uses rdtsc, however - this is purely for indicative timing anyway - so I find it's sufficient for my needs..)

I use Orbit profiler, easy, open source and powerfull ! https://orbitprofiler.com/

How could running code in the debugger makes it faster?

It never happened to me. In Visual Studio, I have a part of code that is executed 300 times, I time it every iteration with the performance counter, and then average it.
If I'm running the code in the debugger I get an average of 1.01 ms if I run it without the debugger I get 1.8 ms.
I closed all other apps, I rebooted, I tried it many times: Always the same timing.
I'm trying to optimize my code, but before throwing me into changing the code, I want to be sure of my timings. To have something to compare with.
What can cause that strange behaviour?
Edit:
Some clarification:
I'm running the same compiled piece of code: the release build. The only difference is (F5 vs CTRL-F5)
So, the compiler optimization should not be invoved.
Since each calcuated times were verry small, I changed the way I benchmark: I'm now timing the 300 iterations and then divide by 300. I have the same result.
About caching: The code is doing some image cross correlation, with different images at each iterations. The steps of the processing are not modified by the data in the images. So, I think caching is not the problem.

I think I figured it out.
If I add a Sleep(3000) before running the tests, they give the same result.
I think it has something to do with the loading of misc. dlls. In the debugger, the dlls were loaded before any code was executed. Outside the debugger, the dlls were loaded on demand, and one or more were loaded after the timer was started.
Thanks all.

I don't think anyone has mentioned this yet, but the debug build may not only affect the way your code executes, but also the way the timer itself executes. This can lead to the timer being inaccurate / slower / definitely not reliable. I would recommend using a profiler as others have mentioned, and compare only similar configurations.

You are likely to get very erroneous results by doing it this way ... you should be using a profiler. You should read this article entitled The Perils of MicroBenchmarking:
http://blogs.msdn.com/shawnhar/archive/2009/07/14/the-perils-of-microbenchmarking.aspx

It's probably a compiler optimization that's actually making your code worse. This is extremely rare these days but if you're doing odd, odd stuff, this can happen.
Some debugger / IDEs like Visual Studio will automatically zero out memory for you in Debug mode; this may be a contributing factor.

Are you running the exact same code in the debugger and outside the debugger or running debug in the debugger and release outside? If so the code isn't the same. If you're running debug and release and seeing the difference you could turn off optimization in release and see what that does or run your code in a profiler in debug and release and see what changes.

The debug version initializes variables to 0 (usually).
While a release binary does not initialize variables (unless the code explicitly does). This may affect what the code is doing the ziae of a loop or a whole host of other possibilities.
Set the warning level to the highest level (level 4, default 3).
Set the flag that says treat warnings as errors.
Recompile and re-test.

Before you dive into an optimization session get some facts:
dose it makes a difference? dose this application runs twice as slow measured over a reasonable length of time?
how are the debug and release builds configured
what is the state of this project? Is it a complete software or are you profiling a single function ?
how are you running the debug and build releases , are you sure you are testing under the same conditions (e.g. process priority settings )
suppose you do optimize the code what do you have in mind ?

Having read your additional data a distant bell started to ring ...
When running a program in the debugger it will catch both C++ exceptions and structured exceptions (windows execution)
One event that will trigger a structured exception is a divide by zero, it is possible that the debugger quickly catches and dismiss this event (as a first chance exception handling) while the release code goes a bit longer before doing something about it.
so if your code might be generating such or similar exceptions it worth a while to look into it.

Visual Studio 2008 Profiler - Instrumented produces strange results

I run the Visual Studio 2008 profiler on a "RelDebug" build of my app. Optimizations are on, but inlining is only moderate, stack frames are present, and symbols are emitted. In other words, RelDebug is a somewhat optimized build that can be debugged (although the usual Release caveats about inspecting variables applies).
I run both the Sampling, and the Instrumented profiler on separate runs.
Result? The Sampling profiler produces a result that looks reasonable. However when I look at the Instrumented profiler results, I see functions that should not even be near the top of the list, coming out up to.
For example, a function like "SetFont" that consists of only 1 line assigning the height to a class member. Or "SetClipRect" that merely assigns a rectangle.
Of course I am looking at "Exclusive" stats (i.e. minus children).
This happen to anyone else? It always seems to happen once my application has grown to a certain size. It makes the instrumented profiler useless at that point.
I figured out the problem. Both the Visual Studio 2008 and the Visual Studio 2010 profilers are mediocre (to put it politely). I bought Intel C++ Studio which comes with vTune Amplifier (a profiler). Using the Intel profiler on the exact same code I was able to get profiler results that actually made sense.

You say "of course you are looking at Exclusive". Look at inclusive stats. In all but the simplest programs or algorithms, nearly all the time is spent in subroutines and functions, so if you've got a performance problem, it most likely consists of calls you didn't know were time-hogs.
The method I rely on is this. Assuming you are trying to find out what you could fix to make the code faster, it will find it, while not wasting your time with high-precision statistics about things that are not problems.

There's no bug. Sampling cannot tell you how much time you spent per call. Profiler is just counting how many times timer ended up in that specific function. Since SetFont is not frequently called, you don't get many hits in that function and you get impression that that function is not time consuming.
On the other hand, when you run instrumentation, profiler counts every call and measures execution time of every function. That is why you get accurate information about functions CPU consumption.
When examining instrumentation results you must always look at number of calls as well. Since SetFont is more-less API it doesn't matter if it's exclusive or inclusive. The only thing that matters is its overall time and how frequently it's called.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js