When is about writing code into C++ using VS2005, how can you measure the performance of your code?
Is any default tool in VS for that? Can I know which function or class slow down my application?
Are other external tools which can be integrated into VS in order to measure the gaps in my code?

If you have the Team System edition of Visual Studio 2005, you can use the built-in profiler.

AMD CodeAnalyst is available for free for both Windows and Linux and works on most x86 or x64 CPUs (including Intel's).
It has extra features available when you have an AMD processor, of course. It also integrates into Visual Studio.
I've had pretty good luck with it.
Note that there are generally at least two common forms of profiler:
instrumenting: alters your build to record information at the beginning and end of certain areas (usually per function)
sampling: periodically looks at what code is running to record information
The types of information recorded can include (but are not limited to): elapsed time, # of CPU cycles, cache hits/misses, etc.
Instrumenting can be specific to certain areas of the code (just certain files or just code you compile, not libraries you link to). The overhead is much higher (you're adding code to the project, which takes time to execute, so you're altering timing; you may change program behavior for e.g. interrupt handlers or other timing-dependent code). You're guaranteed that you will get information about the functions/areas you instrument, though.
Sampling can miss very small or very sporadic functions, but modern machines have hardware help to allow you to sample much more thoroughly. Note that some sampling systems may still inject timing differences, although they generally will be much much smaller.
Some profiling tools support a mixture of the above, depending on how you use them.

You could also use Intel VTune.

You want a tool called a profiler. For a free one that covers most simple cases, I recommend Very Sleepy. It works by sampling the application's current call stack at regular intervals.

You can always measure the time and performance of you code yourself. Consult MSDN about the the following functions QueryPerformanceCounter() and QueryPerformanceFrequency().
For more in depth analysis of memory allocation and execution times we use Memory Validator and Performance Validator from Software Verify. They have support for several languages other than C++.

I think measuring performance, and locating code to optimize, are different problems, and require different methods.
To locate code to optimize, I swear by this simple method, which is orthogonal to accepted wisdom about profiling, and does not require you to buy or install any tools.
To measure performance, I'm content with the simple process of running the subject code in a loop and timing it.
EDIT: BTW, I just looked at Very Sleepy, and it appears to be on the right track. It samples the entire call stack, and retains each stack. What I can't tell is if it gives you, for each call instruction or regular instruction, the fraction of stack samples containing that instruction. In my opinion, that is the most valuable statistic, and it does not need to be very precise.
dotTrace, on the other hand, also looks like maybe it retains stack samples, but its UI presentation of call-stack info seems to be a call-tree. What I would look for is something that shows the stack-residence percentage of individual instructions (or statements), because they could be in different branches of the call-tree, and thus the call-tree could miss their importance.

For intrusive measurement, use the performance counters. Since you're using C++, you should use a facade over this slightly painful API. STLSoft has a family of such things, with different pros and cons. I suggest winstl::performance_counter for highest resolution, or winstl::threadtimes_counter if you want to monitor the performance of a particular thread regardless of other activity in your process(es). There was an article about this in Dr Dobb's several years ago, in which the design rationale behind the facades was described in detail.
For non-intrusive measurement, you can't go past VTune.

We use Rational quantify which comes as a part of Rational PurifyPlus set of tools.
Its an excellent tool for profiling application performance.

I've recently tried JetBrains dotTrace profiler and it looks very good. It helped me locate a number "black holes" in existing C++ code quite easily.
It works fine in Visual Studio 2005 Professional in a solution which mixes C# and C++ - it uses the right function names for both pieces of code and does an integrated analysis. You can trace for time or memory.
It will be a pity when the evaluation period expires :)

We've had good results from AQTime. It's not free but is cheaper than Visual Studio ;-)


GlowCode vs. AQTime C++ profiling performance in the real world?

I am a user of AQTime Pro and while the tool is pretty nice, it does have a horrible performance impact on the application under test if you're not careful. (Even if you are careful, the performance impact is often high for the app I'm mostly profiling.)
I've recently stumbled over GlowCode (found it in a few answers on SO) and while it'll be easy to just download the trial and see how it works on my App, I was wondering if other users could confirm their boasting wrt. profiling performance.
So, I'm looking for real world assessments of the performance impact of GlowCode (vs. AQTime) for native C++ of people who regularly use these products. (I only fire up the profiler every odd month, therefore any assessment on my part will be very limited.)
I have a GlowCode license and in my experience it has very minimal performance impact compared to the other profilers I've used (SciTech .NET Memory Profiler and Visual Studio Ultimate profiler). Though like you, I only fire it up when needed.
I will say that GlowCode's UI is abysmal IMO. Once you understand enough of it to discover the bottlenecks it's okay but getting there is a hurdle. I did exchange email with GC devs any they were grateful for the feedback and even changed one thing for me. They did mention that they are working on a UI revamp and maybe the latest version has that, I'm not sure (I have GC 7).
I have never used AQTime Pro so can't offer a comparison there.
You may try out MicroProfiler (there is a performance comparison): it's impact is 5-6 time less than AQTime's and it is OpenSource (free; source code here).
It is realtime as Glowcode and easily integrates with VisualStudio (2005-2014). But unlike Glowcode it is less fragile (for instance, I couldn't enable Glowcode to profile STL classes and algorithms - always have bad hook (instrumentation) status for them).
To enable profiling of a particular DLL/EXE just click 'Enable Profiling' in the project's context menu. Or, you may fine grain the area you need to profile, by manually setting '/Gh /GH' command line options to specific files.

I've found gprof to be the best CPU hotspot profiler, and Google Performance Tools to be the best sampling profiler. Both work for C and C++.
In my opinion there are no good profiling tools on Windows.
GNU gprof pros and cons
GCC only
Works with C and C++
Only treats CPU time, and code inside the binary, you need everything you wish to profile statically linked in
Very accurate
Adds a small overhead to execution
Google Performance Tools pros and cons
I think it requires the GNU tool chain
Occasionally fails to identify symbols
Very customizable
Outputs to a huge variety of formats, including the Callgrind format, and automatically loads KCacheGrind for you
Has various memory profiling tools also
Is a sampling profiler, with minimal overhead
I would respectfully disagree with Matt.
The tool I use all the time on Windows is the random-pausing technique, and it works with all languages that the IDE supports.
As an example of using it to do performance tuning, this case shows how a speedup of 43 times was achieved through a series of steps.
Gprof has a lot of problems, listed here, and according to the google-perftools manual, some of the same issues are repeated there, such as reporting procedures, not lines, emphasizing self (local) time, emphasizing the graph, etc. (I can't tell from the doc if it samples while blocked.)
As software systems become ever larger, self time becomes less and less relevant. The program counter spends most of its time in library routines or blocked in the system.
Graphs become gigantic nests.
People ask "I know function X is costly, but where in function X is the problem?"
What's more, the "bottlenecks" get bigger and bigger, because the stack gets deeper on average, and every layer of the stack is a fresh opportunity to do more function calls than necessary.
An example of a stack-sampler that reports percent by line, and samples while blocked, and allows user control of sampling so as not to dilute the sample set during user input, is Zoom.
EDIT: Sorry, can't leave well enough alone. Here's a new explanation:
The way programs work, they trace out a call tree, which is a lot like the oak tree outside my window. It has a trunk (main) which sprouts branches (call sites) which sprout further branches for several levels out to leaves (instructions) and acorns (blocking calls).
When the tree surgeon comes to prune (optimize) it, does he look only where the leaves are (hotspots)? Does he ignore acorns (no samples during blocking)?
No, he looks for branches (call sites) that are both heavy (on the stack a lot) and unhealthy (unnecessary). Those are what he prunes.
That's what random-pausing and Zoom do, is help find those call sites.
You can use Callgrind to create profiling output. It is part of Valgrind.
Callgrind-output could be used with KCacheGrind, which is probably worth a look as long as you're using Linux.
AMD CodeAnalyst is pretty nice. It's also cross platform which is nice when one finds a platform specific bottleneck.

Programming in gamedev (performance related)

I am just wondering how some things work in gamedev:
I know, that the performance is actually crucial so there is still (and I think never will be) no place to use managed languages/platforms as Java/.NET just because of their performance. But... recently I have read somewhere here on SO, that even though people creating games use C++ as a primary language, they actually do not use STL or Boost (or a lot of them). In think it has something in common with performance, right? If I am wrong, could you please tell me what are the reasons to avoid those libraries (that I think make developer's life much easier)? Is it because of licensing (Boost)? And what about EA's version of STL? Do other studios make their own versions too?
How "close to metal" game programming really is? Do you go deeper and closer to the machine? Do you sometimes use Assembly for critical inner loops, or C++ is actually the lowest abstraction layer that you use at the moment? I assume that in such products where performance is the most important thing profiling is very, very common task - but are you sometimes forced to use assembly to speed some parts up, or good C++ is "good enough"?
Sorry, It may not have been clear, but I am interested in answers from people having game industry experience. I am not interested in some assumptions given by people who do not have commercial experience in game development. I am also not interested in examples of some niche-games created in C#/Java whatever. However if you know a product that looks better than FarCry2 (just and example, but your favourite modern great looking game name here), and is written entirely in Java/.NET, and has similar performance to FarCry2... do not hesitate to mention this product! Thanks.
Contrary to some beliefs, the STL is quite optimized and not at all bad code. The reason for why most game studios don't use it is memory. You don't have as much control over memory allocation and deallocation as if you would write your own version of the STL containers. This is also the reason why managed languages are not preferred.
Writing your own containers will let you write cross platform code and do memory tracking easier. This is especially true on consoles where, for instance the PS3, requires detailed knowledge of the hardware in order to get the best performance out of it. Which usually in the end means that you need full control over memory flow between the PPU, SPUs and RSX.
Assembler is only "required" (in quotes since it's not actually required but helps) for very specialized operations, e.g. math library functions. What's more common is SIMD intrinsics which vectorizes the code. However, most studios have legacy code which is quite optimized and since these optimizations are quite low level it's not code that needs to change greatly between hardware generations. I'd say on consoles it's much more common that you use lower level code.
I know, that the performance is actually crucial so there is still (and I think never will be) no place to use managed languages/platforms as Java/.NET just because of their performance.
No, you don't know this. You think it, you want to believe it because you romanticize game development, and because you think high-level languages can't be fast.
.NET performance is perfectly good enough for 90% of the games out there. And it's only going to get better. There is no inherent reason why managed platforms must be slower. They have the potential to be faster because they're JIT'ed. In practice, their performance tends to be about the same as reasonably good C++ code, much better than typical C++ code, and slightly worse than really good C++ code. And most big games use more than one language anyway. They use scripting languages, like Lua or Python, or some home-brewed stuff, all of which are orders of magnitudes slower than .NET.
Similarly, there is absolutely no reason why most of your game couldn't be written in .NET. And then the three really performance-critical functions can, if necessary, be ported to native C++ later.
But... recently I have read somewhere here on SO, that even though people creating games use C++ as a primary language, they actually do not use STL or Boost (or a lot of them). In think it has something in common with performance, right? If I am wrong, could you please tell me what are the reasons to avoid those libraries
Same as you're guilty of above... Superstition about game development. "Oh no, we can't afford to use other people's code! It's far too inefficient". Game development is stuck in the 80's in terms of programming practices and methodologies. In other words, don't worry too much about what other game developers do. If the STL or Boost make your code easier to write, then use it! And then, if you experience performance problems, profile it, and if necessary, replace that particular library component with your own.
But most of the STL is literally zero overhead. And 95% of the code in any game is not performance critical. Treat game development like you would any other programming. Don't treat it as some magical land where every line of code must be perfectly optimized and where normal rules don't apply.
And what about EA's version of STL? Do other studios make their own versions too?
As far as I know, no. EA made it partly for internal use, but also as input to the C++ community as a whole, as an example of what they (and a lot of game developers) would like to see from future revisions to the standard (it was submitted to the standards committee as well)
I can recommend the book C++ for game programmers. It has an in-depth discussion of the performance cost of c++ features such as the STL, exceptions and RTTI. It also touches on having your own memory manager, and various common performance optimizations.
Appearlenty there is a new edition out, but it has a different author. What's up with that?
About 1:
I haven't tried EA's STL version, but I can confirm from my own game development experience that the STL can sometimes be a bottleneck. So far I was always able to find workarounds though.
Boost can be very helpful, but it really depends on the particular part of boost whether it's helpful or not for performance-critical code. For example, Boost::filesystem was very useful for me, whereas boost::signals was barely useable due to very poor performance. So I implemented my own signaling library using FastDelegates instead.
About 2:
Most of the time you will get away with regular C++ code. Once the game is running and you can identify bottlenecks with your profiler, you can start optimizing those pieces of code. And even then, you might not have to write any assembler code if you do it right.
For example, my custom-built 2d game engine runs without hardware acceleration. I developed it when 3D drivers were still quite buggy and most casual gamers have outdated graphics card drivers, so compatibility was more important than pure performance at that time.
Still, in our game latest game Gemsweeper, we are using a lot of alpha blending with 8-bit alpha masks and the game still has to run on 500 mhz cpu's. So alpha blending turned out to be a performance critical area.
To optimize this, I've used the VectorC compiler so that I could take advantage of MMX, SSE and the like without having to write assembler code. But the same code can be fast on one CPU and slow on the other (e.g. Intel vs. AMD), so I also compiled the alpha blending code several times with different optimization settings. The game runs a benchmark at runtime to find the fastest blitting module for each blitting method and uses that module from then on.
I've compared the result with some other 2d blending libraries, one of which claimed to be pure hand-optimized assembler and my engine had about the same performance in average, as measured on different CPUs.
Bottom line:
Do not optimize without measuring first and think about alternatives before starting to write assembler. This usually wields good enough results and wil save you a lot of time.
We use 'STL' (ie. the standard C++ library) and a small amount of Boost. However some of it is avoided or frowned upon (std::map, std::list, boost::shared_ptr) typically for the over-exuberant memory allocation policies or poor cache coherency. These can typically be worked around, eg. with custom allocators, but instead we have other approaches to the same problems with their own benefits.
As for how close to the metal it really is, it depends. In our project C++ is the lowest level we go. In another project in this studio, there is a little assembly, especially on the non-PC platforms. These days in certain projects you're just as likely to be limited by the GPU as by the CPU so the days of low level code optimisation are getting fewer and the days of optimising shaders and art assets are growing.
Be wary of claiming that Java/.NET etc is never used however. Not everybody needs the performance of FarCry2 (which is a pretty excessive spec) which is why you're seeing more and more games written in managed languages with C++ just for optimisation.
It is true that in game development, STL is not used. In spite of what certain people always rush to claim, they also never use Java or C# or other managed languages.
I'm not talking about small X-Box Live Arcade downloadable games or web browser games, or such things. I'm talking about high-end development in AAA games.
They don't use STL. However, they do use their own custom implementations that look a lot like STL. There will be smart-arrays, there will be hash tables, there will be smart pointers, they just won't be STL.
Consoles have some performance characteristics that are very different from PCs. Even game projects that exclusively target PC are usually using codebases that have been used for console projects in the past. A lot of tweaking goes into making the basic template structures work as desired.
Most game studios also want code that they can adapt to other platforms. Locking into an implementation from MS/Sony/Nintendo makes for a lot more pain when it comes time to port the game to a new platform. The provided template libraries (which aren't necessarily STL to start with) are often less than stellar. At least they are that way early in the hardware cycle when a studio is ironing out the engine they plan to keep using for the next five years.
At the studios I've worked at, I've certainly seen a fair degree of "not-built-here" attitude to dismiss third party code. Sometimes it's justified, sometimes it's not. In the case of basic data structure templates, it typically is.
As for your second question, assembler is occasionally used. But only in isolated situations where a large volume of math needs to happen very frequently. An entire engine might contain two or three smallish files of asm blocks.
You can find out for yourself (to a degree) by looking at game SDKs.
Almost all the id Tech 4 games (DooM III, Prey, Quake IV, ET:QW) have SDKs out, complete with physics, script, AI, math, etc. systems included. The only asm used is for specialized math code, everything else is pure C++.
Crytek has a Crysis SDK out (you'll need the game installed to install the SDK though) and Far Cry SDK.
Valve has the Source SDK available to anyone who has purchased a Source game through steam.
There are a lot more if you look. A lot of the code isn't particularly clean or flexible (sometimes not even fast), but I suppose it's easier to adjust things in code you've written as opposed to monolithic libraries full of hard to understand template-fu.
No, you are largely wrong. Both .NET and Java have been used in commercial games, certainly on Windows (and probably on consoles too).
STL is also used widely, I know that quite a large proportion of amateur games developers use it.
Probably the main reason for not using STL is inertia, and using third party libraries/engines which do not.
I imagine that historically on some platforms, good STL implementations were no available, especially on RAM-limited stuff like PS2.

C/C++ compiler feedback optimization

Has anyone seen any real world numbers for different programs which are using the feedback optimization that C/C++ compilers offer to support the branch prediction, cache preloading functions etc.
I searched for it and amazingly not even the popular interpreter development groups seem to have checked the effect. And increasing ruby,python,php etc. performance by 10% or so should be considered usefull.
Is there really no benefit or is the whole developer community just to lazy to use it?
10% is a good ballpark figure. That said, ...
You have to REALLY care about the performance to go this route. The product I work on (DB2) uses PGO and other invasive and agressive optimizations. Among the costs are significant build time (triple on some platforms) and development and support nightmares.
When something goes wrong it can be non-trivial to map the fault location in the optimized code back to the source. Developers don't usually expect that functions in different modules can end up merged and inlined and this can have "interesting" effects.
Problems with pointer aliasing, which are nasty to track down also usually show up with these sorts of optimizations. You have the additional fun of having non-deterministic builds (an aliasing problem can show up in monday's build, vanish again till thursday's, ...).
The line between what is correct or incorrect compiler behaviour under these sorts of aggressive optimizations also becomes fairly blurred. Even with the luxury of having our compiler guys in house (literally) the optimization issues (either in our source or the compiler) are still not easy to understand and resolve.
From unladen-swallow (a project optimizing the CPython VM):
For us, the final nail in PyBench's coffin was when experimenting with gcc's feedback-directed optimization tools, we were able to produce a universal 15% performance increase across our macrobenchmarks; using the same training workload, PyBench got 10% slower.
So some people are at least looking at it. That said, PGO sets some pretty tricky requirements on the build environment that are hard to satisfy for open-source projects meant to be built by a distributed heterogeneous group of people. Heavy optimization also creates difficult to debug heisenbugs. It's less work to give the compiler explicit hints for the performance critical parts.
That said, I expect significant performance increases from runtime profile guided optimization. JIT'ing allows the optimizer to cope with the profile of data changing across the execution of a program and do many extremely runtime data specific optimizations that would explode the code size for static compilation. Especially dynamic languages need good runtime data based optimization to perform well. With dynamic language performance getting significant attention lately (JavaScript VM's, MS DLR, JSR-292, PyPy and so on) there's a lot of work being done in this area.
Traditional methods in improving the compiler efficiency via profiling is done by performance analysis tools. However, how the data from the tools may be of use in optimization still depends on the compiler you use. For example, GCC is a framework being worked on to produce compilers for different domains. Providing profiling mechanism in the such compiler framework will be extremely difficult.
We can rely on statistical data to do certain optimization. For instance, GCC unrolls a loop if the loop count is less than a constant (say 7). How it fixes up the constant will be based on statistical result of the code size generated for different target architecture.
Profile guided optimizations track the special areas of the source. Details regarding previous run results needs to be stored which is an overhead. The input on the other hand requires a statistical representation of the target application which may use the compiler. So the complexity level rises with the number of different inputs and outputs. In short, deciding profile guided optimization needs extreme data collection. Automation or embedding such profiling into source needs careful monitoring. If not, the entire result will be awry and in our effort to swim we actually will drown.
However, experimentation on this regard is ongoing. Just have a look at POGO.

What is profiling?

I am new to this and is trying to learn.
What is profiling?
What are various free tools for profiling .NET, Java EE?
Can Javascript be profiled?
If so, by which tool?
And lastly, how do these profilers work?
Profiling measures how long various parts of the code take to run. Javascript can be profiled with firebug: http://getfirebug.com/js.html
profiling is measuring the execution times and correlating it with various classes/methods/functions. (see the link I gave to the wikipedia page for some commentary on how profilers can work)
Think of profilers as debuggers for execution duration bugs.
Profilers are implemented a lot like debuggers too, except that rather than allowing you to stop the program and poke around, they simply let it run and keep track of how much time gets spent in every part of the program. This is particularly useful if you have some code that is running slower than you need it to run, as you can figure out exactly where all the time is going, and concentrate your efforts on fixing just that bottleneck.
Many developers believe you should never hand-optimize code without using a profiler.
The way you would usually use your profiler is as follows:
Start the profiler, fire up your application using the profiler.
Use your application for some time or just the features in your application that you have identified as bottlenecks and would like to optimize.
Once your application is closed (or sometimes even before that), the profiler can present you a breakdown of execution times per function. Some will also allow you to get a breakdown of execution times per line or function within one of these functions so you can see where cpu most time was used up using a top-down approach.
Usually some functions in your application will take an unusually long time to execute. After looking at your profiling results, you should be able to identify them and eliminate performance problems.
Here are some .NET profilers for you to try (free):
CLR Profiler
I am not a big fan of these. I would recommend one of the commercial products to get the best results:
Other than that take a look at Brad Adams blog posts Profilers for the CLR and .NET Application Profiler.
I personally like dotTrace.
Profiling is a technique for measuring execution times and numbers of invocations of procedures.
It is not however the only or even necessarily the best way to locate things that cause time to be wasted in your code. Look here.
For a different Wikipedia article, try http://en.wikipedia.org/wiki/Performance_tuning#Bottlenecks
For a simple how-to, try http://www.wikihow.com/Optimize-Your-Program%27s-Performance
Wikipedia says:
In software engineering, performance analysis, more commonly today known as profiling, is the investigation of a program's behavior using information gathered as the program executes
Continue reading here http://en.wikipedia.org/wiki/Performance_analysis.
So, about javascript tool Firebug(http://getfirebug.com/index.html#install) is an excelent option.
Profiling is a measure of execution time at method level (functional statistics) as well as run-time level information collection such as consumption of memory, processor, threads and number of classes (non-functional statistics) loaded over a period of time the application is running. It falls under performance analysis (functional and non-functional statistics collection) of the application in question as run by one user. JConsole is one of the built-in tools to profile Java applications.
Profiling or programming profiles is the technique of dynamic analysis of programs, which uses resources such as memory space or the temporal complexity of a program, the use of particular instructions or the frequency, as well as the duration of function calls , to mention a few cases. Typically, profiling information is used to aid program optimization and, more specifically, performance engineering. Profiling is accomplished by instrumenting the program's source code. Profilers employ different methods such as event-based, statistical, instrumented, and simulation methods