Speed comparison - Template specialization vs. Virtual Function vs. If-Statement

Speed comparison - Template specialization vs. Virtual Function vs. If-Statement - c++

Just to get it out of the way...
Premature optimization is the root of all evil
Make use of OOP
etc.
I understand. Just looking for some advice regarding the speed of certain operations that I can store in my grey matter for future reference.
Say you have an Animation class. An animation can be looped (plays over and over) or not looped (plays once), it may have unique frame times or not, etc. Let's say there are 3 of these "either or" attributes. Note that any method of the Animation class will at most check for one of these (i.e. this isn't a case of a giant branch of if-elseif).
Here are some options.
1) Give it boolean members for the attributes given above, and use an if statement to check against them when playing the animation to perform the appropriate action.
Problem: Conditional checked every single time the animation is played.
2) Make a base animation class, and derive other animations classes such as LoopedAnimation and AnimationUniqueFrames, etc.
Problem: Vtable check upon every call to play the animation given that you have something like a vector<Animation>. Also, making a separate class for all of the possible combinations seems code bloaty.
3) Use template specialization, and specialize those functions that depend on those attributes. Like template<bool looped, bool uniqueFrameTimes> class Animation.
Problem: The problem with this is that you couldn't just have a vector<Animation> for something's animations. Could also be bloaty.
I'm wondering what kind of speed each of these options offer? I'm particularly interested in the 1st and 2nd option because the 3rd doesn't allow one to iterate through a general container of Animations.
In short, what is faster - a vtable fetch or a conditional?

(1) Not that the size of the generated assembly matters anymore these days, but this is what it generates (approximately, assuming MSVC on x86):
mov eax, [ecx+12] ; 'this' pointer stored in ecx, eax is scratch
cmp eax, 0 ; test for 0
jz .somewhereElse ; jump if the bool isn't set
The optimizing compiler will intersperse other instructions there, making it more pipeline-friendly. The contents of your class will most likely be in your cache anyway, and if it's not, it will be needed a few cycles later anyway. So, in retrospect, that's maybe a few cycles, and for something that will be called at most a few times per frame, that's nothing.
(2) This is approximately the assembly that will be generated every time your play() method is called:
mov eax, [ebp+4] ; pointer to your Animation* somewhere on the stack, eax is scratch
mov eax, [eax+12] ; dereference the vtable
call eax ; call it
Then, you'll have some duplicate code or another function call inside your specialized play() function, since there'll definetely be some common stuff, so that incurs some overhead (in code size and/or execution speed). So, this is definetely slower.
Also, this makes it alot harder to load generic animations. Your graphics department won't be happy.
(3) To use this effectively, you'll end up making a base class for your templated version anyway, with virtual functions (in that case, see (2)), OR you'll do it manually by checking types in places where you call your animation thing, in which case also see (2).
This also makes it MUCH harder to load generic animations. Your graphics department will be even less happy.
(4) What you need to worry about is not some microoptimization for tiny things done at most a few times a frame. From reading your post, i actually identified another problem that's commonly overlooked. You're mentioning std::vector<Animation>. Nothing against the STL, but that's bad voodoo. A single memory allocation will cost you more cycles than all the boolean checks in your play() or update() methods for probably the entire time your application is running. Putting Animations in and out of std::vectors (especially if you're putting in instances and not pointers (smart or dumb) to instances) will cost you way more.
You need to look at different places to optimize. This is such a ridiculous microoptimization that will bring you no benefit except make it harder to generalize and make your graphics department happy. What will matter, however, is worrying about memory allocation, and THEN, when you're done programming that part, starting a profiler and looking where the hot spots are.
If keeping your animations is actually becoming a bottleneck, the std::vector (nice as it is) is where you might want to look. Have you looked at, say, an intrusive linked list? That will actually be more benefit than worrying about this.

(Edited for brevity.)
The compiler, CPU, and OS all can change the answer, here:
CPU: instruction/data cache size, architecture, and behavior, especially any intelligent prefetch
CPU: branch prediction and speculative execution behavior
CPU: the penalty for a mispredicted branch
compiler and CPU: the availability and relative cost of conditionally-executed instructions (helps with branch cases that only cover a few instructions)
compiler or linker: optimizations that may transform your code and remove branches
In short, as Blindy said in the comments: test it. =)
If you're writing for a modern desktop OS or OSes, enlist the help of a profiling tool (valgrind, shark, codeanalyst, vtune, etc) -- it may give you details you never even knew you could look for, such as cache misses, branch mispredicts, etc.
Even if you don't find a great answer, you'll learn something from applying the tool. I often find looking at the disassembly quite instructive, too (see some of the other answers in this thread).
Some slightly more speculative notes:
vtable tends to result in a load (this+0), offset, second load, and then branch on the contents of the register. You can see this in some of the other answers. Most CPUs that I'm familiar with are miserable at predicting branches from registers.
the bool may be near other data you're using and as such may already be cached. The branch target is also likely to be fixed and therefore a lot more friendly for prediction and/or speculative execution.
on some processors (rarer these days), it costs more to load a bool than an int.
on an ARM processor I work with, we occasionally tuck the vtables in "tightly coupled memory" on the processor core. Decreases the indirect load time considerably -- it's as if the vtable is always in-cache or better.
As you mentioned, the usual rule applies: do what fits requirements and is flexible/maintainable/readable first, then optimize.
Further reading / other patterns to pursue:
FastDelegate, which makes component based systems much easier to deal with
Pitfalls of Object-Oriented Programming slides, which discuss how to get more out of CPU and CPU caches in general
Both the "Data Oriented Design" and the "Component-Based Entity" paradigms are useful to keep in your brain for games, multimedia engines, and other things where you have a greater-than-average demand for performance and still want to keep your code somewhat organized. YMMV, of course. =)

Vtable is very very fast. So are simple conditionals. They translate to single digits of CPU instructions. Worrying about this kind of performance gets you in the murky waters of compiler optimisations, where you don't at all understand what the compiler is doing. Chances are, very subtle changes in your program can trump the minute differences between an if statement and a vtable.
I did a little test a while ago testing differences between RTTI multiple dispatch and vtable. In release mode a dispatch between three objects (two vtable calls) done over two million iterations take 62 milliseconds. That is way way not even worth worrying about.

Who says #3 makes it impossible to have a generic container of animations? There are several approaches one can use. They do all boil down to eventually making a polymorphic call but the options are there. Consider this:
std::vector<boost::any> generic_container;
function(generic_container[0]);
void function(boost::any & a)
{
my_operation::execute(a.type().name(), a);
}
my_operation just needs to have a way of registering and filtering operations by type name. It searches for a functor that operates on whatever a represents, and uses it. The functor then any_casts to the appropriate time and does the type specific operation.
Or use a visitor framework. The above is sort of a variation of that but at too generic a level to really qualify.
And there are more possible methods. Instead of storing animations you could store a type that hides the specifics and executes the correct view options when activated. One virtual is called but it is specific to switching out concrete types that do more complex operations on each other.
There is no general answer to your question in other words. Depending on what you need you could reach all kinds of levels of complexity to make almost your entire program compile time polymorphic as opposed to run-time.

Related

Branch-aware programming

I'm reading around that branch misprediction can be a hot bottleneck for the performance of an application. As I can see, people often show assembly code that unveil the problem and state that programmers usually can predict where a branch could go the most of the times and avoid branch mispredictons.
My questions are:
Is it possible to avoid branch mispredictions using some high level programming technique (i.e. no assembly)?
What should I keep in mind to produce branch-friendly code in a high level programming language (I'm mostly interested in C and C++)?
Code examples and benchmarks are welcome.

people often ... and state that programmers usually can predict where a branch could go
(*) Experienced programmers often remind that human programmers are very bad at predicting that.
1- Is it possible to avoid branch mispredictions using some high level programming technique (i.e. no assembly)?
Not in standard c++ or c. At least not for a single branch. What you can do is minimize the depth of your dependency chains so that branch mis-prediction would not have any effect. Modern cpus will execute both code paths of a branch and drop the one that wasn't chosen. There's a limit to this however, which is why branch prediction only matters in deep dependency chains.
Some compilers provide extension for suggesting the prediction manually such as __builtin_expect in gcc. Here is a stackoverflow question about it. Even better, some compilers (such as gcc) support profiling the code and automatically detect the optimal predictions. It's smart to use profiling rather than manual work because of (*).
2- What should I keep in mind to produce branch-friendly code in a high level programming language (I'm mostly interested in C and C++)?
Primarily, you should keep in mind that branch mis-prediction is only going to affect you in the most performance critical part of your program and not to worry about it until you've measured and found a problem.
But what can I do when some profiler (valgrind, VTune, ...) tells that on line n of foo.cpp I got a branch prediction penalty?
Lundin gave very sensible advice
Measure fo find out whether it matters.
If it matters, then
Minimize the depth of dependency chains of your calculations. How to do that can be quite complicated and beyond my expertise and there's not much you can do without diving into assembly. What you can do in a high level language is to minimize the number of conditional checks (**). Otherwise you're at the mercy of compiler optimization. Avoiding deep dependency chains also allows more efficient use of out-of-order superscalar processors.
Make your branches consistently predictable. The effect of that can be seen in this stackoverflow question. In the question, there is a loop over an array. The loop contains a branch. The branch depends on size of the current element. When the data was sorted, the loop could be demonstrated to be much faster when compiled with a particular compiler and run on a particular cpu. Of course, keeping all your data sorted will also cost cpu time, possibly more than the branch mis-predictions do, so, measure.
If it's still a problem, use profile guided optimization (if available).
Order of 2. and 3. may be switched. Optimizing your code by hand is a lot of work. On the other hand, gathering the profiling data can be difficult for some programs as well.
(**) One way to do that is transform your loops by for example unrolling them. You can also let the optimizer do it automatically. You must measure though, because unrolling will affect the way you interact with the cache and may well end up being a pessimization.

As a caveat, I'm not a micro-optimization wizard. I don't know exactly how the hardware branch predictor works. To me it's a magical beast against which I play scissors-paper-stone and it seems to be able to read my mind and beat me all the time. I'm a design & architecture type.
Nevertheless, since this question was about a high-level mindset, I might be able to contribute some tips.
Profiling
As said, I'm not a computer architecture wizard, but I do know how to profile code with VTune and measure things like branch mispredictions and cache misses and do it all the time being in a performance-critical field. That's the very first thing you should be looking into if you don't know how to do this (profiling). Most of these micro-level hotspots are best discovered in hindsight with a profiler in hand.
Branch Elimination
A lot of people are giving some excellent low-level advice on how to improve the predictability of your branches. You can even manually try to aid the branch predictor in some cases and also optimize for static branch prediction (writing if statements to check for the common cases first, e.g.). There's a comprehensive article on the nitty-gritty details here from Intel: https://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts.
However, doing this beyond a basic common case/rare case anticipation is very hard to do and it is almost always best saved for later after you measure. It's just too difficult for humans to be able to accurately predict the nature of the branch predictor. It's far more difficult to predict than things like page faults and cache misses, and even those are almost impossible to perfectly humanly-predict in a complex codebase.
However, there is an easier, high-level way to mitigate branch misprediction, and that's to avoid branching completely.
Skipping Small/Rare Work
One of the mistakes I commonly made earlier in my career and see a lot of peers trying to do when they're starting out, before they've learned to profile and are still going by hunches, is to try to skip small or rare work.
An example of this is memoizing to a large look-up table to avoid repeatedly doing some relatively-cheap computations, like using a look-up table that spans megabytes to avoid repeatedly calling cos and sin. To a human brain, this seems like it's saving work to compute it once and store it, except often loading the memory from this giant LUT down through the memory hierarchy and into a register often ends up being even more expensive than the computations they were intended to save.
Another case is adding a bunch of little branches to avoid small computations which are harmless to do unnecessarily (won't impact correctness) throughout the code as a naive attempt at optimization, only to find the branching costs more than just doing unnecessary computations.
This naive attempt at branching as an optimization can also apply even for slightly-expensive but rare work. Take this C++ example:
struct Foo
{
...
Foo& operator=(const Foo& other)
{
// Avoid unnecessary self-assignment.
if (this != &other)
{
...
}
return *this;
}
...
};
Note that this is somewhat of a simplistic/illustrative example as most people implement copy assignment using copy-and-swap against a parameter passed by value and avoid branching anyway no matter what.
In this case, we're branching to avoid self-assignment. Yet if self-assignment is only doing redundant work and doesn't hinder the correctness of the result, it can often give you a boost in real-world performance to simply allow the self-copying:
struct Foo
{
...
Foo& operator=(const Foo& other)
{
// Don't check for self-assignment.
...
return *this;
}
...
};
... this can help because self-assignment tends to be quite rare. We're slowing down the rare case by redundantly self-assigning, but we're speeding up the common case by avoiding the need to check in all other cases. Of course that's unlikely to reduce branch mispredictions significantly since there is a common/rare case skew in terms of the branching, but hey, a branch that doesn't exist can't be mispredicted.
A Naive Attempt at a Small Vector
As a personal story, I formerly worked in a large-scale C codebase which often had a lot of code like this:
char str[256];
// do stuff with 'str'
... and naturally since we had a pretty extensive user base, some rare user out there would eventually type in a name for a material in our software that was over 255 characters in length and overflow the buffer, leading to segfaults. Our team was getting into C++ and started porting a lot of these source files to C++ and replacing such code with this:
std::string str = ...;
// do stuff with 'str'
... which eliminated those buffer overruns without much effort. However, at least back then, containers like std::string and std::vector were heap(free store)-allocated structures, and we found ourselves trading correctness/safety for efficiency. Some of these replaced areas were performance-critical (called in tight loops), and while we eliminated a lot of bug reports with these mass replacements, the users started noticing the slowdowns.
So then we wanted something which was like a hybrid between these two techniques. We wanted to be able to slap something in there to achieve safety over the C-style fixed-buffer variants (which were perfectly fine and very efficient for common-case scenarios), but still work for the rare-case scenarios where the buffer wasn't big enough for user inputs. I was one of the performance geeks on the team and one of the few using a profiler (I unfortunately worked with a lot of people who thought they were too smart to use one), so I got called into the task.
My first naive attempt was something like this (vastly simplified: the actual one used placement new and so forth and was a fully standard-compliant sequence). It involves using a fixed-size buffer (size specified at compile-time) for the common case and a dynamically-allocated one if the size exceeded that capacity.
template <class T, int N>
class SmallVector
{
public:
...
T& operator[](int n)
{
return num < N ? buf[n]: ptr[n];
}
...
private:
T buf[N];
T* ptr;
};
This attempt was an utter fail. While it didn't pay the price of the heap/free store to construct, the branching in operator[] made it even worse than std::string and std::vector<char> and was showing up as a profiling hotspot instead of malloc (our vendor implementation of std::allocator and operator new used malloc under the hood). So then I quickly got the idea to simply assign ptr to buf in the constructor. Now ptr points to buf even in the common case scenario, and now operator[] can be implemented like this:
T& operator[](int n)
{
return ptr[n];
}
... and with that simple branch elimination, our hotspots went away. We now had a general-purpose, standard-compliant container we could use that was just about as fast as the former C-style, fixed-buffer solution (only difference being one additional pointer and a few more instructions in the constructor), but could handle those rare-case scenarios where the size needed to be larger than N. Now we use this even more than std::vector (but only because our use cases favor a bunch of teeny, temporary, contiguous, random-access containers). And making it fast came down to just eliminating a branch in operator[].
Common Case/Rare Case Skewing
One of the things learned after profiling and optimizing for years is that there's no such thing as "absolutely-fast-everywhere" code. A lot of the act of optimization is trading an inefficiency there for greater efficiency here. Users might perceive your code as absolutely-fast-everywhere, but that comes from smart tradeoffs where the optimizations are aligning with the common case (common case being both aligned with realistic user-end scenarios and coming from hotspots pointed out from a profiler measuring those common scenarios).
Good things tend to happen when you skew the performance towards the common case and away from the rare case. For the common case to get faster, often the rare case must get slower, yet that's a good thing.
Zero-Cost Exception-Handling
An example of common case/rare case skewing is the exception-handling technique used in a lot of modern compilers. They apply zero-cost EH, which isn't really "zero-cost" all across the board. In the case that an exception is thrown, they're now slower than ever before. Yet in the case where an exception isn't thrown, they're now faster than ever before and often faster in successful scenarios than code like this:
if (!try_something())
return error;
if (!try_something_else())
return error;
...
When we use zero-cost EH here instead and avoid checking for and propagating errors manually, things tend to go even faster in the non-exceptional cases than this style of code above. Crudely speaking, it's due to the reduced branching. Yet in exchange, something far more expensive has to happen when an exception is thrown. Nevertheless, that skew between common case and rare case tends to aid real-world scenarios. We don't care quite as much about the speed of failing to load a file (rare case) as loading it successfully (common case), and that's why a lot of modern C++ compilers implement "zero-cost" EH. It is again in the interest of skewing the common case and rare case, pushing them further away from each in terms of performance.
Virtual Dispatch and Homogeneity
A lot of branching in object-oriented code where the dependencies flow towards abstractions (stable abstractions principle, e.g.), can have a large bulk of its branching (besides loops of course, which play well to the branch predictor) in the form of dynamic dispatch (virtual function calls or function pointer calls).
In these cases, a common temptation is to aggregate all kinds of sub-types into a polymorphic container storing a base pointer, looping through it and calling virtual methods on each element in that container. This can lead to a lot of branch mispredictions, especially if this container is being updated all the time. The pseudocode might look like this:
for each entity in world:
entity.do_something() // virtual call
A strategy to avoid this scenario is to start sorting this polymorphic container based on its sub-types. This is a fairly old-style optimization popular in the gaming industry. I don't know how helpful it is today, but it is a high-level kind of optimization.
Another way I've found to be definitely still be useful even in recent cases which achieves a similar effect is to break the polymorphic container apart into multiple containers for each sub-type, leading to code like this:
for each human in world.humans():
human.do_something()
for each orc in world.orcs():
orc.do_something()
for each creature in world.creatures():
creature.do_something()
... naturally this hinders the maintainability of the code and reduces the extensibility. However, you don't have to do this for every single sub-type in this world. We only need to do it for the most common. For example, this imaginary video game might consist, by far, of humans and orcs. It might also have fairies, goblins, trolls, elves, gnomes, etc., but they might not be nearly as common as humans and orcs. So we only need to split the humans and orcs away from the rest. If you can afford it, you can also still have a polymorphic container that stores all of these subtypes which we can use for less performance-critical loops. This is somewhat akin to hot/cold splitting for optimizing locality of reference.
Data-Oriented Optimization
Optimizing for branch prediction and optimizing memory layouts tends to kind of blur together. I've only rarely attempted optimizations specifically for the branch predictor, and that was only after I exhausted everything else. Yet I've found that focusing a lot on memory and locality of reference did make my measurements result in fewer branch mispredictions (often without knowing exactly why).
Here it can help to study data-oriented design. I've found some of the most useful knowledge relating to optimization comes from studying memory optimization in the context of data-oriented design. Data-oriented design tends to emphasize fewer abstractions (if any), and bulkier, high-level interfaces that process big chunks of data. By nature such designs tend to reduce the amount of disparate branching and jumping around in code with more loopy code processing big chunks of homogeneous data.
It often helps, even if your goal is to reduce branch misprediction, to focus more on consuming data more quickly. I've found some great gains before from branchless SIMD, for example, but the mindset was still in the vein of consuming data more quickly (which it did, and thanks to some help from here on SO like Harold).
TL;DR
So anyway, these are some strategies to potentially reduce branch mispredictions throughout your code from a high-level standpoint. They're devoid of the highest level of expertise in computer architecture, but I'm hoping this is an appropriate kind of helpful response given the level of the question being asked. A lot of this advice is kind of blurred with optimization in general, but I've found that optimizing for branch prediction often needs to be blurred with optimizing beyond it (memory, parallelization, vectorization, algorithmic). In any case, the safest bet is to make sure you have a profiler in your hand before you venture deep.

Linux kernel defines likely and unlikely macros based on __builtin_expect gcc builtins:
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
(See here for the macros definitions in include/linux/compiler.h)
You can use them like:
if (likely(a > 42)) {
/* ... */
}
or
if (unlikely(ret_value < 0)) {
/* ... */
}

In general it's a good idea to keep hot inner loops well proportioned to the cache sizes most commonly encountered. That is, if your program handles data in lumps of, say, less than 32kbytes at a time and does a decent amount of work on it then you're making good use of the L1 cache.
In contrast if your hot inner loop chews through 100MByte of data and performs only one operation on each data item, then the CPU will spend most of the time fetching data from DRAM.
This is important because part of the reason CPUs have branch prediction in the first place is to be able to pre-fetch operands for the next instruction. The performance consequences of a branch mis-prediction can be reduced by arranging your code so that there's a good chance that the next data comes from L1 cache no matter what branch is taken. Whilst not a perfect strategy, L1 cache sizes seem to be universally stuck on 32 or 64K; it's almost a constant thing across the industry. Admittedly coding in this way is not often straightforward, and relying on profile driven optimisation, etc. as recommended by others is probably the most straightforward way ahead.
Regardless of anything else, whether or not a problem with branch mis-prediction will occur varies according to the CPU's cache sizes, what else is running on the machine, what the main memory bandwidth / latency is, etc.

Perhaps the most common techniques is to use separate methods for normal and error returns. C has no choice, but C++ has exceptions. Compilers are aware that the exception branches are exceptional and therefore unexpected.
This means that exception branches are indeed slow, as they're unpredicted, but the non-error branch is made faster. On average, this is a net win.

1- Is it possible to avoid branch mispredictions using some high level programming technique (i.e. no assembly)?
Avoid? Perhaps not. Reduce? Certainly...
2- What should I keep in mind to produce branch-friendly code in a high level programming language (I'm mostly interested in C and C++)?
It is worth noting that optimisation for one machine isn't necessarily optimisation for another. With that in mind, profile-guided optimisation is reasonably good at rearranging branches, based on whichever test input you give it. This means you don't need to do any programming to perform this optimisation, and it should be relatively tailored to whichever machine you're profiling on. Obviously, the best results will be achieved when your test input and the machine you profile on roughly matches what common expectations... but those are also considerations for any other optimisations, branch-prediction related or otherwise.

To answer your questions let me explain how does branch prediction works.
First of all, there is a branch penalty when the processor correctly predicts the taken branches. If the processor predicts a branch as taken, then it has to know the target of the predicted branch since execution flow will continue from that address. Assuming that the branch target address is already stored in Branch Target Buffer(BTB), it has to fetch new instructions from the address found in BTB. So you are still wasting a few clock cycles even if the branch is correctly predicted.
Since BTB has an associative cache structure the target address might not be present, and hence more clock cycles might be wasted.
On the other, hand if the CPU predicts a branch as not taken and if it's correct then there is no penalty since the CPU already knows where the consecutive instructions are.
As I explained above, predicted not taken branches have higher throughput than predicted taken branches.
Is it possible to avoid branch misprediction using some high level programming technique (i.e. no assembly)?
Yes, it is possible. You can avoid by organizing your code in way that all branches have repetitive branch pattern such that always taken or not taken.
But if you want to get higher throughput you should organize branches in a way that they are most likely to be not taken as I explained above.
What should I keep in mind to produce branch-friendly code in a high
level programming language (I'm mostly interested in C and C++)?
If it's possible eliminate branches as possible. If this is not the case when writing if-else or switch statements, check the most common cases first to make sure the branches most likely to be not taken. Try to use __builtin_expect(condition, 1) function to force compiler to produce condition to be treated as not taken.

Branchless isn't always better, even if both sides of the branch are trivial. When branch prediction works, it's faster than a loop-carried data dependency.
See gcc optimization flag -O3 makes code slower than -O2 for a case where gcc -O3 transforms an if() to branchless code in a case where it's very predictable, making it slower.
Sometimes you are confident that a condition is unpredictable (e.g. in a sort algorithm or binary search). Or you care more about the worst-case not being 10x slower than about the fast-case being 1.5x faster.
Some idioms are more likely to compile to a branchless form (like a cmov x86 conditional move instruction).
x = x>limit ? limit : x; // likely to compile branchless
if (x>limit) x=limit; // less likely to compile branchless, but still can
The first way always writes to x, while the second way doesn't modify x in one of the branches. This seems to be the reason that some compilers tend to emit a branch instead of a cmov for the if version. This applies even when x is a local int variable that's already live in a register, so "writing" it doesn't involve a store to memory, just changing the value in a register.
Compilers can still do whatever they want, but I've found this difference in idiom can make a difference. Depending on what you're testing, it's occasionally better to help the compiler mask and AND rather than doing a plain old cmov. I did it in that answer because I knew that the compiler would have what it needed to generate the mask with a single instruction (and from seeing how clang did it).
TODO: examples on http://gcc.godbolt.org/

High-performance code in c++ (inheritance, pointers to functions, if)

Suppose you have a very large graph with lots of processing upon its nodes (like tens of milions of operations per node). The core routine is the same for each node, but there are some additional operations based on internal conditions. There can be 2 such conditions which produces 4 cases (0,0), (1,0), (0,1), (1,1). E.g. (1,1) means that both conditions hold. Conditions are established once (one set for each node independently) in a program and, from then on, never change. Unfortunately, they are determined in runtime and in a fully unpredictable way (based on data received via HTTP from and external server).
What is the fastest in such scenario? (taken into account modern compiler optimizations which I have no idea of)
simply using "IFs": if (condition X) perform additional operation X.
using inheritance to derive four classes from base class (exposing method OPERATION) to have a proper operation and save tens milions of "ifs". [but I am not sure if this is a real saving, inheritance must have its overhead too)
use pointer to function to assign the function based on conditions once.
I would take me long to come to a point where I can test it by myself (I don't have such a big data yet and this will be integrated into bigger project so would not be easy to test all versions).
Reading answers: I know that I probably have to experiment with it. But apart from everything, this is sort of a question what is faster:
tens of milions of IF statements and normal statically known function calls VS function pointer calls VS inheritance which I think is not the best idea in this case and I am thinking of eliminating it from further inspection Thanks for any constructive answers (not saying that I shouldn't care about such minor things ;)

There is no real answer except to measure the actual code on the
real data. At times, in the past, I've had to deal with such
problems, and in the cases I've actually measured, virtual
functions were faster than if's. But that doesn't mean much,
since the cases I measured were in a different program (and thus
a different context) than yours. For example, a virtual
function call will generally prevent inlining, whereas an if is
inline by nature, and inlining may open up additional
optimization possibilities.
Also the machines I measured on handled virtual functions pretty
well; I've heard that some other machines (HP's PA, for example)
are very ineffective in their implementation of indirect jumps
(including not only virtual function calls, but also the return
from the function---again, the lost opportunity to inline
costs).

If you absolutely have to have the fastest way, and the process order of the nodes is not relevant, make four different types, one for each case, and define a process function for each. Then in a container class, have four vectors, one for each node type. Upon creation of a new node get all the data you need to create the node, including the conditions, and create a node of the correct type and push it into the correct vector. Then, when you need to process all your nodes, process them type by type, first processing all nodes in the first vector, then the second, etc.
Why would you want to do this:
No ifs for the state switching
No vtables
No function indirection
But much more importantly:
No instruction cache thrashing (you're not jumping to a different part of your code for every next node)
No branch prediction misses for state switching ifs (since there are none)
Even if you'd have inheritance with virtual functions and thus function indirection through vtables, simply sorting the nodes by their type in your vector may already make a world of difference in performance as any possible instruction cache thrashing would essentially be gone and depending on the methods of branch prediction the branch prediction misses could also be reduced.
Also, don't make a vector of pointers, but make a vector of objects. If they are pointers you have an extra adressing indirection, which in itself is not that worrisome, but the problem is that it may lead to data cache thrashing if the objects are pretty much randomly spread throughout your memory. If on the other hand your objects are directly put into the vector the processing will basically go through memory linearly and the cache will basically hit every time and cache prefetching might actually be able to do a good job.
Note though that you would pay heavily in data structure creation if you don't do it correctly, if at all possible, when making the vector reserve enough capacity in it immediately for all your nodes, reallocating and moving every time your vector runs out of space can become expensive.
Oh, and yes, as James mentioned, always, always measure! What you think may be the fastest way may not be, sometimes things are very counter intuitive, depending on all kinds of factors like optimizations, pipelining, branch prediction, cache hits/misses, data structure layout, etc. What I wrote above is a pretty general approach, but it is not guaranteed to be the fastest and there are definitely ways to do it wrong. Measure, Measure, Measure.
P.S. inheritance with virtual functions is roughly equivalent to using function pointers. Virtual functions are usually implemented by a vtable at the head of the class, which is basically just a table of function pointers to the implementation of the given virtual for the actual type of the object. Whether ifs is faster than virtuals or the other way around is a very very difficult question to answer and depends completely on the implementation, compiler and platform used.

I'm actually quite impressed with how effective branch prediction can be, and only the if solution allows inlining which also can be dramatic. Virtual functions and pointer to function also involve loading from memory and could possibly cause cache misses
But, you have four conditions so branch misses can be expensive.
Without the ability to test and verify the answer really can't be answered. Especially since its not even clear that this would be a performance bottleneck sufficient enough to warrant optimization efforts.
In cases like this. I would err on the side of readability and ease of debugging and go with if

Many programmers have taken classes and read books that go on about certain favorite subjects: pipelining, cacheing, branch prediction, virtual functions, compiler optimizations, big-O algorithms, etc. etc. and the performance of those.
If I could make an analogy to boating, these are things like trimming weight, tuning power, adjusting balance and streamlining, assuming you are starting from some speedboat that's already close to optimal.
Never mind you may actually be starting from the Queen Mary, and you're assuming it's a speedboat.
It may very well be that there are ways to speed up the code by large factors, just by cutting away fat (masquerading as good design), if only you knew where it was.
Well, you don't know where it is, and guessing where is a waste of time, unless you value being wrong.
When people say "measure and use a profiler" they are pointing in the right direction, but not far enough.
Here's an example of how I do it, and I made a crude video of it, FWIW.

Unless there's a clear pattern to these attributes, no branch predictor exists that can effectively predict this data-dependent condition for you. Under such circumstances, you may be better off avoiding a control speculation (and paying the penalty of a branch misprediction), and just wait for the actual data to arrive and resolve the control flow (more likely to occur using virtual functions). You'll have to benchmark of course to verify that, as it depends on the actual pattern (if for e.g. you have even small groups of similarly "tagged" elements).
The sorting suggested above is nice and all, but note that it converts a problem that's just plain O(n) into an O(logn) one, so for large sizes you'll lose unless you can sort once - traverse many times, or otherwise cheaply maintain the sort state.
Note that some predictors may also attempt to predict the address of the function call, so you might be facing the same problem there.
However, I must agree about the comments regarding early-optimizations - do you know for sure that the control flow is your bottleneck? What if fetching the actual data from memory takes longer? In general, it would seem that your elements can be process in parallel, so even if you run this on a single thread (and much more if you use multiple cores) - you should be bandwidth-bound and not latency bound.

Pipeline optimzation, is there any point to do this?

Some very expencied programmer from another company told me about some low-level code-optimzation tips that targetting specific CPU, including pipeline-optimzation, which means, arrange the code (inlined assembly, obviously) in special orders such that it fit the pipeline better for the targetting hardware.
With the presence of out-of-order and speculative execuation, I just wonder is there any points to do this kind of low-level stuff? We are mostly invovled in high performance computing, so we can really focus on one very specific CPU type to do our optimzation, but I just dont know if there is any point to do this specific optimzation, anyone has any experience here, where to begin? are there any code examples for this kind of optimzation? many thanks!

I'll start by saying that the compiler will usually optimize code sufficiently (i.e. well enough) that you do not need to worry about this provided your high-level code and algorithms are optimized. In general, manual optimizing should only happen if you have hard evidence that there is an actual performance issue that you can quantify and have tracked down.
Now, with that said, it's always possible to improve things - sometimes a little, sometimes a lot.
If you are in the high-performance computing game, then this sort of optimization might make sense. There are all sorts of "tricks" that can be done, but they are best left to real experts and not for the faint of heart.
If you really want to know more about this topic, a good place to start is by reading Agner Fog's website.

Pipeline optimization will improve your programs performance:
Branches and jumps may force your processor to reload the instruction pipeline, which takes some time. This time could be devoted to data processing instructions.
Some platform independent methods for pipeline optimizations:
Reduce number of branches.
Use Boolean Arithmetic
Set up code to allow for conditional execution of instructions.
Unroll loops.
Make loops have short content (that can fit in a processor's cache
without loading).
Edit 1: Other optimizations
Reduce code by eliminating features and requirements.
Review and optimize the design.
Review implementation for more efficient implementations.
Revert to assembly language only when all other optimizations have
provided little performance improvement; optimize only the code that
is executed 80% of the time; find out by profiling.
Edit 2: Data Optimizations
You can also gain performance improvements by organizing your data. Search the web for "Data Driven Design" or "Optimize performance data".
One idea is that the most frequently used data should be close together and ultimately fit into the processor's data cache. This will reduce the frequency that the processor has to reload its data cache.
Another optimization is to: Load data (into registers), operate on data, then write all data back to memory. The idea here is to trigger the processor's data cache loading circuitry before it processes the data (or registers).
If you can, organize the data to fit in one "line" of your processor's cache. Sequential locations require less time than random access locations.

There are always things that "help" vs. "hinder" the execution in the pipeline, but for most general purpose code that isn't highly specialized, I would expect that performance from compiled code is about as good as the best you can get without highly specialized code for each model of processor. If you have a controlled system, where all of your machines are using the same (or a small number of similar) processor model, and you know that 99% of the time is spent in this particular function, then there may be a benefit to optimizing that particular function to become more efficient.
In your case, it being HPC, it may well be beneficial to handwrite some of the low-level code (e.g. matrix multiplication) to be optimized for the processor you are running on. This does take some reasonable amount of understanding of the processor however, so you need to study the optimization guides for that processor model, and if you can, talk to people who've worked on that processor before.
Some of the things you'd look at is "register to register dependencies" - where you need the result of c = a + b to calculate x = c + d - so you try to separate these with some other useful work, such that the calculation of x doesn't get held up by the c = a + b calculation.
Cache-prefetching and generally caring for how the caches are used is also a useful thing to look at - not kicking useful cached data out that you need 100 instructions later, when you are storing the resulting 1MB array that won't be used again for several seconds can be worth a lot of processor time.
It's hard(er) to control these things when compilers decide to shuffle it around in it's own optimisation, so handwritten assembler is pretty much the only way to go.

Branch prediction between objects of same class

I'm optimizing a program, and trying to avoid branch misprediction. I have two objects of a class. In the class's primary function there are several if branches. Each object takes a different direction on each of those branches, and they each run the function one after another. My questions:
Since they're members of the same class, and are therefore sharing that function, are they also sharing the same branch prediction? Essentially, am I making the system go TFTFTFTF...
Or, since they're their own objects do they have their own branch predictions and therefore maintaining consistent predictions (TTTTTTT... and FFFFFFFF...)

Yes, the method is shared between instances of a class.
It means, as well, that the predictions are shared.
However, there is more to branch prediction than the "last" time. The processor will remember some of the last results and identify "easy" (cyclic) patterns. Therefore, if you constantly swap between your two objects and the pattern ends up TFTFTFTFTF then the processor will correctly guess that the next result will be a T.
From a semantic point of view, however, did you thought about using a base class and two different derived classes (+ the usual virtual mechanism) ?

Don't bother about such low-level details like branch prediction (it will vary from one model of a processor to the next). Leave that optimization to the compiler (and it is probably good enough).
If you want to improve your application, work more on the algorithms themselves. And use profiling & measurements. Don't forget that premature optimization is evil.

Since a branch misprediction will typically cost of the order of 10 to 20 cycles it's really only of importance when it's inside a loop that is being executed millions of times a second. Modern CPUs do a pretty good job of branch prediction anyway, so it's pretty rare to have to worry about this kind of thing (compared to say 5 - 10 years ago).

Effective optimization strategies on modern C++ compilers

I'm working on scientific code that is very performance-critical. An initial version of the code has been written and tested, and now, with profiler in hand, it's time to start shaving cycles from the hot spots.
It's well-known that some optimizations, e.g. loop unrolling, are handled these days much more effectively by the compiler than by a programmer meddling by hand. Which techniques are still worthwhile? Obviously, I'll run everything I try through a profiler, but if there's conventional wisdom as to what tends to work and what doesn't, it would save me significant time.
I know that optimization is very compiler- and architecture- dependent. I'm using Intel's C++ compiler targeting the Core 2 Duo, but I'm also interested in what works well for gcc, or for "any modern compiler."
Here are some concrete ideas I'm considering:
Is there any benefit to replacing STL containers/algorithms with hand-rolled ones? In particular, my program includes a very large priority queue (currently a std::priority_queue) whose manipulation is taking a lot of total time. Is this something worth looking into, or is the STL implementation already likely the fastest possible?
Along similar lines, for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
Given the scientific nature of the program, floating-point numbers are used everywhere. A significant bottleneck in my code used to be conversions from floating point to integers: the compiler would emit code to save the current rounding mode, change it, perform the conversion, then restore the old rounding mode --- even though nothing in the program ever changed the rounding mode! Disabling this behavior significantly sped up my code. Are there any similar floating-point-related gotchas I should be aware of?
One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
Lastly, to nip certain kinds of answers in the bud:
I understand that optimization has a cost in terms of complexity, reliability, and maintainability. For this particular application, increased performance is worth these costs.
I understand that the best optimizations are often to improve the high-level algorithms, and this has already been done.

Take a look at the excellent Pitfalls of Object-Oriented Programming slides for some info about restructuring code for locality. In my experience getting better locality is almost always the biggest win.
General process:
Learn to love the Disassembly View in your debugger, or have your build system generate the intermediate assembly files (.s) if at all possible. Keep an eye on changes or for things that look egregious -- even without familiarity with a given instruction set architecture, you should be able to see some things fairly clearly! (I sometimes check in a series of .s files with corresponding .cpp/.c changes, just to leverage the lovely tools from my SCM to watch the code and corresponding asm change over time.)
Get a profiler that can watch your CPU's performance counters, or can at least guess at cache misses. (AMD CodeAnalyst, cachegrind, vTune, etc.)
Some other specific things:
Understand strict aliasing. Once you do, make use of restrict if your compiler has it. (Examine the disasm here too!)
Check out different floating point modes on your processor and compiler. If you don't need the denormalized range, choosing a mode without this can result in better performance. (It sounds like you've already done some things in this area, based on your discussion of rounding modes.)
Definitely avoid allocs: call reserve on std::vector when you can, or use std::array when you know the size at compile-time.
Use memory pools to increase locality and decrease alloc/free overhead; also to ensure cacheline alignment and prevent ping-ponging.
Use frame allocators if you're allocating things in predictable patterns, and can afford to deallocate everything in one go.
Do be aware of invariants. Something you know is invariant may not be to the compiler, for example a use of a struct or class member in a loop. I find the single easiest way to fall into the correct habit here is to give a name to everything, and prefer to name things outside of loops. E.g. const int threshold = m_currentThreshold; or perhaps Thing * const pThing = pStructHoldingThing->pThing; Fortunately you can usually see things that need this treatment in the disassembly view. This also helps with debugging later (makes the watch/locals window behave much more nicely in debug builds)!
Avoid writes in loops if possible -- accumulate first, then write, or batch a few writes together. YMMV, of course.
WRT your std::priority_queue question: inserting things into a vector (the default backend for a priority_queue) tends to move a lot of elements around. If you can break up into phases, where you insert data, then sort it, then read it once it's sorted, you'll probably be a lot better off. Although you'll definitely lose locality, you may find a more self-ordering structure like a std::map or std::set worth the overhead -- but this is really dependent on your usage patterns.

Is there any benefit to replacing STL containers/algorithms with hand-rolled ones?
I would only consider this as a last option. The STL containers and algorithms have been thoroughly tested. Creating new ones are expensive in terms of development time.
Along similar lines, for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
First, try reserving space for the vectors. Check out the std::vector::reserve method. A vector that keeps growing or changing to larger sizes is going to waste dynamic memory and execution time. Add some code to determine a good value for an upper bound.
I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?
As a matter of principle, always pass large structures by reference or pointer. Prefer passing by constant reference. If you are using pointers, consider using smart pointers.
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
Modern compilers are very aware of instruction caches (pipelines) and try to keep them from being reloaded. You can always assist your compiler by writing code that uses less branches (from if, switch, loop constructs and function calls).
You may see more significant performance gain by adjusting your program to optimize the data cache. Search the web for Data Driven Design. There are many excellent articles on this topic.
Given the scientific nature of the program, floating-point numbers are used everywhere. A significant bottleneck in my code used to be conversions from floating point to integers: the compiler would emit code to save the current rounding mode, change it, perform the conversion, then restore the old rounding mode --- even though nothing in the program ever changed the rounding mode! Disabling this behavior significantly sped up my code. Are there any similar floating-point-related gotchas I should be aware of?
For accuracy, keep everything as a double. Adjust for rounding only when necessary and perhaps before displaying. This falls under the optimization rule, Use less code, eliminate extraneous or deadwood code.
Also see the section above about reserving space in containers before using them.
Some processors can load and store floating point numbers either faster or as fast as integers. This would require gathering profile data before optimizing. However, if you know there is minimal resolution, you could use integers and change your base to that minimal resolution . For example, when dealing with U.S. money, integers can be used to represent 1/100 or 1/1000 of a dollar.
One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
This an incorrect assumption. Compilers can optimize based on the function's signature, especially if the parameters correctly use const. I always like to assist the compiler by moving constant stuff outside of the loop. For an upper limit value, such as a string length, assign it to a const variable before the loop. The const modifier will assist the Optimizer.
There is always the count-down optimization in loops. For many processors, a jump on register equals zero is more efficient than compare and jump if less than.
On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
I would avoid "micro optimizations". If you have any doubts, print out the assembly code generated by the compiler (for the area you are questioning) under the highest optimization setting. Try rewriting the code to express the compiler's assembly code. Optimize this code, if you can. Anything more requires platform specific instructions.
Optimization Ideas & Concepts
1. Computers prefer to execute sequential instructions.
Branching upsets them. Some modern processors have enough instruction cache to contain code for small loops. When in doubt, don't cause branches.
2. Eliminate Requirements
Less code, more performance.
3. Optimize designs before code
Often times, more performance can be gained by changing the design versus changing the implementation of the design. Less design promotes less code, generates more performance.
4. Consider data organization
Optimize the data.
Organize frequently used fields into substructures.
Set data sizes to fit into a data cache line.
Remove constant data out of data structures.
Use const specifier as much as possible.
5. Consider page swapping
Operating systems will swap out your program or task for another one. Often times into a 'swap file' on the hard drive. Breaking up the code into chunks that contain heavily executed code and less executed code will assist the OS. Also, coagulate heavily used code into tighter units. The idea is to reduce the swapping of code from the hard drive (such as fetching "far" functions). If code must be swapped out, it should be as one unit.
6. Consider I/O optimizations
(Includes file I/O too).
Most I/O prefers fewer large chunks of data to many small chunks of data. Hard drives like to keep spinning. Larger data packets have less overhead than smaller packets.
Format data into a buffer then write the buffer.
7. Eliminate the competition
Get rid of any programs and tasks that are competing against your application for the processor(s). Such tasks as virus scanning and playing music. Even I/O drivers want a piece of the action (which is why you want to reduce the number or I/O transactions).
These should keep you busy for a while. :-)

Use of memory buffer pools can be of great performance benefit vs. dynamic allocation. More so if they reduce or prevent heap fragmentation over long execution runs.
Be aware of data location. If you have a significant mix of local vs. global data you may be overworking the cache mechanism. Try to keep data sets in close proximity to make maximum use of cache line validity.
Even though compilers do a wonderful job with loops, I still scrutinize them when performance tuning. You can spot architectural flaws that yield orders of magnitude where the compiler may only trim percentages.
If a single priority queue is using a lot of time in its operation, there may be benefit to creating a battery of queues representing buckets of priority. It would be complexity being traded for speed in this case.
I notice you didn't mention the use of SSE type instructions. Could they be applicable to your type of number crunching?
Best of luck.

Here is a nice paper on the subject.

About STL containers.
Most people here claim STL offers one of the fastest possible implementations of the container algorithms. And I say the opposite: for the most real-world scenarios the STL containers taken as-is yield a really catastrophic performance.
People argue about the complexity of the algorithms used in STL. Here STL is good: O(1) for list/queue, vector (amortized), and O(log(N)) for map. But this is not the real bottleneck of the performance for a typical application! For many applications the real bottleneck is the heap operations (malloc/free, new/delete, etc.).
A typical operation on the list costs just a few CPU cycles. On a map - some tens, may be more (this depends on the cache state and log(N) of course). And typical heap operations cost from hunders to thousands (!!!) of CPU cycles. For multithreaded applications for instance they also require synchronization (interlocked operations). Plus on some OSs (such as Windows XP) the heap functions are implemented entirely in the kernel mode.
So that the actual performance of the STL containers in a typical scenario is dominated by the amount of heap operations they perform. And here they're disastrous. Not because they're implemented poorly, but because of their design. That is, this is the question of the design.
On the other hand there're other containers which are designed differently.
Once I've designed and written such containers for my own needs:
http://www.codeproject.com/KB/recipes/Containers.aspx
And it proved for me to be superior from the performance point of view, and not only.
But recently I've discovered I'm not the only one who thought about this.
boost::intrusive is the container library that is implemented in the manner similar to what I did then.
I suggest you try it (if you didn't already)

Is there any benefit to replacing STL containers/algorithms with hand-rolled ones?
Generally, not unless you're working with a poor implementation. I wouldn't replace an STL container or algorithm just because you think you can write tighter code. I'd do it only if the STL version is more general than it needs to be for your problem. If you can write a simpler version that does just what you need, then there might be some speed to gain there.
One exception I've seen is to replace a copy-on-write std::string with one that doesn't require thread synchronization.
for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
Unlikely. But if you're using a lot of time allocating up to a certain size, it might be profitable to add a reserve() call.
performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference.
When working with containers, I pass iterators for the inputs and an output iterator, which is still pretty general.
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
Not very. Yes. I find that missed branch predictions and cache-hostile memory access patterns are the two biggest killers of performance (once you've gotten to reasonable algorithms). A lot of older code uses "early out" tests to reduce calculations. But on modern processors, that's often more expensive than doing the math and ignoring the result.
A significant bottleneck in my code used to be conversions from floating point to integers
Yup. I recently discovered the same issue.
One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop.
Some compilers can deal with this. Visual C++ has a "link-time code generation" option that effective re-invokes the compiler to do further optimization. And, in the case of functions like strlen, many compilers will recognize that as an intrinsic function.
Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand? On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
When you're optimizing at this low level, there are few reliable rules of thumb. Compilers will vary. Measure your current solution, and decide if it's too slow. If it is, come up with a hypothesis (e.g., "What if I replace the inner if-statements with a look-up table?"). It might help ("eliminates stalls due to failed branch predictions") or it might hurt ("look-up access pattern hurts cache coherence"). Experiment and measure incrementally.
I'll often clone the straightforward implementation and use an #ifdef HAND_OPTIMIZED/#else/#endif to switch between the reference version and the tweaked version. It's useful for later code maintenance and validation. I commit each successful experiment to change control, and keep a log (spreadsheet) with the changelist number, run times, and explanation for each step in optimization. As I learn more about how the code behaves, the log makes it easy to back up and branch off in another direction.
You need a framework for running reproducible timing tests and to compare results to the reference version to make sure you don't inadvertently introduce bugs.

If I were working on this, I would expect an end-stage where things like cache locality and vector operations would come into play.
However, before getting to the end stage, I would expect to find a series of problems of different sizes having less to do with compiler-level optimization, and more to do with odd stuff going on that could never be guessed, but once found, are simple to fix. Usually they revolve around class overdesign and data structure issues.
Here's an example of this kind of process.
I have found that generalized container classes with iterators, which in principle the compiler can optimize down to minimal cycles, often are not so optimized for some obscure reason. I've also heard other cases on SO where this happens.
Others have said, before you do anything else, profile. I agree with that approach except I think there's a better way, and it's indicated in that link. Whenever I find myself asking if some specific thing, like STL, could be a problem, I just might be right - BUT - I'm guessing. The fundamental winning idea in performance tuning is find out, don't guess. It is easy to find out for sure what is taking the time, so don't guess.

here is some stuff I had used:
templates to specialize innermost loops bounds (makes them really fast)
use __restrict__ keywords for alias problems
reserve vectors beforehand to sane defaults.
avoid using map (it can be really slow)
vector append/ insert can be significantly slow. If that is the case, raw operations may make it faster
N-byte memory alignment (Intel has pragma aligned, http://www.intel.com/software/products/compilers/docs/clin/main_cls/cref_cls/common/cppref_pragma_vector.htm)
trying to keep memory within L1/L2 caches.
compiled with NDEBUG
profile using oprofile, use opannotate to look for specific lines (stl overhead is clearly visible then)
here are sample parts of profile data (so you know where to look for problems)
* Output annotated source file with samples
* Output all files
*
* CPU: Core 2, speed 1995 MHz (estimated)
--
* Total samples for file : "/home/andrey/gamess/source/blas.f"
*
* 1020586 14.0896
--
* Total samples for file : "/home/andrey/libqc/rysq/src/fock.cpp"
*
* 962558 13.2885
--
* Total samples for file : "/usr/include/boost/numeric/ublas/detail/matrix_assign.hpp"
*
* 748150 10.3285
--
* Total samples for file : "/usr/include/boost/numeric/ublas/functional.hpp"
*
* 639714 8.8315
--
* Total samples for file : "/home/andrey/gamess/source/eigen.f"
*
* 429129 5.9243
--
* Total samples for file : "/usr/include/c++/4.3/bits/stl_algobase.h"
*
* 411725 5.6840
--
example of code from my project
template<int ni, int nj, int nk, int nl>
inline void eval(const Data::density_type &D, const Data::fock_type &F,
const double *__restrict Q, double scale) {
const double * __restrict Dij = D[0];
...
double * __restrict Fij = F[0];
...
for (int l = 0, kl = 0, ijkl = 0; l < nl; ++l) {
for (int k = 0; k < nk; ++k, ++kl) {
for (int j = 0, ij = 0; j < nj; ++j, ++jk, ++jl) {
for (int i = 0; i < ni; ++i, ++ij, ++ik, ++il, ++ijkl) {

And I think the main hint anyone could give you is: measure, measure, measure. That and improving your algorithms.
The way you use certain language features, the compiler version, std lib implementation, platform, machine - all ply their role in performance and you haven't mentioned many of those and no one of us ever had your exact setup.
Regarding replacing std::vector: use a drop-in replacement (e.g., this one) and just try it out.

How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
I can't speak for all compilers, but my experience with GCC shows that it will not heavily optimize code with respect to the cache. I would expect this to be true for most modern compilers. Optimization such as reordering nested loops can definitely affect performance. If you believe that you have memory access patterns that could lead to many cache misses, it will be in your interest to investigate this.

Is there any benefit to replacing STL
containers/algorithms with hand-rolled
ones? In particular, my program
includes a very large priority queue
(currently a std::priority_queue)
whose manipulation is taking a lot of
total time. Is this something worth
looking into, or is the STL
implementation already likely the
fastest possible?
The STL is generally the fastest, general case. If you have a very specific case, you might see a speed-up with a hand-rolled one. For example, std::sort (normally quicksort) is the fastest general sort, but if you know in advance that your elements are virtually already ordered, then insertion sort might be a better choice.
Along similar lines, for std::vectors
whose needed sizes are unknown but
have a reasonably small upper bound,
is it profitable to replace them with
statically-allocated arrays?
This depends on where you are going to do the static allocation. One thing I tried along this line was to static allocate a large amount of memory on the stack, then re-use later. Results? Heap memory was substantially faster. Just because an item is on the stack doesn't make it faster to access- the speed of stack memory also depends on things like cache. A statically allocated global array may not be any faster than the heap. I assume that you have already tried techniques like just reserving the upper bound. If you have a lot of vectors that have the same upper bound, consider improving cache by having a vector of structs, which contain the data members.
I've found that dynamic memory
allocation is often a severe
bottleneck, and that eliminating it
can lead to significant speedups. As a
consequence I'm interesting in the
performance tradeoffs of returning
large temporary data structures by
value vs. returning by pointer vs.
passing the result in by reference. Is
there a way to reliably determine
whether or not the compiler will use
RVO for a given method (assuming the
caller doesn't need to modify the
result, of course)?
I personally normally pass the result in by reference in this scenario. It allows for a lot more re-use. Passing large data structures by value and hoping that the compiler uses RVO is not a good idea when you can just manually use RVO yourself.
How cache-aware do compilers tend to
be? For example, is it worth looking
into reordering nested loops?
I found that they weren't particularly cache-aware. The issue is that the compiler doesn't understand your program and can't predict the vast majority of it's state, especially if you depend heavily on heap. If you have a profiler that ships with your compiler, for example Visual Studio's Profile Guided Optimization, then this can produce excellent speedups.
Given the scientific nature of the
program, floating-point numbers are
used everywhere. A significant
bottleneck in my code used to be
conversions from floating point to
integers: the compiler would emit code
to save the current rounding mode,
change it, perform the conversion,
then restore the old rounding mode ---
even though nothing in the program
ever changed the rounding mode!
Disabling this behavior significantly
sped up my code. Are there any similar
floating-point-related gotchas I
should be aware of?
There are different floating-point models - Visual Studio gives an fp:fast compiler setting. As for the exact effects of doing such, I can't be certain. However, you could try altering the floating point precision or other settings in your compiler and checking the result.
One consequence of C++ being compiled
and linked separately is that the
compiler is unable to do what would
seem to be very simple optimizations,
such as move method calls like
strlen() out of the termination
conditions of loop. Are there any
optimization like this one that I
should look out for because they can't
be done by the compiler and must be
done by hand?
I've never come across such a scenario. However, if you're genuinely concerned about such, then the option remains to do it manually. One of the things that you could try is calling a function on a const reference, suggesting to the compiler that the value won't change.
One of the other things that I want to point out is the use of non-standard extensions to the compiler, for example provided by Visual Studio is __assume. http://msdn.microsoft.com/en-us/library/1b3fsfxw(VS.80).aspx
There's also multithread, which I would expect you've gone down that road. You could try some specific opts, like another answer suggested SSE.
Edit: I realized that a lot of the suggestions I posted referenced Visual Studio directly. That's true, but, GCC almost certainly provides alternatives to the majority of them. I just have personal experience with VS most.

The STL priority queue implementation is fairly well-optimized for what it does, but certain kinds of heaps have special properties that can improve your performance on certain algorithms. Fibonacci heaps are one example. Also, if you're storing objects with a small key and a large amount of satellite data, you'll get a major improvement in cache performance if you store that data separately, even if it means storing one extra pointer per object.
As for arrays, I've found std::vector to even slightly out-perform compile-time-constant arrays. That said, its optimizations are general, and specific knowledge of your algorithm's access patterns may allow you to optimize further for cache locality, alignment, coloring, etc. If you find that your performance drops significantly past a certain threshold due to cache effects, hand-optimized arrays may move that problem size threshold by as much as a factor of two in some cases, but it's unlikely to make a huge difference for small inner loops that fit easily within the cache, or large working sets that exceed the size of any CPU cache. Work on the priority queue first.
Most of the overhead of dynamic memory allocation is constant with respect to the size of the object being allocated. Allocating one large object and returning it by a pointer isn't going to hurt much as much as copying it. The threshold for copying vs. dynamic allocation varies greatly between systems, but it should be fairly consistent within a chip generation.
Compilers are quite cache-aware when cpu-specific tuning is turned on, but they don't know the size of the cache. If you're optimizing for cache size, you may want to detect that or have the user specify it at run-time, since that will vary even between processors of the same generation.
As for floating point, you absolutely should be using SSE. This doesn't necessarily require learning SSE yourself, as there are many libraries of highly-optimized SSE code that do all sorts of important scientific computing operations. If you're compiling 64-bit code, the compiler might emit some SSE code automatically, as SSE2 is part of the x86_64 instruction set. SSE will also save you some of the overhead of x87 floating point, since it's not converting back and forth to 80-bit values internally. Those conversions can also be a source of accuracy problems, since you can get different results from the same set of operations depending on how they get compiled, so it's good to be rid of them.

If you work on big matrices for instance, consider tiling your loops to improve the locality. This often leads to dramatic improvements. You can use VTune/PTU to monitor the L2 cache misses.

One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
On some compilers this is incorrect. The compiler has perfect knowledge of all code across all translation units (including static libraries) and can optimize the code the same way it would do if it were in a single translation unit. A few ones that support this feature come to my mind:
Microsoft Visual C++ compilers
Intel C++ Compiler
LLVC-GCC
GCC (I think, not sure)

i'm surprised no one has mentioned these two:
Link time optimization clang and g++ from 4.5 on support link time optimizations. I've heard that on g++ case, the heuristics is still pretty inmature but it should improve quickly since the main architecture is laid out.
Benefits range from inter procedural optimizations at object file level, including highly sought stuff like inling of virtual calls (devirtualization)
Project inlining this might seem to some like very crude approach, but it is that very crudeness which makes it so powerful: this amounts at dumping all your headers and .cpp files into a single, really big .cpp file and compile that; basically it will give you the same benefits of link-time optimization in your trip back to 1999. Of course, if your project is really big, you'll still need a 2010 machine; this thing will eat your RAM like there is no tomorrow. However, even in that case, you can split it in more than one no-so-damn-huge .cpp file

If you are doing heavy floating point math you should consider using SSE to vectorize your computations if that maps well to your problem.
Google SSE intrinsics for more information about this.

Here is something that worked for me once. I can't say that it will work for you. I had code on the lines of
switch(num) {
case 1: result = f1(param); break;
case 2: result = f2(param); break;
//...
}
Then I got a serious performance boost when I changed it to
// init:
funcs[N] = {f1, f2 /*...*/};
// later in the code:
result = (funcs[num])(param);
Perhaps someone here can explain the reason the latter version is better. I suppose it has something to do with the fact that there are no conditional branches there.

My current project is a media server, with multi thread processing (C++ language). It's a time critical application, once low performance functions could cause bad results on media streaming like lost of sync, high latency, huge delays and so.
The strategy i usually use to grantee the best performance possible is to minimize the amount of heavy operational system calls that allocate or manage resources like memory, files, sockets and so.
At first i wrote my own STL, network and file manage classes.
All my containers classes ("MySTL") manage their own memory blocks to avoid multiple alloc (new) / free (delete) calls. The objects released are enqueued on a memory block pool to be reused when needed. On that way i improve performance and protect my code against memory fragmentation.
The parts of the code that need to access lower performance system resources (like files, databases, script, network write) i use separate threads for them. But not one thread for each unit (like not 1 thread for each socket), if so the operational system would lose performance while managing a high number of threads. So you can group objects of same classes to be processed on a separate thread if possible.
For example, if you have to write data to a network socket, but the socket write buffer is full, i save the data on a sendqueue buffer (which shares memory with all sockets together) to be sent on a separate thread as soon as the sockets become writeable again. At this way your main threads should never stop processing on a blocked state waiting for the operational system frees a specific resource. All the buffers released are saved and reused when needed.
After all a profile tool would be welcome to look for program bottles and shows which algorithms should be improved.
i got succeeded using that strategy once i have servers running like 500+ days on a linux machine without rebooting, with thousands users logging everyday.
[02:01] -alpha.ip.tv- Uptime: 525days 12hrs 43mins 7secs

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js