Loop unrolling vs Loop tiling

Loop unrolling vs Loop tiling - c++

Can someone please tell if the 2 optimization techniques are same or different?
Also, is it responsibility of programmer or compiler to do it?

The two techniques are different. See descriptions for Loop unrolling and Loop tiling.
Loop unrolling is done to eliminate the overhead of looping. It is (usually) only useful for fairly small loops where the number of iterations is small and is known at compile time. It is mostly done by the compiler.
In older times when computers were slower and compilers were more primitive, programmers would do manual loop unrolling but now it would be unusual for a programmer to do it -- except possibly for a very restrictive embedded system.
Loop tiling is commonly done with very large data sets. The object is: to load some data into cache memory and perform all operations on it before paging in some new data.
Depending on the operations being performed and the internal organisation of the data, a simple loop might jump about into different data pages causing a lot of cache misses (and page loads). Careful planning of the order of execution can significantly improve run-times for certain problems.
While it is likely that a compiler might perform loop tiling, there are times when the programmer might do so manually and possibly do a better job than the compiler.
In general, don't try to do these types of optimisation as they add a lot of complexity (and bugs) to the code and usually provide only modest performance gains. However if your code is slow and profiling indicates particular types of bottlenecks, then something like loop tiling should be considered and may lead to large performance gains.

These are two totally different performance optimisations.
Loop unrolling is a code optimisation where code is replicated within a loop and the total number of loop iterations is reduced. The benefit is reduced loop overhead (normally only relevant for very small loops), and better instruction scheduling with reduced dependency stalls in superscalar CPUs. This can be done both manually and/or as a compiler optimisation.
Tiling is a memory optimisation which aims to make better use of cache by processing tiles (small blocks within a larger data structure), typically in the context of an image or other 2D data structure. This is normally implemented at the source code level, as part of the overall design of an algorithm implementation.

Related

A simple and reliable C++ benchmarking solution? [duplicate]

I am evaluating a network+rendering workload for my project.
The program continuously runs a main loop:
while (true) {
doSomething()
drawSomething()
doSomething2()
sendSomething()
}
The main loop runs more than 60 times per second.
I want to see the performance breakdown, how much time each procedure takes.
My concern is that if I print the time interval for every entrance and exit of each procedure,
It would incur huge performance overhead.
I am curious what is an idiomatic way of measuring the performance.
Printing of logging is good enough?

Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program.
e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)
Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.
This question is way too broad to say anything more specific.
Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes. See How do I write a correct micro-benchmark in Java? for that and more.
Beware common microbenchmark pitfalls
Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: Performance: memset; or example of a wrong conclusion based on this mistake)
Never-written memory (obtained fresh from the kernel) gets all its pages copy-on-write mapped to the same system-wide physical page (4K or 2M) of zeros if you read without writing, at least on Linux. So you can get cache hits but TLB misses. e.g. A large allocation from new / calloc / malloc, or a zero-initialized array in static storage in .bss. Use a non-zero initializer or memset.
Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).
related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.
Most integer and FP arithmetic asm instructions (except divide and square root which are already slower than others) have performance (latency and throughput) that doesn't depend on the actual data. Except for subnormal aka denormal floating point being very slow, and in some cases (e.g. legacy x87 but not SSE2) also producing NaN or Inf can be slow.
On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).
Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.
Related to the above and below points: you can't "benchmark the * operator" in C in general, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.
You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. For an array, volatile double sink = output[argc]; is a useful trick: the compiler doesn't know the value of argc so it has to generate the whole array, but you don't need to read the whole array or even call an RNG function. (Unless the compiler aggressively transforms to only calculate the one output selected by argc, but that tends not to be a problem in practice.)
For inputs, use a random number or argc or something instead of a compile-time constant so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.
If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.
Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.
Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.
Litmus tests:
If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)
Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).
Vary other test parameters as a sanity check.
For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.

How do I segregate C++ code without impacting performance?

I'm having trouble refactoring my C++ code. The code itself is barely 200 lines, if even, however, being an image processing affair, it loops a lot, and the roadblocks I'm encoutering (I assume) deal with very gritty details (e.g. memory access).
The program produces a correct output, but is supposed to ultimately run in realtime. Initially, it took ~3 minutes per 320x240px frame, but it's at around 2 seconds now (running approximately as fast on mid-range workstation and low-end laptop hardware; red flag?). Still a far cry from 24 times per second, however. Basically, any change I make propagates through the millions of repetitions, and tracking my beginner mistakes has become exponentially more cumbersome as I approach the realtime mark.
At 2 points, the program calculates a less computationally expensive variant of Euclidean distance, called taxicab distance (the sum of absolute differences).
Now, the abridged version:
std::vector<int> positiveRows, positiveCols;
/* looping through pixels, reading values */
distance = (abs(pValues[0] - qValues[0]) + abs(pValues[1] - qValues[1]) + abs(pValues[2] - qValues[2]));
if(distance < threshold)
{
positiveRows.push_back(row);
positiveCols.push_back(col);
}
If I wrap the functionality, as follows:
int taxicab_dist(int Lp,
int ap,
int bp,
int Lq,
int aq,
int bq)
{
return (abs(Lp - Lq) + abs(ap - aq) + abs(bp - bq));
}
and call it from within the same .cpp file, there is no performance degradation. If I instead declare and define it in separate .hpp / .cpp files, I get a significant slowdown. This directly opposes what I've been told in my undergraduate courses ("including a file is the same as copy-pasting it"). The closest I've gotten to the original code's performance was by declaring the arguments const, but it still takes ~100ms longer, which my judgement says is not affordable for such a meager task. Then again, I don't see why it slows down (significantly) if I also make them const int&. Then, when I do the most sensible thing, and choose to take arrays as arguments, again I take a performance hit. I don't even dare attempt any templating shenanigans, or try making the function modify its behavior so that it accepts an arbitrary number of pairs, at least not until I understand what I've gotten myself into.
So my question is: how can take the calculation definition to a separate file, and have it perform the same as the original solution? Furthermore, should the fact that compilers are optimizing my program to run 2 seconds instead of 15 be a huge red flag (bad algorithm design, not using more exotic C++ keywords / features)?
I'm guessing the main reason why I've failed to find an answer is because I don't know what is the name of this stuff. I've heard the terms "vectorization" tossed around quite a bit in the HPC community. Would this be related to that?
If it helps in any way at all, the code it its entirety can be found here.

As Joachim Pileborg says, you should profile first. Find out where in your program most of the execution time occurs. This is the place where you should optimize.
Reserving space in vector
Vectors start out small and then reallocate as necessary. This involves allocating a larger space in memory and then copying the old elements to the new vector. Finally deallocating the memory. The std::vector has the capability of reserving space upon construction. For large sizes of vectors, this can be a time saver, eliminating many reallocations.
Compiling with speed optimizations
With modern compilers, you should set the optimizations for high speed and see what they can do. The compiler writers have many tricks up their sleeve and can often spot locations to optimize that you or I miss.
Truth is assembly language
You will need to view the assembly language listing. If the assembly language shows only two instructions in the area you think is the bottleneck, you really can't get faster.
Loop unrolling
You may be able to get more performance by copying the content in a for loop many times. This is called loop unrolling. In some processors, branch or jump instructions cost more execution time than data processing instructions. Unrolling a loop reduces the number of executed branch instructions. Again, the compiler may automatically perform this when you raise the optimization level.
Data cache optimization
Search the web for "Data cache optimization". Loading and unloading the data cache wastes time. If your data can fit into the processor's data cache, it doesn't have to keep loading an unloading (called cache misses). Also remember to perform all your operations on the data in one place before performing other operations. This reduces the likelihood of the processor reloading the cache.
Multi-processor computing
If your platform has more than one processor, such as a Graphics Processing Unit (GPU), you may be able to delegate some tasks to it. Be aware that you have also added time by communicating with the other processor. So for small tasks, the communications overhead may waste the time you gained by delegating.
Parallel computing
Similar to multi-processors, you can have the Operating System delegate the tasks. The OS could delegate to different cores in your processor (if you have a multi-core processor) or it runs it in another thread. Again there is a cost: overhead of managing the task or thread and communications.
Summary
The three rules of Optimization:
Don't
Don't
Profile
After you profile, review the area where the most execution takes place. This will gain you more time than optimizing a section that never gets called. Design optimizations will generally get you more time than code optimizations. Likewise, requirement changes (such as elimination) may gain you more time than design optimizations.
After your program is working correctly and is robust, you can optimize, only if warranted. If your UI is so slow that the User can go get a cup of coffee, it is a good place to optimize. If you gain 100 milliseconds by optimizing data transfer, but your program waits 1 second for the human response, you have not gained anything. Consider this as driving really fast to a stop sign. Regardless of your speed, you still have to stop.
If you still need performance gain, search the web for "Optimizations c++", or "data optimizations" or "performance optimization".

Pipeline optimzation, is there any point to do this?

Some very expencied programmer from another company told me about some low-level code-optimzation tips that targetting specific CPU, including pipeline-optimzation, which means, arrange the code (inlined assembly, obviously) in special orders such that it fit the pipeline better for the targetting hardware.
With the presence of out-of-order and speculative execuation, I just wonder is there any points to do this kind of low-level stuff? We are mostly invovled in high performance computing, so we can really focus on one very specific CPU type to do our optimzation, but I just dont know if there is any point to do this specific optimzation, anyone has any experience here, where to begin? are there any code examples for this kind of optimzation? many thanks!

I'll start by saying that the compiler will usually optimize code sufficiently (i.e. well enough) that you do not need to worry about this provided your high-level code and algorithms are optimized. In general, manual optimizing should only happen if you have hard evidence that there is an actual performance issue that you can quantify and have tracked down.
Now, with that said, it's always possible to improve things - sometimes a little, sometimes a lot.
If you are in the high-performance computing game, then this sort of optimization might make sense. There are all sorts of "tricks" that can be done, but they are best left to real experts and not for the faint of heart.
If you really want to know more about this topic, a good place to start is by reading Agner Fog's website.

Pipeline optimization will improve your programs performance:
Branches and jumps may force your processor to reload the instruction pipeline, which takes some time. This time could be devoted to data processing instructions.
Some platform independent methods for pipeline optimizations:
Reduce number of branches.
Use Boolean Arithmetic
Set up code to allow for conditional execution of instructions.
Unroll loops.
Make loops have short content (that can fit in a processor's cache
without loading).
Edit 1: Other optimizations
Reduce code by eliminating features and requirements.
Review and optimize the design.
Review implementation for more efficient implementations.
Revert to assembly language only when all other optimizations have
provided little performance improvement; optimize only the code that
is executed 80% of the time; find out by profiling.
Edit 2: Data Optimizations
You can also gain performance improvements by organizing your data. Search the web for "Data Driven Design" or "Optimize performance data".
One idea is that the most frequently used data should be close together and ultimately fit into the processor's data cache. This will reduce the frequency that the processor has to reload its data cache.
Another optimization is to: Load data (into registers), operate on data, then write all data back to memory. The idea here is to trigger the processor's data cache loading circuitry before it processes the data (or registers).
If you can, organize the data to fit in one "line" of your processor's cache. Sequential locations require less time than random access locations.

There are always things that "help" vs. "hinder" the execution in the pipeline, but for most general purpose code that isn't highly specialized, I would expect that performance from compiled code is about as good as the best you can get without highly specialized code for each model of processor. If you have a controlled system, where all of your machines are using the same (or a small number of similar) processor model, and you know that 99% of the time is spent in this particular function, then there may be a benefit to optimizing that particular function to become more efficient.
In your case, it being HPC, it may well be beneficial to handwrite some of the low-level code (e.g. matrix multiplication) to be optimized for the processor you are running on. This does take some reasonable amount of understanding of the processor however, so you need to study the optimization guides for that processor model, and if you can, talk to people who've worked on that processor before.
Some of the things you'd look at is "register to register dependencies" - where you need the result of c = a + b to calculate x = c + d - so you try to separate these with some other useful work, such that the calculation of x doesn't get held up by the c = a + b calculation.
Cache-prefetching and generally caring for how the caches are used is also a useful thing to look at - not kicking useful cached data out that you need 100 instructions later, when you are storing the resulting 1MB array that won't be used again for several seconds can be worth a lot of processor time.
It's hard(er) to control these things when compilers decide to shuffle it around in it's own optimisation, so handwritten assembler is pretty much the only way to go.

Performance of breaking apart one loop into two loops

Good Day,
Suppose that you have a simple for loop like below...
for(int i=0;i<10;i++)
{
//statement 1
//statement 2
}
Assume that statement 1 and statement 2 were O(1). Besides the small overhead of "starting" another loop, would breaking down that for loop into two (not nested, but sequential) loops be as equally fast? For example...
for(int i=0;i<10;i++)
{
//statement 1
}
for(int i=0;i<10;i++)
{
//statement 2
}
Why I ask such a silly question is that I have a Collision Detection System(CDS) that has to loop through all the objects. I want to "compartmentalize" the functionality of my CDS system so I can simply call
cds.update(objectlist);
instead of having to break my cds system up. (Don't worry too much about my CDS implementation... I think I know what I am doing, I just don't know how to explain it, what I really need to know is if I take a huge performance hit for looping through all my objects again.

It depends on your application.
Possible Drawbacks (of splitting):
your data does not fit into the L1 data cache, therefore you load it once for the first loop and then reload it for the second loop
Possible Gains (of splitting):
your loop contains many variables, splitting helps reducing register/stack pressure and the optimizer turns it into better machine code
the functions you use trash the L1 instruction cache so the cache is loaded on each iteration, while by splitting you manage into loading it once (only) at the first iteration of each loop
These lists are certainly not comprehensive, but already you can sense that there is a tension between code and data. So it is difficult for us to take an educated/a wild guess when we know neither.
In doubt: profile. Use callgrind, check the cache misses in each case, check the number of instructions executed. Measure the time spent.

In terms of algorithmic complexity splitting the loops makes no difference.
In terms of real world performance splitting the loops could improve performance, worsen performance or make no difference - it depends on the OS, hardware and - of course - what statement 1 and statement 2 are.

As noted, the complexity remains.
But in the real world, it is impossible for us to predict which version runs faster. The following are factors that play roles, huge ones:
Data caching
Instruction caching
Speculative execution
Branch prediction
Branch target buffers
Number of available registers on the CPU
Cache sizes
(note: over all of them, there's the Damocles sword of misprediction; all are wikipedizable and googlable)
Especially the last factor makes it sometimes impossible to compile the one true code for code whose performance relies on specific cache sizes. Some applications will run faster on CPU with huge caches, while running slower on small caches, and for some other applications it will be the opposite.
Solutions:
Let your compiler do the job of loop transformation. Modern g++'s are quite good in that discipline. Another discipline that g++ is good at is automatic vectorization. Be aware that compilers know more about computer architecture than almost all people.
Ship different binaries and a dispatcher.
Use cache-oblivious data structures/layouts and algorithms that adapt to the target cache.
It is always a good idea to endeavor for software that adapts to the target, ideally without sacrificing code quality. And before doing manual optimization, either microscopic or macroscopic, measure real world runs, then and only then optimize.
Literature:
* Agner Fog's Guides
* Intel's Guides

With two loops you will be paying for:
increased generated code size
2x as many branch predicts
depending what the data layout of statement 1 and 2 are you could be reloading data into cache.
The last point could have a huge impact in either direction. You should measure as with any perf optimization.

As far as the big-o complexity is concerned, this doesn't make a difference if 1 loop is O(n), then so is the 2 loop solution.
As far as micro-optimisation, it is hard to say. The cost of a loop is rather small, we don't know what the cost of accessing your objects is (if they are in a vector, then it should be rather small too), but there is a lot to consider to give a useful answer.

You're correct in noting that there will be some performance overhead by creating a second loop. Therefore, it cannot be "equally fast"; as this overhead, while small, is still overhead.
I won't try to speak intelligently about how collision systems should be built, but if you're trying to optimize performance it's better to avoid building unnecessary control structures if you can manage it without pulling your hair out.
Remember that premature optimization is one of the worst things you can do. Worry about optimization when you have a performance problem, in my opinion.

Effective optimization strategies on modern C++ compilers

I'm working on scientific code that is very performance-critical. An initial version of the code has been written and tested, and now, with profiler in hand, it's time to start shaving cycles from the hot spots.
It's well-known that some optimizations, e.g. loop unrolling, are handled these days much more effectively by the compiler than by a programmer meddling by hand. Which techniques are still worthwhile? Obviously, I'll run everything I try through a profiler, but if there's conventional wisdom as to what tends to work and what doesn't, it would save me significant time.
I know that optimization is very compiler- and architecture- dependent. I'm using Intel's C++ compiler targeting the Core 2 Duo, but I'm also interested in what works well for gcc, or for "any modern compiler."
Here are some concrete ideas I'm considering:
Is there any benefit to replacing STL containers/algorithms with hand-rolled ones? In particular, my program includes a very large priority queue (currently a std::priority_queue) whose manipulation is taking a lot of total time. Is this something worth looking into, or is the STL implementation already likely the fastest possible?
Along similar lines, for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
Given the scientific nature of the program, floating-point numbers are used everywhere. A significant bottleneck in my code used to be conversions from floating point to integers: the compiler would emit code to save the current rounding mode, change it, perform the conversion, then restore the old rounding mode --- even though nothing in the program ever changed the rounding mode! Disabling this behavior significantly sped up my code. Are there any similar floating-point-related gotchas I should be aware of?
One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
Lastly, to nip certain kinds of answers in the bud:
I understand that optimization has a cost in terms of complexity, reliability, and maintainability. For this particular application, increased performance is worth these costs.
I understand that the best optimizations are often to improve the high-level algorithms, and this has already been done.

Take a look at the excellent Pitfalls of Object-Oriented Programming slides for some info about restructuring code for locality. In my experience getting better locality is almost always the biggest win.
General process:
Learn to love the Disassembly View in your debugger, or have your build system generate the intermediate assembly files (.s) if at all possible. Keep an eye on changes or for things that look egregious -- even without familiarity with a given instruction set architecture, you should be able to see some things fairly clearly! (I sometimes check in a series of .s files with corresponding .cpp/.c changes, just to leverage the lovely tools from my SCM to watch the code and corresponding asm change over time.)
Get a profiler that can watch your CPU's performance counters, or can at least guess at cache misses. (AMD CodeAnalyst, cachegrind, vTune, etc.)
Some other specific things:
Understand strict aliasing. Once you do, make use of restrict if your compiler has it. (Examine the disasm here too!)
Check out different floating point modes on your processor and compiler. If you don't need the denormalized range, choosing a mode without this can result in better performance. (It sounds like you've already done some things in this area, based on your discussion of rounding modes.)
Definitely avoid allocs: call reserve on std::vector when you can, or use std::array when you know the size at compile-time.
Use memory pools to increase locality and decrease alloc/free overhead; also to ensure cacheline alignment and prevent ping-ponging.
Use frame allocators if you're allocating things in predictable patterns, and can afford to deallocate everything in one go.
Do be aware of invariants. Something you know is invariant may not be to the compiler, for example a use of a struct or class member in a loop. I find the single easiest way to fall into the correct habit here is to give a name to everything, and prefer to name things outside of loops. E.g. const int threshold = m_currentThreshold; or perhaps Thing * const pThing = pStructHoldingThing->pThing; Fortunately you can usually see things that need this treatment in the disassembly view. This also helps with debugging later (makes the watch/locals window behave much more nicely in debug builds)!
Avoid writes in loops if possible -- accumulate first, then write, or batch a few writes together. YMMV, of course.
WRT your std::priority_queue question: inserting things into a vector (the default backend for a priority_queue) tends to move a lot of elements around. If you can break up into phases, where you insert data, then sort it, then read it once it's sorted, you'll probably be a lot better off. Although you'll definitely lose locality, you may find a more self-ordering structure like a std::map or std::set worth the overhead -- but this is really dependent on your usage patterns.

Is there any benefit to replacing STL containers/algorithms with hand-rolled ones?
I would only consider this as a last option. The STL containers and algorithms have been thoroughly tested. Creating new ones are expensive in terms of development time.
Along similar lines, for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
First, try reserving space for the vectors. Check out the std::vector::reserve method. A vector that keeps growing or changing to larger sizes is going to waste dynamic memory and execution time. Add some code to determine a good value for an upper bound.
I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?
As a matter of principle, always pass large structures by reference or pointer. Prefer passing by constant reference. If you are using pointers, consider using smart pointers.
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
Modern compilers are very aware of instruction caches (pipelines) and try to keep them from being reloaded. You can always assist your compiler by writing code that uses less branches (from if, switch, loop constructs and function calls).
You may see more significant performance gain by adjusting your program to optimize the data cache. Search the web for Data Driven Design. There are many excellent articles on this topic.
Given the scientific nature of the program, floating-point numbers are used everywhere. A significant bottleneck in my code used to be conversions from floating point to integers: the compiler would emit code to save the current rounding mode, change it, perform the conversion, then restore the old rounding mode --- even though nothing in the program ever changed the rounding mode! Disabling this behavior significantly sped up my code. Are there any similar floating-point-related gotchas I should be aware of?
For accuracy, keep everything as a double. Adjust for rounding only when necessary and perhaps before displaying. This falls under the optimization rule, Use less code, eliminate extraneous or deadwood code.
Also see the section above about reserving space in containers before using them.
Some processors can load and store floating point numbers either faster or as fast as integers. This would require gathering profile data before optimizing. However, if you know there is minimal resolution, you could use integers and change your base to that minimal resolution . For example, when dealing with U.S. money, integers can be used to represent 1/100 or 1/1000 of a dollar.
One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
This an incorrect assumption. Compilers can optimize based on the function's signature, especially if the parameters correctly use const. I always like to assist the compiler by moving constant stuff outside of the loop. For an upper limit value, such as a string length, assign it to a const variable before the loop. The const modifier will assist the Optimizer.
There is always the count-down optimization in loops. For many processors, a jump on register equals zero is more efficient than compare and jump if less than.
On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
I would avoid "micro optimizations". If you have any doubts, print out the assembly code generated by the compiler (for the area you are questioning) under the highest optimization setting. Try rewriting the code to express the compiler's assembly code. Optimize this code, if you can. Anything more requires platform specific instructions.
Optimization Ideas & Concepts
1. Computers prefer to execute sequential instructions.
Branching upsets them. Some modern processors have enough instruction cache to contain code for small loops. When in doubt, don't cause branches.
2. Eliminate Requirements
Less code, more performance.
3. Optimize designs before code
Often times, more performance can be gained by changing the design versus changing the implementation of the design. Less design promotes less code, generates more performance.
4. Consider data organization
Optimize the data.
Organize frequently used fields into substructures.
Set data sizes to fit into a data cache line.
Remove constant data out of data structures.
Use const specifier as much as possible.
5. Consider page swapping
Operating systems will swap out your program or task for another one. Often times into a 'swap file' on the hard drive. Breaking up the code into chunks that contain heavily executed code and less executed code will assist the OS. Also, coagulate heavily used code into tighter units. The idea is to reduce the swapping of code from the hard drive (such as fetching "far" functions). If code must be swapped out, it should be as one unit.
6. Consider I/O optimizations
(Includes file I/O too).
Most I/O prefers fewer large chunks of data to many small chunks of data. Hard drives like to keep spinning. Larger data packets have less overhead than smaller packets.
Format data into a buffer then write the buffer.
7. Eliminate the competition
Get rid of any programs and tasks that are competing against your application for the processor(s). Such tasks as virus scanning and playing music. Even I/O drivers want a piece of the action (which is why you want to reduce the number or I/O transactions).
These should keep you busy for a while. :-)

Use of memory buffer pools can be of great performance benefit vs. dynamic allocation. More so if they reduce or prevent heap fragmentation over long execution runs.
Be aware of data location. If you have a significant mix of local vs. global data you may be overworking the cache mechanism. Try to keep data sets in close proximity to make maximum use of cache line validity.
Even though compilers do a wonderful job with loops, I still scrutinize them when performance tuning. You can spot architectural flaws that yield orders of magnitude where the compiler may only trim percentages.
If a single priority queue is using a lot of time in its operation, there may be benefit to creating a battery of queues representing buckets of priority. It would be complexity being traded for speed in this case.
I notice you didn't mention the use of SSE type instructions. Could they be applicable to your type of number crunching?
Best of luck.

Here is a nice paper on the subject.

About STL containers.
Most people here claim STL offers one of the fastest possible implementations of the container algorithms. And I say the opposite: for the most real-world scenarios the STL containers taken as-is yield a really catastrophic performance.
People argue about the complexity of the algorithms used in STL. Here STL is good: O(1) for list/queue, vector (amortized), and O(log(N)) for map. But this is not the real bottleneck of the performance for a typical application! For many applications the real bottleneck is the heap operations (malloc/free, new/delete, etc.).
A typical operation on the list costs just a few CPU cycles. On a map - some tens, may be more (this depends on the cache state and log(N) of course). And typical heap operations cost from hunders to thousands (!!!) of CPU cycles. For multithreaded applications for instance they also require synchronization (interlocked operations). Plus on some OSs (such as Windows XP) the heap functions are implemented entirely in the kernel mode.
So that the actual performance of the STL containers in a typical scenario is dominated by the amount of heap operations they perform. And here they're disastrous. Not because they're implemented poorly, but because of their design. That is, this is the question of the design.
On the other hand there're other containers which are designed differently.
Once I've designed and written such containers for my own needs:
http://www.codeproject.com/KB/recipes/Containers.aspx
And it proved for me to be superior from the performance point of view, and not only.
But recently I've discovered I'm not the only one who thought about this.
boost::intrusive is the container library that is implemented in the manner similar to what I did then.
I suggest you try it (if you didn't already)

Is there any benefit to replacing STL containers/algorithms with hand-rolled ones?
Generally, not unless you're working with a poor implementation. I wouldn't replace an STL container or algorithm just because you think you can write tighter code. I'd do it only if the STL version is more general than it needs to be for your problem. If you can write a simpler version that does just what you need, then there might be some speed to gain there.
One exception I've seen is to replace a copy-on-write std::string with one that doesn't require thread synchronization.
for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
Unlikely. But if you're using a lot of time allocating up to a certain size, it might be profitable to add a reserve() call.
performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference.
When working with containers, I pass iterators for the inputs and an output iterator, which is still pretty general.
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
Not very. Yes. I find that missed branch predictions and cache-hostile memory access patterns are the two biggest killers of performance (once you've gotten to reasonable algorithms). A lot of older code uses "early out" tests to reduce calculations. But on modern processors, that's often more expensive than doing the math and ignoring the result.
A significant bottleneck in my code used to be conversions from floating point to integers
Yup. I recently discovered the same issue.
One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop.
Some compilers can deal with this. Visual C++ has a "link-time code generation" option that effective re-invokes the compiler to do further optimization. And, in the case of functions like strlen, many compilers will recognize that as an intrinsic function.
Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand? On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
When you're optimizing at this low level, there are few reliable rules of thumb. Compilers will vary. Measure your current solution, and decide if it's too slow. If it is, come up with a hypothesis (e.g., "What if I replace the inner if-statements with a look-up table?"). It might help ("eliminates stalls due to failed branch predictions") or it might hurt ("look-up access pattern hurts cache coherence"). Experiment and measure incrementally.
I'll often clone the straightforward implementation and use an #ifdef HAND_OPTIMIZED/#else/#endif to switch between the reference version and the tweaked version. It's useful for later code maintenance and validation. I commit each successful experiment to change control, and keep a log (spreadsheet) with the changelist number, run times, and explanation for each step in optimization. As I learn more about how the code behaves, the log makes it easy to back up and branch off in another direction.
You need a framework for running reproducible timing tests and to compare results to the reference version to make sure you don't inadvertently introduce bugs.

If I were working on this, I would expect an end-stage where things like cache locality and vector operations would come into play.
However, before getting to the end stage, I would expect to find a series of problems of different sizes having less to do with compiler-level optimization, and more to do with odd stuff going on that could never be guessed, but once found, are simple to fix. Usually they revolve around class overdesign and data structure issues.
Here's an example of this kind of process.
I have found that generalized container classes with iterators, which in principle the compiler can optimize down to minimal cycles, often are not so optimized for some obscure reason. I've also heard other cases on SO where this happens.
Others have said, before you do anything else, profile. I agree with that approach except I think there's a better way, and it's indicated in that link. Whenever I find myself asking if some specific thing, like STL, could be a problem, I just might be right - BUT - I'm guessing. The fundamental winning idea in performance tuning is find out, don't guess. It is easy to find out for sure what is taking the time, so don't guess.

here is some stuff I had used:
templates to specialize innermost loops bounds (makes them really fast)
use __restrict__ keywords for alias problems
reserve vectors beforehand to sane defaults.
avoid using map (it can be really slow)
vector append/ insert can be significantly slow. If that is the case, raw operations may make it faster
N-byte memory alignment (Intel has pragma aligned, http://www.intel.com/software/products/compilers/docs/clin/main_cls/cref_cls/common/cppref_pragma_vector.htm)
trying to keep memory within L1/L2 caches.
compiled with NDEBUG
profile using oprofile, use opannotate to look for specific lines (stl overhead is clearly visible then)
here are sample parts of profile data (so you know where to look for problems)
* Output annotated source file with samples
* Output all files
*
* CPU: Core 2, speed 1995 MHz (estimated)
--
* Total samples for file : "/home/andrey/gamess/source/blas.f"
*
* 1020586 14.0896
--
* Total samples for file : "/home/andrey/libqc/rysq/src/fock.cpp"
*
* 962558 13.2885
--
* Total samples for file : "/usr/include/boost/numeric/ublas/detail/matrix_assign.hpp"
*
* 748150 10.3285
--
* Total samples for file : "/usr/include/boost/numeric/ublas/functional.hpp"
*
* 639714 8.8315
--
* Total samples for file : "/home/andrey/gamess/source/eigen.f"
*
* 429129 5.9243
--
* Total samples for file : "/usr/include/c++/4.3/bits/stl_algobase.h"
*
* 411725 5.6840
--
example of code from my project
template<int ni, int nj, int nk, int nl>
inline void eval(const Data::density_type &D, const Data::fock_type &F,
const double *__restrict Q, double scale) {
const double * __restrict Dij = D[0];
...
double * __restrict Fij = F[0];
...
for (int l = 0, kl = 0, ijkl = 0; l < nl; ++l) {
for (int k = 0; k < nk; ++k, ++kl) {
for (int j = 0, ij = 0; j < nj; ++j, ++jk, ++jl) {
for (int i = 0; i < ni; ++i, ++ij, ++ik, ++il, ++ijkl) {

And I think the main hint anyone could give you is: measure, measure, measure. That and improving your algorithms.
The way you use certain language features, the compiler version, std lib implementation, platform, machine - all ply their role in performance and you haven't mentioned many of those and no one of us ever had your exact setup.
Regarding replacing std::vector: use a drop-in replacement (e.g., this one) and just try it out.

How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
I can't speak for all compilers, but my experience with GCC shows that it will not heavily optimize code with respect to the cache. I would expect this to be true for most modern compilers. Optimization such as reordering nested loops can definitely affect performance. If you believe that you have memory access patterns that could lead to many cache misses, it will be in your interest to investigate this.

Is there any benefit to replacing STL
containers/algorithms with hand-rolled
ones? In particular, my program
includes a very large priority queue
(currently a std::priority_queue)
whose manipulation is taking a lot of
total time. Is this something worth
looking into, or is the STL
implementation already likely the
fastest possible?
The STL is generally the fastest, general case. If you have a very specific case, you might see a speed-up with a hand-rolled one. For example, std::sort (normally quicksort) is the fastest general sort, but if you know in advance that your elements are virtually already ordered, then insertion sort might be a better choice.
Along similar lines, for std::vectors
whose needed sizes are unknown but
have a reasonably small upper bound,
is it profitable to replace them with
statically-allocated arrays?
This depends on where you are going to do the static allocation. One thing I tried along this line was to static allocate a large amount of memory on the stack, then re-use later. Results? Heap memory was substantially faster. Just because an item is on the stack doesn't make it faster to access- the speed of stack memory also depends on things like cache. A statically allocated global array may not be any faster than the heap. I assume that you have already tried techniques like just reserving the upper bound. If you have a lot of vectors that have the same upper bound, consider improving cache by having a vector of structs, which contain the data members.
I've found that dynamic memory
allocation is often a severe
bottleneck, and that eliminating it
can lead to significant speedups. As a
consequence I'm interesting in the
performance tradeoffs of returning
large temporary data structures by
value vs. returning by pointer vs.
passing the result in by reference. Is
there a way to reliably determine
whether or not the compiler will use
RVO for a given method (assuming the
caller doesn't need to modify the
result, of course)?
I personally normally pass the result in by reference in this scenario. It allows for a lot more re-use. Passing large data structures by value and hoping that the compiler uses RVO is not a good idea when you can just manually use RVO yourself.
How cache-aware do compilers tend to
be? For example, is it worth looking
into reordering nested loops?
I found that they weren't particularly cache-aware. The issue is that the compiler doesn't understand your program and can't predict the vast majority of it's state, especially if you depend heavily on heap. If you have a profiler that ships with your compiler, for example Visual Studio's Profile Guided Optimization, then this can produce excellent speedups.
Given the scientific nature of the
program, floating-point numbers are
used everywhere. A significant
bottleneck in my code used to be
conversions from floating point to
integers: the compiler would emit code
to save the current rounding mode,
change it, perform the conversion,
then restore the old rounding mode ---
even though nothing in the program
ever changed the rounding mode!
Disabling this behavior significantly
sped up my code. Are there any similar
floating-point-related gotchas I
should be aware of?
There are different floating-point models - Visual Studio gives an fp:fast compiler setting. As for the exact effects of doing such, I can't be certain. However, you could try altering the floating point precision or other settings in your compiler and checking the result.
One consequence of C++ being compiled
and linked separately is that the
compiler is unable to do what would
seem to be very simple optimizations,
such as move method calls like
strlen() out of the termination
conditions of loop. Are there any
optimization like this one that I
should look out for because they can't
be done by the compiler and must be
done by hand?
I've never come across such a scenario. However, if you're genuinely concerned about such, then the option remains to do it manually. One of the things that you could try is calling a function on a const reference, suggesting to the compiler that the value won't change.
One of the other things that I want to point out is the use of non-standard extensions to the compiler, for example provided by Visual Studio is __assume. http://msdn.microsoft.com/en-us/library/1b3fsfxw(VS.80).aspx
There's also multithread, which I would expect you've gone down that road. You could try some specific opts, like another answer suggested SSE.
Edit: I realized that a lot of the suggestions I posted referenced Visual Studio directly. That's true, but, GCC almost certainly provides alternatives to the majority of them. I just have personal experience with VS most.

The STL priority queue implementation is fairly well-optimized for what it does, but certain kinds of heaps have special properties that can improve your performance on certain algorithms. Fibonacci heaps are one example. Also, if you're storing objects with a small key and a large amount of satellite data, you'll get a major improvement in cache performance if you store that data separately, even if it means storing one extra pointer per object.
As for arrays, I've found std::vector to even slightly out-perform compile-time-constant arrays. That said, its optimizations are general, and specific knowledge of your algorithm's access patterns may allow you to optimize further for cache locality, alignment, coloring, etc. If you find that your performance drops significantly past a certain threshold due to cache effects, hand-optimized arrays may move that problem size threshold by as much as a factor of two in some cases, but it's unlikely to make a huge difference for small inner loops that fit easily within the cache, or large working sets that exceed the size of any CPU cache. Work on the priority queue first.
Most of the overhead of dynamic memory allocation is constant with respect to the size of the object being allocated. Allocating one large object and returning it by a pointer isn't going to hurt much as much as copying it. The threshold for copying vs. dynamic allocation varies greatly between systems, but it should be fairly consistent within a chip generation.
Compilers are quite cache-aware when cpu-specific tuning is turned on, but they don't know the size of the cache. If you're optimizing for cache size, you may want to detect that or have the user specify it at run-time, since that will vary even between processors of the same generation.
As for floating point, you absolutely should be using SSE. This doesn't necessarily require learning SSE yourself, as there are many libraries of highly-optimized SSE code that do all sorts of important scientific computing operations. If you're compiling 64-bit code, the compiler might emit some SSE code automatically, as SSE2 is part of the x86_64 instruction set. SSE will also save you some of the overhead of x87 floating point, since it's not converting back and forth to 80-bit values internally. Those conversions can also be a source of accuracy problems, since you can get different results from the same set of operations depending on how they get compiled, so it's good to be rid of them.

If you work on big matrices for instance, consider tiling your loops to improve the locality. This often leads to dramatic improvements. You can use VTune/PTU to monitor the L2 cache misses.

One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
On some compilers this is incorrect. The compiler has perfect knowledge of all code across all translation units (including static libraries) and can optimize the code the same way it would do if it were in a single translation unit. A few ones that support this feature come to my mind:
Microsoft Visual C++ compilers
Intel C++ Compiler
LLVC-GCC
GCC (I think, not sure)

i'm surprised no one has mentioned these two:
Link time optimization clang and g++ from 4.5 on support link time optimizations. I've heard that on g++ case, the heuristics is still pretty inmature but it should improve quickly since the main architecture is laid out.
Benefits range from inter procedural optimizations at object file level, including highly sought stuff like inling of virtual calls (devirtualization)
Project inlining this might seem to some like very crude approach, but it is that very crudeness which makes it so powerful: this amounts at dumping all your headers and .cpp files into a single, really big .cpp file and compile that; basically it will give you the same benefits of link-time optimization in your trip back to 1999. Of course, if your project is really big, you'll still need a 2010 machine; this thing will eat your RAM like there is no tomorrow. However, even in that case, you can split it in more than one no-so-damn-huge .cpp file

If you are doing heavy floating point math you should consider using SSE to vectorize your computations if that maps well to your problem.
Google SSE intrinsics for more information about this.

Here is something that worked for me once. I can't say that it will work for you. I had code on the lines of
switch(num) {
case 1: result = f1(param); break;
case 2: result = f2(param); break;
//...
}
Then I got a serious performance boost when I changed it to
// init:
funcs[N] = {f1, f2 /*...*/};
// later in the code:
result = (funcs[num])(param);
Perhaps someone here can explain the reason the latter version is better. I suppose it has something to do with the fact that there are no conditional branches there.

My current project is a media server, with multi thread processing (C++ language). It's a time critical application, once low performance functions could cause bad results on media streaming like lost of sync, high latency, huge delays and so.
The strategy i usually use to grantee the best performance possible is to minimize the amount of heavy operational system calls that allocate or manage resources like memory, files, sockets and so.
At first i wrote my own STL, network and file manage classes.
All my containers classes ("MySTL") manage their own memory blocks to avoid multiple alloc (new) / free (delete) calls. The objects released are enqueued on a memory block pool to be reused when needed. On that way i improve performance and protect my code against memory fragmentation.
The parts of the code that need to access lower performance system resources (like files, databases, script, network write) i use separate threads for them. But not one thread for each unit (like not 1 thread for each socket), if so the operational system would lose performance while managing a high number of threads. So you can group objects of same classes to be processed on a separate thread if possible.
For example, if you have to write data to a network socket, but the socket write buffer is full, i save the data on a sendqueue buffer (which shares memory with all sockets together) to be sent on a separate thread as soon as the sockets become writeable again. At this way your main threads should never stop processing on a blocked state waiting for the operational system frees a specific resource. All the buffers released are saved and reused when needed.
After all a profile tool would be welcome to look for program bottles and shows which algorithms should be improved.
i got succeeded using that strategy once i have servers running like 500+ days on a linux machine without rebooting, with thousands users logging everyday.
[02:01] -alpha.ip.tv- Uptime: 525days 12hrs 43mins 7secs

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js