Performance of C++ Operators

Performance of C++ Operators - c++

Is there any sort of performance difference between the arithmetic operators in c++, or do they all run equally fast? E.g. is "++" faster than "+=1"? What about "+=10000"? Does it make a significant difference if the numbers are floats instead of integers? Does "*" take appreciably longer than "+"?
I tried performing 1 billion each of "++", "+=1", and "+=10000". The strange thing is that the number of clock cycles (according to time.h) is actually counterintuitive. One might expect that if any of them are the fastest, it is "++", followed by "+=1", then "+=10000", but the data shows a slight trend in the opposite direction. The difference is more pronounced on 10 billion operations. This is all for integers.
I am dabbling in scientific computing, so I wanted to test the performance of operators. If any of the operators operated in time that was linear in terms of the inputs, for example.

About your edit, the language says nothing about the architecture it's running on. Your question is platform dependent.
That said, typically all fundamental data-type operations have a one-to-one correspondence to assembly.
x86 for example has an instruction which increments a value by 1, which i++ or i += 1 would translate into. Addition and multiplication also have single instructions.
Hardware-wise, it's fairly obvious that adding or multiplying numbers is at least linear in the number of bits in the numbers. Because the hardware has a constant number of bits, it's O(1).
Floats have their own processing unit, usually, which also has single instructions for operations.
Does it matter?
Why not write the code that does what you need it to do. If you want to add one, use ++. If you want to add a large number, add a large number. If you need floats, use floats. If you need to multiply two numbers, then multiply them.
The compiler will figure out the best way to do what you want, so instead of trying to be tricky, do what you need and let it do the hard work.
After you've written your working code, and you decide it's too slow, profile it and find out why. You'll find it's not silly things like multiplying versus adding, but rather going about the entire (sub-)problem in the wrong way.
Practically, all of the operators you listed will be done in a single CPU instruction anyway, on desktop platforms.

No, no, yes*, yes*, respectively.
* but do you really care?
EDIT: to give some kind of idea with a modern processor, you may be able to do 200 integer additions in the time it takes to make one memory access, and only 50 integer multiplications. If you think about it, you're still going to be bound by the memory accesses most of the time.

What you are asking is: What basic operations get transformed into which assembly instructions and what is the performance of those instructions on my specific architecture. And this is also your answer: The code they get translated to is dependant on your compiler and it's knowledge of your architecture, their performance depends on your architecture.
Mind you: in C++ operators can be overloaded for user defined types. They can behave differently from built-in types and the implementation of the overload can be non-trivial (no just one instruction).
Edit: A hint for testing. Most compilers support outputting the generated assembly code. The option for gcc is -S. If you use some other compiler have a look at their documentation.

The best answer is to time it with your compiler.

Look up the optimization manuals for your CPU. That's the only place you're going to find answers.
Get your compiler to output the generated assembly. Download the manuals for your CPU. Look up the instructions used by the compiler in the manual, and you know how they perform.
Of course, this presumes that you already know the basics of how a pipelined, superscalar out-of-order CPU operates, what branch prediction, instruction and data cache and everything else means. Do your homework.
Performance is a ridiculously complicated subject. Depending on context, floating-point code may be as fast as (or faster than) integer code, or it may be four times slower. Usually branches carry almost no penalty, but in special cases, they can be crippling. Sometimes, recomputing data is more efficient than caching it, and sometimes not.
Understand your programming language. Understand your compiler. Understand your CPU. And then examine exactly what the compiler is doing in your case, by profiling/timing, and on when necessary by examining the individual instructions. (and when timing your code, be aware of all the caveats and gotchas that can invalidate your benchmarks: Make sure optimizations are enabled, but also that the code you're trying to measure isn't optimized away. Take the cache into account (if the data is already in the CPU cache, it'll run much faster. If it has to read from physical memory to begin with, it'll take extra time. Both can invalidate your measurements if you're not careful. Keep in mind what you want to measure exactly)
For your specific examples, why should ++i be faster than i += 1? They do the exact same thing? Sometimes, it may make a difference whether you're adding a constant or a variable, but in this case, you're adding the constant one in both cases.
And in general, instructions take a fixed constant time regardless of their operands. adding one to something takes just as long as adding -2000 or 1772051912. The same goes for multiplication or division.
But if you care about performance, you need to understand how the entire technology stack works, not just rely on a few simple rules of thumb like "integer is faster than floating point, and ++ is faster than +=" (Apart from anything else, such simple rules of thumb are almost never true, at least not in every case)

Here is a twist on your evaluations: try Loop Unrolling. Loop unrolling is where you repeat the same statements in a loop to reduce the number of iterations in the loop.
Most modern processors hate branch instructions. The processors have a queue of pre-fetched instructions, which speeds up processing. They really hate branch instructions, because the processor has to clear out the queue and reload it after a branch. This takes more time than just processing sequential instructions.
When coding for processing time, try to minimize the number of branches, which can occur in loop constructs and decision constructs.

Depends on architecture, the built in operators for integer arithmetic translate directly to assembly (as I understand it) ++, +=1, and += 10000 are probably equally fast, multiplication would depend on the platform, overloaded operators would depend on you

Donald Knuth : "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil"
unless you are writing extremely time critical software, you should probably worry about other things

Short answer: you should turn optimizations on before measuring.
The long answer: If you turned optimizations on, you're performing the operations on integers, and still you get different times for ++i; and i += 1;, then it's probably time to get a better compiler -- the two statements have exactly the same semantics and a competent compiler should translate them into the same instruction sequence.

"Does it make a significant difference if the numbers are floats instead of integers?"
-It depends on what kind of processor you are running on. Integer operations are faster on current x86 compatible CPUs.
About i++ and i+=1: there shouldn't be a difference with any good compiler, while you may expect i+=10000 to be slightly slower on x86 CPUs.
"Does "*" take appreciably longer than "+"?"
-Typically yes.
Note that you may run into all sorts of bottlenecks, in which case the speed difference between the operations doesn't show up. Eg. memory bandwidth, CPU pipeline stall due to data dependencies, etc...

The performance problems caused by C++ operators do not come from the operators and not from the operators implementation. It comes from the syntax, from hidden code being run without you knowing.
The best example, is implementing quick sort, on an object which has the operator[] implemented, but internally it's using a linked list. Now instead of O(nlogn) [1] you will get O(n^2logn).
The problem with performance is that you cannot know exactly what your code will eventually be.
[1] I know that quick sort is actually O(n^2), but it rarely gets to it, the average distribution will give you O(nlogn).

Related

Convert all doubles to integers for better performance, is it just a rumor?

I have a very complicated and sophisticated data fitting program which uses the Levenverg-Marquardt algorithm to do fitting in double precision (basically the fitting class is templatized, but I use instantiate it to doubles). The fitting process involves:
Calculating an error function (chi-square)
Solving a system of linear equations (I use lapack for that)
calculating the derivatives of a function with respect to the parameters, which I want to fit to the data (usually 20+ parameters)
calculating the function value continuously: the function is a complicated combination of a sinusoidal and exponential functions with a few harmonics.
A colleague of mine has suggested that I use integers for at least 10 times faster at least. My questions are:
Is that true that I will get that kind of improvement?
Is it safe to convert everything to integers? And what are the drawbacks to this?
What advice would you have for this whole issue? What would you do?
The program is developed to calculate some parameters from the signal online, which means that the program must be as fast as possible, but I'm wondering whether it's worth it to start the project of converting everything to integers.

The amount of improvement depends on your platform. For example, if your platform has a fast floating point coprocessor, performing arithmetic in floating point may be faster than integral arithmetic.
You may be able to get more performance gain by optimizing your algorithms rather than switching to integer arithmetic.
Another method for boosting performance is to reduce data cache hits and also reducing branches and loops.
I would measure performance of the program to find out where the bottlenecks are and then review the sections that where most of the performance takes place. For example, in my embedded system, micro-optimizations like what you are suggesting, saved 3 microseconds. This gain is not worth the effort to retest the entire system. If it works, don't fix it. Concentrate on correctness and robustness first.

The bottom line here is that you have to test it and decide for yourself. Profile a release build using real data.
1- Is that true that I will get that kind of improvement?
Maybe yes, maybe no. It depends on a number of factors, such as
How long it takes to convert from double to int
How big a word is on your machine
What platform/toolset you're using and what optimizations you have enabled
(Maybe) how big a cache line is on your platform
How fast your memory is
How fast your platform computes floating-point versus integer.
And who knows what else. In short, too many complex variables for anyone to be able to say for sure if you will or will not improve performance.
But I would be highly skeptical about your friend's claim, "at least 10 times faster at least."
2- Is it safe to convert everything to integers? And what are the
drawbacks to this?
It depends on what you're converting and how. Obviously converting a value like 123.456 to an integer is decidedly unsafe.
Drawbacks include loss of precision, loss of accuracy, and the expense in terms of space and time to actually do the conversions. Another significant drawback is the fact that you have to write a substantial amount of code, and every line of code you write is a probable source of new bugs.
3- What advice would you have for this whole issue? What would you do?
I would step back & take a deep breath. Profile your code under real-world conditions. Identify the sources of the bottlenecks. Find out what the real problems are, and if there even are any.
Identify inefficiencies in your algorithms, and fix them.
Throw hardware at the problem.
Then you can endeavor to start micro-optimizing. This would be my last resort, especially if the optimization technique you are considering would require writing a lot of code.

First, this reeks of attempting to optimize unnecessarily.
Second, doubles are a minimum of 64-bits. ints on most systems are 32-bits. So you have a couple of choices: truncate the double (which reduces your precision to a single), or store it in the space of 2 integers, or store it as an unsigned long long (which is at least 64-bits as well). For the first 2 options, you are facing a performance hit as you must convert the numbers back and forth between the doubles you are operating on and the integers you are storing it as. For the third option, you are not gaining any performance increase (in terms of memory usage) as they are basically the same size - so you'd just be converting them to integers for no reason.
So, to get to your questions:
1) Doubtful, but you can try it to see for yourself.
2) The problem isn't storage as the bits are just bits when they get into memory. The problem is the arithmetic. Since you stated you need double precision, attempting to do those operations on an integer type will not give you the results you are looking for.
3) Don't optimize until it has been proven something needs to have a performance improvement. And always remember Amdahl's Law: Make the common case fast and the rare case correct.

What I would do is:
First tune it in single-thread mode (by the random-pausing method) until you can't find any way to reduce cycles. The kinds of things I've found are:
a large fraction of time spent in library functions like sin, cos, exp, and log where the arguments were often unchanged, so the answers would be the same. The solution for that is called "memoizing", where you figure out a place to store old values of arguments and results, and check there first before calling the function.
In calling library functions like DGEMM (lapack matrix-multiply) that one would assume are optimized to the teeth, they are actually spending a large fraction of time calling a function to determine if the matrices are upper or lower triangle, square, symmetric, or whatever, rather than actually doing the multiplication. If so, the answer is obvious - write a special routine just for your situation.
Don't say "but I don't have those problems". Of course - you probably have different problems - but the process of finding them is the same.
Once you've made it as fast as possible in single-thread, then figure out how to parallelize it. Multi-threading can have high overhead, so it's best not to tightly-couple the threads.
Regarding your question about converting from doubles to integers, the other answers are right on the money. It only makes sense in very particular situations.

Speed of C++ operators/ simple math

I'm working on a physics engine and feel it would help having a better understanding of the speed and performance effects of performing many simple or complex math operations.
A large part of a physics engine is weeding out the unnecessary computations, but at what point are the computations small enough that a comparative checks aren't necessary?
eg: Testing if two line segments intersect. Should there be check on if they're near each other before just going straight into the simple math, or would the extra operation slow down the process in the long run?
How much time do different mathematical calculations take
eg: (3+8) vs (5x4) vs (log(8)) etc.
How much time do inequality checks take?
eg: >, <, =

You'll have to do profiling.
Basic operations, like additions or multiplications should only take one asm instructions.
EDIT: As per the comments, although taking one asm instruction, multiplications can expand to microinstructions.
Logarithms take longer.
Also one asm instruction.
Unless you profile your code, there's no way to tell where your bottlenecks are.
Unless you call math operations millions of times (and probably even if you do), a good choice of algorithms or some other high-level optimization will results in a bigger speed gain than optimizing the small stuff.
You should write code that is easy to read and easy to modify, and only if you're not satisfied with the performance then, start optimizing - first high-level, and only afterwards low-level.
You might also want to try dynamic programming or caching.

As regards 2 and 3, I could refer you to the Intel® 64 and IA-32 Architectures Optimization Reference Manual. Appendix C presents the latencies and the throughput of various instructions.
However, unless you hand-code assembly code, your compiler will apply its own optimizations, so using this information directly would be rather difficult.
More importantly, you could use SIMD to vectorize your code and run computations in parallel. Also, memory performance can be a bottleneck if your memory layout is not ideal. The document I linked to has chapters on both issues.
However, as #Ph0en1x said, the first step would be choosing (or writing) an efficient algorithm, making it work for your problem. Only then should you start wondering about low-level optimizations.
As for 1, in a general case I'd say that if your algorithm works in such a way that it has some adjustable thresholds for when to execute certain tests, you could do some profiling and print out a performance graph of some kind, and determine the optimal values for those thresholds.

Well, this depends on your hardware. Very nice tables with instruction latency are http://www.agner.org/optimize/instruction_tables.pdf
1. it depends on the code a lot. Also don't forget it doesn't depend only on computations, but how well the comparison results can be predicted.
2. Generally addition/subtraction is very fast, multiplication of floats is a bit slower. Float division is rather slow (if you need to divide by a constant c, it's often better to precompute 1/c and multiply by it). The library functions are usually (I'd dare to say always) slower than simple operators, unless the compiler decides to use SSE. For example sqrt() and 1/sqrt() can be computed using one SSE instruction.
3. From about one cycle to several dozens of cycles. The current processors does the prediction on conditions. If the prediction is right right, it will be fast. However, if the prediction is wrong, the processor has to throw away all the preloaded instructions (IIRC Sandy Bridge preloads up to 30 instructions) and start processing new instructions.
That means if you have a code, where a condition is met most of the time, it will be fast. Similarly if you have code where the condition is not met most the time, it will be fast. Simple alternating conditions (TFTFTF…) are usually fast too.

This depends on the scenario you are trying to simulate. How many objects do you have and how close are they? Are they clustered or distributed evenly? Do your objects move around alot, or are they static? You will have to run tests. Possible data-structures for fast checking of proximity are kd-trees or locality-sensitive hashes (there may be others). I am not sure if these are appropriate for your application, you'd have to check if the maintenance of the data-structure and the lookup-cost are OK for you.
You will have to run tests. Consider checking if you can use vectorization, or if you can even run some of the computations in a GPU using CUDA or something like that.
Same as above - you have to test.

You can generally consider inequality checks, increment, decrement, bit shifts, addition and subtraction to be really cheap. Multiplication and division are generally a little more expensive. Complex math operations like logarithms are much more expensive.
Benchmark on your platform to be sure. Be careful about benchmarking using artificial tests with tight loops -- that tends to give you misleading results. Try to benchmark in code that's as realistic as possible. Ideally, profile the actual code under realistic conditions.
As for the optimizations for things like line intersection, it depends on the data set. If you do a lot of checks and most of your lines are short, it may be worth a quick check to rule out cases where the X or Y ranges don't overlap.

as much as I know all "inequality checks" take the same time.
regarding the rest calculations, I would advice you to run some tests like
take time stamp A
make 1,000,000 "+" calculation (or any other).
take time stamp B
calculate the diff between A and B.
then you can compare the calculations.
take in mind:
using different mathematical lib may change it (some math lib are more performance oriented and some more precision oriented)
the compiler optimization may change it.
each processor is doing it differently.

Typical time of execution for elementary functions

It is well-known that the processor instruction for multiplication takes several times more time than addition, division is even worse (UPD: which is not true any more, see below). What about more complex operations like exponent? How difficult are they?
Motivation. I am interested because it would help in algorithm design to estimate performance-critical parts of algorithms on early stage. Suppose I want to apply a set of filters to an image. One of them operates on 3×3 neighborhood of each pixel, sums them and takes atan. Another one sums more neighbouring pixels, but does not use complicated functions. Which one would execute longer?
So, ideally I want to have approximate relative times of elementary operations execution, like multiplication typically takes 5 times more time than addition, exponent is about 100 multiplications. Of course, it is a deal of orders of magnitude, not the exact values. I understand that it depends on the hardware and on the arguments, so let's say we measure average time (in some sense) for floating-point operations on modern x86/x64. For operations that are not implemented in hardware, I am interested in typical running time for C++ standard libraries.
Have you seen any sources when such thing was analyzed? Does this question makes sense at all? Or no rules of thumb like this could be applied in practice?

First off, let's be clear. This:
It is well-known that processor instruction for multiplication takes
several times more time than addition
is no longer true in general. It hasn't been true for many, many years, and needs to stop being repeated. On most common architectures, integer multiplies are a couple cycles and integer adds are single-cycle; floating-point adds and multiplies tend to have nearly equal timing characteristics (typically around 4-6 cycles latency, with single-cycle throughput).
Now, to your actual question: it varies with both the architecture and the implementation. On a recent architecture, with a well written math library, simple elementary functions like exp and log usually require a few tens of cycles (20-50 cycles is a reasonable back-of-the-envelope figure). With a lower-quality library, you will sometimes see these operations require a few hundred cycles.
For more complicated functions, like pow, typical timings range from high tens into the hundreds of cycles.

You shouldn't be concerned about this. If I tell you that a typical C library implementation of transcendental functions tend to take around 10 times a single floating point addition/multiplication (or 50 floating point additions/multiplications), and around 5 times a floating point division, this wouldn't be useful to you.
Indeed, the way your processor schedules memory accesses will interfere badly with any premature optimization you'd do.
If after profiling you find that a particular implementation using transcendental functions is too slow, you can contemplate setting up a polynomial interpolation scheme. This will include a table and therefore will incur extra cache issues, so make sure to measure and not guess.
This will likely involve Chebyshev approximation. Document yourself about it, this is a particularly useful technique in this kind of domains.
I have been told that compilers are quite bad in optimizing floating point code. You may want to write custom assembly code.
Also, Intel Performance Primitives (if you are on Intel CPU) is something good to own if you are ready to trade off some accuracy for speed.

You could always start a second thread and time the operations. Most elementary operations don't have that much difference in execution time. The big difference is how many times the are executed. The O(n) is generally what you should be thinking about.

benchmark a piece of code independent of CPU performance?

My Objective is : I want to test a piece of code (or function) performance, just like how I test the correctness of that function in a unit-test, let say that the output of this benchmarking process is a "function performance index" which is "portable"
My Problem is : we usually benchmarking a code by using a timer to count elapsed time during execution of that code. and that method is depend on the hardware or O/S or other thing.
My Question is : is there a method to get a "function performance index" that is independent to the performance of the host (CPU/OS/etc..), or if not "independent" lets say it is "relative" to some fixed value. so that somehow the value of "function performance index" is still valid on any platform or hardware performance.
for example: that FPI value is could be measured in
number of arithmetic instruction needed to execute a single call
float value compared to benchmark function, for example function B has rating index of 1.345 (which is the performance is slower 1.345 times than the benchmark function)
or other value.
note that the FPI value doesn't need to be scientifically correct, exact or accurate, I just need a value to give a rough overview of that function performance compared to other function which was tested by the same method.

I think you are in search of the impossible here, because the performance of a modern computer is a complex blend of CPU, cache, memory controller, memory, etc.
So one (hypothetical) computer system might reward the use of enormous look-up tables to simplify an algorithm, so that there were very few cpu instructions processed. Whereas another system might have memory much slower relative to the CPU core, so an algorithm which did a lot of processing but touched very little memory would be favoured.
So a single 'figure of merit' for these two algorithms could not even convey which was the better one on all systems, let alone by how much it was better.

Probably what you really need is a tcov-like tool.
man tcov says:
Each basic block of code (or each
line if the -a option to tcov is specified) is prefixed with
the number of times it has been executed; lines that have
not been executed are prefixed with "#####". A basic block
is a contiguous section of code that has no branches: each
statement in a basic block is executed the same number of
times.

No, there is no such thing. Different hardware performs differently. You can have two different pieces of code X and Y such that hardware A runs X faster than Y but hardware B runs Y faster than X. There is no absolute scale of performance, it depends entirely on the hardware (not to mention other things like the operating system and other environmental considerations).

It sounds like what you want is a program that calculates the Big-O Notation of a piece of code. I don't know if it's possible to do that in an automated fashion (Halting problem, etc).

Like others have mentioned this is not a trivial task and may be impossible to get any sort of accurate results from. Considering a few methods:
Benchmark Functions -- While this seems promising I think you'll find that it won't work well as you try to compare different types of functions. For example, if your benchmark function is 100% CPU bound (as in some complex math computation) then it will compare/scale well with other CPU bound functions but fail when compared with, say, I/O or memory bound functions. Carefully matching a benchmark function to a small set of similar functions may work but is tedious/time consuming.
Number of Instructions -- For a very simple processor it may be possible to count the cycles of each instruction and get a reasonable value for the total number of cycles a block of code will take but with today's modern processors are anything but "simple". With branch prediction and parallel pipelines you can can't just add up instruction cycles and expect to get an accurate result.
Manual Counting -- This might be your best bet and while it is not automatic it may give better results faster than the other methods. Just look at things like the O() order of the code, how much memory the function reads/writes, how many file bytes are input/output etc.... By having a few stats like this for each function/module you should be able to get a rough comparison of their complexity.

Effective optimization strategies on modern C++ compilers

I'm working on scientific code that is very performance-critical. An initial version of the code has been written and tested, and now, with profiler in hand, it's time to start shaving cycles from the hot spots.
It's well-known that some optimizations, e.g. loop unrolling, are handled these days much more effectively by the compiler than by a programmer meddling by hand. Which techniques are still worthwhile? Obviously, I'll run everything I try through a profiler, but if there's conventional wisdom as to what tends to work and what doesn't, it would save me significant time.
I know that optimization is very compiler- and architecture- dependent. I'm using Intel's C++ compiler targeting the Core 2 Duo, but I'm also interested in what works well for gcc, or for "any modern compiler."
Here are some concrete ideas I'm considering:
Is there any benefit to replacing STL containers/algorithms with hand-rolled ones? In particular, my program includes a very large priority queue (currently a std::priority_queue) whose manipulation is taking a lot of total time. Is this something worth looking into, or is the STL implementation already likely the fastest possible?
Along similar lines, for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
Given the scientific nature of the program, floating-point numbers are used everywhere. A significant bottleneck in my code used to be conversions from floating point to integers: the compiler would emit code to save the current rounding mode, change it, perform the conversion, then restore the old rounding mode --- even though nothing in the program ever changed the rounding mode! Disabling this behavior significantly sped up my code. Are there any similar floating-point-related gotchas I should be aware of?
One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
Lastly, to nip certain kinds of answers in the bud:
I understand that optimization has a cost in terms of complexity, reliability, and maintainability. For this particular application, increased performance is worth these costs.
I understand that the best optimizations are often to improve the high-level algorithms, and this has already been done.

Take a look at the excellent Pitfalls of Object-Oriented Programming slides for some info about restructuring code for locality. In my experience getting better locality is almost always the biggest win.
General process:
Learn to love the Disassembly View in your debugger, or have your build system generate the intermediate assembly files (.s) if at all possible. Keep an eye on changes or for things that look egregious -- even without familiarity with a given instruction set architecture, you should be able to see some things fairly clearly! (I sometimes check in a series of .s files with corresponding .cpp/.c changes, just to leverage the lovely tools from my SCM to watch the code and corresponding asm change over time.)
Get a profiler that can watch your CPU's performance counters, or can at least guess at cache misses. (AMD CodeAnalyst, cachegrind, vTune, etc.)
Some other specific things:
Understand strict aliasing. Once you do, make use of restrict if your compiler has it. (Examine the disasm here too!)
Check out different floating point modes on your processor and compiler. If you don't need the denormalized range, choosing a mode without this can result in better performance. (It sounds like you've already done some things in this area, based on your discussion of rounding modes.)
Definitely avoid allocs: call reserve on std::vector when you can, or use std::array when you know the size at compile-time.
Use memory pools to increase locality and decrease alloc/free overhead; also to ensure cacheline alignment and prevent ping-ponging.
Use frame allocators if you're allocating things in predictable patterns, and can afford to deallocate everything in one go.
Do be aware of invariants. Something you know is invariant may not be to the compiler, for example a use of a struct or class member in a loop. I find the single easiest way to fall into the correct habit here is to give a name to everything, and prefer to name things outside of loops. E.g. const int threshold = m_currentThreshold; or perhaps Thing * const pThing = pStructHoldingThing->pThing; Fortunately you can usually see things that need this treatment in the disassembly view. This also helps with debugging later (makes the watch/locals window behave much more nicely in debug builds)!
Avoid writes in loops if possible -- accumulate first, then write, or batch a few writes together. YMMV, of course.
WRT your std::priority_queue question: inserting things into a vector (the default backend for a priority_queue) tends to move a lot of elements around. If you can break up into phases, where you insert data, then sort it, then read it once it's sorted, you'll probably be a lot better off. Although you'll definitely lose locality, you may find a more self-ordering structure like a std::map or std::set worth the overhead -- but this is really dependent on your usage patterns.

Is there any benefit to replacing STL containers/algorithms with hand-rolled ones?
I would only consider this as a last option. The STL containers and algorithms have been thoroughly tested. Creating new ones are expensive in terms of development time.
Along similar lines, for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
First, try reserving space for the vectors. Check out the std::vector::reserve method. A vector that keeps growing or changing to larger sizes is going to waste dynamic memory and execution time. Add some code to determine a good value for an upper bound.
I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?
As a matter of principle, always pass large structures by reference or pointer. Prefer passing by constant reference. If you are using pointers, consider using smart pointers.
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
Modern compilers are very aware of instruction caches (pipelines) and try to keep them from being reloaded. You can always assist your compiler by writing code that uses less branches (from if, switch, loop constructs and function calls).
You may see more significant performance gain by adjusting your program to optimize the data cache. Search the web for Data Driven Design. There are many excellent articles on this topic.
Given the scientific nature of the program, floating-point numbers are used everywhere. A significant bottleneck in my code used to be conversions from floating point to integers: the compiler would emit code to save the current rounding mode, change it, perform the conversion, then restore the old rounding mode --- even though nothing in the program ever changed the rounding mode! Disabling this behavior significantly sped up my code. Are there any similar floating-point-related gotchas I should be aware of?
For accuracy, keep everything as a double. Adjust for rounding only when necessary and perhaps before displaying. This falls under the optimization rule, Use less code, eliminate extraneous or deadwood code.
Also see the section above about reserving space in containers before using them.
Some processors can load and store floating point numbers either faster or as fast as integers. This would require gathering profile data before optimizing. However, if you know there is minimal resolution, you could use integers and change your base to that minimal resolution . For example, when dealing with U.S. money, integers can be used to represent 1/100 or 1/1000 of a dollar.
One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
This an incorrect assumption. Compilers can optimize based on the function's signature, especially if the parameters correctly use const. I always like to assist the compiler by moving constant stuff outside of the loop. For an upper limit value, such as a string length, assign it to a const variable before the loop. The const modifier will assist the Optimizer.
There is always the count-down optimization in loops. For many processors, a jump on register equals zero is more efficient than compare and jump if less than.
On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
I would avoid "micro optimizations". If you have any doubts, print out the assembly code generated by the compiler (for the area you are questioning) under the highest optimization setting. Try rewriting the code to express the compiler's assembly code. Optimize this code, if you can. Anything more requires platform specific instructions.
Optimization Ideas & Concepts
1. Computers prefer to execute sequential instructions.
Branching upsets them. Some modern processors have enough instruction cache to contain code for small loops. When in doubt, don't cause branches.
2. Eliminate Requirements
Less code, more performance.
3. Optimize designs before code
Often times, more performance can be gained by changing the design versus changing the implementation of the design. Less design promotes less code, generates more performance.
4. Consider data organization
Optimize the data.
Organize frequently used fields into substructures.
Set data sizes to fit into a data cache line.
Remove constant data out of data structures.
Use const specifier as much as possible.
5. Consider page swapping
Operating systems will swap out your program or task for another one. Often times into a 'swap file' on the hard drive. Breaking up the code into chunks that contain heavily executed code and less executed code will assist the OS. Also, coagulate heavily used code into tighter units. The idea is to reduce the swapping of code from the hard drive (such as fetching "far" functions). If code must be swapped out, it should be as one unit.
6. Consider I/O optimizations
(Includes file I/O too).
Most I/O prefers fewer large chunks of data to many small chunks of data. Hard drives like to keep spinning. Larger data packets have less overhead than smaller packets.
Format data into a buffer then write the buffer.
7. Eliminate the competition
Get rid of any programs and tasks that are competing against your application for the processor(s). Such tasks as virus scanning and playing music. Even I/O drivers want a piece of the action (which is why you want to reduce the number or I/O transactions).
These should keep you busy for a while. :-)

Use of memory buffer pools can be of great performance benefit vs. dynamic allocation. More so if they reduce or prevent heap fragmentation over long execution runs.
Be aware of data location. If you have a significant mix of local vs. global data you may be overworking the cache mechanism. Try to keep data sets in close proximity to make maximum use of cache line validity.
Even though compilers do a wonderful job with loops, I still scrutinize them when performance tuning. You can spot architectural flaws that yield orders of magnitude where the compiler may only trim percentages.
If a single priority queue is using a lot of time in its operation, there may be benefit to creating a battery of queues representing buckets of priority. It would be complexity being traded for speed in this case.
I notice you didn't mention the use of SSE type instructions. Could they be applicable to your type of number crunching?
Best of luck.

Here is a nice paper on the subject.

About STL containers.
Most people here claim STL offers one of the fastest possible implementations of the container algorithms. And I say the opposite: for the most real-world scenarios the STL containers taken as-is yield a really catastrophic performance.
People argue about the complexity of the algorithms used in STL. Here STL is good: O(1) for list/queue, vector (amortized), and O(log(N)) for map. But this is not the real bottleneck of the performance for a typical application! For many applications the real bottleneck is the heap operations (malloc/free, new/delete, etc.).
A typical operation on the list costs just a few CPU cycles. On a map - some tens, may be more (this depends on the cache state and log(N) of course). And typical heap operations cost from hunders to thousands (!!!) of CPU cycles. For multithreaded applications for instance they also require synchronization (interlocked operations). Plus on some OSs (such as Windows XP) the heap functions are implemented entirely in the kernel mode.
So that the actual performance of the STL containers in a typical scenario is dominated by the amount of heap operations they perform. And here they're disastrous. Not because they're implemented poorly, but because of their design. That is, this is the question of the design.
On the other hand there're other containers which are designed differently.
Once I've designed and written such containers for my own needs:
http://www.codeproject.com/KB/recipes/Containers.aspx
And it proved for me to be superior from the performance point of view, and not only.
But recently I've discovered I'm not the only one who thought about this.
boost::intrusive is the container library that is implemented in the manner similar to what I did then.
I suggest you try it (if you didn't already)

Is there any benefit to replacing STL containers/algorithms with hand-rolled ones?
Generally, not unless you're working with a poor implementation. I wouldn't replace an STL container or algorithm just because you think you can write tighter code. I'd do it only if the STL version is more general than it needs to be for your problem. If you can write a simpler version that does just what you need, then there might be some speed to gain there.
One exception I've seen is to replace a copy-on-write std::string with one that doesn't require thread synchronization.
for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
Unlikely. But if you're using a lot of time allocating up to a certain size, it might be profitable to add a reserve() call.
performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference.
When working with containers, I pass iterators for the inputs and an output iterator, which is still pretty general.
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
Not very. Yes. I find that missed branch predictions and cache-hostile memory access patterns are the two biggest killers of performance (once you've gotten to reasonable algorithms). A lot of older code uses "early out" tests to reduce calculations. But on modern processors, that's often more expensive than doing the math and ignoring the result.
A significant bottleneck in my code used to be conversions from floating point to integers
Yup. I recently discovered the same issue.
One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop.
Some compilers can deal with this. Visual C++ has a "link-time code generation" option that effective re-invokes the compiler to do further optimization. And, in the case of functions like strlen, many compilers will recognize that as an intrinsic function.
Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand? On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
When you're optimizing at this low level, there are few reliable rules of thumb. Compilers will vary. Measure your current solution, and decide if it's too slow. If it is, come up with a hypothesis (e.g., "What if I replace the inner if-statements with a look-up table?"). It might help ("eliminates stalls due to failed branch predictions") or it might hurt ("look-up access pattern hurts cache coherence"). Experiment and measure incrementally.
I'll often clone the straightforward implementation and use an #ifdef HAND_OPTIMIZED/#else/#endif to switch between the reference version and the tweaked version. It's useful for later code maintenance and validation. I commit each successful experiment to change control, and keep a log (spreadsheet) with the changelist number, run times, and explanation for each step in optimization. As I learn more about how the code behaves, the log makes it easy to back up and branch off in another direction.
You need a framework for running reproducible timing tests and to compare results to the reference version to make sure you don't inadvertently introduce bugs.

If I were working on this, I would expect an end-stage where things like cache locality and vector operations would come into play.
However, before getting to the end stage, I would expect to find a series of problems of different sizes having less to do with compiler-level optimization, and more to do with odd stuff going on that could never be guessed, but once found, are simple to fix. Usually they revolve around class overdesign and data structure issues.
Here's an example of this kind of process.
I have found that generalized container classes with iterators, which in principle the compiler can optimize down to minimal cycles, often are not so optimized for some obscure reason. I've also heard other cases on SO where this happens.
Others have said, before you do anything else, profile. I agree with that approach except I think there's a better way, and it's indicated in that link. Whenever I find myself asking if some specific thing, like STL, could be a problem, I just might be right - BUT - I'm guessing. The fundamental winning idea in performance tuning is find out, don't guess. It is easy to find out for sure what is taking the time, so don't guess.

here is some stuff I had used:
templates to specialize innermost loops bounds (makes them really fast)
use __restrict__ keywords for alias problems
reserve vectors beforehand to sane defaults.
avoid using map (it can be really slow)
vector append/ insert can be significantly slow. If that is the case, raw operations may make it faster
N-byte memory alignment (Intel has pragma aligned, http://www.intel.com/software/products/compilers/docs/clin/main_cls/cref_cls/common/cppref_pragma_vector.htm)
trying to keep memory within L1/L2 caches.
compiled with NDEBUG
profile using oprofile, use opannotate to look for specific lines (stl overhead is clearly visible then)
here are sample parts of profile data (so you know where to look for problems)
* Output annotated source file with samples
* Output all files
*
* CPU: Core 2, speed 1995 MHz (estimated)
--
* Total samples for file : "/home/andrey/gamess/source/blas.f"
*
* 1020586 14.0896
--
* Total samples for file : "/home/andrey/libqc/rysq/src/fock.cpp"
*
* 962558 13.2885
--
* Total samples for file : "/usr/include/boost/numeric/ublas/detail/matrix_assign.hpp"
*
* 748150 10.3285
--
* Total samples for file : "/usr/include/boost/numeric/ublas/functional.hpp"
*
* 639714 8.8315
--
* Total samples for file : "/home/andrey/gamess/source/eigen.f"
*
* 429129 5.9243
--
* Total samples for file : "/usr/include/c++/4.3/bits/stl_algobase.h"
*
* 411725 5.6840
--
example of code from my project
template<int ni, int nj, int nk, int nl>
inline void eval(const Data::density_type &D, const Data::fock_type &F,
const double *__restrict Q, double scale) {
const double * __restrict Dij = D[0];
...
double * __restrict Fij = F[0];
...
for (int l = 0, kl = 0, ijkl = 0; l < nl; ++l) {
for (int k = 0; k < nk; ++k, ++kl) {
for (int j = 0, ij = 0; j < nj; ++j, ++jk, ++jl) {
for (int i = 0; i < ni; ++i, ++ij, ++ik, ++il, ++ijkl) {

And I think the main hint anyone could give you is: measure, measure, measure. That and improving your algorithms.
The way you use certain language features, the compiler version, std lib implementation, platform, machine - all ply their role in performance and you haven't mentioned many of those and no one of us ever had your exact setup.
Regarding replacing std::vector: use a drop-in replacement (e.g., this one) and just try it out.

How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
I can't speak for all compilers, but my experience with GCC shows that it will not heavily optimize code with respect to the cache. I would expect this to be true for most modern compilers. Optimization such as reordering nested loops can definitely affect performance. If you believe that you have memory access patterns that could lead to many cache misses, it will be in your interest to investigate this.

Is there any benefit to replacing STL
containers/algorithms with hand-rolled
ones? In particular, my program
includes a very large priority queue
(currently a std::priority_queue)
whose manipulation is taking a lot of
total time. Is this something worth
looking into, or is the STL
implementation already likely the
fastest possible?
The STL is generally the fastest, general case. If you have a very specific case, you might see a speed-up with a hand-rolled one. For example, std::sort (normally quicksort) is the fastest general sort, but if you know in advance that your elements are virtually already ordered, then insertion sort might be a better choice.
Along similar lines, for std::vectors
whose needed sizes are unknown but
have a reasonably small upper bound,
is it profitable to replace them with
statically-allocated arrays?
This depends on where you are going to do the static allocation. One thing I tried along this line was to static allocate a large amount of memory on the stack, then re-use later. Results? Heap memory was substantially faster. Just because an item is on the stack doesn't make it faster to access- the speed of stack memory also depends on things like cache. A statically allocated global array may not be any faster than the heap. I assume that you have already tried techniques like just reserving the upper bound. If you have a lot of vectors that have the same upper bound, consider improving cache by having a vector of structs, which contain the data members.
I've found that dynamic memory
allocation is often a severe
bottleneck, and that eliminating it
can lead to significant speedups. As a
consequence I'm interesting in the
performance tradeoffs of returning
large temporary data structures by
value vs. returning by pointer vs.
passing the result in by reference. Is
there a way to reliably determine
whether or not the compiler will use
RVO for a given method (assuming the
caller doesn't need to modify the
result, of course)?
I personally normally pass the result in by reference in this scenario. It allows for a lot more re-use. Passing large data structures by value and hoping that the compiler uses RVO is not a good idea when you can just manually use RVO yourself.
How cache-aware do compilers tend to
be? For example, is it worth looking
into reordering nested loops?
I found that they weren't particularly cache-aware. The issue is that the compiler doesn't understand your program and can't predict the vast majority of it's state, especially if you depend heavily on heap. If you have a profiler that ships with your compiler, for example Visual Studio's Profile Guided Optimization, then this can produce excellent speedups.
Given the scientific nature of the
program, floating-point numbers are
used everywhere. A significant
bottleneck in my code used to be
conversions from floating point to
integers: the compiler would emit code
to save the current rounding mode,
change it, perform the conversion,
then restore the old rounding mode ---
even though nothing in the program
ever changed the rounding mode!
Disabling this behavior significantly
sped up my code. Are there any similar
floating-point-related gotchas I
should be aware of?
There are different floating-point models - Visual Studio gives an fp:fast compiler setting. As for the exact effects of doing such, I can't be certain. However, you could try altering the floating point precision or other settings in your compiler and checking the result.
One consequence of C++ being compiled
and linked separately is that the
compiler is unable to do what would
seem to be very simple optimizations,
such as move method calls like
strlen() out of the termination
conditions of loop. Are there any
optimization like this one that I
should look out for because they can't
be done by the compiler and must be
done by hand?
I've never come across such a scenario. However, if you're genuinely concerned about such, then the option remains to do it manually. One of the things that you could try is calling a function on a const reference, suggesting to the compiler that the value won't change.
One of the other things that I want to point out is the use of non-standard extensions to the compiler, for example provided by Visual Studio is __assume. http://msdn.microsoft.com/en-us/library/1b3fsfxw(VS.80).aspx
There's also multithread, which I would expect you've gone down that road. You could try some specific opts, like another answer suggested SSE.
Edit: I realized that a lot of the suggestions I posted referenced Visual Studio directly. That's true, but, GCC almost certainly provides alternatives to the majority of them. I just have personal experience with VS most.

The STL priority queue implementation is fairly well-optimized for what it does, but certain kinds of heaps have special properties that can improve your performance on certain algorithms. Fibonacci heaps are one example. Also, if you're storing objects with a small key and a large amount of satellite data, you'll get a major improvement in cache performance if you store that data separately, even if it means storing one extra pointer per object.
As for arrays, I've found std::vector to even slightly out-perform compile-time-constant arrays. That said, its optimizations are general, and specific knowledge of your algorithm's access patterns may allow you to optimize further for cache locality, alignment, coloring, etc. If you find that your performance drops significantly past a certain threshold due to cache effects, hand-optimized arrays may move that problem size threshold by as much as a factor of two in some cases, but it's unlikely to make a huge difference for small inner loops that fit easily within the cache, or large working sets that exceed the size of any CPU cache. Work on the priority queue first.
Most of the overhead of dynamic memory allocation is constant with respect to the size of the object being allocated. Allocating one large object and returning it by a pointer isn't going to hurt much as much as copying it. The threshold for copying vs. dynamic allocation varies greatly between systems, but it should be fairly consistent within a chip generation.
Compilers are quite cache-aware when cpu-specific tuning is turned on, but they don't know the size of the cache. If you're optimizing for cache size, you may want to detect that or have the user specify it at run-time, since that will vary even between processors of the same generation.
As for floating point, you absolutely should be using SSE. This doesn't necessarily require learning SSE yourself, as there are many libraries of highly-optimized SSE code that do all sorts of important scientific computing operations. If you're compiling 64-bit code, the compiler might emit some SSE code automatically, as SSE2 is part of the x86_64 instruction set. SSE will also save you some of the overhead of x87 floating point, since it's not converting back and forth to 80-bit values internally. Those conversions can also be a source of accuracy problems, since you can get different results from the same set of operations depending on how they get compiled, so it's good to be rid of them.

If you work on big matrices for instance, consider tiling your loops to improve the locality. This often leads to dramatic improvements. You can use VTune/PTU to monitor the L2 cache misses.

One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
On some compilers this is incorrect. The compiler has perfect knowledge of all code across all translation units (including static libraries) and can optimize the code the same way it would do if it were in a single translation unit. A few ones that support this feature come to my mind:
Microsoft Visual C++ compilers
Intel C++ Compiler
LLVC-GCC
GCC (I think, not sure)

i'm surprised no one has mentioned these two:
Link time optimization clang and g++ from 4.5 on support link time optimizations. I've heard that on g++ case, the heuristics is still pretty inmature but it should improve quickly since the main architecture is laid out.
Benefits range from inter procedural optimizations at object file level, including highly sought stuff like inling of virtual calls (devirtualization)
Project inlining this might seem to some like very crude approach, but it is that very crudeness which makes it so powerful: this amounts at dumping all your headers and .cpp files into a single, really big .cpp file and compile that; basically it will give you the same benefits of link-time optimization in your trip back to 1999. Of course, if your project is really big, you'll still need a 2010 machine; this thing will eat your RAM like there is no tomorrow. However, even in that case, you can split it in more than one no-so-damn-huge .cpp file

If you are doing heavy floating point math you should consider using SSE to vectorize your computations if that maps well to your problem.
Google SSE intrinsics for more information about this.

Here is something that worked for me once. I can't say that it will work for you. I had code on the lines of
switch(num) {
case 1: result = f1(param); break;
case 2: result = f2(param); break;
//...
}
Then I got a serious performance boost when I changed it to
// init:
funcs[N] = {f1, f2 /*...*/};
// later in the code:
result = (funcs[num])(param);
Perhaps someone here can explain the reason the latter version is better. I suppose it has something to do with the fact that there are no conditional branches there.

My current project is a media server, with multi thread processing (C++ language). It's a time critical application, once low performance functions could cause bad results on media streaming like lost of sync, high latency, huge delays and so.
The strategy i usually use to grantee the best performance possible is to minimize the amount of heavy operational system calls that allocate or manage resources like memory, files, sockets and so.
At first i wrote my own STL, network and file manage classes.
All my containers classes ("MySTL") manage their own memory blocks to avoid multiple alloc (new) / free (delete) calls. The objects released are enqueued on a memory block pool to be reused when needed. On that way i improve performance and protect my code against memory fragmentation.
The parts of the code that need to access lower performance system resources (like files, databases, script, network write) i use separate threads for them. But not one thread for each unit (like not 1 thread for each socket), if so the operational system would lose performance while managing a high number of threads. So you can group objects of same classes to be processed on a separate thread if possible.
For example, if you have to write data to a network socket, but the socket write buffer is full, i save the data on a sendqueue buffer (which shares memory with all sockets together) to be sent on a separate thread as soon as the sockets become writeable again. At this way your main threads should never stop processing on a blocked state waiting for the operational system frees a specific resource. All the buffers released are saved and reused when needed.
After all a profile tool would be welcome to look for program bottles and shows which algorithms should be improved.
i got succeeded using that strategy once i have servers running like 500+ days on a linux machine without rebooting, with thousands users logging everyday.
[02:01] -alpha.ip.tv- Uptime: 525days 12hrs 43mins 7secs

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js