C++, ways to benchmark improvements in cache locality?

C++, ways to benchmark improvements in cache locality? - c++

I have an implementation of a class X, that has two pointers to two pieces of information. I have written a new implementation, class Y, that has only one pointer to a struct that contains the two pieces of information together as adjacent members. X's and Y's methods usually only need to manipulate one of the pieces of information, but provide a get() method that returns a pointer to the second piece (in this case class X just returns its pointer to that piece and class Y returns the address of the struct's second member). In normal usage, calls to X's and Y's methods will happen interspersed by calls to get() and doing work on that returned second piece.
I expect that in real life situations there should be a performance improvement, now that the two pieces of information are next to one another in memory in the class Y implementation (because they are adjacent members of a struct), but I'm not seeing any difference in the benchmarks I've written (interspersing calls to X's and Y's methods with doing work on their second pieces in big loops). I suspect this is because everything fits in cache in either case in my tests. I don't want to try this in my real app yet because the semantics of X and Y differ in other subtle ways not related to this optimization and porting the using application will be some work, and these benchmarks are supposed to help justify doing that work in the first place.
What's the best way to observe the difference in performance due to better cache locality? If I do a bunch of dummy work on an array equal to the size of the cache in between calls is that sufficient? Or do I want to do work on an array slightly less than the cache size, so that work on my instances of my class will cause things to fall in and out of cache? I'm not sure how to code something that is robust against compiler optimizations and different cache sizes.

If you are on Linux, then using Cachegrind in conjunction with KCacheGrind might provide more insight as to what how your cache is behaving.

You could design a benchmark specifically to bust the cache. For instance, allocate the pointed-to data blocks such that they're all guaranteed to be on different cache lines (say, by using a custom memory allocator that pads allocations out to at least a few hundred bytes). Then repeatedly iterate over a number of objects too big to fit everything in even the L2 cache (very platform-dependent, since it depends on the number of lines in cache, but 1 million would cover most architectures and only require a few hundred meg RAM total).
This will give you an upper limit on the performance gain made by the change from X to Y. But it does it by degrading the performance of X down to below any likely real-world usage. And to prove your case you need a lower-limit estimate, not an upper-limit estimate. So I'm not sure you'd achieve much, unless you discover that even this worst case still makes no significant difference and you needn't bother with the optimization.
Even if you don't aim for theoretical worst-case performance of X, any benchmark designed to exceed the cache is just picking an arbitrary point of bad performance of X, and looking to see if Y is better. It's not far off rigging the benchmark to make Y look good. It really doesn't matter how your code performs in dodgy benchmarks, except maybe for the purposes of marketing lies literature.
The best way to observe the real-world difference in performance, is to measure a real-world client of your class. You say that "the semantics of X and Y differ in other subtle ways not related to this optimization", in which case I can only recommend that you write a class Z which differs from X only in respect of this optimization, and use that in your application as the comparison.
Once your tests attempt to represent the worst realistic use, then if you aren't seeing any difference in performance there's probably no performance gain to be had.
All that said, if it makes logical sense (that is, it doesn't make the code any more astonishing), then I would advocate minimising the number of heap allocations in C++ simply as a rule of thumb. It doesn't tend to make speed or total memory usage worse, and it does tend to simplify your resource handling. A rule of thumb doesn't justify a re-write of working code, of course.

If I'm understanding your situation correctly (and please correct me if not), then it's six of one, or half a dozen of the other.
In class X, you need one pointer lookup for either piece of information. In class Y, you need one lookup for the first, and two (get the first and then offset) for the second. That's sacrificing "locality" for another memory access. Compilers are still, unfortunately, very good at wasting bus time looking up words in RAM.
If it's possible, you'll get the best results by holding the two pieces of target information directly within the class in question (i.e. each it's own class member), rather than using those pointers for unnecessary indirection. Not seeing any code, that's pretty much all I can say.
At any rate, you'll get a lot more performance out of studying the algorithmic complexity of your application than you ever will with micro-optimizing two variables in a class definition. Also a great idea is to use a profiling tool to see (objectively) where your bottlenecks are (gprof is common on *nix systems). Is there a distinct reason you're looking to increase locality caching specifically?

Related

Runtime performance (speed) optimization -- Cache size consideration

What are the basic tips and tricks that a C++ programmer should know when trying to optimize his code in the context of Caching?
Here's something to think about:
For instance, I know that reducing a function's footprint would make the code run a bit faster since you would have less overall instructions on the processor's instruction register I.
When trying to allocate an std::array<char, <size>>, what would be the ideal size that could make your read and writes faster to the array?
How big can an object be to decide to put it on the heap instead of the stack?

In most cases, knowing the correct answer to your question will gain you less than 1% overall performance.
Some (data-)cache optimizations that come to my mind are:
For arrays: use less RAM. Try shorter data types or a simple compression algorithm like RLE. This can also save CPU at the same time, or in the opposite waste CPU cycles with data type conversions. Especially floating point to integer conversions can be quite expensive.
Avoid access to the same cacheline (usually around 64 bytes) from different threads, unless all access is read-only.
Group members that are often used together next to each other. Prefer sequential access to random access.
If you really want to know all about caches, read What Every Programmer Should Know About Memory. While I disagree with the title, it's a great in-depth document.
Because your question suggests that you actually expect gains from just following the tips above (in which case you will be disappointed), here are some general optimization tips:
Tip #1: About 90% of your code you should be optimized for readability, not performance. If you decide to attempt an optimization for performance, make sure you actually measure the gain. When it is below 5% I usually go back to the more readable version.
Tip #2: If you have an existing codebase, profile it first. If you don't profile it, you will miss some very effective optimizations. Usually there are some calls to time-consuming functions that can be completely eliminated, or the result cached.
If you don't want to use a profiler, at least print the current time in a couple of places, or interrupt the program with a debugger a couple of times to check where it is most often spending its time.

High-performance code in c++ (inheritance, pointers to functions, if)

Suppose you have a very large graph with lots of processing upon its nodes (like tens of milions of operations per node). The core routine is the same for each node, but there are some additional operations based on internal conditions. There can be 2 such conditions which produces 4 cases (0,0), (1,0), (0,1), (1,1). E.g. (1,1) means that both conditions hold. Conditions are established once (one set for each node independently) in a program and, from then on, never change. Unfortunately, they are determined in runtime and in a fully unpredictable way (based on data received via HTTP from and external server).
What is the fastest in such scenario? (taken into account modern compiler optimizations which I have no idea of)
simply using "IFs": if (condition X) perform additional operation X.
using inheritance to derive four classes from base class (exposing method OPERATION) to have a proper operation and save tens milions of "ifs". [but I am not sure if this is a real saving, inheritance must have its overhead too)
use pointer to function to assign the function based on conditions once.
I would take me long to come to a point where I can test it by myself (I don't have such a big data yet and this will be integrated into bigger project so would not be easy to test all versions).
Reading answers: I know that I probably have to experiment with it. But apart from everything, this is sort of a question what is faster:
tens of milions of IF statements and normal statically known function calls VS function pointer calls VS inheritance which I think is not the best idea in this case and I am thinking of eliminating it from further inspection Thanks for any constructive answers (not saying that I shouldn't care about such minor things ;)

There is no real answer except to measure the actual code on the
real data. At times, in the past, I've had to deal with such
problems, and in the cases I've actually measured, virtual
functions were faster than if's. But that doesn't mean much,
since the cases I measured were in a different program (and thus
a different context) than yours. For example, a virtual
function call will generally prevent inlining, whereas an if is
inline by nature, and inlining may open up additional
optimization possibilities.
Also the machines I measured on handled virtual functions pretty
well; I've heard that some other machines (HP's PA, for example)
are very ineffective in their implementation of indirect jumps
(including not only virtual function calls, but also the return
from the function---again, the lost opportunity to inline
costs).

If you absolutely have to have the fastest way, and the process order of the nodes is not relevant, make four different types, one for each case, and define a process function for each. Then in a container class, have four vectors, one for each node type. Upon creation of a new node get all the data you need to create the node, including the conditions, and create a node of the correct type and push it into the correct vector. Then, when you need to process all your nodes, process them type by type, first processing all nodes in the first vector, then the second, etc.
Why would you want to do this:
No ifs for the state switching
No vtables
No function indirection
But much more importantly:
No instruction cache thrashing (you're not jumping to a different part of your code for every next node)
No branch prediction misses for state switching ifs (since there are none)
Even if you'd have inheritance with virtual functions and thus function indirection through vtables, simply sorting the nodes by their type in your vector may already make a world of difference in performance as any possible instruction cache thrashing would essentially be gone and depending on the methods of branch prediction the branch prediction misses could also be reduced.
Also, don't make a vector of pointers, but make a vector of objects. If they are pointers you have an extra adressing indirection, which in itself is not that worrisome, but the problem is that it may lead to data cache thrashing if the objects are pretty much randomly spread throughout your memory. If on the other hand your objects are directly put into the vector the processing will basically go through memory linearly and the cache will basically hit every time and cache prefetching might actually be able to do a good job.
Note though that you would pay heavily in data structure creation if you don't do it correctly, if at all possible, when making the vector reserve enough capacity in it immediately for all your nodes, reallocating and moving every time your vector runs out of space can become expensive.
Oh, and yes, as James mentioned, always, always measure! What you think may be the fastest way may not be, sometimes things are very counter intuitive, depending on all kinds of factors like optimizations, pipelining, branch prediction, cache hits/misses, data structure layout, etc. What I wrote above is a pretty general approach, but it is not guaranteed to be the fastest and there are definitely ways to do it wrong. Measure, Measure, Measure.
P.S. inheritance with virtual functions is roughly equivalent to using function pointers. Virtual functions are usually implemented by a vtable at the head of the class, which is basically just a table of function pointers to the implementation of the given virtual for the actual type of the object. Whether ifs is faster than virtuals or the other way around is a very very difficult question to answer and depends completely on the implementation, compiler and platform used.

I'm actually quite impressed with how effective branch prediction can be, and only the if solution allows inlining which also can be dramatic. Virtual functions and pointer to function also involve loading from memory and could possibly cause cache misses
But, you have four conditions so branch misses can be expensive.
Without the ability to test and verify the answer really can't be answered. Especially since its not even clear that this would be a performance bottleneck sufficient enough to warrant optimization efforts.
In cases like this. I would err on the side of readability and ease of debugging and go with if

Many programmers have taken classes and read books that go on about certain favorite subjects: pipelining, cacheing, branch prediction, virtual functions, compiler optimizations, big-O algorithms, etc. etc. and the performance of those.
If I could make an analogy to boating, these are things like trimming weight, tuning power, adjusting balance and streamlining, assuming you are starting from some speedboat that's already close to optimal.
Never mind you may actually be starting from the Queen Mary, and you're assuming it's a speedboat.
It may very well be that there are ways to speed up the code by large factors, just by cutting away fat (masquerading as good design), if only you knew where it was.
Well, you don't know where it is, and guessing where is a waste of time, unless you value being wrong.
When people say "measure and use a profiler" they are pointing in the right direction, but not far enough.
Here's an example of how I do it, and I made a crude video of it, FWIW.

Unless there's a clear pattern to these attributes, no branch predictor exists that can effectively predict this data-dependent condition for you. Under such circumstances, you may be better off avoiding a control speculation (and paying the penalty of a branch misprediction), and just wait for the actual data to arrive and resolve the control flow (more likely to occur using virtual functions). You'll have to benchmark of course to verify that, as it depends on the actual pattern (if for e.g. you have even small groups of similarly "tagged" elements).
The sorting suggested above is nice and all, but note that it converts a problem that's just plain O(n) into an O(logn) one, so for large sizes you'll lose unless you can sort once - traverse many times, or otherwise cheaply maintain the sort state.
Note that some predictors may also attempt to predict the address of the function call, so you might be facing the same problem there.
However, I must agree about the comments regarding early-optimizations - do you know for sure that the control flow is your bottleneck? What if fetching the actual data from memory takes longer? In general, it would seem that your elements can be process in parallel, so even if you run this on a single thread (and much more if you use multiple cores) - you should be bandwidth-bound and not latency bound.

Cache locality performance

If I had a C or C++ program where I was using say 20 integers throughout the program, would it improve performance to create an array of size 20 to store the integers and then create aliases for each number?
Would this improve the cache locality (rather than just creating 20 normal ints) because the ints would be loaded into the cache together as part of the int array(or at least , improve the chances of this)?

The question is how do you allocate space for them? I doubt that you just randomly do new int 20 times here and there in the code. If they are local variables then they will get on stack and get cached.
The main question is is that worth bothering? Try to write your program in readable and elegant way first, and then try to remove major bottlenecks and only after start messing with microoptimizations. If you are processing 20 ints, should not they be array essentially?
Also is it theoretical question? If it is, then yes, array will likely be cached better then 20 random areas in memory. If it is practical question, then I doubt that this is really important unless you are writing supercritical performance code, and even then microoptimizations are last thing to deal with.

It might improve performance a bit, yes. It might also completely ruin your performance. Or it might have no impact whatsoever because the compiler already did something similar for you. Or it might have no impact because you're just not using those integers often enough for this to make a difference.
It also depends on whether one or multiple threads access these integers, and whether they just read, or also modify the numbers. (if you have multiple threads and you write to those integers, then putting them in an array will cause false sharing which will hurt your performance far more than anything you'd hoped to gain)
So why don't you just try it?
There is no simple, single answer. The only serious answer you're going to get is "it depends". If you want to know how it would behave in your case, then you have two options:
try it and see what happens, or
gain a thorough understanding of how your CPU works, gather data on exactly how often these values are accessed and in which patterns, so you can make an educated guess at how the change would affect your performance.
If you choose #2, you'll likely need to follow it up with #1 anyway, to verify that your guess was correct.
Performance isn't simple. There are few universal rules, and everything depends on context. A change which is an optimization in one case might slow everything down in another.
If you're serious about optimizing your code, then there's no substitute for the two steps above. And if you're not serious about it, don't do it. :)

Yes, the theoretical chance of the 20 integers ending up on the same cache line would be higher, although I think a good compiler would almost always be able to replicate the same performance for you even when not using an array.

So, you currently have int positionX, positionY, positionZ;then somewhere else int fuzzy; and int foo;, etc to make about 20 integers?
And you want to do something like this:
int arr[20];
#define positionX arr[0]
#define positionY arr[1]
#define positionZ arr[2]
#define fuzzy arr[3]
#define foo arr[4]
I would expect that if there is ANY performance difference, it may make it slower, because the compiler will notice that you are using arr in some other place, and thus can't use registers to store the value of foo, since it sees that you call update_position which touches arr[0]..arr[2]. It depends on how finegrained the compilers detection of "we're touching the same data" is. And I suspect it may quite often be based on "object" rather than individual fields of an object - particularly for arrays.
However, if you do have data that is used close together, e.g. position variables, it would probably help to have them next to each other.
But I seriously think that you are wasting your time trying to put variables next to each other, and using an array is almost certainly a BAD idea.

It would likely decrease performance. Modern compilers will move variables around in memory when you're not looking, and may store two variables at the same address when they're not used concurrently. With your array idea, those variables cannot overlap, and must use distinct cache lines.

Yes this may improve your performance, however it may not, as its really Variables that get used together that should be stored together.
So if they are used together then yes. Variables and objects should really be declared in the function in which they are used as they will be stored on the stack (level-1 cache in most cases).
So yes if you are going to use them together i.e they are relevant to each other, then this would probably be a little more efficient, provding you also take into consideration how you allocate them memory.

Effective optimization strategies on modern C++ compilers

I'm working on scientific code that is very performance-critical. An initial version of the code has been written and tested, and now, with profiler in hand, it's time to start shaving cycles from the hot spots.
It's well-known that some optimizations, e.g. loop unrolling, are handled these days much more effectively by the compiler than by a programmer meddling by hand. Which techniques are still worthwhile? Obviously, I'll run everything I try through a profiler, but if there's conventional wisdom as to what tends to work and what doesn't, it would save me significant time.
I know that optimization is very compiler- and architecture- dependent. I'm using Intel's C++ compiler targeting the Core 2 Duo, but I'm also interested in what works well for gcc, or for "any modern compiler."
Here are some concrete ideas I'm considering:
Is there any benefit to replacing STL containers/algorithms with hand-rolled ones? In particular, my program includes a very large priority queue (currently a std::priority_queue) whose manipulation is taking a lot of total time. Is this something worth looking into, or is the STL implementation already likely the fastest possible?
Along similar lines, for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
Given the scientific nature of the program, floating-point numbers are used everywhere. A significant bottleneck in my code used to be conversions from floating point to integers: the compiler would emit code to save the current rounding mode, change it, perform the conversion, then restore the old rounding mode --- even though nothing in the program ever changed the rounding mode! Disabling this behavior significantly sped up my code. Are there any similar floating-point-related gotchas I should be aware of?
One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
Lastly, to nip certain kinds of answers in the bud:
I understand that optimization has a cost in terms of complexity, reliability, and maintainability. For this particular application, increased performance is worth these costs.
I understand that the best optimizations are often to improve the high-level algorithms, and this has already been done.

Take a look at the excellent Pitfalls of Object-Oriented Programming slides for some info about restructuring code for locality. In my experience getting better locality is almost always the biggest win.
General process:
Learn to love the Disassembly View in your debugger, or have your build system generate the intermediate assembly files (.s) if at all possible. Keep an eye on changes or for things that look egregious -- even without familiarity with a given instruction set architecture, you should be able to see some things fairly clearly! (I sometimes check in a series of .s files with corresponding .cpp/.c changes, just to leverage the lovely tools from my SCM to watch the code and corresponding asm change over time.)
Get a profiler that can watch your CPU's performance counters, or can at least guess at cache misses. (AMD CodeAnalyst, cachegrind, vTune, etc.)
Some other specific things:
Understand strict aliasing. Once you do, make use of restrict if your compiler has it. (Examine the disasm here too!)
Check out different floating point modes on your processor and compiler. If you don't need the denormalized range, choosing a mode without this can result in better performance. (It sounds like you've already done some things in this area, based on your discussion of rounding modes.)
Definitely avoid allocs: call reserve on std::vector when you can, or use std::array when you know the size at compile-time.
Use memory pools to increase locality and decrease alloc/free overhead; also to ensure cacheline alignment and prevent ping-ponging.
Use frame allocators if you're allocating things in predictable patterns, and can afford to deallocate everything in one go.
Do be aware of invariants. Something you know is invariant may not be to the compiler, for example a use of a struct or class member in a loop. I find the single easiest way to fall into the correct habit here is to give a name to everything, and prefer to name things outside of loops. E.g. const int threshold = m_currentThreshold; or perhaps Thing * const pThing = pStructHoldingThing->pThing; Fortunately you can usually see things that need this treatment in the disassembly view. This also helps with debugging later (makes the watch/locals window behave much more nicely in debug builds)!
Avoid writes in loops if possible -- accumulate first, then write, or batch a few writes together. YMMV, of course.
WRT your std::priority_queue question: inserting things into a vector (the default backend for a priority_queue) tends to move a lot of elements around. If you can break up into phases, where you insert data, then sort it, then read it once it's sorted, you'll probably be a lot better off. Although you'll definitely lose locality, you may find a more self-ordering structure like a std::map or std::set worth the overhead -- but this is really dependent on your usage patterns.

Is there any benefit to replacing STL containers/algorithms with hand-rolled ones?
I would only consider this as a last option. The STL containers and algorithms have been thoroughly tested. Creating new ones are expensive in terms of development time.
Along similar lines, for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
First, try reserving space for the vectors. Check out the std::vector::reserve method. A vector that keeps growing or changing to larger sizes is going to waste dynamic memory and execution time. Add some code to determine a good value for an upper bound.
I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?
As a matter of principle, always pass large structures by reference or pointer. Prefer passing by constant reference. If you are using pointers, consider using smart pointers.
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
Modern compilers are very aware of instruction caches (pipelines) and try to keep them from being reloaded. You can always assist your compiler by writing code that uses less branches (from if, switch, loop constructs and function calls).
You may see more significant performance gain by adjusting your program to optimize the data cache. Search the web for Data Driven Design. There are many excellent articles on this topic.
Given the scientific nature of the program, floating-point numbers are used everywhere. A significant bottleneck in my code used to be conversions from floating point to integers: the compiler would emit code to save the current rounding mode, change it, perform the conversion, then restore the old rounding mode --- even though nothing in the program ever changed the rounding mode! Disabling this behavior significantly sped up my code. Are there any similar floating-point-related gotchas I should be aware of?
For accuracy, keep everything as a double. Adjust for rounding only when necessary and perhaps before displaying. This falls under the optimization rule, Use less code, eliminate extraneous or deadwood code.
Also see the section above about reserving space in containers before using them.
Some processors can load and store floating point numbers either faster or as fast as integers. This would require gathering profile data before optimizing. However, if you know there is minimal resolution, you could use integers and change your base to that minimal resolution . For example, when dealing with U.S. money, integers can be used to represent 1/100 or 1/1000 of a dollar.
One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
This an incorrect assumption. Compilers can optimize based on the function's signature, especially if the parameters correctly use const. I always like to assist the compiler by moving constant stuff outside of the loop. For an upper limit value, such as a string length, assign it to a const variable before the loop. The const modifier will assist the Optimizer.
There is always the count-down optimization in loops. For many processors, a jump on register equals zero is more efficient than compare and jump if less than.
On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
I would avoid "micro optimizations". If you have any doubts, print out the assembly code generated by the compiler (for the area you are questioning) under the highest optimization setting. Try rewriting the code to express the compiler's assembly code. Optimize this code, if you can. Anything more requires platform specific instructions.
Optimization Ideas & Concepts
1. Computers prefer to execute sequential instructions.
Branching upsets them. Some modern processors have enough instruction cache to contain code for small loops. When in doubt, don't cause branches.
2. Eliminate Requirements
Less code, more performance.
3. Optimize designs before code
Often times, more performance can be gained by changing the design versus changing the implementation of the design. Less design promotes less code, generates more performance.
4. Consider data organization
Optimize the data.
Organize frequently used fields into substructures.
Set data sizes to fit into a data cache line.
Remove constant data out of data structures.
Use const specifier as much as possible.
5. Consider page swapping
Operating systems will swap out your program or task for another one. Often times into a 'swap file' on the hard drive. Breaking up the code into chunks that contain heavily executed code and less executed code will assist the OS. Also, coagulate heavily used code into tighter units. The idea is to reduce the swapping of code from the hard drive (such as fetching "far" functions). If code must be swapped out, it should be as one unit.
6. Consider I/O optimizations
(Includes file I/O too).
Most I/O prefers fewer large chunks of data to many small chunks of data. Hard drives like to keep spinning. Larger data packets have less overhead than smaller packets.
Format data into a buffer then write the buffer.
7. Eliminate the competition
Get rid of any programs and tasks that are competing against your application for the processor(s). Such tasks as virus scanning and playing music. Even I/O drivers want a piece of the action (which is why you want to reduce the number or I/O transactions).
These should keep you busy for a while. :-)

Use of memory buffer pools can be of great performance benefit vs. dynamic allocation. More so if they reduce or prevent heap fragmentation over long execution runs.
Be aware of data location. If you have a significant mix of local vs. global data you may be overworking the cache mechanism. Try to keep data sets in close proximity to make maximum use of cache line validity.
Even though compilers do a wonderful job with loops, I still scrutinize them when performance tuning. You can spot architectural flaws that yield orders of magnitude where the compiler may only trim percentages.
If a single priority queue is using a lot of time in its operation, there may be benefit to creating a battery of queues representing buckets of priority. It would be complexity being traded for speed in this case.
I notice you didn't mention the use of SSE type instructions. Could they be applicable to your type of number crunching?
Best of luck.

Here is a nice paper on the subject.

About STL containers.
Most people here claim STL offers one of the fastest possible implementations of the container algorithms. And I say the opposite: for the most real-world scenarios the STL containers taken as-is yield a really catastrophic performance.
People argue about the complexity of the algorithms used in STL. Here STL is good: O(1) for list/queue, vector (amortized), and O(log(N)) for map. But this is not the real bottleneck of the performance for a typical application! For many applications the real bottleneck is the heap operations (malloc/free, new/delete, etc.).
A typical operation on the list costs just a few CPU cycles. On a map - some tens, may be more (this depends on the cache state and log(N) of course). And typical heap operations cost from hunders to thousands (!!!) of CPU cycles. For multithreaded applications for instance they also require synchronization (interlocked operations). Plus on some OSs (such as Windows XP) the heap functions are implemented entirely in the kernel mode.
So that the actual performance of the STL containers in a typical scenario is dominated by the amount of heap operations they perform. And here they're disastrous. Not because they're implemented poorly, but because of their design. That is, this is the question of the design.
On the other hand there're other containers which are designed differently.
Once I've designed and written such containers for my own needs:
http://www.codeproject.com/KB/recipes/Containers.aspx
And it proved for me to be superior from the performance point of view, and not only.
But recently I've discovered I'm not the only one who thought about this.
boost::intrusive is the container library that is implemented in the manner similar to what I did then.
I suggest you try it (if you didn't already)

Is there any benefit to replacing STL containers/algorithms with hand-rolled ones?
Generally, not unless you're working with a poor implementation. I wouldn't replace an STL container or algorithm just because you think you can write tighter code. I'd do it only if the STL version is more general than it needs to be for your problem. If you can write a simpler version that does just what you need, then there might be some speed to gain there.
One exception I've seen is to replace a copy-on-write std::string with one that doesn't require thread synchronization.
for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
Unlikely. But if you're using a lot of time allocating up to a certain size, it might be profitable to add a reserve() call.
performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference.
When working with containers, I pass iterators for the inputs and an output iterator, which is still pretty general.
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
Not very. Yes. I find that missed branch predictions and cache-hostile memory access patterns are the two biggest killers of performance (once you've gotten to reasonable algorithms). A lot of older code uses "early out" tests to reduce calculations. But on modern processors, that's often more expensive than doing the math and ignoring the result.
A significant bottleneck in my code used to be conversions from floating point to integers
Yup. I recently discovered the same issue.
One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop.
Some compilers can deal with this. Visual C++ has a "link-time code generation" option that effective re-invokes the compiler to do further optimization. And, in the case of functions like strlen, many compilers will recognize that as an intrinsic function.
Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand? On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
When you're optimizing at this low level, there are few reliable rules of thumb. Compilers will vary. Measure your current solution, and decide if it's too slow. If it is, come up with a hypothesis (e.g., "What if I replace the inner if-statements with a look-up table?"). It might help ("eliminates stalls due to failed branch predictions") or it might hurt ("look-up access pattern hurts cache coherence"). Experiment and measure incrementally.
I'll often clone the straightforward implementation and use an #ifdef HAND_OPTIMIZED/#else/#endif to switch between the reference version and the tweaked version. It's useful for later code maintenance and validation. I commit each successful experiment to change control, and keep a log (spreadsheet) with the changelist number, run times, and explanation for each step in optimization. As I learn more about how the code behaves, the log makes it easy to back up and branch off in another direction.
You need a framework for running reproducible timing tests and to compare results to the reference version to make sure you don't inadvertently introduce bugs.

If I were working on this, I would expect an end-stage where things like cache locality and vector operations would come into play.
However, before getting to the end stage, I would expect to find a series of problems of different sizes having less to do with compiler-level optimization, and more to do with odd stuff going on that could never be guessed, but once found, are simple to fix. Usually they revolve around class overdesign and data structure issues.
Here's an example of this kind of process.
I have found that generalized container classes with iterators, which in principle the compiler can optimize down to minimal cycles, often are not so optimized for some obscure reason. I've also heard other cases on SO where this happens.
Others have said, before you do anything else, profile. I agree with that approach except I think there's a better way, and it's indicated in that link. Whenever I find myself asking if some specific thing, like STL, could be a problem, I just might be right - BUT - I'm guessing. The fundamental winning idea in performance tuning is find out, don't guess. It is easy to find out for sure what is taking the time, so don't guess.

here is some stuff I had used:
templates to specialize innermost loops bounds (makes them really fast)
use __restrict__ keywords for alias problems
reserve vectors beforehand to sane defaults.
avoid using map (it can be really slow)
vector append/ insert can be significantly slow. If that is the case, raw operations may make it faster
N-byte memory alignment (Intel has pragma aligned, http://www.intel.com/software/products/compilers/docs/clin/main_cls/cref_cls/common/cppref_pragma_vector.htm)
trying to keep memory within L1/L2 caches.
compiled with NDEBUG
profile using oprofile, use opannotate to look for specific lines (stl overhead is clearly visible then)
here are sample parts of profile data (so you know where to look for problems)
* Output annotated source file with samples
* Output all files
*
* CPU: Core 2, speed 1995 MHz (estimated)
--
* Total samples for file : "/home/andrey/gamess/source/blas.f"
*
* 1020586 14.0896
--
* Total samples for file : "/home/andrey/libqc/rysq/src/fock.cpp"
*
* 962558 13.2885
--
* Total samples for file : "/usr/include/boost/numeric/ublas/detail/matrix_assign.hpp"
*
* 748150 10.3285
--
* Total samples for file : "/usr/include/boost/numeric/ublas/functional.hpp"
*
* 639714 8.8315
--
* Total samples for file : "/home/andrey/gamess/source/eigen.f"
*
* 429129 5.9243
--
* Total samples for file : "/usr/include/c++/4.3/bits/stl_algobase.h"
*
* 411725 5.6840
--
example of code from my project
template<int ni, int nj, int nk, int nl>
inline void eval(const Data::density_type &D, const Data::fock_type &F,
const double *__restrict Q, double scale) {
const double * __restrict Dij = D[0];
...
double * __restrict Fij = F[0];
...
for (int l = 0, kl = 0, ijkl = 0; l < nl; ++l) {
for (int k = 0; k < nk; ++k, ++kl) {
for (int j = 0, ij = 0; j < nj; ++j, ++jk, ++jl) {
for (int i = 0; i < ni; ++i, ++ij, ++ik, ++il, ++ijkl) {

And I think the main hint anyone could give you is: measure, measure, measure. That and improving your algorithms.
The way you use certain language features, the compiler version, std lib implementation, platform, machine - all ply their role in performance and you haven't mentioned many of those and no one of us ever had your exact setup.
Regarding replacing std::vector: use a drop-in replacement (e.g., this one) and just try it out.

How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
I can't speak for all compilers, but my experience with GCC shows that it will not heavily optimize code with respect to the cache. I would expect this to be true for most modern compilers. Optimization such as reordering nested loops can definitely affect performance. If you believe that you have memory access patterns that could lead to many cache misses, it will be in your interest to investigate this.

Is there any benefit to replacing STL
containers/algorithms with hand-rolled
ones? In particular, my program
includes a very large priority queue
(currently a std::priority_queue)
whose manipulation is taking a lot of
total time. Is this something worth
looking into, or is the STL
implementation already likely the
fastest possible?
The STL is generally the fastest, general case. If you have a very specific case, you might see a speed-up with a hand-rolled one. For example, std::sort (normally quicksort) is the fastest general sort, but if you know in advance that your elements are virtually already ordered, then insertion sort might be a better choice.
Along similar lines, for std::vectors
whose needed sizes are unknown but
have a reasonably small upper bound,
is it profitable to replace them with
statically-allocated arrays?
This depends on where you are going to do the static allocation. One thing I tried along this line was to static allocate a large amount of memory on the stack, then re-use later. Results? Heap memory was substantially faster. Just because an item is on the stack doesn't make it faster to access- the speed of stack memory also depends on things like cache. A statically allocated global array may not be any faster than the heap. I assume that you have already tried techniques like just reserving the upper bound. If you have a lot of vectors that have the same upper bound, consider improving cache by having a vector of structs, which contain the data members.
I've found that dynamic memory
allocation is often a severe
bottleneck, and that eliminating it
can lead to significant speedups. As a
consequence I'm interesting in the
performance tradeoffs of returning
large temporary data structures by
value vs. returning by pointer vs.
passing the result in by reference. Is
there a way to reliably determine
whether or not the compiler will use
RVO for a given method (assuming the
caller doesn't need to modify the
result, of course)?
I personally normally pass the result in by reference in this scenario. It allows for a lot more re-use. Passing large data structures by value and hoping that the compiler uses RVO is not a good idea when you can just manually use RVO yourself.
How cache-aware do compilers tend to
be? For example, is it worth looking
into reordering nested loops?
I found that they weren't particularly cache-aware. The issue is that the compiler doesn't understand your program and can't predict the vast majority of it's state, especially if you depend heavily on heap. If you have a profiler that ships with your compiler, for example Visual Studio's Profile Guided Optimization, then this can produce excellent speedups.
Given the scientific nature of the
program, floating-point numbers are
used everywhere. A significant
bottleneck in my code used to be
conversions from floating point to
integers: the compiler would emit code
to save the current rounding mode,
change it, perform the conversion,
then restore the old rounding mode ---
even though nothing in the program
ever changed the rounding mode!
Disabling this behavior significantly
sped up my code. Are there any similar
floating-point-related gotchas I
should be aware of?
There are different floating-point models - Visual Studio gives an fp:fast compiler setting. As for the exact effects of doing such, I can't be certain. However, you could try altering the floating point precision or other settings in your compiler and checking the result.
One consequence of C++ being compiled
and linked separately is that the
compiler is unable to do what would
seem to be very simple optimizations,
such as move method calls like
strlen() out of the termination
conditions of loop. Are there any
optimization like this one that I
should look out for because they can't
be done by the compiler and must be
done by hand?
I've never come across such a scenario. However, if you're genuinely concerned about such, then the option remains to do it manually. One of the things that you could try is calling a function on a const reference, suggesting to the compiler that the value won't change.
One of the other things that I want to point out is the use of non-standard extensions to the compiler, for example provided by Visual Studio is __assume. http://msdn.microsoft.com/en-us/library/1b3fsfxw(VS.80).aspx
There's also multithread, which I would expect you've gone down that road. You could try some specific opts, like another answer suggested SSE.
Edit: I realized that a lot of the suggestions I posted referenced Visual Studio directly. That's true, but, GCC almost certainly provides alternatives to the majority of them. I just have personal experience with VS most.

The STL priority queue implementation is fairly well-optimized for what it does, but certain kinds of heaps have special properties that can improve your performance on certain algorithms. Fibonacci heaps are one example. Also, if you're storing objects with a small key and a large amount of satellite data, you'll get a major improvement in cache performance if you store that data separately, even if it means storing one extra pointer per object.
As for arrays, I've found std::vector to even slightly out-perform compile-time-constant arrays. That said, its optimizations are general, and specific knowledge of your algorithm's access patterns may allow you to optimize further for cache locality, alignment, coloring, etc. If you find that your performance drops significantly past a certain threshold due to cache effects, hand-optimized arrays may move that problem size threshold by as much as a factor of two in some cases, but it's unlikely to make a huge difference for small inner loops that fit easily within the cache, or large working sets that exceed the size of any CPU cache. Work on the priority queue first.
Most of the overhead of dynamic memory allocation is constant with respect to the size of the object being allocated. Allocating one large object and returning it by a pointer isn't going to hurt much as much as copying it. The threshold for copying vs. dynamic allocation varies greatly between systems, but it should be fairly consistent within a chip generation.
Compilers are quite cache-aware when cpu-specific tuning is turned on, but they don't know the size of the cache. If you're optimizing for cache size, you may want to detect that or have the user specify it at run-time, since that will vary even between processors of the same generation.
As for floating point, you absolutely should be using SSE. This doesn't necessarily require learning SSE yourself, as there are many libraries of highly-optimized SSE code that do all sorts of important scientific computing operations. If you're compiling 64-bit code, the compiler might emit some SSE code automatically, as SSE2 is part of the x86_64 instruction set. SSE will also save you some of the overhead of x87 floating point, since it's not converting back and forth to 80-bit values internally. Those conversions can also be a source of accuracy problems, since you can get different results from the same set of operations depending on how they get compiled, so it's good to be rid of them.

If you work on big matrices for instance, consider tiling your loops to improve the locality. This often leads to dramatic improvements. You can use VTune/PTU to monitor the L2 cache misses.

One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
On some compilers this is incorrect. The compiler has perfect knowledge of all code across all translation units (including static libraries) and can optimize the code the same way it would do if it were in a single translation unit. A few ones that support this feature come to my mind:
Microsoft Visual C++ compilers
Intel C++ Compiler
LLVC-GCC
GCC (I think, not sure)

i'm surprised no one has mentioned these two:
Link time optimization clang and g++ from 4.5 on support link time optimizations. I've heard that on g++ case, the heuristics is still pretty inmature but it should improve quickly since the main architecture is laid out.
Benefits range from inter procedural optimizations at object file level, including highly sought stuff like inling of virtual calls (devirtualization)
Project inlining this might seem to some like very crude approach, but it is that very crudeness which makes it so powerful: this amounts at dumping all your headers and .cpp files into a single, really big .cpp file and compile that; basically it will give you the same benefits of link-time optimization in your trip back to 1999. Of course, if your project is really big, you'll still need a 2010 machine; this thing will eat your RAM like there is no tomorrow. However, even in that case, you can split it in more than one no-so-damn-huge .cpp file

If you are doing heavy floating point math you should consider using SSE to vectorize your computations if that maps well to your problem.
Google SSE intrinsics for more information about this.

Here is something that worked for me once. I can't say that it will work for you. I had code on the lines of
switch(num) {
case 1: result = f1(param); break;
case 2: result = f2(param); break;
//...
}
Then I got a serious performance boost when I changed it to
// init:
funcs[N] = {f1, f2 /*...*/};
// later in the code:
result = (funcs[num])(param);
Perhaps someone here can explain the reason the latter version is better. I suppose it has something to do with the fact that there are no conditional branches there.

My current project is a media server, with multi thread processing (C++ language). It's a time critical application, once low performance functions could cause bad results on media streaming like lost of sync, high latency, huge delays and so.
The strategy i usually use to grantee the best performance possible is to minimize the amount of heavy operational system calls that allocate or manage resources like memory, files, sockets and so.
At first i wrote my own STL, network and file manage classes.
All my containers classes ("MySTL") manage their own memory blocks to avoid multiple alloc (new) / free (delete) calls. The objects released are enqueued on a memory block pool to be reused when needed. On that way i improve performance and protect my code against memory fragmentation.
The parts of the code that need to access lower performance system resources (like files, databases, script, network write) i use separate threads for them. But not one thread for each unit (like not 1 thread for each socket), if so the operational system would lose performance while managing a high number of threads. So you can group objects of same classes to be processed on a separate thread if possible.
For example, if you have to write data to a network socket, but the socket write buffer is full, i save the data on a sendqueue buffer (which shares memory with all sockets together) to be sent on a separate thread as soon as the sockets become writeable again. At this way your main threads should never stop processing on a blocked state waiting for the operational system frees a specific resource. All the buffers released are saved and reused when needed.
After all a profile tool would be welcome to look for program bottles and shows which algorithms should be improved.
i got succeeded using that strategy once i have servers running like 500+ days on a linux machine without rebooting, with thousands users logging everyday.
[02:01] -alpha.ip.tv- Uptime: 525days 12hrs 43mins 7secs

Is it better/faster to have class variables or local function variables?

Ok I know the title doesn't fully explain this question. So I'm writing a program that performs a large number of calculations and I'm trying to optimize it so that it won't run quite so slow. I have a function that is a member of a class that gets called around 5 million times. This is the function:
void PointCamera::GetRay(float x, float y, Ray& out)
{
//Find difference between location on view plane and origin and normalize
float vpPointx = pixelSizex * (x - 0.5f * (float)width);
float vpPointy = pixelSizey * (((float)height - y) - 0.5f * height);
//Transform ray to camera's direction
out.d = u * vpPointx + v * vpPointy - w * lens_distance;
out.d.Normalize();
//Set origin to camera location
out.o = loc;
}
I'm wondering if it is better/faster to declare the variables vpPointx and vpPointy in the class than to declare them each time I call the function. Would this be a good optimization or would it have little effect?
And in general, if there is anything here that could be optimized please let me know.

By limiting the scope of your variables, you are giving more opportunity to the compiler optimiser to rearrange your code and make it run faster. For example, it might keep the values of those variables entirely within CPU registers, which may be an order of magnitude faster than memory access. Also, if those variables were class instance variables, then the compiler would have to generate code to dereference this every time you accessed them, which would very likely be slower than local variable access.
As always, you should measure the performance yourself and try the code both ways (or better, as many ways as you can think of). All optimisation advice is subject to whatever your compiler actually does, which requires experimentation.

Always prefer locals
Anything that is a temporary value should be a local. It's possible that such a value can exist entirely within a register without kicking something else out of cache or requiring a pointless memory store that will use a resource in far shorter supply than CPU cycles.
A dual 3 GHz CPU can execute 6 billion CPU cycles per second. In order to approach that 6 billion figure, typically most ops should involve no memory or cache operations and the results of most cycles must not be needed by the next instruction unless the CPU can find a later instruction that is immediately dispatchable. This all gets quite complicated but 6 billions somethings, including some wait states, will certainly happen each second.
However, that same CPU system is capable of only 10-40 million memory operations per second. The disparity is partly compensated for by the cache systems, although they are still slower than the CPU is, they are limited in size, and they do not cope with writes as well as they do with reads.
The good news is that good software abstractions and software speed optimization both agree in this case. Do not store transient state in an object unless you have a reason to reference it later.

How about precomputing some of those multiplications that never change. For example, w*lens_distance and 0.5*height. Compute them once whenever the variables change, then just use the stored value in this function call.

There is a performance penalty for declaring them in the class. They are accessed, in effect, by using this->field. There will be, at minimum, one memory write to store the result. The function local variables could live in registers for their entire lifetime.

I'm not sure, although my guess is it's better inside the function (since it's just a push on the stack to "declare" the variable, whereas making it part of the class means accessing it from memory using indirection every time you access it). Of course, in reality the compiler probably optimizes all of this into registers anyway.
Which brings me to my point:
You're going about this the wrong way
I don't think that anyone can really tell you what will be faster. It shouldn't matter even if someone does. The only real way to optimize is by measuring.
This usually means one of two things:
One option is to try each way, measure the time it takes, and compare. Note that this isn't always trivial to do (since each run will sometimes depend on external factors, difficult memory issues, etc). But running the code a few million times will probably iron that out for you.
Ideally, you should be using a profiler. That's a piece of software designed to measure the code for you, and tell you what parts take the longest amount of time. As most people who have dealt with optimization will tell you, you'll usually be surprised at what takes up a lot of time.
That's why you should always go with the "scientific" method of measuring, instead of relying on anyone's guesswork.

Appears to be a raytracer. The little things do add up, but also consider the big hits: You'll get a huge speedup with decent spatial partitioning. Get yourself an octtree or KD-Tree for a few orders magnitude speedup on complex scenes.
As for your direct question: profile it.

The others have already covered the benefits of using locals over class variables, so I won't go into that.. but since you asked for optimization tips in general:
Your int-to-float cast jumps out at me. There's a cost to it, especially if you are using the x387 FPU. Using SSE registers will make it better, but it looks thoroughly unnecessary for your function: you could simply store a copy of them as floats in your class.
You mentioned in a comment that you were still working on your kd-tree. It's probably a better idea to finish that first before doing the low-level optimization; what appears important now may not take up a fraction of the time later.
Use an instruction-level profiler, like VTune. gprof doesn't give you anywhere near enough information.
Have you heard of ompf.org? It's a wonderful raytracing forum, and you can learn a lot about the relevant optimizations there.
See my answer to this post for more tips.
Read Agner Fog.
As an aside: I've heard that the Bounding Interval Hierarchy is much easier to implement. I've not implemented a kd-tree, but I have implemented a BIH, and I'd say it's reasonably straightforward.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js