Optimization Techniques for C++ - c++

In his talk a few days ago at Facebook - slides, video, Andrei Alexandrescu talks about common intuitions that might prove us wrong. For me one very interesting point came up on Slide 7 where he states that the assumption "Fewer instructions = faster code" is not true and more instructions will not necessarily mean slower code.
Here comes my problem: The audio quality of his talk (around 6:20min) is not that well and I don't understand the explanation very well, but from what I get is that he is comparing retired instructions with optimality of an algorithm on a performance level.
However, from my understanding this cannot be done because these are two independent structural levels. Instructions (especially actually retired instructions) are one very important measure and basically, gives you an idea about performance to achieve a goal. If we leave out the latency of an instruction, we can generalize that fewer retired instructions = faster code. Now, of course there are cases where an algorithm that performs complex calculations inside a loop will yield better performance even though it is performed inside the loop, because it will break the loop earlier (think graph traversal). But wouldn't it be more useful to compare to algorithms on a complexity level rather than saying this loop has more instructions and is better than the other? From my point of view, the better algorithm will have less retired instructions in the end.
Can someone please help me to understand where he was going with his example, and how can there be a case where (significantly) more retired instructions lead to better performance?

The quality is indeed bad, but I think he leads to the fact that CPUs are good for calculations, but suffer from bad performance for memory seek (RAM is much slower then CPU), and branches (because CPU works as a pipeline, and branches might cause the pipeline to break).
Here are some cases where more instructions are faster:
Branch prediction - even if we need to do more instructions, but it causes for a better branch prediction, the pipeline of the CPU will be full more time, and less ops will be "thrown out" of it, which ultimately leads to better performance. This thread for example, shows how doing the same thing, but first sorting - improves performnce.
CPU Cache - If your code is more cache optimized, and follows the principle of locality - it is more likely to be faster then a code who doesn't, even if the code that doesn't do half the amount of instructions. This thread gives an example for a small cache optimization - that the same number of instructions might result in much slower code if it is not cache optimized.
It also matters which instructions are done. Sometimes - some instructions might be slower to perform then others, for example - divide might be slower then integer addition.
Note: All of the above are machine dependent and how/if they actually change the performance might vary from one architecture to the other.

The number of instructions is not a good measure in itself.
Fewer retired instructions (because there is nothing more to do) = faster code.
Fewer retired instructions (because they have to wait for dependencies) = slower code.
It can sometimes be that more instructions in the code also means more retired instructions, because they can use up execution slots that would otherwise be wasted in case 2.

Related

Branch-aware programming

I'm reading around that branch misprediction can be a hot bottleneck for the performance of an application. As I can see, people often show assembly code that unveil the problem and state that programmers usually can predict where a branch could go the most of the times and avoid branch mispredictons.
My questions are:
Is it possible to avoid branch mispredictions using some high level programming technique (i.e. no assembly)?
What should I keep in mind to produce branch-friendly code in a high level programming language (I'm mostly interested in C and C++)?
Code examples and benchmarks are welcome.
people often ... and state that programmers usually can predict where a branch could go
(*) Experienced programmers often remind that human programmers are very bad at predicting that.
1- Is it possible to avoid branch mispredictions using some high level programming technique (i.e. no assembly)?
Not in standard c++ or c. At least not for a single branch. What you can do is minimize the depth of your dependency chains so that branch mis-prediction would not have any effect. Modern cpus will execute both code paths of a branch and drop the one that wasn't chosen. There's a limit to this however, which is why branch prediction only matters in deep dependency chains.
Some compilers provide extension for suggesting the prediction manually such as __builtin_expect in gcc. Here is a stackoverflow question about it. Even better, some compilers (such as gcc) support profiling the code and automatically detect the optimal predictions. It's smart to use profiling rather than manual work because of (*).
2- What should I keep in mind to produce branch-friendly code in a high level programming language (I'm mostly interested in C and C++)?
Primarily, you should keep in mind that branch mis-prediction is only going to affect you in the most performance critical part of your program and not to worry about it until you've measured and found a problem.
But what can I do when some profiler (valgrind, VTune, ...) tells that on line n of foo.cpp I got a branch prediction penalty?
Lundin gave very sensible advice
Measure fo find out whether it matters.
If it matters, then
Minimize the depth of dependency chains of your calculations. How to do that can be quite complicated and beyond my expertise and there's not much you can do without diving into assembly. What you can do in a high level language is to minimize the number of conditional checks (**). Otherwise you're at the mercy of compiler optimization. Avoiding deep dependency chains also allows more efficient use of out-of-order superscalar processors.
Make your branches consistently predictable. The effect of that can be seen in this stackoverflow question. In the question, there is a loop over an array. The loop contains a branch. The branch depends on size of the current element. When the data was sorted, the loop could be demonstrated to be much faster when compiled with a particular compiler and run on a particular cpu. Of course, keeping all your data sorted will also cost cpu time, possibly more than the branch mis-predictions do, so, measure.
If it's still a problem, use profile guided optimization (if available).
Order of 2. and 3. may be switched. Optimizing your code by hand is a lot of work. On the other hand, gathering the profiling data can be difficult for some programs as well.
(**) One way to do that is transform your loops by for example unrolling them. You can also let the optimizer do it automatically. You must measure though, because unrolling will affect the way you interact with the cache and may well end up being a pessimization.
As a caveat, I'm not a micro-optimization wizard. I don't know exactly how the hardware branch predictor works. To me it's a magical beast against which I play scissors-paper-stone and it seems to be able to read my mind and beat me all the time. I'm a design & architecture type.
Nevertheless, since this question was about a high-level mindset, I might be able to contribute some tips.
Profiling
As said, I'm not a computer architecture wizard, but I do know how to profile code with VTune and measure things like branch mispredictions and cache misses and do it all the time being in a performance-critical field. That's the very first thing you should be looking into if you don't know how to do this (profiling). Most of these micro-level hotspots are best discovered in hindsight with a profiler in hand.
Branch Elimination
A lot of people are giving some excellent low-level advice on how to improve the predictability of your branches. You can even manually try to aid the branch predictor in some cases and also optimize for static branch prediction (writing if statements to check for the common cases first, e.g.). There's a comprehensive article on the nitty-gritty details here from Intel: https://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts.
However, doing this beyond a basic common case/rare case anticipation is very hard to do and it is almost always best saved for later after you measure. It's just too difficult for humans to be able to accurately predict the nature of the branch predictor. It's far more difficult to predict than things like page faults and cache misses, and even those are almost impossible to perfectly humanly-predict in a complex codebase.
However, there is an easier, high-level way to mitigate branch misprediction, and that's to avoid branching completely.
Skipping Small/Rare Work
One of the mistakes I commonly made earlier in my career and see a lot of peers trying to do when they're starting out, before they've learned to profile and are still going by hunches, is to try to skip small or rare work.
An example of this is memoizing to a large look-up table to avoid repeatedly doing some relatively-cheap computations, like using a look-up table that spans megabytes to avoid repeatedly calling cos and sin. To a human brain, this seems like it's saving work to compute it once and store it, except often loading the memory from this giant LUT down through the memory hierarchy and into a register often ends up being even more expensive than the computations they were intended to save.
Another case is adding a bunch of little branches to avoid small computations which are harmless to do unnecessarily (won't impact correctness) throughout the code as a naive attempt at optimization, only to find the branching costs more than just doing unnecessary computations.
This naive attempt at branching as an optimization can also apply even for slightly-expensive but rare work. Take this C++ example:
struct Foo
{
...
Foo& operator=(const Foo& other)
{
// Avoid unnecessary self-assignment.
if (this != &other)
{
...
}
return *this;
}
...
};
Note that this is somewhat of a simplistic/illustrative example as most people implement copy assignment using copy-and-swap against a parameter passed by value and avoid branching anyway no matter what.
In this case, we're branching to avoid self-assignment. Yet if self-assignment is only doing redundant work and doesn't hinder the correctness of the result, it can often give you a boost in real-world performance to simply allow the self-copying:
struct Foo
{
...
Foo& operator=(const Foo& other)
{
// Don't check for self-assignment.
...
return *this;
}
...
};
... this can help because self-assignment tends to be quite rare. We're slowing down the rare case by redundantly self-assigning, but we're speeding up the common case by avoiding the need to check in all other cases. Of course that's unlikely to reduce branch mispredictions significantly since there is a common/rare case skew in terms of the branching, but hey, a branch that doesn't exist can't be mispredicted.
A Naive Attempt at a Small Vector
As a personal story, I formerly worked in a large-scale C codebase which often had a lot of code like this:
char str[256];
// do stuff with 'str'
... and naturally since we had a pretty extensive user base, some rare user out there would eventually type in a name for a material in our software that was over 255 characters in length and overflow the buffer, leading to segfaults. Our team was getting into C++ and started porting a lot of these source files to C++ and replacing such code with this:
std::string str = ...;
// do stuff with 'str'
... which eliminated those buffer overruns without much effort. However, at least back then, containers like std::string and std::vector were heap(free store)-allocated structures, and we found ourselves trading correctness/safety for efficiency. Some of these replaced areas were performance-critical (called in tight loops), and while we eliminated a lot of bug reports with these mass replacements, the users started noticing the slowdowns.
So then we wanted something which was like a hybrid between these two techniques. We wanted to be able to slap something in there to achieve safety over the C-style fixed-buffer variants (which were perfectly fine and very efficient for common-case scenarios), but still work for the rare-case scenarios where the buffer wasn't big enough for user inputs. I was one of the performance geeks on the team and one of the few using a profiler (I unfortunately worked with a lot of people who thought they were too smart to use one), so I got called into the task.
My first naive attempt was something like this (vastly simplified: the actual one used placement new and so forth and was a fully standard-compliant sequence). It involves using a fixed-size buffer (size specified at compile-time) for the common case and a dynamically-allocated one if the size exceeded that capacity.
template <class T, int N>
class SmallVector
{
public:
...
T& operator[](int n)
{
return num < N ? buf[n]: ptr[n];
}
...
private:
T buf[N];
T* ptr;
};
This attempt was an utter fail. While it didn't pay the price of the heap/free store to construct, the branching in operator[] made it even worse than std::string and std::vector<char> and was showing up as a profiling hotspot instead of malloc (our vendor implementation of std::allocator and operator new used malloc under the hood). So then I quickly got the idea to simply assign ptr to buf in the constructor. Now ptr points to buf even in the common case scenario, and now operator[] can be implemented like this:
T& operator[](int n)
{
return ptr[n];
}
... and with that simple branch elimination, our hotspots went away. We now had a general-purpose, standard-compliant container we could use that was just about as fast as the former C-style, fixed-buffer solution (only difference being one additional pointer and a few more instructions in the constructor), but could handle those rare-case scenarios where the size needed to be larger than N. Now we use this even more than std::vector (but only because our use cases favor a bunch of teeny, temporary, contiguous, random-access containers). And making it fast came down to just eliminating a branch in operator[].
Common Case/Rare Case Skewing
One of the things learned after profiling and optimizing for years is that there's no such thing as "absolutely-fast-everywhere" code. A lot of the act of optimization is trading an inefficiency there for greater efficiency here. Users might perceive your code as absolutely-fast-everywhere, but that comes from smart tradeoffs where the optimizations are aligning with the common case (common case being both aligned with realistic user-end scenarios and coming from hotspots pointed out from a profiler measuring those common scenarios).
Good things tend to happen when you skew the performance towards the common case and away from the rare case. For the common case to get faster, often the rare case must get slower, yet that's a good thing.
Zero-Cost Exception-Handling
An example of common case/rare case skewing is the exception-handling technique used in a lot of modern compilers. They apply zero-cost EH, which isn't really "zero-cost" all across the board. In the case that an exception is thrown, they're now slower than ever before. Yet in the case where an exception isn't thrown, they're now faster than ever before and often faster in successful scenarios than code like this:
if (!try_something())
return error;
if (!try_something_else())
return error;
...
When we use zero-cost EH here instead and avoid checking for and propagating errors manually, things tend to go even faster in the non-exceptional cases than this style of code above. Crudely speaking, it's due to the reduced branching. Yet in exchange, something far more expensive has to happen when an exception is thrown. Nevertheless, that skew between common case and rare case tends to aid real-world scenarios. We don't care quite as much about the speed of failing to load a file (rare case) as loading it successfully (common case), and that's why a lot of modern C++ compilers implement "zero-cost" EH. It is again in the interest of skewing the common case and rare case, pushing them further away from each in terms of performance.
Virtual Dispatch and Homogeneity
A lot of branching in object-oriented code where the dependencies flow towards abstractions (stable abstractions principle, e.g.), can have a large bulk of its branching (besides loops of course, which play well to the branch predictor) in the form of dynamic dispatch (virtual function calls or function pointer calls).
In these cases, a common temptation is to aggregate all kinds of sub-types into a polymorphic container storing a base pointer, looping through it and calling virtual methods on each element in that container. This can lead to a lot of branch mispredictions, especially if this container is being updated all the time. The pseudocode might look like this:
for each entity in world:
entity.do_something() // virtual call
A strategy to avoid this scenario is to start sorting this polymorphic container based on its sub-types. This is a fairly old-style optimization popular in the gaming industry. I don't know how helpful it is today, but it is a high-level kind of optimization.
Another way I've found to be definitely still be useful even in recent cases which achieves a similar effect is to break the polymorphic container apart into multiple containers for each sub-type, leading to code like this:
for each human in world.humans():
human.do_something()
for each orc in world.orcs():
orc.do_something()
for each creature in world.creatures():
creature.do_something()
... naturally this hinders the maintainability of the code and reduces the extensibility. However, you don't have to do this for every single sub-type in this world. We only need to do it for the most common. For example, this imaginary video game might consist, by far, of humans and orcs. It might also have fairies, goblins, trolls, elves, gnomes, etc., but they might not be nearly as common as humans and orcs. So we only need to split the humans and orcs away from the rest. If you can afford it, you can also still have a polymorphic container that stores all of these subtypes which we can use for less performance-critical loops. This is somewhat akin to hot/cold splitting for optimizing locality of reference.
Data-Oriented Optimization
Optimizing for branch prediction and optimizing memory layouts tends to kind of blur together. I've only rarely attempted optimizations specifically for the branch predictor, and that was only after I exhausted everything else. Yet I've found that focusing a lot on memory and locality of reference did make my measurements result in fewer branch mispredictions (often without knowing exactly why).
Here it can help to study data-oriented design. I've found some of the most useful knowledge relating to optimization comes from studying memory optimization in the context of data-oriented design. Data-oriented design tends to emphasize fewer abstractions (if any), and bulkier, high-level interfaces that process big chunks of data. By nature such designs tend to reduce the amount of disparate branching and jumping around in code with more loopy code processing big chunks of homogeneous data.
It often helps, even if your goal is to reduce branch misprediction, to focus more on consuming data more quickly. I've found some great gains before from branchless SIMD, for example, but the mindset was still in the vein of consuming data more quickly (which it did, and thanks to some help from here on SO like Harold).
TL;DR
So anyway, these are some strategies to potentially reduce branch mispredictions throughout your code from a high-level standpoint. They're devoid of the highest level of expertise in computer architecture, but I'm hoping this is an appropriate kind of helpful response given the level of the question being asked. A lot of this advice is kind of blurred with optimization in general, but I've found that optimizing for branch prediction often needs to be blurred with optimizing beyond it (memory, parallelization, vectorization, algorithmic). In any case, the safest bet is to make sure you have a profiler in your hand before you venture deep.
Linux kernel defines likely and unlikely macros based on __builtin_expect gcc builtins:
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
(See here for the macros definitions in include/linux/compiler.h)
You can use them like:
if (likely(a > 42)) {
/* ... */
}
or
if (unlikely(ret_value < 0)) {
/* ... */
}
In general it's a good idea to keep hot inner loops well proportioned to the cache sizes most commonly encountered. That is, if your program handles data in lumps of, say, less than 32kbytes at a time and does a decent amount of work on it then you're making good use of the L1 cache.
In contrast if your hot inner loop chews through 100MByte of data and performs only one operation on each data item, then the CPU will spend most of the time fetching data from DRAM.
This is important because part of the reason CPUs have branch prediction in the first place is to be able to pre-fetch operands for the next instruction. The performance consequences of a branch mis-prediction can be reduced by arranging your code so that there's a good chance that the next data comes from L1 cache no matter what branch is taken. Whilst not a perfect strategy, L1 cache sizes seem to be universally stuck on 32 or 64K; it's almost a constant thing across the industry. Admittedly coding in this way is not often straightforward, and relying on profile driven optimisation, etc. as recommended by others is probably the most straightforward way ahead.
Regardless of anything else, whether or not a problem with branch mis-prediction will occur varies according to the CPU's cache sizes, what else is running on the machine, what the main memory bandwidth / latency is, etc.
Perhaps the most common techniques is to use separate methods for normal and error returns. C has no choice, but C++ has exceptions. Compilers are aware that the exception branches are exceptional and therefore unexpected.
This means that exception branches are indeed slow, as they're unpredicted, but the non-error branch is made faster. On average, this is a net win.
1- Is it possible to avoid branch mispredictions using some high level programming technique (i.e. no assembly)?
Avoid? Perhaps not. Reduce? Certainly...
2- What should I keep in mind to produce branch-friendly code in a high level programming language (I'm mostly interested in C and C++)?
It is worth noting that optimisation for one machine isn't necessarily optimisation for another. With that in mind, profile-guided optimisation is reasonably good at rearranging branches, based on whichever test input you give it. This means you don't need to do any programming to perform this optimisation, and it should be relatively tailored to whichever machine you're profiling on. Obviously, the best results will be achieved when your test input and the machine you profile on roughly matches what common expectations... but those are also considerations for any other optimisations, branch-prediction related or otherwise.
To answer your questions let me explain how does branch prediction works.
First of all, there is a branch penalty when the processor correctly predicts the taken branches. If the processor predicts a branch as taken, then it has to know the target of the predicted branch since execution flow will continue from that address. Assuming that the branch target address is already stored in Branch Target Buffer(BTB), it has to fetch new instructions from the address found in BTB. So you are still wasting a few clock cycles even if the branch is correctly predicted.
Since BTB has an associative cache structure the target address might not be present, and hence more clock cycles might be wasted.
On the other, hand if the CPU predicts a branch as not taken and if it's correct then there is no penalty since the CPU already knows where the consecutive instructions are.
As I explained above, predicted not taken branches have higher throughput than predicted taken branches.
Is it possible to avoid branch misprediction using some high level programming technique (i.e. no assembly)?
Yes, it is possible. You can avoid by organizing your code in way that all branches have repetitive branch pattern such that always taken or not taken.
But if you want to get higher throughput you should organize branches in a way that they are most likely to be not taken as I explained above.
What should I keep in mind to produce branch-friendly code in a high
level programming language (I'm mostly interested in C and C++)?
If it's possible eliminate branches as possible. If this is not the case when writing if-else or switch statements, check the most common cases first to make sure the branches most likely to be not taken. Try to use __builtin_expect(condition, 1) function to force compiler to produce condition to be treated as not taken.
Branchless isn't always better, even if both sides of the branch are trivial. When branch prediction works, it's faster than a loop-carried data dependency.
See gcc optimization flag -O3 makes code slower than -O2 for a case where gcc -O3 transforms an if() to branchless code in a case where it's very predictable, making it slower.
Sometimes you are confident that a condition is unpredictable (e.g. in a sort algorithm or binary search). Or you care more about the worst-case not being 10x slower than about the fast-case being 1.5x faster.
Some idioms are more likely to compile to a branchless form (like a cmov x86 conditional move instruction).
x = x>limit ? limit : x; // likely to compile branchless
if (x>limit) x=limit; // less likely to compile branchless, but still can
The first way always writes to x, while the second way doesn't modify x in one of the branches. This seems to be the reason that some compilers tend to emit a branch instead of a cmov for the if version. This applies even when x is a local int variable that's already live in a register, so "writing" it doesn't involve a store to memory, just changing the value in a register.
Compilers can still do whatever they want, but I've found this difference in idiom can make a difference. Depending on what you're testing, it's occasionally better to help the compiler mask and AND rather than doing a plain old cmov. I did it in that answer because I knew that the compiler would have what it needed to generate the mask with a single instruction (and from seeing how clang did it).
TODO: examples on http://gcc.godbolt.org/

Measuring performance/throughput of fast code ignoring processor speed?

Is there a way I could write a "tool" which could analyse the produced x86 assembly language from a C/C++ program and measure the performance in such a way, that it wouldnt matter if I ran it on a 1GHz or 3GHz processor?
I am thinking more along the lines of instruction throughput? How could I write such a tool? Would it be possible?
I'm pretty sure this has to be equivalent to the halting problem, in which case it can't be done. Things such as branch prediction, memory accesses, and memory caching will all change performance irrespective of the speed of the CPU upon which the program is run.
Well, you could, but it would have very limited relevance. You can't tell the running time by just looking at the instructions.
What about cache usage? A "longer" code can be more cache-friendly, and thus faster.
Certain CPU instructions can be executed in parallel and out-of-order, but the final behaviour depends a lot on the hardware.
If you really want to try it, I would recommend writing a tool for valgrind. You would essentially run the program under a simulated environment, making sure you can replicate the behaviour of real-world CPUs (that's the challenging part).
EDIT: just to be clear, I'm assuming you want dynamic analysis, extracted from real inputs. IF you want static analysis you'll be in "undecidable land" as the other answer pointed out (you can't even detect if a given code loops forever).
EDIT 2: forgot to include the out-of-order case in the second point.
It's possible, but only if the tool knows all the internals of the processor for which it is projecting performance. Since knowing 'all' the internals is tantamount to building your own processor, you would correctly guess that this is not an easy task. So instead, you'll need to make a lot of assumptions, and hope that they don't affect your answer too much. Unfortunately, for anything longer than a few hundred instructions, these assumptions (for example, all memory reads are found in L1 data cache and have 4 cycle latency; all instructions are in L1 instruction cache but in trace cache thereafter) affect your answer a lot. Clock speed is probably the easiest variable to handle, but the details for all the rest that differ greatly from processor to processor.
Current processors are "speculative", "superscalar", and "out-of-order". Speculative means that they choose their code path before the correct choice is computed, and then go back and start over from the branch if their guess is wrong. Superscalar means that multiple instructions that don't depend on each other can sometimes be executed simultaneously -- but only in certain combinations. Out-of-order means that there is a pool of instructions waiting to be executed, and the processor chooses when to execute them based on when their inputs are ready.
Making things even worse, instructions don't execute instantaneously, and the number of cycles they do take (and the resources they occupy during this time) vary also. Accuracy of branch prediction is hard to predict, and it takes different numbers of cycles for processors to recover. Caches are different sizes, take different times to access, and have different algorithms for decided what to cache. There simply is no meaningful concept of 'how fast assembly executes' without reference to the processor it is executing on.
This doesn't mean you can't reason about it, though. And the more you can narrow down the processor you are targetting, and the more you constrain the code you are evaluating, the better you can predict how code will execute. Agner Fog has a good mid-level introduction to the differences and similarities of the current generation of x86 processors:
http://www.agner.org/optimize/microarchitecture.pdf
Additionally, Intel offers for free a very useful (and surprisingly unknown) tool that answers a lot of these questions for recent generations of their processors. If you are trying to measure the performance and interaction of a few dozen instructions in a tight loop, IACA may already do what you want. There are all sorts of improvements that could be made to the interface and presentation of data, but it's definitely worth checking out before trying to write your own:
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer
To my knowledge, there isn't an AMD equivalent, but if there is I'd love to hear about it.

Pipeline optimzation, is there any point to do this?

Some very expencied programmer from another company told me about some low-level code-optimzation tips that targetting specific CPU, including pipeline-optimzation, which means, arrange the code (inlined assembly, obviously) in special orders such that it fit the pipeline better for the targetting hardware.
With the presence of out-of-order and speculative execuation, I just wonder is there any points to do this kind of low-level stuff? We are mostly invovled in high performance computing, so we can really focus on one very specific CPU type to do our optimzation, but I just dont know if there is any point to do this specific optimzation, anyone has any experience here, where to begin? are there any code examples for this kind of optimzation? many thanks!
I'll start by saying that the compiler will usually optimize code sufficiently (i.e. well enough) that you do not need to worry about this provided your high-level code and algorithms are optimized. In general, manual optimizing should only happen if you have hard evidence that there is an actual performance issue that you can quantify and have tracked down.
Now, with that said, it's always possible to improve things - sometimes a little, sometimes a lot.
If you are in the high-performance computing game, then this sort of optimization might make sense. There are all sorts of "tricks" that can be done, but they are best left to real experts and not for the faint of heart.
If you really want to know more about this topic, a good place to start is by reading Agner Fog's website.
Pipeline optimization will improve your programs performance:
Branches and jumps may force your processor to reload the instruction pipeline, which takes some time. This time could be devoted to data processing instructions.
Some platform independent methods for pipeline optimizations:
Reduce number of branches.
Use Boolean Arithmetic
Set up code to allow for conditional execution of instructions.
Unroll loops.
Make loops have short content (that can fit in a processor's cache
without loading).
Edit 1: Other optimizations
Reduce code by eliminating features and requirements.
Review and optimize the design.
Review implementation for more efficient implementations.
Revert to assembly language only when all other optimizations have
provided little performance improvement; optimize only the code that
is executed 80% of the time; find out by profiling.
Edit 2: Data Optimizations
You can also gain performance improvements by organizing your data. Search the web for "Data Driven Design" or "Optimize performance data".
One idea is that the most frequently used data should be close together and ultimately fit into the processor's data cache. This will reduce the frequency that the processor has to reload its data cache.
Another optimization is to: Load data (into registers), operate on data, then write all data back to memory. The idea here is to trigger the processor's data cache loading circuitry before it processes the data (or registers).
If you can, organize the data to fit in one "line" of your processor's cache. Sequential locations require less time than random access locations.
There are always things that "help" vs. "hinder" the execution in the pipeline, but for most general purpose code that isn't highly specialized, I would expect that performance from compiled code is about as good as the best you can get without highly specialized code for each model of processor. If you have a controlled system, where all of your machines are using the same (or a small number of similar) processor model, and you know that 99% of the time is spent in this particular function, then there may be a benefit to optimizing that particular function to become more efficient.
In your case, it being HPC, it may well be beneficial to handwrite some of the low-level code (e.g. matrix multiplication) to be optimized for the processor you are running on. This does take some reasonable amount of understanding of the processor however, so you need to study the optimization guides for that processor model, and if you can, talk to people who've worked on that processor before.
Some of the things you'd look at is "register to register dependencies" - where you need the result of c = a + b to calculate x = c + d - so you try to separate these with some other useful work, such that the calculation of x doesn't get held up by the c = a + b calculation.
Cache-prefetching and generally caring for how the caches are used is also a useful thing to look at - not kicking useful cached data out that you need 100 instructions later, when you are storing the resulting 1MB array that won't be used again for several seconds can be worth a lot of processor time.
It's hard(er) to control these things when compilers decide to shuffle it around in it's own optimisation, so handwritten assembler is pretty much the only way to go.

Performance of breaking apart one loop into two loops

Good Day,
Suppose that you have a simple for loop like below...
for(int i=0;i<10;i++)
{
//statement 1
//statement 2
}
Assume that statement 1 and statement 2 were O(1). Besides the small overhead of "starting" another loop, would breaking down that for loop into two (not nested, but sequential) loops be as equally fast? For example...
for(int i=0;i<10;i++)
{
//statement 1
}
for(int i=0;i<10;i++)
{
//statement 2
}
Why I ask such a silly question is that I have a Collision Detection System(CDS) that has to loop through all the objects. I want to "compartmentalize" the functionality of my CDS system so I can simply call
cds.update(objectlist);
instead of having to break my cds system up. (Don't worry too much about my CDS implementation... I think I know what I am doing, I just don't know how to explain it, what I really need to know is if I take a huge performance hit for looping through all my objects again.
It depends on your application.
Possible Drawbacks (of splitting):
your data does not fit into the L1 data cache, therefore you load it once for the first loop and then reload it for the second loop
Possible Gains (of splitting):
your loop contains many variables, splitting helps reducing register/stack pressure and the optimizer turns it into better machine code
the functions you use trash the L1 instruction cache so the cache is loaded on each iteration, while by splitting you manage into loading it once (only) at the first iteration of each loop
These lists are certainly not comprehensive, but already you can sense that there is a tension between code and data. So it is difficult for us to take an educated/a wild guess when we know neither.
In doubt: profile. Use callgrind, check the cache misses in each case, check the number of instructions executed. Measure the time spent.
In terms of algorithmic complexity splitting the loops makes no difference.
In terms of real world performance splitting the loops could improve performance, worsen performance or make no difference - it depends on the OS, hardware and - of course - what statement 1 and statement 2 are.
As noted, the complexity remains.
But in the real world, it is impossible for us to predict which version runs faster. The following are factors that play roles, huge ones:
Data caching
Instruction caching
Speculative execution
Branch prediction
Branch target buffers
Number of available registers on the CPU
Cache sizes
(note: over all of them, there's the Damocles sword of misprediction; all are wikipedizable and googlable)
Especially the last factor makes it sometimes impossible to compile the one true code for code whose performance relies on specific cache sizes. Some applications will run faster on CPU with huge caches, while running slower on small caches, and for some other applications it will be the opposite.
Solutions:
Let your compiler do the job of loop transformation. Modern g++'s are quite good in that discipline. Another discipline that g++ is good at is automatic vectorization. Be aware that compilers know more about computer architecture than almost all people.
Ship different binaries and a dispatcher.
Use cache-oblivious data structures/layouts and algorithms that adapt to the target cache.
It is always a good idea to endeavor for software that adapts to the target, ideally without sacrificing code quality. And before doing manual optimization, either microscopic or macroscopic, measure real world runs, then and only then optimize.
Literature:
* Agner Fog's Guides
* Intel's Guides
With two loops you will be paying for:
increased generated code size
2x as many branch predicts
depending what the data layout of statement 1 and 2 are you could be reloading data into cache.
The last point could have a huge impact in either direction. You should measure as with any perf optimization.
As far as the big-o complexity is concerned, this doesn't make a difference if 1 loop is O(n), then so is the 2 loop solution.
As far as micro-optimisation, it is hard to say. The cost of a loop is rather small, we don't know what the cost of accessing your objects is (if they are in a vector, then it should be rather small too), but there is a lot to consider to give a useful answer.
You're correct in noting that there will be some performance overhead by creating a second loop. Therefore, it cannot be "equally fast"; as this overhead, while small, is still overhead.
I won't try to speak intelligently about how collision systems should be built, but if you're trying to optimize performance it's better to avoid building unnecessary control structures if you can manage it without pulling your hair out.
Remember that premature optimization is one of the worst things you can do. Worry about optimization when you have a performance problem, in my opinion.

Speed of C++ operators/ simple math

I'm working on a physics engine and feel it would help having a better understanding of the speed and performance effects of performing many simple or complex math operations.
A large part of a physics engine is weeding out the unnecessary computations, but at what point are the computations small enough that a comparative checks aren't necessary?
eg: Testing if two line segments intersect. Should there be check on if they're near each other before just going straight into the simple math, or would the extra operation slow down the process in the long run?
How much time do different mathematical calculations take
eg: (3+8) vs (5x4) vs (log(8)) etc.
How much time do inequality checks take?
eg: >, <, =
You'll have to do profiling.
Basic operations, like additions or multiplications should only take one asm instructions.
EDIT: As per the comments, although taking one asm instruction, multiplications can expand to microinstructions.
Logarithms take longer.
Also one asm instruction.
Unless you profile your code, there's no way to tell where your bottlenecks are.
Unless you call math operations millions of times (and probably even if you do), a good choice of algorithms or some other high-level optimization will results in a bigger speed gain than optimizing the small stuff.
You should write code that is easy to read and easy to modify, and only if you're not satisfied with the performance then, start optimizing - first high-level, and only afterwards low-level.
You might also want to try dynamic programming or caching.
As regards 2 and 3, I could refer you to the Intel® 64 and IA-32 Architectures Optimization Reference Manual. Appendix C presents the latencies and the throughput of various instructions.
However, unless you hand-code assembly code, your compiler will apply its own optimizations, so using this information directly would be rather difficult.
More importantly, you could use SIMD to vectorize your code and run computations in parallel. Also, memory performance can be a bottleneck if your memory layout is not ideal. The document I linked to has chapters on both issues.
However, as #Ph0en1x said, the first step would be choosing (or writing) an efficient algorithm, making it work for your problem. Only then should you start wondering about low-level optimizations.
As for 1, in a general case I'd say that if your algorithm works in such a way that it has some adjustable thresholds for when to execute certain tests, you could do some profiling and print out a performance graph of some kind, and determine the optimal values for those thresholds.
Well, this depends on your hardware. Very nice tables with instruction latency are http://www.agner.org/optimize/instruction_tables.pdf
1. it depends on the code a lot. Also don't forget it doesn't depend only on computations, but how well the comparison results can be predicted.
2. Generally addition/subtraction is very fast, multiplication of floats is a bit slower. Float division is rather slow (if you need to divide by a constant c, it's often better to precompute 1/c and multiply by it). The library functions are usually (I'd dare to say always) slower than simple operators, unless the compiler decides to use SSE. For example sqrt() and 1/sqrt() can be computed using one SSE instruction.
3. From about one cycle to several dozens of cycles. The current processors does the prediction on conditions. If the prediction is right right, it will be fast. However, if the prediction is wrong, the processor has to throw away all the preloaded instructions (IIRC Sandy Bridge preloads up to 30 instructions) and start processing new instructions.
That means if you have a code, where a condition is met most of the time, it will be fast. Similarly if you have code where the condition is not met most the time, it will be fast. Simple alternating conditions (TFTFTF…) are usually fast too.
This depends on the scenario you are trying to simulate. How many objects do you have and how close are they? Are they clustered or distributed evenly? Do your objects move around alot, or are they static? You will have to run tests. Possible data-structures for fast checking of proximity are kd-trees or locality-sensitive hashes (there may be others). I am not sure if these are appropriate for your application, you'd have to check if the maintenance of the data-structure and the lookup-cost are OK for you.
You will have to run tests. Consider checking if you can use vectorization, or if you can even run some of the computations in a GPU using CUDA or something like that.
Same as above - you have to test.
You can generally consider inequality checks, increment, decrement, bit shifts, addition and subtraction to be really cheap. Multiplication and division are generally a little more expensive. Complex math operations like logarithms are much more expensive.
Benchmark on your platform to be sure. Be careful about benchmarking using artificial tests with tight loops -- that tends to give you misleading results. Try to benchmark in code that's as realistic as possible. Ideally, profile the actual code under realistic conditions.
As for the optimizations for things like line intersection, it depends on the data set. If you do a lot of checks and most of your lines are short, it may be worth a quick check to rule out cases where the X or Y ranges don't overlap.
as much as I know all "inequality checks" take the same time.
regarding the rest calculations, I would advice you to run some tests like
take time stamp A
make 1,000,000 "+" calculation (or any other).
take time stamp B
calculate the diff between A and B.
then you can compare the calculations.
take in mind:
using different mathematical lib may change it (some math lib are more performance oriented and some more precision oriented)
the compiler optimization may change it.
each processor is doing it differently.