What can I assume about C/C++ compiler optimisations? - c++

I would like to know how to avoid wasting my time and risking typos by re-hashing source code when I'm integrating legacy code, library code or sample code into my own codebase.
If I give a simple example, based on an image processing scenario, you might see what I mean.
It's actually not unusual to find I'm integrating a code snippet like this:
for (unsigned int y = 0; y < uHeight; y++)
{
for (unsigned int x = 0; x < uWidth; x++)
{
// do something with this pixel ....
uPixel = pPixels[y * uStride + x];
}
}
Over time, I've become accustomed to doing things like moving unnecessary calculations out of the inner loop and maybe changing the postfix increments to prefix ...
for (unsigned int y = 0; y < uHeight; ++y)
{
unsigned int uRowOffset = y * uStride;
for (unsigned int x = 0; x < uWidth; ++x)
{
// do something with this pixel ....
uPixel = pPixels[uRowOffset + x];
}
}
Or, I might use pointer arithmetic, either by row ...
for (unsigned int y = 0; y < uHeight; ++y)
{
unsigned char *pRow = pPixels + (y * uStride);
for (unsigned int x = 0; x < uWidth; ++x)
{
// do something with this pixel ....
uPixel = pRow[x];
}
}
... or by row and column ... so I end up with something like this
unsigned char *pRow = pPixels;
for (unsigned int y = 0; y < uHeight; ++y)
{
unsigned char *pPixel = pRow;
for (unsigned int x = 0; x < uWidth; ++x)
{
// do something with this pixel ....
uPixel = *pPixel++;
}
// next row
pRow += uStride;
}
Now, when I write from scratch, I'll habitually apply my own "optimisations" but I'm aware that the compiler will also be doing things like:
Moving code from inside loops to outside loops
Changing postfix increments to prefix
Lots of other stuff that I have no idea about
Bearing in mind that every time I mess with a piece of working, tested code in this way, I not only cost myself some time but I also run the risk that I'll introduce bugs with finger trouble or whatever (the above examples are simplified). I'm aware of "premature optimisation" and also other ways of improving performance by designing better algorithms, etc. but for the situations above I'm creating building-blocks that will be used in larger pipelined type of apps, where I can't predict what the non-functional requirements might be so I just want the code as fast and tight as is reasonable within time limits (I mean the time I spend tweaking the code).
So, my question is: Where can I find out what compiler optimisations are commonly supported by "modern" compilers. I'm using a mixture of Visual Studio 2008 and 2012, but would be interested to know if there are differences with alternatives e.g. Intel's C/C++ Compiler. Can anyone shed some insight and/or point me at a useful web link, book or other reference?
EDIT
Just to clarify my question
The optimisations I showed above were simple examples, not a complete list. I know that it's pointless (from a performance point of view) to make those specific changes because the compiler will do it anyway.
I'm specifically looking for information about what optimisations are provided by the compilers I'm using.

I would expect most of the optimizations that you include as examples to be a waste of time. A good optimizing compiler should be able to do all of this for you.
I can offer three suggestions by way of practical advice:
Profile your code in the context of a real application processing real data. If you can't, come up with some synthetic tests that you think would closely mimic the final system.
Only optimize code that you have demonstrated through profiling to be a bottleneck.
If you are convinced that a piece of code needs optimization, don't just assume that factoring invariant expression out of a loop would improve performance. Always benchmark, optionally looking at the generated assembly to gain further insight.
The above advice applies to any optimizations. However, the last point is particularly relevant to low-level optimizations. They are a bit of a black art since there are a lot of relevant architectural details involved: memory hierarchy and bandwidth, instruction pipelining, branch prediction, the use of SIMD instructions etc.
I think it's better to rely on the compiler writer having a good knowledge of the target architecture than to try and outsmart them.
From time to time you will find through profiling that you need to optimize things by hand. However, these instances will be fairly rare, which will allow you to spend a good deal of energy on things that will actually make a difference.
In the meantime, focus on writing correct and maintainable code.

I think it would probably be more useful for you to reconsider the premise of your question, rather than to get a direct answer.
Why do you want to perform these optimizations? Judging by your question, I assume it is to make a concrete program faster. If that is the case, you need to start with the question: How do I make this program faster?
That question has a very different answer. First, you need to consider Amdahl's law. That usually means that it only makes sense to optimize one or two important parts of the program. Everything else pretty much doesn't matter. You should use a profiler to locate these parts of the program. At this point you might argue that you already know that you should use a profiler. However, almost all the programmers I know don't profile their code, even if they know that they should. Knowing about vegetables isn't the same as eating them. ;-)
Once you have located the hot-spots, the solution will probably involve:
Improving the algorithm, so the code does less work.
Improving the memory access patterns, to improve cache performance.
Again, you should use the profiler to see if your changes have improved the run-time.
For more details, you can Google code optimization and similar terms.
If you want to get really serious, you should also take a look at Agner Fog's optimization manuals and Computer Architecture: A Quantitative Approach. Make sure to get the newest edition.
You might also might want to read The Sad Tragedy of Micro-Optimization Theater.

What can I assume about C/C++ compiler optimisations?
As possible as you can imagine, except the cases that you get issues either functionality or performance with the optimized code, then turn off optimization and debug.
Mordern compilers have various strategies to optimize your code, especially when you are doing concurrent programming, and using libraries like OMP, Boost or TBB.
If you DO care about what your code exactly made into machine code, it would be no more better to decompile it and observe the assemblies.
The most thing for you to do the manual optimization, might be reduce the unpredictable branches, which is some harder be done by the compiler.
If you want to look for the informations about optimization, there's already a question on SO
What are the c++ compiler optimization techniques in Visual studio
In the optimization options, there are explanations about what each optimize for:
/O Options (Optimize Code)
And there's something about optimization strategies and techniques
C++ Optimization Strategies and Techniques, by Pete Isensee

Related

Will an operation done several times in sequence be simplified by compiler?

I've had this question for a long time but never knew where to look. If a certain operation is written many times will the compiler simplify it or will it run the exact same operation and get the exact same answer?
For example, in the following c-like pseudo-code (i%3)*10 is repeated many times.
for(int i=0; i<100; i++) {
array[(i%3)*10] = someFunction((i%3)*10);
int otherVar = (i%3)*10 + array[(i%3)*10];
int lastVar = (i%3)*10 - otherVar;
anotherFunction(lastVar);
}
I understand a variable would be better for visual purposes, but is it also faster? Is (i%3)*10 calculated 5 times per loop?
There are certain cases where I don't know if its faster to use a variable or just leave the original operation.
Edit: using gcc (MinGW.org GCC-8.2.0-3) 8.2.0 on win 10
Which optimizations are done depends on the compiler, the compiler optimization flag(s) you specify, and the architecture.
Here are a few possible optimizations for your example:
Loop Unrolling This makes the binary larger and thus is a trade-off; for example you may not want this on a tiny microprocessor with very little memory.
Common Subexpression Elimination (CSE) you can be pretty sure that your (i % 3) * 10 will only be executed once per loop iteration.
About your concern about visual clarity vs. optimization: When dealing with a 'local situation' like yours, you should focus on code clarity.
Optimization gains are often to be made at a higher level; for example in the algorithm you use.
There's a lot to be said about optimization; the above are just a few opening remarks. It's great that you're interested in how things work, because this is important for a good (C/C++) programmer.
As a matter of course, you should remove the obfuscation present in your code:
for (int i = 0; i < 100; ++i) {
int i30 = i % 3 * 10;
int r = someFunction(i30);
array[i30] = r;
anotherFunction(-r);
}
Suddenly, it looks quite a lot simpler.
Leave it to the compiler (with appropriate options) to optimize your code unless you find you actually have to take a hand after measuring.
In this case, unrolling three times looks like a good idea for the compiler to pursue. Though inlining might always reveal even better options.
Yes, operations done several times in sequence will be optimized by a compiler.
To go into more detail, all major compilers (GCC, Clang, and MSVC) store the value of (i%3)*10 into a temporary (scratch, junk) register, and then use that whenever an equivalent expression is used again.
This optimization is called GCSE (GNU Common Subexpression Elimination) for GCC, and just CSE otherwise.
This takes a decent chunk out of the time that it takes to compute the loop.

Performance impact of using 'break' inside 'for-loop'

I have done my best and read a lot of Q&As on SO.SE, but I haven't found an answer to my particular question. Most for-loop and break related question refer to nested loops, while I am concerned with performance.
I want to know if using a break inside a for-loop has an impact on the performance of my C++ code (assuming the break gets almost never called). And if it has, I would also like to know tentatively how big the penalization is.
I am quite suspicions that it does indeed impact performance (although I do not know how much). So I wanted to ask you. My reasoning goes as follows:
Independently of the extra code for the conditional statements that
trigger the break (like an if), it necessarily ads additional
instructions to my loop.
Further, it probably also messes around when my compiler tries to
unfold the for-loop, as it no longer knows the number of iterations
that will run at compile time, effectively rendering it into a
while-loop.
Therefore, I suspect it does have a performance impact, which could be
considerable for very fast and tight loops.
So this takes me to a follow-up question. Is a for-loop & break performance-wise equal to a while-loop? Like in the following snippet, where we assume that checkCondition() evaluates 99.9% of the time as true. Do I loose the performance advantage of the for-loop?
// USING WHILE
int i = 100;
while( i-- && checkCondition())
{
// do stuff
}
// USING FOR
for(int i=100; i; --i)
{
if(checkCondition()) {
// do stuff
} else {
break;
}
}
I have tried it on my computer, but I get the same execution time. And being wary of the compiler and its optimization voodoo, I wanted to know the conceptual answer.
EDIT:
Note that I have measured the execution time of both versions in my complete code, without any real difference. Also, I do not trust compiling with -s (which I usually do) for this matter, as I am not interested in the particular result of my compiler. I am rather interested in the concept itself (in an academic sense) as I am not sure if I got this completely right :)
The principal answer is to avoid spending time on similar micro optimizations until you have verified that such condition evaluation is a bottleneck.
The real answer is that CPU have powerful branch prediction circuits which empirically work really well.
What will happen is that your CPU will choose if the branch is going to be taken or not and execute the code as if the if condition is not even present. Of course this relies on multiple assumptions, like not having side effects on the condition calculation (so that part of the body loop depends on it) and that that condition will always evaluate to false up to a certain point in which it will become true and stop the loop.
Some compilers also allow you to specify the likeliness of an evaluation as a hint the branch predictor.
If you want to see the semantic difference between the two code versions just compile them with -S and examinate the generated asm code, there's no other magic way to do it.
The only sensible answer to "what is the performance impact of ...", is "measure it". There are very few generic answers.
In the particular case you show, it would be rather surprising if an optimising compiler generated significantly different code for the two examples. On the other hand, I can believe that a loop like:
unsigned sum = 0;
unsigned stop = -1;
for (int i = 0; i<32; i++)
{
stop &= checkcondition(); // returns 0 or all-bits-set;
sum += (stop & x[i]);
}
might be faster than:
unsigned sum = 0;
for (int i = 0; i<32; i++)
{
if (!checkcondition())
break;
sum += x[i];
}
for a particular compiler, for a particular platform, with the right optimization levels set, and for a particular pattern of "checkcondition" results.
... but the only way to tell would be to measure.

How to generate computation intensive code in C++ that will not be removed by compiler? [duplicate]

This question already has an answer here:
How to prevent optimization of busy-wait
(1 answer)
Closed 7 years ago.
I am doing some experiments on CPU's performance. I wonder if anyone know a formal way or a tool to generate simple code that can run for a period of time (several seconds) and consumes significant computation resource of a CPU.
I know there are a lot of CPU benchmarks but the code of them is pretty complicated. What I want is a program more straight forward.
As the compiler is very smart, writing some redundant code as following will not work.
for (int i = 0; i < 100; i++) {
int a = i * 200 + 100;
}
Put the benchmark code in a function in a separate translation unit from the code that calls it. This prevents the code from being inlined, which can lead to aggressive optimizations.
Use parameters for the fixed values (e.g., the number of iterations to run) and return the resulting value. This prevents the optimizer from doing too much constant folding and it keeps it from eliminating calculations for a variable that it determines you never use.
Building on the example from the question:
int TheTest(int iterations) {
int a;
for (int i = 0; i < iterations; i++) {
a = i * 200 + 100;
}
return a;
}
Even in this example, there's still a chance that the compiler might realize that only the last iteration matters and completely omit the loop and just return 200*(iterations - 1) + 100, but I wouldn't expect that to happen in many real-life cases. Examine the generated code to be certain.
Other ideas, like using volatile on certain variables can inhibit some reasonable optimizations, which might make your benchmark perform worse that actual code.
There are also frameworks, like this one, for writing benchmarks like these.
It's not necessarily your optimiser that removes the code. CPU's these days are very powerful, and you need to increase the challenge level. However, note that your original code is not a good general benchmark: you only use a very subset of a CPU's instruction set. A good benchmark will try to challenge the CPU on different kinds of operations, to predict the performance in real world scenarios. Very good benchmarks will even put load on various components of your computer, to test their interplay.
Therefore, just stick to a well known published benchmark for your problem. There is a very good reason why they are more involved. However, if you really just want to benchmark your setup and code, then this time, just go for higher counter values:
double j=10000;
for (double i = 0; i < j*j*j*j*j; i++)
{
}
This should work better for now. Note that there a just more iterations. Change j according to your needs.

Intel C++ Compiler understanding what optimization is performed

I have a code segment which is as simple as :
for( int i = 0; i < n; ++i)
{
if( data[i] > c && data[i] < r )
{
--data[i];
}
}
It's a part of a large function and project. This is actually a rewrite of a different loop, which proved to be time consuming (long loops), but I was surprised by two things :
When data[i] was temporary stored like this :
for( int i = 0; i < n; ++i)
{
const int tmp = data[i];
if( tmp > c && tmp < r )
{
--data[i];
}
}
It became more much slower. I don't claim this should be faster, but I can not understand why it should be so much slower, the compiler should be able to figure out if tmp should be used or not.
But more importantly when I moved the code segment into a separate function it became around four times slower. I wanted to understand what was going on, so I looked in the opt-report and in both cases the loop is vectorized and seem to do the same optimization.
So my question is what can make such a difference on a function which is not called a million times, but is time consuming in itself ? What to look for in the opt-report ?
I could avoid it by just keeping it inlined, but the why is bugging me.
UPDATE :
I should underline that my main concern is to understand, why it became slower, when moved to a separate function. The code example given with tmp variable, was just a strange example I encountered during the process.
You're probably register starved, and the compiler is having to load and store. I'm pretty sure that the native x86 assembly instructions can take memory addresses to operate on- i.e., the compiler can keep those registers free. But by making it local, you may changing the behaviour wrt. aliasing and the compiler may not be able to prove that the faster version has the same semantics, especially if there is some form of multiple threads in here, allowing it to change the code.
The function was slower when in a new segment likely because function calls not only can break the pipeline, but also create poor instruction cache performance (there's extra code for parameter push/pop/etc).
Lesson: Let the compiler do the optimizing, it's smarter than you. I don't mean that as an insult, it's smarter than me too. But really, especially the Intel compiler, those guys know what they're doing when targetting their own platform.
Edit: More importantly, you need to recognize that compilers are targetted at optimizing unoptimized code. They're not targetted at recognizing half-optimized code. Specifically, the compiler will have a set of triggers for each optimization, and if you happen to write your code in such a way as that they're not hit, you can avoid optimizations being performed even if the code is semantically identical.
And you also need to consider implementation cost. Not every function ideal for inlining can be inlined- just because inlining that logic is too complex for the compiler to handle. I know that VC++ will rarely inline with loops, even if the inlining yields benefit. You may be seeing this in the Intel compiler- that the compiler writers simply decided that it wasn't worth the time to implement.
I encountered this when dealing with loops in VC++- the compiler would produce different assembly for two loops in slightly different formats, even though they both achieved the same result. Of course, their Standard library used the ideal format. You may observe a speedup by using std::for_each and a function object.
You're right, the compiler should be able to identify that as unused code and remove it/not compile it. That doesn't mean it actually does identify it and remove it.
Your best bet is to look at the generated assembly and check to see exactly what is going on. Remember, just because a clever compiler could be able to figure out how to do an optimization, it doesn't mean it can.
If you do check, and see that the code is not removed, you might want to report that to the intel compiler team. It sounds like they might have a bug.

Is there any performance difference between for() and while()?

Or is it all about semantics?
Short answer: no, they are exactly the same.
Guess it could in theory depend on the compiler; a really broken one might do something slightly different but I'd be surprised.
Just for fun here are two variants that compile down to exactly the same assembly code for me using x86 gcc version 4.3.3 as shipped with Ubuntu. You can check the assembly produced on the final binary with objdump on linux.
int main()
{
#if 1
int i = 10;
do { printf("%d\n", i); } while(--i);
#else
int i = 10;
for (; i; --i) printf("%d\n", i);
#endif
}
EDIT: Here is an "oranges with oranges" while loop example that also compiles down to the same thing:
while(i) { printf("%d\n", i); --i; }
If your for and while loops do the same things, the machine code generated by the compiler should be (nearly) the same.
For instance in some testing I did a few years ago,
for (int i = 0; i < 10; i++)
{
...
}
and
int i = 0;
do
{
...
i++;
}
while (i < 10);
would generate exactly the same code, or (and Neil pointed out in the comments) with one extra jmp, which won't make a big enough difference in performance to worry about.
There is no semantic difference, there need not be any compiled difference. But it depends on the compiler. So I tried with with g++ 4.3.2, CC 5.5, and xlc6.
g++, CC were identical, xlc WAS NOT
The difference in xlc was in the initial loop entry.
extern int doit( int );
void loop1( ) {
for ( int ii = 0; ii < 10; ii++ ) {
doit( ii );
}
}
void loop2() {
int ii = 0;
while ( ii < 10 ) {
doit( ii );
ii++;
}
}
XLC OUTPUT
.loop2: # 0x00000000 (H.10.NO_SYMBOL)
mfspr r0,LR
stu SP,-80(SP)
st r0,88(SP)
cal r3,0(r0)
st r3,64(SP)
l r3,64(SP) ### DIFFERENCE ###
cmpi 0,r3,10
bc BO_IF_NOT,CR0_LT,__L40
...
enter code here
.loop1: # 0x0000006c (H.10.NO_SYMBOL+0x6c)
mfspr r0,LR
stu SP,-80(SP)
st r0,88(SP)
cal r3,0(r0)
cmpi 0,r3,10 ### DIFFERENCE ###
st r3,64(SP)
bc BO_IF_NOT,CR0_LT,__La8
...
The scope of the variable in the test of the while loop is wider than the scope of variables declared in the header of the for loop.
Therefore, if there are performance implications as a side-effect of keeping a variable alive longer, then there will be performance implications in choosing between a while and a for loop ( and not wrapping the while up in {} to reduce the scope of its variables ).
An example might be a concurrent collection which counts the number of iterators referring to it, and if more than one iterator exists, it applies locking to prevent concurrent modification, but as an optimisation elides the locking if only one iterator refers to it. If you then had two for loops in a function using differently named iterators on the same container, the fast path would be taken, but with two while loops the slow path would be taken. Similarly there may be performance implications if the objects are large (more cache traffic), or use system resources. But I can't think of a real example that I've ever seen where it would make a difference.
Compilers that optimize using loop unrolling will probably only do so in the for-loop case.
Both are equivalent. It's a matter of semantics.
The only difference may lie in the do... while construct, where you postpone the evaluation of the condition until after the body, and thus may save 1 evaluation.
i = 1; do { ... i--; } while( i > 0 );
as opposed to
for( i = 1; i > 0; --i )
{ ....
}
I write compilers. We compile all "structured" control flow (if, while, for, switch, do...while) into conditional and unconditional branches. Then we analyze the control-flow graph. Since a C compiler has to deal with general goto anyway, it is easiest to reduce everything to branch and conditional-branch instructions, then be sure to handle that case well. (A C compiler has to do a good job not just on handwritten code but also on automatically generated code, which may have many, many goto statements.)
No. If they're doing equivalent things, they'll compile to the same code - as you say, it's about semantics. Choose the one that best represents what you're trying to express.
Ideally it should be the same, but eventually it depends on your compiler/interpreter. To be sure, you must measure or examine the generated assembly code.
Proof that there may be a difference: These lines produce different assembly code using cc65.
for (; i < 1000; ++i);
while (i < 1000) ++i;
On Atmel ATMega while() is faster than for(). Why is this is explained in AVR035: Efficient C Coding for AVR.
P.S. Original platform was not mentioned in question.
continue behaves differently in for and while: in for, it alters the counter, in while, it usually doesn't
To add another answer: In my experience, optimizing software is like a big, bushy beard being shaved off a man.
First you lop it off in big chunks with scissors (prune whole limbs off the call tree).
Then you make it short with an electric clipper (tweak algorithms).
Finally you shave it with a razor to get rid of the last little bit (low-level optimization).
The last is where the difference between for() and while() might, but probably won't, make a difference.
P.S. The programmers I know (who are all very good, and I suspect are a representative sample) basically go at it from the other direction.
They are the same as far as performance goes. I tend to use while when waiting for a state change (such as waiting for a buffer to be filled) and for when processing a number of discrete objects (such as going through each item in a collection).
There is a difference in some cases.
If you are at the point where that difference matters, you either need to pick a better algorithm or begin coding in assembly language. Trust me, coding in assembly is preferable to fixing your compiler version.
Is while() faster/slower than for()? Let's review a few things about optimization:
Compiler-writers work very hard to shave cycles by having fewer calls to jump, compare, increment, and the other kinds of instructions that they generate.
Call instructions, on the other hand, consume many magnitudes more cycles, but the compiler is nearly powerless to do anything to remove those.
As programmers, we write lots of function calls, some because we mean to, some because we're lazy, and some because the compiler slips them in without being obvious.
Most of the time, it doesn't matter, because the hardware is so fast, and our jobs are so small, that the computer is like a beagle dog who wolfes her food and begs for more.
Sometimes, however, the job is big enough that performance is an issue.
What do we do then? Where's the bigger payoff?
Getting the compiler to shave a few cycles off loops & such?
Finding function calls that don't -really- need to be done so much?
The compiler can't do the latter. Only we the programmers can.
We need to learn or be taught how to do this. It doesn't come naturally.
We are congenitally inclined to make wrong guesses and then bet on them.
Getting better algorithms is a start, but only a start. Our teachers need to teach this, if indeed they know how.
Profilers are a start. I do this.
The apocryphal quote of Willie Sutton when asked Why do you rob banks?:
Because that's where the money is.
If you want to save cycles, find out where they are.
Probably only coding style.
for if you know the number of iterations.
while if you do not know the number of iterations.