How to generate computation intensive code in C++ that will not be removed by compiler? [duplicate] - c++

This question already has an answer here:
How to prevent optimization of busy-wait
(1 answer)
Closed 7 years ago.
I am doing some experiments on CPU's performance. I wonder if anyone know a formal way or a tool to generate simple code that can run for a period of time (several seconds) and consumes significant computation resource of a CPU.
I know there are a lot of CPU benchmarks but the code of them is pretty complicated. What I want is a program more straight forward.
As the compiler is very smart, writing some redundant code as following will not work.
for (int i = 0; i < 100; i++) {
int a = i * 200 + 100;
}

Put the benchmark code in a function in a separate translation unit from the code that calls it. This prevents the code from being inlined, which can lead to aggressive optimizations.
Use parameters for the fixed values (e.g., the number of iterations to run) and return the resulting value. This prevents the optimizer from doing too much constant folding and it keeps it from eliminating calculations for a variable that it determines you never use.
Building on the example from the question:
int TheTest(int iterations) {
int a;
for (int i = 0; i < iterations; i++) {
a = i * 200 + 100;
}
return a;
}
Even in this example, there's still a chance that the compiler might realize that only the last iteration matters and completely omit the loop and just return 200*(iterations - 1) + 100, but I wouldn't expect that to happen in many real-life cases. Examine the generated code to be certain.
Other ideas, like using volatile on certain variables can inhibit some reasonable optimizations, which might make your benchmark perform worse that actual code.
There are also frameworks, like this one, for writing benchmarks like these.

It's not necessarily your optimiser that removes the code. CPU's these days are very powerful, and you need to increase the challenge level. However, note that your original code is not a good general benchmark: you only use a very subset of a CPU's instruction set. A good benchmark will try to challenge the CPU on different kinds of operations, to predict the performance in real world scenarios. Very good benchmarks will even put load on various components of your computer, to test their interplay.
Therefore, just stick to a well known published benchmark for your problem. There is a very good reason why they are more involved. However, if you really just want to benchmark your setup and code, then this time, just go for higher counter values:
double j=10000;
for (double i = 0; i < j*j*j*j*j; i++)
{
}
This should work better for now. Note that there a just more iterations. Change j according to your needs.

Related

Will an operation done several times in sequence be simplified by compiler?

I've had this question for a long time but never knew where to look. If a certain operation is written many times will the compiler simplify it or will it run the exact same operation and get the exact same answer?
For example, in the following c-like pseudo-code (i%3)*10 is repeated many times.
for(int i=0; i<100; i++) {
array[(i%3)*10] = someFunction((i%3)*10);
int otherVar = (i%3)*10 + array[(i%3)*10];
int lastVar = (i%3)*10 - otherVar;
anotherFunction(lastVar);
}
I understand a variable would be better for visual purposes, but is it also faster? Is (i%3)*10 calculated 5 times per loop?
There are certain cases where I don't know if its faster to use a variable or just leave the original operation.
Edit: using gcc (MinGW.org GCC-8.2.0-3) 8.2.0 on win 10
Which optimizations are done depends on the compiler, the compiler optimization flag(s) you specify, and the architecture.
Here are a few possible optimizations for your example:
Loop Unrolling This makes the binary larger and thus is a trade-off; for example you may not want this on a tiny microprocessor with very little memory.
Common Subexpression Elimination (CSE) you can be pretty sure that your (i % 3) * 10 will only be executed once per loop iteration.
About your concern about visual clarity vs. optimization: When dealing with a 'local situation' like yours, you should focus on code clarity.
Optimization gains are often to be made at a higher level; for example in the algorithm you use.
There's a lot to be said about optimization; the above are just a few opening remarks. It's great that you're interested in how things work, because this is important for a good (C/C++) programmer.
As a matter of course, you should remove the obfuscation present in your code:
for (int i = 0; i < 100; ++i) {
int i30 = i % 3 * 10;
int r = someFunction(i30);
array[i30] = r;
anotherFunction(-r);
}
Suddenly, it looks quite a lot simpler.
Leave it to the compiler (with appropriate options) to optimize your code unless you find you actually have to take a hand after measuring.
In this case, unrolling three times looks like a good idea for the compiler to pursue. Though inlining might always reveal even better options.
Yes, operations done several times in sequence will be optimized by a compiler.
To go into more detail, all major compilers (GCC, Clang, and MSVC) store the value of (i%3)*10 into a temporary (scratch, junk) register, and then use that whenever an equivalent expression is used again.
This optimization is called GCSE (GNU Common Subexpression Elimination) for GCC, and just CSE otherwise.
This takes a decent chunk out of the time that it takes to compute the loop.

Performance impact of using 'break' inside 'for-loop'

I have done my best and read a lot of Q&As on SO.SE, but I haven't found an answer to my particular question. Most for-loop and break related question refer to nested loops, while I am concerned with performance.
I want to know if using a break inside a for-loop has an impact on the performance of my C++ code (assuming the break gets almost never called). And if it has, I would also like to know tentatively how big the penalization is.
I am quite suspicions that it does indeed impact performance (although I do not know how much). So I wanted to ask you. My reasoning goes as follows:
Independently of the extra code for the conditional statements that
trigger the break (like an if), it necessarily ads additional
instructions to my loop.
Further, it probably also messes around when my compiler tries to
unfold the for-loop, as it no longer knows the number of iterations
that will run at compile time, effectively rendering it into a
while-loop.
Therefore, I suspect it does have a performance impact, which could be
considerable for very fast and tight loops.
So this takes me to a follow-up question. Is a for-loop & break performance-wise equal to a while-loop? Like in the following snippet, where we assume that checkCondition() evaluates 99.9% of the time as true. Do I loose the performance advantage of the for-loop?
// USING WHILE
int i = 100;
while( i-- && checkCondition())
{
// do stuff
}
// USING FOR
for(int i=100; i; --i)
{
if(checkCondition()) {
// do stuff
} else {
break;
}
}
I have tried it on my computer, but I get the same execution time. And being wary of the compiler and its optimization voodoo, I wanted to know the conceptual answer.
EDIT:
Note that I have measured the execution time of both versions in my complete code, without any real difference. Also, I do not trust compiling with -s (which I usually do) for this matter, as I am not interested in the particular result of my compiler. I am rather interested in the concept itself (in an academic sense) as I am not sure if I got this completely right :)
The principal answer is to avoid spending time on similar micro optimizations until you have verified that such condition evaluation is a bottleneck.
The real answer is that CPU have powerful branch prediction circuits which empirically work really well.
What will happen is that your CPU will choose if the branch is going to be taken or not and execute the code as if the if condition is not even present. Of course this relies on multiple assumptions, like not having side effects on the condition calculation (so that part of the body loop depends on it) and that that condition will always evaluate to false up to a certain point in which it will become true and stop the loop.
Some compilers also allow you to specify the likeliness of an evaluation as a hint the branch predictor.
If you want to see the semantic difference between the two code versions just compile them with -S and examinate the generated asm code, there's no other magic way to do it.
The only sensible answer to "what is the performance impact of ...", is "measure it". There are very few generic answers.
In the particular case you show, it would be rather surprising if an optimising compiler generated significantly different code for the two examples. On the other hand, I can believe that a loop like:
unsigned sum = 0;
unsigned stop = -1;
for (int i = 0; i<32; i++)
{
stop &= checkcondition(); // returns 0 or all-bits-set;
sum += (stop & x[i]);
}
might be faster than:
unsigned sum = 0;
for (int i = 0; i<32; i++)
{
if (!checkcondition())
break;
sum += x[i];
}
for a particular compiler, for a particular platform, with the right optimization levels set, and for a particular pattern of "checkcondition" results.
... but the only way to tell would be to measure.

What can I assume about C/C++ compiler optimisations?

I would like to know how to avoid wasting my time and risking typos by re-hashing source code when I'm integrating legacy code, library code or sample code into my own codebase.
If I give a simple example, based on an image processing scenario, you might see what I mean.
It's actually not unusual to find I'm integrating a code snippet like this:
for (unsigned int y = 0; y < uHeight; y++)
{
for (unsigned int x = 0; x < uWidth; x++)
{
// do something with this pixel ....
uPixel = pPixels[y * uStride + x];
}
}
Over time, I've become accustomed to doing things like moving unnecessary calculations out of the inner loop and maybe changing the postfix increments to prefix ...
for (unsigned int y = 0; y < uHeight; ++y)
{
unsigned int uRowOffset = y * uStride;
for (unsigned int x = 0; x < uWidth; ++x)
{
// do something with this pixel ....
uPixel = pPixels[uRowOffset + x];
}
}
Or, I might use pointer arithmetic, either by row ...
for (unsigned int y = 0; y < uHeight; ++y)
{
unsigned char *pRow = pPixels + (y * uStride);
for (unsigned int x = 0; x < uWidth; ++x)
{
// do something with this pixel ....
uPixel = pRow[x];
}
}
... or by row and column ... so I end up with something like this
unsigned char *pRow = pPixels;
for (unsigned int y = 0; y < uHeight; ++y)
{
unsigned char *pPixel = pRow;
for (unsigned int x = 0; x < uWidth; ++x)
{
// do something with this pixel ....
uPixel = *pPixel++;
}
// next row
pRow += uStride;
}
Now, when I write from scratch, I'll habitually apply my own "optimisations" but I'm aware that the compiler will also be doing things like:
Moving code from inside loops to outside loops
Changing postfix increments to prefix
Lots of other stuff that I have no idea about
Bearing in mind that every time I mess with a piece of working, tested code in this way, I not only cost myself some time but I also run the risk that I'll introduce bugs with finger trouble or whatever (the above examples are simplified). I'm aware of "premature optimisation" and also other ways of improving performance by designing better algorithms, etc. but for the situations above I'm creating building-blocks that will be used in larger pipelined type of apps, where I can't predict what the non-functional requirements might be so I just want the code as fast and tight as is reasonable within time limits (I mean the time I spend tweaking the code).
So, my question is: Where can I find out what compiler optimisations are commonly supported by "modern" compilers. I'm using a mixture of Visual Studio 2008 and 2012, but would be interested to know if there are differences with alternatives e.g. Intel's C/C++ Compiler. Can anyone shed some insight and/or point me at a useful web link, book or other reference?
EDIT
Just to clarify my question
The optimisations I showed above were simple examples, not a complete list. I know that it's pointless (from a performance point of view) to make those specific changes because the compiler will do it anyway.
I'm specifically looking for information about what optimisations are provided by the compilers I'm using.
I would expect most of the optimizations that you include as examples to be a waste of time. A good optimizing compiler should be able to do all of this for you.
I can offer three suggestions by way of practical advice:
Profile your code in the context of a real application processing real data. If you can't, come up with some synthetic tests that you think would closely mimic the final system.
Only optimize code that you have demonstrated through profiling to be a bottleneck.
If you are convinced that a piece of code needs optimization, don't just assume that factoring invariant expression out of a loop would improve performance. Always benchmark, optionally looking at the generated assembly to gain further insight.
The above advice applies to any optimizations. However, the last point is particularly relevant to low-level optimizations. They are a bit of a black art since there are a lot of relevant architectural details involved: memory hierarchy and bandwidth, instruction pipelining, branch prediction, the use of SIMD instructions etc.
I think it's better to rely on the compiler writer having a good knowledge of the target architecture than to try and outsmart them.
From time to time you will find through profiling that you need to optimize things by hand. However, these instances will be fairly rare, which will allow you to spend a good deal of energy on things that will actually make a difference.
In the meantime, focus on writing correct and maintainable code.
I think it would probably be more useful for you to reconsider the premise of your question, rather than to get a direct answer.
Why do you want to perform these optimizations? Judging by your question, I assume it is to make a concrete program faster. If that is the case, you need to start with the question: How do I make this program faster?
That question has a very different answer. First, you need to consider Amdahl's law. That usually means that it only makes sense to optimize one or two important parts of the program. Everything else pretty much doesn't matter. You should use a profiler to locate these parts of the program. At this point you might argue that you already know that you should use a profiler. However, almost all the programmers I know don't profile their code, even if they know that they should. Knowing about vegetables isn't the same as eating them. ;-)
Once you have located the hot-spots, the solution will probably involve:
Improving the algorithm, so the code does less work.
Improving the memory access patterns, to improve cache performance.
Again, you should use the profiler to see if your changes have improved the run-time.
For more details, you can Google code optimization and similar terms.
If you want to get really serious, you should also take a look at Agner Fog's optimization manuals and Computer Architecture: A Quantitative Approach. Make sure to get the newest edition.
You might also might want to read The Sad Tragedy of Micro-Optimization Theater.
What can I assume about C/C++ compiler optimisations?
As possible as you can imagine, except the cases that you get issues either functionality or performance with the optimized code, then turn off optimization and debug.
Mordern compilers have various strategies to optimize your code, especially when you are doing concurrent programming, and using libraries like OMP, Boost or TBB.
If you DO care about what your code exactly made into machine code, it would be no more better to decompile it and observe the assemblies.
The most thing for you to do the manual optimization, might be reduce the unpredictable branches, which is some harder be done by the compiler.
If you want to look for the informations about optimization, there's already a question on SO
What are the c++ compiler optimization techniques in Visual studio
In the optimization options, there are explanations about what each optimize for:
/O Options (Optimize Code)
And there's something about optimization strategies and techniques
C++ Optimization Strategies and Techniques, by Pete Isensee

How do I force the compiler not to skip my function calls?

Let's say I want to benchmark two competing implementations of some function double a(double b, double c). I already have a large array <double, 1000000> vals from which I can take input values, so my benchmarking would look roughly like this:
//start timer here
double r;
for (int i = 0; i < 1000000; i+=2) {
r = a(vals[i], vals[i+1]);
}
//stop timer here
Now, a clever compiler could realize that I can only ever use the result of the last iteration and simply kill the rest, leaving me with double r = a(vals[999998], vals[999999]). This of course defeats the purpose of benchmarking.
Is there a good way (bonus points if it works on multiple compilers) to prevent this kind of optimization while keeping all other optimizations in place?
(I have seen other threads about inserting empty asm blocks but I'm worried that might prevent inlining or reordering. I'm also not particularly fond of the idea of adding the results sum += r; during each iteration because that's extra work that should not be included in the resulting timings. For the purposes of this question, it would be great if we could focus on other alternative solutions, although for anyone interested in this there is a lively discussion in the comments where the consensus is that += is the most appropriate method in many cases. )
Put a in a separate compilation unit and do not use LTO (link-time optimizations). That way:
The loop is always identical (no difference due to optimizations based on a)
The overhead of the function call is always the same
To measure the pure overhead and to have a baseline to compare implementations, just benchmark an empty version of a
Note that the compiler can not assume that the call to a has no side-effect, so it can not optimize the loop away and replace it with just the last call.
A totally different approach could use RDTSC, which is a hardware register in the CPU core that measures the clock cycles. It's sometimes useful for micro-benchmarks, but it's not exactly trivial to understand the results correctly. For example, check out this and goggle/search SO for more information on RDTSCs.

Is Loop Hoisting still a valid manual optimization for C code?

Using the latest gcc compiler, do I still have to think about these types of manual loop optimizations, or will the compiler take care of them for me well enough?
If your profiler tells you there is a problem with a loop, and only then, a thing to watch out for is a memory reference in the loop which you know is invariant across the loop but the compiler does not. Here's a contrived example, bubbling an element out to the end of an array:
for ( ; i < a->length - 1; i++)
swap_elements(a, i, i+1);
You may know that the call to swap_elements does not change the value of a->length, but if the definition of swap_elements is in another source file, it is quite likely that the compiler does not. Hence it can be worthwhile hoisting the computation of a->length out of the loop:
int n = a->length;
for ( ; i < n - 1; i++)
swap_elements(a, i, i+1);
On performance-critical inner loops, my students get measurable speedups with transformations like this one.
Note that there's no need to hoist the computation of n-1; any optimizing compiler is perfectly capable of discovering loop-invariant computations among local variables. It's memory references and function calls that may be more difficult. And the code with n-1 is more manifestly correct.
As others have noted, you have no business doing any of this until you've profiled and have discovered that the loop is a performance bottleneck that actually matters.
Write the code, profile it, and only think about optimising it when you have found something that is not fast enough, and you can't think of an alternative algorithm that will reduce/avoid the bottleneck in the first place.
With modern compilers, this advice is even more important - if you write simple clean code, the compiler's optimiser can often do a better job of optimising the code than it can if you try to give it snazzy "pre-optimised" code.
Check the generated assembly and see for yourself. See if the computation for the loop-invariant code is being done inside the loop or outside the loop in the assembly code that your compiler generates. If it's failing to do the loop hoisting, do the hoisting yourself.
But as others have said, you should always profile first to find your bottlenecks. Once you've determined that this is in fact a bottleneck, only then should you check to see if the compiler's performing loop hoisting (aka loop-invariant code motion) in the hot spots. If it's not, help it out.
Compilers generally do an excellent job with this type of optimization, but they do miss some cases. Generally, my advice is: write your code to be as readable as possible (which may mean that you hoist loop invariants -- I prefer to read code written that way), and if the compiler misses optimizations, file bugs to help fix the compiler. Only put the optimization into your source if you have a hard performance requirement that can't wait on a compiler fix, or the compiler writers tell you that they're not going to be able to address the issue.
Where they are likely to be important to performance, you still have to think about them.
Loop hoisting is most beneficial when the value being hoisted takes a lot of work to calculate. If it takes a lot of work to calculate, it's probably a call out of line. If it's a call out of line, the latest version of gcc is much less likely than you are to figure out that it will return the same value every time.
Sometimes people tell you to profile first. They don't really mean it, they just think that if you're smart enough to figure out when it's worth worrying about performance, then you're smart enough to ignore their rule of thumb. Obviously, the following code might as well be "prematurely optimized", whether you have profiled or not:
#include <iostream>
bool isPrime(int p) {
for (int i = 2; i*i <= p; ++i) {
if ((p % i) == 0) return false;
}
return true;
}
int countPrimesLessThan(int max) {
int count = 0;
for (int i = 2; i < max; ++i) {
if (isPrime(i)) ++count;
}
return count;
}
int main() {
for (int i = 0; i < 10; ++i) {
std::cout << "The number of primes less than 1 million is: ";
std::cout << countPrimesLessThan(1000*1000);
std::cout << std::endl;
}
}
It takes a "special" approach to software development not to manually hoist that call to countPrimesLessThan out of the loop, whether you've profiled or not.
Early optimizations are bad only if other aspects - like readability, clarity of intent, or structure - are negatively affected.
If you have to declare it anyway, loop hoisting can even improve clarity, and it explicitely documents your assumption "this value doesn't change".
As a rule of thumb I wouldn't hoist the count/end iterator for a std::vector, because it's a common scenario easily optimized. I wouldn't hoist anything that I can trust my optimizer to hoist, and I wouldn't hoist anything known to be not critical - e.g. when running through a list of dozen windows to respond to a button click. Even if it takes 50ms, it will still appear "instanteneous" to the user. (But even that is a dangerous assumption: if a new feature requires looping 20 times over this same code, it suddenly is slow). You should still hoist operations such as opening a file handle to append, etc.
In many cases - very well in loop hoisting - it helps a lot to consider relative cost: what is the cost of the hoisted calculation compared to the cost of running through the body?
As for optimizations in general, there are quite some cases where the profiler doesn't help. Code may have very different behavior depending on the call path. Library writers often don't know their call path otr frequency. Isolating a piece of code to make things comparable can already alter the behavior significantly. The profiler may tell you "Loop X is slow", but it won't tell you "Loop X is slow because call Y is thrashing the cache for everyone else". A profiler couldn't tell you "this code is fast because of your snarky CPU, but it will be slow on Steve's computer".
A good rule of thumb is usually that the compiler performs the optimizations it is able to.
Does the optimization require any knowledge about your code that isn't immediately obvious to the compiler? Then it is hard for the compiler to apply the optimization automatically, and you may want to do it yourself
In most cases, lop hoisting is a fully automatic process requiring no high-level knowledge of the code -- just a lot of lifetime and dependency analysis, which is what the compiler excels at in the first place.
It is possible to write code where the compiler is unable to determine whether something can be hoisted out safely though -- and in those cases, you may want to do it yourself, as it is a very efficient optimization.
As an example, take the snippet posted by Steve Jessop:
for (int i = 0; i < 10; ++i) {
std::cout << "The number of primes less than 1 billion is: ";
std::cout << countPrimesLessThan(1000*1000*1000);
std::cout << std::endl;
}
Is it safe to hoist out the call to countPrimesLessThan? That depends on how and where the function is defined. What if it has side effects? It may make an important difference whether it is called once or ten times, as well as when it is called. If we don't know how the function is defined, we can't move it outside the loop. And the same is true if the compiler is to perform the optimization.
Is the function definition visible to the compiler? And is the function short enough that we can trust the compiler to inline it, or at least analyze the function for side effects? If so, then yes, it will hoist it outside the loop.
If the definition is not visible, or if the function is very big and complicated, then the compiler will probably assume that the function call can not be moved safely, and then it won't automatically hoist it out.
Remember 80-20 Rule.(80% of execution time is spent on 20% critical code in the program)
There is no meaning in optimizing the code which have no significant effect on program's overall efficiency.
One should not bother about such kind of local optimization in the code.So the best approach is to profile the code to figure out the critical parts in the program which consumes heavy CPU cycles and try to optimize it.This kind of optimization will really makes some sense and will result in improved program efficiency.