I was doing some hand optimizing some of my code and got bitten by gcc somehow.
The original code when run through a test takes about 3.5 seconds to finish execution.
I was confused why my optimized version now needs about 4.3 seconds to finish the test?
I applied __attribute__((always_inline)) to one of the local static functions that sticked out in the profiler and now it proudly runs within 2.9 seconds. Nice.
I've always trusted gcc to make the decision in function inlining, but apparently it doesn't seem really that perfect. I don't understand why gcc ended up with a very wrong decision whether or not to inline a file-scope static function with -O3 -flto -fwhole-program. Is the compiler really just doing a guestimate to the cost-benefits of inlining a function?
Edit: To answer the ACTUAL question, yes, the compiler does indeed "guesstimate" - or as the technical term is, it uses "heuristics" - to determine the gain in speed vs. space that inlining a particular function will result in. Heuristics is defined as "a practical but not theoretically perfect solution". End Edit.
Without seeing the code it's hard to say what is going on in the compiler. You are doing the right thing to profile your code, try your hand-optimisations and profile again - if it's better, keep it!
It's not that unusual that compilers get it wrong from time to time. Humans are more clever at times - but I would generally trust the compiler to get it right. It could be that the function is called many times and rather large, and thus the compiler decides that "it's above the threshold for code-bloat vs. speed gain"? Or it could be that it just doesn't get the "how much better/worse is it to inline" computation right.
Remember that the compiler is meant to be generic, and what works for one case may make another case worse - so the compiler has to compromise and come up with some reasonable heuristics that doesn't give too bad results too often.
If you can run profile-guided optimisation, it may help the compiler make the right decision (as it will know how many iterations and how often a particular branch is taken)...
If you can share the code with the GCC compiler team, report it as a bug - they may ignore/reject it as "too special" or some such, but it's quite possible that this particular case is something "that got missed out".
I think it's fair to say that the compiler "gets it right more often than not", but it doesn't mean it ALWAYS gets it right. I recently looked at some generated code from Clang, and it a whole bunch of extra instructions to unroll the loop - but in the most typical case, the loop would be one iteration, and never more than 16. So the additional instructions to unroll the loop by a factor of 4 was completely wasted for the case of one, and fairly useless even for the longest possible loop. The natural loop "rolled" would just be about 3-4 instructions, so the saving was quite small even if the loop was a lot bigger - but of course, had it been a million iterations, it would probably have tripled the speed of that function.
Related
I have a program that runs in around 1 minute when compiling with g++ without any options.
Compiling with -O3 however makes it run in around 1-2 seconds.
My question is whether it is normal to have this much of a speed up? Or is my code perhaps so bad, that optimization can take away that much time. Obviously I know my code isn't perfect but because of this huge speedup I'm beginning to think it's worse than I thought. Please tell me what the "normal" amount of speed up is (if that's a thing), and whether too much speed up can mean bad code that could (and should) be easily optimized by hand instead of relying on the compiler.
How much faster is C++ code “supposed” to be with optimizations turned on?
In theory: There doesn't necessarily need to be any speed difference. Nor does there exist any upper limit to the speed difference. The C++ language simply doesn't specify a difference between optimisation and lack thereof.
In practice: It depends. Some programs have more to gain from optimisation than others. Some behaviours are easier to prove than others. Some optimisations can even make the program slower, because the compiler cannot know about everything that may happen at runtime.
... 1 minute ... [optimisation] makes it run in around 1-2 seconds.
My question is whether it is normal to have this much of a speed up?
It is entirely normal. You cannot assume that you'll always get as much improvement, but this is not out of the ordinary.
Or is my code perhaps so bad, that optimization can take away that much time.
If the program is fast with optimisation, then it is a fast program. If the program is slow without optimisation, we don't care because we can enable optimisation. Usually, only the optimised speed is relevant.
Faster is better than slower, although that is not the only important metric of a program. Readability, maintainability and especially correctness are more important.
Please tell me ... whether ... code ... could ... be ... optimized by hand instead of relying on the compiler.
Everything could be optimized by hand, at least if you write the program in assembly.
... or should ...
No. There is no reason to waste time doing what the compiler has already done for you.
There are sometimes reasons to optimise by hand something that is already well optimised by the compiler. Relative speedup is not one of those reasons. An example of a valid reason is that the non-optimised build may be too slow to be executed for debugging purposes when there are real time requirements (whether hard or soft) involved.
If I have some code with simple arithmetics that is repeating several times. Will the compiler automatically optimize it?
Here the example:
someArray[index + 1] = 5;
otherArray[index + 1] = 7;
Does it make sense to introduce variable nextIndex = index + 1 from the perfomance point of view, (not from the point of view of good readable and maintanable code) or the compiler will do such optimization automatically?
You should not worry about trivial optimization like this because almost all compilers do it last 10-15 years or longer.
But if you have a really critical place in your code and want to get maximal speed of running, than you can check generated assembler code for this lines to be sure that compiler did this trivial optimization.
In some cases one more arithmetic addition could be more faster version of code than saving in register or memory, and compilers knows about this. You can make your code slower if you try optimize trivial cases manually.
And you can use online services like https://gcc.godbolt.org for check generated code (support gcc, clang, icc in several version).
The old adage "suck it and see" seems to be appropriate here. We often forget that by far the most common processors are 4/8/16 bit micros with weird and wonderful application specific architectures and suitably odd vendor specific compilers to go with them. They frequently have compiler extensions to "aid" (or confuse) the compiler into producing "better" code.
One DSP from early 2000s carried out 8 instructions per clock-cycle in parallel in a pipeline (complex - "load+increment+multiply+add+round"). The precondition for this to work was that everything had to be preloaded into the registers beforehand. This meant that registers were obviously at a premium (as always). With this architecture it was frequently better to bin results to free registers and use free slots that couldn't be paralleled (some instructions precluded the use of others in the same cycle) to recalculate it later. Did the compiler get this "right"?. Yes, it often kept the result to reuse later with the result that it stalled the pipeline due to lack of registers which resulted in slower execution speed.
So, you compiled it, examined it, profiled it etc. to make sure that the when when the compiler got it "right" we could go in and fix it. Without additional semantic information which is not supported by the language it is really hard to know what "right" is.
Conclusion: Suck it and see
Yes. It's a common optimization. https://en.wikipedia.org/wiki/Common_subexpression_elimination
I have a section of my program that contains a large amount of math with some rather long equations. Its long and unsightly and I wish to replace it with a function. However, chunk of code is used an extreme number of times in my code and also requires a lot of variables to be initialized.
If I'm worried about speed, is the cost of calling the function and initializing the variables negligible here or should i stick to directly coding it in each time?
Thanks,
-Faken
Most compilers are smart about inlining reasonably small functions to avoid the overhead of a function call. For functions big enough that the compiler won't inline them, the overhead for the call is probably a very small fraction of the total execution time.
Check your compiler documentation to understand it's specific approach. Some older compilers required or could benefit from hints that a function is a candidate for inlining.
Either way, stick with functions and keep your code clean.
Are you asking if you should optimize prematurely?
Code it in a maintainable manner first; if you then find that this section is a bottleneck in the overall program, worry about tuning it at that point.
You don't know where your bottlenecks are until you profile your code. Anything you can assume about your code hot spots is likely to be wrong. I remember once I wanted to optimize some computational code. I ran a profiler and it turned out that 70 % of the running time was spent zeroing arrays. Nobody would have guessed it by looking at the code.
So, first code clean, then run a profiler, then optimize the rough spots. Not earlier. If it's still slow, change algorithm.
Modern C++ compilers generally inline small functions to avoid function call overhead. As far as the cost of variable initialization, one of the benefits of inlining is that it allows the compiler to perform additional optimizations at the call site. After performing inlining, if the compiler can prove that you don't need those extra variables, the copying will likely be eliminated. (I assume we're talking about primitives, not things with copy constructors.)
The only way to answer that is to test it. Without knowing more about the proposed function, nobody can really say whether the compiler can/will inline that code or not. This may/will also depend on the compiler and compiler flags you use. Depending on the compiler, if you find that it's really a problem, you may be able to use different flags, a pragma, etc., to force it to be generated inline even if it wouldn't be otherwise.
Without knowing how big the function would be, and/or how long it'll take to execute, it's impossible guess how much effect on speed it'll have if it isn't generated inline.
With both of those being unknown, none of us can really guess at how much effect moving the code into a function will have. There might be none, or little or huge.
I would like to select the compiler optimizations to generate the fastest possible application.
Which of the following settings should I set to true?
Dead store elimination
Eliminate duplicate expressions within basic blocks and functions
Enable loop induction variable and strength reduction
Enable Pentium instruction scheduling
Expand common intrinsic functions
Optimize jumps
Use register variables
There is also the option 'Generate the fastest possible code.', which I have obviously set to true. However, when I set this to true, all the above options are still set at false.
So I would like to know if any of the above options will speed up the application if I set them to true?
So I would like to know if any of the above options will speed up the application if I set them to true?
I know some will hate me for this, but nobody here can answer you truthfully. You have to try your program with and without them, and profile each build and see what the results are. Guess-work won't get anybody anywhere.
Compilers already do tons(!) of great optimization, with or without your permission. Your best bet is to write your code in a clean and organized matter, and worry about maintainability and extensibility. As I like to say: Code now, optimize later.
Don't micromanage down to the individual optimization. Compiler writers are very smart people - just turn them all on unless you see a specific need not to. Your time is better spent by optimizing your code (improve algorithmic complexity of your functions, etc) rather than fiddling with compiler options.
My other advice, use a different compiler. Intel has a great reputation as an optimizing compiler. VC and GCC of course are also great choices.
You could look at the generated code with different compiled options to see which is fastest, but I understand nowadays many people don't have experience doing this.
Therefore, it would be useful to profile the application. If there is an obvious portion requiring speed, add some code to execute it a thousand or ten million times and time it using utime() if it's available. The loop should run long enough that other processes running intermittently don't affect the result—ten to twenty seconds is a popular benchmark range. Or run multiple timing trials. Compile different test cases and run it to see what works best.
Spending an hour or two playing with optimization options will quickly reveal that most have minor effect. However, that same time spent thinking about the essence of the algorithm and making small changes (code removal is especially effective) can often vastly improve execution time.
Anyone know this compiler feature? It seems GCC support that. How does it work? What is the potential gain? In which case it's good? Inner loops?
(this question is specific, not about optimization in general, thanks)
It works by placing extra code to count the number of times each codepath is taken. When you compile a second time the compiler uses the knowledge gained about execution of your program that it could only guess at before. There are a couple things PGO can work toward:
Deciding which functions should be inlined or not depending on how often they are called.
Deciding how to place hints about which branch of an "if" statement should be predicted on based on the percentage of calls going one way or the other.
Deciding how to optimize loops based on how many iterations get taken each time that loop is called.
You never really know how much these things can help until you test it.
PGO gives about a 5% speed boost when compiling x264, the project I work on, and we have a built-in system for it (make fprofiled). Its a nice free speed boost in some cases, and probably helps more in applications that, unlike x264, are less made up of handwritten assembly.
Jason's advise is right on. The best speedups you are going to get come from "discovering" that you let an O(n2) algorithm slip into an inner loop somewhere, or that you can cache certain computations outside of expensive functions.
Compared to the micro-optimizations that PGO can trigger, these are the big winners. Once you've done that level of optimization PGO might be able to help. We never had much luck with it though - the cost of the instrumentation was such that our application become unusably slow (by several orders of magnitude).
I like using Intel VTune as a profiler primarily because it is non-invasive compared to instrumenting profilers which change behaviour too much.
The fun thing about optimization is that speed gains are found in the unlikeliest of places.
It's also the reason you need a profiler, rather than guessing where the speed problems are.
I recommend starting with a profiler (gperf if you're using GCC) and just start poking around the results of running your application through some normal operations.