Will compiler automatically optimize repeating code? - c++

If I have some code with simple arithmetics that is repeating several times. Will the compiler automatically optimize it?
Here the example:
someArray[index + 1] = 5;
otherArray[index + 1] = 7;
Does it make sense to introduce variable nextIndex = index + 1 from the perfomance point of view, (not from the point of view of good readable and maintanable code) or the compiler will do such optimization automatically?

You should not worry about trivial optimization like this because almost all compilers do it last 10-15 years or longer.
But if you have a really critical place in your code and want to get maximal speed of running, than you can check generated assembler code for this lines to be sure that compiler did this trivial optimization.
In some cases one more arithmetic addition could be more faster version of code than saving in register or memory, and compilers knows about this. You can make your code slower if you try optimize trivial cases manually.
And you can use online services like https://gcc.godbolt.org for check generated code (support gcc, clang, icc in several version).

The old adage "suck it and see" seems to be appropriate here. We often forget that by far the most common processors are 4/8/16 bit micros with weird and wonderful application specific architectures and suitably odd vendor specific compilers to go with them. They frequently have compiler extensions to "aid" (or confuse) the compiler into producing "better" code.
One DSP from early 2000s carried out 8 instructions per clock-cycle in parallel in a pipeline (complex - "load+increment+multiply+add+round"). The precondition for this to work was that everything had to be preloaded into the registers beforehand. This meant that registers were obviously at a premium (as always). With this architecture it was frequently better to bin results to free registers and use free slots that couldn't be paralleled (some instructions precluded the use of others in the same cycle) to recalculate it later. Did the compiler get this "right"?. Yes, it often kept the result to reuse later with the result that it stalled the pipeline due to lack of registers which resulted in slower execution speed.
So, you compiled it, examined it, profiled it etc. to make sure that the when when the compiler got it "right" we could go in and fix it. Without additional semantic information which is not supported by the language it is really hard to know what "right" is.
Conclusion: Suck it and see

Yes. It's a common optimization. https://en.wikipedia.org/wiki/Common_subexpression_elimination

Related

Require compiler to emit branchless/constant-time code

In cryptography, any piece of code that depends on secret data (such as a private key) must execute in constant time in order to avoid side-channel timing attacks.
The most popular architectures currently (x86-64 and ARM AArch64) both support certain kinds of conditional execution instructions, such as:
CMOVcc, SETcc for x86-64
CSINCcc, CSINVcc, CSNEGcc for AArch64
Even when such instructions are not available, there are techniques to convert a piece of code into a branchless version. Performance may suffer, but in this scenario it's not the primary goal -- running in constant time is.
Therefore, it should in principle be possible to write branchless code in e.g. C/C++, and indeed it is seen that gcc/clang will often emit branchless code with optimizations turned on (there is even a specific flag for this in gcc: -fif-conversion2). However, this appears to be an optimization decision, and if the compiler thinks branchless will perform worse (say, if the "then" and "else" clauses perform a lot of computation, more than the cost of flushing the pipeline in case of a wrongly predicted branch), then I assume the compiler will emit regular code.
If constant-time is a non-negotiable goal, one may be forced to use some of the aforementioned tricks to generate branchless code, making the code less clear. Also, performance is often a secondary and quite important goal, so the developer has to hope that the compiler will infer the intended operation behind the branchless code and emit an efficient instruction sequence, often using the instructions mentioned above. This may require rewriting the code over and over while looking at the assembly output, until a magic incantation satisfies the compilers -- and this may change from compiler to compiler, or when a new version comes out.
Overall, this is an awful situation on both sides: compiler writers must infer intent from obfuscated code, transforming it into a much simpler instruction sequence; while developers must write such obfuscated code, since there are no guarantees that simple, clear code would actually run in constant time.
Making this into a question: if a certain piece of code must be emitted in constant-time (or not at all), is there a compiler flag or pragma that will force the code to be emitted as such, even if the compiler predicts worse performance than the branched version, or abort the compilation if it is not possible? Developers would be able to write clear code with the peace of mind that it will be constant-time, while supplying the compiler with clear and easy to analyze code. I understand this is probably a language- and compiler-dependent question, so I would be satisfied with either C or C++ answers, for either gcc or clang.
I found this question by going down a similar rabbit hole. For security purposes I require my code to not branch on secret data and to not leak information trough timing attacks.
While not an answer per se I can recommend this paper from the S&P 2018: https://ieeexplore.ieee.org/document/8406587.
The authors also wrote and extension for CLang/LLVM. I am not sure how well this extension works but it's a first step and gives a good overview on where we currently stand in the research context.

Better not trust the gcc default inliner?

I was doing some hand optimizing some of my code and got bitten by gcc somehow.
The original code when run through a test takes about 3.5 seconds to finish execution.
I was confused why my optimized version now needs about 4.3 seconds to finish the test?
I applied __attribute__((always_inline)) to one of the local static functions that sticked out in the profiler and now it proudly runs within 2.9 seconds. Nice.
I've always trusted gcc to make the decision in function inlining, but apparently it doesn't seem really that perfect. I don't understand why gcc ended up with a very wrong decision whether or not to inline a file-scope static function with -O3 -flto -fwhole-program. Is the compiler really just doing a guestimate to the cost-benefits of inlining a function?
Edit: To answer the ACTUAL question, yes, the compiler does indeed "guesstimate" - or as the technical term is, it uses "heuristics" - to determine the gain in speed vs. space that inlining a particular function will result in. Heuristics is defined as "a practical but not theoretically perfect solution". End Edit.
Without seeing the code it's hard to say what is going on in the compiler. You are doing the right thing to profile your code, try your hand-optimisations and profile again - if it's better, keep it!
It's not that unusual that compilers get it wrong from time to time. Humans are more clever at times - but I would generally trust the compiler to get it right. It could be that the function is called many times and rather large, and thus the compiler decides that "it's above the threshold for code-bloat vs. speed gain"? Or it could be that it just doesn't get the "how much better/worse is it to inline" computation right.
Remember that the compiler is meant to be generic, and what works for one case may make another case worse - so the compiler has to compromise and come up with some reasonable heuristics that doesn't give too bad results too often.
If you can run profile-guided optimisation, it may help the compiler make the right decision (as it will know how many iterations and how often a particular branch is taken)...
If you can share the code with the GCC compiler team, report it as a bug - they may ignore/reject it as "too special" or some such, but it's quite possible that this particular case is something "that got missed out".
I think it's fair to say that the compiler "gets it right more often than not", but it doesn't mean it ALWAYS gets it right. I recently looked at some generated code from Clang, and it a whole bunch of extra instructions to unroll the loop - but in the most typical case, the loop would be one iteration, and never more than 16. So the additional instructions to unroll the loop by a factor of 4 was completely wasted for the case of one, and fairly useless even for the longest possible loop. The natural loop "rolled" would just be about 3-4 instructions, so the saving was quite small even if the loop was a lot bigger - but of course, had it been a million iterations, it would probably have tripled the speed of that function.

How do I find how C++ compiler implements something except inspecting emitted machine code?

Suppose I crafted a set of classes to abstract something and now I worry whether my C++ compiler will be able to peel off those wrappings and emit really clean, concise and fast code. How do I find out what the compiler decided to do?
The only way I know is to inspect the disassembly. This works well for simple code, but there're two drawbacks - the compiler might do it different when it compiles the same code again and also machine code analysis is not trivial, so it takes effort.
How else can I find how the compiler decided to implement what I coded in C++?
I'm afraid you're out of luck on this one. You're trying to find out "what the compiler did". What the compiler did is to produce machine code. The disassembly is simply a more readable form of the machine code, but it can't add information that isn't there. You can't figure out how a meat grinder works by looking at a hamburger.
I was actually wondering about that.
I have been quite interested, for the last few months, in the Clang project.
One of Clang particular interests, wrt optimization, is that you can emit the optimized LLVM IR code instead of machine code. The IR is a high-level assembly language, with the notion of structure and type.
Most of the optimizations passes in the Clang compiler suite are indeed performed on the IR (the last round is of course architecture specific and performed by the backend depending on the available operations), this means that you could actually see, right in the IR, if the object creation (as in your linked question) was optimized out or not.
I know it is still assembly (though of higher level), but it does seem more readable to me:
far less opcodes
typed objects / pointers
no "register" things or "magic" knowledge required
Would that suit you :) ?
Timing the code will directly measure its speed and can avoid looking at the disassembly entirely. This will detect when compiler, code modifications or subtle configuration changes have affected the performance (either for better or worse). In that way it's better than the disassembly which is only an indirect measure.
Things like code size can also serve as possible indicators of problems. At the very least they suggest that something has changed. It can also point out unexpected code bloat when the compiler should have boiled down a bunch of templates (or whatever) into a concise series of instructions.
Of course, looking at the disassembly is an excellent technique for developing the code and helping decide if the compiler is doing a sufficiently good translation. You can see if you're getting your money's worth, as it were.
In other words, measure what you expect and then dive in if you think the compiler is "cheating" you.
You want to know if the compiler produced "clean, concise and fast code".
"Clean" has little meaning here. Clean code is code which promotes readability and maintainability -- by human beings. Thus, this property relates to what the programmer sees, i.e. the source code. There is no notion of cleanliness for binary code produced by a compiler that will be looked at by the CPU only. If you wrote a nice set of classes to abstract your problem, then your code is as clean as it can get.
"Concise code" has two meanings. For source code, this is about saving the scarce programmer eye and brain resources, but, as I pointed out above, this does not apply to compiler output, since there is no human involved at that point. The other meaning is about code which is compact, thus having lower storage cost. This can have an impact on execution speed, because RAM is slow, and thus you really want the innermost loops of your code to fit in the CPU level 1 cache. The size of the functions produced by the compiler can be obtained with some developer tools; on systems which use GNU binutils, you can use the size command to get the total code and data sizes in an object file (a compiled .o), and objdump to get more information. In particular, objdump -x will give the size of each individual function.
"Fast" is something to be measured. If you want to know whether your code is fast or not, then benchmark it. If the code turns out to be too slow for your problem at hand (this does not happen often) and you have some compelling theoretical reason to believe that the hardware could do much better (e.g. because you estimated the number of involved operations, delved into the CPU manuals, and mastered all the memory bandwidth and cache issues), then (and only then) is it time to have a look at what the compiler did with your code. Barring these conditions, cleanliness of source code is a much more important issue.
All that being said, it can help quite a lot if you have a priori notions of what a compiler can do. This requires some training. I suggest that you have a look at the classic dragon book; but otherwise you will have to spend some time compiling some example code and looking at the assembly output. C++ is not the easiest language for that, you may want to begin with plain C. Ideally, once you know enough to be able to write your own compiler, then you know what a compiler can do, and you can guess what it will do on a given code.
You might find a compiler that had an option to dump a post-optimisation AST/representation - how readable it would be is another matter. If you're using GCC, there's a chance it wouldn't be too hard, and that someone might have already done it - GCCXML does something vaguely similar. Of little use if the compiler you want to build your production code on can't do it.
After that, some compiler (e.g. gcc with -S) can output assembly language, which might be usefully clearer than reading a disassembly: for example, some compilers alternate high-level source as comments then corresponding assembly.
As for the drawbacks you mentioned:
the compiler might do it different when it compiles the same code again
absolutely, only the compiler docs and/or source code can tell you the chance of that, though you can put some performance checks into nightly test runs so you'll get alerted if performance suddenly changes
and also machine code analysis is not trivial, so it takes effort.
Which raises the question: what would be better. I can image some process where you run the compiler over your code and it records when variables are cached in registers at points of use, which function calls are inlined, even the maximum number of CPU cycles an instruction might take (where knowable at compile time) etc. and produces some record thereof, then a source viewer/editor that colour codes and annotates the source correspondingly. Is that the kind of thing you have in mind? Would it be useful? Perhaps some more than others - e.g. black-and-white info on register usage ignores the utility of the various levels of CPU cache (and utilisation at run-time); the compiler probably doesn't even try to model that anyway. Knowing where inlining was really being done would give me a warm fuzzy feeling. But, profiling seems more promising and useful generally. I fear the benefits are more intuitively real than actually, and compiler writers are better off pursuing C++0x features, run-time instrumentation, introspection, or writing D "on the side" ;-).
The answer to your question was pretty much nailed by Karl. If you want to see what the compiler did, you have to start going through the assembly code it produced--elbow grease is required. As to discovering the "why" behind the "how" of how it implemented your code...every compiler (and every build, potentially), as you mentioned, is different. There are different approaches, different optimizations, etc. However, I wouldn't worry about whether it's emitting clean, concise machine code--cleanliness and concision should be left to the source code. Speed, on the other hand, is pretty much the programmer's responsibility (profiling ftw). More interesting concerns are correctness, maintainability, readability, etc. If you want to see if it made a specific optimization, the compiler docs might help (if they're available for your compiler). You can also just trying searching to see if the compiler implements a known technique for optimizing whatever. If those approaches fail, though, you're right back to reading assembly code. Keep in mind that the code that you're checking out might have little to no impact on performance or executable size--grab some hard data before diving into any of this stuff.
Actually, there is a way to get what you want, if you can get your compiler to
produce DWARF debugging information. There will be a DWARF description for each
out-of-line function and within that description there will (hopefully) be entries
for each inlined function. It's not trivial to read DWARF, and sometimes compilers
don't produce complete or accurate DWARF, but it can be a useful source of information
about what the compiler actually did, that's not tied to any one compiler or CPU.
Once you have a DWARF reading library there are all sorts of useful tools you can
build around it.
Don't expect to use it with Visual C++ as that uses a different debugging format.
(But you might be able to do similar queries through the debug helper library
that comes with it.)
If your compiler manages to translate your "wrappings and emit really clean, concise and fast code" the effort to follow-up the emitted code should be reasonable.
Contrary to another answer I feel that emitted assembly code may well be "clean" if it is (relatively) easily mappable to the original source code, if it doesn't consist of calls all over the place and that the system of jumps is not too complex. With code scheduling and re-ordering an optimized machine code which is also readable is, alas, a thing of the past.

How to know what optimizations are done automatically by my compiler

I was going through this link Will it optimize and wondered how can we know what optimizations are done by a particular compiler.
Like does VC8.0 convert if-else statements to switch-case?
Is such information available on msdn?
As everyone seems to be bent on telling the OP that he shouldn't worry about it, there is some useful although not as specific as the OP requested) information about compiler optimization (options).
You'll have to figure out what flags you're using, especially for MSVC and Intel (GCC release build should default to -O2), but here are the links:
GCC
MSVC
Intel
This is about as close as you'll get before disassembling your binary after compilation.
It depends on the level of of optimization you choose for compiler.
you can find a very nice article about it here
First of all, if optimization took place then your program should work faster usually. After that you could inspect disassembly code to find out what kind of optimizations were performed.
I don't know anything about VC8.0, so I'm not sure how you would access that information. However, if you are generally interested in the kinds of optimisations that go on and want to experiment, I recommend you use LLVM. You can look at the unoptimised, disassembled byte code generated from the default C front end, and then run various optimiser passes over it to see what the effect is each time. Because it's a nicer, abstract assembly code, it tends to be a little easier to figure out what is an optimisation derivable from the code and what is a machine-specific optimisation.
Like does VC8.0 convert if-else statements to switch-case?
Compilers do not do magically rewrite your source code. And even if they did, what would that tell you? What you really would want to know is if the compiler compiled it into a jump table or into multiple compare operations. Any dis-assembler will tell you that.
To clarify my point: Writing a switch-case statement does not necesseraly imply that there will be a jump table in the binary. Not needing to worry about this is the whole point of having compilers.
Instead of figuring out which optimizations are done by the compiler in general, it's probably better to NOT have any dependencies on such compiler-specific knowledge.
Instead start out with a good design and algorithm, writing (as much as possible) portable code that's easy to follow. Then profile the code if it's too slow and fix the actual hotspots. Compiler optimizations are useful no doubt, but better is to apply some investigation to what's actually happening in the code. Algorithmic/design improvements at the source level will typically help performance more than the presence or absence of optimizations like transforming if/else into switch-case.
I'm not sure what "convert if/else to switch/case" means. My processor doesn't have a hardware switch/case instruction.
Typical compilers have several different ways to implement switch/case. A well-known one is using a jump table, but this is only done if appropriate.
For if/else, certainly it is normal for compilers to analyse a digraph of execution flow. I would expect a compiler to notice if each condition references the same variable, and I would expect the compiler to treat equivalent forms of conditionals the same way in general. But this isn't something I'd worry about.
IIRC, the general policy in GCC is that regressions in optimisation are tolerable so long as preferred improvements result. Optimisation is complex and what is "generally" a good optimisation isn't always that great. Plus for perfect optimisation, the compiler would have to know things it can't know (e.g. what inputs it will encounter in real life).
The point is that it really isn't worthwhile knowing that much about specific optimisations unless you happen to be a compiler developer. If you depend on something being optimised by V8, that particular optimisation might not happen in V9 or V10.

Compiler optimization for fastest possible code

I would like to select the compiler optimizations to generate the fastest possible application.
Which of the following settings should I set to true?
Dead store elimination
Eliminate duplicate expressions within basic blocks and functions
Enable loop induction variable and strength reduction
Enable Pentium instruction scheduling
Expand common intrinsic functions
Optimize jumps
Use register variables
There is also the option 'Generate the fastest possible code.', which I have obviously set to true. However, when I set this to true, all the above options are still set at false.
So I would like to know if any of the above options will speed up the application if I set them to true?
So I would like to know if any of the above options will speed up the application if I set them to true?
I know some will hate me for this, but nobody here can answer you truthfully. You have to try your program with and without them, and profile each build and see what the results are. Guess-work won't get anybody anywhere.
Compilers already do tons(!) of great optimization, with or without your permission. Your best bet is to write your code in a clean and organized matter, and worry about maintainability and extensibility. As I like to say: Code now, optimize later.
Don't micromanage down to the individual optimization. Compiler writers are very smart people - just turn them all on unless you see a specific need not to. Your time is better spent by optimizing your code (improve algorithmic complexity of your functions, etc) rather than fiddling with compiler options.
My other advice, use a different compiler. Intel has a great reputation as an optimizing compiler. VC and GCC of course are also great choices.
You could look at the generated code with different compiled options to see which is fastest, but I understand nowadays many people don't have experience doing this.
Therefore, it would be useful to profile the application. If there is an obvious portion requiring speed, add some code to execute it a thousand or ten million times and time it using utime() if it's available. The loop should run long enough that other processes running intermittently don't affect the result—ten to twenty seconds is a popular benchmark range. Or run multiple timing trials. Compile different test cases and run it to see what works best.
Spending an hour or two playing with optimization options will quickly reveal that most have minor effect. However, that same time spent thinking about the essence of the algorithm and making small changes (code removal is especially effective) can often vastly improve execution time.