C/C++ compiler feedback optimization - c++

Has anyone seen any real world numbers for different programs which are using the feedback optimization that C/C++ compilers offer to support the branch prediction, cache preloading functions etc.
I searched for it and amazingly not even the popular interpreter development groups seem to have checked the effect. And increasing ruby,python,php etc. performance by 10% or so should be considered usefull.
Is there really no benefit or is the whole developer community just to lazy to use it?

10% is a good ballpark figure. That said, ...
You have to REALLY care about the performance to go this route. The product I work on (DB2) uses PGO and other invasive and agressive optimizations. Among the costs are significant build time (triple on some platforms) and development and support nightmares.
When something goes wrong it can be non-trivial to map the fault location in the optimized code back to the source. Developers don't usually expect that functions in different modules can end up merged and inlined and this can have "interesting" effects.
Problems with pointer aliasing, which are nasty to track down also usually show up with these sorts of optimizations. You have the additional fun of having non-deterministic builds (an aliasing problem can show up in monday's build, vanish again till thursday's, ...).
The line between what is correct or incorrect compiler behaviour under these sorts of aggressive optimizations also becomes fairly blurred. Even with the luxury of having our compiler guys in house (literally) the optimization issues (either in our source or the compiler) are still not easy to understand and resolve.

From unladen-swallow (a project optimizing the CPython VM):
For us, the final nail in PyBench's coffin was when experimenting with gcc's feedback-directed optimization tools, we were able to produce a universal 15% performance increase across our macrobenchmarks; using the same training workload, PyBench got 10% slower.
So some people are at least looking at it. That said, PGO sets some pretty tricky requirements on the build environment that are hard to satisfy for open-source projects meant to be built by a distributed heterogeneous group of people. Heavy optimization also creates difficult to debug heisenbugs. It's less work to give the compiler explicit hints for the performance critical parts.
That said, I expect significant performance increases from runtime profile guided optimization. JIT'ing allows the optimizer to cope with the profile of data changing across the execution of a program and do many extremely runtime data specific optimizations that would explode the code size for static compilation. Especially dynamic languages need good runtime data based optimization to perform well. With dynamic language performance getting significant attention lately (JavaScript VM's, MS DLR, JSR-292, PyPy and so on) there's a lot of work being done in this area.

Traditional methods in improving the compiler efficiency via profiling is done by performance analysis tools. However, how the data from the tools may be of use in optimization still depends on the compiler you use. For example, GCC is a framework being worked on to produce compilers for different domains. Providing profiling mechanism in the such compiler framework will be extremely difficult.
We can rely on statistical data to do certain optimization. For instance, GCC unrolls a loop if the loop count is less than a constant (say 7). How it fixes up the constant will be based on statistical result of the code size generated for different target architecture.
Profile guided optimizations track the special areas of the source. Details regarding previous run results needs to be stored which is an overhead. The input on the other hand requires a statistical representation of the target application which may use the compiler. So the complexity level rises with the number of different inputs and outputs. In short, deciding profile guided optimization needs extreme data collection. Automation or embedding such profiling into source needs careful monitoring. If not, the entire result will be awry and in our effort to swim we actually will drown.
However, experimentation on this regard is ongoing. Just have a look at POGO.

Related

Why C++ compilers have many optimization levels

I was just thinking why C++ compilers have many optimization levels like O1, O2 etc. Why can everything be part of just one optimization level O.
I tried search online a lot but didn't got a convincing answer for this.
Off the top of my head: optimizing takes time (more optimization means slower compilation), debugging optimized code can be more difficult, more aggressive optimization can reveal bugs, you can optimize for different things (program size, speed, etc.)…
While compilers are smart enough to optimize several common commands (O1) they may try with not-so-frequent strategies (O2) and even with corner cases (O3).
Whether the resulting code is more optimized depends a lot on the original code, and sometimes on the CPU hardware. The only way to tell which "O" is best is trying and measuring running times.
Remember that the one who really knows about the code is you. Write it with part of your brain thinking about how fast it will run.
For precisely two reasons:
There is no one sequence of compiler optimizations that can simultaneously optimize all possible program characteristics of interest, such as execution time, compilation time, code size, energy consumption, binary-portability, etc. In compiler optimizations research, this is known as the phase ordering problem.
Most developers do not want to bother with figuring out which compiler optimizations to use and in what order; they just want to use whatever is generally recommended in a small number of common scenarios.
That's why compiler developers have decided to offer a small collection of optimization levels from which developers can easily choose in general, yet offering hundreds of fine-grained optimization options for advanced scenarios.
The term "optimization levels" is really a misnomer, since they are not exactly "levels" with respect to each other. A better term would be something like "optimization groups".
Designing optimization levels is a complicated matter for compilers that target a broad range of programs and architectures, such as GCC, Clang, icc, and VC++. Many research papers have been published in the past decade that show that the optimization levels offered by compilers are far from being the best for a particular program, target architecture, or specific collection thereof. This motivated a line of research on compiler auto-tuning, which can be considered as an approach that falls somewhere in between offering few optimization levels and offering fine-grained control over compiler optimizations.
In summary, optimization levels provide an important convenience for developers, which will be required for many decades to come.

Is there a good test for C++ optimizing compilers?

I'm evaluating Visual C++ 10 optimizing compiler on trivial code samples so see how good the machine code emitted and I'm out of creative usecases so far.
Is there some sample codebase that is typically used to evaluate how good an optimizing C++ compiler is?
The only valid benchmark is one that simulates the type of code you're developing. Optimizers react differently to different applications and different coding styles, and the only one that really counts is the code that you are going to be compiling with the compiler.
Try benchmarking such libraries as Eigen (http://eigen.tuxfamily.org/index.php?title=Main_Page).
Quite a few benchmarks use scimark: http://math.nist.gov/scimark2/download_c.html however, you should be selective in what you test (test in isolation), as some benchmarks might fail due to poor loop unrolling but the rest of the code was excellent, but something else does better only cause of the loop unrolling (ie the rest of its generated code was sub-par)
As has been already said, you really need to measure optimisation within the context of typical use cases for your own applications, in typical target environments. I include timers in my own automated regression suite for this reason, and have found some quite unusual results as documented in a previous question FWIW, I'm finding VS2010 SP1 is creating code about 8% faster on average than VS2008 on my own application, with about 13% with whole program optimization. This is not spread evenly across use cases. I also tend to see significant variations between long test runs, which are not visible profiling much smaller test cases. I haven't carried out platform comparisons yet, e.g. are many gains platform or hardware specific.
I would imagine that many optimisers will be fine tuned to give best results against well known benchmark suites, which could imply in turn that these are not the best pieces of code against which to test the benefits of optimisation. (Speculation of course)

Mixing assembler code with c/c++

Why is assembly language code often needed along with C/C++ ?
What can't be done in C/C++, which is possible when assembly language code is mixed?
I have some source code of some 3D computer games. There are a lot of assembler code in use.
Things that pop to mind, in no particular order:
Special instructions. In an embedded application, I need to invalidate the cache after a DMA transfer has filled the memory buffer. The only way to do that on an SH-4 CPU is to execute a special instruction, so inline assembly (or a free-standing assembly function) is the only way to go.
Optimizations. Once upon a time, it was common for compilers to not know every trick that was possible to do. In some of those cases, it was worth the effort to replace an inner loop with a hand-crafted version. On the kinds of CPUs you find in small embedded systems (think 8051, PIC, and so forth) it can be valuable to push inner loops into assembly. I will emphasize that for modern processors with pipelines, multi-issue execution, extensive caching and more, it is often exceptionally difficult for hand coding to even approach the capabilities of the optimizer.
Interrupt handling. In an embedded application it is often needed to catch system events such as interrupts and exceptions. It is often the case that the first few instructions executed by an interrupt have special responsibilities and the only way to guarantee that the right things happen is to write the outer layer of a handler in assembly. For example, on a ColdFire (or any descendant of the 68000) only the very first instruction is guaranteed to execute. To prevent nested interrupts, that instruction must modify the interrupt priority level to mask out the priority of the current interrupt.
Certain portions of an OS kernel. For example, task switching requires that the execution state (at least most registers including PC and stack pointer) be saved for the current task and the state loaded for the new task. Fiddling with execution state of the CPU is well outside of the feature set of the language, but can be wrapped in a small amount of assembly code in a way that allows the rest of the kernel to be written in C or C++.
Edit: I've touched up the wording about optimization. Let me emphasize that for targets with large user populations and well supported compilers with decent optimization, it is highly unlikely that an assembly coder can beat the performance of the optimizer.
Before attempting, start by careful profiling to determine where the bottlenecks really lie. With that information in hand, examine assumptions and algorithms carefully, because the best optimization of all is usually to find a better way to handle the larger picture. Then, if all else fails, isolate the bottleneck in a test case, benchmark it carefully, and begin tweaking in assembly.
Why is assembly language code often
needed along with C/C++ ?
Competitive advantage. Like, if you are writing software for the (soon-to-be) #1 gaming company in the world.
What can't be done in C/C++, which is
possible when assembly language code
is mixed?
Nothing, unless some absolute performance level is needed, say, X frames per second or Y billions of polygons per second.
Edit: based on other replies, it seems the consensus is that embedded systems (iPhone, Android etc) have hardware accelerators that certainly require the use of assembly.
I have some source code of some 3D
computer games. There are a lot of
assembler code in use.
They are either written in the 80's-90's, or they are used sparingly (maybe 1% - 5% of total source code) inside a game engine.
Edit: to this date, compiler auto-vectorization quality is still poor. So, you may see programs that contain vectorization intrinsics, and since it's not really much different from writing in actual assembly (most intrinsics have one-one mapping to assembly instructions) some folks might just decide to write in assembly.
Update:
According to anecdotal evidence, RollerCoaster Tycoon is written in 99% assembly.
http://www.chrissawyergames.com/faq3.htm
In the past, compilers used to be pretty poor at optimizing for a particular architecture, and architectures used to be simpler. Now the reverse is true. These days, it's pretty hard for a human to write better assembly than an optimizing compiler, for deeply-pipelined, branch-predicting processors. And so you won't see it much. What there is will be short, and highly targeted.
In short, you probably won't need to do this. If you think you do, profile your code to make sure you've identified a hotspot - don't optimize something just because it's slow, if you're only spending 0.1% of your execution time there. See if you can improve your design or algorithm. If you don't find any improvement there, or if you need functionality not exposed by your higher-level language, look into hand-coding assembly.
There are certain things that can only be done in assembler and cannot be done in C/C++.
These include:
generating software interrupts (SWI or INT instructions)
Use of instructions like SWP for creating mutexes
specialist coporcessor instructions (such as those needed to program the MMU and manage RAM caches)
Access to carry and overflow flags.
You may also be able to optimize code better in assembler than C/C++ (eg memcpy on Android is written in assembler)
There may be new instructions that your compiler cannot yet generate, or the compiler does a bad job, or you may need to control the CPU directly.
Why is assembly language code often
needed along with C/C++ ?needed along with C/C++ ?
It isn't
What can't be done in C/C++, which is
possible when assembly language code
is mixed?
Accessing system registers or IO ports on the CPU.
Accessing BIOS functions.
Using specialized instructions that doesn't map directly to the programming language,
e.g. SIMD instructions.
Provide optimized code that's better than the compiler produces.
The two first points you usually don't need unless you're writing an operating system, or code
running without an operatiing system.
Modern CPUs are quite complex, and you'll be hard pressed to find people that actually can write assembly than what the compiler produces. Many compilers come with libraries giving you access
to more advanced features, like SIMD instructions, so nowadays you often don't need to fall back to
assembly for that.
One more thing worth mentioning is:
C & C++ do not provide any convenient way to setup stack frames when one needs to implement a binary level interop with a script language - or to implement some kind of support for closures.
Assembly can be very optimal than what any compiler can generate in certain situations.

Compiler optimization for fastest possible code

I would like to select the compiler optimizations to generate the fastest possible application.
Which of the following settings should I set to true?
Dead store elimination
Eliminate duplicate expressions within basic blocks and functions
Enable loop induction variable and strength reduction
Enable Pentium instruction scheduling
Expand common intrinsic functions
Optimize jumps
Use register variables
There is also the option 'Generate the fastest possible code.', which I have obviously set to true. However, when I set this to true, all the above options are still set at false.
So I would like to know if any of the above options will speed up the application if I set them to true?
So I would like to know if any of the above options will speed up the application if I set them to true?
I know some will hate me for this, but nobody here can answer you truthfully. You have to try your program with and without them, and profile each build and see what the results are. Guess-work won't get anybody anywhere.
Compilers already do tons(!) of great optimization, with or without your permission. Your best bet is to write your code in a clean and organized matter, and worry about maintainability and extensibility. As I like to say: Code now, optimize later.
Don't micromanage down to the individual optimization. Compiler writers are very smart people - just turn them all on unless you see a specific need not to. Your time is better spent by optimizing your code (improve algorithmic complexity of your functions, etc) rather than fiddling with compiler options.
My other advice, use a different compiler. Intel has a great reputation as an optimizing compiler. VC and GCC of course are also great choices.
You could look at the generated code with different compiled options to see which is fastest, but I understand nowadays many people don't have experience doing this.
Therefore, it would be useful to profile the application. If there is an obvious portion requiring speed, add some code to execute it a thousand or ten million times and time it using utime() if it's available. The loop should run long enough that other processes running intermittently don't affect the result—ten to twenty seconds is a popular benchmark range. Or run multiple timing trials. Compile different test cases and run it to see what works best.
Spending an hour or two playing with optimization options will quickly reveal that most have minor effect. However, that same time spent thinking about the essence of the algorithm and making small changes (code removal is especially effective) can often vastly improve execution time.

How do you normally set up your compiler's optimization settings?

Do you normally set your compiler to optimize for maximum speed or smallest code size? or do you manually configure individual optimization settings? Why?
I notice most of the time people tend to just leave compiler optimization settings to their default state, which with visual c++ means max speed.
I've always felt that the default settings had more to do with looking good on benchmarks, which tend to be small programs that will fit entirely within the L2 cache than what's best for overall performance, so I normally set it optimize for smallest size.
As a Gentoo user I have tried quite a few optimizations on the complete OS and there have been endless discussions on the Gentoo forums about it. Some good flags for GCC can be found in the wiki.
In short, optimizing for size worked best on an old Pentium3 laptop with limited ram, but on my main desktop machine with a Core2Duo, -O2 gave better results over all.
There's also a small script if you are interested in the x86 (32 bit) specific flags that are the most optimized.
If you use gcc and really want to optimize a specific application, try ACOVEA. It runs a set of benchmarks, then recompile them with all possible combinations of compile flags. There's an example using Huffman encoding on the site (lower is better):
A relative graph of fitnesses:
Acovea Best-of-the-Best: ************************************** (2.55366)
Acovea Common Options: ******************************************* (2.86788)
-O1: ********************************************** (3.0752)
-O2: *********************************************** (3.12343)
-O3: *********************************************** (3.1277)
-O3 -ffast-math: ************************************************** (3.31539)
-Os: ************************************************* (3.30573)
(Note that it found -Os to be the slowest on this Opteron system.)
I prefer to use minimal size. Memory may be cheap, cache is not.
Besides the fact that cache locality matters (as On Freund said), one other things Microsoft does is to profile their application and find out which code paths are executed during the first few seconds of startup. After that they feed this data back to the compiler and ask it to put the parts which are executed during startup close together. This results in faster startup time.
I do believe that this technique is available publicly in VS, but I'm not 100% sure.
For me it depends on what platform I'm using. For some embedded platforms or when I worked on the Cell processor you have restraints such as a very small cache or minimal space provided for code.
I use GCC and tend to leave it on "-O2" which is the "safest" level of optimisation and favours speed over a minimal size.
I'd say it probably doesn't make a huge difference unless you are developing for a very high-performance application in which case you should probably be benchmarking the various options for your particular use-case.
Microsoft ships all its C/C++ software optimized for size. After benchmarking they discovered that it actually gives better speed (due to cache locality).
There are many types of optimization, maximum speed versus small code is just one. In this case, I'd choose maximum speed, as the executable will be just a bit bigger.
On the other hand, you could optimize your application for a specific type of processor. In some cases this is a good idea (if you intend to run the program only on your station), but in this case it is probable that the program will not work on other architecture (eg: you compile your program to work on a Pentium 4 machine -> it will probably not work on a Pentium 3).
Build both, profile, choose which works better on specific project and hardware.
For performance critical code, that is - otherwise choose any and don't bother.
We always use maximize for optimal speed but then, all the code I write in C++ is somehow related to bioinformatics algorithms and speed is crucial while the code size is relatively small.
Memory is cheap now days :) So it can be meaningful to set compiler settings to max speed unless you work with embedded systems. Of course answer depends on concrete situation.
This depends on the application of your program. When programming an application to control a fast industrial process, optimize for speed would make sense. When programming an application that only needs to react to a user's input, optimization for size could make sense. That is, if you are concerned about the size of your executable.
Tweaking compiler settings like that is an optimization. On the principle that "premature optimization is the root of all evil," I don't bother with it until the program is near its final shipping state and I've discovered that it's not fast enough -- i.e. almost never.