I'm evaluating Visual C++ 10 optimizing compiler on trivial code samples so see how good the machine code emitted and I'm out of creative usecases so far.
Is there some sample codebase that is typically used to evaluate how good an optimizing C++ compiler is?
The only valid benchmark is one that simulates the type of code you're developing. Optimizers react differently to different applications and different coding styles, and the only one that really counts is the code that you are going to be compiling with the compiler.
Try benchmarking such libraries as Eigen (http://eigen.tuxfamily.org/index.php?title=Main_Page).
Quite a few benchmarks use scimark: http://math.nist.gov/scimark2/download_c.html however, you should be selective in what you test (test in isolation), as some benchmarks might fail due to poor loop unrolling but the rest of the code was excellent, but something else does better only cause of the loop unrolling (ie the rest of its generated code was sub-par)
As has been already said, you really need to measure optimisation within the context of typical use cases for your own applications, in typical target environments. I include timers in my own automated regression suite for this reason, and have found some quite unusual results as documented in a previous question FWIW, I'm finding VS2010 SP1 is creating code about 8% faster on average than VS2008 on my own application, with about 13% with whole program optimization. This is not spread evenly across use cases. I also tend to see significant variations between long test runs, which are not visible profiling much smaller test cases. I haven't carried out platform comparisons yet, e.g. are many gains platform or hardware specific.
I would imagine that many optimisers will be fine tuned to give best results against well known benchmark suites, which could imply in turn that these are not the best pieces of code against which to test the benefits of optimisation. (Speculation of course)
Related
Due to performance issues, I'd like to attempt to convert a Freepascal function (SHA1Update, from the SHA1 unit) to assembly. I use Freepascal 2.6.4 and Lazxarus 1.2.4.
The reason is, I have a loop structure (repeat...until) that reads 64Kb blocks of raw data from disk into a buffer, and then it is hashed. Without the hashing, I can read the disk at 4Gb p\min. With the hashing, it slows to just over 1Gb p\min. So someone suggested converting the hashing routine to assembly.
I am a below average programmer when using high-level languages, let alone assembly, but the potential for performance improvement is drving me to at least enquire.
So my question is : is there a program or script that can take a procedure or function and magically convert it to assembly that I can then compile using the Freepascal compiler? I know it can be done for C\C++ using even web based system like this one
Assembly is indeed what you would use for optimising selected sequences of code. But, because native code compilers generate machine code, usually using an intermediate assembly source representation, which is then run through an assembler, the advantage you gain from using a compiler to "magically convert" your section of code, subject to optimisation, to assembly which then is linked to the rest of the program, compared to simply compiling the whole program with the compiler, is about zero - you're using the same compiler for converting, after all. From that angle, a compiler is nothing else than such a program which "magically converts it to assembly". For optimisation purposes, you want to hand code those section of code - and you need to be good at it. Many compilers generate code nowadays which performs better than non-expert crafted code, for various reasons. One is that target CPUs are very different in what is best performing code for them, and the rules to determine how efficient code for a specific CPU must look like, are often extremely complex. As a hand coder, you need to know the differences between them, to know how to write code which performs well. This knowledge is something many compilers have, and are therefore able to generate code such that one or another CPU architecture or model can benefit from the differences the compiler puts into code generation.
Often much better performance gains can be achieved by choosing more efficient algorithms. A better algorithm, coded in high level, usually outperforms a less adequate algorithm, hand coded in assembly. Therefore I'd look into possibilities to make the hashing process as such faster, by looking at alternative and faster algorithms, rather than trying to improve speed using assembly at this stage - consider assembly optimisation as a last, final step optimisation, when other means to speed up your code have been exhausted.
As #Bushmills already explained your code is converted to assembly automatically by the FreePascal compiler - before producing the machine code in the Portable Executable (*.exe) format.
What you would need is not the assembly language, but hand-optimized code written in assembly language. This is task for experienced assembly programmer. You can 1) become an assembly language expert by yourself, this Stack Overflow question can give you some starting points: A good NASM/FASM tutorial?
My guess is that any programmer can become an assembly language expert (either CISC or RISC architectures) in about a year. Depending on your previous experience and the courses you'd take and your eagerness. For theoretical background (processor-neutral) I'd recommend Donald Knuth's MMIX lectures
You should be able to 2) see the intermediate assembly files produced by the FreePascal compiler by following instructions in this: http://free-pascal-general.1045716.n5.nabble.com/Assembler-file-generate-by-compiler-td5710837.html discussion
If you want to really move on in a reasonable time-frame then I'd suggest you to create Minimal, Complete and Verifiable example and 3) ask for code review at some code review sites where some more experienced programmers will take a look at your code and propose some changes. These sites should be a good candidates:
https://codereview.stackexchange.com/
https://www.codementor.io/
Those are sites designed especially for helping beginners and intermediate programmers with problems like the one of yours
I would like to select the compiler optimizations to generate the fastest possible application.
Which of the following settings should I set to true?
Dead store elimination
Eliminate duplicate expressions within basic blocks and functions
Enable loop induction variable and strength reduction
Enable Pentium instruction scheduling
Expand common intrinsic functions
Optimize jumps
Use register variables
There is also the option 'Generate the fastest possible code.', which I have obviously set to true. However, when I set this to true, all the above options are still set at false.
So I would like to know if any of the above options will speed up the application if I set them to true?
So I would like to know if any of the above options will speed up the application if I set them to true?
I know some will hate me for this, but nobody here can answer you truthfully. You have to try your program with and without them, and profile each build and see what the results are. Guess-work won't get anybody anywhere.
Compilers already do tons(!) of great optimization, with or without your permission. Your best bet is to write your code in a clean and organized matter, and worry about maintainability and extensibility. As I like to say: Code now, optimize later.
Don't micromanage down to the individual optimization. Compiler writers are very smart people - just turn them all on unless you see a specific need not to. Your time is better spent by optimizing your code (improve algorithmic complexity of your functions, etc) rather than fiddling with compiler options.
My other advice, use a different compiler. Intel has a great reputation as an optimizing compiler. VC and GCC of course are also great choices.
You could look at the generated code with different compiled options to see which is fastest, but I understand nowadays many people don't have experience doing this.
Therefore, it would be useful to profile the application. If there is an obvious portion requiring speed, add some code to execute it a thousand or ten million times and time it using utime() if it's available. The loop should run long enough that other processes running intermittently don't affect the result—ten to twenty seconds is a popular benchmark range. Or run multiple timing trials. Compile different test cases and run it to see what works best.
Spending an hour or two playing with optimization options will quickly reveal that most have minor effect. However, that same time spent thinking about the essence of the algorithm and making small changes (code removal is especially effective) can often vastly improve execution time.
Has anyone seen any real world numbers for different programs which are using the feedback optimization that C/C++ compilers offer to support the branch prediction, cache preloading functions etc.
I searched for it and amazingly not even the popular interpreter development groups seem to have checked the effect. And increasing ruby,python,php etc. performance by 10% or so should be considered usefull.
Is there really no benefit or is the whole developer community just to lazy to use it?
10% is a good ballpark figure. That said, ...
You have to REALLY care about the performance to go this route. The product I work on (DB2) uses PGO and other invasive and agressive optimizations. Among the costs are significant build time (triple on some platforms) and development and support nightmares.
When something goes wrong it can be non-trivial to map the fault location in the optimized code back to the source. Developers don't usually expect that functions in different modules can end up merged and inlined and this can have "interesting" effects.
Problems with pointer aliasing, which are nasty to track down also usually show up with these sorts of optimizations. You have the additional fun of having non-deterministic builds (an aliasing problem can show up in monday's build, vanish again till thursday's, ...).
The line between what is correct or incorrect compiler behaviour under these sorts of aggressive optimizations also becomes fairly blurred. Even with the luxury of having our compiler guys in house (literally) the optimization issues (either in our source or the compiler) are still not easy to understand and resolve.
From unladen-swallow (a project optimizing the CPython VM):
For us, the final nail in PyBench's coffin was when experimenting with gcc's feedback-directed optimization tools, we were able to produce a universal 15% performance increase across our macrobenchmarks; using the same training workload, PyBench got 10% slower.
So some people are at least looking at it. That said, PGO sets some pretty tricky requirements on the build environment that are hard to satisfy for open-source projects meant to be built by a distributed heterogeneous group of people. Heavy optimization also creates difficult to debug heisenbugs. It's less work to give the compiler explicit hints for the performance critical parts.
That said, I expect significant performance increases from runtime profile guided optimization. JIT'ing allows the optimizer to cope with the profile of data changing across the execution of a program and do many extremely runtime data specific optimizations that would explode the code size for static compilation. Especially dynamic languages need good runtime data based optimization to perform well. With dynamic language performance getting significant attention lately (JavaScript VM's, MS DLR, JSR-292, PyPy and so on) there's a lot of work being done in this area.
Traditional methods in improving the compiler efficiency via profiling is done by performance analysis tools. However, how the data from the tools may be of use in optimization still depends on the compiler you use. For example, GCC is a framework being worked on to produce compilers for different domains. Providing profiling mechanism in the such compiler framework will be extremely difficult.
We can rely on statistical data to do certain optimization. For instance, GCC unrolls a loop if the loop count is less than a constant (say 7). How it fixes up the constant will be based on statistical result of the code size generated for different target architecture.
Profile guided optimizations track the special areas of the source. Details regarding previous run results needs to be stored which is an overhead. The input on the other hand requires a statistical representation of the target application which may use the compiler. So the complexity level rises with the number of different inputs and outputs. In short, deciding profile guided optimization needs extreme data collection. Automation or embedding such profiling into source needs careful monitoring. If not, the entire result will be awry and in our effort to swim we actually will drown.
However, experimentation on this regard is ongoing. Just have a look at POGO.
Do you normally set your compiler to optimize for maximum speed or smallest code size? or do you manually configure individual optimization settings? Why?
I notice most of the time people tend to just leave compiler optimization settings to their default state, which with visual c++ means max speed.
I've always felt that the default settings had more to do with looking good on benchmarks, which tend to be small programs that will fit entirely within the L2 cache than what's best for overall performance, so I normally set it optimize for smallest size.
As a Gentoo user I have tried quite a few optimizations on the complete OS and there have been endless discussions on the Gentoo forums about it. Some good flags for GCC can be found in the wiki.
In short, optimizing for size worked best on an old Pentium3 laptop with limited ram, but on my main desktop machine with a Core2Duo, -O2 gave better results over all.
There's also a small script if you are interested in the x86 (32 bit) specific flags that are the most optimized.
If you use gcc and really want to optimize a specific application, try ACOVEA. It runs a set of benchmarks, then recompile them with all possible combinations of compile flags. There's an example using Huffman encoding on the site (lower is better):
A relative graph of fitnesses:
Acovea Best-of-the-Best: ************************************** (2.55366)
Acovea Common Options: ******************************************* (2.86788)
-O1: ********************************************** (3.0752)
-O2: *********************************************** (3.12343)
-O3: *********************************************** (3.1277)
-O3 -ffast-math: ************************************************** (3.31539)
-Os: ************************************************* (3.30573)
(Note that it found -Os to be the slowest on this Opteron system.)
I prefer to use minimal size. Memory may be cheap, cache is not.
Besides the fact that cache locality matters (as On Freund said), one other things Microsoft does is to profile their application and find out which code paths are executed during the first few seconds of startup. After that they feed this data back to the compiler and ask it to put the parts which are executed during startup close together. This results in faster startup time.
I do believe that this technique is available publicly in VS, but I'm not 100% sure.
For me it depends on what platform I'm using. For some embedded platforms or when I worked on the Cell processor you have restraints such as a very small cache or minimal space provided for code.
I use GCC and tend to leave it on "-O2" which is the "safest" level of optimisation and favours speed over a minimal size.
I'd say it probably doesn't make a huge difference unless you are developing for a very high-performance application in which case you should probably be benchmarking the various options for your particular use-case.
Microsoft ships all its C/C++ software optimized for size. After benchmarking they discovered that it actually gives better speed (due to cache locality).
There are many types of optimization, maximum speed versus small code is just one. In this case, I'd choose maximum speed, as the executable will be just a bit bigger.
On the other hand, you could optimize your application for a specific type of processor. In some cases this is a good idea (if you intend to run the program only on your station), but in this case it is probable that the program will not work on other architecture (eg: you compile your program to work on a Pentium 4 machine -> it will probably not work on a Pentium 3).
Build both, profile, choose which works better on specific project and hardware.
For performance critical code, that is - otherwise choose any and don't bother.
We always use maximize for optimal speed but then, all the code I write in C++ is somehow related to bioinformatics algorithms and speed is crucial while the code size is relatively small.
Memory is cheap now days :) So it can be meaningful to set compiler settings to max speed unless you work with embedded systems. Of course answer depends on concrete situation.
This depends on the application of your program. When programming an application to control a fast industrial process, optimize for speed would make sense. When programming an application that only needs to react to a user's input, optimization for size could make sense. That is, if you are concerned about the size of your executable.
Tweaking compiler settings like that is an optimization. On the principle that "premature optimization is the root of all evil," I don't bother with it until the program is near its final shipping state and I've discovered that it's not fast enough -- i.e. almost never.
Anyone know this compiler feature? It seems GCC support that. How does it work? What is the potential gain? In which case it's good? Inner loops?
(this question is specific, not about optimization in general, thanks)
It works by placing extra code to count the number of times each codepath is taken. When you compile a second time the compiler uses the knowledge gained about execution of your program that it could only guess at before. There are a couple things PGO can work toward:
Deciding which functions should be inlined or not depending on how often they are called.
Deciding how to place hints about which branch of an "if" statement should be predicted on based on the percentage of calls going one way or the other.
Deciding how to optimize loops based on how many iterations get taken each time that loop is called.
You never really know how much these things can help until you test it.
PGO gives about a 5% speed boost when compiling x264, the project I work on, and we have a built-in system for it (make fprofiled). Its a nice free speed boost in some cases, and probably helps more in applications that, unlike x264, are less made up of handwritten assembly.
Jason's advise is right on. The best speedups you are going to get come from "discovering" that you let an O(n2) algorithm slip into an inner loop somewhere, or that you can cache certain computations outside of expensive functions.
Compared to the micro-optimizations that PGO can trigger, these are the big winners. Once you've done that level of optimization PGO might be able to help. We never had much luck with it though - the cost of the instrumentation was such that our application become unusably slow (by several orders of magnitude).
I like using Intel VTune as a profiler primarily because it is non-invasive compared to instrumenting profilers which change behaviour too much.
The fun thing about optimization is that speed gains are found in the unlikeliest of places.
It's also the reason you need a profiler, rather than guessing where the speed problems are.
I recommend starting with a profiler (gperf if you're using GCC) and just start poking around the results of running your application through some normal operations.