Because of a school assignment I have to convert a C++ code to assembly(ARMv8). Then I have to compile the C++ code using GCC's -O0,-O1,-O2,-O3 and -Os optimizations, write down the time and compare with the execute time of my assembly code. As, I think I know -O3 have to be faster than -O1 and -O2. However, I get that -O2 is the fastest, then are -O1,-O3,-Os,-O0. Is that usual? (Calculated times are about 30 seconds).
Notice that GCC has many other optimization flags.
There is no guarantee that -O3 gives faster code than -O2; a compiler can apply more optimization passes, but they are all heuristics and might be unsuccessful (or even slow down slightly your particular code). Hence it does happen that -O3 gives some slightly slower code than -O2 (on some particular input source code).
You could try a more recent version of GCC (the latest -in November 2017- is GCC 7, GCC 8 will go out in few months). You could also try some better -march= or -mtune= option.
At last, with your GCC plugin, you might add your own optimization pass, or change the order (and the set) of applied optimization passes (there are several hundreds different optimization passes in GCC). But you'll need a lot of work (perhaps a year or two) to be able to extend GCC.
You could tune optimization parameters, and some project (MILEPOST) has even used machine learning techniques to improve them.
See also slides and references on my (old) GCC MELT documentation.
Yes, it is usual. Take the -Ox optimization as guide-lines. In average, they produce optimization that is advertise, but a lot depends on the style in which the code is written, memory layout, as well as the compiler itself.
Sometimes, you need to try and fail many times before getting the optimal code.
-O2 indeed gives the best optimization in most of the cases.
Related
I'm working on big project most files are longer than 7000 lines. If I'm using -fno-inline option compilation time going down 3 times.
Actual numbers:
w/o -fno-inline - 340 sec
w/ -fno-inline ~ 115 sec
I didn't find anything about -fno-inline impact on compilation performance. Is there any explanation to this ?
Some background:
I use MACROSes pretty extensively (for logging purposes)
There is one global Exception try / catch block inherited from old code (need to rework this piece)
There are few try/catch blocks inside, mostly to catch exceptions from stof/stoi
I tested compilation time with and w/o (-pipe, -O0 to -O3, -g / no -g, -ggdb / no ggdb). Nothing brings compilation time so down as -fno-inline.
I'm working on big project most files are longer than 7000 lines.
That is a bit quite big. You might (I am not sure) win a bit of compilation time by avoiding files bigger than 5KLOC (by splitting large C++ files, bigger than 8KLOC, in several ones), and by compiling in parallel several translation units at the same time (using make -j or ninja). This requires some refactoring work. On the other hand, with genuine C++, don't have too small files (because standard container headers like <vector> could include many thousands lines; you might also consider having a precompiled header). Pragmatically 3KLOC to 7KLOC per C++ source file is a nice trade-off in practice.
Use the -ftime-report option to g++ to get a detailed timing of each compilation phase (or passes). You may need to understand the internals of GCC to decipher the obtained table.
I didn't find anything about -fno-inline impact on compilation performance. Is there any explanation to this ?
Inline expansion happens several times inside GCC. It generally works on some GIMPLE or SSA internal representation. Of course, inlining is improving the runtime performance of your program. By disabling it, you could lose 50% of speed in your executable (and perhaps even more, since inline member functions such as getters and setters are extensively used in C++, notably in standard container templates).
FWIW, my old GCC MELT web pages (GCC MELT is now a dead project) have several slides and reference explaining GCC internals, and I am just writing right now (october 2018) the draft of a technical report on bismon (funded now by the CHARIOT H2020 project); that draft happens to have a section §1.3.2 explaining some interesting GCC optimizations.
See also CppCon 2017 talk: Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”
I see this thread, and I had the same question, but this one isn't really answered: GCC standard optimizations behavior
I'm trying to figure out exactly what flag is causing an incredible boost in performance, in O1. I first found out which flags are set, using g++ -O1 -Q --help=optimizers and then got each of the enabled ones and used them to compile with g++. But the output results were different (the binary itself was of difference sizes).
How do I handpick optimizations for g++ or is this not possible?
Not all optimizations have individual flags, so no combination of them will generate the same code as using -O1 or any other of the general optimization enabling options (-Os, -O2, etc...). Also I imagine that a lot of the specific optimization options are ignored when you use -O0 (the default) because they require passes that are skipped if optimization hasn't generally enabled.
To try to narrow down your performance increase you can try using -O1 and then selectively disabling optimizations. For example:
g++ -O1 -fno-peephole -fno-tree-cselim -fno-var-tracking ...
You still might not have better luck this way though. It might be multiple optimizations in combination are producing your performance increase. It could also be the result of optimizations not covered by any specific flag.
I also doubt that better cache locality resulted in your "incredible boost in performance". If so it was likely a coincidence, especially at -O1. Big performance increases usually come about because GCC was able eliminate a chunk of your code either because it didn't actually have any net effect, always resulted in the same value being computed or it invoked undefined behaviour.
When I compile an application with Intel's compiler it is slower than when I compile it with GCC. The Intel compiler's output is more than 2x slower. The application contains several nested loops. Are there any differences between GCC and the Intel compiler that I am missing? Do I need to turn on some other flags to improve the Intel compiler's performance? I expected the Intel compiler to be at least as fast as GCC.
Compiler Versions:
Intel version 12.0.0 20101006
GCC version 4.4.4 20100630
The compiler flags are the same with both compilers:
-O3 -openmp -parallel -mSSE4.2 -Wall -pthread
I have no experience with the intel compiler so I can't answer whether you are missing some flags or not.
However from what I recall recent versions of gcc are generally as good at optimizing code as icc (sometimes better, sometimes worse (although most sources seem to indicate to generally better)), so you might have run into a situation where icc is particulary bad. Examples for what optimizations each compiler can do can be found here and here. Even if gcc is not generally better you could simply have a case which gcc recognizes for optimization and icc doesn't. Compilers can be very picky about what they optimize and what not, especially regarding things like autovectorization.
If your loop is small enough it might be worth it to compare the generated assembly code between gcc and icc. Also if you show some code or at least tell us what you are doing in your loop we might be able to give you better speculations what leads to this behaviour. For example in some situations. If it's a relatively small loop it is likely a case of icc missing one (or some, but probably not many) optimization which either have inherently good potential (prefetching, autovectorization, unrolling, loop invariant motion,...) or which enable other optimizations (primarily inlining).
Note that I'm only talking about optimization potential when I compare gcc to icc. In the end icc might typically generate faster code then gcc, but not so much because it does more optimizations, but because it has a faster standard library implementation and because it is smarter about where to optimize (on high optimization levels gcc gets a little bit overeager (or at least it used to) about trading code size for (theoretical) runtime improvements. This can actually hurt performance, e.g. when the carefully unrolled and vectorized loop is only ever executed with 3 iterations.
I normally use -inline-level=1 -inline-forceinline to make sure that functions which I have explicitly declared inline actually do get inlined. Other than that I would expect ICC performance to be at least as good as with gcc. You will need to profile your code to see where the performance difference is coming from. If this is Linux then I recommend using Zoom, which you can get on a free 30 day evaluation.
I compared gcc and llvm-gcc with -O3 option on hmmer and mcf in spec cpu2006 benchmark. Surprisingly, I found gcc beat llvm-gcc for both cases. Is it because the -O3 has different meanings? How should I establish the experiments to get a fair comparison?
BTW, I did the experiment by ONLY changing cc in the makefile.
Thanks,
Bo
You seem surprised that gcc beat llvm on your benchmark. Phoronix hosts a bunch of interesting benchmarks in this area. For instance, have a look at:
Benchmarking LLVM & Clang Against GCC 4.5.
Compiler Benchmarks Of GCC, LLVM-GCC, DragonEgg, Clang
∶
(Lots of luvverly colours.)
As far as How should I establish the experiments to get a fair comparison? goes, presumably you should compare the fastest runtime, fastest compile time, lowest memory footprint, most operations per Watt and scalability over number of CPUs (you pay your money and take your choice), for the fastest configuration of each compiler against the fastest configuration of the other(s).
First off, you need to at least establish the variability of each progam—how repeatable the variables are for each run of a single program on your platform. (Yes, believable benchmarking requires thoroughness on your part.)
If I use "-O2" flag, the performance improves, but the compilation time gets longer.
How can I decide, whether to use it or not?
Maybe O2 makes the most difference in some certain types of code (e.g. math calculations?), and I should use it only for those parts of the project?
EDIT: I want to emphasize the fact that setting -O2 for all components of my project changes the total compilation time from 10 minutes to 30 minutes.
I would recommend using -O2 most of the time, benefits include:
Usually reduces size of generated code (unlike -O3).
More warnings (some warnings require analysis that is only done during optimization)
Often measurably improved performance (which may not matter).
If release-level code will have optimization enabled, it's best to have optimization enabled throughout the development/test cycle.
Source-level debugging is more difficult with optimizations enabled, occasionally it is helpful to disable optimization when debugging a problem.
I'm in bioinformatics so my advice may be biased. That said, I always use the -O3 switch (for release and test builds, that is; not usually for debugging). True, it has certain disadvantages, namely increasing compile-time and often the size of the executable.
However, the first factor can be partially mitigated by a good build strategy and other tricks reducing the overall build time. Also, since most of the compilation is really I/O bound, the increase of compile time is often not that pronounced.
The second disadvantage, the executable's size, often simply doesn't matter at all.
Never.
Use -O3 -Wall -Werror -std=[whatever your code base should follow]
Always, except when you're programming and just want to test something you just wrote.
We usually have our build environment set up so that we can build debug builds that use -O0 and release builds that use -O3 (the build enviroment preserves the objects and libraries of all configurations, so that one can switch easily between configurations). During development one mostly builds and runs the debug configuration for faster build speed (and more accurate debug information) and less frequently also builds and tests the release configuration.
Is the increased compilation time really noticable? I use -O2 all the time as the default, anything less just leaves a lot of "friction" in your code. Also note that the optimization levels of -O1, -O2 tends to be the best tested, as they are most interesting. -O0 tends to be more buggy, and you can debug pretty well at -O2 in my experience. Provided you have some idea about what a compiler can do in terms of code reordering, inlining, etc.
-Werror -Wall is necessary.