When I compile an application with Intel's compiler it is slower than when I compile it with GCC. The Intel compiler's output is more than 2x slower. The application contains several nested loops. Are there any differences between GCC and the Intel compiler that I am missing? Do I need to turn on some other flags to improve the Intel compiler's performance? I expected the Intel compiler to be at least as fast as GCC.
Compiler Versions:
Intel version 12.0.0 20101006
GCC version 4.4.4 20100630
The compiler flags are the same with both compilers:
-O3 -openmp -parallel -mSSE4.2 -Wall -pthread
I have no experience with the intel compiler so I can't answer whether you are missing some flags or not.
However from what I recall recent versions of gcc are generally as good at optimizing code as icc (sometimes better, sometimes worse (although most sources seem to indicate to generally better)), so you might have run into a situation where icc is particulary bad. Examples for what optimizations each compiler can do can be found here and here. Even if gcc is not generally better you could simply have a case which gcc recognizes for optimization and icc doesn't. Compilers can be very picky about what they optimize and what not, especially regarding things like autovectorization.
If your loop is small enough it might be worth it to compare the generated assembly code between gcc and icc. Also if you show some code or at least tell us what you are doing in your loop we might be able to give you better speculations what leads to this behaviour. For example in some situations. If it's a relatively small loop it is likely a case of icc missing one (or some, but probably not many) optimization which either have inherently good potential (prefetching, autovectorization, unrolling, loop invariant motion,...) or which enable other optimizations (primarily inlining).
Note that I'm only talking about optimization potential when I compare gcc to icc. In the end icc might typically generate faster code then gcc, but not so much because it does more optimizations, but because it has a faster standard library implementation and because it is smarter about where to optimize (on high optimization levels gcc gets a little bit overeager (or at least it used to) about trading code size for (theoretical) runtime improvements. This can actually hurt performance, e.g. when the carefully unrolled and vectorized loop is only ever executed with 3 iterations.
I normally use -inline-level=1 -inline-forceinline to make sure that functions which I have explicitly declared inline actually do get inlined. Other than that I would expect ICC performance to be at least as good as with gcc. You will need to profile your code to see where the performance difference is coming from. If this is Linux then I recommend using Zoom, which you can get on a free 30 day evaluation.
Related
Here my very simple question. With ICC I know it is possible to use #pragma SIMD to force vectorization of loops that the compiler chooses not to vectorize. Is there something analogous in GCC? Or, is there any plan to add this feature in a future release?
Quite related, what about forcing vectorization with Graphite?
As long as gcc is allowed to use SSE/SSE2/etc instructions, the compiler will in general produce vector instructions when it realizes that it's "worthwhile". Like most things in compilers, this requires some luck/planning/care from the programmer to avoid the compiler thinking "maybe this isn't safe" or "this is too complicated, I can't figure out what's going on". But quite often, it's successful if you are using a reasonably modern version of gcc (4.x versions should all do this).
You can make the compiler use SSE or SSE2 instructions by adding -msse or -msse2 (etc. for later SSE extensions). -msse2 is default in x86-64.
I'm not aware of any way that you can FORCE this, however. The compiler will either do this because it's happy that it's a good solution, or it wont.
Sorry, can't answer about Graphite.
Because of a school assignment I have to convert a C++ code to assembly(ARMv8). Then I have to compile the C++ code using GCC's -O0,-O1,-O2,-O3 and -Os optimizations, write down the time and compare with the execute time of my assembly code. As, I think I know -O3 have to be faster than -O1 and -O2. However, I get that -O2 is the fastest, then are -O1,-O3,-Os,-O0. Is that usual? (Calculated times are about 30 seconds).
Notice that GCC has many other optimization flags.
There is no guarantee that -O3 gives faster code than -O2; a compiler can apply more optimization passes, but they are all heuristics and might be unsuccessful (or even slow down slightly your particular code). Hence it does happen that -O3 gives some slightly slower code than -O2 (on some particular input source code).
You could try a more recent version of GCC (the latest -in November 2017- is GCC 7, GCC 8 will go out in few months). You could also try some better -march= or -mtune= option.
At last, with your GCC plugin, you might add your own optimization pass, or change the order (and the set) of applied optimization passes (there are several hundreds different optimization passes in GCC). But you'll need a lot of work (perhaps a year or two) to be able to extend GCC.
You could tune optimization parameters, and some project (MILEPOST) has even used machine learning techniques to improve them.
See also slides and references on my (old) GCC MELT documentation.
Yes, it is usual. Take the -Ox optimization as guide-lines. In average, they produce optimization that is advertise, but a lot depends on the style in which the code is written, memory layout, as well as the compiler itself.
Sometimes, you need to try and fail many times before getting the optimal code.
-O2 indeed gives the best optimization in most of the cases.
I see this thread, and I had the same question, but this one isn't really answered: GCC standard optimizations behavior
I'm trying to figure out exactly what flag is causing an incredible boost in performance, in O1. I first found out which flags are set, using g++ -O1 -Q --help=optimizers and then got each of the enabled ones and used them to compile with g++. But the output results were different (the binary itself was of difference sizes).
How do I handpick optimizations for g++ or is this not possible?
Not all optimizations have individual flags, so no combination of them will generate the same code as using -O1 or any other of the general optimization enabling options (-Os, -O2, etc...). Also I imagine that a lot of the specific optimization options are ignored when you use -O0 (the default) because they require passes that are skipped if optimization hasn't generally enabled.
To try to narrow down your performance increase you can try using -O1 and then selectively disabling optimizations. For example:
g++ -O1 -fno-peephole -fno-tree-cselim -fno-var-tracking ...
You still might not have better luck this way though. It might be multiple optimizations in combination are producing your performance increase. It could also be the result of optimizations not covered by any specific flag.
I also doubt that better cache locality resulted in your "incredible boost in performance". If so it was likely a coincidence, especially at -O1. Big performance increases usually come about because GCC was able eliminate a chunk of your code either because it didn't actually have any net effect, always resulted in the same value being computed or it invoked undefined behaviour.
Consider this simple case scenario:
I download the pre-built binaries of a C++ compiler (say CLang or GCC or anything else) for my generic OS (that is not windows). I compile my code which consists of some computationally expensive mathematical calculation with optimization flag -O3 and I have an execution time of T1.
On a different attempt, this time instead of using pre-built binaries I download the source code and build the compiler by myself on my generic machine. I compile the same code with the same optimization flag, achieving execution time T2?
Will T2 < T1 or they will be more or less the same?
In other words, is the execution time independent from the way that compiler is built?
The compiler's optimization of your code is the result of the behavior of the compiler, not the performance of the compiler.
As long as the compiler has the same behavioral design, it will produce exactly the same output.
Generally the same compiler version should generate the same assembler code given the same C or C++ code input. However there are certain things that might further affect the code that is being execute when you run the compiler.
A distro might have backported (or even created own) patches from other versions.
Modern compilers often have library depenencies (e.g. cloog) that may have different behaviour in different versions, causing the compiler to make code generation decisions based on essentially other data
These libraries may (in some compiler versions) be optional at compile time (might need to give --enable switches to configure, or configure tries to autodetect them).
Compiler switches like -march=native will look on what hardware you compile and try to optimize accordingly.
a time limit in the compilers optimizer triggers, essentially making better optimizations on better machines; or the same for memory (I don't think thats to be found in modern compilers anymore though)
That said, even the same assembler might perform different on yours and their machine, e.g. because one is optimized for AMD, the other for intel.
In my opinion, and in theory, compilation speed can be faster, since you can say to "compiler which compile the compiler", "please target to my computer, and you can use my computer's processor's own machine code to optimize".
But I think compiler's optimization cannot be faster.. To make compiler's optimization faster, I think we need put something like new technology into compiler, not just re-compile.
That depends on how that compiler is implemented and on your platform, but the answer will be most likely "no".
If your platform provides specific functionality that can improve the performance of your program, the optimizer in your compiler might use that functionality to produce a faster program. The optimizer can do so only if the compiler writer was aware of the functionality and has implemented special treatment for your platform in the optimizer. If that is the case, the detection might be done dynamically in the optimizer, meaning any build of the optimizer can detect the platform and optimize your code. Only if the detection has to occur at compiletime of the optimizer for some reason, recompiling it on your platform could give that advantage. But if such a better build for your plaform exists, the compiler vendor most likely has provided binaries for it.
So, with all these ifs, it's unlikely that your program will be any faster when you recompile the compiler on your platform. There is a chance, however, that the compiler will be a bit faster if it is optimized to your platform rather than a generic binary, resulting on shorter compiletimes.
I compared gcc and llvm-gcc with -O3 option on hmmer and mcf in spec cpu2006 benchmark. Surprisingly, I found gcc beat llvm-gcc for both cases. Is it because the -O3 has different meanings? How should I establish the experiments to get a fair comparison?
BTW, I did the experiment by ONLY changing cc in the makefile.
Thanks,
Bo
You seem surprised that gcc beat llvm on your benchmark. Phoronix hosts a bunch of interesting benchmarks in this area. For instance, have a look at:
Benchmarking LLVM & Clang Against GCC 4.5.
Compiler Benchmarks Of GCC, LLVM-GCC, DragonEgg, Clang
∶
(Lots of luvverly colours.)
As far as How should I establish the experiments to get a fair comparison? goes, presumably you should compare the fastest runtime, fastest compile time, lowest memory footprint, most operations per Watt and scalability over number of CPUs (you pay your money and take your choice), for the fastest configuration of each compiler against the fastest configuration of the other(s).
First off, you need to at least establish the variability of each progam—how repeatable the variables are for each run of a single program on your platform. (Yes, believable benchmarking requires thoroughness on your part.)