performance comparison - gcc and llvm-gcc - llvm

I compared gcc and llvm-gcc with -O3 option on hmmer and mcf in spec cpu2006 benchmark. Surprisingly, I found gcc beat llvm-gcc for both cases. Is it because the -O3 has different meanings? How should I establish the experiments to get a fair comparison?
BTW, I did the experiment by ONLY changing cc in the makefile.
Thanks,
Bo

You seem surprised that gcc beat llvm on your benchmark. Phoronix hosts a bunch of interesting benchmarks in this area. For instance, have a look at:
Benchmarking LLVM & Clang Against GCC 4.5.
Compiler Benchmarks Of GCC, LLVM-GCC, DragonEgg, Clang
∶
(Lots of luvverly colours.)
As far as How should I establish the experiments to get a fair comparison? goes, presumably you should compare the fastest runtime, fastest compile time, lowest memory footprint, most operations per Watt and scalability over number of CPUs (you pay your money and take your choice), for the fastest configuration of each compiler against the fastest configuration of the other(s).
First off, you need to at least establish the variability of each progam—how repeatable the variables are for each run of a single program on your platform. (Yes, believable benchmarking requires thoroughness on your part.)

Related

what are the compilation flags that are activated by using O3

we are in the process of changing the intel compiler version from v14 to v18 in our systems and by running the tests, we have noticed that O3 in some cases produces incorrect results whereas the same code runs correctly with O3 and v14. I was wondering what are the differences in the optimizations between these two versions and how can I get a full list of flags that are getting activated by using O3 in each version. Thank you all in advance for your help and suggestions.
The behaviour of -O3 is documented on Intel's website: https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/compiler-options/compiler-option-details/optimization-options/o.html
O3
Performs O2 optimizations and enables more aggressive loop transformations such as Fusion, Block-Unroll-and-Jam, and collapsing IF statements.
This option may set other options. This is determined by the compiler, depending on which operating system and architecture you are using. The options that are set may change from release to release.
When O3 is used with options -ax or -x (Linux) or with options /Qax or /Qx (Windows), the compiler performs more aggressive data dependency analysis than for O2, which may result in longer compilation times.
The O3 optimizations may not cause higher performance unless loop and memory access transformations take place. The optimizations may slow down code in some cases compared to O2 optimizations.
The O3 option is recommended for applications that have loops that heavily use floating-point calculations and process large data sets.
Many routines in the shared libraries are more highly optimized for Intel® microprocessors than for non-Intel microprocessors.
The bottom of the page lists "Alternate options" which only lists -Od (which disables all optimizations: probably not what you want).
So it looks like -O3 activates optimizations that cannot be represented by using other flags (so -O3 does not have a long-form equivalent version).
Looking at Intel's page about the techniques used for high-level optimization, it looks like they cannot be enabled à la carte, so with HLO it's all-or-nothing and is enabled using either O2 or O3 (except that O2 uses a subset of O3's techniques).
Compare that to -Ofast which does have a long-form equivalent:
Ofast
It sets compiler options -O3, -no-prec-div, and -fp-model fast=2.

How does "__builtin_popcount" of gcc work?

I want know the inner workings of "__builtin_popcount".
As much as I understand, it works differently for different cpu.
Similar to many other built-ins, it translates into specific CPU instruction if one is available on the target CPU, thus considerably speeding up the application.
For example, on x86_64 it translates to popcntl ASM instruction.
Additional information can be found on GCC page: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
It is also worth noting that the actual speedup could only be seen if gcc is ran with march flag which targets architecture supporting this instruction or an argument which specifically enables it, -mpopcnt. Without either of those, gcc will revert to generic bit counting via bit operations.

Optimized Execution Time

Because of a school assignment I have to convert a C++ code to assembly(ARMv8). Then I have to compile the C++ code using GCC's -O0,-O1,-O2,-O3 and -Os optimizations, write down the time and compare with the execute time of my assembly code. As, I think I know -O3 have to be faster than -O1 and -O2. However, I get that -O2 is the fastest, then are -O1,-O3,-Os,-O0. Is that usual? (Calculated times are about 30 seconds).
Notice that GCC has many other optimization flags.
There is no guarantee that -O3 gives faster code than -O2; a compiler can apply more optimization passes, but they are all heuristics and might be unsuccessful (or even slow down slightly your particular code). Hence it does happen that -O3 gives some slightly slower code than -O2 (on some particular input source code).
You could try a more recent version of GCC (the latest -in November 2017- is GCC 7, GCC 8 will go out in few months). You could also try some better -march= or -mtune= option.
At last, with your GCC plugin, you might add your own optimization pass, or change the order (and the set) of applied optimization passes (there are several hundreds different optimization passes in GCC). But you'll need a lot of work (perhaps a year or two) to be able to extend GCC.
You could tune optimization parameters, and some project (MILEPOST) has even used machine learning techniques to improve them.
See also slides and references on my (old) GCC MELT documentation.
Yes, it is usual. Take the -Ox optimization as guide-lines. In average, they produce optimization that is advertise, but a lot depends on the style in which the code is written, memory layout, as well as the compiler itself.
Sometimes, you need to try and fail many times before getting the optimal code.
-O2 indeed gives the best optimization in most of the cases.

Do `-g -rdynamic` gcc flags slow down application execution (grow performance consumption) notably?

So I want to distribute my gcc application with backtrace logging for critical errors. Yet it is quite performance critical application so I wonder if -g -rdynamic gcc flags do slow down execution (especially if they do allot)? Also would like to give my users maximum performance so I do compile with optimization flags like "-flto" and "-mtune" and that makes me wonder if flags would conflict and inside baacktrace would be madness?
Although introducing debug symbols does not affect performance by itself, your application still end up far behind in terms of possible performance. What I mean by that is that it would be bad idea to use -g and -O3 simultaneously, in general. Therefore, if your application is performance critical, but at the same time severely needs to keep good level of debugging, then it would be reasonable to find some balance between these two. In the latest versions of GCC, we are provided with -Og flag:
Optimize debugging experience. -Og enables optimizations that do not
interfere with debugging. It should be the optimization level of
choice for the standard edit-compile-debug cycle, offering a
reasonable level of optimization while maintaining fast compilation
and a good debugging experience.
I think it would be good idea to test your application with this flag, to see whether the performance is indeed better than bare -g, but the debugging stays intact.
Once again, do not neglect reading official GCC documentation. LTO is relatively new feature in GCC, and, as a result, some of its parts are still experimental and are not meant for production. For example, this is the direct extract:
Link-time optimization does not work well with generation of debugging
information. Combining -flto with -g is currently experimental and
expected to produce wrong results.
Not so long ago I had mixed experience with LTO. Sometimes it works well, sometimes the project doesn't even compile, not to mention that there could also be subtle runtime issues. Summarizing all of it, I would not recommend using LTO, especially in your situation.
NOTE: Performance gain from LTO usually varies from 0% to 3%, and it heavily depends on the underlying application. Without profiling, you cannot tell whether it is even reasonable to employ LTO for your situation as it might deliver more troubles than benefits.
Flags like -march and -mtune usually do optimizations on a very low level - instruction level for the target processor architecture. Thus, I wouldn't expect them to interfere with debugging. Nevertheless, you are welcomed to test this yourself with your application.
-g has no impact whatsoever on performance. -rdynamic will increase the size of the dynamic symbol table in the main executable, which might slow down dynamic linking. My best guess is that the slow-down will be very small but possibly measurable (nonzero) with precise measurement/profiling tools.

Intel Compiler versus GCC

When I compile an application with Intel's compiler it is slower than when I compile it with GCC. The Intel compiler's output is more than 2x slower. The application contains several nested loops. Are there any differences between GCC and the Intel compiler that I am missing? Do I need to turn on some other flags to improve the Intel compiler's performance? I expected the Intel compiler to be at least as fast as GCC.
Compiler Versions:
Intel version 12.0.0 20101006
GCC version 4.4.4 20100630
The compiler flags are the same with both compilers:
-O3 -openmp -parallel -mSSE4.2 -Wall -pthread
I have no experience with the intel compiler so I can't answer whether you are missing some flags or not.
However from what I recall recent versions of gcc are generally as good at optimizing code as icc (sometimes better, sometimes worse (although most sources seem to indicate to generally better)), so you might have run into a situation where icc is particulary bad. Examples for what optimizations each compiler can do can be found here and here. Even if gcc is not generally better you could simply have a case which gcc recognizes for optimization and icc doesn't. Compilers can be very picky about what they optimize and what not, especially regarding things like autovectorization.
If your loop is small enough it might be worth it to compare the generated assembly code between gcc and icc. Also if you show some code or at least tell us what you are doing in your loop we might be able to give you better speculations what leads to this behaviour. For example in some situations. If it's a relatively small loop it is likely a case of icc missing one (or some, but probably not many) optimization which either have inherently good potential (prefetching, autovectorization, unrolling, loop invariant motion,...) or which enable other optimizations (primarily inlining).
Note that I'm only talking about optimization potential when I compare gcc to icc. In the end icc might typically generate faster code then gcc, but not so much because it does more optimizations, but because it has a faster standard library implementation and because it is smarter about where to optimize (on high optimization levels gcc gets a little bit overeager (or at least it used to) about trading code size for (theoretical) runtime improvements. This can actually hurt performance, e.g. when the carefully unrolled and vectorized loop is only ever executed with 3 iterations.
I normally use -inline-level=1 -inline-forceinline to make sure that functions which I have explicitly declared inline actually do get inlined. Other than that I would expect ICC performance to be at least as good as with gcc. You will need to profile your code to see where the performance difference is coming from. If this is Linux then I recommend using Zoom, which you can get on a free 30 day evaluation.