GNU Fortran compiler optimisation flags for Ivy Bridge architecture - fortran

May I please ask for your suggestions on the GNU Fortran compiler (v6.3.0) flags to optimise the code for the Ivy Bridge architecture (Intel Xeon CPU E5-2697v2 Ivy Bridge # 2.7 GHz)?
At the moment I’m compiling the code with the following flags:
-O3 -march=ivybridge -mtune=ivybridge -ffast-math -mavx -m64 -w

Unless you use intrinsics specific to Ivy bridge, Sandy bridge flag os sufficient. I expect you should find some advantage by setting additionally -funroll-loops --param max-unroll-times=2
Sometimes -O2 -ftree-vectorize will work out better than -O3.
If you have complex data type you will want to check vs. -fno-cx-limited-range as the default of -ffast-math may be too aggressive.

Related

Cross compile C++ for ARM64/x86_64, using clang, with core2-duo enabled

OK, so I am new to cross compilation.
I am writing some shell-scripts to compile some C++ files, on my Mac.
I want to build a "Fat universal binary", so I want this to work for Arm64 and x86_64.
After a lot of searching, I found using: --arch arm64 --arch x86_64 will solve my problem of cross compilation.
However, my old "optimisation flags" conflicted this. I used them to make my code run faster, for a computer-game I was making. Well... these are the flags:
-march=core2 -mfpmath=sse -mmmx -msse -msse2 -msse4.1 -msse4.2
Unfortunately... clang can't figure out that I mean to use this, for the intel build only. I get error message of:
clang: error: the clang compiler does not support '-march=core2'
If I remove --arch arm64 --arch x86_64 the code compiles again.
I tried various things like --arch x86_64+sse4 which seem allowed by the gcc documentation, but clang does not recognise them.
As far as I know, gcc/clang do not compile sse4.2 instructions by default. Despite the CPUs being released about 17 years ago. This is quite a bad assumption I think.

What are these functions given by Intel Advisor?

I'm trying to use Intel Advisor to understand hotspot in my application.
These are the compile and linker flags that I'm using:
INTEL_OPT=-O3 -simd -xCORE-AVX2 -parallel -ipo -qopenmp -fargument-noalias -ansi-alias -no-prec-div -fp-model fast=2
INTEL_PROFILE=-g -qopt-report=5 -Bdynamic -shared-intel -debug inline-debug-info -qopenmp-link dynamic -parallel-source-info=2 -ldl
This is a sample image taken from this tutorial:
This is a screenshot from my application:
I don't understand what all these functions before _clone, [stack], _start and _libc_start_main are.
James is correct: things like _clone, [stack], _start and _libc_start_main correspond to CRT, Cray sys libs (if you use Cray env), OMP runtime internals or general system calls .
Also in your profile you don't seem to have any vectorization info enabled (empty "why no vectorization", no peel-remainder break-down, no SIMD Efficiency metrics and so on). Since your compilation flags seems to be reasonable, my next guess is that you are either stripping debug info into separate file or use pretty old ICL version. Removing ipo may also help to enable missed information.

Is sse2 enabled by default in g++?

When I run g++ -Q --help=target, I get
-msse2 [disabled].
However, if I create the assembly code of with default options as
g++ -g mycode.cpp -o mycode.o; objdump -S mycode.o > default,
and a sse2 version with
g++ -g -msse2 mycode.cpp -o mycode.sse2.o; objdump -S mycode.sse2.o > sse2,
and finally a non-sse2 version with
g++ -g -mno-sse2 mycode.cpp -o mycode.nosse2.o; objdump -S mycode.nosse2.o > nosse2
I see basically no difference between default and sse2, but a big difference between default and nosse2, so this tells me that, by default, g++ is using sse2 instructions, even though I am being told it is disabled ... what is going on here?
I am compiling on a Xeon E5-2680 under Linux with gcc-4.4.7 if it matters.
If you are compiling for 64bit, then this is totally fine and documented behavior.
As stated in the gcc docs the SSE instruction set is enabled by default when using an x86-64 compiler:
-mfpmath=unit
Generate floating point arithmetics for selected unit unit. The choices for unit are:
`387'
Use the standard 387 floating point coprocessor present majority of chips and emulated otherwise. Code compiled with this option will run almost everywhere. The temporary results are computed in 80bit precision instead of precision specified by the type resulting in slightly different results compared to most of other chips. See -ffloat-store for more detailed description.
This is the default choice for i386 compiler.
`sse'
Use scalar floating point instructions present in the SSE instruction set. This instruction set is supported by Pentium3 and newer chips, in the AMD line by Athlon-4, Athlon-xp and Athlon-mp chips. The earlier version of SSE instruction set supports only single precision arithmetics, thus the double and extended precision arithmetics is still done using 387. Later version, present only in Pentium4 and the future AMD x86-64 chips supports double precision arithmetics too.
For the i386 compiler, you need to use -march=cpu-type, -msse or -msse2 switches to enable SSE extensions and make this option effective. For the x86-64 compiler, these extensions are enabled by default.
The resulting code should be considerably faster in the majority of cases and avoid the numerical instability problems of 387 code, but may break some existing code that expects temporaries to be 80bit.
This is the default choice for the x86-64 compiler.

g++ enables wrong flags at -Os

at the moment I am doing some experiments with the GNU C++-Compiler and the -Os optimization option for minimal code size. I checked the enabled compiler flags at -Os with the following command:
g++ -c -Q -Os --help=optimizers | grep "enabled"
I got this list of enabled options:
-faggressive-loop-optimizations [enabled]
-falign-functions [enabled]
-falign-jumps [enabled]
-falign-labels [enabled]
-falign-loops [enabled]
-fasynchronous-unwind-tables [enabled]
...
This seems a bit strange, because I also looked up, which flags should be enabled at -Os, here and under the -Os section it is written that all the falign- options should be disabled for code minimization.
Q: So is this a bug or am I doing something wrong here ? Cause after reading what the falign- flags do I really think they should be disabled in -Os !
My gcc-version is 4.9.2 and I am working on Arch-Linux.
Already thanks for helping :)
Q: So is this a bug or am I doing something wrong here ? Cause after reading what the falign- flags do I really think they should be disabled in -Os
I think Hans did a good job of finding part of the problem. Its definitely a documentation bug. But no one from GCC commented on why -Os enabled them, so you might not have all of the information.
Older ARM devices were very intolerant of unaligned accesses. Older arm devices included ARMv4 and I think ARMv5. If you performed an unaligned access, you would get a SIGBUS (been there, done that, got the tee shirt).
Modern ARM devices fix up unaligned accesses like x86 processors do, so you no longer get a SIGBUS. Instead, you just take the performance penalty.
You should try to specify an architecture in case those options are an artifact from older ARM device support. For example, -march=armv7. If you find it on ARMv6 and ARMv7, then that could still be a bug. It depends if the GCC team decided the tradeoff was sufficient for ARM (code size vs performance penalty).

What is the -O5 flag for compiling gfortran .f90 files?

I see the flag in the documentation of how to compile some f90 code I have acquired (specifically, mpfi90 -O5 file.f90), but researching the -O5 flag turned up nothing in the gfortran docs, mpfi docs, or anywhere else. I assume it is an optimization flag like -O1, etc., but I'm not sure.
Thanks!
Source: http://publib.boulder.ibm.com/infocenter/comphelp/v7v91/index.jsp?topic=%2Fcom.ibm.xlf91a.doc%2Fxlfug%2Fhu00509.htm
The flag -O5 is an optimizer like -O3 and -O2. The linked source says,
qnoopt/-O0 Fast compilation, debuggable code, conserved program
semantics.
-O2 (same as -O) Comprehensive low-level optimization; partial debugging support.
-O3 More extensive optimization; some precision trade-offs.
-O4 and -O5 Interprocedural optimization; loop optimization; automatic machine tuning.
With each higher number containing all the optimizations of the lower levels.