Is sse2 enabled by default in g++? - c++

When I run g++ -Q --help=target, I get
-msse2 [disabled].
However, if I create the assembly code of with default options as
g++ -g mycode.cpp -o mycode.o; objdump -S mycode.o > default,
and a sse2 version with
g++ -g -msse2 mycode.cpp -o mycode.sse2.o; objdump -S mycode.sse2.o > sse2,
and finally a non-sse2 version with
g++ -g -mno-sse2 mycode.cpp -o mycode.nosse2.o; objdump -S mycode.nosse2.o > nosse2
I see basically no difference between default and sse2, but a big difference between default and nosse2, so this tells me that, by default, g++ is using sse2 instructions, even though I am being told it is disabled ... what is going on here?
I am compiling on a Xeon E5-2680 under Linux with gcc-4.4.7 if it matters.

If you are compiling for 64bit, then this is totally fine and documented behavior.
As stated in the gcc docs the SSE instruction set is enabled by default when using an x86-64 compiler:
-mfpmath=unit
Generate floating point arithmetics for selected unit unit. The choices for unit are:
`387'
Use the standard 387 floating point coprocessor present majority of chips and emulated otherwise. Code compiled with this option will run almost everywhere. The temporary results are computed in 80bit precision instead of precision specified by the type resulting in slightly different results compared to most of other chips. See -ffloat-store for more detailed description.
This is the default choice for i386 compiler.
`sse'
Use scalar floating point instructions present in the SSE instruction set. This instruction set is supported by Pentium3 and newer chips, in the AMD line by Athlon-4, Athlon-xp and Athlon-mp chips. The earlier version of SSE instruction set supports only single precision arithmetics, thus the double and extended precision arithmetics is still done using 387. Later version, present only in Pentium4 and the future AMD x86-64 chips supports double precision arithmetics too.
For the i386 compiler, you need to use -march=cpu-type, -msse or -msse2 switches to enable SSE extensions and make this option effective. For the x86-64 compiler, these extensions are enabled by default.
The resulting code should be considerably faster in the majority of cases and avoid the numerical instability problems of 387 code, but may break some existing code that expects temporaries to be 80bit.
This is the default choice for the x86-64 compiler.

Related

GCC 8 Cross Compiler outputs ARMv7 executable instead of ARMv6

I'm trying to compile a C++ application for a Raspberry Pi Zero using GCC 8.2.1.
I'm using this for a relatively large C++17 project that is being built using CMake, and I'm trying to cross-compile it on my x86-64 laptop.
Even with the simplest code possible, I'm not able to compile it for ARMv6:
int main() {}
$ arm-linux-gnueabihf-g++ test.cpp -static -march=armv6 -mfpu=vfp -mfloat-abi=hard
When running the file on the Pi, I get an Illegal instruction error, and readelf returns the following:
$ arm-linux-gnueabihf-readelf -A a.out
Attribute Section: aeabi
File Attributes
Tag_CPU_name: "7-A"
Tag_CPU_arch: v7
Tag_CPU_arch_profile: Application
Tag_ARM_ISA_use: Yes
Tag_THUMB_ISA_use: Thumb-2
Tag_FP_arch: VFPv3
Tag_Advanced_SIMD_arch: NEONv1
Tag_ABI_PCS_wchar_t: 4
Tag_ABI_FP_rounding: Needed
Tag_ABI_FP_denormal: Needed
Tag_ABI_FP_exceptions: Needed
Tag_ABI_FP_number_model: IEEE 754
Tag_ABI_align_needed: 8-byte
Tag_ABI_align_preserved: 8-byte, except leaf SP
Tag_ABI_enum_size: int
Tag_ABI_VFP_args: VFP registers
Tag_CPU_unaligned_access: v6
GCC seems to ignore my architecture flags.
When simply compiling it into an object file, it seems to work just fine, but the linking stage always uses ARMv7:
$ arm-linux-gnueabihf-g++ test.cpp -static -march=armv6 -mfpu=vfp -mfloat-abi=hard -c
$ arm-linux-gnueabihf-readelf -A test.o
Attribute Section: aeabi
File Attributes
Tag_CPU_name: "6"
Tag_CPU_arch: v6
Tag_ARM_ISA_use: Yes
Tag_THUMB_ISA_use: Thumb-1
Tag_FP_arch: VFPv2
Tag_ABI_PCS_wchar_t: 4
Tag_ABI_FP_denormal: Needed
Tag_ABI_FP_exceptions: Needed
Tag_ABI_FP_number_model: IEEE 754
Tag_ABI_align_needed: 8-byte
Tag_ABI_align_preserved: 8-byte, except leaf SP
Tag_ABI_enum_size: int
Tag_ABI_VFP_args: VFP registers
Tag_ABI_optimization_goals: Aggressive Debug
Tag_CPU_unaligned_access: v6
What am I doing wrong?
I ended up compiling GCC from source, following this post. I didn't need all of the steps (I compiled everything using GCC 8 instead of compiling GCC 6.3 first, and I didn't edit any source files.)
I posted a Dockerfile with all build steps on GitHub.
The architecture of the executables produced is correct now, but I can't test it on target yet to check if it actually runs.
By default, newer GCC versions do not create correct binaries for ARMv6. Even though you pass the correct -mcpu= flag to gcc, it will create startup code for the newer ARMv7 architecture. Running them on your RasPI Zero will cause an "Illegal Instruction" exception.
this info is mentioned here https://github.com/Pro/raspi-toolchain

g++ enables wrong flags at -Os

at the moment I am doing some experiments with the GNU C++-Compiler and the -Os optimization option for minimal code size. I checked the enabled compiler flags at -Os with the following command:
g++ -c -Q -Os --help=optimizers | grep "enabled"
I got this list of enabled options:
-faggressive-loop-optimizations [enabled]
-falign-functions [enabled]
-falign-jumps [enabled]
-falign-labels [enabled]
-falign-loops [enabled]
-fasynchronous-unwind-tables [enabled]
...
This seems a bit strange, because I also looked up, which flags should be enabled at -Os, here and under the -Os section it is written that all the falign- options should be disabled for code minimization.
Q: So is this a bug or am I doing something wrong here ? Cause after reading what the falign- flags do I really think they should be disabled in -Os !
My gcc-version is 4.9.2 and I am working on Arch-Linux.
Already thanks for helping :)
Q: So is this a bug or am I doing something wrong here ? Cause after reading what the falign- flags do I really think they should be disabled in -Os
I think Hans did a good job of finding part of the problem. Its definitely a documentation bug. But no one from GCC commented on why -Os enabled them, so you might not have all of the information.
Older ARM devices were very intolerant of unaligned accesses. Older arm devices included ARMv4 and I think ARMv5. If you performed an unaligned access, you would get a SIGBUS (been there, done that, got the tee shirt).
Modern ARM devices fix up unaligned accesses like x86 processors do, so you no longer get a SIGBUS. Instead, you just take the performance penalty.
You should try to specify an architecture in case those options are an artifact from older ARM device support. For example, -march=armv7. If you find it on ARMv6 and ARMv7, then that could still be a bug. It depends if the GCC team decided the tradeoff was sufficient for ARM (code size vs performance penalty).

Successful compilation of SSE instruction with qmake (but SSE2 is not recognized)

I'm trying to compile and run my code migrated from Unix to windows. My code is pure C++ and not using Qt classes. it is fine in Unix.
I'm also using Qt creator as an IDE and qmake.exe with -spec win32-g++ for compiling. As I have sse instructions within my code, I have to include emmintrin.h header.
I added:
QMAKE_FLAGS_RELEASE += -O3 -msse4.1 -mssse3 -msse3 -msse2 -msse
QMAKE_CXXFLAGS_RELEASE += -O3 -msse4.1 -mssse3 -msse3 -msse2 -msse
In the .pro file. I have been able to compile my code without errors. but after running it gives run-time error while going through some functions containing __m128 or like that.
When I open emmintrin.h, I see:
#ifndef __SSE2__
# error "SSE2 instruction set not enabled"
#else
and It is undefined after #else.
I don't know how to enable SSE in my computer.
Platform: Windows Vista
System type: 64-bit
Processor: intel(R) Core(TM) i5-2430M CPU # 2.40Hz
Does anyone know the solution?
Thanks in advance.
It sounds like your data is not 16 byte aligned, which is a requirement for SSE loads such as mm_load_ps. You can either:
use _mm_loadu_ps as a temporary workaround. On newer CPUs the performance hit for misaligned loads such as this is fairly small (on older CPUs it's much more significant), but it should still be avoided if possible
or
fix your memory alignment. On Windows/Visual Studio you can use the declspec(align(16)) attribute for static allocations or _aligned_malloc for dynamic allocations. For gcc and most other civilised platforms/compilers use __attribute__ ((align(16))) for the former and posix_memalign for the latter.

What is the -O5 flag for compiling gfortran .f90 files?

I see the flag in the documentation of how to compile some f90 code I have acquired (specifically, mpfi90 -O5 file.f90), but researching the -O5 flag turned up nothing in the gfortran docs, mpfi docs, or anywhere else. I assume it is an optimization flag like -O1, etc., but I'm not sure.
Thanks!
Source: http://publib.boulder.ibm.com/infocenter/comphelp/v7v91/index.jsp?topic=%2Fcom.ibm.xlf91a.doc%2Fxlfug%2Fhu00509.htm
The flag -O5 is an optimizer like -O3 and -O2. The linked source says,
qnoopt/-O0 Fast compilation, debuggable code, conserved program
semantics.
-O2 (same as -O) Comprehensive low-level optimization; partial debugging support.
-O3 More extensive optimization; some precision trade-offs.
-O4 and -O5 Interprocedural optimization; loop optimization; automatic machine tuning.
With each higher number containing all the optimizations of the lower levels.

What extra optimisation does g++ do with -Ofast?

In g++ 4.6 (or later), what extra optimisations does -Ofast enable other than -ffast-math?
The man page says this option "also enables optimizations that are not valid for all standard compliant programs". Where can I find more information about whether this might affect my program or not?
Here's a command for checking what options are enabled with -Ofast:
$ g++ -c -Q -Ofast --help=optimizers | grep enabled
Since I only have g++ 4.4 that doesn't support -Ofast, I can't show you the output.
The -Ofast options might silently enable the gcc C++ extensions. You should check your sources to see if you make any use of them. In addition, the compiler might turn off some obscure and rarely encountered syntax checking for digraphs and trigraphs (this only improves compiler performance, not the speed of the compiled code).