Successful compilation of SSE instruction with qmake (but SSE2 is not recognized) - c++

I'm trying to compile and run my code migrated from Unix to windows. My code is pure C++ and not using Qt classes. it is fine in Unix.
I'm also using Qt creator as an IDE and qmake.exe with -spec win32-g++ for compiling. As I have sse instructions within my code, I have to include emmintrin.h header.
I added:
QMAKE_FLAGS_RELEASE += -O3 -msse4.1 -mssse3 -msse3 -msse2 -msse
QMAKE_CXXFLAGS_RELEASE += -O3 -msse4.1 -mssse3 -msse3 -msse2 -msse
In the .pro file. I have been able to compile my code without errors. but after running it gives run-time error while going through some functions containing __m128 or like that.
When I open emmintrin.h, I see:
#ifndef __SSE2__
# error "SSE2 instruction set not enabled"
#else
and It is undefined after #else.
I don't know how to enable SSE in my computer.
Platform: Windows Vista
System type: 64-bit
Processor: intel(R) Core(TM) i5-2430M CPU # 2.40Hz
Does anyone know the solution?
Thanks in advance.

It sounds like your data is not 16 byte aligned, which is a requirement for SSE loads such as mm_load_ps. You can either:
use _mm_loadu_ps as a temporary workaround. On newer CPUs the performance hit for misaligned loads such as this is fairly small (on older CPUs it's much more significant), but it should still be avoided if possible
or
fix your memory alignment. On Windows/Visual Studio you can use the declspec(align(16)) attribute for static allocations or _aligned_malloc for dynamic allocations. For gcc and most other civilised platforms/compilers use __attribute__ ((align(16))) for the former and posix_memalign for the latter.

Related

Cross compile C++ for ARM64/x86_64, using clang, with core2-duo enabled

OK, so I am new to cross compilation.
I am writing some shell-scripts to compile some C++ files, on my Mac.
I want to build a "Fat universal binary", so I want this to work for Arm64 and x86_64.
After a lot of searching, I found using: --arch arm64 --arch x86_64 will solve my problem of cross compilation.
However, my old "optimisation flags" conflicted this. I used them to make my code run faster, for a computer-game I was making. Well... these are the flags:
-march=core2 -mfpmath=sse -mmmx -msse -msse2 -msse4.1 -msse4.2
Unfortunately... clang can't figure out that I mean to use this, for the intel build only. I get error message of:
clang: error: the clang compiler does not support '-march=core2'
If I remove --arch arm64 --arch x86_64 the code compiles again.
I tried various things like --arch x86_64+sse4 which seem allowed by the gcc documentation, but clang does not recognise them.
As far as I know, gcc/clang do not compile sse4.2 instructions by default. Despite the CPUs being released about 17 years ago. This is quite a bad assumption I think.

Performance comparison issue between OpenMPI and Intel MPI

I am working with a C++ MPI code which when compiled with openMPI takes 1min12 seconds and 16 seconds with Intel MPI (I have tested it on other inputs too, difference is similar. Both compiled codes give correct answer). I want to understand why is there such a big difference in run time. And what can be done to decrease run time with openMPI (GCC).
I am using CentOS 6 OS with Intel Haswell processor.
I am using following flags for compiling.
openMPI (GCC): mpiCC -Wall -O3
I have also tried -march=native -funroll-loops. It does not make a great difference. I have also tried -lm option. I cannot compile for 32 bit.
Intel MPI: mpiicpc -Wall -O3 -xhost
-xhost saves 3 seconds in run time.

Code size is doubled when compiling with GCC ARM Embedded?

I've just ported a STM32 microcontroller project from Keil uVision (using Keil ARM Compiler) to CooCox CoIDE (using GCC ARM Embedded compiler).
Problem is, the code size is the double size when compiled in CoIDE with GCC compared to Keil uVision.
How can this be? What can I do?
Code size in Keil: 54632b (.text)
Code size in CoIDE: 100844b (.text)
GCC compiler flags:
arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb -g2 -Wl,-Map=project.map -Os
-Wl,--gc-sections -Wl,-TC:\arm-gcc-link.ld -g -o project.elf -L -lm
I am suspecting CoIDE and GCC to compile a lot of functions and files, that are present in the project, though aren't used (yet). Is it possible that it compiles whole files even if I only use 1 function out of 20 in there? (even though I have -Os)..
Hard to say which files are really compiled/linked in your final binary from the information you give. I suppose it takes all the C files it finds on your project if you did not explicitly specified which one to compile or if you don't use your own Makefile.
But from the compiler options you give, the linker flag --gc-sections won't do much garbage if you don't have the following compiler flags: -ffunction-sections -fdata-sections. Try to add those options to strip all unused functions and data at link time.
Since the question was tagged with C++, I wonder if you would like to disable exceptions and RTTI. Those take quite a bit of code. Add -fno-exceptions -fno-rtti to linker flags.

Is sse2 enabled by default in g++?

When I run g++ -Q --help=target, I get
-msse2 [disabled].
However, if I create the assembly code of with default options as
g++ -g mycode.cpp -o mycode.o; objdump -S mycode.o > default,
and a sse2 version with
g++ -g -msse2 mycode.cpp -o mycode.sse2.o; objdump -S mycode.sse2.o > sse2,
and finally a non-sse2 version with
g++ -g -mno-sse2 mycode.cpp -o mycode.nosse2.o; objdump -S mycode.nosse2.o > nosse2
I see basically no difference between default and sse2, but a big difference between default and nosse2, so this tells me that, by default, g++ is using sse2 instructions, even though I am being told it is disabled ... what is going on here?
I am compiling on a Xeon E5-2680 under Linux with gcc-4.4.7 if it matters.
If you are compiling for 64bit, then this is totally fine and documented behavior.
As stated in the gcc docs the SSE instruction set is enabled by default when using an x86-64 compiler:
-mfpmath=unit
Generate floating point arithmetics for selected unit unit. The choices for unit are:
`387'
Use the standard 387 floating point coprocessor present majority of chips and emulated otherwise. Code compiled with this option will run almost everywhere. The temporary results are computed in 80bit precision instead of precision specified by the type resulting in slightly different results compared to most of other chips. See -ffloat-store for more detailed description.
This is the default choice for i386 compiler.
`sse'
Use scalar floating point instructions present in the SSE instruction set. This instruction set is supported by Pentium3 and newer chips, in the AMD line by Athlon-4, Athlon-xp and Athlon-mp chips. The earlier version of SSE instruction set supports only single precision arithmetics, thus the double and extended precision arithmetics is still done using 387. Later version, present only in Pentium4 and the future AMD x86-64 chips supports double precision arithmetics too.
For the i386 compiler, you need to use -march=cpu-type, -msse or -msse2 switches to enable SSE extensions and make this option effective. For the x86-64 compiler, these extensions are enabled by default.
The resulting code should be considerably faster in the majority of cases and avoid the numerical instability problems of 387 code, but may break some existing code that expects temporaries to be 80bit.
This is the default choice for the x86-64 compiler.

g++ enables wrong flags at -Os

at the moment I am doing some experiments with the GNU C++-Compiler and the -Os optimization option for minimal code size. I checked the enabled compiler flags at -Os with the following command:
g++ -c -Q -Os --help=optimizers | grep "enabled"
I got this list of enabled options:
-faggressive-loop-optimizations [enabled]
-falign-functions [enabled]
-falign-jumps [enabled]
-falign-labels [enabled]
-falign-loops [enabled]
-fasynchronous-unwind-tables [enabled]
...
This seems a bit strange, because I also looked up, which flags should be enabled at -Os, here and under the -Os section it is written that all the falign- options should be disabled for code minimization.
Q: So is this a bug or am I doing something wrong here ? Cause after reading what the falign- flags do I really think they should be disabled in -Os !
My gcc-version is 4.9.2 and I am working on Arch-Linux.
Already thanks for helping :)
Q: So is this a bug or am I doing something wrong here ? Cause after reading what the falign- flags do I really think they should be disabled in -Os
I think Hans did a good job of finding part of the problem. Its definitely a documentation bug. But no one from GCC commented on why -Os enabled them, so you might not have all of the information.
Older ARM devices were very intolerant of unaligned accesses. Older arm devices included ARMv4 and I think ARMv5. If you performed an unaligned access, you would get a SIGBUS (been there, done that, got the tee shirt).
Modern ARM devices fix up unaligned accesses like x86 processors do, so you no longer get a SIGBUS. Instead, you just take the performance penalty.
You should try to specify an architecture in case those options are an artifact from older ARM device support. For example, -march=armv7. If you find it on ARMv6 and ARMv7, then that could still be a bug. It depends if the GCC team decided the tradeoff was sufficient for ARM (code size vs performance penalty).