How to enable SSSE3 intrinsics but disable their use in compiler optimization - c++

I have a code that uses SSSE3 intrinsic commands (note the triple S) and a runtime check whether to use it, therefore I assumed that the application should execute on CPUs without SSSE3 support.
However, when using -mssse3 with -O1 optimization the compiler also inserts SSSE3 instructions which I didn't explicitly call, hence the program crashes.
Is there a way to enable SSSE3 code when I EXPLICITLY call the relevant intrinsic functions, but to stop the compiler from adding its own SSSE3 code?
Note that I cannot disable the -O1 optimization.

The solution to this problem is to NOT compile ALL the program code with the -mssse3 option, and only compile the portion that actually uses these features with that option. In other words:
// main.cpp
...
if (use_ssse3())
do_something_ssse3();
else
do_something_traditional();
// traditional.cpp:
void do_something_traditional()
{
...
code goes here ...
}
// ssse3.cpp:
void do_something_ssse3()
{
...
code goes here ...
}
Only "ssse3.cpp" should be compiled with the -mssse3 flag.

If you use gcc, you can just compile the code without using the -mssse3 switch, and pull in the SSSE3 intrinsics with
#define __SSSE3__ 1
#include <tmmintrin.h>
where you need them.

Related

C++ error: intrinsic function was not declared in scope

I want to compile code that uses the intrinsic function _mm256_undefined_si256() (returns a vector of 8 packed double word integers). Here is the reduced snipped of the affected function from the header file:
// test.hpp
#include "immintrin.h"
namespace {
inline __m256i foo(__m256i a, __m256i b) {
__m256i res = _mm256_undefined_si256();
// some inline asm stuff
// __asm__(...);
return res;
}
}
Compiling via gcc -march=native -mavx2 -O3 -std=c++11 test.cpp -o app throws the following error >>_mm256_undefined_si256<< was not declared in this scope.
I can not explain why this intrinsic function is not defined, since there are other intrinsics used in the header file which work properly.
Your code works in GCC4.9 and newer (https://godbolt.org/z/bajMsKvK9). GCC4.9 was released in April 2014, close to a decade ago, and the most recent release of GCC4.8.5 was in June 2015. So it's about time to upgrade your compiler!
GCC4.8 was missing that intrinsic, and didn't even know about -march=sandybridge (let alone tuning options for Haswell which had AVX2), although it did know about the less meaningful -march=corei7-avx.
It does happen that GCC misses some of the more obscure intrinsics that Intel adds along with support for a new instruction set, so support for _mm256_add_epi32 won't always imply _mm256_undefined_si256().
e.g. it took until GCC11 for them to add _mm_load_si32(void*) unaligned aliasing-safe movd (which I think Intel introduced around the same time as AVX-512 stuff), so that's multiple years late. (And until GCC12 / 11.3 for GCC to implement it correctly, Bug 99754, and still not aliasing-safe for _mm_load_ss(float*) (Bug 84508).
But fortunately for the _mm256_undefined_si256, it's supported by non-ancient versions of all the mainstream compilers.

Is there a way to enable avx2 intruction set without auto-vectorization by LLVM

Recently I met a problem that my avx2 optimized program may crash on old machines like 2010 mac, which does not support avx2 intruction set. At the same time, I can ensure that all my avx2 code is surrounded by dynamically detection of instruction, which will not be run on an avx2-free machine.
So i digged into this problem and found that the crash is caused by auto-vectorization conducted by llvm itself. I tried -fno-vectorize and -fno-slp-vectorize but found that once -mavx2 is set, the program will be auto vectorized.
Is there a way to disable auto-vectorization in llvm with -mavx2 set? Because without -mavx2, my handwritten avx2 code may not be compiled successfully.
An alternative to specifying the -mavx2 flag generally would be to use function attributes specifying avx2 just on the relevant functions.
void __attribute__ ((__target__ ("avx2"))) function_with_avx2(...) {
...
}
void function_without_avx2(...) {
...
}

How to disable fma3 instructions in gcc

I need to disable FMA3 instructions (for backward compatibility issue) for the 64bit system.
I'v used _set_FMA3_enable(0) in my windows environment. And what option (or macro) I need to use to disable FMA3 in gcc?
For example.
#include <iostream>
#include <iomanip>
#include <math.h>
#include <limits>
union value
{
double real;
long long unsigned integer;
};
int main()
{
#if defined (_WIN64)
_set_FMA3_enable(0);
#endif
value x;
x.integer = 4602754097807755994;
value y;
y.real = sin(x.real);
std::cout << std::setprecision(17) << y.real << std::endl;
std::cout << y.integer << std::endl;
}
On Visual C++ it runs to 0.48674319998526994 4602440005894221058 with _set_FMA3_enable(0).
And 0.48674319998526999 4602440005894221059 without it (or with `_set_FMA3_enable(1)).'
I run this code in gcc environment with g++ -g0 -march=x86-64 -O2 -mtune=generic -msse3 -mno-fma -DNDEBUG main.cpp and always get 0.48674319998526999 4602440005894221059.
How can I reproduce _set_FMA3_enable(0) results with gcc?
Visual Studio 16.7.4. gcc version 9.3.0 (with wsl)
How to disable fma3 instructions in gcc
Generation of FMA3 instructions can be disabled with -mno-fma option. -ffp-contract=off is another option which should also disable contraction of separate multiply and add operations into FMA.
How can I reproduce [visual studio] results with gcc?
Apparently, not by disabling FMA instructions. FMA are not the only thing that influence the accuracy of floating point operations.
You could try not relying on the standard library for sin and instead use the same implementation on both systems. You could also set rounding mode explicitly so that you don't rely on system defaults. If you do, then you should use -frounding-math as well.
It would be simpler if you can instead not rely on getting exactly same results.
P.S. The behaviour of the shown program is undefined in C++ because of read form inactive union member.

g++: optimization -march=haswell and newer changes numerical result

I have been working on optimizing performance and of course doing regression tests when I noticed that g++ seems to alter results depending on chosen optimization. So far I thought that -O2 -march=[whatever] should yield the exact same results for numerical computations regardless of what architecture is chosen. However this seems not to be the case for g++. While using old architectures up to ivybridge yields the same results as clang does for any architecture, I get different results for gcc for haswell and newer. Is this a bug in gcc or did I misunderstand something about optimizations? I am really startled because clang does not seem to show this behavior.
Note that I am well aware that the differences are within machine precision, but they still disturb my simple regression checks.
Here is some example code:
#include <iostream>
#include <armadillo>
int main(){
arma::arma_rng::set_seed(3);
arma::sp_cx_mat A = arma::sprandn<arma::sp_cx_mat>(20,20, 0.1);
arma::sp_cx_mat B = A + A.t();
arma::cx_vec eig;
arma::eigs_gen(eig, B, 1, "lm", 0.001);
std::cout << "eigenvalue: " << eig << std::endl;
}
Compiled using:
g++ -march=[architecture] -std=c++14 -O2 -o test example.cpp -larmadillo
gcc version: 6.2.1
clang version: 3.8.0
Compiled for 64 bit, executed on an Intel Skylake processor.
It is because GCC uses fused-multiply-add (fma) instruction by default, if it is available. Clang, on the contrary, doesn't use them by default, even if it is available.
Result from a*b+c can differ whether fma used or not, that's why you get different results, when you use -march=haswell (Haswell is the first Intel CPU which supports fma).
You can decide whether you want to use this feature with -ffp-contract=XXX.
-ffp-contract=off, you won't get fma instructions.
-ffp-contract=on, you get fma instructions, but only in the case of contraction if allowed by the language standard. In current version of GCC, this means off (because it is not implemented yet).
-ffp-contract=fast (that's the GCC default), you'll get fma instrucions.

Apple LLVM 5.0 pragma optimize

What is the equivalent of GCC's #pragma GCC optimize("O0") or VS's #pragma optimize("", off) in Apple LLVM 5.0 compiler?
I need it to disable optimizations for just a section of code.
From a brief search it doesn't look like clang/llvm supports such a pragma at this time. If you don't want to turn off optimizations for an entire file I suggest factoring what you don't want optimized into a separate file and setting -O0 on it separately.
Actually there is now a way to do that by specifying an __attribute__ ((optnone)) to the function that wraps the code you don't want to be optimized.
For instance I'm using it to have a clear benchmark of an inline function
static void BM_notoptimizedfunction(benchmark::State& state) __attribute__ ((optnone)) {
// your code here won't be optimized by clang
}
And that's it !