How to disable fma3 instructions in gcc - c++

I need to disable FMA3 instructions (for backward compatibility issue) for the 64bit system.
I'v used _set_FMA3_enable(0) in my windows environment. And what option (or macro) I need to use to disable FMA3 in gcc?
For example.
#include <iostream>
#include <iomanip>
#include <math.h>
#include <limits>
union value
{
double real;
long long unsigned integer;
};
int main()
{
#if defined (_WIN64)
_set_FMA3_enable(0);
#endif
value x;
x.integer = 4602754097807755994;
value y;
y.real = sin(x.real);
std::cout << std::setprecision(17) << y.real << std::endl;
std::cout << y.integer << std::endl;
}
On Visual C++ it runs to 0.48674319998526994 4602440005894221058 with _set_FMA3_enable(0).
And 0.48674319998526999 4602440005894221059 without it (or with `_set_FMA3_enable(1)).'
I run this code in gcc environment with g++ -g0 -march=x86-64 -O2 -mtune=generic -msse3 -mno-fma -DNDEBUG main.cpp and always get 0.48674319998526999 4602440005894221059.
How can I reproduce _set_FMA3_enable(0) results with gcc?
Visual Studio 16.7.4. gcc version 9.3.0 (with wsl)

How to disable fma3 instructions in gcc
Generation of FMA3 instructions can be disabled with -mno-fma option. -ffp-contract=off is another option which should also disable contraction of separate multiply and add operations into FMA.
How can I reproduce [visual studio] results with gcc?
Apparently, not by disabling FMA instructions. FMA are not the only thing that influence the accuracy of floating point operations.
You could try not relying on the standard library for sin and instead use the same implementation on both systems. You could also set rounding mode explicitly so that you don't rely on system defaults. If you do, then you should use -frounding-math as well.
It would be simpler if you can instead not rely on getting exactly same results.
P.S. The behaviour of the shown program is undefined in C++ because of read form inactive union member.

Related

g++: optimization -march=haswell and newer changes numerical result

I have been working on optimizing performance and of course doing regression tests when I noticed that g++ seems to alter results depending on chosen optimization. So far I thought that -O2 -march=[whatever] should yield the exact same results for numerical computations regardless of what architecture is chosen. However this seems not to be the case for g++. While using old architectures up to ivybridge yields the same results as clang does for any architecture, I get different results for gcc for haswell and newer. Is this a bug in gcc or did I misunderstand something about optimizations? I am really startled because clang does not seem to show this behavior.
Note that I am well aware that the differences are within machine precision, but they still disturb my simple regression checks.
Here is some example code:
#include <iostream>
#include <armadillo>
int main(){
arma::arma_rng::set_seed(3);
arma::sp_cx_mat A = arma::sprandn<arma::sp_cx_mat>(20,20, 0.1);
arma::sp_cx_mat B = A + A.t();
arma::cx_vec eig;
arma::eigs_gen(eig, B, 1, "lm", 0.001);
std::cout << "eigenvalue: " << eig << std::endl;
}
Compiled using:
g++ -march=[architecture] -std=c++14 -O2 -o test example.cpp -larmadillo
gcc version: 6.2.1
clang version: 3.8.0
Compiled for 64 bit, executed on an Intel Skylake processor.
It is because GCC uses fused-multiply-add (fma) instruction by default, if it is available. Clang, on the contrary, doesn't use them by default, even if it is available.
Result from a*b+c can differ whether fma used or not, that's why you get different results, when you use -march=haswell (Haswell is the first Intel CPU which supports fma).
You can decide whether you want to use this feature with -ffp-contract=XXX.
-ffp-contract=off, you won't get fma instructions.
-ffp-contract=on, you get fma instructions, but only in the case of contraction if allowed by the language standard. In current version of GCC, this means off (because it is not implemented yet).
-ffp-contract=fast (that's the GCC default), you'll get fma instrucions.

is boost::random::uniform_real_distribution supposed to be the same across processors?

The following code produces different output on x86 32bit vs 64bit processors.
Is it supposed to be this way? If I replace it with std::uniform_real_distribution and compile with -std=c++11 it produces the same output on both processors.
#include <iostream>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/uniform_real_distribution.hpp>
int main()
{
boost::mt19937 gen;
gen.seed(4294653137UL);
std::cout.precision(1000);
double lo = - std::numeric_limits<double>::max() / 2 ;
double hi = + std::numeric_limits<double>::max() / 2 ;
boost::random::uniform_real_distribution<double> boost_distrib(lo, hi);
std::cout << "lo " << lo << '\n';
std::cout << "hi " << hi << "\n\n";
std::cout << "boost distrib gen " << boost_distrib(gen) << '\n';
}
BTW, you could have written boost::mt19937 gen(4294653137UL); to avoid seeding with the default seed (5489) in the default constructor. Your code has to loop over all 624 uint32_t elements of the generator's internal state twice.
The generator is always fine, and works the same on any machine. The difference only comes from using floating-point to map it to a uniform_real_distribution.
g++ -m32 -msse2 -mfpmath=sse produces identical output to all the other compilers. 32 vs 64bit is different because 64bit uses SSE for float math, so double temporaries are always 64bit. 32bit x86 defaults to using the legacy x87 FPU, where everything is 80bit internally, and only rounded down to 64bit double when storing to memory.
Note that bit-identical FP results in genral is NOT guaranteed with different compilers even on the same platform.
32bit clang still uses SSE math by default, so it gets identical results to 64bit clang or 64bit g++. Telling g++ to do the same solves the problem. -mfpmath=sse tells it to do calculations with SSE (although it doesn't change the ABI, so floating point return values are still in x87 st(0).) -msse2 tells g++ to assume the target machine supports SSE and SSE2. (sse2 added double-precision to sse's single-precision. SSE2 is baseline in the x86-64 architecture, and used to pass/return FP args in the 64bit ABI.)
Without SSE, you could (but don't) use -ffloat-store to precisely follow the C standard and round intermediate results to 32 or 64bits by storing and re-loading them. This adds about 6 cycles of latency to every FP math instruction. (Compared to 3 cycle FP add, 5 cycle FP mul on Intel Haswell.) So don't do this, you'll get horrible code.
debugging steps:
I tried it out on Ubuntu 15.10, with g++ 5.2, clang-3.5, and clang-3.8 (from http://llvm.org/apt/).
for i in ./boost-random-seedint*; do echo -ne "$i:\t" ; $i|md5sum ;done
./boost-random-seedint-g++32: 53d99523ca2afeac428eae2c89e69974 -
./boost-random-seedint-g++64: a59f08c0bc22b8753c474db077b809bd -
./boost-random-seedint-clang3.5-32: a59f08c0bc22b8753c474db077b809bd -
./boost-random-seedint-clang3.5-64: a59f08c0bc22b8753c474db077b809bd -
./boost-random-seedint-clang3.8-32: a59f08c0bc22b8753c474db077b809bd -
./boost-random-seedint-clang3.8-64: a59f08c0bc22b8753c474db077b809bd -
So the only outlier is 32bit g++. All the other outputs have the same hash
Compiler options:
clang++-3.8 -m32 -O1 -g boost-random-seedint.cpp -o boost-random-seedint-clang3.8-32 # and similiar
g++ -m32 -Og -g boost-random-seedint.cpp -o boost-random-seedint32
clang doesn't have a -Og. 32bit g++ with -O0 and -O3 make binaries that give the same output as the one from -Og.
Debugging the 32 and 64bit binaries: their state arrays are identical after the default seed and after the call to gen.seed(4294653137UL).

Why is Eigen's Cholesky decomposition much faster on Linux than on Windows?

I've noticed a significant performance difference regarding Cholesky decomposition using the Eigen library.
I'm using the latest version of Eigen (3.2.1) with the following benchmark code:
#include <iostream>
#include <chrono>
#include <Eigen/Core>
#include <Eigen/Cholesky>
using namespace std;
using namespace std::chrono;
using namespace Eigen;
int main()
{
const MatrixXd::Index size = 4200;
MatrixXd m = MatrixXd::Random(size, size);
m = (m + m.transpose()) / 2.0 + 10000 * MatrixXd::Identity(size, size);
LLT<MatrixXd> llt;
auto start = high_resolution_clock::now();
llt.compute(m);
if (llt.info() != Success)
cout << "Cholesky decomposition failed!" << endl;
auto stop = high_resolution_clock::now();
cout << "Cholesky decomposition in "
<< duration_cast<milliseconds>(stop - start).count()
<< " ms." << endl;
return 0;
}
I compile this benchmark with g++ -std=c++11 -Wall -O3 -o bench bench.cc and run it on Windows the first time (using MinGW, [edit: GCC 4.8.1]) and on Linux (edit: GCC 4.8.1) the second time, but both times on the same machine.
On Windows, it gives me:
Cholesky decomposition in 10114 ms.
But on Linux I get:
Cholesky decomposition in 3258 ms.
That is less than a third of the time needed on Windows.
Is there something available on Linux systems that Eigen uses to achieve this speed-up?
And if so, how may I accomplish the same on Windows?
Make sure you are using a 64 bit system. If that's not the case then don't forget to enable SSE2 instructions (-msse2), but the performance still won't be as performant as with 64 bits system because fewer SSE registers are available.
See Eigen's main page here.
Quote
Eigen is being successfully used with the following compilers:
GCC, version 4.1 and newer. Very good performance with GCC 4.2
and newer. MSVC (Visual Studio), 2008 and newer (the old 2.x
versions of Eigen support MSVC 2005, but without vectorization).
Intel C++ compiler. Very good performance. LLVM/CLang++
(2.8 and newer). MinGW, recent versions. Very good
performance when using GCC 4. QNX's QCC compiler.
Argument
You may have a more recent version of gcc (>=4.2) than your version of MinGW uses...
Note
Just as a side-note, you may even have a MinGW version that is not "recent", as the link also says:
Eigen is standard C++98 and so should theoretically be compatible with
any compliant compiler.
Whenever we use some non-standard feature, that is optional and can be
disabled.
So maybe your version of gcc uses a new optimizing feature, that MinGW doesn't posses, and falls back to another, slower, alternative.
Of course, in the end, it could be a completely different thing, this is an experimental guess based on theory...

How to enable SSSE3 intrinsics but disable their use in compiler optimization

I have a code that uses SSSE3 intrinsic commands (note the triple S) and a runtime check whether to use it, therefore I assumed that the application should execute on CPUs without SSSE3 support.
However, when using -mssse3 with -O1 optimization the compiler also inserts SSSE3 instructions which I didn't explicitly call, hence the program crashes.
Is there a way to enable SSSE3 code when I EXPLICITLY call the relevant intrinsic functions, but to stop the compiler from adding its own SSSE3 code?
Note that I cannot disable the -O1 optimization.
The solution to this problem is to NOT compile ALL the program code with the -mssse3 option, and only compile the portion that actually uses these features with that option. In other words:
// main.cpp
...
if (use_ssse3())
do_something_ssse3();
else
do_something_traditional();
// traditional.cpp:
void do_something_traditional()
{
...
code goes here ...
}
// ssse3.cpp:
void do_something_ssse3()
{
...
code goes here ...
}
Only "ssse3.cpp" should be compiled with the -mssse3 flag.
If you use gcc, you can just compile the code without using the -mssse3 switch, and pull in the SSSE3 intrinsics with
#define __SSSE3__ 1
#include <tmmintrin.h>
where you need them.

g++ optimization options affect the value of sin function

I have a problem with "sin" function of libc.
#include <cmath>
#include <stdio.h>
int main(int argc, char **argv)
{
double tt = 6.28318530717958620000; // 2 * M_PI
double yy = ::sin(tt);
printf("%.32f\n", yy);
return 0;
}
When compile the above code using "g++" without any optimization option, it would output "-0.00000000000000024492127076447545".
But if with "-O3" option, it would output "-0.00000000000000024492935982947064".
Why doesn't it return "-0.00000000000000024492935982947064" without "-O3"?
Thanks in advance.
Because with "-O3" the compiler precomputes sin(2*pi) at compile time, with one algorithm. Without "-O3" this is computed at runtime, with other algorithm.
This may be because compiler itself was built with some math library, which differ from your math library.
Update
The only entity, giving the result "-0.00000000000000024492127076447545" is 32-bit version of libstdc++. 64-bit version of the same library as well as gcc itself produce "-0.00000000000000024492935982947064".
So upgrading to newer version will not help. Also I tried various options, proposed here: neither -ffloat-store, nor -fno-builtin do not make any difference, as well as long double and sinl.
32-bit libstdc++ uses 387 floating point instructions, while gcc apparently uses SSE instructions. Here is the difference. Probably, the only way to make them consistent is to rebuild gcc from sources, directing it to use only 387 instructions internally.