g++: optimization -march=haswell and newer changes numerical result - c++

I have been working on optimizing performance and of course doing regression tests when I noticed that g++ seems to alter results depending on chosen optimization. So far I thought that -O2 -march=[whatever] should yield the exact same results for numerical computations regardless of what architecture is chosen. However this seems not to be the case for g++. While using old architectures up to ivybridge yields the same results as clang does for any architecture, I get different results for gcc for haswell and newer. Is this a bug in gcc or did I misunderstand something about optimizations? I am really startled because clang does not seem to show this behavior.
Note that I am well aware that the differences are within machine precision, but they still disturb my simple regression checks.
Here is some example code:
#include <iostream>
#include <armadillo>
int main(){
arma::arma_rng::set_seed(3);
arma::sp_cx_mat A = arma::sprandn<arma::sp_cx_mat>(20,20, 0.1);
arma::sp_cx_mat B = A + A.t();
arma::cx_vec eig;
arma::eigs_gen(eig, B, 1, "lm", 0.001);
std::cout << "eigenvalue: " << eig << std::endl;
}
Compiled using:
g++ -march=[architecture] -std=c++14 -O2 -o test example.cpp -larmadillo
gcc version: 6.2.1
clang version: 3.8.0
Compiled for 64 bit, executed on an Intel Skylake processor.

It is because GCC uses fused-multiply-add (fma) instruction by default, if it is available. Clang, on the contrary, doesn't use them by default, even if it is available.
Result from a*b+c can differ whether fma used or not, that's why you get different results, when you use -march=haswell (Haswell is the first Intel CPU which supports fma).
You can decide whether you want to use this feature with -ffp-contract=XXX.
-ffp-contract=off, you won't get fma instructions.
-ffp-contract=on, you get fma instructions, but only in the case of contraction if allowed by the language standard. In current version of GCC, this means off (because it is not implemented yet).
-ffp-contract=fast (that's the GCC default), you'll get fma instrucions.

Related

C++ error: intrinsic function was not declared in scope

I want to compile code that uses the intrinsic function _mm256_undefined_si256() (returns a vector of 8 packed double word integers). Here is the reduced snipped of the affected function from the header file:
// test.hpp
#include "immintrin.h"
namespace {
inline __m256i foo(__m256i a, __m256i b) {
__m256i res = _mm256_undefined_si256();
// some inline asm stuff
// __asm__(...);
return res;
}
}
Compiling via gcc -march=native -mavx2 -O3 -std=c++11 test.cpp -o app throws the following error >>_mm256_undefined_si256<< was not declared in this scope.
I can not explain why this intrinsic function is not defined, since there are other intrinsics used in the header file which work properly.
Your code works in GCC4.9 and newer (https://godbolt.org/z/bajMsKvK9). GCC4.9 was released in April 2014, close to a decade ago, and the most recent release of GCC4.8.5 was in June 2015. So it's about time to upgrade your compiler!
GCC4.8 was missing that intrinsic, and didn't even know about -march=sandybridge (let alone tuning options for Haswell which had AVX2), although it did know about the less meaningful -march=corei7-avx.
It does happen that GCC misses some of the more obscure intrinsics that Intel adds along with support for a new instruction set, so support for _mm256_add_epi32 won't always imply _mm256_undefined_si256().
e.g. it took until GCC11 for them to add _mm_load_si32(void*) unaligned aliasing-safe movd (which I think Intel introduced around the same time as AVX-512 stuff), so that's multiple years late. (And until GCC12 / 11.3 for GCC to implement it correctly, Bug 99754, and still not aliasing-safe for _mm_load_ss(float*) (Bug 84508).
But fortunately for the _mm256_undefined_si256, it's supported by non-ancient versions of all the mainstream compilers.

How to disable fma3 instructions in gcc

I need to disable FMA3 instructions (for backward compatibility issue) for the 64bit system.
I'v used _set_FMA3_enable(0) in my windows environment. And what option (or macro) I need to use to disable FMA3 in gcc?
For example.
#include <iostream>
#include <iomanip>
#include <math.h>
#include <limits>
union value
{
double real;
long long unsigned integer;
};
int main()
{
#if defined (_WIN64)
_set_FMA3_enable(0);
#endif
value x;
x.integer = 4602754097807755994;
value y;
y.real = sin(x.real);
std::cout << std::setprecision(17) << y.real << std::endl;
std::cout << y.integer << std::endl;
}
On Visual C++ it runs to 0.48674319998526994 4602440005894221058 with _set_FMA3_enable(0).
And 0.48674319998526999 4602440005894221059 without it (or with `_set_FMA3_enable(1)).'
I run this code in gcc environment with g++ -g0 -march=x86-64 -O2 -mtune=generic -msse3 -mno-fma -DNDEBUG main.cpp and always get 0.48674319998526999 4602440005894221059.
How can I reproduce _set_FMA3_enable(0) results with gcc?
Visual Studio 16.7.4. gcc version 9.3.0 (with wsl)
How to disable fma3 instructions in gcc
Generation of FMA3 instructions can be disabled with -mno-fma option. -ffp-contract=off is another option which should also disable contraction of separate multiply and add operations into FMA.
How can I reproduce [visual studio] results with gcc?
Apparently, not by disabling FMA instructions. FMA are not the only thing that influence the accuracy of floating point operations.
You could try not relying on the standard library for sin and instead use the same implementation on both systems. You could also set rounding mode explicitly so that you don't rely on system defaults. If you do, then you should use -frounding-math as well.
It would be simpler if you can instead not rely on getting exactly same results.
P.S. The behaviour of the shown program is undefined in C++ because of read form inactive union member.

is boost::random::uniform_real_distribution supposed to be the same across processors?

The following code produces different output on x86 32bit vs 64bit processors.
Is it supposed to be this way? If I replace it with std::uniform_real_distribution and compile with -std=c++11 it produces the same output on both processors.
#include <iostream>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/uniform_real_distribution.hpp>
int main()
{
boost::mt19937 gen;
gen.seed(4294653137UL);
std::cout.precision(1000);
double lo = - std::numeric_limits<double>::max() / 2 ;
double hi = + std::numeric_limits<double>::max() / 2 ;
boost::random::uniform_real_distribution<double> boost_distrib(lo, hi);
std::cout << "lo " << lo << '\n';
std::cout << "hi " << hi << "\n\n";
std::cout << "boost distrib gen " << boost_distrib(gen) << '\n';
}
BTW, you could have written boost::mt19937 gen(4294653137UL); to avoid seeding with the default seed (5489) in the default constructor. Your code has to loop over all 624 uint32_t elements of the generator's internal state twice.
The generator is always fine, and works the same on any machine. The difference only comes from using floating-point to map it to a uniform_real_distribution.
g++ -m32 -msse2 -mfpmath=sse produces identical output to all the other compilers. 32 vs 64bit is different because 64bit uses SSE for float math, so double temporaries are always 64bit. 32bit x86 defaults to using the legacy x87 FPU, where everything is 80bit internally, and only rounded down to 64bit double when storing to memory.
Note that bit-identical FP results in genral is NOT guaranteed with different compilers even on the same platform.
32bit clang still uses SSE math by default, so it gets identical results to 64bit clang or 64bit g++. Telling g++ to do the same solves the problem. -mfpmath=sse tells it to do calculations with SSE (although it doesn't change the ABI, so floating point return values are still in x87 st(0).) -msse2 tells g++ to assume the target machine supports SSE and SSE2. (sse2 added double-precision to sse's single-precision. SSE2 is baseline in the x86-64 architecture, and used to pass/return FP args in the 64bit ABI.)
Without SSE, you could (but don't) use -ffloat-store to precisely follow the C standard and round intermediate results to 32 or 64bits by storing and re-loading them. This adds about 6 cycles of latency to every FP math instruction. (Compared to 3 cycle FP add, 5 cycle FP mul on Intel Haswell.) So don't do this, you'll get horrible code.
debugging steps:
I tried it out on Ubuntu 15.10, with g++ 5.2, clang-3.5, and clang-3.8 (from http://llvm.org/apt/).
for i in ./boost-random-seedint*; do echo -ne "$i:\t" ; $i|md5sum ;done
./boost-random-seedint-g++32: 53d99523ca2afeac428eae2c89e69974 -
./boost-random-seedint-g++64: a59f08c0bc22b8753c474db077b809bd -
./boost-random-seedint-clang3.5-32: a59f08c0bc22b8753c474db077b809bd -
./boost-random-seedint-clang3.5-64: a59f08c0bc22b8753c474db077b809bd -
./boost-random-seedint-clang3.8-32: a59f08c0bc22b8753c474db077b809bd -
./boost-random-seedint-clang3.8-64: a59f08c0bc22b8753c474db077b809bd -
So the only outlier is 32bit g++. All the other outputs have the same hash
Compiler options:
clang++-3.8 -m32 -O1 -g boost-random-seedint.cpp -o boost-random-seedint-clang3.8-32 # and similiar
g++ -m32 -Og -g boost-random-seedint.cpp -o boost-random-seedint32
clang doesn't have a -Og. 32bit g++ with -O0 and -O3 make binaries that give the same output as the one from -Og.
Debugging the 32 and 64bit binaries: their state arrays are identical after the default seed and after the call to gen.seed(4294653137UL).

__m256 unknown type (clang 5.1/i5 CPU)?

I just started to experiment with intrinsics. I managed to successfully compile a program using __m128 on a Mac using Clang 5.1. The CPU on this Mac is an Intel core i5 M540.
When I tried to compile the same code with __m256, I get the following message:
simple.cpp:4:2: error: unknown type name '__m256'
__m256 A;
The code looks like this:
#include <immintrin.h>
int main()
{
__m256 A;
return 0;
}
And here is the command used to compile it:
c++ -o simple simple.cpp -march=native -O3
Is it just that my CPU is too old to support AVX instruction set? Are all the options I use (on the command line) correct? I checked in the immintrin.h include file, and it does call another including file which seems to be defining AVX intrinsics. Apologies if the question is naive or if the terminology is misused, as I said, I am new to this topic.
The Intel 540M CPU is in the Westmere microarchitecture (sorry for the mistake in the comment) which appears before Sandy Bridge when AVX was introduced so it doesn't support AVX. The term "core i5" covers a wide range of architectures from Nehalem to Haswell (current) so using a core i5 CPU doesn't mean that you'll have support for all instruction sets like the lates ones.

Why is Eigen's Cholesky decomposition much faster on Linux than on Windows?

I've noticed a significant performance difference regarding Cholesky decomposition using the Eigen library.
I'm using the latest version of Eigen (3.2.1) with the following benchmark code:
#include <iostream>
#include <chrono>
#include <Eigen/Core>
#include <Eigen/Cholesky>
using namespace std;
using namespace std::chrono;
using namespace Eigen;
int main()
{
const MatrixXd::Index size = 4200;
MatrixXd m = MatrixXd::Random(size, size);
m = (m + m.transpose()) / 2.0 + 10000 * MatrixXd::Identity(size, size);
LLT<MatrixXd> llt;
auto start = high_resolution_clock::now();
llt.compute(m);
if (llt.info() != Success)
cout << "Cholesky decomposition failed!" << endl;
auto stop = high_resolution_clock::now();
cout << "Cholesky decomposition in "
<< duration_cast<milliseconds>(stop - start).count()
<< " ms." << endl;
return 0;
}
I compile this benchmark with g++ -std=c++11 -Wall -O3 -o bench bench.cc and run it on Windows the first time (using MinGW, [edit: GCC 4.8.1]) and on Linux (edit: GCC 4.8.1) the second time, but both times on the same machine.
On Windows, it gives me:
Cholesky decomposition in 10114 ms.
But on Linux I get:
Cholesky decomposition in 3258 ms.
That is less than a third of the time needed on Windows.
Is there something available on Linux systems that Eigen uses to achieve this speed-up?
And if so, how may I accomplish the same on Windows?
Make sure you are using a 64 bit system. If that's not the case then don't forget to enable SSE2 instructions (-msse2), but the performance still won't be as performant as with 64 bits system because fewer SSE registers are available.
See Eigen's main page here.
Quote
Eigen is being successfully used with the following compilers:
GCC, version 4.1 and newer. Very good performance with GCC 4.2
and newer. MSVC (Visual Studio), 2008 and newer (the old 2.x
versions of Eigen support MSVC 2005, but without vectorization).
Intel C++ compiler. Very good performance. LLVM/CLang++
(2.8 and newer). MinGW, recent versions. Very good
performance when using GCC 4. QNX's QCC compiler.
Argument
You may have a more recent version of gcc (>=4.2) than your version of MinGW uses...
Note
Just as a side-note, you may even have a MinGW version that is not "recent", as the link also says:
Eigen is standard C++98 and so should theoretically be compatible with
any compliant compiler.
Whenever we use some non-standard feature, that is optional and can be
disabled.
So maybe your version of gcc uses a new optimizing feature, that MinGW doesn't posses, and falls back to another, slower, alternative.
Of course, in the end, it could be a completely different thing, this is an experimental guess based on theory...