g++ optimization options affect the value of sin function - c++

I have a problem with "sin" function of libc.
#include <cmath>
#include <stdio.h>
int main(int argc, char **argv)
{
double tt = 6.28318530717958620000; // 2 * M_PI
double yy = ::sin(tt);
printf("%.32f\n", yy);
return 0;
}
When compile the above code using "g++" without any optimization option, it would output "-0.00000000000000024492127076447545".
But if with "-O3" option, it would output "-0.00000000000000024492935982947064".
Why doesn't it return "-0.00000000000000024492935982947064" without "-O3"?
Thanks in advance.

Because with "-O3" the compiler precomputes sin(2*pi) at compile time, with one algorithm. Without "-O3" this is computed at runtime, with other algorithm.
This may be because compiler itself was built with some math library, which differ from your math library.
Update
The only entity, giving the result "-0.00000000000000024492127076447545" is 32-bit version of libstdc++. 64-bit version of the same library as well as gcc itself produce "-0.00000000000000024492935982947064".
So upgrading to newer version will not help. Also I tried various options, proposed here: neither -ffloat-store, nor -fno-builtin do not make any difference, as well as long double and sinl.
32-bit libstdc++ uses 387 floating point instructions, while gcc apparently uses SSE instructions. Here is the difference. Probably, the only way to make them consistent is to rebuild gcc from sources, directing it to use only 387 instructions internally.

Related

How to disable fma3 instructions in gcc

I need to disable FMA3 instructions (for backward compatibility issue) for the 64bit system.
I'v used _set_FMA3_enable(0) in my windows environment. And what option (or macro) I need to use to disable FMA3 in gcc?
For example.
#include <iostream>
#include <iomanip>
#include <math.h>
#include <limits>
union value
{
double real;
long long unsigned integer;
};
int main()
{
#if defined (_WIN64)
_set_FMA3_enable(0);
#endif
value x;
x.integer = 4602754097807755994;
value y;
y.real = sin(x.real);
std::cout << std::setprecision(17) << y.real << std::endl;
std::cout << y.integer << std::endl;
}
On Visual C++ it runs to 0.48674319998526994 4602440005894221058 with _set_FMA3_enable(0).
And 0.48674319998526999 4602440005894221059 without it (or with `_set_FMA3_enable(1)).'
I run this code in gcc environment with g++ -g0 -march=x86-64 -O2 -mtune=generic -msse3 -mno-fma -DNDEBUG main.cpp and always get 0.48674319998526999 4602440005894221059.
How can I reproduce _set_FMA3_enable(0) results with gcc?
Visual Studio 16.7.4. gcc version 9.3.0 (with wsl)
How to disable fma3 instructions in gcc
Generation of FMA3 instructions can be disabled with -mno-fma option. -ffp-contract=off is another option which should also disable contraction of separate multiply and add operations into FMA.
How can I reproduce [visual studio] results with gcc?
Apparently, not by disabling FMA instructions. FMA are not the only thing that influence the accuracy of floating point operations.
You could try not relying on the standard library for sin and instead use the same implementation on both systems. You could also set rounding mode explicitly so that you don't rely on system defaults. If you do, then you should use -frounding-math as well.
It would be simpler if you can instead not rely on getting exactly same results.
P.S. The behaviour of the shown program is undefined in C++ because of read form inactive union member.

g++: optimization -march=haswell and newer changes numerical result

I have been working on optimizing performance and of course doing regression tests when I noticed that g++ seems to alter results depending on chosen optimization. So far I thought that -O2 -march=[whatever] should yield the exact same results for numerical computations regardless of what architecture is chosen. However this seems not to be the case for g++. While using old architectures up to ivybridge yields the same results as clang does for any architecture, I get different results for gcc for haswell and newer. Is this a bug in gcc or did I misunderstand something about optimizations? I am really startled because clang does not seem to show this behavior.
Note that I am well aware that the differences are within machine precision, but they still disturb my simple regression checks.
Here is some example code:
#include <iostream>
#include <armadillo>
int main(){
arma::arma_rng::set_seed(3);
arma::sp_cx_mat A = arma::sprandn<arma::sp_cx_mat>(20,20, 0.1);
arma::sp_cx_mat B = A + A.t();
arma::cx_vec eig;
arma::eigs_gen(eig, B, 1, "lm", 0.001);
std::cout << "eigenvalue: " << eig << std::endl;
}
Compiled using:
g++ -march=[architecture] -std=c++14 -O2 -o test example.cpp -larmadillo
gcc version: 6.2.1
clang version: 3.8.0
Compiled for 64 bit, executed on an Intel Skylake processor.
It is because GCC uses fused-multiply-add (fma) instruction by default, if it is available. Clang, on the contrary, doesn't use them by default, even if it is available.
Result from a*b+c can differ whether fma used or not, that's why you get different results, when you use -march=haswell (Haswell is the first Intel CPU which supports fma).
You can decide whether you want to use this feature with -ffp-contract=XXX.
-ffp-contract=off, you won't get fma instructions.
-ffp-contract=on, you get fma instructions, but only in the case of contraction if allowed by the language standard. In current version of GCC, this means off (because it is not implemented yet).
-ffp-contract=fast (that's the GCC default), you'll get fma instrucions.

Why is Eigen's Cholesky decomposition much faster on Linux than on Windows?

I've noticed a significant performance difference regarding Cholesky decomposition using the Eigen library.
I'm using the latest version of Eigen (3.2.1) with the following benchmark code:
#include <iostream>
#include <chrono>
#include <Eigen/Core>
#include <Eigen/Cholesky>
using namespace std;
using namespace std::chrono;
using namespace Eigen;
int main()
{
const MatrixXd::Index size = 4200;
MatrixXd m = MatrixXd::Random(size, size);
m = (m + m.transpose()) / 2.0 + 10000 * MatrixXd::Identity(size, size);
LLT<MatrixXd> llt;
auto start = high_resolution_clock::now();
llt.compute(m);
if (llt.info() != Success)
cout << "Cholesky decomposition failed!" << endl;
auto stop = high_resolution_clock::now();
cout << "Cholesky decomposition in "
<< duration_cast<milliseconds>(stop - start).count()
<< " ms." << endl;
return 0;
}
I compile this benchmark with g++ -std=c++11 -Wall -O3 -o bench bench.cc and run it on Windows the first time (using MinGW, [edit: GCC 4.8.1]) and on Linux (edit: GCC 4.8.1) the second time, but both times on the same machine.
On Windows, it gives me:
Cholesky decomposition in 10114 ms.
But on Linux I get:
Cholesky decomposition in 3258 ms.
That is less than a third of the time needed on Windows.
Is there something available on Linux systems that Eigen uses to achieve this speed-up?
And if so, how may I accomplish the same on Windows?
Make sure you are using a 64 bit system. If that's not the case then don't forget to enable SSE2 instructions (-msse2), but the performance still won't be as performant as with 64 bits system because fewer SSE registers are available.
See Eigen's main page here.
Quote
Eigen is being successfully used with the following compilers:
GCC, version 4.1 and newer. Very good performance with GCC 4.2
and newer. MSVC (Visual Studio), 2008 and newer (the old 2.x
versions of Eigen support MSVC 2005, but without vectorization).
Intel C++ compiler. Very good performance. LLVM/CLang++
(2.8 and newer). MinGW, recent versions. Very good
performance when using GCC 4. QNX's QCC compiler.
Argument
You may have a more recent version of gcc (>=4.2) than your version of MinGW uses...
Note
Just as a side-note, you may even have a MinGW version that is not "recent", as the link also says:
Eigen is standard C++98 and so should theoretically be compatible with
any compliant compiler.
Whenever we use some non-standard feature, that is optional and can be
disabled.
So maybe your version of gcc uses a new optimizing feature, that MinGW doesn't posses, and falls back to another, slower, alternative.
Of course, in the end, it could be a completely different thing, this is an experimental guess based on theory...

How to detect the libstdc++ version in Clang?

I would like to write a "portable" C++ library in Clang. "Portable" means that I detect (in C preprocessor) what C++ features are available in the compilation environment and use these features or provide my workarounds. This is similar to what Boost libraries are doing.
However, the presence of some features depends not on the language, but on the Standard Library implementation. In particular I am interested in:
type traits (which of them are available and with what spelling)
if initializer_list being constexpr.
I find this problematic because Clang by default does not use its own Standard Library implementation: it uses libstdc++. While Clang has predefined preprocessor macros __GNUC__, __GNUC_MINOR__, __GNUC_PATCHLEVEL__, they are hardcoded to values 4, 2, 1 respectively, and they tell me little about the available libstdc++ features.
How can I check in Clang preprocessor what version of libstdc++ it is using?
Clang does come with its own standard library implementation, it's called libc++. You can use it by adding -stdlib=libc++ to your compile command.
That being said, there are various ways to check Clang/libstdc++ C++ support:
Clang has the __has_feature macro (and friends) that can be used to detect language features and language extenstions.
Libstdc++ has its own version macros, see the documentation. You'll need to include a libstdc++ header to get these defined though.
GCC has its version macros which you already discovered, but those would need to be manually compared to the documentation.
And also, this took me 2 minutes of googling.
This is what I think would help. It prints the value of the _LIBCPP_VERSION macro:
#include <iostream>
#include <string>
using namespace std;
int main(int argc, const char * argv[])
{
cout<<"Value = "<<_LIBCPP_VERSION<<endl;
return 0;
}
Compile it again the version of clang you want the info for.

vector<bool>::push_back bug in GCC 3.4.3?

The following code crashes for me using GCC to build for ARM:
#include <vector>
using namespace std;
void foo(vector<bool>& bools) {
bools.push_back(true);
}
int main(int argc, char** argv) {
vector<bool> bools;
bool b = false;
bools.push_back(b);
}
My compiler is: arm_v5t_le-gcc (GCC) 3.4.3 (MontaVista 3.4.3-25.0.30.0501131 2005-07-23). The crash doesn't occur when building for debug, but occurs with optimizations set to -O2.
Yes, the foo function is necessary to reproduce the issue. This was very confusing at first, but I've discovered that the crash only happens when the push_back call isn't inlined. If GCC notices that the push_back method is called more than once, it won't inline it in each location. For example, I can also reproduce the crash by calling push_back twice inside of main. If you make foo static, then gcc can tell it is never called and will optimize it out, resulting in push_back getting inlined into main, resulting in the crash not occurring.
I've tried this on x86 with gcc 4.3.3, and it appears the issue is fixed for that version.
So, my questions are:
Has anyone else run into this? Perhaps there are some compiler flags I can pass in to prevent it.
Is this a bug with gcc's code generation, or is it a bug in the stl implementation (bits/stl_bvector.h)? (I plan on testing this out myself when I get the time)
If it is a problem with the compiler, is upgrading to 4.3.3 what fixes it, or is it switching to x86 from arm?
Incidentally, most other vector<bool> methods seem to work. And yes, I know that using vector<bool> isn't the best option in the world.
Can you build your own toolchain with gcc 3.4.6 and Montavista's patches? 3.4.6 is the last release of the 3.x line.
I can append some instructions for how to build an ARM cross-compiler from GCC sources if you want. I have to do it all the time, since nobody does prebuilt toolchains for Mac OS X.
I'd be really surprised if this is broken for ARM in gcc 4.x. But the only way to test is if you or someone else can try this out on an ARM-targeting gcc 4.x.
Upgrading to GCC 4 is a safe bet. Its code generation backend replaces the old RTL (Register Transfer Language) representation with SSA (Static Single Assignment). This change allowed a significant rewrite of the optimizer.