Maximum Precision for C++ with Eigen3 - c++

I'm using the awesome Eigen3 library to write a MATLAB MEX file. But I am experiencing some accuracy issues (compared to MATLAB), even when using long double.
The most critical computation seems to be the one where I compute a probability according to the normal distribution.
Here is the code snippet:
p.fill( 1/( M_PIl * sigma * sigma ) );
p.array() *= ( - 0.5/pow( sigma, 2.0 ) * ( mu.array() - x.array() ).array().square() ).array().exp();
where x, p and mu are Eigen::Matrix< long double, Dynamic, 1 >. Usually these vectors have a length of 3000.
What are possible steps I can take to get the maximum possible precision?
What are the correct GCC compiler flags I can use to force 80 bit precision wherever possible?
P.S: I compile the C++ code (in MATLAB with MEX) with gcc 4.9 and my linux reports the following available instruction sets: Intel MMX, Intel SSE, Intel SSE2, Intel SSE3, Intel SSE4
Edit:
I tried what #Avi Ginsburg suggested below and compiled it using the following command:
mex -g -largeArrayDims '-I/usr/include/eigen3' CXXFLAGS='-DEIGEN_DONT_VECTORIZE -std=c++11 -fPIC' test.cpp
with double and long double and each of these options gives me the same error with respect to the solution from MATLAB.

I'm hazarding a guess here. You are using SSE instructions with your array calculations, most notably, ...array().exp(). I'm pretty sure there is no extended precision with SSE, hence the differences between MATLAB and Eigen.

By default Eigen uses a faster but slightly less accurate implementation of several mathematical functions, including the exponential. You can explicitly disable these optimizations by compiling with the -DEIGEN_FAST_MATH=0 option.
If you use gcc as your compiler also make sure that you don't use the -Ofast or -ffast-math options, as these can result in reduced precision.

If you want to compute the probability density of a (1 dimensional) normal distribution, the factor at the beginning should be 1/std::sqrt( 2* M_PIl * sigma * sigma ).
Also, the p.fill() at the beginning of your snippet is inefficient. Just write this in one line:
p = (1/std::sqrt(2*M_PIl*sigma*sigma)) *
( -0.5/(sigma*sigma) * (mu-x).array().square() ).exp();
N.B.: If you are only performing element-wise operations on your arrays, consider declaring them as Eigen::Array<...> instead of Eigen::Matrix<...>. The template parameters are the same, also the binary layout is the same, but you don't need to write .array() every time you want to make element-wise operations.

Related

speeding up complex-number multiplication in C++

I have some code which multiplies complex numbers, and have noticed that mulxc3 (long double version of muldc3) is being called frequently: i.e. the complex number multiplications are not being inlined.
I am compiling with g++ version 7.5, with -O3 and --ffast-math.
It is similar to this question, except the problem persists when I compile with -ffast-math. Since I do not require checking for whether the arguments are Inf or NaN, I was considering making my own very simple complex class without such checks to allow the multiplication to be inlined, but given my lack of C++ proficiency, and having read this article makes me think that would be counterproductive.
So, is there a way to change either my code or compilation process so that I can keep using std::complex, but inline the multiplication?

round much slower than floor/ceil/int in LLVM

I was benchmarking some essential routines by executing cycles such as:
float *src, *dst;
for (int i=0; i<cnt; i++) dst[i] = round(src[i]);
All with AVX2 target, newest CLANG. Interestingly floor(x), ceil(x), int(x)... all seem fast. But round(x) seems exremely slow and looking into disassembly there's some weird spaghetti code instead of the newer SSE or AVX versions. Even when blocking the ability to vectorize the loops by introducing some dependency, round is like 10x slower. For floor etc. the generated code uses vroundss, for round there's the spaghetti code... Any ideas?
Edit: I'm using -ffast-math, -mfpmath=sse, -fno-math-errno, -O3, -std=c++17, -march=core-avx2 -mavx2 -mfma
The problem is that none of the SSE rounding modes specify the correct rounding for round:
These functions round x to the nearest integer, but round halfway cases away from zero
(regardless of the current rounding direction, see fenv(3)), instead of to the nearest
even integer like rint(3).
If you want faster code, you could try testing rint instead of round, as that specifies a rounding mode that SSE does support.
One thing to note is that an expression like floor(x + 0.5), while not having the exact same semantics that round(x) does, is a valid substitute in almost all use cases, and I doubt it is anywhere near 10x slower than floor(x).

sine result depends on C++ compiler used

I use the two following C++ compilers:
cl.exe : Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24210 for x86
g++ : g++ (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010
When using the built-in sine function, I get different results. This is not critical, but sometimes results are too significants for my use. Here is an example with a 'hard-coded' value:
printf("%f\n", sin(5451939907183506432.0));
Result with cl.exe:
0.528463
Result with g++:
0.522491
I know that g++'s result is more accurate and that I could use an additional library to get this same result, but that's not my point here. I would really understand what happens here: why is cl.exe that wrong?
Funny thing, if I apply a modulo of (2 * pi) on the param, then I get the same result than g++...
[EDIT] Just because my example looks crazy for some of you: this is a part of a pseudorandom number generator. It is not important to know if the result of the sine is accurate or not: we just need it to give some result.
You have a 19-digit literal, but double usually has 15-17 digit precision. As a result, you can get a small relative error (when converting to double), but big enough (in the context of sine calculation) absolute error.
Actually, different implementations of the standard library have differences in treating such large numbers. For example, in my environment, if we execute
std::cout << std::fixed << 5451939907183506432.0;
g++ result would be 5451939907183506432.000000
cl result would be 5451939907183506400.000000
The difference is because versions of cl earlier than 19 have a formatting algorithm that uses only a limited number of digits and fills the remaining decimal places with zero.
Furthermore, let's look at this code:
double a[1000];
for (int i = 0; i < 1000; ++i) {
a[i] = sin(5451939907183506432.0);
}
double d = sin(5451939907183506432.0);
cout << a[500] << endl;
cout << d << endl;
When executed with my x86 VC++ compiler the output is:
0.522491
0.528463
It appears that when filling the array sin is compiled to the call of __vdecl_sin2, and when there is a single operation, it is compiled to the call of __libm_sse2_sin_precise (with /fp:precise).
In my opinion, your number is too large for sin calculation to expect the same behavior from different compilers and to expect the correct behavior in general.
I think Sam's comment is closest to the mark. Whereas you're using a recentish version of GCC/glibc, which implements sin() in software (calculated at compile time for the literal in question), cl.exe for x86 likely uses the fsin instruction. The latter can be very imprecise, as described in the Random ASCII blog post, "Intel Underestimates Error Bounds by 1.3 quintillion".
Part of the problem with your example in particular is that Intel uses an imprecise approximation of pi when doing range reduction:
When doing range reduction from double-precision (53-bit mantissa) pi the results will have about 13 bits of precision (66 minus 53), for an error of up to 2^40 ULPs (53 minus 13).
According to cppreference:
The result may have little or no significance if the magnitude of arg is large
(until C++11)
It's possible that this is the cause of the problem, in which case you will want to manually do the modulo so that arg is not large.

Why is it faster to perform float by float matrix multiplication compared to int by int?

Having two int matrices A and B, with more than 1000 rows and 10K columns, I often need to convert them to float matrices to gain speedup (4x or more).
I'm wondering why is this the case? I realize that there is a lot of optimization and vectorizations such as AVX, etc going on with float matrix multiplication. But yet, there are instructions such AVX2, for integers (if I'm not mistaken). And, can't one make use of SSE and AVX for integers?
Why isn't there a heuristic underneath matrix algebra libraries such as Numpy or Eigen to capture this and perform integer matrix multiplication faster just like float?
About accepted answer: While #sascha's answer is very informative and relevant, #chatz's answer is the actual reason why the int by int multiplication is slow irrespective of whether BLAS integer matrix operations exist.
If you compile these two simple functions which essentially just calculate a product (using the Eigen library)
#include <Eigen/Core>
int mult_int(const Eigen::MatrixXi& A, Eigen::MatrixXi& B)
{
Eigen::MatrixXi C= A*B;
return C(0,0);
}
int mult_float(const Eigen::MatrixXf& A, Eigen::MatrixXf& B)
{
Eigen::MatrixXf C= A*B;
return C(0,0);
}
using the flags -mavx2 -S -O3 you will see very similar assembler code, for the integer and the float version.
The main difference however is that vpmulld has 2-3 times the latency and just 1/2 or 1/4 the throughput of vmulps. (On recent Intel architectures)
Reference: Intel Intrinsics Guide, "Throughput" means the reciprocal throughput, i.e., how many clock-cycles are used per operation, if no latency happens (somewhat simplified).
All those vector-vector and matrix-vector operations are using BLAS internally. BLAS, optimized over decades for different archs, cpus, instructions and cache-sizes has no integer-type!
Here is some branch of OpenBLAS working on it (and some tiny discussion at google-groups linking it).
And i think i heard Intel's MKL (Intel's BLAS implementation) might be working on integer-types too. This talk looks interesting (mentioned in that forum), although it's short and probably more approaching small integral types useful in embedded Deep-Learning).

A faster but less accurate fsin for Intel asm?

Since the function fsin for computing the sin(x) function under the x86 dates back to the Pentium era, and apparently it doesn't even use SSE registers, I was wondering if there is a newer and better set of instructions for computing trigonometric functions.
I'm used to code in C++ and do some asm optimizations, so anything that fits in a pipeline starting from C++, to C to asm will do for me.
Thanks.
I'm under Linux 64 bit for now, with gcc and clang ( even tough clang doesn't really offer any FPU related optimization AFAIK ).
EDIT
I have already implemented a sin function, it's usually 2 times faster then std::sin even with sse on.
My function is never slower then fsin, even tough fsin is usually more accurate, but considering that fsin never outperforms my sin implementation, I'll keep my sin for now, also my sin is totally portable where fsin is for x86 only.
I need this for real time computation, so I'll trade precision for speed, I think that I'll be fine with 4-5 decimals of precision .
no to a table based approach, I'm not using it, it screws up the cache, makes everything slower, no algorithm based on memory access or lookup tables please.
If you need an approximation of sine optimized for absolute accuracy over -π … π, use:
x * (1 + x * x * (-0.1661251158026961831813227851437597220432 + x * x * (8.03943560729777481878247432892823524338e-3 + x * x * -1.4941402004593877749503989396238510717e-4))
It can be implemented with:
float xx = x * x;
float s = x + (x * xx) * (-0.16612511580269618f + xx * (8.0394356072977748e-3f + xx * -1.49414020045938777495e-4f));
And perhaps optimized depending on the characteristics of your target architecture. Also, not noted in the linked blog post, if you are implementing this in assembly, do use the FMADD instruction. If implementing in C or C++, if you use, say, the fmaf() C99 standard function, make sure that FMADD is generated. The emulated version is much more expensive than a multiplication and an addition, because what fmaf() does is not exactly equivalent to multiplication followed by addition (so it would be incorrect to just implement it so).
The difference between sin(x) and the above polynomial between -π and π graphs so:
The polynomial is optimized to reduce the difference between it and sin(x) between -π and π, not just something that someone thought was a good idea.
If you only need the [-1 … 1] definition interval, then the polynomial can be made more accurate on that interval by ignoring the rest. Running the optimization algorithm again for this definition interval produces:
x * (1 + x * x * (-1.666659904470566774477504230733785739156e-1 + x * x *(8.329797530524482484880881032235130379746e-3 + x * x *(-1.928379009208489415662312713847811393721e-4)))
The absolute error graph:
If that is too accurate for you, it is possible to optimize a polynomial of lower degree for the same objective. Then the absolute error will be larger but you will save a multiplication or two.
If you're okay with an approximation (I'm assuming you are, if you're trying to beat hardware), you should take a look at Nick's sin implementation at DevMaster:
http://devmaster.net/posts/9648/fast-and-accurate-sine-cosine
He has two versions: a "fast & sloppy" method and a "slow & accurate" method. A couple replies down someone estimates the relative errors as 12% and 0.2% respectively. I've done an implementation myself, and find runtimes of 1/14 and 1/8 the hardware times on my machine.
Hope that helps!
PS: If you do this yourself, you can refactor the slow/accurate method to avoid a multiplication and improve slightly over Nick's version, but I don't remember exactly how...