round much slower than floor/ceil/int in LLVM - c++

I was benchmarking some essential routines by executing cycles such as:
float *src, *dst;
for (int i=0; i<cnt; i++) dst[i] = round(src[i]);
All with AVX2 target, newest CLANG. Interestingly floor(x), ceil(x), int(x)... all seem fast. But round(x) seems exremely slow and looking into disassembly there's some weird spaghetti code instead of the newer SSE or AVX versions. Even when blocking the ability to vectorize the loops by introducing some dependency, round is like 10x slower. For floor etc. the generated code uses vroundss, for round there's the spaghetti code... Any ideas?
Edit: I'm using -ffast-math, -mfpmath=sse, -fno-math-errno, -O3, -std=c++17, -march=core-avx2 -mavx2 -mfma

The problem is that none of the SSE rounding modes specify the correct rounding for round:
These functions round x to the nearest integer, but round halfway cases away from zero
(regardless of the current rounding direction, see fenv(3)), instead of to the nearest
even integer like rint(3).
If you want faster code, you could try testing rint instead of round, as that specifies a rounding mode that SSE does support.

One thing to note is that an expression like floor(x + 0.5), while not having the exact same semantics that round(x) does, is a valid substitute in almost all use cases, and I doubt it is anywhere near 10x slower than floor(x).

Related

sine result depends on C++ compiler used

I use the two following C++ compilers:
cl.exe : Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24210 for x86
g++ : g++ (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010
When using the built-in sine function, I get different results. This is not critical, but sometimes results are too significants for my use. Here is an example with a 'hard-coded' value:
printf("%f\n", sin(5451939907183506432.0));
Result with cl.exe:
0.528463
Result with g++:
0.522491
I know that g++'s result is more accurate and that I could use an additional library to get this same result, but that's not my point here. I would really understand what happens here: why is cl.exe that wrong?
Funny thing, if I apply a modulo of (2 * pi) on the param, then I get the same result than g++...
[EDIT] Just because my example looks crazy for some of you: this is a part of a pseudorandom number generator. It is not important to know if the result of the sine is accurate or not: we just need it to give some result.
You have a 19-digit literal, but double usually has 15-17 digit precision. As a result, you can get a small relative error (when converting to double), but big enough (in the context of sine calculation) absolute error.
Actually, different implementations of the standard library have differences in treating such large numbers. For example, in my environment, if we execute
std::cout << std::fixed << 5451939907183506432.0;
g++ result would be 5451939907183506432.000000
cl result would be 5451939907183506400.000000
The difference is because versions of cl earlier than 19 have a formatting algorithm that uses only a limited number of digits and fills the remaining decimal places with zero.
Furthermore, let's look at this code:
double a[1000];
for (int i = 0; i < 1000; ++i) {
a[i] = sin(5451939907183506432.0);
}
double d = sin(5451939907183506432.0);
cout << a[500] << endl;
cout << d << endl;
When executed with my x86 VC++ compiler the output is:
0.522491
0.528463
It appears that when filling the array sin is compiled to the call of __vdecl_sin2, and when there is a single operation, it is compiled to the call of __libm_sse2_sin_precise (with /fp:precise).
In my opinion, your number is too large for sin calculation to expect the same behavior from different compilers and to expect the correct behavior in general.
I think Sam's comment is closest to the mark. Whereas you're using a recentish version of GCC/glibc, which implements sin() in software (calculated at compile time for the literal in question), cl.exe for x86 likely uses the fsin instruction. The latter can be very imprecise, as described in the Random ASCII blog post, "Intel Underestimates Error Bounds by 1.3 quintillion".
Part of the problem with your example in particular is that Intel uses an imprecise approximation of pi when doing range reduction:
When doing range reduction from double-precision (53-bit mantissa) pi the results will have about 13 bits of precision (66 minus 53), for an error of up to 2^40 ULPs (53 minus 13).
According to cppreference:
The result may have little or no significance if the magnitude of arg is large
(until C++11)
It's possible that this is the cause of the problem, in which case you will want to manually do the modulo so that arg is not large.

Results (slightly) different after vectorization is enabled

One of our software is using Eigen (3.2.5) to perform some matric/vector related computations. The software was developed carefully in this regard, starting by disabling all options and optimizations (including using -DEIGEN_DONT_VECTORIZE), and setting accuracy tests in place.
Since we are now interested in faster numerical throughputs, we have started enabling vectorization inside Eigen. However, we have noticed that one of our tests now gives a slightly different output: the difference with the reference implementation is around 1e-4, while it was 1e-5 before.
We are going to let loose a bit the precision in this test (because we don't really know the accuracy of the reference data, and we have another test case with synthetic data for which we have an exact solution and that still passes), but out of curiosity: what can be a plausible cause for this variation?
In case it's relevant, this computation involves Euclidean norms.
This has to be expected because when you enable vectorization, floating point operations are not carried out in the exact same order. This typically occurs for expressions involving reductions such as sum, norms, matrix products, etc. For instance, let's consider the following simple sum:
float s = 0;
for(int i=0;i<n;i++)
s += v[i];
A vectorized version might look to something like (pseudo code):
Packet ps = {0,0,0,0};
for(int i=0;i<n;i+=4)
ps += load_packet(&v[i]);
float s = ps[0]+ps[1]+ps[2]+ps[3];
Owing to roundoff errors, each version will return a different value. In Eigen, this aspect even more tricky because reductions are implemented in a way to maximize instruction pipelining.

Extent of G++ compiler optimization on non-commutative operations

I am concerned about the G++ optimizer's effect on arithmetic operations, specifically integer operations that are not necessarily commutative, eg * and /. This concern arose when I looked at a simple function in gdb that had been compiled with the -O3 flag set; it was all in all a better function, but it's form was completely different then what it had been with no optimization, operations had been removed, and some had been relocated. Here is a simple function with which I will demonstrate the crux of my concern;
int ClipLower(int num, int dig){
int Mult10 = 1;
while (dig != 0){
Mult10 *= 10, dig--;
}
return ((num / Mult10) * Mult10);
}
This function simply clips off the base10 digits below digit 'dig'. My concern is, does the compiler consider things like the fact that math on integers is non-commutative? So, will the compiler try to reduce (num / mult10) * mult10 into num * 1, and of course discard the one?
I am aware that volatile will avoid this situation, but I would still like my code optimized as much as possible. So in essence I am asking if the gnu optimizer will understand that integer math is non-communicative, and further more how much of a concern optimization-gone-awry really is.
also
here is the disassembly for the function at -O4, as you can see, the order of operations is fine
13 return ((num / Mult10) * Mult10);
cltd
idiv %ecx
imul %ecx,%eax
ret
amusingly, the compiler generated a load of no-operations following the function, presumably as padding because it ended up so small.
Here is the list of flags that -O3 in g++ is equivalent to: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
Now if you look carefully, there is also -Ofast which is defined as -O3 + some other, especially -ffast-math. In description of -ffast-math you can read:
This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.
This is done precisely to ensure default compiler flags do not violate rounding error and other floating point standard specifications.
There is also a related question on SO, why don't compilers optimize a*a*a*a*a*a to (a*a*a)^2, the answer is the same. (I cannot find the link atm =/)
Btw, Mult10 *= 10, dig--; are you trying to lose people following your code? =D
EDIT: Another by the way, going over -O3 has no effect. Except that some people say you might overflow some internal variable. I didn't test the overflow but I'm sure -O4 and -O100 are equivalent to -O3 at this point of writing this.
Try it and look at the assembly
Optimization should not effect output, only speed. Rounding should be maintained. But bugs can occur, although much more rarely nowadays.
Generally issues are more likely with floating point. 2/7 with floats might vary slightly.
With ints it should always be 0, no matter what optimization, even if it is multiplied by 7.

What is floating point speculation and how does it differ from the compiler's floating point model

The Intel C++ compiler provides two options for controlling floating point:
-fp-speculation (fast/safe/strict/off)
-fp-model (precise/fast/strict and source/double/extended)
I think I understand what fp-model does. But what is fp-speculation and how does it relate to fp-model? I have yet to find any intel doc which explains this!
-fp-model influences how floating-point computations are carried out, and can change the numeric result (by licensing unsafe optimizations or by changing the precision at which intermediate results are evaluated).
-fp-speculation does not change the numerical results, but can effect what floating-point flags are raised by an operation (or what traps are taken if floating-point traps are enabled). 99.99% of programmers don't need care about these things, so you can probably run with the default and not worry about it.
Here's a concrete example; suppose you have the following function:
double foo(double x) {
// lots of computation
if (x >= 0) return sqrt(x);
else return x;
}
sqrt is, relatively speaking, slow. It would be nice to hoist the computation of sqrt(x) like this:
double foo(double x) {
const double sqrtx = sqrt(x);
// lots of computation
if (x >= 0) return sqrtx;
else return x;
}
By doing this, we allow the computation of sqrt to proceed simultaneously with other computations, reducing the latency of our function. However, there's a problem; if x is negative, then sqrt(x) raises the invalid flag. In the original program, this could never happen, because sqrt(x) was only computed if x was non-negative. In the modified program, sqrt(x) is computed unconditionally. Thus, if x is negative, the modified program raises the invalid flag, whereas the original program did not.
The -fp-speculation flag gives you a way to tell the compiler whether or not you care about these cases, so it knows whether or not it is licensed to make such transformations.
Out of order execution and speculative execution can result in extraneous exceptions or raise exceptions at the wrong time.
If that matters to you, you can use the fp-speculation option to control speculation of floating-point instructions.
For (a little bit) more information: http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/copts/common_options/option_fp_speculation.htm
On Windows OS:
1.Intel compiler floating calculation 32 bit application vs 64 bit application , same code Can give to you different result!!!! No matter what flag you choose:)!!!!
2.Visual studio compiler floating calculation 32 bit vs 64 bit application , same code output same result.

Efficient way to project vectors onto unit box

I am reimplementing a Matlab function in C for performance reasons. Now, I am looking for the most efficient way to compute the projection of a vector onto the unit-box.
In C terms, I want to compute
double i = somevalue;
i = (i > 1.) ? 1. : i;
i = (i < -1.) ? -1. : i;
and since I have to do this operation several millions of times I wonder what could be the most efficient way to achieve this.
If you're on 686, your compiler will likely transform the conditional into a CMOV instruction, which is probably fast enough.
See the question Fastest way to clamp a real (fixed/floating point) value? for experiments. #Spat also suggests the MINSS/MINSD and MAXSS/MAXSD instructions, which can be available as intrinsics for your compiler. They are SSE instructions, and may be your best choice, again, provided you're on 686.
If you/"the compiler" use(s) the IEEE 754 double format, I'd think reading the first bit (the sign bit) of the double's memory is probably the most direct way. Then you'd have no additional round or division operations needed.
Did you consider using SSE instructions to speed up your code?
Also you could use OpenMP to parallelize you code, and thus making it faster.