Are FPU control functions relevant for x64_64 processors? - c++

I'm asking this question because I'm trying to achieve bitwise (hash) equality between Visual Studio 2017 (cl.exe) and gcc 5.4.0. The problematic function makes use of sin() and cos(). All variables are double, and FMAD is also relevant.
I've been reading extensively on SO and the web about floating point determinism, reproducibility, and lock-stock MP game design. I'm aware that single-compiler, single-build determinism is not hard, but I am attempting 2-compiler, single-build determinism.
Efficiency is not a concern here. I just want the results to match.
I ask because I hope to narrow my concerns for what to test/try.
Are these things relevant for x86_64 processors and builds?
functions that control the x87 fpu
XPFPA_{DECLARE,RESTORE,etc}
"<"fpu_control.h>, _FPU_SINGLE, _FPU_DOUBLE, etc.
_controlfp_s(), _PC24, _PC53, _PC_64
I ask because I have read that platforms with SSE (x86_64) default to using SSE for floating point, so fpu control functions should be irrelevant?
I have found this and this to be most informative. This MSDN article says setting the floating point precision mask is not supported on x64 arch. And this SO post says SSE has fixed precision.
My testing has shown that /fp:{strict,precise,fast} are not changing the hashes. Neither is optimization level. So I'm hoping to narrow my focus to sin, cos.

Most floating point functions have to perform rounding one way or an other. The C/C++ standard is rather vague on the subject, and IEEE conformance is not strict enough on trigonometric functions. Which means that in practice it is useless to try to squeeze correct rounding out of your compilers default math implementation in a portable way.
For instance, the libm implementation (used by gcc) of sin/cos is written in assembly and the algorithm is different for different architectures and most probably depends on the version of the library.
You therefore have two possibilities:
implement your own sin/cos using only floating point operations with exact rounding (fused multiply-accumulate + Taylor series)
use a 3rd party library with strong rounding considerations
I personally use the MPFR library as a gold standard when dealing with rounding errors. There will be a runtime cost, although I never tried to benchmark it against libm performance.
Custom Implementation
Note that if you decide to implement it yourself, you need to choose the rounding mode and inform the compiler that it matter to you.
In C++ it is done this way:
#include <cfenv>
#pragma STDC FENV_ACCESS ON
#pragma STDC FP_CONTRACT OFF
int main(int, char**) {
...
if(!std::fesetround(FE_TONEAREST))
throw std::runtime_error("fesetround failed!");
...
}

Related

Using Half Precision Floating Point on x86 CPUs

I intend to use half-precision floating-point in my code but I am not able to figure out how to declare them. For Example, I want to do something like the following-
fp16 a_fp16;
bfloat a_bfloat;
However, the compiler does not seem to know these types (fp16 and bfloat are just dummy types, for demonstration purposes)
I remember reading that bfloat support was added into GCC-10, but I am not able to find it in the manual.I am especially interested in bfloat floating numbers
Additional Questions -
FP16 now has hardware support on Intel / AMD support as today? I think native hardware support was added since Ivy Bridge itself. (https://scicomp.stackexchange.com/questions/35187/is-half-precision-supported-by-modern-architecture)
I wanted to confirm whether using FP16 will indeed increase FLOPs. I remember reading somewhere that all arithmetic operations on fp16 are internally converted to fp32 first, and only affect cache footprint and bandwidth.
SIMD intrinsic support for half precision float, especially bfloat(I am aware of intrinsics like _mm256_mul_ph, but not sure how to pass the 16bit FP datatype, would really appreciate if someone could highlight this too)
Are these types added to Intel Compilers as well ?
PS - Related Post - Half-precision floating-point arithmetic on Intel chips , but it does not cover on declaring half precision floating point numbers.
TIA
Neither C++ nor C language has arithmetic types for half floats.
The GCC compiler supports half floats as a language extension. Quote from the documentation:
On x86 targets with SSE2 enabled, GCC supports half-precision (16-bit) floating point via the _Float16 type. For C++, x86 provides a builtin type named _Float16 which contains same data format as C.
...
On x86 targets with SSE2 enabled, without -mavx512fp16, all operations will be emulated by software emulation and the float instructions. The default behavior for FLT_EVAL_METHOD is to keep the intermediate result of the operation as 32-bit precision. This may lead to inconsistent behavior between software emulation and AVX512-FP16 instructions. Using -fexcess-precision=16 will force round back after each operation.
Using -mavx512fp16 will generate AVX512-FP16 instructions instead of software emulation. The default behavior of FLT_EVAL_METHOD is to round after each operation. The same is true with -fexcess-precision=standard and -mfpmath=sse. If there is no -mfpmath=sse, -fexcess-precision=standard alone does the same thing as before, It is useful for code that does not have _Float16 and runs on the x87 FPU.

How to control CPU instructions used by Microsoft C Runtime Libraries?

Is it possible to control which CPU instruction sets are used by the MS C Runtime Library (Visual Studio 2013, 2015)? If I step into the disassembly for, say, cos(), the code compares against a precalculated set of CPU capabilities and then executes the function using the 'best' capabilities available on the CPU. The problem is that different instruction sets yield different results, so the results differ depending on the CPU architecture.
As an example, building a 64-bit executable of:
std::cout << std::setprecision(20) << cos(-0.61385470201194381) << std:: endl;
On Haswell/Broadwell and later returns 0.81743370050726594 (same as x86). On older CPUs returns 0.81743370050726583.
The Runtime Library uses the FMA instruction set if available, executes a different implementation and yields the different results. Note that this is not affected by the compiler options selected in the application because the Runtime Libraries are provided pre-compiled. Also note that the floating point precision control function _controlfp() cannot control the precision of the 64-bit runtime.
Is it possible to control which instruction sets the Runtime Library uses so that the results can be more deterministic?
Is it possible to control which instruction sets the Runtime Library uses so that the results can be more deterministic?
No.
If you only use basic arithmetic (+,-,*,/,sqrt), and force your compiler to use strict IEEE754 arithmetic, then it should be perfectly reproducible. For other functions, such as cos you're at the mercy of the libm, which are not required to provide any accuracy guarantees. You will also see similar problems with BLAS libraries.
If you need perfect reproducibility, you have 2 paths:
Use a correctly-rounded math library, such as CRlibm (though I don't think the 2-argument functions such as pow have been proven correct).
Roll your own math functions, limiting yourself to arithmetic operations above (in that case, fdlibm might be a good start).

What happens to floating point numbers in the absence of an FPU?

If you are programming with the C language for a microprocessor that does not have an FPU, does the compiler signal errors when floating point literals and keywords are encountered (0.75, float, double, etc)?
Also, what happens if the result of an expression is fractional?
I understand that there are software libraries that are used so you can do floating-point math, but I am specifically wondering what the results will be if you did not use one.
Thanks.
A C implementation is required to implement the types float and double, and arithmetic expressions involving them. So if the compiler knows that the target architecture doesn't have floating-point ops then it must bring in a software library to do it. The compiler is allowed to link against an external library, it's also allowed to implement floating point ops in software by itself as intrinsics, but it must somehow generate code to get it done.
If it doesn't do so [*] then it is not a conforming C implementation, so strictly speaking you're not "programming with the C language". You're programming with whatever your compiler docs tell you is available instead.
You'd hope that code involving float or double types will either fail to compile (because the compiler knows you're in a non-conforming mode and tells you) or else fails to link (because the compiler emits calls to emulation routines in the library, but the library is missing). But you're on your own as far as C is concerned, if you use something that isn't C.
I don't know the exact details (how old do I look?), but I imagine that back in the day if you took some code compiled for x87 then you might be able to link and load it on a system using an x86 with no FPU. Then the CPU would complain about an illegal instruction when you tried to execute it -- quite possibly the system would hang depending what OS you were running. So the worst possible case is pretty bad.
what happens if the result of an expression is fractional?
The actual result of an expression won't matter, because the expression itself was either performed with integer operations (in which case the result is not fractional) or else with floating-point operations (in which case the problem arises before you even find out the result).
[*] or if you fail to specify the options to make it do so ;-)
Floating-point is a required part of the C language, according to the C standard. If the target hardware does not have floating-point instructions, then a C implementation must provide floating-point operations in some other way, such as by emulating them in software. (All calculations are just functions of bits. If you have elementary operations for manipulating bits and performing tests and branches, then you can compute any function that a general computer can.)
A compiler could provide a subset of C without floating-point, but then it would not be a standard-compliant C compiler.
Software floating point can take two forms:
a compiler may generate calls to built-in floating point functions directly - for example the operation 1.2 * 2.5 may invoke (for example) fmul( 1.2, 2.5 ),
alternatively for architectures that support an FPU, but for which some device variants may omit it, it is common to use FPU emulation. When an FP instruction is encountered an invalid instruction exception will occur and the exception handler will vector to code that emulates the instruction.
FPU emulation has the advantage that when the same code is executed on a device with a real FPU, it will be used automatically and accelerate execution. However without an FPU there is usually a small overhead compared with direct software implementation, so if the application is never expected to run on an FPU, emulation might best be avoided is the compiler provides the option.
Software floating point is very much slower that hardware supported floating point. Use of fixed-point techniques can improve performance with acceptable precision in many cases.
Typically, such microprocessor comes along either with a driver-package or even with a complete BSP (board-support-package, consisting of drivers and OS linked together), both of which contain FP library routines.
The compiler replaces every floating-point operation with an equivalent function call. This should be taken into consideration, especially when invoking such operations iteratively (inside a for / while loop), since the compiler cannot apply loop-unrolling optimization as a result.
The result of not including the required libraries within the project would be linkage errors.

C++ compilers/platforms that don't use IEEE754 floating point

I'm working on updating a serialization library to add support for serializing floating point in a portable manner. Ideally I'd like to be able to test the code in an environment where IEEE754 isn't supported. Would it be sufficient to test using a soft-float library? Or any other suggestions about how I can properly test the code?
Free toolchains that you can find for ARM (embedded Linux) development, mostly do not support hard-float operations but soft-float only. You could try with one of these (i.e. CodeSourcery) but you would need some kind of a platform to run the compiled code (real HW or QEMU).
Or if you would want to do the same but on x86 machine, take a look at: Using software floating point on x86 linux
Should your library work on a system where both hardware floating point and soft-float are not available ? If so, if you test using a compiler with soft-float, your code may not compile/work on such a system.
Personally, I would test the library on a ARM9 system with a gcc compiler without soft-float.
Not an answer to your actual question, but describing what you must do to solve the problem.
If you want to support "different" floating point formats, your code would have to understand the internal format of floats [unless you only support "same architecture both ends"], pick the floating point number apart into your own format [which of course may be IEEE-754, but beware of denormals, 128-bit long doubles, NaN, INFINITY, and other "exception values", and of course out of range numbers], and then put it back together to the format required by the other end. If you are not doing this, there is no point in hunting down a non-IEEE-754 system, because it won't work.

How to control whether C math uses SSE2?

I stepped into the assembly of the transcendental math functions of the C library with MSVC in fp:strict mode. They all seem to follow the same pattern, here's what happens for sin.
First there is a dispatch routine from a file called "disp_pentium4.inc". It checks if the variable ___use_sse2_mathfcns has been set; if so, calls __sin_pentium4, otherwise calls __sin_default.
__sin_pentium4 (in "sin_pentium4.asm") starts by transferring the argument from the x87 fpu to the xmm0 register, performs the calculation using SSE2 instructions, and loads the result back in the fpu.
__sin_default (in "sin.asm") keeps the variable on the x87 stack and simply calls fsin.
So in both cases, the operand is pushed on the x87 stack and returned on it as well, making it transparent to the caller, but if ___use_sse2_mathfcns is defined, the operation is actually performed in SSE2 rather than x87.
This behavior is very interesting to me because the x87 transcendental functions are notorious for having slightly different behaviors depending on the implementation, whereas a given piece of SSE2 code should always give reproducible results.
Is there a way to determine for certain, either at compile or run-time, that the SSE2 code path will be used? I am not proficient writing assembly, so if this involves writing any assembly, a code example would be appreciated.
I found the answer through careful investigation of math.h. This is controlled by a method called _set_SSE2_enable. This is a public symbol documented here:
Enables or disables the use of Streaming SIMD Extensions 2 (SSE2)
instructions in CRT math routines. (This function is not available on
x64 architectures because SSE2 is enabled by default.)
This causes the aforementionned ___use_sse2_mathfcns flag to be set to the provided value, effectively enabling or disabling use of the _pentium4 SSE2 routines.
The documentation mentions this affects only certain transcendental functions, but looking at the disassembly, this seems to affect everyone of them.
Edit: stepping into every function reveals that they're all available in SSE2 except for the following:
fmod
sinh
cosh
tanh
sqrt
Sqrt is the biggest offender, but it's trivial to implement in SSE2 using intrinsics. For the others, there's no simple solution except perhaps using a third-party library, but I can probably do without.
Why not use your own library instead of the C runtime? This would provide an even stronger guarantee of consistency across computers (presumably the C runtime is provided as a DLL and might change slightly in time).
I would recommend CRlibm. If you are already targeting SSE2, and as long as you did not intend to change the FPU's rounding mode, you are in the ideal conditions to use it, and you won't find a more accurate implementation.
The short answer is that you can't tell IN YOUR CODE for certain what the library will do, unless you are also involving library-implementation specific details. These would make the code completely unportable - even two different builds of the same compiler may change the internals of the library.
Of course, if portability isn't an issue, then using extern <type> ___use_sse2_mathfcns; and checking if it's the true would clearly work.
I expect that if the processor has SSE2 and you are using a modern enough library, it would use SSE2 wherever possible. But to say that for certain is a different matter.
If this is critical for your code, then implement your own transcendental functions and use those - that's the only way to guarantee the same result. Or, use some suitable inline assembler (or transcendental) code to calculate selected sin, cos, etc values, and compare those with the sin() and cos() functions provided by the library.