How to control CPU instructions used by Microsoft C Runtime Libraries?

How to control CPU instructions used by Microsoft C Runtime Libraries? - c++

Is it possible to control which CPU instruction sets are used by the MS C Runtime Library (Visual Studio 2013, 2015)? If I step into the disassembly for, say, cos(), the code compares against a precalculated set of CPU capabilities and then executes the function using the 'best' capabilities available on the CPU. The problem is that different instruction sets yield different results, so the results differ depending on the CPU architecture.
As an example, building a 64-bit executable of:
std::cout << std::setprecision(20) << cos(-0.61385470201194381) << std:: endl;
On Haswell/Broadwell and later returns 0.81743370050726594 (same as x86). On older CPUs returns 0.81743370050726583.
The Runtime Library uses the FMA instruction set if available, executes a different implementation and yields the different results. Note that this is not affected by the compiler options selected in the application because the Runtime Libraries are provided pre-compiled. Also note that the floating point precision control function _controlfp() cannot control the precision of the 64-bit runtime.
Is it possible to control which instruction sets the Runtime Library uses so that the results can be more deterministic?

Is it possible to control which instruction sets the Runtime Library uses so that the results can be more deterministic?
No.
If you only use basic arithmetic (+,-,*,/,sqrt), and force your compiler to use strict IEEE754 arithmetic, then it should be perfectly reproducible. For other functions, such as cos you're at the mercy of the libm, which are not required to provide any accuracy guarantees. You will also see similar problems with BLAS libraries.
If you need perfect reproducibility, you have 2 paths:
Use a correctly-rounded math library, such as CRlibm (though I don't think the 2-argument functions such as pow have been proven correct).
Roll your own math functions, limiting yourself to arithmetic operations above (in that case, fdlibm might be a good start).

Related

Are FPU control functions relevant for x64_64 processors?

I'm asking this question because I'm trying to achieve bitwise (hash) equality between Visual Studio 2017 (cl.exe) and gcc 5.4.0. The problematic function makes use of sin() and cos(). All variables are double, and FMAD is also relevant.
I've been reading extensively on SO and the web about floating point determinism, reproducibility, and lock-stock MP game design. I'm aware that single-compiler, single-build determinism is not hard, but I am attempting 2-compiler, single-build determinism.
Efficiency is not a concern here. I just want the results to match.
I ask because I hope to narrow my concerns for what to test/try.
Are these things relevant for x86_64 processors and builds?
functions that control the x87 fpu
XPFPA_{DECLARE,RESTORE,etc}
"<"fpu_control.h>, _FPU_SINGLE, _FPU_DOUBLE, etc.
_controlfp_s(), _PC24, _PC53, _PC_64
I ask because I have read that platforms with SSE (x86_64) default to using SSE for floating point, so fpu control functions should be irrelevant?
I have found this and this to be most informative. This MSDN article says setting the floating point precision mask is not supported on x64 arch. And this SO post says SSE has fixed precision.
My testing has shown that /fp:{strict,precise,fast} are not changing the hashes. Neither is optimization level. So I'm hoping to narrow my focus to sin, cos.

Most floating point functions have to perform rounding one way or an other. The C/C++ standard is rather vague on the subject, and IEEE conformance is not strict enough on trigonometric functions. Which means that in practice it is useless to try to squeeze correct rounding out of your compilers default math implementation in a portable way.
For instance, the libm implementation (used by gcc) of sin/cos is written in assembly and the algorithm is different for different architectures and most probably depends on the version of the library.
You therefore have two possibilities:
implement your own sin/cos using only floating point operations with exact rounding (fused multiply-accumulate + Taylor series)
use a 3rd party library with strong rounding considerations
I personally use the MPFR library as a gold standard when dealing with rounding errors. There will be a runtime cost, although I never tried to benchmark it against libm performance.
Custom Implementation
Note that if you decide to implement it yourself, you need to choose the rounding mode and inform the compiler that it matter to you.
In C++ it is done this way:
#include <cfenv>
#pragma STDC FENV_ACCESS ON
#pragma STDC FP_CONTRACT OFF
int main(int, char**) {
...
if(!std::fesetround(FE_TONEAREST))
throw std::runtime_error("fesetround failed!");
...
}

What happens to floating point numbers in the absence of an FPU?

If you are programming with the C language for a microprocessor that does not have an FPU, does the compiler signal errors when floating point literals and keywords are encountered (0.75, float, double, etc)?
Also, what happens if the result of an expression is fractional?
I understand that there are software libraries that are used so you can do floating-point math, but I am specifically wondering what the results will be if you did not use one.
Thanks.

A C implementation is required to implement the types float and double, and arithmetic expressions involving them. So if the compiler knows that the target architecture doesn't have floating-point ops then it must bring in a software library to do it. The compiler is allowed to link against an external library, it's also allowed to implement floating point ops in software by itself as intrinsics, but it must somehow generate code to get it done.
If it doesn't do so [*] then it is not a conforming C implementation, so strictly speaking you're not "programming with the C language". You're programming with whatever your compiler docs tell you is available instead.
You'd hope that code involving float or double types will either fail to compile (because the compiler knows you're in a non-conforming mode and tells you) or else fails to link (because the compiler emits calls to emulation routines in the library, but the library is missing). But you're on your own as far as C is concerned, if you use something that isn't C.
I don't know the exact details (how old do I look?), but I imagine that back in the day if you took some code compiled for x87 then you might be able to link and load it on a system using an x86 with no FPU. Then the CPU would complain about an illegal instruction when you tried to execute it -- quite possibly the system would hang depending what OS you were running. So the worst possible case is pretty bad.
what happens if the result of an expression is fractional?
The actual result of an expression won't matter, because the expression itself was either performed with integer operations (in which case the result is not fractional) or else with floating-point operations (in which case the problem arises before you even find out the result).
[*] or if you fail to specify the options to make it do so ;-)

Floating-point is a required part of the C language, according to the C standard. If the target hardware does not have floating-point instructions, then a C implementation must provide floating-point operations in some other way, such as by emulating them in software. (All calculations are just functions of bits. If you have elementary operations for manipulating bits and performing tests and branches, then you can compute any function that a general computer can.)
A compiler could provide a subset of C without floating-point, but then it would not be a standard-compliant C compiler.

Software floating point can take two forms:
a compiler may generate calls to built-in floating point functions directly - for example the operation 1.2 * 2.5 may invoke (for example) fmul( 1.2, 2.5 ),
alternatively for architectures that support an FPU, but for which some device variants may omit it, it is common to use FPU emulation. When an FP instruction is encountered an invalid instruction exception will occur and the exception handler will vector to code that emulates the instruction.
FPU emulation has the advantage that when the same code is executed on a device with a real FPU, it will be used automatically and accelerate execution. However without an FPU there is usually a small overhead compared with direct software implementation, so if the application is never expected to run on an FPU, emulation might best be avoided is the compiler provides the option.
Software floating point is very much slower that hardware supported floating point. Use of fixed-point techniques can improve performance with acceptable precision in many cases.

Typically, such microprocessor comes along either with a driver-package or even with a complete BSP (board-support-package, consisting of drivers and OS linked together), both of which contain FP library routines.
The compiler replaces every floating-point operation with an equivalent function call. This should be taken into consideration, especially when invoking such operations iteratively (inside a for / while loop), since the compiler cannot apply loop-unrolling optimization as a result.
The result of not including the required libraries within the project would be linkage errors.

Cross Platform Floating Point Consistency

I'm developing a cross-platform game which plays over a network using a lockstep model. As a brief overview, this means that only inputs are communicated, and all game logic is simulated on each client's computer. Therefore, consistency and determinism is very important.
I'm compiling the Windows version on MinGW32, which uses GCC 4.8.1, and on Linux I'm compiling using GCC 4.8.2.
What struck me recently was that, when my Linux version connected to my Windows version, the program would diverge, or de-sync, instantly, even though the same code was compiled on both machines! Turns out the problem was that the Linux build was being compiled via 64 bit, whereas the Windows version was 32 bit.
After compiling a Linux 32 bit version, I was thankfully relieved that the problem was resolved. However, it got me thinking and researching on floating point determinism.
This is what I've gathered:
A program will be generally consistent if it's:
ran on the same architecture
compiled using the same compiler
So if I assume, targeting a PC market, that everyone has a x86 processor, then that solves requirement one. However, the second requirement seems a little silly.
MinGW, GCC, and Clang (Windows, Linux, Mac, respectively) are all different compilers based/compatible with/on GCC. Does this mean it's impossible to achieve cross-platform determinism? or is it only applicable to Visual C++ vs GCC?
As well, do the optimization flags -O1 or -O2 affect this determinism? Would it be safer to leave them off?
In the end, I have three questions to ask:
1) Is cross-platform determinism possible when using MinGW, GCC, and Clang for compilers?
2) What flags should be set across these compilers to ensure the most consistency between operating systems / CPUs?
3) Floating point accuracy isn't that important for me -- what's important is that they are consistent. Is there any method to reducing floating point numbers to a lower precision (like 3-4 decimal places) to ensure that the little rounding errors across systems become non-existent? (Every implementation I've tried to write so far has failed)
Edit: I've done some cross-platform experiments.
Using floatation points for velocity and position, I kept a Linux Intel Laptop and a Windows AMD Desktop computer in sync for up to 15 decimal places of the float values. Both systems are, however, x86_64. The test was simple though -- it was just moving entities around over a network, trying to determine any visible error.
Would it make sense to assume that the same results would hold if a x86 computer were to connect to a x86_64 computer? (32 bit vs 64 bit Operating System)

Cross-platform and cross-compiler consistency is of course possible. Anything is possible given enough knowledge and time! But it might be very hard, or very time-consuming, or indeed impractical.
Here are the problems I can foresee, in no particular order:
Remember that even an extremely small error of plus-or-minus 1/10^15 can blow up to become significant (you multiply that number with that error margin with one billion, and now you have a plus-or-minus 0.000001 error which might be significant.) These errors can accumulate over time, over many frames, until you have a desynchronized simulation. Or they can manifest when you compare values (even naively using "epsilons" in floating-point comparisons might not help; only displace or delay the manifestation.)
The above problem is not unique to distributed deterministic simulations (like yours.) The touch on the issue of "numerical stability", which is a difficult and often neglected subject.
Different compiler optimization switches, and different floating-point behavior determination switches might lead to the compiler generate slightly different sequences of CPU instructions for the same statements. Obviously these must be the same across compilations, using the same exact compilers, or the generated code must be rigorously compared and verified.
32-bit and 64-bit programs (note: I'm saying programs and not CPUs) will probably exhibit slightly different floating-point behaviors. By default, 32-bit programs cannot rely on anything more advanced than x87 instruction set from the CPU (no SSE, SSE2, AVX, etc.) unless you specify this on the compiler command line (or use the intrinsics/inline assembly instructions in your code.) On the other hand, a 64-bit program is guaranteed to run on a CPU with SSE2 support, so the compiler will use those instructions by default (again, unless overridden by the user.) While x87 and SSE2 float datatypes and operations on them are similar, they are - AFAIK - not identical. Which will lead to inconsistencies in the simulation if one program uses one instruction set and another program uses another.
The x87 instruction set includes a "control word" register, which contain flags that control some aspects of floating-point operations (e.g. exact rounding behavior, etc.) This is a runtime thing, and your program can do one set of calculations, then change this register, and after that do the exact same calculations and get a different result. Obviously, this register must be checked and handled and kept identical on the different machines. It is possible for the compiler (or the libraries you use in your program) to generate code that changes these flags at runtime inconsistently across the programs.
Again, in case of the x87 instruction set, Intel and AMD have historically implemented things a little differently. For example, one vendor's CPU might internally do some calculations using more bits (and therefore arrive at a more accurate result) that the other, which means that if you happen to run on two different CPUs (both x86) from two different vendors, the results of simple calculations might not be the same. I don't know how and under what circumstances these higher accuracy calculations are enabled and whether they happen under normal operating conditions or you have to ask for them specifically, but I do know these discrepancies exist.
Random numbers and generating them consistently and deterministically across programs has nothing to do with floating-point consistency. It's important and source of many bugs, but in the end it's just a few more bits of state that you have to keep synched.
And here are a couple of techniques that might help:
Some projects use "fixed-point" numbers and fixed-point arithmetic to avoid rounding errors and general unpredictability of floating-point numbers. Read the Wikipedia article for more information and external links.
In one of my own projects, during development, I used to hash all the relevant state (including a lot of floating-point numbers) in all the instances of the game and send the hash across the network each frame to make sure even one bit of that state wasn't different on different machines. This also helped with debugging, where instead of trusting my eyes to see when and where inconsistencies existed (which wouldn't tell me where they originated, anyways) I would know the instant some part of the state of the game on one machine started diverging from the others, and know exactly what it was (if the hash check failed, I would stop the simulation and start comparing the whole state.)
This feature was implemented in that codebase from the beginning, and was used only during the development process to help with debugging (because it had performance and memory costs.)
Update (in answer to first comment below): As I said in point 1, and others have said in other answers, that doesn't guarantee anything. If you do that, you might decrease the probability and frequency of an inconsistency occurring, but the likelihood doesn't become zero. If you don't analyze what's happening in your code and the possible sources of problems carefully and systematically, it is still possible to run into errors no matter how much you "round off" your numbers.
For example, if you have two numbers (e.g. as results of two calculations that were supposed to produce identical results) that are 1.111499999 and 1.111500001 and you round them to three decimal places, they become 1.111 and 1.112 respectively. The original numbers' difference was only 2E-9, but it has now become 1E-3. In fact, you have increased your error 500'000 times. And still they are not equal even with the rounding. You've exacerbated the problem.
True, this doesn't happen much, and the examples I gave are two unlucky numbers to get in this situation, but it is still possible to find yourself with these kinds of numbers. And when you do, you're in trouble. The only sure-fire solution, even if you use fixed-point arithmetic or whatever, is to do rigorous and systematic mathematical analysis of all your possible problem areas and prove that they will remain consistent across programs.
Short of that, for us mere mortals, you need to have a water-tight way to monitor the situation and find exactly when and how the slightest discrepancies occur, to be able to solve the problem after the fact (instead of relying on your eyes to see problems in game animation or object movement or physical behavior.)

No, not in practice. For example, sin() might come from a library or from a compiler intrinsic, and differ in rounding. Sure, that's only one bit, but that's already out of sync. And that one bit error may add up over time, so even an imprecise comparison may not be sufficient.
N/A
You can't reduce FP precision for a given type, and I don't even see how it would help you. You'd turn the occasional 1E-6 difference into an occasional 1E-4 difference.

Next to your concerns on determinism, I have another remark: if you are worried about calculation consistency on a distributed system, you may have a design issue.
You could think about your application as a bunch of nodes, each responsible for their own calculations. If information about another node is needed, it should sent to you by that node.

1.)
In principle cross platform, OS, hardware compatibility is possible but in practice it's a pain.
In general your results will depend on which OS you use, which compiler, and which hardware you use. Change any one of those and your results might change. You have to test all changes.
I use Qt Creator and qmake (cmake is probably better but qmake works for me) and test my code in MSVC on Windows, GCC on Linux, and MinGW-w64 on Windows. I test both 32-bit and 64-bit. This has to be done whenever code changes.
2.) and 3.)
In terms of floating point some compilers will use x87 instead of SSE in 32-bit mode. See this as an example of the consequences of when that happens Why a number crunching program starts running much slower when diverges into NaNs? All 64-bit systems have SSE so I think most use SSE/AVX in 64-bit otherwise, e.g. in 32 bit mode, you might need to force SSE with something like -mfpmath=sse and -msse2.
But if you want a more compatible version of GCC on windows then I would used MingGW-w64 for 32-bit (aka MinGW-w32) or MinGW-w64 in 64bit . This is not the same thing as MinGW (aka mingw32). The projects have diverged. MinGW depends on MSVCRT (the MSVC C runtime library) and MinGW-w64 does not. The Qt project has a pretty good description of MinGW-w64 and installiation. http://qt-project.org/wiki/MinGW-64-bit
You might also want to consider writing a CPU dispatcher cpu dispatcher for visual studio for AVX and SSE.

How to control whether C math uses SSE2?

I stepped into the assembly of the transcendental math functions of the C library with MSVC in fp:strict mode. They all seem to follow the same pattern, here's what happens for sin.
First there is a dispatch routine from a file called "disp_pentium4.inc". It checks if the variable ___use_sse2_mathfcns has been set; if so, calls __sin_pentium4, otherwise calls __sin_default.
__sin_pentium4 (in "sin_pentium4.asm") starts by transferring the argument from the x87 fpu to the xmm0 register, performs the calculation using SSE2 instructions, and loads the result back in the fpu.
__sin_default (in "sin.asm") keeps the variable on the x87 stack and simply calls fsin.
So in both cases, the operand is pushed on the x87 stack and returned on it as well, making it transparent to the caller, but if ___use_sse2_mathfcns is defined, the operation is actually performed in SSE2 rather than x87.
This behavior is very interesting to me because the x87 transcendental functions are notorious for having slightly different behaviors depending on the implementation, whereas a given piece of SSE2 code should always give reproducible results.
Is there a way to determine for certain, either at compile or run-time, that the SSE2 code path will be used? I am not proficient writing assembly, so if this involves writing any assembly, a code example would be appreciated.

I found the answer through careful investigation of math.h. This is controlled by a method called _set_SSE2_enable. This is a public symbol documented here:
Enables or disables the use of Streaming SIMD Extensions 2 (SSE2)
instructions in CRT math routines. (This function is not available on
x64 architectures because SSE2 is enabled by default.)
This causes the aforementionned ___use_sse2_mathfcns flag to be set to the provided value, effectively enabling or disabling use of the _pentium4 SSE2 routines.
The documentation mentions this affects only certain transcendental functions, but looking at the disassembly, this seems to affect everyone of them.
Edit: stepping into every function reveals that they're all available in SSE2 except for the following:
fmod
sinh
cosh
tanh
sqrt
Sqrt is the biggest offender, but it's trivial to implement in SSE2 using intrinsics. For the others, there's no simple solution except perhaps using a third-party library, but I can probably do without.

Why not use your own library instead of the C runtime? This would provide an even stronger guarantee of consistency across computers (presumably the C runtime is provided as a DLL and might change slightly in time).
I would recommend CRlibm. If you are already targeting SSE2, and as long as you did not intend to change the FPU's rounding mode, you are in the ideal conditions to use it, and you won't find a more accurate implementation.

The short answer is that you can't tell IN YOUR CODE for certain what the library will do, unless you are also involving library-implementation specific details. These would make the code completely unportable - even two different builds of the same compiler may change the internals of the library.
Of course, if portability isn't an issue, then using extern <type> ___use_sse2_mathfcns; and checking if it's the true would clearly work.
I expect that if the processor has SSE2 and you are using a modern enough library, it would use SSE2 wherever possible. But to say that for certain is a different matter.
If this is critical for your code, then implement your own transcendental functions and use those - that's the only way to guarantee the same result. Or, use some suitable inline assembler (or transcendental) code to calculate selected sin, cos, etc values, and compare those with the sin() and cos() functions provided by the library.

Filesize difference when cross compiling

I am writing a small application in c++ that runs on my host machine (linux x86) and on a a target machine(arm).
The problem I have is that on the host machine my binary is about 700kb of size but on the target machine it is about 7mb.
I am using the same compile switches for both platforms. My first though was that a library on the arget machine got linked statically but I checked both binaries with objdump and both use the same dynamically link libraries.
So can anyone give me hint how I can figure out why there is such a huge difference in size?

While different computer architectures can theoretically require completely different amounts of executable code for the same program, a factor of 10 is not really expected among modern architectures. ARM and x86 may be different, but they are still designed in the same universe where memory and bandwidth is not something to waste, leading CPU designers to try to keep the executable code as tight as possible.
I would, therefore, look at the following possibilities, in order of probability:
Symbol stripping: if one of the two binaries has been stripped from its symbols, then it would be significantly smaller, especially if compiled with debugging information. You might want to try to strip both binaries and see what happens.
Static linking: I have occasionally encountered build systems for embedded targets that would prefer static linking over using shared libraries. Examining the library dependencies of each binary would probably detect this.
Additional enabled code: The larger binary may have additional code enabled because e.g. the build system found an additional optional library or because the target platform requires specific handle.
Still, a factor of 10 is probably too much for this, unless the smaller binary is missing a lot of functionality or the larger one has linked in some optional library statically.
Different compiler configuration: You should not only look at the compiler options that you supply, but also at the defaults the compiler uses for each target. For example if the compiler has significantly higher inlining or loop unrolling limits in one architecture, the resulting executable could baloon-out noticeably.

first there is no reason to expect the same code compiled for different architectures to have any kind of relationship in size to each other. You can easily have A be larger than B then change one line of code and then B is larger than A.
Second the "binaries" you are talking about are I am guessing elf, which is a little bit of binary and some to a lot of overhead. The overhead can vary between architectures and other such things.
Bottom line if you are compiling the same code for two architectures/platforms or with different compilers or compile options for the same architecture there is no reason to expect the file sizes to have any relationship to each other.

Different architectures can have completely different ways to handle the same thing. For example loading immediate value on CISC (e.g. x86) architecture is usually one instruction, while on RISC (e.g. ppc, arm) it usually is more than one instruction, the actual number needed being dependent on the value. For example if the instruction set only allows 16bit immediate values, you may need up to 7 instructions to load a 64bit value (loading by 16bits and shifting in between the loads). Hence the code is inherently different.

One reason not mentioned so far, but relevant to ARM/x86 comparisons is Floating Point emulation. All x86 chips today come with native FP support (and x86-64 even with SIMD FP support via SSE), but ARM CPU's often lack a FP unit. That in turn means even a simple FP addition has to be turned into a long sequence of integer operations on exponents and mantissa's.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js