fp:precise vs. fp:strict performance

fp:precise vs. fp:strict performance - c++

I detected some differences in my program results between Release and Debug versions. After some research I realized that some floating point optimizations are causing those differences. I have solved the problem by using the fenv_access pragma for disabling some optimizations for some critical methods.
Thinking about it, I realized that it is probably better to use the fp:strict model instead of fp:precise in my program because of its characteristics, but I am worried about performance. I have tried to find some information about the performance issues of fp:strict or the differences in performance between precise and strict, model but I have found very little information.
Does anyone know anything about this??
Thanks in advance.

This happens because you are compiling in 32-bit mode, it uses the x86 floating point processor. The code optimizer removes redundant moves from the FPU registers to memory and back, leaving intermediary results in the FPU stack. A pretty important optimization.
Problem is, the FPU stores doubles with 80 bits of precision. Instead of the 64 bits of precision of a double. Intel originally assumed this was a feature, producing more accurate intermediary calculations but it is really a bug. They didn't make the same mistake when they designed the SSE2 instruction set, used by 64 bit compilers to do floating point math. The XMM registers are 64 bits.
So in the release mode build you get subtly different results since the calculations are performed with more bits. This should never be a problem in a program that uses floating point values to calculate, a double can only store 15 significant digits. What's different are the noise digits, the ones beyond the first 15 digits. But sometimes less if your calculation loses significant digits badly. Like calculating 1 - 3 * (1/3.0).
But yeah, you can use fp:precise to get consistent noise digits. It forces the intermediate values to be flushed to memory so they cannot remain in the FPU with 80 bits of precision. It makes your code slow of course.

I am not sure if this is a solution but is what I have :)
As I have post previously I have wrote a test program that performs floating point operations that is said to be optimized under fp:precise and not under fp:strict and then measure performance. I run it 10000 times and, in average, fp:strict is 2.85% slower than fp:precise.

Just offering my two cents:
I have an image processing program that autovectorizes, the aim was to compare the performance and accuracy taking matlab as a gold standard.
Using VS2012 and an Intel i950.
Critical region error & runtime
2.3328196e-02 465 ms with strict
7.1277611e-02 182 ms with precise
7.1277611e-02 188 ms with fast
strict did not vecotrization
Using strict slowed the code down by 2x. Which was not acceptable.

It's absolutely normal to see performance difference between a Debug and Release version.
The compiler and run-times will do a lot more additional sanity checks in debug version; don't compare one to the other, especially in regards to performance; compare release vs. release with different compiler switches.
On the other hand, if the results are different between the 2 versions, then you will have to go in and check for programming errors (most probably).
Max.

Related

GPU HLSL compute shader warnings int and uint division

I keep having warnings from compute shader compilation in that I'm recommended to use uints instead of ints with dividing.
By default from the data type I assume uints are faster; however various tests online seem to point to the contrary; perhaps this contradiction is on the CPU side only and GPU parallelisation has some unknown advantage?
(Or is it just bad advice?)

I know that this is an extremely late answer, but this is a question that has come up for me as well, and I wanted to provide some information for anyone who sees this in the future.
I recently found this resource - https://arxiv.org/pdf/1905.08778.pdf
The table at the bottom lists the latency of basic operations on several graphics cards. There is a small but consistent savings to be found by using uints on all measured hardware. However, what the warning doesn't state is that the greater optimization is to be found by replacing division with multiplication if at all possible.
https://www.slideshare.net/DevCentralAMD/lowlevel-shader-optimization-for-nextgen-and-dx11-by-emil-persson states that type conversion is a full-rate operation like int/float subtraction, addition, and multiplication, whereas division is very slow.
I've seen it suggested that to improve performance, one should convert to float, divide, then convert back to int, but as shown in the first source, this will at best give you small gains and at worst actually decrease performance.
You are correct that it varies from performance of operations on the CPU, although I'm not entirely certain why.
Looking at https://www.agner.org/optimize/instruction_tables.pdf it appears that which operation is faster (MUL vs IMUL) varies from CPU to CPU - in a few at the top of the list IMUL is actually faster, despite a higher instruction count. Other CPUs don't provide a distinction between MUL and IMUL at all.
TL;DR uint division is faster on the GPU, but on the CPU YMMV

x86 4byte floats vs. 8byte doubles (vs. long long)?

We have a measurement data processing application and currently all data is held as C++ float which means 32bit/4byte on our x86/Windows platform. (32bit Windows Application).
Since precision is becoming an issue, there have been discussions to move to another datatype. The options currently discussed are switching to double (8byte) or implementing a fixed decimal type on top of __int64 (8byte).
The reason the fixed-decimal solution using __int64 as underlying type is even discussed is that someone claimed that double performance is (still) significantly worse than processing floats and that we might see significant performance benefits using a native integer type to store our numbers. (Note that we really would be fine with fixed decimal precision, although the code would obviously become more complex.)
Obviously we need to benchmark in the end, but I would like to ask whether the statement that doubles are worse holds any truth looking at modern processors? I guess for large arrays doubles may mess up cache hits more that floats, but otherwise I really fail to see how they could differ in performance?

It depends on what you do. Additions, subtractions and multiplies on double are just as fast as on float on current x86 and POWER architecture processors. Divisions, square roots and transcendental functions (exp, log, sin, cos, etc.) are usually notably slower with double arguments, since their runtime is dependent on the desired accuracy.
If you go fixed point, multiplies and divisions need to be implemented with long integer multiply / divide instructions which are usually slower than arithmetic on doubles (since processors aren't optimized as much for it). Even more so if you're running in 32 bit mode where a long 64 bit multiply with 128 bit results needs to be synthesized from several 32-bit long multiplies!
Cache utilization is a red herring here. 64-bit integers and doubles are the same size - if you need more than 32 bits, you're gonna eat that penalty no matter what.

Look it up. Both and Intel publish the instruction latencies for their CPUs in freely available PDF documents on their websites.
However, for the most part, performance won't be significantly different, or a couple of reasons:
when using the x87 FPU instead of SSE, all floating point operations are calculated at 80 bits precision internally, and then rounded off, which means that the actual computation is equally expensive for all floating-point types. The only cost is really memory-related then (in terms of CPU cache and memory bandwidth usage, and that's only an issue in float vs double, but irrelevant if you're comparing to int64)
with or without SSE, nearly all floating-point operations are pipelined. When using SSE, the double instructions may (I haven't looked this up) have a higher latency than their float equivalents, but the throughput is the same, so it should be possible to achieve similar performance with doubles.
It's also not a given that a fixed-point datatype would actually be faster either. It might, but the overhead of keeping this datatype consistent after some operations might outweigh the savings. Floating-point operations are fairly cheap on a modern CPU. They have a bit of latency, but as mentioned before, they're generally pipelined, potentially hiding this cost.
So my advice:
Write some quick tests. It shouldn't be that hard to write a program that performs a number of floating-point ops, and then measure how much slower the double version is relative to the float one.
Look it up in the manuals, and see for yourself if there's any significant performance difference between float and double computations

I've trouble the understand the rationale "as double as slower than float we'll use 64 bits int". Guessing performance has always been an black art needing much of experience, on today hardware it is even worse considering the number of factors to take into account. Even measuring is difficult. I know of several cases where micro-benchmarks lent to one solution but in context measurement showed that another was better.
First note that two of the factors which have been given to explain the claimed slower double performance than float are not pertinent here: bandwidth needed will the be same for double as for 64 bits int and SSE2 vectorization would give an advantage to double...
Then consider than using integer computation will increase the pressure on the integer registers and computation units when apparently the floating point one will stay still. (I've already seen cases where doing integer computation in double was a win attributed to the added computation units available)
So I doubt that rolling your own fixed point arithmetic would be advantageous over using double (but I could be showed wrong by measures).

Implementing 64 fixed points isn't really fun. Especially for more complex functions like Sqrt or logarithm. Integers will probably still a bit faster for simple operations like additions. And you'll need to deal with integer overflows. And you need to be careful when implementing rounding, else errors can easily accumulate.
We're implementing fixed points in a C# project because we need determinism which floatingpoint on .net doesn't guarantee. And it's relatively painful. Some formula contained x^3 bang int overflow. Unless you have really compelling reasons not to, use float or double instead of fixedpoint.
SIMD instructions from SSE2 complicate the comparison further, since they allow operation on several floating point numbers(4 floats or 2 doubles) at the same time. I'd use double and try to take advantage of these instructions. So double will probably be significantly slower than floats, but comparing with ints is difficult and I'd prefer float/double over fixedpoint is most scenarios.

It's always best to measure instead of guess. Yes, on many architectures, calculations on doubles process twice the data as calculations on floats (and long doubles are slower still). However, as other answers, and comments on this answer, have pointed out, the x86 architecture doesn't follow the same rules as, say, ARM processors, SPARC processors, etc. On x86 floats, doubles and long doubles are all converted to long doubles for computation. I should have known this, because the conversion causes x86 results to be more accurate than SPARC and Sun went through a lot of trouble to get the less accurate results for Java, sparking some debate (note, that page is from 1998, things have since changed).
Additionally, calculations on doubles are built in to the CPU where calculations on a fixed decimal datatype would be written in software and potentially slower.
You should be able to find a decent fixed sized decimal library and compare.

With various SIMD instruction sets you can perform 4 single precision floating point operations at the same cost as one, essentially you pack 4 floats into a single 128 bit register. When switching to doubles you can only pack 2 doubles into these registers and hence you can only do two operations at the same time.

As many people have said, a 64bit int is probably not worth it if double is an option. At least when SSE is available. This might be different on micro controllers of various kinds but I guess that is not your application. If you need additional precision in long sums of floats, you should keep in mind that this operation is sometimes problematic with floats and doubles and would be more exact on integers.

Floating point calculation change depending on the compiler

When I run the exact same code performing the exact same floating point calculation (using doubles) compiled on Windows and Solaris, I get slightly different results.
I am aware that the results are not exact due to rounding errors. However I would have expected the rounding errors to be platform-independent, thereby giving be the same (slightly incorrect) result on both platforms, which is not the case.
Is this normal, or do I have another problem in my code?

On x86, usually most calculations happen with 80-bit quantities, unless otherwise forced to be double-precision. Most other architectures I know of do all calculations in double-precision (again, unless otherwise overridden).
I don't know if you're running Solaris on SPARC or x86, but if the former, then I highly suspect that to be the cause of the difference.

The subject of your question suggests that it might depend on the compiler. It might, but the fact that you are running on different hardware (assuming your Solaris is not x86) suggests a much more likely reason for the difference - the difference in hardware itself.
Different hardware platforms might use completely different hardware devices (FPUs, CPUs) to perform floating-point calculations, arriving at different results.
Moreover, often the FPU units are configurable by some persistent settings, like infinity model, rounding mode etc. Different hardware might have different default setup. Compiler will normally generate the code that will initialize the FPU at program startup, by that initial setup can be different as well.
Finally, different implementations of C++ language might implement floating-point semantics differently, so you might even get different results from different C++ compilers of the same hardware.

I believe that under Windows/x86, your code will run with the x87 precision already set to 53 bits (double precision), though I'm not sure exactly when this gets set. On Solaris/x86 the x87 FPU is likely to be using its default precision of 64 bits (extended precision), hence the difference.
There's a simple check you can do to detect which precision (53 bits or 64 bits) is being used: try computing something like 1e16 + 2.9999, while being careful to avoid compiler constant-folding optimizations (e.g., define a separate add function to do the addition, and turn off any optimizations that might inline functions). When using 53-bit precision (SSE2, or x87 in double-precision mode) this gives 1e16 + 2; when using 64-bit precision (x87 in extended precision mode) this gives 1e16 + 4. The latter result comes from an effect called 'double rounding', where the result of the addition is rounded first to 64 bits, and then to 53 bits. (Doing this calculation directly in Python, I get 1e16 + 4 on 32-bit Linux, and 1e16+2 on Windows, for exactly the same reason.)
Here's a very nice article (that goes significantly beyond the oft-quoted Goldberg's "What every computer scientist should know...") that explains some of the problems arising from the use of the x87 FPU:
http://hal.archives-ouvertes.fr/docs/00/28/14/29/PDF/floating-point-article.pdf

Faster way to multiply floats

We can do left shift operators in C/C++ for faster way to multiply integers with powers of 2.
But we cannot use left shift operators for floats or doubles because they are represented in different way, having an exponent component and a mantissa component.
My questions is that,
Is there any way? Like left shift operators for integers to faster multiply float numbers? Even with powers of 2??

No, you can't. But depending on your problem, you might be able to use SIMD instructions to perform one operation on several packed variables.. Read about the SSE2 instruction set.
http://en.wikipedia.org/wiki/SSE2
http://softpixel.com/~cwright/programming/simd/sse2.php
In any event, if you are optimizing floating-point multiplications, you are in 99% of the cases looking in the wrong place. Without going on a major rant regarding premature optimization, at least justify it by performing proper profiling.

You could do this:
float f = 5.0;
int* i = (int*)&f;
*i += 0x00800000;
But then you have the overhead of moving the float out of the register, into memory, then back into a different register, only to be flushed back to memory ... about 15 or so cycles more than if you'd just done fmul. Of course, that's even assuming your system has IEEE floats at all.
Don't try to optimize this. You should look at the rest of your program to find algorithmic optimizations instead of trying to discover ways to microoptimize things like floats. It will only end in blood and tears.

Truly, any decent compiler would recognize static-time power-of-two constants and use the smartest operation.

In Microsoft Visual C++, don't forget the "floating point model" switch. The default is /fp:precise but you can change it to /fp:fast. The fast model trades some floating point accuracy for more speed. In some cases, the speedups can be drastic (the blog post referenced below notes speedups as high as x5 in some cases). Note that Xbox games are compiled with the /fp:fast switch by default.
I just switched from /fp:precise to /fp:fast on a math-heavy application of mine (with many float multiplications) and got an immediate 27% speedup with almost no loss in accuracy across my test suite.
Read the Microsoft blog post regarding the details of this switch here. It seems that the main reasons to not enable this would be if you need all the accuracy available (eg, games with large worlds, long-running simulations where errors may accumulate) or you need robust double or float NaN processing.
Lastly, also consider enabling the SSE2 instruction extensions. This gave an extra 3% boost in my application. The effects of this will vary depending on the number of operands in your arithmetic—for example, these extensions can provide speedup in cases where you are adding or multiplying more than 2 numbers together at a time.

Float values behaving differently across the release and debug builds

My application is generating different floating point values when I compile it in release mode and in debug mode. The only reason that I found out is I save a binary trace log and the one from the release build is ever so slightly off from the debug build, it looks like the bottom two bits of the 32 bit float values are different about 1/2 of the cases.
Would you consider this "difference" to be a bug or would this type of difference be expected. Would this be a compiler bug or an internal library bug.
For example:
LEFTPOS and SPACING are defined floating point values.
float def_x;
int xpos;
def_x = LEFTPOS + (xpos * (SPACING / 2));
The issue is in regards to the X360 compiler.

Release mode may have a different FP strategy set. There are different floating point arithmetic modes depending on the level of optimization you'd like. MSVC, for example, has strict, fast, and precise modes.

I know that on PC, floating point registers are 80 bits wide. So if a calculation is done entirely within the FPU, you get the benefit of 80 bits of precision. On the other hand, if an intermediate result is moved out into a normal register and back, it gets truncated to 32 bits, which gives different results.
Now consider that a release build will have optimisations which keep intermediate results in FPU registers, whereas a debug build will probably naively copy intermediate results back and forward between memory and registers - and there you have your difference in behaviour.
I don't know whether this happens on X360 too or not.

It's not a bug. Any floating point uperation has a certain imprecision. In Release mode, optimization will change the order of the operations and you'll get a slightly different result. The difference should be small, though. If it's big you might have other problems.

I helped a co-worker find a compiler switch that was different in release vs. debug builds that was causing his differences.
Take a look at /fp (Specify Floating-Point Behavior).

In addition to the different floating-point modes others have pointed out, SSE or similiar vector optimizations may be turned on for release. Converting floating-point arithmetic from standard registers to vector registers can have an effect on the lower bits of your results, as the vector registers will generally be more narrow (fewer bits) than the standard floating-point registers.

Not a bug. This type of difference is to be expected.
For example, some platforms have float registers that use more bits than are stored in memory, so keeping a value in the register can yield a slightly different result compared to storing to memory and re-loading from memory.

This discrepancy may very well be caused by the compiler optimization, which is typically done in the release mode, but not in debug mode. For example, the compiler may reorder some of the operations to speed up execution, which can conceivably cause a slight difference in the floating point result.
So, I would say most likely it is not a bug. If you are really worried about this, try turning on optimization in the Debug mode.

Like others mentioned, floating point registers have higher precision than floats, so the accuracy of the final result depends on the register allocation.
If you need consistent results, you can make the variables volatile, which will result in slower, less precise, but consistent results.

If you set a compiler switch that allowed the compiler to reorder floating-point operations, -- e.g. /fp:fast -- then obviously it's not a bug.
If you didn't set any such switch, then it's a bug -- the C and C++ standards don't allow the compilers to reorder operations without your permission.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js