When a float variable goes out of the float limits, what happens? - c++

I remarked two things:
std::numeric_limits<float>::max()+(a small number) gives: std::numeric_limits<float>::max().
std::numeric_limits<float>::max()+(a large number like: std::numeric_limits<float>::max()/3) gives inf.
Why this difference? Does 1 or 2 results in an OVERFLOW and thus to an undefined behavior?
Edit: Code for testing this:
1.
float d = std::numeric_limits<float>::max();
float q = d + 100;
cout << "q: " << q << endl;
2.
float d = std::numeric_limits<float>::max();
float q = d + (d/3);
cout << "q: " << q << endl;

Formally, the behavior is undefined. On a machine with IEEE
floating point, however, overflow after rounding will result
in Inf. The precision is limited, however, and the results
after rounding of FLT_MAX + 1 are FLT_MAX.
You can see the same effect with values well under FLT_MAX.
Try something like:
float f1 = 1e20; // less than FLT_MAX
float f2 = f1 + 1.0;
if ( f1 == f2 ) ...
The if will evaluate to true, at least with IEEE arithmetic.
(There do exist, or at least have existed, machines where
float has enough precision for the if to evaluate to
false, but they aren't very common today.)

It depends on what you are doing. If the float "overflow" comes in an expression which is directly returned, i.e.
return std::numeric_limits::max() + std::numeric_limits::max();
the operation might not result in an overflow. I cite from the C standard [ISO/IEC 9899:2011]:
The return statement is not an assignment. The overlap restriction of
subclause 6.5.16.1 does not apply to the case of function return. The
representation of floating-point values may have wider range or
precision than implied by the type; a cast may be used to remove this
extra range and precision.
See here for more details.

Related

Floating points in C++ (float and double)

I know that we shouldn't use floating points in the loops. But could someone explain it to me what happens when we have a loop and we add a small number to a large number until we reach a certain value that allows the loop to terminate?
I guess it might cause potential errors. But apart from that?
What would it look like with a single-precision (float) and double-precision (double) floating-point numbers? I guess more rounding errors would appear in the double type. Could someone give me an example (the best in C ++) because I have no idea how to start with it...
I would be very grateful if you could provide me with a hint. Thanks!
In a C++ implementation using IEEE-754 arithmetic and the “single” (binary32) format for float, this code prints “count = 3”:
int count = 0;
for (float f = 0; f < .3f; f += .1f)
++count;
std::cout << "count = " << count << ".\n";
but this code prints “count = 4”:
int count = 0;
for (float f = 0; f < .33f; f += .11f)
++count;
std::cout << "count = " << count << ".\n";
In the first example, the source text .1f is converted to 0.100000001490116119384765625, which is the value representable in float that is closed to .1. The source text .3f is converted to 0.300000011920928955078125, the float value closest to .3. Adding this converted value for .1f to f produces 0.100000001490116119384765625, then 0.20000000298023223876953125, and then 0.300000011920928955078125, at which point f < .3f is false, and the loop stops.
In the second example, .11f is converted to 0.10999999940395355224609375, and .33f is converted to 0.3300000131130218505859375. In this case, adding the converted value of .11f to f produces 0.10999999940395355224609375, then 0.2199999988079071044921875, and then 0.329999983310699462890625. Note that, due to rounding, this result of adding .11f three times is 0.329999983310699462890625, which is less than .33f (0.3300000131130218505859375), so f < .33f is true, and the loop continues for another iteration.
This is similar to adding ⅓ in a two-digit decimal format with a loop bound of three-thirds (which is 1). If we had for (f = 0; f < 1; f += ⅓), the ⅓ in the source text would have to be converted to .33 (two-digit decimal). Then f would be stepped through .33, .66, and .99. The loop would not stop until it reached 1.32. The same rounding issues occur in binary floating-point arithmetic.
When the amount added in the loop is a small number relative to the large number, these rounding issues are greater. First, there will be more additions, so there will be more rounding errors, and they may accumulate. Second, since larger numbers require a larger exponent to scale them in the floating-point format, they have less absolute precision than smaller numbers. This means the roundings have to be come larger relative to the small number that is being added. So the rounding errors are larger in magnitude.
Then, even if the loop eventually terminates, the values of f in each iteration may be far from the desired values, due to the accumulated errors. If f is used for calculations inside the loop, the calculations might not be using the desired values and may produce incorrect results.
With increasing values the difference between 2 floating point values increases too. There is a point where i+1 results in the same value.
Consider this code:
#include <iostream>
int main()
{
float i = 0;
while (i != i + 1) i++;
std::cout << i << std::endl;
return 0;
}
while (i != i + 1) should be an endless loop, but for floating point variables, it is not.
The code above prints 1.67772e+07 on https://godbolt.org/z/7xf8n8
So, for (float f = 0; f < 2e7; f++) is an endless loop.
You can try it with double yourself, the value is bigger.

Numerical accuracy of pow(a/b,x) vs pow(b/a,-x)

Is there a difference in accuracy between pow(a/b,x) and pow(b/a,-x)?
If there is, does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Edit: Let's assume x86_64 processor and gcc compiler.
Edit: I tried comparing using some random numbers. For example:
printf("%.20f",pow(8.72138221/1.761329479,-1.51231)) // 0.08898783049228660424
printf("%.20f",pow(1.761329479/8.72138221, 1.51231)) // 0.08898783049228659037
So, it looks like there is a difference (albeit minuscule in this case), but maybe someone who knows about the algorithm implementation could comment on what the maximum difference is, and under what conditions.
Here's one way to answer such questions, to see how floating-point behaves. This is not a 100% correct way to analyze such question, but it gives a general idea.
Let's generate random numbers. Calculate v0=pow(a/b, n) and v1=pow(b/a, -n) in float precision. And calculate ref=pow(a/b, n) in double precision, and round it to float. We use ref as a reference value (we suppose that double has much more precision than float, so we can trust that ref can be considered the best possible value. This is true for IEEE-754 for most of the time). Then sum the difference between v0-ref and v1-ref. The difference should calculated with "the number of floating point numbers between v and ref".
Note, that the results may be depend on the range of a, b and n (and on the random generator quality. If it's really bad, it may give a biased result). Here, I've used a=[0..1], b=[0..1] and n=[-2..2]. Furthermore, this answer supposes that the algorithm of float/double division/pow is the same kind, have the same characteristics.
For my computer, the summed differences are: 2604828 2603684, it means that there is no significant precision difference between the two.
Here's the code (note, this code supposes IEEE-754 arithmetic):
#include <cmath>
#include <stdio.h>
#include <string.h>
long long int diff(float a, float b) {
unsigned int ai, bi;
memcpy(&ai, &a, 4);
memcpy(&bi, &b, 4);
long long int diff = (long long int)ai - bi;
if (diff<0) diff = -diff;
return diff;
}
int main() {
long long int e0 = 0;
long long int e1 = 0;
for (int i=0; i<10000000; i++) {
float a = 1.0f*rand()/RAND_MAX;
float b = 1.0f*rand()/RAND_MAX;
float n = 4.0f*rand()/RAND_MAX - 2.0f;
if (a==0||b==0) continue;
float v0 = std::pow(a/b, n);
float v1 = std::pow(b/a, -n);
float ref = std::pow((double)a/b, n);
e0 += diff(ref, v0);
e1 += diff(ref, v1);
}
printf("%lld %lld\n", e0, e1);
}
... between pow(a/b,x) and pow(b/a,-x) ... does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Whichever division is more arcuate.
Consider z = xy = 2y * log2(x).
Roughly: The error in y * log2(x) is magnified by the value of z to form the error in z. xy is very sensitive to the error in x. The larger the |log2(x)|, the greater concern.
In OP's case, both pow(a/b,p) and pow(b/a,-p), in general, have the same y * log2(x) and same z and similar errors in z. It is a question of how x, y are formed:
a/b and b/a, in general, both have the same error of +/- 0.5*unit in the last place and so both approaches are of similar error.
Yet with select values of a/b vs. b/a, one quotient will be more exact and it is that approach with the lower pow() error.
pow(7777777/4,-p) can be expected to be more accurate than pow(4/7777777,p).
Lacking assurance about the error in the division, the general rule applies: no major difference.
In general, the form with the positive power is slightly better, although by so little it will likely have no practical effect. Specific cases could be distinguished. For example, if either a or b is a power of two, it ought to be used as the denominator, as the division then has no rounding error.
In this answer, I assume IEEE-754 binary floating-point with round-to-nearest-ties-to-even and that the values involved are in the normal range of the floating-point format.
Given a, b, and x with values a, b, and x, and an implementation of pow that computes the representable value nearest the ideal mathematical value (actual implementations are generally not this good), pow(a/b, x) computes (a/b•(1+e0))x•(1+e1), where e0 is the rounding error that occurs in the division and e1 is the rounding error that occurs in the pow, and pow(b/a, -x) computes (b/a•(1+e2))−x•(1+e3), where e2 and e3 are the rounding errors in this division and this pow, respectively.
Each of the errors, e0…e3 lies in the interval [−u/2, u/2], where u is the unit of least precision (ULP) of 1 in the floating-point format. (The notation [p, q] is the interval containing all values from p to q, including p and q.) In case a result is near the edge of a binade (where the floating-point exponent changes and the significand is near 1), the lower bound may be −u/4. At this time, I will not analyze this case.
Rewriting, these are (a/b)x•(1+e0)x•(1+e1) and (a/b)x•(1+e2)−x•(1+e3). This reveals the primary difference is in (1+e0)x versus (1+e2)−x. The 1+e1 versus 1+e3 is also a difference, but this is just the final rounding. [I may consider further analysis of this later but omit it for now.]
Consider (1+e0)x and (1+e2)−x.The potential values of the first expression span [(1−u/2)x, (1+u/2)x], while the second spans [(1+u/2)−x, (1−u/2)−x]. When x > 0, the second interval is longer than the first:
The length of the first is (1+u/2)x−(1+u/2)x.
The length of the second is (1/(1−u/2))x−(1/(1+u/2))x.
Multiplying the latter by (1−u2/22)x produces ((1−u2/22)/(1−u/2))x−( (1−u2/22)/(1+u/2))x = (1+u/2)x−(1+u/2)x, which is the length of the first interval.
1−u2/22 < 1, so (1−u2/22)x < 1 for positive x.
Since the first length equals the second length times a number less than one, the first interval is shorter.
Thus, the form in which the exponent is positive is better in the sense that it has a shorter interval of potential results.
Nonetheless, this difference is very slight. I would not be surprised if it were unobservable in practice. Also, one might be concerned with the probability distribution of errors rather than the range of potential errors. I suspect this would also favor positive exponents.
For evaluation of rounding errors like in your case, it might be useful to use some multi-precision library, such as Boost.Multiprecision. Then, you can compare results for various precisions, e.g, such as with the following program:
#include <iomanip>
#include <iostream>
#include <boost/multiprecision/cpp_bin_float.hpp>
#include <boost/multiprecision/cpp_dec_float.hpp>
namespace mp = boost::multiprecision;
template <typename FLOAT>
void comp() {
FLOAT a = 8.72138221;
FLOAT b = 1.761329479;
FLOAT c = 1.51231;
FLOAT e = mp::pow(a / b, -c);
FLOAT f = mp::pow(b / a, c);
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << e << std::endl;
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << f << std::endl;
}
int main() {
std::cout << "Double: " << std::endl;
comp<mp::cpp_bin_float_double>();
td::cout << std::endl;
std::cout << "Double extended: " << std::endl;
comp<mp::cpp_bin_float_double_extended>();
std::cout << std::endl;
std::cout << "Quad: " << std::endl;
comp<mp::cpp_bin_float_quad>();
std::cout << std::endl;
std::cout << "Dec-100: " << std::endl;
comp<mp::cpp_dec_float_100>();
std::cout << std::endl;
}
Its output reads, on my platform:
Double:
0.0889878304922865903670015086390776559711
0.0889878304922866181225771242679911665618
Double extended:
0.0889878304922865999079806265115166752366
0.0889878304922865999012043629334822725241
Quad:
0.0889878304922865999004910375213273866639
0.0889878304922865999004910375213273505527
Dec-100:
0.0889878304922865999004910375213273881004
0.0889878304922865999004910375213273881004
Live demo: https://wandbox.org/permlink/tAm4sBIoIuUy2lO6
For double, the first calculation was more accurate, however, I guess one cannot make any generic conclusions here.
Also, note that your input numbers are not accurately representable with the IEEE 754 double precision floating-point type (none of them). The question is whether you care about the accuracy of calculations with either those exact numbers of their closest representations.

Does exist two numbers that multiplied (or divided) each other introduce error?

Here's the bank of tests I'm doing, learning how FP basic ops (+, -, *, /) would introduce errors:
#include <iostream>
#include <math.h>
int main() {
std::cout.precision(100);
double a = 0.499999999999999944488848768742172978818416595458984375;
double original = 47.9;
double target = original * a;
double back = target / a;
std::cout << original << std::endl;
std::cout << back << std::endl;
std::cout << fabs(original - back) << std::endl; // its always 0.0 for the test I did
}
Can you show to me two values (original and a) that, once * (or /), due to FP math, introduce error?
And if they exist, is it possible to establish if that error is introduced by * or /? And how? (since you need both for coming back to the value; 80 bit?)
With + is easy (just add 0.499999999999999944488848768742172978818416595458984375 to 0.5, and you get 1.0, as for 0.5 + 0.5).
But I'm not able to do the same with * or /.
The output of:
#include <cstdio>
int main(void)
{
double a = 1000000000000.;
double b = 1000000000000.;
std::printf("a = %.99g.\n", a);
std::printf("a = %.99g.\n", b);
std::printf("a*b = %.99g.\n", a*b);
}
is:
a = 1000000000000.
a = 1000000000000.
a*b = 999999999999999983222784.
assuming IEEE-754 basic 64-bit binary floating-point with correct rounding to nearest, ties to even.
Obviously, 999999999999999983222784 differs from the exact mathematical result of 1000000000000•1000000000000, 1000000000000000000000000.
Multiply any two large† numbers, and there is likely going to be error because representable values have great distances in the high range of values.
While this error can be great in absolute terms, it is still small in relation to the size of the number itself, so if you perform the reverse division, the error of the first operation is scaled down in the same ratio, and disappears completely. As such, this sequence of operations is stable.
If the result of the multiplication would be greater than the maximum value representable, then it would overflow to inifinity (may depend on configuration), in which case reverse division won't result in the original value, but remains as infinity.
Similarly, if you divide with a great number, you will potentially underflow the smallest representable value resulting in either zero or a subnormal value.
† Numbers do not necessarily have to be huge. It's just easier to perceive the issue when considering huge values. The problem applies to quite small values as well. For example:
2.100000000000000088817841970012523233890533447265625 ×
2.100000000000000088817841970012523233890533447265625
Correct result:
4.410000000000000373034936274052605470949292688633679117285...
Example floating point result:
4.410000000000000142108547152020037174224853515625
Error:
2.30926389122032568296724439173008679117285652827862296732064351090230047702789306640625
× 10^-16
Does exist two numbers that multiplied (or divided) each other introduce error?
This is much easier to see with "%a".
When the precision of the result is insufficient, rounding occurs. Typically double has 53 bits of binary precision. Multiplying 2 27-bit numbers below results in an exact 53-bit answer, but 2 28 bit ones cannot form a 55-bit significant answer.
Division is easy to demo, just try 1.0/n*n.
int main(void) {
double a = 1 + 1.0/pow(2,26);
printf("%.15a, %.17e\n", a, a);
printf("%.15a, %.17e\n", a*a, a*a);
double b = 1 + 1.0/pow(2,27);
printf("%.15a, %.17e\n", b, b);
printf("%.15a, %.17e\n", b*b, b*b);
for (int n = 47; n < 52; n += 2) {
volatile double frac = 1.0/n;
printf("%.15a, %.17e %d\n", frac, frac, n);
printf("%.15a, %.17e\n", frac*n, frac*n);
}
return 0;
}
Output
//v-------v 27 significant bits.
0x1.000000400000000p+0, 1.00000001490116119e+00
//v-------------v 53 significant bits.
0x1.000000800000100p+0, 1.00000002980232261e+00
//v-------v 28 significant bits.
0x1.000000200000000p+0, 1.00000000745058060e+00
//v--------------v not 55 significant bits.
0x1.000000400000000p+0, 1.00000001490116119e+00
// ^^^ all zeros here, not the expected mathematical answer.
0x1.5c9882b93105700p-6, 2.12765957446808505e-02 47
0x1.000000000000000p+0, 1.00000000000000000e+00
0x1.4e5e0a72f053900p-6, 2.04081632653061208e-02 49
0x1.fffffffffffff00p-1, 9.99999999999999889e-01 <==== Not 1.0
0x1.414141414141400p-6, 1.96078431372549017e-02 51
0x1.000000000000000p+0, 1.00000000000000000e+00

c++ float subtraction rounding error

I have a float value between 0 and 1. I need to convert it with -120 to 80.
To do this, first I multiply with 200 after 120 subtract.
When subtract is made I had rounding error.
Let's look my example.
float val = 0.6050f;
val *= 200.f;
Now val is 121.0 as I expected.
val -= 120.0f;
Now val is 0.99999992
I thought maybe I can avoid this problem with multiplication and division.
float val = 0.6050f;
val *= 200.f;
val *= 100.f;
val -= 12000.0f;
val /= 100.f;
But it didn't help. I have still 0.99 on my hand.
Is there a solution for it?
Edit: After with detailed logging, I understand there is no problem with this part of code. Before my log shows me "0.605", after I had detailed log and I saw "0.60499995946884155273437500000000000000000000000000"
the problem is in different place.
Edit2: I think I found the guilty. The initialised value is 0.5750.
std::string floatToStr(double d)
{
std::stringstream ss;
ss << std::fixed << std::setprecision(15) << d;
return ss.str();
}
int main()
{
float val88 = 0.57500000000f;
std::cout << floatToStr(val88) << std::endl;
}
The result is 0.574999988079071
Actually I need to add and sub 0.0025 from this value every time.
Normally I expected 0.575, 0.5775, 0.5800, 0.5825 ....
Edit3: Actually I tried all of them with double. And it is working for my example.
std::string doubleToStr(double d)
{
std::stringstream ss;
ss << std::fixed << std::setprecision(15) << d;
return ss.str();
}
int main()
{
double val88 = 0.575;
std::cout << doubleToStr(val88) << std::endl;
val88 += 0.0025;
std::cout << doubleToStr(val88) << std::endl;
val88 += 0.0025;
std::cout << doubleToStr(val88) << std::endl;
val88 += 0.0025;
std::cout << doubleToStr(val88) << std::endl;
return 0;
}
The results are:
0.575000000000000
0.577500000000000
0.580000000000000
0.582500000000000
But I bound to float unfortunately. I need to change lots of things.
Thank you for all to help.
Edit4: I have found my solution with strings. I use ostringstream's rounding and convert to double after that. I can have 4 precision right numbers.
std::string doubleToStr(double d, int precision)
{
std::stringstream ss;
ss << std::fixed << std::setprecision(precision) << d;
return ss.str();
}
double val945 = (double)0.575f;
std::cout << doubleToStr(val945, 4) << std::endl;
std::cout << doubleToStr(val945, 15) << std::endl;
std::cout << atof(doubleToStr(val945, 4).c_str()) << std::endl;
and results are:
0.5750
0.574999988079071
0.575
Let us assume that your compiler implements IEEE 754 binary32 and binary64 exactly for float and double values and operations.
First, you must understand that 0.6050f does not represent the mathematical quantity 6050 / 10000. It is exactly 0.605000019073486328125, the nearest float to that. Even if you write perfect computations from there, you have to remember that these computations start from 0.605000019073486328125 and not from 0.6050.
Second, you can solve nearly all your accumulated roundoff problems by computing with double and converting to float only in the end:
$ cat t.c
#include <stdio.h>
int main(){
printf("0.6050f is %.53f\n", 0.6050f);
printf("%.53f\n", (float)((double)0.605f * 200. - 120.));
}
$ gcc t.c && ./a.out
0.6050f is 0.60500001907348632812500000000000000000000000000000000
1.00000381469726562500000000000000000000000000000000000
In the above code, all computations and intermediate values are double-precision.
This 1.0000038… is a very good answer if you remember that you started with 0.605000019073486328125 and not 0.6050 (which doesn't exist as a float).
If you really care about the difference between 0.99999992 and 1.0, float is not precise enough for your application. You need to at least change to double.
If you need an answer in a specific range, and you are getting answers slightly outside that range but within rounding error of one of the ends, replace the answer with the appropriate range end.
The point everybody is making can be summarised: in general, floating point is precise but not exact.
How precise is governed by the number of bits in the mantissa -- which is 24 for float, and 53 for double (assuming IEEE 754 binary formats, which is pretty safe these days ! [1]).
If you are looking for an exact result, you have to be ready to deal with values that differ (ever so slightly) from that exact result, but...
(1) The Exact Binary Fraction Problem
...the first issue is whether the exact value you are looking for can be represented exactly in binary floating point form...
...and that is rare -- which is often a disappointing surprise.
The binary floating point representation of a given value can be exact, but only under the following, restricted circumstances:
the value is an integer, < 2^24 (float) or < 2^53 (double).
this is the simplest case, and perhaps obvious. Since you are looking a result >= -120 and <= 80, this is sufficient.
or:
the value is an integer which divides exactly by 2^n and is then (as above) < 2^24 or < 2^53.
this includes the first rule, but is more general.
or:
the value has a fractional part, but when the value is multiplied by the smallest 2^n necessary to produce an integer, that integer is < 2^24 (float) or 2^53 (double).
This is the part which may come as a surprise.
Consider 27.01, which is a simple enough decimal value, and clearly well within the ~7 decimal digit precision of a float. Unfortunately, it does not have an exact binary floating point form -- you can multiply 27.01 by any 2^n you like, for example:
27.01 * (2^ 6) = 1728.64 (multiply by 64)
27.01 * (2^ 7) = 3457.28 (multiply by 128)
...
27.01 * (2^10) = 27658.24
...
27.01 * (2^20) = 28322037.76
...
27.01 * (2^25) = 906305208.32 (> 2^24 !)
and you never get an integer, let alone one < 2^24 or < 2^53.
Actually, all these rules boil down to one rule... if you can find an 'n' (positive or negative, integer) such that y = value * (2^n), and where y is an exact, odd integer, then value has an exact representation if y < 2^24 (float) or if y < 2^53 (double) -- assuming no under- or over-flow, which is another story.
This looks complicated, but the rule of thumb is simply: "very few decimal fractions can be represented exactly as binary fractions".
To illustrate how few, let us consider all the 4 digit decimal fractions, of which there are 10000, that is 0.0000 up to 0.9999 -- including the trivial, integer case 0.0000. We can enumerate how many of those have exact binary equivalents:
1: 0.0000 = 0/16 or 0/1
2: 0.0625 = 1/16
3: 0.1250 = 2/16 or 1/8
4: 0.1875 = 3/16
5: 0.2500 = 4/16 or 1/4
6: 0.3125 = 5/16
7: 0.3750 = 6/16 or 3/8
8: 0.4375 = 7/16
9: 0.5000 = 8/16 or 1/2
10: 0.5625 = 9/16
11: 0.6250 = 10/16 or 5/8
12: 0.6875 = 11/16
13: 0.7500 = 12/16 or 3/4
14: 0.8125 = 13/16
15: 0.8750 = 14/16 or 7/8
16: 0.9375 = 15/16
That's it ! Just 16/10000 possible 4 digit decimal fractions (including the trivial 0 case) have exact binary fraction equivalents, at any precision. All the other 9984/10000 possible decimal fractions give rise to recurring binary fractions. So, for 'n' digit decimal fractions only (2^n) / (10^n) can be represented exactly -- that's 1/(5^n) !!
This is, of course, because your decimal fraction is actually the rational x / (10^n)[2] and your binary fraction is y / (2^m) (for integer x, y, n and m), and for a given binary fraction to be exactly equal to a decimal fraction we must have:
y = (x / (10^n)) * (2^m)
= (x / ( 5^n)) * (2^(m-n))
which is only the case when x is an exact multiple of (5^n) -- for otherwise y is not an integer. (Noting that n <= m, assuming that x has no (spurious) trailing zeros, and hence n is as small as possible.)
(2) The Rounding Problem
The result of a floating point operation may need to be rounded to the precision of the destination variable. IEEE 754 requires that the operation is done as if there were no limit to the precision, and the ("true") result is then rounded to the nearest value at the precision of the destination. So, the final result is as precise as it can be... given the limitations on how precise the arguments are, and how precise the destination is... but not exact !
(With floats and doubles, 'C' may promote float arguments to double (or long double) before performing an operation, and the result of that will be rounded to double. The final result of an expression may then be a double (or long double), which is then rounded (again) if it is to be stored in a float variable. All of this adds to the fun ! See FLT_EVAL_METHOD for what your system does -- noting the default for a floating point constant is double.)
So, the other rules to remember are:
floating point values are not reals (they are, in fact, rationals with a limited denominator).
The precision of a floating point value may be large, but there are lots of real numbers that cannot be represented exactly !
floating point expressions are not algebra.
For example, converting from degrees to radians requires division by π. Any arithmetic with π has a problem ('cos it's irrational), and with floating point the value for π is rounded to whatever floating precision we are using. So, the conversion of (say) 27 (which is exact) degrees to radians involves division by 180 (which is exact) and multiplication by our "π". However exact the arguments, the division and the multiplication may round, so the result is may only approximate. Taking:
float pi = 3.14159265358979 ; /* plenty for float */
float x = 27.0 ;
float y = (x / 180.0) * pi ;
float z = (y / pi) * 180.0 ;
printf("z-x = %+6.3e\n", z-x) ;
my (pretty ordinary) machine gave: "z-x = +1.907e-06"... so, for our floating point:
x != (((x / 180.0) * pi) / pi) * 180 ;
at least, not for all x. In the case shown, the relative difference is small -- ~ 1.2 / (2^24) -- but not zero, which simple algebra might lead us to expect.
hence: floating point equality is a slippery notion.
For all the reasons above, the test x == y for two floating values is problematic. Depending on how x and y have been calculated, if you expect the two to be exactly the same, you may very well be sadly disappointed.
[1] There exists a standard for decimal floating point, but generally binary floating point is what people use.
[2] For any decimal fraction you can write down with a finite number of digits !
Even with double precision, you'll run into issues such as:
200. * .60499999999999992 = 120.99999999999997
It appears that you want some type of rounding so that 0.99999992 is rounded to 1.00000000 .
If the goal is to produce values to the nearest multiple of 1/1000, try:
#include <math.h>
val = (float) floor((200000.0f*val)-119999.5f)/1000.0f;
If the goal is to produce values to the nearest multiple of 1/200, try:
val = (float) floor((40000.0f*val)-23999.5f)/200.0f;
If the goal is to produce values to the nearest integer, try:
val = (float) floor((200.0f*val)-119.5f);

c++ additive identity unsafe example ( a+0.0 != a )

In MSDN article, it mentions when fp:fast mode is enabled, operations like additive identity (a±0.0 = a, 0.0-a = -a) are unsafe. Is there any example that a+0 != a under such mode?
EDIT: As someone mentioned below, this sort of issue normally comes up when doing comparison. My issue is from comparison, the psedocode looks like below:
for(i=0;i<v.len;i++)
{
sum+=v[i];
if( sum >= threshold) break;
}
It breaks after adding a value of 0 (v[i]). The v[i] is not from calculation, it is assigned. I understand if my v[i] is from calculation then rounding might come into play, but why even though I give v[i] a zero value, I still have this sum < threshold but sum + v[i] >= threshold?
The reason that it's "unsafe" is that what the compiler assumes to be zero may not really end up being zero, due to rounding errors.
Take this example which adds two floats on the edge of the precision which 32 bit floats allows:
float a = 33554430, b = 16777215;
float x = a + b;
float y = x - a - b;
float z = 1;
z = z + y;
With fp:fast, the compiler says "since x = a + b, y = x - a - b = 0, so 'z + y' is just z". However, due to rounding errors, y actually ends up being -1, not 0. So you would get a different result without fp:fast.
It's not saying something 'fixed' like, "if you set /fp:fast, and variable a happens to be 3.12345, then a+0 might not be a". It's saying that when you set /fp:fast, the compiler will take shortcuts that mean that if you compute a+0, and then compare that to what you stored for a, there is no guarantee that they'll be the same.
There is a great write up on this class of problems (which are endemic to floating point calculations on computers) here: http://www.parashift.com/c++-faq-lite/floating-point-arith2.html
If a is -0.0, then a + 0.0 is +0.0.
it mentions when fp:fast mode is enabled, operations like additive identity (a±0.0 = a, 0.0-a = -a) is unsafe.
What that article says is
Any of the following (unsafe) algebraic rules may be employed by the optimizer when the fp:fast mode is enabled:
And then it lists a±0.0 = a and 0.0-a = -a
It is not saying that these identities are unsafe when fp:fast is enabled. It is saying that these identities are not true for IEEE 754 floating point arithmetic but that /fp:fast will optimize as though they are true.
I'm not certain of an example that shows that a + 0.0 == a to be false (except for NaN, obviously), but IEEE 754 has a lot of subtleties, such as when intermediate values should be truncated. One possibility is that if you have some expression that includes + 0.0, that might result in a requirement under IEEE 754 to do truncation on an intermediate value, but that /fp:fast will generate code that doesn't do the truncation, and consequently later results may differ from what is strictly required by IEEE 754.
Using Pascal Cuoq's info here's a program that produces different output based on /fp:fast
#include <cmath>
#include <iostream>
int main() {
volatile double a = -0.0;
if (_copysign(1.0, a + 0.0) == _copysign(1.0, 0.0)) {
std::cout << "correct IEEE 754 results\n";
} else {
std::cout << "result not IEEE 754 conformant\n";
}
}
When built with /fp:fast the program outputs "result not IEEE 754 conformant" while building with /fp:strict cause the program to output "correct IEEE 754 results".