Code:
double v = (180*9.8)/(42*42); // v should be 1.000000
printf("%f ",v);
cout<<asin(v);
Output:
1.000000
nan
I am using 64-bit mingw (win 7).
This is because v is greater than 1 (when the (180*9.8)/(42*42) is evaluated using double precision floating point).
double v = (180*9.8)/(42*42);
std::cout.precision(20);
cout << fixed << v << endl;
Output:
1.00000000000000022204
nan
DEMO
To get away with the problem of finite precision can do below.
if (v > 1)
v = 1;
if (v < -1)
v = -1;
9.8 is a value that can't be represented exactly in floating point. That means, the actual value stored is equal to 9.8 + delta, where delta is a small value which may be positive or negative.
If delta is positive for your floating point representation (presumably IEEE), then 180*9.8 will be greater than 1764, so the value of v will exceed 1. The only valid inputs for asin() are in the range -1 to 1. Although the return value from asin() is not specified for values outside that range, a NaN is one way of reporting that.
Related
My apologies if this has been asked before, but I cannot find it.
I was wondering if there is a way to calculate the point at which a single precision floating point number that is used as a counter will reach a 'maximum' (the point at which it is no longer able to add another value due to loss of precision).
For example, if I continuously add 0.1f to a float I will eventually reach a point where the value does not change:
const float INCREMENT = 0.1f;
float value = INCREMENT;
float prevVal = 0.0f;
do {
prevVal = value;
value += INCREMENT;
} while (value != prevVal);
cout << value << endl;
On GCC this outputs 2.09715e+06
Is there a way to compute this mathematically for different values of INCREMENT? I believe it should in theory be when the exponent portion of the float requires a shift beyond 23 bits, resulting in losing the mantissa and simply adding 0.
Given some positive y used as an increment, the smallest X for which adding y does not produce a result greater than X is the least power of 2 not less than y divided by half the “epsilon” of the floating-point format. It can be calculated by:
Float Y = y*2/std::numeric_limits<Float>::epsilon();
int e;
std::frexp(Y, &e);
Float X = std::ldexp(.5, e);
if (X < Y) X *= 2;
A proof follows. I assume IEEE-754 binary floating-point arithmetic using round-to-nearest-ties-to-even.
When two numbers are added in IEEE-754 floating-point arithmetic, the result is the exact mathematical result rounded to the nearest representable value in a selected direction.
A note about notation: Text in source code format represents floating-point values and operations. Other text is mathematical. So x+y is the exact mathematical sum of x and y, x is x in floating-point format, and x+y is the result of adding x and y in a floating-point operation. Also, I will use Float for the floating-point type in C++.
Given a floating-point number x, consider adding a positive value y using floating-point arithmetic, x+y. Under what conditions will the result exceed x?
Let x1 be the next value greater than x representable in the floating-point format, and let xm be the midpoint between x and x1. If the mathematical value of x+y is less than xm, then the floating-point calculation x+y rounds down, so it produces x. If x+y is greater than xm, either it rounds up and produces x1, or it produces some greater number because y is large enough to move the sum beyond x1. If x+y equals xm, the result is whichever of x or x1 has an even low digit. For reasons we will see, this is always x in the situations relevant to this question, so the calculation rounds down.
Therefore, x+y produces a result greater than x if and only if x+y exceeds xm, meaning that y exceeds half the distance from x to x1. Note that the distance from x to x1 is the value of 1 in the low digit of the significand of x.
In a binary floating-point format with p digits in its significand, the position value of the low digit is 21−p times the position value of the high digit. For example, if x is 2e, the highest bit in its significand represents 2e, and the lowest bit represents 2e+1−p.
The question asks, given a y, what is the least x for which x+y does not produce a result greater than x? It is the least x for which y does not exceed half the value of the low digit of the significand of x.
Let 2e be the position value of the high bit of the significand of x. Then y ≤ ½•2e+1−p = 2e−p, so y•2p ≤ 2e.
Therefore, given some positive y, the least x for which x+y does not produce a result greater than x has its leading bit, 2e, equal to or exceeding y•2p. And in fact it must be exactly 2e because all other floating-point numbers whose leading bit has position value 2e have other bits set in their significands, so they are greater. 2e is the least number for which the leading bit represents 2e.
Therefore, x is the least power of two that equals or exceeds y•2p.
In C++, std::numeric_limits<Float>::epsilon() (from the <limits> header) is the step from 1 to the next representable value, meaning it is 21−p. So y•2p equals y*2/std::numeric_limits<Float>::epsilon(). (This operation is exact unless it overflows to ∞.)
Let’s assign this to a variable:
Float Y = y*2/std::numeric_limits<Float>::epsilon();
We can find the position value represented by the highest bit of Y’s significand by using frexp (from the <cmath> header) to extract the exponent from the floating-point representation of Y and ldexp (also <cmath>) to apply that exponent to a new significand (.5 because of the scale that frexp and ldexp use):
int e;
std::frexp(Y, &e);
Float X = std::ldexp(.5, e);
Then X is a power of two, and it is less than or equal to Y. It is in fact the greatest power of two not greater than Y, since the next greater power of 2, 2X, is greater than Y. However, we want the least power of two not less than Y. We can find this with:
if (X < Y) X *= 2;
The resulting X is the number sought by the question.
Marek's Answer is pretty close, and a decent way to find it using a program (that is more efficient than the one I originally posted). However, I don't necessarily need the answer in a program form, just a mathematical one.
From what I can tell, the answer comes down to the exponent of the delta used, and the number of mantissa bits. We need to round to the nearest power of 2, which is kind of complicated. Basically if the mantissa is 0, we do nothing, otherwise we add 1 to the exponent. So, assuming we now have the delta as a power of 2, represented as 1.0 x 2exp, and a mantissa of N bits, the maximum value is 1.0 x 2(N + exp). Note that FLT_EPSILON in C is equal to 1.0 x 2-N. So we can also find this by dividing our nearest power of 2 by FLT_EPSILON.
For a delta of 0.1, the nearest power of 2 is 0.125, or 1.0 x 2-3. Therefore we want 1.0 x 2(23 + (-3)) or 1.0 x 221 which is equal to 2097152.
Yes it is possible.
there is std::numeric_limits::epsilon() which defines smallest value which can increase value 1.0.
Using this you can calculate this limit for any number.
In C there is DBL_EPSILON
So in your case this goes like this:
template<class T>
auto maximumWhenAdding(T delta) -> T
{
static_assert(std::is_floating_point_v<T>, "Works only for floating points.");
int power2= std::ilogb(delta);
float roudedDelta = ldexp(T { 1.0 }, power2);
if (roudedDelta != delta) {
roudedDelta *= 2;
}
return 2 * roudedDelta / std::numeric_limits<T>::epsilon();
}
live example C++
Note in live test examples delta fails to increase maxForDelta, but subtraction is successful, so this is exactly what you need.
I'm wondering if x/y when x and y are integers (but floating point type), is guaranteed to yield the same floating point value as kx/ky, where k is an integer.
So, for example, does 1.0/3, 2.0/6, 3.0/9, ... all yield the same exact floating point number (one that would compare equally with == operator)?
In case this is different per language/ platform, I am specifically interested in c++ on Linux.
As long as k*x and k*y operations are exact (the result fits in a floating point), then IEEE754 standard guarantees that you'll get the nearest floating point to the exact division result.
Obviously, since (k*x)/(k*y)=(x/y) in exact math, the nearest floating point will be the same for both.
If k*x or k*y does not fit into a float (the floating point operation is inexact), then you don't get any guaranty.
Concerning bare minimum guaranteed by C++, I don't know, but you can consider that most platforms do comply with these basic IEEE754 properties.
If the calculations are done in the same precision, I think they'll end up the same. However, if that's not the case, both float->double and double->float conversions will create discrepancies. And that's not an impossible scenario (at least without fp:strict), since the compiler can mix FPU and SSE code (for instance, if it needs to call a function that's not implemented in SSE, or use it as an argument/return result for a cdecl function).
That said, you can also create a quotient (x/y) class and use it as the key. You can define all arithmetic for it, for instance
q0+q1 = (q0.x*q1.y+q1.x*q0.y)/(q0.y*q1.y)
q0<q1 = q0.x*q1.y*(q0.y*q1.y) < q1.x*q0.y*(q0.y*q1.y)
(in the latter case * (q0.y * q1.y) is added to account for the fact that we've multiplied the original expression, q0.x/q0.y < q1.x/q1.y by q0.y*q1.y, and if it's negative, < will change to >). You can also get rid of some divisions that way.
I don't know about guarantees, but compiling this
int main() {
int i = 0;
float
x = 1E20,
y = 3E20,
f = 10;
while ( ++i <=20 ) {
printf(" %d) %f = %f / %f\n", i, x / y, x, y );
x *= f;
y *= f;
}
}
with gcc -O0 (on Debian GNU/Linux on an Intel(R) Xeon(R) CPU E3-1246) produces
1) 0.333333 = 1.000000 / 3.000000
2) 0.333333 = 1000.000000 / 3000.000000
3) 0.333333 = 1000000.000000 / 3000000.000000
4) 0.333333 = 1000000000.000000 / 3000000000.000000
5) 0.333333 = 999999995904.000000 / 3000000053248.000000
6) 0.333333 = 999999986991104.000000 / 3000000028082176.000000
7) 0.333333 = 999999984306749440.000000 / 3000000159078678528.000000
8) 0.333333 = 999999949672133165056.000000 / 3000000271228864561152.000000
9) 0.333333 = 999999941790833817157632.000000 / 3000000329775659716968448.000000
10) 0.333333 = 999999914697178458896728064.000000 / 3000000186813393145719422976.000000
11) 0.333333 = 999999939489602493962365435904.000000 / 3000000196258126111458713403392.000000
12) 0.333333 = 999999917124474831091725703839744.000000 / 3000000060858434314620245836300288.000000
13) 0.333333 = 999999882462153731101078006664265728.000000 / 3000000043527273764624921987712548864.000000
14) -nan = inf / inf
Say
int64_t x = (1UL << 53);
cout << x << end;
x+= 1.0;
cout << x << end;
The result of x is same, which is '9007199254740992'.
However, x += 1; can make x plus 1 correctly.
Moreover, for 1UL << 52 plus 1.0 can make the result correctly.
I think it could be the float imprecision. Could someone give me more details of that?
The line x+= 1.0 is evaluated as
x = (int64_t)((double)x + (double)1.0);
The number 2^53 + 1 = 9007199254740993 can't be represented exactly as IEEE double, so it's rounded to 2^53 = 9007199254740992 (this depends on the current rounding mode, actually) which is then (losslessly) converted to an int64_t.
x+= 1.0;
The expression x + 1.0 is done with floating point arithmetic.
Assuming IEEE-754 is used, the double precision floating point type can represent integers at most 253 precisely.
In the below example app I calculate the floating point remainder from dividing 953 by 0.1, using std::fmod
What I was expecting is that since 953.0 / 0.1 == 9530, that std::fmod(953, 0.1) == 0
I'm getting 0.1 - why is this the case?
Note that with std::remainder I get the correct result.
That is:
std::fmod (953, 0.1) == 0.1 // unexpected
std::remainder(953, 0.1) == 0 // expected
Difference between the two functions:
According to cppreference.com
std::fmod calculates the following:
exactly the value x - n*y, where n is x/y with its fractional part truncated
std::remainder calculates the following:
exactly the value x - n*y, where n is the integral value nearest the exact value x/y
Given my inputs I would expect both functions to have the same output. Why is this not the case?
Exemplar app:
#include <iostream>
#include <cmath>
bool is_zero(double in)
{
return std::fabs(in) < 0.0000001;
}
int main()
{
double numerator = 953;
double denominator = 0.1;
double quotient = numerator / denominator;
double fmod = std::fmod (numerator, denominator);
double rem = std::remainder(numerator, denominator);
if (is_zero(fmod))
fmod = 0;
if (is_zero(rem))
rem = 0;
std::cout << "quotient: " << quotient << ", fmod: " << fmod << ", rem: " << rem << std::endl;
return 0;
}
Output:
quotient: 9530, fmod: 0.1, rem: 0
Because they are different functions.
std::remainder(x, y) calculates IEEE remainder which is x - (round(x/y)*y) where round is rounding half to even (so in particular round(1.0/2.0) == 0)
std::fmod(x, y) calculates x - trunc(x/y)*y. When you divide 953 by 0.1 you may get a number slightly smaller than 9530, so truncation gives 9529. So as the result you get 953.0 - 952.9 = 0.1
Welcome to floating point math. Here's what happens: One tenth cannot be represented exactly in binary, just as one third cannot be represented exactly in decimal. As a result, the division produces a result slightly below 9530. The floor operation produces the integer 9529 instead of 9530. And then this leaves 0.1 left over.
Is division by zero possible in the following case due to the floating point error in the subtraction?
float x, y, z;
...
if (y != 1.0)
z = x / (y - 1.0);
In other words, is the following any safer?
float divisor = y - 1.0;
if (divisor != 0.0)
z = x / divisor;
Assuming IEEE-754 floating-point, they are equivalent.
It is a basic theorem of FP arithmetic that for finite x and y, x - y == 0 if and only if x == y, assuming gradual underflow.
If subnormal results are flushed to zero (instead of gradual underflow), this theorem holds only if the result x - y is normal. Because 1.0 is well scaled, y - 1.0 is never subnormal, and so y - 1.0 is zero if and only if y is exactly 1.0, regardless of how underflow are handled.
C++ doesn't guarantee IEEE-754, of course, but the theorem is true for most "reasonable" floating-point systems.
This will prevent you from dividing by exactly zero, however that does not mean still won't end up with +/-inf as a result. The denominator could still be small enough so that the answer is not representable with a double and you will end up with an inf. For example:
#include <iostream>
#include <limits>
int main(int argc, char const *argv[])
{
double small = std::numeric_limits<double>::epsilon();
double large = std::numeric_limits<double>::max() / small;
std::cout << "small: " << small << std::endl;
std::cout << "large: " << large << std::endl;
return 0;
}
In this program small is non-zero, but it is so small that large exceeds the range of double and is inf.
There is no difference between the two code snippets () - in fact, the optimizer could even optimize both fragments to the same binary code, assuming that there are no further uses of the divisor variable.
Note, however, that division by a floating point zero 0.0 does not result in a run-time error, but produces an inf or -inf instead.