What is this feature of floating point? - d

Real Close to the Machine: Floating Point in D
https://dlang.org/articles/d-floating-point.html
says
Useful relations for a floating point type F, where x and y are of type F
...
x>0 if and only if 1/(1/x) > 0; x<0 if and only if 1/(1/x) < 0.
what is the meaning of this sentence?

In the text you're quoting, we're looking at how the representation is symmetric around 1, and that the rounding doesn't break this. That is, for any number 0 < x < 1, there's a corresponding number 1 < y < ∞, such that y = 1/x and 1/y = x. That's the first half - the second is simply the same for negative numbers: 0 > x > -1 and -1 > y > -∞.
It may not be immediately obvious how this can be a problem, but consider x = 3.
y must then be 1/3 = 0.333.... With a limited precision of 3 decimal digits, 1/y would then be 3.003003003.... IEEE 754 defines how this should work, and says that the rounding should ensure 1/(1/x) should be equal to x, and thus that the result should be 3, even if there are rounding errors in both 1/x and 1/y - they should cancel each other out.
Older floating-point systems weren't as well-behaved as IEEE 754. I'm not sure if any of them weren't symmetric around 1, but that's certainly within the realm of possibility.

Related

Is there a value of type `double`, `K`, such that `K * K == 3.0`?

Is there a value of type double (IEEE 64-bit float / binary64), K, such that K * K == 3.0? (The irrational number is of course "square root of 3")
I tried:
static constexpr double Sqrt3 = 1.732050807568877293527446341505872366942805253810380628055806;
static_assert(Sqrt3 * Sqrt3 == 3.0);
but the static assert fails.
(I'm guessing neither the next higher nor next lower floating-point representable number square to 3.0 after rounding? Or is the parser of the floating point literal being stupid? Or is it doable in IEEE standard but fast math optimizations are messing it up?)
I think the digits are right:
$ python
>>> N = 1732050807568877293527446341505872366942805253810380628055806
>>> N * N
2999999999999999999999999999999999999999999999999999999999996\
607078976886330406910974461358291614910225958586655450309636
Update
I've discovered that:
static_assert(Sqrt3 * Sqrt3 < 3.0); // pass
static_assert(Sqrt3 * Sqrt3 > 2.999999999999999); // pass
static_assert(Sqrt3 * Sqrt3 > 2.9999999999999999); // fail
So the literal must produce the next lower value.
I guess I need to check the next higher value. Could bit-dump the representation maybe and then increment the last bit of the mantissa.
Update 2
For posterity: I wound up going with this for the Sqrt3 constant and the test:
static constexpr double Sqrt3 = 1.7320508075688772;
static_assert(0x1.BB67AE8584CAAP+0 == 1.7320508075688772);
static_assert(Sqrt3 * Sqrt3 == 2.9999999999999996);
The answer is no; there is no such K.
The closest binary64 value to the actual square root of 3 is equal to 7800463371553962 × 2-52. Its square is:
60847228810955004221158677897444 × 2-104
This value is not exactly representable. It falls between (3 - 2-51) and 3, which are respectively equal to
60847228810955002264642499117056 × 2-104
and
60847228810955011271841753858048 × 2-104
As you can see, K * K is much closer to 3 - 2-51 than it is to 3. So IEEE 754 requires the result of the operation K * K to yield 3 - 2-51, not 3. (The compiler might convert K to an extended-precision format for the calculation, but the result will still be 3 - 2-51 after conversion back to binary64.)
Furthermore, if we go to the next representable value after K in the binary64 format, we will find that its square is closest to 3 + 2-51, which is the next representable value after 3.
This result should not be too surprising; in general, incrementing a number by 1 ulp will increment its square by roughly 2 ulps, so you have about a 50% chance, given some value x, that there is a K with the same precision as x such that K * K == x.
The C standard does not dictate the default rounding mode. While it is typically round-to-nearest, ties-to-even, it could be round-upward, and some implementations support changing the mode. In such case, squaring 1.732050807568877193176604123436845839023590087890625 while rounding upward produces exactly 3.
#include <fenv.h>
#include <math.h>
#include <stdio.h>
#pragma STDC FENV_ACCESS ON
int main(void)
{
volatile double x = 1.732050807568877193176604123436845839023590087890625;
fesetround(FE_UPWARD);
printf("%.99g\n", x*x); // Prints “3”.
}
x is declared volatile to prevent the compiler from computing x*x at compile-time with a different rounding mode. Some compilers do not support #pragma STDC FENV_ACCESS but may support fesetround once the #pragma line is removed.
Testing with Python is valid I think, since both use the IEEE-754 representation for doubles along with the rules for operations on same.
The closest possible double to the square root of 3 is slightly low.
>>> Sqrt3 = 3**0.5
>>> Sqrt3*Sqrt3
2.9999999999999996
The next available value is too high.
>>> import numpy as np
>>> Sqrt3p = np.nextafter(Sqrt3,999)
>>> Sqrt3p*Sqrt3p
3.0000000000000004
If you could split the difference, you'd have it.
>>> Sqrt3*Sqrt3p
3.0
In the Ruby language, the Float class uses "the native architecture's double-precision floating point representation" and it has methods named prev_float and next_float that let you iterate through different possible floats using the smallest possible steps. Using this, I was able to do a simple test and see that there is no double (at least on x86_64 Linux) that meets your criterion. The Ruby interpreter is written in C, so I think my results should be applicable to the C double type.
Here is the Ruby code:
x = Math.sqrt(3)
4.times { x = x.prev_float }
9.times do
puts "%.20f squared is %.20f" % [x, x * x]
puts "Success!" if x * x == 3
x = x.next_float
end
And the output:
1.73205080756887630500 squared is 2.99999999999999644729
1.73205080756887652704 squared is 2.99999999999999733546
1.73205080756887674909 squared is 2.99999999999999822364
1.73205080756887697113 squared is 2.99999999999999866773
1.73205080756887719318 squared is 2.99999999999999955591
1.73205080756887741522 squared is 3.00000000000000044409
1.73205080756887763727 squared is 3.00000000000000133227
1.73205080756887785931 squared is 3.00000000000000177636
1.73205080756887808136 squared is 3.00000000000000266454
Is there a value of type double, K, such that K * K == 3.0?
Yes.
K = sqrt(n); and K * K == n may be true, even when √n is irrational.
Note that K, the result of sqrt(n), as a double, is a rational number.
Various rounding modes: #Eric
K * K rounds to n
Example: Roots n: 11, 14 and 17 when squared are n.
for (int i = 10; i < 20; i++) {
double x = sqrt(i);
double y = x * x;
printf("%2d %.25g\n", i, y);
}
10 10.00000000000000177635684
11 11
12 11.99999999999999822364316
13 12.99999999999999822364316
14 14
15 15.00000000000000177635684
16 16
17 17
18 17.99999999999999644728632
19 19.00000000000000355271368
Different precision
Rather than 53 bits with common double, say the FP math was done with 24. Roots n: 3, 5 and 10 when squared are n.
for (int i = 2; i < 11; i++) {
float x = sqrtf(i);
printf("%2d %.25g\n", i, x*x);
}
2 1.99999988079071044921875
3 3
4 4
5 5
6 6.000000476837158203125
7 6.999999523162841796875
8 7.999999523162841796875
9 9
10 10
or say the FP math was done with 64 bits. Roots n: 5, 6 and 10 when squared are n.
for (int i = 2; i < 11; i++) {
long double x = sqrtl(i);
printf("%2d %.35Lg\n", i, x*x);
}
2 1.9999999999999999998915797827514496
3 3.0000000000000000002168404344971009
4 4
5 5
6 6
7 6.9999999999999999995663191310057982
8 7.9999999999999999995663191310057982
9 9
10 10
With various precisions, (note C does not specify a fixed precision), K * K == 3.0 is possible.
FLT_EVAL_METHOD == 2
When FLT_EVAL_METHOD == 2, intermediate calculations may be done at higher precession, thus affecting the product of k*k.
(Have yet to come up with a good simple example.)
sqrt(3) is irrational, which means that there is no rational number k such that k*k == 3. A double can only represent rational numbers; therefore, there is no double k such that k*k == 3.
If you can accept a number that is close to satisfying k*k == 3, then you can use std::numeric_limits (in <type_traits>, if memory serves) to see if you’re within some minimal interval around 3. It may look like:
assert( abs(k*k - 3.) <= abs(k*k + 3.) * std::numeric_limits<double>::epsilon * X);
Epsilon is the smallest difference from one that double can represent. We scale it by the sum of the two values to compare in order to bring its magnitude in line with the numbers we’re checking. X is a scaling factor that lets you adjust the precision you accept.
If this is a theoretical question: no. If it’s a practical question: yes, up some level of precision.

Find float a to closest multiple of float b

C++ Scenario: I have two variables of type double a and b.
Goal: a should be set to the closest multiple of b that is smaller than a.
First approach: Use fmod() or remainder() to get r. Then do a = a - r.
I know that due to the representation of decimal numbers in memory fmod() or remainder() can never guarantee 100% accuracy. In my tests I found that I cannot use fmod() at all, as the variance of its results is too unpredictable (at least as far as I understand). There are many questions and discussions out there talking about this phenomenon.
So is there something I could do to still use fmod()?
With “something” I mean some trick similar to checking if a equals b by employing a value double
EPSILON = 0.005;
if (std::abs(a-b) < EPSILON)
std::cout << "equal" << '\n';
My second approach works but seems not to be very elegant. I am just subtracting b from a until there is nothing left to subtract:
double findRemainder(double x, double y) {
double rest;
if (y > x)
{
double temp = x;
x = y;
y = temp;
}
while (x > y)
{
rest = x - y;
x = x - y;
}
return rest;
}
int main()
{
typedef std::numeric_limits<double> dbl;
std::cout.precision(dbl::max_digits10);
double a = 13.78, b = 2.2, r = 0;
r = findRemainder(a, b);
return 0;
}
Any suggestions for me?
Preamble
The problem is impossible, both as stated and as intended.
Remainders are exact
This statement is incorrect: “fmod() or remainder() can never guarantee 100% accuracy.” If the floating-point format supports subnormal numbers (as IEEE-754 does), then fmod(x, y) and remainder are both exact; they produce a result with no rounding error (barring bugs in their implementation). The remainder, as defined for either of them, is always less than y and not more than x in magnitude. Therefore, it is always in a portion of the floating-point format that is at least as fine as y and as x, so all the bits needed for the real-arithmetic remainder can be represented in the floating-point remainder. So a correct implementation will return the exact remainder.
Multiples may not be representable
For simplicity of illustration, I will use IEEE-754 binary32, the format commonly used for float. The issues are the same for other formats. In this format, all integers with magnitude up to 224, 16,777,216, are representable. After that, due to the scaling by the floating-point exponent, the representable values increase by two: 16,777,218, 16,777,220, and so on. At 225, 33,554,432, they increase by four: 33,554,436, 33,554,440. At 226, 67,108,864, they increase by eight.
100,000,000 is representable, and so are 99,999,992 and 100,000,008. Now consider asking what multiple of 3 is the closest to 100,000,000. It is 99,999,999. But 99,999,999 is not representable in the binary32 format.
Thus, it is not always possible for a function to take two representable values, a and b, and return the greatest multiple of b that is less than a, using the same floating-point format. This is not because of any difficulty computing the multiple but simply because it is impossible to represent the true multiple in the floating-point format.
In fact, given the standard library, it is easy to compute the remainder; std::fmod(100000000.f, 3.f) is 1. But it is impossible to compute 100000000.f − 1 in the binary32 format.
The intended question is impossible
The examples shown, 13.78 for a and 2.2 for b, suggest the desire is to produce a multiple for some floating-point numbers a and b that are the results of converting decimal numerals a and b to the floating-point format. However, once such conversions are performed, the original numbers cannot be known from the results a and b.
To see this, consider values for a of either 99,999,997 or 100,000,002 while b is 10. The greatest multiple of 10 less than 99,999,997 is 99,999,990, and the greatest multiple of 10 less than 100,000,002 is 100,000,000.
When either 99,999,997 or 100,000,002 is converted to the binary32 format (using the common method, round-to-nearest-ties-to-even), the result for a is 100,000,000. Converting b of course yields 10 for b.
Then a function that converts the greatest multiple of a that is less than b can return only one result. Even if this function uses extended precision (say binary64) so that it can return either 99,999,990 or 100,000,000 even though those are not representable in binary32, it has no way to distinguish them. Whether the original a is 99,999,997 or 100,000,002, the a given to the function is 100,000,000, so there is no way for it to know the original a and no way for it to decide which result to return.
Hmm,
there really is a problem of definition, because most multiples of a floating point won't be representable exactly, except maybe if the multiplier is a power of two.
Taking your example and Smalltalk notations (which does not really matter, I do it just because i can evaluate and verify the expressions I propose), the exact fractional representation of double precision 0.1 and 0.9 can be written:
(1+(1<<54)reciprocal) / 10 = 0.1.
(9+(1<<52)reciprocal) / 10 = 0.9.
<< is a bistshift, 1<<54 is 2 raised to the power of 54, and reciprocal is its inverse 2^-54.
As you can easily see:
(1+(1<<54)reciprocal) * 9 > (9+(1<<52)reciprocal)
That is, the exact multiple of 0.1 is greater than 0.9.
Thus, technically, the answer is 8*0.1 (which is exact in this lucky case)
(8+(1<<51)reciprocal) / 10 = 0.8.
What remainder does is to give the EXACT remainder of the division, so it is related to above computations somehow.
You can try it, you will find something like-2.77555...e-17, or exactly (1<<55) reciprocal. The negative part is indicating that nearest multiple is close to 0.9, but a bit below 0.9.
However, if your problem is to find the greatest <= 0.9, among the rounded to nearest multiple of 0.1, then your answer will be 0.9, because the rounded product is 0.1*9 = 0.9.
You have to first resolve that ambiguity. If ever, you are not interested in multiples of 0.1, but in multiples of (1/10), then it's again a different matter...

Interchangeability of IEEE 754 floating-point addition and multiplication

Is the addition x + x interchangeable by the multiplication 2 * x in IEEE 754 (IEC 559) floating-point standard, or more generally speaking is there any guarantee that case_add and case_mul always give exactly the same result?
#include <limits>
template <typename T>
T case_add(T x, size_t n)
{
static_assert(std::numeric_limits<T>::is_iec559, "invalid type");
T result(x);
for (size_t i = 1; i < n; ++i)
{
result += x;
}
return result;
}
template <typename T>
T case_mul(T x, size_t n)
{
static_assert(std::numeric_limits<T>::is_iec559, "invalid type");
return x * static_cast<T>(n);
}
Is the addition x + x interchangeable by the multiplication 2 * x in IEEE 754 (IEC 559) floating-point standard
Yes, since they are both mathematically identical, they will give the same result (since the result is exact in floating point).
or more generally speaking is there any guarantee that case_add and case_mul always give exactly the same result?
Not generally, no. From what I can tell, it seems to hold for n <= 5:
n=3: as x+x is exact (i.e. involves no rounding), so (x+x)+x only involves one rounding at the final step.
n=4 (and you're using the default rounding mode) then
if the last bit of x is 0, then x+x+x is exact, and so the results are equal by the same argument as n=3.
if the last 2 bits are 01, then the exact value of x+x+x will have last 2 bits of 1|1 (where | indicates the final bit in the format), which will be rounded up to 0|0. The next addition will give an exact result |01, so the result will be rounded down, cancelling out the previous error.
if the last 2 bits are 11, then the exact value of x+x+x will have last 2 bits of 0|1, which will be rounded down to 0|0. The next addition will give an exact result |11, so the result will be rounded up, again cancelling out the previous error.
n=5 (again, assuming default rounding): since x+x+x+x is exact, it holds for the same reason as n=3.
For n=6 it fails, e.g. take x to be 1.0000000000000002 (the next double after 1.0), in which case 6x is 6.000000000000002 and x+x+x+x+x+x is 6.000000000000001
If n is for example pow(2, 54) then the multiplication will work just fine, but in the addition path once the result value is sufficiently larger than the input x, result += x will yield result.
Yes, but it doesn't hold generally. Multiplication by a number higher than 2 might not give the same results, as you have changed the exponent and can drop a bit if you replace with adds. Multiplication by two can't drop a bit if replaced by add operations, however.
If the accumulator result in case_add becomes too large, adding x will introduce rounding errors. At a certain point, adding x won't have an effect at all. So the functions won't give the same result.
For example if double x = 0x1.0000000000001p0 (hexadecimal float notation):
n case_add case_mul
1 0x1.0000000000001p+0 0x1.0000000000001p+0
2 0x1.0000000000001p+1 0x1.0000000000001p+1
3 0x1.8000000000002p+1 0x1.8000000000002p+1
4 0x1.0000000000001p+2 0x1.0000000000001p+2
5 0x1.4000000000001p+2 0x1.4000000000001p+2
6 0x1.8000000000001p+2 0x1.8000000000002p+2

Division z / (x/n) when n is 0

I have an arithmetic expression, for example:
float z = 8.0
float x = 3.0;
float n = 0;
cout << z / (x/n) + 1 << endl;
Why I get normal answer equal to 1, when it should be "nan", "1.#inf", etc.?
I assume you're using floating point arithmetic (though one can't be sure, because you're not telling us).
IEEE754 floating point semantics work on the extended real line and include infinities on both ends. This makes divisions with non-zero numerator well-defined for any (non-NaN) denominator, "consistent with" (i.e. extending continuously) the usual arithmetic rules: x / n is infinity, and z divided by infinity is zero — just as if you had simplified the expression as n * z / x.
The only genuinely undefined quantities are 0/0 and inf/inf, which are represented by the special value NaN.
The IEEE 754 specifies that 3/0 = Inf (or anything positive instead of 3). 8/Inf gives 0. If you add 1 you'll receive 1. This is because 0 denotes "0 or something very close to it" and Inf "Infinity or very big number". It also allows to perform some operations on limits as it effectively extends the real numbers into by infinities. NaN's are reserved when the limit is not achievable (or not easily computable by simple implementation).
As a side effect you have some strange effects like 0 == -0 but 1/0 == Inf and 1/-0 == -Inf. It is important to remember that FP arithmetic is not normal - for example cos(x) * cos(x) + sin(x) * sin(x) - 1 != 0 even if x != NaN && x != Inf && x != -Inf. For floats and x == 1 the result is -5.9604645e-8. Therefore not all expectation can be easily transferred to it - like division by 0 in this case.
While C/C++ does not mandate that IEE 754 specification will be used for floating point numbers it is the specification right now and is implemented on virtually any hardware and for that reason used by most C/C++ implementations.

Is floating-point addition and multiplication associative?

I had a problem when I was adding three floating point values and comparing them to 1.
cout << ((0.7 + 0.2 + 0.1)==1)<<endl; //output is 0
cout << ((0.7 + 0.1 + 0.2)==1)<<endl; //output is 1
Why would these values come out different?
Floating point addition is not necessarily associative. If you change the order in which you add things up, this can change the result.
The standard paper on the subject is What Every Computer Scientist Should Know about Floating Point Arithmetic. It gives the following example:
Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the expression (x+y)+z has a totally different answer than x+(y+z) when x = 1e30, y = -1e30 and z = 1 (it is 1 in the former case, 0 in the latter).
What is likely, with currently popular machines and software, is:
The compiler encoded .7 as 0x1.6666666666666p-1 (this is the hexadecimal numeral 1.6666666666666 multiplied by 2 to the power of -1), .2 as 0x1.999999999999ap-3, and .1 as 0x1.999999999999ap-4. Each of these is the number representable in floating-point that is closest to the decimal numeral you wrote.
Observe that each of these hexadecimal floating-point constants has exactly 53 bits in its significand (the "fraction" part, often inaccurately called the mantissa). The hexadecimal numeral for the significand has a "1" and thirteen more hexadecimal digits (four bits each, 52 total, 53 including the "1"), which is what the IEEE-754 standard provides for, for 64-bit binary floating-point numbers.
Let's add the numbers for .7 and .2: 0x1.6666666666666p-1 and 0x1.999999999999ap-3. First, scale the exponent of the second number to match the first. To do this, we will multiply the exponent by 4 (changing "p-3" to "p-1") and multiply the significand by 1/4, giving 0x0.66666666666668p-1. Then add 0x1.6666666666666p-1 and 0x0.66666666666668p-1, giving 0x1.ccccccccccccc8p-1. Note that this number has more than 53 bits in the significand: The "8" is the 14th digit after the period. Floating-point cannot return a result with this many bits, so it has to be rounded to the nearest representable number. In this case, there are two numbers that are equally near, 0x1.cccccccccccccp-1 and 0x1.ccccccccccccdp-1. When there is a tie, the number with a zero in the lowest bit of the significand is used. "c" is even and "d" is odd, so "c" is used. The final result of the addition is 0x1.cccccccccccccp-1.
Next, add the number for .1 (0x1.999999999999ap-4) to that. Again, we scale to make the exponents match, so 0x1.999999999999ap-4 becomes 0x.33333333333334p-1. Then add that to 0x1.cccccccccccccp-1, giving 0x1.fffffffffffff4p-1. Rounding that to 53 bits gives 0x1.fffffffffffffp-1, and that is the final result of .7+.2+.1.
Now consider .7+.1+.2. For .7+.1, add 0x1.6666666666666p-1 and 0x1.999999999999ap-4. Recall the latter is scaled to 0x.33333333333334p-1. Then the exact sum is 0x1.99999999999994p-1. Rounding that to 53 bits gives 0x1.9999999999999p-1.
Then add the number for .2 (0x1.999999999999ap-3), which is scaled to 0x0.66666666666668p-1. The exact sum is 0x2.00000000000008p-1. Floating-point significands are always scaled to start with 1 (except for special cases: zero, infinity, and very small numbers at the bottom of the representable range), so we adjust this to 0x1.00000000000004p0. Finally, we round to 53 bits, giving 0x1.0000000000000p0.
Thus, because of errors that occur when rounding, .7+.2+.1 returns 0x1.fffffffffffffp-1 (very slightly less than 1), and .7+.1+.2 returns 0x1.0000000000000p0 (exactly 1).
Floating point multiplication is not associative in C or C++.
Proof:
#include<stdio.h>
#include<time.h>
#include<stdlib.h>
using namespace std;
int main() {
int counter = 0;
srand(time(NULL));
while(counter++ < 10){
float a = rand() / 100000;
float b = rand() / 100000;
float c = rand() / 100000;
if (a*(b*c) != (a*b)*c){
printf("Not equal\n");
}
}
printf("DONE");
return 0;
}
In this program, about 30% of the time, (a*b)*c is not equal to a*(b*c).
Neither addition nor multiplication is associative with IEEE 743 double precision (64-bit) numbers. Here are examples for each (evaluated with Python 3.9.7):
>>> (.1 + .2) + .3
0.6000000000000001
>>> .1 + (.2 + .3)
0.6
>>> (.1 * .2) * .3
0.006000000000000001
>>> .1 * (.2 * .3)
0.006
Similar answer to Eric's, but for addition, and with Python.
import random
random.seed(0)
n = 1000
a = [random.random() for i in range(n)]
b = [random.random() for i in range(n)]
c = [random.random() for i in range(n)]
sum(1 if (a[i] + b[i]) + c[i] != a[i] + (b[i] + c[i]) else 0 for i in range(n))