Interchangeability of IEEE 754 floating-point addition and multiplication

Interchangeability of IEEE 754 floating-point addition and multiplication - c++

Is the addition x + x interchangeable by the multiplication 2 * x in IEEE 754 (IEC 559) floating-point standard, or more generally speaking is there any guarantee that case_add and case_mul always give exactly the same result?
#include <limits>
template <typename T>
T case_add(T x, size_t n)
{
static_assert(std::numeric_limits<T>::is_iec559, "invalid type");
T result(x);
for (size_t i = 1; i < n; ++i)
{
result += x;
}
return result;
}
template <typename T>
T case_mul(T x, size_t n)
{
static_assert(std::numeric_limits<T>::is_iec559, "invalid type");
return x * static_cast<T>(n);
}

Is the addition x + x interchangeable by the multiplication 2 * x in IEEE 754 (IEC 559) floating-point standard
Yes, since they are both mathematically identical, they will give the same result (since the result is exact in floating point).
or more generally speaking is there any guarantee that case_add and case_mul always give exactly the same result?
Not generally, no. From what I can tell, it seems to hold for n <= 5:
n=3: as x+x is exact (i.e. involves no rounding), so (x+x)+x only involves one rounding at the final step.
n=4 (and you're using the default rounding mode) then
if the last bit of x is 0, then x+x+x is exact, and so the results are equal by the same argument as n=3.
if the last 2 bits are 01, then the exact value of x+x+x will have last 2 bits of 1|1 (where | indicates the final bit in the format), which will be rounded up to 0|0. The next addition will give an exact result |01, so the result will be rounded down, cancelling out the previous error.
if the last 2 bits are 11, then the exact value of x+x+x will have last 2 bits of 0|1, which will be rounded down to 0|0. The next addition will give an exact result |11, so the result will be rounded up, again cancelling out the previous error.
n=5 (again, assuming default rounding): since x+x+x+x is exact, it holds for the same reason as n=3.
For n=6 it fails, e.g. take x to be 1.0000000000000002 (the next double after 1.0), in which case 6x is 6.000000000000002 and x+x+x+x+x+x is 6.000000000000001

If n is for example pow(2, 54) then the multiplication will work just fine, but in the addition path once the result value is sufficiently larger than the input x, result += x will yield result.

Yes, but it doesn't hold generally. Multiplication by a number higher than 2 might not give the same results, as you have changed the exponent and can drop a bit if you replace with adds. Multiplication by two can't drop a bit if replaced by add operations, however.

If the accumulator result in case_add becomes too large, adding x will introduce rounding errors. At a certain point, adding x won't have an effect at all. So the functions won't give the same result.
For example if double x = 0x1.0000000000001p0 (hexadecimal float notation):
n case_add case_mul
1 0x1.0000000000001p+0 0x1.0000000000001p+0
2 0x1.0000000000001p+1 0x1.0000000000001p+1
3 0x1.8000000000002p+1 0x1.8000000000002p+1
4 0x1.0000000000001p+2 0x1.0000000000001p+2
5 0x1.4000000000001p+2 0x1.4000000000001p+2
6 0x1.8000000000001p+2 0x1.8000000000002p+2

Related

Issue related to double precision floating point division in C++

In C++, we know that we can find the minimum representable double precision value using std::numeric_limits<double>::min(). The value turns out to be 2.22507e-308 when printed.
Now if a given double value (say val) is subtracted from this minimum value and then a division is undertaken with the same previous double value (val - minval) / val, I was expecting the answer to be rounded to 0 if the operation floor((val - minval ) / val) was performed on the resulting divided value.
To my surprise, the answer is delivered as 1. Can someone please explain this anomalous behavior?
Consider the following code:
int main()
{
double minval = std::numeric_limits<double>::min(), wg = 8038,
ans = floor((wg - minval) / wg); // expecting the answer to round to 0
cout << ans; // but the answer actually resulted as 1!
}

A double typically has around 16 digits of precision.
You're starting with 8038. For simplicity, I'm going to call that 8.038e3. Since we have around 16 digits of precision, the smallest number we can subtract from that and still get a result different from 8038 is 8038e(3-16) = 8038e-13.
8038 - 2.2e-308 is like reducing the mass of the universe by one electron, and expecting that to affect the mass of the universe by a significant amount.
Actually, relatively speaking, 8038-2.2e-308 is a much smaller change than removing a whole electron from the universe--more like removing a minuscule fraction of a single electron from the universe, if that were possible. Even if we were to assume that string theory were correct, even removing one string from the universe would still be a huge change compared to subtracting 2.2e-308 from 8038.

The comments and the previous answer correctly attribute the cause to floating point precision issues but there are additional details needed to explain the correct behavior. In fact, even in cases where subtraction cannot be carried out such that the results of the subtraction cannot be represented with the finite precision of floating point numbers, inexact rounding is still performed by the compiler and subtraction is not completely discarded.
As an example, consider the code below.
int main()
{
double b, c, d;
vector<double> a{0.07, 0.08, 0.09, 0.1, 0.2, 0.3, 0.4, 0.6, 0.7};
cout << "Subtraction Possible?" << "\t" << "Floor Result" << "\n";
for( int i = 0; i < 9; i++ ) {
b = std::nextafter( a[i], 0 );
c = a[i] - b;
d = 1e-17;
if( (bool)(d > c) )
cout << "True" << "\t";
else
cout << "False" << "\t";
cout << setprecision(52) << floor((a[i] - d)/a[i]) << "\n";
}
return 0;
}
The code takes in different double precision values in the form of vector a and performs subtraction from 1e-17. It must be noted that the smallest value that can be subtracted from 0.07 is shown to be 1.387778780781445675529539585113525390625e-17 using std::nextafter for the value 0.07. This means that 1e-17 is smaller than the smallest value which can be subtracted from any of these numbers. Hence, theoretically, subtraction should not be possible for any of the numbers listed in vector a. If we assume that the subtraction results are discarded, then the answer must always stay 1 but it turns out that sometimes the answer is 0 and other times 1.
This can be observed from the output of the C++ program as shown below:
Subtraction Possible? Floor Result
False 0
False 0
False 0
False 0
False 1
False 1
False 1
False 1
False 1
The reasons lay buried within the Floating Point specification prescribed in the IEEE 754 document. In general the standard specifically states that even in cases where the results of an operation cannot be represented, rounding must be carried out. I quote Page 27, Section 4.3 of the IEEE 754, 2019 document:
Except where stated otherwise, every operation shall be performed as if it first produced an
intermediate result correct to infinite precision and with unbounded range, and then rounded that result
according to one of the attributes in this clause
The statement in further repeated in Section 5.1 of Page 29 as shown below:
Unless otherwise specified, each of the computational
operations specified by this standard that returns a numeric result shall be performed as if it first produced
an intermediate result correct to infinite precision and with unbounded range, and then rounded that
intermediate result, if necessary, to fit in the destination’s format (see Clause 4 and Clause 7).
C++'s g++ compiler (which I have been testing) correctly and very precisely interprets the standard by implementing nearest rounding stated in Section 4.3.1 of the IEEE 754 document. This has the implication that even when a[i] - b is not representable, a numeric result is delivered as if the subtraction first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that
intermediate result. Hence, it may or may not be the case that a[i] - b == a[i] which means that the answer may or may not be 1 depending on whether a[i] - b is closer to a[i] or it is closer to the next representable value after a[i].
It turns out that 8038 - 2.22507e-308 is closer to 8038 due to which the answer is rounded (using nearest rounding) to 8038 and the final answer is 1 but this is to only state that this behavior does result from the compiler's interpretation of the standard and is not something arbitrary.
I found below references on Floating Point numbers to be very useful. I would recommend reading Cleve Moler's (founder of MATLAB) reference on floating point numbers before going through the IEEE specification for a quick and easy understanding of their behavior.
"IEEE Standard for Floating-Point Arithmetic," in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.
Moler, Cleve. “Floating Points.” MATLAB News and Notes. Fall, 1996.

Find float a to closest multiple of float b

C++ Scenario: I have two variables of type double a and b.
Goal: a should be set to the closest multiple of b that is smaller than a.
First approach: Use fmod() or remainder() to get r. Then do a = a - r.
I know that due to the representation of decimal numbers in memory fmod() or remainder() can never guarantee 100% accuracy. In my tests I found that I cannot use fmod() at all, as the variance of its results is too unpredictable (at least as far as I understand). There are many questions and discussions out there talking about this phenomenon.
So is there something I could do to still use fmod()?
With “something” I mean some trick similar to checking if a equals b by employing a value double
EPSILON = 0.005;
if (std::abs(a-b) < EPSILON)
std::cout << "equal" << '\n';
My second approach works but seems not to be very elegant. I am just subtracting b from a until there is nothing left to subtract:
double findRemainder(double x, double y) {
double rest;
if (y > x)
{
double temp = x;
x = y;
y = temp;
}
while (x > y)
{
rest = x - y;
x = x - y;
}
return rest;
}
int main()
{
typedef std::numeric_limits<double> dbl;
std::cout.precision(dbl::max_digits10);
double a = 13.78, b = 2.2, r = 0;
r = findRemainder(a, b);
return 0;
}
Any suggestions for me?

Preamble
The problem is impossible, both as stated and as intended.
Remainders are exact
This statement is incorrect: “fmod() or remainder() can never guarantee 100% accuracy.” If the floating-point format supports subnormal numbers (as IEEE-754 does), then fmod(x, y) and remainder are both exact; they produce a result with no rounding error (barring bugs in their implementation). The remainder, as defined for either of them, is always less than y and not more than x in magnitude. Therefore, it is always in a portion of the floating-point format that is at least as fine as y and as x, so all the bits needed for the real-arithmetic remainder can be represented in the floating-point remainder. So a correct implementation will return the exact remainder.
Multiples may not be representable
For simplicity of illustration, I will use IEEE-754 binary32, the format commonly used for float. The issues are the same for other formats. In this format, all integers with magnitude up to 224, 16,777,216, are representable. After that, due to the scaling by the floating-point exponent, the representable values increase by two: 16,777,218, 16,777,220, and so on. At 225, 33,554,432, they increase by four: 33,554,436, 33,554,440. At 226, 67,108,864, they increase by eight.
100,000,000 is representable, and so are 99,999,992 and 100,000,008. Now consider asking what multiple of 3 is the closest to 100,000,000. It is 99,999,999. But 99,999,999 is not representable in the binary32 format.
Thus, it is not always possible for a function to take two representable values, a and b, and return the greatest multiple of b that is less than a, using the same floating-point format. This is not because of any difficulty computing the multiple but simply because it is impossible to represent the true multiple in the floating-point format.
In fact, given the standard library, it is easy to compute the remainder; std::fmod(100000000.f, 3.f) is 1. But it is impossible to compute 100000000.f − 1 in the binary32 format.
The intended question is impossible
The examples shown, 13.78 for a and 2.2 for b, suggest the desire is to produce a multiple for some floating-point numbers a and b that are the results of converting decimal numerals a and b to the floating-point format. However, once such conversions are performed, the original numbers cannot be known from the results a and b.
To see this, consider values for a of either 99,999,997 or 100,000,002 while b is 10. The greatest multiple of 10 less than 99,999,997 is 99,999,990, and the greatest multiple of 10 less than 100,000,002 is 100,000,000.
When either 99,999,997 or 100,000,002 is converted to the binary32 format (using the common method, round-to-nearest-ties-to-even), the result for a is 100,000,000. Converting b of course yields 10 for b.
Then a function that converts the greatest multiple of a that is less than b can return only one result. Even if this function uses extended precision (say binary64) so that it can return either 99,999,990 or 100,000,000 even though those are not representable in binary32, it has no way to distinguish them. Whether the original a is 99,999,997 or 100,000,002, the a given to the function is 100,000,000, so there is no way for it to know the original a and no way for it to decide which result to return.

Hmm,
there really is a problem of definition, because most multiples of a floating point won't be representable exactly, except maybe if the multiplier is a power of two.
Taking your example and Smalltalk notations (which does not really matter, I do it just because i can evaluate and verify the expressions I propose), the exact fractional representation of double precision 0.1 and 0.9 can be written:
(1+(1<<54)reciprocal) / 10 = 0.1.
(9+(1<<52)reciprocal) / 10 = 0.9.
<< is a bistshift, 1<<54 is 2 raised to the power of 54, and reciprocal is its inverse 2^-54.
As you can easily see:
(1+(1<<54)reciprocal) * 9 > (9+(1<<52)reciprocal)
That is, the exact multiple of 0.1 is greater than 0.9.
Thus, technically, the answer is 8*0.1 (which is exact in this lucky case)
(8+(1<<51)reciprocal) / 10 = 0.8.
What remainder does is to give the EXACT remainder of the division, so it is related to above computations somehow.
You can try it, you will find something like-2.77555...e-17, or exactly (1<<55) reciprocal. The negative part is indicating that nearest multiple is close to 0.9, but a bit below 0.9.
However, if your problem is to find the greatest <= 0.9, among the rounded to nearest multiple of 0.1, then your answer will be 0.9, because the rounded product is 0.1*9 = 0.9.
You have to first resolve that ambiguity. If ever, you are not interested in multiples of 0.1, but in multiples of (1/10), then it's again a different matter...

Should I use multiplication or division for recurring floats?

It is common knowledge that division takes many more clock cycles to compute than multiplication. (Refer to the discussion here: Floating point division vs floating point multiplication.)
I already use x * 0.5 instead of x / 2 and x * 0.125 instead of x / 8 in my C++ code, but I was wondering how far I should take this.
For decimals that recur when inverted (ie. 1 / num is a recurring decimal), I use division instead of multiplication (example x / 2.2 instead of x * 0.45454545454).
My question is: In loops that iterate a considerably large number of times, should I replace divisors with their recurring multiplicative counterparts (ie. x * 0.45454545454 instead of x / 2.2), or will this bring an even greater loss of precision?
Edit: I did some profiling, I turned on full optimization in Visual Studio, used the Windows QueryPerformanceCounter() function to get profiling results.
int main() {
init();
int x;
float value = 100002030.0;
start();
for (x = 0; x < 100000000; x++)
value /= 2.2;
printf("Div: %fms, value: %f", getElapsedMilliseconds(), value);
value = 100002030.0;
restart();
for (x = 0; x < 100000000; x++)
value *= 0.45454545454;
printf("\nMult: %fms, value: %f", getElapsedMilliseconds(), value);
scanf_s("");
}
The results are: Div: 426.907185ms, value: 0.000000
Mult: 289.616415ms, value: 0.000000
Division took almost twice as long as multiplication, even with optimizations. Performance benefits are guaranteed, but will they reduce precision?

For decimals that recur when inverted (ie. 1 / num is a recurring decimal), I use division instead of multiplication (example x / 2.2 instead of x * 0.45454545454).
It is also common knowledge that 22/10 is not representable exactly in binary floating-point, so all you are achieving, instead of multiplying by a slightly inaccurate value, is dividing by a slightly inaccurate value.
In fact, if the intent is to divide by 22/10 or some other real value that isn't necessarily exactly representable in binary floating-point, then half the times, the multiplication is more accurate than the division, because it happens by coincidence that the relative error for 1/X is less than the relative error for X.
Another remark is that your micro-benchmark runs into subnormal numbers, where the timings are not representative of timings for the usual operations on normal floating-point numbers, and after a short while, value is zero, which again means that the timings are not representative of the reality of multiplying and dividing normal numbers. And as Mark Ransom says, you should at least make the operands the same for both measurements: as currently written all the multiplications take a zero operand and result in zero. Also since 2.2 and 0.45454545454 both have type double, your benchmark is measuring double-precision multiplication and division, and if you are willing to implement a single-precision division by a double-precision multiplication, this needs not involve any loss of accuracy (but you would have to provide more digits for 1/2.2).
But don't let yourself be fooled into trying to fix the micro-benchmark. You don't need it, because there is no trade-off when X is no more exactly representable than 1/X. There is no reason not to use multiplication.
Note: you should explicitly multiply by 1 / X because since the two operations / X and * (1 / X) are very slightly different, the compiler is not able to do the replacement itself. On the other hand you don't need to replace / 2 by * 0.5 because any compiler worth its salt should do that for you.

You will get different answers when multiplying by a reciprocal versus dividing, but in practice it typically does not matter, and the performance gain is worthwhile. At most, the error will be 1 ULP for reciprocal multiplication versus ½ ULP for division. But do
a = b * (1.f / 7.f);
rather than
a = b * 0.142857f;
because the former will generate the most accurate (½ ULP) representation for 1/7.

Does casting `std::floor()` and `std::ceil()` to integer type always give the correct result?

I am being paranoid that one of these functions may give an incorrect result like this:
std::floor(2000.0 / 1000.0) --> std::floor(1.999999999999) --> 1
or
std::ceil(18 / 3) --> std::ceil(6.000000000001) --> 7
Can something like this happen? If there is indeed a risk like this, I'm planning to use the functions below in order to work safely. But, is this really necessary?
constexpr long double EPSILON = 1e-10;
intmax_t GuaranteedFloor(const long double & Number)
{
if (Number > 0)
{
return static_cast<intmax_t>(std::floor(Number) + EPSILON);
}
else
{
return static_cast<intmax_t>(std::floor(Number) - EPSILON);
}
}
intmax_t GuaranteedCeil(const long double & Number)
{
if (Number > 0)
{
return static_cast<intmax_t>(std::ceil(Number) + EPSILON);
}
else
{
return static_cast<intmax_t>(std::ceil(Number) - EPSILON);
}
}
(Note: I'm assuming that the the given 'long double' argument will fit in the 'intmax_t' return type.)

People often get the impression that floating point operations produce results with small, unpredictable, quasi-random errors. This impression is incorrect.
Floating point arithmetic computations are as exact as possible. 18/3 will always produce exactly 6. The result of 1/3 won't be exactly one third, but it will be the closest number to one third that is representable as a floating point number.
So the examples you showed are guaranteed to always work. As for your suggested "guaranteed floor/ceil", it's not a good idea. Certain sequences of operations can easily blow the error far above 1e-10, and certain other use cases will require 1e-10 to be correctly recognized (and ceil'ed) as nonzero.
As a rule of thumb, hardcoded epsilon values are bugs in your code.

In the specific examples you're listing, I don't think those errors would ever occur.
std::floor(2000.0 /*Exactly Representable in 32-bit or 64-bit Floating Point Numbers*/ / 1000.0 /*Also exactly representable*/) --> std::floor(2.0 /*Exactly Representable*/) --> 2
std::ceil(18 / 3 /*both treated as ints, might not even compile if ceil isn't properly overloaded....?*/) --> 6
std::ceil(18.0 /*Exactly Representable*/ / 3.0 /*Exactly Representable*/) --> 6
Having said that, if you have math that depends on these functions behaving exactly correctly for floating point numbers, that may illuminate a design flaw you need to reconsider/reexamine.

As long as the floating-point values x and y exactly represent integers within the limits of the type you're using, there's no problem--x / y will always yield a floating-point value that exactly represents the integer result. Casting to int as you're doing will always work.
However, once the floating-point values go outside the integer-representable range for the type (Representing integers in doubles), epsilons don't help.
Consider this example. 16777217 is the smallest integer not exactly representable as a 32-bit float:
int ix=16777217, iy=97;
printf("%d / %d = %d", ix, iy, ix/iy);
// yields "16777217 / 97 = 172961" which is accurate
float x=ix, y=iy;
printf("%f / %f = %f", x, y, x/y);
// yields "16777216.000000 / 97.000000 = 172960.989691"
In this case, the error is negative; in other cases (try 16777219 / 1549), the error is positive.
While it's tempting to add an epsilon to make floor work, it won't extend the accuracy much. When the values differ by more orders of magnitude, the error becomes greater than 1 and integer-accuracy can't be guaranteed. Specifically, when x/y exceeds the max. representable, the error can exceed 1.0, so the epsilon is no help.
If this is coming into play, you will have to consider changing your mathematical approach--order of operations, work with logarithms, etc.

Such results are likely to appear when working with doubles. You can use round or you can subtract 0.5 then use std::ceil function.

Can I trust a real-to-int conversion of the result of ceil()?

Suppose I have some code such as:
float a, b = ...; // both positive
int s1 = ceil(sqrt(a/b));
int s2 = ceil(sqrt(a/b)) + 0.1;
Is it ever possible that s1 != s2? My concern is when a/b is a perfect square. For example, perhaps a=100.0 and b=4.0, then the output of ceil should be 5.00000 but what if instead it is 4.99999?
Similar question: is there a chance that 100.0/4.0 evaluates to say 5.00001 and then ceil will round it up to 6.00000?
I'd prefer to do this in integer math but the sqrt kinda screws that plan.
EDIT: suggestions on how to better implement this would be appreciated too! The a and b values are integer values, so actual code is more like: ceil(sqrt(float(a)/b))
EDIT: Based on levis501's answer, I think I will do this:
float a, b = ...; // both positive
int s = sqrt(a/b);
while (s*s*b < a) ++s;
Thank you all!

I don't think it's possible. Regardless of the value of sqrt(a/b), what it produces is some value N that we use as:
int s1 = ceil(N);
int s2 = ceil(N) + 0.1;
Since ceil always produces an integer value (albeit represented as a double), we will always have some value X, for which the first produces X.0 and the second X.1. Conversion to int will always truncate that .1, so both will result in X.
It might seem like there would be an exception if X was so large that X.1 overflowed the range of double. I don't see where this could be possible though. Except close to 0 (where overflow isn't a concern) the square root of a number will always be smaller than the input number. Therefore, before ceil(N)+0.1 could overflow, the a/b being used as an input in sqrt(a/b) would have to have overflowed already.

You may want to write an explicit function for your case. e.g.:
/* return the smallest positive integer whose square is at least x */
int isqrt(double x) {
int y1 = ceil(sqrt(x));
int y2 = y1 - 1;
if ((y2 * y2) >= x) return y2;
return y1;
}
This will handle the odd case where the square root of your ratio a/b is within the precision of double.

Equality of floating point numbers is indeed an issue, but IMHO not if we deal with integer numbers.
If you have the case of 100.0/4.0, it should perfectly evaluate to 25.0, as 25.0 is exactly representable as a float, as opposite to e.g. 25.1.

Yes, it's entirely possible that s1 != s2. Why is that a problem, though?
It seems natural enough that s1 != (s1 + 0.1).
BTW, if you would prefer to have 5.00001 rounded to 5.00000 instead of 6.00000, use rint instead of ceil.
And to answer the actual question (in your comment) - you can use sqrt to get a starting point and then just find the correct square using integer arithmetic.
int min_dimension_greater_than(int items, int buckets)
{
double target = double(items) / buckets;
int min_square = ceil(target);
int dim = floor(sqrt(target));
int square = dim * dim;
while (square < min_square) {
seed += 1;
square = dim * dim;
}
return dim;
}
And yes, this can be improved a lot, it's just a quick sketch.

s1 will always equal s2.
The C and C++ standards do not say much about the accuracy of math routines. Taken literally, it is impossible for the standard to be implemented, since the C standard says sqrt(x) returns the square root of x, but the square root of two cannot be exactly represented in floating point.
Implementing routines with good performance that always return a correctly rounded result (in round-to-nearest mode, this means the result is the representable floating-point number that is nearest to the exact result, with ties resolved in favor of a low zero bit) is a difficult research problem. Good math libraries target accuracy less than 1 ULP (so one of the two nearest representable numbers is returned), perhaps something slightly more than .5 ULP. (An ULP is the Unit of Least Precision, the value of the low bit given a particular value in the exponent field.) Some math libraries may be significantly worse than this. You would have to ask your vendor or check the documentation for more information.
So sqrt may be slightly off. If the exact square root is an integer (within the range in which integers are exactly representable in floating-point) and the library guarantees errors are less than 1 ULP, then the result of sqrt must be exactly correct, because any result other than the exact result is at least 1 ULP away.
Similarly, if the library guarantees errors are less than 1 ULP, then ceil must return the exact result, again because the exact result is representable and any other result would be at least 1 ULP away. Additionally, the nature of ceil is such that I would expect any reasonable math library to always return an integer, even if the rest of the library were not high quality.
As for overflow cases, if ceil(x) were beyond the range where all integers are exactly representable, then ceil(x)+.1 is closer to ceil(x) than it is to any other representable number, so the rounded result of adding .1 to ceil(x) should be ceil(x) in any system implementing the floating-point standard (IEEE 754). That is provided you are in the default rounding mode, which is round-to-nearest. It is possible to change the rounding mode to something like round-toward-infinity, which could cause ceil(x)+.1 to be an integer higher than ceil(x).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js