C++ floating points comparison why epsilon() * std::fabs(x+y)? - c++

This code sample from https://en.cppreference.com/w/cpp/types/numeric_limits/epsilon
typename std::enable_if<!std::numeric_limits<T>::is_integer, bool>::type
almost_equal(T x, T y, int ulp)
{
// the machine epsilon has to be scaled to the magnitude of the values used
// and multiplied by the desired precision in ULPs (units in the last place)
return std::fabs(x-y) <= std::numeric_limits<T>::epsilon() * std::fabs(x+y) * ulp
// unless the result is subnormal
|| std::fabs(x-y) < std::numeric_limits<T>::min();
}
Can someone explain why epsilon is scaled to fabs(x+y) instead of std::fmax(std::fabs(x), std::fabs(y))?

We can distinguish two main cases: either x and y are almost_equal, or they aren't.
When x and y are almost equal (down to a few bits), x+y is almost equal to 2*x. It's likely that x and y have the same exponent, but it's also possible that their exponent differs by exactly one (e.g. 0.49999 and 0.500001).
When x and y are not even close, e.g. when they differ by more than a factor of 2, then almost_equal(x,y) will return true anyway - the difference between the current and proposed implementations doesn't really matter.
So, what we're really worried about is the edge case - where x and y differ in about ulp bits. So getting back to the definition of epsilon: it differs from 1.0 in exactly 1 Unit in the Last Place (1 ulp). Differing in N ULP's means differing by ldexp(epsilon, N-1).
But here we need to look carefully at the direction of epsilon. It's defined by the next bigger number: nextafter(1.0, 2.0)-1.0. But that's twice the value of 1.0-nextafter(1.0, 0.0) - the previous smaller number.
This is because the difference between adjacent floating-point numbers around x scales with approximately log(x), but it's a step function. At every power of 2, the step size changes also by a step of 2.
So, your proposed lower bound of approximately x indeed works for the best case (near powers of 2) but can be off by a factor of 2. And x+y, which is almost equal to 2*x when it matters, also works for the worst-case inputs.
So you could just multiply your bound by 2, but it's already a bit slower due to the hidden conditional.

Related

How to write this floating point code in a portable way?

I am working on a cryptocurrency and there is a calculation that nodes must make:
average /= total;
double ratio = average/DESIRED_BLOCK_TIME_SEC;
int delta = -round(log2(ratio));
It is required that every node has the exact same result no matter what architecture or stdlib being used by the system. My understanding is that log2 might have different implementations that yield very slightly different results or flags like --ffast-math could impact the outputted results.
Is there a simple way to convert the above calculation to something that is verifiably portable across different architectures (fixed point?) or am I overthinking the precision that is needed (given that I round the answer at the end).
EDIT: Average is a long and total is an int... so average ends up rounded to the closest second.
DESIRED_BLOCK_TIME_SEC = 30.0 (it's a float) that is #defined
For this kind of calculation to be exact, one must either calculate all the divisions and logarithms exactly -- or one can work backwards.
-round(log2(x)) == round(log2(1/x)), meaning that one of the divisions can be turned around to get (1/x) >= 1.
round(log2(x)) == floor(log2(x * sqrt(2))) == binary_log((int)(x*sqrt(2))).
One minor detail here is, if (double)sqrt(2) rounds down, or up. If it rounds up, then there might exist one or more value x * sqrt2 == 2^n + epsilon (after rounding), where as if it would round down, we would get 2^n - epsilon. One would give the integer value of n the other would give n-1. Which is correct?
Naturally that one is correct, whose ratio to the theoretical mid point x * sqrt(2) is smaller.
x * sqrt(2) / 2^(n-1) < 2^n / (x * sqrt(2)) -- multiply by x*sqrt(2)
x^2 * 2 / 2^(n-1) < 2^n -- multiply by 2^(n-1)
x^2 * 2 < 2^(2*n-1)
In order of this comparison to be exact, x^2 or pow(x,2) must be exact as well on the boundary - and it matters, what range the original values are. Similar analysis can and should be done while expanding x = a/b, so that the inexactness of the division can be mitigated at the cost of possible overflow in the multiplication...
Then again, I wonder how all the other similar applications handle the corner cases, which may not even exist -- and those could be brute force searched assuming that average and total are small enough integers.
EDIT
Because average is an integer, it makes sense to tabulate those exact integer values, which are on the boundaries of -round(log2(average)).
From octave: d=-round(log2((1:1000000)/30.0)); find(d(2:end) ~= find(d(1:end-1))
1 2 3 6 11 22 43 85 170 340 679 1358 2716
5431 10862 21723 43445 86890 173779 347558 695115
All the averages between [1 2( -> 5
All the averages between [2 3( -> 4
All the averages between [3 6( -> 3
..
All the averages between [43445 86890( -> -11
int a = find_lower_bound(average, table); // linear or binary search
return 5 - a;
No floating point arithmetic needed

Find float a to closest multiple of float b

C++ Scenario: I have two variables of type double a and b.
Goal: a should be set to the closest multiple of b that is smaller than a.
First approach: Use fmod() or remainder() to get r. Then do a = a - r.
I know that due to the representation of decimal numbers in memory fmod() or remainder() can never guarantee 100% accuracy. In my tests I found that I cannot use fmod() at all, as the variance of its results is too unpredictable (at least as far as I understand). There are many questions and discussions out there talking about this phenomenon.
So is there something I could do to still use fmod()?
With “something” I mean some trick similar to checking if a equals b by employing a value double
EPSILON = 0.005;
if (std::abs(a-b) < EPSILON)
std::cout << "equal" << '\n';
My second approach works but seems not to be very elegant. I am just subtracting b from a until there is nothing left to subtract:
double findRemainder(double x, double y) {
double rest;
if (y > x)
{
double temp = x;
x = y;
y = temp;
}
while (x > y)
{
rest = x - y;
x = x - y;
}
return rest;
}
int main()
{
typedef std::numeric_limits<double> dbl;
std::cout.precision(dbl::max_digits10);
double a = 13.78, b = 2.2, r = 0;
r = findRemainder(a, b);
return 0;
}
Any suggestions for me?
Preamble
The problem is impossible, both as stated and as intended.
Remainders are exact
This statement is incorrect: “fmod() or remainder() can never guarantee 100% accuracy.” If the floating-point format supports subnormal numbers (as IEEE-754 does), then fmod(x, y) and remainder are both exact; they produce a result with no rounding error (barring bugs in their implementation). The remainder, as defined for either of them, is always less than y and not more than x in magnitude. Therefore, it is always in a portion of the floating-point format that is at least as fine as y and as x, so all the bits needed for the real-arithmetic remainder can be represented in the floating-point remainder. So a correct implementation will return the exact remainder.
Multiples may not be representable
For simplicity of illustration, I will use IEEE-754 binary32, the format commonly used for float. The issues are the same for other formats. In this format, all integers with magnitude up to 224, 16,777,216, are representable. After that, due to the scaling by the floating-point exponent, the representable values increase by two: 16,777,218, 16,777,220, and so on. At 225, 33,554,432, they increase by four: 33,554,436, 33,554,440. At 226, 67,108,864, they increase by eight.
100,000,000 is representable, and so are 99,999,992 and 100,000,008. Now consider asking what multiple of 3 is the closest to 100,000,000. It is 99,999,999. But 99,999,999 is not representable in the binary32 format.
Thus, it is not always possible for a function to take two representable values, a and b, and return the greatest multiple of b that is less than a, using the same floating-point format. This is not because of any difficulty computing the multiple but simply because it is impossible to represent the true multiple in the floating-point format.
In fact, given the standard library, it is easy to compute the remainder; std::fmod(100000000.f, 3.f) is 1. But it is impossible to compute 100000000.f − 1 in the binary32 format.
The intended question is impossible
The examples shown, 13.78 for a and 2.2 for b, suggest the desire is to produce a multiple for some floating-point numbers a and b that are the results of converting decimal numerals a and b to the floating-point format. However, once such conversions are performed, the original numbers cannot be known from the results a and b.
To see this, consider values for a of either 99,999,997 or 100,000,002 while b is 10. The greatest multiple of 10 less than 99,999,997 is 99,999,990, and the greatest multiple of 10 less than 100,000,002 is 100,000,000.
When either 99,999,997 or 100,000,002 is converted to the binary32 format (using the common method, round-to-nearest-ties-to-even), the result for a is 100,000,000. Converting b of course yields 10 for b.
Then a function that converts the greatest multiple of a that is less than b can return only one result. Even if this function uses extended precision (say binary64) so that it can return either 99,999,990 or 100,000,000 even though those are not representable in binary32, it has no way to distinguish them. Whether the original a is 99,999,997 or 100,000,002, the a given to the function is 100,000,000, so there is no way for it to know the original a and no way for it to decide which result to return.
Hmm,
there really is a problem of definition, because most multiples of a floating point won't be representable exactly, except maybe if the multiplier is a power of two.
Taking your example and Smalltalk notations (which does not really matter, I do it just because i can evaluate and verify the expressions I propose), the exact fractional representation of double precision 0.1 and 0.9 can be written:
(1+(1<<54)reciprocal) / 10 = 0.1.
(9+(1<<52)reciprocal) / 10 = 0.9.
<< is a bistshift, 1<<54 is 2 raised to the power of 54, and reciprocal is its inverse 2^-54.
As you can easily see:
(1+(1<<54)reciprocal) * 9 > (9+(1<<52)reciprocal)
That is, the exact multiple of 0.1 is greater than 0.9.
Thus, technically, the answer is 8*0.1 (which is exact in this lucky case)
(8+(1<<51)reciprocal) / 10 = 0.8.
What remainder does is to give the EXACT remainder of the division, so it is related to above computations somehow.
You can try it, you will find something like-2.77555...e-17, or exactly (1<<55) reciprocal. The negative part is indicating that nearest multiple is close to 0.9, but a bit below 0.9.
However, if your problem is to find the greatest <= 0.9, among the rounded to nearest multiple of 0.1, then your answer will be 0.9, because the rounded product is 0.1*9 = 0.9.
You have to first resolve that ambiguity. If ever, you are not interested in multiples of 0.1, but in multiples of (1/10), then it's again a different matter...

Interchangeability of IEEE 754 floating-point addition and multiplication

Is the addition x + x interchangeable by the multiplication 2 * x in IEEE 754 (IEC 559) floating-point standard, or more generally speaking is there any guarantee that case_add and case_mul always give exactly the same result?
#include <limits>
template <typename T>
T case_add(T x, size_t n)
{
static_assert(std::numeric_limits<T>::is_iec559, "invalid type");
T result(x);
for (size_t i = 1; i < n; ++i)
{
result += x;
}
return result;
}
template <typename T>
T case_mul(T x, size_t n)
{
static_assert(std::numeric_limits<T>::is_iec559, "invalid type");
return x * static_cast<T>(n);
}
Is the addition x + x interchangeable by the multiplication 2 * x in IEEE 754 (IEC 559) floating-point standard
Yes, since they are both mathematically identical, they will give the same result (since the result is exact in floating point).
or more generally speaking is there any guarantee that case_add and case_mul always give exactly the same result?
Not generally, no. From what I can tell, it seems to hold for n <= 5:
n=3: as x+x is exact (i.e. involves no rounding), so (x+x)+x only involves one rounding at the final step.
n=4 (and you're using the default rounding mode) then
if the last bit of x is 0, then x+x+x is exact, and so the results are equal by the same argument as n=3.
if the last 2 bits are 01, then the exact value of x+x+x will have last 2 bits of 1|1 (where | indicates the final bit in the format), which will be rounded up to 0|0. The next addition will give an exact result |01, so the result will be rounded down, cancelling out the previous error.
if the last 2 bits are 11, then the exact value of x+x+x will have last 2 bits of 0|1, which will be rounded down to 0|0. The next addition will give an exact result |11, so the result will be rounded up, again cancelling out the previous error.
n=5 (again, assuming default rounding): since x+x+x+x is exact, it holds for the same reason as n=3.
For n=6 it fails, e.g. take x to be 1.0000000000000002 (the next double after 1.0), in which case 6x is 6.000000000000002 and x+x+x+x+x+x is 6.000000000000001
If n is for example pow(2, 54) then the multiplication will work just fine, but in the addition path once the result value is sufficiently larger than the input x, result += x will yield result.
Yes, but it doesn't hold generally. Multiplication by a number higher than 2 might not give the same results, as you have changed the exponent and can drop a bit if you replace with adds. Multiplication by two can't drop a bit if replaced by add operations, however.
If the accumulator result in case_add becomes too large, adding x will introduce rounding errors. At a certain point, adding x won't have an effect at all. So the functions won't give the same result.
For example if double x = 0x1.0000000000001p0 (hexadecimal float notation):
n case_add case_mul
1 0x1.0000000000001p+0 0x1.0000000000001p+0
2 0x1.0000000000001p+1 0x1.0000000000001p+1
3 0x1.8000000000002p+1 0x1.8000000000002p+1
4 0x1.0000000000001p+2 0x1.0000000000001p+2
5 0x1.4000000000001p+2 0x1.4000000000001p+2
6 0x1.8000000000001p+2 0x1.8000000000002p+2

Should I use multiplication or division for recurring floats?

It is common knowledge that division takes many more clock cycles to compute than multiplication. (Refer to the discussion here: Floating point division vs floating point multiplication.)
I already use x * 0.5 instead of x / 2 and x * 0.125 instead of x / 8 in my C++ code, but I was wondering how far I should take this.
For decimals that recur when inverted (ie. 1 / num is a recurring decimal), I use division instead of multiplication (example x / 2.2 instead of x * 0.45454545454).
My question is: In loops that iterate a considerably large number of times, should I replace divisors with their recurring multiplicative counterparts (ie. x * 0.45454545454 instead of x / 2.2), or will this bring an even greater loss of precision?
Edit: I did some profiling, I turned on full optimization in Visual Studio, used the Windows QueryPerformanceCounter() function to get profiling results.
int main() {
init();
int x;
float value = 100002030.0;
start();
for (x = 0; x < 100000000; x++)
value /= 2.2;
printf("Div: %fms, value: %f", getElapsedMilliseconds(), value);
value = 100002030.0;
restart();
for (x = 0; x < 100000000; x++)
value *= 0.45454545454;
printf("\nMult: %fms, value: %f", getElapsedMilliseconds(), value);
scanf_s("");
}
The results are: Div: 426.907185ms, value: 0.000000
Mult: 289.616415ms, value: 0.000000
Division took almost twice as long as multiplication, even with optimizations. Performance benefits are guaranteed, but will they reduce precision?
For decimals that recur when inverted (ie. 1 / num is a recurring decimal), I use division instead of multiplication (example x / 2.2 instead of x * 0.45454545454).
It is also common knowledge that 22/10 is not representable exactly in binary floating-point, so all you are achieving, instead of multiplying by a slightly inaccurate value, is dividing by a slightly inaccurate value.
In fact, if the intent is to divide by 22/10 or some other real value that isn't necessarily exactly representable in binary floating-point, then half the times, the multiplication is more accurate than the division, because it happens by coincidence that the relative error for 1/X is less than the relative error for X.
Another remark is that your micro-benchmark runs into subnormal numbers, where the timings are not representative of timings for the usual operations on normal floating-point numbers, and after a short while, value is zero, which again means that the timings are not representative of the reality of multiplying and dividing normal numbers. And as Mark Ransom says, you should at least make the operands the same for both measurements: as currently written all the multiplications take a zero operand and result in zero. Also since 2.2 and 0.45454545454 both have type double, your benchmark is measuring double-precision multiplication and division, and if you are willing to implement a single-precision division by a double-precision multiplication, this needs not involve any loss of accuracy (but you would have to provide more digits for 1/2.2).
But don't let yourself be fooled into trying to fix the micro-benchmark. You don't need it, because there is no trade-off when X is no more exactly representable than 1/X. There is no reason not to use multiplication.
Note: you should explicitly multiply by 1 / X because since the two operations / X and * (1 / X) are very slightly different, the compiler is not able to do the replacement itself. On the other hand you don't need to replace / 2 by * 0.5 because any compiler worth its salt should do that for you.
You will get different answers when multiplying by a reciprocal versus dividing, but in practice it typically does not matter, and the performance gain is worthwhile. At most, the error will be 1 ULP for reciprocal multiplication versus ½ ULP for division. But do
a = b * (1.f / 7.f);
rather than
a = b * 0.142857f;
because the former will generate the most accurate (½ ULP) representation for 1/7.

How to efficiently find the largest integer closest to the mean of two integers in increments of 100,000?

Let's say I am given integers x and y (satisfying x <= y with ones digit of 0 so they are, in particular, divisible by two). Then I know that their average avg = ((x+y) / 2) is an integer as well. I would like to find this midpoint rounded up to a resolution of 100. In other words if my two inputs are 75200 and 75300 then the avg is 75250 and rounded up to the nearest 100 (but without exceeding or equaling the bigger number) forces the answer to be 75200.
How can I implement this logic without first dividing everything by 100 and using the following floating point arithmetic:
x + std::floor((y - x) * .5 * 100 + .5)*0.01
In other words, how can I do the above without floating point values but obtain the same behavior at the resolution of 100 instead of 0.01?
To compute the average you can do
avg = (x + y) / 2
(BTW, integer addition and division by 2 are very cheap operations even on small microcontrollers.)
To round this to the nearest multiple of 100 (corresponding to your floating-point example) you can do
result = ((avg + 50) / 100) * 100
as integer division rounds down to the nearest integer. By changing the 50 to 0 you can always round down, while changing it to 99 always rounds up.
Edit: Note that this method for rounding doesn't work for negative numbers. Since integer division rounds towards zero, in that case you'll need to subtract the 50, subtract 99 to always round down and subtract 0 to always round up.
Your problematic example requires strong conditions:
the difference between x and y needs to be not greater than 100
y % 100 must be 0
So for most cases, a simple rounded average is perfect for you:
avg100 = avg - (avg % 100) + 100
The tricky part is fixing the remaining error without a condition - if you want to avoid conditions, or slow operations.
For this, the best way is to use a multiplication, and split the expression into two:
avg100 = avg - (avg % 100)
avg100 += 100 * !!(y - avg100)
For most cases, y is greater than avg100. For this case, the !! operator will return 1. In the rare case when they equal, it will return a 0, and it won't change the value.
(I don't know if the compiler will really generate a code without conditions for the '!!' operator, but I don't have a batter idea, and if it is possible, I think it will. If not, this code is still short and easy to understand.)
Also, you can calculate the average using the following expression:
avg = y - (y-x)/2
Or even change the division into bit shift for optimization.
This won't require for both of the numbers to be even, just to be the same parity.