Is x/a the same as x*(1/a) for floats? - c++

With float a = ...; and float inva = 1/a; is x / a the same as x * inva?
And what is with this case:
unsigned i = ...;
float v1 = static_cast<float>(i) / 4294967295.0f;
float scl = 1.0f / 4294967295.0f;
float v2 = static_cast<float>(i) * scl;
Is v1 equal to v2 for all unsigned integers?

is v1 equal to v2 for all unsigned integers?
Yes, because 4294967295.0f is a power of two. Division and multiplication by the reciprocal are equivalent when the divisor is a power of two (assuming the computation of the reciprocal does not overflow or underflow to zero).
Division and multiplication by the reciprocal are not equivalent in general, only in the particular case of powers of two. The reason is that for (almost all) powers of two y, the computation of 1 / y is exact, so that x * (1 / y) only rounds once, just like x / y only rounds once.

No, the result will not always be the same. The way you group the operands in floating point multiplication, or division in this case, has an effect on the numerical accuracy of the answer. Thus, the product a*(1/b) might differ from a/b. Check the wikipedia article http://en.wikipedia.org/wiki/Floating_point#Accuracy_problems.

Related

Fast inverse square root using fixed point instead of floating point

I am trying to implement Fast Inverse Square Root for a fixed point number, but I'm not getting anywhere.
I am trying to follow exactly the same principle as the article, except instead of writing the number in the floating point format x = (-1) ^ s * (1 + M) * 2 ^ (E-127), I am using the format x = M * 2 ^ -16, which is a 32-bit fixed point number with 16 decimal bits and 16 fractional bits.
The problem is that I cannot find the value of the "magic constant". According to my calculations, it doesn’t exist, but I’m not a mathematician and I think I’m doing everything wrong.
To solve Y = 1 / sqrt (x), I used the following reasoning (I don't know if it is correct).
In the original code we have that Y0 for approximation of newton is given by:
i = 0x5f3759df - (i >> 1);
Which means that we will have as a result a floating point number given by:
y0 = (1 + R2 - M / 2) * 2 ^ (R1 - E / 2);
This is because the operation >> divides exponent and mantissa by 2, and then we perform a subtraction of the numbers as integers.
Following the steps shown in the article, I set the format of x to:
x = M * 2 ^ -16
In an attempt to perform the same logic, I try to define Y0 for:
Y0 = (R2 - M / 2) * 2 ^ (R1 - (-16/2));
I'm trying to find a number, which can minimize the error given by:
error = (Y - Y0) / Y
Regardless of the value of R1, I can do shift operations to correct the exponent value of my final result, having the correct result at a fixed point.
Where am I wrong?
It can't be done.
The fast inverse sqrt is due to the floating point representation, that has already split the number into powers of two (exponent) and the significant.
It can be done.
With the same tricks as done for floating points, it's possible to convert your fixed point into 2^exp * x. Given uint32_t a, uint8_t exp = bias- builtin_count_leading_zeros(a); uint32_t b = a << exp, with the constants (and domain of a) so carefully chosen, that there will be no underflows or overflows.
Thus, you will actually have a custom floating point representation, which is tailored for this specific purpose, omitting the sign bit at least and having the best possible number of bits for the exponent, which might as well be 8.

Finding the maximum of a floating point counter

My apologies if this has been asked before, but I cannot find it.
I was wondering if there is a way to calculate the point at which a single precision floating point number that is used as a counter will reach a 'maximum' (the point at which it is no longer able to add another value due to loss of precision).
For example, if I continuously add 0.1f to a float I will eventually reach a point where the value does not change:
const float INCREMENT = 0.1f;
float value = INCREMENT;
float prevVal = 0.0f;
do {
prevVal = value;
value += INCREMENT;
} while (value != prevVal);
cout << value << endl;
On GCC this outputs 2.09715e+06
Is there a way to compute this mathematically for different values of INCREMENT? I believe it should in theory be when the exponent portion of the float requires a shift beyond 23 bits, resulting in losing the mantissa and simply adding 0.
Given some positive y used as an increment, the smallest X for which adding y does not produce a result greater than X is the least power of 2 not less than y divided by half the “epsilon” of the floating-point format. It can be calculated by:
Float Y = y*2/std::numeric_limits<Float>::epsilon();
int e;
std::frexp(Y, &e);
Float X = std::ldexp(.5, e);
if (X < Y) X *= 2;
A proof follows. I assume IEEE-754 binary floating-point arithmetic using round-to-nearest-ties-to-even.
When two numbers are added in IEEE-754 floating-point arithmetic, the result is the exact mathematical result rounded to the nearest representable value in a selected direction.
A note about notation: Text in source code format represents floating-point values and operations. Other text is mathematical. So x+y is the exact mathematical sum of x and y, x is x in floating-point format, and x+y is the result of adding x and y in a floating-point operation. Also, I will use Float for the floating-point type in C++.
Given a floating-point number x, consider adding a positive value y using floating-point arithmetic, x+y. Under what conditions will the result exceed x?
Let x1 be the next value greater than x representable in the floating-point format, and let xm be the midpoint between x and x1. If the mathematical value of x+y is less than xm, then the floating-point calculation x+y rounds down, so it produces x. If x+y is greater than xm, either it rounds up and produces x1, or it produces some greater number because y is large enough to move the sum beyond x1. If x+y equals xm, the result is whichever of x or x1 has an even low digit. For reasons we will see, this is always x in the situations relevant to this question, so the calculation rounds down.
Therefore, x+y produces a result greater than x if and only if x+y exceeds xm, meaning that y exceeds half the distance from x to x1. Note that the distance from x to x1 is the value of 1 in the low digit of the significand of x.
In a binary floating-point format with p digits in its significand, the position value of the low digit is 21−p times the position value of the high digit. For example, if x is 2e, the highest bit in its significand represents 2e, and the lowest bit represents 2e+1−p.
The question asks, given a y, what is the least x for which x+y does not produce a result greater than x? It is the least x for which y does not exceed half the value of the low digit of the significand of x.
Let 2e be the position value of the high bit of the significand of x. Then y ≤ ½•2e+1−p = 2e−p, so y•2p ≤ 2e.
Therefore, given some positive y, the least x for which x+y does not produce a result greater than x has its leading bit, 2e, equal to or exceeding y•2p. And in fact it must be exactly 2e because all other floating-point numbers whose leading bit has position value 2e have other bits set in their significands, so they are greater. 2e is the least number for which the leading bit represents 2e.
Therefore, x is the least power of two that equals or exceeds y•2p.
In C++, std::numeric_limits<Float>::epsilon() (from the <limits> header) is the step from 1 to the next representable value, meaning it is 21−p. So y•2p equals y*2/std::numeric_limits<Float>::epsilon(). (This operation is exact unless it overflows to ∞.)
Let’s assign this to a variable:
Float Y = y*2/std::numeric_limits<Float>::epsilon();
We can find the position value represented by the highest bit of Y’s significand by using frexp (from the <cmath> header) to extract the exponent from the floating-point representation of Y and ldexp (also <cmath>) to apply that exponent to a new significand (.5 because of the scale that frexp and ldexp use):
int e;
std::frexp(Y, &e);
Float X = std::ldexp(.5, e);
Then X is a power of two, and it is less than or equal to Y. It is in fact the greatest power of two not greater than Y, since the next greater power of 2, 2X, is greater than Y. However, we want the least power of two not less than Y. We can find this with:
if (X < Y) X *= 2;
The resulting X is the number sought by the question.
Marek's Answer is pretty close, and a decent way to find it using a program (that is more efficient than the one I originally posted). However, I don't necessarily need the answer in a program form, just a mathematical one.
From what I can tell, the answer comes down to the exponent of the delta used, and the number of mantissa bits. We need to round to the nearest power of 2, which is kind of complicated. Basically if the mantissa is 0, we do nothing, otherwise we add 1 to the exponent. So, assuming we now have the delta as a power of 2, represented as 1.0 x 2exp, and a mantissa of N bits, the maximum value is 1.0 x 2(N + exp). Note that FLT_EPSILON in C is equal to 1.0 x 2-N. So we can also find this by dividing our nearest power of 2 by FLT_EPSILON.
For a delta of 0.1, the nearest power of 2 is 0.125, or 1.0 x 2-3. Therefore we want 1.0 x 2(23 + (-3)) or 1.0 x 221 which is equal to 2097152.
Yes it is possible.
there is std::numeric_limits::epsilon() which defines smallest value which can increase value 1.0.
Using this you can calculate this limit for any number.
In C there is DBL_EPSILON
So in your case this goes like this:
template<class T>
auto maximumWhenAdding(T delta) -> T
{
static_assert(std::is_floating_point_v<T>, "Works only for floating points.");
int power2= std::ilogb(delta);
float roudedDelta = ldexp(T { 1.0 }, power2);
if (roudedDelta != delta) {
roudedDelta *= 2;
}
return 2 * roudedDelta / std::numeric_limits<T>::epsilon();
}
live example C++
Note in live test examples delta fails to increase maxForDelta, but subtraction is successful, so this is exactly what you need.

OpenCL kernel float division gives different result

I have a OpenCL kernel for some computation. I found only one thread gives different result with CPU codes. I am using vs2010 x64 release mode.
By checking the OpenCL codes by some examples, I found some interesting results. Here are the testing examples in kernel codes.
I tested 3 cases in OpenCl kernel, the precision is checked by printf("%.10f", fval);
case 1:
float fval = (10296184.0) / (float)(x*y*z); // which gives result fval = 3351.6225585938
float fval = (10296184.0f) / (float)(x*y*z); // which gives result fval = 3351.6225585938
Variables are: int x,y, z
these values are computed by some operations. And their values are x=12, y=16, z=16;
case 2:
float fval = (10296184.0) / (float)(12*16*16); // which gives result fval = 3351.6223144531
float fval = (10296184.0f) / (float)(12*16*16); // which gives result fval = 3351.6223144531
case 3:
However, when I compute the difference of fval by using above two expressions, the result is 0 if using 10296184.0.
float fval = (10296184.0) / (float)(x*y*z) - (10296184.0) / (float)(12*16*16); // which gives result fval = 0.0000000000
float fval = (10296184.0f) / (float)(x*y*z) - (10296184.0f) / (float)(12*16*16); // which gives result fval = 0.0001812663
Could anyone explain the reason or give me some hints?
Some observations:
The two float values differ by 1 ULP. So the results differ by a minimum amount.
// Float ULP in the 2's place here
// v
0x1.a2f3ea0000000p+11 3351.622314... // OP's lower float value
0x1.a2f3eaaaaaaabp+11 3351.622395... // higher precision quotient
0x1.a2f3ec0000000p+11 3351.622558... // OP's higher float value
(10296184.0) / (float)(12*16*16) is calculated at compile time as is the closer result to the expected mathematical answer.
float fval = (10296184.0) / (float)(x*y*z) is calculated at run time.
Considering float variables being used, surprising that code is doing this division with double math. This is a double constant divide by a double (which is the promotion of the float product) resulting in a double quotient, converted to a float and then saved. I'd expect 10296184.0f - note the f - to have been used, then the math could have all been done as floats.
C allows different rounding modes denoted by FLT_ROUNDS This may differ at compile time and run time and may explain the difference. Knowing the result of fegetround() (The function gets the current rounding direction.) would help.
OP may have employed various compiler optimizations that sacrifice precision for speed.
C does not specify the precision of math operations, yet good to the last ULP should be expected with * / + - sqrt() modf() on quality platforms. I suspect code suffers from a weak math implementation.

What is the optimum epsilon/dx value to use within the finite difference method?

double MyClass::dx = ?????;
double MyClass::f(double x)
{
return 3.0*x*x*x - 2.0*x*x + x - 5.0;
}
double MyClass::fp(double x) // derivative of f(x), that is f'(x)
{
return (f(x + dx) - f(x)) / dx;
}
When using finite difference method for derivation, it is critical to choose an optimum dx value. Mathematically, dx must be as small as possible. However, I'm not sure if it is a correct choice to choose it the smallest positive double precision number (i.e.; 2.2250738585072014 x 10−308).
Is there an optimal numeric interval or exact value to choose a dx in to make the calculation error as small as possible?
(I'm using 64-bit compiler. I will run my program on a Intel i5 processor.)
Choosing the smallest possible value is almost certainly wrong: if dx were that smallest number, then f(x + dx) would be exactly equal to f(x) due to rounding.
So you have a tradeoff: Choose dx too small, and you lose precision to rounding errors. Choose it too large, and your result will be imprecise due to changes in the derivative as x changes.
To judge the numeric errors, consider (f(x + dx) - f(x))/f(x)1 mathematically. The numerator denotes the difference you want to compute, but the denominator denotes the magnitude of numbers you're dealing with. If that fraction is about 2‒k, then you can expect approximately k bits of precision in your result.
If you know your function, you can compute what error you'd get from choosing dx too large. You can then balence things, so that the error incurred from this is about the same as the error incurred from rounding. But if you know the function, you might be better off by providing a function that directly computes the derivative, like in your example with the polygonal f.
The Wikipedia section that pogorskiy pointed out suggests a value of sqrt(ε)x, or approximately 1.5e-8 * x. Without any more detailed knowledge about the function, such a rule of thumb will provide a reasonable default. Also note that that same section suggests not dividing by dx, but instead by (x + dx) - x, as this takes rounding errors incurred by computing x + dx into account. But I guess that whole article is full of suggestions you might use.
1 This formula really should divide by f(x), not by dx, even though a past editor thought differently. I'm attempting to compare the amount of significant bits remaining after the division, not the slope of the tangent.
Why not just use the Power Rule to derive the derivative, you'll get an exact answer:
f(x) = 3x^3 - 2x^2 + x - 5
f'(x) = 9x^2 - 4x + 1
Therefore:
f(x) = 3.0 * x * x * x - 2.0 * x * x + x - 5.0
fp(x) = 9.0 * x * x - 4.0 * x + 1.0

Unsigned long long arithmetic into double

I'm making a function that takes in 3 unsigned long longs, and applies the law of cosines to find out if the triangle is obtuse, acute or a right triangle. Should I just cast the variables to doubles before I use them?
void triar( unsigned long long& r,
unsigned long long x,
unsigned long long y,
unsigned long long z )
{
if(x==0 || y==0 || z==0) die("invalid triangle sides");
double t=(x*x + y*y -z*z)/(2*x*y);
t=acos (t) * (180.0 / 3.14159265);
if(t > 90) {
cout<<"Obtuse Triangle"<<endl;
r=t;
} else if(t < 90){
cout<<"Acute Triangle"<<endl;
r=t;
} else if(t == 90){
cout<<"Right Traingle"<<endl;
r=t;
}
}
There is generally no reason why you could not cast if you need floating point arithmetics. However, there is also an implicit conversion from unsigned long to double, so you can also often do completely without casting.
In many cases, including yours, you can cast only one of the arguments to force double arithmetics on a particular operation only. For example,
double t = (double)(x*x + y*y - z*z) / (2*x*y)
This way, all operations except for the division are computed in integer arithmetics and are therefore slighly faster. The cast is still necessary to avoid truncation during division.
Your code contains a comparison of floating point arguments. Floating point arithmetics however almost inevitably reduces accuracy. Avoid limited accuracy, or analyze and control accuracy.
Prefer an integer only solution as described in an excellent sister answer if you have a wide enough integral type at your disposal
Always avoid conversion from radians to degrees except for presentation to humans
Take the value of π from your mathematical library header files (unfortunately, this is platform dependent - try _USE_MATH_DEFINES + M_PI or, if already using boost libraries, boost::math::constants::pi<double>()), or express it analytically. For example, std::atan(1)*2 is the right angle.
If you choose double precision, and the ultimate difference value is less than, say, std::numeric_limits<double>::min() * 8, you can probably not tell anything about the triangle and the classification you return is basically bogus. (I made up the value of 8, you will possibly lose way more bits than three.)
You have a problem with obtuse triangles, x*x + y*y - z*z would mathematically give a negative result, that is then reduced modulo 2^WIDTH (where WIDTH is the number of value bits in unsigned long long, at least 64 and probably exactly that) yielding a - probably large - positive value (or in rare cases 0). Then the computed result of t = (x*x + y*y - z*z)/(2*x*y) can be larger than 1, and acos(t) would return a NaN.
The correct way to find out whether the triangle is obtuse/acute/right-angled with the given argument type is to check whether x*x + y*y < /* > / == */ z*z - if you can be sure the mathematical results don't exceed the unsigned long long range.
If you can't be sure of that, you can either convert the variables to double before the computation,
double xd = x, yd = y, zd = z;
double t = (xd*xd + yd*yd - zd*zd)/(2*xd*yd);
with possible loss of precision and incorrect results for nearly right-angled triangles (e.g. for the slightly obtuse triangle x = 2^29, y = 2^56-1, z = 2^56+2, both y and z would be converted to 2^56 with standard 64-bit doubles, xd*xd + yd*yd = 2^58 + 2^112 would be evaluated to 2^112, subtracting zd*zd then results in 0).
Or you can compare x*x + y*y to z*z - or x*x to z*z - y*y - using only integer arithmetic. If x*x is representable as an unsigned long long (I assume that 0 < x <= y <= z), it's relatively easy, first check whether (z - y)*(z + y) would exceed ULLONG_MAX, if yes, the triangle is obtuse, otherwise calculate and compare. If x*x is not representable, it becomes complicated, I think the easiest way (except for using a big integer library, of course) would be to compute the high and if necessary low 64 (or whatever width unsigned long long has) bits separately by splitting the numbers at half the width and compare those.
Further note: Your value for π, 3.14159265 is too inaccurate, right-angled triangles will be reported as obtuse.