Fast inverse square root using fixed point instead of floating point - c++

I am trying to implement Fast Inverse Square Root for a fixed point number, but I'm not getting anywhere.
I am trying to follow exactly the same principle as the article, except instead of writing the number in the floating point format x = (-1) ^ s * (1 + M) * 2 ^ (E-127), I am using the format x = M * 2 ^ -16, which is a 32-bit fixed point number with 16 decimal bits and 16 fractional bits.
The problem is that I cannot find the value of the "magic constant". According to my calculations, it doesn’t exist, but I’m not a mathematician and I think I’m doing everything wrong.
To solve Y = 1 / sqrt (x), I used the following reasoning (I don't know if it is correct).
In the original code we have that Y0 for approximation of newton is given by:
i = 0x5f3759df - (i >> 1);
Which means that we will have as a result a floating point number given by:
y0 = (1 + R2 - M / 2) * 2 ^ (R1 - E / 2);
This is because the operation >> divides exponent and mantissa by 2, and then we perform a subtraction of the numbers as integers.
Following the steps shown in the article, I set the format of x to:
x = M * 2 ^ -16
In an attempt to perform the same logic, I try to define Y0 for:
Y0 = (R2 - M / 2) * 2 ^ (R1 - (-16/2));
I'm trying to find a number, which can minimize the error given by:
error = (Y - Y0) / Y
Regardless of the value of R1, I can do shift operations to correct the exponent value of my final result, having the correct result at a fixed point.
Where am I wrong?

It can't be done.
The fast inverse sqrt is due to the floating point representation, that has already split the number into powers of two (exponent) and the significant.
It can be done.
With the same tricks as done for floating points, it's possible to convert your fixed point into 2^exp * x. Given uint32_t a, uint8_t exp = bias- builtin_count_leading_zeros(a); uint32_t b = a << exp, with the constants (and domain of a) so carefully chosen, that there will be no underflows or overflows.
Thus, you will actually have a custom floating point representation, which is tailored for this specific purpose, omitting the sign bit at least and having the best possible number of bits for the exponent, which might as well be 8.

Related

Finding the maximum of a floating point counter

My apologies if this has been asked before, but I cannot find it.
I was wondering if there is a way to calculate the point at which a single precision floating point number that is used as a counter will reach a 'maximum' (the point at which it is no longer able to add another value due to loss of precision).
For example, if I continuously add 0.1f to a float I will eventually reach a point where the value does not change:
const float INCREMENT = 0.1f;
float value = INCREMENT;
float prevVal = 0.0f;
do {
prevVal = value;
value += INCREMENT;
} while (value != prevVal);
cout << value << endl;
On GCC this outputs 2.09715e+06
Is there a way to compute this mathematically for different values of INCREMENT? I believe it should in theory be when the exponent portion of the float requires a shift beyond 23 bits, resulting in losing the mantissa and simply adding 0.
Given some positive y used as an increment, the smallest X for which adding y does not produce a result greater than X is the least power of 2 not less than y divided by half the “epsilon” of the floating-point format. It can be calculated by:
Float Y = y*2/std::numeric_limits<Float>::epsilon();
int e;
std::frexp(Y, &e);
Float X = std::ldexp(.5, e);
if (X < Y) X *= 2;
A proof follows. I assume IEEE-754 binary floating-point arithmetic using round-to-nearest-ties-to-even.
When two numbers are added in IEEE-754 floating-point arithmetic, the result is the exact mathematical result rounded to the nearest representable value in a selected direction.
A note about notation: Text in source code format represents floating-point values and operations. Other text is mathematical. So x+y is the exact mathematical sum of x and y, x is x in floating-point format, and x+y is the result of adding x and y in a floating-point operation. Also, I will use Float for the floating-point type in C++.
Given a floating-point number x, consider adding a positive value y using floating-point arithmetic, x+y. Under what conditions will the result exceed x?
Let x1 be the next value greater than x representable in the floating-point format, and let xm be the midpoint between x and x1. If the mathematical value of x+y is less than xm, then the floating-point calculation x+y rounds down, so it produces x. If x+y is greater than xm, either it rounds up and produces x1, or it produces some greater number because y is large enough to move the sum beyond x1. If x+y equals xm, the result is whichever of x or x1 has an even low digit. For reasons we will see, this is always x in the situations relevant to this question, so the calculation rounds down.
Therefore, x+y produces a result greater than x if and only if x+y exceeds xm, meaning that y exceeds half the distance from x to x1. Note that the distance from x to x1 is the value of 1 in the low digit of the significand of x.
In a binary floating-point format with p digits in its significand, the position value of the low digit is 21−p times the position value of the high digit. For example, if x is 2e, the highest bit in its significand represents 2e, and the lowest bit represents 2e+1−p.
The question asks, given a y, what is the least x for which x+y does not produce a result greater than x? It is the least x for which y does not exceed half the value of the low digit of the significand of x.
Let 2e be the position value of the high bit of the significand of x. Then y ≤ ½•2e+1−p = 2e−p, so y•2p ≤ 2e.
Therefore, given some positive y, the least x for which x+y does not produce a result greater than x has its leading bit, 2e, equal to or exceeding y•2p. And in fact it must be exactly 2e because all other floating-point numbers whose leading bit has position value 2e have other bits set in their significands, so they are greater. 2e is the least number for which the leading bit represents 2e.
Therefore, x is the least power of two that equals or exceeds y•2p.
In C++, std::numeric_limits<Float>::epsilon() (from the <limits> header) is the step from 1 to the next representable value, meaning it is 21−p. So y•2p equals y*2/std::numeric_limits<Float>::epsilon(). (This operation is exact unless it overflows to ∞.)
Let’s assign this to a variable:
Float Y = y*2/std::numeric_limits<Float>::epsilon();
We can find the position value represented by the highest bit of Y’s significand by using frexp (from the <cmath> header) to extract the exponent from the floating-point representation of Y and ldexp (also <cmath>) to apply that exponent to a new significand (.5 because of the scale that frexp and ldexp use):
int e;
std::frexp(Y, &e);
Float X = std::ldexp(.5, e);
Then X is a power of two, and it is less than or equal to Y. It is in fact the greatest power of two not greater than Y, since the next greater power of 2, 2X, is greater than Y. However, we want the least power of two not less than Y. We can find this with:
if (X < Y) X *= 2;
The resulting X is the number sought by the question.
Marek's Answer is pretty close, and a decent way to find it using a program (that is more efficient than the one I originally posted). However, I don't necessarily need the answer in a program form, just a mathematical one.
From what I can tell, the answer comes down to the exponent of the delta used, and the number of mantissa bits. We need to round to the nearest power of 2, which is kind of complicated. Basically if the mantissa is 0, we do nothing, otherwise we add 1 to the exponent. So, assuming we now have the delta as a power of 2, represented as 1.0 x 2exp, and a mantissa of N bits, the maximum value is 1.0 x 2(N + exp). Note that FLT_EPSILON in C is equal to 1.0 x 2-N. So we can also find this by dividing our nearest power of 2 by FLT_EPSILON.
For a delta of 0.1, the nearest power of 2 is 0.125, or 1.0 x 2-3. Therefore we want 1.0 x 2(23 + (-3)) or 1.0 x 221 which is equal to 2097152.
Yes it is possible.
there is std::numeric_limits::epsilon() which defines smallest value which can increase value 1.0.
Using this you can calculate this limit for any number.
In C there is DBL_EPSILON
So in your case this goes like this:
template<class T>
auto maximumWhenAdding(T delta) -> T
{
static_assert(std::is_floating_point_v<T>, "Works only for floating points.");
int power2= std::ilogb(delta);
float roudedDelta = ldexp(T { 1.0 }, power2);
if (roudedDelta != delta) {
roudedDelta *= 2;
}
return 2 * roudedDelta / std::numeric_limits<T>::epsilon();
}
live example C++
Note in live test examples delta fails to increase maxForDelta, but subtraction is successful, so this is exactly what you need.

Order of operations to maximize precision

I'm using floats for these operations:
Which of these two is more precise?
(a * b) / c
or
(a / c) * b
Does it matter at all or does it depend on the situation? If so, which should I choose in what cases?
Really, if you don't use double then you are misguided, and you don't care about precision.
Otherwise, you get the best error bounds if the first result is slightly lower than the next higher power of two. For example, calculating (pi * e) / sqrt (2), you get the best error bounds by calculating (e / sqrt (2)) * pi, because e / sqrt (2) ≈ 1.922 is close below 2. Results close to the next higher power of two have a lower relative error.
For addition and subtraction of a large number of items, it's best to first subtract items of equal magnitude and opposite sign (x - y is calculated exactly if y/2 ≤ x ≤ 2y), and otherwise combining numbers giving the smallest possible results.

How to approximate Euclidean distance on the integer plane, without overflow?

I'm working on a platform that has only integer arithmetic. The application uses geographic information, and I'm representing points by (x, y) coordinates where x and y are distances measured in meters.
As an approximation, I want to compute the Euclidean distance between two points. But to do this I have to square distances, and with 32-bit integers, the largest distance I can represent is 32 kilometers. Not good.
My needs are more on the order of 1000 kilometers. But I'd like to be able to resolve distances on a scale smaller than 30 meters.
Hence my question: how can I compute Euclidean distance, using only integer arithmetic, without overflow, on distances whose squares don't fit in a single word?
ETA: I would like to be able to compute distances, but I might settle for being able to compare them.
Perhaps comparing the octagonal distance approximation would be sufficient?
Slightly more up to date is this article on fast approximate distance functions.
I would recommend to use fixed point calculation using integers and then the distance approximation is already not too complicated.
fixed point calculation
distance approximation
Fast Approximate Distance Functions by Rafael Baptista
First step is to choose some fixed point representation for our needs:
For example in case we need a number range for 1000km with 1m resolution we can use 20bits that would be 2^20 = 1,048,576. So we have around 10bits for fractions.
Then we need to implement the approximation we choose:
For example in case we select the following approximation:
h ≈ b (1 + 0.337 (a/b)) = b + 0.337 a AND assuming 0 ≤ a ≤ b
We will implement as follows:
int32_t dx = (x1 > x2 ? x1 - x2 : x2 - x1);
int32_t dy = (y1 > y2 ? y1 - y2 : y2 - y1);
int32_t a = dx > dy ? dy : dx;
int32_t b = dx > dy ? dx : dy;
int32_t h = b + (345 * a >> 10); /* 345.088 = 0.337 * 2^10 */
About overflow:
Adding two <+20.0> positive numbers will result a maximum of <+21.0> number. That is Ok.
The multiplication is also safe while we use numbers in a range of -1..1. In this case the result will also remain in the same range. In our case <+20.0> * <+0.10> will result <+20.10> numbers. That we convert back to <+20.0>.
There is one step here we need to pay attention. During the multiplication we will get temporary a number with <+20.10> that is already near to our 32bits limit.
Exact calculation
We can also calculate the exact distance using the following consideration:
h = b sqrt(1 + (a/b)^2) AND assuming 0 < b ≤ a
In tis case we also need to calculate the square root:
square root
In case the a/b still significantly larger than one or too large to calculate the square of it, we can simplify the calculation to:
h = a
See the implementation here
I would leave the square root out of play, so that I can approximate the Euclidean distance. However, when comparing distances, this approach gives you 100% accuracy, since the comparison would be the same if you squared the distances.
I am pretty sure about that, since I had use that approach when searching for nearest neighbours in high dimensional spaces. You can check my code and the theory in kd-GeRaF.

Oh where has my precision gone with OpenMesh vector arithmetic?

Using doubles I would expect to have about 15 decimal points of precision. I know that many decimal numbers are not exactly representable in floating point notation, so I would get an approximation for 1/3 for example. However, using a double I would expect an approximation that was correct to about 15 decimal points. I would also expect to retain that level of accuracy when doing arithmetic.
However, in the following example, I try to calculate the area of a triangle using Heron's formula and OpenMesh::Vec3d which are backed by OpenMesh::VectorDataT<double,3> and end up with a result that is only accurate to 5 decimal points.
The correct result is area = 8.19922e-8, but I'm getting area=8.1992238711962083e-8. Any ideas where this is coming from?
The suggestion that this might result from the instability in Heron's Formula is a good one, but unfortunately is not the case in this example. I have added code which calculates the stable variation on Heron for those who might be interested. In this example, u.norm()>v.norm()>w.norm().
#include <OpenMesh/Core/Mesh/PolyMesh_ArrayKernelT.hh>
int main()
{
//triangle vertices
OpenMesh::Vec3d x(0.051051, 0.057411, 0.001355);
OpenMesh::Vec3d y(0.050981, 0.057337, -0.000678);
OpenMesh::Vec3d z(0.050949, 0.057303, 0.0);
//edge vectors
OpenMesh::Vec3d u = x-y;
OpenMesh::Vec3d v = x-z;
OpenMesh::Vec3d w = y-z;
//Heron's Formula
double semiP = (u.norm() + v.norm() + w.norm())/2.0;
double area = sqrt(semiP * (semiP - u.norm()) * (semiP - v.norm()) * (semiP - w.norm()) );
//Heron's Formula for small angles
double areaSmall = sqrt((u.norm() + (v.norm()+w.norm()))*(w.norm()-(u.norm()-v.norm()))*(w.norm()+(u.norm()-v.norm()))*(u.norm()+(v.norm()-w.norm())))/4.0;
}
Heron's formula is numerically unstable. If you have a very "flat" triangle with small angles, the sum of the two small sides is almost the long side, so one of the terms gets very small. If, for example, a and b are the small sides,
(s - c)
will be very small, because
s = (a + b + c)/2
is nearly equal to c.
The wikipedia article about herons formula mentions a stable alternative:
Arrange the sides such that a > b > c and use
A = 1/4*sqrt((a + (b + c))*(c - (a - b))*(c + (a - b))*(a + (b - c)))
To 75 decimal places, the correct area of your triangle is
0.000000081992238711963087279421583920293974467992093148008322378721298327364.
If I replace the nine double constants you have with their decimal equivalents, I get
0.000000081992238711965902754749500279615357792172906541206211853522524016959
It would appear that you are not getting what you're expecting because you're expecting something unreasonable.
Any calculation involving subtraction will result in a loss of precision, if the values are at all close to each other. How many significant digits do you expect from this subtraction?
1.23456789012345
- 1.23456789000000
----------------
0.00000000012345
Both operands have 15 digits of precision, but the result only has 5.

Why does division yield a vastly different result than multiplication by a fraction in floating points

I understand why floating point numbers can't be compared, and know about the mantissa and exponent binary representation, but I'm no expert and today I came across something I don't get:
Namely lets say you have something like:
float denominator, numerator, resultone, resulttwo;
resultone = numerator / denominator;
float buff = 1 / denominator;
resulttwo = numerator * buff;
To my knowledge different flops can yield different results and this is not unusual. But in some edge cases these two results seem to be vastly different. To be more specific in my GLSL code calculating the Beckmann facet slope distribution for the Cook-Torrance lighitng model:
float a = 1 / (facetSlopeRMS * facetSlopeRMS * pow(clampedCosHalfNormal, 4));
float b = clampedCosHalfNormal * clampedCosHalfNormal - 1.0;
float c = facetSlopeRMS * facetSlopeRMS * clampedCosHalfNormal * clampedCosHalfNormal;
facetSlopeDistribution = a * exp(b/c);
yields very very different results to
float a = (facetSlopeRMS * facetSlopeRMS * pow(clampedCosHalfNormal, 4));
facetDlopeDistribution = exp(b/c) / a;
Why does it? The second form of the expression is problematic.
If I say try to add the second form of the expression to a color I get blacks, even though the expression should always evaluate to a positive number. Am I getting an infinity? A NaN? if so why?
I didn't go through your mathematics in detail, but you must be aware that small errors get pumped up easily by all these powers and exponentials. You should try and substitute all variables var with var + e(var) (on paper, yes) and derive an expression for the total error - without simplifying in between steps, because that's where the error comes from!
This is also a very common problem in computational fluid dynamics, where you can observe things like 'numeric diffusion' if your grid isn't properly aligned with the simulated flow.
So get a clear grip on where the biggest errors come from, and rewrite equations where possible to minimize the numeric error.
edit: to clarify, an example
Say you have some variable x and an expression y=exp(x). The error in x is denoted e(x) and is small compared to x (say e(x)/x < 0.0001, but note that this depends on the type you are using). Then you could say that
e(y) = y(x+e(x)) - y(x)
e(y) ~ dy/dx * e(x) (for small e(x))
e(y) = exp(x) * e(x)
So there's a magnification of the absolute error of exp(x), meaning that around x=0 there's really no issue (not a surprise, since at that point the slope of exp(x) equals that of x) , but for big x you will notice this.
The relative error would then be
e(y)/y = e(y)/exp(x) = e(x)
whilst the relative error in x was
e(x)/x
so you added a factor of x to the relative error.