Linear Interpolation Optimization For Sign-ness - c++

In a tight loop, I am doing a linear interpolation between two floating point values. However, the only necessary part of the result is the sign (whether the result is negative or positive). I'm doing a typical lerp operation right now, f way between a and b.
a + f * (b - a);
Is there something more efficient considering I just need to know the resulting sign and not the actual lerped value?
Edit: 'f' is a set of fixed distances along the interpolation, which are known beforehand.

You can calculate whether interpolated value changes sign at given range:
if Sign(a) <> Sign(b) then //don't forget about zero sign
change occurs
In this case find f parameter, where lerp = 0
a + f0 * (b - a) = 0
f0 = a / (a+b)
For smaller values lerp has the same sign as a, for larger - the same sign as b, so you don't need to calculate lerp value - just compare f with f0

Related

Fast inverse square root using fixed point instead of floating point

I am trying to implement Fast Inverse Square Root for a fixed point number, but I'm not getting anywhere.
I am trying to follow exactly the same principle as the article, except instead of writing the number in the floating point format x = (-1) ^ s * (1 + M) * 2 ^ (E-127), I am using the format x = M * 2 ^ -16, which is a 32-bit fixed point number with 16 decimal bits and 16 fractional bits.
The problem is that I cannot find the value of the "magic constant". According to my calculations, it doesn’t exist, but I’m not a mathematician and I think I’m doing everything wrong.
To solve Y = 1 / sqrt (x), I used the following reasoning (I don't know if it is correct).
In the original code we have that Y0 for approximation of newton is given by:
i = 0x5f3759df - (i >> 1);
Which means that we will have as a result a floating point number given by:
y0 = (1 + R2 - M / 2) * 2 ^ (R1 - E / 2);
This is because the operation >> divides exponent and mantissa by 2, and then we perform a subtraction of the numbers as integers.
Following the steps shown in the article, I set the format of x to:
x = M * 2 ^ -16
In an attempt to perform the same logic, I try to define Y0 for:
Y0 = (R2 - M / 2) * 2 ^ (R1 - (-16/2));
I'm trying to find a number, which can minimize the error given by:
error = (Y - Y0) / Y
Regardless of the value of R1, I can do shift operations to correct the exponent value of my final result, having the correct result at a fixed point.
Where am I wrong?
It can't be done.
The fast inverse sqrt is due to the floating point representation, that has already split the number into powers of two (exponent) and the significant.
It can be done.
With the same tricks as done for floating points, it's possible to convert your fixed point into 2^exp * x. Given uint32_t a, uint8_t exp = bias- builtin_count_leading_zeros(a); uint32_t b = a << exp, with the constants (and domain of a) so carefully chosen, that there will be no underflows or overflows.
Thus, you will actually have a custom floating point representation, which is tailored for this specific purpose, omitting the sign bit at least and having the best possible number of bits for the exponent, which might as well be 8.

Perform Addition on a Product using the S.E.A.L. Library

I am trying to perform an operation that is of the form: (A * B) + C. The multiplication works fine, as all the numbers have the same scale at that point, but the product of A * B has a different scale than C. It makes sense that multiplication would change the scale, but I was wondering if there was a way to perform an operation like this using the SEAL library.
Environment information:
Language: C++
Encryption Scheme: CKKS
Small encoded doubles (eg. 0.4531)
Scale used for encoding: pow(2.0, 60) like the example
Thank you in advance and let me know if further information is needed.
There are multiple ways of getting this to work. For example, suppose ciphertexts A, B, C all have the same scale Z. Then A * B will have scale Z^2. At this point you should also relinearize A * B unless you have a strong reason not to.
To compute A * B + C, you could for instance:
re-encode C (if you have the plaintext) with scale Z^2 and use that instead;
use multiply_plain to multiply C with a scalar 1.0 plaintext with scale Z to increase the scale to Z^2 but keep the value the same (there is an overload for CKKSEncoder::encode for this);
rescale A * B first so it has scale Z^2/q_k where q_k is the last prime in the coeff_modulus. Now, you could re-encode C to have scale exactly Z^2/q_k (if you have the plaintext), or multiply C with a scalar 1.0 plaintext as explained above to change the scale to exactly Z^2/q_k;
if Z is close to q_k so that Z^2/q_k ~ Z, then after rescaling you might just be able to use double &Ciphertext::scale() to set the scale of A * B to exactly C.scale() at the cost of a small multiplicative error ~ Z/q_k. For example, instead of scale 2^60 for A, B, C you could use static_cast<double>(parms.coeff_modulus().back()). Then Z^2/q_k = Z (exactly) and the addition works immediately without any scale switching. Of course, this doesn't work so well anymore after a second multiply+rescale as the second to last prime can no longer be equal to Z (all primes in coeff_modulus must be distinct).

Order of operations to maximize precision

I'm using floats for these operations:
Which of these two is more precise?
(a * b) / c
or
(a / c) * b
Does it matter at all or does it depend on the situation? If so, which should I choose in what cases?
Really, if you don't use double then you are misguided, and you don't care about precision.
Otherwise, you get the best error bounds if the first result is slightly lower than the next higher power of two. For example, calculating (pi * e) / sqrt (2), you get the best error bounds by calculating (e / sqrt (2)) * pi, because e / sqrt (2) ≈ 1.922 is close below 2. Results close to the next higher power of two have a lower relative error.
For addition and subtraction of a large number of items, it's best to first subtract items of equal magnitude and opposite sign (x - y is calculated exactly if y/2 ≤ x ≤ 2y), and otherwise combining numbers giving the smallest possible results.

program to evaluate the polynomial ax 3 + bx 2 + cx + d with minimum number of operations for given values of a,b,c and d

Please suggest subroutine program to evaluate the polynomial ax 3 + bx 2 + cx + d with minimum number of operations for given values of a,b,c and d.
If using bisection method is there any way to guess limit values dynamically.
The fastest method to evaluate f(x)=ax³+bx²+cx+d is the Horner scheme which uses parantheses to transform the expression to
f(x) = ((a*x+b)*x+c)*x+d
For finding roots note that at x=-R and x=+R where
R = 1+max(abs(b), abs(c), abs(d))/abs(a)
the polynomial will have non-zero values of opposite sign. Use bisection or better regula-falsi with the Illinois-anti-stalling modification.

Oh where has my precision gone with OpenMesh vector arithmetic?

Using doubles I would expect to have about 15 decimal points of precision. I know that many decimal numbers are not exactly representable in floating point notation, so I would get an approximation for 1/3 for example. However, using a double I would expect an approximation that was correct to about 15 decimal points. I would also expect to retain that level of accuracy when doing arithmetic.
However, in the following example, I try to calculate the area of a triangle using Heron's formula and OpenMesh::Vec3d which are backed by OpenMesh::VectorDataT<double,3> and end up with a result that is only accurate to 5 decimal points.
The correct result is area = 8.19922e-8, but I'm getting area=8.1992238711962083e-8. Any ideas where this is coming from?
The suggestion that this might result from the instability in Heron's Formula is a good one, but unfortunately is not the case in this example. I have added code which calculates the stable variation on Heron for those who might be interested. In this example, u.norm()>v.norm()>w.norm().
#include <OpenMesh/Core/Mesh/PolyMesh_ArrayKernelT.hh>
int main()
{
//triangle vertices
OpenMesh::Vec3d x(0.051051, 0.057411, 0.001355);
OpenMesh::Vec3d y(0.050981, 0.057337, -0.000678);
OpenMesh::Vec3d z(0.050949, 0.057303, 0.0);
//edge vectors
OpenMesh::Vec3d u = x-y;
OpenMesh::Vec3d v = x-z;
OpenMesh::Vec3d w = y-z;
//Heron's Formula
double semiP = (u.norm() + v.norm() + w.norm())/2.0;
double area = sqrt(semiP * (semiP - u.norm()) * (semiP - v.norm()) * (semiP - w.norm()) );
//Heron's Formula for small angles
double areaSmall = sqrt((u.norm() + (v.norm()+w.norm()))*(w.norm()-(u.norm()-v.norm()))*(w.norm()+(u.norm()-v.norm()))*(u.norm()+(v.norm()-w.norm())))/4.0;
}
Heron's formula is numerically unstable. If you have a very "flat" triangle with small angles, the sum of the two small sides is almost the long side, so one of the terms gets very small. If, for example, a and b are the small sides,
(s - c)
will be very small, because
s = (a + b + c)/2
is nearly equal to c.
The wikipedia article about herons formula mentions a stable alternative:
Arrange the sides such that a > b > c and use
A = 1/4*sqrt((a + (b + c))*(c - (a - b))*(c + (a - b))*(a + (b - c)))
To 75 decimal places, the correct area of your triangle is
0.000000081992238711963087279421583920293974467992093148008322378721298327364.
If I replace the nine double constants you have with their decimal equivalents, I get
0.000000081992238711965902754749500279615357792172906541206211853522524016959
It would appear that you are not getting what you're expecting because you're expecting something unreasonable.
Any calculation involving subtraction will result in a loss of precision, if the values are at all close to each other. How many significant digits do you expect from this subtraction?
1.23456789012345
- 1.23456789000000
----------------
0.00000000012345
Both operands have 15 digits of precision, but the result only has 5.