Is there a C/C++ function to safely handle division by zero? - c++

We have a situation we want to do a sort of weighted average of two values w1 & w2, based on how far two other values v1 & v2 are away from zero... for example:
If v1 is zero, it doesn't get weighted at all so we return w2
If v2 is zero, it doesn't get weighted at all so we return w1
If both values are equally far from zero, we do a mean average and return (w1 + w2 )/2
I've inherited code like:
float calcWeightedAverage(v1,v2,w1,w2)
{
v1=fabs(v1);
v2=fabs(v2);
return (v1/(v1+v2))*w1 + (v2/(v1+v2)*w2);
}
For a bit of background, v1 & v2 represent how far two different knobs are turned, the weighting of their individual resultant effects only depends how much they are turned, not in which direction.
Clearly, this has a problem when v1==v2==0, since we end up with return (0/0)*w1 + (0/0)*w2 and you can't do 0/0. Putting a special test in for v1==v2==0 sounds horrible mathematically, even if it wasn't bad practice with floating-point numbers.
So I wondered if
there was a standard library function to handle this
there's a neater mathematical representation

You're trying to implement this mathematical function:
F(x, y) = (W1 * |x| + W2 * |y|) / (|x| + |y|)
This function is discontinuous at the point x = 0, y = 0. Unfortunately, as R. stated in a comment, the discontinuity is not removable - there is no sensible value to use at this point.
This is because the "sensible value" changes depending on the path you take to get to x = 0, y = 0. For example, consider following the path F(0, r) from r = R1 to r = 0 (this is equivalent to having the X knob at zero, and smoothly adjusting the Y knob down from R1 to 0). The value of F(x, y) will be constant at W2 until you get to the discontinuity.
Now consider following F(r, 0) (keeping the Y knob at zero and adjusting the X knob smoothly down to zero) - the output will be constant at W1 until you get to the discontinuity.
Now consider following F(r, r) (keeping both knobs at the same value, and adjusting them down simulatneously to zero). The output here will be constant at W1 + W2 / 2 until you go to the discontinuity.
This implies that any value between W1 and W2 is equally valid as the output at x = 0, y = 0. There's no sensible way to choose between them. (And further, always choosing 0 as the output is completely wrong - the output is otherwise bounded to be on the interval W1..W2 (ie, for any path you approach the discontinuity along, the limit of F() is always within that interval), and 0 might not even lie in this interval!)
You can "fix" the problem by adjusting the function slightly - add a constant (eg 1.0) to both v1 and v2 after the fabs(). This will make it so that the minimum contribution of each knob can't be zero - just "close to zero" (the constant defines how close).
It may be tempting to define this constant as "a very small number", but that will just cause the output to change wildly as the knobs are manipulated close to their zero points, which is probably undesirable.

This is the best I could come up with quickly
float calcWeightedAverage(float v1,float v2,float w1,float w2)
{
float a1 = 0.0;
float a2 = 0.0;
if (v1 != 0)
{
a1 = v1/(v1+v2) * w1;
}
if (v2 != 0)
{
a2 = v2/(v1+v2) * w2;
}
return a1 + a2;
}

I don't see what would be wrong with just doing this:
float calcWeightedAverage( float v1, float v2, float w1, float w2 ) {
static const float eps = FLT_MIN; //Or some other suitably small value.
v1 = fabs( v1 );
v2 = fabs( v2 );
if( v1 + v2 < eps )
return (w1+w2)/2.0f;
else
return (v1/(v1+v2))*w1 + (v2/(v1+v2)*w2);
}
Sure, no "fancy" stuff to figure out your division, but why make it harder than it has to be?

Personally I don't see anything wrong with an explicit check for divide by zero. We all do them, so it could be argued that not having it is uglier.
However, it is possible to turn off the IEEE divide by zero exceptions. How you do this depends on your platform. I know on windows it has to be done process-wide, so you can inadvertantly mess with other threads (and they with you) by doing it if you aren't careful.
However, if you do that your result value will be NaN, not 0. I highly dooubt that's what you want. If you are going to have to put a special check in there anyway with different logic when you get NaN, you might as well just check for 0 in the denominator up front.

So with a weighted average, you need to look at the special case where both are zero. In that case you want to treat it as 0.5 * w1 + 0.5 * w2, right? How about this?
float calcWeightedAverage(float v1,float v2,float w1,float w2)
{
v1=fabs(v1);
v2=fabs(v2);
if (v1 == v2) {
v1 = 0.5;
} else {
v1 = v1 / (v1 + v2); // v1 is between 0 and 1
}
v2 = 1 - v1; // avoid addition and division because they should add to 1
return v1 * w1 + v2 * w2;
}

You chould test for fabs(v1)+fabs(v2)==0 (this seems to be the fastest given that you've already computed them), and return whatever value makes sense in this case (w1+w2/2?). Otherwise, keep the code as-is.
However, I suspect the algorithm itself is broken if v1==v2==0 is possible. This kind of numerical instability when the knobs are "near 0" hardly seems desirable.
If the behavior actually is right and you want to avoid special-cases, you could add the minimum positive floating point value of the given type to v1 and v2 after taking their absolute values. (Note that DBL_MIN and friends are not the correct value because they're the minimum normalized values; you need the minimum of all positive values, including subnormals.) This will have no effect unless they're already extremely small; the additions will just yield v1 and v2 in the usual case.

The problem with using an explicit check for zero is that you can end up with discontinuities in behaviour unless you are careful as outlined in cafs response ( and if its in the core of your algorithm the if can be expensive - but dont care about that until you measure...)
I tend to use something that just smooths out the weighting near zero instead.
float calcWeightedAverage(v1,v2,w1,w2)
{
eps = 1e-7; // Or whatever you like...
v1=fabs(v1)+eps;
v2=fabs(v2)+eps;
return (v1/(v1+v2))*w1 + (v2/(v1+v2)*w2);
}
Your function is now smooth, with no asymptotes or division by zero, and so long as one of v1 or v2 is above 1e-7 by a significant amount it will be indistinguishable from a "real" weighted average.

If the denominator is zero, how do you want it to default? You can do something like this:
static inline float divide_default(float numerator, float denominator, float default) {
return (denominator == 0) ? default : (numerator / denominator);
}
float calcWeightedAverage(v1, v2, w1, w2)
{
v1 = fabs(v1);
v2 = fabs(v2);
return w1 * divide_default(v1, v1 + v2, 0.0) + w2 * divide_default(v2, v1 + v2, 0.0);
}
Note that the function definition and use of static inline should really let the compiler know that it can inline.

This should work
#include <float.h>
float calcWeightedAverage(v1,v2,w1,w2)
{
v1=fabs(v1);
v2=fabs(v2);
return (v1/(v1+v2+FLT_EPSILON))*w1 + (v2/(v1+v2+FLT_EPSILON)*w2);
}
edit:
I saw there may be problems with some precision so instead of using FLT_EPSILON use DBL_EPSILON for accurate results (I guess you will return a float value).

I'd do like this:
float calcWeightedAverage(double v1, double v2, double w1, double w2)
{
v1 = fabs(v1);
v2 = fabs(v2);
/* if both values are equally far from 0 */
if (fabs(v1 - v2) < 0.000000001) return (w1 + w2) / 2;
return (v1*w1 + v2*w2) / (v1 + v2);
}

Related

Quadratic Algebra advice for Array like return function

I have a problem. I want to write a method, which uses the PQ-Formula to calculate Zeros on quadratic algebra.
As I see C++ doesn't support Arrays, unlike C#, which I use normally.
How do I get either, ZERO, 1 or 2 results returned?
Is there any other way without Array, which doesn't exists?
Actually I am not into pointers so my actual code is corrupted.
I'd glad if someone can help me.
float* calculateZeros(float p, float q)
{
float *x1, *x2;
if (((p) / 2)*((p) / 2) - (q) < 0)
throw std::exception("No Zeros!");
x1 *= -((p) / 2) + sqrt(static_cast<double>(((p) / 2)*((p) / 2) - (q)));
x2 *= -((p) / 2) - sqrt(static_cast<double>(((p) / 2)*((p) / 2) - (q)));
float returnValue[1];
returnValue[0] = x1;
returnValue[1] = x2;
return x1 != x2 ? returnValue[0] : x1;
}
Actualy this code is not compilable but this is the code I've done so far.
There are quite a fiew issues with; at very first, I'll be dropping all those totally needless parentheses, they just make the code (much) harder to read:
float* calculateZeros(float p, float q)
{
float *x1, *x2; // pointers are never initialized!!!
if ((p / 2)*(p / 2) - q < 0)
throw std::exception("No Zeros!"); // zeros? q just needs to be large enough!
x1 *= -(p / 2) + sqrt(static_cast<double>((p / 2)*(p / 2) - q);
x2 *= -(p / 2) - sqrt(static_cast<double>((p / 2)*(p / 2) - q);
// ^ this would multiply the pointer values! but these are not initialized -> UB!!!
float returnValue[1];
returnValue[0] = x1; // you are assigning pointer to value here
returnValue[1] = x2;
return x1 != x2 ? returnValue[0] : x1;
// ^ value! ^ pointer!
// apart from, if you returned a pointer to returnValue array, then you would
// return a pointer to data with scope local to the function – i. e. the array
// is destroyed upon leaving the function, thus the pointer returned will get
// INVALID as soon as the function is exited; using it would again result in UB!
}
As is, your code wouldn't even compile...
As I see C++ doesn't support arrays
Well... I assume you meant: 'arrays as return values or function parameters'. That's true for raw arrays, these can only be passed as pointers. But you can accept structs and classes as parameters or use them as return values. You want to return both calculated values? So you could use e. g. std::array<float, 2>; std::array is a wrapper around raw arrays avoiding all the hassle you have with the latter... As there are exactly two values, you could use std::pair<float, float>, too, or std::tuple<float, float>.
Want to be able to return either 2, 1 or 0 values? std::vector<float> might be your choice...
std::vector<float> calculateZeros(float p, float q)
{
std::vector<float> results;
// don't repeat the code all the time...
double h = static_cast<double>(p) / 2; // "half"
s = h * h; // "square" (of half)
if(/* s greater than or equal q */)
{
// only enter, if we CAN have a result otherwise, the vector remains empty
// this is far better behaviour than the exception
double r = sqrt(s - q); // "root"
h = -h;
if(/* r equals 0*/)
{
results.push_back(h);
}
else
{
results.reserve(2); // prevents re-allocations;
// admitted, for just two values, we could live with...
results.push_back(h + r);
results.push_back(h - r);
}
}
return results;
}
Now there's one final issue left: as precision even of double is limited, rounding errors can occur (and the matter is even worth if using float; I would recommend making all floats to doubles, parameters and return values as well!). You shouldn't ever compare for exact equality (someValue == 0.0), but consider some epsilon to cover badly rounded values:
-epsilon < someValue && someValue < +epsilon
Ok, in given case, there are two originally exact comparisons involved, we might want to do as little epsilon-comparisons as possible. So:
double d = r - s;
if(d > -epsilon)
{
// considered 0 or greater than
h = -h;
if(d < +epsilon)
{
// considered 0 (and then no need to calculate the root at all...)
results.push_back(h);
}
else
{
// considered greater 0
double r = sqrt(d);
results.push_back(h - r);
results.push_back(h + r);
}
}
Value of epsilon? Well, either use a fix, small enough value or calculate it dynamically based on the smaller of the two values (multiply some small factor to) – and be sure to have it positive... You might be interested in a bit more of information on the matter. You don't have to care about not being C++ – the issue is the same for all languages using IEEE754 representation for doubles.

How do I efficiently determine the scale factor of two parallel vectors?

I have two three-dimensional non-zero vectors which I know to be parallel, and thus I can multiply each component of one vector by a constant to obtain the other. In order to determine this constant, I can take any of the fields from both vectors and divide them by one another to obtain the scale factor.
For example:
vec3 vector1(1.0, 1.5, 2.0);
vec3 vector2(2.0, 3.0, 4.0);
float scaleFactor = vector2.x / vector1.x; // = 2.0
Unfortunately, picking the same field (say the x-axis) every time risks the divisor being zero.
Dividing the lengths of the vectors is not possible either because it does not take a negative scale factor into account.
Is there an efficient means of going about this which avoids zero divisions?
So we want something that:
1- has no branching
2- avoids division by zero
3- ensures the largest possible divider
These requirements are achieved by the ratio of two dot-products:
(v1 * v2) / (v2 * v2)
=
(v1.x*v2.x + v1.y*v2.y + v1.z*v2.z) / (v2.x*v2.x + v2.y*v2.y + v2.z*v2.z)
In the general case where the dimension is not a (compile time) constant, both numerator and denominator can be computed in a single loop.
Pretty much, this.
inline float scale_factor(const vec3& v1, const vec3& v2, bool* fail)
{
*fail = false;
float eps = 0.000001;
if (std::fabs(vec1.x) > eps)
return vec2.x / vec1.x;
if (std::fabs(vec1.y) > eps)
return vec2.y / vec1.y;
if (std::fabs(vec1.z) > eps)
return vec2.z / vec1.z;
*fail = true;
return -1;
}
Also, one can think of getting 2 sums of elements, and then getting a scale factor with a single division. You can get sum effectively by using IPP's ippsSum_32f, for example, as it is calculated using SIMD instructions.
But, to be honest, I doubt that you can really improve these methods. Either sum all -> divide or branch -> divide will provide you with the solution pretty close to the best.
To minimize the relative error, use the largest element:
if (abs(v1.x) > abs(v1.y) && abs(v1.x) > abs(v1.z))
return v2.x / v1.x;
else if (abs(v1.y) > abs(v1.x) && abs(v1.y) > abs(v1.z))
return v2.y / v1.y;
else
return v2.z / v1.z;
This code assumes that v1 is not a zero vector.

Points on the same line

I was doing a practice question and it was something like this,We are given N pair of coordinates (x,y) and we are given a central point too which is (x0,y0).We were asked to find maximum number of points lying on a line passing from (x0,y0).
My approach:- I tried to maintain a hash map having slope as the key and I thought to get the maximum second value to get maximum number of points on the same line.Something like this
mp[(yi-y0)/(xi-x0))]++; //i from 0 to n
And iterating map and doing something line this
if(it->second >max) //it is the iterator
max=it->second;
and printing max at last;
Problem With my approach- Whenever I get (xi-x0) as 0 I get runtime error.I also tried atan(slope) so that i would get degrees instead of some not defined value but still its not working.
What i expect->How to remove this runtime error and is my approach correct for finding maximum points on a line passing from a point(x0,y0).
P.S -My native language is not english so please ignore if something goes wrong.I tried my best to make everything clear If i am not clear enough please tell me
I'm assuming no other points have the same coordinates as your "origin".
If all your coordinates happen to be integers, you can keep a rational number (i.e. a pair of integers, i.e. a numerator and a denominator) as the slope, instead of a single real number.
The slope is DeltaY / DeltaX, so all you have to do is keep the pair of numbers separate. You just need to take care to divide the pair by their greatest common divisor, and handle the case where DeltaX is zero. For example:
pair<int, int> CalcSlope (int x0, int y0, int x1, int y1)
{
int dx = abs(x1 - x0), dy = abs(y1 - y0);
int g = GCD(dx, dy);
return {dy / g, dx / g};
}
Now just use the return value of CalcSlope() as your map key.
In case you need it, here's one way to calculate the GCD:
int GCD (int a, int b)
{
if (0 == b) return a;
else return gcd(b, a % b);
}
You have already accepted an answer, but I would like to share my approach anyway. This method uses the fact that three points a, b, and c are covariant if and only if
(a.first-c.first)*(b.second-c.second) - (a.second-c.second)*(b.first-c.first) == 0
You can use this property to create a custom comparison object like this
struct comparePoints {
comparePoints(int x0 = 0, int y0 = 0) : _x0(x0), _y0(y0) {}
bool operator()(const point& a, const point& b) {
return (a.first-_x0)*(b.second-_y0) - (b.first-_x0)*(a.second-_y0) < 0;
}
private:
int _x0, _y0;
};
which you can then use as a comparison object of a map according to
comparePoints comparator(x0, y0);
map<pair<int, int>, int, comparePoints> counter(comparator);
You can then add points to this map similar to what you did before:
if (!(x == x0 && y == y0))
counter[{x,y}]++;
By using comparitor as a comparison object, two keys a, b in the map are considered equal if !comparator(a, b) && !comparator(b,a), which is true if and only if a, b and {x0,y0} are collinear.
The advantage of this method is that you don't need to divide the coordinates which avoids rounding errors and problems with dividing by zero, or calculate the atan which is a costly operation.
Move everything so that the zero point is at the origin:
(xi, yi) -= (x0, y0)
Then for each point (xi, yi), find the greatest common divisor of xi and yi and divide both numbers by it:
k = GCD(xi, yi)
(xi', yi`) = (yi/k, yi/k)
Now points that are on the same ray will map to equal points. If (xi, yi) is on the same ray as (xj, yj) then (xi', yi') = (xj', yj').
Now find the largest set of equal points (don't forget any (xi, yi) = (0, 0)) and you have your answer.
You've a very original approach here !
Nevertheless, a vertical line has a infinite slope and this is the problem here: dividing by 0 is not allowed.
Alternative built on your solution (slope):
...
int mpvertical=0; // a separate couner for verticals
if (xi-x0)
mp[(yi-y0)/(xi-x0))]++;
else if (yi-y0)
mpvertical++;
// else the point (xi,yi) is the point (x0,y0): it shall not be counted)
This is cumbersome, because you have everything in the map plus this extra counter. But it will be exact. A workaround could be to count such points in mp[std::numeric_limits<double>::max()], but this would be an approximation.
Remark: the case were xi==x0 AND yi==y0 corresponds to your origin point. These points have to be discarded as they belong to every line line.
Trigonomic alternative (angle):
This uses the general atan2 formula used to converting cartesian coordinates into polar coordinates, to get the angle:
if (xi!=x0 && yi!=y0) // the other case can be ignored
mp[ 2*atan((yi-y0)/((xi-x0)+sqrt(pow(xi-x0,2)+pow(yi-y0,2)))) ]++;
so your key for mp will be an angle between -pi and +pi. No more extra case, but slightly more calculations.
You can hide these extra details and use the slighltly more optimized build in function:
if (xi!=x0 && yi!=y0) // the other case can be ignored
mp[ atan2(yi-y0, xi-x0) ]++;
you can give this approach a try
struct vec2
{
vec2(float a,float b):x(a),y(b){}
float x,y;
};
bool isColinear(vec2 a, vec2 b, vec2 c)
{
return fabs((a.y-b.y)*(a.x-c.x) - (a.y-c.y)*(a.x-b.x)) <= 0.000001 ;
}

Boids colliding with each other

I was looking at some pseudocode for boids and wrote it in C++. However, I am finding that boids will occasionally collide with each other. I thought that I had programmed it correctly, given how simple the psuedocode is. yet, when i display the locations of all the boids, some of them have the same coordinates.
The pseudocode from the link:
PROCEDURE rule2(boid bJ)
Vector c = 0;
FOR EACH BOID b
IF b != bJ THEN
IF |b.position - bJ.position| < 100 THEN
c = c - (b.position - bJ.position)
END IF
END IF
END
RETURN c
END PROCEDURE
my code is:
std::pair <signed int, signed int> keep_distance(std::vector <Boid> & boids, Boid & boid){
signed int dx = 0;
signed int dy = 0;
for(Boid & b : boids){
if (boid != b){ // this checks an "id" number, not location
if (b.dist(boid) < MIN_DIST){
dx -= b.get_x() - boid.get_x();
dy -= b.get_y() - boid.get_y();
}
}
}
return std::pair <signed int, signed int> (dx, dy);
}
with
MIN_DIST = 100;
unsigned int Boid::dist(const Boid & b){
return (unsigned int) sqrt((b.x - x) * (b.x - x) + (b.y - y) * (b.y - y));
}
the only major difference is between these two codes should be that instead of vector c, im using the components instead.
the order of functions i am using to move each boid around is:
center_of_mass(boids, new_boids[i]); // rule 1
match_velocity(boids, new_boids[i]); // rule 3
keep_within_bound(new_boids[i]);
tendency_towards_place(new_boids[i], mouse_x, mouse_y);
keep_distance(boids, new_boids[i]); // rule 2
is there something obvious im not seeing? maybe some silly vector arithmetic i did wrong?
The rule doesn't say that boids cannot collide. They just don't want to. :)
As you can see in this snippet:
FOR EACH BOID b
v1 = rule1(b)
v2 = rule2(b)
v3 = rule3(b)
b.velocity = b.velocity + v1 + v2 + v3
b.position = b.position + b.velocity
END
There is no check to make sure they don't collide. If the numbers come out unfavorably they will still collide.
That being said, if you get the exact same position for multiple boids it is still very unlikely, though. It would point to a programming error.
Later in the article he has this code:
ROCEDURE move_all_boids_to_new_positions()
Vector v1, v2, v3, ...
Integer m1, m2, m3, ...
Boid b
FOR EACH BOID b
v1 = m1 * rule1(b)
v2 = m2 * rule2(b)
v3 = m3 * rule3(b)
b.velocity = b.velocity + v1 + v2 + v3 + ...
b.position = b.position + b.velocity
END
END PROCEDURE
(Though realistically I would make m1 a double rather than an Integer) If rule1 is the poorly named rule that makes boids attempt to avoid each other, simply increase the value of m1 and they will turn faster away from each other. Also, increasingMIN_DIST will cause them to notice that they're about to run into each other sooner, and decreasing their maximum velocity (vlim in the function limit_velocity) will allow them to react more sanely to near collisions.
As others mentioned, there's nothing that 100% guarantees collisions don't happen, but these tweaks will make collisions less likely.

Most accurate line intersection ordinate computation with floats?

I'm computing the ordinate y of a point on a line at a given abscissa x. The line is defined by its two end points coordinates (x0,y0)(x1,y1). End points coordinates are floats and the computation must be done in float precision for use in GPU.
The maths, and thus the naive implementation, are trivial.
Let t = (x - x0)/(x1 - x0), then y = (1 - t) * y0 + t * y1 = y0 + t * (y1 - y0).
The problem is when x1 - x0 is small. The result will introduce cancellation error. When combined with the one of x - x0, in the division I expect a significant error in t.
The question is if there exist another way to determine y with a better accuracy ?
i.e. should I compute (x - x0)*(y1 - y0) first, and divide by (x1 - x0) after ?
The difference y1 - y0 will always be big.
To a large degree, your underlying problem is fundamental. When (x1-x0) is small, it means there are only a few bits in the mantissa of x1 and x0 which differ. And by extension, there are only a limted number of floats between x0 and x1. E.g. if only the lower 4 bits of the mantissa differ, there are at most 14 values between them.
In your best algorithm, the t term represents these lower bits. And to continue or example, if x0 and x1 differ by 4 bits, then t can take on only 16 values either. The calculation of these possible values is fairly robust. Whether you're calculating 3E0/14E0 or 3E-12/14E-12, the result is going to be close to the mathematical value of 3/14.
Your formula has the additional advantage of having y0 <= y <= y1, since 0 <= t <= 1
(I'm assuming that you know enough about float representations, and therefore "(x1-x0) is small" really means "small, relative to the values of x1 and x0 themselves". A difference of 1E-1 is small when x0=1E3 but large if x0=1E-6 )
You may have a look at Qt's "QLine" (if I remember it right) sources; they have implemented an intersection determination algorithm taken from one the "Graphics Gems" books (the reference must be in the code comments, the book was on EDonkey a couple of years ago), which, in turn, has some guarantees on applicability for a given screen resolution when calculations are performed with given bit-width (they use fixed-point arithmetics if I'm not wrong).
If you have the possibility to do it, you can introduce two cases in your computation, depending on abs(x1-x0) < abs(y1-y0). In the vertical case abs(x1-x0) < abs(y1-y0), compute x from y instead of y from x.
EDIT. Another possibility would be to obtain the result bit by bit using a variant of dichotomic search. This will be slower, but may improve the result in extreme cases.
// Input is X
xmin = min(x0,x1);
xmax = max(x0,x1);
ymin = min(y0,y1);
ymax = max(y0,y1);
for (int i=0;i<20;i++) // get 20 bits in result
{
xmid = (xmin+xmax)*0.5;
ymid = (ymin+ymax)*0.5;
if ( x < xmid ) { xmax = xmid; ymax = ymid; } // first half
else { xmin = xmid; ymin = ymid; } // second half
}
// Output is some value in [ymin,ymax]
Y = ymin;
I have implemented a benchmark program to compare the effect of the different expression.
I computed y using double precision and then compute y using single precision with different expressions.
Here are the expression tested:
inline double getYDbl( double x, double x0, double y0, double x1, double y1 )
{
double const t = (x - x0)/(x1 - x0);
return y0 + t*(y1 - y0);
}
inline float getYFlt1( float x, float x0, float y0, float x1, float y1 )
{
double const t = (x - x0)/(x1 - x0);
return y0 + t*(y1 - y0);
}
inline float getYFlt2( float x, float x0, float y0, float x1, float y1 )
{
double const t = (x - x0)*(y1 - y0);
return y0 + t/(x1 - x0);
}
inline float getYFlt3( float x, float x0, float y0, float x1, float y1 )
{
double const t = (y1 - y0)/(x1 - x0);
return y0 + t*(x - x0);
}
inline float getYFlt4( float x, float x0, float y0, float x1, float y1 )
{
double const t = (x1 - x0)/(y1 - y0);
return y0 + (x - x0)/t;
}
I computed the average and stdDev of the difference between the double precision result and single precision result.
The result is that there is none on the average over 1000 and 10K random value sets. I used icc compiler with and without optimization as well as g++.
Note that I had to use the isnan() function to filter out bogus values. I suspect these result from underflow in the difference or division.
I don't know if the compilers rearrange the expression.
Anyway, the conclusion from this test is that the above rearrangements of the expression have no effect on the computation precision. The error remains the same (on average).
Check if the distance between x0 and x1 is small, i.e. fabs(x1 - x0) < eps. Then the line is parallell to the y axis of the coordinate system, i.e. you can't calculuate the y values of that line depending on x. You have infinite many y values and therefore you have to treat this case differently.
If your source data is already a float then you already have fundamental inaccuracy.
To explain further, imagine if you were doing this graphically. You have a 2D sheet of graph paper, and 2 point marked.
Case 1: Those points are very accurate, and have been marked with a very sharp pencil. Its easy to draw the line joining them, and easy to then get y given x (or vice versa).
Case 2: These point have been marked with a big fat felt tip pen, like a bingo marker. Clearly the line you draw will be less accurate. Do you go through the centre of the spots? The top edge? The bottom edge? Top of one, bottom of the other? Clearly there are many different options. If the two dots are close to each other then the variation will be even greater.
Floats have a certain level of inaccuracy inherent in them, due to the way they represent numbers, ergo they correspond more to case 2 than case 1 (which one could suggest is the equivalent of using an arbitrary precision librray). No algorithm in the world can compensate for that. Imprecise data in, Imprecise data out
How about computing something like:
t = sign * power2 ( sqrt (abs(x - x0))/ sqrt (abs(x1 - x0)))
The idea is to use a mathematical equivalent formula in which low (x1-x0) has less effect.
(not sure if the one I wrote matches this criteria)
As MSalters said, the problem is already in the original data.
Interpolation / extrapolation requires the slope, which already has low accuracy in the given conditions (worst for very short line segments far away from the origin).
Choice of algorithm canot regain this accuracy loss. My gut feeling is that the different evaluation order will not change things, as the error is introduced by the subtractions, not the devision.
Idea:
If you have more accurate data when the lines are generated, you can change the representation from ((x0, y0), (x1, y1)) to (x0,y0, angle, length). You could store angle or slope, slope has a pole, but angle requires trig functions... ugly.
Of course that won't work if you need the end point frequently, and you have so many lines that you can't store additional data, I have no idea. But maybe there is another representation that works well for your needs.
doubles have enough resolution in most situations, but that would double the working set too.