Different results from similar floating-point functions

Different results from similar floating-point functions - c++

so i have 2 functions that should do the same thing
float ver1(float a0, float a1) {
float r0 = a0 - a1;
if (abs(r0) > PI) {
if (r0 > 0) {
r0 -= PI2;
} else {
r0 += PI2;
}
}
return r0;
}
float ver2(float a0, float a1) {
float a2 = a1 - PI2;
float r0 = a0 - a1;
float r1 = a0 - a2;
if (abs(r0) < abs(r1)) {
return r0;
}
if (abs(r0) > abs(r1)) {
return r1;
}
return 0;
}
note: PI and PI2 are float constants of pi and 2*pi
The odd thing is that sometimes they produce different results, for example if you feed them 0.28605145 and 5.9433694 then the first one results in 0.62586737 and the second one in 0.62586755 and i cant figure out whats causing this.
If you manually calculate what the result should be you'll find that the second answer is correct. This function i use in a 2d physical sim and the really odd thing is that the first answer (the wrong one) works there while the second one (the right one) makes it act all kinds of crazy. Such a tiny difference from an unknown source and such a profound effect :|
At this point im switchign to matrices anyway but this odd situation got me curious, anybody know whats going on?

float typically has a precision of about 24 bits, or about 7 decimal places.
You are subtracting two numbers of similar magnitude (r0+PI2 in the first, a1-PI2 in the second), and so are experiencing loss of significance - several of the most significant bits of the result are zero, so there are fewer bits left to represent the difference. That is why the answers match to only about 6 decimal places.
If you need more precision, then a double or a 32-bit or larger fixed-point representation might be more suitable than a float. There are also arbitrary-precision libraries available, such as GMP, which can represent numbers with all the precision you need, although arithmetic will be significantly slower than with built-in types.

You should use fabs() function instead of abs() because abs() only works with integer numbers. You'll get weird and wrong results when using abs() with floating points.

Floating point numbers don't behave like mathematical real numbers. Every sum of 2 may result in a "error". So I wouldn't call the first correct and the second incorrect just because of one example. You need to be careful of every action you do with floats if you want to keep the error small.
The error is generally smaller if the abs of the numbers are in the same range.
And if the ranges are different the error tend to be bigger.
For example 10000000.0 + 0.1 - 10000000.0 is hardly ever 0.1.
If you know the ranges of the input you can adjust the code to reduce errors.

Related

Find float a to closest multiple of float b

C++ Scenario: I have two variables of type double a and b.
Goal: a should be set to the closest multiple of b that is smaller than a.
First approach: Use fmod() or remainder() to get r. Then do a = a - r.
I know that due to the representation of decimal numbers in memory fmod() or remainder() can never guarantee 100% accuracy. In my tests I found that I cannot use fmod() at all, as the variance of its results is too unpredictable (at least as far as I understand). There are many questions and discussions out there talking about this phenomenon.
So is there something I could do to still use fmod()?
With “something” I mean some trick similar to checking if a equals b by employing a value double
EPSILON = 0.005;
if (std::abs(a-b) < EPSILON)
std::cout << "equal" << '\n';
My second approach works but seems not to be very elegant. I am just subtracting b from a until there is nothing left to subtract:
double findRemainder(double x, double y) {
double rest;
if (y > x)
{
double temp = x;
x = y;
y = temp;
}
while (x > y)
{
rest = x - y;
x = x - y;
}
return rest;
}
int main()
{
typedef std::numeric_limits<double> dbl;
std::cout.precision(dbl::max_digits10);
double a = 13.78, b = 2.2, r = 0;
r = findRemainder(a, b);
return 0;
}
Any suggestions for me?

Preamble
The problem is impossible, both as stated and as intended.
Remainders are exact
This statement is incorrect: “fmod() or remainder() can never guarantee 100% accuracy.” If the floating-point format supports subnormal numbers (as IEEE-754 does), then fmod(x, y) and remainder are both exact; they produce a result with no rounding error (barring bugs in their implementation). The remainder, as defined for either of them, is always less than y and not more than x in magnitude. Therefore, it is always in a portion of the floating-point format that is at least as fine as y and as x, so all the bits needed for the real-arithmetic remainder can be represented in the floating-point remainder. So a correct implementation will return the exact remainder.
Multiples may not be representable
For simplicity of illustration, I will use IEEE-754 binary32, the format commonly used for float. The issues are the same for other formats. In this format, all integers with magnitude up to 224, 16,777,216, are representable. After that, due to the scaling by the floating-point exponent, the representable values increase by two: 16,777,218, 16,777,220, and so on. At 225, 33,554,432, they increase by four: 33,554,436, 33,554,440. At 226, 67,108,864, they increase by eight.
100,000,000 is representable, and so are 99,999,992 and 100,000,008. Now consider asking what multiple of 3 is the closest to 100,000,000. It is 99,999,999. But 99,999,999 is not representable in the binary32 format.
Thus, it is not always possible for a function to take two representable values, a and b, and return the greatest multiple of b that is less than a, using the same floating-point format. This is not because of any difficulty computing the multiple but simply because it is impossible to represent the true multiple in the floating-point format.
In fact, given the standard library, it is easy to compute the remainder; std::fmod(100000000.f, 3.f) is 1. But it is impossible to compute 100000000.f − 1 in the binary32 format.
The intended question is impossible
The examples shown, 13.78 for a and 2.2 for b, suggest the desire is to produce a multiple for some floating-point numbers a and b that are the results of converting decimal numerals a and b to the floating-point format. However, once such conversions are performed, the original numbers cannot be known from the results a and b.
To see this, consider values for a of either 99,999,997 or 100,000,002 while b is 10. The greatest multiple of 10 less than 99,999,997 is 99,999,990, and the greatest multiple of 10 less than 100,000,002 is 100,000,000.
When either 99,999,997 or 100,000,002 is converted to the binary32 format (using the common method, round-to-nearest-ties-to-even), the result for a is 100,000,000. Converting b of course yields 10 for b.
Then a function that converts the greatest multiple of a that is less than b can return only one result. Even if this function uses extended precision (say binary64) so that it can return either 99,999,990 or 100,000,000 even though those are not representable in binary32, it has no way to distinguish them. Whether the original a is 99,999,997 or 100,000,002, the a given to the function is 100,000,000, so there is no way for it to know the original a and no way for it to decide which result to return.

Hmm,
there really is a problem of definition, because most multiples of a floating point won't be representable exactly, except maybe if the multiplier is a power of two.
Taking your example and Smalltalk notations (which does not really matter, I do it just because i can evaluate and verify the expressions I propose), the exact fractional representation of double precision 0.1 and 0.9 can be written:
(1+(1<<54)reciprocal) / 10 = 0.1.
(9+(1<<52)reciprocal) / 10 = 0.9.
<< is a bistshift, 1<<54 is 2 raised to the power of 54, and reciprocal is its inverse 2^-54.
As you can easily see:
(1+(1<<54)reciprocal) * 9 > (9+(1<<52)reciprocal)
That is, the exact multiple of 0.1 is greater than 0.9.
Thus, technically, the answer is 8*0.1 (which is exact in this lucky case)
(8+(1<<51)reciprocal) / 10 = 0.8.
What remainder does is to give the EXACT remainder of the division, so it is related to above computations somehow.
You can try it, you will find something like-2.77555...e-17, or exactly (1<<55) reciprocal. The negative part is indicating that nearest multiple is close to 0.9, but a bit below 0.9.
However, if your problem is to find the greatest <= 0.9, among the rounded to nearest multiple of 0.1, then your answer will be 0.9, because the rounded product is 0.1*9 = 0.9.
You have to first resolve that ambiguity. If ever, you are not interested in multiples of 0.1, but in multiples of (1/10), then it's again a different matter...

Does casting `std::floor()` and `std::ceil()` to integer type always give the correct result?

I am being paranoid that one of these functions may give an incorrect result like this:
std::floor(2000.0 / 1000.0) --> std::floor(1.999999999999) --> 1
or
std::ceil(18 / 3) --> std::ceil(6.000000000001) --> 7
Can something like this happen? If there is indeed a risk like this, I'm planning to use the functions below in order to work safely. But, is this really necessary?
constexpr long double EPSILON = 1e-10;
intmax_t GuaranteedFloor(const long double & Number)
{
if (Number > 0)
{
return static_cast<intmax_t>(std::floor(Number) + EPSILON);
}
else
{
return static_cast<intmax_t>(std::floor(Number) - EPSILON);
}
}
intmax_t GuaranteedCeil(const long double & Number)
{
if (Number > 0)
{
return static_cast<intmax_t>(std::ceil(Number) + EPSILON);
}
else
{
return static_cast<intmax_t>(std::ceil(Number) - EPSILON);
}
}
(Note: I'm assuming that the the given 'long double' argument will fit in the 'intmax_t' return type.)

People often get the impression that floating point operations produce results with small, unpredictable, quasi-random errors. This impression is incorrect.
Floating point arithmetic computations are as exact as possible. 18/3 will always produce exactly 6. The result of 1/3 won't be exactly one third, but it will be the closest number to one third that is representable as a floating point number.
So the examples you showed are guaranteed to always work. As for your suggested "guaranteed floor/ceil", it's not a good idea. Certain sequences of operations can easily blow the error far above 1e-10, and certain other use cases will require 1e-10 to be correctly recognized (and ceil'ed) as nonzero.
As a rule of thumb, hardcoded epsilon values are bugs in your code.

In the specific examples you're listing, I don't think those errors would ever occur.
std::floor(2000.0 /*Exactly Representable in 32-bit or 64-bit Floating Point Numbers*/ / 1000.0 /*Also exactly representable*/) --> std::floor(2.0 /*Exactly Representable*/) --> 2
std::ceil(18 / 3 /*both treated as ints, might not even compile if ceil isn't properly overloaded....?*/) --> 6
std::ceil(18.0 /*Exactly Representable*/ / 3.0 /*Exactly Representable*/) --> 6
Having said that, if you have math that depends on these functions behaving exactly correctly for floating point numbers, that may illuminate a design flaw you need to reconsider/reexamine.

As long as the floating-point values x and y exactly represent integers within the limits of the type you're using, there's no problem--x / y will always yield a floating-point value that exactly represents the integer result. Casting to int as you're doing will always work.
However, once the floating-point values go outside the integer-representable range for the type (Representing integers in doubles), epsilons don't help.
Consider this example. 16777217 is the smallest integer not exactly representable as a 32-bit float:
int ix=16777217, iy=97;
printf("%d / %d = %d", ix, iy, ix/iy);
// yields "16777217 / 97 = 172961" which is accurate
float x=ix, y=iy;
printf("%f / %f = %f", x, y, x/y);
// yields "16777216.000000 / 97.000000 = 172960.989691"
In this case, the error is negative; in other cases (try 16777219 / 1549), the error is positive.
While it's tempting to add an epsilon to make floor work, it won't extend the accuracy much. When the values differ by more orders of magnitude, the error becomes greater than 1 and integer-accuracy can't be guaranteed. Specifically, when x/y exceeds the max. representable, the error can exceed 1.0, so the epsilon is no help.
If this is coming into play, you will have to consider changing your mathematical approach--order of operations, work with logarithms, etc.

Such results are likely to appear when working with doubles. You can use round or you can subtract 0.5 then use std::ceil function.

Can I trust a real-to-int conversion of the result of ceil()?

Suppose I have some code such as:
float a, b = ...; // both positive
int s1 = ceil(sqrt(a/b));
int s2 = ceil(sqrt(a/b)) + 0.1;
Is it ever possible that s1 != s2? My concern is when a/b is a perfect square. For example, perhaps a=100.0 and b=4.0, then the output of ceil should be 5.00000 but what if instead it is 4.99999?
Similar question: is there a chance that 100.0/4.0 evaluates to say 5.00001 and then ceil will round it up to 6.00000?
I'd prefer to do this in integer math but the sqrt kinda screws that plan.
EDIT: suggestions on how to better implement this would be appreciated too! The a and b values are integer values, so actual code is more like: ceil(sqrt(float(a)/b))
EDIT: Based on levis501's answer, I think I will do this:
float a, b = ...; // both positive
int s = sqrt(a/b);
while (s*s*b < a) ++s;
Thank you all!

I don't think it's possible. Regardless of the value of sqrt(a/b), what it produces is some value N that we use as:
int s1 = ceil(N);
int s2 = ceil(N) + 0.1;
Since ceil always produces an integer value (albeit represented as a double), we will always have some value X, for which the first produces X.0 and the second X.1. Conversion to int will always truncate that .1, so both will result in X.
It might seem like there would be an exception if X was so large that X.1 overflowed the range of double. I don't see where this could be possible though. Except close to 0 (where overflow isn't a concern) the square root of a number will always be smaller than the input number. Therefore, before ceil(N)+0.1 could overflow, the a/b being used as an input in sqrt(a/b) would have to have overflowed already.

You may want to write an explicit function for your case. e.g.:
/* return the smallest positive integer whose square is at least x */
int isqrt(double x) {
int y1 = ceil(sqrt(x));
int y2 = y1 - 1;
if ((y2 * y2) >= x) return y2;
return y1;
}
This will handle the odd case where the square root of your ratio a/b is within the precision of double.

Equality of floating point numbers is indeed an issue, but IMHO not if we deal with integer numbers.
If you have the case of 100.0/4.0, it should perfectly evaluate to 25.0, as 25.0 is exactly representable as a float, as opposite to e.g. 25.1.

Yes, it's entirely possible that s1 != s2. Why is that a problem, though?
It seems natural enough that s1 != (s1 + 0.1).
BTW, if you would prefer to have 5.00001 rounded to 5.00000 instead of 6.00000, use rint instead of ceil.
And to answer the actual question (in your comment) - you can use sqrt to get a starting point and then just find the correct square using integer arithmetic.
int min_dimension_greater_than(int items, int buckets)
{
double target = double(items) / buckets;
int min_square = ceil(target);
int dim = floor(sqrt(target));
int square = dim * dim;
while (square < min_square) {
seed += 1;
square = dim * dim;
}
return dim;
}
And yes, this can be improved a lot, it's just a quick sketch.

s1 will always equal s2.
The C and C++ standards do not say much about the accuracy of math routines. Taken literally, it is impossible for the standard to be implemented, since the C standard says sqrt(x) returns the square root of x, but the square root of two cannot be exactly represented in floating point.
Implementing routines with good performance that always return a correctly rounded result (in round-to-nearest mode, this means the result is the representable floating-point number that is nearest to the exact result, with ties resolved in favor of a low zero bit) is a difficult research problem. Good math libraries target accuracy less than 1 ULP (so one of the two nearest representable numbers is returned), perhaps something slightly more than .5 ULP. (An ULP is the Unit of Least Precision, the value of the low bit given a particular value in the exponent field.) Some math libraries may be significantly worse than this. You would have to ask your vendor or check the documentation for more information.
So sqrt may be slightly off. If the exact square root is an integer (within the range in which integers are exactly representable in floating-point) and the library guarantees errors are less than 1 ULP, then the result of sqrt must be exactly correct, because any result other than the exact result is at least 1 ULP away.
Similarly, if the library guarantees errors are less than 1 ULP, then ceil must return the exact result, again because the exact result is representable and any other result would be at least 1 ULP away. Additionally, the nature of ceil is such that I would expect any reasonable math library to always return an integer, even if the rest of the library were not high quality.
As for overflow cases, if ceil(x) were beyond the range where all integers are exactly representable, then ceil(x)+.1 is closer to ceil(x) than it is to any other representable number, so the rounded result of adding .1 to ceil(x) should be ceil(x) in any system implementing the floating-point standard (IEEE 754). That is provided you are in the default rounding mode, which is round-to-nearest. It is possible to change the rounding mode to something like round-toward-infinity, which could cause ceil(x)+.1 to be an integer higher than ceil(x).

Can float values add to a sum of zero? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Most effective way for float and double comparison
I have two values(floats) I am attempting to add together and average. The issue I have is that occasionally these values would add up to zero, thus not requiring them to be averaged.
The situation I am in specifically contains the values "-1" and "1", yet when added together I am given the value "-1.19209e-007" which is clearly not 0. Any information on this?

I'm sorry but this doesn't make sense to me.
Two floating point values, if they are exactly the same but with opposite sign, subtracted will produce always 0. This is how floating point operations works.
float a = 0.2f;
float b = -0.2f;
float f = (a - b) / 2;
printf("%f %d\n", f, f != 0); // will print out 0.0000 0
Will be always 0 also if the compiler doesn't optimize the code.
There is not any kind of rounding error to take in account if a and b have the same value but opposite sign! That is, if the higher bit of a is 0 and the higher bit of b is 1 and all other bits are the same, the result cannot be other than 0.
But if a and b are slightly different, of course, the result can be non-zero.
One possible solution to avoid this can be using a tolerance...
float f = (a + b) / 2;
if (abs(f) < 0.000001f)
f = 0;
We are using a simple tolerance to see if our value is near to zero.
A nice example code to show this is...
int main(int argc)
{
for (int i = -10000000; i <= 10000000 * argc; ++i)
{
if (i != 0)
{
float a = 3.14159265f / i;
float b = -a + (argc - 1);
float f = (a + b) / 2;
if (f != 0)
printf("%f %d\n", a, f);
}
}
printf("completed\n");
return 0;
}
I'm using "argc" here as a trick to force the compiler to not optimize out our code.

At least right off, this sounds like typical floating point imprecision.
The usual way to deal with it is to round your numbers to the correct number of significant digits. In this case, your average would be -1.19209e-08 (i.e., 0.00000001192). To (say) six or seven significant digits, that is zero.

Takes the sum of all your numbers, divide by your count. Round off your answer to something reasonable before you do prints, reports comparisons, or whatever you're doing.

again, do some searching on this but here is the basic explanation ...
the computer approximates floating point numbers by base 2 instead of base 10. this means that , for example, 0.2 (when converted to binary) is actually 0.001100110011 ... on forever. since the computer cannot add these on forever, it must approximate it.
because of these approximations, we lose "precision" of calculations. hence "single" and "double" precision floating point numbers. this is why you never test for a float to be actually 0. instead, you test whether is below some threshhold which you want to use as zero.

Floating point comparison [duplicate]

This question already has answers here:
Floating point inaccuracy examples
(7 answers)
Closed 8 years ago.
int main()
{
float a = 0.7;
float b = 0.5;
if (a < 0.7)
{
if (b < 0.5) printf("2 are right");
else printf("1 is right");
}
else printf("0 are right");
}
I would have expected the output of this code to be 0 are right.
But to my dismay the output is 1 is right why?

int main()
{
float a = 0.7, b = 0.5; // These are FLOATS
if(a < .7) // This is a DOUBLE
{
if(b < .5) // This is a DOUBLE
printf("2 are right");
else
printf("1 is right");
}
else
printf("0 are right");
}
Floats get promoted to doubles during comparison, and since floats are less precise than doubles, 0.7 as float is not the same as 0.7 as double. In this case, 0.7 as float becomes inferior to 0.7 as double when it gets promoted. And as Christian said, 0.5 being a power of 2 is always represented exactly, so the test works as expected: 0.5 < 0.5 is false.
So either:
Change float to double, or:
Change .7 and .5 to .7f and .5f,
and you will get the expected behavior.

The issue is that the constants you are comparing to are double not float. Also, changing your constants to something that is representable easily such as a factor of 5 will make it say 0 is right. For example,
main()
{
float a=0.25,b=0.5;
if(a<.25)
{
if(b<.5)
printf("2 are right");
else
printf("1 is right");
}
else
printf("0 are right");
}
Output:
0 are right
This SO question on Most Effective Way for float and double comparison covers this topic.
Also, this article at cygnus on floating point number comparison gives us some tips:
The IEEE float and double formats were designed so that the numbers
are “lexicographically ordered”, which – in the words of IEEE
architect William Kahan means “if two floating-point numbers in the
same format are ordered ( say x < y ), then they are ordered the same
way when their bits are reinterpreted as Sign-Magnitude integers.”
This means that if we take two floats in memory, interpret their bit
pattern as integers, and compare them, we can tell which is larger,
without doing a floating point comparison. In the C/C++ language this
comparison looks like this:
if (*(int*)&f1 < *(int*)&f2)
This charming syntax means take the address of f1, treat it as an
integer pointer, and dereference it. All those pointer operations look
expensive, but they basically all cancel out and just mean ‘treat f1
as an integer’. Since we apply the same syntax to f2 the whole line
means ‘compare f1 and f2, using their in-memory representations
interpreted as integers instead of floats’.

It's due to rounding issues while converting from float to double

Generally comparing equality with floats is a dangerous business (which is effectively what you're doing as you're comparing right on the boundary of > ), remember that in decimal certain fractions (like 1/3) cannot be expressed exactly, the same can be said of binary,
0.5= 0.1, will be the same in float or double.
0.7=0.10110011001100 etc forever, 0.7 cannot be exactly represented in binary, you get rounding errors and may be (very very slightly) different between float and double
Note that going between floats and doubles you cut off a different number of decimal places, hence your inconsistant results.

Also, btw, you have an error in your logic of 0 are right. You don't check b when you output 0 are right. But the whole thing is a little mysterious in what you are really trying to accomplish. Floating point comparisons between floats and doubles will have variations, minute, so you should compare with a delta 'acceptable' variation for your situation. I've always done this via inline functions that just perform the work (did it once with a macro, but thats too messy). Anyhow, yah, rounding issues abound with this type of example. Read the floating point stuff, and know that .7 is different than .7f and assigning .7 to a float will cast a double into a float, thus changing the exact nature of the value. But, the programming assumption about b being wrong since you checked a blared out to me, and I had to note that :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Different results from similar floating-point functions - c++

You should use fabs() function instead of abs() because abs() only works with integer numbers. You'll get weird and wrong results when using abs() with floating points.

Related

Find float a to closest multiple of float b

Does casting `std::floor()` and `std::ceil()` to integer type always give the correct result?

Can I trust a real-to-int conversion of the result of ceil()?

Can float values add to a sum of zero? [duplicate]

Floating point comparison [duplicate]

Categories

Resources