c++ == operator double - c++

I know it is incorrect to compare double (equality) and the best is to use an epsilon factor as described into the Knuth book (Art of programming). Nevertheless, I am working on a legacy code (C++), where there are a lot of devision like:
// b,c double from previous computation
if( b == 50.0)
b += 0.001;
double a = c/(b - 50.0);
Do we perform the conditional statement (b == 50) on the "bit representation" (mantissa-exponenent) or the decimal one ? I do not find this information on my C++ book. If it is the decimal, I think I can trough away the conditional statement.

The == operator is applied to the run-time representation of the floating-point value, ideally with exactly the exponent and significand numbers of bits implied by the type, but unfortunately, sometimes in a wider format, as allowed by the standard.
In b == 50.0, the decimal representation 50.0 is converted to such a floating-point representation at compile-time once and for all. That value is then used (or the program behaves as if it was used) each time this expression 50.0 is involved. In the case of 50.0, it does not make a difference because the number 50 can be represented exactly as a binary floating-point value.
As an example, b == 50.0000000000000000000001 is likely to behave exactly as b == 50.0 because 50.0000000000000000000001 represents the same floating-point value as 50.0.

For the specific piece of code, the use of exact comparison is correct:
// b,c double from previous computation
if( b == 50.0)
b += 0.001;
double a = c/(b - 50.0);
The purpose seems to be to ensure that the division will not be a division by zero. The code may have been written to be compatible with systems in which division by 0 causes failure, rather than infinity. Subtracting 50 from any double that is not exactly 50 will have a non-zero result, so the 0.001 fudge factor only needs to be added in the case of exact equality.

Related

Find float a to closest multiple of float b

C++ Scenario: I have two variables of type double a and b.
Goal: a should be set to the closest multiple of b that is smaller than a.
First approach: Use fmod() or remainder() to get r. Then do a = a - r.
I know that due to the representation of decimal numbers in memory fmod() or remainder() can never guarantee 100% accuracy. In my tests I found that I cannot use fmod() at all, as the variance of its results is too unpredictable (at least as far as I understand). There are many questions and discussions out there talking about this phenomenon.
So is there something I could do to still use fmod()?
With “something” I mean some trick similar to checking if a equals b by employing a value double
EPSILON = 0.005;
if (std::abs(a-b) < EPSILON)
std::cout << "equal" << '\n';
My second approach works but seems not to be very elegant. I am just subtracting b from a until there is nothing left to subtract:
double findRemainder(double x, double y) {
double rest;
if (y > x)
{
double temp = x;
x = y;
y = temp;
}
while (x > y)
{
rest = x - y;
x = x - y;
}
return rest;
}
int main()
{
typedef std::numeric_limits<double> dbl;
std::cout.precision(dbl::max_digits10);
double a = 13.78, b = 2.2, r = 0;
r = findRemainder(a, b);
return 0;
}
Any suggestions for me?
Preamble
The problem is impossible, both as stated and as intended.
Remainders are exact
This statement is incorrect: “fmod() or remainder() can never guarantee 100% accuracy.” If the floating-point format supports subnormal numbers (as IEEE-754 does), then fmod(x, y) and remainder are both exact; they produce a result with no rounding error (barring bugs in their implementation). The remainder, as defined for either of them, is always less than y and not more than x in magnitude. Therefore, it is always in a portion of the floating-point format that is at least as fine as y and as x, so all the bits needed for the real-arithmetic remainder can be represented in the floating-point remainder. So a correct implementation will return the exact remainder.
Multiples may not be representable
For simplicity of illustration, I will use IEEE-754 binary32, the format commonly used for float. The issues are the same for other formats. In this format, all integers with magnitude up to 224, 16,777,216, are representable. After that, due to the scaling by the floating-point exponent, the representable values increase by two: 16,777,218, 16,777,220, and so on. At 225, 33,554,432, they increase by four: 33,554,436, 33,554,440. At 226, 67,108,864, they increase by eight.
100,000,000 is representable, and so are 99,999,992 and 100,000,008. Now consider asking what multiple of 3 is the closest to 100,000,000. It is 99,999,999. But 99,999,999 is not representable in the binary32 format.
Thus, it is not always possible for a function to take two representable values, a and b, and return the greatest multiple of b that is less than a, using the same floating-point format. This is not because of any difficulty computing the multiple but simply because it is impossible to represent the true multiple in the floating-point format.
In fact, given the standard library, it is easy to compute the remainder; std::fmod(100000000.f, 3.f) is 1. But it is impossible to compute 100000000.f − 1 in the binary32 format.
The intended question is impossible
The examples shown, 13.78 for a and 2.2 for b, suggest the desire is to produce a multiple for some floating-point numbers a and b that are the results of converting decimal numerals a and b to the floating-point format. However, once such conversions are performed, the original numbers cannot be known from the results a and b.
To see this, consider values for a of either 99,999,997 or 100,000,002 while b is 10. The greatest multiple of 10 less than 99,999,997 is 99,999,990, and the greatest multiple of 10 less than 100,000,002 is 100,000,000.
When either 99,999,997 or 100,000,002 is converted to the binary32 format (using the common method, round-to-nearest-ties-to-even), the result for a is 100,000,000. Converting b of course yields 10 for b.
Then a function that converts the greatest multiple of a that is less than b can return only one result. Even if this function uses extended precision (say binary64) so that it can return either 99,999,990 or 100,000,000 even though those are not representable in binary32, it has no way to distinguish them. Whether the original a is 99,999,997 or 100,000,002, the a given to the function is 100,000,000, so there is no way for it to know the original a and no way for it to decide which result to return.
Hmm,
there really is a problem of definition, because most multiples of a floating point won't be representable exactly, except maybe if the multiplier is a power of two.
Taking your example and Smalltalk notations (which does not really matter, I do it just because i can evaluate and verify the expressions I propose), the exact fractional representation of double precision 0.1 and 0.9 can be written:
(1+(1<<54)reciprocal) / 10 = 0.1.
(9+(1<<52)reciprocal) / 10 = 0.9.
<< is a bistshift, 1<<54 is 2 raised to the power of 54, and reciprocal is its inverse 2^-54.
As you can easily see:
(1+(1<<54)reciprocal) * 9 > (9+(1<<52)reciprocal)
That is, the exact multiple of 0.1 is greater than 0.9.
Thus, technically, the answer is 8*0.1 (which is exact in this lucky case)
(8+(1<<51)reciprocal) / 10 = 0.8.
What remainder does is to give the EXACT remainder of the division, so it is related to above computations somehow.
You can try it, you will find something like-2.77555...e-17, or exactly (1<<55) reciprocal. The negative part is indicating that nearest multiple is close to 0.9, but a bit below 0.9.
However, if your problem is to find the greatest <= 0.9, among the rounded to nearest multiple of 0.1, then your answer will be 0.9, because the rounded product is 0.1*9 = 0.9.
You have to first resolve that ambiguity. If ever, you are not interested in multiples of 0.1, but in multiples of (1/10), then it's again a different matter...

Does casting `std::floor()` and `std::ceil()` to integer type always give the correct result?

I am being paranoid that one of these functions may give an incorrect result like this:
std::floor(2000.0 / 1000.0) --> std::floor(1.999999999999) --> 1
or
std::ceil(18 / 3) --> std::ceil(6.000000000001) --> 7
Can something like this happen? If there is indeed a risk like this, I'm planning to use the functions below in order to work safely. But, is this really necessary?
constexpr long double EPSILON = 1e-10;
intmax_t GuaranteedFloor(const long double & Number)
{
if (Number > 0)
{
return static_cast<intmax_t>(std::floor(Number) + EPSILON);
}
else
{
return static_cast<intmax_t>(std::floor(Number) - EPSILON);
}
}
intmax_t GuaranteedCeil(const long double & Number)
{
if (Number > 0)
{
return static_cast<intmax_t>(std::ceil(Number) + EPSILON);
}
else
{
return static_cast<intmax_t>(std::ceil(Number) - EPSILON);
}
}
(Note: I'm assuming that the the given 'long double' argument will fit in the 'intmax_t' return type.)
People often get the impression that floating point operations produce results with small, unpredictable, quasi-random errors. This impression is incorrect.
Floating point arithmetic computations are as exact as possible. 18/3 will always produce exactly 6. The result of 1/3 won't be exactly one third, but it will be the closest number to one third that is representable as a floating point number.
So the examples you showed are guaranteed to always work. As for your suggested "guaranteed floor/ceil", it's not a good idea. Certain sequences of operations can easily blow the error far above 1e-10, and certain other use cases will require 1e-10 to be correctly recognized (and ceil'ed) as nonzero.
As a rule of thumb, hardcoded epsilon values are bugs in your code.
In the specific examples you're listing, I don't think those errors would ever occur.
std::floor(2000.0 /*Exactly Representable in 32-bit or 64-bit Floating Point Numbers*/ / 1000.0 /*Also exactly representable*/) --> std::floor(2.0 /*Exactly Representable*/) --> 2
std::ceil(18 / 3 /*both treated as ints, might not even compile if ceil isn't properly overloaded....?*/) --> 6
std::ceil(18.0 /*Exactly Representable*/ / 3.0 /*Exactly Representable*/) --> 6
Having said that, if you have math that depends on these functions behaving exactly correctly for floating point numbers, that may illuminate a design flaw you need to reconsider/reexamine.
As long as the floating-point values x and y exactly represent integers within the limits of the type you're using, there's no problem--x / y will always yield a floating-point value that exactly represents the integer result. Casting to int as you're doing will always work.
However, once the floating-point values go outside the integer-representable range for the type (Representing integers in doubles), epsilons don't help.
Consider this example. 16777217 is the smallest integer not exactly representable as a 32-bit float:
int ix=16777217, iy=97;
printf("%d / %d = %d", ix, iy, ix/iy);
// yields "16777217 / 97 = 172961" which is accurate
float x=ix, y=iy;
printf("%f / %f = %f", x, y, x/y);
// yields "16777216.000000 / 97.000000 = 172960.989691"
In this case, the error is negative; in other cases (try 16777219 / 1549), the error is positive.
While it's tempting to add an epsilon to make floor work, it won't extend the accuracy much. When the values differ by more orders of magnitude, the error becomes greater than 1 and integer-accuracy can't be guaranteed. Specifically, when x/y exceeds the max. representable, the error can exceed 1.0, so the epsilon is no help.
If this is coming into play, you will have to consider changing your mathematical approach--order of operations, work with logarithms, etc.
Such results are likely to appear when working with doubles. You can use round or you can subtract 0.5 then use std::ceil function.

Can I trust a real-to-int conversion of the result of ceil()?

Suppose I have some code such as:
float a, b = ...; // both positive
int s1 = ceil(sqrt(a/b));
int s2 = ceil(sqrt(a/b)) + 0.1;
Is it ever possible that s1 != s2? My concern is when a/b is a perfect square. For example, perhaps a=100.0 and b=4.0, then the output of ceil should be 5.00000 but what if instead it is 4.99999?
Similar question: is there a chance that 100.0/4.0 evaluates to say 5.00001 and then ceil will round it up to 6.00000?
I'd prefer to do this in integer math but the sqrt kinda screws that plan.
EDIT: suggestions on how to better implement this would be appreciated too! The a and b values are integer values, so actual code is more like: ceil(sqrt(float(a)/b))
EDIT: Based on levis501's answer, I think I will do this:
float a, b = ...; // both positive
int s = sqrt(a/b);
while (s*s*b < a) ++s;
Thank you all!
I don't think it's possible. Regardless of the value of sqrt(a/b), what it produces is some value N that we use as:
int s1 = ceil(N);
int s2 = ceil(N) + 0.1;
Since ceil always produces an integer value (albeit represented as a double), we will always have some value X, for which the first produces X.0 and the second X.1. Conversion to int will always truncate that .1, so both will result in X.
It might seem like there would be an exception if X was so large that X.1 overflowed the range of double. I don't see where this could be possible though. Except close to 0 (where overflow isn't a concern) the square root of a number will always be smaller than the input number. Therefore, before ceil(N)+0.1 could overflow, the a/b being used as an input in sqrt(a/b) would have to have overflowed already.
You may want to write an explicit function for your case. e.g.:
/* return the smallest positive integer whose square is at least x */
int isqrt(double x) {
int y1 = ceil(sqrt(x));
int y2 = y1 - 1;
if ((y2 * y2) >= x) return y2;
return y1;
}
This will handle the odd case where the square root of your ratio a/b is within the precision of double.
Equality of floating point numbers is indeed an issue, but IMHO not if we deal with integer numbers.
If you have the case of 100.0/4.0, it should perfectly evaluate to 25.0, as 25.0 is exactly representable as a float, as opposite to e.g. 25.1.
Yes, it's entirely possible that s1 != s2. Why is that a problem, though?
It seems natural enough that s1 != (s1 + 0.1).
BTW, if you would prefer to have 5.00001 rounded to 5.00000 instead of 6.00000, use rint instead of ceil.
And to answer the actual question (in your comment) - you can use sqrt to get a starting point and then just find the correct square using integer arithmetic.
int min_dimension_greater_than(int items, int buckets)
{
double target = double(items) / buckets;
int min_square = ceil(target);
int dim = floor(sqrt(target));
int square = dim * dim;
while (square < min_square) {
seed += 1;
square = dim * dim;
}
return dim;
}
And yes, this can be improved a lot, it's just a quick sketch.
s1 will always equal s2.
The C and C++ standards do not say much about the accuracy of math routines. Taken literally, it is impossible for the standard to be implemented, since the C standard says sqrt(x) returns the square root of x, but the square root of two cannot be exactly represented in floating point.
Implementing routines with good performance that always return a correctly rounded result (in round-to-nearest mode, this means the result is the representable floating-point number that is nearest to the exact result, with ties resolved in favor of a low zero bit) is a difficult research problem. Good math libraries target accuracy less than 1 ULP (so one of the two nearest representable numbers is returned), perhaps something slightly more than .5 ULP. (An ULP is the Unit of Least Precision, the value of the low bit given a particular value in the exponent field.) Some math libraries may be significantly worse than this. You would have to ask your vendor or check the documentation for more information.
So sqrt may be slightly off. If the exact square root is an integer (within the range in which integers are exactly representable in floating-point) and the library guarantees errors are less than 1 ULP, then the result of sqrt must be exactly correct, because any result other than the exact result is at least 1 ULP away.
Similarly, if the library guarantees errors are less than 1 ULP, then ceil must return the exact result, again because the exact result is representable and any other result would be at least 1 ULP away. Additionally, the nature of ceil is such that I would expect any reasonable math library to always return an integer, even if the rest of the library were not high quality.
As for overflow cases, if ceil(x) were beyond the range where all integers are exactly representable, then ceil(x)+.1 is closer to ceil(x) than it is to any other representable number, so the rounded result of adding .1 to ceil(x) should be ceil(x) in any system implementing the floating-point standard (IEEE 754). That is provided you are in the default rounding mode, which is round-to-nearest. It is possible to change the rounding mode to something like round-toward-infinity, which could cause ceil(x)+.1 to be an integer higher than ceil(x).

Floating point comparison [duplicate]

This question already has answers here:
Floating point inaccuracy examples
(7 answers)
Closed 8 years ago.
int main()
{
float a = 0.7;
float b = 0.5;
if (a < 0.7)
{
if (b < 0.5) printf("2 are right");
else printf("1 is right");
}
else printf("0 are right");
}
I would have expected the output of this code to be 0 are right.
But to my dismay the output is 1 is right why?
int main()
{
float a = 0.7, b = 0.5; // These are FLOATS
if(a < .7) // This is a DOUBLE
{
if(b < .5) // This is a DOUBLE
printf("2 are right");
else
printf("1 is right");
}
else
printf("0 are right");
}
Floats get promoted to doubles during comparison, and since floats are less precise than doubles, 0.7 as float is not the same as 0.7 as double. In this case, 0.7 as float becomes inferior to 0.7 as double when it gets promoted. And as Christian said, 0.5 being a power of 2 is always represented exactly, so the test works as expected: 0.5 < 0.5 is false.
So either:
Change float to double, or:
Change .7 and .5 to .7f and .5f,
and you will get the expected behavior.
The issue is that the constants you are comparing to are double not float. Also, changing your constants to something that is representable easily such as a factor of 5 will make it say 0 is right. For example,
main()
{
float a=0.25,b=0.5;
if(a<.25)
{
if(b<.5)
printf("2 are right");
else
printf("1 is right");
}
else
printf("0 are right");
}
Output:
0 are right
This SO question on Most Effective Way for float and double comparison covers this topic.
Also, this article at cygnus on floating point number comparison gives us some tips:
The IEEE float and double formats were designed so that the numbers
are “lexicographically ordered”, which – in the words of IEEE
architect William Kahan means “if two floating-point numbers in the
same format are ordered ( say x < y ), then they are ordered the same
way when their bits are reinterpreted as Sign-Magnitude integers.”
This means that if we take two floats in memory, interpret their bit
pattern as integers, and compare them, we can tell which is larger,
without doing a floating point comparison. In the C/C++ language this
comparison looks like this:
if (*(int*)&f1 < *(int*)&f2)
This charming syntax means take the address of f1, treat it as an
integer pointer, and dereference it. All those pointer operations look
expensive, but they basically all cancel out and just mean ‘treat f1
as an integer’. Since we apply the same syntax to f2 the whole line
means ‘compare f1 and f2, using their in-memory representations
interpreted as integers instead of floats’.
It's due to rounding issues while converting from float to double
Generally comparing equality with floats is a dangerous business (which is effectively what you're doing as you're comparing right on the boundary of > ), remember that in decimal certain fractions (like 1/3) cannot be expressed exactly, the same can be said of binary,
0.5= 0.1, will be the same in float or double.
0.7=0.10110011001100 etc forever, 0.7 cannot be exactly represented in binary, you get rounding errors and may be (very very slightly) different between float and double
Note that going between floats and doubles you cut off a different number of decimal places, hence your inconsistant results.
Also, btw, you have an error in your logic of 0 are right. You don't check b when you output 0 are right. But the whole thing is a little mysterious in what you are really trying to accomplish. Floating point comparisons between floats and doubles will have variations, minute, so you should compare with a delta 'acceptable' variation for your situation. I've always done this via inline functions that just perform the work (did it once with a macro, but thats too messy). Anyhow, yah, rounding issues abound with this type of example. Read the floating point stuff, and know that .7 is different than .7f and assigning .7 to a float will cast a double into a float, thus changing the exact nature of the value. But, the programming assumption about b being wrong since you checked a blared out to me, and I had to note that :)

Why does division(?) yield this number?

Rephrasing question :
The following code (Not C++ - written in an in-house scripting language)
if(A*B != 0.0)
{
D = (C/(A*B))*100.0;
}
else
{
D = 0.0;
}
yields a value of
90989373681853939930449659398190196007605312719045829137102976436641398782862768335320454041881784565022989668056715169480294533394160442876108458546952155914634268552157701346144299391656459840294022732906509880379702822420494744472135997630178480287638496793549447363202959411986592330337536848282003701760.000000
for D. We are 100% sure that A != 0.0. And we are almost 100% sure that B == 0.0. We never use such infinitesimally small values (close to 0.0 but not 0.0) such as the value of B that this value of C suggests. It is impossible that it acquired that value from our data. Can A*B yield anything that is not equal to 0.0 when B is 0?
The number you divided by was not in fact 0, just very, very close.
Assuming you are using IEEE floating point numbers it is not a good idea to use equal or not equal in this case with floating point numbers. Even if the same value like -0.0 and +0.0 they are not equal from a bitwise perspective which is what the equate does. Even if using other float formats, equal and not equal are discouraged.
Instead put some sort of range on it e=a*b; if ((e<0.0002)||(e>0.0002) then...
This looks like you are accruing error from previous calculations, so you divison is by a really small decimal, but not zero. You should add a margin of error if you want to catch something like this, psuedocode: if(num < margin_of_error) ret inf;, or use the epsilon method to be even safer