Can float values add to a sum of zero? [duplicate] - c++

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Most effective way for float and double comparison
I have two values(floats) I am attempting to add together and average. The issue I have is that occasionally these values would add up to zero, thus not requiring them to be averaged.
The situation I am in specifically contains the values "-1" and "1", yet when added together I am given the value "-1.19209e-007" which is clearly not 0. Any information on this?

I'm sorry but this doesn't make sense to me.
Two floating point values, if they are exactly the same but with opposite sign, subtracted will produce always 0. This is how floating point operations works.
float a = 0.2f;
float b = -0.2f;
float f = (a - b) / 2;
printf("%f %d\n", f, f != 0); // will print out 0.0000 0
Will be always 0 also if the compiler doesn't optimize the code.
There is not any kind of rounding error to take in account if a and b have the same value but opposite sign! That is, if the higher bit of a is 0 and the higher bit of b is 1 and all other bits are the same, the result cannot be other than 0.
But if a and b are slightly different, of course, the result can be non-zero.
One possible solution to avoid this can be using a tolerance...
float f = (a + b) / 2;
if (abs(f) < 0.000001f)
f = 0;
We are using a simple tolerance to see if our value is near to zero.
A nice example code to show this is...
int main(int argc)
{
for (int i = -10000000; i <= 10000000 * argc; ++i)
{
if (i != 0)
{
float a = 3.14159265f / i;
float b = -a + (argc - 1);
float f = (a + b) / 2;
if (f != 0)
printf("%f %d\n", a, f);
}
}
printf("completed\n");
return 0;
}
I'm using "argc" here as a trick to force the compiler to not optimize out our code.

At least right off, this sounds like typical floating point imprecision.
The usual way to deal with it is to round your numbers to the correct number of significant digits. In this case, your average would be -1.19209e-08 (i.e., 0.00000001192). To (say) six or seven significant digits, that is zero.

Takes the sum of all your numbers, divide by your count. Round off your answer to something reasonable before you do prints, reports comparisons, or whatever you're doing.

again, do some searching on this but here is the basic explanation ...
the computer approximates floating point numbers by base 2 instead of base 10. this means that , for example, 0.2 (when converted to binary) is actually 0.001100110011 ... on forever. since the computer cannot add these on forever, it must approximate it.
because of these approximations, we lose "precision" of calculations. hence "single" and "double" precision floating point numbers. this is why you never test for a float to be actually 0. instead, you test whether is below some threshhold which you want to use as zero.

Related

Iterate though all possible floating-point values, starting from lowest

I am writing a unit test for a math function and I would like to be able to "walk" all possible floats/doubles.
Due to IEEE shenanigans, floating types cannot be incremented (++) at their extremities. See this question for more details. That answer states :
one can only add multiples of 2^(n-N)
But never mentions what little n is.
A solution to iterate all possible values from +0.0 to +infinity is given in this great blog post. The technique involves using a union with an int to walk the different values of a float. This works due to the following properties explained in the post, though they are only valid for positive numbers.
Adjacent floats have adjacent integer representations
Incrementing the integer representation of a float moves to the next representable float, moving away from zero
His solution for +0.0 to +infinity (0.f to std::numeric_limits<float>::max()) :
union Float_t {
int32_t RawExponent() const { return (i >> 23) & 0xFF; }
int32_t i;
float f;
};
Float_t allFloats;
allFloats.f = 0.0f;
while (allFloats.RawExponent() < 255) {
allFloats.i += 1;
}
Is there a solution for -infinity to +0.0 (std::numeric_limits<float>::lowest() to 0.f)?
I've tested std::nextafter and std::nexttoward and couldn't get them to work. Maybe this is an MSVC issue?
I would be ok with any sort of hack since this is a unit test. Thanks!
You can walk all 32-bit bit representations by using all values of a 32-bit unsigned int. Then you will walk really all representations, positive and negative, including both nulls (there are two) and also all the not a number representations (NaN). You may or may not want to filter out the NaN representations, or just filter out the signaling ones and leave the non signaling ones in. This depends on your use case.
Example:
for (uint32_t i = 0;;)
{
float f;
// Type punning: Force the bit representation of i into f.
// Type punning is hard because mostly undefined in C/C++.
// Using memcpy() usually avoids any type punning warning.
memcpy(&f, &i, sizeof(f));
// Use f here.
// Warning: Using signaling NaNs may throw exceptions or raise signals.
i++;
if (i == 0)
break;
}
Instead you can also walk a 32-bit int from -2**31 to +(2**31-1). This makes no difference.
Pascal Cuoq correctly points out std::nextafter is the right solution. I had a problem elsewhere in my code. Sorry for the unnecessary question.
#include <cassert>
#include <cmath>
#include <limits>
float i = std::numeric_limits<float>::lowest();
float hi = std::numeric_limits<float>::max();
float new_i = std::nextafterf(i, hi);
assert(i != new_i);
double d = std::numeric_limits<double>::lowest();
double hi_d = std::numeric_limits<double>::max();
double new_d = std::nextafter(d, hi_d);
assert(d != new_d);
long double ld = std::numeric_limits<long double>::lowest();
long double hi_ld = std::numeric_limits<long double>::max();
long double new_ld = std::nextafterl(ld, hi_ld);
assert(ld != new_ld);
for (float d = std::numeric_limits<float>::lowest();
d < std::numeric_limits<float>::max();
d = std::nextafterf(
d, std::numeric_limits<float>::max())) {
// Wait a lifetime?
}
Iterating through all the float values can be done with simple understanding of the floating-point representation:
The distance between consecutive subnormal values is the minimum normal times the “epsilon”. Simply iterate through all the subnormals using this distance as an increment.
The distance between the normal values at the lowest exponent is the same. Step through them with the same increment.
For each exponent, the distance increases according to the floating-point radix. Simply multiply the increment by the radix and step through all the values for the next exponent.
Repeat until infinity is reached.
Observe that the inner loop in the code below is simply:
for (; x < Limit; x += Increment)
Test(x);
This has the advantage that only normal floating-point arithmetic is used. The inner loop contains only one addition and one comparison (plus any tests you want to perform with each number). No library functions are called in the loop, no representations are dissected or copied to general registers or otherwise manipulated. There is nothing to impede performance.
This code steps through only the non-negative numbers. The negative numbers can be tested separately in the same way or can share this code by inserting a call Test(-x).
#include <limits>
static void Test(float x)
{
// Insert unit test for value x here.
}
int main(void)
{
typedef float T;
static const int Radix = std::numeric_limits<T>::radix;
static const T Infinity = std::numeric_limits<T>::infinity();
/* Increment is the current distance between floating-point numbers. We
start it at distance between subnormal numbers.
*/
T Increment =
std::numeric_limits<T>::min() * std::numeric_limits<T>::epsilon();
/* Limit is the next boundary where the distance between floating-point
numbers changes. We will increment up to that limit and then adjust
the limit and increment. We start it at the top of the first set of
normals, which allows the first loop to increment first through the
subnormals and then through the normals with the lowest exponent.
(These two sets have the same step size between adjacent values.)
*/
T Limit = std::numeric_limits<T>::min() * Radix;
/* Start with zero and continue until we reach infinity.
We execute an inner loop that iterates through all the significands of
one floating-point exponent. Each time it completes, we step up the
limit and increment.
*/
for (T x = 0; x < Infinity; Limit *= Radix, Increment *= Radix)
// Increment x through all the significands with the current exponent.
for (; x < Limit; x += Increment)
// Test with the current value of x.
Test(x);
// Also test infinity.
Test(Infinity);
}
(This code assumes the floating-point type has subnormals, and that they are not flushed to zero. The code can be readily adjusted to support these alternatives as well.)

In which segment a given number lies in? [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 6 years ago.
Suppose to have n (integer) contiguous segments of length l (floating point). That is:
Segment 0 = [0, l)
Segment 1 = [l, 2*l)
Segment 2 = [2*l, 3*l)
...
Segment (n-1) = [(n-1)*l, n*l)
Given a number x (floating point) I want to determine the id of the segment it lies inside.
My first idea is the following:
int segmentId = (int) floor(x/l);
Anyway, this sometimes does not work. For example, consider
double l = 1.1;
double x = 5.5;
int segmentId = (int) floor(x/l); //returns 5
double l = 1.1;
double x = 6.6;
int segmentId = (int) floor(x/l); //returns 5!!!
Of course, due to finite arithmetic, this does not work well.
Maybe some extra checks are required in order to have a robust implementation, but I really don't know how to proceed further.
The question is: how would you solve the problem "In which segment a given number lies in?"
Your problem is that neither 1.1, nor 6.6 are representable exactly in binary floating point. So when you type
double l = 1.1;
double x = 6.6;
you get 2 numbers stored in l and in x, which are slightly different than 1.1 and 6.6. After that, int segmentId = (int) floor(x/l); determines the correct segment for those slightly different numbers, but not for the original numbers.
You can solve this problem by using a decimal floating point data type instead of binary. You can check C++ decimal data types and Exact decimal datatype for C++? for the libraries, or implement the decimal data type yourself.
But still the problem will remain for numbers, which are not representable in finite decimal floating point, such as 1/3 (circulating fraction), sqrt(2) (irrational), pi (transcendental), etc.
Just in case u don't specifically want an O(1) answer you can go for the O(logn) answer by just doing a binary search on the segments.
What precision does your solution require? There can always be a problem with marginal values for given segment, cause they are most likely unrepresentable.
I think adding a very small epsilon in this case could help. However it may fail in other case.
Check the segments again after the division.
bool inSegment(double x, double l, double segment)
{
return (x >= l*(segment-1)) && (x < l*segment);
}
int segmentId;
double segment = floor(x/l);
if (inSegment(x, l, segment-1))
segmentId = segment - 1;
else if (inSegment(x, l, segment))
segmentId = segment;
else if (inSegment(x, l, segment+1))
segmentId = segment + 1;
else
printf("Something wrong happened\n");
Or use an epsilon and round the value up if the value is close enough to an integer above.
how would you solve the problem "In which segment a given number lie in?"
You should divide the number by the segment length, then truncate the fractional part away. Like this:
int segmentId = (int) floor(x/l);
It seems that you have already figured this out.
Of course, due to finite arithmetic, this does not work well.
If the result of 6.6 / 1.1 happens to be5.9999999999999991118215802998747676610946655273438, then 5 is in fact the correct segment for the result.
If you would like 6.6 / 1.1 to be exactly 6, then your problem is with finite precision division, which doesn't do what you want and with finite precision representation of floating point numbers that has no exact representation for all numbers. The segmentation itself worked perfectly.
I really don't know how to proceed further
Either don't use finite precision floating point (use fixed or arbitrary precision), or don't require the results of calculations to be exact.

Determining the number of decimal digits in a double - C++

I am trying to get the number of digits after a decimal point in a double. Currently, my code looks like this:
int num_of_decimal_digits = 0;
while (someDouble - someInt != 0)
{
someDouble = someDouble*10;
someInt = someDouble;
num_of_decimal_digits++;
}
Whenever I enter a decimal in for someDouble that is less than one, the loop gets stuck and repeats infinitely. Should I use static_cast? Any advice?
Due to floating-point rounding error, multiplying by 10 is not necessarily an exact decimal shift. You can test the absolute error of the difference rather than comparing it for exact equality with 0.
while (abs(someDouble - someInt) < epsilon)
Or you can acknowledge that a double with a 53-bit mantissa can only represent log10 253 ≈ 15.9 decimal digits, and limit the loop to 16 iterations.
while (someDouble - someInt != 0 && num_of_decimal_digits < 16)
Or both.
while (abs(someDouble - someInt) < epsilon && num_of_decimal_digits < 16)
The naive answer would be:
int num_of_decimal_digits = 0;
double absDouble = someDouble > 0 ? someDouble : someDouble * -1;
while (absDouble - someInt != 0)
{
absDouble = absDouble*10;
someInt = absDouble;
num_of_decimal_digits++;
}
This solves your problem of negative numbers.
However, this solution is likely not going to give you the output you desire in a lot of cases because of the way that floating point numbers are represented. For example 0.35 might really be represented as 0.3499999999998 the way floating point numbers are stored in binary. I would suggest that you share more background information about what you are hoping to accomplish with this code (your input and your desired output). There is likely a much better solution for what you are attempting to accomplish.

Can I trust a real-to-int conversion of the result of ceil()?

Suppose I have some code such as:
float a, b = ...; // both positive
int s1 = ceil(sqrt(a/b));
int s2 = ceil(sqrt(a/b)) + 0.1;
Is it ever possible that s1 != s2? My concern is when a/b is a perfect square. For example, perhaps a=100.0 and b=4.0, then the output of ceil should be 5.00000 but what if instead it is 4.99999?
Similar question: is there a chance that 100.0/4.0 evaluates to say 5.00001 and then ceil will round it up to 6.00000?
I'd prefer to do this in integer math but the sqrt kinda screws that plan.
EDIT: suggestions on how to better implement this would be appreciated too! The a and b values are integer values, so actual code is more like: ceil(sqrt(float(a)/b))
EDIT: Based on levis501's answer, I think I will do this:
float a, b = ...; // both positive
int s = sqrt(a/b);
while (s*s*b < a) ++s;
Thank you all!
I don't think it's possible. Regardless of the value of sqrt(a/b), what it produces is some value N that we use as:
int s1 = ceil(N);
int s2 = ceil(N) + 0.1;
Since ceil always produces an integer value (albeit represented as a double), we will always have some value X, for which the first produces X.0 and the second X.1. Conversion to int will always truncate that .1, so both will result in X.
It might seem like there would be an exception if X was so large that X.1 overflowed the range of double. I don't see where this could be possible though. Except close to 0 (where overflow isn't a concern) the square root of a number will always be smaller than the input number. Therefore, before ceil(N)+0.1 could overflow, the a/b being used as an input in sqrt(a/b) would have to have overflowed already.
You may want to write an explicit function for your case. e.g.:
/* return the smallest positive integer whose square is at least x */
int isqrt(double x) {
int y1 = ceil(sqrt(x));
int y2 = y1 - 1;
if ((y2 * y2) >= x) return y2;
return y1;
}
This will handle the odd case where the square root of your ratio a/b is within the precision of double.
Equality of floating point numbers is indeed an issue, but IMHO not if we deal with integer numbers.
If you have the case of 100.0/4.0, it should perfectly evaluate to 25.0, as 25.0 is exactly representable as a float, as opposite to e.g. 25.1.
Yes, it's entirely possible that s1 != s2. Why is that a problem, though?
It seems natural enough that s1 != (s1 + 0.1).
BTW, if you would prefer to have 5.00001 rounded to 5.00000 instead of 6.00000, use rint instead of ceil.
And to answer the actual question (in your comment) - you can use sqrt to get a starting point and then just find the correct square using integer arithmetic.
int min_dimension_greater_than(int items, int buckets)
{
double target = double(items) / buckets;
int min_square = ceil(target);
int dim = floor(sqrt(target));
int square = dim * dim;
while (square < min_square) {
seed += 1;
square = dim * dim;
}
return dim;
}
And yes, this can be improved a lot, it's just a quick sketch.
s1 will always equal s2.
The C and C++ standards do not say much about the accuracy of math routines. Taken literally, it is impossible for the standard to be implemented, since the C standard says sqrt(x) returns the square root of x, but the square root of two cannot be exactly represented in floating point.
Implementing routines with good performance that always return a correctly rounded result (in round-to-nearest mode, this means the result is the representable floating-point number that is nearest to the exact result, with ties resolved in favor of a low zero bit) is a difficult research problem. Good math libraries target accuracy less than 1 ULP (so one of the two nearest representable numbers is returned), perhaps something slightly more than .5 ULP. (An ULP is the Unit of Least Precision, the value of the low bit given a particular value in the exponent field.) Some math libraries may be significantly worse than this. You would have to ask your vendor or check the documentation for more information.
So sqrt may be slightly off. If the exact square root is an integer (within the range in which integers are exactly representable in floating-point) and the library guarantees errors are less than 1 ULP, then the result of sqrt must be exactly correct, because any result other than the exact result is at least 1 ULP away.
Similarly, if the library guarantees errors are less than 1 ULP, then ceil must return the exact result, again because the exact result is representable and any other result would be at least 1 ULP away. Additionally, the nature of ceil is such that I would expect any reasonable math library to always return an integer, even if the rest of the library were not high quality.
As for overflow cases, if ceil(x) were beyond the range where all integers are exactly representable, then ceil(x)+.1 is closer to ceil(x) than it is to any other representable number, so the rounded result of adding .1 to ceil(x) should be ceil(x) in any system implementing the floating-point standard (IEEE 754). That is provided you are in the default rounding mode, which is round-to-nearest. It is possible to change the rounding mode to something like round-toward-infinity, which could cause ceil(x)+.1 to be an integer higher than ceil(x).

C++ Should this be easier?

long-time listener, first-time caller. I am relatively new to programming and was looking back at some of the code I wrote for an old lab. Is there an easier way to tell if a double is evenly divisible by an integer?
double num (//whatever);
int divisor (//an integer);
bool bananas;
if(floor(num)!= num || static_cast<int>(num)%divisor != 0) {
bananas=false;
}
if(bananas==true)
//do stuff;
}
The question is strange, and the checks are as well. The problem is that it makes little sense to speak about divisibility of a floating point number because floating point number are represented imprecisely in binary, and divisibility is about exactitude.
I encourage you to read this article, by David Goldberg: What Every Computer Scientist Should Know About Floating Point Arithmetic. It is a bit long-winded, so you may appreciate this website, instead: The Floating-Point Guide.
The truth is that floor(num) == num is a strange piece of code.
num is a double
floor(num) returns an double, close to an int
The trouble is that this does not check what you really wanted. For example, suppose (for the sake of example) that 5 cannot be represented exactly as a double, therefore, instead of storing 5, the computer will store 4.999999999999.
double num = 5; // 4.999999999999999
double floored = floor(num); // 4.0
assert(num != floored);
In general exact comparisons are meaningless for floating point numbers, because of rounding errors.
If you insist on using floor, I suggest to use floor(num + 0.5) which is better, though slightly biased. A better rounding method is the Banker's rounding because it is unbiased, and the article references others if you wish. Note that the Banker's rounding is the baked in in round...
As for your question, first you need a double aware modulo: fmod, then you need to remember the avoid exact comparisons bit.
A first (naive) attempt:
// divisor is deemed non-zero
// epsilon is a constant
double mod = fmod(num, divisor); // divisor will be converted to a double
if (mod <= epsilon) { }
Unfortunately it fails one important test: the magnitude of mod depends on the magnitude of divisor, thus if divisor is smaller than epsilon to begin with, it will always be true.
A second attempt:
// divisor is deemed non-zero
double const epsilon = divisor / 1000.0;
double mod = fmod(num, divisor);
if (mod <= epsilon) { }
Better, but not quite there: mod and epsilon are signed! Yes, it's a bizarre modulo, th sign of mod is the sign of num
A third attempt:
// divisor is deemed non-zero
double const eps = fabs(divisor / 1000.0);
double mod = fabs(fmod(num, divisor));
if (mod <= eps) { }
Much better.
Should work fairly well too if divisor comes from an integer, as there won't be precision issues... or at least not too much.
EDIT: fourth attempt, by #ybungalobill
The previous attempt does not deal well with situations where num/divisor errors on the wrong side. Like 1.999/1.000 --> 0.999, it's nearly divisor so we should indicate equality, yet it failed.
// divisor is deemed non-zero
mod = fabs(fmod(num/divisor, 1));
if (mod <= 0.001 || fabs(1 - mod) <= 0.001) { }
Looks like a never ending task eh ?
There is still cause for troubles though.
double has a limited precision, that is a limited number of digits that is representable (16 I think ?). This precision might be insufficient to represent an integer:
Integer n = 12345678901234567890;
double d = n; // 1.234567890123457 * 10^20
This truncation means it is impossible to map it back to its original value. This should not cause any issue with double and int, for example on my platform double is 8 bytes and int is 4 bytes, so it would work, but changing double to float or int to long could violate this assumption, oh hell!
Are you sure you really need floating point, by the way ?
Based on the above comments, I believe you can do this...
double num (//whatever);
int divisor (//an integer);
if(fmod(num, divisor) == 0) {
//do stuff;
}
I haven't checked it but why not do this?
if (floor(num) == num && !(static_cast<int>(num) % divisor)) {
// do stuff...
}