Strange result using long double in C++ - c++

Using long double arithmetic in C++, the number 50,000,056,019,485.52605438232421875 squared yields 2,500,005,601,951,690,788,240,883,712. Meanwhile, the number 50,000,056,019,485.526050567626953125 (which differs from the first number by less than 0.001) squared yields 2,500,005,601,951,690,787,704,012,800, which differs from the first square by almost 1 billion. With the differences highlighted:
50,000,056,019,485.526054382324218750 ^ 2 = 2,500,005,601,951,690,788,240,883,712
50,000,056,019,485.526050567626953125 ^ 2 = 2,500,005,601,951,690,787,704,012,800
Long double's range in my machine goes from ~1e-300 to ~1e+300. I fully understand the inability to represent all the numbers in the range, but I didn't expect such a big difference. Could anybody shed some light on this?

Math
Given d as a small value compared to a
The expected difference of (a+d)*(a+d) - (a)*(a) is about 2*a*d.
Here a is about 50,000,056,000,000.0, d is about 0.00000381... and 2*a*d is about 381,000,000.0. That is within a factor of two of the difference seen using long double: 536,870,912.0.
"I didn't expect such a big difference." is only "off" less than a factor of 2. Given the squares represent rounded products, the difference of the squares seen here is reasonable.
Some details:
long double
Given the following and long double as 10-byte, a and b are consecutive long double values. They differ by 1 ULP or about 1 part in 264. Both decimal code constants are exactly saved as a, b.
long double b = 50000056019485.526054382324218750L; // 0x1.6bcc5c9f50ec355cp+45
long double a = 50000056019485.526050567626953125L; // 0x1.6bcc5c9f50ec355ap+45
// difference
long double d = 0.000003814697265625L; // 0x0.0000000000000002p+45
The square of a, b are not execrably representable as long double. Instead the reported squares are the rounded values. They differ by 2 ULP.
long double bb = 2500005601951690788240883712.0L; // 0x1.027e98e7c774ede8cp+91
long double aa = 2500005601951690787704012800.0L; // 0x1.027e98e7c774ede88p+91
// difference 536870912.0 .... 0x0.00000000000000004p+91
When multiplying 2 floating point numbers, the error in the product is expected to be within +/-0.5 ULP. Here the ULP is 5,368,70,912 and indeed the squares are correct to within 0.5*

Related

Find float a to closest multiple of float b

C++ Scenario: I have two variables of type double a and b.
Goal: a should be set to the closest multiple of b that is smaller than a.
First approach: Use fmod() or remainder() to get r. Then do a = a - r.
I know that due to the representation of decimal numbers in memory fmod() or remainder() can never guarantee 100% accuracy. In my tests I found that I cannot use fmod() at all, as the variance of its results is too unpredictable (at least as far as I understand). There are many questions and discussions out there talking about this phenomenon.
So is there something I could do to still use fmod()?
With “something” I mean some trick similar to checking if a equals b by employing a value double
EPSILON = 0.005;
if (std::abs(a-b) < EPSILON)
std::cout << "equal" << '\n';
My second approach works but seems not to be very elegant. I am just subtracting b from a until there is nothing left to subtract:
double findRemainder(double x, double y) {
double rest;
if (y > x)
{
double temp = x;
x = y;
y = temp;
}
while (x > y)
{
rest = x - y;
x = x - y;
}
return rest;
}
int main()
{
typedef std::numeric_limits<double> dbl;
std::cout.precision(dbl::max_digits10);
double a = 13.78, b = 2.2, r = 0;
r = findRemainder(a, b);
return 0;
}
Any suggestions for me?
Preamble
The problem is impossible, both as stated and as intended.
Remainders are exact
This statement is incorrect: “fmod() or remainder() can never guarantee 100% accuracy.” If the floating-point format supports subnormal numbers (as IEEE-754 does), then fmod(x, y) and remainder are both exact; they produce a result with no rounding error (barring bugs in their implementation). The remainder, as defined for either of them, is always less than y and not more than x in magnitude. Therefore, it is always in a portion of the floating-point format that is at least as fine as y and as x, so all the bits needed for the real-arithmetic remainder can be represented in the floating-point remainder. So a correct implementation will return the exact remainder.
Multiples may not be representable
For simplicity of illustration, I will use IEEE-754 binary32, the format commonly used for float. The issues are the same for other formats. In this format, all integers with magnitude up to 224, 16,777,216, are representable. After that, due to the scaling by the floating-point exponent, the representable values increase by two: 16,777,218, 16,777,220, and so on. At 225, 33,554,432, they increase by four: 33,554,436, 33,554,440. At 226, 67,108,864, they increase by eight.
100,000,000 is representable, and so are 99,999,992 and 100,000,008. Now consider asking what multiple of 3 is the closest to 100,000,000. It is 99,999,999. But 99,999,999 is not representable in the binary32 format.
Thus, it is not always possible for a function to take two representable values, a and b, and return the greatest multiple of b that is less than a, using the same floating-point format. This is not because of any difficulty computing the multiple but simply because it is impossible to represent the true multiple in the floating-point format.
In fact, given the standard library, it is easy to compute the remainder; std::fmod(100000000.f, 3.f) is 1. But it is impossible to compute 100000000.f − 1 in the binary32 format.
The intended question is impossible
The examples shown, 13.78 for a and 2.2 for b, suggest the desire is to produce a multiple for some floating-point numbers a and b that are the results of converting decimal numerals a and b to the floating-point format. However, once such conversions are performed, the original numbers cannot be known from the results a and b.
To see this, consider values for a of either 99,999,997 or 100,000,002 while b is 10. The greatest multiple of 10 less than 99,999,997 is 99,999,990, and the greatest multiple of 10 less than 100,000,002 is 100,000,000.
When either 99,999,997 or 100,000,002 is converted to the binary32 format (using the common method, round-to-nearest-ties-to-even), the result for a is 100,000,000. Converting b of course yields 10 for b.
Then a function that converts the greatest multiple of a that is less than b can return only one result. Even if this function uses extended precision (say binary64) so that it can return either 99,999,990 or 100,000,000 even though those are not representable in binary32, it has no way to distinguish them. Whether the original a is 99,999,997 or 100,000,002, the a given to the function is 100,000,000, so there is no way for it to know the original a and no way for it to decide which result to return.
Hmm,
there really is a problem of definition, because most multiples of a floating point won't be representable exactly, except maybe if the multiplier is a power of two.
Taking your example and Smalltalk notations (which does not really matter, I do it just because i can evaluate and verify the expressions I propose), the exact fractional representation of double precision 0.1 and 0.9 can be written:
(1+(1<<54)reciprocal) / 10 = 0.1.
(9+(1<<52)reciprocal) / 10 = 0.9.
<< is a bistshift, 1<<54 is 2 raised to the power of 54, and reciprocal is its inverse 2^-54.
As you can easily see:
(1+(1<<54)reciprocal) * 9 > (9+(1<<52)reciprocal)
That is, the exact multiple of 0.1 is greater than 0.9.
Thus, technically, the answer is 8*0.1 (which is exact in this lucky case)
(8+(1<<51)reciprocal) / 10 = 0.8.
What remainder does is to give the EXACT remainder of the division, so it is related to above computations somehow.
You can try it, you will find something like-2.77555...e-17, or exactly (1<<55) reciprocal. The negative part is indicating that nearest multiple is close to 0.9, but a bit below 0.9.
However, if your problem is to find the greatest <= 0.9, among the rounded to nearest multiple of 0.1, then your answer will be 0.9, because the rounded product is 0.1*9 = 0.9.
You have to first resolve that ambiguity. If ever, you are not interested in multiples of 0.1, but in multiples of (1/10), then it's again a different matter...

Whole and fraction to float/double?

If I have a signed long variable that holds the whole number part of a decimal and another long variable that holds the fraction part, how would I convert that to a float or double type?
The fraction part is scaled to the 9th place.
Example:
signed long h = -5;
long f = 200073490
Result should be -5.20007349
Example 2:
signed long h = 3;
long f = 500100;
Result should be 3.0005001
Edit
Also: looking for a mathematical solution. Converting to string and scanning it back into float/double will not work in my project.
Since the long int, h, representing the fractional part is scaled by 1000,000,000 you just need to divide it by 1000000000 and correct for the sign in the event the integer portion of the pair is negative. That is you add the scaled fractional part when the base number is positive and subtract the scaled fractional part when the base number is negative. Given that h is the integer portion and f is the fractional portion an expression that combines these to produce a double is:
double result=h + (1-2*(h < 0)) * f/1000000000.0;
The expression (1-2*(h < 0)) yields a 1 when h is not negative otherwise a -1.

C - Printing a float - loss of precision when casting to int [duplicate]

This question already has answers here:
What range of numbers can be represented in a 16-, 32- and 64-bit IEEE-754 systems?
(7 answers)
Closed 8 years ago.
I'm trying to make a function that enables me to print floats.
Right now, I'm encountering two strange behaviors :
Sometimes, values like 1.3 come out as 1.2999999 instead of 1.3000000,and sometimes values like 1.234567 come out as 1.2345672 instead of 1.2345670.
Here's the source code :
int ft_putflt(float f)
{
int ret;
int intpart;
int i;
ret = 0;
i = 0;
intpart = (int)f;
ft_putnbr(intpart);
ret = ft_nbrlen(intpart) + 8;
write(1, ".", 1);
while (i++ < 7)
{
f *= 10;
ft_putchar(48 + ((int)f % 10));
}
return (ret);
}
ft_putnbr is OK AFAIK.
ft_putchar is a simple call to "write(1, &c, 1)".
test values (value : output)
1.234567 : 1.2345672 (!)
1.2345670 : 1.2345672 (!)
1.0000001 : 1.0000001 OK
0.1234567 : 0.1234567 OK
0.67 : 0.6700000 OK
1.3 : 1.3000000 OK (fixed it)
1.321012 : 1.3210119 (!)
1.3210121 : 1.3210122 (!)
This all seems a bit mystic to me... Loss of precision when casting to int maybe ?
Yes, you lose precision when messing with floats and ints.
If both floats have differing magnitude and both are using the complete precision range (of about 7 decimal digits) then yes, you will see some loss in the last places, because floats are stored in the form of (sign) (mantissa) × 2(exponent). If two values have differing exponents and you add them, then the smaller value will get reduced to less digits in the mantissa (because it has to adapt to the larger exponent):
PS> [float]([float]0.0000001 + [float]1)
1
In relation to integers, a normal 32-bit integer is capable of representing values exactly which do not fit exactly into a float. A float can still store approximately the same number, but no longer exactly. Of course, this only applies to numbers that are large enough, i. e. longer than 24 bits.Because a float has 24 bits of precision and (32-bit) integers have 32, float will still be able to retain the magnitude and most of the significant digits, but the last places may likely differ:
PS> [float]2100000050 + [float]100
2100000100
This is inherent in the use of finite-precision numerical representation schemes. Given any number that can be represented, A, there is some number that is the smallest number greater than A that can be represented, call that B. Numbers between A and B cannot be represented exactly and must be approximated.
For example, let's consider using six decimal digits because that's an easier system to understand as a starting point. If A is .333333, then B is .333334. Numbers between A and B, such 1/3, cannot be exactly represented. So if you take 1/3 and add it to itself twice (or multiply it by 3), you will get .999999, not 1. You should expect to see imprecision at the limits of the representation.

Why do I get two different outputs here?

The following two pieces of code produce two different outputs.
//this one gives incorrect output
cpp_dec_float_50 x=log(2)
std::cout << std::setprecision(std::numeric_limits<cpp_dec_float_50>::digits)<< x << std::endl;
The output it gives is
0.69314718055994528622676398299518041312694549560547
which is only correct upto the 15th decimal place. Had x been double, even then we'd have got first 15 digits correct. It seems that the result is overflowing. I don't see though why it should. cpp_dec_float_50 is supposed to have 50 digits precision.
//this one gives correct output
cpp_dec_float_50 x=2
std::cout << std::setprecision(std::numeric_limits<cpp_dec_float_50>::digits)<< log(x) << std::endl;
The output it gives is
0.69314718055994530941723212145817656807550013436026
which is correct according to wolframaplha .
When you do log(2), you're using the implementation of log in the standard library, which takes a double and returns a double, so the computation is carried out to double precision.
Only after that's computed (to, as you noted, a mere 15 digits of precision) is the result converted to your 50-digit extended precision number.
When you do:
cpp_dec_float_50 x=2;
/* ... */ log(x);
You're passing an extended precision number to start with, so (apparently) an extended precision overload of log is being selected, so it computes the result to the 50 digit precision you (apparently) want.
This is really just a complex version of:
float a = 1 / 2;
Here, 1 / 2 is integer division because the parameters are integers. It's only converted to a float to be stored in a after the result is computed.
C++ rules for how to compute a result do not depend on what you do with that result. So the actual calculation of log(2) is the same whether you store it in an int, a float, or a cpp_dec_float_50.
Your second bit of code is the equivalent of:
float b = 1;
float c = 2;
float a = b / c;
Now, you're calling / on a float, so you get floating point division. C++'s rules do take into account the types of arguments and paramaters. That's complex enough, and trying to also take into account what you do with the result would make C++'s already overly-complex rules incomprehensible to mere mortals.

Can I trust a real-to-int conversion of the result of ceil()?

Suppose I have some code such as:
float a, b = ...; // both positive
int s1 = ceil(sqrt(a/b));
int s2 = ceil(sqrt(a/b)) + 0.1;
Is it ever possible that s1 != s2? My concern is when a/b is a perfect square. For example, perhaps a=100.0 and b=4.0, then the output of ceil should be 5.00000 but what if instead it is 4.99999?
Similar question: is there a chance that 100.0/4.0 evaluates to say 5.00001 and then ceil will round it up to 6.00000?
I'd prefer to do this in integer math but the sqrt kinda screws that plan.
EDIT: suggestions on how to better implement this would be appreciated too! The a and b values are integer values, so actual code is more like: ceil(sqrt(float(a)/b))
EDIT: Based on levis501's answer, I think I will do this:
float a, b = ...; // both positive
int s = sqrt(a/b);
while (s*s*b < a) ++s;
Thank you all!
I don't think it's possible. Regardless of the value of sqrt(a/b), what it produces is some value N that we use as:
int s1 = ceil(N);
int s2 = ceil(N) + 0.1;
Since ceil always produces an integer value (albeit represented as a double), we will always have some value X, for which the first produces X.0 and the second X.1. Conversion to int will always truncate that .1, so both will result in X.
It might seem like there would be an exception if X was so large that X.1 overflowed the range of double. I don't see where this could be possible though. Except close to 0 (where overflow isn't a concern) the square root of a number will always be smaller than the input number. Therefore, before ceil(N)+0.1 could overflow, the a/b being used as an input in sqrt(a/b) would have to have overflowed already.
You may want to write an explicit function for your case. e.g.:
/* return the smallest positive integer whose square is at least x */
int isqrt(double x) {
int y1 = ceil(sqrt(x));
int y2 = y1 - 1;
if ((y2 * y2) >= x) return y2;
return y1;
}
This will handle the odd case where the square root of your ratio a/b is within the precision of double.
Equality of floating point numbers is indeed an issue, but IMHO not if we deal with integer numbers.
If you have the case of 100.0/4.0, it should perfectly evaluate to 25.0, as 25.0 is exactly representable as a float, as opposite to e.g. 25.1.
Yes, it's entirely possible that s1 != s2. Why is that a problem, though?
It seems natural enough that s1 != (s1 + 0.1).
BTW, if you would prefer to have 5.00001 rounded to 5.00000 instead of 6.00000, use rint instead of ceil.
And to answer the actual question (in your comment) - you can use sqrt to get a starting point and then just find the correct square using integer arithmetic.
int min_dimension_greater_than(int items, int buckets)
{
double target = double(items) / buckets;
int min_square = ceil(target);
int dim = floor(sqrt(target));
int square = dim * dim;
while (square < min_square) {
seed += 1;
square = dim * dim;
}
return dim;
}
And yes, this can be improved a lot, it's just a quick sketch.
s1 will always equal s2.
The C and C++ standards do not say much about the accuracy of math routines. Taken literally, it is impossible for the standard to be implemented, since the C standard says sqrt(x) returns the square root of x, but the square root of two cannot be exactly represented in floating point.
Implementing routines with good performance that always return a correctly rounded result (in round-to-nearest mode, this means the result is the representable floating-point number that is nearest to the exact result, with ties resolved in favor of a low zero bit) is a difficult research problem. Good math libraries target accuracy less than 1 ULP (so one of the two nearest representable numbers is returned), perhaps something slightly more than .5 ULP. (An ULP is the Unit of Least Precision, the value of the low bit given a particular value in the exponent field.) Some math libraries may be significantly worse than this. You would have to ask your vendor or check the documentation for more information.
So sqrt may be slightly off. If the exact square root is an integer (within the range in which integers are exactly representable in floating-point) and the library guarantees errors are less than 1 ULP, then the result of sqrt must be exactly correct, because any result other than the exact result is at least 1 ULP away.
Similarly, if the library guarantees errors are less than 1 ULP, then ceil must return the exact result, again because the exact result is representable and any other result would be at least 1 ULP away. Additionally, the nature of ceil is such that I would expect any reasonable math library to always return an integer, even if the rest of the library were not high quality.
As for overflow cases, if ceil(x) were beyond the range where all integers are exactly representable, then ceil(x)+.1 is closer to ceil(x) than it is to any other representable number, so the rounded result of adding .1 to ceil(x) should be ceil(x) in any system implementing the floating-point standard (IEEE 754). That is provided you are in the default rounding mode, which is round-to-nearest. It is possible to change the rounding mode to something like round-toward-infinity, which could cause ceil(x)+.1 to be an integer higher than ceil(x).