Related
Given some initialized object x of type double that stores a positive value, I want to find the smallest double v such as 0 <= v < 1/x.
It seems fairly obvious that I could write double v = DBL_EPSILON;, but, if x is big enough, could it happen that v end up being bigger than the real value 1/x because it's so small that cannot be represented in my machine?
In other words, and more precisely, if I have a positive real number x and an object x1 of type double whose stored value represents x exactly, is it guaranteed that the value represented by DBL_EPSILON is less than the real number 1/x?
In case it is not guaranteed, how can I calculate the biggest value of type double that ensures that DBL_EPSILON is less than the real number 1/x?
I will assume double is IEEE 754 binary64.
If a floating-point number is representable in my machine, will its inverse be representable in my machine?
Not necessarily, for two reasons:
The inverse might not be a floating-point number.
For example, although 3 is a floating-point number, 1/3 is not.
The inverse might overflow.
For example, the inverse of 2β1074 is 21074, which is not only larger than all finite floating-point numbers but more than halfway from the largest finite floating-point number, 1.fffffffffffffp+1023 = 21024 β 2971, to what would be the next one after that, 21024, if the range of exponents were larger.
So the inverse of 2β1074 is rounded to infinity.
Given some initialized object x of type double that stores a positive value, I want to find the smallest double v such as 0 <= v < 1/x.
The smallest such π£ is always zero.
If you restrict it to be nonzero, it will always be the smallest subnormal floating-point number, 0x1pβ1074, or roughly 4.9406564584124654βΓβ10β324, irrespective of π₯ (unless π₯ is infinite).
But perhaps you want the largest such π£ rather than the smallest such π£.
The largest such π£ is always either 1 β π₯ = fl(1/π₯) (that is, the floating-point number nearest to 1/π₯, which is what you get by writing 1/x in C), or the next floating-point number closer to zero (which you can get by writing nextafter(1/x, 0) in C): in the default rounding mode, the division operator always returns the nearest floating-point number to the true quotient, or one of the two nearest ones if there is a tie.
You can also get the largest such π£ by setting the rounding mode with fesetround(FE_DOWNWARD) or fesetround(FE_TOWARDZERO) and then just computing 1/x, although toolchain support for non-default rounding modes is spotty and mostly they serve to shake out bugs in ill-conditioned code rather than to give reliable rounding semantics.
It seems fairly obvious that I could write double v = DBL_EPSILON;, but, if x is big enough, could it happen that v end up being bigger than the real value 1/x because it's so small that cannot be represented in my machine?
1/x is never rounded to zero unless π₯ is infinite or you have nonstandard flush-to-zero semantics enabled (so results which would ordinarily be subnormal are instead rounded to zero, such as when π₯ is the largest finite floating-point number 0x1.fffffffffffffp+1023).
But flush-to-zero aside, there are many values of π₯ for which 1/π₯ and fl(1/π₯) = 1/x is smaller than DBL_EPSILON.
For example, if π₯ = 0x1p+1000 (that is, 21000 β 1.0715086071862673βΓβ10301), then 1/π₯ = fl(1/π₯) = 1/x = 0x1pβ1000 (that is, 2β1000 β 9.332636185032189βΓβ10β302) is far below DBL_EPSILON = 0x1pβ52 (that is, 2β52 β 2.220446049250313 βΓβ10β16).
1/π₯ in this case is a floating-point number, so the reciprocal is computed exactly in floating-point arithmetic; there is no rounding at all.
The largest floating-point number below 1/π₯ in this case is 0x1.fffffffffffffpβ1001, or 2β1000 β 2β1053.
DBL_EPSILON (2β52) is not the smallest floating-point number (2β1074), or even the smallest normal floating-point number (2β1022).
Rather, DBL_EPSILON is the distance from 1 to the next larger floating-point number, 1 + 2β52, sometimes written ulp(1) to indicate that it is the magnitude of the least significant digit, or unit in the last place, in the floating-point representation of 1.
In case it is not guaranteed, how can I calculate the biggest value of type double that ensures that DBL_EPSILON is less than the real number 1/x?
That would be 1/DBL_EPSILON - 1, or 252 β 1.
But what do you want this number for?
Why are you trying to use DBL_EPSILON here?
The inverse of positive infinity is, of course, smaller than any positive rational number. Beyond that, even the largest finite floating point number has a multiplicative inverse well above the smallest representable floating point number of equivalent width, thanks to denormal numbers.
If a floating-point number is representable in my machine, will its inverse be representable in my machine?
No. There is no specification that 1.0/DBL_MIN <= DBL_MAX and 1.0/DBL_MAX <= DBL_MIN both must be true. One is usually true. With sub-normals, 1.0/sub-normal is often > DBL_MAX.
Given some initialized object x of type double that stores a positive value, I want to find the smallest double v such as 0 <= v < 1/x.
This is true as v could be zero unless for some large x like DBL_MAX, 1.0/x is zero. That is a possibility. With sub-normals, that is rarely the case as 1.0/DBL_MAX is representable as a value more than 0.
DBL_EPSILON has little to do with the above. OP's issues are more dependent on DBL_MAX, DBL_MIN and is the double supports sub-normals. Many FP encodings about balanced where 1/DBL_MIN is somewhat about DBL_MIN, yet C does not require that symmetry.
No. Floating point numbers are balanced around 1.0 to minimize the effect of calculating inverses, but this balance is not exact, ad the middle point for the exponent (the value 0x3fff... fot the exponent, gives the same number of powers of two above and below 1.0. But the exponent value 0x4ffff... is reserved for infinity and then nans, while the value 0x0000... is reserved for denormals (also called subnormals) These values are not normalized (and some architectures don't even implement them), but in those that implement, they add as many bits as the width of the mantissa as powers of 2 in addition (but with with lower precision) to the normalized ones, in the range of
the negative exponents. This means that you have a set o numbers, quite close to zero, for which when you compute their inverses, you always get infinity.
For doubles you have 52 more powers of two, or around 15 more powers of ten. For floats, this is around 7 more powers of ten.
But this also means that if you calculate the inverse of a large number you'll always get a number different than zero.
static_casting from a floating point to an integer simply strips the fractional point of the number. For example static_cast<int>(13.9999999) yields 13.
Not all integers are representable as floating point numbers. For example internally the closest float to 13,000,000 may be: 12999999.999999.
In this hypothetical case, I'd expect to get an unexpected result from:
const auto foo = 12'999'999.5F;
const auto bar = static_cast<long long>(ceil(foo));
My assumption is that such a breakdown does occur at some point, if not necessarily at 13,000,000. I'd just like to know the range over which I can trust static_cast<long long>(ceif(foo))?
For example internally the closest float to 13,000,000 may be: 12999999.999999.
That is not possible in any normal floating-point format. The floating-point representation of numbers is equivalent to Mβ’be, where b is a fixed base (e.g., 2 for binary floating-point) and M and e are integers with some restrictions on their values. In order for a value like 13,000,000-x to be represented, where x is some positive value less than 1, e must be negative (because Mβ’be for a non-negative e is an integer). If so, then Mβ’b0 is an integer larger than Mβ’be, so it is larger than 13,000,000, and so 13,000,000 can be represented as M'β’b0, where M' is a positive integer less than M and hence fits in the range of allowed values for M (in any normal floating-point format). (Perhaps some bizarre floating-point format might impose a strange range on M or e that prevents this, but no normal format does.)
Regarding your code:
auto test = 0LL;
const auto floater = 0.5F;
for(auto i = 0LL; i == test; i = std::ceil(i + floater)) ++test;
cout << test << endl;
When i was 8,388,608, the mathematical result of 8,388,608 + .5 is 8,388,608.5. This is not representable in the float format on your system, so it was rounded to 8,388,608. The ceil of this is 8,388,608. At this point, test was 8,388,609, so the loop stopped. So this code does not demonstrate that 8,388,608.5 is representable and 8,388,609 is not.
Behavior seems to return to normal if I do: ceil(8'388'609.5F) which will correctly return 8,388,610.
8,388,609.5 is not representable in the float format on your system, so it was rounded by the rule βround to nearest, ties to even.β The two nearest representable values are 8,388,609, and 8,388,610. Since they are equally far apart, the result was 8,388,610. That value was passed to ceil, which of course returned 8,388,610.
On Visual Studio 2015 I got 8,388,609 which is a horrifying small safe range.
In the IEEE-754 basic 32-bit binary format, all integers from -16,777,216 to +16,777,216 are representable, because the format has a 24-bit significand.
Floating point numbers are represented by 3 integers, cbq where:
c is the mantissa (so for the number: 12,999,999.999999 c would be 12,999,999,999,999)
q is the exponent (so for the number: 12,999,999.999999 q would be -6)
b is the base (IEEE-754 requires b to be either 10 or 2; in the representation above b is 10)
From this it's easy to see that a floating point with the capability of representing 12,999,999.999999 also has the capability of representing 13,000,000.000000 using a c of 1,300,000,000,000 and a q of -5.
This example is a bit contrived in that the chosen b is 10, where in almost all implementations the chosen base is 2. But it's worth pointing out that even with a b of 2 the q functions as a shift left or right of the mantissa.
Next let's talk about a range here. Obviously a 32-bit floating point cannot represent all the integers represented by a 32-bit integer, as the floating point must also represent so many much larger or smaller numbers. Since the exponent is simply shifting the mantissa, a floating point number can always exactly represent every integer that can be represented by it's mantissa. Given the traditional IEEE-754 binary base floating point numbers:
A 32-bit (float) has a 24-bit mantissa so it can represent all integers in the range [-16,777,215, 16,777,215]
A 64-bit (double) has a 53-bit mantissa so it can represent all integers in the range [-9,007,199,254,740,991, 9,007,199,254,740,991]
A 128-bit (long double depending upon implementation) has a 113-bit mantissa so it can represent all integers in the range [-103,845,937,170,696,552,570,609,926,584,40,191, 103,845,937,170,696,552,570,609,926,584,40,191]
[source]
c++ provides digits as a method of finding this number for a given floating point type. (Though admittedly even a long long is too small to represent a 113-bit mantissa.) For example a float's maximum mantissa could be found by:
(1LL << numeric_limits<float>::digits) - 1LL
Having thoroughly explained the mantissa, let's revisit the exponent section to talk about how a floating point is actually stored. Take 13,000,000.0 that could be represented as:
c = 13, q = 6, b = 10
c = 130, q = 5, b = 10
c = 1,300, q = 4, b = 10
And so on. For the traditional binary format IEEE-754 requires:
The representation is made unique by choosing the smallest representable exponent that retains the most significant bit (MSB) within the selected word size and format. Further, the exponent is not represented directly, but a bias is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers
To explain this in the more familiar base-10 if our mantissa has 14 decimal places, the implementation would look like this:
c = 13,000,000,000,000 so the MSB will be used in the represented number
q = 6 This is a little confusing, it's cause of the bias introduced here; logically q = -6 but the bias is set so that when q = 0 only the MSB of c is immediately to the left of the decimal point, meaning that c = 13,000,000,000,000, q = 0, b = 10 will represent 1.3
b = 10 again the above rules are really only required for base-2 but I've shown them as they would apply to base-10 for the purpose of explaination
Translated back to base-2 this means that a q of numeric_limits<T>::digits - 1 has only zeros after the decimal place. ceil only has an effect if there is a fractional part of the number.
A final point of explanation here, is the range over which ceil will have an effect. After the exponent of a floating point is larger than numeric_limits<T>::digits continuing to increase it only introduces trailing zeros to the resulting number, thus calling ceil when q is greater than or equal to numeric_limits<T>::digits - 2LL. And since we know the MSB of c will be used in the number this means that c must be smaller than (1LL << numeric_limits<T>::digits - 1LL) - 1LL Thus for ceil to have an effect on the traditional binary IEEE-754 floating point:
A 32-bit (float) must be smaller than 8,388,607
A 64-bit (double) must be smaller than 4,503,599,627,370,495
A 128-bit (long double depending upon implementation) must be smaller than 5,192,296,858,534,827,628,530,496,329,220,095
I am writing a function in c++ that is supposed to find the largest single digit in the number passed (inputValue). For example, the answer for .345 is 5. However, after a while, the program is changing the inputValue to something along the lines of .3449 (and the largest digit is then set to 9). I have no idea why this is happening. Any help to resolve this problem would be greatly appreciated.
This is the function in my .hpp file
void LargeInput(const double inputValue)
//Function to find the largest value of the input
{
int tempMax = 0,//Value that the temporary max number is in loop
digit = 0,//Value of numbers after the decimal place
test = 0,
powerOten = 10;//Number multiplied by so that the next digit can be checked
double number = inputValue;//A variable that can be changed in the function
cout << "The number is still " << number << endl;
for (int k = 1; k <= 6; k++)
{
test = (number*powerOten);
cout << "test: " << test << endl;
digit = test % 10;
cout << (static_cast<int>(number*powerOten)) << endl;
if (tempMax < digit)
tempMax = digit;
powerOten *= 10;
}
return;
}
You cannot represent real numbers (doubles) precisely in a computer - they need to be approximated. If you change your function to work on longs or ints there won't be any inaccuracies. That seems natural enough for the context of your question, you're just looking at the digits and not the number, so .345 can be 345 and get the same result.
Try this:
int get_largest_digit(int n) {
int largest = 0;
while (n > 0) {
int x = n % 10;
if (x > largest) largest = x;
n /= 10;
}
return largest;
}
This is because the fractional component of real numbers is in the form of 1/2^n. As a result you can get values very close to what you want but you can never achieve exact values like 1/3.
It's common to instead use integers and have a conversion (like 1000 = 1) so if you had the number 1333 you would do printf("%d.%d", 1333/1000, 1333 % 1000) to print out 1.333.
By the way the first sentence is a simplification of how floating point numbers are actually represented. For more information check out; http://en.wikipedia.org/wiki/Floating_point#Representable_numbers.2C_conversion_and_rounding
This is how floating point number work, unfortunately. The core of the problem is that there are an infinite number of floating point numbers. More specifically, there are an infinite number of values between 0.1 and 0.2 and there are an infinite number of values between 0.01 and 0.02. Computers, however, have a finite number of bits to represent a floating point number (64 bits for a double precision number). Therefore, most floating point numbers have to be approximated. After any floating point operation, the processor has to round the result to a value it can represent in 64 bits.
Another property of floating point numbers is that as number get bigger they get less and less precise. This is because the same 64 bits have to be able to represent very big numbers (1,000,000,000) and very small numbers (0.000,000,000,001). Therefore, the rounding error gets larger when working with bigger numbers.
The other issue here is that you are converting from floating point to integer. This introduces even more rounding error. It appears that when (0.345 * 10000) is converted to an integer, the result is closer to 3449 than 3450.
I suggest you don't convert your numbers to integers. Write your program in terms of floating point numbers. You can't use the modulus (%) operator on floating point numbers to get a value for digit. Instead use the fmod function in the C math library (cmath.h).
As other answers have indicated, binary floating-point is incapable of representing most decimal numbers exactly. Therefore, you must reconsider your problem statement. Some alternatives are:
The number is passed as a double (specifically, a 64-bit IEEE-754 binary floating-point value), and you wish to find the largest digit in the decimal representation of the exact value passed. In this case, the solution suggested by user millimoose will work (provided the asprintf or snprintf function used is of good quality, so that it does not incur rounding errors that prevent it from producing correctly rounded output).
The number is passed as a double but is intended to represent a number that is exactly representable as a decimal numeral with a known number of digits. In this case, the solution suggested by user millimoose again works, with the format specification altered to convert the double to decimal with the desired number of digits (e.g., instead of β%.64fβ, you could use β%.6fβ).
The function is changed to pass the number in another way, such as with decimal floating-point, as a scaled integer, or as a string containing a decimal numeral.
Once you have clarified the problem statement, it may be interesting to consider how to solve it with floating-point arithmetic, rather than calling library functions for formatted output. This is likely to have pedagogical value (and incidentally might produce a solution that is computationally more efficient than calling a library function).
For example, this blog says 0.005 is not exactly 0.005, but rounding that number yields the right result.
I have tried all kinds of rounding in C++ and it fails when rounding numbers to certain decimal places. For example, Round(x,y) rounds x to a multiple of y. So Round(37.785,0.01) should give you 37.79 and not 37.78.
I am reopening this question to ask the community for help. The problem is with the impreciseness of floating point numbers (37,785 is represented as 37.78499999999).
The question is how does Excel get around this problem?
The solution in this round() for float in C++ is incorrect for the above problem.
"Round(37.785,0.01) should give you 37.79 and not 37.78."
First off, there is no consensus that 37.79 rather than 37.78 is the "right" answer here? Tie-breakers are always a bit tough. While always rounding up in the case of a tie is a widely-used approach, it certainly is not the only approach.
Secondly, this isn't a tie-breaking situation. The numerical value in the IEEE binary64 floating point format is 37.784999999999997 (approximately). There are lots of ways to get a value of 37.784999999999997 besides a human typing in a value of 37.785 and happen to have that converted to that floating point representation. In most of these cases, the correct answer is 37.78 rather than 37.79.
Addendum
Consider the following Excel formulae:
=ROUND(37785/1000,2)
=ROUND(19810222/2^19+21474836/2^47,2)
Both cells will display the same value, 37.79. There is a legitimate argument over whether 37785/1000 should round to 37.78 or 37.79 with two place accuracy. How to deal with these corner cases is a bit arbitrary, and there is no consensus answer. There isn't even a consensus answer inside Microsoft: "the Round() function is not implemented in a consistent fashion among different Microsoft products for historical reasons." ( http://support.microsoft.com/kb/196652 ) Given an infinite precision machine, Microsoft's VBA would round 37.785 to 37.78 (banker's round) while Excel would yield 37.79 (symmetric arithmetic round).
There is no argument over the rounding of the latter formula. It is strictly less than 37.785, so it should round to 37.78, not 37.79. Yet Excel rounds it up. Why?
The reason has to do with how real numbers are represented in a computer. Microsoft, like many others, uses the IEEE 64 bit floating point format. The number 37785/1000 suffers from precision loss when expressed in this format. This precision loss does not occur with 19810222/2^19+21474836/2^47; it is an "exact number".
I intentionally constructed that exact number to have the same floating point representation as does the inexact 37785/1000. That Excel rounds this exact value up rather than down is the key to determining how Excel's ROUND() function works: It is a variant of symmetric arithmetic rounding. It rounds based on a comparison to the floating point representation of the corner case.
The algorithm in C++:
#include <cmath> // std::floor
// Compute 10 to some positive integral power.
// Dealing with overflow (exponent > 308) is an exercise left to the reader.
double pow10 (unsigned int exponent) {
double result = 1.0;
double base = 10.0;
while (exponent > 0) {
if ((exponent & 1) != 0) result *= base;
exponent >>= 1;
base *= base;
}
return result;
}
// Round the same way Excel does.
// Dealing with nonsense such as nplaces=400 is an exercise left to the reader.
double excel_round (double x, int nplaces) {
bool is_neg = false;
// Excel uses symmetric arithmetic round: Round away from zero.
// The algorithm will be easier if we only deal with positive numbers.
if (x < 0.0) {
is_neg = true;
x = -x;
}
// Construct the nearest rounded values and the nasty corner case.
// Note: We really do not want an optimizing compiler to put the corner
// case in an extended double precision register. Hence the volatile.
double round_down, round_up;
volatile double corner_case;
if (nplaces < 0) {
double scale = pow10 (-nplaces);
round_down = std::floor (x * scale);
corner_case = (round_down + 0.5) / scale;
round_up = (round_down + 1.0) / scale;
round_down /= scale;
}
else {
double scale = pow10 (nplaces);
round_down = std::floor (x / scale);
corner_case = (round_down + 0.5) * scale;
round_up = (round_down + 1.0) * scale;
round_down *= scale;
}
// Round by comparing to the corner case.
x = (x < corner_case) ? round_down : round_up;
// Correct the sign if needed.
if (is_neg) x = -x;
return x;
}
For very accurate arbitrary precision and rounding of floating point numbers to a fixed set of decimal places, you should take a look at a math library like GNU MPFR. While it's a C-library, the web-page I posted also links to a couple different C++ bindings if you want to avoid using C.
You may also want to read a paper entitled "What every computer scientist should know about floating point arithmetic" by David Goldberg at the Xerox Palo Alto Research Center. It's an excellent article demonstrating the underlying process that allows floating point numbers to be approximated in a computer that represents everything in binary data, and how rounding errors and other problems can creep up in FPU-based floating point math.
I don't know how Excel does it, but printing floating point numbers nicely is a hard problem: http://www.serpentine.com/blog/2011/06/29/here-be-dragons-advances-in-problems-you-didnt-even-know-you-had/
So your actual question seems to be, how to get correctly rounded floating point -> string conversions. By googling for those terms you'll get a bunch of articles, but if you're interested in something to use, most platforms provide reasonably competent implementations of sprintf()/snprintf(). So just use those, and if you find bugs, file a report to the vendor.
A function that takes a floating point number as argument and returns another floating point number, rounded exactly to a given number of decimal digits cannot be written, because there are many numbers with a finite decimal representation that have an infinite binary representation; one of the simplest examples is 0.1 .
To achieve what you want you must accept to use a different type as a result of your rounding function. If your immediate need is printing the number you can use a string and a formatting function: the problem becomes how to obtain exactly the formatting you expect. Otherwise if you need to store this number in order to perform exact calculations on it, for instance if you are doing accounting, you need a library that's capable of representing decimal numbers exactly. In this case the most common approach is to use a scaled representation: an integer for the value together with the number of decimal digits. Dividing the value by ten raised to the scale gives you the original number.
If any of these approaches is suitable, I'll try and expand my answer with practical suggestions.
Excel rounds numbers like this "correctly" by doing WORK. They started in 1985, with a fairly "normal" set of floating-point routines, and added some scaled-integer fake floating point, and they've been tuning those things and adding special cases ever since. The app DID used to have most of the same "obvious" bugs that everybody else did, it's just that it mostly had them a long time ago. I filed a couple myself, back when I was doing tech support for them in the early 90s.
I believe the following C# code rounds numbers as they are rounded in Excel. To exactly replicate the behavior in C++ you might need to use a special decimal type.
In plain English, the double-precision number is converted to a decimal and then rounded to fifteen significant digits (not to be confused with fifteen decimal places). The result is rounded a second time to the specified number of decimal places.
That might seem weird, but what you have to understand is that Excel always displays numbers that are rounded to 15 significant figures. If the ROUND() function weren't using that display value as a starting point, and used the internal double representation instead, then there would be cases where ROUND(A1,N) did not seem to correspond to the actual value in A1. That would be very confusing to a non-technical user.
The double which is closest to 37.785 has an exact decimal value of 37.784999999999996589394868351519107818603515625. (Any double can be represented precisely by a finite base ten decimal because one quarter, one eighth, one sixteenth, and so forth all have finite decimal expansions.) If that number were rounded directly to two decimal places, there would be no tie to break and the result would be 37.78. If you round to 15 significant figures first you get 37.7850000000000. If this is further rounded to two decimal places, then you get 37.79, so there is no real mystery after all.
// Convert to a floating decimal point number, round to fifteen
// significant digits, and then round to the number of places
// indicated.
static decimal SmartRoundDouble(double input, int places)
{
int numLeadingDigits = (int)Math.Log10(Math.Abs(input)) + 1;
decimal inputDec = GetAccurateDecimal(input);
inputDec = MoveDecimalPointRight(inputDec, -numLeadingDigits);
decimal round1 = Math.Round(inputDec, 15);
round1 = MoveDecimalPointRight(round1, numLeadingDigits);
decimal round2 = Math.Round(round1, places, MidpointRounding.AwayFromZero);
return round2;
}
static decimal MoveDecimalPointRight(decimal d, int n)
{
if (n > 0)
for (int i = 0; i < n; i++)
d *= 10.0m;
else
for (int i = 0; i > n; i--)
d /= 10.0m;
return d;
}
// The constructor for decimal that accepts a double does
// some rounding by default. This gets a more exact number.
static decimal GetAccurateDecimal(double r)
{
string accurateStr = r.ToString("G17", CultureInfo.InvariantCulture);
return Decimal.Parse(accurateStr, CultureInfo.InvariantCulture);
}
What you NEED is this :
double f = 22.0/7.0;
cout.setf(ios::fixed, ios::floatfield);
cout.precision(6);
cout<<f<<endl;
How it can be implemented (just a overview for rounding last digit)
:
long getRoundedPrec(double d, double precision = 9)
{
precision = (int)precision;
stringstream s;
long l = (d - ((double)((int)d)))* pow(10.0,precision+1);
int lastDigit = (l-((l/10)*10));
if( lastDigit >= 5){
l = l/10 +1;
}
return l;
}
Just as base-10 numbers must be rounded as they are converted to base-2, it is possible to round a number as it is converted from base-2 to base-10. Once the number has a base-10 representation it can be rounded again in a straightforward manner by looking at the digit to the right of the one you wish to round.
While there's nothing wrong with the above assertion, there's a much more pragmatic solution. The problem is that the binary representation tries to get as close as possible to the decimal number, even if that binary is less than the decimal. The amount of error is within [-0.5,0.5] least significant bits (LSB) of the true value. For rounding purposes you'd rather it be within [0,1] LSB so that the error is always positive, but that's not possible without changing all the rules of floating point math.
The one thing you can do is add 1 LSB to the value, so the error is within [0.5,1.5] LSB of the true value. This is less accurate overall, but only by a very tiny amount; when the value is rounded for representation as a decimal number it is much more likely to be rounded to a proper decimal number because the error is always positive.
To add 1 LSB to the value before rounding it, see the answers to this question. For example in Visual Studio C++ 2010 the procedure would be:
Round(_nextafter(37.785,37.785*1.1),0.01);
There are many ways to optimize the result of a floating-point value using statistical, numerical... algorithms
The easiest one is probably searching for repetitive 9s or 0s in the range of precision. If there are any, maybe those 9s are redundant, just round them up. But this may not work in many cases. Here's an example for a float with 6 digits of precision:
2.67899999 β 2.679
12.3499999 β 12.35
1.20000001 β 1.2
Excel always limits the input range to 15 digits and rounds the output to maximum 15 digits so this might be one of the way Excel uses
Or you can include the precision along with the number. After each step, adjust the accuracy depend on the precision of operands. For example
1.113 β 3 decimal digits
6.15634 β 5 decimal digits
Since both number are inside the double's 16-17 digits precision range, their sum will be accurate to the larger of them, which is 5 digits. Similarly, 3+5 < 16, so their product will be precise to 8 decimal numbers
1.113 + 6.15634 = 7.26934 β 5 decimal digits
1.113 * 6.15634 = 6.85200642 β 8 decimal digits
But 4.1341677841 * 2.251457145 will only take double's accuracy because the real result exceed double's precision
Another efficient algorithm is Grisu but I haven't had an opportunity to try.
In 2010, Florian Loitsch published a wonderful paper in PLDI, "Printing floating-point numbers quickly and accurately with integers", which represents the biggest step in this field in 20 years: he mostly figured out how to use machine integers to perform accurate rendering! Why do I say "mostly"? Because although Loitsch's "Grisu3" algorithm is very fast, it gives up on about 0.5% of numbers, in which case you have to fall back to Dragon4 or a derivative
Here be dragons: advances in problems you didnβt even know you had
In fact I think Excel must combine many different methods to achieve the best result of all
Example When a Value Reaches Zero
In Excel 95 or earlier, enter the following into a new workbook:
A1: =1.333+1.225-1.333-1.225
Right-click cell A1, and then click Format Cells. On the Number tab, click Scientific under Category. Set the Decimal places to 15.
Rather than displaying 0, Excel 95 displays -2.22044604925031E-16.
Excel 97, however, introduced an optimization that attempts to correct for this problem. Should an addition or subtraction operation result in a value at or very close to zero, Excel 97 and later will compensate for any error introduced as a result of converting an operand to and from binary. The example above when performed in Excel 97 and later correctly displays 0 or 0.000000000000000E+00 in scientific notation.
Floating-point arithmetic may give inaccurate results in Excel
As mjfgates says, Excel does hard work to get this "right". The first thing to do when you try to reimplement this, is define what you mean by "right". Obvious solutions:
implement rational arithmetic
Slow but reliable.
implement a bunch of heuristics
Fast but tricky to get right (think "years of bug reports").
It really depends on your application.
Most decimal fractions can't be accurately represented in binary.
double x = 0.0;
for (int i = 1; i <= 10; i++)
{
x += 0.1;
}
// x should now be 1.0, right?
//
// it isn't. Test it and see.
One solution is to use BCD. It's old. But, it's also tried and true. We have a lot of other old ideas that we use every day (like using a 0 to represent nothing...).
Another technique uses scaling upon input/output. This has the advantage of nearly all math being integer math.
As part of a numerical library test I need to choose base 10 decimal numbers that can be represented exactly in base 2. How do you detect in C++ if a base 10 decimal number can be represented exactly in base 2?
My first guess is as follows:
bool canBeRepresentedInBase2(const double &pNumberInBase10)
{
//check if a number in base 10 can be represented exactly in base 2
//reference: http://en.wikipedia.org/wiki/Binary_numeral_system
bool funcResult = false;
int nbOfDoublings = 16*3;
double doubledNumber = pNumberInBase10;
for (int i = 0; i < nbOfDoublings ; i++)
{
doubledNumber = 2*doubledNumber;
double intPart;
double fracPart = modf(doubledNumber/2, &intPart);
if (fracPart == 0) //number can be represented exactly in base 2
{
funcResult = true;
break;
}
}
return funcResult;
}
I tested this function with the following values: -1.0/4.0, 0.0, 0.1, 0.2, 0.205, 1.0/3.0, 7.0/8.0, 1.0, 256.0/255.0, 1.02, 99.005. It returns true for -1.0/4.0, 0.0, 7.0/8.0, 1.0, 99.005 which is correct.
Any better ideas?
I think what you are looking for is a number which has a fractional portion which is the sum of a sequence of negative powers of 2 (aka: 1 over a power of 2). I believe this should always be able to be represented exactly in IEEE floats/doubles.
For example:
0.375 = (1/4 + 1/8) which should have an exact representation.
If you want to generate these. You could try do something like this:
#include <iostream>
#include <cstdlib>
int main() {
srand(time(0));
double value = 0.0;
for(int i = 1; i < 256; i *= 2) {
// doesn't matter, some random probability of including this
// fraction in our sequence..
if((rand() % 3) == 0) {
value += (1.0 / static_cast<double>(i));
}
}
std::cout << value << std::endl;
}
EDIT: I believe your function has a broken interface. It would be better if you had this:
bool canBeRepresentedExactly(int numerator, int denominator);
because not all fractions have exact representations, but the moment you shove it into a double, you've chosen a representation in binary... defeating the purpose of the test.
If you're checking to see if it's binary, it will always return true. If your method takes a double as the parameter, the number is already represented in binary (double is a binary type, usually 64 bits). Looking at your code, I think you're actually trying to see if it can be represented exactly as an integer, in which case why can't you just cast to int, then back to double and compare to the original. Any integer stored in a double that's within the range representable by an int should be exact, IIRC, because a 64 bit double has 53 bits of mantissa (and I'm assuming a 32 bit int). That means if they're equal, it's an integer.
If you're passing in a double, then by definition, it has already been represented in binary and if not, then you've already lost accuracy.
Maybe try passing in numerator and denominator of the fraction to the function. Then you have not lost accuracy and can check to see if you can come up with a binary representation of the answer that is the same as the fraction you've passed in.
As rmeador have pointed out, it might not be a good idea to accept the double, because the number has been converted to a double, an possible approximation to the number that you're trying to check.
So, in a very abstract way, you should split your check into integers, and decimals. Integers should not be too large such that the mantissa cannot express all the integers, (e.g. 9007199254740993 should not be represented properly by a 64-bit fp)
Decimal points may be a bit easier, mentally, because if anything after the decimal point (e.g. yyy in xxx.yyy) contains a factor of anything other than 2, the floating point repeats in order to try to represent it. It's the reason why 1/3 cannot be represented with finite digits in base 10 = base (2*5)... See Recurring Decimal
EDIT: As the comments pointed out, if the decimal number has a factor of anything other than 1/2, that would be the mathematically correct way to say it...
As others have mentioned, your method doesn't do what you mean, since you pass a number represented as a (binary) double. The method actually detects, if the number you passed is in the form integer/2^48. This should fail for numbers like (1+2^-50), which is binary, and 259/255, which isn't.
If you really want to test a number for being exactly representable by finite binary string, you have to pass a number in an exact form.
You can't pass IN a Double because it's already lost precision. You should be able to use the toString() method of Double to check for this. (example in Java)
public static Boolean canBeRepresentedInBase2(String thenumber)
{
// Reuturns true of the parsed Double did not loose precision.
// Only works for numbers that are not converted into scientific notation by toString.
return thenumber.equals(Double.parseDouble(thenumber).toString())
}
You asked for C++ but maybe this algorithm will help. I use "EE" to mean "exactly expressible as a float."
Start with a decimal representation of the number you want to test. Remove any trailing zeroes (that is, 0.123450000 becomes 0.12345).
1) If the number is not an integer, check to see if the rightmost digit is 5. If it's not, then stop -- the number is not EE.
2) Multiply the number by 2. If the result is an integer, then stop -- the number is EE. Otherwise, go back to step 1.
I don't have rigorous proof for this but a "warm fuzzy." Fire up Calculator and enter your favorite fractional power of 2, like 0.0000152587890625. Add it to itself a few dozen times (I just hit "+" once then "=" a bunch of times). If there are any non-zero digits to the right of the decimal point, the last digit is always 5.
Here is the code in C# and it works. Because it works with the Decimal data - there are no inherent rounding errors that show up in the original code which uses double. (decimal in C# stores using base 10 instead of base 2 - which is what double does)
static bool canBeRepresentedInBase2(decimal pNumberInBase10)
{
//check if a number in base 10 can be represented exactly in base 2
//reference: http://en.wikipedia.org/wiki/Binary_numeral_system
bool funcResult = false;
int nbOfDoublings = 16*3;
decimal doubledNumber = pNumberInBase10;
for (int i = 0; i < nbOfDoublings ; i++)
{
doubledNumber = 2*doubledNumber;
decimal intPart;
decimal fracPart = ModF(doubledNumber/2, out intPart);
if (fracPart == 0) //number can be represented exactly in base 2
{
funcResult = true;
break;
}
}
return funcResult;
}
static decimal ModF(decimal number, out decimal intPart)
{
intPart = Math.Floor(number);
decimal fractional = number - (intPart);
return fractional;
}
Tested with the following code (where WL does a Console.WritelLine - SnippetCompiler)
WL(canBeRepresentedInBase2(-1.0M/4.0M)); //true
WL(canBeRepresentedInBase2(0.0M)); //true
WL(canBeRepresentedInBase2(0.1M)); //false
WL(canBeRepresentedInBase2(0.2M)); //false
WL(canBeRepresentedInBase2(0.205M)); //false
WL(canBeRepresentedInBase2(1.0M/3.0M)); //false
WL(canBeRepresentedInBase2(7.0M/8.0M)); //true
WL(canBeRepresentedInBase2(1.0M)); //true
WL(canBeRepresentedInBase2(256.0M/255.0M)); //false
WL(canBeRepresentedInBase2(1.02M)); //false
WL(canBeRepresentedInBase2(99.005M)); //false
WL(canBeRepresentedInBase2(2.53M)); //false
Or even easier:
return pNumber == floor(pNumber);
On the other hand, if you have some weird fractional representation (numerator denominator pair, or string with a decimal in it, or something), and you really do want to know if the value can be exactly represented as a double, it's a bit harder.
But you would need a different parameter(s) for that...
Given a number r it can be represented exactly with finite precision in base 2 iff r can be written as r = m/2^n, where m, n are integers, and n >= 0.
For example 1/7 doesn't have a finite binary expression, also 1/6 and 1/10 can't be written with a finite expression in base 2.
But 1/4+1/32+1/1024, have a finite expression in base.
PS: In general you can express a number r with finite digits in a base b iff r=m/b^n where m, n are integers an n >= 0.
PPS: As almost everybody has stated previously using a double as input is a bad idea, because you are loosing precision, and you will end up with a different number.
I don't think this is what he's asking... I think he's looking for a solution that will tell him if a number can be represented EXACTLY in binary form. For example, 33.3.. That's a number cannot be represented in binary, because it will go on forever, so depending on your FPU settings, it will be represented as something like "33.333333333333336". So, it looks like his method will do the job. I don't know of a better way off the top of my head.
\
Ignoring the general criticism of using a double...
For a general finite decimal, you can determine if it has a finite representation in binary with the following algorithm:
Extract the fraction part of the decimal f.
Determine f x 10b = c, where b and c are integers.
Determine 2d >= 10b, where d is an integer.
If c x 2b / 10b is an integer, then the decimal has a finite representation in binary. Otherwise, it doesn't.
You can generalize this to any two bases.