float number to string converting implementation in STD - c++

I faced with a curious issue. Look at this simple code:
int main(int argc, char **argv) {
char buf[1000];
snprintf_l(buf, sizeof(buf), _LIBCPP_GET_C_LOCALE, "%.17f", 0.123e30f);
std::cout << "WTF?: " << buf << std::endl;
}
The output looks quire wired:
123000004117574256822262431744.00000000000000000
My question is how it's implemented? Can someone show me the original code? I did not find it. Or maybe it's too complicated for me.
I've tried to reimplement the same transformation double to string with Java code but was failed. Even when I tried to get exponent and fraction parts separately and summarize fractions in cycle I always get zeros instead of these numbers "...822262431744". When I tried to continue summarizing fractions after the 23 bits (for float number) I faced with other issue - how many fractions I need to collect? Why the original code stops on left part and does not continue until the scale is end?
So, I really do not understand the basic logic, how it implemented. I've tried to define really big numbers (e.g. 0.123e127f). And it generates huge number in decimal format. The number has much higher precision than float can be. Looks like this is an issue, because the string representation contains something which float number cannot.

Please read documentation:
printf, fprintf, sprintf, snprintf, printf_s, fprintf_s, sprintf_s, snprintf_s - cppreference.com
The format string consists of ordinary multibyte characters (except %), which are copied unchanged into the output stream, and conversion specifications. Each conversion specification has the following format:
introductory % character
...
(optional) . followed by integer number or *, or neither that specifies precision of the conversion. In the case when * is used, the precision is specified by an additional argument of type int, which appears before the argument to be converted, but after the argument supplying minimum field width if one is supplied. If the value of this argument is negative, it is ignored. If neither a number nor * is used, the precision is taken as zero. See the table below for exact effects of precision.
....
Conversion Specifier
Explanation
Expected Argument Type
f F
converts floating-point number to the decimal notation in the style [-]ddd.ddd. Precision specifies the exact number of digits to appear after the decimal point character. The default precision is 6. In the alternative implementation decimal point character is written even if no digits follow it. For infinity and not-a-number conversion style see notes.
double
So with f you forced form ddd.ddd (no exponent) and with .17 you have forced to show 17 digits after decimal separator. With such big value printed outcome looks that odd.

Finally I've found out what the difference between Java float -> decimal -> string convertation and c++ float -> string (decimal) convertation. I did not find the original source code, but I replicated the same code in Java to make it clear. I think the code explains everything:
// the context size might be calculated properly by getting maximum
// float number (including exponent value) - its 40 + scale, 17 for me
MathContext context = new MathContext(57, RoundingMode.HALF_UP);
BigDecimal divisor = BigDecimal.valueOf(2);
int tmp = Float.floatToRawIntBits(1.23e30f)
boolean sign = tmp < 0;
tmp <<= 1;
// there might be NaN value, this code does not support it
int exponent = (tmp >>> 24) - 127;
tmp <<= 8;
int mask = 1 << 23;
int fraction = mask | (tmp >>> 9);
// at this line we have all parts of the float: sign, exponent and fractions. Let's build mantissa
BigDecimal mantissa = BigDecimal.ZERO;
for (int i = 0; i < 24; i ++) {
if ((fraction & mask) == mask) {
// i'm not sure about speed, maybe division at each iteration might be faster than pow
mantissa = mantissa.add(divisor.pow(-i, context));
}
mask >>>= 1;
}
// it was the core line where I was losing accuracy, because of context
BigDecimal decimal = mantissa.multiply(divisor.pow(exponent, context), context);
String str = decimal.setScale(17, RoundingMode.HALF_UP).toPlainString();
// add minus manually, because java lost it if after the scale value become 0, C++ version of code doesn't do it
if (sign) {
str = "-" + str;
}
return str;
Maybe topic is useless. Who really need to have the same implementation like C++ has? But at least this code keeps all precision for float number comparing to the most popular way converting float to decimal string:
return BigDecimal.valueOf(1.23e30f).setScale(17, RoundingMode.HALF_UP).toPlainString();

The C++ implementation you are using uses the IEEE-754 binary32 format for float. In this format, the closet representable value to 0.123•1030 is 123,000,004,117,574,256,822,262,431,744, which is represented in the binary32 format as +13,023,132•273. So 0.123e30f in the source code yields the number 123,000,004,117,574,256,822,262,431,744. (Because the number is represented as +13,023,132•273, we know its value is that exactly, which is 123,000,004,117,574,256,822,262,431,744, even though the digits “123000004117574256822262431744” are not stored directly.)
Then, when you format it with %.17f, your C++ implementation prints the exact value faithfully, yielding “123000004117574256822262431744.00000000000000000”. This accuracy is not required by the C++ standard, and some C++ implementations will not do the conversion exactly.
The Java specification also does not require formatting of floating-point values to be exact, at least in some formatting operations. (I am going from memory and some supposition here; I do not have a citation at hand.) It allows, perhaps even requires, that only a certain number of correct digits be produced, after which zeros are used if needed for positioning relative to the decimal point or for the requested format.
The number has much higher precision than float can be.
For any value represented in the float format, that value has infinite precision. The number +13,023,132•273 is exactly +13,023,132•273, which is exactly 123,000,004,117,574,256,822,262,431,744, to infinite precision. The precision the format has for representing numbers affects only which numbers it can represent, not how precisely it represents the numbers that it does represent.

Related

Weird Rounding Occurs in C++ Function

I am writing a function in c++ that is supposed to find the largest single digit in the number passed (inputValue). For example, the answer for .345 is 5. However, after a while, the program is changing the inputValue to something along the lines of .3449 (and the largest digit is then set to 9). I have no idea why this is happening. Any help to resolve this problem would be greatly appreciated.
This is the function in my .hpp file
void LargeInput(const double inputValue)
//Function to find the largest value of the input
{
int tempMax = 0,//Value that the temporary max number is in loop
digit = 0,//Value of numbers after the decimal place
test = 0,
powerOten = 10;//Number multiplied by so that the next digit can be checked
double number = inputValue;//A variable that can be changed in the function
cout << "The number is still " << number << endl;
for (int k = 1; k <= 6; k++)
{
test = (number*powerOten);
cout << "test: " << test << endl;
digit = test % 10;
cout << (static_cast<int>(number*powerOten)) << endl;
if (tempMax < digit)
tempMax = digit;
powerOten *= 10;
}
return;
}
You cannot represent real numbers (doubles) precisely in a computer - they need to be approximated. If you change your function to work on longs or ints there won't be any inaccuracies. That seems natural enough for the context of your question, you're just looking at the digits and not the number, so .345 can be 345 and get the same result.
Try this:
int get_largest_digit(int n) {
int largest = 0;
while (n > 0) {
int x = n % 10;
if (x > largest) largest = x;
n /= 10;
}
return largest;
}
This is because the fractional component of real numbers is in the form of 1/2^n. As a result you can get values very close to what you want but you can never achieve exact values like 1/3.
It's common to instead use integers and have a conversion (like 1000 = 1) so if you had the number 1333 you would do printf("%d.%d", 1333/1000, 1333 % 1000) to print out 1.333.
By the way the first sentence is a simplification of how floating point numbers are actually represented. For more information check out; http://en.wikipedia.org/wiki/Floating_point#Representable_numbers.2C_conversion_and_rounding
This is how floating point number work, unfortunately. The core of the problem is that there are an infinite number of floating point numbers. More specifically, there are an infinite number of values between 0.1 and 0.2 and there are an infinite number of values between 0.01 and 0.02. Computers, however, have a finite number of bits to represent a floating point number (64 bits for a double precision number). Therefore, most floating point numbers have to be approximated. After any floating point operation, the processor has to round the result to a value it can represent in 64 bits.
Another property of floating point numbers is that as number get bigger they get less and less precise. This is because the same 64 bits have to be able to represent very big numbers (1,000,000,000) and very small numbers (0.000,000,000,001). Therefore, the rounding error gets larger when working with bigger numbers.
The other issue here is that you are converting from floating point to integer. This introduces even more rounding error. It appears that when (0.345 * 10000) is converted to an integer, the result is closer to 3449 than 3450.
I suggest you don't convert your numbers to integers. Write your program in terms of floating point numbers. You can't use the modulus (%) operator on floating point numbers to get a value for digit. Instead use the fmod function in the C math library (cmath.h).
As other answers have indicated, binary floating-point is incapable of representing most decimal numbers exactly. Therefore, you must reconsider your problem statement. Some alternatives are:
The number is passed as a double (specifically, a 64-bit IEEE-754 binary floating-point value), and you wish to find the largest digit in the decimal representation of the exact value passed. In this case, the solution suggested by user millimoose will work (provided the asprintf or snprintf function used is of good quality, so that it does not incur rounding errors that prevent it from producing correctly rounded output).
The number is passed as a double but is intended to represent a number that is exactly representable as a decimal numeral with a known number of digits. In this case, the solution suggested by user millimoose again works, with the format specification altered to convert the double to decimal with the desired number of digits (e.g., instead of “%.64f”, you could use “%.6f”).
The function is changed to pass the number in another way, such as with decimal floating-point, as a scaled integer, or as a string containing a decimal numeral.
Once you have clarified the problem statement, it may be interesting to consider how to solve it with floating-point arithmetic, rather than calling library functions for formatted output. This is likely to have pedagogical value (and incidentally might produce a solution that is computationally more efficient than calling a library function).

printing float, preserving precision

I am writing a program that prints floating point literals to be used inside another program.
How many digits do I need to print in order to preserve the precision of the original float?
Since a float has 24 * (log(2) / log(10)) = 7.2247199 decimal digits of precision, my initial thought was that printing 8 digits should be enough. But if I'm unlucky, those 0.2247199 get distributed to the left and to the right of the 7 significant digits, so I should probably print 9 decimal digits.
Is my analysis correct? Is 9 decimal digits enough for all cases? Like printf("%.9g", x);?
Is there a standard function that converts a float to a string with the minimum number of decimal digits required for that value, in the cases where 7 or 8 are enough, so I don't print unnecessary digits?
Note: I cannot use hexadecimal floating point literals, because standard C++ does not support them.
In order to guarantee that a binary->decimal->binary roundtrip recovers the original binary value, IEEE 754 requires
The original binary value will be preserved by converting to decimal and back again using:[10]
5 decimal digits for binary16
9 decimal digits for binary32
17 decimal digits for binary64
36 decimal digits for binary128
For other binary formats the required number of decimal digits is
1 + ceiling(p*log10(2))
where p is the number of significant bits in the binary format, e.g. 24 bits for binary32.
In C, the functions you can use for these conversions are snprintf() and strtof/strtod/strtold().
Of course, in some cases even more digits can be useful (no, they are not always "noise", depending on the implementation of the decimal conversion routines such as snprintf() ). Consider e.g. printing dyadic fractions.
24 * (log(2) / log(10)) = 7.2247199
That's pretty representative for the problem. It makes no sense whatsoever to express the number of significant digits with an accuracy of 0.0000001 digits. You are converting numbers to text for the benefit of a human, not a machine. A human couldn't care less, and would much prefer, if you wrote
24 * (log(2) / log(10)) = 7
Trying to display 8 significant digits just generates random noise digits. With non-zero odds that 7 is already too much because floating point error accumulates in calculations. Above all, print numbers using a reasonable unit of measure. People are interested in millimeters, grams, pounds, inches, etcetera. No architect will care about the size of a window expressed more accurately than 1 mm. No window manufacturing plant will promise a window sized as accurate as that.
Last but not least, you cannot ignore the accuracy of the numbers you feed into your program. Measuring the speed of an unladen European swallow down to 7 digits is not possible. It is roughly 11 meters per second, 2 digits at best. So performing calculations on that speed and printing a result that has more significant digits produces nonsensical results that promise accuracy that isn't there.
If you have a C library that is conforming to C99 (and if your float types have a base that is a power of 2 :) the printf format character %a can print floating point values without lack of precision in hexadecimal form, and utilities as scanf and strod will be able to read them.
If the program is meant to be read by a computer, I would do the simple trick of using char* aliasing.
alias float* to char*
copy into an unsigned (or whatever unsigned type is sufficiently large) via char* aliasing
print the unsigned value
Decoding is just reversing the process (and on most platform a direct reinterpret_cast can be used).
The floating-point-to-decimal conversion used in Java is guaranteed to be produce the least number of decimal digits beyond the decimal point needed to distinguish the number from its neighbors (more or less).
You can copy the algorithm from here: http://www.docjar.com/html/api/sun/misc/FloatingDecimal.java.html
Pay attention to the FloatingDecimal(float) constructor and the toJavaFormatString() method.
If you read these papers (see below), you'll find that there are some algorithm that print the minimum number of decimal digits such that the number can be re-interpreted unchanged (i.e. by scanf).
Since there might be several such numbers, the algorithm also pick the nearest decimal fraction to the original binary fraction (I named float value).
A pity that there's no such standard library in C.
http://www.cs.indiana.edu/~burger/FP-Printing-PLDI96.pdf
http://grouper.ieee.org/groups/754/email/pdfq3pavhBfih.pdf
You can use sprintf. I am not sure whether this answers your question exactly though, but anyways, here is the sample code
#include <stdio.h>
int main( void )
{
float d_n = 123.45;
char s_cp[13] = { '\0' };
char s_cnp[4] = { '\0' };
/*
* with sprintf you need to make sure there's enough space
* declared in the array
*/
sprintf( s_cp, "%.2f", d_n );
printf( "%s\n", s_cp );
/*
* snprinft allows to control how much is read into array.
* it might have portable issues if you are not using C99
*/
snprintf( s_cnp, sizeof s_cnp - 1 , "%f", d_n );
printf( "%s\n", s_cnp );
getchar();
return 0;
}
/* output :
* 123.45
* 123
*/
With something like
def f(a):
b=0
while a != int(a): a*=2; b+=1
return a, b
(which is Python) you should be able to get mantissa and exponent in a loss-free way.
In C, this would probably be
struct float_decomp {
float mantissa;
int exponent;
}
struct float_decomp decomp(float x)
{
struct float_decomp ret = { .mantissa = x, .exponent = 0};
while x != floor(x) {
ret.mantissa *= 2;
ret.exponent += 1;
}
return ret;
}
But be aware that still not all values can be represented in that way, it is just a quick shot which should give the idea, but probably needs improvement.

How does Excel successfully round floating point numbers even though they are imprecise?

For example, this blog says 0.005 is not exactly 0.005, but rounding that number yields the right result.
I have tried all kinds of rounding in C++ and it fails when rounding numbers to certain decimal places. For example, Round(x,y) rounds x to a multiple of y. So Round(37.785,0.01) should give you 37.79 and not 37.78.
I am reopening this question to ask the community for help. The problem is with the impreciseness of floating point numbers (37,785 is represented as 37.78499999999).
The question is how does Excel get around this problem?
The solution in this round() for float in C++ is incorrect for the above problem.
"Round(37.785,0.01) should give you 37.79 and not 37.78."
First off, there is no consensus that 37.79 rather than 37.78 is the "right" answer here? Tie-breakers are always a bit tough. While always rounding up in the case of a tie is a widely-used approach, it certainly is not the only approach.
Secondly, this isn't a tie-breaking situation. The numerical value in the IEEE binary64 floating point format is 37.784999999999997 (approximately). There are lots of ways to get a value of 37.784999999999997 besides a human typing in a value of 37.785 and happen to have that converted to that floating point representation. In most of these cases, the correct answer is 37.78 rather than 37.79.
Addendum
Consider the following Excel formulae:
=ROUND(37785/1000,2)
=ROUND(19810222/2^19+21474836/2^47,2)
Both cells will display the same value, 37.79. There is a legitimate argument over whether 37785/1000 should round to 37.78 or 37.79 with two place accuracy. How to deal with these corner cases is a bit arbitrary, and there is no consensus answer. There isn't even a consensus answer inside Microsoft: "the Round() function is not implemented in a consistent fashion among different Microsoft products for historical reasons." ( http://support.microsoft.com/kb/196652 ) Given an infinite precision machine, Microsoft's VBA would round 37.785 to 37.78 (banker's round) while Excel would yield 37.79 (symmetric arithmetic round).
There is no argument over the rounding of the latter formula. It is strictly less than 37.785, so it should round to 37.78, not 37.79. Yet Excel rounds it up. Why?
The reason has to do with how real numbers are represented in a computer. Microsoft, like many others, uses the IEEE 64 bit floating point format. The number 37785/1000 suffers from precision loss when expressed in this format. This precision loss does not occur with 19810222/2^19+21474836/2^47; it is an "exact number".
I intentionally constructed that exact number to have the same floating point representation as does the inexact 37785/1000. That Excel rounds this exact value up rather than down is the key to determining how Excel's ROUND() function works: It is a variant of symmetric arithmetic rounding. It rounds based on a comparison to the floating point representation of the corner case.
The algorithm in C++:
#include <cmath> // std::floor
// Compute 10 to some positive integral power.
// Dealing with overflow (exponent > 308) is an exercise left to the reader.
double pow10 (unsigned int exponent) {
double result = 1.0;
double base = 10.0;
while (exponent > 0) {
if ((exponent & 1) != 0) result *= base;
exponent >>= 1;
base *= base;
}
return result;
}
// Round the same way Excel does.
// Dealing with nonsense such as nplaces=400 is an exercise left to the reader.
double excel_round (double x, int nplaces) {
bool is_neg = false;
// Excel uses symmetric arithmetic round: Round away from zero.
// The algorithm will be easier if we only deal with positive numbers.
if (x < 0.0) {
is_neg = true;
x = -x;
}
// Construct the nearest rounded values and the nasty corner case.
// Note: We really do not want an optimizing compiler to put the corner
// case in an extended double precision register. Hence the volatile.
double round_down, round_up;
volatile double corner_case;
if (nplaces < 0) {
double scale = pow10 (-nplaces);
round_down = std::floor (x * scale);
corner_case = (round_down + 0.5) / scale;
round_up = (round_down + 1.0) / scale;
round_down /= scale;
}
else {
double scale = pow10 (nplaces);
round_down = std::floor (x / scale);
corner_case = (round_down + 0.5) * scale;
round_up = (round_down + 1.0) * scale;
round_down *= scale;
}
// Round by comparing to the corner case.
x = (x < corner_case) ? round_down : round_up;
// Correct the sign if needed.
if (is_neg) x = -x;
return x;
}
For very accurate arbitrary precision and rounding of floating point numbers to a fixed set of decimal places, you should take a look at a math library like GNU MPFR. While it's a C-library, the web-page I posted also links to a couple different C++ bindings if you want to avoid using C.
You may also want to read a paper entitled "What every computer scientist should know about floating point arithmetic" by David Goldberg at the Xerox Palo Alto Research Center. It's an excellent article demonstrating the underlying process that allows floating point numbers to be approximated in a computer that represents everything in binary data, and how rounding errors and other problems can creep up in FPU-based floating point math.
I don't know how Excel does it, but printing floating point numbers nicely is a hard problem: http://www.serpentine.com/blog/2011/06/29/here-be-dragons-advances-in-problems-you-didnt-even-know-you-had/
So your actual question seems to be, how to get correctly rounded floating point -> string conversions. By googling for those terms you'll get a bunch of articles, but if you're interested in something to use, most platforms provide reasonably competent implementations of sprintf()/snprintf(). So just use those, and if you find bugs, file a report to the vendor.
A function that takes a floating point number as argument and returns another floating point number, rounded exactly to a given number of decimal digits cannot be written, because there are many numbers with a finite decimal representation that have an infinite binary representation; one of the simplest examples is 0.1 .
To achieve what you want you must accept to use a different type as a result of your rounding function. If your immediate need is printing the number you can use a string and a formatting function: the problem becomes how to obtain exactly the formatting you expect. Otherwise if you need to store this number in order to perform exact calculations on it, for instance if you are doing accounting, you need a library that's capable of representing decimal numbers exactly. In this case the most common approach is to use a scaled representation: an integer for the value together with the number of decimal digits. Dividing the value by ten raised to the scale gives you the original number.
If any of these approaches is suitable, I'll try and expand my answer with practical suggestions.
Excel rounds numbers like this "correctly" by doing WORK. They started in 1985, with a fairly "normal" set of floating-point routines, and added some scaled-integer fake floating point, and they've been tuning those things and adding special cases ever since. The app DID used to have most of the same "obvious" bugs that everybody else did, it's just that it mostly had them a long time ago. I filed a couple myself, back when I was doing tech support for them in the early 90s.
I believe the following C# code rounds numbers as they are rounded in Excel. To exactly replicate the behavior in C++ you might need to use a special decimal type.
In plain English, the double-precision number is converted to a decimal and then rounded to fifteen significant digits (not to be confused with fifteen decimal places). The result is rounded a second time to the specified number of decimal places.
That might seem weird, but what you have to understand is that Excel always displays numbers that are rounded to 15 significant figures. If the ROUND() function weren't using that display value as a starting point, and used the internal double representation instead, then there would be cases where ROUND(A1,N) did not seem to correspond to the actual value in A1. That would be very confusing to a non-technical user.
The double which is closest to 37.785 has an exact decimal value of 37.784999999999996589394868351519107818603515625. (Any double can be represented precisely by a finite base ten decimal because one quarter, one eighth, one sixteenth, and so forth all have finite decimal expansions.) If that number were rounded directly to two decimal places, there would be no tie to break and the result would be 37.78. If you round to 15 significant figures first you get 37.7850000000000. If this is further rounded to two decimal places, then you get 37.79, so there is no real mystery after all.
// Convert to a floating decimal point number, round to fifteen
// significant digits, and then round to the number of places
// indicated.
static decimal SmartRoundDouble(double input, int places)
{
int numLeadingDigits = (int)Math.Log10(Math.Abs(input)) + 1;
decimal inputDec = GetAccurateDecimal(input);
inputDec = MoveDecimalPointRight(inputDec, -numLeadingDigits);
decimal round1 = Math.Round(inputDec, 15);
round1 = MoveDecimalPointRight(round1, numLeadingDigits);
decimal round2 = Math.Round(round1, places, MidpointRounding.AwayFromZero);
return round2;
}
static decimal MoveDecimalPointRight(decimal d, int n)
{
if (n > 0)
for (int i = 0; i < n; i++)
d *= 10.0m;
else
for (int i = 0; i > n; i--)
d /= 10.0m;
return d;
}
// The constructor for decimal that accepts a double does
// some rounding by default. This gets a more exact number.
static decimal GetAccurateDecimal(double r)
{
string accurateStr = r.ToString("G17", CultureInfo.InvariantCulture);
return Decimal.Parse(accurateStr, CultureInfo.InvariantCulture);
}
What you NEED is this :
double f = 22.0/7.0;
cout.setf(ios::fixed, ios::floatfield);
cout.precision(6);
cout<<f<<endl;
How it can be implemented (just a overview for rounding last digit)
:
long getRoundedPrec(double d, double precision = 9)
{
precision = (int)precision;
stringstream s;
long l = (d - ((double)((int)d)))* pow(10.0,precision+1);
int lastDigit = (l-((l/10)*10));
if( lastDigit >= 5){
l = l/10 +1;
}
return l;
}
Just as base-10 numbers must be rounded as they are converted to base-2, it is possible to round a number as it is converted from base-2 to base-10. Once the number has a base-10 representation it can be rounded again in a straightforward manner by looking at the digit to the right of the one you wish to round.
While there's nothing wrong with the above assertion, there's a much more pragmatic solution. The problem is that the binary representation tries to get as close as possible to the decimal number, even if that binary is less than the decimal. The amount of error is within [-0.5,0.5] least significant bits (LSB) of the true value. For rounding purposes you'd rather it be within [0,1] LSB so that the error is always positive, but that's not possible without changing all the rules of floating point math.
The one thing you can do is add 1 LSB to the value, so the error is within [0.5,1.5] LSB of the true value. This is less accurate overall, but only by a very tiny amount; when the value is rounded for representation as a decimal number it is much more likely to be rounded to a proper decimal number because the error is always positive.
To add 1 LSB to the value before rounding it, see the answers to this question. For example in Visual Studio C++ 2010 the procedure would be:
Round(_nextafter(37.785,37.785*1.1),0.01);
There are many ways to optimize the result of a floating-point value using statistical, numerical... algorithms
The easiest one is probably searching for repetitive 9s or 0s in the range of precision. If there are any, maybe those 9s are redundant, just round them up. But this may not work in many cases. Here's an example for a float with 6 digits of precision:
2.67899999 → 2.679
12.3499999 → 12.35
1.20000001 → 1.2
Excel always limits the input range to 15 digits and rounds the output to maximum 15 digits so this might be one of the way Excel uses
Or you can include the precision along with the number. After each step, adjust the accuracy depend on the precision of operands. For example
1.113 → 3 decimal digits
6.15634 → 5 decimal digits
Since both number are inside the double's 16-17 digits precision range, their sum will be accurate to the larger of them, which is 5 digits. Similarly, 3+5 < 16, so their product will be precise to 8 decimal numbers
1.113 + 6.15634 = 7.26934 → 5 decimal digits
1.113 * 6.15634 = 6.85200642 → 8 decimal digits
But 4.1341677841 * 2.251457145 will only take double's accuracy because the real result exceed double's precision
Another efficient algorithm is Grisu but I haven't had an opportunity to try.
In 2010, Florian Loitsch published a wonderful paper in PLDI, "Printing floating-point numbers quickly and accurately with integers", which represents the biggest step in this field in 20 years: he mostly figured out how to use machine integers to perform accurate rendering! Why do I say "mostly"? Because although Loitsch's "Grisu3" algorithm is very fast, it gives up on about 0.5% of numbers, in which case you have to fall back to Dragon4 or a derivative
Here be dragons: advances in problems you didn’t even know you had
In fact I think Excel must combine many different methods to achieve the best result of all
Example When a Value Reaches Zero
In Excel 95 or earlier, enter the following into a new workbook:
A1: =1.333+1.225-1.333-1.225
Right-click cell A1, and then click Format Cells. On the Number tab, click Scientific under Category. Set the Decimal places to 15.
Rather than displaying 0, Excel 95 displays -2.22044604925031E-16.
Excel 97, however, introduced an optimization that attempts to correct for this problem. Should an addition or subtraction operation result in a value at or very close to zero, Excel 97 and later will compensate for any error introduced as a result of converting an operand to and from binary. The example above when performed in Excel 97 and later correctly displays 0 or 0.000000000000000E+00 in scientific notation.
Floating-point arithmetic may give inaccurate results in Excel
As mjfgates says, Excel does hard work to get this "right". The first thing to do when you try to reimplement this, is define what you mean by "right". Obvious solutions:
implement rational arithmetic
Slow but reliable.
implement a bunch of heuristics
Fast but tricky to get right (think "years of bug reports").
It really depends on your application.
Most decimal fractions can't be accurately represented in binary.
double x = 0.0;
for (int i = 1; i <= 10; i++)
{
x += 0.1;
}
// x should now be 1.0, right?
//
// it isn't. Test it and see.
One solution is to use BCD. It's old. But, it's also tried and true. We have a lot of other old ideas that we use every day (like using a 0 to represent nothing...).
Another technique uses scaling upon input/output. This has the advantage of nearly all math being integer math.

Printing double without losing precision

How do you print a double to a stream so that when it is read in you don't lose precision?
I tried:
std::stringstream ss;
double v = 0.1 * 0.1;
ss << std::setprecision(std::numeric_limits<T>::digits10) << v << " ";
double u;
ss >> u;
std::cout << "precision " << ((u == v) ? "retained" : "lost") << std::endl;
This did not work as I expected.
But I can increase precision (which surprised me as I thought that digits10 was the maximum required).
ss << std::setprecision(std::numeric_limits<T>::digits10 + 2) << v << " ";
// ^^^^^^ +2
It has to do with the number of significant digits and the first two don't count in (0.01).
So has anybody looked at representing floating point numbers exactly?
What is the exact magical incantation on the stream I need to do?
After some experimentation:
The trouble was with my original version. There were non-significant digits in the string after the decimal point that affected the accuracy.
So to compensate for this we can use scientific notation to compensate:
ss << std::scientific
<< std::setprecision(std::numeric_limits<double>::digits10 + 1)
<< v;
This still does not explain the need for the +1 though.
Also if I print out the number with more precision I get more precision printed out!
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits10) << v << "\n";
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits10 + 1) << v << "\n";
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits) << v << "\n";
It results in:
1.000000000000000e-02
1.0000000000000002e-02
1.00000000000000019428902930940239457413554200000000000e-02
Based on #Stephen Canon answer below:
We can print out exactly by using the printf() formatter, "%a" or "%A". To achieve this in C++ we need to use the fixed and scientific manipulators (see n3225: 22.4.2.2.2p5 Table 88)
std::cout.flags(std::ios_base::fixed | std::ios_base::scientific);
std::cout << v;
For now I have defined:
template<typename T>
std::ostream& precise(std::ostream& stream)
{
std::cout.flags(std::ios_base::fixed | std::ios_base::scientific);
return stream;
}
std::ostream& preciselngd(std::ostream& stream){ return precise<long double>(stream);}
std::ostream& precisedbl(std::ostream& stream) { return precise<double>(stream);}
std::ostream& preciseflt(std::ostream& stream) { return precise<float>(stream);}
Next: How do we handle NaN/Inf?
It's not correct to say "floating point is inaccurate", although I admit that's a useful simplification. If we used base 8 or 16 in real life then people around here would be saying "base 10 decimal fraction packages are inaccurate, why did anyone ever cook those up?".
The problem is that integral values translate exactly from one base into another, but fractional values do not, because they represent fractions of the integral step and only a few of them are used.
Floating point arithmetic is technically perfectly accurate. Every calculation has one and only one possible result. There is a problem, and it is that most decimal fractions have base-2 representations that repeat. In fact, in the sequence 0.01, 0.02, ... 0.99, only a mere 3 values have exact binary representations. (0.25, 0.50, and 0.75.) There are 96 values that repeat and therefore are obviously not represented exactly.
Now, there are a number of ways to write and read back floating point numbers without losing a single bit. The idea is to avoid trying to express the binary number with a base 10 fraction.
Write them as binary. These days, everyone implements the IEEE-754 format so as long as you choose a byte order and write or read only that byte order, then the numbers will be portable.
Write them as 64-bit integer values. Here you can use the usual base 10. (Because you are representing the 64-bit aliased integer, not the 52-bit fraction.)
You can also just write more decimal fraction digits. Whether this is bit-for-bit accurate will depend on the quality of the conversion libraries and I'm not sure I would count on perfect accuracy (from the software) here. But any errors will be exceedingly small and your original data certainly has no information in the low bits. (None of the constants of physics and chemistry are known to 52 bits, nor has any distance on earth ever been measured to 52 bits of precision.) But for a backup or restore where bit-for-bit accuracy might be compared automatically, this obviously isn't ideal.
Don't print floating-point values in decimal if you don't want to lose precision. Even if you print enough digits to represent the number exactly, not all implementations have correctly-rounded conversions to/from decimal strings over the entire floating-point range, so you may still lose precision.
Use hexadecimal floating point instead. In C:
printf("%a\n", yourNumber);
C++0x provides the hexfloat manipulator for iostreams that does the same thing (on some platforms, using the std::hex modifier has the same result, but this is not a portable assumption).
Using hex floating point is preferred for several reasons.
First, the printed value is always exact. No rounding occurs in writing or reading a value formatted in this way. Beyond the accuracy benefits, this means that reading and writing such values can be faster with a well tuned I/O library. They also require fewer digits to represent values exactly.
I got interested in this question because I'm trying to (de)serialize my data to & from JSON.
I think I have a clearer explanation (with less hand waiving) for why 17 decimal digits are sufficient to reconstruct the original number losslessly:
Imagine 3 number lines:
1. for the original base 2 number
2. for the rounded base 10 representation
3. for the reconstructed number (same as #1 because both in base 2)
When you convert to base 10, graphically, you choose the tic on the 2nd number line closest to the tic on the 1st. Likewise when you reconstruct the original from the rounded base 10 value.
The critical observation I had was that in order to allow exact reconstruction, the base 10 step size (quantum) has to be < the base 2 quantum. Otherwise, you inevitably get the bad reconstruction shown in red.
Take the specific case of when the exponent is 0 for the base2 representation. Then the base2 quantum will be 2^-52 ~= 2.22 * 10^-16. The closest base 10 quantum that's less than this is 10^-16. Now that we know the required base 10 quantum, how many digits will be needed to encode all possible values? Given that we're only considering the case of exponent = 0, the dynamic range of values we need to represent is [1.0, 2.0). Therefore, 17 digits would be required (16 digits for fraction and 1 digit for integer part).
For exponents other than 0, we can use the same logic:
exponent base2 quant. base10 quant. dynamic range digits needed
---------------------------------------------------------------------
1 2^-51 10^-16 [2, 4) 17
2 2^-50 10^-16 [4, 8) 17
3 2^-49 10^-15 [8, 16) 17
...
32 2^-20 10^-7 [2^32, 2^33) 17
1022 9.98e291 1.0e291 [4.49e307,8.99e307) 17
While not exhaustive, the table shows the trend that 17 digits are sufficient.
Hope you like my explanation.
In C++20 you'll be able to use std::format to do this:
std::stringstream ss;
double v = 0.1 * 0.1;
ss << std::format("{}", v);
double u;
ss >> u;
assert(v == u);
The default floating-point format is the shortest decimal representation with a round-trip guarantee. The advantage of this method compared to using the precision of max_digits10 (not digits10 which is not suitable for round trip through decimal) from std::numeric_limits is that it doesn't print unnecessary digits.
In the meantime you can use the {fmt} library, std::format is based on. For example (godbolt):
fmt::print("{}", 0.1 * 0.1);
Output (assuming IEEE754 double):
0.010000000000000002
{fmt} uses the Dragonbox algorithm for fast binary floating point to decimal conversion. In addition to giving the shortest representation it is 20-30x faster than common standard library implementations of printf and iostreams.
Disclaimer: I'm the author of {fmt} and C++20 std::format.
A double has the precision of 52 binary digits or 15.95 decimal digits. See http://en.wikipedia.org/wiki/IEEE_754-2008. You need at least 16 decimal digits to record the full precision of a double in all cases. [But see fourth edit, below].
By the way, this means significant digits.
Answer to OP edits:
Your floating point to decimal string runtime is outputing way more digits than are significant. A double can only hold 52 bits of significand (actually, 53, if you count a "hidden" 1 that is not stored). That means the the resolution is not more than 2 ^ -53 = 1.11e-16.
For example: 1 + 2 ^ -52 = 1.0000000000000002220446049250313 . . . .
Those decimal digits, .0000000000000002220446049250313 . . . . are the smallest binary "step" in a double when converted to decimal.
The "step" inside the double is:
.0000000000000000000000000000000000000000000000000001 in binary.
Note that the binary step is exact, while the decimal step is inexact.
Hence the decimal representation above,
1.0000000000000002220446049250313 . . .
is an inexact representation of the exact binary number:
1.0000000000000000000000000000000000000000000000000001.
Third Edit:
The next possible value for a double, which in exact binary is:
1.0000000000000000000000000000000000000000000000000010
converts inexactly in decimal to
1.0000000000000004440892098500626 . . . .
So all of those extra digits in the decimal are not really significant, they are just base conversion artifacts.
Fourth Edit:
Though a double stores at most 16 significant decimal digits, sometimes 17 decimal digits are necessary to represent the number. The reason has to do with digit slicing.
As I mentioned above, there are 52 + 1 binary digits in the double. The "+ 1" is an assumed leading 1, and is neither stored nor significant. In the case of an integer, those 52 binary digits form a number between 0 and 2^53 - 1. How many decimal digits are necessary to store such a number? Well, log_10 (2^53 - 1) is about 15.95. So at most 16 decimal digits are necessary. Let's label these d_0 to d_15.
Now consider that IEEE floating point numbers also have an binary exponent. What happens when we increment the exponet by, say, 2? We have multiplied our 52-bit number, whatever it was, by 4. Now, instead of our 52 binary digits aligning perfectly with our decimal digits d_0 to d_15, we have some significant binary digits represented in d_16. However, since we multiplied by something less than 10, we still have significant binary digits represented in d_0. So our 15.95 decimal digits now occuply d_1 to d_15, plus some upper bits of d_0 and some lower bits of d_16. This is why 17 decimal digits are sometimes needed to represent a IEEE double.
Fifth Edit
Fixed numerical errors
The easiest way (for IEEE 754 double) to guarantee a round-trip conversion is to always use 17 significant digits. But that has the disadvantage of sometimes including unnecessary noise digits (0.1 → "0.10000000000000001").
An approach that's worked for me is to sprintf the number with 15 digits of precision, then check if atof gives you back the original value. If it doesn't, try 16 digits. If that doesn't work, use 17.
You might want to try David Gay's algorithm (used in Python 3.1 to implement float.__repr__).
Thanks to ThomasMcLeod for pointing out the error in my table computation
To guarantee round-trip conversion using 15 or 16 or 17 digits is only possible for a comparatively few cases. The number 15.95 comes from taking 2^53 (1 implicit bit + 52 bits in the significand/"mantissa") which comes out to an integer in the range 10^15 to 10^16 (closer to 10^16).
Consider a double precision value x with an exponent of 0, i.e. it falls into the floating point range range 1.0 <= x < 2.0. The implicit bit will mark the 2^0 component (part) of x. The highest explicit bit of the significand will denote the next lower exponent (from 0) <=> -1 => 2^-1 or the 0.5 component.
The next bit 0.25, the ones after 0.125, 0.0625, 0.03125, 0.015625 and so on (see table below). The value 1.5 will thus be represented by two components added together: the implicit bit denoting 1.0 and the highest explicit significand bit denoting 0.5.
This illustrates that from the implicit bit downward you have 52 additional, explicit bits to represent possible components where the smallest is 0 (exponent) - 52 (explicit bits in significand) = -52 => 2^-52 which according to the table below is ... well you can see for yourselves that it comes out to quite a bit more than 15.95 significant digits (37 to be exact). To put it another way the smallest number in the 2^0 range that is != 1.0 itself is 2^0+2^-52 which is 1.0 + the number next to 2^-52 (below) = (exactly) 1.0000000000000002220446049250313080847263336181640625, a value which I count as being 53 significant digits long. With 17 digit formatting "precision" the number will display as 1.0000000000000002 and this would depend on the library converting correctly.
So maybe "round-trip conversion in 17 digits" is not really a concept that is valid (enough).
2^ -1 = 0.5000000000000000000000000000000000000000000000000000
2^ -2 = 0.2500000000000000000000000000000000000000000000000000
2^ -3 = 0.1250000000000000000000000000000000000000000000000000
2^ -4 = 0.0625000000000000000000000000000000000000000000000000
2^ -5 = 0.0312500000000000000000000000000000000000000000000000
2^ -6 = 0.0156250000000000000000000000000000000000000000000000
2^ -7 = 0.0078125000000000000000000000000000000000000000000000
2^ -8 = 0.0039062500000000000000000000000000000000000000000000
2^ -9 = 0.0019531250000000000000000000000000000000000000000000
2^-10 = 0.0009765625000000000000000000000000000000000000000000
2^-11 = 0.0004882812500000000000000000000000000000000000000000
2^-12 = 0.0002441406250000000000000000000000000000000000000000
2^-13 = 0.0001220703125000000000000000000000000000000000000000
2^-14 = 0.0000610351562500000000000000000000000000000000000000
2^-15 = 0.0000305175781250000000000000000000000000000000000000
2^-16 = 0.0000152587890625000000000000000000000000000000000000
2^-17 = 0.0000076293945312500000000000000000000000000000000000
2^-18 = 0.0000038146972656250000000000000000000000000000000000
2^-19 = 0.0000019073486328125000000000000000000000000000000000
2^-20 = 0.0000009536743164062500000000000000000000000000000000
2^-21 = 0.0000004768371582031250000000000000000000000000000000
2^-22 = 0.0000002384185791015625000000000000000000000000000000
2^-23 = 0.0000001192092895507812500000000000000000000000000000
2^-24 = 0.0000000596046447753906250000000000000000000000000000
2^-25 = 0.0000000298023223876953125000000000000000000000000000
2^-26 = 0.0000000149011611938476562500000000000000000000000000
2^-27 = 0.0000000074505805969238281250000000000000000000000000
2^-28 = 0.0000000037252902984619140625000000000000000000000000
2^-29 = 0.0000000018626451492309570312500000000000000000000000
2^-30 = 0.0000000009313225746154785156250000000000000000000000
2^-31 = 0.0000000004656612873077392578125000000000000000000000
2^-32 = 0.0000000002328306436538696289062500000000000000000000
2^-33 = 0.0000000001164153218269348144531250000000000000000000
2^-34 = 0.0000000000582076609134674072265625000000000000000000
2^-35 = 0.0000000000291038304567337036132812500000000000000000
2^-36 = 0.0000000000145519152283668518066406250000000000000000
2^-37 = 0.0000000000072759576141834259033203125000000000000000
2^-38 = 0.0000000000036379788070917129516601562500000000000000
2^-39 = 0.0000000000018189894035458564758300781250000000000000
2^-40 = 0.0000000000009094947017729282379150390625000000000000
2^-41 = 0.0000000000004547473508864641189575195312500000000000
2^-42 = 0.0000000000002273736754432320594787597656250000000000
2^-43 = 0.0000000000001136868377216160297393798828125000000000
2^-44 = 0.0000000000000568434188608080148696899414062500000000
2^-45 = 0.0000000000000284217094304040074348449707031250000000
2^-46 = 0.0000000000000142108547152020037174224853515625000000
2^-47 = 0.0000000000000071054273576010018587112426757812500000
2^-48 = 0.0000000000000035527136788005009293556213378906250000
2^-49 = 0.0000000000000017763568394002504646778106689453125000
2^-50 = 0.0000000000000008881784197001252323389053344726562500
2^-51 = 0.0000000000000004440892098500626161694526672363281250
2^-52 = 0.0000000000000002220446049250313080847263336181640625
#ThomasMcLeod: I think the significant digit rule comes from my field, physics, and means something more subtle:
If you have a measurement that gets you the value 1.52 and you cannot read any more detail off the scale, and say you are supposed to add another number (for example of another measurement because this one's scale was too small) to it, say 2, then the result (obviously) has only 2 decimal places, i.e. 3.52.
But likewise, if you add 1.1111111111 to the value 1.52, you get the value 2.63 (and nothing more!).
The reason for the rule is to prevent you from kidding yourself into thinking you got more information out of a calculation than you put in by the measurement (which is impossible, but would seem that way by filling it with garbage, see above).
That said, this specific rule is for addition only (for addition: the error of the result is the sum of the two errors - so if you measure just one badly, though luck, there goes your precision...).
How to get the other rules:
Let's say a is the measured number and δa the error. Let's say your original formula was:
f:=m a
Let's say you also measure m with error δm (let that be the positive side).
Then the actual limit is:
f_up=(m+δm) (a+δa)
and
f_down=(m-δm) (a-δa)
So,
f_up =m a+δm δa+(δm a+m δa)
f_down=m a+δm δa-(δm a+m δa)
Hence, now the significant digits are even less:
f_up ~m a+(δm a+m δa)
f_down~m a-(δm a+m δa)
and so
δf=δm a+m δa
If you look at the relative error, you get:
δf/f=δm/m+δa/a
For division it is
δf/f=δm/m-δa/a
Hope that gets the gist across and hope I didn't make too many mistakes, it's late here :-)
tl,dr: Significant digits mean how many of the digits in the output actually come from the digits in your input (in the real world, not the distorted picture that floating point numbers have).
If your measurements were 1 with "no" error and 3 with "no" error and the function is supposed to be 1/3, then yes, all infinite digits are actual significant digits. Otherwise, the inverse operation would not work, so obviously they have to be.
If significant digit rule means something completely different in another field, carry on :-)

How to detect if a base 10 decimal can be represented exactly in base 2

As part of a numerical library test I need to choose base 10 decimal numbers that can be represented exactly in base 2. How do you detect in C++ if a base 10 decimal number can be represented exactly in base 2?
My first guess is as follows:
bool canBeRepresentedInBase2(const double &pNumberInBase10)
{
//check if a number in base 10 can be represented exactly in base 2
//reference: http://en.wikipedia.org/wiki/Binary_numeral_system
bool funcResult = false;
int nbOfDoublings = 16*3;
double doubledNumber = pNumberInBase10;
for (int i = 0; i < nbOfDoublings ; i++)
{
doubledNumber = 2*doubledNumber;
double intPart;
double fracPart = modf(doubledNumber/2, &intPart);
if (fracPart == 0) //number can be represented exactly in base 2
{
funcResult = true;
break;
}
}
return funcResult;
}
I tested this function with the following values: -1.0/4.0, 0.0, 0.1, 0.2, 0.205, 1.0/3.0, 7.0/8.0, 1.0, 256.0/255.0, 1.02, 99.005. It returns true for -1.0/4.0, 0.0, 7.0/8.0, 1.0, 99.005 which is correct.
Any better ideas?
I think what you are looking for is a number which has a fractional portion which is the sum of a sequence of negative powers of 2 (aka: 1 over a power of 2). I believe this should always be able to be represented exactly in IEEE floats/doubles.
For example:
0.375 = (1/4 + 1/8) which should have an exact representation.
If you want to generate these. You could try do something like this:
#include <iostream>
#include <cstdlib>
int main() {
srand(time(0));
double value = 0.0;
for(int i = 1; i < 256; i *= 2) {
// doesn't matter, some random probability of including this
// fraction in our sequence..
if((rand() % 3) == 0) {
value += (1.0 / static_cast<double>(i));
}
}
std::cout << value << std::endl;
}
EDIT: I believe your function has a broken interface. It would be better if you had this:
bool canBeRepresentedExactly(int numerator, int denominator);
because not all fractions have exact representations, but the moment you shove it into a double, you've chosen a representation in binary... defeating the purpose of the test.
If you're checking to see if it's binary, it will always return true. If your method takes a double as the parameter, the number is already represented in binary (double is a binary type, usually 64 bits). Looking at your code, I think you're actually trying to see if it can be represented exactly as an integer, in which case why can't you just cast to int, then back to double and compare to the original. Any integer stored in a double that's within the range representable by an int should be exact, IIRC, because a 64 bit double has 53 bits of mantissa (and I'm assuming a 32 bit int). That means if they're equal, it's an integer.
If you're passing in a double, then by definition, it has already been represented in binary and if not, then you've already lost accuracy.
Maybe try passing in numerator and denominator of the fraction to the function. Then you have not lost accuracy and can check to see if you can come up with a binary representation of the answer that is the same as the fraction you've passed in.
As rmeador have pointed out, it might not be a good idea to accept the double, because the number has been converted to a double, an possible approximation to the number that you're trying to check.
So, in a very abstract way, you should split your check into integers, and decimals. Integers should not be too large such that the mantissa cannot express all the integers, (e.g. 9007199254740993 should not be represented properly by a 64-bit fp)
Decimal points may be a bit easier, mentally, because if anything after the decimal point (e.g. yyy in xxx.yyy) contains a factor of anything other than 2, the floating point repeats in order to try to represent it. It's the reason why 1/3 cannot be represented with finite digits in base 10 = base (2*5)... See Recurring Decimal
EDIT: As the comments pointed out, if the decimal number has a factor of anything other than 1/2, that would be the mathematically correct way to say it...
As others have mentioned, your method doesn't do what you mean, since you pass a number represented as a (binary) double. The method actually detects, if the number you passed is in the form integer/2^48. This should fail for numbers like (1+2^-50), which is binary, and 259/255, which isn't.
If you really want to test a number for being exactly representable by finite binary string, you have to pass a number in an exact form.
You can't pass IN a Double because it's already lost precision. You should be able to use the toString() method of Double to check for this. (example in Java)
public static Boolean canBeRepresentedInBase2(String thenumber)
{
// Reuturns true of the parsed Double did not loose precision.
// Only works for numbers that are not converted into scientific notation by toString.
return thenumber.equals(Double.parseDouble(thenumber).toString())
}
You asked for C++ but maybe this algorithm will help. I use "EE" to mean "exactly expressible as a float."
Start with a decimal representation of the number you want to test. Remove any trailing zeroes (that is, 0.123450000 becomes 0.12345).
1) If the number is not an integer, check to see if the rightmost digit is 5. If it's not, then stop -- the number is not EE.
2) Multiply the number by 2. If the result is an integer, then stop -- the number is EE. Otherwise, go back to step 1.
I don't have rigorous proof for this but a "warm fuzzy." Fire up Calculator and enter your favorite fractional power of 2, like 0.0000152587890625. Add it to itself a few dozen times (I just hit "+" once then "=" a bunch of times). If there are any non-zero digits to the right of the decimal point, the last digit is always 5.
Here is the code in C# and it works. Because it works with the Decimal data - there are no inherent rounding errors that show up in the original code which uses double. (decimal in C# stores using base 10 instead of base 2 - which is what double does)
static bool canBeRepresentedInBase2(decimal pNumberInBase10)
{
//check if a number in base 10 can be represented exactly in base 2
//reference: http://en.wikipedia.org/wiki/Binary_numeral_system
bool funcResult = false;
int nbOfDoublings = 16*3;
decimal doubledNumber = pNumberInBase10;
for (int i = 0; i < nbOfDoublings ; i++)
{
doubledNumber = 2*doubledNumber;
decimal intPart;
decimal fracPart = ModF(doubledNumber/2, out intPart);
if (fracPart == 0) //number can be represented exactly in base 2
{
funcResult = true;
break;
}
}
return funcResult;
}
static decimal ModF(decimal number, out decimal intPart)
{
intPart = Math.Floor(number);
decimal fractional = number - (intPart);
return fractional;
}
Tested with the following code (where WL does a Console.WritelLine - SnippetCompiler)
WL(canBeRepresentedInBase2(-1.0M/4.0M)); //true
WL(canBeRepresentedInBase2(0.0M)); //true
WL(canBeRepresentedInBase2(0.1M)); //false
WL(canBeRepresentedInBase2(0.2M)); //false
WL(canBeRepresentedInBase2(0.205M)); //false
WL(canBeRepresentedInBase2(1.0M/3.0M)); //false
WL(canBeRepresentedInBase2(7.0M/8.0M)); //true
WL(canBeRepresentedInBase2(1.0M)); //true
WL(canBeRepresentedInBase2(256.0M/255.0M)); //false
WL(canBeRepresentedInBase2(1.02M)); //false
WL(canBeRepresentedInBase2(99.005M)); //false
WL(canBeRepresentedInBase2(2.53M)); //false
Or even easier:
return pNumber == floor(pNumber);
On the other hand, if you have some weird fractional representation (numerator denominator pair, or string with a decimal in it, or something), and you really do want to know if the value can be exactly represented as a double, it's a bit harder.
But you would need a different parameter(s) for that...
Given a number r it can be represented exactly with finite precision in base 2 iff r can be written as r = m/2^n, where m, n are integers, and n >= 0.
For example 1/7 doesn't have a finite binary expression, also 1/6 and 1/10 can't be written with a finite expression in base 2.
But 1/4+1/32+1/1024, have a finite expression in base.
PS: In general you can express a number r with finite digits in a base b iff r=m/b^n where m, n are integers an n >= 0.
PPS: As almost everybody has stated previously using a double as input is a bad idea, because you are loosing precision, and you will end up with a different number.
I don't think this is what he's asking... I think he's looking for a solution that will tell him if a number can be represented EXACTLY in binary form. For example, 33.3.. That's a number cannot be represented in binary, because it will go on forever, so depending on your FPU settings, it will be represented as something like "33.333333333333336". So, it looks like his method will do the job. I don't know of a better way off the top of my head.
\
Ignoring the general criticism of using a double...
For a general finite decimal, you can determine if it has a finite representation in binary with the following algorithm:
Extract the fraction part of the decimal f.
Determine f x 10b = c, where b and c are integers.
Determine 2d >= 10b, where d is an integer.
If c x 2b / 10b is an integer, then the decimal has a finite representation in binary. Otherwise, it doesn't.
You can generalize this to any two bases.