printing float, preserving precision - c++

I am writing a program that prints floating point literals to be used inside another program.
How many digits do I need to print in order to preserve the precision of the original float?
Since a float has 24 * (log(2) / log(10)) = 7.2247199 decimal digits of precision, my initial thought was that printing 8 digits should be enough. But if I'm unlucky, those 0.2247199 get distributed to the left and to the right of the 7 significant digits, so I should probably print 9 decimal digits.
Is my analysis correct? Is 9 decimal digits enough for all cases? Like printf("%.9g", x);?
Is there a standard function that converts a float to a string with the minimum number of decimal digits required for that value, in the cases where 7 or 8 are enough, so I don't print unnecessary digits?
Note: I cannot use hexadecimal floating point literals, because standard C++ does not support them.

In order to guarantee that a binary->decimal->binary roundtrip recovers the original binary value, IEEE 754 requires
The original binary value will be preserved by converting to decimal and back again using:[10]
5 decimal digits for binary16
9 decimal digits for binary32
17 decimal digits for binary64
36 decimal digits for binary128
For other binary formats the required number of decimal digits is
1 + ceiling(p*log10(2))
where p is the number of significant bits in the binary format, e.g. 24 bits for binary32.
In C, the functions you can use for these conversions are snprintf() and strtof/strtod/strtold().
Of course, in some cases even more digits can be useful (no, they are not always "noise", depending on the implementation of the decimal conversion routines such as snprintf() ). Consider e.g. printing dyadic fractions.

24 * (log(2) / log(10)) = 7.2247199
That's pretty representative for the problem. It makes no sense whatsoever to express the number of significant digits with an accuracy of 0.0000001 digits. You are converting numbers to text for the benefit of a human, not a machine. A human couldn't care less, and would much prefer, if you wrote
24 * (log(2) / log(10)) = 7
Trying to display 8 significant digits just generates random noise digits. With non-zero odds that 7 is already too much because floating point error accumulates in calculations. Above all, print numbers using a reasonable unit of measure. People are interested in millimeters, grams, pounds, inches, etcetera. No architect will care about the size of a window expressed more accurately than 1 mm. No window manufacturing plant will promise a window sized as accurate as that.
Last but not least, you cannot ignore the accuracy of the numbers you feed into your program. Measuring the speed of an unladen European swallow down to 7 digits is not possible. It is roughly 11 meters per second, 2 digits at best. So performing calculations on that speed and printing a result that has more significant digits produces nonsensical results that promise accuracy that isn't there.

If you have a C library that is conforming to C99 (and if your float types have a base that is a power of 2 :) the printf format character %a can print floating point values without lack of precision in hexadecimal form, and utilities as scanf and strod will be able to read them.

If the program is meant to be read by a computer, I would do the simple trick of using char* aliasing.
alias float* to char*
copy into an unsigned (or whatever unsigned type is sufficiently large) via char* aliasing
print the unsigned value
Decoding is just reversing the process (and on most platform a direct reinterpret_cast can be used).

The floating-point-to-decimal conversion used in Java is guaranteed to be produce the least number of decimal digits beyond the decimal point needed to distinguish the number from its neighbors (more or less).
You can copy the algorithm from here: http://www.docjar.com/html/api/sun/misc/FloatingDecimal.java.html
Pay attention to the FloatingDecimal(float) constructor and the toJavaFormatString() method.

If you read these papers (see below), you'll find that there are some algorithm that print the minimum number of decimal digits such that the number can be re-interpreted unchanged (i.e. by scanf).
Since there might be several such numbers, the algorithm also pick the nearest decimal fraction to the original binary fraction (I named float value).
A pity that there's no such standard library in C.
http://www.cs.indiana.edu/~burger/FP-Printing-PLDI96.pdf
http://grouper.ieee.org/groups/754/email/pdfq3pavhBfih.pdf

You can use sprintf. I am not sure whether this answers your question exactly though, but anyways, here is the sample code
#include <stdio.h>
int main( void )
{
float d_n = 123.45;
char s_cp[13] = { '\0' };
char s_cnp[4] = { '\0' };
/*
* with sprintf you need to make sure there's enough space
* declared in the array
*/
sprintf( s_cp, "%.2f", d_n );
printf( "%s\n", s_cp );
/*
* snprinft allows to control how much is read into array.
* it might have portable issues if you are not using C99
*/
snprintf( s_cnp, sizeof s_cnp - 1 , "%f", d_n );
printf( "%s\n", s_cnp );
getchar();
return 0;
}
/* output :
* 123.45
* 123
*/

With something like
def f(a):
b=0
while a != int(a): a*=2; b+=1
return a, b
(which is Python) you should be able to get mantissa and exponent in a loss-free way.
In C, this would probably be
struct float_decomp {
float mantissa;
int exponent;
}
struct float_decomp decomp(float x)
{
struct float_decomp ret = { .mantissa = x, .exponent = 0};
while x != floor(x) {
ret.mantissa *= 2;
ret.exponent += 1;
}
return ret;
}
But be aware that still not all values can be represented in that way, it is just a quick shot which should give the idea, but probably needs improvement.

Related

float number to string converting implementation in STD

I faced with a curious issue. Look at this simple code:
int main(int argc, char **argv) {
char buf[1000];
snprintf_l(buf, sizeof(buf), _LIBCPP_GET_C_LOCALE, "%.17f", 0.123e30f);
std::cout << "WTF?: " << buf << std::endl;
}
The output looks quire wired:
123000004117574256822262431744.00000000000000000
My question is how it's implemented? Can someone show me the original code? I did not find it. Or maybe it's too complicated for me.
I've tried to reimplement the same transformation double to string with Java code but was failed. Even when I tried to get exponent and fraction parts separately and summarize fractions in cycle I always get zeros instead of these numbers "...822262431744". When I tried to continue summarizing fractions after the 23 bits (for float number) I faced with other issue - how many fractions I need to collect? Why the original code stops on left part and does not continue until the scale is end?
So, I really do not understand the basic logic, how it implemented. I've tried to define really big numbers (e.g. 0.123e127f). And it generates huge number in decimal format. The number has much higher precision than float can be. Looks like this is an issue, because the string representation contains something which float number cannot.
Please read documentation:
printf, fprintf, sprintf, snprintf, printf_s, fprintf_s, sprintf_s, snprintf_s - cppreference.com
The format string consists of ordinary multibyte characters (except %), which are copied unchanged into the output stream, and conversion specifications. Each conversion specification has the following format:
introductory % character
...
(optional) . followed by integer number or *, or neither that specifies precision of the conversion. In the case when * is used, the precision is specified by an additional argument of type int, which appears before the argument to be converted, but after the argument supplying minimum field width if one is supplied. If the value of this argument is negative, it is ignored. If neither a number nor * is used, the precision is taken as zero. See the table below for exact effects of precision.
....
Conversion Specifier
Explanation
Expected Argument Type
f F
converts floating-point number to the decimal notation in the style [-]ddd.ddd. Precision specifies the exact number of digits to appear after the decimal point character. The default precision is 6. In the alternative implementation decimal point character is written even if no digits follow it. For infinity and not-a-number conversion style see notes.
double
So with f you forced form ddd.ddd (no exponent) and with .17 you have forced to show 17 digits after decimal separator. With such big value printed outcome looks that odd.
Finally I've found out what the difference between Java float -> decimal -> string convertation and c++ float -> string (decimal) convertation. I did not find the original source code, but I replicated the same code in Java to make it clear. I think the code explains everything:
// the context size might be calculated properly by getting maximum
// float number (including exponent value) - its 40 + scale, 17 for me
MathContext context = new MathContext(57, RoundingMode.HALF_UP);
BigDecimal divisor = BigDecimal.valueOf(2);
int tmp = Float.floatToRawIntBits(1.23e30f)
boolean sign = tmp < 0;
tmp <<= 1;
// there might be NaN value, this code does not support it
int exponent = (tmp >>> 24) - 127;
tmp <<= 8;
int mask = 1 << 23;
int fraction = mask | (tmp >>> 9);
// at this line we have all parts of the float: sign, exponent and fractions. Let's build mantissa
BigDecimal mantissa = BigDecimal.ZERO;
for (int i = 0; i < 24; i ++) {
if ((fraction & mask) == mask) {
// i'm not sure about speed, maybe division at each iteration might be faster than pow
mantissa = mantissa.add(divisor.pow(-i, context));
}
mask >>>= 1;
}
// it was the core line where I was losing accuracy, because of context
BigDecimal decimal = mantissa.multiply(divisor.pow(exponent, context), context);
String str = decimal.setScale(17, RoundingMode.HALF_UP).toPlainString();
// add minus manually, because java lost it if after the scale value become 0, C++ version of code doesn't do it
if (sign) {
str = "-" + str;
}
return str;
Maybe topic is useless. Who really need to have the same implementation like C++ has? But at least this code keeps all precision for float number comparing to the most popular way converting float to decimal string:
return BigDecimal.valueOf(1.23e30f).setScale(17, RoundingMode.HALF_UP).toPlainString();
The C++ implementation you are using uses the IEEE-754 binary32 format for float. In this format, the closet representable value to 0.123•1030 is 123,000,004,117,574,256,822,262,431,744, which is represented in the binary32 format as +13,023,132•273. So 0.123e30f in the source code yields the number 123,000,004,117,574,256,822,262,431,744. (Because the number is represented as +13,023,132•273, we know its value is that exactly, which is 123,000,004,117,574,256,822,262,431,744, even though the digits “123000004117574256822262431744” are not stored directly.)
Then, when you format it with %.17f, your C++ implementation prints the exact value faithfully, yielding “123000004117574256822262431744.00000000000000000”. This accuracy is not required by the C++ standard, and some C++ implementations will not do the conversion exactly.
The Java specification also does not require formatting of floating-point values to be exact, at least in some formatting operations. (I am going from memory and some supposition here; I do not have a citation at hand.) It allows, perhaps even requires, that only a certain number of correct digits be produced, after which zeros are used if needed for positioning relative to the decimal point or for the requested format.
The number has much higher precision than float can be.
For any value represented in the float format, that value has infinite precision. The number +13,023,132•273 is exactly +13,023,132•273, which is exactly 123,000,004,117,574,256,822,262,431,744, to infinite precision. The precision the format has for representing numbers affects only which numbers it can represent, not how precisely it represents the numbers that it does represent.

Can you help me to understand what "significant digits" means in floating point math?

For what I'm learning, once I convert a floating point value to a decimal one, the "significant digits" I need are a fixed number (17 for double, for example). 17 totals: before and after decimal separator.
So for example this code:
typedef std::numeric_limits<double> dbl;
int main()
{
std::cout.precision(dbl::max_digits10);
//std::cout << std::fixed;
double value1 = 1.2345678912345678912345;
double value2 = 123.45678912345678912345;
double value3 = 123456789123.45678912345;
std::cout << value1 << std::endl;
std::cout << value2 << std::endl;
std::cout << value3 << std::endl;
}
will correctly "show me" 17 values:
1.2345678912345679
123.45678912345679
123456789123.45679
But if I increase precision for the cout (i.e. std::cout.precision(100)), I can see there are other numbers after the 17 range:
1.2345678912345678934769921397673897445201873779296875
123.456789123456786683163954876363277435302734375
123456789123.456787109375
Why should ignore them? They are stored within the variables/double as well, so they will affect the whole "math" later (division, multiplication, sum, and so on).
What does it means "significant digits"? There is other...
Can you help me to understand what “significant digits” means in floating point math?
With FP numbers, like mathematical real numbers, significant digits is the leading digits of a value that do not begin with 0 and then, depending on context, to 1) the decimal point, 2) the last non-zero digit, or 3) the last printed digit.
123. // 3 significant decimal digits
123.125 // 6 significant decimal digits
0.0078125 // 5 significant decimal digits
0x0.00123p45 // 3 significant hexadecimal digits
123000.0 // 3, 6, or 7 significant decimal digits depending on context
When concerned about decimal significant digits and FP types like double. the issue is often "How many decimal significant digits are needed or of concern?"
Nearly all C FP implementations use a binary encoding such that all finite FP are exact sums of power of 2. Each finite FP is exact. Common encoding affords most double to have 53 binary digits is it significand - so 53 significant binary digits. How this appears as a decimal is often the source of confusion.
// Example 0.1 is not an exact sum of powers of 2 so a nearby value is used.
double x = 0.1;
// x takes on the exact value of
// 0.1000000000000000055511151231257827021181583404541015625
// aka 0x1.999999999999ap-4
// aka base2: 0.000110011001100110011001100110011001100110011001100110011010
// The preceding and subsequent doubles
// 0.09999999999999999167332731531132594682276248931884765625
// 0.10000000000000001942890293094023945741355419158935546875
// 123456789012345678901234567890123456789012345678901234567890
Looking at above, one could say x has over 50 decimal significant digits. Yet the value matches the intended 0.1 to 16 decimal significant digits. Or yet since the preceding and subsequent possible double values differ in the 17 place, one could say x has 17 decimal significant digits.
What does it means "significant digits"?
Various meanings of significant digits exist, but for C, 2 common ones are:
The number of decimal significant digits that a textual value to double converts as expected for all double. This is typically 15. C specifies this as DBL_DIG and must be at least 10.
The number of decimal significant digits that a textual value of double needs to be printed to distinguish from another double. This is typically 17. C specifies this as DBL_DECIMAL_DIG and must be at least 10.
Why should ignore them?
It depends of coding goals. Rarely are all digits of the exact value needed. (DBL_TRUE_MIN might have 752 od them.) For most applications, DBL_DECIMAL_DIG is enough. In select apps, DBL_DIG will do. So usually, ignoring digits past 17 does not cause problems.
Keep in mind that floating-point values are not real numbers. There are gaps between the values, and all those extra digits, while meaningful for real numbers, don’t reflect any difference in the floating-point value. When you convert a floating-point value to text, having std::numeric_limits<...>::max_digits10 digits ensures that you can convert the text string back to floating-point and get the original value. The extra digits don’t affect the result.
The extra digits that you see when you ask for more digits are the result of the conversion algorithm trying to do what you asked. The algorithm typically just keeps extracting digits until it reaches the desired precision; it could be written to start outputting zeros after it’s written max_digits10 digits, but that’s an additional complication that nobody bothers with. It wouldn’t really be helpful.
just to add to Pete Becker's answer, I think you're confusing the problem of finding the exact decimal representation of a binary mantissa, with the problem of finding some decimal representation uniquely representing that binary mantissa ( given some fixed rounding scheme ).
Now, regarding the first problem, you always need a finite number of decimal digits to exactly represent a binary mantissa ( because 2 divides 10 ).
For example, you need 18 decimal digits to exactly represent the binary 1.0000000000000001, being 1.00000762939453125 in decimal.
but you need just 17 digits to represent it uniquely as 1.0000076293945312 because no other number having exact value 1.0000076293945312xyz... where 0<=x<5 can exist as a double ( more precisely, the next and prior exactly representable values being 1.0000076293945314720446049250313080847263336181640625 and 1.0000076293945310279553950749686919152736663818359375 ).
Of course, this does not mean that given some decimal number you can ignore all digits past the 17th; it just means that if you apply the same rounding scheme used to produce the decimal at the 17th position and assign it back to a double you'll get the same original double.

c++ how to get "one digit exponent" with printf

Is there a way to print in scientific notation less than 3 places for exponent part of number?
The 6.1 formatting doesn't affect exponent but only the number part:
var=1.23e-9;
printf ("%e\n", var);
printf ("%6.1e\n", var);
gives
1.230000e-009
1.2e-009
I've also tried this in wxWidgets with formatting of string but the behavior is the same.
m_var->SetLabel(wxString::Format(wxT("%6.1e"),var));
What I'd like to have is 1.2e-9.
According to Wikipedia:
The exponent always contains at least two digits; if the value is
zero, the exponent is 00. In Windows, the exponent contains three
digits by default, e.g. 1.5e002, but this can be altered by
Microsoft-specific _set_output_format function.
_set_output_format
I've had to do this a lot (I write file parsers and some file formats like NITF require you to store numeric values as strings).
What you do is an exploit based on what base-10 math (scientific notation) really means: It means that for all real numbers y, y = (x) * 10^(N) for some integer N and some x in the range (-1, 1) exclusive.
So, you do the following
void PrintScientific(double d)
{
int exponent = (int)floor(log10( fabs(d))); // This will round down the exponent
double base = d * pow(10, -1.0*exponent);
printf("%lfE%+01d", base, exponent);
}
You can add all the format specifiers you need to control the # of chars before, after the "." decimal place.
Do NOT forget the rounding step! This is how it works, using the properties of base10 and logarithms (base 10 here):
Let y = x * 10^N =>
log(y) = log(x*10^N) =>
log(y) = log(x) + log(10^N) => // From Log "product" rule
log(y) = log(x) + N
Since x is in the range (-10, 10) -"()" means exclusive(exclusive), that implies log(x) is in the range (-1, 1). So when we round down for integer conversion, we're dropping "log(x)" contribution.
You can then get the "x" portion from the original number, which lets you output the original in any scientific notation you want to use.
With standard C printf() this can't be done (and the use of three digits by default seems wrong as well), at least in C99 (I don't have a newer version at hand). The relevant quote from the C99 standard is at 7.19.6.1 paragraph 8, formats e,f:
.... The exponent always contains at least two digits, and only as many more digits as necessary to represent the exponent. If the value is zero, the exponent is zero. ...
The best bet to fit this [portably] into code using lots of these outputs is to use C++ IOStreams: although the default formatting is the same as in C, it is possible to install a custom facet into the stream's std::locale which does the formatting the way you need. That said, writing the formatting code might not be entirely trivial. Although I would probably just built on the standard conversion and then remove the excess zeros after the e character.
I found Zach's answer to be the fastest and simplest method and is also applicable to any OS. I did find that two modifications were needed on the "base =" line for it to work for all numbers. (Otherwise nan's when exponent is negative in cygwin). The extra print statement is just for patran neutral file compatibility. I would have upvoted his answer, but I just started on stackexchange so I don't have sufficient "reputation".
void PrintScientific(double d)
{
int exponent = (int)floor(log10( fabs(d))); // This will round down the exponent
double base = (d * pow(10.0, -1*exponent));
if(abs(exponent)<10)
printf("%13.9lfE%+01d", base, exponent);
else
printf("%12.9lfE%+01d", base, exponent);
}
C/C++ specifies at least two exponent digits with printf("%e",...). To print only 1, and to deal with Visual Studio which, by default, prints at least 3, additional code is needed.
Consider IOStreams #Dietmar Kühl
If C++ code still wants to use printf() style formats:
Adjusting the value of a double before calling printf() too often results in rounding issues, range shorting and general corner case failures like dealing with log10(0.0). Also consider large double just near a power-of-10 where log10() may come up short, -0.0, INF, NAN.
In this case, better to post-process the string.
double var = 1.23e-9;
// - 1 . x e - EEEEE \0
#define ExpectedSize (1+1+1+1+1+1+ 5 + 1)
char buf[ExpectedSize + 10];
snprintf(buf, sizeof buf, "%.1e", var);
char *e = strchr(buf, 'e'); // lucky 'e' not in "Infinity" nor "NaN"
if (e) {
e++;
int expo = atoi(e);
snprintf(e, sizeof buf - (e - buf), "%1d", expo);
}
printf("'%6s'\n", buf); // '1.2e-9'
Note: %e is amiable to post-processing as its width is not so unwieldy as "%f". sprintf(buf, "%f", DBL_MAX) could be 1000s of char.

How does Excel successfully round floating point numbers even though they are imprecise?

For example, this blog says 0.005 is not exactly 0.005, but rounding that number yields the right result.
I have tried all kinds of rounding in C++ and it fails when rounding numbers to certain decimal places. For example, Round(x,y) rounds x to a multiple of y. So Round(37.785,0.01) should give you 37.79 and not 37.78.
I am reopening this question to ask the community for help. The problem is with the impreciseness of floating point numbers (37,785 is represented as 37.78499999999).
The question is how does Excel get around this problem?
The solution in this round() for float in C++ is incorrect for the above problem.
"Round(37.785,0.01) should give you 37.79 and not 37.78."
First off, there is no consensus that 37.79 rather than 37.78 is the "right" answer here? Tie-breakers are always a bit tough. While always rounding up in the case of a tie is a widely-used approach, it certainly is not the only approach.
Secondly, this isn't a tie-breaking situation. The numerical value in the IEEE binary64 floating point format is 37.784999999999997 (approximately). There are lots of ways to get a value of 37.784999999999997 besides a human typing in a value of 37.785 and happen to have that converted to that floating point representation. In most of these cases, the correct answer is 37.78 rather than 37.79.
Addendum
Consider the following Excel formulae:
=ROUND(37785/1000,2)
=ROUND(19810222/2^19+21474836/2^47,2)
Both cells will display the same value, 37.79. There is a legitimate argument over whether 37785/1000 should round to 37.78 or 37.79 with two place accuracy. How to deal with these corner cases is a bit arbitrary, and there is no consensus answer. There isn't even a consensus answer inside Microsoft: "the Round() function is not implemented in a consistent fashion among different Microsoft products for historical reasons." ( http://support.microsoft.com/kb/196652 ) Given an infinite precision machine, Microsoft's VBA would round 37.785 to 37.78 (banker's round) while Excel would yield 37.79 (symmetric arithmetic round).
There is no argument over the rounding of the latter formula. It is strictly less than 37.785, so it should round to 37.78, not 37.79. Yet Excel rounds it up. Why?
The reason has to do with how real numbers are represented in a computer. Microsoft, like many others, uses the IEEE 64 bit floating point format. The number 37785/1000 suffers from precision loss when expressed in this format. This precision loss does not occur with 19810222/2^19+21474836/2^47; it is an "exact number".
I intentionally constructed that exact number to have the same floating point representation as does the inexact 37785/1000. That Excel rounds this exact value up rather than down is the key to determining how Excel's ROUND() function works: It is a variant of symmetric arithmetic rounding. It rounds based on a comparison to the floating point representation of the corner case.
The algorithm in C++:
#include <cmath> // std::floor
// Compute 10 to some positive integral power.
// Dealing with overflow (exponent > 308) is an exercise left to the reader.
double pow10 (unsigned int exponent) {
double result = 1.0;
double base = 10.0;
while (exponent > 0) {
if ((exponent & 1) != 0) result *= base;
exponent >>= 1;
base *= base;
}
return result;
}
// Round the same way Excel does.
// Dealing with nonsense such as nplaces=400 is an exercise left to the reader.
double excel_round (double x, int nplaces) {
bool is_neg = false;
// Excel uses symmetric arithmetic round: Round away from zero.
// The algorithm will be easier if we only deal with positive numbers.
if (x < 0.0) {
is_neg = true;
x = -x;
}
// Construct the nearest rounded values and the nasty corner case.
// Note: We really do not want an optimizing compiler to put the corner
// case in an extended double precision register. Hence the volatile.
double round_down, round_up;
volatile double corner_case;
if (nplaces < 0) {
double scale = pow10 (-nplaces);
round_down = std::floor (x * scale);
corner_case = (round_down + 0.5) / scale;
round_up = (round_down + 1.0) / scale;
round_down /= scale;
}
else {
double scale = pow10 (nplaces);
round_down = std::floor (x / scale);
corner_case = (round_down + 0.5) * scale;
round_up = (round_down + 1.0) * scale;
round_down *= scale;
}
// Round by comparing to the corner case.
x = (x < corner_case) ? round_down : round_up;
// Correct the sign if needed.
if (is_neg) x = -x;
return x;
}
For very accurate arbitrary precision and rounding of floating point numbers to a fixed set of decimal places, you should take a look at a math library like GNU MPFR. While it's a C-library, the web-page I posted also links to a couple different C++ bindings if you want to avoid using C.
You may also want to read a paper entitled "What every computer scientist should know about floating point arithmetic" by David Goldberg at the Xerox Palo Alto Research Center. It's an excellent article demonstrating the underlying process that allows floating point numbers to be approximated in a computer that represents everything in binary data, and how rounding errors and other problems can creep up in FPU-based floating point math.
I don't know how Excel does it, but printing floating point numbers nicely is a hard problem: http://www.serpentine.com/blog/2011/06/29/here-be-dragons-advances-in-problems-you-didnt-even-know-you-had/
So your actual question seems to be, how to get correctly rounded floating point -> string conversions. By googling for those terms you'll get a bunch of articles, but if you're interested in something to use, most platforms provide reasonably competent implementations of sprintf()/snprintf(). So just use those, and if you find bugs, file a report to the vendor.
A function that takes a floating point number as argument and returns another floating point number, rounded exactly to a given number of decimal digits cannot be written, because there are many numbers with a finite decimal representation that have an infinite binary representation; one of the simplest examples is 0.1 .
To achieve what you want you must accept to use a different type as a result of your rounding function. If your immediate need is printing the number you can use a string and a formatting function: the problem becomes how to obtain exactly the formatting you expect. Otherwise if you need to store this number in order to perform exact calculations on it, for instance if you are doing accounting, you need a library that's capable of representing decimal numbers exactly. In this case the most common approach is to use a scaled representation: an integer for the value together with the number of decimal digits. Dividing the value by ten raised to the scale gives you the original number.
If any of these approaches is suitable, I'll try and expand my answer with practical suggestions.
Excel rounds numbers like this "correctly" by doing WORK. They started in 1985, with a fairly "normal" set of floating-point routines, and added some scaled-integer fake floating point, and they've been tuning those things and adding special cases ever since. The app DID used to have most of the same "obvious" bugs that everybody else did, it's just that it mostly had them a long time ago. I filed a couple myself, back when I was doing tech support for them in the early 90s.
I believe the following C# code rounds numbers as they are rounded in Excel. To exactly replicate the behavior in C++ you might need to use a special decimal type.
In plain English, the double-precision number is converted to a decimal and then rounded to fifteen significant digits (not to be confused with fifteen decimal places). The result is rounded a second time to the specified number of decimal places.
That might seem weird, but what you have to understand is that Excel always displays numbers that are rounded to 15 significant figures. If the ROUND() function weren't using that display value as a starting point, and used the internal double representation instead, then there would be cases where ROUND(A1,N) did not seem to correspond to the actual value in A1. That would be very confusing to a non-technical user.
The double which is closest to 37.785 has an exact decimal value of 37.784999999999996589394868351519107818603515625. (Any double can be represented precisely by a finite base ten decimal because one quarter, one eighth, one sixteenth, and so forth all have finite decimal expansions.) If that number were rounded directly to two decimal places, there would be no tie to break and the result would be 37.78. If you round to 15 significant figures first you get 37.7850000000000. If this is further rounded to two decimal places, then you get 37.79, so there is no real mystery after all.
// Convert to a floating decimal point number, round to fifteen
// significant digits, and then round to the number of places
// indicated.
static decimal SmartRoundDouble(double input, int places)
{
int numLeadingDigits = (int)Math.Log10(Math.Abs(input)) + 1;
decimal inputDec = GetAccurateDecimal(input);
inputDec = MoveDecimalPointRight(inputDec, -numLeadingDigits);
decimal round1 = Math.Round(inputDec, 15);
round1 = MoveDecimalPointRight(round1, numLeadingDigits);
decimal round2 = Math.Round(round1, places, MidpointRounding.AwayFromZero);
return round2;
}
static decimal MoveDecimalPointRight(decimal d, int n)
{
if (n > 0)
for (int i = 0; i < n; i++)
d *= 10.0m;
else
for (int i = 0; i > n; i--)
d /= 10.0m;
return d;
}
// The constructor for decimal that accepts a double does
// some rounding by default. This gets a more exact number.
static decimal GetAccurateDecimal(double r)
{
string accurateStr = r.ToString("G17", CultureInfo.InvariantCulture);
return Decimal.Parse(accurateStr, CultureInfo.InvariantCulture);
}
What you NEED is this :
double f = 22.0/7.0;
cout.setf(ios::fixed, ios::floatfield);
cout.precision(6);
cout<<f<<endl;
How it can be implemented (just a overview for rounding last digit)
:
long getRoundedPrec(double d, double precision = 9)
{
precision = (int)precision;
stringstream s;
long l = (d - ((double)((int)d)))* pow(10.0,precision+1);
int lastDigit = (l-((l/10)*10));
if( lastDigit >= 5){
l = l/10 +1;
}
return l;
}
Just as base-10 numbers must be rounded as they are converted to base-2, it is possible to round a number as it is converted from base-2 to base-10. Once the number has a base-10 representation it can be rounded again in a straightforward manner by looking at the digit to the right of the one you wish to round.
While there's nothing wrong with the above assertion, there's a much more pragmatic solution. The problem is that the binary representation tries to get as close as possible to the decimal number, even if that binary is less than the decimal. The amount of error is within [-0.5,0.5] least significant bits (LSB) of the true value. For rounding purposes you'd rather it be within [0,1] LSB so that the error is always positive, but that's not possible without changing all the rules of floating point math.
The one thing you can do is add 1 LSB to the value, so the error is within [0.5,1.5] LSB of the true value. This is less accurate overall, but only by a very tiny amount; when the value is rounded for representation as a decimal number it is much more likely to be rounded to a proper decimal number because the error is always positive.
To add 1 LSB to the value before rounding it, see the answers to this question. For example in Visual Studio C++ 2010 the procedure would be:
Round(_nextafter(37.785,37.785*1.1),0.01);
There are many ways to optimize the result of a floating-point value using statistical, numerical... algorithms
The easiest one is probably searching for repetitive 9s or 0s in the range of precision. If there are any, maybe those 9s are redundant, just round them up. But this may not work in many cases. Here's an example for a float with 6 digits of precision:
2.67899999 → 2.679
12.3499999 → 12.35
1.20000001 → 1.2
Excel always limits the input range to 15 digits and rounds the output to maximum 15 digits so this might be one of the way Excel uses
Or you can include the precision along with the number. After each step, adjust the accuracy depend on the precision of operands. For example
1.113 → 3 decimal digits
6.15634 → 5 decimal digits
Since both number are inside the double's 16-17 digits precision range, their sum will be accurate to the larger of them, which is 5 digits. Similarly, 3+5 < 16, so their product will be precise to 8 decimal numbers
1.113 + 6.15634 = 7.26934 → 5 decimal digits
1.113 * 6.15634 = 6.85200642 → 8 decimal digits
But 4.1341677841 * 2.251457145 will only take double's accuracy because the real result exceed double's precision
Another efficient algorithm is Grisu but I haven't had an opportunity to try.
In 2010, Florian Loitsch published a wonderful paper in PLDI, "Printing floating-point numbers quickly and accurately with integers", which represents the biggest step in this field in 20 years: he mostly figured out how to use machine integers to perform accurate rendering! Why do I say "mostly"? Because although Loitsch's "Grisu3" algorithm is very fast, it gives up on about 0.5% of numbers, in which case you have to fall back to Dragon4 or a derivative
Here be dragons: advances in problems you didn’t even know you had
In fact I think Excel must combine many different methods to achieve the best result of all
Example When a Value Reaches Zero
In Excel 95 or earlier, enter the following into a new workbook:
A1: =1.333+1.225-1.333-1.225
Right-click cell A1, and then click Format Cells. On the Number tab, click Scientific under Category. Set the Decimal places to 15.
Rather than displaying 0, Excel 95 displays -2.22044604925031E-16.
Excel 97, however, introduced an optimization that attempts to correct for this problem. Should an addition or subtraction operation result in a value at or very close to zero, Excel 97 and later will compensate for any error introduced as a result of converting an operand to and from binary. The example above when performed in Excel 97 and later correctly displays 0 or 0.000000000000000E+00 in scientific notation.
Floating-point arithmetic may give inaccurate results in Excel
As mjfgates says, Excel does hard work to get this "right". The first thing to do when you try to reimplement this, is define what you mean by "right". Obvious solutions:
implement rational arithmetic
Slow but reliable.
implement a bunch of heuristics
Fast but tricky to get right (think "years of bug reports").
It really depends on your application.
Most decimal fractions can't be accurately represented in binary.
double x = 0.0;
for (int i = 1; i <= 10; i++)
{
x += 0.1;
}
// x should now be 1.0, right?
//
// it isn't. Test it and see.
One solution is to use BCD. It's old. But, it's also tried and true. We have a lot of other old ideas that we use every day (like using a 0 to represent nothing...).
Another technique uses scaling upon input/output. This has the advantage of nearly all math being integer math.

Printing double without losing precision

How do you print a double to a stream so that when it is read in you don't lose precision?
I tried:
std::stringstream ss;
double v = 0.1 * 0.1;
ss << std::setprecision(std::numeric_limits<T>::digits10) << v << " ";
double u;
ss >> u;
std::cout << "precision " << ((u == v) ? "retained" : "lost") << std::endl;
This did not work as I expected.
But I can increase precision (which surprised me as I thought that digits10 was the maximum required).
ss << std::setprecision(std::numeric_limits<T>::digits10 + 2) << v << " ";
// ^^^^^^ +2
It has to do with the number of significant digits and the first two don't count in (0.01).
So has anybody looked at representing floating point numbers exactly?
What is the exact magical incantation on the stream I need to do?
After some experimentation:
The trouble was with my original version. There were non-significant digits in the string after the decimal point that affected the accuracy.
So to compensate for this we can use scientific notation to compensate:
ss << std::scientific
<< std::setprecision(std::numeric_limits<double>::digits10 + 1)
<< v;
This still does not explain the need for the +1 though.
Also if I print out the number with more precision I get more precision printed out!
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits10) << v << "\n";
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits10 + 1) << v << "\n";
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits) << v << "\n";
It results in:
1.000000000000000e-02
1.0000000000000002e-02
1.00000000000000019428902930940239457413554200000000000e-02
Based on #Stephen Canon answer below:
We can print out exactly by using the printf() formatter, "%a" or "%A". To achieve this in C++ we need to use the fixed and scientific manipulators (see n3225: 22.4.2.2.2p5 Table 88)
std::cout.flags(std::ios_base::fixed | std::ios_base::scientific);
std::cout << v;
For now I have defined:
template<typename T>
std::ostream& precise(std::ostream& stream)
{
std::cout.flags(std::ios_base::fixed | std::ios_base::scientific);
return stream;
}
std::ostream& preciselngd(std::ostream& stream){ return precise<long double>(stream);}
std::ostream& precisedbl(std::ostream& stream) { return precise<double>(stream);}
std::ostream& preciseflt(std::ostream& stream) { return precise<float>(stream);}
Next: How do we handle NaN/Inf?
It's not correct to say "floating point is inaccurate", although I admit that's a useful simplification. If we used base 8 or 16 in real life then people around here would be saying "base 10 decimal fraction packages are inaccurate, why did anyone ever cook those up?".
The problem is that integral values translate exactly from one base into another, but fractional values do not, because they represent fractions of the integral step and only a few of them are used.
Floating point arithmetic is technically perfectly accurate. Every calculation has one and only one possible result. There is a problem, and it is that most decimal fractions have base-2 representations that repeat. In fact, in the sequence 0.01, 0.02, ... 0.99, only a mere 3 values have exact binary representations. (0.25, 0.50, and 0.75.) There are 96 values that repeat and therefore are obviously not represented exactly.
Now, there are a number of ways to write and read back floating point numbers without losing a single bit. The idea is to avoid trying to express the binary number with a base 10 fraction.
Write them as binary. These days, everyone implements the IEEE-754 format so as long as you choose a byte order and write or read only that byte order, then the numbers will be portable.
Write them as 64-bit integer values. Here you can use the usual base 10. (Because you are representing the 64-bit aliased integer, not the 52-bit fraction.)
You can also just write more decimal fraction digits. Whether this is bit-for-bit accurate will depend on the quality of the conversion libraries and I'm not sure I would count on perfect accuracy (from the software) here. But any errors will be exceedingly small and your original data certainly has no information in the low bits. (None of the constants of physics and chemistry are known to 52 bits, nor has any distance on earth ever been measured to 52 bits of precision.) But for a backup or restore where bit-for-bit accuracy might be compared automatically, this obviously isn't ideal.
Don't print floating-point values in decimal if you don't want to lose precision. Even if you print enough digits to represent the number exactly, not all implementations have correctly-rounded conversions to/from decimal strings over the entire floating-point range, so you may still lose precision.
Use hexadecimal floating point instead. In C:
printf("%a\n", yourNumber);
C++0x provides the hexfloat manipulator for iostreams that does the same thing (on some platforms, using the std::hex modifier has the same result, but this is not a portable assumption).
Using hex floating point is preferred for several reasons.
First, the printed value is always exact. No rounding occurs in writing or reading a value formatted in this way. Beyond the accuracy benefits, this means that reading and writing such values can be faster with a well tuned I/O library. They also require fewer digits to represent values exactly.
I got interested in this question because I'm trying to (de)serialize my data to & from JSON.
I think I have a clearer explanation (with less hand waiving) for why 17 decimal digits are sufficient to reconstruct the original number losslessly:
Imagine 3 number lines:
1. for the original base 2 number
2. for the rounded base 10 representation
3. for the reconstructed number (same as #1 because both in base 2)
When you convert to base 10, graphically, you choose the tic on the 2nd number line closest to the tic on the 1st. Likewise when you reconstruct the original from the rounded base 10 value.
The critical observation I had was that in order to allow exact reconstruction, the base 10 step size (quantum) has to be < the base 2 quantum. Otherwise, you inevitably get the bad reconstruction shown in red.
Take the specific case of when the exponent is 0 for the base2 representation. Then the base2 quantum will be 2^-52 ~= 2.22 * 10^-16. The closest base 10 quantum that's less than this is 10^-16. Now that we know the required base 10 quantum, how many digits will be needed to encode all possible values? Given that we're only considering the case of exponent = 0, the dynamic range of values we need to represent is [1.0, 2.0). Therefore, 17 digits would be required (16 digits for fraction and 1 digit for integer part).
For exponents other than 0, we can use the same logic:
exponent base2 quant. base10 quant. dynamic range digits needed
---------------------------------------------------------------------
1 2^-51 10^-16 [2, 4) 17
2 2^-50 10^-16 [4, 8) 17
3 2^-49 10^-15 [8, 16) 17
...
32 2^-20 10^-7 [2^32, 2^33) 17
1022 9.98e291 1.0e291 [4.49e307,8.99e307) 17
While not exhaustive, the table shows the trend that 17 digits are sufficient.
Hope you like my explanation.
In C++20 you'll be able to use std::format to do this:
std::stringstream ss;
double v = 0.1 * 0.1;
ss << std::format("{}", v);
double u;
ss >> u;
assert(v == u);
The default floating-point format is the shortest decimal representation with a round-trip guarantee. The advantage of this method compared to using the precision of max_digits10 (not digits10 which is not suitable for round trip through decimal) from std::numeric_limits is that it doesn't print unnecessary digits.
In the meantime you can use the {fmt} library, std::format is based on. For example (godbolt):
fmt::print("{}", 0.1 * 0.1);
Output (assuming IEEE754 double):
0.010000000000000002
{fmt} uses the Dragonbox algorithm for fast binary floating point to decimal conversion. In addition to giving the shortest representation it is 20-30x faster than common standard library implementations of printf and iostreams.
Disclaimer: I'm the author of {fmt} and C++20 std::format.
A double has the precision of 52 binary digits or 15.95 decimal digits. See http://en.wikipedia.org/wiki/IEEE_754-2008. You need at least 16 decimal digits to record the full precision of a double in all cases. [But see fourth edit, below].
By the way, this means significant digits.
Answer to OP edits:
Your floating point to decimal string runtime is outputing way more digits than are significant. A double can only hold 52 bits of significand (actually, 53, if you count a "hidden" 1 that is not stored). That means the the resolution is not more than 2 ^ -53 = 1.11e-16.
For example: 1 + 2 ^ -52 = 1.0000000000000002220446049250313 . . . .
Those decimal digits, .0000000000000002220446049250313 . . . . are the smallest binary "step" in a double when converted to decimal.
The "step" inside the double is:
.0000000000000000000000000000000000000000000000000001 in binary.
Note that the binary step is exact, while the decimal step is inexact.
Hence the decimal representation above,
1.0000000000000002220446049250313 . . .
is an inexact representation of the exact binary number:
1.0000000000000000000000000000000000000000000000000001.
Third Edit:
The next possible value for a double, which in exact binary is:
1.0000000000000000000000000000000000000000000000000010
converts inexactly in decimal to
1.0000000000000004440892098500626 . . . .
So all of those extra digits in the decimal are not really significant, they are just base conversion artifacts.
Fourth Edit:
Though a double stores at most 16 significant decimal digits, sometimes 17 decimal digits are necessary to represent the number. The reason has to do with digit slicing.
As I mentioned above, there are 52 + 1 binary digits in the double. The "+ 1" is an assumed leading 1, and is neither stored nor significant. In the case of an integer, those 52 binary digits form a number between 0 and 2^53 - 1. How many decimal digits are necessary to store such a number? Well, log_10 (2^53 - 1) is about 15.95. So at most 16 decimal digits are necessary. Let's label these d_0 to d_15.
Now consider that IEEE floating point numbers also have an binary exponent. What happens when we increment the exponet by, say, 2? We have multiplied our 52-bit number, whatever it was, by 4. Now, instead of our 52 binary digits aligning perfectly with our decimal digits d_0 to d_15, we have some significant binary digits represented in d_16. However, since we multiplied by something less than 10, we still have significant binary digits represented in d_0. So our 15.95 decimal digits now occuply d_1 to d_15, plus some upper bits of d_0 and some lower bits of d_16. This is why 17 decimal digits are sometimes needed to represent a IEEE double.
Fifth Edit
Fixed numerical errors
The easiest way (for IEEE 754 double) to guarantee a round-trip conversion is to always use 17 significant digits. But that has the disadvantage of sometimes including unnecessary noise digits (0.1 → "0.10000000000000001").
An approach that's worked for me is to sprintf the number with 15 digits of precision, then check if atof gives you back the original value. If it doesn't, try 16 digits. If that doesn't work, use 17.
You might want to try David Gay's algorithm (used in Python 3.1 to implement float.__repr__).
Thanks to ThomasMcLeod for pointing out the error in my table computation
To guarantee round-trip conversion using 15 or 16 or 17 digits is only possible for a comparatively few cases. The number 15.95 comes from taking 2^53 (1 implicit bit + 52 bits in the significand/"mantissa") which comes out to an integer in the range 10^15 to 10^16 (closer to 10^16).
Consider a double precision value x with an exponent of 0, i.e. it falls into the floating point range range 1.0 <= x < 2.0. The implicit bit will mark the 2^0 component (part) of x. The highest explicit bit of the significand will denote the next lower exponent (from 0) <=> -1 => 2^-1 or the 0.5 component.
The next bit 0.25, the ones after 0.125, 0.0625, 0.03125, 0.015625 and so on (see table below). The value 1.5 will thus be represented by two components added together: the implicit bit denoting 1.0 and the highest explicit significand bit denoting 0.5.
This illustrates that from the implicit bit downward you have 52 additional, explicit bits to represent possible components where the smallest is 0 (exponent) - 52 (explicit bits in significand) = -52 => 2^-52 which according to the table below is ... well you can see for yourselves that it comes out to quite a bit more than 15.95 significant digits (37 to be exact). To put it another way the smallest number in the 2^0 range that is != 1.0 itself is 2^0+2^-52 which is 1.0 + the number next to 2^-52 (below) = (exactly) 1.0000000000000002220446049250313080847263336181640625, a value which I count as being 53 significant digits long. With 17 digit formatting "precision" the number will display as 1.0000000000000002 and this would depend on the library converting correctly.
So maybe "round-trip conversion in 17 digits" is not really a concept that is valid (enough).
2^ -1 = 0.5000000000000000000000000000000000000000000000000000
2^ -2 = 0.2500000000000000000000000000000000000000000000000000
2^ -3 = 0.1250000000000000000000000000000000000000000000000000
2^ -4 = 0.0625000000000000000000000000000000000000000000000000
2^ -5 = 0.0312500000000000000000000000000000000000000000000000
2^ -6 = 0.0156250000000000000000000000000000000000000000000000
2^ -7 = 0.0078125000000000000000000000000000000000000000000000
2^ -8 = 0.0039062500000000000000000000000000000000000000000000
2^ -9 = 0.0019531250000000000000000000000000000000000000000000
2^-10 = 0.0009765625000000000000000000000000000000000000000000
2^-11 = 0.0004882812500000000000000000000000000000000000000000
2^-12 = 0.0002441406250000000000000000000000000000000000000000
2^-13 = 0.0001220703125000000000000000000000000000000000000000
2^-14 = 0.0000610351562500000000000000000000000000000000000000
2^-15 = 0.0000305175781250000000000000000000000000000000000000
2^-16 = 0.0000152587890625000000000000000000000000000000000000
2^-17 = 0.0000076293945312500000000000000000000000000000000000
2^-18 = 0.0000038146972656250000000000000000000000000000000000
2^-19 = 0.0000019073486328125000000000000000000000000000000000
2^-20 = 0.0000009536743164062500000000000000000000000000000000
2^-21 = 0.0000004768371582031250000000000000000000000000000000
2^-22 = 0.0000002384185791015625000000000000000000000000000000
2^-23 = 0.0000001192092895507812500000000000000000000000000000
2^-24 = 0.0000000596046447753906250000000000000000000000000000
2^-25 = 0.0000000298023223876953125000000000000000000000000000
2^-26 = 0.0000000149011611938476562500000000000000000000000000
2^-27 = 0.0000000074505805969238281250000000000000000000000000
2^-28 = 0.0000000037252902984619140625000000000000000000000000
2^-29 = 0.0000000018626451492309570312500000000000000000000000
2^-30 = 0.0000000009313225746154785156250000000000000000000000
2^-31 = 0.0000000004656612873077392578125000000000000000000000
2^-32 = 0.0000000002328306436538696289062500000000000000000000
2^-33 = 0.0000000001164153218269348144531250000000000000000000
2^-34 = 0.0000000000582076609134674072265625000000000000000000
2^-35 = 0.0000000000291038304567337036132812500000000000000000
2^-36 = 0.0000000000145519152283668518066406250000000000000000
2^-37 = 0.0000000000072759576141834259033203125000000000000000
2^-38 = 0.0000000000036379788070917129516601562500000000000000
2^-39 = 0.0000000000018189894035458564758300781250000000000000
2^-40 = 0.0000000000009094947017729282379150390625000000000000
2^-41 = 0.0000000000004547473508864641189575195312500000000000
2^-42 = 0.0000000000002273736754432320594787597656250000000000
2^-43 = 0.0000000000001136868377216160297393798828125000000000
2^-44 = 0.0000000000000568434188608080148696899414062500000000
2^-45 = 0.0000000000000284217094304040074348449707031250000000
2^-46 = 0.0000000000000142108547152020037174224853515625000000
2^-47 = 0.0000000000000071054273576010018587112426757812500000
2^-48 = 0.0000000000000035527136788005009293556213378906250000
2^-49 = 0.0000000000000017763568394002504646778106689453125000
2^-50 = 0.0000000000000008881784197001252323389053344726562500
2^-51 = 0.0000000000000004440892098500626161694526672363281250
2^-52 = 0.0000000000000002220446049250313080847263336181640625
#ThomasMcLeod: I think the significant digit rule comes from my field, physics, and means something more subtle:
If you have a measurement that gets you the value 1.52 and you cannot read any more detail off the scale, and say you are supposed to add another number (for example of another measurement because this one's scale was too small) to it, say 2, then the result (obviously) has only 2 decimal places, i.e. 3.52.
But likewise, if you add 1.1111111111 to the value 1.52, you get the value 2.63 (and nothing more!).
The reason for the rule is to prevent you from kidding yourself into thinking you got more information out of a calculation than you put in by the measurement (which is impossible, but would seem that way by filling it with garbage, see above).
That said, this specific rule is for addition only (for addition: the error of the result is the sum of the two errors - so if you measure just one badly, though luck, there goes your precision...).
How to get the other rules:
Let's say a is the measured number and δa the error. Let's say your original formula was:
f:=m a
Let's say you also measure m with error δm (let that be the positive side).
Then the actual limit is:
f_up=(m+δm) (a+δa)
and
f_down=(m-δm) (a-δa)
So,
f_up =m a+δm δa+(δm a+m δa)
f_down=m a+δm δa-(δm a+m δa)
Hence, now the significant digits are even less:
f_up ~m a+(δm a+m δa)
f_down~m a-(δm a+m δa)
and so
δf=δm a+m δa
If you look at the relative error, you get:
δf/f=δm/m+δa/a
For division it is
δf/f=δm/m-δa/a
Hope that gets the gist across and hope I didn't make too many mistakes, it's late here :-)
tl,dr: Significant digits mean how many of the digits in the output actually come from the digits in your input (in the real world, not the distorted picture that floating point numbers have).
If your measurements were 1 with "no" error and 3 with "no" error and the function is supposed to be 1/3, then yes, all infinite digits are actual significant digits. Otherwise, the inverse operation would not work, so obviously they have to be.
If significant digit rule means something completely different in another field, carry on :-)