How to connect the theory of fixed-point numbers and its practical implementation? - c++

The theory of fixed-point number is that we divide certain number of bits between integer part and fractional part. This amount is fixed.
For example, 26.5 is stored in that order:
To convert from floating-point to fixed-point, we follow this algorithm:
Calculate x = floating_input * 2^(fractional_bits)
27.3 * 2^10 = 27955.2
Round x to the nearest whole number (e.g. round(x))
27955
Store the rounded x in an integer container
Now if we look on the bit representation of our numbers and on what multiplying on 2^(fractional_bits) makes, we will see:
27 is 11011
27*2^10 is 110 1100 0000 0000 which is shifting on 10 bits to the left.
So we can say, that multiplying on 2^10 indeed gives us "space" in the right part of bits for save forth altering of this number. We can make two such numbers converted in this way, interacting each other and eventually re-converted to familiar view with point by opposite dividing on 2^10.
If we recall that bits are stored in some integer variable, which in turn has its own amount of bits it gets clear that as more bits in that variable are devoted for fraction part as less bits remain for integer part of number.
27.3 * 2^10 = 27955.2 should be rounded for storing in integer type to
27955 which is 110 1101 0011 0011
after that number can be altered somehow, certain value isn't important now, and let's say, we want to retrieve back human-readable value:
27955/2^10 = 27,2998046875
What about amount of bits after point?
Let's say we have two numbers with purpose to multiply them and we chose 10 bits after point
27 * 3.3 = 89.1 expected
27*2^10 = 27 648 is 110 1100 0000 0000
3.3*2^10 = 3 379 is 1101 0011 0011
27 648 * 3 379 = 93 422 592
consequently
27*3.3 = 93 422 592/(2^10*2^10) = 89.09 pretty accurate
Let's take 1 bit after point
27 and 3.3
27*2^1 = 54 is 110110
3.3*2^1 = 6.6 after round 6 is 110
54 * 6 = 324
consequently
27*3.3 = 324/(2^1*2^1) = 81 which is unsatisfying
On practice we can use next code to create and operate with fixed-point number:
#include <iostream>
using namespace std;
const int scale = 10;
#define DoubleToFixed(x) (x*(double)(1<<scale))
#define FixedToDouble(x) ((double)x / (double)(1<<scale))
#define IntToFixed(x) (x << scale)
#define FixedToInt(x) (x >> scale)
#define MUL(x,y) (((x)*(y)) >> scale)
#define DIV(x,y) ((x) << scale)
int main()
{
double a = 7.27;
double b = 3.0;
int f = DoubleToFixed(a);
cout << f<<endl; //7444
cout << FixedToDouble(f)<<endl; //7.26953125
int g = DoubleToFixed(b);
cout << g<<endl; //3072
int c = MUL(f,g);
cout << FixedToDouble(c)<<endl; //21.80859375
}
So, where is connection between the theory of fixed emplacement of point between bits (powers of 2) and practice implementation? If we store fixed-number in int, it is obvious, that there is no place for storing the point in it.
It seems that fixed-point numbers are just conversion for increase performance. And to retrieve human-readable number after calculations, the opposite conversion must present.
I hope, I understand the algorithm. But is the idea of placement of point between digits is just an abstract idea?

Fixed-point formats are used as a way to represent fractional numbers. Quite commonly, processors perform fixed-point or integer arithmetic faster or more efficiently than floating-point arithmetic. Whether fixed-point arithmetic is suitable for an application depends on what numbers the application needs to work with.
Using fixed-point formats does require converting input to the fixed-point format and converting numbers in the fixed-point format to output. But this is also true of integers and floating-point: All input must be converted to whatever internal format is used to represent it, and all output must be produced by converting from internal formats.
And how does multiplying on 2^(fractional_bits) affect the quantity of digits after the point?
Suppose we have some number x that is represented as an integer X = x•2f, where f is the number of fraction bits. Conceptually X is in a fixed-point format. Similarly, we have y represented as Y = y•2f.
If we execute an integer multiplication instruction to produce result Z = XY, then Z = XY = (x•2f)•(y•2f) = xy•22f. Then, if we divide Z by 2f (or, nearly equivalently, shift it right by f bits), we have xy•2f except for any rounding errors that may have occurred in the division. And xy•2f is the fixed-point representation of the product of x and y.
Thus, we can effect a fixed-point multiplication by perform an integer multiplication followed by a shift.
Often, to get rounding instead of truncation, a value of half of 2f is added before the shift, so we compute floor((XY + 2f−1) / 2f):
Multiply X by Y.
Add 2f−1.
Shift right f bits.

It seems that fixed-point numbers are just convertion for encreese performance.
You might as well say that floating-point numbers are a conversion to increase the representable range.
Whatever format your numbers are originally coming in as (strings, voltage levels, integers, etc.), you often convert them to floating point numbers in order to store or operate on them, but neither floating point nor fixed point is a human-readable representation.
Floating point numbers have lower precision and a wider magnitude range; fixed point numbers have higher precision and a narrower magnitude range. (Performance differences depend on the architecture and the important operations.) You shouldn't think of the fixed-point representation as a conversion from floating point, but as an alternative to floating point.

I think you want a class that wraps an int along with the fixed radix point information. Indeed, the use is implicit, but you then define your own multiplication (for example) that works on the fixed point meaning as a whole rather than just multiplying the underlying ints.
You don't want to leave the implicit meaning ... make it known to the compiler in a strong way. You should not have to explicitly call your handling functions; make it part of the class semantics.

Related

Why is the result of a bitwise shift unrecoverable if there is a mathematical equivalent of the same operation?

Take for example the number 91. That number in binary is 1011011. If you shift that number to the right by 5 bits, you would get 2 (10 in binary). According to a google search, bit shifting to the left or right by a certain amount of bits is the same as multiplying or dividing the number by 2 to the power of the number of bits to be shifted, respectively. so to get from 91 to 2 by bit shifting, the equation would look like this: 91 / 2^5, which is also 91 / 32. Now, of course if you did that in your calculator, there would be some decimal values, which aren't included when bit shifting. The resulting 2 is actually 2.84357. I'm sure you know that if you do a certain operation on a number and then you do the inverse, the result would be what you had in the first place. So does decimal precision have something to do with this?
There is a mathematical equivalent of shifting to the right... and the mathematical operation is UNRECOVERABLE.
You seem to think that shifting to the right is:
bit shifting to the left or right by a certain amount of bits is the same as multiplying or dividing the number by 2
This is what you will hear people casually say, but it is only half right. As it it is not the same but only similar.
The correct statement is:
shifting a base-2 number one digit to the right is THE SAME as dividing by two in the integer domain
If you have an integer calculator, if you did 91/32 you will get 2. You will not get ANY decimal point because we are operating in the integer domain.
For real numbers, the equivalent operation is:
FLOOR(91/32)
Which is also unrecoverable because it also results in 2.
The lesson here is be careful when listening to what people CASUALLY say. Casual speech is often imprecise and assumes the listener is familiar with the subject. You need to dig deeper what the statement is actually trying to say.
As for why it is unrecoverable? Division of integers give two results: the quotient (which is the main result) and the remainder. When we divide 91 by 32 we are doing this:
2
_____
32 ) 91
64
__
27
So we get the result of 2 and a remainder of 27. The reason you can't get 91 by multiplying 2*32 is because we threw away the remainder.
You can get the result back if you saved the remainder. However, calculating the remainder is not a matter of simple shifts. Here's an example of how to make it reversable in C:
int test () {
int a = 91;
int b = 32;
int result;
int remainder;
result = a / b; // result will be 2
remainder = a % b; // remainder will be 27
return (result * b) + remainder; // returns 91
}
You can only recover the result of an operation if it has a 1-1 mapping between the inputs and outputs, i.e. it has an inverse function. But not all mathematical functions have an inverse function
For example if f(x) = x >> n with >> is the shift operator then it'll be equivalent to
f(x) = ⌊x/2n⌋
with ⌊ ⌋ being the floor function. Since there are many inputs that lead to the same output, the relationship isn't 1-1 and there can't be an inverse function for it. This function works the same for both signed and unsigned right shift:
91 >> 5 == floor(91.0/32.0) == 2
-91 >> 5 == floor(-91.0/32.0) == -3
Similarly for an unsigned left shift function g(x) = x << n then the equivalent is
g(x) = (x * 2n) mod 2N
with N being the size in bits of x, because integer math in hardware, C and many other languages always reduce modulo 2N due to the limit of register size and the use of two's complement. And it's clear that the modulo function also isn't invertible/recoverable. The signed left shift is almost the same with some small modifications

When Will static_casting the Result of ceil Compromise the Result?

static_casting from a floating point to an integer simply strips the fractional point of the number. For example static_cast<int>(13.9999999) yields 13.
Not all integers are representable as floating point numbers. For example internally the closest float to 13,000,000 may be: 12999999.999999.
In this hypothetical case, I'd expect to get an unexpected result from:
const auto foo = 12'999'999.5F;
const auto bar = static_cast<long long>(ceil(foo));
My assumption is that such a breakdown does occur at some point, if not necessarily at 13,000,000. I'd just like to know the range over which I can trust static_cast<long long>(ceif(foo))?
For example internally the closest float to 13,000,000 may be: 12999999.999999.
That is not possible in any normal floating-point format. The floating-point representation of numbers is equivalent to M•be, where b is a fixed base (e.g., 2 for binary floating-point) and M and e are integers with some restrictions on their values. In order for a value like 13,000,000-x to be represented, where x is some positive value less than 1, e must be negative (because M•be for a non-negative e is an integer). If so, then M•b0 is an integer larger than M•be, so it is larger than 13,000,000, and so 13,000,000 can be represented as M'•b0, where M' is a positive integer less than M and hence fits in the range of allowed values for M (in any normal floating-point format). (Perhaps some bizarre floating-point format might impose a strange range on M or e that prevents this, but no normal format does.)
Regarding your code:
auto test = 0LL;
const auto floater = 0.5F;
for(auto i = 0LL; i == test; i = std::ceil(i + floater)) ++test;
cout << test << endl;
When i was 8,388,608, the mathematical result of 8,388,608 + .5 is 8,388,608.5. This is not representable in the float format on your system, so it was rounded to 8,388,608. The ceil of this is 8,388,608. At this point, test was 8,388,609, so the loop stopped. So this code does not demonstrate that 8,388,608.5 is representable and 8,388,609 is not.
Behavior seems to return to normal if I do: ceil(8'388'609.5F) which will correctly return 8,388,610.
8,388,609.5 is not representable in the float format on your system, so it was rounded by the rule “round to nearest, ties to even.” The two nearest representable values are 8,388,609, and 8,388,610. Since they are equally far apart, the result was 8,388,610. That value was passed to ceil, which of course returned 8,388,610.
On Visual Studio 2015 I got 8,388,609 which is a horrifying small safe range.
In the IEEE-754 basic 32-bit binary format, all integers from -16,777,216 to +16,777,216 are representable, because the format has a 24-bit significand.
Floating point numbers are represented by 3 integers, cbq where:
c is the mantissa (so for the number: 12,999,999.999999 c would be 12,999,999,999,999)
q is the exponent (so for the number: 12,999,999.999999 q would be -6)
b is the base (IEEE-754 requires b to be either 10 or 2; in the representation above b is 10)
From this it's easy to see that a floating point with the capability of representing 12,999,999.999999 also has the capability of representing 13,000,000.000000 using a c of 1,300,000,000,000 and a q of -5.
This example is a bit contrived in that the chosen b is 10, where in almost all implementations the chosen base is 2. But it's worth pointing out that even with a b of 2 the q functions as a shift left or right of the mantissa.
Next let's talk about a range here. Obviously a 32-bit floating point cannot represent all the integers represented by a 32-bit integer, as the floating point must also represent so many much larger or smaller numbers. Since the exponent is simply shifting the mantissa, a floating point number can always exactly represent every integer that can be represented by it's mantissa. Given the traditional IEEE-754 binary base floating point numbers:
A 32-bit (float) has a 24-bit mantissa so it can represent all integers in the range [-16,777,215, 16,777,215]
A 64-bit (double) has a 53-bit mantissa so it can represent all integers in the range [-9,007,199,254,740,991, 9,007,199,254,740,991]
A 128-bit (long double depending upon implementation) has a 113-bit mantissa so it can represent all integers in the range [-103,845,937,170,696,552,570,609,926,584,40,191, 103,845,937,170,696,552,570,609,926,584,40,191]
[source]
c++ provides digits as a method of finding this number for a given floating point type. (Though admittedly even a long long is too small to represent a 113-bit mantissa.) For example a float's maximum mantissa could be found by:
(1LL << numeric_limits<float>::digits) - 1LL
Having thoroughly explained the mantissa, let's revisit the exponent section to talk about how a floating point is actually stored. Take 13,000,000.0 that could be represented as:
c = 13, q = 6, b = 10
c = 130, q = 5, b = 10
c = 1,300, q = 4, b = 10
And so on. For the traditional binary format IEEE-754 requires:
The representation is made unique by choosing the smallest representable exponent that retains the most significant bit (MSB) within the selected word size and format. Further, the exponent is not represented directly, but a bias is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers
To explain this in the more familiar base-10 if our mantissa has 14 decimal places, the implementation would look like this:
c = 13,000,000,000,000 so the MSB will be used in the represented number
q = 6 This is a little confusing, it's cause of the bias introduced here; logically q = -6 but the bias is set so that when q = 0 only the MSB of c is immediately to the left of the decimal point, meaning that c = 13,000,000,000,000, q = 0, b = 10 will represent 1.3
b = 10 again the above rules are really only required for base-2 but I've shown them as they would apply to base-10 for the purpose of explaination
Translated back to base-2 this means that a q of numeric_limits<T>::digits - 1 has only zeros after the decimal place. ceil only has an effect if there is a fractional part of the number.
A final point of explanation here, is the range over which ceil will have an effect. After the exponent of a floating point is larger than numeric_limits<T>::digits continuing to increase it only introduces trailing zeros to the resulting number, thus calling ceil when q is greater than or equal to numeric_limits<T>::digits - 2LL. And since we know the MSB of c will be used in the number this means that c must be smaller than (1LL << numeric_limits<T>::digits - 1LL) - 1LL Thus for ceil to have an effect on the traditional binary IEEE-754 floating point:
A 32-bit (float) must be smaller than 8,388,607
A 64-bit (double) must be smaller than 4,503,599,627,370,495
A 128-bit (long double depending upon implementation) must be smaller than 5,192,296,858,534,827,628,530,496,329,220,095

How can I avoid this float number rounding issue in C++?

With below code, I get result "4.31 43099".
double f = atof("4.31");
long ff = f * 10000L;
std::cout << f << ' ' << ff << '\n';
If I change "double f" to "float f". I get expected result "4.31 43100". I am not sure if changing "double" to "float" is a good solution. Is there any good solution to assure I get "43100"?
You're not going to be able to eliminate the errors in floating point arithmatic (though with proper analysis you can calculate the error). For casual usage one thing you can do to get more intuitive results is to replace the built-in float to integral conversion (which does truncation), with normal rounding:
double f = atof("4.31");
long ff = std::round(f * 10000L);
std::cout << f << ' ' << ff << '\n';
This should output what you expect: 4.31 43100
Also there's no point in using 10000L, because no matter what kind of integral type you use it still gets converted to f's floating point type for the multiplication. just use std::round(f * 10000.0);
The problem is that floating point is inexact by nature when talking about decimal numbers. A decimal number can be rounded either up or down when converted to binary, depending on which value is closest.
In this case you just want to make sure that if the number was rounded down, it's rounded up instead. You do this by adding the smallest amount possible to the value, which is done with the nextafter function if you have C++11:
long ff = std::nextafter(f, 1.1*f) * 10000L;
If you don't have nextafter you can approximate it with numeric_limits.
long ff = (f * (1.0 + std::numeric_limits<double>::epsilon())) * 10000L;
I just saw your comment that you only use 4 decimal places, so this would be simpler but less robust:
long ff = (f * 1.0000001) * 10000L;
With standard C types - i doubt.
There are many values that cannot be represented in those bits - they actually demand more space to be stored. So floating-point processor just uses the closest possible.
Floating pointing numbers cannot store all the values you think it could - there is only limited amount of bits - you can't put more than 4 billion different values in 32 bits. And that's just the first restriction.
Floating point values(in C) are represented as: sign - one sign bit, power - bits which defines the power of two for the number, significand - the bits that actually make the number.
Your actual number is sign * significand * 2 inpowerof(power - normalization).
Double is 1bit of sign, 15 bits of power(normalized to be positive but that is not the point) and 48 bits to represent the value;
It is a lot but not enough to represent all the values, especially when they cannot be easily represented as finite sum of powers of two: like binary 1010.101101(101). For example it cannot represent precisely such values like 1/3 = 0.333333(3). That's the second restriction.
Try to read - decent understanding of advantages and disadvantages of floating point arithmetic may be very handy:
http://en.wikipedia.org/wiki/Floating_point and http://homepage.cs.uiowa.edu/~atkinson/m170.dir/overton.pdf
There have been some confused answers here! What is happening is this: 4.31 can't be exactly represented as either a single- or double-precision number. It turns out that the nearest representable single-precision number is a little more than 4.31, while the nearest representable double-precision number is a little less than 4.31. When a floating-point value is assigned to an integer variable, it is rounded towards zero (not towards the nearest integer!).
So if f is single-precision, f * 10000L is greater than 43100, so it is rounded down to 43100. And if f is double-precision, f * 10000L is less than 43100, so it is rounded down to 43099.
The comment by n.m. suggests f * 10000L + 0.5, which is I think the best solution.

Decimal to IEEE Single Precision Floating Point

I'm interested in learning how to convert an integer value into IEEE single precision floating point format using bitwise operators only. However, I'm confused as to what can be done to know how many logical shifts left are needed when calculating for the exponent.
Given an int, say 15, we have:
Binary: 1111
-> 1.111 x 2^3 => After placing a decimal point after the first bit, we find that the 'e' value will be three.
E = Exp - Bias
Therefore, Exp = 130 = 10000010
And the significand will be: 111000000000000000000000
However, I knew that the 'e' value would be three because I was able to see that there are three bits after placing the decimal after the first bit. Is there a more generic way to code for this as a general case?
Again, this is for an int to float conversion, assuming that the integer is non-negative, non-zero, and is not larger than the max space allowed for the mantissa.
Also, could someone explain why rounding is needed for values greater than 23 bits?
Thanks in advance!
First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf
And now to some meat.
The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 224. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.
IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:
The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.
(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )
Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:
Bit 31 = 0: Positive value
Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 21)
Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1).
So the value is 1.0 x 21 = 2.0.
To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:
Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation.
While aligning the integer, records the total number of shifts made.
Masks away the hidden 1.
Using the number of shifts made, computes the exponent and appends it to the number.
Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly.
There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:
float uint_to_float(unsigned int significand)
{
// Only support 0 < significand < 1 << 24.
if (significand == 0 || significand >= 1 << 24)
return -1.0; // or abort(); or whatever you'd like here.
int shifts = 0;
// Align the leading 1 of the significand to the hidden-1
// position. Count the number of shifts required.
while ((significand & (1 << 23)) == 0)
{
significand <<= 1;
shifts++;
}
// The number 1.0 has an exponent of 0, and would need to be
// shifted left 23 times. The number 2.0, however, has an
// exponent of 1 and needs to be shifted left only 22 times.
// Therefore, the exponent should be (23 - shifts). IEEE-754
// format requires a bias of 127, though, so the exponent field
// is given by the following expression:
unsigned int exponent = 127 + 23 - shifts;
// Now merge significand and exponent. Be sure to strip away
// the hidden 1 in the significand.
unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);
// Reinterpret as a float and return. This is an evil hack.
return *reinterpret_cast< float* >( &merged );
}
You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".)
You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.
For integers >= 224, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.
You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 224, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.

Printing double without losing precision

How do you print a double to a stream so that when it is read in you don't lose precision?
I tried:
std::stringstream ss;
double v = 0.1 * 0.1;
ss << std::setprecision(std::numeric_limits<T>::digits10) << v << " ";
double u;
ss >> u;
std::cout << "precision " << ((u == v) ? "retained" : "lost") << std::endl;
This did not work as I expected.
But I can increase precision (which surprised me as I thought that digits10 was the maximum required).
ss << std::setprecision(std::numeric_limits<T>::digits10 + 2) << v << " ";
// ^^^^^^ +2
It has to do with the number of significant digits and the first two don't count in (0.01).
So has anybody looked at representing floating point numbers exactly?
What is the exact magical incantation on the stream I need to do?
After some experimentation:
The trouble was with my original version. There were non-significant digits in the string after the decimal point that affected the accuracy.
So to compensate for this we can use scientific notation to compensate:
ss << std::scientific
<< std::setprecision(std::numeric_limits<double>::digits10 + 1)
<< v;
This still does not explain the need for the +1 though.
Also if I print out the number with more precision I get more precision printed out!
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits10) << v << "\n";
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits10 + 1) << v << "\n";
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits) << v << "\n";
It results in:
1.000000000000000e-02
1.0000000000000002e-02
1.00000000000000019428902930940239457413554200000000000e-02
Based on #Stephen Canon answer below:
We can print out exactly by using the printf() formatter, "%a" or "%A". To achieve this in C++ we need to use the fixed and scientific manipulators (see n3225: 22.4.2.2.2p5 Table 88)
std::cout.flags(std::ios_base::fixed | std::ios_base::scientific);
std::cout << v;
For now I have defined:
template<typename T>
std::ostream& precise(std::ostream& stream)
{
std::cout.flags(std::ios_base::fixed | std::ios_base::scientific);
return stream;
}
std::ostream& preciselngd(std::ostream& stream){ return precise<long double>(stream);}
std::ostream& precisedbl(std::ostream& stream) { return precise<double>(stream);}
std::ostream& preciseflt(std::ostream& stream) { return precise<float>(stream);}
Next: How do we handle NaN/Inf?
It's not correct to say "floating point is inaccurate", although I admit that's a useful simplification. If we used base 8 or 16 in real life then people around here would be saying "base 10 decimal fraction packages are inaccurate, why did anyone ever cook those up?".
The problem is that integral values translate exactly from one base into another, but fractional values do not, because they represent fractions of the integral step and only a few of them are used.
Floating point arithmetic is technically perfectly accurate. Every calculation has one and only one possible result. There is a problem, and it is that most decimal fractions have base-2 representations that repeat. In fact, in the sequence 0.01, 0.02, ... 0.99, only a mere 3 values have exact binary representations. (0.25, 0.50, and 0.75.) There are 96 values that repeat and therefore are obviously not represented exactly.
Now, there are a number of ways to write and read back floating point numbers without losing a single bit. The idea is to avoid trying to express the binary number with a base 10 fraction.
Write them as binary. These days, everyone implements the IEEE-754 format so as long as you choose a byte order and write or read only that byte order, then the numbers will be portable.
Write them as 64-bit integer values. Here you can use the usual base 10. (Because you are representing the 64-bit aliased integer, not the 52-bit fraction.)
You can also just write more decimal fraction digits. Whether this is bit-for-bit accurate will depend on the quality of the conversion libraries and I'm not sure I would count on perfect accuracy (from the software) here. But any errors will be exceedingly small and your original data certainly has no information in the low bits. (None of the constants of physics and chemistry are known to 52 bits, nor has any distance on earth ever been measured to 52 bits of precision.) But for a backup or restore where bit-for-bit accuracy might be compared automatically, this obviously isn't ideal.
Don't print floating-point values in decimal if you don't want to lose precision. Even if you print enough digits to represent the number exactly, not all implementations have correctly-rounded conversions to/from decimal strings over the entire floating-point range, so you may still lose precision.
Use hexadecimal floating point instead. In C:
printf("%a\n", yourNumber);
C++0x provides the hexfloat manipulator for iostreams that does the same thing (on some platforms, using the std::hex modifier has the same result, but this is not a portable assumption).
Using hex floating point is preferred for several reasons.
First, the printed value is always exact. No rounding occurs in writing or reading a value formatted in this way. Beyond the accuracy benefits, this means that reading and writing such values can be faster with a well tuned I/O library. They also require fewer digits to represent values exactly.
I got interested in this question because I'm trying to (de)serialize my data to & from JSON.
I think I have a clearer explanation (with less hand waiving) for why 17 decimal digits are sufficient to reconstruct the original number losslessly:
Imagine 3 number lines:
1. for the original base 2 number
2. for the rounded base 10 representation
3. for the reconstructed number (same as #1 because both in base 2)
When you convert to base 10, graphically, you choose the tic on the 2nd number line closest to the tic on the 1st. Likewise when you reconstruct the original from the rounded base 10 value.
The critical observation I had was that in order to allow exact reconstruction, the base 10 step size (quantum) has to be < the base 2 quantum. Otherwise, you inevitably get the bad reconstruction shown in red.
Take the specific case of when the exponent is 0 for the base2 representation. Then the base2 quantum will be 2^-52 ~= 2.22 * 10^-16. The closest base 10 quantum that's less than this is 10^-16. Now that we know the required base 10 quantum, how many digits will be needed to encode all possible values? Given that we're only considering the case of exponent = 0, the dynamic range of values we need to represent is [1.0, 2.0). Therefore, 17 digits would be required (16 digits for fraction and 1 digit for integer part).
For exponents other than 0, we can use the same logic:
exponent base2 quant. base10 quant. dynamic range digits needed
---------------------------------------------------------------------
1 2^-51 10^-16 [2, 4) 17
2 2^-50 10^-16 [4, 8) 17
3 2^-49 10^-15 [8, 16) 17
...
32 2^-20 10^-7 [2^32, 2^33) 17
1022 9.98e291 1.0e291 [4.49e307,8.99e307) 17
While not exhaustive, the table shows the trend that 17 digits are sufficient.
Hope you like my explanation.
In C++20 you'll be able to use std::format to do this:
std::stringstream ss;
double v = 0.1 * 0.1;
ss << std::format("{}", v);
double u;
ss >> u;
assert(v == u);
The default floating-point format is the shortest decimal representation with a round-trip guarantee. The advantage of this method compared to using the precision of max_digits10 (not digits10 which is not suitable for round trip through decimal) from std::numeric_limits is that it doesn't print unnecessary digits.
In the meantime you can use the {fmt} library, std::format is based on. For example (godbolt):
fmt::print("{}", 0.1 * 0.1);
Output (assuming IEEE754 double):
0.010000000000000002
{fmt} uses the Dragonbox algorithm for fast binary floating point to decimal conversion. In addition to giving the shortest representation it is 20-30x faster than common standard library implementations of printf and iostreams.
Disclaimer: I'm the author of {fmt} and C++20 std::format.
A double has the precision of 52 binary digits or 15.95 decimal digits. See http://en.wikipedia.org/wiki/IEEE_754-2008. You need at least 16 decimal digits to record the full precision of a double in all cases. [But see fourth edit, below].
By the way, this means significant digits.
Answer to OP edits:
Your floating point to decimal string runtime is outputing way more digits than are significant. A double can only hold 52 bits of significand (actually, 53, if you count a "hidden" 1 that is not stored). That means the the resolution is not more than 2 ^ -53 = 1.11e-16.
For example: 1 + 2 ^ -52 = 1.0000000000000002220446049250313 . . . .
Those decimal digits, .0000000000000002220446049250313 . . . . are the smallest binary "step" in a double when converted to decimal.
The "step" inside the double is:
.0000000000000000000000000000000000000000000000000001 in binary.
Note that the binary step is exact, while the decimal step is inexact.
Hence the decimal representation above,
1.0000000000000002220446049250313 . . .
is an inexact representation of the exact binary number:
1.0000000000000000000000000000000000000000000000000001.
Third Edit:
The next possible value for a double, which in exact binary is:
1.0000000000000000000000000000000000000000000000000010
converts inexactly in decimal to
1.0000000000000004440892098500626 . . . .
So all of those extra digits in the decimal are not really significant, they are just base conversion artifacts.
Fourth Edit:
Though a double stores at most 16 significant decimal digits, sometimes 17 decimal digits are necessary to represent the number. The reason has to do with digit slicing.
As I mentioned above, there are 52 + 1 binary digits in the double. The "+ 1" is an assumed leading 1, and is neither stored nor significant. In the case of an integer, those 52 binary digits form a number between 0 and 2^53 - 1. How many decimal digits are necessary to store such a number? Well, log_10 (2^53 - 1) is about 15.95. So at most 16 decimal digits are necessary. Let's label these d_0 to d_15.
Now consider that IEEE floating point numbers also have an binary exponent. What happens when we increment the exponet by, say, 2? We have multiplied our 52-bit number, whatever it was, by 4. Now, instead of our 52 binary digits aligning perfectly with our decimal digits d_0 to d_15, we have some significant binary digits represented in d_16. However, since we multiplied by something less than 10, we still have significant binary digits represented in d_0. So our 15.95 decimal digits now occuply d_1 to d_15, plus some upper bits of d_0 and some lower bits of d_16. This is why 17 decimal digits are sometimes needed to represent a IEEE double.
Fifth Edit
Fixed numerical errors
The easiest way (for IEEE 754 double) to guarantee a round-trip conversion is to always use 17 significant digits. But that has the disadvantage of sometimes including unnecessary noise digits (0.1 → "0.10000000000000001").
An approach that's worked for me is to sprintf the number with 15 digits of precision, then check if atof gives you back the original value. If it doesn't, try 16 digits. If that doesn't work, use 17.
You might want to try David Gay's algorithm (used in Python 3.1 to implement float.__repr__).
Thanks to ThomasMcLeod for pointing out the error in my table computation
To guarantee round-trip conversion using 15 or 16 or 17 digits is only possible for a comparatively few cases. The number 15.95 comes from taking 2^53 (1 implicit bit + 52 bits in the significand/"mantissa") which comes out to an integer in the range 10^15 to 10^16 (closer to 10^16).
Consider a double precision value x with an exponent of 0, i.e. it falls into the floating point range range 1.0 <= x < 2.0. The implicit bit will mark the 2^0 component (part) of x. The highest explicit bit of the significand will denote the next lower exponent (from 0) <=> -1 => 2^-1 or the 0.5 component.
The next bit 0.25, the ones after 0.125, 0.0625, 0.03125, 0.015625 and so on (see table below). The value 1.5 will thus be represented by two components added together: the implicit bit denoting 1.0 and the highest explicit significand bit denoting 0.5.
This illustrates that from the implicit bit downward you have 52 additional, explicit bits to represent possible components where the smallest is 0 (exponent) - 52 (explicit bits in significand) = -52 => 2^-52 which according to the table below is ... well you can see for yourselves that it comes out to quite a bit more than 15.95 significant digits (37 to be exact). To put it another way the smallest number in the 2^0 range that is != 1.0 itself is 2^0+2^-52 which is 1.0 + the number next to 2^-52 (below) = (exactly) 1.0000000000000002220446049250313080847263336181640625, a value which I count as being 53 significant digits long. With 17 digit formatting "precision" the number will display as 1.0000000000000002 and this would depend on the library converting correctly.
So maybe "round-trip conversion in 17 digits" is not really a concept that is valid (enough).
2^ -1 = 0.5000000000000000000000000000000000000000000000000000
2^ -2 = 0.2500000000000000000000000000000000000000000000000000
2^ -3 = 0.1250000000000000000000000000000000000000000000000000
2^ -4 = 0.0625000000000000000000000000000000000000000000000000
2^ -5 = 0.0312500000000000000000000000000000000000000000000000
2^ -6 = 0.0156250000000000000000000000000000000000000000000000
2^ -7 = 0.0078125000000000000000000000000000000000000000000000
2^ -8 = 0.0039062500000000000000000000000000000000000000000000
2^ -9 = 0.0019531250000000000000000000000000000000000000000000
2^-10 = 0.0009765625000000000000000000000000000000000000000000
2^-11 = 0.0004882812500000000000000000000000000000000000000000
2^-12 = 0.0002441406250000000000000000000000000000000000000000
2^-13 = 0.0001220703125000000000000000000000000000000000000000
2^-14 = 0.0000610351562500000000000000000000000000000000000000
2^-15 = 0.0000305175781250000000000000000000000000000000000000
2^-16 = 0.0000152587890625000000000000000000000000000000000000
2^-17 = 0.0000076293945312500000000000000000000000000000000000
2^-18 = 0.0000038146972656250000000000000000000000000000000000
2^-19 = 0.0000019073486328125000000000000000000000000000000000
2^-20 = 0.0000009536743164062500000000000000000000000000000000
2^-21 = 0.0000004768371582031250000000000000000000000000000000
2^-22 = 0.0000002384185791015625000000000000000000000000000000
2^-23 = 0.0000001192092895507812500000000000000000000000000000
2^-24 = 0.0000000596046447753906250000000000000000000000000000
2^-25 = 0.0000000298023223876953125000000000000000000000000000
2^-26 = 0.0000000149011611938476562500000000000000000000000000
2^-27 = 0.0000000074505805969238281250000000000000000000000000
2^-28 = 0.0000000037252902984619140625000000000000000000000000
2^-29 = 0.0000000018626451492309570312500000000000000000000000
2^-30 = 0.0000000009313225746154785156250000000000000000000000
2^-31 = 0.0000000004656612873077392578125000000000000000000000
2^-32 = 0.0000000002328306436538696289062500000000000000000000
2^-33 = 0.0000000001164153218269348144531250000000000000000000
2^-34 = 0.0000000000582076609134674072265625000000000000000000
2^-35 = 0.0000000000291038304567337036132812500000000000000000
2^-36 = 0.0000000000145519152283668518066406250000000000000000
2^-37 = 0.0000000000072759576141834259033203125000000000000000
2^-38 = 0.0000000000036379788070917129516601562500000000000000
2^-39 = 0.0000000000018189894035458564758300781250000000000000
2^-40 = 0.0000000000009094947017729282379150390625000000000000
2^-41 = 0.0000000000004547473508864641189575195312500000000000
2^-42 = 0.0000000000002273736754432320594787597656250000000000
2^-43 = 0.0000000000001136868377216160297393798828125000000000
2^-44 = 0.0000000000000568434188608080148696899414062500000000
2^-45 = 0.0000000000000284217094304040074348449707031250000000
2^-46 = 0.0000000000000142108547152020037174224853515625000000
2^-47 = 0.0000000000000071054273576010018587112426757812500000
2^-48 = 0.0000000000000035527136788005009293556213378906250000
2^-49 = 0.0000000000000017763568394002504646778106689453125000
2^-50 = 0.0000000000000008881784197001252323389053344726562500
2^-51 = 0.0000000000000004440892098500626161694526672363281250
2^-52 = 0.0000000000000002220446049250313080847263336181640625
#ThomasMcLeod: I think the significant digit rule comes from my field, physics, and means something more subtle:
If you have a measurement that gets you the value 1.52 and you cannot read any more detail off the scale, and say you are supposed to add another number (for example of another measurement because this one's scale was too small) to it, say 2, then the result (obviously) has only 2 decimal places, i.e. 3.52.
But likewise, if you add 1.1111111111 to the value 1.52, you get the value 2.63 (and nothing more!).
The reason for the rule is to prevent you from kidding yourself into thinking you got more information out of a calculation than you put in by the measurement (which is impossible, but would seem that way by filling it with garbage, see above).
That said, this specific rule is for addition only (for addition: the error of the result is the sum of the two errors - so if you measure just one badly, though luck, there goes your precision...).
How to get the other rules:
Let's say a is the measured number and δa the error. Let's say your original formula was:
f:=m a
Let's say you also measure m with error δm (let that be the positive side).
Then the actual limit is:
f_up=(m+δm) (a+δa)
and
f_down=(m-δm) (a-δa)
So,
f_up =m a+δm δa+(δm a+m δa)
f_down=m a+δm δa-(δm a+m δa)
Hence, now the significant digits are even less:
f_up ~m a+(δm a+m δa)
f_down~m a-(δm a+m δa)
and so
δf=δm a+m δa
If you look at the relative error, you get:
δf/f=δm/m+δa/a
For division it is
δf/f=δm/m-δa/a
Hope that gets the gist across and hope I didn't make too many mistakes, it's late here :-)
tl,dr: Significant digits mean how many of the digits in the output actually come from the digits in your input (in the real world, not the distorted picture that floating point numbers have).
If your measurements were 1 with "no" error and 3 with "no" error and the function is supposed to be 1/3, then yes, all infinite digits are actual significant digits. Otherwise, the inverse operation would not work, so obviously they have to be.
If significant digit rule means something completely different in another field, carry on :-)