Why are there random trash digits in floating point numbers? [duplicate] - c++

There have been several questions posted to SO about floating-point representation. For example, the decimal number 0.1 doesn't have an exact binary representation, so it's dangerous to use the == operator to compare it to another floating-point number. I understand the principles behind floating-point representation.
What I don't understand is why, from a mathematical perspective, are the numbers to the right of the decimal point any more "special" that the ones to the left?
For example, the number 61.0 has an exact binary representation because the integral portion of any number is always exact. But the number 6.10 is not exact. All I did was move the decimal one place and suddenly I've gone from Exactopia to Inexactville. Mathematically, there should be no intrinsic difference between the two numbers -- they're just numbers.
By contrast, if I move the decimal one place in the other direction to produce the number 610, I'm still in Exactopia. I can keep going in that direction (6100, 610000000, 610000000000000) and they're still exact, exact, exact. But as soon as the decimal crosses some threshold, the numbers are no longer exact.
What's going on?
Edit: to clarify, I want to stay away from discussion about industry-standard representations, such as IEEE, and stick with what I believe is the mathematically "pure" way. In base 10, the positional values are:
... 1000 100 10 1 1/10 1/100 ...
In binary, they would be:
... 8 4 2 1 1/2 1/4 1/8 ...
There are also no arbitrary limits placed on these numbers. The positions increase indefinitely to the left and to the right.

Decimal numbers can be represented exactly, if you have enough space - just not by floating binary point numbers. If you use a floating decimal point type (e.g. System.Decimal in .NET) then plenty of values which can't be represented exactly in binary floating point can be exactly represented.
Let's look at it another way - in base 10 which you're likely to be comfortable with, you can't express 1/3 exactly. It's 0.3333333... (recurring). The reason you can't represent 0.1 as a binary floating point number is for exactly the same reason. You can represent 3, and 9, and 27 exactly - but not 1/3, 1/9 or 1/27.
The problem is that 3 is a prime number which isn't a factor of 10. That's not an issue when you want to multiply a number by 3: you can always multiply by an integer without running into problems. But when you divide by a number which is prime and isn't a factor of your base, you can run into trouble (and will do so if you try to divide 1 by that number).
Although 0.1 is usually used as the simplest example of an exact decimal number which can't be represented exactly in binary floating point, arguably 0.2 is a simpler example as it's 1/5 - and 5 is the prime that causes problems between decimal and binary.
Side note to deal with the problem of finite representations:
Some floating decimal point types have a fixed size like System.Decimal others like java.math.BigDecimal are "arbitrarily large" - but they'll hit a limit at some point, whether it's system memory or the theoretical maximum size of an array. This is an entirely separate point to the main one of this answer, however. Even if you had a genuinely arbitrarily large number of bits to play with, you still couldn't represent decimal 0.1 exactly in a floating binary point representation. Compare that with the other way round: given an arbitrary number of decimal digits, you can exactly represent any number which is exactly representable as a floating binary point.

For example, the number 61.0 has an exact binary representation because the integral portion of any number is always exact. But the number 6.10 is not exact. All I did was move the decimal one place and suddenly I've gone from Exactopia to Inexactville. Mathematically, there should be no intrinsic difference between the two numbers -- they're just numbers.
Let's step away for a moment from the particulars of bases 10 and 2. Let's ask - in base b, what numbers have terminating representations, and what numbers don't? A moment's thought tells us that a number x has a terminating b-representation if and only if there exists an integer n such that x b^n is an integer.
So, for example, x = 11/500 has a terminating 10-representation, because we can pick n = 3 and then x b^n = 22, an integer. However x = 1/3 does not, because whatever n we pick we will not be able to get rid of the 3.
This second example prompts us to think about factors, and we can see that for any rational x = p/q (assumed to be in lowest terms), we can answer the question by comparing the prime factorisations of b and q. If q has any prime factors not in the prime factorisation of b, we will never be able to find a suitable n to get rid of these factors.
Thus for base 10, any p/q where q has prime factors other than 2 or 5 will not have a terminating representation.
So now going back to bases 10 and 2, we see that any rational with a terminating 10-representation will be of the form p/q exactly when q has only 2s and 5s in its prime factorisation; and that same number will have a terminating 2-representatiion exactly when q has only 2s in its prime factorisation.
But one of these cases is a subset of the other! Whenever
q has only 2s in its prime factorisation
it obviously is also true that
q has only 2s and 5s in its prime factorisation
or, put another way, whenever p/q has a terminating 2-representation, p/q has a terminating 10-representation. The converse however does not hold - whenever q has a 5 in its prime factorisation, it will have a terminating 10-representation , but not a terminating 2-representation. This is the 0.1 example mentioned by other answers.
So there we have the answer to your question - because the prime factors of 2 are a subset of the prime factors of 10, all 2-terminating numbers are 10-terminating numbers, but not vice versa. It's not about 61 versus 6.1 - it's about 10 versus 2.
As a closing note, if by some quirk people used (say) base 17 but our computers used base 5, your intuition would never have been led astray by this - there would be no (non-zero, non-integer) numbers which terminated in both cases!

The root (mathematical) reason is that when you are dealing with integers, they are countably infinite.
Which means, even though there are an infinite amount of them, we could "count out" all of the items in the sequence, without skipping any. That means if we want to get the item in the 610000000000000th position in the list, we can figure it out via a formula.
However, real numbers are uncountably infinite. You can't say "give me the real number at position 610000000000000" and get back an answer. The reason is because, even between 0 and 1, there are an infinite number of values, when you are considering floating-point values. The same holds true for any two floating point numbers.
More info:
http://en.wikipedia.org/wiki/Countable_set
http://en.wikipedia.org/wiki/Uncountable_set
Update:
My apologies, I appear to have misinterpreted the question. My response is about why we cannot represent every real value, I hadn't realized that floating point was automatically classified as rational.

To repeat what I said in my comment to Mr. Skeet: we can represent 1/3, 1/9, 1/27, or any rational in decimal notation. We do it by adding an extra symbol. For example, a line over the digits that repeat in the decimal expansion of the number. What we need to represent decimal numbers as a sequence of binary numbers are 1) a sequence of binary numbers, 2) a radix point, and 3) some other symbol to indicate the repeating part of the sequence.
Hehner's quote notation is a way of doing this. He uses a quote symbol to represent the repeating part of the sequence. The article: http://www.cs.toronto.edu/~hehner/ratno.pdf and the Wikipedia entry: http://en.wikipedia.org/wiki/Quote_notation.
There's nothing that says we can't add a symbol to our representation system, so we can represent decimal rationals exactly using binary quote notation, and vice versa.

BCD - Binary-coded Decimal - representations are exact. They are not very space-efficient, but that's a trade-off you have to make for accuracy in this case.

This is a good question.
All your question is based on "how do we represent a number?"
ALL the numbers can be represented with decimal representation or with binary (2's complement) representation. All of them !!
BUT some (most of them) require infinite number of elements ("0" or "1" for the binary position, or "0", "1" to "9" for the decimal representation).
Like 1/3 in decimal representation (1/3 = 0.3333333... <- with an infinite number of "3")
Like 0.1 in binary ( 0.1 = 0.00011001100110011.... <- with an infinite number of "0011")
Everything is in that concept. Since your computer can only consider finite set of digits (decimal or binary), only some numbers can be exactly represented in your computer...
And as said Jon, 3 is a prime number which isn't a factor of 10, so 1/3 cannot be represented with a finite number of elements in base 10.
Even with arithmetic with arbitrary precision, the numbering position system in base 2 is not able to fully describe 6.1, although it can represent 61.
For 6.1, we must use another representation (like decimal representation, or IEEE 854 that allows base 2 or base 10 for the representation of floating-point values)

If you make a big enough number with floating point (as it can do exponents), then you'll end up with inexactness in front of the decimal point, too. So I don't think your question is entirely valid because the premise is wrong; it's not the case that shifting by 10 will always create more precision, because at some point the floating point number will have to use exponents to represent the largeness of the number and will lose some precision that way as well.

It's the same reason you cannot represent 1/3 exactly in base 10, you need to say 0.33333(3). In binary it is the same type of problem but just occurs for different set of numbers.

(Note: I'll append 'b' to indicate binary numbers here. All other numbers are given in decimal)
One way to think about things is in terms of something like scientific notation. We're used to seeing numbers expressed in scientific notation like, 6.022141 * 10^23. Floating point numbers are stored internally using a similar format - mantissa and exponent, but using powers of two instead of ten.
Your 61.0 could be rewritten as 1.90625 * 2^5, or 1.11101b * 2^101b with the mantissa and exponents. To multiply that by ten and (move the decimal point), we can do:
(1.90625 * 2^5) * (1.25 * 2^3) = (2.3828125 * 2^8) = (1.19140625 * 2^9)
or in with the mantissa and exponents in binary:
(1.11101b * 2^101b) * (1.01b * 2^11b) = (10.0110001b * 2^1000b) = (1.00110001b * 2^1001b)
Note what we did there to multiply the numbers. We multiplied the mantissas and added the exponents. Then, since the mantissa ended greater than two, we normalized the result by bumping the exponent. It's just like when we adjust the exponent after doing an operation on numbers in decimal scientific notation. In each case, the values that we worked with had a finite representation in binary, and so the values output by the basic multiplication and addition operations also produced values with a finite representation.
Now, consider how we'd divide 61 by 10. We'd start by dividing the mantissas, 1.90625 and 1.25. In decimal, this gives 1.525, a nice short number. But what is this if we convert it to binary? We'll do it the usual way -- subtracting out the largest power of two whenever possible, just like converting integer decimals to binary, but we'll use negative powers of two:
1.525 - 1*2^0 --> 1
0.525 - 1*2^-1 --> 1
0.025 - 0*2^-2 --> 0
0.025 - 0*2^-3 --> 0
0.025 - 0*2^-4 --> 0
0.025 - 0*2^-5 --> 0
0.025 - 1*2^-6 --> 1
0.009375 - 1*2^-7 --> 1
0.0015625 - 0*2^-8 --> 0
0.0015625 - 0*2^-9 --> 0
0.0015625 - 1*2^-10 --> 1
0.0005859375 - 1*2^-11 --> 1
0.00009765625...
Uh oh. Now we're in trouble. It turns out that 1.90625 / 1.25 = 1.525, is a repeating fraction when expressed in binary: 1.11101b / 1.01b = 1.10000110011...b Our machines only have so many bits to hold that mantissa and so they'll just round the fraction and assume zeroes beyond a certain point. The error you see when you divide 61 by 10 is the difference between:
1.100001100110011001100110011001100110011...b * 2^10b
and, say:
1.100001100110011001100110b * 2^10b
It's this rounding of the mantissa that leads to the loss of precision that we associate with floating point values. Even when the mantissa can be expressed exactly (e.g., when just adding two numbers), we can still get numeric loss if the mantissa needs too many digits to fit after normalizing the exponent.
We actually do this sort of thing all the time when we round decimal numbers to a manageable size and just give the first few digits of it. Because we express the result in decimal it feels natural. But if we rounded a decimal and then converted it to a different base, it'd look just as ugly as the decimals we get due to floating point rounding.

I'm surprised no one has stated this yet: use continued fractions. Any rational number can be represented finitely in binary this way.
Some examples:
1/3 (0.3333...)
0; 3
5/9 (0.5555...)
0; 1, 1, 4
10/43 (0.232558139534883720930...)
0; 4, 3, 3
9093/18478 (0.49209871198181621387596060179673...)
0; 2, 31, 7, 8, 5
From here, there are a variety of known ways to store a sequence of integers in memory.
In addition to storing your number with perfect accuracy, continued fractions also have some other benefits, such as best rational approximation. If you decide to terminate the sequence of numbers in a continued fraction early, the remaining digits (when recombined to a fraction) will give you the best possible fraction. This is how approximations to pi are found:
Pi's continued fraction:
3; 7, 15, 1, 292 ...
Terminating the sequence at 1, this gives the fraction:
355/113
which is an excellent rational approximation.

In the equation
2^x = y ;
x = log(y) / log(2)
Hence, I was just wondering if we could have a logarithmic base system for binary like,
2^1, 2^0, 2^(log(1/2) / log(2)), 2^(log(1/4) / log(2)), 2^(log(1/8) / log(2)),2^(log(1/16) / log(2)) ........
That might be able to solve the problem, so if you wanted to write something like 32.41 in binary, that would be
2^5 + 2^(log(0.4) / log(2)) + 2^(log(0.01) / log(2))
Or
2^5 + 2^(log(0.41) / log(2))

The problem is that you do not really know whether the number actually is exactly 61.0 . Consider this:
float a = 60;
float b = 0.1;
float c = a + b * 10;
What is the value of c? It is not exactly 61, because b is not really .1 because .1 does not have an exact binary representation.

The number 61.0 does indeed have an exact floating-point operation—but that's not true for all integers. If you wrote a loop that added one to both a double-precision floating point number and a 64-bit integer, eventually you'd reach a point where the 64-bit integer perfectly represents a number, but the floating point doesn't—because there aren't enough significant bits.
It's just much easier to reach the point of approximation on the right side of the decimal point. If you started writing out all the numbers in binary floating point, it'd make more sense.
Another way of thinking about it is that when you note that 61.0 is perfectly representable in base 10, and shifting the decimal point around doesn't change that, you're performing multiplication by powers of ten (10^1, 10^-1). In floating point, multiplying by powers of two does not affect the precision of the number. Try taking 61.0 and dividing it by three repeatedly for an illustration of how a perfectly precise number can lose its precise representation.

There's a threshold because the meaning of the digit has gone from integer to non-integer. To represent 61, you have 6*10^1 + 1*10^0; 10^1 and 10^0 are both integers. 6.1 is 6*10^0 + 1*10^-1, but 10^-1 is 1/10, which is definitely not an integer. That's how you end up in Inexactville.

A parallel can be made of fractions and whole numbers. Some fractions eg 1/7 cannot be represented in decimal form without lots and lots of decimals. Because floating point is binary based the special cases change but the same sort of accuracy problems present themselves.

There are an infinite number of rational numbers, and a finite number of bits with which to represent them. See http://en.wikipedia.org/wiki/Floating_point#Accuracy_problems.

you know integer numbers right? each bit represent 2^n
2^4=16
2^3=8
2^2=4
2^1=2
2^0=1
well its the same for floating point(with some distinctions) but the bits represent 2^-n
2^-1=1/2=0.5
2^-2=1/(2*2)=0.25
2^-3=0.125
2^-4=0.0625
Floating point binary representation:
sign Exponent Fraction(i think invisible 1 is appended to the fraction )
B11 B10 B9 B8 B7 B6 B5 B4 B3 B2 B1 B0

The high scoring answer above nailed it.
First you were mixing base 2 and base 10 in your question, then when you put a number on the right side that is not divisible into the base you get problems. Like 1/3 in decimal because 3 doesnt go into a power of 10 or 1/5 in binary which doesnt go into a power of 2.
Another comment though NEVER use equal with floating point numbers, period. Even if it is an exact representation there are some numbers in some floating point systems that can be accurately represented in more than one way (IEEE is bad about this, it is a horrible floating point spec to start with, so expect headaches). No different here 1/3 is not EQUAL to the number on your calculator 0.3333333, no matter how many 3's there are to the right of the decimal point. It is or can be close enough but is not equal. so you would expect something like 2*1/3 to not equal 2/3 depending on the rounding. Never use equal with floating point.

As we have been discussing, in floating point arithmetic, the decimal 0.1 cannot be perfectly represented in binary.
Floating point and integer representations provide grids or lattices for the numbers represented. As arithmetic is done, the results fall off the grid and have to be put back onto the grid by rounding. Example is 1/10 on a binary grid.
If we use binary coded decimal representation as one gentleman suggested, would we be able to keep numbers on the grid?

For a simple answer: The computer doesn't have infinite memory to store fraction (after representing the decimal number as the form of scientific notation). According to IEEE 754 standard for double-precision floating-point numbers, we only have a limit of 53 bits to store fraction.
For more info: http://mathcenter.oxford.emory.edu/site/cs170/ieee754/

I will not bother to repeat what the other 20 answers have already summarized, so I will just answer briefly:
The answer in your content:
Why can't base two numbers represent certain ratios exactly?
For the same reason that decimals are insufficient to represent certain ratios, namely, irreducible fractions with denominators containing prime factors other than two or five which will always have an indefinite string in at least the mantissa of its decimal expansion.
Why can't decimal numbers be represented exactly in binary?
This question at face value is based on a misconception regarding values themselves. No number system is sufficient to represent any quantity or ratio in a manner that the thing itself tells you that it is both a quantity, and at the same time also gives the interpretation in and of itself about the intrinsic value of the representation. As such, all quantitative representations, and models in general, are symbolic and can only be understood a posteriori, namely, after one has been taught how to read and interpret these numbers.
Since models are subjective things that are true insofar as they reflect reality, we do not strictly need to interpret a binary string as sums of negative and positive powers of two. Instead, one may observe that we can create an arbitrary set of symbols that use base two or any other base to represent any number or ratio exactly. Just consider that we can refer to all of infinity using a single word and even a single symbol without "showing infinity" itself.
As an example, I am designing a binary encoding for mixed numbers so that I can have more precision and accuracy than an IEEE 754 float. At the time of writing this, the idea is to have a sign bit, a reciprocal bit, a certain number of bits for a scalar to determine how much to "magnify" the fractional portion, and then the remaining bits are divided evenly between the integer portion of a mixed number, and the latter a fixed-point number which, if the reciprocal bit is set, should be interpreted as one divided by that number. This has the benefit of allowing me to represent numbers with infinite decimal expansions by using their reciprocals which do have terminating decimal expansions, or alternatively, as a fraction directly, potentially as an approximation, depending on my needs.

You can't represent 0.1 exactly in binary for the same reason you can't measure 0.1 inch using a conventional English ruler.
English rulers, like binary fractions, are all about halves. You can measure half an inch, or a quarter of an inch (which is of course half of a half), or an eighth, or a sixteenth, etc.
If you want to measure a tenth of an inch, though, you're out of luck. It's less than an eighth of an inch, but more than a sixteenth. If you try to get more exact, you find that it's a little more than 3/32, but a little less than 7/64. I've never seen an actual ruler that had gradations finer than 64ths, but if you do the math, you'll find that 1/10 is less than 13/128, and it's more than 25/256, and it's more than 51/512. You can keep going finer and finer, to 1024ths and 2048ths and 4096ths and 8192nds, but you will never find an exact marking, even on an infinitely-fine base-2 ruler, that exactly corresponds to 1/10, or 0.1.
You will find something interesting, though. Let's look at all the approximations I've listed, and for each one, record explicitly whether 0.1 is less or greater:
fraction
decimal
0.1 is...
as 0/1
1/2
0.5
less
0
1/4
0.25
less
0
1/8
0.125
less
0
1/16
0.0625
greater
1
3/32
0.09375
greater
1
7/64
0.109375
less
0
13/128
0.1015625
less
0
25/256
0.09765625
greater
1
51/512
0.099609375
greater
1
103/1024
0.1005859375
less
0
205/2048
0.10009765625
less
0
409/4096
0.099853515625
greater
1
819/8192
0.0999755859375
greater
1
Now, if you read down the last column, you get 0001100110011. It's no coincidence that the infinitely-repeating binary fraction for 1/10 is 0.0001100110011...

Related

IEEE754 float point substraction precision lost

Here is the subtraction
First number
Decimal 3.0000002
Hexadecimal 0x4040001
Binary: Sign[0], Exponent[1000_0000], Mantissa[100_0000_0000_0000_0000_0001]
substract second number:
Decimal 3.000000
Hexadecimal 0x4040000
Binary: Sign[0], Exponent[1000_0000], Mantissa[100_0000_0000_0000_0000_0000]
==========================================
At this situation, the exponent is already same, we just need to substract the mantissa. We know in IEEE754, there is a hiding bit 1 in front of mantissa. Therefore, the result mantissa should be:
Mantissa_1[1100_0000_0000_0000_0000_0001] - Mantissa_2[1100_0000_0000_0000_0000_0000]
which equal to
Mantissa_Rst = [0000_0000_0000_0000_0000_0001]
But this number is not normalized, Because of the first hiding bit is not 1. Thus we shift the Mantissa_Rst right 23 times, and the exponent minuses 23 at the same time.
Then we have the result value
Hexadecimal 0x4040000
Binary: Sign[0], Exponent[0110_1000], Mantissa[000_0000_0000_0000_0000_0000].
32 bits total, no rounding needed.
Notice that in the mantissa region, there still is a hidden 1.
If my calculations were correct, then converting result to decimal number is 0.00000023841858, comparing with the real result 0.0000002, I still think that is not very precise.
So the question is, are my calculations wrong? or actually this is a real situation and happens all the time in computer?
The inaccuracy already starts with your input. 3.0000002 is a fraction with a prime factor of five in the denominator, so its "decimal" expansion in base 2 is periodic. No amount of mantissa bits will suffice to represent it exactly. The float you give actually has the value 3.0000002384185791015625 (this is exact). Yes, this happens all the time.
Don't despair, though! Base ten has the same problem (for example 1/3). It isn't a problem. Well, it is for some people, but luckily there are other number types available for their needs. Floating point numbers have many advantages, and slight rounding error is irrelevant for many applications, for example when not even your inputs are perfectly accurate measurements of what you're interested in (a lot of scientific computing and simulation). Also remember that 64-bit floats also exist. Additionally, the error is bounded: With the best possible rounding, your result will be within 0.5 units in the last place removed from the infinite-precision result. For a 32-bit float of the magnitude as your example, this is approximately 2^-25, or 3 * 10^-8. This gets worse and worse as you do additional operations that have to round, but with careful numeric analysis and the right algorithms, you can get a lot of milage out of them.
Whenever x/2 ≤ y ≤ 2x, the calculation x - y is exact which means there is no rounding error whatsoever. That is also the case in your example.
You just made the wrong assumption that you could have a floating point number that is equal to 3.0000002. You can't. The type "float" can only ever represent integers less than 2^24, multiplied by a power of two. 3.0000002 is not such a number, therefore it is rounded to the nearest floating point number, which is closer to 3.00000023841858. Subtracting 3 calculates the difference exactly and gives a result close to 0.00000023841858.

Is there any faster and accurate atof? [duplicate]

Why do some numbers lose accuracy when stored as floating point numbers?
For example, the decimal number 9.2 can be expressed exactly as a ratio of two decimal integers (92/10), both of which can be expressed exactly in binary (0b1011100/0b1010). However, the same ratio stored as a floating point number is never exactly equal to 9.2:
32-bit "single precision" float: 9.19999980926513671875
64-bit "double precision" float: 9.199999999999999289457264239899814128875732421875
How can such an apparently simple number be "too big" to express in 64 bits of memory?
In most programming languages, floating point numbers are represented a lot like scientific notation: with an exponent and a mantissa (also called the significand). A very simple number, say 9.2, is actually this fraction:
5179139571476070 * 2 -49
Where the exponent is -49 and the mantissa is 5179139571476070. The reason it is impossible to represent some decimal numbers this way is that both the exponent and the mantissa must be integers. In other words, all floats must be an integer multiplied by an integer power of 2.
9.2 may be simply 92/10, but 10 cannot be expressed as 2n if n is limited to integer values.
Seeing the Data
First, a few functions to see the components that make a 32- and 64-bit float. Gloss over these if you only care about the output (example in Python):
def float_to_bin_parts(number, bits=64):
if bits == 32: # single precision
int_pack = 'I'
float_pack = 'f'
exponent_bits = 8
mantissa_bits = 23
exponent_bias = 127
elif bits == 64: # double precision. all python floats are this
int_pack = 'Q'
float_pack = 'd'
exponent_bits = 11
mantissa_bits = 52
exponent_bias = 1023
else:
raise ValueError, 'bits argument must be 32 or 64'
bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))
return [''.join(islice(bin_iter, x)) for x in (1, exponent_bits, mantissa_bits)]
There's a lot of complexity behind that function, and it'd be quite the tangent to explain, but if you're interested, the important resource for our purposes is the struct module.
Python's float is a 64-bit, double-precision number. In other languages such as C, C++, Java and C#, double-precision has a separate type double, which is often implemented as 64 bits.
When we call that function with our example, 9.2, here's what we get:
>>> float_to_bin_parts(9.2)
['0', '10000000010', '0010011001100110011001100110011001100110011001100110']
Interpreting the Data
You'll see I've split the return value into three components. These components are:
Sign
Exponent
Mantissa (also called Significand, or Fraction)
Sign
The sign is stored in the first component as a single bit. It's easy to explain: 0 means the float is a positive number; 1 means it's negative. Because 9.2 is positive, our sign value is 0.
Exponent
The exponent is stored in the middle component as 11 bits. In our case, 0b10000000010. In decimal, that represents the value 1026. A quirk of this component is that you must subtract a number equal to 2(# of bits) - 1 - 1 to get the true exponent; in our case, that means subtracting 0b1111111111 (decimal number 1023) to get the true exponent, 0b00000000011 (decimal number 3).
Mantissa
The mantissa is stored in the third component as 52 bits. However, there's a quirk to this component as well. To understand this quirk, consider a number in scientific notation, like this:
6.0221413x1023
The mantissa would be the 6.0221413. Recall that the mantissa in scientific notation always begins with a single non-zero digit. The same holds true for binary, except that binary only has two digits: 0 and 1. So the binary mantissa always starts with 1! When a float is stored, the 1 at the front of the binary mantissa is omitted to save space; we have to place it back at the front of our third element to get the true mantissa:
1.0010011001100110011001100110011001100110011001100110
This involves more than just a simple addition, because the bits stored in our third component actually represent the fractional part of the mantissa, to the right of the radix point.
When dealing with decimal numbers, we "move the decimal point" by multiplying or dividing by powers of 10. In binary, we can do the same thing by multiplying or dividing by powers of 2. Since our third element has 52 bits, we divide it by 252 to move it 52 places to the right:
0.0010011001100110011001100110011001100110011001100110
In decimal notation, that's the same as dividing 675539944105574 by 4503599627370496 to get 0.1499999999999999. (This is one example of a ratio that can be expressed exactly in binary, but only approximately in decimal; for more detail, see: 675539944105574 / 4503599627370496.)
Now that we've transformed the third component into a fractional number, adding 1 gives the true mantissa.
Recapping the Components
Sign (first component): 0 for positive, 1 for negative
Exponent (middle component): Subtract 2(# of bits) - 1 - 1 to get the true exponent
Mantissa (last component): Divide by 2(# of bits) and add 1 to get the true mantissa
Calculating the Number
Putting all three parts together, we're given this binary number:
1.0010011001100110011001100110011001100110011001100110 x 1011
Which we can then convert from binary to decimal:
1.1499999999999999 x 23 (inexact!)
And multiply to reveal the final representation of the number we started with (9.2) after being stored as a floating point value:
9.1999999999999993
Representing as a Fraction
9.2
Now that we've built the number, it's possible to reconstruct it into a simple fraction:
1.0010011001100110011001100110011001100110011001100110 x 1011
Shift mantissa to a whole number:
10010011001100110011001100110011001100110011001100110 x 1011-110100
Convert to decimal:
5179139571476070 x 23-52
Subtract the exponent:
5179139571476070 x 2-49
Turn negative exponent into division:
5179139571476070 / 249
Multiply exponent:
5179139571476070 / 562949953421312
Which equals:
9.1999999999999993
9.5
>>> float_to_bin_parts(9.5)
['0', '10000000010', '0011000000000000000000000000000000000000000000000000']
Already you can see the mantissa is only 4 digits followed by a whole lot of zeroes. But let's go through the paces.
Assemble the binary scientific notation:
1.0011 x 1011
Shift the decimal point:
10011 x 1011-100
Subtract the exponent:
10011 x 10-1
Binary to decimal:
19 x 2-1
Negative exponent to division:
19 / 21
Multiply exponent:
19 / 2
Equals:
9.5
Further reading
The Floating-Point Guide: What Every Programmer Should Know About Floating-Point Arithmetic, or, Why don’t my numbers add up? (floating-point-gui.de)
What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg 1991)
IEEE Double-precision floating-point format (Wikipedia)
Floating Point Arithmetic: Issues and Limitations (docs.python.org)
Floating Point Binary
This isn't a full answer (mhlester already covered a lot of good ground I won't duplicate), but I would like to stress how much the representation of a number depends on the base you are working in.
Consider the fraction 2/3
In good-ol' base 10, we typically write it out as something like
0.666...
0.666
0.667
When we look at those representations, we tend to associate each of them with the fraction 2/3, even though only the first representation is mathematically equal to the fraction. The second and third representations/approximations have an error on the order of 0.001, which is actually much worse than the error between 9.2 and 9.1999999999999993. In fact, the second representation isn't even rounded correctly! Nevertheless, we don't have a problem with 0.666 as an approximation of the number 2/3, so we shouldn't really have a problem with how 9.2 is approximated in most programs. (Yes, in some programs it matters.)
Number bases
So here's where number bases are crucial. If we were trying to represent 2/3 in base 3, then
(2/3)10 = 0.23
In other words, we have an exact, finite representation for the same number by switching bases! The take-away is that even though you can convert any number to any base, all rational numbers have exact finite representations in some bases but not in others.
To drive this point home, let's look at 1/2. It might surprise you that even though this perfectly simple number has an exact representation in base 10 and 2, it requires a repeating representation in base 3.
(1/2)10 = 0.510 = 0.12 = 0.1111...3
Why are floating point numbers inaccurate?
Because often-times, they are approximating rationals that cannot be represented finitely in base 2 (the digits repeat), and in general they are approximating real (possibly irrational) numbers which may not be representable in finitely many digits in any base.
While all of the other answers are good there is still one thing missing:
It is impossible to represent irrational numbers (e.g. π, sqrt(2), log(3), etc.) precisely!
And that actually is why they are called irrational. No amount of bit storage in the world would be enough to hold even one of them. Only symbolic arithmetic is able to preserve their precision.
Although if you would limit your math needs to rational numbers only the problem of precision becomes manageable. You would need to store a pair of (possibly very big) integers a and b to hold the number represented by the fraction a/b. All your arithmetic would have to be done on fractions just like in highschool math (e.g. a/b * c/d = ac/bd).
But of course you would still run into the same kind of trouble when pi, sqrt, log, sin, etc. are involved.
TL;DR
For hardware accelerated arithmetic only a limited amount of rational numbers can be represented. Every not-representable number is approximated. Some numbers (i.e. irrational) can never be represented no matter the system.
There are infinitely many real numbers (so many that you can't enumerate them), and there are infinitely many rational numbers (it is possible to enumerate them).
The floating-point representation is a finite one (like anything in a computer) so unavoidably many many many numbers are impossible to represent. In particular, 64 bits only allow you to distinguish among only 18,446,744,073,709,551,616 different values (which is nothing compared to infinity). With the standard convention, 9.2 is not one of them. Those that can are of the form m.2^e for some integers m and e.
You might come up with a different numeration system, 10 based for instance, where 9.2 would have an exact representation. But other numbers, say 1/3, would still be impossible to represent.
Also note that double-precision floating-points numbers are extremely accurate. They can represent any number in a very wide range with as much as 15 exact digits. For daily life computations, 4 or 5 digits are more than enough. You will never really need those 15, unless you want to count every millisecond of your lifetime.
Why can we not represent 9.2 in binary floating point?
Floating point numbers are (simplifying slightly) a positional numbering system with a restricted number of digits and a movable radix point.
A fraction can only be expressed exactly using a finite number of digits in a positional numbering system if the prime factors of the denominator (when the fraction is expressed in it's lowest terms) are factors of the base.
The prime factors of 10 are 5 and 2, so in base 10 we can represent any fraction of the form a/(2b5c).
On the other hand the only prime factor of 2 is 2, so in base 2 we can only represent fractions of the form a/(2b)
Why do computers use this representation?
Because it's a simple format to work with and it is sufficiently accurate for most purposes. Basically the same reason scientists use "scientific notation" and round their results to a reasonable number of digits at each step.
It would certainly be possible to define a fraction format, with (for example) a 32-bit numerator and a 32-bit denominator. It would be able to represent numbers that IEEE double precision floating point could not, but equally there would be many numbers that can be represented in double precision floating point that could not be represented in such a fixed-size fraction format.
However the big problem is that such a format is a pain to do calculations on. For two reasons.
If you want to have exactly one representation of each number then after each calculation you need to reduce the fraction to it's lowest terms. That means that for every operation you basically need to do a greatest common divisor calculation.
If after your calculation you end up with an unrepresentable result because the numerator or denominator you need to find the closest representable result. This is non-trivil.
Some Languages do offer fraction types, but usually they do it in combination with arbitary precision, this avoids needing to worry about approximating fractions but it creates it's own problem, when a number passes through a large number of calculation steps the size of the denominator and hence the storage needed for the fraction can explode.
Some languages also offer decimal floating point types, these are mainly used in scenarios where it is imporant that the results the computer gets match pre-existing rounding rules that were written with humans in mind (chiefly financial calculations). These are slightly more difficult to work with than binary floating point, but the biggest problem is that most computers don't offer hardware support for them.

is it possible to categorize the different forms of the approximation of floating point numbers

I am just wondering if we can make rules for the form of the approximation of real numbers using floating point numbers.
For intance is a floating point number can be terminated by 1.xxx777777 (so terminated by infinite 7 by instance and eventually a random digit at the end ) ?
I believe that there is only this form of floating point number :
1. exact value.
2. value like 1.23900008721.... so where 1.239 is approximated with digits that appears as "noise" but with 0 between the exact value and this noise
3. value like 3.2599995, where 3.26 is approximated by adding 9999.. and a final digit (like 5), so approximated with a floating number just below the real number
4. value like 2.000001, where 2.0 is approximated with a floating number just above the real number
You are thinking in terms of decimal numbers, that is, numbers that can be represented as n*(10^e), with e either positive or negative. These numbers occur naturally in your thought processes for historical reasons having to do with having ten fingers.
Computer numbers are represented in binary, for technical reasons that have to do with an electrical signal being either present or absent.
When you are dealing with smallish integer numbers, it does not matter much that the computer representation does not match your own, because you are thinking of an accurate approximation of the mathematical number, and so is the computer, so by transitivity, you and the computer are thinking about the same thing.
With either very large or very small numbers, you will tend to think in terms of powers of ten, and the computer will definitely think in terms of powers of two. In these cases you can observe a difference between your intuition and what the computer does, and also, your classification is nonsense. Binary floating-point numbers are neither more dense or less dense near numbers that happen to have a compact representation as decimal numbers. They are simply represented in binary, n*(2^p), with p either positive or negative. Many real numbers have only an approximative representation in decimal, and many real numbers have only an approximative representation in binary. These numbers are not the same (binary numbers can be represented in decimal, but not always compactly. Some decimal numbers cannot be represented exactly in binary at all, for instance 0.1).
If you want to understand the computer's floating-point numbers, you must stop thinking in decimal. 1.23900008721.... is not special, and neither is 1.239. 3.2599995 is not special, and neither is 3.26. You think they are special because they are either exactly or close to compact decimal numbers. But that does not make any difference in binary floating-point.
Here are a few pieces of information that may amuse you, since you tagged your question C++:
If you print a double-precision number with the format %.16e, you get a decimal number that converts back to the original double. But it does not always represent the exact value of the original double. To see the exact value of the double in decimal, you must use %.53e. If you write 0.1 in a program, the compiler interprets this as meaning 1.000000000000000055511151231257827021181583404541015625e-01, which is a relatively compact number in binary. Your question speaks of 3.2599995 and 2.000001 as if these were floating-point numbers, but they aren't. If you write these numbers in a program, the compiler will interpret them as 3.25999950000000016103740563266910612583160400390625
and
2.00000100000000013977796697872690856456756591796875. So the pattern you are looking for is simple: the decimal representation of a floating-point number is always 17 significant digits followed by 53-17=36 “noise” digits as you call them. The noise digits are sometimes all zeroes, and the significant digits can end in a bunch of zeroes too.
Floating point is presented by bits. What this means is:
1 bit flipped after the decimal is 0.5 or 1/2
01 bits is 0.25 or 1/4
etc.
This means floating point is always approximately close but not exact if it's not an exact power of 2, when represented in terms of what the machine can handle.
Rational numbers can very accurately be represented by the machine (not precisely of course if not a power of two below the decimal point), but irrational numbers will always carry an error. In terms of this your question is not so much related to c++ as to computer architecture.

Representing probability in C++

I'm trying to represent a simple set of 3 probabilities in C++. For example:
a = 0.1
b = 0.2
c = 0.7
(As far as I know probabilities must add up to 1)
My problem is that when I try to represent 0.7 in C++ as a float I end up with 0.69999999, which won't help when I am doing my calculations later. The same for 0.8, 0.80000001.
Is there a better way of representing numbers between 0.0 and 1.0 in C++?
Bear in mind that this relates to how the numbers are stored in memory so that when it comes to doing tests on the values they are correct, I'm not concerned with how they are display/printed out.
This has nothing to do with C++ and everything to do with how floating point numbers are represented in memory. You should never use the equality operator to compare floating point values, see here for better methods: http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm
My problem is that when I try to
represent 0.7 in C++ as a float I end
up with 0.69999999, which won't help
when I am doing my calculations later.
The same for 0.8, 0.80000001.
Is it really a problem? If you just need more precision, use a double instead of a float. That should get you about 15 digits precision, more than enough for most work.
Consider your source data. Is 0.7 really significantly more correct than 0.69999999?
If so, you could use a rational number library such as:
http://www.boost.org/doc/libs/1_40_0/libs/rational/index.html
If the problem is that probabilities add up to 1 by definition, then store them as a collection of numbers, omitting the last one. Infer the last value by subtracting the sum of the others from 1.
How much precision do you need? You might consider scaling the values and quantizing them in a fixed-point representation.
The tests you want to do with your numbers will be incorrect.
There is no exact floating point representation in a base-2 number system for a number like 0.1, because it is a infinte periodic number. Consider one third, that is exactly representable as 0.1 in a base-3 system, but 0.333... in the base-10 system.
So any test you do with a number 0.1 in floating point will be prone to be flawed.
A solution would be using rational numbers (boost has a rational lib), which will be always exact for, ermm, rationals, or use a selfmade base-10 system by multiplying the numbers with a power of ten.
If you really need the precision, and are sticking with rational numbers, I suppose you could go with a fixed point arithemtic. I've not done this before so I can't recommend any libraries.
Alternatively, you can set a threshold when comparing fp numbers, but you'd have to err on one side or another -- say
bool fp_cmp(float a, float b) {
return (a < b + epsilon);
}
Note that excess precision is automatically truncated in each calculation, so you should take care when operating at many different orders of magnitude in your algorithm. A contrived example to illustrate:
a = 15434355e10 + 22543634e10
b = a / 1e20 + 1.1534634
c = b * 1e20
versus
c = b + 1.1534634e20
The two results will be very different. Using the first method a lot of the precision of the first two numbers will be lost in the divide by 1e20. Assuming that the final value you want is on the order of 1e20, the second method will give you more precision.
If you only need a few digits of precision then just use an integer. If you need better precision then you'll have to look to different libraries that provide guarantees on precision.
The issue here is that floating point numbers are stored in base 2. You can not exactly represent a decimal in base 10 with a floating point number in base 2.
Lets step back a second. What does .1 mean? Or .7? They mean 1x10-1 and 7x10-1. If you're using binary for your number, instead of base 10 as we normally do, .1 means 1x2-1, or 1/2. .11 means 1x2-1 + 1x2-2, or 1/2+1/4, or 3/4.
Note how in this system, the denominator is always a power of 2. You cannot represent a number without a denominator that is a power of 2 in a finite number of digits. For instance, .1 (in decimal) means 1/10, but in binary that is an infinite repeating fraction, 0.000110011... (with the 0011 pattern repeating forever). This is similar to how in base 10, 1/3 is an infinite fraction, 0.3333....; base 10 can only represent numbers exactly with a denominator that is a multiple of powers of 2 and 5. (As an aside, base 12 and base 60 are actually really convenient bases, since 12 is divisible by 2, 3, and 4, and 60 is divisible by 2, 3, 4, and 5; but for some reason we use decimal anyhow, and we use binary in computers).
Since floating point numbers (or fixed point numbers) always have a finite number of digits, they cannot represent these infinite repeating fractions exactly. So, they either truncate or round the values to be as close as possible to the real value, but are not equal to the real value exactly. Once you start adding up these rounded values, you start getting more error. In decimal, if your representation of 1/3 is .333, then three copies of that will add up to .999, not 1.
There are four possible solutions. If all you care about is exactly representing decimal fractions like .1 and .7 (as in, you don't care that 1/3 will have the same problem you mention), then you can represent your numbers as decimal, for instance using binary coded decimal, and manipulate those. This is a common solution in finance, where many operations are defined in terms of decimal. This has the downside that you will need to implement all of your own arithmetic operations yourself, without the benefits of the computer's FPU, or find a decimal arithmetic library. This also, as mentioned, does not help with fractions that can't be represented exactly in decimal.
Another solution is to use fractions to represent your numbers. If you use fractions, with bignums (arbitrarily large numbers) for your numerators and denominators, you can represent any rational number that will fit in the memory of your computer. Again, the downside is that arithmetic will be slower, and you'll need to implement arithmetic yourself or use an existing library. This will solve your problem for all rational numbers, but if you wind up with a probability that is computed based on π or √2, you will still have the same issues with not being able to represent them exactly, and need to also use one of the later solutions.
A third solution, if all you care about is getting your numbers to add up to 1 exactly, is for events where you have n possibilities, to only store the values of n-1 of those probabilities, and compute the probability of the last as 1 minus the sum of the rest of the probabilities.
And a fourth solution is to do what you always need to remember when working with floating point numbers (or any inexact numbers, such as fractions being used to represent irrational numbers), and never compare two numbers for equality. Again in base 10, if you add up 3 copies of 1/3, you will wind up with .999. When you want to compare that number to 1, you have to instead compare to see if it is close enough to 1; check that the absolute value of the difference, 1-.999, is less than a threshold, such as .01.
Binary machines always round decimal fractions (except .0 and .5, .25, .75, etc) to values that don't have an exact representation in floating point. This has nothing to do with the language C++. There is no real way around it except to deal with it from a numerical perspective within your code.
As for actually producing the probabilities you seek:
float pr[3] = {0.1, 0.2, 0.7};
float accPr[3];
float prev = 0.0;
int i = 0;
for (i = 0; i < 3; i++) {
accPr[i] = prev + pr[i];
prev = accPr[i];
}
float frand = rand() / (1 + RAND_MAX);
for (i = 0; i < 2; i++) {
if (frand < accPr[i]) break;
}
return i;
I'm sorry to say there's not really an easy answer to your problem.
It falls into a field of study called "Numerical Analysis" that deals with these types of problems (which goes far beyond just making sure you don't check for equality between 2 floating point values). And by field of study, I mean there are a slew of books, journal articles, courses etc. dealing with it. There are people who do their PhD thesis on it.
All I can say is that that I'm thankful I don't have to deal with these issues very much, because the problems and the solutions are often very non-intuitive.
What you might need to do to deal with representing the numbers and calculations you're working on is very dependent on exactly what operations you're doing, the order of those operations and the range of values that you expect to deal with in those operations.
Depending on the requirements of your applications any one of several solutions could be best:
You live with the inherent lack of precision and use floats or doubles. You cannot test either for equality and this implies that you cannot test the sum of your probabilities for equality with 1.0.
As proposed before, you can use integers if you require a fixed precision. You represent 0.7 as 7, 0.1 as 1, 0.2 as 2 and they will add up perfectly to 10, i.e., 1.0. If you have to calculate with your probabilities, especially if you do division and multiplication, you need to round the results correctly. This will introduce an imprecision again.
Represent your numbers as fractions with a pair of integers (1,2) = 1/2 = 0.5. Precise, more flexible than 2) but you don't want to calculate with those.
You can go all the way and use a library that implements rational numbers (e.g. gmp). Precise, with arbitrary precision, you can calculate with it, but slow.
yeah, I'd scale the numbers (0-100)(0-1000) or whatever fixed size you need if you're worried about such things. It also makes for faster math computation in most cases. Back in the bad-old-days, we'd define entire cos/sine tables and other such bleh in integer form to reduce floating fuzz and increase computation speed.
I do find it a bit interesting that a "0.7" fuzzes like that on storage.

Determining output (printing) of float with %f in C/C++

I have gone through earlier discussions on floating point numbers in SO but that didn't clarified my problem,I knew this floating point issues may be common in every forum but my question in not concerned about Floating point arithmetic or Comparison.I am rather inquisitive about its representation and output with %f.
The question is straight forward :"How to determine the exact output of :
float = <Some_Value>f;
printf("%f \n",<Float_Variable>);
Lets us consider this code snippet:
float f = 43.2f,
f1 = 23.7f,
f2 = 58.89f,
f3 = 0.7f;
printf("f1 = %f\n",f);
printf("f2 = %f\n",f1);
printf("f3 = %f\n",f2);
printf("f4 = %f\n",f3);
Output:
f1 = 43.200001
f2 = 23.700001
f3 = 58.889999
f4 = 0.700000
I am aware that %f (is meant to be for double) has a default precision of 6, also I am aware that the problem (in this case) can be fixed by using double but I am inquisitive about the output f2 = 23.700001 and f3 = 58.889999 in float.
EDIT: I am aware that floating point number cannot be represented precisely, but what is the rule of for obtaining the closest representable value ?
Thanks,
Assuming that you're talking about IEEE 754 float, which has a precision of 24 binary digits: represent the number in binary (exactly) and round the number to the 24th most significant digit. The result will be the closest floating point.
For example, 23.7 represented in binary is
10111.1011001100110011001100110011...
After rounding you'll get
10111.1011001100110011010
Which in decimal is
23.700000762939453125
After rounding to the sixth decimal place, you'll have
23.700001
which is exactly the output of your printf.
What Every Computer Scientist Should Know About Floating-Point Arithmetic
You may interest to see other people question regarding that on SO too.
Please take a look too.
https://stackoverflow.com/search?q=floating+point
A 32-bit float (as in this case) is represented as 1 bit of sign, 8 bits of exponent and 23 bits of the fractional part of the mantissa.
First, forget the sign of what you put in. Then the rest of what you put in will be stored as a fraction of the format
(1 + x/8,388,608) * 2^(y-127) (note that the 8.388,608 is 2^23). Where x is the fractional mantissa and y is the exponent. Believe it or not, there is only one representation in this form for every value you put in. The value stored will be the closest value to the number you want, if your value cannot be represented exactly, it means you'll pick up an extra .0001 or whatever.
So, if you want to figure out the value that will actually be stored, just figure out what it will turn into.
So second thing to do (after throwing out the sign) is to find the largest power of 2 that is smaller in magnitude than the number you are representing. So let's take 43.2.
The largest power of two smaller than that is 32. So that's the "1" on the left, since it's a 32, not a 1, that means the 2^ value on the right must be 2^5 (32), which means y is 132. So now subtract off the 32, it's done for. What's left is 11.2. Now we need to represent 11.2 as a fraction over 8,338,608 times 2^5.
So
11.2 approx equals x*32/8,336,608 or x/262,144. The value you get for x is 2,938,013/262,144. The real numerator was 0.2 lower (2,938,012.8), so there will be an error of 0.2 in 262,144 or 2 in 131,072. In decmial, this value is 0.000015258789063. So if you print enough digits, you'll see this error value show up in your output.
When you see the output be too low, it's because the rounding went the other way, the value produced was nearer to the wanted value by being too low, and so you get an output that is too low. When the value can be represented exactly (like for example any power of 2), you never get an error.
It's not simple, but there you go. I'm sure you can code this up.
*note: for very small in magnitude values (roughly less than 2^-127) you get into weirdness called denormals. I'm not going to explain them, but they won't fit the pattern. Luckily they don't show up much. And once you get into that range, your accuracy goes to pot anyway.
You can control the number of decimal points that are outputted by including this in the format specifier.
So instead of having
float f = 43.2f,
printf("f1 = %f\n",f);
Have this
float f = 43.2f,
printf("f1 = %.2f\n",f);
to print two numbers after the decimal point.
Do note that floating point numbers are not precisely represented in memory.
The compiler and CPU use IEEE 754 to represent floating point values in memory. Most rational numbers cannot be expressed exactly in this format, so the compiler chooses the closest approximate representation.
To avoid unpredictable output, you should round to the appropriate precision.
// outputs "0.70"
printf("%.2f\n", 0.7f);
A floating point number or a double precision floating point number is stored as an integer numerator, and a power of 2 as denominator. The math behind it is pretty simple. It involves shifting and bit testing.
So when you declare a constant in base 10, the compiler converts it to a binary integer in 23 bits and an exponent in 8 (or 52 bit integer and 11 bit exponent).
To print it back out, it converts this fraction back into base 10.
Gross simplification: the rule is that "floats are good for 2 or 3 decimal places, doubles for 4 or 5". That is to say, the first 2 or 3 decimal places printed will be exactly what you put in. After that, you have to work out the encoding to see what you're going to get.
This is only a rule of thumb, and as it happens your test case shows one instance where the float representation is only good to 1 d.p.
The way to figure out what will be printed is to simulate exactly what the compiler / libraries / hardware will do:
Convert the number to binary, and round to 24 significant (binary) digits.
Convert that number to decimal, and round to 6 (decimal) digits after the decimal point.
Of course, this is exactly what your program does already, so what are you asking for?
Edit to illustrate, I'll work through one of your examples:
Begin by converting 23.7 to binary:
10111.1011001100110011001100110011001100110011001100110011...
Round that number to 24 significant binary digits:
10111.1011001100110011010
Note that it rounded up. Converting back to decimal gives:
23.700000762939453125
Now, round this value to 6 digits after the decimal point:
23.700001
Which is exactly what you observed.