Let double d=9.2, why int a= d*10 is 92 not 91? - ieee-754

The binary representation of d=9.2 should be something like 9.199999999999999289457264239899814128875732421875.
So 10d should be 91.999999999xxxxxxx.
Let int a = 10d, isn't it truncated as 91 instead of 92?

When the multiplication is performed, the real-number result is rounded to fit in the IEEE-754 binary64 format, and the result after rounding is exactly 92.
When 9.2 is converted to the nearest value representable in binary64, the result is indeed 9.199999999999999289457264239899814128875732421875. Using hexadecimal for the significand, this is 1.266666666666616•23.
When we multiply this by 10 using real-number arithmetic, the result is
B.7FFFFFFFFFFFC16•23. (This is easy to see. Starting from the right, 6 times 10 is 6010, which is 3C16. So we write the C digit and carry the 3. In the next column, 6 times 10 is again 3C16, and adding the carried 3 gives 3F16. We write the F digit and carry the 3. Continuing produces more F digits and propagates a carry of 3 up to the front, where 2 times 10 plus a carry of 3 is 2310 = 1716, so we write the 7 and carry the 1. Finally 1 times 10 plus 1 is 1110 = B16.)
Let’s adjust that number to normalize it to have a single bit left of the radix point. Shifting the significand right three bits and compensating by adding three to the exponent gives 1.6FFFFFFFFFFFF816•26. Now we can see this number has too many bits to fit in the binary64 format. The initial 1 has one bit, the next 13 digits have four bits, and the final 8 has one significant bit (10002, and the trailing zeros are mere placeholders, not significant bits). That is 54 bits, but the binary64 format has only 53 for the significand (52 are stored explicitly in the significand field, one is encoded via the exponent field). So we must round it to fit in the format.
The two nearest representable values are 1.6FFFFFFFFFFFF16•26 and 1.700000000000016•26. They are equally distant from 1.6FFFFFFFFFFFF816•26, and the rule for breaking ties is to choose the number with the even low bit. So the result is 1.700000000000016•26.
Thus, multiplying 9.199999999999999289457264239899814128875732421875 by 10 in binary64 arithmetic yields 1.716•26, which is 92.

Related

converting a number to a binary code using the IEEE 754 standard 24bits

I have a question about coverting number with IEEE 754 standart.
But I have 24 bits(1 bit is allocated to the sign / 7 bits are allocated to the exponent / remaining 16 bits are allocated to the mantissa)
Unfortunately I don't know how I can do that.
Maybe someone could help?
If you are following the ordinary IEEE-754 scheme for binary formats:
Set the sign: Starting with a number x, if x is non-negative, set the sign bit to 0. If x is negative, or you wish to encode a negative zero, set the sign bit to 1. Set t to |x| (the absolute value of x).
Determine the exponent: Set e, an exponent counter, to 0. While t is less than 1 and e is greater than −62, multiply t by 2 and add −1 to e. While t is 2 or greater and e is less than 63, multiply t by ½ and add +1 to e. (At most one of those loops will execute, depending on the starting value of t.) (The limits on the represented exponent are −62 and +63, inclusive, so the loops stop if t reaches either one.)
Round the significand: Multiply t by 224. Round it to an integer using the desired method (most often round-to-nearest, ties-to-even-low-bit). If t is 225 or greater, multiply it by ½ and add +1 to e.
If t is less than 224, x is in the subnormal (or zero) range. Set the exponent field to all zeros. Write t in binary, with leading 0 bits to make 24 bits total. Set the primary significand field to those bits. Stop.
If e is greater than 63, x does not fit in the finite range. Set the exponent field to all ones and the primary significand field to all zeros. This represents an infinity, with the sign bit indicating +∞ or −∞. Stop.
The exponent is encoded by putting e+63 in the exponent field, so write e+63 in binary, using leading 0 bits as necessary to make 7 bits, and put them into the exponent field. Write t in binary. It will be 25 bits. Put the low 24 bits in the primary significand field. Stop.

Why does the bit-width of the mantissa of a floating point represent twice as many numbers compared to an int?

I am told that a double in C++ has a mantissa that is capable of safely and accurately representing [-(253 − 1), 253 − 1] integers.
How is this possible when the mantissa is only 52 bits? Why is it that an int16 is only capable of having a range of [-32,768, +32,767], or [-215, 215-1], when the same thing could be used for int16 to allow twice as many representable numbers?
The format of the double (64 bits) is as follows:
1 bit: sign
11 bits: exponent
52 bits: mantissa
We only need to look at the positive integers we can represent with the mantissa, since the sign bit takes care of the negative integers for us.
Naively, with 52 bits, we can store unsigned integers from 0 to 2^52 - 1. With a sign bit, this lets us store from -2^52 - 1 to 2^52 - 1.
However, we have a little trick we can use. We say that the first digit of our integer is always a 1, which gives us an extra bit to work with.
To see why this works, let's dig a little deeper.
Every positive integer will have at least one 1 in its binary representation. So, we shift the mantissa left or right until we get a 1 at the start, using the exponent. An example might help here:
9, represented as an unsigned int: 000...0001001 (dots representing more 0s).
Written another way: 1.001 * 2^3. (1.001 being in binary, not decimal.)
And, we'll agree to never use a 0 as the first bit. So even though we could write 9 as 0.1001 * 2^4 or 0.01001 * 2^5, we won't. We'll agree that when we write the numbers out in this format, we'll always make sure we use the exponent to "shift" bits over until we start with a 1.
So, the information we need to store to get 9 is as follows:
e: 3
i: 1.001
But if i always starts with 1, why bother writing it out every time? Let's just keep the following instead:
e: 3
i: 001
Using precisely this information, we can reconstruct the number as: 1.i * 2^e == 9.
When we get up to larger numbers, our "i" will be bigger, maybe up to 52 bits used, but we're actually storing 53 bits of information because of the leading 1 we always have.
Final Note: This is not quite what is stored in the exponent and mantissa of a double, I've simplified things to help explain, but hopefully this helps people understand where the missing bit is from. Also, this does not cover 0, which gets a special representation (the trick used above will not work for 0, since the usual representation of 0 doesn't have any 1s in it).

fibonacci series Precision [duplicate]

From what I have read, a value of data type double has an approximate precision of 15 decimal places. However, when I use a number whose decimal representation repeats, such as 1.0/7.0, I find that the variable holds the value of 0.14285714285714285 - which is 17 places (via the debugger).
I would like to know why it is represented as 17 places internally, and why a precision of 15 is always written at ~15?
An IEEE double has 53 significant bits (that's the value of DBL_MANT_DIG in <cfloat>). That's approximately 15.95 decimal digits (log10(253)); the implementation sets DBL_DIG to 15, not 16, because it has to round down. So you have nearly an extra decimal digit of precision (beyond what's implied by DBL_DIG==15) because of that.
The nextafter() function computes the nearest representable number to a given number; it can be used to show just how precise a given number is.
This program:
#include <cstdio>
#include <cfloat>
#include <cmath>
int main() {
double x = 1.0/7.0;
printf("FLT_RADIX = %d\n", FLT_RADIX);
printf("DBL_DIG = %d\n", DBL_DIG);
printf("DBL_MANT_DIG = %d\n", DBL_MANT_DIG);
printf("%.17g\n%.17g\n%.17g\n", nextafter(x, 0.0), x, nextafter(x, 1.0));
}
gives me this output on my system:
FLT_RADIX = 2
DBL_DIG = 15
DBL_MANT_DIG = 53
0.14285714285714282
0.14285714285714285
0.14285714285714288
(You can replace %.17g by, say, %.64g to see more digits, none of which are significant.)
As you can see, the last displayed decimal digit changes by 3 with each consecutive value. The fact that the last displayed digit of 1.0/7.0 (5) happens to match the mathematical value is largely coincidental; it was a lucky guess. And the correct rounded digit is 6, not 5. Replacing 1.0/7.0 by 1.0/3.0 gives this output:
FLT_RADIX = 2
DBL_DIG = 15
DBL_MANT_DIG = 53
0.33333333333333326
0.33333333333333331
0.33333333333333337
which shows about 16 decimal digits of precision, as you'd expect.
It is actually 53 binary places, which translates to 15 stable decimal places, meaning that if you round a start out with a number with 15 decimal places, convert it to a double, and then round the double back to 15 decimal places you'll get the same number. To uniquely represent a double you need 17 decimal places (meaning that for every number with 17 decimal places, there's a unique closest double) which is why 17 places are showing up, but not all 17-decimal numbers map to different double values (like in the examples in the other answers).
Decimal representation of floating point numbers is kind of strange. If you have a number with 15 decimal places and convert that to a double, then print it out with exactly 15 decimal places, you should get the same number. On the other hand, if you print out an arbitrary double with 15 decimal places and the convert it back to a double, you won't necessarily get the same value back—you need 17 decimal places for that. And neither 15 nor 17 decimal places are enough to accurately display the exact decimal equivalent of an arbitrary double. In general, you need over 100 decimal places to do that precisely.
See the Wikipedia page for double-precision and this article on floating-point precision.
A double holds 53 binary digits accurately, which is ~15.9545898 decimal digits. The debugger can show as many digits as it pleases to be more accurate to the binary value. Or it might take fewer digits and binary, such as 0.1 takes 1 digit in base 10, but infinite in base 2.
This is odd, so I'll show an extreme example. If we make a super simple floating point value that holds only 3 binary digits of accuracy, and no mantissa or sign (so range is 0-0.875), our options are:
binary - decimal
000 - 0.000
001 - 0.125
010 - 0.250
011 - 0.375
100 - 0.500
101 - 0.625
110 - 0.750
111 - 0.875
But if you do the numbers, this format is only accurate to 0.903089987 decimal digits. Not even 1 digit is accurate. As is easy to see, since there's no value that begins with 0.4?? nor 0.9??, and yet to display the full accuracy, we require 3 decimal digits.
tl;dr: The debugger shows you the value of the floating point variable to some arbitrary precision (19 digits in your case), which doesn't necessarily correlate with the accuracy of the floating point format (17 digits in your case).
IEEE 754 floating point is done in binary. There's no exact conversion from a given number of bits to a given number of decimal digits. 3 bits can hold values from 0 to 7, and 4 bits can hold values from 0 to 15. A value from 0 to 9 takes roughly 3.5 bits, but that's not exact either.
An IEEE 754 double precision number occupies 64 bits. Of this, 52 bits are dedicated to the significand (the rest is a sign bit and exponent). Since the significand is (usually) normalized, there's an implied 53rd bit.
Now, given 53 bits and roughly 3.5 bits per digit, simple division gives us 15.1429 digits of precision. But remember, that 3.5 bits per decimal digit is only an approximation, not a perfectly accurate answer.
Many (most?) debuggers actually look at the contents of the entire register. On an x86, that's actually an 80-bit number. The x86 floating point unit will normally be adjusted to carry out calculations to 64-bit precision -- but internally, it actually uses a couple of "guard bits", which basically means internally it does the calculation with a few extra bits of precision so it can round the last one correctly. When the debugger looks at the whole register, it'll usually find at least one extra digit that's reasonably accurate -- though since that digit won't have any guard bits, it may not be rounded correctly.
It is because it's being converted from a binary representation. Just because it has printed all those decimal digits doesn't mean it can represent all decimal values to that precision. Take, for example, this in Python:
>>> 0.14285714285714285
0.14285714285714285
>>> 0.14285714285714286
0.14285714285714285
Notice how I changed the last digit, but it printed out the same number anyway.
In most contexts where double values are used, calculations will have a certain amount of uncertainty. The difference between 1.33333333333333300 and 1.33333333333333399 may be less than the amount of uncertainty that exists in the calculations. Displaying the value of "2/3 + 2/3" as "1.33333333333333" is apt to be more meaningful than displaying it as "1.33333333333333319", since the latter display implies a level of precision that doesn't really exist.
In the debugger, however, it is important to uniquely indicate the value held by a variable, including essentially-meaningless bits of precision. It would be very confusing if a debugger displayed two variables as holding the value "1.333333333333333" when one of them actually held 1.33333333333333319 and the other held 1.33333333333333294 (meaning that, while they looked the same, they weren't equal). The extra precision shown by the debugger isn't apt to represent a numerically-correct calculation result, but indicates how the code will interpret the values held by the variables.

std::numeric_limits::digits10<float> and precision after the dot

When std::numeric_limits::digits10<float> return 7 does it mean that I have 7 significatif figures after the dot or that 7 with the left part?
For instance is it like:
1.123456
12.12345
or is it like
12.1234657
From cppreference
The value of std::numeric_limits::digits10 is the number of base-10 digits that can be represented by the type T without change, that is, any number with this many decimal digits can be converted to a value of type T and back to decimal form, without change due to rounding or overflow. For base-radix types, it is the value of digits (digits-1 for floating-point types) multiplied by log
10(radix) and rounded down.
And later
The standard 32-bit IEEE 754 floating-point type has a 24 bit fractional part (23 bits written, one implied), which may suggest that it can represent 7 digit decimals (24 * std::log10(2) is 7.22), but relative rounding errors are non-uniform and some floating-point values with 7 decimal digits do not survive conversion to 32-bit float and back: the smallest positive example is 8.589973e9, which becomes 8.589974e9 after the roundtrip. These rounding errors cannot exceed one bit in the representation, and digits10 is calculated as (24-1)*std::log10(2), which is 6.92. Rounding down results in the value 6.
That means, e.g.
cout << std::numeric_limits<float>::digits10; // 6
cout << std::numeric_limits<float>::digits; // 24
the second one is the number of digits in the mantissa while the first one the number of decimal digits that can safely be represented across aforementioned conversions.
TL;DR: it's your first case.

Double precision - decimal places

From what I have read, a value of data type double has an approximate precision of 15 decimal places. However, when I use a number whose decimal representation repeats, such as 1.0/7.0, I find that the variable holds the value of 0.14285714285714285 - which is 17 places (via the debugger).
I would like to know why it is represented as 17 places internally, and why a precision of 15 is always written at ~15?
An IEEE double has 53 significant bits (that's the value of DBL_MANT_DIG in <cfloat>). That's approximately 15.95 decimal digits (log10(253)); the implementation sets DBL_DIG to 15, not 16, because it has to round down. So you have nearly an extra decimal digit of precision (beyond what's implied by DBL_DIG==15) because of that.
The nextafter() function computes the nearest representable number to a given number; it can be used to show just how precise a given number is.
This program:
#include <cstdio>
#include <cfloat>
#include <cmath>
int main() {
double x = 1.0/7.0;
printf("FLT_RADIX = %d\n", FLT_RADIX);
printf("DBL_DIG = %d\n", DBL_DIG);
printf("DBL_MANT_DIG = %d\n", DBL_MANT_DIG);
printf("%.17g\n%.17g\n%.17g\n", nextafter(x, 0.0), x, nextafter(x, 1.0));
}
gives me this output on my system:
FLT_RADIX = 2
DBL_DIG = 15
DBL_MANT_DIG = 53
0.14285714285714282
0.14285714285714285
0.14285714285714288
(You can replace %.17g by, say, %.64g to see more digits, none of which are significant.)
As you can see, the last displayed decimal digit changes by 3 with each consecutive value. The fact that the last displayed digit of 1.0/7.0 (5) happens to match the mathematical value is largely coincidental; it was a lucky guess. And the correct rounded digit is 6, not 5. Replacing 1.0/7.0 by 1.0/3.0 gives this output:
FLT_RADIX = 2
DBL_DIG = 15
DBL_MANT_DIG = 53
0.33333333333333326
0.33333333333333331
0.33333333333333337
which shows about 16 decimal digits of precision, as you'd expect.
It is actually 53 binary places, which translates to 15 stable decimal places, meaning that if you round a start out with a number with 15 decimal places, convert it to a double, and then round the double back to 15 decimal places you'll get the same number. To uniquely represent a double you need 17 decimal places (meaning that for every number with 17 decimal places, there's a unique closest double) which is why 17 places are showing up, but not all 17-decimal numbers map to different double values (like in the examples in the other answers).
Decimal representation of floating point numbers is kind of strange. If you have a number with 15 decimal places and convert that to a double, then print it out with exactly 15 decimal places, you should get the same number. On the other hand, if you print out an arbitrary double with 15 decimal places and the convert it back to a double, you won't necessarily get the same value back—you need 17 decimal places for that. And neither 15 nor 17 decimal places are enough to accurately display the exact decimal equivalent of an arbitrary double. In general, you need over 100 decimal places to do that precisely.
See the Wikipedia page for double-precision and this article on floating-point precision.
A double holds 53 binary digits accurately, which is ~15.9545898 decimal digits. The debugger can show as many digits as it pleases to be more accurate to the binary value. Or it might take fewer digits and binary, such as 0.1 takes 1 digit in base 10, but infinite in base 2.
This is odd, so I'll show an extreme example. If we make a super simple floating point value that holds only 3 binary digits of accuracy, and no mantissa or sign (so range is 0-0.875), our options are:
binary - decimal
000 - 0.000
001 - 0.125
010 - 0.250
011 - 0.375
100 - 0.500
101 - 0.625
110 - 0.750
111 - 0.875
But if you do the numbers, this format is only accurate to 0.903089987 decimal digits. Not even 1 digit is accurate. As is easy to see, since there's no value that begins with 0.4?? nor 0.9??, and yet to display the full accuracy, we require 3 decimal digits.
tl;dr: The debugger shows you the value of the floating point variable to some arbitrary precision (19 digits in your case), which doesn't necessarily correlate with the accuracy of the floating point format (17 digits in your case).
IEEE 754 floating point is done in binary. There's no exact conversion from a given number of bits to a given number of decimal digits. 3 bits can hold values from 0 to 7, and 4 bits can hold values from 0 to 15. A value from 0 to 9 takes roughly 3.5 bits, but that's not exact either.
An IEEE 754 double precision number occupies 64 bits. Of this, 52 bits are dedicated to the significand (the rest is a sign bit and exponent). Since the significand is (usually) normalized, there's an implied 53rd bit.
Now, given 53 bits and roughly 3.5 bits per digit, simple division gives us 15.1429 digits of precision. But remember, that 3.5 bits per decimal digit is only an approximation, not a perfectly accurate answer.
Many (most?) debuggers actually look at the contents of the entire register. On an x86, that's actually an 80-bit number. The x86 floating point unit will normally be adjusted to carry out calculations to 64-bit precision -- but internally, it actually uses a couple of "guard bits", which basically means internally it does the calculation with a few extra bits of precision so it can round the last one correctly. When the debugger looks at the whole register, it'll usually find at least one extra digit that's reasonably accurate -- though since that digit won't have any guard bits, it may not be rounded correctly.
It is because it's being converted from a binary representation. Just because it has printed all those decimal digits doesn't mean it can represent all decimal values to that precision. Take, for example, this in Python:
>>> 0.14285714285714285
0.14285714285714285
>>> 0.14285714285714286
0.14285714285714285
Notice how I changed the last digit, but it printed out the same number anyway.
In most contexts where double values are used, calculations will have a certain amount of uncertainty. The difference between 1.33333333333333300 and 1.33333333333333399 may be less than the amount of uncertainty that exists in the calculations. Displaying the value of "2/3 + 2/3" as "1.33333333333333" is apt to be more meaningful than displaying it as "1.33333333333333319", since the latter display implies a level of precision that doesn't really exist.
In the debugger, however, it is important to uniquely indicate the value held by a variable, including essentially-meaningless bits of precision. It would be very confusing if a debugger displayed two variables as holding the value "1.333333333333333" when one of them actually held 1.33333333333333319 and the other held 1.33333333333333294 (meaning that, while they looked the same, they weren't equal). The extra precision shown by the debugger isn't apt to represent a numerically-correct calculation result, but indicates how the code will interpret the values held by the variables.