Computing a real number X from 32-bit binary number IEEE-754 single precision representation - ieee-754

I'm not sure if what I've done is the best way of going about the problem:
0010 0010 0001 1110 1100 1110 0000 0000
I split it up:
Sign : 0 (positive)
Exponent: 0100 0100 (in base 2) -> 2^2 + 2^6 = 68 -> excess 127: 68 - 127 = -59 (base 10)
Mantissa: (1).001 1110 1100 1110 0000 0000 -> decimal numbers needed: d-10 = d-2 * log2 / log10 = 24 * log2 / log10 = 7.22 ~ 8 (teacher told us to round up always)
So the mantissa in base 10 is: 2^0 + 2^-3 + 2^-4 + 2^-5 + 2^-6 + 2^-8 + 2^-9 + 2^-12 + 2^-13 + 2^-14 = 1.2406616 (base 10)
Therefore the real number is:
+1.2406616 * 2^(-59) = 2.1522048 * 10^-18
But is the 10^x representation good? How do I find the right number of sig figs? Would it be the same as the rule used above?

The representation is almost good. I'd say your need a total of 9 (you have 8) significant digits.
See Printf width specifier to maintain precision of floating-point value
The right number of significant digits depends on what is right means.
If you want to print out to x significant decimal places, and read it back and be sure you have the same number x again, then for all IEEE-754 single, a total of 9 decimal places is needed in. 1 before and 8 after the '.' in scientific notation. You may get by with less digits for some numbers, but some numbers need as many as 9.
In C this is defined as FLT_DECIMAL_DIG.
Printing more than 9 does not hurt, it just does not convert to a different IEEE-754 single precision number had only 9 been used.
OTOH if you start with a textual decimal number with y significant digits, convert it to IEEE-754 single and then back to text, then the most y digits you should count on always working is 6.
In C this is defined as FLT_DIG.
So at the end, I'd say d-10 = d-2 * log2 / log10 is almost right. But since powers of 2 (IEEE-754 single) and powers of 10 (x.xxxxxxxx * 10 ^ expo) to not match (expect at 1.0) the precision to use with text is FLT_DECIMAL_DIG:
"number of decimal digits, n, such that any floating-point number with p radix b digits can be rounded to a floating-point number with n decimal digits and back again without change to the value,
p log10 b if b is a power of 10
ceiling(1 + p log10 b) otherwise"
9 in the case of IEEE-754 single

Related

C++ Floating Point Addition (from scratch): Negative results cannot be computed

I am implementing a floating point addition program from scratch, following the methodology listed out in this PDF: https://www.cs.colostate.edu/~cs270/.Fall20/resources/FloatingPointExample.pdf
The main issue I am having is that addition works when the result is positive (e.x. -10 + 12, 3 + 5.125), but the addition does not work when the result is negative. This is because do not understand how to implement the following step:
Step 5: Convert result from 2’s complement to signed magnitude
If the result is negative, convert the mantissa back to signed magnitude by inverting the bits and adding 1. The result is
positive in this example, so nothing needs to be done.
How do I determine if the result is negative without using floating point addition (I am not allowed to use any floating or double adds)? Of course I can see if the current and the next floats are negative and see their cumulative quantities, but that would defeat the purposes of this assignment.
If given only the following:
Sign bit, exponent, and mantissa of X
Sign bit, exponent, and mantissa of Y
Mantissa and exponent of Z
How do I determine whether Z = X + Y is negative just with the above data and not using any floating point addition?
The key insight is that many floating-point formats keep the sign and mantissa separate, so the mantissa is an unsigned integer. The sign and mantissa can be trivially combined to create a signed integer. You can then use signed integer arithmetic to add or subtract the two mantissa's of your floating-point number.
If you are following the PDF you posted, you should have converted the numbers to 2's complement at Step 3. After the addition in Step 4, you have the result in 2's complement. (Result of adding the shifted numbers)
To check if the result is negative, you need to check the leftmost bit (the sign bit) in the resulting bit pattern. In 2's complement, this bit is 1 for negative numbers, and 0 for nonnegative numbers.
sign = signBit;
if (signBit) {
result = ~result + 1;
}
If you are using unsigned integers to hold the bit pattern, you could make them of a fixed size, so that you are able to find the sign bit using shifts later.
uint64_t result;
...
signBit = (result >> 63) & 1;
At step 5, you’ve already added the mantissas. To determine whether the result is positive or negative, just check the sign bit of that sum.
The only difference between grade school math and what we do with floating point is that we have twos complement (base 2 vs base 10 is not really relevant, just makes life easier). So if you made it through grade school you know how all of this works.
In decimal in grade school you align the decimal points and then do the math. With floating point we shift the smaller number and discard it's mantissa (sorry fraction) bits to line it up with the larger number.
In grade school if doing subtraction you subtract the smaller number from the larger number once you resolve the identities
a - (-b) = a + b
-a + b = b - a
and so on so that you either have
n - m
or
n + m
And then you do the math. Apply the sign based on what you had to do to get a-b or a+b.
The beauty of twos complement is that a negation or negative is invert and add one, which feeds nicely into logic.
a - b = a + (-b) = a + (~b) + 1
so you do not re-arrange the operands but you might have to negate the second one.
Also you do not have to remember the sign of the result the result tells you its
sign.
So align the points
put it in the form
a + b
a + (-b)
Where a can be positive or negative but b's sign and the operation may need to
negate b.
Do the addition.
If the result is negative, negate the result into a positive
Normalize
IEEE is only involved in the desire to have the 1.fraction be positive, other floating point formats allow for negative whole.fraction and do not negate, simply
normalize. The rest of it is just grade school math (plus twos complement)
Some examples
2 + 4
in binary the numbers are
+10
+100
which converted to a normalized form are
+1.0 * 2^1
+1.00 * 2^2
need same exponent (align the point)
+0.10 * 2^2
+1.00 * 2^2
both are positive so no change just do the addition
this is the base form, I put more sign extension out front than needed
to make the sign of the result much easier to see.
0
000010
+000100
=======
fill it in
000000
000010
+000100
========
000110
result is positive (msbit of result is zero) so normalize
+1.10 * 2^2
4+5
100
101
+1.00 2^2
+1.01 2^2
same exponent
both positive
0
000100
+000101
=======
001000
000100
+000101
=======
001001
result is positive so normalize
+1.001 * 2^3
4 - 2
100
10
+1.00 * 2^2
+1.0 * 2^1
need the same exponent
+1.00 * 2^2
+0.10 * 2^2
subtract a - b = a + (-b)
1 <--- add one
00100
+11101 <--- invert
=======
fill it in
11011
00100
+11101
=======
00010
result is positive so normalize
+1.0 * 2^1
2 - 4
10
100
+1.0 * 2^1
+1.00 * 2^2
make same exponent
+0.10 * 2^2
+1.00 * 2^2
do the math
a - b = a + (-b)
1
000010
+111011
========
fill it in
000111
000010
+111011
========
111110
result is negative so negate (0 - n)
000011 <--- add one
000000
+000001 <--- invert
=========
000010
normalize
-1.0 * 2^1

What's the reason why "text-float-text" guarantee 6 digit but "float-text-float" does 9?

I'm reading this, but really I can't get why text-float-text guarantee 6 digits, instead float-text-float should 9 (considering single precision).
Converting text-float-text store into a float the correct precision. Only when printing occurs the "rounded" version. But so its a "printer" fault.
Code:
int main()
{
float decimalFloat = 8.589973e9;
char const *decimalString = "8.589973e9";
float const floatFromDecimalString = strtof(decimalString, nullptr);
std::cout << decimalString << std::endl << std::scientific << floatFromDecimalString << std::endl;
std::cout << "text-float-text: 6 digit preserved, not 7" << std::endl << std::endl;
std::cout << "but the value is correctly converted..." << std::endl;
std::cout << std::bitset<sizeof decimalFloat*8>(*(long unsigned int*)(&decimalFloat)) << std::endl;
std::cout << std::bitset<sizeof floatFromDecimalString*8>(*(long unsigned int*)(&floatFromDecimalString)) << std::endl;
}
The binary is preserved. Its equal between declaring floor directly or after the conversion from the same decimal stored as string:
01010000000000000000000000100110
Why we need digits10? The number of preserved digits is max_digits10. If print rounds "badly", well... it seems a problem of the printer.
One should know that the actual float value significant digits are max_digits10 and not digits10 (even if you are looking at digits10 once printed).
Sometimes some code to show counters examples helps.
This is C code, yet the float characteristics are the same in C++/C.
What's the reason why “text-float-text” guarantee 6 digit.
7 decimal digits do not round trip. Consider text like "9999999e3". The value converts to a float. Yet with only an effective 24-bit significant binary digits, the next float is 1024 away. As subsequent text values in the region are 1e3 or 1,000 away, eventually nearby text values convert to the same float.
6 decimal digits always works, as the step in subsequent text values is always smaller than the step in binary digits.
void text_to_float_test(void) {
unsigned long ten = 10*1000*1000;
float f1,f2;
for (unsigned long i = ten; i>0; i--) {
char s1[40];
sprintf(s1+0, "%lue3", i);
sscanf(s1, "%f", &f1);
char s2[40];
sprintf(s2 + 0, "%lue3", i-1);
sscanf(s2, "%f", &f2);
if (f1 == f2) {
printf("\"%s\" and \"%s\" both convert to %.*e\n", s1, s2, 7-1, f1);
return;
}
}
puts("Done");
}
Output
"9999979e3" and "9999978e3" both convert to 9.999978e+09
but “float-text-float” does 9?
Between each power-of-2 pairs, there are typically 223 different float. Both FP values 1.000000954e+01 and the next float 1.000001049e+01 both convert to the same text when only 8 significant decimal digits are used.
Deeper: between 8 and 16 there are 223 different float linearly distributed owing to the binary encoding of FP numbers. 1/8 of them are between 10 and 11 or 1,048,576. Using only 10.xxx xxx only makes for 1,000,000 different text. More decimal digits are needed.
#include <math.h>
#include <stdio.h>
int float_to_text(float x0, float x1, int significant_digits) {
char s0[100];
char sn[100];
while (x0 <= x1) {
sprintf(s0, "%.*e", significant_digits-1, x0);
float xn = nextafterf(x0, x0*2); // next higher float
sprintf(sn, "%.*e", significant_digits-1, xn);
if (strcmp(s0,sn) == 0) {
printf("%2d significant_digits: %.12e and the next float %.12e both are \"%s\"\n",
significant_digits, x0, xn, s0);
fflush(stdout);
return 1;
}
x0 = xn;
}
return 0;
}
void float_to_text_test(float x0) {
int significant_digits = 5;
while (float_to_text(x0, x0*2, significant_digits)) {
significant_digits++;
}
printf("%2d significant digits needed %.*e to %.*e\n", //
significant_digits, significant_digits, x0, significant_digits, x0*2);
}
int main(void) {
float_to_text_test(8.0);
}
Output
5 significant_digits: 8.000000000000e+00 and the next float 8.000000953674e+00 both are "8.0000e+00"
6 significant_digits: 8.000000000000e+00 and the next float 8.000000953674e+00 both are "8.00000e+00"
7 significant_digits: 8.000009536743e+00 and the next float 8.000010490417e+00 both are "8.000010e+00"
8 significant_digits: 1.000000953674e+01 and the next float 1.000001049042e+01 both are "1.0000010e+01"
9 significant digits needed 8.000000000e+00 to 1.600000000e+01
Decimal→Binary→Decimal
Consider the seven-digit decimal floating-point values 9,999,979•103 (9,999,979,000) and 9,999,978•103 (9,999,978,000). When you convert these to binary floating-point with 24-bit significands, you get 1001 0101 0000 0010 1110 0100•210 (9,999,978,496) in both cases, because that is the closest binary floating-point value to each of the numbers. (The next lower and higher binary floating-point numbers are 1001 0101 0000 0010 1110 0011•210 (9,999,977,472) and 1001 0101 0000 0010 1110 0101•210 (9,999,979,520).)
Therefore, 24-bit significands cannot distinguish all decimal floating-point numbers with seven-digit significands. We can do at most six digits.
Binary→Decimal→Binary
Consider the two 24-bit-significant binary floating-point values 1111 1111 1111 1111 1111 1101•23 (134,217,704) and 1111 1111 1111 1111 1111 1100•23 (134,217,696). If you convert these to decimal floating-point with eight-digit significands, you get 13,421,770•101 in both cases. Then you cannot tell them apart. So you need at least nine decimal digits.
You can think of this as some “chunking” that is forced by where the digit positions lie. At the top of a decimal number, we need a bit big enough to exceed 5 in the first digit. But, the nearest power of two is not necessarily going to start with 5 in that position—it might start with 6, or 7, or 8, or 9, so there is some wastage in it. At the bottom, we need a bit lower than 1 in the last digit. But the nearest power of two is does not necessarily start with 9 in the next lower position. It might start with 8 or 7 or 6 or even 5. So again, there is some wastage. To go from binary to decimal to binary, you need enough decimal digits to fit around the wastage, so you need extra decimal digits. To go from decimal to binary to decimal, you have to keep the decimal digits few enough so that they plus the wastage fit inside the binary, so you need fewer decimal digits.
What's the reason why “text-float-text” guarantee 6 digit but “float-text-float” does 9?
From a binary FP point-of-view, a leading decimal digit of 1 to 9 contains different amount of information: 1 to 3+ bits.
Where absolute precision changes is different for float (0.125, 0.5, 2.0 1.6, etc) and decimal text ("0.001", "0.1", "10.0", "`10000.0", etc.).
It is the effect of those two that cause the wobbling precision.
To see this, let us use the pigeon hole principle.
With text having n significant decimal digits, it has the form of
-1sign × 1_to_9.(n-1)decimal_digits × 10exponent
C++ typically encodes float as binary32. Most values are in the form:
-1sign × 1.23_bit_fraction × 2exponent - offset
Text-float-text
Consider a "worst-case" condition where text contains lots of information - its most significant digit is closer to 9 than 1.
In the range [1.0e9 ... 10.0e9), and using 7 significant decimal digits, text values are spaced 1000 apart.
In the select range [233 ... 234) or [8.589934592e9 ... 17.179869184e9), there are 223 different float linearly spaced 1024 apart.
9.999872000e9 and
9.999744000e9 can exactly encoded as float and as 7 digit decimal text. Difference is
0.000128000e9 or 128,000.
Between them are 127 different 7 digit decimal text values and 124 different float. If code tries to encode all 127 of those text values to float and back to the same text, it will succeed only 124 times.
Example: "9.999851e9" and "9.999850e9" both convert to float 9.999850496000e+09
Instead, if text values are only 6 significant decimal digits, the round-trip always works.
float-text-float
Consider a "worst-case" condition where text contains little information - its most significant digit is closer to 1 than 9.
In the range [8.0 ... 16.0), there are 223 or 8,388,608 different float linearly spaced.
In the range [10.0 ... 11.0), there are 1/8 × 223 or 1,048,576 different float values.
In the range [10.000000 ... 11.000000), and using 8 significant decimal digits, there are 1,000,000 different text values.
If code tries to encode all 1,048,576 of those float values to text with only 8 decimal digits and then back to the same float, it will succeed only 1,000,000 times.
9 decimal digits are needed.

IEEE floating-point number to exact base10 character string

Will printf('%.9e', value) always print the exact base10 representation of value if value is an IEEE single precision floating-point number (C/C++ float)?
Will the same hold for printf('%.17e', value) if value is an IEEE double precision floating-point number (C/C++ double)?
If not, how can I?
It appears that printf('%.17f', value) and printf('%.17g', value) will not.
Will printf('%.9e', value) always print the exact base10 representation?
No. Consider 0.5, 0.25, 0.125, 0.0625 .... Each value is one-half the preceding and needs another decimal place for each decremented power of 2.
float, often binary32 can represent values about pow(2,-127) and sub-normals even smaller. It would take 127+ decimal places to represent those exactly. Even counting only significant digits, then number is 89+. Example FLT_MIN on one machine is exactly
0.000000000000000000000000000000000000011754943508222875079687365372222456778186655567720875215087517062784172594547271728515625
FLT_TRUE_MIN, the smallest non-zero sub-normal is 151 digits:
0.00000000000000000000000000000000000000000000140129846432481707092372958328991613128026194187651577175706828388979108268586060148663818836212158203125
By comparison, FLT_MAX only takes 39 digits.
340282346638528859811704183484516925440
Rarely are exact decimal representation of float needed. Printing them to FLT_DECIMAL_DIG (typically 9) significant digits is sufficient to uniquely display them. Many systems do not print exact decimal representation beyond a few dozen significant digits.
Vast majority of systems I have used printed float/double exactly to at least DBL_DIG significant digits (typically 15+). Most systems do so at least to DBL_DECIMAL_DIG (typically 17+) significant digits.
Printf width specifier to maintain precision of floating-point value gets into these issues.
printf('%.*e', FLT_DECIMAL_DIG - 1, value) will print a float to enough decimals places to scan it back and get the same value - (round-trip).
The IEEE-754 format for a 32-bit floating point number is explained in this Wikipedia article.
The following table shows the bit weights for each bit, given that the exponent is 0, meaning 1.0 <= N < 2.0. The last number in the table is the largest number less than 2.0.
From the table, you can see that you need to print at least 23 digits after the decimal point to get the exact decimal number from a 32-bit floating point number.
3f800000 1.0000000000000000000000000 (1)
3fc00000 1.5000000000000000000000000 (1 + 2^-1)
3fa00000 1.2500000000000000000000000 (1 + 2^-2)
3f900000 1.1250000000000000000000000 (1 + 2^-3)
3f880000 1.0625000000000000000000000 (1 + 2^-4)
3f840000 1.0312500000000000000000000 (1 + 2^-5)
3f820000 1.0156250000000000000000000 (1 + 2^-6)
3f810000 1.0078125000000000000000000 (1 + 2^-7)
3f808000 1.0039062500000000000000000 (1 + 2^-8)
3f804000 1.0019531250000000000000000 (1 + 2^-9)
3f802000 1.0009765625000000000000000 (1 + 2^-10)
3f801000 1.0004882812500000000000000 (1 + 2^-11)
3f800800 1.0002441406250000000000000 (1 + 2^-12)
3f800400 1.0001220703125000000000000 (1 + 2^-13)
3f800200 1.0000610351562500000000000 (1 + 2^-14)
3f800100 1.0000305175781250000000000 (1 + 2^-15)
3f800080 1.0000152587890625000000000 (1 + 2^-16)
3f800040 1.0000076293945312500000000 (1 + 2^-17)
3f800020 1.0000038146972656250000000 (1 + 2^-18)
3f800010 1.0000019073486328125000000 (1 + 2^-19)
3f800008 1.0000009536743164062500000 (1 + 2^-20)
3f800004 1.0000004768371582031250000 (1 + 2^-21)
3f800002 1.0000002384185791015625000 (1 + 2^-22)
3f800001 1.0000001192092895507812500 (1 + 2^-23)
3fffffff 1.9999998807907104492187500
One thing to note about this is that there are only 2^23 (about 8 million) floating point values between 1 and 2. However, there are 10^23 numbers with 23 digits after the decimal point, so very few decimal numbers have exact floating point representations.
As a simple example, the number 1.1 does not have an exact representation. The two 32-bit float values closest to 1.1 are
3f8ccccc 1.0999999046325683593750000
3f8ccccd 1.1000000238418579101562500

How does floating-point arithmetic work when one is added to a big number?

If we run this code:
#include <iostream>
int main ()
{
using namespace std;
float a = 2.34E+22f;
float b = a+1.0f;
cout<<"a="<<a<<endl;
cout<<"b-a"<<b-a<<endl;
return 0;
}
Then the result will be 0, because float number has only 6 significant digits. But float number 1.0 tries to be added to 23 digit of number. So, how program realizes that there is no place for number 1, what is the algorithm?
Step by step:
IEEE-754 32-bit binary floating-point format:
sign 1 bit
significand 23 bits
exponent 8 bits
I) float a = 23400000000.f;
Convert 23400000000.f to float:
23,400,000,000 = 101 0111 0010 1011 1111 1010 1010 0000 00002
= 1.01011100101011111110101010000000002 • 234.
But the significand can store only 23 bits after the point. So we must round:
1.01011100101011111110101 010000000002 • 234
≈ 1.010111001010111111101012 • 234
So, after:
float a = 23400000000.f;
a is equal to 23,399,991,808.
II) float b = a + 1;
a = 101011100101011111110101000000000002.
b = 101011100101011111110101000000000012
= 1.01011100101011111110101000000000012 • 234.
But, again, the significand can store only 23 binary digits after the point. So we must round:
1.01011100101011111110101 000000000012 • 234
≈ 1.010111001010111111101012 • 234
So, after:
float b = a + 1;
b is equal to 23,399,991,808.
III) float c = b - a;
101011100101011111110101000000000002 - 101011100101011111110101000000000002 = 0
This value can be stored in a float without rounding.
So, after:
float c = b - a;
с is equal to 0.
The basic principle is that the two numbers are aligned so that the decimal point is in the same place. I'm using a 10 digit number to make it a little easier to read:
a = 1.234E+10f;
b = a+1.0f;
When calculating a + 1.0f, the decimal points need to be lined up:
1.234E+10f becomes 1234000000.0
1.0f becomes 1.0
+
= 1234000001.0
But since it's float, the 1 on the right is outside the valid range, so the number stored will be 1.234000E+10- any digits beyond that are lost, because there is just not enough digits.
[Note that if you do this on an optimizing compiler, it may still show 1.0 as a difference, because the floating point unit uses a 64- or 80-bit internal representation, so if the calculation is done without storing the intermediate results in a variable (and a decent compiler can certainly achieve that here) With 2.34E+22f it is guaranteed to not fit in a 64-bit float, and probably not in a 80-bit one either].
When adding two FP numbers, they're first converted to the same exponent. In decimal:
2.34000E+22 + 1.00000E0 = 2.34000E22 + 0.000000E+22. In this step, the 1.0 is lost to rounding.
Binary floating point works pretty much the same, except that E+22 is replaced by 2^77.

representation of double and radix point

According to what I know on double (IEEE standard) there is one bit for signus, 54 bits for mantissa, a base and some bits for exponent
the formula to get the double is : (−1)^s × c × b^q
Maybe I made some mistake but the idea is here.
I'm just wondering how we can know where to put the radix point with this formula.
If i take number, I get for instance:
m = 3
q = 4
s = 2
b = 2
(-1)^2 * 4 * 2^3 = 32
but I don't know where to put some radix point..
What is wrong here ?
EDIT:
Maybe q is always negative ?
I guess a look at the Wikipedia would've helped.
Thing is, that there is a "hidden" '1.' in the IEEE formula.
Every IEEE 754 number has to be normlized, this means that the encoded number is in the format:
(-1)^(sign) * '1.' (mantissa) * 2^(exponent)
Therefore, you have encoded 1.32, not 32.
32 = 1 * 2^5, so mantissa=1, exponent=5, sign=0. We need to add 1023 to exponent when coding the exponent, so below we have 1023+5=1028. Also we need to remove digit 1 when coding mantissa, so that 1.(whatever) becomes (whatever)
Hexadecimal representation of 32 as 64-bit double is 4040000000000000, or binary:
0100 0000 0100 0000 0000 ... and zeros all the way down
^======== start of mantissa (coded 0, interpreted 1.0)
^===========^---------- exponent (coded 1028, interpreted 5)
^----------------------- sign (0)
To verify the result visit this page, enter 32 in first field, and click either Rounded or Not Rounded button (doesn't matter which one).