Convert MBF Double to IEEE - bit-manipulation

I found a topic below for convert MBF to IEEE.
Convert MBF Single and Double to IEEE
Anyone can explain what are the function of the code marked below?
Dim sign As Byte = mbf(6) And ToByte(&H80) 'What is the reason AND (&H80)?
Dim exp As Int16 = mbf(7) - 128S - 1S + 1023S 'Why is 1152 (128+1+1023)?
ieee(7) = ieee(7) Or sign 'Why don't just save sign to ieee(7)?
ieee(7) = ieee(7) Or ToByte(exp >> 4 And &HFF) 'What is the reason to shift 4?
Public Shared Function MTID(ByVal src() As Byte, ByVal startIndex As Integer) As Double
Dim mbf(7) As Byte
Dim ieee(7) As Byte
Array.Copy(src, startIndex, mbf, 0, 8)
If mbf(7) <> 0 Then
Dim sign As Byte = mbf(6) And ToByte(&H80)
Dim exp As Int16 = mbf(7) - 128S - 1S + 1023S
ieee(7) = ieee(7) Or sign
ieee(7) = ieee(7) Or ToByte(exp >> 4 And &HFF)
ieee(6) = ieee(6) Or ToByte(exp << 4 And &HFF)
For i As Integer = 6 To 1 Step -1
mbf(i) <<= 1
mbf(i) = mbf(i) Or mbf(i - 1) >> 7
Next
mbf(0) <<= 1
For i As Integer = 6 To 1 Step -1
ieee(i) = ieee(i) Or mbf(i) >> 4
ieee(i - 1) = ieee(i - 1) Or mbf(i) << 4
Next
ieee(0) = ieee(0) Or mbf(0) >> 4
End If
Return BitConverter.ToDouble(ieee, 0)
End Function

The IEEE754 double format is made up of a 1-bit sign, 11-bit exponent and 52-bit mantissa:
7 6 5 4 3 2 1 0
seeeeeee eeeemmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm
Due to the vagaries of endianness, that most significant byte on the left is actually ieee(7), the least significant on the right is ieee(0) - this is the same for mbf() below.
The exponent gives you a value of 0 through 2047 (211-1) some of which are used to represent special values like +/-inf (infinity) and nan (not a number).
The mantissa bits represent, from left to right, 1/2, 1/4, 1/8 and so on. In order to get the number, you calculate n = (-1)s x 2e-bias x 1.m
Microsoft double binary format is:
7 6 5 4 3 2 1 0
eeeeeeee smmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm
The code you see is simply transferring (and slightly changing) the values from MBF to IEEE754 double precision format.
To answer your specific questions:
Dim sign As Byte = mbf(6) And ToByte(&H80)
What is the reason for 'And &H80'?
Hex 80 (&H80) is the binary pattern 1000 0000.
When you AND a value with that, you get &H80 if that bit was set or 0 otherwise.
This basically just records what the sign of the number was and you can simply transfer it as-is from mbf(6) to ieee(7).
Dim exp As Int16 = mbf(7) - 128S - 1S + 1023S
Why 1152 (128+1+1023)?
Exponents in IEEE754 are biased exponents. In other words, the values stored may be 0 thru 255 but the actual values represented by those may be -128 thru 127 (disregarding the special values for now).
This allows you to have negative exponents for very small values and positive exponents for large values.
MBF exponents are also biased but they're biased on 128 for both single and double types whereas IEEE754 double precision exponents have their 0-point at 1023.
The reason for the extra -1 is because of the differences between MBF and IEEE754 regarding where the implicit 1 goes. IEEE754 puts it before the binary point, MBF after. That means the exponent must be adjusted by one.
ieee(7) = ieee(7) Or sign
Why don't we just save sign to ieee(7)?
That's a slight mystery since ieee(7) hasn't been explicitly set at that point. I can only assume that ieee() has been initialised to zero upon creation, otherwise you may get into trouble since just about every transfer operation here is done with an OR.
You're right that it makes more sense to just use ieee(7) = sign. The actual ORing to combine the exponent bits are on the next line.
ieee(7) = ieee(7) Or ToByte(exp >> 4 And &HFF)
What is the reason for shifting by 4?
Because the IEEE754 exponent crosses two bytes and you want only part of that exponent in the most significant one. Seven bits of the exponent go into the most significant byte, the other four go into the next byte.
This is handled by the two lines:
ieee(7) = ieee(7) Or ToByte(exp >> 4 And &HFF) ' upper 7 bits '
ieee(6) = ieee(6) Or ToByte(exp << 4 And &HFF) ' lower 4 bits '
Given the 16-bit value 00000abcdefghijk, the two are calculated:
>> 4 and &hff : 0abcdefg (s will go at the left)
<< 4 and &hff : hijk0000 (m will go at the right)

Related

C++ Floating Point Addition (from scratch): Negative results cannot be computed

I am implementing a floating point addition program from scratch, following the methodology listed out in this PDF: https://www.cs.colostate.edu/~cs270/.Fall20/resources/FloatingPointExample.pdf
The main issue I am having is that addition works when the result is positive (e.x. -10 + 12, 3 + 5.125), but the addition does not work when the result is negative. This is because do not understand how to implement the following step:
Step 5: Convert result from 2’s complement to signed magnitude
If the result is negative, convert the mantissa back to signed magnitude by inverting the bits and adding 1. The result is
positive in this example, so nothing needs to be done.
How do I determine if the result is negative without using floating point addition (I am not allowed to use any floating or double adds)? Of course I can see if the current and the next floats are negative and see their cumulative quantities, but that would defeat the purposes of this assignment.
If given only the following:
Sign bit, exponent, and mantissa of X
Sign bit, exponent, and mantissa of Y
Mantissa and exponent of Z
How do I determine whether Z = X + Y is negative just with the above data and not using any floating point addition?
The key insight is that many floating-point formats keep the sign and mantissa separate, so the mantissa is an unsigned integer. The sign and mantissa can be trivially combined to create a signed integer. You can then use signed integer arithmetic to add or subtract the two mantissa's of your floating-point number.
If you are following the PDF you posted, you should have converted the numbers to 2's complement at Step 3. After the addition in Step 4, you have the result in 2's complement. (Result of adding the shifted numbers)
To check if the result is negative, you need to check the leftmost bit (the sign bit) in the resulting bit pattern. In 2's complement, this bit is 1 for negative numbers, and 0 for nonnegative numbers.
sign = signBit;
if (signBit) {
result = ~result + 1;
}
If you are using unsigned integers to hold the bit pattern, you could make them of a fixed size, so that you are able to find the sign bit using shifts later.
uint64_t result;
...
signBit = (result >> 63) & 1;
At step 5, you’ve already added the mantissas. To determine whether the result is positive or negative, just check the sign bit of that sum.
The only difference between grade school math and what we do with floating point is that we have twos complement (base 2 vs base 10 is not really relevant, just makes life easier). So if you made it through grade school you know how all of this works.
In decimal in grade school you align the decimal points and then do the math. With floating point we shift the smaller number and discard it's mantissa (sorry fraction) bits to line it up with the larger number.
In grade school if doing subtraction you subtract the smaller number from the larger number once you resolve the identities
a - (-b) = a + b
-a + b = b - a
and so on so that you either have
n - m
or
n + m
And then you do the math. Apply the sign based on what you had to do to get a-b or a+b.
The beauty of twos complement is that a negation or negative is invert and add one, which feeds nicely into logic.
a - b = a + (-b) = a + (~b) + 1
so you do not re-arrange the operands but you might have to negate the second one.
Also you do not have to remember the sign of the result the result tells you its
sign.
So align the points
put it in the form
a + b
a + (-b)
Where a can be positive or negative but b's sign and the operation may need to
negate b.
Do the addition.
If the result is negative, negate the result into a positive
Normalize
IEEE is only involved in the desire to have the 1.fraction be positive, other floating point formats allow for negative whole.fraction and do not negate, simply
normalize. The rest of it is just grade school math (plus twos complement)
Some examples
2 + 4
in binary the numbers are
+10
+100
which converted to a normalized form are
+1.0 * 2^1
+1.00 * 2^2
need same exponent (align the point)
+0.10 * 2^2
+1.00 * 2^2
both are positive so no change just do the addition
this is the base form, I put more sign extension out front than needed
to make the sign of the result much easier to see.
0
000010
+000100
=======
fill it in
000000
000010
+000100
========
000110
result is positive (msbit of result is zero) so normalize
+1.10 * 2^2
4+5
100
101
+1.00 2^2
+1.01 2^2
same exponent
both positive
0
000100
+000101
=======
001000
000100
+000101
=======
001001
result is positive so normalize
+1.001 * 2^3
4 - 2
100
10
+1.00 * 2^2
+1.0 * 2^1
need the same exponent
+1.00 * 2^2
+0.10 * 2^2
subtract a - b = a + (-b)
1 <--- add one
00100
+11101 <--- invert
=======
fill it in
11011
00100
+11101
=======
00010
result is positive so normalize
+1.0 * 2^1
2 - 4
10
100
+1.0 * 2^1
+1.00 * 2^2
make same exponent
+0.10 * 2^2
+1.00 * 2^2
do the math
a - b = a + (-b)
1
000010
+111011
========
fill it in
000111
000010
+111011
========
111110
result is negative so negate (0 - n)
000011 <--- add one
000000
+000001 <--- invert
=========
000010
normalize
-1.0 * 2^1

Why do negative numbers in variables have a larger limit than a positive number (signed variables)?

As seen in the picture above all of the variables have a negative limit that is one more than the positive limit. I was how it is able to add that extra one. I know that the first digit in the variable is used to tell if it is negative (1) or if is not (0). I also know that binary is based on the powers of 2. What I am confused about is how there is one extra when the positive itself can't go higher and the negative only has one digit changing. For example, a short can go up to 32,767 (01111111 11111111) or 16,383 + all of the decimal values of the binary numbers below it. Negative numbers are the same thing except a one at the beginning, right? So how do the negative numbers have a larger limit? Thanks to anyone who answers!
The reason is a scheme called "2's complement" to represent signed integer.
You know that the most significant bit of a signed integer represent the sign. But what you don't know is, it also represent a value, a negative value.
Take a 4-bit 2's complement signed integer as an example:
1 0 1 0
-2^3 2^2 2^1 2^0
This 4-bit integer is interpreted as:
1 * -2^3 + 0 * 2^2 + 1 * 2^1 + 0 * 2^0
= -8 + 0 + 2 + 0
= -6
With this scheme, the max of 4-bit 2's complement is 7.
0 1 1 1
-2^3 2^2 2^1 2^0
And the min is -8.
1 0 0 0
-2^3 2^2 2^1 2^0
Also, 0 is represented by 0000, 1 is 0001, and -1 is 1111. Comparing these three numbers, we can observe that zero has its "sign bit" positive, and there is no "negative zero" in 2's complement scheme. In other words, half of the range only consists of negative number, but the other half of the range includes zero and positive numbers.
If integers are stored using two's complement then you get one extra negative value and a single zero. If they are stored using one's complement or signed magnitude you get two zeros and the same number of negative values as positive ones. Floating point numbers have their own storage scheme, and under IEEE formats use have an explicit sign bit.
I know that the first digit in the variable is used to tell if it is negative (1) or if is not (0).
The first binary digit (or bit), yes, assuming two's complement representation. Which basically answers your question. There are 32,768 numbers < 0 (-32,768 .. -1) , and 32,768 numbers >= 0 (0 .. +32,767).
Also note that in binary the total possible representations (bit patterns) are an even number. You couldn't have the min and max values equal in absolute values, since you'd end up with an odd number of possible values (counting 0). Thus, you'd have to waste or declare illegal at least one bit pattern.

Computing a real number X from 32-bit binary number IEEE-754 single precision representation

I'm not sure if what I've done is the best way of going about the problem:
0010 0010 0001 1110 1100 1110 0000 0000
I split it up:
Sign : 0 (positive)
Exponent: 0100 0100 (in base 2) -> 2^2 + 2^6 = 68 -> excess 127: 68 - 127 = -59 (base 10)
Mantissa: (1).001 1110 1100 1110 0000 0000 -> decimal numbers needed: d-10 = d-2 * log2 / log10 = 24 * log2 / log10 = 7.22 ~ 8 (teacher told us to round up always)
So the mantissa in base 10 is: 2^0 + 2^-3 + 2^-4 + 2^-5 + 2^-6 + 2^-8 + 2^-9 + 2^-12 + 2^-13 + 2^-14 = 1.2406616 (base 10)
Therefore the real number is:
+1.2406616 * 2^(-59) = 2.1522048 * 10^-18
But is the 10^x representation good? How do I find the right number of sig figs? Would it be the same as the rule used above?
The representation is almost good. I'd say your need a total of 9 (you have 8) significant digits.
See Printf width specifier to maintain precision of floating-point value
The right number of significant digits depends on what is right means.
If you want to print out to x significant decimal places, and read it back and be sure you have the same number x again, then for all IEEE-754 single, a total of 9 decimal places is needed in. 1 before and 8 after the '.' in scientific notation. You may get by with less digits for some numbers, but some numbers need as many as 9.
In C this is defined as FLT_DECIMAL_DIG.
Printing more than 9 does not hurt, it just does not convert to a different IEEE-754 single precision number had only 9 been used.
OTOH if you start with a textual decimal number with y significant digits, convert it to IEEE-754 single and then back to text, then the most y digits you should count on always working is 6.
In C this is defined as FLT_DIG.
So at the end, I'd say d-10 = d-2 * log2 / log10 is almost right. But since powers of 2 (IEEE-754 single) and powers of 10 (x.xxxxxxxx * 10 ^ expo) to not match (expect at 1.0) the precision to use with text is FLT_DECIMAL_DIG:
"number of decimal digits, n, such that any floating-point number with p radix b digits can be rounded to a floating-point number with n decimal digits and back again without change to the value,
p log10 b if b is a power of 10
ceiling(1 + p log10 b) otherwise"
9 in the case of IEEE-754 single

How to add and subtract 16 bit floating point half precision numbers?

How do I add and subtract 16 bit floating point half precision numbers?
Say I need to add or subtract:
1 10000 0000000000
1 01111 1111100000
2’s complement form.
The OpenEXR library defines a half-precision floating point class. It's C++, but the code for casting between native IEEE754 float and half should be easy to adapt. see: Half/half.h as a start.
Assuming you are using a denormalized representation similar to that of IEEE single/double precision, just compute the sign = (-1)^S, the mantissa as 1.M if E != 0 and 0.M if E == 0, and the exponent = E - 2^(n-1), operate on these natural representations, and convert back to the 16-bit format.
sign1 = -1
mantissa1 = 1.0
exponent1 = 1
sign2 = -1
mantissa2 = 1.11111
exponent2 = 0
sum:
sign = -1
mantissa = 1.111111
exponent = 1
Representation: 1 10000 1111110000
Naturally, this assumes excess encoding of the exponent.

representation of double and radix point

According to what I know on double (IEEE standard) there is one bit for signus, 54 bits for mantissa, a base and some bits for exponent
the formula to get the double is : (−1)^s × c × b^q
Maybe I made some mistake but the idea is here.
I'm just wondering how we can know where to put the radix point with this formula.
If i take number, I get for instance:
m = 3
q = 4
s = 2
b = 2
(-1)^2 * 4 * 2^3 = 32
but I don't know where to put some radix point..
What is wrong here ?
EDIT:
Maybe q is always negative ?
I guess a look at the Wikipedia would've helped.
Thing is, that there is a "hidden" '1.' in the IEEE formula.
Every IEEE 754 number has to be normlized, this means that the encoded number is in the format:
(-1)^(sign) * '1.' (mantissa) * 2^(exponent)
Therefore, you have encoded 1.32, not 32.
32 = 1 * 2^5, so mantissa=1, exponent=5, sign=0. We need to add 1023 to exponent when coding the exponent, so below we have 1023+5=1028. Also we need to remove digit 1 when coding mantissa, so that 1.(whatever) becomes (whatever)
Hexadecimal representation of 32 as 64-bit double is 4040000000000000, or binary:
0100 0000 0100 0000 0000 ... and zeros all the way down
^======== start of mantissa (coded 0, interpreted 1.0)
^===========^---------- exponent (coded 1028, interpreted 5)
^----------------------- sign (0)
To verify the result visit this page, enter 32 in first field, and click either Rounded or Not Rounded button (doesn't matter which one).