I am implementing a floating point addition program from scratch, following the methodology listed out in this PDF: https://www.cs.colostate.edu/~cs270/.Fall20/resources/FloatingPointExample.pdf
The main issue I am having is that addition works when the result is positive (e.x. -10 + 12, 3 + 5.125), but the addition does not work when the result is negative. This is because do not understand how to implement the following step:
Step 5: Convert result from 2’s complement to signed magnitude
If the result is negative, convert the mantissa back to signed magnitude by inverting the bits and adding 1. The result is
positive in this example, so nothing needs to be done.
How do I determine if the result is negative without using floating point addition (I am not allowed to use any floating or double adds)? Of course I can see if the current and the next floats are negative and see their cumulative quantities, but that would defeat the purposes of this assignment.
If given only the following:
Sign bit, exponent, and mantissa of X
Sign bit, exponent, and mantissa of Y
Mantissa and exponent of Z
How do I determine whether Z = X + Y is negative just with the above data and not using any floating point addition?
The key insight is that many floating-point formats keep the sign and mantissa separate, so the mantissa is an unsigned integer. The sign and mantissa can be trivially combined to create a signed integer. You can then use signed integer arithmetic to add or subtract the two mantissa's of your floating-point number.
If you are following the PDF you posted, you should have converted the numbers to 2's complement at Step 3. After the addition in Step 4, you have the result in 2's complement. (Result of adding the shifted numbers)
To check if the result is negative, you need to check the leftmost bit (the sign bit) in the resulting bit pattern. In 2's complement, this bit is 1 for negative numbers, and 0 for nonnegative numbers.
sign = signBit;
if (signBit) {
result = ~result + 1;
}
If you are using unsigned integers to hold the bit pattern, you could make them of a fixed size, so that you are able to find the sign bit using shifts later.
uint64_t result;
...
signBit = (result >> 63) & 1;
At step 5, you’ve already added the mantissas. To determine whether the result is positive or negative, just check the sign bit of that sum.
The only difference between grade school math and what we do with floating point is that we have twos complement (base 2 vs base 10 is not really relevant, just makes life easier). So if you made it through grade school you know how all of this works.
In decimal in grade school you align the decimal points and then do the math. With floating point we shift the smaller number and discard it's mantissa (sorry fraction) bits to line it up with the larger number.
In grade school if doing subtraction you subtract the smaller number from the larger number once you resolve the identities
a - (-b) = a + b
-a + b = b - a
and so on so that you either have
n - m
or
n + m
And then you do the math. Apply the sign based on what you had to do to get a-b or a+b.
The beauty of twos complement is that a negation or negative is invert and add one, which feeds nicely into logic.
a - b = a + (-b) = a + (~b) + 1
so you do not re-arrange the operands but you might have to negate the second one.
Also you do not have to remember the sign of the result the result tells you its
sign.
So align the points
put it in the form
a + b
a + (-b)
Where a can be positive or negative but b's sign and the operation may need to
negate b.
Do the addition.
If the result is negative, negate the result into a positive
Normalize
IEEE is only involved in the desire to have the 1.fraction be positive, other floating point formats allow for negative whole.fraction and do not negate, simply
normalize. The rest of it is just grade school math (plus twos complement)
Some examples
2 + 4
in binary the numbers are
+10
+100
which converted to a normalized form are
+1.0 * 2^1
+1.00 * 2^2
need same exponent (align the point)
+0.10 * 2^2
+1.00 * 2^2
both are positive so no change just do the addition
this is the base form, I put more sign extension out front than needed
to make the sign of the result much easier to see.
0
000010
+000100
=======
fill it in
000000
000010
+000100
========
000110
result is positive (msbit of result is zero) so normalize
+1.10 * 2^2
4+5
100
101
+1.00 2^2
+1.01 2^2
same exponent
both positive
0
000100
+000101
=======
001000
000100
+000101
=======
001001
result is positive so normalize
+1.001 * 2^3
4 - 2
100
10
+1.00 * 2^2
+1.0 * 2^1
need the same exponent
+1.00 * 2^2
+0.10 * 2^2
subtract a - b = a + (-b)
1 <--- add one
00100
+11101 <--- invert
=======
fill it in
11011
00100
+11101
=======
00010
result is positive so normalize
+1.0 * 2^1
2 - 4
10
100
+1.0 * 2^1
+1.00 * 2^2
make same exponent
+0.10 * 2^2
+1.00 * 2^2
do the math
a - b = a + (-b)
1
000010
+111011
========
fill it in
000111
000010
+111011
========
111110
result is negative so negate (0 - n)
000011 <--- add one
000000
+000001 <--- invert
=========
000010
normalize
-1.0 * 2^1
Related
As seen in the picture above all of the variables have a negative limit that is one more than the positive limit. I was how it is able to add that extra one. I know that the first digit in the variable is used to tell if it is negative (1) or if is not (0). I also know that binary is based on the powers of 2. What I am confused about is how there is one extra when the positive itself can't go higher and the negative only has one digit changing. For example, a short can go up to 32,767 (01111111 11111111) or 16,383 + all of the decimal values of the binary numbers below it. Negative numbers are the same thing except a one at the beginning, right? So how do the negative numbers have a larger limit? Thanks to anyone who answers!
The reason is a scheme called "2's complement" to represent signed integer.
You know that the most significant bit of a signed integer represent the sign. But what you don't know is, it also represent a value, a negative value.
Take a 4-bit 2's complement signed integer as an example:
1 0 1 0
-2^3 2^2 2^1 2^0
This 4-bit integer is interpreted as:
1 * -2^3 + 0 * 2^2 + 1 * 2^1 + 0 * 2^0
= -8 + 0 + 2 + 0
= -6
With this scheme, the max of 4-bit 2's complement is 7.
0 1 1 1
-2^3 2^2 2^1 2^0
And the min is -8.
1 0 0 0
-2^3 2^2 2^1 2^0
Also, 0 is represented by 0000, 1 is 0001, and -1 is 1111. Comparing these three numbers, we can observe that zero has its "sign bit" positive, and there is no "negative zero" in 2's complement scheme. In other words, half of the range only consists of negative number, but the other half of the range includes zero and positive numbers.
If integers are stored using two's complement then you get one extra negative value and a single zero. If they are stored using one's complement or signed magnitude you get two zeros and the same number of negative values as positive ones. Floating point numbers have their own storage scheme, and under IEEE formats use have an explicit sign bit.
I know that the first digit in the variable is used to tell if it is negative (1) or if is not (0).
The first binary digit (or bit), yes, assuming two's complement representation. Which basically answers your question. There are 32,768 numbers < 0 (-32,768 .. -1) , and 32,768 numbers >= 0 (0 .. +32,767).
Also note that in binary the total possible representations (bit patterns) are an even number. You couldn't have the min and max values equal in absolute values, since you'd end up with an odd number of possible values (counting 0). Thus, you'd have to waste or declare illegal at least one bit pattern.
I have some real data. For example +2 and -3. These data are represented in two's complement fixed point with 4 bit binary value where MSB represents the sign bit and number of fractional bit is zero.
So +2 = 0010
-3 = 1101
addition of this two numbers is (+2) + (-3)=-1
(0010)+(1101)=(1111)
But in case of subtraction (+2)-(-3) what should i do?
Is it needed to take the two's complement of 1101 (-3) again and add with 0010?
You can evaluate -(-3) in binary and than simply sums it with the other values.
With two's complement, evaluate the opposite of a number is pretty simple: just apply the NOT binary operation to every digits except for the less significant bit. The equation below uses the tilde to rapresent the NOT operation of a single bit and assumed to deal with integer rapresented by n bits (n = 4 in your example):
In your example (with an informal notation): -(-3) = -(1101) = 0011
I understand that floating point numbers can often include rounding errors.
When you take the floor or ceiling of a float (or double) in order to convert it to an integer, will the resultant value be exact or can the "floored" value still be an approximation?
Basically, is it possible for something like floor(3.14159265) to return a value which is essentially 2.999999, which would convert to 2 when you try to cast that to an int?
Is it possible for something like floor(3.14159265) to return a value which is essentially 2.999999?
The floor() function returns an floating point value that is an exact integer. So the premise of your question is wrong to begin with.
Now, floor(x) returns the nearest integral value that is not greater than x. It is always true that
floor(x) <= x
and that there exists no integer i, greater than floor(x), such that i <= x.
Looking at floor(3.14159265), this returns 3.0. There's no debate about that. Nothing more to say.
Where it gets interesting is if you write floor(x) where x is the result of an arithmetic expression. Floating point precision and rounding can mean that x falls on the wrong side of an integer. In other words, the true value of the expression that yields x is greater than some integer, i, but that x when evaluated using floating point arithmetic is less than i.
Small integers are representable exactly as floats, but big integers are not.
But, as others pointed out, big integers not representable by float will never be representable by a non-integer, so floor() will never return a non-integer value. Thus, the cast to (int), as long as it does not overflow, will be correct.
But how small is small? Copying shamelessly from this answer:
For float, it is 16,777,217 (224 + 1).
For double, it is 9,007,199,254,740,993 (253 + 1).
Note that the usual range of int (32-bits) is 231, so float is unable to represent all of them exactly. Use double if you need that.
Interestingly, floats can store a certain range of integers exactly, for example:
1 is stored as mantissa 1 (binary 1) * exponent 2^0
2 is stored as mantissa 1 (binary 1) * exponent 2^1
3 is stored as mantissa 1.5 (binary 1.1) * exponent 2^1
4 is stored as mantissa 1 * exponent 2^2
5 is stored as mantissa 1.25 (binary 1.01) * exponent 2^2
6 is stored as mantissa 1.5 (binary 1.1) * exponent 2^2
7 is stored as mantissa 1.75 (binary 1.11) * exponent 2^2
8 is stored as mantissa 1 (binary 1) * exponent 2^3
9 is stored as mantissa 1.125 (binary 1.001) * exponent 2^3
10 is stored as mantissa 1.25 (binary 1.01) * exponent 2^3
...
As you can see, the way exponents increase works in with the perfectly-stored fractional values the mantissa can represent.
You can get a good sense for this by putting number into this great online conversion site.
Once you cross a certain threshold, there's not enough digits in the mantissa to divide the span of the increased exponents without skipping first every odd integer value, then three out of every four, then 7 out of 8 etc.. For numbers over this threshold, the issue is not that they might be different from integer values by some tiny fractional amount, its that all the representable values are integers and not only can no fractional part be represented any more, but as above some of the integers can't be either.
You can observe this in the calculator by considering:
Binary Decimal
+-Exponent Mantissa
0 10010110 11111111111111111111111 16777215
0 10010111 00000000000000000000000 16777216
0 10010111 00000000000000000000001 16777218
See how at this stage, the smallest possible increment of the mantissa is actually "worth 2" in terms of the decimal value represented?
When you take the floor or ceiling of a float (or double) in order to convert it to an integer, will the resultant value be exact or can the "floored" value still be an approximation?
It's always exact. What floor is doing is effectively wiping out any '1's in the mantissa whose significance (their contribution to value) is fractional anyway.
Basically, is it possible for something like floor(3.14159265) to return a value which is essentially 2.999999, which would convert to 2 when you try to cast that to an int?
No.
I'm not sure if what I've done is the best way of going about the problem:
0010 0010 0001 1110 1100 1110 0000 0000
I split it up:
Sign : 0 (positive)
Exponent: 0100 0100 (in base 2) -> 2^2 + 2^6 = 68 -> excess 127: 68 - 127 = -59 (base 10)
Mantissa: (1).001 1110 1100 1110 0000 0000 -> decimal numbers needed: d-10 = d-2 * log2 / log10 = 24 * log2 / log10 = 7.22 ~ 8 (teacher told us to round up always)
So the mantissa in base 10 is: 2^0 + 2^-3 + 2^-4 + 2^-5 + 2^-6 + 2^-8 + 2^-9 + 2^-12 + 2^-13 + 2^-14 = 1.2406616 (base 10)
Therefore the real number is:
+1.2406616 * 2^(-59) = 2.1522048 * 10^-18
But is the 10^x representation good? How do I find the right number of sig figs? Would it be the same as the rule used above?
The representation is almost good. I'd say your need a total of 9 (you have 8) significant digits.
See Printf width specifier to maintain precision of floating-point value
The right number of significant digits depends on what is right means.
If you want to print out to x significant decimal places, and read it back and be sure you have the same number x again, then for all IEEE-754 single, a total of 9 decimal places is needed in. 1 before and 8 after the '.' in scientific notation. You may get by with less digits for some numbers, but some numbers need as many as 9.
In C this is defined as FLT_DECIMAL_DIG.
Printing more than 9 does not hurt, it just does not convert to a different IEEE-754 single precision number had only 9 been used.
OTOH if you start with a textual decimal number with y significant digits, convert it to IEEE-754 single and then back to text, then the most y digits you should count on always working is 6.
In C this is defined as FLT_DIG.
So at the end, I'd say d-10 = d-2 * log2 / log10 is almost right. But since powers of 2 (IEEE-754 single) and powers of 10 (x.xxxxxxxx * 10 ^ expo) to not match (expect at 1.0) the precision to use with text is FLT_DECIMAL_DIG:
"number of decimal digits, n, such that any floating-point number with p radix b digits can be rounded to a floating-point number with n decimal digits and back again without change to the value,
p log10 b if b is a power of 10
ceiling(1 + p log10 b) otherwise"
9 in the case of IEEE-754 single
According to what I know on double (IEEE standard) there is one bit for signus, 54 bits for mantissa, a base and some bits for exponent
the formula to get the double is : (−1)^s × c × b^q
Maybe I made some mistake but the idea is here.
I'm just wondering how we can know where to put the radix point with this formula.
If i take number, I get for instance:
m = 3
q = 4
s = 2
b = 2
(-1)^2 * 4 * 2^3 = 32
but I don't know where to put some radix point..
What is wrong here ?
EDIT:
Maybe q is always negative ?
I guess a look at the Wikipedia would've helped.
Thing is, that there is a "hidden" '1.' in the IEEE formula.
Every IEEE 754 number has to be normlized, this means that the encoded number is in the format:
(-1)^(sign) * '1.' (mantissa) * 2^(exponent)
Therefore, you have encoded 1.32, not 32.
32 = 1 * 2^5, so mantissa=1, exponent=5, sign=0. We need to add 1023 to exponent when coding the exponent, so below we have 1023+5=1028. Also we need to remove digit 1 when coding mantissa, so that 1.(whatever) becomes (whatever)
Hexadecimal representation of 32 as 64-bit double is 4040000000000000, or binary:
0100 0000 0100 0000 0000 ... and zeros all the way down
^======== start of mantissa (coded 0, interpreted 1.0)
^===========^---------- exponent (coded 1028, interpreted 5)
^----------------------- sign (0)
To verify the result visit this page, enter 32 in first field, and click either Rounded or Not Rounded button (doesn't matter which one).