representation of double and radix point

representation of double and radix point - c++

According to what I know on double (IEEE standard) there is one bit for signus, 54 bits for mantissa, a base and some bits for exponent
the formula to get the double is : (−1)^s × c × b^q
Maybe I made some mistake but the idea is here.
I'm just wondering how we can know where to put the radix point with this formula.
If i take number, I get for instance:
m = 3
q = 4
s = 2
b = 2
(-1)^2 * 4 * 2^3 = 32
but I don't know where to put some radix point..
What is wrong here ?
EDIT:
Maybe q is always negative ?

I guess a look at the Wikipedia would've helped.
Thing is, that there is a "hidden" '1.' in the IEEE formula.
Every IEEE 754 number has to be normlized, this means that the encoded number is in the format:
(-1)^(sign) * '1.' (mantissa) * 2^(exponent)
Therefore, you have encoded 1.32, not 32.

32 = 1 * 2^5, so mantissa=1, exponent=5, sign=0. We need to add 1023 to exponent when coding the exponent, so below we have 1023+5=1028. Also we need to remove digit 1 when coding mantissa, so that 1.(whatever) becomes (whatever)
Hexadecimal representation of 32 as 64-bit double is 4040000000000000, or binary:
0100 0000 0100 0000 0000 ... and zeros all the way down
^======== start of mantissa (coded 0, interpreted 1.0)
^===========^---------- exponent (coded 1028, interpreted 5)
^----------------------- sign (0)
To verify the result visit this page, enter 32 in first field, and click either Rounded or Not Rounded button (doesn't matter which one).

Related

C++ Floating Point Addition (from scratch): Negative results cannot be computed

I am implementing a floating point addition program from scratch, following the methodology listed out in this PDF: https://www.cs.colostate.edu/~cs270/.Fall20/resources/FloatingPointExample.pdf
The main issue I am having is that addition works when the result is positive (e.x. -10 + 12, 3 + 5.125), but the addition does not work when the result is negative. This is because do not understand how to implement the following step:
Step 5: Convert result from 2’s complement to signed magnitude
If the result is negative, convert the mantissa back to signed magnitude by inverting the bits and adding 1. The result is
positive in this example, so nothing needs to be done.
How do I determine if the result is negative without using floating point addition (I am not allowed to use any floating or double adds)? Of course I can see if the current and the next floats are negative and see their cumulative quantities, but that would defeat the purposes of this assignment.
If given only the following:
Sign bit, exponent, and mantissa of X
Sign bit, exponent, and mantissa of Y
Mantissa and exponent of Z
How do I determine whether Z = X + Y is negative just with the above data and not using any floating point addition?

The key insight is that many floating-point formats keep the sign and mantissa separate, so the mantissa is an unsigned integer. The sign and mantissa can be trivially combined to create a signed integer. You can then use signed integer arithmetic to add or subtract the two mantissa's of your floating-point number.

If you are following the PDF you posted, you should have converted the numbers to 2's complement at Step 3. After the addition in Step 4, you have the result in 2's complement. (Result of adding the shifted numbers)
To check if the result is negative, you need to check the leftmost bit (the sign bit) in the resulting bit pattern. In 2's complement, this bit is 1 for negative numbers, and 0 for nonnegative numbers.
sign = signBit;
if (signBit) {
result = ~result + 1;
}
If you are using unsigned integers to hold the bit pattern, you could make them of a fixed size, so that you are able to find the sign bit using shifts later.
uint64_t result;
...
signBit = (result >> 63) & 1;

At step 5, you’ve already added the mantissas. To determine whether the result is positive or negative, just check the sign bit of that sum.

The only difference between grade school math and what we do with floating point is that we have twos complement (base 2 vs base 10 is not really relevant, just makes life easier). So if you made it through grade school you know how all of this works.
In decimal in grade school you align the decimal points and then do the math. With floating point we shift the smaller number and discard it's mantissa (sorry fraction) bits to line it up with the larger number.
In grade school if doing subtraction you subtract the smaller number from the larger number once you resolve the identities
a - (-b) = a + b
-a + b = b - a
and so on so that you either have
n - m
or
n + m
And then you do the math. Apply the sign based on what you had to do to get a-b or a+b.
The beauty of twos complement is that a negation or negative is invert and add one, which feeds nicely into logic.
a - b = a + (-b) = a + (~b) + 1
so you do not re-arrange the operands but you might have to negate the second one.
Also you do not have to remember the sign of the result the result tells you its
sign.
So align the points
put it in the form
a + b
a + (-b)
Where a can be positive or negative but b's sign and the operation may need to
negate b.
Do the addition.
If the result is negative, negate the result into a positive
Normalize
IEEE is only involved in the desire to have the 1.fraction be positive, other floating point formats allow for negative whole.fraction and do not negate, simply
normalize. The rest of it is just grade school math (plus twos complement)
Some examples
2 + 4
in binary the numbers are
+10
+100
which converted to a normalized form are
+1.0 * 2^1
+1.00 * 2^2
need same exponent (align the point)
+0.10 * 2^2
+1.00 * 2^2
both are positive so no change just do the addition
this is the base form, I put more sign extension out front than needed
to make the sign of the result much easier to see.
0
000010
+000100
=======
fill it in
000000
000010
+000100
========
000110
result is positive (msbit of result is zero) so normalize
+1.10 * 2^2
4+5
100
101
+1.00 2^2
+1.01 2^2
same exponent
both positive
0
000100
+000101
=======
001000
000100
+000101
=======
001001
result is positive so normalize
+1.001 * 2^3
4 - 2
100
10
+1.00 * 2^2
+1.0 * 2^1
need the same exponent
+1.00 * 2^2
+0.10 * 2^2
subtract a - b = a + (-b)
1 <--- add one
00100
+11101 <--- invert
=======
fill it in
11011
00100
+11101
=======
00010
result is positive so normalize
+1.0 * 2^1
2 - 4
10
100
+1.0 * 2^1
+1.00 * 2^2
make same exponent
+0.10 * 2^2
+1.00 * 2^2
do the math
a - b = a + (-b)
1
000010
+111011
========
fill it in
000111
000010
+111011
========
111110
result is negative so negate (0 - n)
000011 <--- add one
000000
+000001 <--- invert
=========
000010
normalize
-1.0 * 2^1

bitmasking and binary arithmetic

I'd like to know the science behind the following. a 32 bit value is shifted left 32 times in a 64 bit type, then a division is performed. somehow the precision is contained within the last 32 bits and in order to retrieve the value as a floating point number, I can multiply by 1 over the max value of an unsigned 32 bit int.
phase = ((uint64) 44100 << 32) / 48000;
(phase & 0xffffffff) * (1.0f / 4294967296.0f);// == 0.918749988
the same as
(float)44100/48000;// == 0.918749988

(...)
If you lose precision when dividing two integer numbers, you should remember the remainder.
The reminder in C++ can be taken by doing 44100%48000 in your case.
Actually these are constants and it's completely clear that 44100/48000 == 0, so remainder is all you have.
Well, the reminder will even be -- guess what -- 44100!
The float type (imposed by the explicit cast) has only 6 significant digits. So 4294967296.0f will be simply 429496e4 (in mathematics: 429496*10^4). That's why this type isn't valuable for anything but playing around.
The best way to get a value of fixed integer type in which all bits are set, and not miss the correct number of 'f' in 0xfffff, is to use the ~ operator and 0 value. In your case, ~uint32_t(0).
Well, I should have said this in the beginning: 44100.0/48000 should give you the result you want. :P

this is the answer I was looking for
bit shifting left will provide that number of bits in which to store the precision vale from a division.
dividing the integer value represented by these bits by 2 to the power of the bit shift amount will return the precision value
e.g
0000 0001 * 2^8 = 1 0000 0000 = 256(base 10)
1 0000 0000 / 2 = 1000 0000 = 128(base 10)
128 / 2^8 = 0.5

Computing a real number X from 32-bit binary number IEEE-754 single precision representation

I'm not sure if what I've done is the best way of going about the problem:
0010 0010 0001 1110 1100 1110 0000 0000
I split it up:
Sign : 0 (positive)
Exponent: 0100 0100 (in base 2) -> 2^2 + 2^6 = 68 -> excess 127: 68 - 127 = -59 (base 10)
Mantissa: (1).001 1110 1100 1110 0000 0000 -> decimal numbers needed: d-10 = d-2 * log2 / log10 = 24 * log2 / log10 = 7.22 ~ 8 (teacher told us to round up always)
So the mantissa in base 10 is: 2^0 + 2^-3 + 2^-4 + 2^-5 + 2^-6 + 2^-8 + 2^-9 + 2^-12 + 2^-13 + 2^-14 = 1.2406616 (base 10)
Therefore the real number is:
+1.2406616 * 2^(-59) = 2.1522048 * 10^-18
But is the 10^x representation good? How do I find the right number of sig figs? Would it be the same as the rule used above?

The representation is almost good. I'd say your need a total of 9 (you have 8) significant digits.
See Printf width specifier to maintain precision of floating-point value
The right number of significant digits depends on what is right means.
If you want to print out to x significant decimal places, and read it back and be sure you have the same number x again, then for all IEEE-754 single, a total of 9 decimal places is needed in. 1 before and 8 after the '.' in scientific notation. You may get by with less digits for some numbers, but some numbers need as many as 9.
In C this is defined as FLT_DECIMAL_DIG.
Printing more than 9 does not hurt, it just does not convert to a different IEEE-754 single precision number had only 9 been used.
OTOH if you start with a textual decimal number with y significant digits, convert it to IEEE-754 single and then back to text, then the most y digits you should count on always working is 6.
In C this is defined as FLT_DIG.
So at the end, I'd say d-10 = d-2 * log2 / log10 is almost right. But since powers of 2 (IEEE-754 single) and powers of 10 (x.xxxxxxxx * 10 ^ expo) to not match (expect at 1.0) the precision to use with text is FLT_DECIMAL_DIG:
"number of decimal digits, n, such that any floating-point number with p radix b digits can be rounded to a floating-point number with n decimal digits and back again without change to the value,
p log10 b if b is a power of 10
ceiling(1 + p log10 b) otherwise"
9 in the case of IEEE-754 single

Decimal to IEEE Single Precision Floating Point

I'm interested in learning how to convert an integer value into IEEE single precision floating point format using bitwise operators only. However, I'm confused as to what can be done to know how many logical shifts left are needed when calculating for the exponent.
Given an int, say 15, we have:
Binary: 1111
-> 1.111 x 2^3 => After placing a decimal point after the first bit, we find that the 'e' value will be three.
E = Exp - Bias
Therefore, Exp = 130 = 10000010
And the significand will be: 111000000000000000000000
However, I knew that the 'e' value would be three because I was able to see that there are three bits after placing the decimal after the first bit. Is there a more generic way to code for this as a general case?
Again, this is for an int to float conversion, assuming that the integer is non-negative, non-zero, and is not larger than the max space allowed for the mantissa.
Also, could someone explain why rounding is needed for values greater than 23 bits?
Thanks in advance!

First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf
And now to some meat.
The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 224. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.
IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:
The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.
(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )
Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:
Bit 31 = 0: Positive value
Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 21)
Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1).
So the value is 1.0 x 21 = 2.0.
To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:
Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation.
While aligning the integer, records the total number of shifts made.
Masks away the hidden 1.
Using the number of shifts made, computes the exponent and appends it to the number.
Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly.
There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:
float uint_to_float(unsigned int significand)
{
// Only support 0 < significand < 1 << 24.
if (significand == 0 || significand >= 1 << 24)
return -1.0; // or abort(); or whatever you'd like here.
int shifts = 0;
// Align the leading 1 of the significand to the hidden-1
// position. Count the number of shifts required.
while ((significand & (1 << 23)) == 0)
{
significand <<= 1;
shifts++;
}
// The number 1.0 has an exponent of 0, and would need to be
// shifted left 23 times. The number 2.0, however, has an
// exponent of 1 and needs to be shifted left only 22 times.
// Therefore, the exponent should be (23 - shifts). IEEE-754
// format requires a bias of 127, though, so the exponent field
// is given by the following expression:
unsigned int exponent = 127 + 23 - shifts;
// Now merge significand and exponent. Be sure to strip away
// the hidden 1 in the significand.
unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);
// Reinterpret as a float and return. This is an evil hack.
return *reinterpret_cast< float* >( &merged );
}
You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".)
You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.
For integers >= 224, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.
You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 224, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.

Can't figure out how to convert this scientific # into IEEE 754

I can't seem to figure out how to convert 2 * 10^33 into IEEE 754 format.
I find the sign bit to be 0
I find the exponent to be 110 + bias (of 127) to be 0xED
But, the mantissa is just killing me.. I can't figure out why I keep getting 0 for this part.

You need the first 24 bits of 2*10^33. The first bit is always 1, and the remaining 23 bits form the last 23 bits of the IEEE-754 single-precision floating-point number.
Now, 2*10^33 has 110 binary digits, so it is too large to calculate exactly with most tools (calculators or programming languages). We can make things a little bit easier by noting that 2*10^33 = 2*(2*5)^33 = 2^34*5^33, so the first 24 bits of our number are the same as those of 5^33, which has only 76 bits.
We can further write:
5^33 = (2^7 - 3)^11
= 2^77 - 11*3*2^70 + 55*9*2^63 - 165*27*2^56 + 330*81*2^49
- 462*243*2^42 + 462*729*2^35 - 330*2187*2^27 + ...
= 2^53 * (2^24 - 33*2^17 + 495*2^10 - 4455*2^3 + 26730/2^4
- 112266/2^11 + 336798/2^18 - 721710/2^25 + ...)
= 2^53 * (16777216 - 4325376 + 506880 - 35640 + 1670.625
- 54.817... + 1.284... - 0.0215...)
= 2^53 * 12924697.071
= 2^53 * 110001010011011100011001b
where we rounded in the last step. So the stored part of the mantissa is 10001010011011100011001. Together with the information you already have, the result is:
0 11101101 10001010011011100011001
or in hex:
76C53719

If you want it done automatically, try this website. Type 2e33 into the top text box and hit the Rounded or Not Rounded buttons to get the answer.

If you type 2000000000000000000000000000000000 into my decimal/binary converter you will get
110001010011011100011001000100100011011001001100111000110000010101101100001010000000000000000000000000000000000
Rounded to 24 significant bits -- the number of bits in a float -- this is 110001010011011100011001 (the trailing 23 bits of this are the mantissa).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

representation of double and radix point - c++

I guess a look at the Wikipedia would've helped. Thing is, that there is a "hidden" '1.' in the IEEE formula. Every IEEE 754 number has to be normlized, this means that the encoded number is in the format: (-1)^(sign) * '1.' (mantissa) * 2^(exponent) Therefore, you have encoded 1.32, not 32.

Related

C++ Floating Point Addition (from scratch): Negative results cannot be computed

bitmasking and binary arithmetic

Computing a real number X from 32-bit binary number IEEE-754 single precision representation

Decimal to IEEE Single Precision Floating Point

Can't figure out how to convert this scientific # into IEEE 754

Categories

Resources