Exponent in IEEE 754 - ieee-754

Why exponent in float is displaced by 127?
Well, the real question is : What is the advantage of such notation in comparison to 2's complement notation?

Since the exponent as stored is unsigned, it is possible to use integer instructions to compare floating point values. the the entire floating point value can be treated as a signed magnitude integer value for purposes of comparison (not twos-compliment).

Just to correct some misinformation: it is 2^n * 1.mantissa, the 1 infront of the fraction is implicitly stored.

Note that there is a slight difference in the representable range for the exponent, between biased and 2's complement. The IEEE standard supports exponents in the range of (-127 to +128), while if it was 2's complement, it would be (-128 to +127). I don't really know the reason why the standard chooses the bias form, but maybe the committee members thought it would be more useful to allow extremely large numbers, rather than extremely small numbers.

#Stephen Canon, in response to ysap's answer (sorry, this should have been a follow up comment to my answer, but the original answer was entered as an unregistered user, so I cannot really comment it yet).
Stephen, obviously you are right, the exponent range I mentioned is incorrect, but the spirit of the answer still applies. Assuming that if it was 2's complement instead of biased value, and assuming that the 0x00 and 0xFF values would still be special values, then the biased exponents allow for (2x) bigger numbers than the 2's complement exponents.

The exponent in a 32-bit float consists of 8 bits, but without a sign bit. So the range is effectively [0;255]. In order to represent numbers < 2^0, that range is shifted by 127, becoming [-127;128].
That way, very small numbers can be represented very precisely. With a [0;255] range, small numbers would have to be represented as 2^0 * 0.mantissa with lots of zeroes in the mantissa. But with a [-127;128] range, small numbers are more precise because they can be represented as 2^-126 * 0.mantissa (with less unnecessary zeroes in the mantissa). Hope you get the point.

Related

Signed type representation in c++

In the book I am reading it says that:
The standard does not define how signed types are represented, but does specify that range should be evenly divided between positive and negative values. Hence, an 8-bit signed char is guaranteed to be able to hold values from -127 through 127; most modern machines use representations that allow values from -128 through 127.
I presume that [-128;127] range arises from method called "twos-complement" in which negative number is !A+1 (e.g. 0111 is 7, and 1001 is then -7). But I cannot wrap my head around why in some older(?) machines the values range [-127;127]. Can anyone clarify this?
Both one's complement and signed magnitude are representations that provide the range [-127,127] with an 8 bit number. Both have a different representation for +0 and -0. Both have been used by (mostly) early computer systems.
The signed magnitude representation is perhaps the simplest for humans to imagine and was probably used for the same reason as why people first created decimal computers, rather than binary.
I would imagine that the only reason why one's complement was ever used, was because two's complement hadn't yet been considered by the creators of early computers. Then later on, because of backwards compatibility. Although, this is just my conjecture, so take it with a grain of salt.
Further information: https://en.wikipedia.org/wiki/Signed_number_representations
As a slightly related factoid: In the IEEE floating point representation, the signed exponent uses excess-K representation and the fractional part is represented by signed magnitude.
It's not actually -127 to 127. But -127 to -0 and 0 to 127.
Earlier processor used two methods:
Signed magnitude: In this a a negative answer is form by putting 1 at the most significant bit. So 10000000 and 00000000 both represent 0
One's complement: Just applying not to positive number. This cause two zero representation: 11111111 and 00000000.
Also two's complement is nearly as old as other two. https://www.linuxvoice.com/edsac-dennis-wheeler-and-the-cambridge-connection/

C++ I've just read that floats are inexact and do not store exact integer values. What does this mean?

I am thinking of this at a binary level.
would a float of value 1 and an integer of value 1 not compile down to (omitting lots of zeros here)
0001
If they do both compile down to this then where does this inexactness come in.
Resource I'm using is http://www.cprogramming.com/tutorial/lesson1.html
Thanks.
It's possible. Floating point numbers are represented in an exponential notation (a*2^n), where some bits represent a (the significand), and some bits represent n (the exponent).
You can't uniquely represent all the integers in the range of a floating point value, due to the so-called pigeonhole principle. For example, 32-bit floats go up to over 10^38, but on 32 bits you can only represent 2^32 values - that means some integers will have the same representation.
Now, what happens when you try to, for example, do the following:
x = 10^38 - (10^38 - 1)
You should get 1, but you probably won't, because 10^38 and 10^38-1 are so close to each other that the computer has to represent them the same way. So, your 1.0f will usually be 1, but if this 1 is a result of calculation, it might not be.
Here are some examples.
To be precise: Integers can be exactly represented as floats if their binary representation does not use more bits than the float format supplies for the mantissa plus an implicit one bit.
IEEE floats have a mantissa of 23 bits, add one implicit bit, and you can store any integer representable with 24 bits in a float (that's integers up to 16777216). Likewise, a double has 52 mantissa bits, so it can store integers up to 9007199254740992.
Beyond that point, the IEEE format omits first the odd numbers, then all numbers not divisible by 4, and so on. So, even 0xffffff00ul is exactly representable as a float, but 0xffffff01ul is not.
So, yes, you can represent integers as floats, and as long as they don't become larger than the 16e6 or 9e15 limits, you can even expect additions between integers in float format to be exact.
A float will store an int exactly if the int is less than a certain number, but if you have a large enough int, there won't be enough bits in the mantissa to store all the bits of the integer. The missing bits are then assumed to be zero. If the missing bits aren't zero, then your int won't be equal to your float.
Short answer: no, the floating point representation of integers is not that simple.
The representation adopted for the float type by the C language standard is called IEEE 754 single-precision and is probably more complicated than most people would like to delve into, but the link describes it thoroughly in case you're interested.
As for the representation of the integer 1: we can see how it's encoded in the 32-bit base-2 single-precision format defined by IEEE 754 here - 3f80 0000.
Suppose letters stand for a bit, 0/1. Then a floating point number looks (schematically) like:
smmmmee
where s is the sign +/- and the number is .mmmm x 10 ^ ee
Now if you have two immediately following numbers:
.mmm0 x 10 ^ ee
.mmm1 x 10 ^ ee
Then for large exponent ee the difference might be more then 1.
And of course in base 2 a number like 1/5, 0.2, cannot represented exact. Summing fractions wil increase the error.
(Note this is not the exact representation.)
would a float of value 1 and an integer of value 1 not compile down to (omitting lots of zeros here) 0001
No, float will be stored like something similar to 0x00000803f, depending on precision.
What does this mean?
Some numbers cannot be precisely represented in binary form. O.2 in binary form will look like 0.00110011001100110011... which will keep going on(and repeating) forever. No matter how many bits you use to store it, it will be never enough. That's because 5 is not divisible by 2. The only way to precisely represent it is to use ratios to store it.
floating points have limited precision. Roughly speaking, they only store certain amount of digits after first significant non-zero digit, and the rest will be lost. That'll result in errors, for example, with single precision floats 100000000000000001 and 100000000000000002 are most likely rounded off to the same number.
You might also want to read something like this.
Conclusion:
If you're writing financial software, do not use floats. Use Bignums, using libraries like gmp
Contrary to some modern dynamically typed programming languages such as JavaScript or Ruby that have a single basic numeric type, the C programming language has many. That is because C reflects the different way to represent different kinds of numbers within a processor register.
To investigate different representations you can use the union construct where the same data can be viewed as different types.
Define
union {
float x;
int v;
} u;
Assign u.x = 1.0f and printf("0x%08x\n",u.v) to get the 32-bit representation of 1.0f as a floating point number. It should return 0x3f800000 and not 0x00000001 as one might expect.
As mentioned in earlier answers this reflects the representation of a floating number as a 32-bit value as `
1.0f = 0x3F800000 = 0011.1111.1000.0000.0000.0000.0000.0000 =
0 0111.1111 000.0000.0000.0000.0000.0000 = 0 0x7F 0
Here the three parts are sign s=0, exponent e=127, and mantissa m=0 and the floating point value is computed as
value = s * (1 + m * 2^-23) * 2^(e-127)
With this representation any integer number from -16,777,215 to 16,777,215 can be represented exactly. This is the value of (2^24 - 1) since there are only 23 bits in the mantissa. This range is not sufficient for many applications, therefore the float type cannot replace the int type.
The range of exact representation of integers by the double type is wider since the value occupies 64 bits and there are 53 bits reserved for the mantissa. It is exactly from
-9,007,199,254,740,991 to 9,007,199,254,740,991. Yet double requires twice as much memory.
Another source of difficulty is the way fractional numbers are represented. Since decimal fractions cannot be represented exactly (0.1f = 0x3dcccccd = 0.10000000149...) the use of floating point numbers breaks common algebraic identities.
0.1f * 10 != 1.0f
This can be confusing and lead to errors that are hard to detect. In general strict equality should not be used with floating point numbers.
Another example of floating point arithmetic depature from algebraic correctness:
float x = 16777217.0f;
float y = 16777215.0f;
x -= 1.0f;
y += 1.0f;
if (y > x) {printf("16777215.0 + 1.0 > 16777217.0 - 1.0\n");}
Yet another issue is the behaviour of the system when the limits of exact representation are broken. When in integer arithmetic the result of an arithmetic operation is greater than the range of the type, this can be detected in many ways: a special OVERFLOW bit in the processor flags register is flipped, and the result is significantly different from the expected.
In floating point arithmetic as the example above shows, the loss of precision occurs silently.
Hope this helps to understand why one needs many basic numeric types in C.

Where's the 24th fraction bit on a single precision float? IEEE 754

I found myself today doing some bit manipulation and I decided to refresh my floating-point knowledge a little!
Things were going great until I saw this:
... 23 fraction bits of the significand appear in the memory format but the total precision is 24 bits
I read it again and again but I still can't figure out where the 24th bit is, I noticed something about a binary point so I assumed that it's a point in the middle between the mantissa and the exponent.
I'm not really sure but I believe he author was talking about this bit:
Binary point?
|
s------e-----|-------------m----------
0 - 01111100 - 01000000000000000000000
^ this
The 24th bit is implicit due to normalization.
The significand is shifted left (and one subtracted from the exponent for each bit shift) until the leading bit of the significand is a 1.
Then, since the leading bit is a 1, only the other 23 bits are actually stored.
There is also the possibility of a denormal number. The exponent is stored as a "bias" format signed number, meaning that it's an unsigned number where the middle of the range is defined to mean 01. So, with 8 bits, it's stored as a number from 0..255, but 0 is interpreted to mean -128, 128 is interpreted to mean 0, and 255 is interpreted as 127 (I may have a fencepost error there, but you get the idea).
If, in the process of normalization, this is decremented to 0 (meaning an actual exponent value of -128), then normalization stops, and the significand is stored as-is. In this case, the implicit bit from normalization it taken to be a 0 instead of a 1.
Most floating point hardware is designed to basically assume numbers will be normalized, so they assume that implicit bit is a 1. During the computation, they check for the possibility of a denormal number, and in that case they do roughly the equivalent of throwing an exception, and re-start the calculation with that taken into account. This is why computation with denormals often gets drastically slower than otherwise.
In case you wonder why it uses this strange format: IEEE floating point (like many others) is designed to ensure that if you treat its bit pattern as an integer of the same size, you can compare them as signed, 2's complement integers and they'll still sort into the correct order as floating point numbers. Since the sign of the number is in the most significant bit (where it is for a 2's complement integer) that's treated as the sign bit. The bits of the exponent are stored as the next most significant bits -- but if we used 2's complement for them, an exponent less than 0 would set the second most significant bit of the number, which would result in what looked like a big number as an integer. By using bias format, a smaller exponent leaves that bit clear, and a larger exponent sets it, so the order as an integer reflects the order as a floating point.
Normally (pardon the pun), the leading bit of a floating point number is always 1; thus, it doesn't need to be stored anywhere. The reason is that, if it weren't 1, that would mean you had chosen the wrong exponent to represent it; you could get more precision by shifting the mantissa bits left and using a smaller exponent.
The one exception is denormal/subnormal numbers, which are represented by all zero bits in the exponent field (the lowest possible exponent). In this case, there is no implicit leading 1 in the mantissa, and you have diminishing precision as the value approaches zero.
For normal floating point numbers, the number stored in the floating point variable is (ignoring sign) 1. mantissa * 2exponent-offset. The leading 1 is not stored in the variable.

Floating-point: "The leading 1 is 'implicit' in the significand." -- ...huh?

I'm learning about the representation of floating-point IEEE 754 numbers, and my textbook says:
To pack even more bits into the significand, IEEE 754 makes the leading 1-bit of normalized binary numbers implicit. Hence, the number is actually 24 bits long in single precision (implied 1 and 23-bit fraction), and 53 bits long in double precision (1 + 52).
I don't get what "implicit" means here... what's the difference between an explicit bit and an implicit bit? Don't all numbers have the bit, regardless of their sign?
Yes, all normalised numbers (other than the zeroes) have that bit set to one (a), so they make it implicit to prevent wasting space storing it.
In other words, they save that bit totally, and reuse it so that it can be used to increase the precision of your numbers.
Keep in mind that this is the first bit of the fraction, not the first bit of the binary pattern. The first bit of the binary pattern is the sign, followed by a few bits of exponent, followed by the fraction itself.
For example, a single precision number is (sign, exponent, fraction):
<1> <--8---> <---------23----------> <- bit widths
s eeeeeeee fffffffffffffffffffffff
If you look at the way the number is calculated, it's:
(-1)sign x 1.fraction x 2exponent-bias
So the fractional part used for calculating that value is 1.fffff...fff (in binary).
(a) There is actually a class of numbers (the denormalised ones and the zeroes) for which that property does not hold true. These numbers all have a biased exponent of zero but the vast majority of numbers follow the rule.
Here is what they are saying. The first non-zero bit is always going to be 1. So there is no need for the binary representation to include that bit, since you know what it is. So they don't. They tell you where that first 1 is, and then they give the bits after it. So there is a 1 that is not explicitly in the binary representation, whose location is implicit from the fact that they told you where it was.
It may also be helpful to note that we are dealing in binary representations of a number. The reason that the first digit of a normalized binary number (that is, no leading zeroes) has to be 1 is that 1 is the only non-zero value available to us in this representation. So, the same would not be true for, say, base-three representations.

Questions about two's complement and IEEE 754 representations

How would i go about finding the value of the two-byte two’s complement value 0xFF72 is"?
Would i start by converting 0xFF72 to binary?
reverse the bits.
add 1 in binary notation. // lost here.
write decimal.
I just dont know..
Also,
What about an 8 byte double that has the value: 0x7FF8000000000000. Its value as a floating point?
I would think that this was homework, but for the particular double that is listed. 0x7FF8000000000000 is a quiet NaN per the IEEE-754 spec, not a very interesting value to put on a homework assignment:
The sign bit is clear.
The exponent field is 0x7ff, the largest possible exponent, which means that the number is either an infinity or a NaN.
The significand field is 0x8000000000000. Since it isn't zero, the number is not an infinity, and must be a NaN. Since the leading bit is set, it is a quiet NaN, not a so-called "signaling NaN".
Step 3 just means add 1 to the value. It's really as simple as it sounds. :-)
Example with 0xFF72 (assumed 16-bits here):
First, invert it: 0x008D (each digit is simply 0xF minus the original value)
Then add 1: 0x008E
This sounds like homework and for openness you should tag it as such if it is.
As for interpreting an 8 byte (double) floating point number, take a look at this Wikipedia article.