Questions about two's complement and IEEE 754 representations - ieee-754

How would i go about finding the value of the two-byte two’s complement value 0xFF72 is"?
Would i start by converting 0xFF72 to binary?
reverse the bits.
add 1 in binary notation. // lost here.
write decimal.
I just dont know..
Also,
What about an 8 byte double that has the value: 0x7FF8000000000000. Its value as a floating point?

I would think that this was homework, but for the particular double that is listed. 0x7FF8000000000000 is a quiet NaN per the IEEE-754 spec, not a very interesting value to put on a homework assignment:
The sign bit is clear.
The exponent field is 0x7ff, the largest possible exponent, which means that the number is either an infinity or a NaN.
The significand field is 0x8000000000000. Since it isn't zero, the number is not an infinity, and must be a NaN. Since the leading bit is set, it is a quiet NaN, not a so-called "signaling NaN".

Step 3 just means add 1 to the value. It's really as simple as it sounds. :-)
Example with 0xFF72 (assumed 16-bits here):
First, invert it: 0x008D (each digit is simply 0xF minus the original value)
Then add 1: 0x008E

This sounds like homework and for openness you should tag it as such if it is.
As for interpreting an 8 byte (double) floating point number, take a look at this Wikipedia article.

Related

Value of the field std::numeric_limits<T>::digits for floating point types T in C++, GCC compiler [duplicate]

In this wiki article it shows 23 bits for precision, 8 for exponent, and 1 for sign
Where is the hidden 24th bit in float type that makes (23+1) for 7 significand digits?
Floating point numbers are usually normalized. Consider, for example, scientific notation as most of us learned it in school. You always scale the exponent so there's exactly one digit before the decimal point. For example, instead of 123.456, you write 1.23456x102.
Floating point on a computer is normally handled (almost1) the same way: numbers are normalized so there's exactly one digit before the binary point (binary point since most work in binary instead of decimal). There's one difference though: in the case of binary, that means the digit before the decimal point must be a 1. Since it's always a 1, there's no real need to store that bit. To save a bit of storage in each floating point number, that 1 bit is implicit instead of being stored.
As usual, there's just a bit more to the situation than that though. The main difference is denormalized numbers. Consider, for example, if you were doing scientific notation but you could only use exponents from -99 to +99. If you wanted to store a number like, say, 1.234*10-102, you wouldn't be able to do that directly, so it would probably just get rounded down to 0.
Denormalized numbers give you a way to deal with that. Using a denormalized number, you'd store that as 0.001234*10-99. Assuming (as is normally the case on a computer) that the number of digits for the mantissa and exponent are each limited, this loses some precision, but still avoids throwing away all the precision and just calling it 0.
1 Technically, there are differences, but they make no difference to the basic understanding involved.
http://en.wikipedia.org/wiki/Single_precision_floating-point_format#IEEE_754_single_precision_binary_floating-point_format:_binary32
The true significand includes 23
fraction bits to the right of the
binary point and an implicit leading
bit (to the left of the binary point)
with value 1 unless the exponent is
stored with all zeros
Explains it pretty well, it is by convention/design that last bit is not stored explicitly but rather stated by specification that it is there unless everything is 0'os.
As you write, the single-precision floating-point format has a sign bit, eight exponent bits, and 23 significand bits. Let s be the sign bit, e be the exponent bits, and f be the significand bits. Here is what various combinations of bits stand for:
If e and f are zero, the object is +0 or -0, according to whether s is 0 or 1.
If e is zero and f is not, the object is (-1)s * 21-127 * 0.f. "0.f" means to write 0, period, and the 23 bits of f, then interpret that as a binary numeral. E.g., 0.011000... is 3/8. These are the "subnormal" numbers.
If 0 < e < 255, the object is (-1)s * 2e-127 * 1.f. "1.f" is similar to "0.f" above, except you start with 1 instead of 0. This is the implicit bit. Most of the floating-point numbers are in this format; these are the "normal" numbers.
If e is 255 and f is zero, the object is +infinity or -infinity, according to whether s is 0 or 1.
If e is 255 and f is not zero, the object is a NaN (Not a Number). The meaning of the f field of a NaN is implementation dependent; it is not fully specified by the standard. Commonly, if the first bit is zero, it is a signaling NaN; otherwise it is a quiet NaN.

Where's the 24th fraction bit on a single precision float? IEEE 754

I found myself today doing some bit manipulation and I decided to refresh my floating-point knowledge a little!
Things were going great until I saw this:
... 23 fraction bits of the significand appear in the memory format but the total precision is 24 bits
I read it again and again but I still can't figure out where the 24th bit is, I noticed something about a binary point so I assumed that it's a point in the middle between the mantissa and the exponent.
I'm not really sure but I believe he author was talking about this bit:
Binary point?
|
s------e-----|-------------m----------
0 - 01111100 - 01000000000000000000000
^ this
The 24th bit is implicit due to normalization.
The significand is shifted left (and one subtracted from the exponent for each bit shift) until the leading bit of the significand is a 1.
Then, since the leading bit is a 1, only the other 23 bits are actually stored.
There is also the possibility of a denormal number. The exponent is stored as a "bias" format signed number, meaning that it's an unsigned number where the middle of the range is defined to mean 01. So, with 8 bits, it's stored as a number from 0..255, but 0 is interpreted to mean -128, 128 is interpreted to mean 0, and 255 is interpreted as 127 (I may have a fencepost error there, but you get the idea).
If, in the process of normalization, this is decremented to 0 (meaning an actual exponent value of -128), then normalization stops, and the significand is stored as-is. In this case, the implicit bit from normalization it taken to be a 0 instead of a 1.
Most floating point hardware is designed to basically assume numbers will be normalized, so they assume that implicit bit is a 1. During the computation, they check for the possibility of a denormal number, and in that case they do roughly the equivalent of throwing an exception, and re-start the calculation with that taken into account. This is why computation with denormals often gets drastically slower than otherwise.
In case you wonder why it uses this strange format: IEEE floating point (like many others) is designed to ensure that if you treat its bit pattern as an integer of the same size, you can compare them as signed, 2's complement integers and they'll still sort into the correct order as floating point numbers. Since the sign of the number is in the most significant bit (where it is for a 2's complement integer) that's treated as the sign bit. The bits of the exponent are stored as the next most significant bits -- but if we used 2's complement for them, an exponent less than 0 would set the second most significant bit of the number, which would result in what looked like a big number as an integer. By using bias format, a smaller exponent leaves that bit clear, and a larger exponent sets it, so the order as an integer reflects the order as a floating point.
Normally (pardon the pun), the leading bit of a floating point number is always 1; thus, it doesn't need to be stored anywhere. The reason is that, if it weren't 1, that would mean you had chosen the wrong exponent to represent it; you could get more precision by shifting the mantissa bits left and using a smaller exponent.
The one exception is denormal/subnormal numbers, which are represented by all zero bits in the exponent field (the lowest possible exponent). In this case, there is no implicit leading 1 in the mantissa, and you have diminishing precision as the value approaches zero.
For normal floating point numbers, the number stored in the floating point variable is (ignoring sign) 1. mantissa * 2exponent-offset. The leading 1 is not stored in the variable.

Floating-point: "The leading 1 is 'implicit' in the significand." -- ...huh?

I'm learning about the representation of floating-point IEEE 754 numbers, and my textbook says:
To pack even more bits into the significand, IEEE 754 makes the leading 1-bit of normalized binary numbers implicit. Hence, the number is actually 24 bits long in single precision (implied 1 and 23-bit fraction), and 53 bits long in double precision (1 + 52).
I don't get what "implicit" means here... what's the difference between an explicit bit and an implicit bit? Don't all numbers have the bit, regardless of their sign?
Yes, all normalised numbers (other than the zeroes) have that bit set to one (a), so they make it implicit to prevent wasting space storing it.
In other words, they save that bit totally, and reuse it so that it can be used to increase the precision of your numbers.
Keep in mind that this is the first bit of the fraction, not the first bit of the binary pattern. The first bit of the binary pattern is the sign, followed by a few bits of exponent, followed by the fraction itself.
For example, a single precision number is (sign, exponent, fraction):
<1> <--8---> <---------23----------> <- bit widths
s eeeeeeee fffffffffffffffffffffff
If you look at the way the number is calculated, it's:
(-1)sign x 1.fraction x 2exponent-bias
So the fractional part used for calculating that value is 1.fffff...fff (in binary).
(a) There is actually a class of numbers (the denormalised ones and the zeroes) for which that property does not hold true. These numbers all have a biased exponent of zero but the vast majority of numbers follow the rule.
Here is what they are saying. The first non-zero bit is always going to be 1. So there is no need for the binary representation to include that bit, since you know what it is. So they don't. They tell you where that first 1 is, and then they give the bits after it. So there is a 1 that is not explicitly in the binary representation, whose location is implicit from the fact that they told you where it was.
It may also be helpful to note that we are dealing in binary representations of a number. The reason that the first digit of a normalized binary number (that is, no leading zeroes) has to be 1 is that 1 is the only non-zero value available to us in this representation. So, the same would not be true for, say, base-three representations.

Some questions about floating points

I'm wondering if a number is represented one way in a floating point representation, is it going to be represented in the same way in a representation that has a larger size.
That is, if a number has a particular representation as a float, will it have the same representation if that float is cast to a double and then still the same when cast to a long double.
I'm wondering because I'm writing a BigInteger implementation and any floating point number that is passed in I am sending to a function that accepts a long double to convert it. Which leads me to my next question. Obviously floating points do not always have exact representations, so in my BigInteger class what should I be attempting to represent when given a float. Is it reasonable to try and represent the same number as given by std::cout << std::fixed << someFloat; even if that is not the same as the number passed in. Is that the most accurate representation I will be able to get? If so, ...
What's the best way to extract that value (in base some power of 10), at the moment I'm just grabbing it as a string and passing it to my string constructor. This will work, but I can't help but feel theres a better way, but certainly taking the remainder when dividing by my base is not accurate with floats.
Finally, I wonder if there is a floating point equivalent of uintmax_t, that is a typename that will always be the largest floating point type on a system, or is there no point because long double will always be the largest (even if it 's the same as a double).
Thanks, T.
If by "same representation" you mean "exactly the same binary representation in memory except for padding", then no. Double-precision has more bits of both exponent and mantissa, and also has a different exponent bias. But I believe that any single-precision value is exactly representable in double-precision (except possibly denormalised values).
I'm not sure what you mean when you say "floating points do not always have exact representations". Certainly, not all decimal floating-point values have exact binary floating-point values (and vice versa), but I'm not sure that's a problem here. So long as your floating-point input has no fractional part, then a suitably large "BigInteger" format should be able to represent it exactly.
Conversion via a base-10 representation is not the way to go. In theory, all you need is a bit-array of length ~1024, initialise it all to zero, and then shift the mantissa bits in by the exponent value. But without knowing more about your implementation, there's not a lot more I can suggest!
double includes all values of float; long double includes all values of double. So you're not losing any value information by conversion to long double. However, you're losing information about the original type, which is relevant (see below).
In order to follow common C++ semantics, conversion of a floating point value to integer should truncate the value, not round.
The main problem is with large values that are not exact. You can use the frexp function to find the base 2 exponent of the floating point value. You can use std::numeric_limits<T>::digits to check if that's within the integer range that can be exactly represented.
My personal design choice would be an assert that the fp value is within the range that can be exactly represented, i.e. a restriction on the range of any actual argument.
To do that properly you need overloads taking float and double arguments, since the range that can be represented exactly depends on the actual argument's type.
When you have an fp value that is within the allowed range, you can use floor and fmod to extract digits in any numeral system you want.
yes, going from IEEE float to double to extended you will see bits from the smaller format to the larger format, for example
single
S EEEEEEEE MMMMMMM.....
double
S EEEEEEEEEEEE MMMMM....
6.5 single
0 10000001 101000...
6.5 double
0 10000000001 101000...
13 single
0 10000010 101000...
13 double
0 10000000010 101000...
The mantissa you will left justify and then add zeros.
The exponent is right justified, sign extend the next to msbit then copy the msbit.
An exponent of -2 for example. take -2 subtract 1 which is -3. -3 in twos complement is 0xFD or 0b11111101 but the exponent bits in the format are 0b01111101, the msbit inverted. And for double a -2 exponent -2-1 = -3. or 0b1111...1101 and that becomes 0b0111...1101, the msbit inverted. (exponent bits = twos_complement(exponent-1) with the msbit inverted).
As we see above an exponent of 3 3-1 = 2 0b000...010 invert the upper bit 0b100...010
So yes you can take the bits from single precision and copy them to the proper locations in the double precision number. I dont have an extended float reference handy but pretty sure it works the same way.

Exponent in IEEE 754

Why exponent in float is displaced by 127?
Well, the real question is : What is the advantage of such notation in comparison to 2's complement notation?
Since the exponent as stored is unsigned, it is possible to use integer instructions to compare floating point values. the the entire floating point value can be treated as a signed magnitude integer value for purposes of comparison (not twos-compliment).
Just to correct some misinformation: it is 2^n * 1.mantissa, the 1 infront of the fraction is implicitly stored.
Note that there is a slight difference in the representable range for the exponent, between biased and 2's complement. The IEEE standard supports exponents in the range of (-127 to +128), while if it was 2's complement, it would be (-128 to +127). I don't really know the reason why the standard chooses the bias form, but maybe the committee members thought it would be more useful to allow extremely large numbers, rather than extremely small numbers.
#Stephen Canon, in response to ysap's answer (sorry, this should have been a follow up comment to my answer, but the original answer was entered as an unregistered user, so I cannot really comment it yet).
Stephen, obviously you are right, the exponent range I mentioned is incorrect, but the spirit of the answer still applies. Assuming that if it was 2's complement instead of biased value, and assuming that the 0x00 and 0xFF values would still be special values, then the biased exponents allow for (2x) bigger numbers than the 2's complement exponents.
The exponent in a 32-bit float consists of 8 bits, but without a sign bit. So the range is effectively [0;255]. In order to represent numbers < 2^0, that range is shifted by 127, becoming [-127;128].
That way, very small numbers can be represented very precisely. With a [0;255] range, small numbers would have to be represented as 2^0 * 0.mantissa with lots of zeroes in the mantissa. But with a [-127;128] range, small numbers are more precise because they can be represented as 2^-126 * 0.mantissa (with less unnecessary zeroes in the mantissa). Hope you get the point.