IEEE 754, division by zero - ieee-754

I know in standard IEEE 754 division by zero is allowed. I want to know how it's represented in binary.
For example, 0.25 in decimal is
0 01111101 00000000000000000000000
in binary. What about 5.0/0.0 or 0.0/0.0 do they have represenation in binary, and are they same?
Thanks.

When you divide a finite number by zero you'll get an infinity with the sign of the number you tried to divide. So 5.0/0.0 is +inf but 0.0/0.0 returns something called a QNaN indefinite.
Let’s say we are dividing negative one by zero. Because this results in a pre-computed exception I think the key to understanding what happens is in the “response” verbiage Intel uses in section 4.9.1.2
The masked response for the divide-by-zero exception is to set the ZE flag and return an infinity signed with the exclusive OR of the sign of the operands.
I hope I’m reading this right. Since the Zero mask bit (found in the control word of the x87 FPU) is a 1, the pre-computed exception flag becomes set once the fpu detects the zero in the operand used for division. Now the processor knows to do something like this:
1 sign of operand 1, our -1.0
xor 0 sign of operand 2, the zero
----------
1 response
Now with that response bit I know whether I have a positive or negative infinity
-inf 1 11111111 00000000000000000000000
-----+-+------+-+---------------------+
| | | | |
| +------+ +---------------------+
| | |
| v v
| exponent fraction
|
v
sign
If I had a positive 1.0 instead and divided by zero:
0 sign of operand 1
xor 0 sign of operand 2
-----------
0
Now I have
inf 0 11111111 00000000000000000000000
As long as the numerator is positive and you're dividing by zero you'll get the same positive infinity.
This is what I imagine happening when I run something like this:
int main() {
SetExceptionMask(exAllArithmeticExceptions);
float a = -1;
float b = a / 0.0f;
printf("%f\n", b);
}
The result is -inf which looks like this 1 11111111 00000000000000000000000
QNaNs ("quiet not a number") are especially helpful for debugging and are generated through a few different ways but 0.0/0.0 will return something that looks like this:
qnan 0 11111111 10000000000000000000000
-----+-+------+-+---------------------+
| |
+---------------------+
|
v
fraction
Now software can manipulate the bits in the fraction of a QNaN for any purpose, usually this seems done for diagnostic purposes.
To learn more I recommend watching parts 31(https://youtu.be/SsDoUirLkbY), and 33(https://youtu.be/3ZxXSUPSFaQ) of this Intel Manual reading.

Mark has corrected me that division by zero results in positive or negative infinity in IEEE 754-2208.
In the wikipedia article on the subject, we find the following:
sign = 0 for positive infinity, 1 for negative infinity.
biased exponent = all 1 bits.
fraction = all 0 bits.
Source: IEEE 754 Wikipedia article
I was wrong in thinking it would result in a NaN, upon which I elaborated below.
+Infinity:
0 11111111 00000000000000000000000
-Infinity:
1 11111111 00000000000000000000000
INCORRECT ORIGINAL RESPONSE BELOW
May still be of tangential interest, so leaving it in.
This results in, from my understanding, a NaN, pr **not a number*.
The wikipedia page on Nan has an encoding section from which the following quote arrives.
In IEEE 754 standard-conforming floating-point storage formats, NaNs are identified by specific, pre-defined bit patterns unique to NaNs. The sign bit does not matter. Binary format NaNs are represented with the exponential field filled with ones (like infinity values), and some non-zero number in the significand (to make them distinct from infinity values). The original IEEE 754 standard from 1985 (IEEE 754-1985) only described binary floating-point formats, and did not specify how the signaled/quiet state was to be tagged. In practice, the most significant bit of the significand determined whether a NaN is signalling or quiet. Two different implementations, with reversed meanings, resulted.
Source: NaN Encoding (Wikipedia)
The article also goes on to note that in 2008, the IEEE 754-2008 revision adds a suggested method for indicating if the NaN should be quiet or verbose.
NaN is identified by having the top five bits of the combination field after the sign bit set to ones. The sixth bit of the field is the 'is_quiet' flag. The standard follows the interpretation as an 'is_signaling' flag. I.e. the signaled/quiet bit is zero if the NaN is quiet, and non-zero if the N is identified by having the aN is signaling.
Basically, as before, the exponent is all ones, and the last bit indicates whether it is quiet or not.
My interpretation is that a NaN could be represented in a number of ways, including as follows:
0 11111111 00000000000000000000010

Related

Product of an array of infinity by an array of zeros gives me an array of NaN [duplicate]

IEEE 754 specifies the result of 1 / 0 as ∞ (Infinity).
However, IEEE 754 then specifies the result of 0 × ∞ as NaN.
This feels counter-intuitive: Why is 0 × ∞ not 0?
We can think of 1 / 0 = ∞ as the limit of 1 / z as z tends to zero
We can think of 0 × ∞ = 0 as the limit of 0 × z as z tends to ∞.
Why does the IEEE standard follow intuition 1. but not 2.?
It is easier to understand the behavior of IEEE 754 floating point zeros and infinities if you do not think of them as being literally zero or infinite.
The floating point zeros not only represent the real number zero. They also represent all real numbers that would round to something smaller than the smallest subnormal. That is why zero is signed. Even tiny numbers do have a sign if they are not actually zero.
Similarly, each infinity also represents all numbers with the corresponding sign that would round to something with a magnitude that would not fit in the finite range.
NaN represents either "No real number result", for example sqrt(-1), or "Haven't a clue".
Something very big divided by something very small is very, very big, so `Infinity / 0 == Infinity".
Something very big multiplied by something very small could be anything, depending on the actual magnitudes that we don't know. Since the result could be anything from very small through very big, NaN is the most reasonable answer.
=================================================================
Although I think the above is the best way to understand practical floating point behavior, a similar issue arises in real number limits.
Suppose f(x) tends to infinity and g(x) tends to zero as x tends to infinity. It is easy to prove that f(x)/g(x) tends to infinity as x tends to infinity. On the other hand, it is not possible to prove anything about the limit of f(x)*g(x) without more information about the functions.

Why does NaN * 0.f != 0.0 with Xcode [duplicate]

My SSE-FPU generates the following NaNs:
When I do a any basic dual operation like ADDSD, SUBSD, MULSD or DIVSD and one of both operands is a NaN, the result has the sign of the NaN-operand and the lower 51 bits of the mantissa of the result is loaded with the lower 51 bits of the mantissa of the NaN-operand.
When both operations are NaN, the result is loaded with the sign of the destination-register and the lower 51 bits of the result-mantissa is loaded with the lower 51 bits of the destination-register before the operation. So the associative law doesn't count when doing multiplications on two NaN-operands!
When I do a SQRTSD on a NaN-value, the result has the sign of the NaN-operand and the lower 51 bits of the result is loaded with the lower 51 bits of the operand.
When I do a multiplication of infinity with zero or infinity, I always get -NaN as a result (binary representation 0xFFF8000000000000u).
If any operand is a signalling NaN, the result becomes a quiet NaN if the exception isn't masked.
Is this behaviour determined anywhere in the IEEE-754-standard?
NaN have a sign and a payload, together are called the information contained in the NaN.
The whole point of NaNs is that they are "sticky" (maybe Monadic is a better term?), once we have a NaN in an expression the whole expression evaluate to NaN.
Also NaNs are treated specially when evaluating predicates (like binary relations), for example if a is NaN, then it is not equal to itself.
Point 1
From the IEEE 754:
Propagation of the diagnostic information requires that information
contained in the NaNs be preserved through arithmetic operations and
floating-point format conversions.
Point 2
From the IEEE 754:
Every operation involving one or two input NaNs, none of them signaling,
shall signal no exception but, if a floating-point result is to be delivered,
shall deliver as its result a quiet NaN, which should be one of the input
NaNs.
No floating point operation has ever been associative.
I think you were looking for the term commutative though since associativity requires at least three operands involved.
Point 3
See point 4
Point 4
From IEEE 754:
The invalid operations are
1. Any operation on a signaling NaN (6.2)
2. Addition or subtraction – magnitude subtraction of infinities such as,
(+INFINITY) + (–INFINITY)
3. Multiplication – 0 × INFINITY
4. Division – 0/0 or INFINITY/INFINITY
5. Remainder – x REM y, where y is zero or x is infinite
6. Square root if the operand is less than zero
7. Conversion of a binary floating-point number to an integer or
decimal format when overflow, infinity, or NaN precludes a faithful
representation in that format and this cannot otherwise be signaled
8. Comparison by way of predicates involving < or >, without ?, when
the operands are unordered (5.7, Table 4)
Point 5
From IEEE 754:
Every operation involving a signaling NaN or invalid operation (7.1) shall, if
no trap occurs and if a floating-point result is to be delivered, deliver a quiet
NaN as its result.
Due to its relevance, the IEEE 754 standard can be found here.

Equivalence between two approaches to IEEE 754 simple precision

I've been studying IEEE 754 for a while,and there's a thing that I do not manage to understand. According to my notes, in IEEE simple precision, you have 1 bit for the sign, 8 for exponent and 23 for mantissa, making a total of 32 bits. The exponent could be described as following: the first bit gives the sign, and then the remaining 7 bits describe some number, which means that the biggest possible value for exponent is 2^+127, and the lowest 2^-127. But according to Wikipedia (and other websites), the lowest possible value is -126 which you get if you consider the exponent as a number determined by: e-127 and e is an integer between 1 and 254. Why can't e take the value 0 which will enable the exponent -127?
Look up 'subnormal' or denormalized numbers; they have a biassed exponent value of 0.
A denormal number is represented with a biased exponent of all 0 bits, which represents an exponent of −126 in single precision (not −127).
Also, there are 24 logical bits in the mantissa, but the first is always 1 so it isn't actually stored.
Signed zeros are represented by exponent and mantissa with all bits zero, and the sign bit may be 0 (positive) or 1 (negative).

Value of the field std::numeric_limits<T>::digits for floating point types T in C++, GCC compiler [duplicate]

In this wiki article it shows 23 bits for precision, 8 for exponent, and 1 for sign
Where is the hidden 24th bit in float type that makes (23+1) for 7 significand digits?
Floating point numbers are usually normalized. Consider, for example, scientific notation as most of us learned it in school. You always scale the exponent so there's exactly one digit before the decimal point. For example, instead of 123.456, you write 1.23456x102.
Floating point on a computer is normally handled (almost1) the same way: numbers are normalized so there's exactly one digit before the binary point (binary point since most work in binary instead of decimal). There's one difference though: in the case of binary, that means the digit before the decimal point must be a 1. Since it's always a 1, there's no real need to store that bit. To save a bit of storage in each floating point number, that 1 bit is implicit instead of being stored.
As usual, there's just a bit more to the situation than that though. The main difference is denormalized numbers. Consider, for example, if you were doing scientific notation but you could only use exponents from -99 to +99. If you wanted to store a number like, say, 1.234*10-102, you wouldn't be able to do that directly, so it would probably just get rounded down to 0.
Denormalized numbers give you a way to deal with that. Using a denormalized number, you'd store that as 0.001234*10-99. Assuming (as is normally the case on a computer) that the number of digits for the mantissa and exponent are each limited, this loses some precision, but still avoids throwing away all the precision and just calling it 0.
1 Technically, there are differences, but they make no difference to the basic understanding involved.
http://en.wikipedia.org/wiki/Single_precision_floating-point_format#IEEE_754_single_precision_binary_floating-point_format:_binary32
The true significand includes 23
fraction bits to the right of the
binary point and an implicit leading
bit (to the left of the binary point)
with value 1 unless the exponent is
stored with all zeros
Explains it pretty well, it is by convention/design that last bit is not stored explicitly but rather stated by specification that it is there unless everything is 0'os.
As you write, the single-precision floating-point format has a sign bit, eight exponent bits, and 23 significand bits. Let s be the sign bit, e be the exponent bits, and f be the significand bits. Here is what various combinations of bits stand for:
If e and f are zero, the object is +0 or -0, according to whether s is 0 or 1.
If e is zero and f is not, the object is (-1)s * 21-127 * 0.f. "0.f" means to write 0, period, and the 23 bits of f, then interpret that as a binary numeral. E.g., 0.011000... is 3/8. These are the "subnormal" numbers.
If 0 < e < 255, the object is (-1)s * 2e-127 * 1.f. "1.f" is similar to "0.f" above, except you start with 1 instead of 0. This is the implicit bit. Most of the floating-point numbers are in this format; these are the "normal" numbers.
If e is 255 and f is zero, the object is +infinity or -infinity, according to whether s is 0 or 1.
If e is 255 and f is not zero, the object is a NaN (Not a Number). The meaning of the f field of a NaN is implementation dependent; it is not fully specified by the standard. Commonly, if the first bit is zero, it is a signaling NaN; otherwise it is a quiet NaN.

Where's the 24th fraction bit on a single precision float? IEEE 754

I found myself today doing some bit manipulation and I decided to refresh my floating-point knowledge a little!
Things were going great until I saw this:
... 23 fraction bits of the significand appear in the memory format but the total precision is 24 bits
I read it again and again but I still can't figure out where the 24th bit is, I noticed something about a binary point so I assumed that it's a point in the middle between the mantissa and the exponent.
I'm not really sure but I believe he author was talking about this bit:
Binary point?
|
s------e-----|-------------m----------
0 - 01111100 - 01000000000000000000000
^ this
The 24th bit is implicit due to normalization.
The significand is shifted left (and one subtracted from the exponent for each bit shift) until the leading bit of the significand is a 1.
Then, since the leading bit is a 1, only the other 23 bits are actually stored.
There is also the possibility of a denormal number. The exponent is stored as a "bias" format signed number, meaning that it's an unsigned number where the middle of the range is defined to mean 01. So, with 8 bits, it's stored as a number from 0..255, but 0 is interpreted to mean -128, 128 is interpreted to mean 0, and 255 is interpreted as 127 (I may have a fencepost error there, but you get the idea).
If, in the process of normalization, this is decremented to 0 (meaning an actual exponent value of -128), then normalization stops, and the significand is stored as-is. In this case, the implicit bit from normalization it taken to be a 0 instead of a 1.
Most floating point hardware is designed to basically assume numbers will be normalized, so they assume that implicit bit is a 1. During the computation, they check for the possibility of a denormal number, and in that case they do roughly the equivalent of throwing an exception, and re-start the calculation with that taken into account. This is why computation with denormals often gets drastically slower than otherwise.
In case you wonder why it uses this strange format: IEEE floating point (like many others) is designed to ensure that if you treat its bit pattern as an integer of the same size, you can compare them as signed, 2's complement integers and they'll still sort into the correct order as floating point numbers. Since the sign of the number is in the most significant bit (where it is for a 2's complement integer) that's treated as the sign bit. The bits of the exponent are stored as the next most significant bits -- but if we used 2's complement for them, an exponent less than 0 would set the second most significant bit of the number, which would result in what looked like a big number as an integer. By using bias format, a smaller exponent leaves that bit clear, and a larger exponent sets it, so the order as an integer reflects the order as a floating point.
Normally (pardon the pun), the leading bit of a floating point number is always 1; thus, it doesn't need to be stored anywhere. The reason is that, if it weren't 1, that would mean you had chosen the wrong exponent to represent it; you could get more precision by shifting the mantissa bits left and using a smaller exponent.
The one exception is denormal/subnormal numbers, which are represented by all zero bits in the exponent field (the lowest possible exponent). In this case, there is no implicit leading 1 in the mantissa, and you have diminishing precision as the value approaches zero.
For normal floating point numbers, the number stored in the floating point variable is (ignoring sign) 1. mantissa * 2exponent-offset. The leading 1 is not stored in the variable.