My SSE-FPU generates the following NaNs:
When I do a any basic dual operation like ADDSD, SUBSD, MULSD or DIVSD and one of both operands is a NaN, the result has the sign of the NaN-operand and the lower 51 bits of the mantissa of the result is loaded with the lower 51 bits of the mantissa of the NaN-operand.
When both operations are NaN, the result is loaded with the sign of the destination-register and the lower 51 bits of the result-mantissa is loaded with the lower 51 bits of the destination-register before the operation. So the associative law doesn't count when doing multiplications on two NaN-operands!
When I do a SQRTSD on a NaN-value, the result has the sign of the NaN-operand and the lower 51 bits of the result is loaded with the lower 51 bits of the operand.
When I do a multiplication of infinity with zero or infinity, I always get -NaN as a result (binary representation 0xFFF8000000000000u).
If any operand is a signalling NaN, the result becomes a quiet NaN if the exception isn't masked.
Is this behaviour determined anywhere in the IEEE-754-standard?
NaN have a sign and a payload, together are called the information contained in the NaN.
The whole point of NaNs is that they are "sticky" (maybe Monadic is a better term?), once we have a NaN in an expression the whole expression evaluate to NaN.
Also NaNs are treated specially when evaluating predicates (like binary relations), for example if a is NaN, then it is not equal to itself.
Point 1
From the IEEE 754:
Propagation of the diagnostic information requires that information
contained in the NaNs be preserved through arithmetic operations and
floating-point format conversions.
Point 2
From the IEEE 754:
Every operation involving one or two input NaNs, none of them signaling,
shall signal no exception but, if a floating-point result is to be delivered,
shall deliver as its result a quiet NaN, which should be one of the input
NaNs.
No floating point operation has ever been associative.
I think you were looking for the term commutative though since associativity requires at least three operands involved.
Point 3
See point 4
Point 4
From IEEE 754:
The invalid operations are
1. Any operation on a signaling NaN (6.2)
2. Addition or subtraction – magnitude subtraction of infinities such as,
(+INFINITY) + (–INFINITY)
3. Multiplication – 0 × INFINITY
4. Division – 0/0 or INFINITY/INFINITY
5. Remainder – x REM y, where y is zero or x is infinite
6. Square root if the operand is less than zero
7. Conversion of a binary floating-point number to an integer or
decimal format when overflow, infinity, or NaN precludes a faithful
representation in that format and this cannot otherwise be signaled
8. Comparison by way of predicates involving < or >, without ?, when
the operands are unordered (5.7, Table 4)
Point 5
From IEEE 754:
Every operation involving a signaling NaN or invalid operation (7.1) shall, if
no trap occurs and if a floating-point result is to be delivered, deliver a quiet
NaN as its result.
Due to its relevance, the IEEE 754 standard can be found here.
Related
Consider the following C++ code:
#include <fenv.h>
#include <iostream>
using namespace std;
int main(){
fesetround(FE_TONEAREST);
double a = 0x1.efc7f0001p+376;
double b = -0x1.0fdfdp+961;
double c = a*b;
cout << a << " "<< b << " " << c << endl;
}
The output that I see is
2.98077e+113 -2.06992e+289 -inf
I do not understand why c is infinity. My understanding is that whatever the smallest non-infinity floating point value is, it should be closer to the actual value of a*b than -inf as the minimum non-infinity floating point number is finite and any finite number is closer to any other finite number than negative infinity. Why is infinity outputted here?
This was run on 64bit x86 and the assembly uses SSE instructions. It was compiled with -O0 and happens both with clang and gcc.
The result is the minimum finite floating point if the round towards zero mode is used for floating points. I conclude that the issue is rounding related.
Rounding is not the primary issue here. The infinite result is caused by overflow.
This answer follows the rules in IEEE 754-2019. The real-number-arithmetic product of 1.EFC7F000116•2376 and −1.0FDFD961•216 is around −1.07430C8649FEFE816•21335. In normal conditions, floating-point arithmetic produces a result “as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that result…” (IEEE 754-2019 4.3). However, we do not have normal conditions. IEEE 754-2019 7.4 says:
The overflow exception shall be signaled if and only if the destination format’s largest finite number is exceeded in magnitude by what would have been the rounded floating-point result (see 4) were the exponent range unbounded…
In other words, if we rounded the result as if we could have any exponent (so we are just rounding the significand), the result would be −1.07430C8649FF+96416•21338. But the magnitude of that exceeds the largest finite number that double can represent, ±1.FFFFFFFFFFFFF16•21023. Therefore, an overflow exception is signaled, and, since you do not catch the exception, a default result is delivered:
… The default result shall be determined by the rounding-direction attribute and the sign of the intermediate result as follows:
a) roundTiesToEven and roundTiesToAway carry all overflows to ∞ with the sign of the intermediate result…
The behaviour you note (when using the FE_TONEAREST rounding mode) conforms to the IEEE-754 (or the equivalent ISO/IEC/IEEE 60559:2011) Standard. (Your examples imply that your platform uses IEEE-754 representation, as many – if not most – platforms do, these days.)
From this Wikipedia page, footnote #18, which cites IEEE-754 §4.3.1, dealing with the "rounding to nearest" modes:
In the following two rounding-direction attributes, an infinitely
precise result with magnitude at least
bemax(b-½b(1-p)) shall round to ∞ (infinity) with no change in sign.
The 'infinitely precise' result of your a * b calculation does, indeed, have a magnitude greater than the specified value, so the rounding is Standard-conformant.
I can't find a similar IEEE-754 citation for the "round-towards-zero" mode, but this GNU libc manual has this to say:
Round toward zero. All results are rounded to the largest
representable value whose magnitude is less than that of the result.
In other words, if the result is negative it is rounded up; if it is
positive, it is rounded down.
So, again, when using that mode, rounding to -DBL_MAX is appropriate/conformant.
Unless an IEEE 754 is NaN, +-0.0 or +-Infinity, is dividing by itself guaranteed to result in exactly 1.0?
Similarly, is subtracting itself guaranteed to always result in +-0.0?
IEEE 754-2008 4.3 says:
… Except where stated otherwise, every operation shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that result according to one of the attributes in this clause.
When an intermediate result is representable, all of the rounding attributes round it to itself; rounding changes a value only when it is not representable. Rules for subtraction and division are given in 5.4, and they do not state exceptions for the above.
Zero and one are representable per 3.3, which specify the sets of numbers representable in any format conforming to the standard. In particular, zero may be represented by a significand with all zero digits, and one may be represented with a significand starting with 1 and followed by “.000…000” and an exponent of zero. The minimum and maximums exponents of a format are defined so that they always include zero (emin is 1−emax, so zero is between them, inclusive, unless emax is less than 1, in which case no numbers are representable, i.e. the format is empty and not an actual format).
Since rounding does not change representable values and zero and one are representable, dividing a finite non-zero value by itself always produces one, and subtracting a finite value from itself always produces zero.
Dividing and substracting from itself, if same literal value was used - yes, IEEE754 requires to produce closest and cosistent match.
Due to the floating point "approx" nature, its possible that two different sets of values return the same value.
Example:
#include <iostream>
int main() {
std::cout.precision(100);
double a = 0.5;
double b = 0.5;
double c = 0.49999999999999994;
std::cout << a + b << std::endl; // output "exact" 1.0
std::cout << a + c << std::endl; // output "exact" 1.0
}
But is it also possible with subtraction? I mean: is there two sets of different values (keeping one value of them) that return 0.0?
i.e. a - b = 0.0 and a - c = 0.0, given some sets of a,b and a,c with b != c??
The IEEE-754 standard was deliberately designed so that subtracting two values produces zero if and only if the two values are equal, except that subtracting an infinity from itself produces NaN and/or an exception.
Unfortunately, C++ does not require conformance to IEEE-754, and many C++ implementations use some features of IEEE-754 but do not fully conform.
A not uncommon behavior is to “flush” subnormal results to zero. This is part of a hardware design to avoid the burden of handling subnormal results correctly. If this behavior is in effect, the subtraction of two very small but different numbers can yield zero. (The numbers would have to be near the bottom of the normal range, having some significand bits in the subnormal range.)
Sometimes systems with this behavior may offer a way of disabling it.
Another behavior to beware of is that C++ does not require floating-point operations to be carried out precisely as written. It allows “excess precision” to be used in intermediate operations and “contractions” of some expressions. For example, a*b - c*d may be computed by using one operation that multiplies a and b and then another that multiplies c and d and subtracts the result from the previously computed a*b. This latter operation acts as if c*d were computed with infinite precision rather than rounded to the nominal floating-point format. In this case, a*b - c*d may produce a non-zero result even though a*b == c*d evaluates to true.
Some C++ implementations offer ways to disable or limit such behavior.
Gradual underflow feature of IEEE floating point standard prevents this. Gradual underflow is achieved by subnormal (denormal) numbers, which are spaced evenly (as opposed to logarithmically, like normal floating point) and located between the smallest negative and positive normal numbers with zeroes in the middle. As they are evenly spaced, the addition of two subnormal numbers of differing signedness (i.e. subtraction towards zero) is exact and therefore won't reproduce what you ask. The smallest subnormal is (much) less than the smallest distance between normal numbers, and therefore any subtraction between unequal normal numbers is going to be closer to a subnormal than zero.
If you disable IEEE conformance using a special denormals-are-zero (DAZ) or flush-to-zero (FTZ) mode of the CPU, then indeed you could subtract two small, close numbers which would otherwise result in a subnormal number, which would be treated as zero due to the mode of the CPU. A working example (Linux):
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); // system specific
double d = std::numeric_limits<double>::min(); // smallest normal
double n = std::nextafter(d, 10.0); // second smallest normal
double z = d - n; // a negative subnormal (flushed to zero)
std::cout << (z == 0) << '\n' << (d == n);
This should print
1
0
First 1 indicates that result of subtraction is exactly zero, while the second 0 indicates that the operands are not equal.
Unfortunately the answer is dependent on your implementation and the way it is configured. C and C++ don't demand any specific floating point representation or behavior. Most implementations use the IEEE 754 representations, but they don't always precisely implement IEEE 754 arithmetic behaviour.
To understand the answer to this question we must first understand how floating point numbers work.
A naive floating point representation would have an exponent, a sign and a mantissa. It's value would be
(-1)s2(e – e0)(m/2M)
Where:
s is the sign bit, with a value of 0 or 1.
e is the exponent field
e0 is the exponent bias. It essentially sets the overall range of the floating point number.
M is the number of mantissa bits.
m is the mantissa with a value between 0 and 2M-1
This is similar in concept to the scientific notation you were taught in school.
However this format has many different representations of the same number, nearly a whole bit's worth of encoding space is wasted. To fix this we can add an "implicit 1" to the mantissa.
(-1)s2(e – e0)(1+(m/2M))
This format has exactly one representation of each number. However there is a problem with it, it can't represent zero or numbers close to zero.
To fix this IEEE floating point reserves a couple of exponent values for special cases. An exponent value of zero is reserved for representing small numbers known as subnormals. The highest possible exponent value is reserved for NaNs and infinities (which I will ignore in this post since they aren't relevant here). So the definition now becomes.
(-1)s2(1 – e0)(m/2M) when e = 0
(-1)s2(e – e0)(1+(m/2M)) when e >0 and e < 2E-1
With this representation smaller numbers always have a step size that is less than or equal to that for larger ones. So provided the result of the subtraction is smaller in magnitude than both operands it can be represented exactly. In particular results close to but not exactly zero can be represented exactly.
This does not apply if the result is larger in magnitude than one or both of the operands, for example subtracting a small value from a large value or subtracting two values of opposite signs. In those cases the result may be imprecise but it clearly can't be zero.
Unfortunately FPU designers cut corners. Rather than including the logic to handle subnormal numbers quickly and correctly they either did not support (non-zero) subnormals at all or provided slow support for subnormals and then gave the user the option to turn it on and off. If support for proper subnormal calculations is not present or is disabled and the number is too small to represent in normalized form then it will be "flushed to zero".
So in the real world under some systems and configurations subtracting two different very-small floating point numbers can result in a zero answer.
Excluding funny numbers like NAN, I don't think it's possible.
Let's say a and b are normal finite IEEE 754 floats, and |a - b| is less than or equal to both |a| and |b| (otherwise it's clearly not zero).
That means the exponent is <= both a's and b's, and so the absolute precision is at least as high, which makes the subtraction exactly representable. That means that if a - b == 0, then it is exactly zero, so a == b.
I know in standard IEEE 754 division by zero is allowed. I want to know how it's represented in binary.
For example, 0.25 in decimal is
0 01111101 00000000000000000000000
in binary. What about 5.0/0.0 or 0.0/0.0 do they have represenation in binary, and are they same?
Thanks.
When you divide a finite number by zero you'll get an infinity with the sign of the number you tried to divide. So 5.0/0.0 is +inf but 0.0/0.0 returns something called a QNaN indefinite.
Let’s say we are dividing negative one by zero. Because this results in a pre-computed exception I think the key to understanding what happens is in the “response” verbiage Intel uses in section 4.9.1.2
The masked response for the divide-by-zero exception is to set the ZE flag and return an infinity signed with the exclusive OR of the sign of the operands.
I hope I’m reading this right. Since the Zero mask bit (found in the control word of the x87 FPU) is a 1, the pre-computed exception flag becomes set once the fpu detects the zero in the operand used for division. Now the processor knows to do something like this:
1 sign of operand 1, our -1.0
xor 0 sign of operand 2, the zero
----------
1 response
Now with that response bit I know whether I have a positive or negative infinity
-inf 1 11111111 00000000000000000000000
-----+-+------+-+---------------------+
| | | | |
| +------+ +---------------------+
| | |
| v v
| exponent fraction
|
v
sign
If I had a positive 1.0 instead and divided by zero:
0 sign of operand 1
xor 0 sign of operand 2
-----------
0
Now I have
inf 0 11111111 00000000000000000000000
As long as the numerator is positive and you're dividing by zero you'll get the same positive infinity.
This is what I imagine happening when I run something like this:
int main() {
SetExceptionMask(exAllArithmeticExceptions);
float a = -1;
float b = a / 0.0f;
printf("%f\n", b);
}
The result is -inf which looks like this 1 11111111 00000000000000000000000
QNaNs ("quiet not a number") are especially helpful for debugging and are generated through a few different ways but 0.0/0.0 will return something that looks like this:
qnan 0 11111111 10000000000000000000000
-----+-+------+-+---------------------+
| |
+---------------------+
|
v
fraction
Now software can manipulate the bits in the fraction of a QNaN for any purpose, usually this seems done for diagnostic purposes.
To learn more I recommend watching parts 31(https://youtu.be/SsDoUirLkbY), and 33(https://youtu.be/3ZxXSUPSFaQ) of this Intel Manual reading.
Mark has corrected me that division by zero results in positive or negative infinity in IEEE 754-2208.
In the wikipedia article on the subject, we find the following:
sign = 0 for positive infinity, 1 for negative infinity.
biased exponent = all 1 bits.
fraction = all 0 bits.
Source: IEEE 754 Wikipedia article
I was wrong in thinking it would result in a NaN, upon which I elaborated below.
+Infinity:
0 11111111 00000000000000000000000
-Infinity:
1 11111111 00000000000000000000000
INCORRECT ORIGINAL RESPONSE BELOW
May still be of tangential interest, so leaving it in.
This results in, from my understanding, a NaN, pr **not a number*.
The wikipedia page on Nan has an encoding section from which the following quote arrives.
In IEEE 754 standard-conforming floating-point storage formats, NaNs are identified by specific, pre-defined bit patterns unique to NaNs. The sign bit does not matter. Binary format NaNs are represented with the exponential field filled with ones (like infinity values), and some non-zero number in the significand (to make them distinct from infinity values). The original IEEE 754 standard from 1985 (IEEE 754-1985) only described binary floating-point formats, and did not specify how the signaled/quiet state was to be tagged. In practice, the most significant bit of the significand determined whether a NaN is signalling or quiet. Two different implementations, with reversed meanings, resulted.
Source: NaN Encoding (Wikipedia)
The article also goes on to note that in 2008, the IEEE 754-2008 revision adds a suggested method for indicating if the NaN should be quiet or verbose.
NaN is identified by having the top five bits of the combination field after the sign bit set to ones. The sixth bit of the field is the 'is_quiet' flag. The standard follows the interpretation as an 'is_signaling' flag. I.e. the signaled/quiet bit is zero if the NaN is quiet, and non-zero if the N is identified by having the aN is signaling.
Basically, as before, the exponent is all ones, and the last bit indicates whether it is quiet or not.
My interpretation is that a NaN could be represented in a number of ways, including as follows:
0 11111111 00000000000000000000010
I am reading about positive and negative infinity in c++.
I read that integral types dont have a infinite value ie. std::numeric_limits<int>::infinity(); wont work, but std::numeric_limits<int>::max(); will work and will represent the maximum possible value that can be represented by the integral type.
so the std::numeric_limits<int>::max(); of the integral type could be taken as its positive infinite limit ?
Or the integral type has only the max value and the infinity value is not true ?
Integers are always finite.
The closest you can get to what you're looking for is by setting an integer to it's maximum value, which for a signed integer only around 2 billion something.
std::numeric_limits has a has_infinity member which you can use to check if the type you want has an infinite representation, which are usually only on floating point numbers such as float and double.
Floating point numbers have a special bit pattern to indicate "the value is infinity", which is used when the result of some operation is defined as infinite.
Integer values have a set number of bits, and all bits are used to represent a numeric value. There is no "special bit pattern", it's just whatever the sum of the bit positions mean.
Edit: On page 315 of my hardcopy of AMD64 Architecture Programmer's Manual, it says
Infinity. Infinity is a positve or negative number +∞ and
-∞, in which the integer bit is 1, the biased exponent is maximum and fraction is 0. The infintes are the maximum numbers that
can be represented in floating point format, negative infinity is less
than any finite number and positive infinity is greater than any
finite number (ie. the affine sense).
And infinite result is produce when a non-zero, non-infinite number is
divided by 0 or multiplied by infinity, or when infinity is added to
infinity or to 0. Arithmetic infinites is exact. For example, adding
any floating point number to +∞ gives a result of +∞
Arithmetic comparisons work correctly on infinites. Exceptions occur
only when the use of an infinity as a source operand constitutes an
invalid operation.
(Any typing mistakes are mine)