Floating points : Is it true if a > b then a - b > 0? - c++

in c++ for two doubles a and b, is it true if a > b, then a - b > 0? Assume a, b are not Nan, just normal numbers

No, not by the requirements of the C++ standard alone. C++ allows implementations to use floating-point types without support for subnormal numbers. In such a type, you could have a equal to 1.00012•2m, where m is the minimum normal exponent, and b equal to 1.00002•2m. Then a > b but the exact value of a-b is not representable (it would be subnormal), so the computed result is zero.
This does not happen with IEEE-754 arithmetic, which supports subnormals. However, some C++ implementations, possibly not conforming to the C++ standard, use IEEE-754 formats but flush subnormal results to zero. So this will also produce a > b but a-b == 0.
Supposing an implementation is supporting subnormals and is conforming to the C standard, another issue is the standard allows implementations to use extra precision within floating-point expressions. If your a and b are actually expressions rather than single objects, it is possible that the extra precision could cause a > b to be true while a-b == 0 is false. Implementations are permitted to use or not use such extra precision at will, so a > b could be true in some evaluations and not others, and similarly for a-b == 0. If a and b are single objects, this will not happen, because the standard requires implementations to “discard” the excess precision in assignments and casts.

Yes, this is true in IEEE-754 arithmetic, in all rounding modes. So long as your C++ compiler uses IEEE-754 semantics (either via software or hardware). But see Eric Postpischill's answer for when particular implementations might diverge from IEEE-754. Best to check with your compiler documentation.

Related

Integer to float conversions with IEEE FP

What are the guarantees regarding conversions from integral to floating-point types in a C++ implementation supporting IEEE-754 FP arithmetic?
Specifically, is it always well-defined behaviour to convert any integral value to any floating-point type, possibly resulting in a value of +-inf? Or are there situations in which this would result in undefined behaviour?
(Note, I am not asking about exact conversion, just if performing the conversion is always legal from the point of view of the language standard)
IEC 60559 (the current successor standard to IEEE 754) makes integer-to-float conversion well-defined in all cases, as discussed in Franck's answer, but it is the language standard that has the final word on the subject.
In the base standard, C++11 section 4.9 "Floating-integral conversions", paragraph 2, makes out-of-range integer-to-floating-point conversions undefined behavior. (Quotation is from document N3337, which is the closest approximation to the official 2011 C++ standard that is publicly available at no charge.)
A prvalue of an integer type or of an unscoped enumeration type can be converted to a prvalue of a floating
point type. The result is exact if possible. If the value being converted is in the range of values that can
be represented but the value cannot be represented exactly, it is an implementation-defined choice of either
the next lower or higher representable value. [ Note: Loss of precision occurs if the integral value cannot
be represented exactly as a value of the floating type. — end note ] If the value being converted is outside
the range of values that can be represented, the behavior is undefined. If the source type is bool, the value
false is converted to zero and the value true is converted to one.
Emphasis mine. The C standard says the same thing in different words (section 6.3.1.4 paragraph 2).
The C++ standard does not discuss what it would mean for an implementation of C++ to supply IEC 60559-conformant floating-point arithmetic. However, the C standard (closest approximation to C11 available online at no charge is N1570) does discuss this in its Annex F, and C++ implementors do tend to turn to C for guidance when C++ leaves something unspecified. There is no explicit discussion of integer to floating point conversion in Annex F, but there is this sentence in F.1p1:
Since
negative and positive infinity are representable in IEC 60559 formats, all real numbers lie
within the range of representable values.
Putting that sentence together with 6.3.1.4p2 suggests to me that the C committee meant for integer-to-floating conversion to produce ±Inf when the integer's magnitude is outside the range of representable finite numbers. And that interpretation is consistent with the IEC 60559-specified behavior of conversions, so we can be reasonably confident that that's what an implementation of C that claimed to conform to Annex F would do.
However, applying any interpretation of the C standard to C++ is risky at best; C++ has not been defined as a superset of C for a very long time.
If your implementation of C++ predefines the macro __STDC_IEC_559__ and/or documents compliance with IEC 60559 in some way and you don't use the "be sloppy about floating-point math in the name of speed" compilation mode (which may be on by default), you can probably rely on out-of-range conversions to produce ±Inf. Otherwise, it's UB.
In section 7.4 the standard IEEE-754 (2008) says it is well-defined behavior for it. But it is related to IEEE-754 and C/C++ implementations are free to respect it or not (see the zwol answer).
The overflow exception shall be signaled if and only if
the destination format’s largest finite number is
exceeded in magnitude by what would have been the rounded floating-point result were the exponent
range unbounded. The default result shall be determined by the rounding-direction attribute and the sign of
the intermediate result as follows:
a)
roundTiesToEven and roundTiesToAway carry all overflows to
∞
with the sign of the intermediate
result.
b) roundTowardZero carries all overflows to the format’s largest finite number with the sign of the
intermediate result.
c)
roundTowardNegative carries positive overflows to the format’s largest finite number, and carries
negative overflows to
−∞
d) roundTowardPositive carries negative overflows to the format’s most negative finite number, and
carries positive overflows to +
∞
All cases of these 4 points provide a deterministic result for the conversion from integral to floating-point type.
All other cases (no overflow) are also well-defined with deterministic results given by the IEEE-754 standard.

Is imprecision of a double value guaranteed to be consistent on the same machine?

We know that doubles loose precision of their value as you increase the number of decimal points. But if I use the same double value twice on the same machine will I be guaranteed to have the same imprecision? For example:
double d1 = 123.456;//actually becomes 123.45600001
double d2 = 123.456;//is guaranteed to become 123.45600001?
For simplicity sake lets stick with just C++.
No, you do not have this guarantee. C++ implementations are not bound by IEEE standard, and can choose any binary represenation they want.
While they are unlikely to just invent their own, they usually have fluctuate (even within the same vendor) in how thew represent the 'nonrepresentable' numbers - they can be represented with a bigger number or a smaller number - and this representation changes (I believe, even different floating math options for same gcc can affect this).
In your case, d1 == d2 will return true, and it will almost always return true on any properly operating compiler/architecture. Even if the compiler/architecture aren't IEEE conforming, it's extremely unlikely that providing an identical symbol for 123.456 would return different values on multiple invocations.
But, if you had the following code:
double d1 = 123.456;
double d2 = get_123_456_from_network_service(service);
That guarantee would cease to be. 123.456 will be exactly the same on your computer all the time, but if a different computer attempts to use the same symbol, and isn't IEEE conforming (or if yours isn't), then there's a good chance the values would be different.
If your compiler conforms to the IEEE 754 (IEC 559) standard, they should be the same. You can check if it is IEEE compliant with std::numeric_limits<T>::is_iec559. Most implementations are IEEE compliant, but this is not a guarantee.

Is floating point addition commutative in C++?

For floating point values, is it guaranteed that a + b is the same as1 b + a?
I believe this is guaranteed in IEEE754, however the C++ standard does not specify that IEEE754 must be used. The only relevant text seems to be from [expr.add]#3:
The result of the binary + operator is the sum of the operands.
The mathematical operation "sum" is commutative. However, the mathematical operation "sum" is also associative, whereas floating point addition is definitely not associative. So, it seems to me that we cannot conclude that the commutativity of "sum" in mathematics means that this quote specifies commutativity in C++.
Footnote 1:
"Same" as in bitwise identical, like memcmp rather than ==, to distinguish +0 from -0. IEEE754 treats +0.0 == -0.0 as true, but also has specific rules for signed zero. +0 + -0 and -0 + +0 both produce +0 in IEEE754, same for addition of opposite-sign values with equal magnitude. An == that followed IEEE semantics would hide non-commutativity of signed-zero if that was the criterion.
Also, a+b == b+a is false with IEEE754 math if either input is NaN.
memcmp will say whether two NaNs have the same bit-pattern (including payload), although we can consider NaN propagation rules separately from commutativity of valid math operations.
It is not even required that a + b == a + b. One of the subexpressions may hold the result of the addition with more precision than the other one, for example when the use of multiple additions requires one of the subexpressions to be temporarily stored in memory, when the other subexpression can be kept in a register (with higher precision).
If a + b == a + b is not guaranteed, a + b == b + a cannot be guaranteed. If a + b does not have to return the same value each time, and the values are different, one of them necessarily will not be equal to one particular evaluation of b + a.
No, the C++ language generally wouldn't make such a requirement of the hardware. Only the associativity of operators is defined.
All kinds of crazy things do happen in floating-point arithmetic. Perhaps, on some machine, adding zero to an denormal number produces zero. Conceivable that a machine could avoid updating memory in the case of adding a zero-valued register to a denormal in memory. Possible that a really dumb compiler would always put the LHS in memory and the RHS in a register.
Note, though, that a machine with non-commutative addition would need to specifically define how expressions map to instructions, if you're going to have any control over which operation you get. Does the left-hand side go into the first machine operand or the second?
Such an ABI specification, mentioning the construction of expressions and instructions in the same breath, would be quite pathological.
The C++ standard very specifically does not guarantee IEEE 754. The library does have some support for IEC 559 (which is basically just the IEC's version of the IEEE 754 standard), so you can check whether the underlying implementation uses IEEE 754/IEC 559 though (and when it does, you can depend on what it guarantees, of course).
For the most part, the C and C++ standards assume that such basic operations will be implemented however the underlying hardware works. For something as common as IEEE 754, they'll let you detect whether it's present, but still don't require it.

assertions on negations of relational operators for floating point number

Suppose I have two variables a and b, either both in type float, or both in type double, which hold some values. Do the following assertions always hold? I mean, do the existence of numerical errors alter the conclusions?
a > b is true if and only if a <= b is false
a < b is true if and only if a >= b is false
a >= b is necessarily true if a == b is true
a <= b is necessarily true if a == b is true
For the third and fourth, I mean for example, does "a == b is true" always give you "a >= b is true"?
EDIT:
Assume neither a or b is NaN or Inf.
EDIT 2:
After reading the IEEE 754 standard in the year 1985, I found the following.
First of all, it said the following
Comparisons are exact and never overflow nor underflow.
I understand it as when doing comparison, there is no consideration of numerical error, numbers are compared as-is. Since addition and subtraction such as a - b requires additional effort to define what the numerical error is, I assume the quote above is saying that comparisons like a > b is not done by judging whether a - b > 0 is true or not. Please let me know if I'm wrong.
Second, it listed the four canonical relations which are mutually exclusive.
Four mutually exclusive relations are possible: less than, equal, greater than, and unordered. The last case arises when at least one operand is NaN. Every NaN shall compare unordered with everything, including itself.
Then in Table 4, it defined the various kinds of operators such as ">" or ">=" in terms of the truth values under these four canonical relations. From the table we immediately have the following:
a >= b is true if and only if a < b is false
a <= b is true if and only if a > b is false
Both a >= b and a <= b are necessarily true if a == b is true
So the assertions in my questions can be concluded as true. However, I wasn't able to find anything in the standard defining whether symmetricity is true or not. In another way, from a > b I don't know if b < a is true or not. Thus I also have no way to derive a <= b is true from b < a is false. So I would be interested to know, in addition to the assertions in the OP, whether the following is always true or not
a < b is true if and only if b > a is true
a <= b is true if and only if b >= a is true
etc.
EDIT 3:
Regarding denormal numbers as mentioned by #Mark Ransom, I read the wikipedia page about it and my current understanding is that the existence of denormal numbers does not alter the conclusions above. In another way, if some hardware claims that it fully support denormal numbers, it also needs to ensure that the definitions of the comparison operators satisfy the above standards.
EDIT 4:
I just read the 2008 revision of IEEE 754, it doesn't say anything about symmetricity either. So I guess this is undefined.
(All the above discussion assumes no NaN or Inf in any of the operands).
If neither number is a NaN or infinity then your assertions hold, if you have an IEEE compatible system. If the standards don't mention it then I'm sure that is just because it was sufficiently obvious to not be worth mentioning. In particular, if "a < b" and "b > a" don't have the same value (even for NaNs and infinities) then we are in crazy town.
Denormals shouldn't affect the conclusions because if you are assuming IEEE compatibility then denormal support is a given.
The only risks I can think of are in cases involving the x87 FPU and it's odd 80-bit long double format. Rounding from 80-bit long double to double or float is expensive, so it is sometimes omitted, and that can lead to an assert like this firing:
assert(sqrt(val) == sqrt(val));
It could fire because the result of the first sqrt() call may be written to memory, and therefore rounded to float or double. It would then be compared against the result of the second sqrt() call, which might not be rounded. However, failing to round the result of the second sqrt() call before doing the comparison is, strictly speaking, not IEEE compliant if sqrt() returns float or double.
Infinity should only be a problem if both numbers are infinity.

Division by zero: Undefined Behavior or Implementation Defined in C and/or C++?

Regarding division by zero, the standards say:
C99 6.5.5p5 - The result of the / operator is the quotient from the division of the first operand by the second; the result of the % operator is the remainder. In both operations, if the value of the second operand is zero, the behavior is undefined.
C++03 5.6.4 - The binary / operator yields the quotient, and the binary % operator yields the remainder from the division of the first expression by the second. If the second operand of / or % is zero the behavior is undefined.
If we were to take the above paragraphs at face value, the answer is clearly Undefined Behavior for both languages. However, if we look further down in the C99 standard we see the following paragraph which appears to be contradictory(1):
C99 7.12p4 - The macro INFINITY expands to a constant expression of type float representing positive or unsigned infinity, if available;
Do the standards have some sort of golden rule where Undefined Behavior cannot be superseded by a (potentially) contradictory statement? Barring that, I don't think it's unreasonable to conclude that if your implementation defines the INFINITY macro, division by zero is defined to be such. However, if your implementation does not define such a macro, the behavior is Undefined.
I'm curious what the consensus is (if any) on this matter for each of the two languages. Would the answer change if we are talking about integer division int i = 1 / 0 versus floating point division float i = 1.0 / 0.0 ?
Note (1) The C++03 standard talks about the <cmath> library which includes the INFINITY macro.
I don't see any contradiction. Division by zero is undefined, period. There is no mention of "... unless INFINITY is defined" anywhere in the quoted text.
Note that nowhere in mathematics it is defined that 1 / 0 = infinity. One might interpret it that way, but it is a personal, "shortcut" style interpretation, rather than a sound fact.
1 / 0 is not infinity, only lim 1/x = ∞ (x -> +0)
This was not a math purest question, but a C/C++ question.
According to the IEEE 754 Standard, which all modern C compilers / FPU's use, we have
3.0 / 0.0 = INF
0.0 / 0.0 = NaN
-3.0 / 0.0 = -INF
The FPU will have a status flag that you can set to generate an exception if so desired, but this is not the norm.
INF can be quite useful to avoiding branching when INF is a useful result. See discussion here
http://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF
Why would it?
That doesn't make sense mathematically, it's not as if 1/x is defined as ∞ in mathematics in general. Also, you would at least need two more cases: -1/x and 0/x can't also equal ∞.
See division by zero in general, and the section about computer arithmetic in particular.
Implementations which define __STDC_IEC_559__ are required to abide by the requirements given in Annex F, which in turn requires floating-point semantics consistent with IEC 60559. The Standard imposes no requirements on the behavior of floating-point division by zero on implementations which do not define __STDC_IEC_559__, but does for those which do define it. In cases where IEC 60559 specifies a behavior but the C Standard does not, a compiler which defines __STDC_IEC_559__ is required by the C Standard to behave as described in the IEC standard.
As defined by IEC 60559 (or the US standard IEEE-754) Division of zero by zero yields NaN, division of a floating-point number by positive zero or literal constant zero yields an INF value with the same sign as the dividend, and division of a floating-point number by negative zero yields an INF with the opposite sign.
I've only got the C99 draft. In §7.12/4 it says:
The macro
INFINITY
expands to a constant expression of
type float representing positive or
unsigned infinity, if available; else
to a positive constant of type float
that overflows at translation time.
Note that INFINITY can be defined in terms of floating-point overflow, not necessarily divide-by-zero.
For the INFINITY macro: there is a explicit coding to represent +/- infinity in the IEEE754 standard, which is if all exponent bits are set and all fraction bits are cleared (if a fraction bit is set, it represents NaN)
With my compiler, (int) INFINITY == -2147483648, so an expression that evaluates to int i = 1/0 would definitely produce wrong results if INFINITIY was returned
Bottom line, C99 (as per your quotes) does not say anything about INFINITY in the context of "implementation-defined". Secondly, what you quoted does not show inconsistent meaning of "undefined behavior".
[Quoting Wikipedia's Undefined Behavior page] "In C and C++, implementation-defined behavior is also used, where the language standard does not specify the behavior, but the implementation must choose a behavior and needs to document and observe the rules it chose."
More precisely, the standard means "implementation-defined" (I think only) when it uses those words with respect to the statement made since "implementation-defined" is a specific attribute of the standard. The quote of C99 7.12p4 didn't mention "implementation-defined".
[From C99 std (late draft)] "undefined behavior: behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements"
Note there is "no requirement" imposed for undefined behavior!
[C99 ..] "implementation-defined behavior: unspecified behavior where each implementation documents how the choice is made"
[C99 ..] "unspecified behavior: use of an unspecified value, or other behavior where this International Standard provides two or more possibilities and imposes no further requirements on which is chosen in any instance"
Documentation is a requirement for implementation-defined behavior.