Does IEEE 754 float division or subtraction by itself always result in the same value? - ieee-754

Unless an IEEE 754 is NaN, +-0.0 or +-Infinity, is dividing by itself guaranteed to result in exactly 1.0?
Similarly, is subtracting itself guaranteed to always result in +-0.0?

IEEE 754-2008 4.3 says:
… Except where stated otherwise, every operation shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that result according to one of the attributes in this clause.
When an intermediate result is representable, all of the rounding attributes round it to itself; rounding changes a value only when it is not representable. Rules for subtraction and division are given in 5.4, and they do not state exceptions for the above.
Zero and one are representable per 3.3, which specify the sets of numbers representable in any format conforming to the standard. In particular, zero may be represented by a significand with all zero digits, and one may be represented with a significand starting with 1 and followed by “.000…000” and an exponent of zero. The minimum and maximums exponents of a format are defined so that they always include zero (emin is 1−emax, so zero is between them, inclusive, unless emax is less than 1, in which case no numbers are representable, i.e. the format is empty and not an actual format).
Since rounding does not change representable values and zero and one are representable, dividing a finite non-zero value by itself always produces one, and subtracting a finite value from itself always produces zero.

Dividing and substracting from itself, if same literal value was used - yes, IEEE754 requires to produce closest and cosistent match.

Related

Floating point multiplication results in infinity even though rounding to nearest

Consider the following C++ code:
#include <fenv.h>
#include <iostream>
using namespace std;
int main(){
fesetround(FE_TONEAREST);
double a = 0x1.efc7f0001p+376;
double b = -0x1.0fdfdp+961;
double c = a*b;
cout << a << " "<< b << " " << c << endl;
}
The output that I see is
2.98077e+113 -2.06992e+289 -inf
I do not understand why c is infinity. My understanding is that whatever the smallest non-infinity floating point value is, it should be closer to the actual value of a*b than -inf as the minimum non-infinity floating point number is finite and any finite number is closer to any other finite number than negative infinity. Why is infinity outputted here?
This was run on 64bit x86 and the assembly uses SSE instructions. It was compiled with -O0 and happens both with clang and gcc.
The result is the minimum finite floating point if the round towards zero mode is used for floating points. I conclude that the issue is rounding related.
Rounding is not the primary issue here. The infinite result is caused by overflow.
This answer follows the rules in IEEE 754-2019. The real-number-arithmetic product of 1.EFC7F000116•2376 and −1.0FDFD961•216 is around −1.07430C8649FEFE816•21335. In normal conditions, floating-point arithmetic produces a result “as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that result…” (IEEE 754-2019 4.3). However, we do not have normal conditions. IEEE 754-2019 7.4 says:
The overflow exception shall be signaled if and only if the destination format’s largest finite number is exceeded in magnitude by what would have been the rounded floating-point result (see 4) were the exponent range unbounded…
In other words, if we rounded the result as if we could have any exponent (so we are just rounding the significand), the result would be −1.07430C8649FF+96416•21338. But the magnitude of that exceeds the largest finite number that double can represent, ±1.FFFFFFFFFFFFF16•21023. Therefore, an overflow exception is signaled, and, since you do not catch the exception, a default result is delivered:
… The default result shall be determined by the rounding-direction attribute and the sign of the intermediate result as follows:
a) roundTiesToEven and roundTiesToAway carry all overflows to ∞ with the sign of the intermediate result…
The behaviour you note (when using the FE_TONEAREST rounding mode) conforms to the IEEE-754 (or the equivalent ISO/IEC/IEEE 60559:2011) Standard. (Your examples imply that your platform uses IEEE-754 representation, as many – if not most – platforms do, these days.)
From this Wikipedia page, footnote #18, which cites IEEE-754 §4.3.1, dealing with the "rounding to nearest" modes:
In the following two rounding-direction attributes, an infinitely
precise result with magnitude at least
bemax(b-½b(1-p)) shall round to ∞ (infinity) with no change in sign.
The 'infinitely precise' result of your a * b calculation does, indeed, have a magnitude greater than the specified value, so the rounding is Standard-conformant.
I can't find a similar IEEE-754 citation for the "round-towards-zero" mode, but this GNU libc manual has this to say:
Round toward zero. All results are rounded to the largest
representable value whose magnitude is less than that of the result.
In other words, if the result is negative it is rounded up; if it is
positive, it is rounded down.
So, again, when using that mode, rounding to -DBL_MAX is appropriate/conformant.

Is there any definition how floating-point values evaluated at compile-time are rounded?

Is there any definition how floating-point values evaluated at compile-time are rounded in C or C++ ? F.e. when I have double d = 1.0 / 3.0; ? I.e. what kind of rounding is done at compile-time.
And is there a definition of what's the default-rounding mode for a thread at runtime (C99's / C++11's fegetround() / fesetround()) ?
And is rounding to integer-values also included in the latter configuration parameters ? Im aware of nearbyint(), but this is specified to bound to the rounding-parameters which can be set by fesetround(). What I'm concerned about is direct casting to an integer.
In both specs (C17 and C++20), the compile time rounding is implementation defined.
In the C++ spec, this is specified in lex.fcon, which says
If the scaled value is not in the range of representable values for its type, the program is ill-formed. Otherwise, the value of a floating-point-literal is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner.
The C spec has similar language (quote taken from N2176, C17 final draft):
the result is either the nearest representable value, or the larger or smaller representable value immediately adjacent to the nearest representable value, chosen in an implementation-defined manner.
It also recommends that translation time conversion should match execution time conversion by library functions (like strtod) but it is not required. See the description of representable values.
Conversions of floating point values to integers are specified in both to truncate (round towards 0 by discarding the fractional part).

C++ Std::stof doesn't work for float less than FLT_MIN?

I'm having issues with the following line of code:
float temp = std::stof("-8.34416e-46");
My program aborts as soon as it reaches this line. It's clear that this float is less than FLT_MIN, but such floats are allowed to exist. (For example, float temp = -8.34416e-46; works fine. Is the stof method supposed to only work for values between FLT_MIN and FLT_MAX?
If so, what would be a good alternative to get a string ("-8.34416e-46") into a float?
An Alternative
Convert to double with std::stod and then assign to float. Add bounds checking if desired.
Converting “-8.34416e-46”
The C++ standard allows conversions of strings to float to report underflow if the result is in the subnormal range even though it is representable.
When rounding to the nearest representable value is used, −8.34416•10−46 is within the range of float (in C++ implementations that use IEEE-754 binary32 for float, which is common), but it is in the subnormal range. The C++ standard says stof calls strtof and then defers to the C standard to define strtof. The C standard indicates that strtof may underflow, about which it says “The result underflows if the magnitude of the mathematical result is so small that the mathematical result cannot be represented, without extraordinary roundoff error, in an object of the specified type.” That is awkward phrasing, but it refers to the rounding errors that occur when subnormal values are encountered. (Subnormal values are subject to larger relative errors than normal values, so their rounding errors might be said to be extraordinary.)
Thus, a C++ implementation is allowed by the C++ standard to underflow for subnormal values even though they are representable.
The smallest positive magnitude in the binary32 format is 2−149, about 1.4•10−45. 8.34416•10−46 is smaller than this, but it is greater than half of 2−149. That means, between 0 and 2−149, it is closer to the latter, so conversion with rounding to the nearest representable value will produce 2−149 rather than zero. Unfortunately, your strtof implementation chooses to report underflow rather than completing a conversion to the nearest representable value.
Normal and subnormal values
For IEEE-754 32-bit binary floating-point, the normal range is from 2-126 to 2128-2104. Within this range, every number is represented with a signficand (the fraction portion of the floating-point representation) that has a leading 1 bit followed by 23 additional bits, and so the error that occurs when rounding any real number in this range to the nearest representable value is at most 2-24 times the position value of the leading bit.
In additional to this normal range, there is a subnormal range from 2−149 to 2−126−2−149. In this interval, the exponent part of the floating-point format has reached its smallest value and cannot be decreased any more. To represent smaller and smaller numbers in this interval, the significand is reduced below the normal minimum of 1. It starts with a 0 and is followed by 23 additional bits. In this interval, the error that occurs when rounding a real number to the nearest representable value may be larger than 2-24 times the position value of the leading bit. Since the exponent cannot be decreased any further, numbers in this interval have increasing numbers of leading 0 bits as they get smaller and smaller. Thus the relative errors involved with using these numbers grows.
For whatever reasons, the C++ has said that implementations may report underflow in this interval. (The IEEE-754 standard defines underflow in complicated ways and also allows implementations some choices.)

Is it possible in floating point to return 0.0 subtracting two different values?

Due to the floating point "approx" nature, its possible that two different sets of values return the same value.
Example:
#include <iostream>
int main() {
std::cout.precision(100);
double a = 0.5;
double b = 0.5;
double c = 0.49999999999999994;
std::cout << a + b << std::endl; // output "exact" 1.0
std::cout << a + c << std::endl; // output "exact" 1.0
}
But is it also possible with subtraction? I mean: is there two sets of different values (keeping one value of them) that return 0.0?
i.e. a - b = 0.0 and a - c = 0.0, given some sets of a,b and a,c with b != c??
The IEEE-754 standard was deliberately designed so that subtracting two values produces zero if and only if the two values are equal, except that subtracting an infinity from itself produces NaN and/or an exception.
Unfortunately, C++ does not require conformance to IEEE-754, and many C++ implementations use some features of IEEE-754 but do not fully conform.
A not uncommon behavior is to “flush” subnormal results to zero. This is part of a hardware design to avoid the burden of handling subnormal results correctly. If this behavior is in effect, the subtraction of two very small but different numbers can yield zero. (The numbers would have to be near the bottom of the normal range, having some significand bits in the subnormal range.)
Sometimes systems with this behavior may offer a way of disabling it.
Another behavior to beware of is that C++ does not require floating-point operations to be carried out precisely as written. It allows “excess precision” to be used in intermediate operations and “contractions” of some expressions. For example, a*b - c*d may be computed by using one operation that multiplies a and b and then another that multiplies c and d and subtracts the result from the previously computed a*b. This latter operation acts as if c*d were computed with infinite precision rather than rounded to the nominal floating-point format. In this case, a*b - c*d may produce a non-zero result even though a*b == c*d evaluates to true.
Some C++ implementations offer ways to disable or limit such behavior.
Gradual underflow feature of IEEE floating point standard prevents this. Gradual underflow is achieved by subnormal (denormal) numbers, which are spaced evenly (as opposed to logarithmically, like normal floating point) and located between the smallest negative and positive normal numbers with zeroes in the middle. As they are evenly spaced, the addition of two subnormal numbers of differing signedness (i.e. subtraction towards zero) is exact and therefore won't reproduce what you ask. The smallest subnormal is (much) less than the smallest distance between normal numbers, and therefore any subtraction between unequal normal numbers is going to be closer to a subnormal than zero.
If you disable IEEE conformance using a special denormals-are-zero (DAZ) or flush-to-zero (FTZ) mode of the CPU, then indeed you could subtract two small, close numbers which would otherwise result in a subnormal number, which would be treated as zero due to the mode of the CPU. A working example (Linux):
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); // system specific
double d = std::numeric_limits<double>::min(); // smallest normal
double n = std::nextafter(d, 10.0); // second smallest normal
double z = d - n; // a negative subnormal (flushed to zero)
std::cout << (z == 0) << '\n' << (d == n);
This should print
1
0
First 1 indicates that result of subtraction is exactly zero, while the second 0 indicates that the operands are not equal.
Unfortunately the answer is dependent on your implementation and the way it is configured. C and C++ don't demand any specific floating point representation or behavior. Most implementations use the IEEE 754 representations, but they don't always precisely implement IEEE 754 arithmetic behaviour.
To understand the answer to this question we must first understand how floating point numbers work.
A naive floating point representation would have an exponent, a sign and a mantissa. It's value would be
(-1)s2(e – e0)(m/2M)
Where:
s is the sign bit, with a value of 0 or 1.
e is the exponent field
e0 is the exponent bias. It essentially sets the overall range of the floating point number.
M is the number of mantissa bits.
m is the mantissa with a value between 0 and 2M-1
This is similar in concept to the scientific notation you were taught in school.
However this format has many different representations of the same number, nearly a whole bit's worth of encoding space is wasted. To fix this we can add an "implicit 1" to the mantissa.
(-1)s2(e – e0)(1+(m/2M))
This format has exactly one representation of each number. However there is a problem with it, it can't represent zero or numbers close to zero.
To fix this IEEE floating point reserves a couple of exponent values for special cases. An exponent value of zero is reserved for representing small numbers known as subnormals. The highest possible exponent value is reserved for NaNs and infinities (which I will ignore in this post since they aren't relevant here). So the definition now becomes.
(-1)s2(1 – e0)(m/2M) when e = 0
(-1)s2(e – e0)(1+(m/2M)) when e >0 and e < 2E-1
With this representation smaller numbers always have a step size that is less than or equal to that for larger ones. So provided the result of the subtraction is smaller in magnitude than both operands it can be represented exactly. In particular results close to but not exactly zero can be represented exactly.
This does not apply if the result is larger in magnitude than one or both of the operands, for example subtracting a small value from a large value or subtracting two values of opposite signs. In those cases the result may be imprecise but it clearly can't be zero.
Unfortunately FPU designers cut corners. Rather than including the logic to handle subnormal numbers quickly and correctly they either did not support (non-zero) subnormals at all or provided slow support for subnormals and then gave the user the option to turn it on and off. If support for proper subnormal calculations is not present or is disabled and the number is too small to represent in normalized form then it will be "flushed to zero".
So in the real world under some systems and configurations subtracting two different very-small floating point numbers can result in a zero answer.
Excluding funny numbers like NAN, I don't think it's possible.
Let's say a and b are normal finite IEEE 754 floats, and |a - b| is less than or equal to both |a| and |b| (otherwise it's clearly not zero).
That means the exponent is <= both a's and b's, and so the absolute precision is at least as high, which makes the subtraction exactly representable. That means that if a - b == 0, then it is exactly zero, so a == b.

positive and negative infinity for integral types in c++

I am reading about positive and negative infinity in c++.
I read that integral types dont have a infinite value ie. std::numeric_limits<int>::infinity(); wont work, but std::numeric_limits<int>::max(); will work and will represent the maximum possible value that can be represented by the integral type.
so the std::numeric_limits<int>::max(); of the integral type could be taken as its positive infinite limit ?
Or the integral type has only the max value and the infinity value is not true ?
Integers are always finite.
The closest you can get to what you're looking for is by setting an integer to it's maximum value, which for a signed integer only around 2 billion something.
std::numeric_limits has a has_infinity member which you can use to check if the type you want has an infinite representation, which are usually only on floating point numbers such as float and double.
Floating point numbers have a special bit pattern to indicate "the value is infinity", which is used when the result of some operation is defined as infinite.
Integer values have a set number of bits, and all bits are used to represent a numeric value. There is no "special bit pattern", it's just whatever the sum of the bit positions mean.
Edit: On page 315 of my hardcopy of AMD64 Architecture Programmer's Manual, it says
Infinity. Infinity is a positve or negative number +∞ and
-∞, in which the integer bit is 1, the biased exponent is maximum and fraction is 0. The infintes are the maximum numbers that
can be represented in floating point format, negative infinity is less
than any finite number and positive infinity is greater than any
finite number (ie. the affine sense).
And infinite result is produce when a non-zero, non-infinite number is
divided by 0 or multiplied by infinity, or when infinity is added to
infinity or to 0. Arithmetic infinites is exact. For example, adding
any floating point number to +∞ gives a result of +∞
Arithmetic comparisons work correctly on infinites. Exceptions occur
only when the use of an infinity as a source operand constitutes an
invalid operation.
(Any typing mistakes are mine)