Numeric Limits of Float Lowest and Max - c++

In what real case does c++ std::numeric_limits<float>::lowest() not equal the negative of std::numeric_limits<float>::max()?

std::numeric_limits<float>::lowest() will equal the negative of std::numeric_limits<float>::max() if the floating point representation uses a signed magnitude, as the max and lowest values will have identical significand (ignoring the sign bit) and exponent, with both containing their maximum values. The sign bit will then determine whether the number is positive or negative, and will not affect its magnitude. (Another concequence of this is that the underlying representation will be able to represent both positive and negative zero.) IEEE 754 uses signed magnitude representation, so this is what you are most likely to encounter in practice.
However, an alternative representation is to use a two's complement significand. A property of two's complement is that the maximal negative value has a greater magnitude than the maximal positive value, and so for a floating point representation using a two's complement significand, std::numeric_limits<float>::lowest() will not equal the negative of std::numeric_limits<float>::max(). An example of a real system using a two's complement significand floating point format is the IBM 1130, circa 1970.
Now, cppreference.com does state:
While it's not true for fundamental C++ floating-poing types, a third-party floating-point type T may exist such that std::numeric_limits<T>::lowest() != -std::numeric_limits<T>::max().
But I'm not sure what the basis of that statement is, as the C++11 standard, paragraph 18.3.2.4, merely specifies that lowest() is:
A finite value x such that there is no other finite value y where y < x.
Meaningful for all specializations in which is_bounded != false.
It also notes in a footnote:
lowest() is necessary because not all floating-point representations have a smallest (most negative) value that is the
negative of the largest (most positive) finite value.
So it would seem that a two's complement floating point representation could be used by a C++ implementation, and that would have std::numeric_limits<float>::lowest() not equal to the negative of std::numeric_limits<float>::max().

Related

Casting Double to Long is giving wrong value [duplicate]

I'm currently learning inter-type data convertion in cpp. I have been taught that
For a really large int, we can (for some computers) suffer a loss of
precision when converting to double.
But no reason was provided for the statement.
Could someone please provide an explanation and an example? Thanks
Let's say that the floating point number uses N bits of storage.
Now, let us assume that this float can precisely represent all integers that can be represented by an integer type of N bits. Since the N bit integer requires all of its N bits to represent all of its values, so would be the requirement for this float.
A floating point number should be able to represent fractional numbers. However, since all of the bits are used to represent the integers, there are zero bits left to represent any fractional number. This is a contradiction, and we must conclude that the assumption that float can precisely represent all integers as equally sized integer type must be erroneous.
Since there must be non-representable integers in the range of a N bit integer, it is possible that converting such integer to a floating point of N bits will lose precision, if the converted value happens to be one of the non-representable ones.
Now, since a floating point can represent a subset of rational numbers, some of those representable values may indeed be integers. In particular, the IEEE-754 spec guarantees that a binary double precision floating point can represent all integers up to 253. This property is directly associated with the length of the mantissa.
Therefore it is not possible to lose precision of a 32 bit integer when converting to a double on a system which conforms to IEEE-754.
More technically, the floating point unit of x86 architecture actually uses a 80-bit extended floating point format, which is designed to be able to represent precisely all of 64 bit integers and can be accessed using the long double type.
This may happen if int is 64 bit and double is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the double in such a case has less bits than the int, then double is able to represent less digits and a loss of precision happens.

Does IEEE 754 float division or subtraction by itself always result in the same value?

Unless an IEEE 754 is NaN, +-0.0 or +-Infinity, is dividing by itself guaranteed to result in exactly 1.0?
Similarly, is subtracting itself guaranteed to always result in +-0.0?
IEEE 754-2008 4.3 says:
… Except where stated otherwise, every operation shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that result according to one of the attributes in this clause.
When an intermediate result is representable, all of the rounding attributes round it to itself; rounding changes a value only when it is not representable. Rules for subtraction and division are given in 5.4, and they do not state exceptions for the above.
Zero and one are representable per 3.3, which specify the sets of numbers representable in any format conforming to the standard. In particular, zero may be represented by a significand with all zero digits, and one may be represented with a significand starting with 1 and followed by “.000…000” and an exponent of zero. The minimum and maximums exponents of a format are defined so that they always include zero (emin is 1−emax, so zero is between them, inclusive, unless emax is less than 1, in which case no numbers are representable, i.e. the format is empty and not an actual format).
Since rounding does not change representable values and zero and one are representable, dividing a finite non-zero value by itself always produces one, and subtracting a finite value from itself always produces zero.
Dividing and substracting from itself, if same literal value was used - yes, IEEE754 requires to produce closest and cosistent match.

converting really large int to double, loss of precision on some computer

I'm currently learning inter-type data convertion in cpp. I have been taught that
For a really large int, we can (for some computers) suffer a loss of
precision when converting to double.
But no reason was provided for the statement.
Could someone please provide an explanation and an example? Thanks
Let's say that the floating point number uses N bits of storage.
Now, let us assume that this float can precisely represent all integers that can be represented by an integer type of N bits. Since the N bit integer requires all of its N bits to represent all of its values, so would be the requirement for this float.
A floating point number should be able to represent fractional numbers. However, since all of the bits are used to represent the integers, there are zero bits left to represent any fractional number. This is a contradiction, and we must conclude that the assumption that float can precisely represent all integers as equally sized integer type must be erroneous.
Since there must be non-representable integers in the range of a N bit integer, it is possible that converting such integer to a floating point of N bits will lose precision, if the converted value happens to be one of the non-representable ones.
Now, since a floating point can represent a subset of rational numbers, some of those representable values may indeed be integers. In particular, the IEEE-754 spec guarantees that a binary double precision floating point can represent all integers up to 253. This property is directly associated with the length of the mantissa.
Therefore it is not possible to lose precision of a 32 bit integer when converting to a double on a system which conforms to IEEE-754.
More technically, the floating point unit of x86 architecture actually uses a 80-bit extended floating point format, which is designed to be able to represent precisely all of 64 bit integers and can be accessed using the long double type.
This may happen if int is 64 bit and double is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the double in such a case has less bits than the int, then double is able to represent less digits and a loss of precision happens.

positive and negative infinity for integral types in c++

I am reading about positive and negative infinity in c++.
I read that integral types dont have a infinite value ie. std::numeric_limits<int>::infinity(); wont work, but std::numeric_limits<int>::max(); will work and will represent the maximum possible value that can be represented by the integral type.
so the std::numeric_limits<int>::max(); of the integral type could be taken as its positive infinite limit ?
Or the integral type has only the max value and the infinity value is not true ?
Integers are always finite.
The closest you can get to what you're looking for is by setting an integer to it's maximum value, which for a signed integer only around 2 billion something.
std::numeric_limits has a has_infinity member which you can use to check if the type you want has an infinite representation, which are usually only on floating point numbers such as float and double.
Floating point numbers have a special bit pattern to indicate "the value is infinity", which is used when the result of some operation is defined as infinite.
Integer values have a set number of bits, and all bits are used to represent a numeric value. There is no "special bit pattern", it's just whatever the sum of the bit positions mean.
Edit: On page 315 of my hardcopy of AMD64 Architecture Programmer's Manual, it says
Infinity. Infinity is a positve or negative number +∞ and
-∞, in which the integer bit is 1, the biased exponent is maximum and fraction is 0. The infintes are the maximum numbers that
can be represented in floating point format, negative infinity is less
than any finite number and positive infinity is greater than any
finite number (ie. the affine sense).
And infinite result is produce when a non-zero, non-infinite number is
divided by 0 or multiplied by infinity, or when infinity is added to
infinity or to 0. Arithmetic infinites is exact. For example, adding
any floating point number to +∞ gives a result of +∞
Arithmetic comparisons work correctly on infinites. Exceptions occur
only when the use of an infinity as a source operand constitutes an
invalid operation.
(Any typing mistakes are mine)

Exponent in IEEE 754

Why exponent in float is displaced by 127?
Well, the real question is : What is the advantage of such notation in comparison to 2's complement notation?
Since the exponent as stored is unsigned, it is possible to use integer instructions to compare floating point values. the the entire floating point value can be treated as a signed magnitude integer value for purposes of comparison (not twos-compliment).
Just to correct some misinformation: it is 2^n * 1.mantissa, the 1 infront of the fraction is implicitly stored.
Note that there is a slight difference in the representable range for the exponent, between biased and 2's complement. The IEEE standard supports exponents in the range of (-127 to +128), while if it was 2's complement, it would be (-128 to +127). I don't really know the reason why the standard chooses the bias form, but maybe the committee members thought it would be more useful to allow extremely large numbers, rather than extremely small numbers.
#Stephen Canon, in response to ysap's answer (sorry, this should have been a follow up comment to my answer, but the original answer was entered as an unregistered user, so I cannot really comment it yet).
Stephen, obviously you are right, the exponent range I mentioned is incorrect, but the spirit of the answer still applies. Assuming that if it was 2's complement instead of biased value, and assuming that the 0x00 and 0xFF values would still be special values, then the biased exponents allow for (2x) bigger numbers than the 2's complement exponents.
The exponent in a 32-bit float consists of 8 bits, but without a sign bit. So the range is effectively [0;255]. In order to represent numbers < 2^0, that range is shifted by 127, becoming [-127;128].
That way, very small numbers can be represented very precisely. With a [0;255] range, small numbers would have to be represented as 2^0 * 0.mantissa with lots of zeroes in the mantissa. But with a [-127;128] range, small numbers are more precise because they can be represented as 2^-126 * 0.mantissa (with less unnecessary zeroes in the mantissa). Hope you get the point.