static_casting from a floating point to an integer simply strips the fractional point of the number. For example static_cast<int>(13.9999999) yields 13.
Not all integers are representable as floating point numbers. For example internally the closest float to 13,000,000 may be: 12999999.999999.
In this hypothetical case, I'd expect to get an unexpected result from:
const auto foo = 12'999'999.5F;
const auto bar = static_cast<long long>(ceil(foo));
My assumption is that such a breakdown does occur at some point, if not necessarily at 13,000,000. I'd just like to know the range over which I can trust static_cast<long long>(ceif(foo))?
For example internally the closest float to 13,000,000 may be: 12999999.999999.
That is not possible in any normal floating-point format. The floating-point representation of numbers is equivalent to M•be, where b is a fixed base (e.g., 2 for binary floating-point) and M and e are integers with some restrictions on their values. In order for a value like 13,000,000-x to be represented, where x is some positive value less than 1, e must be negative (because M•be for a non-negative e is an integer). If so, then M•b0 is an integer larger than M•be, so it is larger than 13,000,000, and so 13,000,000 can be represented as M'•b0, where M' is a positive integer less than M and hence fits in the range of allowed values for M (in any normal floating-point format). (Perhaps some bizarre floating-point format might impose a strange range on M or e that prevents this, but no normal format does.)
Regarding your code:
auto test = 0LL;
const auto floater = 0.5F;
for(auto i = 0LL; i == test; i = std::ceil(i + floater)) ++test;
cout << test << endl;
When i was 8,388,608, the mathematical result of 8,388,608 + .5 is 8,388,608.5. This is not representable in the float format on your system, so it was rounded to 8,388,608. The ceil of this is 8,388,608. At this point, test was 8,388,609, so the loop stopped. So this code does not demonstrate that 8,388,608.5 is representable and 8,388,609 is not.
Behavior seems to return to normal if I do: ceil(8'388'609.5F) which will correctly return 8,388,610.
8,388,609.5 is not representable in the float format on your system, so it was rounded by the rule “round to nearest, ties to even.” The two nearest representable values are 8,388,609, and 8,388,610. Since they are equally far apart, the result was 8,388,610. That value was passed to ceil, which of course returned 8,388,610.
On Visual Studio 2015 I got 8,388,609 which is a horrifying small safe range.
In the IEEE-754 basic 32-bit binary format, all integers from -16,777,216 to +16,777,216 are representable, because the format has a 24-bit significand.
Floating point numbers are represented by 3 integers, cbq where:
c is the mantissa (so for the number: 12,999,999.999999 c would be 12,999,999,999,999)
q is the exponent (so for the number: 12,999,999.999999 q would be -6)
b is the base (IEEE-754 requires b to be either 10 or 2; in the representation above b is 10)
From this it's easy to see that a floating point with the capability of representing 12,999,999.999999 also has the capability of representing 13,000,000.000000 using a c of 1,300,000,000,000 and a q of -5.
This example is a bit contrived in that the chosen b is 10, where in almost all implementations the chosen base is 2. But it's worth pointing out that even with a b of 2 the q functions as a shift left or right of the mantissa.
Next let's talk about a range here. Obviously a 32-bit floating point cannot represent all the integers represented by a 32-bit integer, as the floating point must also represent so many much larger or smaller numbers. Since the exponent is simply shifting the mantissa, a floating point number can always exactly represent every integer that can be represented by it's mantissa. Given the traditional IEEE-754 binary base floating point numbers:
A 32-bit (float) has a 24-bit mantissa so it can represent all integers in the range [-16,777,215, 16,777,215]
A 64-bit (double) has a 53-bit mantissa so it can represent all integers in the range [-9,007,199,254,740,991, 9,007,199,254,740,991]
A 128-bit (long double depending upon implementation) has a 113-bit mantissa so it can represent all integers in the range [-103,845,937,170,696,552,570,609,926,584,40,191, 103,845,937,170,696,552,570,609,926,584,40,191]
[source]
c++ provides digits as a method of finding this number for a given floating point type. (Though admittedly even a long long is too small to represent a 113-bit mantissa.) For example a float's maximum mantissa could be found by:
(1LL << numeric_limits<float>::digits) - 1LL
Having thoroughly explained the mantissa, let's revisit the exponent section to talk about how a floating point is actually stored. Take 13,000,000.0 that could be represented as:
c = 13, q = 6, b = 10
c = 130, q = 5, b = 10
c = 1,300, q = 4, b = 10
And so on. For the traditional binary format IEEE-754 requires:
The representation is made unique by choosing the smallest representable exponent that retains the most significant bit (MSB) within the selected word size and format. Further, the exponent is not represented directly, but a bias is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers
To explain this in the more familiar base-10 if our mantissa has 14 decimal places, the implementation would look like this:
c = 13,000,000,000,000 so the MSB will be used in the represented number
q = 6 This is a little confusing, it's cause of the bias introduced here; logically q = -6 but the bias is set so that when q = 0 only the MSB of c is immediately to the left of the decimal point, meaning that c = 13,000,000,000,000, q = 0, b = 10 will represent 1.3
b = 10 again the above rules are really only required for base-2 but I've shown them as they would apply to base-10 for the purpose of explaination
Translated back to base-2 this means that a q of numeric_limits<T>::digits - 1 has only zeros after the decimal place. ceil only has an effect if there is a fractional part of the number.
A final point of explanation here, is the range over which ceil will have an effect. After the exponent of a floating point is larger than numeric_limits<T>::digits continuing to increase it only introduces trailing zeros to the resulting number, thus calling ceil when q is greater than or equal to numeric_limits<T>::digits - 2LL. And since we know the MSB of c will be used in the number this means that c must be smaller than (1LL << numeric_limits<T>::digits - 1LL) - 1LL Thus for ceil to have an effect on the traditional binary IEEE-754 floating point:
A 32-bit (float) must be smaller than 8,388,607
A 64-bit (double) must be smaller than 4,503,599,627,370,495
A 128-bit (long double depending upon implementation) must be smaller than 5,192,296,858,534,827,628,530,496,329,220,095
Related
Given some initialized object x of type double that stores a positive value, I want to find the smallest double v such as 0 <= v < 1/x.
It seems fairly obvious that I could write double v = DBL_EPSILON;, but, if x is big enough, could it happen that v end up being bigger than the real value 1/x because it's so small that cannot be represented in my machine?
In other words, and more precisely, if I have a positive real number x and an object x1 of type double whose stored value represents x exactly, is it guaranteed that the value represented by DBL_EPSILON is less than the real number 1/x?
In case it is not guaranteed, how can I calculate the biggest value of type double that ensures that DBL_EPSILON is less than the real number 1/x?
I will assume double is IEEE 754 binary64.
If a floating-point number is representable in my machine, will its inverse be representable in my machine?
Not necessarily, for two reasons:
The inverse might not be a floating-point number.
For example, although 3 is a floating-point number, 1/3 is not.
The inverse might overflow.
For example, the inverse of 2−1074 is 21074, which is not only larger than all finite floating-point numbers but more than halfway from the largest finite floating-point number, 1.fffffffffffffp+1023 = 21024 − 2971, to what would be the next one after that, 21024, if the range of exponents were larger.
So the inverse of 2−1074 is rounded to infinity.
Given some initialized object x of type double that stores a positive value, I want to find the smallest double v such as 0 <= v < 1/x.
The smallest such 𝑣 is always zero.
If you restrict it to be nonzero, it will always be the smallest subnormal floating-point number, 0x1p−1074, or roughly 4.9406564584124654 × 10−324, irrespective of 𝑥 (unless 𝑥 is infinite).
But perhaps you want the largest such 𝑣 rather than the smallest such 𝑣.
The largest such 𝑣 is always either 1 ⊘ 𝑥 = fl(1/𝑥) (that is, the floating-point number nearest to 1/𝑥, which is what you get by writing 1/x in C), or the next floating-point number closer to zero (which you can get by writing nextafter(1/x, 0) in C): in the default rounding mode, the division operator always returns the nearest floating-point number to the true quotient, or one of the two nearest ones if there is a tie.
You can also get the largest such 𝑣 by setting the rounding mode with fesetround(FE_DOWNWARD) or fesetround(FE_TOWARDZERO) and then just computing 1/x, although toolchain support for non-default rounding modes is spotty and mostly they serve to shake out bugs in ill-conditioned code rather than to give reliable rounding semantics.
It seems fairly obvious that I could write double v = DBL_EPSILON;, but, if x is big enough, could it happen that v end up being bigger than the real value 1/x because it's so small that cannot be represented in my machine?
1/x is never rounded to zero unless 𝑥 is infinite or you have nonstandard flush-to-zero semantics enabled (so results which would ordinarily be subnormal are instead rounded to zero, such as when 𝑥 is the largest finite floating-point number 0x1.fffffffffffffp+1023).
But flush-to-zero aside, there are many values of 𝑥 for which 1/𝑥 and fl(1/𝑥) = 1/x is smaller than DBL_EPSILON.
For example, if 𝑥 = 0x1p+1000 (that is, 21000 ≈ 1.0715086071862673 × 10301), then 1/𝑥 = fl(1/𝑥) = 1/x = 0x1p−1000 (that is, 2−1000 ≈ 9.332636185032189 × 10−302) is far below DBL_EPSILON = 0x1p−52 (that is, 2−52 ≈ 2.220446049250313 × 10−16).
1/𝑥 in this case is a floating-point number, so the reciprocal is computed exactly in floating-point arithmetic; there is no rounding at all.
The largest floating-point number below 1/𝑥 in this case is 0x1.fffffffffffffp−1001, or 2−1000 − 2−1053.
DBL_EPSILON (2−52) is not the smallest floating-point number (2−1074), or even the smallest normal floating-point number (2−1022).
Rather, DBL_EPSILON is the distance from 1 to the next larger floating-point number, 1 + 2−52, sometimes written ulp(1) to indicate that it is the magnitude of the least significant digit, or unit in the last place, in the floating-point representation of 1.
In case it is not guaranteed, how can I calculate the biggest value of type double that ensures that DBL_EPSILON is less than the real number 1/x?
That would be 1/DBL_EPSILON - 1, or 252 − 1.
But what do you want this number for?
Why are you trying to use DBL_EPSILON here?
The inverse of positive infinity is, of course, smaller than any positive rational number. Beyond that, even the largest finite floating point number has a multiplicative inverse well above the smallest representable floating point number of equivalent width, thanks to denormal numbers.
If a floating-point number is representable in my machine, will its inverse be representable in my machine?
No. There is no specification that 1.0/DBL_MIN <= DBL_MAX and 1.0/DBL_MAX <= DBL_MIN both must be true. One is usually true. With sub-normals, 1.0/sub-normal is often > DBL_MAX.
Given some initialized object x of type double that stores a positive value, I want to find the smallest double v such as 0 <= v < 1/x.
This is true as v could be zero unless for some large x like DBL_MAX, 1.0/x is zero. That is a possibility. With sub-normals, that is rarely the case as 1.0/DBL_MAX is representable as a value more than 0.
DBL_EPSILON has little to do with the above. OP's issues are more dependent on DBL_MAX, DBL_MIN and is the double supports sub-normals. Many FP encodings about balanced where 1/DBL_MIN is somewhat about DBL_MIN, yet C does not require that symmetry.
No. Floating point numbers are balanced around 1.0 to minimize the effect of calculating inverses, but this balance is not exact, ad the middle point for the exponent (the value 0x3fff... fot the exponent, gives the same number of powers of two above and below 1.0. But the exponent value 0x4ffff... is reserved for infinity and then nans, while the value 0x0000... is reserved for denormals (also called subnormals) These values are not normalized (and some architectures don't even implement them), but in those that implement, they add as many bits as the width of the mantissa as powers of 2 in addition (but with with lower precision) to the normalized ones, in the range of
the negative exponents. This means that you have a set o numbers, quite close to zero, for which when you compute their inverses, you always get infinity.
For doubles you have 52 more powers of two, or around 15 more powers of ten. For floats, this is around 7 more powers of ten.
But this also means that if you calculate the inverse of a large number you'll always get a number different than zero.
According to The C++ Programming Language - 4th, section 6.2.5:
There are three floating-points types: float (single-precision), double (double-precision), and long double (extended-precision)
Refer to: http://en.wikipedia.org/wiki/Single-precision_floating-point_format
The true significand includes 23 fraction bits to the right of the binary point and an implicit leading bit (to the left of the binary point) with value 1 unless the exponent is stored with all zeros. Thus only 23 fraction bits of the significand appear in the memory format but the total precision is 24 bits (equivalent to log10(224) ≈ 7.225 decimal digits).
→ The maximum digits of floating point number is 7 digits on binary32 interchange format. (a computer number format that occupies 4 bytes (32 bits) in computer memory)
When I test on different compilers (like GCC, VC compiler)
→ It always outputs 6 as the value.
Take a look into float.h of each compiler
→ I found that 6 is fixed.
Question:
Do you know why there is a different here (between actual value theoretical value - 7 - and actual value - 6)?
It sounds like "7" is more reasonable because when I test using below code, the value is still valid, while "8" is invalid
Why don't the compilers check the interchange format for giving decision about the numbers of digits represented in floating-point (instead of using a fixed value)?
Code:
#include <iostream>
#include <limits>
using namespace std;
int main( )
{
cout << numeric_limits<float> :: digits10 << endl;
float f = -9999999;
cout.precision ( 10 );
cout << f << endl;
}
You're not reading the documentation.
std::numeric_limits<float>::digits10 is 6:
The value of std::numeric_limits<T>::digits10 is the number of base-10 digits that can be represented by the type T without change, that is, any number with this many decimal digits can be converted to a value of type T and back to decimal form, without change due to rounding or overflow. For base-radix types, it is the value of digits (digits-1 for floating-point types) multiplied by log10(radix) and rounded down.
The standard 32-bit IEEE 754 floating-point type has a 24 bit fractional part (23 bits written, one implied), which may suggest that it can represent 7 digit decimals (24 * std::log10(2) is 7.22), but relative rounding errors are non-uniform and some floating-point values with 7 decimal digits do not survive conversion to 32-bit float and back: the smallest positive example is 8.589973e9, which becomes 8.589974e9 after the roundtrip. These rounding errors cannot exceed one bit in the representation, and digits10 is calculated as (24-1)*std::log10(2), which is 6.92. Rounding down results in the value 6.
std::numeric_limits<float>::max_digits10 is 9:
The value of std::numeric_limits<T>::max_digits10 is the number of base-10 digits that are necessary to uniquely represent all distinct values of the type T, such as necessary for serialization/deserialization to text. This constant is meaningful for all floating-point types.
Unlike most mathematical operations, the conversion of a floating-point value to text and back is exact as long as at least max_digits10 were used (9 for float, 17 for double): it is guaranteed to produce the same floating-point value, even though the intermediate text representation is not exact. It may take over a hundred decimal digits to represent the precise value of a float in decimal notation.
std::numeric_limits<float>::digits10 equates to FLT_DIG, which is defined by the C standard :
number of decimal digits, q, such that any floating-point number with q decimal digits can be rounded into a floating-point number with p radix b digits and back again without change to the q decimal digits,
⎧ p log10 b if b is a power of 10
⎨
⎩ ⎣( p − 1) log10 b⎦ otherwise
FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10
The reason for the value 6 (and not 7), is due to rounding errors - not all floating point values with 7 decimal digits can be losslessly represented by a 32-bit float. Rounding errors are limited to 1 bit though, so the FLT_DIG value was calculated based on 23 bits (instead of the full 24) :
23 * log10(2) = 6.92
which is rounded down to 6.
With below code, I get result "4.31 43099".
double f = atof("4.31");
long ff = f * 10000L;
std::cout << f << ' ' << ff << '\n';
If I change "double f" to "float f". I get expected result "4.31 43100". I am not sure if changing "double" to "float" is a good solution. Is there any good solution to assure I get "43100"?
You're not going to be able to eliminate the errors in floating point arithmatic (though with proper analysis you can calculate the error). For casual usage one thing you can do to get more intuitive results is to replace the built-in float to integral conversion (which does truncation), with normal rounding:
double f = atof("4.31");
long ff = std::round(f * 10000L);
std::cout << f << ' ' << ff << '\n';
This should output what you expect: 4.31 43100
Also there's no point in using 10000L, because no matter what kind of integral type you use it still gets converted to f's floating point type for the multiplication. just use std::round(f * 10000.0);
The problem is that floating point is inexact by nature when talking about decimal numbers. A decimal number can be rounded either up or down when converted to binary, depending on which value is closest.
In this case you just want to make sure that if the number was rounded down, it's rounded up instead. You do this by adding the smallest amount possible to the value, which is done with the nextafter function if you have C++11:
long ff = std::nextafter(f, 1.1*f) * 10000L;
If you don't have nextafter you can approximate it with numeric_limits.
long ff = (f * (1.0 + std::numeric_limits<double>::epsilon())) * 10000L;
I just saw your comment that you only use 4 decimal places, so this would be simpler but less robust:
long ff = (f * 1.0000001) * 10000L;
With standard C types - i doubt.
There are many values that cannot be represented in those bits - they actually demand more space to be stored. So floating-point processor just uses the closest possible.
Floating pointing numbers cannot store all the values you think it could - there is only limited amount of bits - you can't put more than 4 billion different values in 32 bits. And that's just the first restriction.
Floating point values(in C) are represented as: sign - one sign bit, power - bits which defines the power of two for the number, significand - the bits that actually make the number.
Your actual number is sign * significand * 2 inpowerof(power - normalization).
Double is 1bit of sign, 15 bits of power(normalized to be positive but that is not the point) and 48 bits to represent the value;
It is a lot but not enough to represent all the values, especially when they cannot be easily represented as finite sum of powers of two: like binary 1010.101101(101). For example it cannot represent precisely such values like 1/3 = 0.333333(3). That's the second restriction.
Try to read - decent understanding of advantages and disadvantages of floating point arithmetic may be very handy:
http://en.wikipedia.org/wiki/Floating_point and http://homepage.cs.uiowa.edu/~atkinson/m170.dir/overton.pdf
There have been some confused answers here! What is happening is this: 4.31 can't be exactly represented as either a single- or double-precision number. It turns out that the nearest representable single-precision number is a little more than 4.31, while the nearest representable double-precision number is a little less than 4.31. When a floating-point value is assigned to an integer variable, it is rounded towards zero (not towards the nearest integer!).
So if f is single-precision, f * 10000L is greater than 43100, so it is rounded down to 43100. And if f is double-precision, f * 10000L is less than 43100, so it is rounded down to 43099.
The comment by n.m. suggests f * 10000L + 0.5, which is I think the best solution.
I understand that floating point numbers can often include rounding errors.
When you take the floor or ceiling of a float (or double) in order to convert it to an integer, will the resultant value be exact or can the "floored" value still be an approximation?
Basically, is it possible for something like floor(3.14159265) to return a value which is essentially 2.999999, which would convert to 2 when you try to cast that to an int?
Is it possible for something like floor(3.14159265) to return a value which is essentially 2.999999?
The floor() function returns an floating point value that is an exact integer. So the premise of your question is wrong to begin with.
Now, floor(x) returns the nearest integral value that is not greater than x. It is always true that
floor(x) <= x
and that there exists no integer i, greater than floor(x), such that i <= x.
Looking at floor(3.14159265), this returns 3.0. There's no debate about that. Nothing more to say.
Where it gets interesting is if you write floor(x) where x is the result of an arithmetic expression. Floating point precision and rounding can mean that x falls on the wrong side of an integer. In other words, the true value of the expression that yields x is greater than some integer, i, but that x when evaluated using floating point arithmetic is less than i.
Small integers are representable exactly as floats, but big integers are not.
But, as others pointed out, big integers not representable by float will never be representable by a non-integer, so floor() will never return a non-integer value. Thus, the cast to (int), as long as it does not overflow, will be correct.
But how small is small? Copying shamelessly from this answer:
For float, it is 16,777,217 (224 + 1).
For double, it is 9,007,199,254,740,993 (253 + 1).
Note that the usual range of int (32-bits) is 231, so float is unable to represent all of them exactly. Use double if you need that.
Interestingly, floats can store a certain range of integers exactly, for example:
1 is stored as mantissa 1 (binary 1) * exponent 2^0
2 is stored as mantissa 1 (binary 1) * exponent 2^1
3 is stored as mantissa 1.5 (binary 1.1) * exponent 2^1
4 is stored as mantissa 1 * exponent 2^2
5 is stored as mantissa 1.25 (binary 1.01) * exponent 2^2
6 is stored as mantissa 1.5 (binary 1.1) * exponent 2^2
7 is stored as mantissa 1.75 (binary 1.11) * exponent 2^2
8 is stored as mantissa 1 (binary 1) * exponent 2^3
9 is stored as mantissa 1.125 (binary 1.001) * exponent 2^3
10 is stored as mantissa 1.25 (binary 1.01) * exponent 2^3
...
As you can see, the way exponents increase works in with the perfectly-stored fractional values the mantissa can represent.
You can get a good sense for this by putting number into this great online conversion site.
Once you cross a certain threshold, there's not enough digits in the mantissa to divide the span of the increased exponents without skipping first every odd integer value, then three out of every four, then 7 out of 8 etc.. For numbers over this threshold, the issue is not that they might be different from integer values by some tiny fractional amount, its that all the representable values are integers and not only can no fractional part be represented any more, but as above some of the integers can't be either.
You can observe this in the calculator by considering:
Binary Decimal
+-Exponent Mantissa
0 10010110 11111111111111111111111 16777215
0 10010111 00000000000000000000000 16777216
0 10010111 00000000000000000000001 16777218
See how at this stage, the smallest possible increment of the mantissa is actually "worth 2" in terms of the decimal value represented?
When you take the floor or ceiling of a float (or double) in order to convert it to an integer, will the resultant value be exact or can the "floored" value still be an approximation?
It's always exact. What floor is doing is effectively wiping out any '1's in the mantissa whose significance (their contribution to value) is fractional anyway.
Basically, is it possible for something like floor(3.14159265) to return a value which is essentially 2.999999, which would convert to 2 when you try to cast that to an int?
No.
I am trying to convert an IEEE based floating point number to a MIL-STD 1750A floating point number.
I have attached the specification for both:
I understand how to decompose the floating point 12.375 in IEEE format as per the example on wikipedia.
However, I'm not sure if my interpretation of the MIL-STD is correct.
12.375 = (12)b10 + (0.375)b10 = (1100)b2 + (0.011)b2 = (1100.011)b2
(1100.011)b2 = 0.1100011 x 2^4 => Exponent, E = 4.
4 in normalised 2's complement is = (100)b2 = Exponent
Therefore a MIL-STD 1750A 32 bit floating point number is:
S=0, F=11000110000000000000000, E=00000100
Is my above interpretation correct?
For -12.375, is it just the sign bit that swaps? i.e.:
S=1, F=11000110000000000000000, E=00000100
Or does something funky happen with the fraction part?
The diagram above is a bit misleading, I think. In IEEE format, to switch from positive to negative, you simply flip the first bit. The remaining three bits can be treated as an unsigned number. In the MIL-STD format, the mantissa is a two's complement number, so while the first bit does indicate the sign, the remaining 23 bits do not remain the same.
What I get is
S=1, F=00111010000000000000000, E=00000100