Efficient way to compute the next higher integer after a float? [closed] - c++

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
Surprisingly I can't find an easy reference on this, I want to compute:
float x = /*...*/;
float next = nextint(x);
where next is strictly greater than x (ie if x is an integer, return the next higher integer). Ideally without branches.

You seem to want floor + 1:
float next = floorf(x) + 1; // or std::floor
Note that this gives you the mathematically next integer, rounded to nearest representable value, which for large x may be x itself. This does not produce a strictly larger representable integer in such case. You should consider whether this is what you intend.

There is a way to get the correct result even for large floating point numbers (where the next float may be more than 1 away).
float nextint(float x)
{
constexpr float MAX_VALUE = std::numeric_limits<float>::max();
return std::ceil(std::nextafter(x, MAX_VALUE));
}
First, we move to the next representable floating point value (towards positive infinity). Then we round up to the nearest floating point value.
Proof of correctness:
We trivially satisfy the "strictly greater" criterion because nextafter strictly increases the number and ceil never lowers it.
We never advance by more than one representable integer (that is, we actually get the "next higher" one): Either nextafter(x) is already the next higher representable integer (in which case ceil leaves it unchanged), or it is a float between x and the next higher integer (in which case ceil takes us to the latter).

Related

How can we clearly know the precision in double or float in C/C++? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
Suppose we have a real number a which has infinite precision.
Now, we have floating type double or float in C/C++ and want to represent a using those types. Let's say "a_f" is the name of the variable for a.
I already understood how the values are represented, which consists of the following three parts: sign, fraction, and exponent.
Depending on what types are used, the number of bits assigned for fraction and exponent differ and that determines the "precision".
How is the precision defined in this sense?
Is that the upper bound of absolute difference between a and a_f (|a - a_f|), or is that anything else?
In the case of double, why is the "precision" bounded by 2^{-54}??
Thank you.
The precision of floating point types is normally defined in terms of the the number of digits in the mantissa, which can be obtained using std::numeric_limits<T>::digits (where T is the floating point type of interest - float, double, etc).
The number of digits in the mantissa is defined in terms of the radix, obtained using std::numeric_limits<T>::radix.
Both the number of digits and radix of floating point types are implementation defined. I'm not aware of any real-world implementation that supports a floating point radix other than 2 (but the C++ standard doesn't require that).
If the radix is 2 std::numeric_limits<T>::digits is the number of bits (i.e. base two digits), and that defines the precision of the floating point type. For IEEE754 double precision types, that works out to 54 bits precision - but the C++ standard does not require an implementation to use IEEE floating point representations.
When storing a real value a in a floating point variable, the actual variable stored (what you're describing as a_f) is the nearest approximation that can be represented (assuming effects like overflow do not occur). The difference (or magnitude of the difference) between the two does not only depend on the mantissa - it also depends on the floating point exponent - so there is no fixed upper bound.
Practically (in very inaccurate terms) the possible difference between a value and its floating point approximation is related to the magnitude of the value. Floating point variables do not represent a uniformly distributed set of values between the minimum and maximum representable values - this is a trade-off of representation using a mantissa and exponent, which is necessary to be able to represent a larger range of values than a integral type of the same size.
The thing with floating points is that they get more innacurate the greater or smaller they are. For example:
double x1 = 10;
double x2 = 20;
std::cout << std::boolalpha << (x1 == x2);
prints, as expected, false.
However, the following code:
// the greatest number representable as double. #include <limits>
double x1 = std::numeric_limits<double>::max();
double x2 = x1 - 10;
std::cout << std::boolalpha << (x1 == x2);
prints, unexpectedly, true, since the numbers are so big that you can't meaingfully represent x1 - 10. It gets rounded to x1.
One may then ask where and what are the bounds. As we see the inconsistencies, we obvioulsy need some tools to inspect them. <limits> and <cmath> are your friends.
std::nextafter:
std::nextafter takes two floats or doubles. The first argument is our starting point and the second one represents the direction where we want to compute the next, representable value. For example, we can see that:
double x1 = 10;
double x2 = std::nextafter(x1, std::numeric_limits<double>::max());
std::cout << std::setprecision(std::numeric_limits<double>::digits) << x2;
x2 is slightly more than 10. On the other hand:
double x1 = std::numeric_limits<double>::max();
double x2 = std::nextafter(x1, std::numeric_limits<double>::lowest());
std::cout << std::setprecision(std::numeric_limits<double>::digits)
<< x1 << '\n' << x2;
Outputs on my machine:
1.79769313486231570814527423731704356798070567525845e+308
1.7976931348623155085612432838450624023434343715745934e+308
^ difference
This is only 16th decimal place. Considering that this number is multiplied by 10308, you can see why dividing 10 changed absolutely nothing.
It's tough to talk about specific values. One may estimate that doubles have 15 digits of precision (combined before and after dot) and it's a decent estimation, however, if you want to be sure, use convenient tools designed for this specific task.
For instance, number 123456789 may be represented as .12 * 10^9 or maybe .12345 * 10^9 or .1234567 * 10^9. None of these is an exact representation and some are better than the others. Which one you go with depends on how many bits you have for the fraction. More bits means more precision. The number of bits used to represent the fraction is called the "precision".

A small miscalculation in my coordinat system [duplicate]

This question already has answers here:
Why does floating-point arithmetic not give exact results when adding decimal fractions?
(31 answers)
Closed 3 years ago.
I am trying to store coordinate points in a double variable. It is very simple. I have x and y coordinates. My double variable stores them like that x.y . When I tried to convert this value into separate coordinates, I had some troubles.
I have tried these codes but still get same error.
//First try
double temp=memory.pop();
int x=(int)temp;
int y=(int)((temp-(int)temp)*100);
//Second try
double temp=memory.pop();
int x=(int)temp;
int y=100.0f*temp-(((int)temp)*100.0f);
In temp variable, I have 5.14 double number. After calculations, x should be 5 and y should be 14. However, x become 5, y become 13.
The issue is related to integer casting. Specfically (int)temp will floor to the first smaller integer. For insance, the value 5,14 can be (and probably is) not exaclty that but a bit smaller/larger for a matter of precision (since, as said in comments, the floating point numbers don't have infinite precision). So without loss of generality let's say that is 5,13999999999999, you can see that when you perform your operation you will obtain 13 instead of 14 for the second part of the coordinate.

Dfference between float and double [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
First code
double pi=3.14159,a=100.64;
cin>>a;
double sum=(a*a)*pi;
cout <<fixed<<setprecision(4)<<"Value is="<<sum<<endl;
return 0;
the value is =31819.3103
second code
float pi=3.14159,a=100.64;
float sum=(a*a)*pi;
cout <<fixed<<setprecision(4)<<"Value="<<sum<<endl;
return 0;
the value is =31819.3105
why the difference between two value ?
In both float and double (and all other floating-point types available in c++) the values are represented in floating-point form: to store x = m * 2^p, the values m and p are written to memory.
Obviously, not all real numbers can be represented in such form (especially given that the maximum length of m and p is limited). All the numbers that cannot be represented in such form are rounded to one of the nearest neighbours. Since both 3.14159 and 100.64 are infinite fractions in the binary system, both of them are rounded, and when you write a = 3.14159, a is really a bit different.
Subsequently, the result of some expression calculation on the rounded values is not precise and may vary if we use a different rounding mode, that's why you see the result you see.
Probably, the value obtained by using double is more precise as double on most architectures and compilers uses more digits of mantissa. To achieve even more precision, consider using long double.

How is ++ defined on a large floating point [duplicate]

This question already has an answer here:
maximum value in float
(1 answer)
Closed 7 years ago.
So I've been looking at IEEE754 floating point double. (My C++ compiler uses that type for a double).
Consider this snippet:
// 9007199254740992 is the 53rd power of 2.
// 590295810358705700000 is the 69th power of 2.
for (double f = 9007199254740992; f <= 590295810358705700000; ++f){
/* what is f?*/
}
Presumably f increments in even steps up to the 54th power of 2, due to rounding up?
Then after that, nothing happens due to rounding down?
Is that correct? Is it even well-defined?
++f is essentially the same as f = f + 1, ignoring the fact that ++f is an expression that yields a value.
Now, for floating point values, the issue of representability comes into play. It may be that f + 1 is not representable. In which case, f + 1 will evaluate to the nearest representable value to the true value of f + 1. In case there are two equally near candidates for nearest representable value, round to even is used.
This is covered in the Operations section of What Every Computer Scientist Should Know About Floating-Point Arithmetic:
The IEEE standard requires that the result of addition, subtraction, multiplication and division be exactly rounded. That is, the result must be computed exactly and then rounded to the nearest floating-point number (using round to even).
So, if your example, for sufficiently large values of f, you will find that f == f + 1.
Yes, this loop will never end on rounding problem. I hope the reason is clear for you (since you are familiar with https://en.wikipedia.org/wiki/IEEE_floating_point) but let me describe in short for impatient audience.
We can think about floating point as forced by compiler/FPU/standard special presentation of number. For simple example let's review:
20000
2e4
0.2e5
Both three forms represents the same number. Last two form called "science" form but what is the best? IEEE754 answers - the last one because we can save the space by omitting leading 0 and just write .2e5 . Such decimal analogy is very close to binary presentation where there is a space for mantissa (.2) and exponent (5).
Now let's do the same for 20000.00000000001
0.2000000000000001e5
As we can see mantissa growth and there is some limit where fixed memory will overflow. Instead of exception we sacrifice precision, that (just as example) give as the 0.2e5.
For bigger numbers (as in question) we have lost in precision too.
9007199254740992 may be presented as 0.9e16 And when 1 is added nothing happens.
So f = f + 1 creates infinite loop
Being f++ the same as f = f + 1, as pointed out on the comments, and as i tested myself, f == f+1 (!!) for a large f dependent on the platform. An explanation is here (for small numbers, but the principle is the same) http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BinMath/addFloat.html
Here's how to add floating point numbers.
First, convert the two representations to scientific notation. Thus,
we explicitly represent the hidden 1. In order to add, we need the
exponents of the two numbers to be the same. We do this by rewriting
Y. This will result in Y being not normalized, but value is equivalent
to the normalized Y. Add x - y to Y's exponent. Shift the radix point
of the mantissa (signficand) Y left by x - y to compensate for the
change in exponent. Add the two mantissas of X and the adjusted Y
together. If the sum in the previous step does not have a single bit
of value 1, left of the radix point, then adjust the radix point and
exponent until it does. Convert back to the one byte floating point
representation.
In the process of converting the number to the same exponent, due to precision, 1 is rounded to 0, and hence f == f + 1.
According to IEEE754, after the sum the number is rounded to match the double format, and due to the rounding operation, f==f+1.
I don't know if there are problems where looping over large floating point values by increment of 1 is a meaningful solution, but people may be stumbling on this question looking for a workaround for their neverending loop. Therefore, even though the question only asks how the addition is defined by the standard, I'll propose a workaround.
Indeed, for large values of f, f++ == f is true, and using that as the increment in loop will have undefined behaviour.
Assuming it's OK that f be incremented by a number that is the smallest number e greater than 1 for which the floating point has a representation f + e > f. In that case, following workaround where the loop will always terminate could be OK:
// use template, or overloads for different floatingpoints
template<class T>
T add_s(T l, T r) {
T result = l + r;
T greater = std::max(l, r);
if(result == greater)
return std::nextafter(greater, std::numeric_limits<T>::max());
return result;
}
// ...
for (double f = /*...*/; f < /*...*/; f = add_s(f, 1.0))
That said, adding tiny floats to huge floats will result in an uncontrollable cumulation of errors. If that's not OK for you, then you need arbitraty precision math, not floating point.

Increasing float value [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Floating point inaccuracy examples
I have the following line inside a WHILE loop, in C/C++:
while(...)
{
x = x + float(0.1); // x is a float type. Is the cast necessary?
}
x starts as 0. The thing is, after my first loop, x = 0.1. That's cool. After my second loop, x = 0.2. That's sweet. But, after my third loop, x = 0.3000001. That's not OK. I want it to have 0.3 as value, not 0.3000001. Can it be done? Am I looping wrongly?
Floating point does not work that way there are infinitely many real numbers between any two real numbers and only a finite amount of bits this means that in almost all cases the floating point representation is approximate. Read this link for more info.
It's not the loop, it's just how floats are represented in memory. You don't expect all real numbers to be directly representible in a limited number of bytes, right?
A 0.3 can't be exactly represented by a float. Try a double (not saying it will work, it probably won't, but the offset will be lower).
This is a common misconception with floating point numbers. 0.3 may not be exactly representable with 32bit or 64bit binary floating point. Lots of numbers are not exactly representable. Your loop is working fine ignoring the unnecessary syntax.
while (...)
{
x += 0.1f; /* this will do just fine in C++ and C */
}
If this doesn't make sense consider the fact that there are an infinite number of floating point numbers...with only a finite number of bits to describe them.
Either way, if you need exact results you need to use a decimal type of the proper precision. Good news though, unless you're doing calculations on money you likely do not need exact results (even if you think you do).
Code such as this:
for (int i = 0;… ; ++i)
{
float x = i / 10.f;
…
}
will result in the value of x in each iteration being the float value that is closest to i/10. It will usually not be exact, since the exact value of i/10 is usually not representable in float.
For double, change the definition to:
double x = i / 10.;
This will result in a finer x, so it will usually be even closer to i/10. However, it will still usually not be exactly i/10.
If you need exactly i/10, you should explain your requirements further.
NO the cast is not necessary in this case
float x;
x = x + float(0.1);
You can simply write
x+= 0.1