Floating points in C++ (float and double) - c++

I know that we shouldn't use floating points in the loops. But could someone explain it to me what happens when we have a loop and we add a small number to a large number until we reach a certain value that allows the loop to terminate?
I guess it might cause potential errors. But apart from that?
What would it look like with a single-precision (float) and double-precision (double) floating-point numbers? I guess more rounding errors would appear in the double type. Could someone give me an example (the best in C ++) because I have no idea how to start with it...
I would be very grateful if you could provide me with a hint. Thanks!

In a C++ implementation using IEEE-754 arithmetic and the “single” (binary32) format for float, this code prints “count = 3”:
int count = 0;
for (float f = 0; f < .3f; f += .1f)
++count;
std::cout << "count = " << count << ".\n";
but this code prints “count = 4”:
int count = 0;
for (float f = 0; f < .33f; f += .11f)
++count;
std::cout << "count = " << count << ".\n";
In the first example, the source text .1f is converted to 0.100000001490116119384765625, which is the value representable in float that is closed to .1. The source text .3f is converted to 0.300000011920928955078125, the float value closest to .3. Adding this converted value for .1f to f produces 0.100000001490116119384765625, then 0.20000000298023223876953125, and then 0.300000011920928955078125, at which point f < .3f is false, and the loop stops.
In the second example, .11f is converted to 0.10999999940395355224609375, and .33f is converted to 0.3300000131130218505859375. In this case, adding the converted value of .11f to f produces 0.10999999940395355224609375, then 0.2199999988079071044921875, and then 0.329999983310699462890625. Note that, due to rounding, this result of adding .11f three times is 0.329999983310699462890625, which is less than .33f (0.3300000131130218505859375), so f < .33f is true, and the loop continues for another iteration.
This is similar to adding ⅓ in a two-digit decimal format with a loop bound of three-thirds (which is 1). If we had for (f = 0; f < 1; f += ⅓), the ⅓ in the source text would have to be converted to .33 (two-digit decimal). Then f would be stepped through .33, .66, and .99. The loop would not stop until it reached 1.32. The same rounding issues occur in binary floating-point arithmetic.
When the amount added in the loop is a small number relative to the large number, these rounding issues are greater. First, there will be more additions, so there will be more rounding errors, and they may accumulate. Second, since larger numbers require a larger exponent to scale them in the floating-point format, they have less absolute precision than smaller numbers. This means the roundings have to be come larger relative to the small number that is being added. So the rounding errors are larger in magnitude.
Then, even if the loop eventually terminates, the values of f in each iteration may be far from the desired values, due to the accumulated errors. If f is used for calculations inside the loop, the calculations might not be using the desired values and may produce incorrect results.

With increasing values the difference between 2 floating point values increases too. There is a point where i+1 results in the same value.
Consider this code:
#include <iostream>
int main()
{
float i = 0;
while (i != i + 1) i++;
std::cout << i << std::endl;
return 0;
}
while (i != i + 1) should be an endless loop, but for floating point variables, it is not.
The code above prints 1.67772e+07 on https://godbolt.org/z/7xf8n8
So, for (float f = 0; f < 2e7; f++) is an endless loop.
You can try it with double yourself, the value is bigger.

Related

pow() function gives an error [duplicate]

Recently i write a block of code:
const int sections = 10;
for(int t= 0; t < 5; t++){
int i = pow(sections, 5- t -1);
cout << i << endl;
}
And the result is wrong:
9999
1000
99
10
1
If i using just this code:
for(int t = 0; t < 5; t++){
cout << pow(sections,5-t-1) << endl;
}
The problem doesn't occur anymore:
10000
1000
100
10
1
Does anyone give me an explaination? thanks you very much!
Due to the representation of floating point values pow(10.0, 5) could be 9999.9999999 or something like this. When you assign that to an integer that got truncated.
EDIT: In case of cout << pow(10.0, 5); it looks like the output is rounded, but I don't have any supporting document right now confirming that.
EDIT 2: The comment made by BoBTFish and this question confirms that when pow(10.0, 5) is used directly in cout that is getting rounded.
When used with fractional exponents, pow(x,y) is commonly evaluated as exp(log(x)*y); such a formula would mathematically correct if evaluated with infinite precision, but may in practice result in rounding errors. As others have noted, a value of 9999.999999999 when cast to an integer will yield 9999. Some languages and libraries use such a formulation all the time when using an exponentiation operator with a floating-point exponent; others try to identify when the exponent is an integer and use iterated multiplication when appropriate. Looking up documentation for the pow function, it appears that it's supposed to work when x is negative and y has no fractional part (when x is negative and `y is even, the result should be pow(-x,y); when y is odd, the result should be -pow(-x,y). It would seem logical that when y has no fractional part a library which is going to go through the trouble of dealing with a negative x value should use iterated multiplication, but I don't know of any spec dictating that it must.
In any case, if you are trying to raise an integer to a power, it is almost certainly best to use integer maths for the computation or, if the integer to be raised is a constant or will always be small, simply use a lookup table (raising numbers from 0 to 15 by any power that would fit in a 64-bit integer would require only a 4,096-item table).
From Here
Looking at the pow() function: double pow (double base, double exponent); we know the parameters and return value are all double type. But the variable num, i and res are all int type in code above, when tranforming int to double or double to int, it may cause precision loss. For example (maybe not rigorous), the floating point unit (FPU) calculate pow(10, 4)=9999.99999999, then int(9999.9999999)=9999 by type transform in C++.
How to solve it?
Solution1
Change the code:
const int num = 10;
for(int i = 0; i < 5; ++i){
double res = pow(num, i);
cout << res << endl;
}
Solution2
Replace floating point unit (FPU) having higher calculation precision in double type. For example, we use SSE in Windows CPU. In Code::Block 13.12, we can do this steps to reach the goal: Setting -> Compiler setting -> GNU GCC Compile -> Other options, add
-mfpmath=sse -msse3
The picture is as follows:
(source: qiniudn.com)
Whats happens is the pow function returns a double so
when you do this
int i = pow(sections, 5- t -1);
the decimal .99999 cuts of and you get 9999.
while printing directly or comparing it with 10000 is not a problem because it is runded of in a sense.
If the code in your first example is the exact code you're running, then you have a buggy library. Regardless of whether you're picking up std::pow or C's pow which takes doubles, even if the double version is chosen, 10 is exactly representable as a double. As such the exponentiation is exactly representable as a double. No rounding or truncation or anything like that should occur.
With g++ 4.5 I couldn't reproduce your (strange) behavior even using -ffast-math and -O3.
Now what I suspect is happening is that sections is not being assigned the literal 10 directly but instead is being read or computed internally such that its value is something like 9.9999999999999, which when raised to the fourth power generates a number like 9999.9999999. This is then truncated to the integer 9999 which is displayed.
Depending on your needs you may want to round either the source number or the final number prior to assignment into an int. For example: int i = pow(sections, 5- t -1) + 0.5; // Add 0.5 and truncate to round to nearest.
There must be some broken pow function in the global namespace. Then std::pow is "automatically" used instead in your second example because of ADL.
Either that or t is actually a floating-point quantity in your first example, and you're running into rounding errors.
You're assigning the result to an int. That coerces it, truncating the number.
This should work fine:
for(int t= 0; t < 5; t++){
double i = pow(sections, 5- t -1);
cout << i << endl;
}
What happens is that your answer is actually 99.9999 and not exactly 100. This is because pow is double. So, you can fix this by using i = ceil(pow()).
Your code should be:
const int sections = 10;
for(int t= 0; t < 5; t++){
int i = ceil(pow(sections, 5- t -1));
cout << i << endl;
}

C++ Modulus returning wrong answer

Here is my code :
#include <iostream>
#include <cmath>
using namespace std;
int main()
{
int n, i, num, m, k = 0;
cout << "Enter a number :\n";
cin >> num;
n = log10(num);
while (n > 0) {
i = pow(10, n);
m = num / i;
k = k + pow(m, 3);
num = num % i;
--n;
cout << m << endl;
cout << num << endl;
}
k = k + pow(num, 3);
return 0;
}
When I input 111 it gives me this
1
12
1
2
I am using codeblocks. I don't know what is wrong.
Whenever I use pow expecting an integer result, I add .5 so I use (int)(pow(10,m)+.5) instead of letting the compiler automatically convert pow(10,m) to an int.
I have read many places telling me others have done exhaustive tests of some of the situations in which I add that .5 and found zero cases where it makes a difference. But accurately identifying the conditions in which it isn't needed can be quite hard. Using it when it isn't needed does no real harm.
If it makes a difference, it is a difference you want. If it doesn't make a difference, it had a tiny cost.
In the posted code, I would adjust every call to pow that way, not just the one I used as an example.
There is no equally easy fix for your use of log10, but it may be subject to the same problem. Since you expect a non integer answer and want that non integer answer truncated down to an integer, adding .5 would be very wrong. So you may need to find some more complicated work around for the fundamental problem of working with floating point. I'm not certain, but assuming 32-bit integers, I think adding 1e-10 to the result of log10 before converting to int is both never enough to change log10(10^n-1) into log10(10^n) but always enough to correct the error that might have done the reverse.
pow does floating-point exponentiation.
Floating point functions and operations are inexact, you cannot ever rely on them to give you the exact value that they would appear to compute, unless you are an expert on the fine details of IEEE floating point representations and the guarantees given by your library functions.
(and furthermore, floating-point numbers might even be incapable of representing the integers you want exactly)
This is particularly problematic when you convert the result to an integer, because the result is truncated to zero: int x = 0.999999; sets x == 0, not x == 1. Even the tiniest error in the wrong direction completely spoils the result.
You could round to the nearest integer, but that has problems too; e.g. with sufficiently large numbers, your floating point numbers might not have enough precision to be near the result you want. Or if you do enough operations (or unstable operations) with the floating point numbers, the errors can accumulate to the point you get the wrong nearest integer.
If you want to do exact, integer arithmetic, then you should use functions that do so. e.g. write your own ipow function that computes integer exponentiation without any floating-point operations at all.

My for-loop does not start [duplicate]

This question already has answers here:
Compare double to zero using epsilon
(12 answers)
Closed 8 years ago.
I know there are loads of topics about this question, but none of those helped me. I am trying to find the root of a function by testing every number in a range of -10 to 10 with two decimal places. I know it maybe isn't the best way, but I am a beginner and just want to try this out. Somehow the loop does not work, as I am always getting -10 as an output.
Anyway, that is my code:
#include <iostream>
using namespace std;
double calc (double m,double n)
{
double x;
for (x=-10;x<10 && m*x+n==0; x+=0.01)
{
cout << x << endl;
}
return x;
}
int main()
{
double m, n, x;
cout << "......\n";
cin >> m; // gradient
cout << "........\n";
cin >> n; // y-intercept
x=calc(m,n); // using function to calculate
cout << ".......... " << x<< endl; //output solution
cout << "..............\n"; // Nothing of importance
return 0;
}
You are testing the conjunction of two conditions in your loop condition.
for (x=-10;x<10 && m*x+n==0; x+=0.01
For many inputs, the second condition will not be true, so the loop will terminate before the first iteration, causing a return value of -10.
What you want is probably closer to something closer to the following. We need to test whether the absolute value is smaller than some EPSILON for two reasons. One, double is not precise. Two, you are doing an approximate solution anyways, so you would not expect an exact answer unless you happened to get lucky.
#define EPSILON 1E-2
double calc (double m,double n)
{
double x;
for (x=-10;x<10; x+=0.001)
{
if (abs(m*x+n) < EPSILON) return x;
}
// return a value outside the range to indicate that we failed to find a
// solution within range.
return -20;
}
Update: At the request of the OP, I will be more specific about what problem EPSILON solves.
double is not precise. In a computer, floating point number are usually represented by a fixed number of bits, with the bit representation usually being specified by a standard such as IEE 754. Because the number of bits is fixed and finite, you cannot represent arbitrary precision numbers. Let us consider an example in base 10 for ease of understanding, although you should understand that computers experience a similar problem in base 2.
If m = 1/3, x = 3, and n = -1, we would expect that m*x + n == 0. However, because 1/3 is the repeated decimal 0.33333... and we can only represent a fixed number of them, the result of 3*0.33333 is actually 0.999999, which is not equal to 1. Therefore, m*x + n != 0, and our check will fail. Thus, instead of checking for equality with zero, we must check whether the result is sufficiently close to zero, by comparing its absolute value with a small number we call EPSILON. As one of the comments pointed out the correct value of EPSILON for this particular purpose is std::numeric_limits::epsilon, but the second issue requires a larger EPSILON.
You are are only doing an approximate solution anyways. Since you are checking the values of x at finitely small increments, there is a strong possibility that you will simply step over the root without ever landing on it exactly. Consider the equation 10000x + 1 = 0. The correct solution is -0.0001, but if you are taking steps of 0.001, you will never actually try the value x = -0.0001, so you could not possibly find the correct solution. For linear functions, we would expect that values of x close to -0.0001, such as x = 0, will get us reasonably close to the correct solution, so we use EPSILON as a fudge factor to work around the lack of precision in our method.
m*x+n==0 condition returns false, thus the loop doesn't start.
You should change it to m*x+n!=0

Why pow(10,5) = 9,999 in C++

Recently i write a block of code:
const int sections = 10;
for(int t= 0; t < 5; t++){
int i = pow(sections, 5- t -1);
cout << i << endl;
}
And the result is wrong:
9999
1000
99
10
1
If i using just this code:
for(int t = 0; t < 5; t++){
cout << pow(sections,5-t-1) << endl;
}
The problem doesn't occur anymore:
10000
1000
100
10
1
Does anyone give me an explaination? thanks you very much!
Due to the representation of floating point values pow(10.0, 5) could be 9999.9999999 or something like this. When you assign that to an integer that got truncated.
EDIT: In case of cout << pow(10.0, 5); it looks like the output is rounded, but I don't have any supporting document right now confirming that.
EDIT 2: The comment made by BoBTFish and this question confirms that when pow(10.0, 5) is used directly in cout that is getting rounded.
When used with fractional exponents, pow(x,y) is commonly evaluated as exp(log(x)*y); such a formula would mathematically correct if evaluated with infinite precision, but may in practice result in rounding errors. As others have noted, a value of 9999.999999999 when cast to an integer will yield 9999. Some languages and libraries use such a formulation all the time when using an exponentiation operator with a floating-point exponent; others try to identify when the exponent is an integer and use iterated multiplication when appropriate. Looking up documentation for the pow function, it appears that it's supposed to work when x is negative and y has no fractional part (when x is negative and `y is even, the result should be pow(-x,y); when y is odd, the result should be -pow(-x,y). It would seem logical that when y has no fractional part a library which is going to go through the trouble of dealing with a negative x value should use iterated multiplication, but I don't know of any spec dictating that it must.
In any case, if you are trying to raise an integer to a power, it is almost certainly best to use integer maths for the computation or, if the integer to be raised is a constant or will always be small, simply use a lookup table (raising numbers from 0 to 15 by any power that would fit in a 64-bit integer would require only a 4,096-item table).
From Here
Looking at the pow() function: double pow (double base, double exponent); we know the parameters and return value are all double type. But the variable num, i and res are all int type in code above, when tranforming int to double or double to int, it may cause precision loss. For example (maybe not rigorous), the floating point unit (FPU) calculate pow(10, 4)=9999.99999999, then int(9999.9999999)=9999 by type transform in C++.
How to solve it?
Solution1
Change the code:
const int num = 10;
for(int i = 0; i < 5; ++i){
double res = pow(num, i);
cout << res << endl;
}
Solution2
Replace floating point unit (FPU) having higher calculation precision in double type. For example, we use SSE in Windows CPU. In Code::Block 13.12, we can do this steps to reach the goal: Setting -> Compiler setting -> GNU GCC Compile -> Other options, add
-mfpmath=sse -msse3
The picture is as follows:
(source: qiniudn.com)
Whats happens is the pow function returns a double so
when you do this
int i = pow(sections, 5- t -1);
the decimal .99999 cuts of and you get 9999.
while printing directly or comparing it with 10000 is not a problem because it is runded of in a sense.
If the code in your first example is the exact code you're running, then you have a buggy library. Regardless of whether you're picking up std::pow or C's pow which takes doubles, even if the double version is chosen, 10 is exactly representable as a double. As such the exponentiation is exactly representable as a double. No rounding or truncation or anything like that should occur.
With g++ 4.5 I couldn't reproduce your (strange) behavior even using -ffast-math and -O3.
Now what I suspect is happening is that sections is not being assigned the literal 10 directly but instead is being read or computed internally such that its value is something like 9.9999999999999, which when raised to the fourth power generates a number like 9999.9999999. This is then truncated to the integer 9999 which is displayed.
Depending on your needs you may want to round either the source number or the final number prior to assignment into an int. For example: int i = pow(sections, 5- t -1) + 0.5; // Add 0.5 and truncate to round to nearest.
There must be some broken pow function in the global namespace. Then std::pow is "automatically" used instead in your second example because of ADL.
Either that or t is actually a floating-point quantity in your first example, and you're running into rounding errors.
You're assigning the result to an int. That coerces it, truncating the number.
This should work fine:
for(int t= 0; t < 5; t++){
double i = pow(sections, 5- t -1);
cout << i << endl;
}
What happens is that your answer is actually 99.9999 and not exactly 100. This is because pow is double. So, you can fix this by using i = ceil(pow()).
Your code should be:
const int sections = 10;
for(int t= 0; t < 5; t++){
int i = ceil(pow(sections, 5- t -1));
cout << i << endl;
}

Can float values add to a sum of zero? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Most effective way for float and double comparison
I have two values(floats) I am attempting to add together and average. The issue I have is that occasionally these values would add up to zero, thus not requiring them to be averaged.
The situation I am in specifically contains the values "-1" and "1", yet when added together I am given the value "-1.19209e-007" which is clearly not 0. Any information on this?
I'm sorry but this doesn't make sense to me.
Two floating point values, if they are exactly the same but with opposite sign, subtracted will produce always 0. This is how floating point operations works.
float a = 0.2f;
float b = -0.2f;
float f = (a - b) / 2;
printf("%f %d\n", f, f != 0); // will print out 0.0000 0
Will be always 0 also if the compiler doesn't optimize the code.
There is not any kind of rounding error to take in account if a and b have the same value but opposite sign! That is, if the higher bit of a is 0 and the higher bit of b is 1 and all other bits are the same, the result cannot be other than 0.
But if a and b are slightly different, of course, the result can be non-zero.
One possible solution to avoid this can be using a tolerance...
float f = (a + b) / 2;
if (abs(f) < 0.000001f)
f = 0;
We are using a simple tolerance to see if our value is near to zero.
A nice example code to show this is...
int main(int argc)
{
for (int i = -10000000; i <= 10000000 * argc; ++i)
{
if (i != 0)
{
float a = 3.14159265f / i;
float b = -a + (argc - 1);
float f = (a + b) / 2;
if (f != 0)
printf("%f %d\n", a, f);
}
}
printf("completed\n");
return 0;
}
I'm using "argc" here as a trick to force the compiler to not optimize out our code.
At least right off, this sounds like typical floating point imprecision.
The usual way to deal with it is to round your numbers to the correct number of significant digits. In this case, your average would be -1.19209e-08 (i.e., 0.00000001192). To (say) six or seven significant digits, that is zero.
Takes the sum of all your numbers, divide by your count. Round off your answer to something reasonable before you do prints, reports comparisons, or whatever you're doing.
again, do some searching on this but here is the basic explanation ...
the computer approximates floating point numbers by base 2 instead of base 10. this means that , for example, 0.2 (when converted to binary) is actually 0.001100110011 ... on forever. since the computer cannot add these on forever, it must approximate it.
because of these approximations, we lose "precision" of calculations. hence "single" and "double" precision floating point numbers. this is why you never test for a float to be actually 0. instead, you test whether is below some threshhold which you want to use as zero.