I already know how floating point numbers stored in memory and I understand why expression 0.1 + 0.2 != 0.3 is True.
But I don't understand why 0.2f + 0.3f == 0.5f is true.
Here is my code:
cout << setprecision(64)
<< "0.3 = " << 0.3 << "\n"
<< "0.2 = " << 0.2 << "\n"
<< "0.2 + 0.3 = " << 0.2 + 0.3 << "\n"
<< "0.3f = " << 0.3f << "\n"
<< "0.2f = " << 0.2f << "\n"
<< "0.2f + 0.3f = " << 0.2f + 0.3f << "\n";
I get output:
0.3 = 0.299999999999999988897769753748434595763683319091796875
0.2 = 0.200000000000000011102230246251565404236316680908203125
0.2 + 0.3 = 0.5
0.3f = 0.300000011920928955078125
0.2f = 0.20000000298023223876953125
0.2f + 0.3f = 0.5
I agree that if we sum 0.3 + 0.2 with double types a result will be 0.5, because 0.299999999999999988897769753748434595763683319091796875 + 0.200000000000000011102230246251565404236316680908203125 = 0.5.
But I still don't understand why sum 0.2f + 0.3f is 0.5 too. I expect the result will be 0.50000001490116119384765625 (0.300000011920928955078125 + 0.20000000298023223876953125).
Could you please help me understand where I'm wrong?
The basic reason is that although .2f is a little above .2 and .3f is a little above .3, the sum of the excesses is less than halfway from .5 to the next representable float number.
First, let’s note of the scales used for these numbers. Using the
IEEE-754 binary32 format, the step between numbers in [1, 2) is
2−23. Each representable number in this interval is an integer multiple of 2−23.
.3 in is [¼, ½), where the step is 2−25.
.2 in is [⅛, ¼), where the step is 2−26.
The literal 0.2f is .2 converted to float. This produces 13,421,773•2−26, which equals
0.20000000298023223876953125. For 0.3f, we get 10,066,330•2−25,
which is 0.300000011920928955078125.
Let’s convert those scales to the scale used for numbers in [½, 1), where the step is 2−24.
13,421,773•2−26 becomes 3,355,443.25•2−24, and
10,066,330•2−25 becomes 5,033,165•2−24. Adding those produces 8,388,608.25•2−24. To get a representable result, we round that to the nearest integer. As you can see, the fraction is .25, so we round down, yielding 8,388,608•2−24, which is .5. The next representable number, 8,388,609•2−24, which is 0.500000059604644775390625, is further away.
Related
Say
int64_t x = (1UL << 53);
cout << x << end;
x+= 1.0;
cout << x << end;
The result of x is same, which is '9007199254740992'.
However, x += 1; can make x plus 1 correctly.
Moreover, for 1UL << 52 plus 1.0 can make the result correctly.
I think it could be the float imprecision. Could someone give me more details of that?
The line x+= 1.0 is evaluated as
x = (int64_t)((double)x + (double)1.0);
The number 2^53 + 1 = 9007199254740993 can't be represented exactly as IEEE double, so it's rounded to 2^53 = 9007199254740992 (this depends on the current rounding mode, actually) which is then (losslessly) converted to an int64_t.
x+= 1.0;
The expression x + 1.0 is done with floating point arithmetic.
Assuming IEEE-754 is used, the double precision floating point type can represent integers at most 253 precisely.
I'm working on a lisp interpreter and implemented rational numbers. I thought they have the advantage over doubles to be able to represent numbers like 1/3. I did some calculations to compare the results. I was surprised by the results
with doubles
(* 3.0 (/ 1.0 3.0)) -> 1
(* 3.0 (/ 4.0 3.0)) -> 4
(* 81.0 (/ 1.0 81.0)) -> 1
with ratios:
(* 3 (/ 1 3)) -> 1
(* 3 (/ 4 3)) -> 4
(* 81 (/ 1 81)) -> 1
Why are the results of the floating point operations exact? There must be a loss of precision. doubles cannot store an infinit number of digits. Or do I miss something?
I did a quick test with a small C-Application. Same result.
#include <stdio.h>
int main()
{
double a1 = 1, b1 = 3;
double a2 = 1, b2 = 81;
printf("result : %f\n", a1 / b1 * b1);
printf("result : %f\n", a2 / b2 * b2);
return 0;
}
Output is:
result : 1.000000
result : 1.000000
MFG
Martin
For the first case, the exact result of the multiply is half way between 1.0 and the largest double that is less than 1.0. Under IEEE 754 round-to-nearest rules, half way numbers are rounded to even, in this case to 1.0. In effect, the rounding of the result of the multiply undid the error introduced by rounding of the division result.
This Java program illustrates what is happening. The conversions to BigDecimal and the BigDecimal arithmetic operations are all exact:
import java.math.BigDecimal;
public class Test {
public static void main(String[] args) {
double a1 = 1, b1 = 3;
System.out.println("Final Result: " + ((a1 / b1) * b1));
BigDecimal divResult = new BigDecimal(a1 / b1);
System.out.println("Division Result: " + divResult);
BigDecimal multiplyResult = divResult.multiply(BigDecimal.valueOf(3));
System.out.println("Multiply Result: " + multiplyResult);
System.out.println("Error rounding up to 1.0: "
+ BigDecimal.valueOf(1).subtract(multiplyResult));
BigDecimal nextDown = new BigDecimal(Math.nextAfter(1.0, 0));
System.out.println("Next double down from 1.0: " + nextDown);
System.out.println("Error rounding down: "
+ multiplyResult.subtract(nextDown));
}
}
The output is:
Final Result: 1.0
Division Result: 0.333333333333333314829616256247390992939472198486328125
Multiply Result: 0.999999999999999944488848768742172978818416595458984375
Error rounding up to 1.0: 5.5511151231257827021181583404541015625E-17
Next double down from 1.0: 0.99999999999999988897769753748434595763683319091796875
Error rounding down: 5.5511151231257827021181583404541015625E-17
The output for the second, similar, case is:
Final Result: 1.0
Division Result: 0.012345679012345678327022824305458925664424896240234375
Multiply Result: 0.9999999999999999444888487687421729788184165954589843750
Error rounding up to 1.0: 5.55111512312578270211815834045410156250E-17
Next double down from 1.0: 0.99999999999999988897769753748434595763683319091796875
Error rounding down: 5.55111512312578270211815834045410156250E-17
This program illustrates a situation in which rounding error can accumulate:
import java.math.BigDecimal;
public class Test {
public static void main(String[] args) {
double tenth = 0.1;
double sum = 0;
for (int i = 0; i < 10; i++) {
sum += tenth;
}
System.out.println("Sum: " + new BigDecimal(sum));
System.out.println("Product: " + new BigDecimal(10.0 * tenth));
}
}
Output:
Sum: 0.99999999999999988897769753748434595763683319091796875
Product: 1
Multiplying by 10 rounds to 1.0. Doing the same multiplication by repeated addition does not get the exact answer.
In the below example app I calculate the floating point remainder from dividing 953 by 0.1, using std::fmod
What I was expecting is that since 953.0 / 0.1 == 9530, that std::fmod(953, 0.1) == 0
I'm getting 0.1 - why is this the case?
Note that with std::remainder I get the correct result.
That is:
std::fmod (953, 0.1) == 0.1 // unexpected
std::remainder(953, 0.1) == 0 // expected
Difference between the two functions:
According to cppreference.com
std::fmod calculates the following:
exactly the value x - n*y, where n is x/y with its fractional part truncated
std::remainder calculates the following:
exactly the value x - n*y, where n is the integral value nearest the exact value x/y
Given my inputs I would expect both functions to have the same output. Why is this not the case?
Exemplar app:
#include <iostream>
#include <cmath>
bool is_zero(double in)
{
return std::fabs(in) < 0.0000001;
}
int main()
{
double numerator = 953;
double denominator = 0.1;
double quotient = numerator / denominator;
double fmod = std::fmod (numerator, denominator);
double rem = std::remainder(numerator, denominator);
if (is_zero(fmod))
fmod = 0;
if (is_zero(rem))
rem = 0;
std::cout << "quotient: " << quotient << ", fmod: " << fmod << ", rem: " << rem << std::endl;
return 0;
}
Output:
quotient: 9530, fmod: 0.1, rem: 0
Because they are different functions.
std::remainder(x, y) calculates IEEE remainder which is x - (round(x/y)*y) where round is rounding half to even (so in particular round(1.0/2.0) == 0)
std::fmod(x, y) calculates x - trunc(x/y)*y. When you divide 953 by 0.1 you may get a number slightly smaller than 9530, so truncation gives 9529. So as the result you get 953.0 - 952.9 = 0.1
Welcome to floating point math. Here's what happens: One tenth cannot be represented exactly in binary, just as one third cannot be represented exactly in decimal. As a result, the division produces a result slightly below 9530. The floor operation produces the integer 9529 instead of 9530. And then this leaves 0.1 left over.
I just check following thing in python 2.7
print 0.1 + 0.2
output :- 0.3
print 0.1 + 0.2 - 0.3
output :- 5.55111512313e-17
But I expect the 0.0
So, how to achive this thing ?
The problem here is that the float type doesn't have enough precision to display the result you want. If you try to print the partial sum 0.1 + 0.2 you'll see that the float result you get is 0.30000000000000004.
So, 5.55111512313e-17 is the closest approximation possible with float type variables to that result. If you try to cast the result to int, so:
int(0.2 + 0.1 - 0.3)
You'll see 0, and that's the right integer approximation.
You can get 0.0 with floating point variables by using the decimal class.
Try this:
from decimal import Decimal
Decimal("0.2") + Decimal("0.1") - Decimal("0.3")
And you'll see that the result is Decimal("0.0")
I have a slider that returns values from 0.0f to 1.0f.
I want to use this value and clamp it to MIN and MAX, but not exactly clamp.
Say min is 0.2f and max is 0.3f. When the slider would be at 0, I want 0.2f. When the slider is at 0.5f, I want 0.25f, and so on.
It's just so that the effect of the slider is not as strong.
given MIN MAX and sliderVal, how could I clamp the sliderVal?
Thanks
slider_range = slider_max - slider_min;
range = range_max - range_min;
value = (double)(slider_pos - slider_min) / slider_range * range + range_min;
Assuming you want the slider to linearly change between 0.2f and 0.3f, then the transformation from the interval [0.0 1.0] to [0.2 0.3] is trivial:
newVal = 0.2f + (sliderVal)*0.1f;
Looking at this from a mathematical perspective, you want the output to be linear with respect to the input, according to your desciption. Thus, the transfer function between the input and output values must be of the form:
y = mx + b
Consider the x value to be the input (the slider value), and the y value to be the output (the new, desired value). Thus, you have two points: (0.0, 0.2) and (1.0, 0.3) Substitute these points into the above equation:
0.2 = (0.0)m + b
0.3 = (1.0)m + b
You now have a system of linear equations which are trivial to solve for:
0.2 = (0.0)m + b --> b = 0.2
0.3 = (1.0)m + b --> 0.3 = m + 0.2 --> m = 0.1
Thus, the transfer function is:
y = 0.1 * x + 0.2
Q.E.D.
We can generalize the above process. Instead of using points (0.0, 0.2) and (1.0, 0.3), use points (minSlider, maxSlider) and (minValue, maxValue).
minValue = (minSlider)m + b
maxValue = (maxSlider)m + b
Elimate the variable b:
minValue = (minSlider)m + b
-maxValue = -(maxSlider)m - b
--> minValue-maxValue = (minSlider-maxSlider)m
m = (minValue-maxValue)/(minSlider-maxSlider)
Eliminate the variable m:
minValue*maxSlider = (minSlider*maxSlider)m + b*maxSlider
-maxValue*minSlider = -(minSlider*maxSlider)m - b*minSlider
--> minValue*maxSlider - maxValue*minSlider = b(maxSlider-minSlider)
b = (minValue*maxSlider - maxValue*minSlider)/(maxSlider-minSlider)
You can verify that these equations give you the exact same values for m and b. If we assume that the minimum slider value will always be 0.0:
m = (minValue-maxValue)/(minSlider-maxSlider)
b = (minValue*maxSlider - maxValue*minSlider)/(maxSlider-minSlider)
--> m = (maxValue-minValue)/(maxSlider)
b = minValue
In C++:
const double maxSlider = 1.0;
const double minValue = 0.2;
const double maxValue = 0.3;
double value = (maxValue-minValue)/(maxSlider)*getSliderPosition() + minValue;
Basically you have
0.0f -> MIN
1.0f -> MAX
and you want
clampedVal = sliderVal * ( MAX - MIN ) + MIN
std::lerp does this. It accepts three floating points and clamps interpolates third argument between first and second.
Qouting from cppreference:
#include <iostream>
#include <cmath>
int main()
{
float a=10.0f, b=20.0f;
std::cout << "a=" << a << ", " << "b=" << b << '\n'
<< "mid point=" << std::lerp(a,b,0.5f) << '\n'
<< std::boolalpha << (a == std::lerp(a,b,0.0f)) << ' '
<< std::boolalpha << (b == std::lerp(a,b,1.0f)) << '\n';
}
Output:
a=10, b=20
mid point=15
true true