Currently I have a function in an application which takes in a float as a parameter and should perform a simple multiplication and division on the value passed in. Before the value is passed into the function in the application, it is typecast to a float as the particulars of the main application deal with the numerical data in ints. Unfortunately when I pass in the value of 0.0 to the function, it does not generate an output of 1.0 (which it should from the calculation the function performs) but merely outputs a value of 0.0 and I was wondering why the calulation was failing to produce the correct output as the program compiles and the calculation is correct as far as I'm aware.
Here is the code:
void CarPositionClass::centre(float inputPos)
{
if ((inputPos <= 0) && (inputPos >= -125))
{
membershipC = ((inputPos + 125)*(1 / 125));
}
}
It should also be noted that membershipC is a float variable that is a member of the CarPositionClass.
Change 1 / 125 to, say, 1.0 / 125. 1 / 125 uses integer division, so the result is 0.
Or change this expression
((inputPos + 125)*(1 / 125))
to
(inputPos + 125) / 125
Since inputPos is floating point, so is inputPos + 125, and then dividing a float by an integer is a float.
P.S. This is surely a duplicate question. I expect the C++ gurus to lower the dup hammer any second now. :)
The division between two integers results in an integer. At least one operand has to be a floating point type for it not to truncate the result:
membershipC = ((inputPos + 125)*(1.0 / 125));
// ^^^
Related
I am being paranoid that one of these functions may give an incorrect result like this:
std::floor(2000.0 / 1000.0) --> std::floor(1.999999999999) --> 1
or
std::ceil(18 / 3) --> std::ceil(6.000000000001) --> 7
Can something like this happen? If there is indeed a risk like this, I'm planning to use the functions below in order to work safely. But, is this really necessary?
constexpr long double EPSILON = 1e-10;
intmax_t GuaranteedFloor(const long double & Number)
{
if (Number > 0)
{
return static_cast<intmax_t>(std::floor(Number) + EPSILON);
}
else
{
return static_cast<intmax_t>(std::floor(Number) - EPSILON);
}
}
intmax_t GuaranteedCeil(const long double & Number)
{
if (Number > 0)
{
return static_cast<intmax_t>(std::ceil(Number) + EPSILON);
}
else
{
return static_cast<intmax_t>(std::ceil(Number) - EPSILON);
}
}
(Note: I'm assuming that the the given 'long double' argument will fit in the 'intmax_t' return type.)
People often get the impression that floating point operations produce results with small, unpredictable, quasi-random errors. This impression is incorrect.
Floating point arithmetic computations are as exact as possible. 18/3 will always produce exactly 6. The result of 1/3 won't be exactly one third, but it will be the closest number to one third that is representable as a floating point number.
So the examples you showed are guaranteed to always work. As for your suggested "guaranteed floor/ceil", it's not a good idea. Certain sequences of operations can easily blow the error far above 1e-10, and certain other use cases will require 1e-10 to be correctly recognized (and ceil'ed) as nonzero.
As a rule of thumb, hardcoded epsilon values are bugs in your code.
In the specific examples you're listing, I don't think those errors would ever occur.
std::floor(2000.0 /*Exactly Representable in 32-bit or 64-bit Floating Point Numbers*/ / 1000.0 /*Also exactly representable*/) --> std::floor(2.0 /*Exactly Representable*/) --> 2
std::ceil(18 / 3 /*both treated as ints, might not even compile if ceil isn't properly overloaded....?*/) --> 6
std::ceil(18.0 /*Exactly Representable*/ / 3.0 /*Exactly Representable*/) --> 6
Having said that, if you have math that depends on these functions behaving exactly correctly for floating point numbers, that may illuminate a design flaw you need to reconsider/reexamine.
As long as the floating-point values x and y exactly represent integers within the limits of the type you're using, there's no problem--x / y will always yield a floating-point value that exactly represents the integer result. Casting to int as you're doing will always work.
However, once the floating-point values go outside the integer-representable range for the type (Representing integers in doubles), epsilons don't help.
Consider this example. 16777217 is the smallest integer not exactly representable as a 32-bit float:
int ix=16777217, iy=97;
printf("%d / %d = %d", ix, iy, ix/iy);
// yields "16777217 / 97 = 172961" which is accurate
float x=ix, y=iy;
printf("%f / %f = %f", x, y, x/y);
// yields "16777216.000000 / 97.000000 = 172960.989691"
In this case, the error is negative; in other cases (try 16777219 / 1549), the error is positive.
While it's tempting to add an epsilon to make floor work, it won't extend the accuracy much. When the values differ by more orders of magnitude, the error becomes greater than 1 and integer-accuracy can't be guaranteed. Specifically, when x/y exceeds the max. representable, the error can exceed 1.0, so the epsilon is no help.
If this is coming into play, you will have to consider changing your mathematical approach--order of operations, work with logarithms, etc.
Such results are likely to appear when working with doubles. You can use round or you can subtract 0.5 then use std::ceil function.
I might be missing something very basic here. But I don't know how to figure out that basic thing. When I set T to 10 and dt to 0.1, I should get the result 101 but I am getting the result as 100. Why is it so?
n_sim_steps = (int)(T/dt) + 1
Furthermore, if I execute this as a watch in eclipse, it returns 101, but in code it results in 100.
It should be
n_sim_steps = (int)(T/dt + 0.5) + 1
You are a victim of precission loss
10 / 0.1 may be 99.999999999999 because of this loss and may be casted back to int as 99. Adding 0.5 and then casting would make sure that the result is rounded.
You better to use ceil function.
function signature
double ceil (double x);
like ceil(2.3) will results 3
I'm copying a script from matlab into a c++ function. However, I constantly get different result for the exp function. For example, following snippet:
std::complex<double> final_b = std::exp(std::complex<double>(0, 1 * pi));
should be equivalent to the MATLAB code
final_b = exp(1i * pi);
But it isn't. For MATLAB, I receive -1 + 0i (which is correct) and for c++, I get -1 + -2.068231e-013*i.
Now I thought at the beginning this is just a rounding error of sorts, but for the actual script I'm using, which has bigger complex exponentials, I get completely different numbers. What is the cause of this? How do I fix this?
Edit: I've manually tried calculating the exponential with eulers formula
exp(x+iy) = exp(x) * (cos(y) + i*sin(y))
and get the same wonky results in c++
That is called floating point approximation (or imprecision):
If you include the header cfloat there are some definitions. In particular, DBL_EPSILON, which is the smallest number that 1.0 + DBL_EPSILON != 1.0, which is usually 1e-9 (and -2.068231e-013 is much smaller than that. If you do the following piece of code, you can check if it is zero or not:
// The complete formula is std::abs(a - b), but since b is zero, I am ommiting it
if (std::abs(number.imag()) < DBL_EPSILON) {
// The number is either zero or very close to zero
}
For example, you can see the working code here: http://ideone.com/2OzNZm
This question already has answers here:
Error subtracting floating point numbers when passing through 0.0
(4 answers)
Closed 9 years ago.
Consider the following code snippet:
float f = 0.01 ;
printf("%f\n",f - 0.01);
if (f - 0.01 == 0)
{
printf("%f\n",f - 0.01);
}
When I run this code, for the second line I get the output -0.000000, and the if condition does not execute .
What is the reason for the -0.000000?
I remember from a digital logic class I took in college that this arises due to internal representation using one's complement. Please correct me if I'm wrong and please suggest fixes and how to avoid this in the future .
I'm using clang to compile my code , if it matters.
You're running into two problems:
0.01 can't be represented exactly as a binary floating-point value
f has type float while 0.01 has type double
Your calculation requires a conversion from double to float and back which (apparently) isn't giving exactly the same value that it started with.
You might be able to fix this specific example by sticking to a single type (float or double) for all values; but you'll still have problems if you want to compare the results of more complicated calculations for exact equality.
0.01 is a double not a float (You probably have warnings about this when you compile your code.)
So, you're basically converting your "0.01" backwards and forwards between floats and doubles, which is what's causing your discrepancies.
So decide if you want to use floats (e.g 0.01f) or doubles, and stick with one version throughout.
However, as other answers have pointed out, you'll never get an "exact" value when doing floating point arithmetic - it just doesn't work that way.
For reference, both of these versions will give the answer you're expecting
float f = 0.01f ;
printf("%f\n",f - 0.01f);
if (f - 0.01f == 0)
{
printf("%f\n",f - 0.01f);
}
or
double f = 0.01 ;
printf("%f\n",f - 0.01);
if (f - 0.01 == 0)
{
printf("%f\n",f - 0.01);
}
both print
0.000000
0.000000
The reason is that, 0.01 can't be represented correctly in binary floating number. This can be understand by an example: 10/3 is giving the result 3.333333333333333......, i.e it cant be represented correctly in decimal. Similar case with 0.01. Every floating point decimal number can't be represented correctly in binary floating point equivalent.
You have two double 0.01 - one is converted to a float. Due to loss of precision double(float(0.01)) != double(0.01)
Even without the obvious loss of precision you might get into trouble using double(s) only. A compiler might keep one as an extended double in a register and fetch the other from memory (stored as double)
I'm developing for a platform without a math library, so I need to build my own tools. My current way of getting the fraction is to convert the float to fixed point (multiply with (float)0xFFFF, cast to int), get only the lower part (mask with 0xFFFF) and convert it back to a float again.
However, the imprecision is killing me. I'm using my Frac() and InvFrac() functions to draw an anti-aliased line. Using modf I get a perfectly smooth line. With my own method pixels start jumping around due to precision loss.
This is my code:
const float fp_amount = (float)(0xFFFF);
const float fp_amount_inv = 1.f / fp_amount;
inline float Frac(float a_X)
{
return ((int)(a_X * fp_amount) & 0xFFFF) * fp_amount_inv;
}
inline float Frac(float a_X)
{
return (0xFFFF - (int)(a_X * fp_amount) & 0xFFFF) * fp_amount_inv;
}
Thanks in advance!
If I understand your question correctly, you just want the part after the decimal right? You don't need it actually in a fraction (integer numerator and denominator)?
So we have some number, say 3.14159 and we want to end up with just 0.14159. Assuming our number is stored in float f;, we can do this:
f = f-(long)f;
Which, if we insert our number, works like this:
0.14159 = 3.14159 - 3;
What this does is remove the whole number portion of the float leaving only the decimal portion. When you convert the float to a long, it drops the decimal portion. Then when you subtract that from your original float, you're left with only the decimal portion. We need to use a long here because of the size of the float type (8 bytes on most systems). An integer (only 4 bytes on many systems) isn't necessarily large enough to cover the same range of numbers as a float, but a long should be.
As I suspected, modf does not use any arithmetic per se -- it's all shifts and masks, take a look here. Can't you use the same ideas on your platform?
I would recommend taking a look at how modf is implemented on the systems you use today. Check out uClibc's version.
http://git.uclibc.org/uClibc/tree/libm/s_modf.c
(For legal reasons, it appears to be BSD licensed, but you'd obviously want to double check)
Some of the macros are defined here.
There's a bug in your constants. You're basically trying to do a left shift of the number by 16 bits, mask off everything but the lower bits, then right shift by 16 bits again. Shifting is the same as multiplying by a power of 2, but you're not using a power of 2 - you're using 0xFFFF, which is off by 1. Replacing this with 0x10000 will make the formula work as intended.
I'm not completly sure, but I think that what you are doing is wrong, since you are only considering the mantissa and forgetting the exponent completely.
You need to use the exponent to shift the value in the mantissa to find the actual integer part.
For a description of the storage mechanism of 32bit floats, take a look here.
Why go to floating point at all for your line drawing? You could just stick to your fixed point version and use an integer/fixed point based line drawing routine instead - Bresenham's comes to mind. While this version isn't aliased, I know there are others that are.
Bresenham's line drawing
Seems like maybe you want this.
float f = something;
float fractionalPart = f - floor(f);
Your method is assuming that there are 16 bits in the fractional part (and as Mark Ransom notes, that means you should shift by 16 bits, i.e. multiply by by 0x1000). That might not be true. The exponent is what determines how many bit there are in the fractional part.
To put this in a formula, your method works by calculating (x modf 1.0) as ((x << 16) mod 1<<16) >> 16, and it's that hardcoded 16 which should depend on the exponent - the exact replacement depends on your float format.
double frac(double val)
{
return val - trunc(val);
}
// frac(1.0) = 1.0 - 1.0 = 0.0 correct
// frac(-1.0) = -1.0 - -1.0 = 0.0 correct
// frac(1.4) = 1.4 - 1.0 = 0.4 correct
// frac(-1.4) = -1.4 - -1.0 = -0.4 correct
Simple and works for -ve and +ve
One option is to use fmod(x, 1).