Consider the following functions:
#include <iostream>
#include <iomanip>
#include <cmath>
#include <limits>
template <typename Type>
inline Type a(const Type dx, const Type a0, const Type z0, const Type b1)
{
return (std::sqrt(std::abs(2*b1-z0))*dx)+a0;
}
template <typename Type>
inline Type b(const Type dx, const Type a0, const Type z0, const Type a1)
{
return (std::pow((a1-a0)/dx, 2)+ z0)/2;
}
int main(int argc, char* argv[])
{
double dx = 1.E-6;
double a0 = 1;
double a1 = 2;
double z0 = -1.E7;
double b1 = -10;
std::cout<<std::scientific;
std::cout<<std::setprecision(std::numeric_limits<double>::digits10);
std::cout<<a1-a(dx, a0, z0, b(dx, a0, z0, a1))<<std::endl;
std::cout<<b1-b(dx, a0, z0, a(dx, a0, z0, b1))<<std::endl;
return 0;
}
On my machine, it returns:
0.000000000000000e+00
-1.806765794754028e-07
Instead of (0, 0). There is a large rounding error with the second expression.
My question is: how to reduce the rounding error of each function without changing the type (I need to keep these 2 functions declarations (but the formulas can be rearanged): they come from a larger program)?
Sadly, all of the floating point types are notorious for rounding error. They can't even store 0.1 without it (you can prove this using long division by hand: the binary equivalent is 0b0.0001100110011001100...). You might try some workarounds like expanding that pow to a hard-coded multiplication, but you'll ultimately need to code your program to anticipate and minimize the effects of rounding error. Here are a couple ideas:
Never compare floating point values for equality. Some alternative comparisons I have seen include: abs(a-b) < delta, or percent_difference (a,b) < delta or even abs(a/b-1) < delta, where delta is a "suitably small" value you have determined works for this specific test.
Avoid adding long arrays of numbers into an accumulator; the end of the array may be completely lost to rounding error as the accumulator grows large. In "Cuda by Example" by Jason Sanders and Edward Kandrot, the authors recommend recursively adding each pair of elements individually so that each step produces an array half the size of the previous step, until you get a one-element array.
In a(), you lose precision when you add a0 (which is exactly 1) to the small and imprecise result of sqrt()*dx.
The function b() doesn't lose any precision using the supplied values.
When you call a() before b() as in the second output, you're doing mathematical operations on a number that's already imprecise, compounding the error.
Try to structure the mathematical operations so you do operations that are less likely to create floating point errors first and those more likely to create floating point errors last.
Or, inside your functions, make sure they are operating on "long double" values. For example, the following uses floating-point promotion to promote double to long double during the first mathematical operation (pay attention to operator precedence):
template <typename Type>
inline Type a(const Type dx, const Type a0, const Type z0, const Type b1)
{
return (std::sqrt(std::abs(2*static_cast<long double>(b1)-z0))*dx)+a0;
}
Related
Consider the following piece of code:
#include <iostream>
#include <cmath>
int main() {
int i = 23;
int j = 1;
int base = 10;
int k = 2;
i += j * pow(base, k);
std::cout << i << std::endl;
}
It outputs "122" instead of "123". Is it a bug in g++ 4.7.2 (MinGW, Windows XP)?
std::pow() works with floating point numbers, which do not have infinite precision, and probably the implementation of the Standard Library you are using implements pow() in a (poor) way that makes this lack of infinite precision become relevant.
However, you could easily define your own version that works with integers. In C++11, you can even make it constexpr (so that the result could be computed at compile-time when possible):
constexpr int int_pow(int b, int e)
{
return (e == 0) ? 1 : b * int_pow(b, e - 1);
}
Here is a live example.
Tail-recursive form (credits to Dan Nissenbaum):
constexpr int int_pow(int b, int e, int res = 1)
{
return (e == 0) ? res : int_pow(b, e - 1, b * res);
}
All the other answers so far miss or dance around the one and only problem in the question:
The pow in your C++ implementation is poor quality. It returns an inaccurate answer when there is no need to.
Get a better C++ implementation, or at least replace the math functions in it. The one pointed to by Pascal Cuoq is good.
Not with mine at least:
$ g++ --version | head -1
g++ (GCC) 4.7.2 20120921 (Red Hat 4.7.2-2)
$ ./a.out
123
IDEone is also running version 4.7.2 and gives 123.
Signatures of pow() from http://www.cplusplus.com/reference/cmath/pow/
double pow ( double base, double exponent );
long double pow ( long double base, long double exponent );
float pow ( float base, float exponent );
double pow ( double base, int exponent );
long double pow ( long double base, int exponent );
You should set double base = 10.0; and double i = 23.0.
If you simply write
#include <iostream>
#include <cmath>
int main() {
int i = 23;
int j = 1;
int base = 10;
int k = 2;
i += j * pow(base, k);
std::cout << i << std::endl;
}
what do you think is pow supposed to refer to? The C++ standard does not even guarantee that after including cmath you'll have a pow function at global scope.
Keep in mind that all the overloads are at least in the std namespace. There is are pow functions that take an integer exponent and there are pow functions that take floating point exponents. It is quite possible that your C++ implementation only declares the C pow function at global scope. This function takes a floating point exponent. The thing is that this function is likely to have a couple of approximation and rounding errors. For example, one possible way of implementing that function is:
double pow(double base, double power)
{
return exp(log(base)*power);
}
It's quite possible that pow(10.0,2.0) yields something like 99.99999999992543453265 due to rounding and approximation errors. Combined with the fact that floating point to integer conversion yields the number before the decimal point this explains your result of 122 because 99+3=122.
Try using an overload of pow which takes an integer exponent and/or do some proper rounding from float to int. The overload taking an integer exponent might give you the exact result for 10 to the 2nd power.
Edit:
As you pointed out, trying to use the std::pow(double,int) overload also seems to yield a value slightly less 100. I took the time to check the ISO standards and the libstdc++ implementation to see that starting with C++11 the overloads taking integer exponents have been dropped as a result of resolving defect report 550. Enabling C++0x/C++11 support actually removes the overloads in the libstdc++ implementation which could explain why you did not see any improvement.
Anyhow, it is probably a bad idea to rely on the accuracy of such a function especially if a conversion to integer is involved. A slight error towards zero will obviously make a big difference if you expect a floating point value that is an integer (like 100) and then convert it to an int-type value. So my suggestion would be write your own pow function that takes all integers or take special care with respect to the double->int conversion using your own round function so that a slight error torwards zero does not change the result.
Your problem is not a bug in gcc, that's absolutely certain. It may be a bug in the implementation of pow, but I think your problem is really simply the fact that you are using pow which gives an imprecise floating point result (because it is implemented as something like exp(power * log(base)); and log(base) is never going to be absolutely accurate [unless base is a power of e].
Problem: I'm looking for a way of rounding some float f to the closest int in general -- especially if the f is large. Mathematically speaking I'd like to compute the following function
where script T denotes the set of ints representable by my machine. In case of ties (eg. .5) r(f) can be defined arbitrarily.
Current Code: Below my current solution including two unsatisfying float examples (in main):
#include <cmath>
#include <iostream>
#include <limits>
template <class T>
T projection(T const min, T t, T const max) {
return std::max(std::min(t, max), min);
}
template <class Out, class In>
Out repr(In in) {
using Limits = std::numeric_limits<Out>;
auto next = [](Out val) {
auto const zero = static_cast<In>(0);
return std::nexttoward(static_cast<In>(val), zero);
};
return projection(next(Limits::lowest()), std::round(in), next(Limits::max()));
};
int main() {
std::cout
<< repr<int>(std::numeric_limits<float>::max()) << " "
<< repr<int>(static_cast<float>(std::numeric_limits<int>::max())) << "\n";
}
On my machine with 32bit ints this prints:
2147483520 2147483520
Short elaboration: For the upper bound, next computes the next smaller float that can be safely static_casted to int (analogously for lower bound). This is necessary as my float examples in main demonstrate: Without next, repr involves undefined behavior of casting (at least) std::numeric_limits<int>::max() + 1 as float to int in which this number is not representable.
The obvious downside of my repr is that it is incorrect in the mathematical sense: For large floats (eg. std::numeric_limits<float>::max()) it doesn't return std::numeric_limits<int>::max().
Questions:
Is this there an easier way to solve the problem (easier in the sense of less manual number crunching and more delegating to std-functions)?
How can repr be made correct (in the mathematical sense) with fully defined behavior only (no undefined and no implementation defined behavior)?
So far I've been talking about int and float but (as templates already suggested) this should only be a start. What about combinations
double and long or
double and long long?
I have a function that takes floats, I'm doing some computation with them, and I'd like to keep as much accuracy as possible in the returned result. I read that when you multiply two floats, you double the number of significant digits.
So when two floats get multiplied, for example float e, f; and I do double g = e * f, when do the bits get truncated?
In my example function below, do I need casting, and if yes, where? This is in a tight inner loop, if I put static_cast<double>(x) around each variable a b c d where it's used, I get 5-10% slowdown. But I suspect I don't need to cast each variable separately, and only in some locations, if at all? Or does returning a double here do not give me any gain anyway and I can as well just return a float?
double func(float a, float b, float c, float d) {
return (a - b) * c + (a - c) * b;
}
When you multiply two floats without casting, the result is calculated with float precision (i.e. truncated) and then converted to double.
To calculate the result in double, you need to cast at least one operand to double first. Then the entire calculation will be done in double (and all float values will be converted). However, that will create the same slowdown. The slowdown is likely because converting a number from float to double is not entirely trivial (different bit size and range of exponent and mantisa).
If I'd be doing that and have control over the function definition, I'd pass all the arguments as double (I generally use double everywhere, on modern computers the speed difference between calculating in float vs double is negligible, only issues could be memory throughput and cache performance when operating on large arrays of values).
Btw. the case important for precision actually isn't the multiplication, but the addition/subtraction - that is where the precision can make a big difference. Consider adding/subtracting 1e+6 and 1e-3.
Meaning is more important than 5-10% slowdown. What I'd do:
double func_impl(double a, double b, double c, double d) {
return (a - b) * c + (a - c) * b;
}
double func(float a, float b, float c, float d) {
return func_impl(a, b, c, d);
}
I'd choose this even if it's a bit slower, because it expresses the idea that you want double precision in your calculations well and just need the floats on the interface; while it keeps the body of your function separate from the casting (the latter being done in one step).
I used to replace const with #define, but in the below example it prints false.
#include <iostream>
#define x 3e+38
using namespace std;
int main() {
float p = x;
if (p==x)
cout<<"true"<<endl;
else
cout<<"false"<<endl;
return 0;
}
But if I replace
#define x 3e+38
with
const float x = 3e+38;
it works perfectly, question is why? (I know there are several topics discussed for #define vs const, but really didn't get this, kindly enlighten me)
In c++ the literals are double precision. In the first examples the number 3e+38 is first converted to float in the variable initialization and then back to double precision in the comparison. The conversions are not necessary exact, so the numbers may differ. In the second example numbers stay float all the time. To fix it you can change p to double, write
#define x 3e+38f
(which defines a float literal), or change the comparison to
if (p == static_cast<float>(x))
which performs the same conversion as the variable initialization, and does then the comparison in single precision.
Also as commented the comparison of floating point numbers with == is not usually a good idea, as rounding errors yield unexpected results, e.g., x*y might be different from y*x.
The number 3e+38 is double due its magnitude.
The assignment
float p = x;
causes the 3e+38 to lose its precision and hence its value when stored in p.
thats why the comparison :
if(p==x)
results in false because p has different value than 3e+38.
If I have the following declaration:
float a = 3.0 ;
is that an error? I read in a book that 3.0 is a double value and that I have to specify it as float a = 3.0f. Is it so?
It is not an error to declare float a = 3.0 : if you do, the compiler will convert the double literal 3.0 to a float for you.
However, you should use the float literals notation in specific scenarios.
For performance reasons:
Specifically, consider:
float foo(float x) { return x * 0.42; }
Here the compiler will emit a conversion (that you will pay at runtime) for each returned value. To avoid it you should declare:
float foo(float x) { return x * 0.42f; } // OK, no conversion required
To avoid bugs when comparing results:
e.g. the following comparison fails :
float x = 4.2;
if (x == 4.2)
std::cout << "oops"; // Not executed!
We can fix it with the float literal notation :
if (x == 4.2f)
std::cout << "ok !"; // Executed!
(Note: of course, this is not how you should compare float or double numbers for equality in general)
To call the correct overloaded function (for the same reason):
Example:
void foo(float f) { std::cout << "\nfloat"; }
void foo(double d) { std::cout << "\ndouble"; }
int main()
{
foo(42.0); // calls double overload
foo(42.0f); // calls float overload
return 0;
}
As noted by Cyber, in a type deduction context, it is necessary to help the compiler deduce a float :
In case of auto :
auto d = 3; // int
auto e = 3.0; // double
auto f = 3.0f; // float
And similarly, in case of template type deduction :
void foo(float f) { std::cout << "\nfloat"; }
void foo(double d) { std::cout << "\ndouble"; }
template<typename T>
void bar(T t)
{
foo(t);
}
int main()
{
bar(42.0); // Deduce double
bar(42.0f); // Deduce float
return 0;
}
Live demo
The compiler will turn any of the following literals into floats, because you declared the variable as a float.
float a = 3; // converted to float
float b = 3.0; // converted to float
float c = 3.0f; // float
It would matter is if you used auto (or other type deducting methods), for example:
auto d = 3; // int
auto e = 3.0; // double
auto f = 3.0f; // float
Floating point literals without a suffix are of type double, this is covered in the draft C++ standard section 2.14.4 Floating literals:
[...]The type of a floating literal is double unless explicitly specified by a suffix.[...]
so is it an error to assign 3.0 a double literal to a float?:
float a = 3.0
No, it is not, it will be converted, which is covered in section 4.8 Floating point conversions:
A prvalue of floating point type can be converted to a prvalue of
another floating point type. If the source value can be exactly
represented in the destination type, the result of the conversion is
that exact representation. If the source value is between two adjacent
destination values, the result of the conversion is an
implementation-defined choice of either of those values. Otherwise,
the behavior is undefined.
We can read more details on the implications of this in GotW #67: double or nothing which says:
This means that a double constant can be implicitly (i.e., silently)
converted to a float constant, even if doing so loses precision (i.e.,
data). This was allowed to remain for C compatibility and usability
reasons, but it's worth keeping in mind when you do floating-point
work.
A quality compiler will warn you if you try to do something that's
undefined behavior, namely put a double quantity into a float that's
less than the minimum, or greater than the maximum, value that a float
is able to represent. A really good compiler will provide an optional
warning if you try to do something that may be defined but could lose
information, namely put a double quantity into a float that is between
the minimum and maximum values representable by a float, but which
can't be represented exactly as a float.
So there are caveats for the general case that you should be aware of.
From a practical perspective, in this case the results will most likely be the same even though technically there is a conversion, we can see this by trying out the following code on godbolt:
#include <iostream>
float func1()
{
return 3.0; // a double literal
}
float func2()
{
return 3.0f ; // a float literal
}
int main()
{
std::cout << func1() << ":" << func2() << std::endl ;
return 0;
}
and we see that the results for func1 and func2 are identical, using both clang and gcc:
func1():
movss xmm0, DWORD PTR .LC0[rip]
ret
func2():
movss xmm0, DWORD PTR .LC0[rip]
ret
As Pascal points out in this comment you won't always be able to count on this. Using 0.1 and 0.1f respectively causes the assembly generated to differ since the conversion must now be done explicitly. The following code:
float func1(float x )
{
return x*0.1; // a double literal
}
float func2(float x)
{
return x*0.1f ; // a float literal
}
results in the following assembly:
func1(float):
cvtss2sd %xmm0, %xmm0 # x, D.31147
mulsd .LC0(%rip), %xmm0 #, D.31147
cvtsd2ss %xmm0, %xmm0 # D.31147, D.31148
ret
func2(float):
mulss .LC2(%rip), %xmm0 #, D.31155
ret
Regardless whether you can determine if the conversion will have a performance impact or not, using the correct type better documents your intention. Using an explicit conversions for example static_cast also helps to clarify the conversion was intended as opposed to accidental, which may signify a bug or potential bug.
Note
As supercat points out, multiplication by e.g. 0.1 and 0.1f is not equivalent. I am just going to quote the comment because it was excellent and a summary probably would not do it justice:
For example, if f was equal to 100000224 (which is exactly
representable as a float), multiplying it by one tenth should yield a
result which rounds down to 10000022, but multiplying by 0.1f will
instead yield a result which erroneously rounds up to 10000023. If the
intention is to divide by ten, multiplication by double constant 0.1
will likely be faster than division by 10f, and more precise than
multiplication by 0.1f.
My original point was to demonstrate a false example given in another question but this finely demonstrates subtle issues can exist in toy examples.
It's not an error in the sense that the compiler will reject it, but it is an error in the sense that it may not be what you want.
As your book correctly states, 3.0 is a value of type double. There is an implicit conversion from double to float, so float a = 3.0; is a valid definition of a variable.
However, at least conceptually, this performs a needless conversion. Depending on the compiler, the conversion may be performed at compile time, or it may be saved for run time. A valid reason for saving it for run time is that floating-point conversions are difficult and may have unexpected side effects if the value cannot be represented exactly, and it's not always easy to verify whether the value can be represented exactly.
3.0f avoids that problem: although technically, the compiler is still allowed to calculate the constant at run time (it always is), here, there is absolutely no reason why any compiler might possibly do that.
While not an error, per se, it is a little sloppy. You know you want a float, so initialize it with a float.Another programmer may come along and not be sure which part of the declaration is correct, the type or the initializer. Why not have them both be correct?
float Answer = 42.0f;
When you define a variable, it is initialized with the provided initializer. This may require converting the value of the initializer to the type of the variable that's being initialized. That's what's happening when you say float a = 3.0;: The value of the initializer is converted to float, and the result of the conversion becomes the initial value of a.
That's generally fine, but it doesn't hurt to write 3.0f to show that you're aware of what you're doing, and especially if you want to write auto a = 3.0f.
If you try out the following:
std::cout << sizeof(3.2f) <<":" << sizeof(3.2) << std::endl;
you will get output as:
4:8
that shows, size of 3.2f is taken as 4 bytes on 32-bit machine wheres 3.2 is interpreted as double value taking 8 bytes on 32-bit machine.
This should provide the answer that you are looking for.
The compiler deduces the best-fitting type from literals, or at leas what it thinks is best-fitting. That is rather lose efficiency over precision, i.e. use a double instead of float.
If in doubt, use brace-intializers to make it explicit:
auto d = double{3}; // make a double
auto f = float{3}; // make a float
auto i = int{3}; // make a int
The story gets more interesting if you initialize from another variable where type-conversion rules apply: While it is legal to constuct a double form a literal, it cant be contructed from an int without possible narrowing:
auto xxx = double{i} // warning ! narrowing conversion of 'i' from 'int' to 'double'