Loss of precision? - c++

I have a function like:
double calc( double x );
Do I lose precision with any of these expressions:
double y1 = calc( 1.0 / 3 );
double y2 = calc( 1.0 / 3.0 );
Are these more accurate:
double y3 = calc( static_cast<double>(1) / 3 )
double y4 = calc( static_cast<double>(1) / static_cast<double>(3) )
EDIT
Sorry, I had the wrong numbers here.
But, what I meant was that is 1.0 interpreted as float or double and is that always the case or does it depend on some compiler flags? If it is a float, then 1.0/3 would also be a float and only afterwards be converted into a double. If this is the case, it would cause loss of precision, wouldn't it?
EDIT 2
I have tested this with g++ and as it turns out, if the program is compiled with -fsingle-precision-constant flag, you do lose precision.
#include <iostream>
#include <limits>
#include <typeinfo>
long double calc( long double val)
{
return val;
}
int main() {
std::cout.precision(std::numeric_limits< long double >::max_digits10);
std::cout << calc(1.0/3.0) << std::endl;
std::cout << calc(static_cast<float>(1)/3) << std::endl;
std::cout << calc(static_cast<double>(1)/3) << std::endl;
std::cout << calc(static_cast<long double>(1)/3) << std::endl;
std::cout << typeid(1.0).name() << std::endl;
return 0;
}
The results are,
0.333333343267440795898
0.333333343267440795898
0.33333333333333331483
0.333333333333333333342
f
So, I decided to use static_cast< long double >(1) / 3 to be on the safe side.

None of the alternatives you show gives any loss of precision [at least not if the compiler does what the standard says it should do]. Which is that all binary operators where one operand is a double, the other side automatically gets promoted to double [and in general, when two operands are different size, they are promoted to the larger one].
In particular, integer values [below the mantissa's number of bits] are always represented precisely.
[Obviously, we have no idea what calc does with your input - that may be the source of any and all kinds of errors, but I'm presuming you are actually asking if 3.0/8.0 will always be 0.375 in the cases you have suggested - of course 3/8 will result in zero, since that is integer on both sides]
edit in response to the original question being edited:
If the code says 1. or 1.0 or 0.3 or .3, it is a double. If you write 0.5f, it is a float. As per the rules above 1.0/3 will be the result of double(1.0)/double(3.0).
It is technically possible for a compiler to support only one floating point type, with 3 different ways of writing that - the C and C++ standards has no requirement for double to have more bits than float.

No. The constant expression for y1 is implicitly converted into a double. The constant expression for y2 is already a double.
What you are doing in y3 and y4 is defining a constant integer value and casting it to a double, when you can simply defining a double-precision floating-point constant as you have already done.

Related

Numerical accuracy of pow(a/b,x) vs pow(b/a,-x)

Is there a difference in accuracy between pow(a/b,x) and pow(b/a,-x)?
If there is, does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Edit: Let's assume x86_64 processor and gcc compiler.
Edit: I tried comparing using some random numbers. For example:
printf("%.20f",pow(8.72138221/1.761329479,-1.51231)) // 0.08898783049228660424
printf("%.20f",pow(1.761329479/8.72138221, 1.51231)) // 0.08898783049228659037
So, it looks like there is a difference (albeit minuscule in this case), but maybe someone who knows about the algorithm implementation could comment on what the maximum difference is, and under what conditions.
Here's one way to answer such questions, to see how floating-point behaves. This is not a 100% correct way to analyze such question, but it gives a general idea.
Let's generate random numbers. Calculate v0=pow(a/b, n) and v1=pow(b/a, -n) in float precision. And calculate ref=pow(a/b, n) in double precision, and round it to float. We use ref as a reference value (we suppose that double has much more precision than float, so we can trust that ref can be considered the best possible value. This is true for IEEE-754 for most of the time). Then sum the difference between v0-ref and v1-ref. The difference should calculated with "the number of floating point numbers between v and ref".
Note, that the results may be depend on the range of a, b and n (and on the random generator quality. If it's really bad, it may give a biased result). Here, I've used a=[0..1], b=[0..1] and n=[-2..2]. Furthermore, this answer supposes that the algorithm of float/double division/pow is the same kind, have the same characteristics.
For my computer, the summed differences are: 2604828 2603684, it means that there is no significant precision difference between the two.
Here's the code (note, this code supposes IEEE-754 arithmetic):
#include <cmath>
#include <stdio.h>
#include <string.h>
long long int diff(float a, float b) {
unsigned int ai, bi;
memcpy(&ai, &a, 4);
memcpy(&bi, &b, 4);
long long int diff = (long long int)ai - bi;
if (diff<0) diff = -diff;
return diff;
}
int main() {
long long int e0 = 0;
long long int e1 = 0;
for (int i=0; i<10000000; i++) {
float a = 1.0f*rand()/RAND_MAX;
float b = 1.0f*rand()/RAND_MAX;
float n = 4.0f*rand()/RAND_MAX - 2.0f;
if (a==0||b==0) continue;
float v0 = std::pow(a/b, n);
float v1 = std::pow(b/a, -n);
float ref = std::pow((double)a/b, n);
e0 += diff(ref, v0);
e1 += diff(ref, v1);
}
printf("%lld %lld\n", e0, e1);
}
... between pow(a/b,x) and pow(b/a,-x) ... does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Whichever division is more arcuate.
Consider z = xy = 2y * log2(x).
Roughly: The error in y * log2(x) is magnified by the value of z to form the error in z. xy is very sensitive to the error in x. The larger the |log2(x)|, the greater concern.
In OP's case, both pow(a/b,p) and pow(b/a,-p), in general, have the same y * log2(x) and same z and similar errors in z. It is a question of how x, y are formed:
a/b and b/a, in general, both have the same error of +/- 0.5*unit in the last place and so both approaches are of similar error.
Yet with select values of a/b vs. b/a, one quotient will be more exact and it is that approach with the lower pow() error.
pow(7777777/4,-p) can be expected to be more accurate than pow(4/7777777,p).
Lacking assurance about the error in the division, the general rule applies: no major difference.
In general, the form with the positive power is slightly better, although by so little it will likely have no practical effect. Specific cases could be distinguished. For example, if either a or b is a power of two, it ought to be used as the denominator, as the division then has no rounding error.
In this answer, I assume IEEE-754 binary floating-point with round-to-nearest-ties-to-even and that the values involved are in the normal range of the floating-point format.
Given a, b, and x with values a, b, and x, and an implementation of pow that computes the representable value nearest the ideal mathematical value (actual implementations are generally not this good), pow(a/b, x) computes (a/b•(1+e0))x•(1+e1), where e0 is the rounding error that occurs in the division and e1 is the rounding error that occurs in the pow, and pow(b/a, -x) computes (b/a•(1+e2))−x•(1+e3), where e2 and e3 are the rounding errors in this division and this pow, respectively.
Each of the errors, e0…e3 lies in the interval [−u/2, u/2], where u is the unit of least precision (ULP) of 1 in the floating-point format. (The notation [p, q] is the interval containing all values from p to q, including p and q.) In case a result is near the edge of a binade (where the floating-point exponent changes and the significand is near 1), the lower bound may be −u/4. At this time, I will not analyze this case.
Rewriting, these are (a/b)x•(1+e0)x•(1+e1) and (a/b)x•(1+e2)−x•(1+e3). This reveals the primary difference is in (1+e0)x versus (1+e2)−x. The 1+e1 versus 1+e3 is also a difference, but this is just the final rounding. [I may consider further analysis of this later but omit it for now.]
Consider (1+e0)x and (1+e2)−x.The potential values of the first expression span [(1−u/2)x, (1+u/2)x], while the second spans [(1+u/2)−x, (1−u/2)−x]. When x > 0, the second interval is longer than the first:
The length of the first is (1+u/2)x−(1+u/2)x.
The length of the second is (1/(1−u/2))x−(1/(1+u/2))x.
Multiplying the latter by (1−u2/22)x produces ((1−u2/22)/(1−u/2))x−( (1−u2/22)/(1+u/2))x = (1+u/2)x−(1+u/2)x, which is the length of the first interval.
1−u2/22 < 1, so (1−u2/22)x < 1 for positive x.
Since the first length equals the second length times a number less than one, the first interval is shorter.
Thus, the form in which the exponent is positive is better in the sense that it has a shorter interval of potential results.
Nonetheless, this difference is very slight. I would not be surprised if it were unobservable in practice. Also, one might be concerned with the probability distribution of errors rather than the range of potential errors. I suspect this would also favor positive exponents.
For evaluation of rounding errors like in your case, it might be useful to use some multi-precision library, such as Boost.Multiprecision. Then, you can compare results for various precisions, e.g, such as with the following program:
#include <iomanip>
#include <iostream>
#include <boost/multiprecision/cpp_bin_float.hpp>
#include <boost/multiprecision/cpp_dec_float.hpp>
namespace mp = boost::multiprecision;
template <typename FLOAT>
void comp() {
FLOAT a = 8.72138221;
FLOAT b = 1.761329479;
FLOAT c = 1.51231;
FLOAT e = mp::pow(a / b, -c);
FLOAT f = mp::pow(b / a, c);
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << e << std::endl;
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << f << std::endl;
}
int main() {
std::cout << "Double: " << std::endl;
comp<mp::cpp_bin_float_double>();
td::cout << std::endl;
std::cout << "Double extended: " << std::endl;
comp<mp::cpp_bin_float_double_extended>();
std::cout << std::endl;
std::cout << "Quad: " << std::endl;
comp<mp::cpp_bin_float_quad>();
std::cout << std::endl;
std::cout << "Dec-100: " << std::endl;
comp<mp::cpp_dec_float_100>();
std::cout << std::endl;
}
Its output reads, on my platform:
Double:
0.0889878304922865903670015086390776559711
0.0889878304922866181225771242679911665618
Double extended:
0.0889878304922865999079806265115166752366
0.0889878304922865999012043629334822725241
Quad:
0.0889878304922865999004910375213273866639
0.0889878304922865999004910375213273505527
Dec-100:
0.0889878304922865999004910375213273881004
0.0889878304922865999004910375213273881004
Live demo: https://wandbox.org/permlink/tAm4sBIoIuUy2lO6
For double, the first calculation was more accurate, however, I guess one cannot make any generic conclusions here.
Also, note that your input numbers are not accurately representable with the IEEE 754 double precision floating-point type (none of them). The question is whether you care about the accuracy of calculations with either those exact numbers of their closest representations.

fmod is inconsistent with division [duplicate]

when I use fmod(0.6,0.2) in c++ it returns 0.2
I know this is caused by floating point accuracy but it seems I have to get remainder of two double this moment
thanks very much for any solutions for this kind of problem
The mathematical values 0.6 and 0.2 cannot be represented exactly in binary floating-point.
This demo program will show you what's going on:
#include <iostream>
#include <iomanip>
#include <cmath>
int main() {
const double x = 0.6;
const double y = 0.2;
std::cout << std::setprecision(60)
<< "x = " << x << "\n"
<< "y = " << y << "\n"
<< "fmod(x, y) = " << fmod(x, y) << "\n";
}
The output on my system (and very likely on yours) is:
x = 0.59999999999999997779553950749686919152736663818359375
y = 0.200000000000000011102230246251565404236316680908203125
fmod(x, y) = 0.1999999999999999555910790149937383830547332763671875
The result returned by fmod() is correct given the arguments you passed it.
If you need some other result (I presume you were expecting 0.0), you'll have to do something different. There are a number of possibilities. You can check whether the result differs from 0.2 by some very small amount, or perhaps you can do the computation using integer arithmetic (if, for example, all the numbers you're dealing with are multiples of 0.1 or of 0.01).
I think your best bet here would be to use the remainder function instead of fmod. For the given example, it will return a very small number rather than 0.2. You can use this fact to assume a remainder of 0 by rounding to a certain level of precision.
You're right, the problem is a rounding error. Try this code:
#include <stdio.h>
#include <math.h>
int main()
{
double d6 = 0.6;
double d2 = 0.2;
double d = fmod (d6, d2);
printf ("%30.20e %30.20e %30.20e\n", d6, d2, d);
}
When I run it in gcc 4.4.7 the output is:
5.99999999999999977796e-01 2.00000000000000011102e-01 1.99999999999999955591e-01
As for how to "fix" it, I don't know enough about exactly what you are trying to do to know what to suggest. There will always be numbers that cannot be exactly represented in floating point, so this kind of behavior is unavoidable.
More information about the problem domain would be helpful to get better suggestions. For example, if the numbers you are working with always just have one digit after the decimal point, and are small enough, just multiply by 10, round to the nearest integer (or long or long long), and use the % operator rather than the fmod function. If you are trying to see whether the result of fmod is 0.0, simply accept a result that is close to 0.2 (in this case) as if it were close to 0.

How to loop over exact representations of floating point numbers?

I am trying to loop exactly from one floating point number to the next. Say, I need to loop from std::numeric_limits<float>::epsilon() to 1, which are both exactly representable IEEE754 numbers. My code is:
using nld = std::numeric_limits<float>;
auto h = nld::epsilon();
for (; h < 1; h = std::nextafter(h, 1)) {
std::cerr << "h: " << h << std::endl;
}
which loops indefinitely beacuse h is exactly representable, so nextafter keeps returning it. I also know that adding machine epsilon to h in a loop will not cut it: floating point numbers are not equally spaced. How do I loop over the exact representations of IEEE754 numbers?
The not equally spaced problem presents itself here:
using nld = std::numeric_limits<float>;
auto h = nld::epsilon();
for (; h < 4; h += nld::epsilon()) {
if (h = h + nld::epsilon()) {
std::cerr << "h: " << h << std::endl;
}
}
which keeps printing 2 for me
Per the comments:
The approach with nextafter is exactly what you should be doing. However, it has some complications that may lead to unexpected results.
Quoting cppreference std::nextafter:
float nextafter( float from, float to ); (1) (since C++11)
double nextafter( double from, double to ); (2) (since C++11)
long double nextafter( long double from, long double to ); (3) (since C++11)
Promoted nextafter( Arithmetic from, Arithmetic to ); (4) (since C++11)
...
4) A set of overloads or a function template for all combinations of arguments of arithmetic type not covered by (1-3). If any argument has integral type, it is cast to double. If any argument is long double, then the return type Promoted is also long double, otherwise the return type is always double.
Since your to is 1, of type int, you get overload version 4, with a return type of double. Now, it's entirely possible that given a float f, (float)nextafter((double)f, 1) is exactly equal to the original f: it's rather likely that the next representable number in type double cannot be represented in float, and that the conversion back to float rounds down.
The only overload that returns float is the one where to has type float. To use that overload, use 1.0f instead of 1.
treating them as integers will work for notmal positive floats,
negative floats will step in the wrong direction, and denormals and zero may be a special case.
eg: for positive normal floats:
float nextfloat(float in)
{
union { float f; uint32_t i; } a;
a.f=in;
a.i++;
return(a.f);
}
this is relies on the floats having the same endianness and size as the integers, here I pair float and uint32_t, but you could do the same for double and uint64_t... this could actually be a class of undefined behavior, testing its operation should probably be poart of the build process.

When a float variable goes out of the float limits, what happens?

I remarked two things:
std::numeric_limits<float>::max()+(a small number) gives: std::numeric_limits<float>::max().
std::numeric_limits<float>::max()+(a large number like: std::numeric_limits<float>::max()/3) gives inf.
Why this difference? Does 1 or 2 results in an OVERFLOW and thus to an undefined behavior?
Edit: Code for testing this:
1.
float d = std::numeric_limits<float>::max();
float q = d + 100;
cout << "q: " << q << endl;
2.
float d = std::numeric_limits<float>::max();
float q = d + (d/3);
cout << "q: " << q << endl;
Formally, the behavior is undefined. On a machine with IEEE
floating point, however, overflow after rounding will result
in Inf. The precision is limited, however, and the results
after rounding of FLT_MAX + 1 are FLT_MAX.
You can see the same effect with values well under FLT_MAX.
Try something like:
float f1 = 1e20; // less than FLT_MAX
float f2 = f1 + 1.0;
if ( f1 == f2 ) ...
The if will evaluate to true, at least with IEEE arithmetic.
(There do exist, or at least have existed, machines where
float has enough precision for the if to evaluate to
false, but they aren't very common today.)
It depends on what you are doing. If the float "overflow" comes in an expression which is directly returned, i.e.
return std::numeric_limits::max() + std::numeric_limits::max();
the operation might not result in an overflow. I cite from the C standard [ISO/IEC 9899:2011]:
The return statement is not an assignment. The overlap restriction of
subclause 6.5.16.1 does not apply to the case of function return. The
representation of floating-point values may have wider range or
precision than implied by the type; a cast may be used to remove this
extra range and precision.
See here for more details.

round() for float in C++

I need a simple floating point rounding function, thus:
double round(double);
round(0.1) = 0
round(-0.1) = 0
round(-0.9) = -1
I can find ceil() and floor() in the math.h - but not round().
Is it present in the standard C++ library under another name, or is it missing??
Editor's Note: The following answer provides a simplistic solution that contains several implementation flaws (see Shafik Yaghmour's answer for a full explanation). Note that C++11 includes std::round, std::lround, and std::llround as builtins already.
There's no round() in the C++98 standard library. You can write one yourself though. The following is an implementation of round-half-up:
double round(double d)
{
return floor(d + 0.5);
}
The probable reason there is no round function in the C++98 standard library is that it can in fact be implemented in different ways. The above is one common way but there are others such as round-to-even, which is less biased and generally better if you're going to do a lot of rounding; it's a bit more complex to implement though.
The C++03 standard relies on the C90 standard for what the standard calls the Standard C Library which is covered in the draft C++03 standard (closest publicly available draft standard to C++03 is N1804) section 1.2 Normative references:
The library described in clause 7 of ISO/IEC 9899:1990 and clause 7 of
ISO/IEC 9899/Amd.1:1995 is hereinafter called the Standard C
Library.1)
If we go to the C documentation for round, lround, llround on cppreference we can see that round and related functions are part of C99 and thus won't be available in C++03 or prior.
In C++11 this changes since C++11 relies on the C99 draft standard for C standard library and therefore provides std::round and for integral return types std::lround, std::llround :
#include <iostream>
#include <cmath>
int main()
{
std::cout << std::round( 0.4 ) << " " << std::lround( 0.4 ) << " " << std::llround( 0.4 ) << std::endl ;
std::cout << std::round( 0.5 ) << " " << std::lround( 0.5 ) << " " << std::llround( 0.5 ) << std::endl ;
std::cout << std::round( 0.6 ) << " " << std::lround( 0.6 ) << " " << std::llround( 0.6 ) << std::endl ;
}
Another option also from C99 would be std::trunc which:
Computes nearest integer not greater in magnitude than arg.
#include <iostream>
#include <cmath>
int main()
{
std::cout << std::trunc( 0.4 ) << std::endl ;
std::cout << std::trunc( 0.9 ) << std::endl ;
std::cout << std::trunc( 1.1 ) << std::endl ;
}
If you need to support non C++11 applications your best bet would be to use boost round, iround, lround, llround or boost trunc.
Rolling your own version of round is hard
Rolling your own is probably not worth the effort as Harder than it looks: rounding float to nearest integer, part 1, Rounding float to nearest integer, part 2 and Rounding float to nearest integer, part 3 explain:
For example a common roll your implementation using std::floor and adding 0.5 does not work for all inputs:
double myround(double d)
{
return std::floor(d + 0.5);
}
One input this will fail for is 0.49999999999999994, (see it live).
Another common implementation involves casting a floating point type to an integral type, which can invoke undefined behavior in the case where the integral part can not be represented in the destination type. We can see this from the draft C++ standard section 4.9 Floating-integral conversions which says (emphasis mine):
A prvalue of a floating point type can be converted to a prvalue of an
integer type. The conversion truncates; that is, the fractional part
is discarded. The behavior is undefined if the truncated value cannot
be represented in the destination type.[...]
For example:
float myround(float f)
{
return static_cast<float>( static_cast<unsigned int>( f ) ) ;
}
Given std::numeric_limits<unsigned int>::max() is 4294967295 then the following call:
myround( 4294967296.5f )
will cause overflow, (see it live).
We can see how difficult this really is by looking at this answer to Concise way to implement round() in C? which referencing newlibs version of single precision float round. It is a very long function for something which seems simple. It seems unlikely that anyone without intimate knowledge of floating point implementations could correctly implement this function:
float roundf(x)
{
int signbit;
__uint32_t w;
/* Most significant word, least significant word. */
int exponent_less_127;
GET_FLOAT_WORD(w, x);
/* Extract sign bit. */
signbit = w & 0x80000000;
/* Extract exponent field. */
exponent_less_127 = (int)((w & 0x7f800000) >> 23) - 127;
if (exponent_less_127 < 23)
{
if (exponent_less_127 < 0)
{
w &= 0x80000000;
if (exponent_less_127 == -1)
/* Result is +1.0 or -1.0. */
w |= ((__uint32_t)127 << 23);
}
else
{
unsigned int exponent_mask = 0x007fffff >> exponent_less_127;
if ((w & exponent_mask) == 0)
/* x has an integral value. */
return x;
w += 0x00400000 >> exponent_less_127;
w &= ~exponent_mask;
}
}
else
{
if (exponent_less_127 == 128)
/* x is NaN or infinite. */
return x + x;
else
return x;
}
SET_FLOAT_WORD(x, w);
return x;
}
On the other hand if none of the other solutions are usable newlib could potentially be an option since it is a well tested implementation.
Boost offers a simple set of rounding functions.
#include <boost/math/special_functions/round.hpp>
double a = boost::math::round(1.5); // Yields 2.0
int b = boost::math::iround(1.5); // Yields 2 as an integer
For more information, see the Boost documentation.
Edit: Since C++11, there are std::round, std::lround, and std::llround.
It may be worth noting that if you wanted an integer result from the rounding you don't need to pass it through either ceil or floor. I.e.,
int round_int( double r ) {
return (r > 0.0) ? (r + 0.5) : (r - 0.5);
}
It's available since C++11 in cmath (according to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf)
#include <cmath>
#include <iostream>
int main(int argc, char** argv) {
std::cout << "round(0.5):\t" << round(0.5) << std::endl;
std::cout << "round(-0.5):\t" << round(-0.5) << std::endl;
std::cout << "round(1.4):\t" << round(1.4) << std::endl;
std::cout << "round(-1.4):\t" << round(-1.4) << std::endl;
std::cout << "round(1.6):\t" << round(1.6) << std::endl;
std::cout << "round(-1.6):\t" << round(-1.6) << std::endl;
return 0;
}
Output:
round(0.5): 1
round(-0.5): -1
round(1.4): 1
round(-1.4): -1
round(1.6): 2
round(-1.6): -2
It's usually implemented as floor(value + 0.5).
Edit: and it's probably not called round since there are at least three rounding algorithms I know of: round to zero, round to closest integer, and banker's rounding. You are asking for round to closest integer.
There are 2 problems we are looking at:
rounding conversions
type conversion.
Rounding conversions mean rounding ± float/double to nearest floor/ceil float/double.
May be your problem ends here.
But if you are expected to return Int/Long, you need to perform type conversion, and thus "Overflow" problem might hit your solution. SO, do a check for error in your function
long round(double x) {
assert(x >= LONG_MIN-0.5);
assert(x <= LONG_MAX+0.5);
if (x >= 0)
return (long) (x+0.5);
return (long) (x-0.5);
}
#define round(x) ((x) < LONG_MIN-0.5 || (x) > LONG_MAX+0.5 ?\
error() : ((x)>=0?(long)((x)+0.5):(long)((x)-0.5))
from : http://www.cs.tut.fi/~jkorpela/round.html
A certain type of rounding is also implemented in Boost:
#include <iostream>
#include <boost/numeric/conversion/converter.hpp>
template<typename T, typename S> T round2(const S& x) {
typedef boost::numeric::conversion_traits<T, S> Traits;
typedef boost::numeric::def_overflow_handler OverflowHandler;
typedef boost::numeric::RoundEven<typename Traits::source_type> Rounder;
typedef boost::numeric::converter<T, S, Traits, OverflowHandler, Rounder> Converter;
return Converter::convert(x);
}
int main() {
std::cout << round2<int, double>(0.1) << ' ' << round2<int, double>(-0.1) << ' ' << round2<int, double>(-0.9) << std::endl;
}
Note that this works only if you do a to-integer conversion.
You could round to n digits precision with:
double round( double x )
{
const double sd = 1000; //for accuracy to 3 decimal places
return int(x*sd + (x<0? -0.5 : 0.5))/sd;
}
These days it shouldn't be a problem to use a C++11 compiler which includes a C99/C++11 math library. But then the question becomes: which rounding function do you pick?
C99/C++11 round() is often not actually the rounding function you want. It uses a funky rounding mode that rounds away from 0 as a tie-break on half-way cases (+-xxx.5000). If you do specifically want that rounding mode, or you're targeting a C++ implementation where round() is faster than rint(), then use it (or emulate its behaviour with one of the other answers on this question which took it at face value and carefully reproduced that specific rounding behaviour.)
round()'s rounding is different from the IEEE754 default round to nearest mode with even as a tie-break. Nearest-even avoids statistical bias in the average magnitude of numbers, but does bias towards even numbers.
There are two math library rounding functions that use the current default rounding mode: std::nearbyint() and std::rint(), both added in C99/C++11, so they're available any time std::round() is. The only difference is that nearbyint never raises FE_INEXACT.
Prefer rint() for performance reasons: gcc and clang both inline it more easily, but gcc never inlines nearbyint() (even with -ffast-math)
gcc/clang for x86-64 and AArch64
I put some test functions on Matt Godbolt's Compiler Explorer, where you can see source + asm output (for multiple compilers). For more about reading compiler output, see this Q&A, and Matt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”,
In FP code, it's usually a big win to inline small functions. Especially on non-Windows, where the standard calling convention has no call-preserved registers, so the compiler can't keep any FP values in XMM registers across a call. So even if you don't really know asm, you can still easily see whether it's just a tail-call to the library function or whether it inlined to one or two math instructions. Anything that inlines to one or two instructions is better than a function call (for this particular task on x86 or ARM).
On x86, anything that inlines to SSE4.1 roundsd can auto-vectorize with SSE4.1 roundpd (or AVX vroundpd). (FP->integer conversions are also available in packed SIMD form, except for FP->64-bit integer which requires AVX512.)
std::nearbyint():
x86 clang: inlines to a single insn with -msse4.1.
x86 gcc: inlines to a single insn only with -msse4.1 -ffast-math, and only on gcc 5.4 and earlier. Later gcc never inlines it (maybe they didn't realize that one of the immediate bits can suppress the inexact exception? That's what clang uses, but older gcc uses the same immediate as for rint when it does inline it)
AArch64 gcc6.3: inlines to a single insn by default.
std::rint:
x86 clang: inlines to a single insn with -msse4.1
x86 gcc7: inlines to a single insn with -msse4.1. (Without SSE4.1, inlines to several instructions)
x86 gcc6.x and earlier: inlines to a single insn with -ffast-math -msse4.1.
AArch64 gcc: inlines to a single insn by default
std::round:
x86 clang: doesn't inline
x86 gcc: inlines to multiple instructions with -ffast-math -msse4.1, requiring two vector constants.
AArch64 gcc: inlines to a single instruction (HW support for this rounding mode as well as IEEE default and most others.)
std::floor / std::ceil / std::trunc
x86 clang: inlines to a single insn with -msse4.1
x86 gcc7.x: inlines to a single insn with -msse4.1
x86 gcc6.x and earlier: inlines to a single insn with -ffast-math -msse4.1
AArch64 gcc: inlines by default to a single instruction
Rounding to int / long / long long:
You have two options here: use lrint (like rint but returns long, or long long for llrint), or use an FP->FP rounding function and then convert to an integer type the normal way (with truncation). Some compilers optimize one way better than the other.
long l = lrint(x);
int i = (int)rint(x);
Note that int i = lrint(x) converts float or double -> long first, and then truncates the integer to int. This makes a difference for out-of-range integers: Undefined Behaviour in C++, but well-defined for the x86 FP -> int instructions (which the compiler will emit unless it sees the UB at compile time while doing constant propagation, then it's allowed to make code that breaks if it's ever executed).
On x86, an FP->integer conversion that overflows the integer produces INT_MIN or LLONG_MIN (a bit-pattern of 0x8000000 or the 64-bit equivalent, with just the sign-bit set). Intel calls this the "integer indefinite" value. (See the cvttsd2si manual entry, the SSE2 instruction that converts (with truncation) scalar double to signed integer. It's available with 32-bit or 64-bit integer destination (in 64-bit mode only). There's also a cvtsd2si (convert with current rounding mode), which is what we'd like the compiler to emit, but unfortunately gcc and clang won't do that without -ffast-math.
Also beware that FP to/from unsigned int / long is less efficient on x86 (without AVX512). Conversion to 32-bit unsigned on a 64-bit machine is pretty cheap; just convert to 64-bit signed and truncate. But otherwise it's significantly slower.
x86 clang with/without -ffast-math -msse4.1: (int/long)rint inlines to roundsd / cvttsd2si. (missed optimization to cvtsd2si). lrint doesn't inline at all.
x86 gcc6.x and earlier without -ffast-math: neither way inlines
x86 gcc7 without -ffast-math: (int/long)rint rounds and converts separately (with 2 total instructions of SSE4.1 is enabled, otherwise with a bunch of code inlined for rint without roundsd). lrint doesn't inline.
x86 gcc with -ffast-math: all ways inline to cvtsd2si (optimal), no need for SSE4.1.
AArch64 gcc6.3 without -ffast-math: (int/long)rint inlines to 2 instructions. lrint doesn't inline
AArch64 gcc6.3 with -ffast-math: (int/long)rint compiles to a call to lrint. lrint doesn't inline. This may be a missed optimization unless the two instructions we get without -ffast-math are very slow.
If you ultimately want to convert the double output of your round() function to an int, then the accepted solutions of this question will look something like:
int roundint(double r) {
return (int)((r > 0.0) ? floor(r + 0.5) : ceil(r - 0.5));
}
This clocks in at around 8.88 ns on my machine when passed in uniformly random values.
The below is functionally equivalent, as far as I can tell, but clocks in at 2.48 ns on my machine, for a significant performance advantage:
int roundint (double r) {
int tmp = static_cast<int> (r);
tmp += (r-tmp>=.5) - (r-tmp<=-.5);
return tmp;
}
Among the reasons for the better performance is the skipped branching.
Beware of floor(x+0.5). Here is what can happen for odd numbers in range [2^52,2^53]:
-bash-3.2$ cat >test-round.c <<END
#include <math.h>
#include <stdio.h>
int main() {
double x=5000000000000001.0;
double y=round(x);
double z=floor(x+0.5);
printf(" x =%f\n",x);
printf("round(x) =%f\n",y);
printf("floor(x+0.5)=%f\n",z);
return 0;
}
END
-bash-3.2$ gcc test-round.c
-bash-3.2$ ./a.out
x =5000000000000001.000000
round(x) =5000000000000001.000000
floor(x+0.5)=5000000000000002.000000
This is http://bugs.squeak.org/view.php?id=7134. Use a solution like the one of #konik.
My own robust version would be something like:
double round(double x)
{
double truncated,roundedFraction;
double fraction = modf(x, &truncated);
modf(2.0*fraction, &roundedFraction);
return truncated + roundedFraction;
}
Another reason to avoid floor(x+0.5) is given here.
There is no need to implement anything, so I'm not sure why so many answers involve defines, functions, or methods.
In C99
We have the following and and header <tgmath.h> for type-generic macros.
#include <math.h>
double round (double x);
float roundf (float x);
long double roundl (long double x);
If you cannot compile this, you have probably left out the math library. A command similar to this works on every C compiler I have (several).
gcc -lm -std=c99 ...
In C++11
We have the following and additional overloads in #include <cmath> that rely on IEEE double precision floating point.
#include <math.h>
double round (double x);
float round (float x);
long double round (long double x);
double round (T x);
There are equivalents in the std namespace too.
If you cannot compile this, you may be using C compilation instead of C++. The following basic command produces neither errors nor warnings with g++ 6.3.1, x86_64-w64-mingw32-g++ 6.3.0, clang-x86_64++ 3.8.0, and Visual C++ 2015 Community.
g++ -std=c++11 -Wall
With Ordinal Division
When dividing two ordinal numbers, where T is short, int, long, or another ordinal, the rounding expression is this.
T roundedQuotient = (2 * integerNumerator + 1)
/ (2 * integerDenominator);
Accuracy
There is no doubt that odd looking inaccuracies appear in floating point operations, but this is only when the numbers appear, and has little to do with rounding.
The source is not just the number of significant digits in the mantissa of the IEEE representation of a floating point number, it is related to our decimal thinking as humans.
Ten is the product of five and two, and 5 and 2 are relatively prime. Therefore the IEEE floating point standards cannot possibly be represented perfectly as decimal numbers for all binary digital representations.
This is not an issue with the rounding algorithms. It is mathematical reality that should be considered during the selection of types and the design of computations, data entry, and display of numbers. If an application displays the digits that show these decimal-binary conversion issues, then the application is visually expressing accuracy that does not exist in digital reality and should be changed.
Function double round(double) with the use of the modf function:
double round(double x)
{
using namespace std;
if ((numeric_limits<double>::max() - 0.5) <= x)
return numeric_limits<double>::max();
if ((-1*std::numeric_limits<double>::max() + 0.5) > x)
return (-1*std::numeric_limits<double>::max());
double intpart;
double fractpart = modf(x, &intpart);
if (fractpart >= 0.5)
return (intpart + 1);
else if (fractpart >= -0.5)
return intpart;
else
return (intpart - 1);
}
To be compile clean, includes "math.h" and "limits" are necessary. The function works according to a following rounding schema:
round of 5.0 is 5.0
round of 3.8 is 4.0
round of 2.3 is 2.0
round of 1.5 is 2.0
round of 0.501 is 1.0
round of 0.5 is 1.0
round of 0.499 is 0.0
round of 0.01 is 0.0
round of 0.0 is 0.0
round of -0.01 is -0.0
round of -0.499 is -0.0
round of -0.5 is -0.0
round of -0.501 is -1.0
round of -1.5 is -1.0
round of -2.3 is -2.0
round of -3.8 is -4.0
round of -5.0 is -5.0
If you need to be able to compile code in environments that support the C++11 standard, but also need to be able to compile that same code in environments that don't support it, you could use a function macro to choose between std::round() and a custom function for each system. Just pass -DCPP11 or /DCPP11 to the C++11-compliant compiler (or use its built-in version macros), and make a header like this:
// File: rounding.h
#include <cmath>
#ifdef CPP11
#define ROUND(x) std::round(x)
#else /* CPP11 */
inline double myRound(double x) {
return (x >= 0.0 ? std::floor(x + 0.5) : std::ceil(x - 0.5));
}
#define ROUND(x) myRound(x)
#endif /* CPP11 */
For a quick example, see http://ideone.com/zal709 .
This approximates std::round() in environments that aren't C++11-compliant, including preservation of the sign bit for -0.0. It may cause a slight performance hit, however, and will likely have issues with rounding certain known "problem" floating-point values such as 0.49999999999999994 or similar values.
Alternatively, if you have access to a C++11-compliant compiler, you could just grab std::round() from its <cmath> header, and use it to make your own header that defines the function if it's not already defined. Note that this may not be an optimal solution, however, especially if you need to compile for multiple platforms.
Based on Kalaxy's response, the following is a templated solution that rounds any floating point number to the nearest integer type based on natural rounding. It also throws an error in debug mode if the value is out of range of the integer type, thereby serving roughly as a viable library function.
// round a floating point number to the nearest integer
template <typename Arg>
int Round(Arg arg)
{
#ifndef NDEBUG
// check that the argument can be rounded given the return type:
if (
(Arg)std::numeric_limits<int>::max() < arg + (Arg) 0.5) ||
(Arg)std::numeric_limits<int>::lowest() > arg - (Arg) 0.5)
)
{
throw std::overflow_error("out of bounds");
}
#endif
return (arg > (Arg) 0.0) ? (int)(r + (Arg) 0.5) : (int)(r - (Arg) 0.5);
}
As pointed out in comments and other answers, the ISO C++ standard library did not add round() until ISO C++11, when this function was pulled in by reference to the ISO C99 standard math library.
For positive operands in [½, ub] round(x) == floor (x + 0.5), where ub is 223 for float when mapped to IEEE-754 (2008) binary32, and 252 for double when it is mapped to IEEE-754 (2008) binary64. The numbers 23 and 52 correspond to the number of stored mantissa bits in these two floating-point formats. For positive operands in [+0, ½) round(x) == 0, and for positive operands in (ub, +∞] round(x) == x. As the function is symmetric about the x-axis, negative arguments x can be handled according to round(-x) == -round(x).
This leads to the compact code below. It compiles into a reasonable number of machine instructions across various platforms. I observed the most compact code on GPUs, where my_roundf() requires about a dozen instructions. Depending on processor architecture and toolchain, this floating-point based approach could be either faster or slower than the integer-based implementation from newlib referenced in a different answer.
I tested my_roundf() exhaustively against the newlib roundf() implementation using Intel compiler version 13, with both /fp:strict and /fp:fast. I also checked that the newlib version matches the roundf() in the mathimf library of the Intel compiler. Exhaustive testing is not possible for double-precision round(), however the code is structurally identical to the single-precision implementation.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
float my_roundf (float x)
{
const float half = 0.5f;
const float one = 2 * half;
const float lbound = half;
const float ubound = 1L << 23;
float a, f, r, s, t;
s = (x < 0) ? (-one) : one;
a = x * s;
t = (a < lbound) ? x : s;
f = (a < lbound) ? 0 : floorf (a + half);
r = (a > ubound) ? x : (t * f);
return r;
}
double my_round (double x)
{
const double half = 0.5;
const double one = 2 * half;
const double lbound = half;
const double ubound = 1ULL << 52;
double a, f, r, s, t;
s = (x < 0) ? (-one) : one;
a = x * s;
t = (a < lbound) ? x : s;
f = (a < lbound) ? 0 : floor (a + half);
r = (a > ubound) ? x : (t * f);
return r;
}
uint32_t float_as_uint (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof(r));
return r;
}
float uint_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof(r));
return r;
}
float newlib_roundf (float x)
{
uint32_t w;
int exponent_less_127;
w = float_as_uint(x);
/* Extract exponent field. */
exponent_less_127 = (int)((w & 0x7f800000) >> 23) - 127;
if (exponent_less_127 < 23) {
if (exponent_less_127 < 0) {
/* Extract sign bit. */
w &= 0x80000000;
if (exponent_less_127 == -1) {
/* Result is +1.0 or -1.0. */
w |= ((uint32_t)127 << 23);
}
} else {
uint32_t exponent_mask = 0x007fffff >> exponent_less_127;
if ((w & exponent_mask) == 0) {
/* x has an integral value. */
return x;
}
w += 0x00400000 >> exponent_less_127;
w &= ~exponent_mask;
}
} else {
if (exponent_less_127 == 128) {
/* x is NaN or infinite so raise FE_INVALID by adding */
return x + x;
} else {
return x;
}
}
x = uint_as_float (w);
return x;
}
int main (void)
{
uint32_t argi, resi, refi;
float arg, res, ref;
argi = 0;
do {
arg = uint_as_float (argi);
ref = newlib_roundf (arg);
res = my_roundf (arg);
resi = float_as_uint (res);
refi = float_as_uint (ref);
if (resi != refi) { // check for identical bit pattern
printf ("!!!! arg=%08x res=%08x ref=%08x\n", argi, resi, refi);
return EXIT_FAILURE;
}
argi++;
} while (argi);
return EXIT_SUCCESS;
}
I use the following implementation of round in asm for x86 architecture and MS VS specific C++:
__forceinline int Round(const double v)
{
int r;
__asm
{
FLD v
FISTP r
FWAIT
};
return r;
}
UPD: to return double value
__forceinline double dround(const double v)
{
double r;
__asm
{
FLD v
FRNDINT
FSTP r
FWAIT
};
return r;
}
Output:
dround(0.1): 0.000000000000000
dround(-0.1): -0.000000000000000
dround(0.9): 1.000000000000000
dround(-0.9): -1.000000000000000
dround(1.1): 1.000000000000000
dround(-1.1): -1.000000000000000
dround(0.49999999999999994): 0.000000000000000
dround(-0.49999999999999994): -0.000000000000000
dround(0.5): 0.000000000000000
dround(-0.5): -0.000000000000000
Since C++ 11 simply:
#include <cmath>
std::round(1.1)
or to get int
static_cast<int>(std::round(1.1))
round_f for ARM with math
static inline float round_f(float value)
{
float rep;
asm volatile ("vrinta.f32 %0,%1" : "=t"(rep) : "t"(value));
return rep;
}
round_f for ARM without math
union f__raw {
struct {
uint32_t massa :23;
uint32_t order :8;
uint32_t sign :1;
};
int32_t i_raw;
float f_raw;
};
float round_f(float value)
{
union f__raw raw;
int32_t exx;
uint32_t ex_mask;
raw.f_raw = value;
exx = raw.order - 126;
if (exx < 0) {
raw.i_raw &= 0x80000000;
} else if (exx < 24) {
ex_mask = 0x00ffffff >> exx;
raw.i_raw += 0x00800000 >> exx;
if (exx == 0) ex_mask >>= 1;
raw.i_raw &= ~ex_mask;
};
return raw.f_raw;
};
Best way to rounding off a floating value by "n" decimal places, is as following with in O(1) time:-
We have to round off the value by 3 places i.e. n=3.So,
float a=47.8732355;
printf("%.3f",a);
// Convert the float to a string
// We might use stringstream, but it looks like it truncates the float to only
//5 decimal points (maybe that's what you want anyway =P)
float MyFloat = 5.11133333311111333;
float NewConvertedFloat = 0.0;
string FirstString = " ";
string SecondString = " ";
stringstream ss (stringstream::in | stringstream::out);
ss << MyFloat;
FirstString = ss.str();
// Take out how ever many decimal places you want
// (this is a string it includes the point)
SecondString = FirstString.substr(0,5);
//whatever precision decimal place you want
// Convert it back to a float
stringstream(SecondString) >> NewConvertedFloat;
cout << NewConvertedFloat;
system("pause");
It might be an inefficient dirty way of conversion but heck, it works lol. And it's good, because it applies to the actual float. Not just affecting the output visually.
I did this:
#include <cmath.h>
using namespace std;
double roundh(double number, int place){
/* place = decimal point. Putting in 0 will make it round to whole
number. putting in 1 will round to the
tenths digit.
*/
number *= 10^place;
int istack = (int)floor(number);
int out = number-istack;
if (out < 0.5){
floor(number);
number /= 10^place;
return number;
}
if (out > 0.4) {
ceil(number);
number /= 10^place;
return number;
}
}