Float addition promoted to double? - c++

I had a small WTF moment this morning. Ths WTF can be summarized with this:
float x = 0.2f;
float y = 0.1f;
float z = x + y;
assert(z == x + y); //This assert is triggered! (Atleast with visual studio 2008)
The reason seems to be that the expression x + y is promoted to double and compared with the truncated version in z. (If i change z to double the assert isn't triggered).
I can see that for precision reasons it would make sense to perform all floating point arithmetics in double precision before converting the result to single precision. I found the following paragraph in the standard (which I guess I sort of already knew, but not in this context):
4.6.1.
"An rvalue of type float can be converted to an rvalue of type double. The value is unchanged"
My question is, is x + y guaranteed to be promoted to double or is at the compiler's discretion?
UPDATE: Since many people has claimed that one shouldn't use == for floating point, I just wanted to state that in the specific case I'm working with, an exact comparison is justified.
Floating point comparision is tricky, here's an interesting link on the subject which I think hasn't been mentioned.

You can't generally assume that == will work as expected for floating point types. Compare rounded values or use constructs like abs(a-b) < tolerance instead.
Promotion is entirely at the compiler's discretion (and will depend on target hardware, optimisation level, etc).
What's going on in this particular case is almost certainly that values are stored in FPU registers at a higher precision than in memory - in general, modern FPU hardware works with double or higher precision internally whatever precision the programmer asked for, with the compiler generating code to make the appropriate conversions when values are stored to memory; in an unoptimised build, the result of x+y is still in a register at the point the comparison is made but z will have been stored out to memory and fetched back, and thus truncated to float precision.

The Working draft for the next standard C++0x section 5 point 11 says
The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby
So at the compiler's discretion.

Using gcc 4.3.2, the assertion is not triggered, and indeed, the rvalue returned from x + y is a float, rather than a double.
So it's up to the compiler. This is why it's never wise to rely on exact equality between two floating point values.

The C++ FAQ lite has some further discussion on the topic:
Why is cos(x) != cos(y) even though x == y?

It is the problem since float number to binary conversion does not give accurate precision.
And within sizeof(float) bytes it cant accomodate precise value of float number and arithmetic operation may lead to approximation and hence equality fails.
See below e.g.
float x = 0.25f; //both fits within 4 bytes with precision
float y = 0.50f;
float z = x + y;
assert(z == x + y); // it would work fine and no assert

I would think it would be at the compiler's discretion, but you could always force it with a cast if that was your thinking?

Yet another reason to never directly compare floats.
if (fabs(result - expectedResult) < 0.00001)

Related

double and float comparison [duplicate]

This question already has answers here:
Comparing float and double
(3 answers)
Closed 7 years ago.
According to this post, when comparing a float and a double, the float should be treated as double.
The following program, does not seem to follow this statement. The behaviour looks quite unpredictable.
Here is my program:
void main(void)
{
double a = 1.1; // 1.5
float b = 1.1; // 1.5
printf("%X %X\n", a, b);
if ( a == b)
cout << "success " <<endl;
else
cout << "fail" <<endl;
}
When I run the following program, I get "fail" displayed.
However, when I change a and b to 1.5, it displays "success".
I have also printed the hex notations of the values. They are different in both the cases. My compiler is Visual Studio 2005
Can you explain this output ? Thanks.
float f = 1.1;
double d = 1.1;
if (f == d)
In this comparison, the value of f is promoted to type double. The problem you're seeing isn't in the comparison, but in the initialization. 1.1 can't be represented exactly as a floating-point value, so the values stored in f and d are the nearest value that can be represented. But float and double are different sizes, so have a different number of significant bits. When the value in f is promoted to double, there's no way to get back the extra bits that were lost when the value was stored, so you end up with all zeros in the extra bits. Those zero bits don't match the bits in d, so the comparison is false. And the reason the comparison succeeds with 1.5 is that 1.5 can be represented exactly as a float and as a double; it has a bunch of zeros in its low bits, so when the promotion adds zeros the result is the same as the double representation.
I found a decent explanation of the problem you are experiencing as well as some solutions.
See How dangerous is it to compare floating point values?
Just a side note, remember that some values can not be represented EXACTLY in IEEE 754 floating point representation. Your same example using a value of say 1.5 would compare as you expect because there is a perfect representation of 1.5 without any loss of data. However, 1.1 in 32-bit and 64-bit are in fact different values because the IEEE 754 standard can not perfectly represent 1.1.
See http://www.binaryconvert.com
double a = 1.1 --> 0x3FF199999999999A
Approximate representation = 1.10000000000000008881784197001
float b = 1.1 --> 0x3f8ccccd
Approximate representation = 1.10000002384185791015625
As you can see, the two values are different.
Also, unless you are working in some limited memory type environment, it's somewhat pointless to use floats. Just use doubles and save yourself the headaches.
If you are not clear on why some values can not be accurately represented, consult a tutorial on how to covert a decimal to floating point.
Here's one: http://class.ece.iastate.edu/arun/CprE281_F05/ieee754/ie5.html
I would regard code which directly performs a comparison between a float and a double without a typecast to be broken; even if the language spec says that the float will be implicitly converted, there are two different ways that the comparison might sensibly be performed, and neither is sufficiently dominant to really justify a "silent" default behavior (i.e. one which compiles without generating a warning). If one wants to perform a conversion by having both operands evaluated as double, I would suggest adding an explicit type cast to make one's intentions clear. In most cases other than tests to see whether a particular double->float conversion will be reversible without loss of precision, however, I suspect that comparison between float values is probably more appropriate.
Fundamentally, when comparing floating-point values X and Y of any sort, one should regard comparisons as indicating that X or Y is larger, or that the numbers are "indistinguishable". A comparison which shows X is larger should be taken to indicate that the number that Y is supposed to represent is probably smaller than X or close to X. A comparison that says the numbers are indistinguishable means exactly that. If one views things in such fashion, comparisons performed by casting to float may not be as "informative" as those done with double, but are less likely to yield results that are just plain wrong. By comparison, consider:
double x, y;
float f = x;
If one compares f and y, it's possible that what one is interested in is how y compares with the value of x rounded to a float, but it's more likely that what one really wants to know is whether, knowing the rounded value of x, whether one can say anything about the relationship between x and y. If x is 0.1 and y is 0.2, f will have enough information to say whether x is larger than y; if y is 0.100000001, it will not. In the latter case, if both operands are cast to double, the comparison will erroneously imply that x was larger; if they are both cast to float, the comparison will report them as indistinguishable. Note that comparison results when casting both operands to double may be erroneous not only when values are within a part per million; they may be off by hundreds of orders of magnitude, such as if x=1e40 and y=1e300. Compare f and y as float and they'll compare indistinguishable; compare them as double and the smaller value will erroneously compare larger.
The reason why the rounding error occurs with 1.1 and not with 1.5 is due to the number of bits required to accurately represent a number like 0.1 in floating point format. In fact an accurate representation is not possible.
See How To Represent 0.1 In Floating Point Arithmetic And Decimal for an example, particularly the answer by #paxdiablo.

Increasing float value [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Floating point inaccuracy examples
I have the following line inside a WHILE loop, in C/C++:
while(...)
{
x = x + float(0.1); // x is a float type. Is the cast necessary?
}
x starts as 0. The thing is, after my first loop, x = 0.1. That's cool. After my second loop, x = 0.2. That's sweet. But, after my third loop, x = 0.3000001. That's not OK. I want it to have 0.3 as value, not 0.3000001. Can it be done? Am I looping wrongly?
Floating point does not work that way there are infinitely many real numbers between any two real numbers and only a finite amount of bits this means that in almost all cases the floating point representation is approximate. Read this link for more info.
It's not the loop, it's just how floats are represented in memory. You don't expect all real numbers to be directly representible in a limited number of bytes, right?
A 0.3 can't be exactly represented by a float. Try a double (not saying it will work, it probably won't, but the offset will be lower).
This is a common misconception with floating point numbers. 0.3 may not be exactly representable with 32bit or 64bit binary floating point. Lots of numbers are not exactly representable. Your loop is working fine ignoring the unnecessary syntax.
while (...)
{
x += 0.1f; /* this will do just fine in C++ and C */
}
If this doesn't make sense consider the fact that there are an infinite number of floating point numbers...with only a finite number of bits to describe them.
Either way, if you need exact results you need to use a decimal type of the proper precision. Good news though, unless you're doing calculations on money you likely do not need exact results (even if you think you do).
Code such as this:
for (int i = 0;… ; ++i)
{
float x = i / 10.f;
…
}
will result in the value of x in each iteration being the float value that is closest to i/10. It will usually not be exact, since the exact value of i/10 is usually not representable in float.
For double, change the definition to:
double x = i / 10.;
This will result in a finer x, so it will usually be even closer to i/10. However, it will still usually not be exactly i/10.
If you need exactly i/10, you should explain your requirements further.
NO the cast is not necessary in this case
float x;
x = x + float(0.1);
You can simply write
x+= 0.1

double variables in c++ are showing equal even when they are not

I just wrote the following code in C++:
double variable1;
double variable2;
variable1=numeric_limits<double>::max()-50;
variable2=variable1;
variable1=variable1+5;
cout<<"\nVariable1==Variable2 ? "<<(variable1==variable2);
The answer to the cout statement comes out 1, even when variable2 and variable1 are not equal.Can someone help me with this? Why is this happening?
I knew the concept of imprecise floating point math but didn't think this would happen with comparing two doubles directly. Also I am getting the same resuklt when I replace variable1 with:
double variable1=(numeric_limits<double>::max()-10000000000000);
The comparison still shows them as equal. How much would I have to subtract to see them start differing?
The maximum value for a double is 1.7976931348623157E+308. Due to lack of precision, adding and removing small values such as 50 and 5 does not actually changes the values of the variable. Thus they stay the same.
There isn't enough precision in a double to differentiate between M and M-45 where M is the largest value that can be represented by a double.
Imagine you're counting atoms to the nearest million. "123,456 million atoms" plus 1 atom is still "123,456 million atoms" because there's no space in the "millions" counting system for the 1 extra atom to make any difference.
numeric_limits<double>::max()
is a huuuuuge number. But the greater the absolute value of a double, the smaller is its precision. Apparently in this case max-50and max-5 are indistinguishable from double's point of view.
You should read the floating point comparison guide. In short, here are some examples:
float a = 0.15 + 0.15
float b = 0.1 + 0.2
if(a == b) // can be false!
if(a >= b) // can also be false!
The comparison with an epsilon value is what most people do.
#define EPSILON 0.00000001
bool AreSame(double a, double b)
{
return fabs(a - b) < EPSILON;
}
In your case, that max value is REALLY big. Adding or subtracting 50 does nothing. Thus they look the same because of the size of the number. See #RichieHindle's answer.
Here are some additional resources for research.
See this blog post.
Also, there was a stack overflow question on this very topic (language agnostic).
From the C++03 standard:
3.9.1/ [...] The value representation of floating-point types is
implementation-defined
and
5/ [...] If during the evaluation of an expression, the result is not
mathematically defined or not in the range of representable values for
its type, the behavior is undefined, unless such an expression is a
constant expression (5.19), in which case the program is ill-formed.
and
18.2.1.2.4/ (about numeric_limits<T>::max()) Maximum finite value.
This implies that once you add something to std::numeric_limits<T>::max(), the behavior of the program is implementation defined if T is floating point, perfectly defined if T is an unsigned type, and undefined otherwise.
If you happen to have std::numeric_limits<T>::is_iec559 == true, in this case the behavior is defined by IEEE 754. I don't have it handy, so I cannot tell whether variable1 is finite or infinite in this case. It seems (according to some lecture notes on IEEE 754 on the internet) that it depends on the rounding mode..
Please read What Every Computer Scientist Should Know About Floating-Point Arithmetic.

Is floating-point == ever OK?

Just today I came across third-party software we're using and in their sample code there was something along these lines:
// Defined in somewhere.h
static const double BAR = 3.14;
// Code elsewhere.cpp
void foo(double d)
{
if (d == BAR)
...
}
I'm aware of the problem with floating-points and their representation, but it made me wonder if there are cases where float == float would be fine? I'm not asking for when it could work, but when it makes sense and works.
Also, what about a call like foo(BAR)? Will this always compare equal as they both use the same static const BAR?
Yes, you are guaranteed that whole numbers, including 0.0, compare with ==
Of course you have to be a little careful with how you got the whole number in the first place, assignment is safe but the result of any calculation is suspect
ps there are a set of real numbers that do have a perfect reproduction as a float (think of 1/2, 1/4 1/8 etc) but you probably don't know in advance that you have one of these.
Just to clarify. It is guaranteed by IEEE 754 that float representions of integers (whole numbers) within range, are exact.
float a=1.0;
float b=1.0;
a==b // true
But you have to be careful how you get the whole numbers
float a=1.0/3.0;
a*3.0 == 1.0 // not true !!
There are two ways to answer this question:
Are there cases where float == float gives the correct result?
Are there cases where float == float is acceptable coding?
The answer to (1) is: Yes, sometimes. But it's going to be fragile, which leads to the answer to (2): No. Don't do that. You're begging for bizarre bugs in the future.
As for a call of the form foo(BAR): In that particular case the comparison will return true, but when you are writing foo you don't know (and shouldn't depend on) how it is called. For example, calling foo(BAR) will be fine but foo(BAR * 2.0 / 2.0) (or even maybe foo(BAR * 1.0) depending on how much the compiler optimises things away) will break. You shouldn't be relying on the caller not performing any arithmetic!
Long story short, even though a == b will work in some cases you really shouldn't rely on it. Even if you can guarantee the calling semantics today maybe you won't be able to guarantee them next week so save yourself some pain and don't use ==.
To my mind, float == float is never* OK because it's pretty much unmaintainable.
*For small values of never.
The other answers explain quite well why using == for floating point numbers is dangerous. I just found one example that illustrates these dangers quite well, I believe.
On the x86 platform, you can get weird floating point results for some calculations, which are not due to rounding problems inherent to the calculations you perform. This simple C program will sometimes print "error":
#include <stdio.h>
void test(double x, double y)
{
const double y2 = x + 1.0;
if (y != y2)
printf("error\n");
}
void main()
{
const double x = .012;
const double y = x + 1.0;
test(x, y);
}
The program essentially just calculates
x = 0.012 + 1.0;
y = 0.012 + 1.0;
(only spread across two functions and with intermediate variables), but the comparison can still yield false!
The reason is that on the x86 platform, programs usually use the x87 FPU for floating point calculations. The x87 internally calculates with a higher precision than regular double, so double values need to be rounded when they are stored in memory. That means that a roundtrip x87 -> RAM -> x87 loses precision, and thus calculation results differ depending on whether intermediate results passed via RAM or whether they all stayed in FPU registers. This is of course a compiler decision, so the bug only manifests for certain compilers and optimization settings :-(.
For details see the GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323
Rather scary...
Additional note:
Bugs of this kind will generally be quite tricky to debug, because the different values become the same once they hit RAM.
So if for example you extend the above program to actually print out the bit patterns of y and y2 right after comparing them, you will get the exact same value. To print the value, it has to be loaded into RAM to be passed to some print function like printf, and that will make the difference disappear...
I'll provide more-or-less real example of legitimate, meaningful and useful testing for float equality.
#include <stdio.h>
#include <math.h>
/* let's try to numerically solve a simple equation F(x)=0 */
double F(double x) {
return 2 * cos(x) - pow(1.2, x);
}
/* a well-known, simple & slow but extremely smart method to do this */
double bisection(double range_start, double range_end) {
double a = range_start;
double d = range_end - range_start;
int counter = 0;
while (a != a + d) // <-- WHOA!!
{
d /= 2.0;
if (F(a) * F(a + d) > 0) /* test for same sign */
a = a + d;
++counter;
}
printf("%d iterations done\n", counter);
return a;
}
int main() {
/* we must be sure that the root can be found in [0.0, 2.0] */
printf("F(0.0)=%.17f, F(2.0)=%.17f\n", F(0.0), F(2.0));
double x = bisection(0.0, 2.0);
printf("the root is near %.17f, F(%.17f)=%.17f\n", x, x, F(x));
}
I'd rather not explain the bisection method used itself, but emphasize on the stopping condition. It has exactly the discussed form: (a == a+d) where both sides are floats: a is our current approximation of the equation's root, and d is our current precision. Given the precondition of the algorithm — that there must be a root between range_start and range_end — we guarantee on every iteration that the root stays between a and a+d while d is halved every step, shrinking the bounds.
And then, after a number of iterations, d becomes so small that during addition with a it gets rounded to zero! That is, a+d turns out to be closer to a then to any other float; and so the FPU rounds it to the closest representable value: to a itself. Calculation on a hypothetical machine can illustrate; let it have 4-digit decimal mantissa and some large exponent range. Then what result should the machine give to 2.131e+02 + 7.000e-3? The exact answer is 213.107, but our machine can't represent such number; it has to round it. And 213.107 is much closer to 213.1 than to 213.2 — so the rounded result becomes 2.131e+02 — the little summand vanished, rounded up to zero. Exactly the same is guaranteed to happen at some iteration of our algorithm — and at that point we can't continue anymore. We have found the root to maximum possible precision.
Addendum
No you can't just use "some small number" in the stopping condition. For any choice of the number, some inputs will deem your choice too large, causing loss of precision, and there will be inputs which will deem your choiсe too small, causing excess iterations or even entering infinite loop. Imagine that our F can change — and suddenly the solutions can be both huge 1.0042e+50 and tiny 1.0098e-70. Detailed discussion follows.
Calculus has no notion of a "small number": for any real number, you can find infinitely many even smaller ones. The problem is, among those "even smaller" ones might be a root of our equation. Even worse, some equations will have distinct roots (e.g. 2.51e-8 and 1.38e-8) — both of which will get approximated by the same answer if our stopping condition looks like d < 1e-6. Whichever "small number" you choose, many roots which would've been found correctly to the maximum precision with a == a+d — will get spoiled by the "epsilon" being too large.
It's true however that floats' exponent has finite limited range, so one actually can find the smallest nonzero positive FP number; in IEEE 754 single precision, it's the 1e-45 denorm. But it's useless! while (d >= 1e-45) {…} will loop forever with single-precision (positive nonzero) d.
At the same time, any choice of the "small number" in d < eps stopping condition will be too small for many equations. Where the root has high enough exponent, the result of subtraction of two neighboring mantissas will easily exceed our "epsilon". For example, 7.00023e+8 - 7.00022e+8 = 0.00001e+8 = 1.00000e+3 = 1000 — meaning that the smallest possible difference between numbers with exponent +8 and 6-digit mantissa is... 1000! It will never fit into, say, 1e-4. For numbers with relatively high exponent we simply have not enough precision to ever see a difference of 1e-4. This means eps = 1e-4 will be too small!
My implementation above took this last problem into account; you can see that d is halved each step — instead of getting recalculated as difference of (possibly huge in exponent) a and b. For reals, it doesn't matter; for floats it does! The algorithm will get into infinite loops with (b-a) < eps on equations with huge enough roots. The previous paragraph shows why. d < eps won't get stuck, but even then — needless iterations will be performed during shrinking d way down below the precision of a — still showing the choice of eps as too small. But a == a+d will stop exactly at precision.
Thus as shown: any choice of eps in while (d < eps) {…} will be both too large and too small, if we allow F to vary.
... This kind of reasoning may seem overly theoretical and needlessly deep, but it's to illustrate again the trickiness of floats. One should be aware of their finite precision when writing arithmetic operators around.
Perfect for integral values even in floating point formats
But the short answer is: "No, don't use ==."
Ironically, the floating point format works "perfectly", i.e., with exact precision, when operating on integral values within the range of the format. This means that you if you stick with double values, you get perfectly good integers with a little more than 50 bits, giving you about +- 4,500,000,000,000,000, or 4.5 quadrillion.
In fact, this is how JavaScript works internally, and it's why JavaScript can do things like + and - on really big numbers, but can only << and >> on 32-bit ones.
Strictly speaking, you can exactly compare sums and products of numbers with precise representations. Those would be all the integers, plus fractions composed of 1 / 2n terms. So, a loop incrementing by n + 0.25, n + 0.50, or n + 0.75 would be fine, but not any of the other 96 decimal fractions with 2 digits.
So the answer is: while exact equality can in theory make sense in narrow cases, it is best avoided.
The only case where I ever use == (or !=) for floats is in the following:
if (x != x)
{
// Here x is guaranteed to be Not a Number
}
and I must admit I am guilty of using Not A Number as a magic floating point constant (using numeric_limits<double>::quiet_NaN() in C++).
There is no point in comparing floating point numbers for strict equality. Floating point numbers have been designed with predictable relative accuracy limits. You are responsible for knowing what precision to expect from them and your algorithms.
It's probably ok if you're never going to calculate the value before you compare it. If you are testing if a floating point number is exactly pi, or -1, or 1 and you know that's the limited values being passed in...
I also used it a few times when rewriting few algorithms to multithreaded versions. I used a test that compared results for single- and multithreaded version to be sure, that both of them give exactly the same result.
Let's say you have a function that scales an array of floats by a constant factor:
void scale(float factor, float *vector, int extent) {
int i;
for (i = 0; i < extent; ++i) {
vector[i] *= factor;
}
}
I'll assume that your floating point implementation can represent 1.0 and 0.0 exactly, and that 0.0 is represented by all 0 bits.
If factor is exactly 1.0 then this function is a no-op, and you can return without doing any work. If factor is exactly 0.0 then this can be implemented with a call to memset, which will likely be faster than performing the floating point multiplications individually.
The reference implementation of BLAS functions at netlib uses such techniques extensively.
In my opinion, comparing for equality (or some equivalence) is a requirement in most situations: standard C++ containers or algorithms with an implied equality comparison functor, like std::unordered_set for example, requires that this comparator be an equivalence relation (see C++ named requirements: UnorderedAssociativeContainer).
Unfortunately, comparing with an epsilon as in abs(a - b) < epsilon does not yield an equivalence relation since it loses transitivity. This is most probably undefined behavior, specifically two 'almost equal' floating point numbers could yield different hashes; this can put the unordered_set in an invalid state.
Personally, I would use == for floating points most of the time, unless any kind of FPU computation would be involved on any operands. With containers and container algorithms, where only read/writes are involved, == (or any equivalence relation) is the safest.
abs(a - b) < epsilon is more or less a convergence criteria similar to a limit. I find this relation useful if I need to verify that a mathematical identity holds between two computations (for example PV = nRT, or distance = time * speed).
In short, use == if and only if no floating point computation occur;
never use abs(a-b) < e as an equality predicate;
Yes. 1/x will be valid unless x==0. You don't need an imprecise test here. 1/0.00000001 is perfectly fine. I can't think of any other case - you can't even check tan(x) for x==PI/2
The other posts show where it is appropriate. I think using bit-exact compares to avoid needless calculation is also okay..
Example:
float someFunction (float argument)
{
// I really want bit-exact comparison here!
if (argument != lastargument)
{
lastargument = argument;
cachedValue = very_expensive_calculation (argument);
}
return cachedValue;
}
I would say that comparing floats for equality would be OK if a false-negative answer is acceptable.
Assume for example, that you have a program that prints out floating points values to the screen and that if the floating point value happens to be exactly equal to M_PI, then you would like it to print out "pi" instead. If the value happens to deviate a tiny bit from the exact double representation of M_PI, it will print out a double value instead, which is equally valid, but a little less readable to the user.
I have a drawing program that fundamentally uses a floating point for its coordinate system since the user is allowed to work at any granularity/zoom. The thing they are drawing contains lines that can be bent at points created by them. When they drag one point on top of another they're merged.
In order to do "proper" floating point comparison I'd have to come up with some range within which to consider the points the same. Since the user can zoom in to infinity and work within that range and since I couldn't get anyone to commit to some sort of range, we just use '==' to see if the points are the same. Occasionally there'll be an issue where points that are supposed to be exactly the same are off by .000000000001 or something (especially around 0,0) but usually it works just fine. It's supposed to be hard to merge points without the snap turned on anyway...or at least that's how the original version worked.
It throws of the testing group occasionally but that's their problem :p
So anyway, there's an example of a possibly reasonable time to use '=='. The thing to note is that the decision is less about technical accuracy than about client wishes (or lack thereof) and convenience. It's not something that needs to be all that accurate anyway. So what if two points won't merge when you expect them to? It's not the end of the world and won't effect 'calculations'.

Does casting to an int after std::floor guarantee the right result?

I'd like a floor function with the syntax
int floor(double x);
but std::floor returns a double. Is
static_cast <int> (std::floor(x));
guaranteed to give me the correct integer, or could I have an off-by-one problem? It seems to work, but I'd like to know for sure.
For bonus points, why the heck does std::floor return a double in the first place?
The range of double is way greater than the range of 32 or 64 bit integers, which is why std::floor returns a double. Casting to int should be fine so long as it's within the appropriate range - but be aware that a double can't represent all 64 bit integers exactly, so you may also end up with errors when you go beyond the point at which the accuracy of double is such that the difference between two consecutive doubles is greater than 1.
static_cast <int> (std::floor(x));
does pretty much what you want, yes. It gives you the nearest integer, rounded towards -infinity. At least as long as your input is in the range representable by ints.
I'm not sure what you mean by 'adding .5 and whatnot, but it won't have the same effect
And std::floor returns a double because that's the most general. Sometimes you might want to round off a float or double, but preserve the type. That is, round 1.3f to 1.0f, rather than to 1.
That'd be hard to do if std::floor returned an int. (or at least you'd have an extra unnecessary cast in there slowing things down).
If floor only performs the rounding itself, without changing the type, you can cast that to int if/when you need to.
Another reason is that the range of doubles is far greater than that of ints. It may not be possible to round all doubles to ints.
The C++ standard says (4.9.1):
"An rvalue of a floating point type can be converted to an rvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type".
So if you are converting a double to an int, the number is within the range of int and the required rounding-up is toward zero, then it is enough to simply cast the number to int:
(int)x;
If you want to deal with various numeric conditions and want to handle different types of conversions in a controlled way, then maybe you should look at the Boost.NumericConversion. This library allows to handle weird cases (like out-of-range, rounding, ranges, etc.)
Here is the example from the documentation:
#include <cassert>
#include <boost/numeric/conversion/converter.hpp>
int main() {
typedef boost::numeric::converter<int,double> Double2Int ;
int x = Double2Int::convert(2.0);
assert ( x == 2 );
int y = Double2Int()(3.14); // As a function object.
assert ( y == 3 ) ; // The default rounding is trunc.
try
{
double m = boost::numeric::bounds<double>::highest();
int z = Double2Int::convert(m); // By default throws positive_overflow()
}
catch ( boost::numeric::positive_overflow const& )
{
}
return 0;
}
Most of the standard math library uses doubles but provides float versions as well. std::floorf() is the single precision version of std::floor() if you'd prefer not to use doubles.
Edit: I've removed part of my previous answer. I had stated that the floor was redundant when casting to int, but I forgot that this is only true for positive floating point numbers.