Handling overflow when casting doubles to integers in C - c++

Today, I noticed that when I cast a double that is greater than the maximum possible integer to an integer, I get -2147483648. Similarly, when I cast a double that is less than the minimum possible integer, I also get -2147483648.
Is this behavior defined for all platforms?
What is the best way to detect this under/overflow? Is putting if statements for min and max int before the cast the best solution?

When casting floats to integers, overflow causes undefined behavior. From the C99 spec, section 6.3.1.4 Real floating and integer:
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
You have to check the range manually, but don't use code like:
// DON'T use code like this!
if (my_double > INT_MAX || my_double < INT_MIN)
printf("Overflow!");
INT_MAX is an integer constant that may not have an exact floating-point representation. When comparing to a float, it may be rounded to the nearest higher or nearest lower representable floating point value (this is implementation-defined). With 64-bit integers, for example, INT_MAX is 2^63 - 1 which will typically be rounded to 2^63, so the check essentially becomes my_double > INT_MAX + 1. This won't detect an overflow if my_double equals 2^63.
For example with gcc 4.9.1 on Linux, the following program
#include <math.h>
#include <stdint.h>
#include <stdio.h>
int main() {
double d = pow(2, 63);
int64_t i = INT64_MAX;
printf("%f > %lld is %s\n", d, i, d > i ? "true" : "false");
return 0;
}
prints
9223372036854775808.000000 > 9223372036854775807 is false
It's hard to get this right if you don't know the limits and internal representation of the integer and double types beforehand. But if you convert from double to int64_t, for example, you can use floating point constants that are exact doubles (assuming two's complement and IEEE doubles):
if (!(my_double >= -9223372036854775808.0 // -2^63
&& my_double < 9223372036854775808.0) // 2^63
) {
// Handle overflow.
}
The construct !(A && B)also handles NaNs correctly. A portable, safe, but slighty inaccurate version for ints is:
if (!(my_double > INT_MIN && my_double < INT_MAX)) {
// Handle overflow.
}
This errs on the side of caution and will falsely reject values that equal INT_MIN or INT_MAX. But for most applications, this should be fine.

limits.h has constants for max and min possible values for integer data types, you can check your double variable before casting, like
if (my_double > nextafter(INT_MAX, 0) || my_double < nextafter(INT_MIN, 0))
printf("Overflow!");
else
my_int = (int)my_double;
EDIT: nextafter() will solve the problem mentioned by nwellnhof

To answer your question: The behaviour when you cast out of range floats is undefined or implementation specific.
Speaking from experience: I've worked on a MIPS64 system that didn't implemented these kind of casts at all. Instead of doing something deterministic the CPU threw a CPU exception. The exception handler that ought to emulate the cast returned without doing anything to the result.
I've ended up with random integers. Guess how long it took to trace back a bug to this cause. :-)
You'll better do the range check yourself if you aren't sure that the number can't get out of the valid range.

A portable way for C++ is to use the SafeInt class:
http://www.codeplex.com/SafeInt
The implementation will allow for normal addition/subtract/etc on a C++ number type including casts. It will throw an exception whenever and overflow scenario is detected.
SafeInt<int> s1 = INT_MAX;
SafeInt<int> s2 = 42;
SafeInt<int> s3 = s1 + s2; // throws
I highly advise using this class in any place where overflow is an important scenario. It makes it very difficult to avoid silently overflowing. In cases where there is a recovery scenario for an overflow, simply catch the SafeIntException and recover as appropriate.
SafeInt now works on GCC as well as Visual Studio

What is the best way to detect this under/overflow?
Compare the truncated double to exact limits near INT_MIN,INT_MAX.
The trick is to exactly convert limits based on INT_MIN,INT_MAX into double values. A double may not exactly represent INT_MAX as the number of bits in an int may exceed that floating point's precision.*1 In that case, the conversion of INT_MAX to double suffers from rounding. The number after INT_MAX is a power-of-2 and is certainly representable as a double. 2.0*(INT_MAX/2 + 1) generates the whole number one greater than INT_MAX.
The same applies to INT_MIN on non-2s-complement machines.
INT_MAX is always a power-of-2 - 1.
INT_MIN is always:
-INT_MAX (not 2's complement) or
-INT_MAX-1 (2's complement)
int double_to_int(double x) {
x = trunc(x);
if (x >= 2.0*(INT_MAX/2 + 1)) Handle_Overflow();
#if -INT_MAX == INT_MIN
if (x <= 2.0*(INT_MIN/2 - 1)) Handle_Underflow();
#else
// Fixed 2022
// if (x < INT_MIN) Handle_Underflow();
if (x - INT_MIN < -1.0) Handle_Underflow();
#endif
return (int) x;
}
To detect NaN and not use trunc()
#define DBL_INT_MAXP1 (2.0*(INT_MAX/2+1))
#define DBL_INT_MINM1 (2.0*(INT_MIN/2-1))
int double_to_int(double x) {
if (x < DBL_INT_MAXP1) {
#if -INT_MAX == INT_MIN
if (x > DBL_INT_MINM1) {
return (int) x;
}
#else
if (ceil(x) >= INT_MIN) {
return (int) x;
}
#endif
Handle_Underflow();
} else if (x > 0) {
Handle_Overflow();
} else {
Handle_NaN();
}
}
[Edit 2022] Corner error corrected after 6 years.
double values in the range (INT_MIN - 1.0 ... INT_MIN) (non-inclusive end-points) convert well to int. Prior code failed those.
*1 This applies too to INT_MIN - 1 when int precision is more than double. Although this is rare, the issues readily applies to long long. Consider the difference between:
if (x < LLONG_MIN - 1.0) Handle_Underflow(); // Bad
if (x - LLONG_MIN < -1.0) Handle_Underflow();// Good
With 2's complement, some_int_type_MIN is a (negative) power-of-2 and exactly converts to a double. Thus x - LLONG_MIN is exact in the range of concern while LLONG_MIN - 1.0 may suffer precision loss in the subtraction.

We meet the same question. such as:
double d = 9223372036854775807L;
int i = (int)d;
in Linux/window, i = -2147483648. but In AIX 5.3 i = 2147483647.
If the double is outside the range of interger.
Linux/window always return INT_MIN.
AIX will return INT_MAX if double is postive, will return INT_MIN of
double is negetive.

Another option is to use boost::numeric_cast which allows for arbitrary conversion between numerical types. It detects loss of range when a numeric type is converted, and throws an exception if the range cannot be preserved.
The website referenced above also provides a small example which should give a quick overview on how this template can be used.
Of course, this isn't plain C anymore ;-)

I am not sure about this but I think it may be possible to "turn on" floating point exceptions for under/overflow...take a look at this Dealing with Floating-point Exceptions in MSVC7\8 so you might have an alternative to if/else checks.

I can't tell you for certain whether it is defined for all platforms, but that is pretty much what's happened on every platform I've used. Except, in my experience, it rolls. That is, if the value of the double is INT_MAX + 2, then when the result of the cast ends up being INT_MIN + 2.
As for the best way to handle it, I'm really not sure. I've run up against the issue myself, and have yet to find an elegant way to deal with it. I'm sure someone will respond that can help us both there.

Related

Float to Double adding many 0s at the end of the new double number during conversion

I'm facing a little problem on a personal project:
When I'm converting a float number to a double to make operations (+-*/) easy, it adds a lot of 0s behind the default float number.
For example: float number = -4.1112 -> double number = -4.1112000000000002
I convert the float to a double with the standard function std::stod().
This issue is a big problem for me cause I'm checking for overflow in my project and it throws an exception because of this issue.
Here is the checkOverflow function that throws an exception:
{
if (type == eOperandType::Int8) {
if (value > std::numeric_limits<int8_t>::max() || value < std::numeric_limits<int8_t>::min())
throw VMException("Overflow");
} else if (type == eOperandType::Int16) {
if (value > std::numeric_limits<int16_t>::max() || value < std::numeric_limits<int16_t>::min())
throw VMException("Overflow");
} else if (type == eOperandType::Int32) {
if (value > std::numeric_limits<int32_t>::max() || value < std::numeric_limits<int32_t>::min())
throw VMException("Overflow");
} else if (type == eOperandType::Float) {
if (value > std::numeric_limits<float>::max() || value < std::numeric_limits<float>::min())
throw VMException("Overflow");
} else if (type == eOperandType::Double) {
if (value > std::numeric_limits<double>::max() || value < std::numeric_limits<double>::min())
throw VMException("Overflow");
}
}
The problem you are having is completely different.
All your checks are wrong. Think about it: if a variable is of type, say, int32_t, its value is necessarily between the minimum and maximum possible values that can be represented by an int32_t, by definition. Let's simplify: it's like having a single-digit number, and testing that it is between 0 and 9 (if it is unsigned), or between -9 and +9 (if it is signed): how could such a test fail? Your checks should never raise an exception. But, as you say, they do. How is it even possible? And anyway, why would it happen for the long series of zeros that derive from representing -4.1112 as a floating point number, turning it into -4.1112000000000002? That isn't an overflow! This is a strong hint that your problem is elsewhere.
The solution is that std::numeric_limits<T>::min doesn't do what you think. As CPP Reference explains, it gives you the smallest positive value:
For floating-point types with denormalization, min returns the minimum positive normalized value. Note that this behavior may be unexpected, especially when compared to the behavior of min for integral types. To find the value that has no values less than it, use numeric_limits::lowest.
And the page about lowest also provides an example, comparing the output of min, lowest and max:
std::numeric_limits<T>::min():
float: 1.17549e-38 or 0x1p-126
double: 2.22507e-308 or 0x1p-1022
std::numeric_limits<T>::lowest():
float: -3.40282e+38 or -0x1.fffffep+127
double: -1.79769e+308 or -0x1.fffffffffffffp+1023
std::numeric_limits<T>::max():
float: 3.40282e+38 or 0x1.fffffep+127
double: 1.79769e+308 or 0x1.fffffffffffffp+1023
And as you can see, min is positive. So the opposite of max is lowest.
So you are getting exceptions because your negative values are smaller than the smallest positive value. Or, in other words: because -4 is less than 0.0001. Which is correct. It's the test that is wrong!
You could fix that by using lowest... But then, what would your checks tell you? If they ever raised an exception, it would mean that the compiler and/or library that you are using are seriously broken. If that is what you are testing, ok. But honestly I think it will never happen, and you could just delete these tests, as they provide no real value.
That's life I'm afraid. A floating point type can only represent a sparse subset of the real numbers.
Assuming IEEE754, the nearest float to -4.1112 is -4.111199855804443359375
The nearest double to -4.1112 is -4.111200000000000187583282240666449069976806640625
If you need perfect decimal precision then use a decimal type. There's one in the Boost library distribution. Boost also has a numeric_cast function, that does what your function is attempting to do; cleverly, with minimal run-time overhead.

Does casting `std::floor()` and `std::ceil()` to integer type always give the correct result?

I am being paranoid that one of these functions may give an incorrect result like this:
std::floor(2000.0 / 1000.0) --> std::floor(1.999999999999) --> 1
or
std::ceil(18 / 3) --> std::ceil(6.000000000001) --> 7
Can something like this happen? If there is indeed a risk like this, I'm planning to use the functions below in order to work safely. But, is this really necessary?
constexpr long double EPSILON = 1e-10;
intmax_t GuaranteedFloor(const long double & Number)
{
if (Number > 0)
{
return static_cast<intmax_t>(std::floor(Number) + EPSILON);
}
else
{
return static_cast<intmax_t>(std::floor(Number) - EPSILON);
}
}
intmax_t GuaranteedCeil(const long double & Number)
{
if (Number > 0)
{
return static_cast<intmax_t>(std::ceil(Number) + EPSILON);
}
else
{
return static_cast<intmax_t>(std::ceil(Number) - EPSILON);
}
}
(Note: I'm assuming that the the given 'long double' argument will fit in the 'intmax_t' return type.)
People often get the impression that floating point operations produce results with small, unpredictable, quasi-random errors. This impression is incorrect.
Floating point arithmetic computations are as exact as possible. 18/3 will always produce exactly 6. The result of 1/3 won't be exactly one third, but it will be the closest number to one third that is representable as a floating point number.
So the examples you showed are guaranteed to always work. As for your suggested "guaranteed floor/ceil", it's not a good idea. Certain sequences of operations can easily blow the error far above 1e-10, and certain other use cases will require 1e-10 to be correctly recognized (and ceil'ed) as nonzero.
As a rule of thumb, hardcoded epsilon values are bugs in your code.
In the specific examples you're listing, I don't think those errors would ever occur.
std::floor(2000.0 /*Exactly Representable in 32-bit or 64-bit Floating Point Numbers*/ / 1000.0 /*Also exactly representable*/) --> std::floor(2.0 /*Exactly Representable*/) --> 2
std::ceil(18 / 3 /*both treated as ints, might not even compile if ceil isn't properly overloaded....?*/) --> 6
std::ceil(18.0 /*Exactly Representable*/ / 3.0 /*Exactly Representable*/) --> 6
Having said that, if you have math that depends on these functions behaving exactly correctly for floating point numbers, that may illuminate a design flaw you need to reconsider/reexamine.
As long as the floating-point values x and y exactly represent integers within the limits of the type you're using, there's no problem--x / y will always yield a floating-point value that exactly represents the integer result. Casting to int as you're doing will always work.
However, once the floating-point values go outside the integer-representable range for the type (Representing integers in doubles), epsilons don't help.
Consider this example. 16777217 is the smallest integer not exactly representable as a 32-bit float:
int ix=16777217, iy=97;
printf("%d / %d = %d", ix, iy, ix/iy);
// yields "16777217 / 97 = 172961" which is accurate
float x=ix, y=iy;
printf("%f / %f = %f", x, y, x/y);
// yields "16777216.000000 / 97.000000 = 172960.989691"
In this case, the error is negative; in other cases (try 16777219 / 1549), the error is positive.
While it's tempting to add an epsilon to make floor work, it won't extend the accuracy much. When the values differ by more orders of magnitude, the error becomes greater than 1 and integer-accuracy can't be guaranteed. Specifically, when x/y exceeds the max. representable, the error can exceed 1.0, so the epsilon is no help.
If this is coming into play, you will have to consider changing your mathematical approach--order of operations, work with logarithms, etc.
Such results are likely to appear when working with doubles. You can use round or you can subtract 0.5 then use std::ceil function.

How close to division by zero can I get?

I want to avoid dividing by zero so I have an if statement:
float number;
//........
if (number > 0.000000000000001)
number = 1/number;
How small of a value can I safely use in place of 0.000000000000001?
Just use:
if(number > 0)
number = 1/number;
Note the difference between > and >=. If number > 0, then it definitely is not 0.
If number can be negative you can also use:
if(number != 0)
number = 1/number;
Note that, as others have mentioned in the comments, checking that number is not 0 will not prevent your result from being Inf or -Inf.
The number in the if condition depends on what you want to do with the result. In IEEE 754, which is used by (almost?) all C implementations, dividing by 0 is OK: you get positive or negative infinity.
If your goal is to avoid +/- Infinity, then the number in the if condition will depend upon the numerator. When the numerator is 1, you can use DBL_MIN or FLT_MIN from math.h.
If your goal is to avoid huge numbers after the division, you can do the division and then check if fabs(number) is bigger than certain value after the division, and then take whatever action as needed.
There is no single correct answer to your question.
You can simply check:
if (number > 0)
I can't understand why you need the lower limit.
For numeric type T std::numeric_limits gives you anything you need. For example you could do this to make sure that anything above min_invertible has finite reciprocal:
float max_float = std::numeric_limits<float>::max();
float min_float = std::numeric_limits<float>::min(); // or denorm_min()
float min_invertible = (max_float*min_float > 1.0f )? min_float : 1.0f/max_float;
You can't decently check up front. DBL_MAX / 0.5 effectively is a division by zero; the result is the same infinity you'd get from any other division by (almost) zero.
There is a simple solution: just check the result. std::isinf(result) will tell you whether the result overflowed, and IEEE754 tells you that division cannot produce infinity in other cases. (Well, except for INF/x,. That's not really producing infinity but merely preserving it.)
Your risk of producing an unhelpful result through overflow or underflow depends on both numerator and denominator.
A safety check which takes that into consideration is:
if (den == 0.0 || log2(num) - log2(den) >= log2(FLT_MAX))
/* expect overflow */ ;
else
return num / den;
but you might want to shave a small amount off log2(FLT_MAX) to leave wiggle-room for subsequent arithmetic and round-off.
You can do something similar with frexp, which would work for negative values as well:
int max;
int n, d;
frexp(FLT_MAX, &max);
frexp(num, &n);
frexp(den, &d);
if (den == 0.0 || n - d > max)
/* might overflow */ ;
else
return num / den;
This avoids the work of computing the logarithm, which might be more efficient if the compiler can find a suitable way of doing it, but it's not as accurate.
With IEEE 32-bit floats, the smallest possible value greater than 0 is 2^-149.
If you're using IEEE 64-bit, the smallest possible value is 2^-1074.
That said, (x > 0) is probably the better test.

Floating point equality and tolerances

Comparing two floating point number by something like a_float == b_float is looking for trouble since a_float / 3.0 * 3.0 might not be equal to a_float due to round off error.
What one normally does is something like fabs(a_float - b_float) < tol.
How does one calculate tol?
Ideally tolerance should be just larger than the value of one or two of the least significant figures. So if the single precision floating point number is use tol = 10E-6 should be about right. However this does not work well for the general case where a_float might be very small or might be very large.
How does one calculate tol correctly for all general cases? I am interested in C or C++ cases specifically.
This blogpost contains an example, fairly foolproof implementation, and detailed theory behind it
http://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
it is also one of a series, so you can always read more.
In short: use ULP for most numbers, use epsilon for numbers near zero, but there are still caveats. If you want to be sure about your floating point math i recommend reading whole series.
As far as I know, one doesn't.
There is no general "right answer", since it can depend on the application's requirement for precision.
For instance, a 2D physics simulation working in screen-pixels might decide that 1/4 of a pixel is good enough, while a 3D CAD system used to design nuclear plant internals might not.
I can't see a way to programmatically decide this from the outside.
The C header file <float.h> gives you the constants FLT_EPSILON and DBL_EPSILON, which is the difference between 1.0 and the smallest number larger than 1.0 that a float/double can represent. You can scale that by the size of your numbers and the rounding error you wish to tolerate:
#include <float.h>
#ifndef DBL_TRUE_MIN
/* DBL_TRUE_MIN is a common non-standard extension for the minimum denorm value
* DBL_MIN is the minimum non-denorm value -- use that if TRUE_MIN is not defined */
#define DBL_TRUE_MIN DBL_MIN
#endif
/* return the difference between |x| and the next larger representable double */
double dbl_epsilon(double x) {
int exp;
if (frexp(x, &exp) == 0.0)
return DBL_TRUE_MIN;
return ldexp(DBL_EPSILON, exp-1);
}
Welcome to the world of traps, snares and loopholes. As mentioned elsewhere, a general purpose solution for floating point equality and tolerances does not exist. Given that, there are tools and axioms that a programmer may use in select cases.
fabs(a_float - b_float) < tol has the shortcoming OP mentioned: "does not work well for the general case where a_float might be very small or might be very large." fabs(a_float - ref_float) <= fabs(ref_float * tol) copes with the variant ranges much better.
OP's "single precision floating point number is use tol = 10E-6" is a bit worrisome for C and C++ so easily promote float arithmetic to double and then it's the "tolerance" of double, not float, that comes into play. Consider float f = 1.0; printf("%.20f\n", f/7.0); So many new programmers do not realize that the 7.0 caused a double precision calculation. Recommend using double though out your code except where large amounts of data need the float smaller size.
C99 provides nextafter() which can be useful in helping to gauge "tolerance". Using it, one can determine the next representable number. This will help with the OP "... the full number of significant digits for the storage type minus one ... to allow for roundoff error." if ((nextafter(x, -INF) <= y && (y <= nextafter(x, +INF))) ...
The kind of tol or "tolerance" used is often the crux of the matter. Most often (IMHO) a relative tolerance is important. e. g. "Are x and y within 0.0001%"? Sometimes an absolute tolerance is needed. e.g. "Are x and y within 0.0001"?
The value of the tolerance is often debatable for the best value is often situation dependent. Comparing within 0.01 may work for a financial application for Dollars but not Yen. (Hint: be sure to use a coding style that allows easy updates.)
Rounding error varies according to values used for operations.
Instead of a fixed tolerance, you can probably use a factor of epsilon like:
bool nearly_equal(double a, double b, int factor /* a factor of epsilon */)
{
double min_a = a - (a - std::nextafter(a, std::numeric_limits<double>::lowest())) * factor;
double max_a = a + (std::nextafter(a, std::numeric_limits<double>::max()) - a) * factor;
return min_a <= b && max_a >= b;
}
Although the value of the tolerance depends on the situation, if you are looking for precision comparasion you could used as tolerance the machine epsilon value, numeric_limits::epsilon() (Library limits). The function returns the difference between 1 and the smallest value greater than 1 that is representable for the data type.
http://msdn.microsoft.com/en-us/library/6x7575x3.aspx
The value of epsilon differs if you are comparing floats or doubles. For instance, in my computer, if comparing floats the value of epsilon is 1.1920929e-007 and if comparing doubles the value of epsilon is 2.2204460492503131e-016.
For a relative comparison between x and y, multiply the epsilon by the maximum absolute value of x and y.
The result above could be multiplied by the ulps (units in the last place) which allows you to play with the precision.
#include <iostream>
#include <cmath>
#include <limits>
template<class T> bool are_almost_equal(T x, T y, int ulp)
{
return std::abs(x-y) <= std::numeric_limits<T>::epsilon() * std::max(std::abs(x), std::abs(y)) * ulp
}
When I need to compare floats, I use code like this
bool same( double a, double b, double error ) {
double x;
if( a == 0 ) {
x = b;
} else if( b == 0 ) {
x = a;
} else {
x = (a-b) / a;
}
return fabs(x) < error;
}

Can I trust a real-to-int conversion of the result of ceil()?

Suppose I have some code such as:
float a, b = ...; // both positive
int s1 = ceil(sqrt(a/b));
int s2 = ceil(sqrt(a/b)) + 0.1;
Is it ever possible that s1 != s2? My concern is when a/b is a perfect square. For example, perhaps a=100.0 and b=4.0, then the output of ceil should be 5.00000 but what if instead it is 4.99999?
Similar question: is there a chance that 100.0/4.0 evaluates to say 5.00001 and then ceil will round it up to 6.00000?
I'd prefer to do this in integer math but the sqrt kinda screws that plan.
EDIT: suggestions on how to better implement this would be appreciated too! The a and b values are integer values, so actual code is more like: ceil(sqrt(float(a)/b))
EDIT: Based on levis501's answer, I think I will do this:
float a, b = ...; // both positive
int s = sqrt(a/b);
while (s*s*b < a) ++s;
Thank you all!
I don't think it's possible. Regardless of the value of sqrt(a/b), what it produces is some value N that we use as:
int s1 = ceil(N);
int s2 = ceil(N) + 0.1;
Since ceil always produces an integer value (albeit represented as a double), we will always have some value X, for which the first produces X.0 and the second X.1. Conversion to int will always truncate that .1, so both will result in X.
It might seem like there would be an exception if X was so large that X.1 overflowed the range of double. I don't see where this could be possible though. Except close to 0 (where overflow isn't a concern) the square root of a number will always be smaller than the input number. Therefore, before ceil(N)+0.1 could overflow, the a/b being used as an input in sqrt(a/b) would have to have overflowed already.
You may want to write an explicit function for your case. e.g.:
/* return the smallest positive integer whose square is at least x */
int isqrt(double x) {
int y1 = ceil(sqrt(x));
int y2 = y1 - 1;
if ((y2 * y2) >= x) return y2;
return y1;
}
This will handle the odd case where the square root of your ratio a/b is within the precision of double.
Equality of floating point numbers is indeed an issue, but IMHO not if we deal with integer numbers.
If you have the case of 100.0/4.0, it should perfectly evaluate to 25.0, as 25.0 is exactly representable as a float, as opposite to e.g. 25.1.
Yes, it's entirely possible that s1 != s2. Why is that a problem, though?
It seems natural enough that s1 != (s1 + 0.1).
BTW, if you would prefer to have 5.00001 rounded to 5.00000 instead of 6.00000, use rint instead of ceil.
And to answer the actual question (in your comment) - you can use sqrt to get a starting point and then just find the correct square using integer arithmetic.
int min_dimension_greater_than(int items, int buckets)
{
double target = double(items) / buckets;
int min_square = ceil(target);
int dim = floor(sqrt(target));
int square = dim * dim;
while (square < min_square) {
seed += 1;
square = dim * dim;
}
return dim;
}
And yes, this can be improved a lot, it's just a quick sketch.
s1 will always equal s2.
The C and C++ standards do not say much about the accuracy of math routines. Taken literally, it is impossible for the standard to be implemented, since the C standard says sqrt(x) returns the square root of x, but the square root of two cannot be exactly represented in floating point.
Implementing routines with good performance that always return a correctly rounded result (in round-to-nearest mode, this means the result is the representable floating-point number that is nearest to the exact result, with ties resolved in favor of a low zero bit) is a difficult research problem. Good math libraries target accuracy less than 1 ULP (so one of the two nearest representable numbers is returned), perhaps something slightly more than .5 ULP. (An ULP is the Unit of Least Precision, the value of the low bit given a particular value in the exponent field.) Some math libraries may be significantly worse than this. You would have to ask your vendor or check the documentation for more information.
So sqrt may be slightly off. If the exact square root is an integer (within the range in which integers are exactly representable in floating-point) and the library guarantees errors are less than 1 ULP, then the result of sqrt must be exactly correct, because any result other than the exact result is at least 1 ULP away.
Similarly, if the library guarantees errors are less than 1 ULP, then ceil must return the exact result, again because the exact result is representable and any other result would be at least 1 ULP away. Additionally, the nature of ceil is such that I would expect any reasonable math library to always return an integer, even if the rest of the library were not high quality.
As for overflow cases, if ceil(x) were beyond the range where all integers are exactly representable, then ceil(x)+.1 is closer to ceil(x) than it is to any other representable number, so the rounded result of adding .1 to ceil(x) should be ceil(x) in any system implementing the floating-point standard (IEEE 754). That is provided you are in the default rounding mode, which is round-to-nearest. It is possible to change the rounding mode to something like round-toward-infinity, which could cause ceil(x)+.1 to be an integer higher than ceil(x).