C++ - How to correctly cast double to int - c++

I'm currently writing a program that needs to take the floor of a square root. Since the value I'm taking the square root of is positive, I just cast it to int. So say for the following example:
int i = 16;
int j = std::sqrt(i)
j should be 4.
I am wondering though, is it possible that sqrt returns 3.9999999991 instead of 4.000000001 or whatever, and the result of j is 3? Is there rules defining floating point behaviour? How can I properly convert this to an int?

Almost all widely available hardware uses IETF754 floating point numbers, although C++ does not require it.
Assuming IETF754 floating point numbers and a direct mapping of ::std::sqrt to the IETF754 floating point square root operation, you are assured of the following:
16 and 4 can both be represented exactly in floating point arithmetic - in fact, double precision floating point numbers can represent any 32 bit integer exactly
the square root returns the result that is closest to being exact
Therefore, your example will work fine.
In general, the problem you have hinted at might happen, but to solve it you have to ask a bigger question: Under which circumstances is a number that is close to integral, really integral?
This is actually harder than it might seem, since you want to compute the floor of the square root, and therefore simply rounding will not work for you. However, once you have answered this question for yourself, implementing a solution should be rather simple. For example:
int i = 16;
int j = std::sqrt(i);
if((j + 1) * (j + 1) == i) j += 1;

Related

Using scientific notation in for loops

I've recently come across some code which has a loop of the form
for (int i = 0; i < 1e7; i++){
}
I question the wisdom of doing this since 1e7 is a floating point type, and will cause i to be promoted when evaluating the stopping condition. Should this be of cause for concern?
The elephant in the room here is that the range of an int could be as small as -32767 to +32767, and the behaviour on assigning a larger value than this to such an int is undefined.
But, as for your main point, indeed it should concern you as it is a very bad habit. Things could go wrong as yes, 1e7 is a floating point double type.
The fact that i will be converted to a floating point due to type promotion rules is somewhat moot: the real damage is done if there is unexpected truncation of the apparent integral literal. By the way of a "proof by example", consider first the loop
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 18446744073709551615ULL; ){
std::cout << i << "\n";
}
This outputs every consecutive value of i in the range, as you'd expect. Note that std::numeric_limits<std::uint64_t>::max() is 18446744073709551615ULL, which is 1 less than the 64th power of 2. (Here I'm using a slide-like "operator" ++< which is useful when working with unsigned types. Many folk consider --> and ++< as obfuscating but in scientific programming they are common, particularly -->.)
Now on my machine, a double is an IEEE754 64 bit floating point. (Such as scheme is particularly good at representing powers of 2 exactly - IEEE754 can represent powers of 2 up to 1022 exactly.) So 18,446,744,073,709,551,616 (the 64th power of 2) can be represented exactly as a double. The nearest representable number before that is 18,446,744,073,709,550,592 (which is 1024 less).
So now let's write the loop as
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 1.8446744073709551615e19; ){
std::cout << i << "\n";
}
On my machine that will only output one value of i: 18,446,744,073,709,550,592 (the number that we've already seen). This proves that 1.8446744073709551615e19 is a floating point type. If the compiler was allowed to treat the literal as an integral type then the output of the two loops would be equivalent.
It will work, assuming that your int is at least 32 bits.
However, if you really want to use exponential notation, you should better define an integer constant outside the loop and use proper casting, like this:
const int MAX_INDEX = static_cast<int>(1.0e7);
...
for (int i = 0; i < MAX_INDEX; i++) {
...
}
Considering this, I'd say it is much better to write
const int MAX_INDEX = 10000000;
or if you can use C++14
const int MAX_INDEX = 10'000'000;
1e7 is a literal of type double, and usually double is 64-bit IEEE 754 format with a 52-bit mantissa. Roughly every tenth power of 2 corresponds to a third power of 10, so double should be able to represent integers up to at least 105*3 = 1015, exactly. And if int is 32-bit then int has roughly 103*3 = 109 as max value (asking Google search it says that "2**31 - 1" = 2 147 483 647, i.e. twice the rough estimate).
So, in practice it's safe on current desktop systems and larger.
But C++ allows int to be just 16 bits, and on e.g. an embedded system with that small int, one would have Undefined Behavior.
If the intention to loop for a exact integer number of iterations, for example if iterating over exactly all the elements in an array then comparing against a floating point value is maybe not such a good idea, solely for accuracy reasons; since the implicit cast of an integer to float will truncate integers toward zero there's no real danger of out-of-bounds access, it will just abort the loop short.
Now the question is: When do these effects actually kick in? Will your program experience them? The floating point representation usually used these days is IEEE 754. As long as the exponent is 0 a floating point value is essentially an integer. C double precision floats 52 bits for the mantissa, which gives you integer precision to a value of up to 2^52, which is in the order of about 1e15. Without specifying with a suffix f that you want a floating point literal to be interpreted single precision the literal will be double precision and the implicit conversion will target that as well. So as long as your loop end condition is less 2^52 it will work reliably!
Now one question you have to think about on the x86 architecture is efficiency. The very first 80x87 FPUs came in a different package, and later a different chip and as aresult getting values into the FPU registers is a bit awkward on the x86 assembly level. Depending on what your intentions are it might make the difference in runtime for a realtime application; but that's premature optimization.
TL;DR: Is it safe to to? Most certainly yes. Will it cause trouble? It could cause numerical problems. Could it invoke undefined behavior? Depends on how you use the loop end condition, but if i is used to index an array and for some reason the array length ended up in a floating point variable always truncating toward zero it's not going to cause a logical problem. Is it a smart thing to do? Depends on the application.

how might someone divide a double or float by multiples of two?

I know with normal integers you can divide by bitshifting to the right. I'm wondering if theres an easy way to do the same with numbers that aren't perfect integers.
ldexp, ldexpf, and ldexpl do this for doubles, floats, and long doubles respectively. Alternatively, if you have a specific power of two in mind (say, 4), it's probably best to just divide the usual way:
whatever / 4
There is a function ldexp (and siblings) that allows you to "multiply by powers of two" (including negative ones), this is not the same optimisation as using shifts for integers. For powers of two, all double values are "perfect" for both X and 1/X (because if X is 2n, then 1/X = 2-n, both of which are fine to store as a floating point number in IEEE-754 or any other binary floating point format), so there won't be any odd rounding, which means the compiler should be able to replace the divide by a multiply operation - in my experiments, it indeed does.
To manipulate the exponent of floating point values is generally detrimental to performance compared to the "apply multiply by 1/X".
The function ldexp is a few dozen instructions long in glibc, with several branches and a call in the code. It is highly unlikely you'll find any benefit from calling ldexp, as well as confusing people who don't know that x = ldexp(x, -1); is the same as x /= 2.0;.
Just use
double x = 7;
x *= 2;
x /= 2;
assert(x == 7.0);
or
double y = 5;
y = 2*x + 0.5*y;
assert(y == 16.5);
or
double z = 2.5*x;
assert(z == 17.5);
Why? Because your computer can represent all powers of two as floating point values (as long as that power does not exceed the limit of the exponent, that is), and it will do so. Consequently, all of the calculations above are precise, there is no rounding error in the constants. All the assert()s are guaranteed to succeed.
Of course, you can achieve the same effect by doing bit manipulations, but current floating point hardware can do a multiplication within a nanosecond, and it handles all special cases correctly. If you do the bit fiddling, you will either waste time, or handle the special cases incorrectly. So don't even try.

Is it safe to use == on FP in this example

I stumbled onto this code here.
Generators doubleSquares(int value)
{
Generators result;
for (int i = 0; i <= std::sqrt(value / 2); i++) // 1
{
double j = std::sqrt(value - std::pow(i, 2)); // 2
if (std::floor(j) == j) // 3
result.push_back( { i, static_cast<int>(j) } ); // 4
}
return result;
}
Am I wrong to think that //3 is dangerous ?
This code is not guaranteed by the C++ standard to work as desired.
Some low-quality math libraries do not return correctly rounded values for pow, even when the inputs have integer values and the mathematical result can be exactly represented. sqrt may also return an inaccurate value, although this function is easier to implement and so less commonly suffers from defects.
Thus, it is not guaranteed that j is exactly an integer when you might expect it to be.
In a good-quality math library, pow and sqrt will always return correct results (zero error) when the mathematical result is exactly representable. If you have a good-quality C++ implementation, this code should work as desired, up to the limits of the integer and floating-point types used.
Improving the Code
This code has no reason to use pow; std::pow(i, 2) should be i*i. This results in exact arithmetic (up to the point of integer overflow) and completely avoids the question of whether pow is correct.
Eliminating pow leaves just sqrt. If we know the implementation returns correct values, we can accept the use of sqrt. If not, we can use this instead:
for (int i = 0; i*i <= value/2; ++i)
{
int j = std::round(std::sqrt(value - i*i));
if (j*j + i*i == value)
result.push_back( { i, j } );
}
This code only relies on sqrt to return a result accurate within .5, which even a low-quality sqrt implementation should provide for reasonable input values.
There are two different, but related, questions:
Is j an integer?
Is j likely to be the result of a double calculation whose exact result would be an integer?
The quoted code asks the first question. It is not correct for asking the second question. More context would be needed to be certain which question should be being asked.
If the second question should be being asked, you cannot depend only on floor. Consider a double that is greater than 2.99999999999 but less than 3. It could be the result of a calculation whose exact value would be 3. Its floor is 2, and it is greater than its floor by almost 1. You would need to compare for being close to the result of std:round instead.
I would say it is dangerous. One should always test for "equality" of floating point numbers by comparing the difference between the two numbers with an acceptably small number, e.g.:
#include <math.h>
...
if (fabs(std::floor(j) - j) < eps) {
...
... where eps is a number that is acceptably small for one's purpose. This approach is essential unless one can guarantee that the operations return exact results, which may be true for some cases (e.g. IEEE-754-compliant systems) but the C++ standard does not require that this be true. See, for instance Cross-Platform Issues With Floating-Point Arithmetics in C++.

Increasing float value [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Floating point inaccuracy examples
I have the following line inside a WHILE loop, in C/C++:
while(...)
{
x = x + float(0.1); // x is a float type. Is the cast necessary?
}
x starts as 0. The thing is, after my first loop, x = 0.1. That's cool. After my second loop, x = 0.2. That's sweet. But, after my third loop, x = 0.3000001. That's not OK. I want it to have 0.3 as value, not 0.3000001. Can it be done? Am I looping wrongly?
Floating point does not work that way there are infinitely many real numbers between any two real numbers and only a finite amount of bits this means that in almost all cases the floating point representation is approximate. Read this link for more info.
It's not the loop, it's just how floats are represented in memory. You don't expect all real numbers to be directly representible in a limited number of bytes, right?
A 0.3 can't be exactly represented by a float. Try a double (not saying it will work, it probably won't, but the offset will be lower).
This is a common misconception with floating point numbers. 0.3 may not be exactly representable with 32bit or 64bit binary floating point. Lots of numbers are not exactly representable. Your loop is working fine ignoring the unnecessary syntax.
while (...)
{
x += 0.1f; /* this will do just fine in C++ and C */
}
If this doesn't make sense consider the fact that there are an infinite number of floating point numbers...with only a finite number of bits to describe them.
Either way, if you need exact results you need to use a decimal type of the proper precision. Good news though, unless you're doing calculations on money you likely do not need exact results (even if you think you do).
Code such as this:
for (int i = 0;… ; ++i)
{
float x = i / 10.f;
…
}
will result in the value of x in each iteration being the float value that is closest to i/10. It will usually not be exact, since the exact value of i/10 is usually not representable in float.
For double, change the definition to:
double x = i / 10.;
This will result in a finer x, so it will usually be even closer to i/10. However, it will still usually not be exactly i/10.
If you need exactly i/10, you should explain your requirements further.
NO the cast is not necessary in this case
float x;
x = x + float(0.1);
You can simply write
x+= 0.1

Is floating-point == ever OK?

Just today I came across third-party software we're using and in their sample code there was something along these lines:
// Defined in somewhere.h
static const double BAR = 3.14;
// Code elsewhere.cpp
void foo(double d)
{
if (d == BAR)
...
}
I'm aware of the problem with floating-points and their representation, but it made me wonder if there are cases where float == float would be fine? I'm not asking for when it could work, but when it makes sense and works.
Also, what about a call like foo(BAR)? Will this always compare equal as they both use the same static const BAR?
Yes, you are guaranteed that whole numbers, including 0.0, compare with ==
Of course you have to be a little careful with how you got the whole number in the first place, assignment is safe but the result of any calculation is suspect
ps there are a set of real numbers that do have a perfect reproduction as a float (think of 1/2, 1/4 1/8 etc) but you probably don't know in advance that you have one of these.
Just to clarify. It is guaranteed by IEEE 754 that float representions of integers (whole numbers) within range, are exact.
float a=1.0;
float b=1.0;
a==b // true
But you have to be careful how you get the whole numbers
float a=1.0/3.0;
a*3.0 == 1.0 // not true !!
There are two ways to answer this question:
Are there cases where float == float gives the correct result?
Are there cases where float == float is acceptable coding?
The answer to (1) is: Yes, sometimes. But it's going to be fragile, which leads to the answer to (2): No. Don't do that. You're begging for bizarre bugs in the future.
As for a call of the form foo(BAR): In that particular case the comparison will return true, but when you are writing foo you don't know (and shouldn't depend on) how it is called. For example, calling foo(BAR) will be fine but foo(BAR * 2.0 / 2.0) (or even maybe foo(BAR * 1.0) depending on how much the compiler optimises things away) will break. You shouldn't be relying on the caller not performing any arithmetic!
Long story short, even though a == b will work in some cases you really shouldn't rely on it. Even if you can guarantee the calling semantics today maybe you won't be able to guarantee them next week so save yourself some pain and don't use ==.
To my mind, float == float is never* OK because it's pretty much unmaintainable.
*For small values of never.
The other answers explain quite well why using == for floating point numbers is dangerous. I just found one example that illustrates these dangers quite well, I believe.
On the x86 platform, you can get weird floating point results for some calculations, which are not due to rounding problems inherent to the calculations you perform. This simple C program will sometimes print "error":
#include <stdio.h>
void test(double x, double y)
{
const double y2 = x + 1.0;
if (y != y2)
printf("error\n");
}
void main()
{
const double x = .012;
const double y = x + 1.0;
test(x, y);
}
The program essentially just calculates
x = 0.012 + 1.0;
y = 0.012 + 1.0;
(only spread across two functions and with intermediate variables), but the comparison can still yield false!
The reason is that on the x86 platform, programs usually use the x87 FPU for floating point calculations. The x87 internally calculates with a higher precision than regular double, so double values need to be rounded when they are stored in memory. That means that a roundtrip x87 -> RAM -> x87 loses precision, and thus calculation results differ depending on whether intermediate results passed via RAM or whether they all stayed in FPU registers. This is of course a compiler decision, so the bug only manifests for certain compilers and optimization settings :-(.
For details see the GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323
Rather scary...
Additional note:
Bugs of this kind will generally be quite tricky to debug, because the different values become the same once they hit RAM.
So if for example you extend the above program to actually print out the bit patterns of y and y2 right after comparing them, you will get the exact same value. To print the value, it has to be loaded into RAM to be passed to some print function like printf, and that will make the difference disappear...
I'll provide more-or-less real example of legitimate, meaningful and useful testing for float equality.
#include <stdio.h>
#include <math.h>
/* let's try to numerically solve a simple equation F(x)=0 */
double F(double x) {
return 2 * cos(x) - pow(1.2, x);
}
/* a well-known, simple & slow but extremely smart method to do this */
double bisection(double range_start, double range_end) {
double a = range_start;
double d = range_end - range_start;
int counter = 0;
while (a != a + d) // <-- WHOA!!
{
d /= 2.0;
if (F(a) * F(a + d) > 0) /* test for same sign */
a = a + d;
++counter;
}
printf("%d iterations done\n", counter);
return a;
}
int main() {
/* we must be sure that the root can be found in [0.0, 2.0] */
printf("F(0.0)=%.17f, F(2.0)=%.17f\n", F(0.0), F(2.0));
double x = bisection(0.0, 2.0);
printf("the root is near %.17f, F(%.17f)=%.17f\n", x, x, F(x));
}
I'd rather not explain the bisection method used itself, but emphasize on the stopping condition. It has exactly the discussed form: (a == a+d) where both sides are floats: a is our current approximation of the equation's root, and d is our current precision. Given the precondition of the algorithm — that there must be a root between range_start and range_end — we guarantee on every iteration that the root stays between a and a+d while d is halved every step, shrinking the bounds.
And then, after a number of iterations, d becomes so small that during addition with a it gets rounded to zero! That is, a+d turns out to be closer to a then to any other float; and so the FPU rounds it to the closest representable value: to a itself. Calculation on a hypothetical machine can illustrate; let it have 4-digit decimal mantissa and some large exponent range. Then what result should the machine give to 2.131e+02 + 7.000e-3? The exact answer is 213.107, but our machine can't represent such number; it has to round it. And 213.107 is much closer to 213.1 than to 213.2 — so the rounded result becomes 2.131e+02 — the little summand vanished, rounded up to zero. Exactly the same is guaranteed to happen at some iteration of our algorithm — and at that point we can't continue anymore. We have found the root to maximum possible precision.
Addendum
No you can't just use "some small number" in the stopping condition. For any choice of the number, some inputs will deem your choice too large, causing loss of precision, and there will be inputs which will deem your choiсe too small, causing excess iterations or even entering infinite loop. Imagine that our F can change — and suddenly the solutions can be both huge 1.0042e+50 and tiny 1.0098e-70. Detailed discussion follows.
Calculus has no notion of a "small number": for any real number, you can find infinitely many even smaller ones. The problem is, among those "even smaller" ones might be a root of our equation. Even worse, some equations will have distinct roots (e.g. 2.51e-8 and 1.38e-8) — both of which will get approximated by the same answer if our stopping condition looks like d < 1e-6. Whichever "small number" you choose, many roots which would've been found correctly to the maximum precision with a == a+d — will get spoiled by the "epsilon" being too large.
It's true however that floats' exponent has finite limited range, so one actually can find the smallest nonzero positive FP number; in IEEE 754 single precision, it's the 1e-45 denorm. But it's useless! while (d >= 1e-45) {…} will loop forever with single-precision (positive nonzero) d.
At the same time, any choice of the "small number" in d < eps stopping condition will be too small for many equations. Where the root has high enough exponent, the result of subtraction of two neighboring mantissas will easily exceed our "epsilon". For example, 7.00023e+8 - 7.00022e+8 = 0.00001e+8 = 1.00000e+3 = 1000 — meaning that the smallest possible difference between numbers with exponent +8 and 6-digit mantissa is... 1000! It will never fit into, say, 1e-4. For numbers with relatively high exponent we simply have not enough precision to ever see a difference of 1e-4. This means eps = 1e-4 will be too small!
My implementation above took this last problem into account; you can see that d is halved each step — instead of getting recalculated as difference of (possibly huge in exponent) a and b. For reals, it doesn't matter; for floats it does! The algorithm will get into infinite loops with (b-a) < eps on equations with huge enough roots. The previous paragraph shows why. d < eps won't get stuck, but even then — needless iterations will be performed during shrinking d way down below the precision of a — still showing the choice of eps as too small. But a == a+d will stop exactly at precision.
Thus as shown: any choice of eps in while (d < eps) {…} will be both too large and too small, if we allow F to vary.
... This kind of reasoning may seem overly theoretical and needlessly deep, but it's to illustrate again the trickiness of floats. One should be aware of their finite precision when writing arithmetic operators around.
Perfect for integral values even in floating point formats
But the short answer is: "No, don't use ==."
Ironically, the floating point format works "perfectly", i.e., with exact precision, when operating on integral values within the range of the format. This means that you if you stick with double values, you get perfectly good integers with a little more than 50 bits, giving you about +- 4,500,000,000,000,000, or 4.5 quadrillion.
In fact, this is how JavaScript works internally, and it's why JavaScript can do things like + and - on really big numbers, but can only << and >> on 32-bit ones.
Strictly speaking, you can exactly compare sums and products of numbers with precise representations. Those would be all the integers, plus fractions composed of 1 / 2n terms. So, a loop incrementing by n + 0.25, n + 0.50, or n + 0.75 would be fine, but not any of the other 96 decimal fractions with 2 digits.
So the answer is: while exact equality can in theory make sense in narrow cases, it is best avoided.
The only case where I ever use == (or !=) for floats is in the following:
if (x != x)
{
// Here x is guaranteed to be Not a Number
}
and I must admit I am guilty of using Not A Number as a magic floating point constant (using numeric_limits<double>::quiet_NaN() in C++).
There is no point in comparing floating point numbers for strict equality. Floating point numbers have been designed with predictable relative accuracy limits. You are responsible for knowing what precision to expect from them and your algorithms.
It's probably ok if you're never going to calculate the value before you compare it. If you are testing if a floating point number is exactly pi, or -1, or 1 and you know that's the limited values being passed in...
I also used it a few times when rewriting few algorithms to multithreaded versions. I used a test that compared results for single- and multithreaded version to be sure, that both of them give exactly the same result.
Let's say you have a function that scales an array of floats by a constant factor:
void scale(float factor, float *vector, int extent) {
int i;
for (i = 0; i < extent; ++i) {
vector[i] *= factor;
}
}
I'll assume that your floating point implementation can represent 1.0 and 0.0 exactly, and that 0.0 is represented by all 0 bits.
If factor is exactly 1.0 then this function is a no-op, and you can return without doing any work. If factor is exactly 0.0 then this can be implemented with a call to memset, which will likely be faster than performing the floating point multiplications individually.
The reference implementation of BLAS functions at netlib uses such techniques extensively.
In my opinion, comparing for equality (or some equivalence) is a requirement in most situations: standard C++ containers or algorithms with an implied equality comparison functor, like std::unordered_set for example, requires that this comparator be an equivalence relation (see C++ named requirements: UnorderedAssociativeContainer).
Unfortunately, comparing with an epsilon as in abs(a - b) < epsilon does not yield an equivalence relation since it loses transitivity. This is most probably undefined behavior, specifically two 'almost equal' floating point numbers could yield different hashes; this can put the unordered_set in an invalid state.
Personally, I would use == for floating points most of the time, unless any kind of FPU computation would be involved on any operands. With containers and container algorithms, where only read/writes are involved, == (or any equivalence relation) is the safest.
abs(a - b) < epsilon is more or less a convergence criteria similar to a limit. I find this relation useful if I need to verify that a mathematical identity holds between two computations (for example PV = nRT, or distance = time * speed).
In short, use == if and only if no floating point computation occur;
never use abs(a-b) < e as an equality predicate;
Yes. 1/x will be valid unless x==0. You don't need an imprecise test here. 1/0.00000001 is perfectly fine. I can't think of any other case - you can't even check tan(x) for x==PI/2
The other posts show where it is appropriate. I think using bit-exact compares to avoid needless calculation is also okay..
Example:
float someFunction (float argument)
{
// I really want bit-exact comparison here!
if (argument != lastargument)
{
lastargument = argument;
cachedValue = very_expensive_calculation (argument);
}
return cachedValue;
}
I would say that comparing floats for equality would be OK if a false-negative answer is acceptable.
Assume for example, that you have a program that prints out floating points values to the screen and that if the floating point value happens to be exactly equal to M_PI, then you would like it to print out "pi" instead. If the value happens to deviate a tiny bit from the exact double representation of M_PI, it will print out a double value instead, which is equally valid, but a little less readable to the user.
I have a drawing program that fundamentally uses a floating point for its coordinate system since the user is allowed to work at any granularity/zoom. The thing they are drawing contains lines that can be bent at points created by them. When they drag one point on top of another they're merged.
In order to do "proper" floating point comparison I'd have to come up with some range within which to consider the points the same. Since the user can zoom in to infinity and work within that range and since I couldn't get anyone to commit to some sort of range, we just use '==' to see if the points are the same. Occasionally there'll be an issue where points that are supposed to be exactly the same are off by .000000000001 or something (especially around 0,0) but usually it works just fine. It's supposed to be hard to merge points without the snap turned on anyway...or at least that's how the original version worked.
It throws of the testing group occasionally but that's their problem :p
So anyway, there's an example of a possibly reasonable time to use '=='. The thing to note is that the decision is less about technical accuracy than about client wishes (or lack thereof) and convenience. It's not something that needs to be all that accurate anyway. So what if two points won't merge when you expect them to? It's not the end of the world and won't effect 'calculations'.