strange double to int conversion behavior in c++

strange double to int conversion behavior in c++ - c++

The following program shows the weird double to int conversion behavior I'm seeing in c++:
#include <stdlib.h>
#include <stdio.h>
int main() {
double d = 33222.221;
printf("d = %9.9g\n",d);
d *= 1000;
int i = (int)d;
printf("d = %9.9g | i = %d\n",d,i);
return 0;
}
When I compile and run the program, I see:
g++ test.cpp
./a.out
d = 33222.221
d = 33222221 | i = 33222220
Why is i not equal to 33222221?
The compiler version is GCC 4.3.0

Floating point representation is almost never precise (only in special cases). Every programmer should read this: What Every Computer Scientist Should Know About Floating-Point Arithmetic
In short - your number is probably 33222220.99999999999999999999999999999999999999999999999999999999999999998 (or something like that), which becomes 33222220 after truncation.

When you attach a debugger and inspect the values, you will see that the value of d is actually 33222220.999999996, which is correctly truncated to 33222220 when converted to integer.
There is a finite amount of numbers that can be stored in a double variable, and 33222221 is not one of them.

Due to floating point approximation, 33222.221 may actually be 33222.220999999999999. Multiplied by 1000 yields 33222220.999999999999. Casting to integer ignores all decimals (round down) for a final result of 33222220.

If you change the "9.9g" in your printf() calls to 17.17 to recover all possible digits of precision with a 64-bit IEEE 754 FP number, you get 33222220.999999996 for the double value. The int conversion then makes sense.

I don't want to repeat the explanations of the other comments.
So, here is just an advice to avoid problems like the one described:
Avoid floating point arithemtics in the first place whereever possible (especially when computation is involved).
If floating point arithmetics is really necessary, you must not compare numbers by operator== by all means! Use your own comparison function instead (or use one supplied by some library), which does something like an "is almost equal" comparison using some kind of epsilon compare (either absolute or relative to the number's magniture).
See for example the excellent article
http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm
by Bruce Dawson instead!
Stefan

Related

Is hardcoding least significant byte of a double a good rounding strategy?

I have a function doing some mathematical computation and returning a double. It ends up with different results under Windows and Android due to std::exp implementation beging different (Why do I get platform-specific result for std::exp?). The e-17 rounding difference gets propagated and in the end it's not just a rounding difference that I get (results can change 2.36 to 2.47 in the end). As I compare the result to some expected values, I want this function to return the same result on all platform.
So I need to round my result. The simpliest solution to do this is apparently (as far as I could find on the web) to do std::ceil(d*std::pow<double>(10,precision))/std::pow<double>(10,precision). However, I feel like this could still end up with different results depending on the platform (and moreover, it's hard to decide what precision should be).
I was wondering if hard-coding the least significant byte of the double could be a good rounding strategy.
This quick test seems to show that "yes":
#include <iostream>
#include <iomanip>
double roundByCast( double d )
{
double rounded = d;
unsigned char* temp = (unsigned char*) &rounded;
// changing least significant byte to be always the same
temp[0] = 128;
return rounded;
}
void showRoundInfo( double d, double rounded )
{
double diff = std::abs(d-rounded);
std::cout << "cast: " << d << " rounded to " << rounded << " (diff=" << diff << ")" << std::endl;
}
void roundIt( double d )
{
showRoundInfo( d, roundByCast(d) );
}
int main( int argc, char* argv[] )
{
roundIt( 7.87234042553191493141184764681 );
roundIt( 0.000000000000000000000184764681 );
roundIt( 78723404.2553191493141184764681 );
}
This outputs:
cast: 7.87234 rounded to 7.87234 (diff=2.66454e-14)
cast: 1.84765e-22 rounded to 1.84765e-22 (diff=9.87415e-37)
cast: 7.87234e+07 rounded to 7.87234e+07 (diff=4.47035e-07)
My question is:
Is unsigned char* temp = (unsigned char*) &rounded safe or is there an undefined behaviour here, and why?
If there is no UB (or if there is a better way to do this without UB), is such a round function safe and accurate for all input?
Note: I know floating point numbers are inaccurate. Please don't mark as duplicate of Is floating point math broken? or Why Are Floating Point Numbers Inaccurate?. I understand why results are different, I'm just looking for a way to make them be identical on all targetted platforms.
Edit, I may reformulate my question as people are asking why I have different values and why I want them to be the same.
Let's say you get a double from a computation that could end up with a different value due to platform specific implementations (like std::exp). If you want to fix those different double to end up having the exact same memory representation (1) on all platforms, and you want to loose the fewest precision as possible, then, is fixing the least significant byte a good approach? (because I feel that rounding to an arbitrary given precision is likely to loose more information than this trick).
(1) By "same representation", I mean that if you transform it to a std::bitset, you want to see the same bits sequence for all platform.

No, rounding is not a strategy for removing small errors, or guaranteeing agreement with calculations performed with errors.
For any slicing of the number line into ranges, you will successfully eliminate most slight deviations (by placing them in the same bucket and clamping to the same value), but you greatly increase the deviation if your original pair of values straddle a boundary.
In your particular case of hardcoding the least significant byte, the very near values
0x1.mmmmmmm100
and
0x1.mmmmmmm0ff
have a deviation of only one ULP... but after your rounding, they differ by 256 ULP. Oops!

Is unsigned char* temp = (unsigned char*) &rounded safe or is there an undefined behaviour here, and why?
It is well defined, as aliasing through unsigned char is allowed.
is such a round function safe and accurate for all input?
No. You cannot perfectly fix this problem with truncating/rounding. Consider, that one implementation gives 0x.....0ff, and the other 0x.....100. Setting the lsb to 0x00 will make the original 1 ulp difference to 256 ulps.
No rounding algorithm can fix this.
You have two options:
don't use floating point, use some other way (for example, fixed point)
embed a floating point library into your application, which only uses basic floating point arithmetic (+, -, *, /, sqrt), and don't use -ffast-math, or any equivalent option. This way, if you're on a IEEE-754 compatible platform, floating point results should be the same, as IEEE-754 mandates that basic operations should be calculated "perfectly". It means as if the operation calculated at infinite precision, and then rounded to the resulting representation.
Btw, if an input 1e-17 difference means a huge output difference, then your problem/algorithm is ill-conditioned, which generally should be avoided, as it usually doesn't give you meaningful results.

What you are doing is totally, totally misguided.
Your problem is not that you are getting different results (2.36 vs. 2.47). Your problem is that at least one of these results, and likely both, have massive errors. Your Windows and Android results are not just different, they are WRONG. (At least one of them, and you have no idea which one).
Find out why you get these massive errors and change your algorithms to not increase tiny rounding errors massively. Or you have a problem that is inherently chaotic, in which case the difference between results is actually very useful information.
What you are trying just makes the rounding errors 256 times bigger, and if two different results end in ....1ff and ....200 hexadecimal, then you change these to ....180 and ....280, so even the difference between slightly different numbers can grow by a factor 256.
And on a bigendian machine your code will just go kaboom!!!

Your function won't work because of aliasing.
double roundByCast( double d )
{
double rounded = d;
unsigned char* temp = (unsigned char*) &rounded;
// changing least significant byte to be always the same
temp[0] = 128;
return rounded;
}
Casting to unsigned char* for temp is allowed, because char* casts are the exception to the aliasing rules. That's necessary for functions like read, write, memcpy, etc, so that they can copy values to and from byte representations.
However, you aren't allowed to write to temp[0] and then assume that rounded changed. You must create a new double variable (on the stack is fine) and memcpy temp back to it.

Difference in behaviour of pow from math.h for same input [duplicate]

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main()
{
int n,i,ele;
n=5;
ele=pow(n,2);
printf("%d",ele);
return 0;
}
The output is 24.
I'm using GNU/GCC in Code::Blocks.
What is happening?
I know the pow function returns a double , but 25 fits an int type so why does this code print a 24 instead of a 25? If n=4; n=6; n=3; n=2; the code works, but with the five it doesn't.

Here is what may be happening here. You should be able to confirm this by looking at your compiler's implementation of the pow function:
Assuming you have the correct #include's, (all the previous answers and comments about this are correct -- don't take the #include files for granted), the prototype for the standard pow function is this:
double pow(double, double);
and you're calling pow like this:
pow(5,2);
The pow function goes through an algorithm (probably using logarithms), thus uses floating point functions and values to compute the power value.
The pow function does not go through a naive "multiply the value of x a total of n times", since it has to also compute pow using fractional exponents, and you can't compute fractional powers that way.
So more than likely, the computation of pow using the parameters 5 and 2 resulted in a slight rounding error. When you assigned to an int, you truncated the fractional value, thus yielding 24.
If you are using integers, you might as well write your own "intpow" or similar function that simply multiplies the value the requisite number of times. The benefits of this are:
You won't get into the situation where you may get subtle rounding errors using pow.
Your intpow function will more than likely run faster than an equivalent call to pow.

You want int result from a function meant for doubles.
You should perhaps use
ele=(int)(0.5 + pow(n,2));
/* ^ ^ */
/* casting and rounding */

Floating-point arithmetic is not exact.
Although small values can be added and subtracted exactly, the pow() function normally works by multiplying logarithms, so even if the inputs are both exact, the result is not. Assigning to int always truncates, so if the inexactness is negative, you'll get 24 rather than 25.
The moral of this story is to use integer operations on integers, and be suspicious of <math.h> functions when the actual arguments are to be promoted or truncated. It's unfortunate that GCC doesn't warn unless you add -Wfloat-conversion (it's not in -Wall -Wextra, probably because there are many cases where such conversion is anticipated and wanted).
For integer powers, it's always safer and faster to use multiplication (division if negative) rather than pow() - reserve the latter for where it's needed! Do be aware of the risk of overflow, though.

When you use pow with variables, its result is double. Assigning to an int truncates it.
So you can avoid this error by assigning result of pow to double or float variable.
So basically
It translates to exp(log(x) * y) which will produce a result that isn't precisely the same as x^y - just a near approximation as a floating point value,. So for example 5^2 will become 24.9999996 or 25.00002

Using scientific notation in for loops

I've recently come across some code which has a loop of the form
for (int i = 0; i < 1e7; i++){
}
I question the wisdom of doing this since 1e7 is a floating point type, and will cause i to be promoted when evaluating the stopping condition. Should this be of cause for concern?

The elephant in the room here is that the range of an int could be as small as -32767 to +32767, and the behaviour on assigning a larger value than this to such an int is undefined.
But, as for your main point, indeed it should concern you as it is a very bad habit. Things could go wrong as yes, 1e7 is a floating point double type.
The fact that i will be converted to a floating point due to type promotion rules is somewhat moot: the real damage is done if there is unexpected truncation of the apparent integral literal. By the way of a "proof by example", consider first the loop
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 18446744073709551615ULL; ){
std::cout << i << "\n";
}
This outputs every consecutive value of i in the range, as you'd expect. Note that std::numeric_limits<std::uint64_t>::max() is 18446744073709551615ULL, which is 1 less than the 64th power of 2. (Here I'm using a slide-like "operator" ++< which is useful when working with unsigned types. Many folk consider --> and ++< as obfuscating but in scientific programming they are common, particularly -->.)
Now on my machine, a double is an IEEE754 64 bit floating point. (Such as scheme is particularly good at representing powers of 2 exactly - IEEE754 can represent powers of 2 up to 1022 exactly.) So 18,446,744,073,709,551,616 (the 64th power of 2) can be represented exactly as a double. The nearest representable number before that is 18,446,744,073,709,550,592 (which is 1024 less).
So now let's write the loop as
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 1.8446744073709551615e19; ){
std::cout << i << "\n";
}
On my machine that will only output one value of i: 18,446,744,073,709,550,592 (the number that we've already seen). This proves that 1.8446744073709551615e19 is a floating point type. If the compiler was allowed to treat the literal as an integral type then the output of the two loops would be equivalent.

It will work, assuming that your int is at least 32 bits.
However, if you really want to use exponential notation, you should better define an integer constant outside the loop and use proper casting, like this:
const int MAX_INDEX = static_cast<int>(1.0e7);
...
for (int i = 0; i < MAX_INDEX; i++) {
...
}
Considering this, I'd say it is much better to write
const int MAX_INDEX = 10000000;
or if you can use C++14
const int MAX_INDEX = 10'000'000;

1e7 is a literal of type double, and usually double is 64-bit IEEE 754 format with a 52-bit mantissa. Roughly every tenth power of 2 corresponds to a third power of 10, so double should be able to represent integers up to at least 105*3 = 1015, exactly. And if int is 32-bit then int has roughly 103*3 = 109 as max value (asking Google search it says that "2**31 - 1" = 2 147 483 647, i.e. twice the rough estimate).
So, in practice it's safe on current desktop systems and larger.
But C++ allows int to be just 16 bits, and on e.g. an embedded system with that small int, one would have Undefined Behavior.

If the intention to loop for a exact integer number of iterations, for example if iterating over exactly all the elements in an array then comparing against a floating point value is maybe not such a good idea, solely for accuracy reasons; since the implicit cast of an integer to float will truncate integers toward zero there's no real danger of out-of-bounds access, it will just abort the loop short.
Now the question is: When do these effects actually kick in? Will your program experience them? The floating point representation usually used these days is IEEE 754. As long as the exponent is 0 a floating point value is essentially an integer. C double precision floats 52 bits for the mantissa, which gives you integer precision to a value of up to 2^52, which is in the order of about 1e15. Without specifying with a suffix f that you want a floating point literal to be interpreted single precision the literal will be double precision and the implicit conversion will target that as well. So as long as your loop end condition is less 2^52 it will work reliably!
Now one question you have to think about on the x86 architecture is efficiency. The very first 80x87 FPUs came in a different package, and later a different chip and as aresult getting values into the FPU registers is a bit awkward on the x86 assembly level. Depending on what your intentions are it might make the difference in runtime for a realtime application; but that's premature optimization.
TL;DR: Is it safe to to? Most certainly yes. Will it cause trouble? It could cause numerical problems. Could it invoke undefined behavior? Depends on how you use the loop end condition, but if i is used to index an array and for some reason the array length ended up in a floating point variable always truncating toward zero it's not going to cause a logical problem. Is it a smart thing to do? Depends on the application.

Why pow() return 999... in C++ [duplicate]

While running the following lines of code:
int i,a;
for(i=0;i<=4;i++)
{
a=pow(10,i);
printf("%d\t",a);
}
I was surprised to see the output, it comes out to be 1 10 99 1000 9999 instead of 1 10 100 1000 10000.
What could be the possible reason?
Note
If you think it's a floating point inaccuracy that in the above for loop when i = 2, the values stored in variable a is 99.
But if you write instead
a=pow(10,2);
now the value of a comes out to be 100. How is that possible?

You have set a to be an int. pow() generates a floating point number, that in SOME cases may be just a hair less than 100 or 10000 (as we see here.)
Then you stuff that into the integer, which TRUNCATES to an integer. So you lose that fractional part. Oops. If you really needed an integer result, round may be a better way to do that operation.
Be careful even there, as for large enough powers, the error may actually be large enough to still cause a failure, giving you something you don't expect. Remember that floating point numbers only carry so much precision.

The function pow() returns a double. You're assigning it to variable a, of type int. Doing that doesn't "round off" the floating point value, it truncates it. So pow() is returning something like 99.99999... for 10^2, and then you're just throwing away the .9999... part. Better to say a = round(pow(10, i)).

This is to do with floating point inaccuracy. Although you are passing in ints they are being implicitly converted to a floating point type since the pow function is only defined for floating point parameters.

Mathematically, the integer power of an integer is an integer.
In a good quality pow() routine this specific calculation should NOT produce any round-off errors. I ran your code on Eclipse/Microsoft C and got the following output:
1 10 100 1000 10000
This test does NOT indicate if Microsoft is using floats and rounding or if they are detecting the type of your numbers and choosing the appropriate method.
So, I ran the following code:
#include <stdio.h>
#include <math.h>
main ()
{
double i,a;
for(i=0.0; i <= 4.0 ;i++)
{
a=pow(10,i);
printf("%lf\t",a);
}
}
And got the following output:
1.000000 10.000000 100.000000 1000.000000 10000.000000

No one spelt out how to actually do it correctly - instead of pow function, just have a variable that tracks the current power:
int i, a, power;
for (i = 0, a = 1; i <= 4; i++, a *= 10) {
printf("%d\t",a);
}
This continuing multiplication by ten is guaranteed to give you the correct answer, and quite OK (and much better than pow, even if it were giving the correct results) for tasks like converting decimal strings into integers.

Rounding in C++ and round-tripping numbers

I have a class that internally represents some quantity in fixed point as 32-bit integer with somewhat arbitrary denominator (it is neither power of 2 nor power of 10).
For communicating with other applications the quantity is converted to plain old double on output and back on input. As code inside the class it looks like:
int32_t quantity;
double GetValue() { return double(quantity) / DENOMINATOR; }
void SetValue(double x) { quantity = x * DENOMINATOR; }
Now I need to ensure that if I output some value as double and read it back, I will always get the same value back. I.e. that
x.SetValue(x.GetValue());
will never change x.quantity (x is arbitrary instance of the class containing the above code).
The double representation has more digits of precision, so it should be possible. But it will almost certainly not be the case with the simplistic code above.
What rounding do I need to use and
How can I find the critical would-be corner cases to test that the rounding is indeed correct?

Any 32 bits will be represented exactly when you convert to a double, but when you divide then multiply by an arbitrary value you will get a similar value but not exactly the same. You should lose at most one bit per operations, which means your double will be almost the same, prior to casting back to an int.
However, since int casts are truncations, you will get the wrong result when very minor errors turn 2.000 into 1.999, thus what you need to do is a simple rounding task prior to casting back.
You can use std::lround() for this if you have C++11, else you can write you own rounding function.
You probably don't care about fairness much here, so the common int(doubleVal+0.5) will work for positives. If as seems likely, you have negatives, try this:
int round(double d) { return d<0?d-0.5:d+0.5; }

The problem you describe is the same problem which exists with converting between binary and decimal representation just with different bases. At least it exists if you want to have the double representation to be a good approximation of the original value (otherwise you could just multiply the 32 bit value you have with your fixed denominator and store the result in a double).
Assuming you want the double representation be a good approximation of your actual value the conversions are nontrivial! The conversion from your internal representation to double can be done using Dragon4 ("How to print floating point numbers accurately", Steele & White) or Grisu ("How to print floating point numbers quickly and accurately", Loitsch; I'm not sure if this algorithm is independent from the base, though). The reverse can be done using Bellerophon ("How to read floating point numbers accurately", Clinger). These algorithms aren't entirely trivial, though...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

strange double to int conversion behavior in c++ - c++

When you attach a debugger and inspect the values, you will see that the value of d is actually 33222220.999999996, which is correctly truncated to 33222220 when converted to integer. There is a finite amount of numbers that can be stored in a double variable, and 33222221 is not one of them.

Due to floating point approximation, 33222.221 may actually be 33222.220999999999999. Multiplied by 1000 yields 33222220.999999999999. Casting to integer ignores all decimals (round down) for a final result of 33222220.

If you change the "9.9g" in your printf() calls to 17.17 to recover all possible digits of precision with a 64-bit IEEE 754 FP number, you get 33222220.999999996 for the double value. The int conversion then makes sense.

Related

Is hardcoding least significant byte of a double a good rounding strategy?

Difference in behaviour of pow from math.h for same input [duplicate]

Using scientific notation in for loops

Why pow() return 999... in C++ [duplicate]

Rounding in C++ and round-tripping numbers

Categories

Resources