Yesterday I faced a problem that I could not explain. I was writing a very simple function that converts a double number to long. It was a part of another program. I used the following code:
long converter(double x) {
return (long) x;
}
It worked perfectly until I entered 1.9 as the input and the result was 1. But 1.9 is closer to 2 than it is to 1. So, the result should be 2 rather than 1.
Why is this problem happening? Can you give me a solution that solves this irritating problem?
Casting it to long won't round it as you expect. It will just discard the fractional part and will give you integer part . Therefore , you get 1 not 2.
You can refer std::lround here .
Casting to long does not round a number correctly. To do that, you have to use std::lround(). Just use the following code:
long converter(double x) {
return std::lround(x);
}
This should get the job done. Also, remember to include <cmath> in your code. Otherwise, it might not work.
P.S. std::lround() may not be compatible with older compilers. So, remember to use the latest version of your compiler.
You can use like
long converter(double x){
if(x>0)
return (long)(x+0.5);
else
return (long)(x-0.5);
}
As casting discard fractional part you can not get what you expect. Adding '0.5' take the number to next long if fraction greater than '0.5'.
Type casting from an floating point type to integral type will (if it will fit) do so by just discarding the fractional part. That is 1.9 converts to 1 and -1.9 converts to -1.
The reason this was done is due to the way floating point values are stored (if there were an ambiguity here they would probably have chosen this conversion to be implementation defined to some degree). The number is contains of three parts: the sign, the mantissa and the exponent. These are binary, but the decimal equivalent would be +1.9e+0 where + is the sign, 1.9 is the mantissa and +0 is the exponent. The conversion of the mantissa+exponent to an integer becomes quite easy and fast if you just take the mantissa and shifts it to be an integer. After that you apply the sign. So the operation becomes to mask out the mantissa, and exponent and then doing a shift and last negate if the sign bit is set.
Had they chosen to actually do a proper rounding one would just have to do a little more job. The standard settled for the somewhat easier solution for defining the cast.
Related
I'm having trouble with rounding floats. I'm solving a task where you need to round your result to two decimal points. But I can't do it when the third decimal point is 5 because it's stored incorrectly.
For example: My result is equal to 1.005 and that should be rounded to 1.01. But C++ rounds it to 1.00 because the original float is stored as 1.0049999... and not 1.005.
I've already tried always adding a very small float to the result but there are some other test cases which are then rounded up but should be rounded down.
I know how floating-point works and that it is often not completely accurate. I'm just wondering whether anyone has found a way around this specific problem.
When you say "my result is equal to 1.005", you are assuming some count of true decimal digits. This can be 1.005 (three digits of fractional part), 1.0050 (four digits), 1.005000, and so on.
So, you should first round, using some usual rounding, to that count of digits. It is simpler to do this in integers: for example, with 6 fractional digits, it means some usual round(), rint(), etc. after multiplication by 1,000,000. With this step, you are getting exact decimal number. After this, you are able to make the required final rounding to what you need.
In your example, this will round 1,004,999.99... to 1,005,000. Then, divide by 10000 and round again.
(Notice that there are suggestions to make this rounding in yet specific way. The General Decimal Arithmetic specification and IBM arithmetic manuals suggest this rounding is done in the way that exact fractional part 0.5 shall be rounded away from zero unless least significant result bit becomes 0 or 5, in that case it is rounded toward zero. But, if you have no such rounding available, a general away-from-zero is also suitable.)
If you are implementing arithmetic for money accounting, it is reasonable to avoid floating point at all and use fixed-point arithmetic (emulated with integers, if needed). This is better because you the methods I've described for rounding are inevitably containing conversion to integers (and back), so, it's cheaper to use such integers directly. You will get inexact operation checking as well (by cost of explicit integer overflow).
If you can use a library like boost with its Multiprecision support.
Another option would be to use a long double, maybe that's precise enough for you.
At first sight, this seems trivial, but the usual (radix 2 <-> radix 10) FP<->ASCII conversions cannot always be done without introducing errors. Granted, these are small, but what options exist to make the conversions to and from ASCII perfect, that is, what are the possibilities of making the conversions, without introducing any error at all? I was thinking about base64 encoding, or bit-encoding (e.g. something like 11110101010...), both of these would preserve the radix.
EDIT: Since I can't answer myself, here's what I had in mind:
double d{.1};
auto const s(::std::to_string(*reinterpret_cast<::std::uint64_t*>(&d)));
::std::uint64_t n(::std::stoull(s));
auto const e(*reinterpret_cast<double*>(&n));
assert(d == e);
What do you mean exactly by "without introducing errors"? If it
is for the machine to reread later, 17 digits precision
guarantees round trip: the actual value in the text will not be
the exact value of the double, but it will be closer to the
original double value than to any other double value, so
reconversion to double will result in the initial value. If you
have access to C++11, you can also set the format to output the
value in hex:
std::cout.setf( std::ios_base::fixed | std::ios_base::scientific,
std::ios_base::floatfield );
In this case, the output should be exact, regardless of the
precision.
If it is for humans to read, and know the exact value, there is
nothing in the standard library which will guarantee this. In
theory, outputting 53 digits should suffice, but the neither the
C++ standard nor the IEEE standard require the implementation to
guard against rounding errors in the conversion routine at this
precision, and some implementations just append a sufficiently
large number of '0' after the 19th or 20th digit, rather than
waste runtime calculating incorrect values.
I think the question you are asking is how to round-trip a floating point double value via an ASCII (string) representation. I agree, for this purpose printing the number in fixed or floating point decimal notation is completely unsuitable.
If you don't care what the string looks like then the simple solution is to just treat the 8 byte double as two integers. Two hex integers will occupy 16 character positions. With practice you can even read one of these and estimate the value.
The same thing in Base-64 just reduces the number of character positions (to 11/12). The number formatted this way is quite unreadable.
There are other ways, but why bother? These should suffice.
I'm writing an algorithm, to round a floating number. The input will be a 64bit IEEE754 double type number, very close to X.5, where X is a integer less than 32. The first solution came into my mind is to use a bit mask, to mask off those least significant bits as they represent very small fractions of 2^-n.(Given the exponent is not large).
But the problem is should I do that? Is there any other ways to accomplish the same thing? I feel using bit operation on float point is very controversy. Thanks!
The langugage I'm using is C++ by the way.
Edit:
Thanks guys, for your comments. I appreciate! Let's say I have a float number, can be 1.4999999... or 21.50000012.... I want to round it to 1.5 or 21.5. My goal is to round any number to its nearest to X.5 form, since it can be stored in a IEEE754 float point number.
If your compiler guarantees that you are using IEEE 754 floating-point, I would recommend that you round according to the method delineated in this blog post: add, and then immediately subtract a large constant so as to send the value in the binade of floating-point numbers where the ULP is 0.5. You won't find any faster method, and it does not involve any bit manipulation.
The appropriate constant to round a number between 0 and 32 to the nearest halt-unit for IEEE 754 double-precision is 2251799813685248.0.
Summary: use x = x + 2251799813685248.0 - 2251799813685248.0;.
You can use any of the functions round(), floor(), ceil(), rint(), nearbyint(), and trunc(). All do rounding in different modes, and all are standard C99. The only thing you need to do is to link against the standard math library by specifying -lm as a compiler flag.
As to trying to achieve rounding by bit manipulations, I would stay away from that: a) it will be much slower than using the functions above (they generally use hardware facilities where possible), b) it is reinventing the wheel with a lot of potential for bugs, and c) the newer C standards don't like you doing bit manipulations on floating point types: they use the so called strict aliasing rules that disallow you to just cast a double* to an uint64_t*. You would either need to do your bit manipulation by casting to a unsigned char* and manipulating the IEEE number byte by byte, or you would have to use memcpy() to copy the bit representation from a double variable into an uint64_t and back again. A lot of hassle for something already available in the form of standardized functions and hardware support.
You want to round x to the nearest value of the form d.5. For a generan number you write:
round(x+0.5)-0.5
For a number close to d.5, less than 0.25 away, you can use Pascal's offering:
round(2*x)*0.5
If you're looking for a bit trick and are guaranteed to have doubles in the ranges you describe, then you could do something like this (inline as you see fit):
void RoundNearestHalf(double &d) {
unsigned const maskshift = ((*(unsigned __int64*)&d >> 52) - 1023);
unsigned __int64 const setmask = 0x0008000000000000 >> maskshift;
unsigned __int64 const clearmask = ~0x0007FFFFFFFFFFFF >> maskshift;
*(unsigned __int64*)&d |= setmask;
*(unsigned __int64*)&d &= clearmask;
}
maskshift is the unbiased exponent. For the input range, we know this will be non-negative and no more than 4 (the trick will work for higher values too, but no more than 51). We use this value to make a setmask which sets the 2^-1 (one-half) place in the mantissa, and clearmask which clears all bits in the mantissa of lower value than 2^-1. The result is d rounded to the nearest half.
Note that it would be worth profiling this against other implementations, perhaps using the standard library to determine whether or not its actually faster.
I can't speak about C++ for sure, but in C99 the use of IEEE 754 standard for floating point will be purely normative (not required). In C99 if the __STDC_IEC_559__ macro is set then it declares that IEC 559 (which is more or less IEEE 754) is used for floating point.
I think it should be pointed out that there are functions to handle many types of rounding for you.
The way I understand it is: when subtracting two double numbers with double precision in c++ they are first transformed to a significand starting with one times 2 to the power of the exponent. Then one can get an error if the subtracted numbers have the same exponent and many of the same digits in the significand, leading to loss of precision. To test this for my code I wrote the following safe addition function:
double Sadd(double d1, double d2, int& report, double prec) {
int exp1, exp2;
double man1=frexp(d1, &exp1), man2=frexp(d2, &exp2);
if(d1*d2<0) {
if(exp1==exp2) {
if(abs(man1+man2)<prec) {
cout << "Floating point error" << endl;
report=0;
}
}
}
return d1+d2;
}
However, testing this I notice something strange: it seems that the actual error (not whether the function reports error but the actual one resulting from the computation) seems to depend on the absolute values of the subtracted numbers and not just the number of equal digits in the significand...
For examples, using 1e-11 as the precision prec and subtracting the following numbers:
1) 9.8989898989898-9.8989898989897: The function reports error and I get the highly incorrect value 9.9475983006414e-14
2) 98989898989898-98989898989897: The function reports error but I get the correct value 1
Obviously I have misunderstood something. Any ideas?
If you subtract two floating-point values that are nearly equal, the result will mostly reflect noise in the low bits. Nearly equal here is more than just same exponent and almost the same digits. For example, 1.0001 and 1.0000 are nearly equal, and subtracting them could be caught by a test like this. But 1.0000 and 0.9999 differ by exactly the same amount, and would not be caught by a test like this.
Further, this is not a safe addition function. Rather, it's a post-hoc check for a design/coding error. If you're subtracting two values that are so close together that noise matters you've made a mistake. Fix the mistake. I'm not objecting to using something like this as a debugging aid, but please call it something that implies that that's what it is, rather than suggesting that there's something inherently dangerous about floating-point addition. Further, putting the check inside the addition function seems excessive: an assert that the two values won't cause problems, followed by a plain old floating-point addition, would probably be better. After all, most of the additions in your code won't lead to problems, and you'd better know where the problem spots are; put asserts in the problems spots.
+1 to Pete Becker's answer.
Note that the problem of degenerated result might also occur with exp1!=exp2
For example, if you subtract
1.0-0.99999999999999
So,
bool degenerated =
(epx1==exp2 && abs(d1+d2)<prec)
|| (epx1==exp2-1 && abs(d1+2*d2)<prec)
|| (epx1==exp2+1 && abs(2*d1+d2)<prec);
You can omit the check for d1*d2<0, or keep it to avoid the whole test otherwise...
If you want to also handle loss of precision with degenerated denormalized floats, that'll be a bit more involved (it's as if the significand had less bits).
It's quite easy to prove that for IEEE 754 floating-point arithmetic, if x/2 <= y <= 2x then calculating x - y is an exact operation and will give the exact result correctly without any rounding error.
And if the result of an addition or subtraction is a denormalised number, then the result is always exact.
In the tutorial I'm reading for OGRE3d here the programmer is constantly adding f at the end of any variable he initializes, like 200.00f or 0.00f so I decided to erase f and see if it compiles and it compiles just fine, what is the point of adding f at the end of the variable?
EDIT: So you're saying if I initialize a variable with 200.03 it won't initialize it as a floating point but if I were to do so with 200.03f it would? If not where does the f become useful then?
It's a way to specify that number has to be interpreted as a "float", not a "double" (which is the standard for C++ decimal numbers and uses up twice the memory).
This discussion could be of help:
http://www.cplusplus.com/forum/beginner/24483/
Quoted from http://msdn.microsoft.com/en-us/library/w9bk1wcy.aspx
A floating-point constant without an f, F, l, or L suffix has type
double. If the letter f or F is the suffix, the constant has type
float. If suffixed by the letter l or L, it has type long double. For
example:
200.00f is not a variable. It can't vary.
It's a compile-time constant, with float representation. The f signifies that it's a float.
By comparison, 200.00 would be interpreted as a double.
The C standard states that constant floats are doubles which promotes the operation to a double.
float a,b,c;
...
a = b+7.1; this is a double precision operation
...
a = b+7.1f; this is a single precision operation
...
c = 7.1; //double
a = b + c; //single all the way
The double precision requires more storage for the constant, plus a conversion from single to double for the variable operand, then a conversion from double to single to assign the result. With all the conversions going on if you are not in tune with how floating point works, rounding and such you might not get the result you were thinking you were going to get. The compiler may at some point in the path optimize some of this behavior out, making it either harder to understand the real problems and the fpu in the hardware might accept mixed mode operands, also hiding what is really going on.
It is not just a speed problem but also accuracy. There was a recent SO question, pretty much the same problem, why does this comparison work with one number and not another. Take the fraction 5/11ths for example 0.454545.... Lets say, hypothetically, you had base 10 fpu with single precision of 3 significant digits and a double of 6 significant digits.
float a = 0.45454545454;
...
if(a>0.4545454545) b=1;
...
well in our hypothetical system we can only store three digits into a, so a = .455 because we are using by default a round up rounding mode. but our comparision will be considered double because we didnt put the f at the end of the number. the double version is 0.454545. a is converted to a double which results in 0.455000, so:
if(0.455000>0.454545) b = 1;
0.455 is greater than 0.454545 so b would be a 1.
float a = 0.45454545454;
...
if(a>0.4545454545f) b=1;
...
so now the comparison is single precision so we are comparing 0.455 to 0.455 which is not greater, so b=1 does not happen.
When you write floating point constants that is base 10 decimal, the floating point numbers in the computer are base 2 and they dont always convert smoothly just like 5/11 would work just fine in base 11 but in base 10 you get an infinite repeating digit. 0.1 in decimal for example creates a repeating pattern in binary. Depending on where the mantissa cuts off the rounding can make that lsbit of the mantissa round up or not (also depends on the rounding mode you are using if the floating point format you are using even has rounding). Which of itself creates problems depending on how you use the variable as the comparison above shows.
For non-floating point the compiler usually saves you, but sometimes doesnt:
unsigned long long a;
...
a = ~3;
a = ~(3ULL);
...
Depending on the compiler and computer the two assignments can give you different results one MIGHT give you 0x00000000FFFFFFFC another MIGHT give 0xFFFFFFFFFFFFFFFC.
If you want something specific you should be quite clear when you tell the compiler what you want otherwise the compiler takes a guess and doesnt always make the guess that you wanted.
It means that the value is to be interpreted as a single-precision floating point variable (type float). Without the f-suffix, it is interpreted as a double-precision floationg point variable (type double).
This is usually done to shut up compiler warnings about possible loss of precision by assigning a double value to a float variable. When you didn't receive such a warning you maybe have switched off warnings in your compiler settings (which is bad!).
But it can also have subtile syntactical meaning. As you know C++ allows functions which have the same name but differ by the types of their parameters. In that case the f suffix can determine which function is called.