Is it safe to use Int(round(x))? - casting

Say you have a double value, and want to round it to an integer...
Many round() functions return a double instead of an integer:
C# - round(double) -> double
C++ - round(double) -> double
Darwin - round(double) -> double
Swift - double.rounded() -> double
Java - round(double) -> int
Ruby: float.round() -> int
(This is most likely because doubles have a much wider range of possible values.)
Given this "default" behavior, it probably explains why you'll commonly see the following recommended:
Int(round(myDouble))
(Here we assume that Int() removes everything after the decimal: 4.9 -> 4.)
So far so good, until you realize how complex floating points really are. E.g. 55 might actually be stored as 54.9999999999999999, for example.
Because of this, it sounds like the following might happen:
Int(round(55.4)) // we ask it to round 55.4, expecting 55
Int(54.9999999999999) // it rounded it to "55.0"
54 // the Int() function removed all remaining digits
We were expecting 55.4 rounded to be 55, but it ended up evaluating to 54.
Can something like the above really happen if we use Int(round(x))?
If so, what should we use instead of Int(round())?
Related: Many languages define floor(double) -> double. Is Int(floor(double)) safe?

Floating point models are constructed on these foundations:
a base b
a significand with a limited number of digits in that base (p the precision)
an exponent e for shifting the floating point, also limited in a certain range
a sign
So the floating point values are made like this: (-1)^signBit * significand * b^e
The significand can be represented in a normalized form x.xxxxxxxx with 1 non null digit left of floating point (except for zero, or eventually values near zero that lose precision and gradually underflow), and p-1 digit after floating point.
But by shifting appropriately the exponent (e+1-p), it can as well be considered as an integer with p digits, xxxxxxxxx.0.
With a reasonnable range for exponent, we see that every integer up to b^p can be represented exactly by such floating point model. With the limited precision, only the last digits in base b are lost, so if we have an integer too large to fit in significand, it will necessarily have a null fraction part. Thus, there is no reason for round to answer anything else but an integral value (with null fraction part).
The only unsafe part as you noted is that Int range might be much smaller than range of floating point values. Thus converting large floating point to Int could result in overflow exception, or worse, silent overflow with undefined behavior...
The conversion to Int is thus not necessary for the sake of eliminating the fraction part. It must be for other purposes (like feeding another part of the program that would only accept an Int).

Related

Truncate and casting to int of ceiled double

When performing a std::ceil on a double value the value will round up to a whole number. So 3.3 will become 4.0. Which can be casted or truncated to an int. Which will 'chop off' the part after the comma. So:
int foo = (int)std::ceil(3.3);
So at first glance this will store 4 in foo. However, the double is a floating point value. So it might either be 4.000000001 or 3.999999999. The latter would be truncated to 3.
But in practice I've never seen this behaviour occurring. Can I safely assume that any implementation will return 4? Or is it only the IEEE-754 that does this. Or have I just been lucky?
Rounding (or ceil-ing) a double will always, always, always be exact.
For floating point numbers below 2^(m+1), where m is the number of mantissal bits, all integers have exact representations, so the result can be exactly represented.
For floating point numbers above 2^(m+1)... they're already integers. Makes sense, if you think about it: there aren't enough mantissal bits to stretch down to the right of the decimal point. So again rounding/ceil-ing is exact.
ceil in C++ behaves like it does in C, where the standard says
The ceil functions compute the smallest integer value not less than x.
The result of ceil is always the floating-point representation of an integer; however, the result may overflow an integral type when truncated.
In your particular case, std::ceil(3.3) must be exactly 4.0, since that's the "smallest integer value not less than" 3.3.

C++ I've just read that floats are inexact and do not store exact integer values. What does this mean?

I am thinking of this at a binary level.
would a float of value 1 and an integer of value 1 not compile down to (omitting lots of zeros here)
0001
If they do both compile down to this then where does this inexactness come in.
Resource I'm using is http://www.cprogramming.com/tutorial/lesson1.html
Thanks.
It's possible. Floating point numbers are represented in an exponential notation (a*2^n), where some bits represent a (the significand), and some bits represent n (the exponent).
You can't uniquely represent all the integers in the range of a floating point value, due to the so-called pigeonhole principle. For example, 32-bit floats go up to over 10^38, but on 32 bits you can only represent 2^32 values - that means some integers will have the same representation.
Now, what happens when you try to, for example, do the following:
x = 10^38 - (10^38 - 1)
You should get 1, but you probably won't, because 10^38 and 10^38-1 are so close to each other that the computer has to represent them the same way. So, your 1.0f will usually be 1, but if this 1 is a result of calculation, it might not be.
Here are some examples.
To be precise: Integers can be exactly represented as floats if their binary representation does not use more bits than the float format supplies for the mantissa plus an implicit one bit.
IEEE floats have a mantissa of 23 bits, add one implicit bit, and you can store any integer representable with 24 bits in a float (that's integers up to 16777216). Likewise, a double has 52 mantissa bits, so it can store integers up to 9007199254740992.
Beyond that point, the IEEE format omits first the odd numbers, then all numbers not divisible by 4, and so on. So, even 0xffffff00ul is exactly representable as a float, but 0xffffff01ul is not.
So, yes, you can represent integers as floats, and as long as they don't become larger than the 16e6 or 9e15 limits, you can even expect additions between integers in float format to be exact.
A float will store an int exactly if the int is less than a certain number, but if you have a large enough int, there won't be enough bits in the mantissa to store all the bits of the integer. The missing bits are then assumed to be zero. If the missing bits aren't zero, then your int won't be equal to your float.
Short answer: no, the floating point representation of integers is not that simple.
The representation adopted for the float type by the C language standard is called IEEE 754 single-precision and is probably more complicated than most people would like to delve into, but the link describes it thoroughly in case you're interested.
As for the representation of the integer 1: we can see how it's encoded in the 32-bit base-2 single-precision format defined by IEEE 754 here - 3f80 0000.
Suppose letters stand for a bit, 0/1. Then a floating point number looks (schematically) like:
smmmmee
where s is the sign +/- and the number is .mmmm x 10 ^ ee
Now if you have two immediately following numbers:
.mmm0 x 10 ^ ee
.mmm1 x 10 ^ ee
Then for large exponent ee the difference might be more then 1.
And of course in base 2 a number like 1/5, 0.2, cannot represented exact. Summing fractions wil increase the error.
(Note this is not the exact representation.)
would a float of value 1 and an integer of value 1 not compile down to (omitting lots of zeros here) 0001
No, float will be stored like something similar to 0x00000803f, depending on precision.
What does this mean?
Some numbers cannot be precisely represented in binary form. O.2 in binary form will look like 0.00110011001100110011... which will keep going on(and repeating) forever. No matter how many bits you use to store it, it will be never enough. That's because 5 is not divisible by 2. The only way to precisely represent it is to use ratios to store it.
floating points have limited precision. Roughly speaking, they only store certain amount of digits after first significant non-zero digit, and the rest will be lost. That'll result in errors, for example, with single precision floats 100000000000000001 and 100000000000000002 are most likely rounded off to the same number.
You might also want to read something like this.
Conclusion:
If you're writing financial software, do not use floats. Use Bignums, using libraries like gmp
Contrary to some modern dynamically typed programming languages such as JavaScript or Ruby that have a single basic numeric type, the C programming language has many. That is because C reflects the different way to represent different kinds of numbers within a processor register.
To investigate different representations you can use the union construct where the same data can be viewed as different types.
Define
union {
float x;
int v;
} u;
Assign u.x = 1.0f and printf("0x%08x\n",u.v) to get the 32-bit representation of 1.0f as a floating point number. It should return 0x3f800000 and not 0x00000001 as one might expect.
As mentioned in earlier answers this reflects the representation of a floating number as a 32-bit value as `
1.0f = 0x3F800000 = 0011.1111.1000.0000.0000.0000.0000.0000 =
0 0111.1111 000.0000.0000.0000.0000.0000 = 0 0x7F 0
Here the three parts are sign s=0, exponent e=127, and mantissa m=0 and the floating point value is computed as
value = s * (1 + m * 2^-23) * 2^(e-127)
With this representation any integer number from -16,777,215 to 16,777,215 can be represented exactly. This is the value of (2^24 - 1) since there are only 23 bits in the mantissa. This range is not sufficient for many applications, therefore the float type cannot replace the int type.
The range of exact representation of integers by the double type is wider since the value occupies 64 bits and there are 53 bits reserved for the mantissa. It is exactly from
-9,007,199,254,740,991 to 9,007,199,254,740,991. Yet double requires twice as much memory.
Another source of difficulty is the way fractional numbers are represented. Since decimal fractions cannot be represented exactly (0.1f = 0x3dcccccd = 0.10000000149...) the use of floating point numbers breaks common algebraic identities.
0.1f * 10 != 1.0f
This can be confusing and lead to errors that are hard to detect. In general strict equality should not be used with floating point numbers.
Another example of floating point arithmetic depature from algebraic correctness:
float x = 16777217.0f;
float y = 16777215.0f;
x -= 1.0f;
y += 1.0f;
if (y > x) {printf("16777215.0 + 1.0 > 16777217.0 - 1.0\n");}
Yet another issue is the behaviour of the system when the limits of exact representation are broken. When in integer arithmetic the result of an arithmetic operation is greater than the range of the type, this can be detected in many ways: a special OVERFLOW bit in the processor flags register is flipped, and the result is significantly different from the expected.
In floating point arithmetic as the example above shows, the loss of precision occurs silently.
Hope this helps to understand why one needs many basic numeric types in C.

How to convert float to double(both stored in IEEE-754 representation) without losing precision?

I mean, for example, I have the following number encoded in IEEE-754 single precision:
"0100 0001 1011 1110 1100 1100 1100 1100" (approximately 23.85 in decimal)
The binary number above is stored in literal string.
The question is, how can I convert this string into IEEE-754 double precision representation(somewhat like the following one, but the value is not the same), WITHOUT losing precision?
"0100 0000 0011 0111 1101 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010"
which is the same number encoded in IEEE-754 double precision.
I have tried using the following algorithm to convert the first string back to decimal number first, but it loses precision.
num in decimal = (sign) * (1 + frac * 2^(-23)) * 2^(exp - 127)
I'm using Qt C++ Framework on Windows platform.
EDIT: I must apologize maybe I didn't get the question clearly expressed.
What I mean is that I don't know the true value 23.85, I only got the first string and I want to convert it to double precision representation without precision loss.
Well: keep the sign bit, rewrite the exponent (minus old bias, plus new bias), and pad the mantissa with zeros on the right...
(As #Mark says, you have to treat some special cases separately, namely when the biased exponent is either zero or max.)
IEEE-754 (and floating point in general) cannot represent periodic binary decimals with full precision. Not even when they, in fact, are rational numbers with relatively small integer numerator and denominator. Some languages provide a rational type that may do it (they are the languages that also support unbounded precision integers).
As a consequence those two numbers you posted are NOT the same number.
They in fact are:
10111.11011001100110011000000000000000000000000000000000000000 ...
10111.11011001100110011001100110011001100110011001101000000000 ...
where ... represent an infinite sequence of 0s.
Stephen Canon in a comment above gives you the corresponding decimal values (did not check them, but I have no reason to doubt he got them right).
Therefore the conversion you want to do cannot be done as the single precision number does not have the information you would need (you have NO WAY to know if the number is in fact periodic or simply looks like being because there happens to be a repetition).
First of all, +1 for identifying the input in binary.
Second, that number does not represent 23.85, but slightly less. If you flip its last binary digit from 0 to 1, the number will still not accurately represent 23.85, but slightly more. Those differences cannot be adequately captured in a float, but they can be approximately captured in a double.
Third, what you think you are losing is called accuracy, not precision. The precision of the number always grows by conversion from single precision to double precision, while the accuracy can never improve by a conversion (your inaccurate number remains inaccurate, but the additional precision makes it more obvious).
I recommend converting to a float or rounding or adding a very small value just before displaying (or logging) the number, because visual appearance is what you really lost by increasing the precision.
Resist the temptation to round right after the cast and to use the rounded value in subsequent computation - this is especially risky in loops. While this might appear to correct the issue in the debugger, the accummulated additional inaccuracies could distort the end result even more.
It might be easiest to convert the string into an actual float, convert that to a double, and convert it back to a string.
Binary floating points cannot, in general, represent decimal fraction values exactly. The conversion from a decimal fractional value to a binary floating point (see "Bellerophon" in "How to Read Floating-Point Numbers Accurately" by William D.Clinger) and from a binary floating point back to a decimal value (see "Dragon4" in "How to Print Floating-Point Numbers Accurately" by Guy L.Steele Jr. and Jon L.White) yield the expected results because one converts a decimal number to the closest representable binary floating point and the other controls the error to know which decimal value it came from (both algorithms are improved on and made more practical in David Gay's dtoa.c. The algorithms are the basis for restoring std::numeric_limits<T>::digits10 decimal digits (except, potentially, trailing zeros) from a floating point value stored in type T.
Unfortunately, expanding a float to a double wrecks havoc on the value: Trying to format the new number will in many cases not yield the decimal original because the float padded with zeros is different from the closest double Bellerophon would create and, thus, Dragon4 expects. There are basically two approaches which work reasonably well, however:
As someone suggested convert the float to a string and this string into a double. This isn't particularly efficient but can be proven to produce the correct results (assuming a correct implementation of the not entirely trivial algorithms, of course).
Assuming your value is in a reasonable range, you can multiply it by a power of 10 such that the least significant decimal digit is non-zero, convert this number to an integer, this integer to a double, and finally divide the resulting double by the original power of 10. I don't have a proof that this yields the correct number but for the range of value I'm interested in and which I want to store accurately in a float, this works.
One reasonable approach to avoid this entirely issue is to use decimal floating point values as described for C++ in the Decimal TR in the first place. Unfortunately, these are not, yet, part of the standard but I have submitted a proposal to the C++ standardization committee to get this changed.

At what point do doubles begin to lose precision?

My application needs to perform some operations: >, <, ==, !=, +, -, ++, etc. (but without division) on some numbers. Those numbers are sometimes integer, and more rarely floats.
If I use internally the "double" type (as defined by IEEE 754) even for integers, up until what point can I be safe to use them as if they were ints, without running in strange rounding errors (for example, n == 5 && n == 6 are both true because they round to the same number)?
Obviously the second input of the various operations (+, -, etc.) is always an integer and I know that with 0.000[..]01 I'll have troubles since the start.
As a bonus answer, the same question but for float.
The number of bits in a IEEE-754 double mantissa is 52, and there's an extra implied bit that is always 1. This means the maximum value that can be contained exactly is 2^53, or 9007199254740992.
A float mantissa is 23 bits, again with an implied bit. The maximum integer that can be exactly represented is 2^24, or 16777216.
If your intent is to hold integer values only, there's usually a 64-bit integer type that would be more appropriate than a double.
Edit: originally I had 2^53-1 and 2^24-1, but I realized there's no need to subtract 1 - an even number can take advantage of an implied 0 bit to the right of the mantissa.
C# Refer to:
However, do be aware that the range of the decimal type is smaller than a double. That is double can hold a larger value, but it does so by losing precision. Or, as stated on MSDN:
The decimal keyword denotes a 128-bit
data type. Compared to floating-point
types, the decimal type has a greater
precision and a smaller range, which
makes it suitable for financial and
monetary calculations. The approximate
range and precision for the decimal
type are shown in the following table.
The primary difference between decimal and double is that decimal is fixed-point and double is floating point. That means that decimal stores an exact value, while double represents a value represented by a fraction, and is less precise. A decimalis 128 bits, so it takes the double space to store. Calculations on decimal is also slower (measure !).
If you need even larger precision, then BigInteger can be used from .NET 4. (You will need to handle decimal points yourself). Here you should be aware, that BigInteger is immutable, so any arithmetic operation on it will create a new instance - if numbers are large, this might be cribbling for performance.
I suggest you look into exactly how much precision you need. Perhaps your algorithm can work with normalized values, that can be smaller ? If performance is an issue, one of the built in floating point types are likely to be faster.

Some questions about floating points

I'm wondering if a number is represented one way in a floating point representation, is it going to be represented in the same way in a representation that has a larger size.
That is, if a number has a particular representation as a float, will it have the same representation if that float is cast to a double and then still the same when cast to a long double.
I'm wondering because I'm writing a BigInteger implementation and any floating point number that is passed in I am sending to a function that accepts a long double to convert it. Which leads me to my next question. Obviously floating points do not always have exact representations, so in my BigInteger class what should I be attempting to represent when given a float. Is it reasonable to try and represent the same number as given by std::cout << std::fixed << someFloat; even if that is not the same as the number passed in. Is that the most accurate representation I will be able to get? If so, ...
What's the best way to extract that value (in base some power of 10), at the moment I'm just grabbing it as a string and passing it to my string constructor. This will work, but I can't help but feel theres a better way, but certainly taking the remainder when dividing by my base is not accurate with floats.
Finally, I wonder if there is a floating point equivalent of uintmax_t, that is a typename that will always be the largest floating point type on a system, or is there no point because long double will always be the largest (even if it 's the same as a double).
Thanks, T.
If by "same representation" you mean "exactly the same binary representation in memory except for padding", then no. Double-precision has more bits of both exponent and mantissa, and also has a different exponent bias. But I believe that any single-precision value is exactly representable in double-precision (except possibly denormalised values).
I'm not sure what you mean when you say "floating points do not always have exact representations". Certainly, not all decimal floating-point values have exact binary floating-point values (and vice versa), but I'm not sure that's a problem here. So long as your floating-point input has no fractional part, then a suitably large "BigInteger" format should be able to represent it exactly.
Conversion via a base-10 representation is not the way to go. In theory, all you need is a bit-array of length ~1024, initialise it all to zero, and then shift the mantissa bits in by the exponent value. But without knowing more about your implementation, there's not a lot more I can suggest!
double includes all values of float; long double includes all values of double. So you're not losing any value information by conversion to long double. However, you're losing information about the original type, which is relevant (see below).
In order to follow common C++ semantics, conversion of a floating point value to integer should truncate the value, not round.
The main problem is with large values that are not exact. You can use the frexp function to find the base 2 exponent of the floating point value. You can use std::numeric_limits<T>::digits to check if that's within the integer range that can be exactly represented.
My personal design choice would be an assert that the fp value is within the range that can be exactly represented, i.e. a restriction on the range of any actual argument.
To do that properly you need overloads taking float and double arguments, since the range that can be represented exactly depends on the actual argument's type.
When you have an fp value that is within the allowed range, you can use floor and fmod to extract digits in any numeral system you want.
yes, going from IEEE float to double to extended you will see bits from the smaller format to the larger format, for example
single
S EEEEEEEE MMMMMMM.....
double
S EEEEEEEEEEEE MMMMM....
6.5 single
0 10000001 101000...
6.5 double
0 10000000001 101000...
13 single
0 10000010 101000...
13 double
0 10000000010 101000...
The mantissa you will left justify and then add zeros.
The exponent is right justified, sign extend the next to msbit then copy the msbit.
An exponent of -2 for example. take -2 subtract 1 which is -3. -3 in twos complement is 0xFD or 0b11111101 but the exponent bits in the format are 0b01111101, the msbit inverted. And for double a -2 exponent -2-1 = -3. or 0b1111...1101 and that becomes 0b0111...1101, the msbit inverted. (exponent bits = twos_complement(exponent-1) with the msbit inverted).
As we see above an exponent of 3 3-1 = 2 0b000...010 invert the upper bit 0b100...010
So yes you can take the bits from single precision and copy them to the proper locations in the double precision number. I dont have an extended float reference handy but pretty sure it works the same way.