Casting Double to Long is giving wrong value [duplicate] - c++

I'm currently learning inter-type data convertion in cpp. I have been taught that
For a really large int, we can (for some computers) suffer a loss of
precision when converting to double.
But no reason was provided for the statement.
Could someone please provide an explanation and an example? Thanks

Let's say that the floating point number uses N bits of storage.
Now, let us assume that this float can precisely represent all integers that can be represented by an integer type of N bits. Since the N bit integer requires all of its N bits to represent all of its values, so would be the requirement for this float.
A floating point number should be able to represent fractional numbers. However, since all of the bits are used to represent the integers, there are zero bits left to represent any fractional number. This is a contradiction, and we must conclude that the assumption that float can precisely represent all integers as equally sized integer type must be erroneous.
Since there must be non-representable integers in the range of a N bit integer, it is possible that converting such integer to a floating point of N bits will lose precision, if the converted value happens to be one of the non-representable ones.
Now, since a floating point can represent a subset of rational numbers, some of those representable values may indeed be integers. In particular, the IEEE-754 spec guarantees that a binary double precision floating point can represent all integers up to 253. This property is directly associated with the length of the mantissa.
Therefore it is not possible to lose precision of a 32 bit integer when converting to a double on a system which conforms to IEEE-754.
More technically, the floating point unit of x86 architecture actually uses a 80-bit extended floating point format, which is designed to be able to represent precisely all of 64 bit integers and can be accessed using the long double type.

This may happen if int is 64 bit and double is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the double in such a case has less bits than the int, then double is able to represent less digits and a loss of precision happens.

Related

converting really large int to double, loss of precision on some computer

I'm currently learning inter-type data convertion in cpp. I have been taught that
For a really large int, we can (for some computers) suffer a loss of
precision when converting to double.
But no reason was provided for the statement.
Could someone please provide an explanation and an example? Thanks
Let's say that the floating point number uses N bits of storage.
Now, let us assume that this float can precisely represent all integers that can be represented by an integer type of N bits. Since the N bit integer requires all of its N bits to represent all of its values, so would be the requirement for this float.
A floating point number should be able to represent fractional numbers. However, since all of the bits are used to represent the integers, there are zero bits left to represent any fractional number. This is a contradiction, and we must conclude that the assumption that float can precisely represent all integers as equally sized integer type must be erroneous.
Since there must be non-representable integers in the range of a N bit integer, it is possible that converting such integer to a floating point of N bits will lose precision, if the converted value happens to be one of the non-representable ones.
Now, since a floating point can represent a subset of rational numbers, some of those representable values may indeed be integers. In particular, the IEEE-754 spec guarantees that a binary double precision floating point can represent all integers up to 253. This property is directly associated with the length of the mantissa.
Therefore it is not possible to lose precision of a 32 bit integer when converting to a double on a system which conforms to IEEE-754.
More technically, the floating point unit of x86 architecture actually uses a 80-bit extended floating point format, which is designed to be able to represent precisely all of 64 bit integers and can be accessed using the long double type.
This may happen if int is 64 bit and double is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the double in such a case has less bits than the int, then double is able to represent less digits and a loss of precision happens.

C++ I've just read that floats are inexact and do not store exact integer values. What does this mean?

I am thinking of this at a binary level.
would a float of value 1 and an integer of value 1 not compile down to (omitting lots of zeros here)
0001
If they do both compile down to this then where does this inexactness come in.
Resource I'm using is http://www.cprogramming.com/tutorial/lesson1.html
Thanks.
It's possible. Floating point numbers are represented in an exponential notation (a*2^n), where some bits represent a (the significand), and some bits represent n (the exponent).
You can't uniquely represent all the integers in the range of a floating point value, due to the so-called pigeonhole principle. For example, 32-bit floats go up to over 10^38, but on 32 bits you can only represent 2^32 values - that means some integers will have the same representation.
Now, what happens when you try to, for example, do the following:
x = 10^38 - (10^38 - 1)
You should get 1, but you probably won't, because 10^38 and 10^38-1 are so close to each other that the computer has to represent them the same way. So, your 1.0f will usually be 1, but if this 1 is a result of calculation, it might not be.
Here are some examples.
To be precise: Integers can be exactly represented as floats if their binary representation does not use more bits than the float format supplies for the mantissa plus an implicit one bit.
IEEE floats have a mantissa of 23 bits, add one implicit bit, and you can store any integer representable with 24 bits in a float (that's integers up to 16777216). Likewise, a double has 52 mantissa bits, so it can store integers up to 9007199254740992.
Beyond that point, the IEEE format omits first the odd numbers, then all numbers not divisible by 4, and so on. So, even 0xffffff00ul is exactly representable as a float, but 0xffffff01ul is not.
So, yes, you can represent integers as floats, and as long as they don't become larger than the 16e6 or 9e15 limits, you can even expect additions between integers in float format to be exact.
A float will store an int exactly if the int is less than a certain number, but if you have a large enough int, there won't be enough bits in the mantissa to store all the bits of the integer. The missing bits are then assumed to be zero. If the missing bits aren't zero, then your int won't be equal to your float.
Short answer: no, the floating point representation of integers is not that simple.
The representation adopted for the float type by the C language standard is called IEEE 754 single-precision and is probably more complicated than most people would like to delve into, but the link describes it thoroughly in case you're interested.
As for the representation of the integer 1: we can see how it's encoded in the 32-bit base-2 single-precision format defined by IEEE 754 here - 3f80 0000.
Suppose letters stand for a bit, 0/1. Then a floating point number looks (schematically) like:
smmmmee
where s is the sign +/- and the number is .mmmm x 10 ^ ee
Now if you have two immediately following numbers:
.mmm0 x 10 ^ ee
.mmm1 x 10 ^ ee
Then for large exponent ee the difference might be more then 1.
And of course in base 2 a number like 1/5, 0.2, cannot represented exact. Summing fractions wil increase the error.
(Note this is not the exact representation.)
would a float of value 1 and an integer of value 1 not compile down to (omitting lots of zeros here) 0001
No, float will be stored like something similar to 0x00000803f, depending on precision.
What does this mean?
Some numbers cannot be precisely represented in binary form. O.2 in binary form will look like 0.00110011001100110011... which will keep going on(and repeating) forever. No matter how many bits you use to store it, it will be never enough. That's because 5 is not divisible by 2. The only way to precisely represent it is to use ratios to store it.
floating points have limited precision. Roughly speaking, they only store certain amount of digits after first significant non-zero digit, and the rest will be lost. That'll result in errors, for example, with single precision floats 100000000000000001 and 100000000000000002 are most likely rounded off to the same number.
You might also want to read something like this.
Conclusion:
If you're writing financial software, do not use floats. Use Bignums, using libraries like gmp
Contrary to some modern dynamically typed programming languages such as JavaScript or Ruby that have a single basic numeric type, the C programming language has many. That is because C reflects the different way to represent different kinds of numbers within a processor register.
To investigate different representations you can use the union construct where the same data can be viewed as different types.
Define
union {
float x;
int v;
} u;
Assign u.x = 1.0f and printf("0x%08x\n",u.v) to get the 32-bit representation of 1.0f as a floating point number. It should return 0x3f800000 and not 0x00000001 as one might expect.
As mentioned in earlier answers this reflects the representation of a floating number as a 32-bit value as `
1.0f = 0x3F800000 = 0011.1111.1000.0000.0000.0000.0000.0000 =
0 0111.1111 000.0000.0000.0000.0000.0000 = 0 0x7F 0
Here the three parts are sign s=0, exponent e=127, and mantissa m=0 and the floating point value is computed as
value = s * (1 + m * 2^-23) * 2^(e-127)
With this representation any integer number from -16,777,215 to 16,777,215 can be represented exactly. This is the value of (2^24 - 1) since there are only 23 bits in the mantissa. This range is not sufficient for many applications, therefore the float type cannot replace the int type.
The range of exact representation of integers by the double type is wider since the value occupies 64 bits and there are 53 bits reserved for the mantissa. It is exactly from
-9,007,199,254,740,991 to 9,007,199,254,740,991. Yet double requires twice as much memory.
Another source of difficulty is the way fractional numbers are represented. Since decimal fractions cannot be represented exactly (0.1f = 0x3dcccccd = 0.10000000149...) the use of floating point numbers breaks common algebraic identities.
0.1f * 10 != 1.0f
This can be confusing and lead to errors that are hard to detect. In general strict equality should not be used with floating point numbers.
Another example of floating point arithmetic depature from algebraic correctness:
float x = 16777217.0f;
float y = 16777215.0f;
x -= 1.0f;
y += 1.0f;
if (y > x) {printf("16777215.0 + 1.0 > 16777217.0 - 1.0\n");}
Yet another issue is the behaviour of the system when the limits of exact representation are broken. When in integer arithmetic the result of an arithmetic operation is greater than the range of the type, this can be detected in many ways: a special OVERFLOW bit in the processor flags register is flipped, and the result is significantly different from the expected.
In floating point arithmetic as the example above shows, the loss of precision occurs silently.
Hope this helps to understand why one needs many basic numeric types in C.

At what point do doubles begin to lose precision?

My application needs to perform some operations: >, <, ==, !=, +, -, ++, etc. (but without division) on some numbers. Those numbers are sometimes integer, and more rarely floats.
If I use internally the "double" type (as defined by IEEE 754) even for integers, up until what point can I be safe to use them as if they were ints, without running in strange rounding errors (for example, n == 5 && n == 6 are both true because they round to the same number)?
Obviously the second input of the various operations (+, -, etc.) is always an integer and I know that with 0.000[..]01 I'll have troubles since the start.
As a bonus answer, the same question but for float.
The number of bits in a IEEE-754 double mantissa is 52, and there's an extra implied bit that is always 1. This means the maximum value that can be contained exactly is 2^53, or 9007199254740992.
A float mantissa is 23 bits, again with an implied bit. The maximum integer that can be exactly represented is 2^24, or 16777216.
If your intent is to hold integer values only, there's usually a 64-bit integer type that would be more appropriate than a double.
Edit: originally I had 2^53-1 and 2^24-1, but I realized there's no need to subtract 1 - an even number can take advantage of an implied 0 bit to the right of the mantissa.
C# Refer to:
However, do be aware that the range of the decimal type is smaller than a double. That is double can hold a larger value, but it does so by losing precision. Or, as stated on MSDN:
The decimal keyword denotes a 128-bit
data type. Compared to floating-point
types, the decimal type has a greater
precision and a smaller range, which
makes it suitable for financial and
monetary calculations. The approximate
range and precision for the decimal
type are shown in the following table.
The primary difference between decimal and double is that decimal is fixed-point and double is floating point. That means that decimal stores an exact value, while double represents a value represented by a fraction, and is less precise. A decimalis 128 bits, so it takes the double space to store. Calculations on decimal is also slower (measure !).
If you need even larger precision, then BigInteger can be used from .NET 4. (You will need to handle decimal points yourself). Here you should be aware, that BigInteger is immutable, so any arithmetic operation on it will create a new instance - if numbers are large, this might be cribbling for performance.
I suggest you look into exactly how much precision you need. Perhaps your algorithm can work with normalized values, that can be smaller ? If performance is an issue, one of the built in floating point types are likely to be faster.

Some questions about floating points

I'm wondering if a number is represented one way in a floating point representation, is it going to be represented in the same way in a representation that has a larger size.
That is, if a number has a particular representation as a float, will it have the same representation if that float is cast to a double and then still the same when cast to a long double.
I'm wondering because I'm writing a BigInteger implementation and any floating point number that is passed in I am sending to a function that accepts a long double to convert it. Which leads me to my next question. Obviously floating points do not always have exact representations, so in my BigInteger class what should I be attempting to represent when given a float. Is it reasonable to try and represent the same number as given by std::cout << std::fixed << someFloat; even if that is not the same as the number passed in. Is that the most accurate representation I will be able to get? If so, ...
What's the best way to extract that value (in base some power of 10), at the moment I'm just grabbing it as a string and passing it to my string constructor. This will work, but I can't help but feel theres a better way, but certainly taking the remainder when dividing by my base is not accurate with floats.
Finally, I wonder if there is a floating point equivalent of uintmax_t, that is a typename that will always be the largest floating point type on a system, or is there no point because long double will always be the largest (even if it 's the same as a double).
Thanks, T.
If by "same representation" you mean "exactly the same binary representation in memory except for padding", then no. Double-precision has more bits of both exponent and mantissa, and also has a different exponent bias. But I believe that any single-precision value is exactly representable in double-precision (except possibly denormalised values).
I'm not sure what you mean when you say "floating points do not always have exact representations". Certainly, not all decimal floating-point values have exact binary floating-point values (and vice versa), but I'm not sure that's a problem here. So long as your floating-point input has no fractional part, then a suitably large "BigInteger" format should be able to represent it exactly.
Conversion via a base-10 representation is not the way to go. In theory, all you need is a bit-array of length ~1024, initialise it all to zero, and then shift the mantissa bits in by the exponent value. But without knowing more about your implementation, there's not a lot more I can suggest!
double includes all values of float; long double includes all values of double. So you're not losing any value information by conversion to long double. However, you're losing information about the original type, which is relevant (see below).
In order to follow common C++ semantics, conversion of a floating point value to integer should truncate the value, not round.
The main problem is with large values that are not exact. You can use the frexp function to find the base 2 exponent of the floating point value. You can use std::numeric_limits<T>::digits to check if that's within the integer range that can be exactly represented.
My personal design choice would be an assert that the fp value is within the range that can be exactly represented, i.e. a restriction on the range of any actual argument.
To do that properly you need overloads taking float and double arguments, since the range that can be represented exactly depends on the actual argument's type.
When you have an fp value that is within the allowed range, you can use floor and fmod to extract digits in any numeral system you want.
yes, going from IEEE float to double to extended you will see bits from the smaller format to the larger format, for example
single
S EEEEEEEE MMMMMMM.....
double
S EEEEEEEEEEEE MMMMM....
6.5 single
0 10000001 101000...
6.5 double
0 10000000001 101000...
13 single
0 10000010 101000...
13 double
0 10000000010 101000...
The mantissa you will left justify and then add zeros.
The exponent is right justified, sign extend the next to msbit then copy the msbit.
An exponent of -2 for example. take -2 subtract 1 which is -3. -3 in twos complement is 0xFD or 0b11111101 but the exponent bits in the format are 0b01111101, the msbit inverted. And for double a -2 exponent -2-1 = -3. or 0b1111...1101 and that becomes 0b0111...1101, the msbit inverted. (exponent bits = twos_complement(exponent-1) with the msbit inverted).
As we see above an exponent of 3 3-1 = 2 0b000...010 invert the upper bit 0b100...010
So yes you can take the bits from single precision and copy them to the proper locations in the double precision number. I dont have an extended float reference handy but pretty sure it works the same way.

Long Integer and Float

If a Long Integer and a float both take 4 bytes to store in memory then why are their ranges different?
Integers are stored like this:
1 bit for the sign (+/-)
31 bits for the value.
Floats are stored differently, giving greater range at the expense of accuracy:
1 bit for the sign (+/-)
N bits for the mantissa S
M bits for the exponent E
Float is represented in the exponential form: (+/-)S*(base)^E
BTW, "long" isn't always 32 bits. See this article.
Different way to encode your numbers.
Long counts up from 1 to 2^(4*8).
Float uses only 23 of the 32 bits for the "counting". But it adds "range" with an exponent in the other bits. So you have bigger numbers, but they are less accurate (in the lower based parts):
1.2424 * 10^54 (mantisse * exponent) is certainly bigger than 2^32. But you can discern a long 2^31 from a long 2^31-1 whereas you can't discern a float 1.24 * 10^54 and a float 1.24 * 10^54 - 1: the 1 just is lost in this representation as float.
They are not always the same size. But even when they are, their ranges are different because they serve different purposes. One is for integers with no decimal places, and one is for decimals.
This can be explained in terms of why a floating point representation can represent a larger range of numbers than a fixed point representation. This text from the Wikipedia entry:
The advantage of floating-point
representation over fixed-point (and
integer) representation is that it can
support a much wider range of values.
For example, a fixed-point
representation that has seven decimal
digits, with the decimal point assumed
to be positioned after the fifth
digit, can represent the numbers
12345.67, 8765.43, 123.00, and so on, whereas a floating-point
representation (such as the IEEE 754
decimal32 format) with seven decimal
digits could in addition represent
1.234567, 123456.7, 0.00001234567, 1234567000000000, and so on. The
floating-point format needs slightly
more storage (to encode the position
of the radix point), so when stored in
the same space, floating-point numbers
achieve their greater range at the
expense of precision.
Indeed a float takes 4 bytes (32bits), but since it's a float you have to store different things in these 32 bits:
1 bit is used for the sign (+/-)
8 bits are used for the exponent
23 bits are used for the significand (the significant digits)
You can see that the range of a float directly depends on the number of bits allocated to the significand, and the min/max values depend on the numbre of bits allocated for the exponent.
With the upper example:
8 bits for the exponent (= size of a char) gives an exponent range [-128,127]
--> max is about 127*log10(2) = 10^38
Regarding a long integer, you've got 1 bit used for the sign and then 31 bits to represent the integer value leading to a max of 2 147 483 647.
You can have a look at Wikipedia for more precise info:
Wikipedia - Floating point
Their ranges are different because they use different ways of representing numbers.
long (in c) is equivalent to long int. The size of this type varies between processors and compilers, but is often, as you say, 32 bits. At 32 bits, it can represent 232 different values. Since we often want to use negative numbers, computers normally represent integers using a format called "two's complement". This way, we can represent numbers from (-231) and up to (231-1). Counting the number 0, this adds up to 232 numbers.
float (in c) is usually a single presicion IEEE 754 formatted number. At 32 bits, this data type can also take 232 different bit patterns, but they are not used to directly represent whole numbers, like in the long. Instead, they represent a sign, and the mantisse and exponent of a normalized decimal number.
In general: when you have more range of values (float has up to 10^many), you have less precision.
This is what happens here. If you need integers, 32-bit long will give you more.
In a handwavey high level, floating point sacrefices integer precision to extend its range. This is done by combining a base value with a scaling factor. For large values, a float will not be able to precisely represent all integers but for small values it will represent better than integer precision.
No, the size of primitive data types in C is Implementation Defined.
This wiki entry clearly states: The floating-point format needs slightly more storage (to encode the position of the radix point), so when stored in the same space, floating-point numbers achieve their greater range at the expense of precision.