I have a COleCurrency object that represents a unit price. And I have a double value that represents a quantity. And I need to calculate the total dollar amount to the nearest penny.
Looks like COleCurrency has built in multiplication operators, but only for multiplication with a long value.
I can multiply COleCurrency.m_cur.int64 by the double, but that converts the double to __int64 so it wouldn't be accurate.
What is the best way to accurately multiply a COleCurrency by a double?
Finite binary floating point values form a proper subset of finite decimal values. While any given floating point value has an exact, finite representation in decimal, the opposite isn't true. In other words, not every decimal can be represented using a finite binary floating point value. A simple example is 0.1 that produces an infinite sequence of binary digits when converted to a binary floating point value.
The important point here is that if you are dealing with fractional values, using binary floating point values to represent them will in general introduce inaccuracies (with very few exceptions, such as 0.5). The only way to perform accurate multiplications with an integer value is to use an integer as the multiplicand.
Since you have opted to use a floating point value the only thing you can do is limit the inaccuracies. The proposed solution:
__int64 x = currency.m_cur.int64 * (__int64)dbl;
suffers from the same "possible loss of data" issue the compiler warned about. Since you're now using an explicit cast, this silences the compiler. The effect is still the same: The floating point value gets truncated.
A better approach would be to convert the 64-bit integer value to a double first. This produces an exact floating point representation of the integer value, provided that it is within range (roughly +/- 1e15). You can then multiply with another floating point value (which is subject to rounding errors), and finally round the result using, e.g., std::llround:
__int64 x = std::llround(static_cast<double>(currency.m_cur.int64) * dbl);
Related
I'm currently learning inter-type data convertion in cpp. I have been taught that
For a really large int, we can (for some computers) suffer a loss of
precision when converting to double.
But no reason was provided for the statement.
Could someone please provide an explanation and an example? Thanks
Let's say that the floating point number uses N bits of storage.
Now, let us assume that this float can precisely represent all integers that can be represented by an integer type of N bits. Since the N bit integer requires all of its N bits to represent all of its values, so would be the requirement for this float.
A floating point number should be able to represent fractional numbers. However, since all of the bits are used to represent the integers, there are zero bits left to represent any fractional number. This is a contradiction, and we must conclude that the assumption that float can precisely represent all integers as equally sized integer type must be erroneous.
Since there must be non-representable integers in the range of a N bit integer, it is possible that converting such integer to a floating point of N bits will lose precision, if the converted value happens to be one of the non-representable ones.
Now, since a floating point can represent a subset of rational numbers, some of those representable values may indeed be integers. In particular, the IEEE-754 spec guarantees that a binary double precision floating point can represent all integers up to 253. This property is directly associated with the length of the mantissa.
Therefore it is not possible to lose precision of a 32 bit integer when converting to a double on a system which conforms to IEEE-754.
More technically, the floating point unit of x86 architecture actually uses a 80-bit extended floating point format, which is designed to be able to represent precisely all of 64 bit integers and can be accessed using the long double type.
This may happen if int is 64 bit and double is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the double in such a case has less bits than the int, then double is able to represent less digits and a loss of precision happens.
I was writing a program where I had to round double to second decimal place. I noticed printf("%.2f", (double) 12.555) return 12.55. However, printf("%.2f", (float) 12.555) returns 12.56. Can anyone explain why this happens?
12.555 is a number that is not representable precisely in binary floating point. It so happens that the closest value to 1.2555 that is representable in double precision floating point on your system is slightly less than 1.2555, and the closest value to 1.2555 that is representable in single precision floating point is slightly more than 1.2555.
Assuming the rounding mode used by the conversion is round to nearest (ties to even), which is the default in IEEE 754 standard, then the described output is to be expected.
Floats and doubles are stored internally using IEEE 754 representation. The part that is relevant to your question is that both floats and doubles store the closest to the decimal number they can, given the limits of their representation. Roughly speaking, those limits are to do with the conversion of the decimal part of the original number to a binary number with a finite number of bits.
It turns out, the closest float to 12.555 is actually 12.55500030517578125 while the closest double to 12.555 is 12.554999999999999715782905696. Notice how the double provides more accuracy, but the error is negative.
At this point, it's probably now obvious why the round function goes up for float but down for double - they're both rounding to the closest decimal number to the underlying representation.
When I run the following code, the output is accurately the number 2500 in decimal.
(g++ 5.3.1 on ubuntu)
#include<iostream>
#include<cmath>
using namespace std;
int main(){
cout.precision(0);
cout << fixed << pow(2.0,500.0);
return 0;
}
I wonder how C++ converted this floating point number to its decimal string at such a high precision.
I know that 2500 can be accurately presented in IEEE 754 format. But I think mod 10 and divided by 10 can cause precision loss on floating point numbers. What algorithm is used when the conversion proceed?
Yes, there exists an exact double-precision floating-point representation for 2500. You should not assume that pow(2.0,500.0) produces this value, though. There is no guarantee of accuracy for the function pow, and you may find SO questions that arose from pow(10.0, 2.0) not producing 100.0, although the mathematical result was perfectly representable too.
But anyway, to answer your question, the conversion from the floating-point binary representation to decimal does not in general rely on floating-point operations, which indeed would be too inaccurate for the intended accuracy of the end result. In general, accurate conversion requires reliance on big integer arithmetics. In the case of 2500, for instance, the naïve algorithm would be to repeatedly divide the big integer written in binary 1000…<500 zeroes in total>… by ten.
There are some cases where floating-point arithmetic can be used, for instance taking advantage of the fact that powers of 10 up to 1023 are represented exactly in IEEE 754 double-precision. But correctly rounded conversion between binary floating-point and decimal floating-point always require big integer arithmetics in general, and this is particularly visible far away from 1.0.
I am trying to get the decimal part from the double and this is my code to get the decimal part
double decimalvalue = 23423.1234-23423.0;
0.12340000000040163
But after the subtraction I am expecting decimalvalue to be 0.1234 but I get 0.12340000000040163. Please help me to understand this behavior and if there is any workaround for it.
I suggest you have a look at
What Every Computer Scientist Should Know About Floating-Point Arithmetic
Wikipedia: IEEE 754
There are a finite number of values you can specify in a floating point number, but an infinite number of floating point numbers in the represented range.
Some floating point numbers therefore cannot be represented exactly in any floating/double style data type.
The typical way to handle your specific problem is to avoid a direct equality comparison, but rather do an epsilon test: See if the expected and computed values are within some small number (compared to the values being subtracted), called epsilon, of each other.
Indirectly related is the concept of Machine Epsilon, worth having a look at for a complete understanding
This is a rounding error. In base ten you cannot perfectly represent 1/3 in a given number of digits (say 15). In base 2 there are a lot more things you can not represent, 0.1234 happens to be one of them. The precision depends on the scale, but it's about 15 decimal digits for a double. I would suggest taking a look at http://en.wikipedia.org/wiki/IEEE_floating_point for more details on floating point numbers.
If you are trying to make a base 10 system (like a human used calculator for instance) and you need exact results you should use BCD.
I'm wondering if a number is represented one way in a floating point representation, is it going to be represented in the same way in a representation that has a larger size.
That is, if a number has a particular representation as a float, will it have the same representation if that float is cast to a double and then still the same when cast to a long double.
I'm wondering because I'm writing a BigInteger implementation and any floating point number that is passed in I am sending to a function that accepts a long double to convert it. Which leads me to my next question. Obviously floating points do not always have exact representations, so in my BigInteger class what should I be attempting to represent when given a float. Is it reasonable to try and represent the same number as given by std::cout << std::fixed << someFloat; even if that is not the same as the number passed in. Is that the most accurate representation I will be able to get? If so, ...
What's the best way to extract that value (in base some power of 10), at the moment I'm just grabbing it as a string and passing it to my string constructor. This will work, but I can't help but feel theres a better way, but certainly taking the remainder when dividing by my base is not accurate with floats.
Finally, I wonder if there is a floating point equivalent of uintmax_t, that is a typename that will always be the largest floating point type on a system, or is there no point because long double will always be the largest (even if it 's the same as a double).
Thanks, T.
If by "same representation" you mean "exactly the same binary representation in memory except for padding", then no. Double-precision has more bits of both exponent and mantissa, and also has a different exponent bias. But I believe that any single-precision value is exactly representable in double-precision (except possibly denormalised values).
I'm not sure what you mean when you say "floating points do not always have exact representations". Certainly, not all decimal floating-point values have exact binary floating-point values (and vice versa), but I'm not sure that's a problem here. So long as your floating-point input has no fractional part, then a suitably large "BigInteger" format should be able to represent it exactly.
Conversion via a base-10 representation is not the way to go. In theory, all you need is a bit-array of length ~1024, initialise it all to zero, and then shift the mantissa bits in by the exponent value. But without knowing more about your implementation, there's not a lot more I can suggest!
double includes all values of float; long double includes all values of double. So you're not losing any value information by conversion to long double. However, you're losing information about the original type, which is relevant (see below).
In order to follow common C++ semantics, conversion of a floating point value to integer should truncate the value, not round.
The main problem is with large values that are not exact. You can use the frexp function to find the base 2 exponent of the floating point value. You can use std::numeric_limits<T>::digits to check if that's within the integer range that can be exactly represented.
My personal design choice would be an assert that the fp value is within the range that can be exactly represented, i.e. a restriction on the range of any actual argument.
To do that properly you need overloads taking float and double arguments, since the range that can be represented exactly depends on the actual argument's type.
When you have an fp value that is within the allowed range, you can use floor and fmod to extract digits in any numeral system you want.
yes, going from IEEE float to double to extended you will see bits from the smaller format to the larger format, for example
single
S EEEEEEEE MMMMMMM.....
double
S EEEEEEEEEEEE MMMMM....
6.5 single
0 10000001 101000...
6.5 double
0 10000000001 101000...
13 single
0 10000010 101000...
13 double
0 10000000010 101000...
The mantissa you will left justify and then add zeros.
The exponent is right justified, sign extend the next to msbit then copy the msbit.
An exponent of -2 for example. take -2 subtract 1 which is -3. -3 in twos complement is 0xFD or 0b11111101 but the exponent bits in the format are 0b01111101, the msbit inverted. And for double a -2 exponent -2-1 = -3. or 0b1111...1101 and that becomes 0b0111...1101, the msbit inverted. (exponent bits = twos_complement(exponent-1) with the msbit inverted).
As we see above an exponent of 3 3-1 = 2 0b000...010 invert the upper bit 0b100...010
So yes you can take the bits from single precision and copy them to the proper locations in the double precision number. I dont have an extended float reference handy but pretty sure it works the same way.