I need to represent numbers using the following structure. The purpose of this structure is not to lose the precision.
struct PreciseNumber
{
long significand;
int exponent;
}
Using this structure actual double value can be represented as value = significand * 10e^exponent.
Now I need to write utility function which can covert double into PreciseNumber.
Can you please let me know how to extract the exponent and significand from the double?
The prelude is somewhat flawed.
Firstly, barring any restrictions on storage space, conversion from a double to a base 10 significand-exponent form won't alter the precision in any form. To understand that, consider the following: any binary terminating fraction (like the one that forms the mantissa on a typical IEEE-754 float) can be written as a sum of negative powers of two. Each negative power of two is a terminating fraction itself, and hence it follows that their sum must be terminating as well.
However, the converse isn't necessarily true. For instance, 0.3 base 10 is equivalent to the non-terminating 0.01 0011 0011 0011 ... in base 2. Fitting this into a fixed size mantissa would blow some precision out of it (which is why 0.3 is actually stored as something that translates back to 0.29999999999999999.)
By this, we may assume that any precision that is intended by storing the numbers in decimal significand-exponent form is either lost, or isn't simply gained at all.
Of course, you might think of the apparent loss of accuracy generated by storing a decimal number as a float as loss in precision, in which case the Decimal32 and Decimal64 floating point formats may be of some interest -- check out http://en.wikipedia.org/wiki/Decimal64_floating-point_format.
This is a very difficult problem. You might want to see how much code it takes to implement a double-to-string conversion (for printf, e.g.). You might steal the code from gnu's implementation of gcc.
You cannot convert an "imprecise" double into a "precise" decimal number, because the required "precision" simply isn't there to begin with (otherwise why would you even want to convert?).
This is what happens if you try something like it in Java:
BigDecimal x = new BigDecimal(0.1);
System.out.println(x);
The output of the program is:
0.1000000000000000055511151231257827021181583404541015625
Well you're at less precision than a typical double. Your significand is a long giving you a range from -2 billion to +2 billion which is more than 9 but fewer than 10 digits of precision.
Here's an untested starting point on what you'd want to do for some simple math on PreciseNumbers
PreciseNumber Multiply(PreciseNumber lhs, PreciseNumber rhs)
{
PreciseNumber ret;
ret.s=lhs.s;
ret.e=lhs.e;
ret.s*=rhs.s;
ret.e+=lhs.e;
return ret;
}
PreciseNumber Add(PreciseNumber lhs, PreciseNumber rhs)
{
PreciseNumber ret;
ret.s=lhs.s;
ret.e=lhs.e;
ret.s+=(rhs.s*pow(10,rhs.e-lhs.e));
}
I didn't take care of any renormalization, but in both cases there are places where you have to worry about over/under flows and loss of precision. Just because you're doing it yourself rather than letting the computer take care of it in a double, doesn't meat the same pitfalls aren't there. The only way to not lose precision is to keep track of all of the digits.
Here's a very rough algorithm. I'll try to fill in some details later.
Take the log10 of the number to get the exponent. Multiply the double by 10^x if positive, or divide by 10^-x if negative.
Start with a significand of zero. Repeat the following 15 times, since a double contains 15 digits of significance:
Multiply the previous significand by 10.
Take the integer portion of the double, add it to the significand, and subtract it from the double.
Subtract 1 from the exponent.
Multiply the double by 10.
When finished, take the remaining double value and use it for rounding: if it's >= 5, add one to the significand.
Related
Is there an algorithm in C++ that will allow me to, given a floating-point value V of type T (e.g. double or float), returns the closest value to V in a given direction (up or down) that can be represented exactly in less than or equal to a specified number of decimal places D ?
For example, given
T = double
V = 670000.08267799998
D = 6
For direction = towards +inf I would like the result to be 670000.082678, and for direction = towards -inf I would like the result to be 670000.082677
This is somewhat similar to std::nexttoward(), but with the restriction that the 'next' value needs to be exactly representable using at most D decimal places.
I've considered a naive solution involving separating out the fractional portion and scaling it by 10^D, truncating it, and scaling it again by 10^-D and tacking it back onto the whole number portion, but I don't believe that guarantees that the resulting value will be exactly representable in the underlying type.
I'm hopeful that there's a way to do this properly, but so far I've been unable to find one.
Edit: I think my original explanation didn't properly convey my requirements. At the suggestion of #patricia-shanahan I'll try to describing my higher-level goal and then reformulate the problem a little differently in that context.
At the highest level, the reason I need this routine is due to some business logic wherein I must take in a double value K and a percentage P, split it into two double components V1 and V2 where V1 ~= P percent of K and V1 + V2 ~= K. The catch is that V1 is used in further calculations before being sent to a 3rd party over a wire protocol that accepts floating-point values in string format with a max of D decimal places. Because the value sent to the 3rd party (in string format) needs to be reconcilable with the results of the calculations made using V1 (in double format) , I need to "adjust" V1 using some function F() so that it is as close as possible to being P percent of K while still being exactly representable in string format using at most D decimal places. V2 has none of the restrictions of V1, and can be calculated as V2 = K - F(V1) (it is understood and acceptable that this may result in V2 such that V1 + V2 is very close to but not exactly equal to K).
At the lower level, I'm looking to write that routine to 'adjust' V1 as something with the following signature:
double F(double V, unsigned int D, bool roundUpIfTrueElseDown);
where the output is computed by taking V and (if necessary, and in the direction specified by the bool param) rounding it to the Dth decimal place.
My expectation would be that when V is serialized out as follows
const auto maxD = std::numeric_limits<double>::digits10;
assert(D <= maxD); // D will be less than maxD... e.g. typically 1-6, definitely <= 13
std::cout << std::fixed
<< std::setprecision(maxD)
<< F(V, D, true);
then the output contains only zeros beyond the Dth decimal place.
It's important to note that, for performance reasons, I am looking for an implementation of F() that does not involve conversion back and forth between double and string format. Though the output may eventually be converted to a string format, in many cases the logic will early-out before this is necessary and I would like to avoid the overhead in that case.
This is a sketch of a program that does what is requested. It is presented mainly to find out whether that is really what is wanted. I wrote it in Java, because that language has some guarantees about floating point arithmetic on which I wanted to depend. I only use BigDecimal to get exact display of doubles, to show that the answers are exactly representable with no more than D digits after the decimal point.
Specifically, I depended on double behaving according to IEEE 754 64-bit binary arithmetic. That is likely, but not guaranteed by the standard, for C++. I also depended on Math.pow being exact for simple exact cases, on exactness of division by a power of two, and on being able to get exact output using BigDecimal.
I have not handled edge cases. The big missing piece is dealing with large magnitude numbers with large D. I am assuming that the bracketing binary fractions are exactly representable as doubles. If they have more than 53 significant bits that will not be the case. It also needs code to deal with infinities and NaNs. The assumption of exactness of division by a power of two is incorrect for subnormal numbers. If you need your code to handle them, you will have to put in corrections.
It is based on the concept that a number that is both exactly representable as a decimal with no more than D digits after the decimal point and is exactly representable as a binary fraction must be representable as a fraction with denominator 2 raised to the D power. If it needs a higher power of 2 in the denominator, it will need more than D digits after the decimal point in its decimal form. If it cannot be represented at all as a fraction with a power-of-two denominator, it cannot be represented exactly as a double.
Although I ran some other cases for illustration, the key output is:
670000.082678 to 6 digits Up: 670000.09375 Down: 670000.078125
Here is the program:
import java.math.BigDecimal;
public class Test {
public static void main(String args[]) {
testIt(2, 0.000001);
testIt(10, 0.000001);
testIt(6, 670000.08267799998);
}
private static void testIt(int d, double in) {
System.out.print(in + " to " + d + " digits");
System.out.print(" Up: " + new BigDecimal(roundUpExact(d, in)).toString());
System.out.println(" Down: "
+ new BigDecimal(roundDownExact(d, in)).toString());
}
public static double roundUpExact(int d, double in) {
double factor = Math.pow(2, d);
double roundee = factor * in;
roundee = Math.ceil(roundee);
return roundee / factor;
}
public static double roundDownExact(int d, double in) {
double factor = Math.pow(2, d);
double roundee = factor * in;
roundee = Math.floor(roundee);
return roundee / factor;
}
}
In general, decimal fractions are not precisely representable as binary fractions. There are some exceptions, like 0.5 (½) and 16.375 (16⅜), because all binary fractions are precisely representable as decimal fractions. (That's because 2 is a factor of 10, but 10 is not a factor of 2, or any power of two.) But if a number is not a multiple of some power of 2, its binary representation will be an infinitely-long cyclic sequence, like the representation of ⅓ in decimal (.333....).
The standard C library provides the macro DBL_DIG (normally 15); any decimal number with that many decimal digits of precision can be converted to a double (for example, with scanf) and then converted back to a decimal representation (for example, with printf). To go in the opposite direction without losing information -- start with a double, convert it to decimal and then convert it back -- you need 17 decimal digits (DBL_DECIMAL_DIG). (The values I quote are based on IEEE-754 64-bit doubles).
One way to provide something close to the question would be to consider a decimal number with no more than DBL_DIG digits of precision to be an "exact-but-not-really-exact" representation of a floating point number if that floating point number is the floating point number which comes closest to the value of the decimal number. One way to find that floating point number would be to use scanf or strtod to convert the decimal number to a floating point number, and then try the floating point numbers in the vicinity (using nextafter to explore) to find which ones convert to the same representation with DBL_DIG digits of precision.
If you trust the standard library implementation to not be too far off, you could convert your double to a decimal number using sprintf, increment the decimal string at the desired digit position (which is just a string operation), and then convert it back to a double with strtod.
Total re-write.
Based on OP's new requirement and using power-of-2 as suggested by #Patricia Shanahan, simple C solution:
double roundedV = ldexp(round(ldexp(V, D)),-D); // for nearest
double roundedV = ldexp(ceil (ldexp(V, D)),-D); // at or just greater
double roundedV = ldexp(floor(ldexp(V, D)),-D); // at or just less
The only thing added here beyond #Patricia Shanahan fine solution is C code to match OP's tag.
In C++ integers must be represented in binary, but floating point types can have a decimal representation.
If FLT_RADIX from <limits.h> is 10, or some multiple of 10, then your goal of exact representation of a decimal values is attainable.
Otherwise, in general, it's not attainable.
So, as a first step, try to find a C++ implementation where FLT_RADIX is 10.
I wouldn't worry about algorithm or efficiency thereof until the C++ implementation is installed and proved to be working on your system. But as a hint, your goal seems to be suspiciously similar to the operation known as “rounding”. I think, after obtaining my decimal floating point C++ implementation, I’d start by investigating techniques for rounding, e.g., googling that, maybe Wikipedia, …
I have a program in C++ where I divide two numbers, and I need to know if the answer is an integer or not. What I am using is:
if(fmod(answer,1) == 0)
I also tried this:
if(floor(answer)==answer)
The problem is that answer usually is a 5 digit number, but with many decimals. For example, answer can be: 58696.000000000000000025658 and the program considers that an integer.
Is there any way I can make this work?
I am dividing double a/double b= double answer
(sometimes there are more than 30 decimals)
Thanks!
EDIT:
a and b are numbers in the thousands (about 100,000) which are then raised to powers of 2 and 3, added together and divided (according to a complicated formula). So I am plugging in various a and b values and looking at the answer. I will only keep the a and b values that make the answer an integer. An example of what I got for one of the answers was: 218624 which my program above considered to be an integer, but it really was: 218624.00000000000000000056982 So I need a code that can distinguish integers with more than 20-30 decimals.
You can use std::modf in cmath.h:
double integral;
if(std::modf(answer, &integral) == 0.0)
The integral part of answer is stored in fraction and the return value of std::modf is the fractional part of answer with the same sign as answer.
The usual solution is to check if the number is within a very short distance of an integer, like this:
bool isInteger(double a){
double b=round(a),epsilon=1e-9; //some small range of error
return (a<=b+epsilon && a>=b-epsilon);
}
This is needed because floating point numbers have limited precision, and numbers that indeed are integers may not be represented perfectly. For example, the following would fail if we do a direct comparison:
double d=sqrt(2); //square root of 2
double answer=2.0/(d*d); //2 divided by 2
Here, answer actually holds the value 0.99999..., so we cannot compare that to an integer, and we cannot check if the fractional part is close to 0.
In general, since the floating point representation of a number can be either a bit smaller or a bit bigger than the actual number, it is not good to check if the fractional part is close to 0. It may be a number like 0.99999999 or 0.000001 (or even their negatives), these are all possible results of a precision loss. That's also why I'm checking both sides (+epsilon and -epsilon). You should adjust that epsilon variable to fit your needs.
Also, keep in mind that the precision of a double is close to 15 digits. You may also use a long double, which may give you some extra digits of precision (or not, it is up to the compiler), but even that only gets you around 18 digits. If you need more precision than that, you will need to use an external library, like GMP.
Floating point numbers are stored in memory using a very different bit format than integers. Because of this, comparing them for equality is not likely to work effectively. Instead, you need to test if the difference is smaller than some epsilon:
const double EPSILON = 0.00000000000000000001; // adjust for whatever precision is useful for you
double remainder = std::fmod(numer, denom);
if(std::fabs(0.0 - remainder) < EPSILON)
{
//...
}
Alternatively, if you want to include values that are close to integers (based on your desired precision), you can modify the if condition slightly (since the remainder returned by std::fmod will be in the range [0, 1)):
if (std::fabs(std::round(d) - d) < EPSILON)
{
// ...
}
You can see the test for this here.
Floating point numbers are generally somewhat precise to about 12-15 digits (as a double), but as they are stored as a mantissa (fraction) and a exponent, rational numbers (integers or common fractions) are not likely to be stored as such. For example,
double d = 2.0; // d might actually be 1.99999999999999995
Because of this, you need to compare the difference of what you expect to some very small number that encompasses the precision you desire (we will call this value, epsilon):
double d = 2.0;
bool test = std::fabs(2 - d) < epsilon; // will return true
So when you are trying to compare the remainder from std::fmod, you need to check it against the difference from 0.0 (not for actual equality to 0.0), which is what is done above.
Also, the std::fabs call prevents you from having to do 2 checks by asserting that the value will always be positive.
If you desire a precision that is greater than 15-18 decimal places, you cannot use double or long double; you will need to use a high precision floating point library.
For the following program:
#include <iostream>
#include <iomanip>
using namespace std;
int main()
{
for (float a = 1.0; a < 10; a++)
cout << std::setprecision(30) << 1.0/a << endl;
return 0;
}
I recieve the following output:
1
0.5
0.333333333333333314829616256247
0.25
0.200000000000000011102230246252
0.166666666666666657414808128124
0.142857142857142849212692681249
0.125
0.111111111111111104943205418749
Which is definitely not right right for the lower place digits, particularly with respect to 1/3,1/5,1/7, and 1/9. things just start going wrong around 10^-16 I would expect to see out put more resembling:
1
0.5
0.333333333333333333333333333333
0.25
0.2
0.166666666666666666666666666666
0.142857142857142857142857142857
0.125
0.111111111111111111111111111111
Is this an inherit flaw in the float class? Is there a way to overcome this and have proper division? Is there a special datatype for doing precise decimal operations? Am I just doing something stupid or wrong in my example?
There are a lot of numbers that computers cannot represent, even if you use float or double-precision float. 1/3, or .3 repeating, is one of those numbers. So it just does the best it can, which is the result you get.
See http://floating-point-gui.de/, or google float precision, there's a ton of info out there (including many SO questions) on this subject.
To answer your questions -- yes, this is an inherent limitation in both the float class and the double class. Some mathematical programs (MathCAD, probably Mathematica) can do "symbolic" math, which allows calculation of the "correct" answers. In many cases, the round-off error can be managed, even over really complex computations, such that the top 6-8 decimal places are correct. However, the opposite is true as well -- naive computations can be constructed that return wildly incorrect answers.
For small problems like division of whole numbers, you'll get a decent number of decimal place accuracy (maybe 4-6 places). If you use double precision floats, that will go up to maybe 8. If you need more... well, I'd start questioning why you want that many decimal places.
First of all, since your code does 1.0/a, it gives you double (1.0 is a double value, 1.0f is float) as the rules of C++ (and C) always extends a smaller type to the larger one if the operands of an operation is different size (so, int + char makes the char into an int before adding the values, long + int will make the int long, etc, etc).
Second floating point values have a set number of bits for the "number". In float, that is 23 bits (+ 1 'hidden' bit), and in double it's 52 bits (+1). Yet get approximately 3 digits per bit (exactly: log2(10), if we use decimal number representation), so a 23 bit number gives approximately 7-8 digits, a 53 bit number approximately 16-17 digits. The remainder is just "noise" caused by the last few bits of the number not evening out when converting to a decimal number.
To have infinite precision, we would have to either store the value as a fraction, or have an infinite number of bits. And of course, we could have some other finite precision, such as 100 bits, but I'm sure you'd complain about that too, because it would just have another 15 or so digits before it "goes wrong".
Floats only have so much precision (23 bits worth to be precise). If you REALLY want to see "0.333333333333333333333333333333" output, you could create a custom "Fraction" class which stores the numerator and denominator separately. Then you could calculate the digit at any given point with complete accuracy.
I have a situation which requires a float to be represented in a single char. The range that this 'minifloat' needs to represent is 0 to 10e-7, so we can always assume that the number is +ve, and the exponent -ve in order to save space.
The representation that I have thought about going with is 3 bits of exponent, and 5 bits mantissa (with 1 implied bit), with the exponent being in base 10, i.e. x = man * 10^exp.
To convert from a float to my minifloat, I plan to use frexp, and use some maths to convert from base 2 to base 10.
Is this a sensible approach? Or are there better ways to achieve this?
Do you actually need the value to be floating point (i.e. to have roughly constant precision as the value scales)? What are you going to do with these values?
A much simpler (and more efficient) idea would be to interpret 8 bits as an unsigned fixed-point number with an implicit scale of 1e-7. I.e.:
float toFloat(uint8_t x) {
return x / 255.0e7;
}
uint8_t fromFloat(float x) {
if (x < 0) return 0;
if (x > 1e-7) return 255;
return 255.0e7 * x; // this truncates; add 0.5 to round instead
}
If it serves your purposes, it is reasonable to use such a format as a storage or transmission format, that is, for recording data in a small space. You should verify that the rounding errors from this format are not too large for your needs, that the range is suitable, et cetera.
This would not be a good format for calculation, because it would be slow on normal hardware.
I do not understand what base conversion you would be doing. If you have an IEEE-754 floating-point number in a float, then the job of converting to or from your 8-bit format is one of rounding the significand (the fraction) when going to the narrower format and of adjusting the exponent bias, plus handling special cases (denormals, overflow, NaNs). This would just involve binary arithmetic, not decimal.
As an aside, note that the proper term for the fraction portion of a floating-point number is “fraction” or “significand” (the term used in the IEEE-754 standard). A “mantissa” is the fractional portion of a logarithm.
An alternative is to use a static array of 256 float (or double) that you will choose on your own criteria.
Then the conversion unsigned char -> float/double is trivial...
The conversion float/double-> unsigned char is a bit more involved (find nearest float in the static array); it would cost about 8 comparisons with a naive binary search algorithm, but you may find better according to the way you choosed the values in the static array.
Of course, operations would be performed with native float/double.
5 mantissa bits give you 32 different situations from 1.00 to 9.00 with the minimum step size 0.25
1.00 1.25 1.50 1.75 2.00 .... 8.75 9.00
3 exponents can give you 8 different situations 10^0(which is 1) 10^-2 10^-3 10^-4 ....finally 10^-7
Your fraction part's error is 0.25. If your calculations can compensate this error, then you can use this.
The IEE754 (64 bits) floating point is supposed to correctly represent 15 significant digit although the internal representation has 17 ditigs. Is there a way to force the 16th and 17th digits to zero ??
Ref:
http://msdn.microsoft.com/en-us/library/system.double(VS.80).aspx :
.
.
Remember that a floating-point number can only approximate a decimal number, and that the precision of a floating-point number determines how accurately that number approximates a decimal number. By default, a Double value contains 15 decimal digits of precision, although a maximum of 17 digits is maintained internally. The precision of a floating-point number has several consequences:
.
.
Example nos:
d1 = 97842111437.390091
d2 = 97842111437.390076
d1 and d2 differ in 16th and 17th decimal places that are not supposed to be significant. Looking for ways to force them to zero. ie
d1 = 97842111437.390000
d2 = 97842111437.390000
No. Counter-example: the two closest floating-point numbers to a rational
1.11111111111118
(which has 15 decimal digits) are
1.1111111111111799942818834097124636173248291015625
1.1111111111111802163264883347437717020511627197265625
In other words, there is not floating-point number that starts with 1.1111111111111800.
This question is a little malformed. The hardware stores the numbers
in binary, not decimal. So in the general case you can't do precise
math in base 10. Some decimal numbers (0.1 is one of them!) do not
even have a non-repeating representation in binary. If you have
precision requirements like this, where you care about the number
being of known precision to exactly 15 decimal digits, you will need
to pick another representation for your numbers.
No, but I wonder if this is relevant to any of your issues (GCC specific):
GCC Documentation
-ffloat-store Do not store floating point variables in registers, and
inhibit other options that might
change whether a floating point value
is taken from a register or memory.
This option prevents undesirable
excess precision on machines such as
the 68000 where the floating registers
(of the 68881) keep more precision
than a double is supposed to have.
Similarly for the x86 architecture.
For most programs, the excess
precision does only good, but a few
programs rely on the precise
definition of IEEE floating point. Use
-ffloat-store for such programs, after modifying them to store all pertinent
intermediate computations into
variables.
You should be able to directly modify the bits in your number by creating a union with a field for the floating point number and an integral type of the same size. Then you can access the bits you want and set them however you want. Here is in example where I whack the sign bit; you can choose any field you want, of course.
#include <stdio.h>
union double_int {
double fp;
unsigned long long integer;
};
int main(int argc, const char *argv[])
{
double my_double = 1325.34634;
union double_int *my_union = (union double_int *)&my_double;
/* print original numbers */
printf("Float %f\n", my_double);
printf("Integer %llx\n", my_union->integer);
/* whack the sign bit to 1 */
my_union->integer |= 1ULL << 63;
/* print modified numbers */
printf("Negative float %f\n", my_double);
printf("Negative integer %llx\n", my_union->integer);
return 0;
}
Generally speaking, people only care about something like this ("I only want the first x digits") when displaying the number. That's relatively easy with stringstreams or sprintf.
If you're concerned about comparing numbers with ==; you really can't do that with floating point numbers. Instead you want to see if the numbers are close enough (say, within an epsilon() of each other).
Playing with the bits of the number directly isn't a great idea.