How to handle number with large exponent in c/c++? [duplicate] - c++

This question already has answers here:
Is there a C++ equivalent to Java's BigDecimal?
(9 answers)
Closed 7 years ago.
I found myself with the need to compute the exponential of a large number, e. g.exp(709). Such a number would be represented, in floating point precision, as 8.2184074615549724e+307.
It seems that numbers with exponents larger than that would be simply translated into Inf, which creates problems in my code. I can only guess that things can be fixed using more bits to represent the exponent, but I am not aware of a pragmatical way to proceed.
Here is a code snippet:
double expon = exp(500); /*here I also tried `long double`, with no effect */
printf("%e\n", expon ); /*gives INF*/
double Wa = LambertW<0>( expon); /*gives error, as it can't handle inf*/
Is there a way to compute this?
This problem has been debated in general, but I did not find an useful answer. Also, it seems that GCC supports multiple-precision floating-point arithmetics since version 4.3. How does it help?
Edit: The suggested possible-duplicate questions turned out irrelevant because as I need huge decimals, not exact decimals. This is not a duplicate.

You should be able to perform your computation with adequate precision using long double arithmetic:
The maximum value for 80 bit long double is 1.18×10^4932, much larger than e^709.
In order for the computation to be performed as long double, your must use expl instead if exp:
long double expon = expl(500);
printf("%Le\n", expon);
The LambertW function will handle the long double if it is properly overloaded for this type, otherwise expon will be converted to double and produce inf and the computation will fail as you mentioned.
I don't know which implementation of Lambert W function you use, Darko Veberic's does not support long double arguments, but it might be feasible to extend the implementation to type long double as it is available in source form: https://github.com/DarkoVeberic/LambertW . You might want to contact him directly.
Another approach is to consider that exp(709) is just too close to the maximum precision of the double type, 10^308. If you can alter your computation using just smaller exponents and a different formula, the computation might be done with regular double types.

Related

int64_t to double to int64_t again, loss of precision

I need to parse a given type (eg: long long integer) which is represented with scientific notation. Examples:
123456789012345678.3e-3
123456789012345678.3
I know the type of the given string but I can't use strtoll since number is given in scientific notation. What I do is that I convert it using strtod, do error checks with respect to int64_t and cast it back to int64_t. ErrCheckInt and ErrCheckDouble does error checks (overflow, underflow, etc) for integral and floating types and casts the number into whatever type it was. .
double res = strtod(processedStr, &end);
return (std::is_floating_point<OUT_T>::value) ? ErrCheckFloat<double, OUT_T>(res, out) : ErrCheckInt<double, OUT_T>(res, out);
Problem is when I parse int64_t with double, I get a floating point number with correct scientific notation, 1 significand. When I cast the number to int64_t again, I loss precision. The example number:
input: 123456789012345678.3
double_converted: 1.23456789012346E+17
cast_to_int64_t: 123456789012345680
expected: 123456789012345678
I know that number is long enough to be represented correctly with double precision. I can use long double but that won't solve the problem.
I can evaluate the string and remove / add digits with respect to e notation in the end but processing should be very, very fast since code will run in embedded rtos. I already do a lot of checks and strtod will take its time as well.
I know the type of the given string but I can't use strtoll since number is given in scientific notation.
You only need to call it once, use the resulting pointer to detect whether the number is in xxxeyyy form, and call strtoll again to parse the exponent. Much simpler than going through floating-point in my opinion.
I know that number is long enough to be represented correctly with double precision.
No, you don't know that since your example input is “123456789012345678”, which is not representable in IEEE 754 double-precision.
I can use long double but that won't solve the problem.
Actually, if your compiler maps long double to “80-bit extended precision with 64 bit significand”, it will solve the problem: all 64-bit integers are representable in that format. GCC and Clang make the historical 80-bit floating-point format available through long double on Linux, but it is so inconvenient as to be practically considered not available on Windows (you would need to change the FPU control word, and restore it everytime you have library functions to call, and write your own math functions to operate on 80-bit floating-point values. Starting with strtold.

How would I find 100! accurately? [Programming challenge] [duplicate]

This question already has answers here:
Calculate the factorial of an arbitrarily large number, showing all the digits
(11 answers)
Closed 9 years ago.
I tried using double but it would give me scientific answers like 3.2e+12. I need proper answer. How would I do that??
My code so far:
int n, x;
double fact;
cin>>n;
while(n--)
{
fact=1;
cin>>x;
for(;x>1;x--)
fact*=x;
cout<<fact<<endl;
}
First things first, using floating point formats such as double and float will always introduce rounding error, if you want to reduce the error with large numbers, use long or long long, however these will not be able to represent values as large as double or long double (note that the behavior and support for long long and long double varies between compilers). You might want to look into BigNums like bigint or bigdouble, though you will sacrifice preformace.
That said, this issue might also be one of setting the formatting: the number is large enough that it is outputted in scientific notation, to change this you can use
cout<<std::fixed;
possible duplicate of How to make C++ cout not use scientific notation
double is a fixed-size type, typically 64 bits, with 53 bits of precision; so it can't accurately represent any integer with more than about 16 digits. The largest standard integer type typically has 64 bits, and can represent integers up to about 19 digits. 100! is much larger than that, so can't be represented accurately by any built-in type.
You'll need a large integer type, representing a number as an array of smaller numbers. There's no standard type; you could use a library like Boost.Multiprecision or GMP or, since this is a programming challenge, implement it yourself. To calculate factorials, you'll need to implement multiplication; the easiest way to do that is with the "long multiplication" algorithm you learnt in school.
There's no data type can store such a big number as (100!) so far.
You should finish something like a BigIntenger class to calculate 100!;
And usually,such big number can be stored by strings.

How can I tell if a double precision floating point number can be safely stored as a single precision one? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Real numbers - how to determine whether float or double is required?
I'm trying to check if a conversion from double to float will result in loss of precision. Obviously, I can do the conversion and convert the float back into double and compare it to the original value. I'm curious as to whether there's a more direct way.
Converting to float and back is generally the most efficient solution; on most common architectures it will require only a couple instructions, with a latency of a couple cycles each. This also has the virtue of being both simple and correct.
On platforms that do not have hardware support for floating-point, you can do the check more efficiently by taking apart the number, and checking whether the exponent and significand fit into single-precision, but that is a relatively uncommon corner-case, and this is much more error-prone and not portable to platforms that use different FP formats.
A floating point number has two parts, the mantissa and the exponent. A double has more bits assigned to both parts. Assigning a double to float will drop mantissa bits which gives you less digits of precision, which is to be expected. However if the double exponent doesn't fit in the float exponent, then the float will be a garbage value.

Real numbers - how to determine whether float or double is required?

Given a real value, can we check if a float data type is enough to store the number, or a double is required?
I know precision varies from architecture to architecture. Is there any C/C++ function to determine the right data type?
For background, see What Every Computer Scientist Should Know About Floating-Point Arithmetic
Unfortunately, I don't think there is any way to automate the decision.
Generally, when people represent numbers in floating point, rather than as strings, the intent is to do arithmetic using the numbers. Even if all the inputs fit in a given floating point type with acceptable precision, you still have to consider rounding error and intermediate results.
In practice, most calculations will work with enough precision for usable results, using a 64 bit type. Many calculations will not get usable results using only 32 bits.
In modern processors, buses and arithmetic units are wide enough to give 32 bit and 64 bit floating point similar performance. The main motivation for using 32 bit is to save space when storing a very large array.
That leads to the following strategy:
If arrays are large enough to justify spending significant effort to halve their size, do analysis and experiments to decide whether a 32 bit type gives good enough results, and if so use it. Otherwise, use a 64 bit type.
I think your question presupposes a way to specify any "real number" to C / C++ (or any other program) without precision loss.
Suppose that you get this real number by specifying it in code or through user input; a way to check if a float or a double would be enough to store it without precision loss is to just count the number of significant bits and check that against the data range for float and double.
If the number is given as an expression (i.e. 1/7 or sqrt(2)), you will also want ways of detecting:
If the number is rational, whether it has repeating decimals, or cyclic decimals.
Or, What happens when you have an irrational number?
More over, there are numbers, such as 0.9, that float / double cannot in theory represent "exactly" )at least not in our binary computation paradigm) - see Jon Skeet's excellent answer on this.
Lastly, see additional discussion on float vs. double.
Precision is not very platform-dependent. Although platforms are allowed to be different, float is almost universally IEEE standard single precision and double is double precision.
Single precision assigns 23 bits of "mantissa," or binary digits after the radix point (decimal point). Since the bit before the dot is always one, this equates to a 24-bit fraction. Dividing by log2(10) = 3.3, a float gets you 7.2 decimal digits of precision.
Following the same process for double yields 15.9 digits and long double yields 19.2 (for systems using the Intel 80-bit format).
The bits besides the mantissa are used for exponent. The number of exponent bits determines the range of numbers allowed. Single goes to ~ 10±38, double goes to ~ 10±308.
As for whether you need 7, 16, or 19 digits or if limited-precision representation is appropriate at all, that's really outside the scope of the question. It depends on the algorithm and the application.
A very detailed post that may or may not answer your question.
An entire series in floating point complexities!
Couldn't you simply store it to a float and a double variable and than compare these two? This should implicitely convert the float back to a double - if there is no difference, the float is sufficient?
float f = value;
double d = value;
if ((double)f == d)
{
// float is sufficient
}
You cannot represent real number with float or double variables, but only a subset of rational numbers.
When you do floating point computation, your CPU floating point unit will decide the best approximation for you.
I might be wrong but I thought that float (4 bytes) and double (8 bytes) floating point representation were actually specified independently of comp architectures.

Long double does not print as the constant I initialized it with [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Floating point inaccuracy examples
Im having a problem... When i compile the src, the variable showed isn't the same that i initialized, see it:
#include <iostream>
using namespace std;
int main()
{
long double mynum = 4.7;
cout.setf(ios::fixed,ios::floatfield);
cout.precision( 20 );
cout << mynum << endl;
}
And then:
[fpointbin#fedora ~]$ ./a.out
4.70000000000000017764
How to fix it? I want to "cout" shows 4.700000...
Your variable is long double, but the default precision of the literal 4.7 is only double. Since you're printing it as long double, the interpretation chooses to print it with enough significant digits to distinguish it from neighbouring long double values, even though those neighbouring values are not possible doubles.
The internal representation of doubles does not allow for an 'exact' representation of 4.7. The 'closest' is 4.70000000000000017764. In reality there is no need to look at a precision of 20 when you have 64 bit doubles. The maximum effective precision is about 15. Try using 12 or so,
cout.precision( 12 );
and you should get what you want to see.
Most platforms, including yours, can only represent those floating point numbers exactly which have a short, finite binary expansion, i.e. which are finite sums of powers of two. 4.7 is not such a number, so it cannot be represented precisely on your platform, and if you demand excessive precision (20 is too much as your mantissa has 64 bits, and log_10(64) is 19.27), then you will inevitably face small errors.
(However, as #Henning says, you are already losing precision when assigning from a (non-long) double; you should write your literal constant as a long double: 4.7L. Then you should only see an error in the 20th digit.)
floats and doubles are binary floating-point types, i.e. they store a mantissa and an exponent in base 2.
This means that any decimal number that cannot be represented exactly into the finite digits of the mantissa will be approximated; the problem you showed comes from this: 4.7 cannot be represented exactly into the mantissa of a double (the literal 4.7 is of type double, kudos #Henning Makholm for spotting it), so the nearest approximation is used.
To better visualize the problem: in base 3, 2/3 is a number with a finite representation (i.e. 0.23), while in base 10 it is a periodic number (0,6666666...); if you have only a finite space for digits, you'll have to perform an approximation, that will be 0,66666667. That's exactly the same thing here, with the source base being 10 and the "target" base being 2.
If there's a special need to avoid this kind of approximations (e.g. when dealing with decimal amounts of money) special decimal types can be used, that store mantissa and exponent in base 10 (C++ do not provide such type of its own, but there are many decimal classes available on the Net); still, for "normal"/scientific calculations binary FP types are used, because they are much faster and more space-efficient.
Certain numbers cannot be represented in base two. Apparently, 4.7 is one of them. What you're seeing is the closest representable number to 4.7.
There's nothing you can do about this, other than setting the precision to a lower number.