Better precision to solve a geometric progression - fortran

Letting a1 be the first term, r be the constant that each term is multiplied by to get the next term and n be the number of terms, the geometric progression is: ai = a1*r**(i-1), pn the product of the n terms and sn the sum of the n terms.
I have the formulas to calculate this, but Fortran 95 (Plato2) doesn't admit the precision I need. (For example: I can't get -1.234E+00567890 as result).
How can I "amply" the double precision to work with this "huge" numbers?

Such high numbers (your example -1.234E+00567890) are too large for any intrinsic numerical type supplied by Fortran standard. They are also larger than numbers used in physical and engineering applications. For example, my gfortran supports these types:
huge(1.0_real32) 3.40282347E+38
huge(1.0_real64) 1.7976931348623157E+308
huge(1.0_real128) 1.18973149535723176508575932662800702E+4932
As far as I know there is no Fortran compiler available with much larger intrinsic floating point types.
For specialized purposes, as yours, specialized libraries are needed. This site is not for software recommendations so I will not recommend any particular one. Have a look at a list of some of them at http://crd-legacy.lbl.gov/~dhbailey/mpdist/ and of course there are more around (the GNU Scientific Library will have something I am sure).

Related

How does printf extract digits from a floating point number?

How do functions such as printf extract digits from a floating point number? I understand how this could be done in principle. Given a number x, of which you want the first n digits, scale x by a power of 10 so that x is between pow(10, n) and pow(10, n-1). Then convert x into an integer, and take the digits of the integer.
I tried this, and it worked. Sort of. My answer was identical to the answer given by printf for the first 16 decimal digits, but tended to differ on the ones after that. How does printf do it?
The classic implementation is David Gay's dtoa. The exact details are somewhat arcane (see Why does "dtoa.c" contain so much code?), but in general it works by doing the base conversion using more precision beyond what you can get from a 32-bit, 64-bit, or even 80-bit floating point number. To do this, it uses so-called "bigints" or arbitrary-precision numbers, which can hold as many digits as you can fit in memory. Gay's code has been copied, with modifications, into countless other libraries including common implementations for the C standard library (so it might power your printf), Java, Python, PHP, JavaScript, etc.
(As a side note... not all of these copies of Gay's dtoa code were kept up to date, so because PHP used an old version of strtod it hung when parsing 2.2250738585072011e-308.)
In general, if you do things the "obvious" and simple way like multiplying by a power of 10 and then converting the integer, you will lose a small amount of precision and some of the results will be inaccurate... but maybe you will get the first 14 or 15 digits correct. Gay's implementation of dtoa() claims to get all the digits correct... but as a result, the code is quite difficult to follow. Skip to the bottom to see strtod itself, you can see that it starts with a "fast path" which just uses ordinary floating-point arithmetic, but then it detects if that result is incorrect and uses a more reliable algorithm using bigints which works in all cases (but is slower).
The implementation has the following citation, which you may find interesting:
* Inspired by "How to Print Floating-Point Numbers Accurately" by
* Guy L. Steele, Jr. and Jon L. White [Proc. ACM SIGPLAN '90, pp. 112-126].
The algorithm works by calculating a range of decimal numbers which produce the given binary number, and by using more digits, the range gets smaller and smaller until you either have an exact result or you can correctly round to the requested number of digits.
In particular, from sec 2.2 Algorithm,
The algorithm uses exact rational arithmetic to perform its computations so that there is no loss of accuracy. In order to generate digits, the algorithm scales the number so that it is of the form 0.d1d2..., where d1, d2, ..., are base-B digits. The first digit is computed by multiplying the scaled number by the output base, B, and taking the integer part. The remainder is used to compute the rest of the digits using the same approach.
The algorithm can then continue until it has the exact result (which is always possible, since floating-point numbers are base 2, and 2 is a factor of 10) or until it has as many digits as requested. The paper goes on to prove the algorithm's correctness.
Also note that not all implementations of printf are based on Gay's dtoa, this is just a particularly common implementation that's been copied a lot.
There are various ways to convert floating-point numbers to decimal numerals without error (either exactly or with rounding to a desired precision).
One method is to use arithmetic as taught in elementary school. C provides functions to work with floating-point numbers, such as frexp, which separates the fraction (also called the significand, often mistakenly called a mantissa) and the exponent. Given a floating-point number, you could create a large array to store decimal digits in and then compute the digits. Each bit in the fraction part of a floating-point number represents some power of two, as determined by the exponent in the floating-point number. So you can simply put a “1” in an array of digits and then use elementary school arithmetic to multiply or divide it the required number of times. You can do that for each bit and then add all the results, and the sum is the decimal numeral that equals the floating-point number.
Commercial printf implementations will use more sophisticated algorithms. Discussing them is beyond the scope of a Stack Overflow question-and-answer. The seminal paper on this is Correctly Rounded Binary-Decimal and Decimal-Binary Conversions by David M. Gay. (A copy appears to be available here, but that seems to be hosted by a third party; I am not sure how official or durable it is. A web search may turn up other sources.) A more recent paper with an algorithm for converting a binary floating-point number to decimal with the shortest number of digits needed to uniquely distinguish the value is Printing Floating-Point Numbers: An Always Correct Method by Marc Andrysco, Ranjit Jhala, and Sorin Lerner.
One key to how it is done is that printf will not just use the floating-point format and its operations to do the work. It will use some form of extended-precision arithmetic, either by working with parts of the floating-point number in an integer format with more bits, by separating the floating-point number into pieces and using multiple floating-point numbers to work with it, or by using a floating-point format with more precision.
Note that the first step in your question, multiple x by a power of ten, already has two rounding errors. First, not all powers of ten are exactly representable in binary floating-point, so just producing such a power of ten necessarily has some representation error. Then, multiplying x by another number often produces a mathematical result that is not exactly representable, so it must be rounded to the floating-point format.
Neither the C or C++ standard does not dictate a certain algorithm for such things. Therefore is impossible to answer how printf does this.
If you want to know an example of a printf implementation, you can have a look here: http://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-common/vfprintf.c and here: http://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-common/printf_fp.c

Rounding error in dgesv?

I'm using the dgesv and dgemm fortran subroutines in C++ to do some simple matrix multiplication and left division.
For random matrices A and B, I do:
A\(A\(A*B));
where * is defined using dgemm and \ using dgesv. Obviously, this expression should simplify to the identity matrix. I'm testing my answers against MATLAB and I'm getting more or less 1's on the diagonal but the other entries are very slightly off (the numbers are on the order of magnitude e-15, so they're close to 0 already).
I'm just wondering if this result is to be expected or not? Because if I do something like this:
C = A+B;
D = A*B;
D\(C\(C*C));
the result should come out to D\C. Basically, C(C*C) is very accurate (matches MATLAB perfectly), but the second I do D\C I get something that's off by e-1 or even e+00. I'm guessing that's not supposed to happen?
Your problem seems to be related to finite accuracy of floating point variables in C/C++. You can read more about it here. There are some techniques of minimizing that effect (some of them described in the wiki article) but there will always be some loss of accuracy after a few operations. You might want to use some third-party mathematical library that supports numbers of arbitrary precision (e.g. GMP). But still - as long as you stick to numerical approach accuracy of your calculations will always be tainted.

Floats vs rationals in arbitrary precision fractional arithmetic (C/C++)

Since there are two ways of implementing an AP fractional number, one is to emulate the storage and behavior of the double data type, only with more bytes, and the other is to use an existing integer APA implementation for representing a fractional number as a rational i.e. as a pair of integers, numerator and denominator, which of the two ways are more likely to deliver efficient arithmetic in terms of performance? (Memory usage is really of minor concern.)
I'm aware of the existing C/C++ libraries, some of which offer fractional APA with "floats" and other with rationals (none of them features fixed-point APA, however) and of course I could benchmark a library that relies on "float" implementation against one that makes use of rational implementation, but the results would largely depend on implementation details of those particular libraries I would have to choose randomly from the nearly ten available ones. So it's more theoretical pros and cons of the two approaches that I'm interested in (or three if take into consideration fixed-point APA).
The question is what you mean by arbitrary precision that you mention in the title. Does it mean "arbitrary, but pre-determined at compile-time and fixed at run-time"? Or does it mean "infinite, i.e. extendable at run-time to represent any rational number"?
In the former case (precision customizable at compile-time, but fixed afterwards) I'd say that one of the most efficient solutions would actually be fixed-point arithmetic (i.e. none of the two you mentioned).
Firstly, fixed-point arithmetic does not require any dedicated library for basic arithmetic operations. It is just a concept overlaid over integer arithmetic. This means that if you really need a lot of digits after the dot, you can take any big-integer library, multiply all your data, say, by 2^64 and you basically immediately get fixed-point arithmetic with 64 binary digits after the dot (at least as long as arithmetic operations are concerned, with some extra adjustments for multiplication and division). This is typically significantly more efficient than floating-point or rational representations.
Note also that in many practical applications multiplication operations are often accompanied by division operations (as in x = y * a / b) that "compensate" for each other, meaning that often it is unnecessary to perform any adjustments for such multiplications and divisions. This also contributes to efficiency of fixed-point arithmetic.
Secondly, fixed-point arithmetic provides uniform precision across the entire range. This is not true for either floating-point or rational representations, which in some applications could be a significant drawback for the latter two approaches (or a benefit, depending on what you need).
So, again, why are you considering floating-point and rational representations only. Is there something that prevents you from considering fixed-point representation?
Since no one else seemed to mention this, rationals and floats represent different sets of numbers. The value 1/3 can be represented precisely with a rational, but not a float. Even an arbitrary precision float would take infinitely many mantissa bits to represent a repeating decimal like 1/3. This is because a float is effectively like a rational but where the denominator is constrained to be a power of 2. An arbitrary precision rational can represent everything that an arbitrary precision float can and more, because the denominator can be any integer instead of just powers of 2. (That is, unless I've horribly misunderstood how arbitrary precision floats are implemented.)
This is in response to your prompt for theoretical pros and cons.
I know you didn't ask about memory usage, but here's a theoretical comparison in case anyone else is interested. Rationals, as mentioned above, specialize in numbers that can be represented simply in fractional notation, like 1/3 or 492113/203233, and floats specialize in numbers that are simple to represent in scientific notation with powers of 2, like 5*2^45 or 91537*2^203233. The amount of ascii typing needed to represent the numbers in their respective human-readable form is proportional to their memory usage.
Please correct me in the comments if I've gotten any of this wrong.
Either way, you'll need multiplication of arbitrary size integers. This will be the dominant factor in your performance since its complexity is worse than O(n*log(n)). Things like aligning operands, and adding or subtracting large integers is O(n), so we'll neglect those.
For simple addition and subtraction, you need no multiplications for floats* and 3 multiplications for rationals. Floats win hands down.
For multiplication, you need one multiplication for floats and 2 multiplications for rational numbers. Floats have the edge.
Division is a little bit more complex, and rationals might win out here, but it's by no means a certainty. I'd say it's a draw.
So overall, IMHO, the fact that addition is at least O(n*log(n)) for rationals and O(n) for floats clearly gives the win to a floating-point representation.
*It is possible that you might need one multiplication to perform addition if your exponent base and your digit base are different. Otherwise, if you use a power of 2 as your base, then aligning the operands takes a bit shift. If you don't use a power of two, then you may also have to do a multiplication by a single digit, which is also an O(n) operation.
You are effectively asking the question: "I need to participate in a race with my chosen animal. Should I choose a turtle or a snail ?".
The first proposal "emulating double" sounds like staggered precision: using an array of doubles of which the sum is the defined number. There is a paper from Douglas M. Priest "Algorithms for Arbitrary Precision Floating Point Arithmetic" which describes how to implement this arithmetic. I implemented this and my experience is very bad: The necessary overhead to make this run drops the performance 100-1000 times !
The other method of using fractionals has severe disadvantages, too: You need to implement gcd and kgv and unfortunately every prime in your numerator or denominator has a good chance to blow up your numbers and kill your performance.
So from my experience they are the worst choices one can made for performance.
I recommend the use of the MPFR library which is one of the fastest AP packages in C and C++.
Rational numbers don't give arbitrary precision, but rather the exact answer. They are, however, more expensive in terms of storage and certain operations with them become costly and some operations are not allowed at all, e.g. taking square roots, since they do not necessarily yield a rational answer.
Personally, I think in your case AP floats would be more appropriate.

c++ and matlab floating point computation

Could someone tell me if c++ and matlab use the same floating point computation implementations? Will I get the same values in C++ as I would in Matlab?
Currently I have these discrepancies from translating my Matlab code into C++:
Matlab: R = 1.0000000001623, I = -3.07178893432791e-010, C = -3.79693498864242e-011
C++: R = 1.00000000340128 I = -3.96890964537988e-009 Z = 2.66864907949582e-009
If not what is the difference and where can I find more about floating point computation implementations?
Thanks!
Although it's not clear what your numbers actually are, the relative difference of the first (and largest) numbers is about 1e-8, which is the relative tolerance of many double precision algorithms.
Floating point numbers are only an approximation of the real number system, and their finite size (64 bits for double precision) limits their precision. Because of this finite precision, operations that involve floating point numbers can incur round-off error, and are thus not strictly associative. What this means is that A+(B+C) != (A+B)+C. The difference between the two is usually small, depending on their relative sizes, but it's not always zero.
What this means is that you should expect small differences in the relative and absolute values when you compare an algorithm coded in Matlab to one in C++. The difference may be in the libraries (i.e., there's no guarantee that Matlab uses the system math library for routines like sqrt), or it may just be that your C++ and Matlab implementations order their operations differently.
The section on floating point comparison tests in Boost::Test discusses this a bit, and has some good references. In particular, you should probably read What Every Computer Scientist Should Know About Floating-Point Arithmetic and consider picking up a copy of Knuth's TAOCP Vol. II.
Matlab by default uses a double floating point precision, a C float uses a single floating point precision.
The representation for floating points is the same between the two, and is either processor or I believe a standard. But as mentioned, floating points are extremely fickle, you always have to allow for some tolerance. If you do a complex operation, such as the one below, you will frequently get a non-zero number, despite algebra telling you otherwise. The stuff under the hood between how operations are done with matlab and c will allow for some differences. Just make sure they are close.
((3*pi+2)*5-9)/2-7.5*pi-3

Cholesky factorization of semi-definite matrices in C++

I am to implement a function to do a Cholesky factorization of a semidefinite matrix in C++ and am wondering if there is an library/anything out there that is already optimized. It is to work as or similar to what is described in this:
http://www.sciencedirect.com/science/article/pii/S0096300310012713
This is an example for positive definite but it doesn't work for positive semi-definite: http://en.wikipedia.org/wiki/Cholesky_decomposition#The_Cholesky.E2.80.93Banachiewicz_and_Cholesky.E2.80.93Crout_algorithms
The program must be in C++, with no C/FORTRAN libraries, (think pointy hared boss giving instructions) which means ATLAS, LAPACK, ect. are out. I have looked through MTL + Boost but theirs only works for positive definite matrices. Are there any libraries that I haven't found, or even single functions that have been written?
The problem with Cholesky decomposition of semi definite matrices is that 1) it is not unique 2) Crout's algorithm fails.
The existence of a decomposition is usually proven non constructively, via a limiting argument (if M_n -> M, and M_n = U^t_n U_n, then ||U_n|| = ||M_n||^1/2 where ||.|| = Hilbert-Schmidt norm, and U_n is a bounded sequence. Extract a subsequence to find a limit U satisfying U^t U = M and U triangular.)
I have found that in the cases I was interested in, it was satisfactory to multiply the diagonal elements by 1 + epsilon, with epsilon small (take a few thousand times the machine epsilon) to give a perfectly acceptable decomposition.
Indeed, if M is positive semi definite, then for each epsilon > 0, M + epsilon I is definite positive.
As the scheme converges when epsilon goes to zero, you can contemplate computing the decomposition for multiple epsilons and perform a Richardson extrapolation.
As to the positive definite case, you could implement Crout's algorithm yourself (there is a sample code in Numerical Recipes), but I would highly recommend against writing it yourself, and advise using LAPACK instead.
This may involve having your boss pay for Intel MKL if he is concerned by potentially poor implementations of LAPACK. Most of the time I heard such a speech, the rationale was "but we can't control the code, we do want to write it yourself so that we can debug it in case of a problem". Dumb argument. LAPACK is 40 years old and thoroughly tested.
Requiring not to use LAPACK is as silly as requiring not to use the standard library for sine, cosine and logarithms.