How do I improve the accuracy of calculating Eigenvalues using GNU GSL - gsl

I am trying to switch from Mathematica to C on a windows platform. My most difficult task is calculating Eigenvalues and Eigenvectors for real non-symmetric matrices up to about 50x50. I have succeeded in getting Microsoft Visual Studio to run the GNU GSL function gsl_eigen_nonsymmv. When I test the accuracy by using the results to calculate the original matrix, (evects*Diagonal(evals)*Invesre(evects)), the accuracy is acceptable for smaller matrices (<20x20), by breaks down compared to Mathematica above 30x30.
I know the process is iterative, but I can not figure out what the convergence criteria is. I assume the answer lies in the function "nonsymmv_get_right_eigenvectors", but the documentation does not even acknowledge the existence of this function. I was able to access and increase the variable "max_iterations" from the default of 1150 to 3000 with no change. I also blindly tried manipulating the following variable up or down by 5-10 orders of magnitude without ANY change in the results (smin, bignum, beta, xnorm).
Any help or insights would be greatly appreciated. I am a scientist who programs only when needed. Thanks

Related

C++ How do I solve very large system of sparse linear system

I am trying to solve a very large and sparse system of linear equations in C++. Currently, I am using BiCGSTAB from eigen. It works fine for small matrix, but it is taking just too much time for matrix of the size I need, which is 40804x40804 (It could be even larger in the future).
I have a very long script, but I simply used the following format:
SparseMatrix<double> sj(40804,40804);
VectorXd c_(40804), sf(40804);
sj.reserve(VectorXi::Constant(40804,36)); //This is a very good estimate of how many non zeros in each column
//...Fill in actual number in sj
sj.makeCompressed();
BiCGSTAB<SparseMatrix<double> > handler;
//...Fill in sj, only in the entries that have been initialized previously
handler.analyzePattern(sj)
handler.factorize(sj);
c_.setZero();
c_=handler.solve(sf);
This takes way too long! And yes, the solution does exist. Sparse function in matlab seems to handle this very well, but I need it in C++ in order to connect to a server.
I would really appreciate it you could help me!
You should consider use of one of the advanced sparse direct solvers: CHOLMOD
Sparse direct solvers are a fundamental tool in computational analysis, providing a very general method for obtaining high-quality results to almost any problem. CHOLMOD is a high performance library for sparse Cholesky factorization.
I guarantee that this package definetly will help you. Moreover CHOLMOD has supported GPU acceleration since 2012 with version 4.0.0 . In SuiteSparse-4.3.1 performance has been further improved, providing speedups of 3x or greater vs. the CPU for the sparse factorization operation.
If your matrices are the representations of graphs you can also consider METIS with combination of CHOLMOD. Which means you will be able to do partition/domainDecomposition in graphs then parallel solve with CHOLMOD.
SuiteSparse is a powerfull tool with the support of linear(KLU) and direct solvers.
Here are the GitHub link, UserGuide and SuiteSparse's home page

high order bessel function computation with large variables

My work involves computation of high order bessel function at large variable value. Within MATLAB, this has been done without problems. However, in order to scale up the problem, I have tuned to writing C++ code with MPI. Of course, the step to generate bessel function is done by invoking some libraries. To put the problem concrete, let me consider this very specific bug.
In matlab, suppose I wish to compute $J_46341(86840.0)$, and
matlab gives me: besselj(46341,86840)=0.001309896212292
However, a simple test example to call
gsl_sf_bessel_Jn_e returns "ERROR: NaN"
and I have checked at order 46340, both matlab and gsl returns the same answer 0.00292895 within acceptable accuracies. One more step in GSL results in the NaN error while matlab still retains a good accurate numerical answer.
I did try to use recurrence relations to generate higher order values, from a-not-so-small-order, say from order of 20000 and up, however, this only delays the NaN error without completely solving the problem.
Switching my attention to other available software libraries out there, I tried NAG, but to my utter disappointment,
nag_bessel_j_alpha (s18ekc) has constraint of abs(nl)<=101
, in other words, it can only compute up to order of 101 and it is clearly not in my interest of study.
So, my question is fairly simple:
Is there a more reliable library approach to obtain high order bessel
function value for large x?
Asymptotically, bessel function approaches 0, I can surely set those values to zero if the tail is approaching the underflow limit. However, the NaN problem seems to occur somewhat between strongly oscillating curve and asymptotically decaying tail.
Problem solved. Thank you for the community work and it really amazed me with your knowledge and contributions!!!
Please see here,
how to call fortran routines from C++?
https://mathoverflow.net/questions/225121/computation-of-high-order-bessel-function-at-large-variable-value
MATLAB, R, Python and JuliaLang/openspecfun all build upon the original fortran source code by Dr. Donald E. Amos (sandia national lab), cited paper:
D. E. Amos, "A subroutine package for Bessel functions of a complex
argument and nonnegative order", Sandia National Laboratory Report,
SAND85-1018, May, 1985.
D. E. Amos, "A portable package for Bessel functions of a complex
argument and nonnegative order", Trans. Math. Software, 1986.
Now known as Amos Algorithm 644 collected by ACM.
http://dl.acm.org/citation.cfm?id=212078
http://dl.acm.org/citation.cfm?id=1268783
http://dl.acm.org/citation.cfm?id=98299
However, the source codes hosted on netlib are not bug free and probably not up-to-date,
http://netlib.sandia.gov/master/index.html
http://netlib.sandia.gov/amos/
While the version adopted by openspecfun works as solid,
https://github.com/JuliaLang/openspecfun

C++ armadillo not correctly solving poorly conditioned matrix

I have a relatively simple question regarding the linear solver built into Armadillo. I am a relative newcomer to C++ but have experience coding in other languages. I am solving a fluid flow problem by successive linearization, using the armadillo function Solve(A,b) to get the solution at each iteration.
The issue that I am running into is that my matrix is very ill-conditioned. The determinant is on the order of 10^-20 and the condition number is 75000. I know these are terrible conditions but it's what I've got. Does anyone know if it is possible to specify the precision in my A matrix and in the solve function to something beyond double (long double perhaps)? I know that there are double matrix classes in Armadillo but I haven't found any documentation for higher levels of precision.
To approach this from another angle, I wrote some code in Mathematica and the LinearSolve worked very well and the program converged to the correct answer. My reasoning is that Mathematica variables have higher precision which can handle the higher levels of rounding error.
If anyone has any insight on this, please let me know. I know there are other ways to approach a poorly conditioned matrix (like preconditioning and pivoting), but my work is more in the physics than in the actual numerical solution so I'm trying to steer clear of that.
EDIT: I just limited the precision in the Mathematica version to 15 decimal places and the program still converges. This leads me to believe it is NOT a variable precision question but rather an issue with the method.
As you said "your work is more in the physics": rather than trying to increase the accuracy, I would use the Moore-Penrose Pseudo-Inverse, which in Armadillo can be obtained by the function pinv. You should then experience a bit with the parameter tolerance to set it to a reasonable level.
The geometrical interpretation is as follows: bad condition numbers are due to the fact that the row/column-vectors are linearly dependent. In physics, such linearly dependencies usually have an origin which at least needs to be interpreted. The pseudoinverse first projects the matrix onto a lower dimensional space in which the vectors are "less linearly dependent" by dropping all singular vectors with singular values smaller than the parameter tolerance. The reulting matrix has a better condition number such that the standard inverse can be constructed with less problems.

Difference between computation results of MATLAB code and C(C++) with IPP code

I need to increase computation speed of MATLAB code. For this purpose I rewrite my program on C language with Intel IPP library for operations with vectors. And here I got a problem:
after some step main computation circle program in MATLAB and my C program go to different pathes of algorythm. It is happened because computations not absolutely equal and my program accumulate error in compare with MATLAB computations results. For this reason, my program doesn't compute correct gradient and the whole optimization algorythm doesn't count well. So I got a computation speed increase, but lost computation efficiency - when on 100th step MATLAB compute optimization error on 0.004, C program compute on 0.05 and this is important in my task.
I checked what function give me error, and what I found: common operations (like ippsAdd_64f_A53, ippsSub_64f_A53, ippsMul_f64_A53, ippsDiv_64f_A53 and usual C operations ,-,*,/) make equal to MATLAB results and sum error is zero, but math.h hyperbolic functions give a sum error on array with 75699 elements about -3..-5e-13. Intel functions ippsCosh_64f_A53 and others give a sum error about -1..-5e-14.
Do you know a library to compute high precision hyperbolic and exponent functions? Or maybe there are some compilator settings in Visual Studio 2012, which can help me?
All computations made in Ipp64f data type (double) in VS 2012 with installed Intel Parallel Studio XE 2013.
P.S.: Sum error was computed in MATLAB. I saved arrays from my C program to level 4 mat file and then imported in MATLAB where I summed difference between MATLAB array and imported array like sum(M_cosh - C_cosh);
Not an answer, more of an extended comment:
You write
I need to increase computation speed of MATLAB code
and ask
Do you know a library to compute high precision trigonometric and
exponent functions?
Yes, I know of several such libraries, but they implement floating-point numbers with more bits than are typically-provided on current CPUs (mainly 32- and 64-bit) and which implement, in software, arithmetic on these numbers. For your purpose of increasing computation speed, such libraries are useless, their increased precision is explicitly bought at the cost of increased execution time. For many other users that's a reasonable trade off.
I don't know of any widely-used or well-regarded libraries which implement precision-preserving algorithms on machine-numbers. There isn't space here to go into any detail, but for an introduction to the problem you could do worse than start reading about Kahan's summation algorithm.
The Mathworks are somewhat coy about revealing what algorithms Matlab implements. However most of the computational kernels of Matlab are written in C (or C++, I believe) and compiled into libraries. Many of them are now multi-threaded too. If you are trying to write code to outperform Matlab you will have to write multi-threaded, high-performance numerical code.
It wouldn't surprise me at all to learn that the algorithms that Matlab implements do have precision-preserving capabilities. The Mathworks are, after all, trying to offer the market a tool which will solve a wide range of problems without the user having to consider low-level issues such as whether or not machine-precision is good enough for a particular combination of problem and dataset.
Finally. It doesn't surprise me that your first attempts were unsuccessful, though beating Matlab for speed is impressive. And I look forward, sceptically, to being pleasantly surprised when you report success, a code of your own which outperforms Matlab in time and produces satisfactory results.

Large Matrix Inversion

I am looking at taking the inverse of a large matrix, common size of 1000 x 1000, but sometimes exceeds 100000 x 100000 (which is currently failing due to time and memory). I know that the normal sentiment is 'don't take the inverse, find some other way to do it', but that is not possible at the moment. The reason for this is due to the usage of software that is already made that expects to get the matrix inverse. (Note: I am looking into ways of changing this, but that will take a long time)
At the moment we are using an LU decomposition method from numerical recopies, and I am currently in the process of testing the eigen library. The eigen library seems to be more stable and a bit faster, but I am still in testing phase for accuracy. I have taken a quick look at other libraries such as ATLAS and LAPACK but have not done any substantial testing with these yet.
It seems as though the eigen library does not use concurrent methods to compute the inverse (though does for LU factorization part of the inverse) and as far as I can tell ATLAS and LAPACK are similar in this limitation. (I am currently testing the speed difference for eigen with openMP and without.)
First question is can anyone explain how it would be possible to optimize matrix inversion by parallelization. I found an article here that talks about matrix inversion parallel algorithms, but I did not understand. It seems this article talks about another method? I am also not sure if scaLAPACK or PETSc are useful?
Second question, I read this article of using the GPUs to increase performance, but I have never coded for GPUs and so have no idea what is trying to convey, but the charts at the bottom looked rather alarming. How is this even possible, and how where do I start to go about implementing something like this if it is to be true.
I also found this article, have yet had the time to read through it to understand, but it seems promising, as memory is a current issue with our software.
Any information about these articles or the problems in general would be of great help. And again I apologize if this question seems vague, I will try to expand more if necessary.
First question is can anyone explain how it would be possible to optimize matrix inversion by parallelization.
I'd hazard a guess that this, and related topics in linear algebra, is one of the most studied topics in parallel computing. If you're stuck looking for somewhere to start reading, well good old Golub and Van Loan have a chapter on the topic. As to whether Scalapack and Petsc are likely to be useful, certainly the former, probably the latter. Of course, they both depend on MPI but that's kind of taken for granted in this field.
Second question ...
Use GPUs if you've got them and you can afford to translate your code into the programming model supported by your GPUs. If you've never coded for GPUs and have access to a cluster of commodity-type CPUs you'll get up to speed quicker by using the cluster than by wrestling with a novel technology.
As for the last article you refer to, it's now 10 years old in a field that changes very quickly (try finding a 10-year old research paper on using GPUs for matrix inversion). I can't comment on its excellence or other attributes, but the problem sizes you mention seem to me to be well within the capabilities of modern clusters for in-core (to use an old term) computation. If your matrices are very big, are they also sparse ?
Finally, I strongly support your apparent intention to use existing off-the-shelf codes rather than to try to develop your own.
100000 x 100000 is 80GB at double precision. You need a library that supports memory-mapped matrices on disk. I can't recommend a particular library and I didn't find anything with quick Google searches. But code from Numerical Recipes certainly isn't going to be adequate.
Regarding the first question (how to parallellize computing the inverse):
I assume you are computing the inverse by doing an LU decomposition of your matrix and then using the decomposition to solve A*B = I where A is your original matrix, B is the matrix you solve for, and I is the identity matrix. Then B is the inverse.
The last step is easy to parallellize. Divide your identity matrix along the columns. If you have p CPUs and your matrix is n-by-n, then every part has n/p columns and n rows. Lets call the parts I1, I2, etc. On every CPU, solve a system of the form A*B1 = I1, this gives you the parts B1, B2, etc., and you can combine them to form B which is the inverse.
An LU decomp on a GPU can be ~10x faster than on a CPU. Although this is now changing, GPU's have traditionally been designed around single precision arithmetic, and so on older hardware single precision arithmetic is generally much faster than double precision arithmetic. Also, storage requirements and performance will be greatly impacted by the structure of your matrices. A sparse 100,000 x 100,000 matrix LU decomp is a reasonable problem to solve and will not require much memory.
Unless you want to become a specialist and spend a lot of time tuning for hardware updates, I would strongly recommend using a commercial library. I would suggest CULA tools. They have both sparse and dense GPU libraries and in fact their free library offers SGETRF - a single precision (dense) LU decomp routine. You'll have to pay for their double precision libraries.
I know it's old post - but really - OpenCL (you download the relevant one based on your graphics card) + OpenMP + Vectorization (not in that order) is the way to go.
Anyhow - for me my experience with matrix anything is really to do with overheads from copying double double arrays in and out the system and also to pad up or initialize matrices with 0s before any commencement of computation - especially when I am working with creating .xll for Excel usage.
If I were to reprioritize the top -
try to vectorize the code (Visual Studio 2012 and Intel C++ has autovectorization - I'm not sure about MinGW or GCC, but I think there are flags for the compiler to analyse your for loops to generate the right assembly codes to use instead of the normal registers to hold your data, to populate your processor's vector registers. I think Excel is doing that because when I monitored Excel's threads while running their MINVERSE(), I notice only 1 thread is used.
I don't know much assembly language - so I don't know how to vectorize manually... (haven't had time to go learn this yet but sooooo wanna do it!)
Parallelize with OpenMP (omp pragma) or MPI or pthreads library (parallel_for) - very simple - but... here's the catch - I realise that if your matrix class is completely single threaded in the first place - then parallelizing the operation like mat multiply or inverse is scrappable - cuz parallelizing will deteriorate the speed due to initializing or copying to or just accessing the non-parallelized matrix class.
But... where parallelization helps is - if you're designing your own matrix class and you parallelize its constructor operation (padding with 0s etc), then your computation of LU(A^-1) = I will also be faster.
It's also mathematically straightforward to also optimize your LU decomposition, and also optimizing ur forward backward substitution for the special case of identity. (I.e. don't waste time creating any identity matrix - analyse where your for (row = col) and evaluate to be a function with 1 and the rest with 0.)
Once it's been parallelized (on the outer layers) - the matrix operations requiring element by element can be mapped to be computed by GPU(SSSSSS) - hundreds of processors to compute elements - beat that!. There is now sample Monte Carlo code available on ATI's website - using ATI's OpenCL - don't worry about porting code to something that uses GeForce - all u gotta do is recompile there.
For 2 and 3 though - remember that overheads are incurred so no point unless you're handling F*K*G HUGE matrices - but I see 100k^2? wow...
Gene