C vs Fortran for BLAS 2 - fortran

I have an application in which I need to carry out a lot of Norms, Dot Products and most importantly, Matrix Vector multiplications.
matrix and vectors are huge. Matrix dimension is tending to be a 100000x100000
the loop structure is:
while(condition)
/* usually iterations=dimension of matrix, so around 1 million iterations are *at least* required (if not more) */
matrix-vector multiplication
3 dot prods
2 norms
I am currently using Intel Fortran with Intel MKL. Will rewriting my codes in Intel C with Intel MKL help any?
Has anyone ever carried out a benchmark of any kind (for DGEMV especially)?
Rewriting codes is a major pain but I would not mind rewriting iff I see a reason to.
EDIT: I misspoke: The matrix dimensions are 100000 not a million. Pretty serious error :|
And yes, the matrix is dense and it needs to be dense.
Moreover, it is not symmetric and not even positive definite.
My algorithm is a modified version of QMR.

The performance will be completely identical in either C or Fortran, as the actual implementation backing the library calls are the same, and essentially all of the time in your code is spent in those library calls.

Related

batch CUDA solution of sparse banded Ax=b for various b's

I have a sparse banded matrix A and I'd like to (direct) solve Ax=b. I have about 500 vectors b, so I'd like to solve for the corresponding 500 x's.
I'm brand new to CUDA, so I'm a little confused as to what options I have available.
cuSOLVER has a batch direct solver cuSolverSP for sparse A_i x_i = b_i using QR here. (I'd be fine with LU too since A is decently conditioned.) However, as far as I can tell, I can't exploit the fact that all my A_i's are the same.
Would an alternative option be to first determine a sparse LU (QR) factorization on the CPU or GPU then perform in parallel the backsubstitution (respectively, backsub and matrix mult) on the GPU? If cusolverSp< t >csrlsvlu() is for one b_i, is there a standard way to batch perform this operation for multiple b_i's?
Finally, since I don't have intuition for this, should I expect a speedup on a GPU for either of these options, given the necessary overhead? x has length ~10000-100000. Thanks.
I'm currently working on something similar myself. I decided to basically wrap the conjugate gradient and level-0 incomplete cholesky preconditioned conjugate gradient solvers utility samples that came with the CUDA SDK into a small class.
You can find them in your CUDA_HOME directory under the path:
samples/7_CUDALibraries/conjugateGradient and /Developer/NVIDIA/CUDA-samples/7_CUDALibraries/conjugateGradientPrecond
Basically, you would load the matrix into the device memory once (and for ICCG, compute the corresponding conditioner / matrix analysis) then call the solve kernel with different b vectors.
I don't know what you anticipate your matrix band structure to look like, but if it is symmetric and either diagonally dominant (off diagonal bands along each row and column are opposite sign of diagonal and their sum is less than the diagonal entry) or positive definite (no eigenvectors with an eigenvalue of 0.) then CG and ICCG should be useful. Alternately, the various multigrid algorithms are another option if you are willing to go through coding them up.
If your matrix is only positive semi-definite (e.g. has at least one eigenvector with an eigenvalue of zero) you can still ofteb get away with using CG or ICCG as long as you ensure that:
1) The right hand side (b vectors) are orthogonal to the null space (null space meaning eigenvectors with an eigenvalue of zero).
2) The solution you obtain is orthogonal to the null space.
It is interesting to note that if you do have a non-trivial null space, then different numeric solvers can give you different answers for the same exact system. The solutions will end up differing by a linear combination of the null space... That problem has caused me many many man hours of debugging and frustration before I finally caught on, so its good to be aware of it.
Lastly, if your matrix has a Circulant Band structure you might consider using a fast fourier transform (FFT) based solver. FFT based numerical solvers can often yield superior performance in cases where they are applicable.
is there a standard way to batch perform this operation for multiple b_i's?
One option is to use the batched refactorization module in CUDA's cuSOLVER, but I am not sure if it is standard.
Batched refactorization module in cuSOLVER provides an efficient method to solve batches of linear systems with fixed left-hand side sparse matrix (or matrices with fixed sparsity pattern but varying coefficients) and varying right-hand sides, based on LU decomposition. Only some partially completed code snippets can be found in the official documentation (as of CUDA 10.1) that are related to it. A complete example can be found here.
If you don't mind going with an open-source library, you could also check out CUSP:
CUSP Quick Start Page
It has a fairly decent suite of solvers, including a few preconditioned methods:
CUSP Preconditioner Examples
The smoothed aggregation preconditioner (a variant of algebraic multigrid) seems to work very well as long as your GPU has enough onboard memory for it.

Difference between computation results of MATLAB code and C(C++) with IPP code

I need to increase computation speed of MATLAB code. For this purpose I rewrite my program on C language with Intel IPP library for operations with vectors. And here I got a problem:
after some step main computation circle program in MATLAB and my C program go to different pathes of algorythm. It is happened because computations not absolutely equal and my program accumulate error in compare with MATLAB computations results. For this reason, my program doesn't compute correct gradient and the whole optimization algorythm doesn't count well. So I got a computation speed increase, but lost computation efficiency - when on 100th step MATLAB compute optimization error on 0.004, C program compute on 0.05 and this is important in my task.
I checked what function give me error, and what I found: common operations (like ippsAdd_64f_A53, ippsSub_64f_A53, ippsMul_f64_A53, ippsDiv_64f_A53 and usual C operations ,-,*,/) make equal to MATLAB results and sum error is zero, but math.h hyperbolic functions give a sum error on array with 75699 elements about -3..-5e-13. Intel functions ippsCosh_64f_A53 and others give a sum error about -1..-5e-14.
Do you know a library to compute high precision hyperbolic and exponent functions? Or maybe there are some compilator settings in Visual Studio 2012, which can help me?
All computations made in Ipp64f data type (double) in VS 2012 with installed Intel Parallel Studio XE 2013.
P.S.: Sum error was computed in MATLAB. I saved arrays from my C program to level 4 mat file and then imported in MATLAB where I summed difference between MATLAB array and imported array like sum(M_cosh - C_cosh);
Not an answer, more of an extended comment:
You write
I need to increase computation speed of MATLAB code
and ask
Do you know a library to compute high precision trigonometric and
exponent functions?
Yes, I know of several such libraries, but they implement floating-point numbers with more bits than are typically-provided on current CPUs (mainly 32- and 64-bit) and which implement, in software, arithmetic on these numbers. For your purpose of increasing computation speed, such libraries are useless, their increased precision is explicitly bought at the cost of increased execution time. For many other users that's a reasonable trade off.
I don't know of any widely-used or well-regarded libraries which implement precision-preserving algorithms on machine-numbers. There isn't space here to go into any detail, but for an introduction to the problem you could do worse than start reading about Kahan's summation algorithm.
The Mathworks are somewhat coy about revealing what algorithms Matlab implements. However most of the computational kernels of Matlab are written in C (or C++, I believe) and compiled into libraries. Many of them are now multi-threaded too. If you are trying to write code to outperform Matlab you will have to write multi-threaded, high-performance numerical code.
It wouldn't surprise me at all to learn that the algorithms that Matlab implements do have precision-preserving capabilities. The Mathworks are, after all, trying to offer the market a tool which will solve a wide range of problems without the user having to consider low-level issues such as whether or not machine-precision is good enough for a particular combination of problem and dataset.
Finally. It doesn't surprise me that your first attempts were unsuccessful, though beating Matlab for speed is impressive. And I look forward, sceptically, to being pleasantly surprised when you report success, a code of your own which outperforms Matlab in time and produces satisfactory results.

Large Matrix Inversion

I am looking at taking the inverse of a large matrix, common size of 1000 x 1000, but sometimes exceeds 100000 x 100000 (which is currently failing due to time and memory). I know that the normal sentiment is 'don't take the inverse, find some other way to do it', but that is not possible at the moment. The reason for this is due to the usage of software that is already made that expects to get the matrix inverse. (Note: I am looking into ways of changing this, but that will take a long time)
At the moment we are using an LU decomposition method from numerical recopies, and I am currently in the process of testing the eigen library. The eigen library seems to be more stable and a bit faster, but I am still in testing phase for accuracy. I have taken a quick look at other libraries such as ATLAS and LAPACK but have not done any substantial testing with these yet.
It seems as though the eigen library does not use concurrent methods to compute the inverse (though does for LU factorization part of the inverse) and as far as I can tell ATLAS and LAPACK are similar in this limitation. (I am currently testing the speed difference for eigen with openMP and without.)
First question is can anyone explain how it would be possible to optimize matrix inversion by parallelization. I found an article here that talks about matrix inversion parallel algorithms, but I did not understand. It seems this article talks about another method? I am also not sure if scaLAPACK or PETSc are useful?
Second question, I read this article of using the GPUs to increase performance, but I have never coded for GPUs and so have no idea what is trying to convey, but the charts at the bottom looked rather alarming. How is this even possible, and how where do I start to go about implementing something like this if it is to be true.
I also found this article, have yet had the time to read through it to understand, but it seems promising, as memory is a current issue with our software.
Any information about these articles or the problems in general would be of great help. And again I apologize if this question seems vague, I will try to expand more if necessary.
First question is can anyone explain how it would be possible to optimize matrix inversion by parallelization.
I'd hazard a guess that this, and related topics in linear algebra, is one of the most studied topics in parallel computing. If you're stuck looking for somewhere to start reading, well good old Golub and Van Loan have a chapter on the topic. As to whether Scalapack and Petsc are likely to be useful, certainly the former, probably the latter. Of course, they both depend on MPI but that's kind of taken for granted in this field.
Second question ...
Use GPUs if you've got them and you can afford to translate your code into the programming model supported by your GPUs. If you've never coded for GPUs and have access to a cluster of commodity-type CPUs you'll get up to speed quicker by using the cluster than by wrestling with a novel technology.
As for the last article you refer to, it's now 10 years old in a field that changes very quickly (try finding a 10-year old research paper on using GPUs for matrix inversion). I can't comment on its excellence or other attributes, but the problem sizes you mention seem to me to be well within the capabilities of modern clusters for in-core (to use an old term) computation. If your matrices are very big, are they also sparse ?
Finally, I strongly support your apparent intention to use existing off-the-shelf codes rather than to try to develop your own.
100000 x 100000 is 80GB at double precision. You need a library that supports memory-mapped matrices on disk. I can't recommend a particular library and I didn't find anything with quick Google searches. But code from Numerical Recipes certainly isn't going to be adequate.
Regarding the first question (how to parallellize computing the inverse):
I assume you are computing the inverse by doing an LU decomposition of your matrix and then using the decomposition to solve A*B = I where A is your original matrix, B is the matrix you solve for, and I is the identity matrix. Then B is the inverse.
The last step is easy to parallellize. Divide your identity matrix along the columns. If you have p CPUs and your matrix is n-by-n, then every part has n/p columns and n rows. Lets call the parts I1, I2, etc. On every CPU, solve a system of the form A*B1 = I1, this gives you the parts B1, B2, etc., and you can combine them to form B which is the inverse.
An LU decomp on a GPU can be ~10x faster than on a CPU. Although this is now changing, GPU's have traditionally been designed around single precision arithmetic, and so on older hardware single precision arithmetic is generally much faster than double precision arithmetic. Also, storage requirements and performance will be greatly impacted by the structure of your matrices. A sparse 100,000 x 100,000 matrix LU decomp is a reasonable problem to solve and will not require much memory.
Unless you want to become a specialist and spend a lot of time tuning for hardware updates, I would strongly recommend using a commercial library. I would suggest CULA tools. They have both sparse and dense GPU libraries and in fact their free library offers SGETRF - a single precision (dense) LU decomp routine. You'll have to pay for their double precision libraries.
I know it's old post - but really - OpenCL (you download the relevant one based on your graphics card) + OpenMP + Vectorization (not in that order) is the way to go.
Anyhow - for me my experience with matrix anything is really to do with overheads from copying double double arrays in and out the system and also to pad up or initialize matrices with 0s before any commencement of computation - especially when I am working with creating .xll for Excel usage.
If I were to reprioritize the top -
try to vectorize the code (Visual Studio 2012 and Intel C++ has autovectorization - I'm not sure about MinGW or GCC, but I think there are flags for the compiler to analyse your for loops to generate the right assembly codes to use instead of the normal registers to hold your data, to populate your processor's vector registers. I think Excel is doing that because when I monitored Excel's threads while running their MINVERSE(), I notice only 1 thread is used.
I don't know much assembly language - so I don't know how to vectorize manually... (haven't had time to go learn this yet but sooooo wanna do it!)
Parallelize with OpenMP (omp pragma) or MPI or pthreads library (parallel_for) - very simple - but... here's the catch - I realise that if your matrix class is completely single threaded in the first place - then parallelizing the operation like mat multiply or inverse is scrappable - cuz parallelizing will deteriorate the speed due to initializing or copying to or just accessing the non-parallelized matrix class.
But... where parallelization helps is - if you're designing your own matrix class and you parallelize its constructor operation (padding with 0s etc), then your computation of LU(A^-1) = I will also be faster.
It's also mathematically straightforward to also optimize your LU decomposition, and also optimizing ur forward backward substitution for the special case of identity. (I.e. don't waste time creating any identity matrix - analyse where your for (row = col) and evaluate to be a function with 1 and the rest with 0.)
Once it's been parallelized (on the outer layers) - the matrix operations requiring element by element can be mapped to be computed by GPU(SSSSSS) - hundreds of processors to compute elements - beat that!. There is now sample Monte Carlo code available on ATI's website - using ATI's OpenCL - don't worry about porting code to something that uses GeForce - all u gotta do is recompile there.
For 2 and 3 though - remember that overheads are incurred so no point unless you're handling F*K*G HUGE matrices - but I see 100k^2? wow...
Gene

Solving normal equation system in C++

I would like to solve the system of linear equations:
Ax = b
A is a n x m matrix (not square), b and x are both n x 1 vectors. Where A and b are known, n is from the order of 50-100 and m is about 2 (in other words, A could be maximum [100x2]).
I know the solution of x: $x = \inv(A^T A) A^T b$
I found few ways to solve it: uBLAS (Boost), Lapack, Eigen and etc. but i dont know how fast are the CPU computation time of 'x' using those packages. I also don't know if this numerically a fast why to solve 'x'
What is for my important is that the CPU computation time would be short as possible and good documentation since i am newbie.
After solving the normal equation Ax = b i would like to improve my approximation using regressive and maybe later applying Kalman Filter.
My question is which C++ library is the robuster and faster for the needs i describe above?
This is a least squares solution, because you have more unknowns than equations. If m is indeed equal to 2, that tells me that a simple linear least squares will be sufficient for you. The formulas can be written out in closed form. You don't need a library.
If m is in single digits, I'd still say that you can easily solve this using A(transpose)*A*X = A(transpose)*b. A simple LU decomposition to solve for the coefficients would be sufficient. It should be a much more straightforward problem than you're making it out to be.
uBlas is not optimized unless you use it with optimized BLAS bindings.
The following are optimized for multi-threading and SIMD:
Intel MKL. FORTRAN library with C interface. Not free but very good.
Eigen. True C++ library. Free and open source. Easy to use and good.
Atlas. FORTRAN and C. Free and open source. Not Windows friendly, but otherwise good.
Btw, I don't know exactly what are you doing, but as a rule normal equations are not a proper way to do linear regression. Unless your matrix is well conditioned, QR or SVD should be preferred.
If liscencing is not a problem, you might try the gnu scientific library
http://www.gnu.org/software/gsl/
It comes with a blas library that you can swap for an optimised library if you need to later (for example the intel, ATLAS, or ACML (AMD chip) library.
If you have access to MATLAB, I would recommend using its C libraries.
If you really need to specialize, you can approximate matrix inversion (to arbitrary precision) using the Skilling method. It uses order (N^2) operations only (rather than the order N^3 of usual matrix inversion - LU decomposition etc).
Its described in the thesis of Gibbs linked to here (around page 27):
http://www.inference.phy.cam.ac.uk/mng10/GP/thesis.ps.gz

Large matrix inversion methods

Hi I've been doing some research about matrix inversion (linear algebra) and I wanted to use C++ template programming for the algorithm , what i found out is that there are number of methods like: Gauss-Jordan Elimination or LU Decomposition and I found the function LU_factorize (c++ boost library)
I want to know if there are other methods , which one is better (advantages/disadvantages) , from a perspective of programmers or mathematicians ?
If there are no other faster methods is there already a (matrix) inversion function in the boost library ? , because i've searched alot and didn't find any.
As you mention, the standard approach is to perform a LU factorization and then solve for the identity. This can be implemented using the LAPACK library, for example, with dgetrf (factor) and dgetri (compute inverse). Most other linear algebra libraries have roughly equivalent functions.
There are some slower methods that degrade more gracefully when the matrix is singular or nearly singular, and are used for that reason. For example, the Moore-Penrose pseudoinverse is equal to the inverse if the matrix is invertible, and often useful even if the matrix is not invertible; it can be calculated using a Singular Value Decomposition.
I'd suggest you to take a look at Eigen source code.
Please Google or Wikipedia for the buzzwords below.
First, make sure you really want the inverse. Solving a system does not require inverting a matrix. Matrix inversion can be performed by solving n systems, with unit basis vectors as right hand sides. So I'll focus on solving systems, because it is usually what you want.
It depends on what "large" means. Methods based on decomposition must generally store the entire matrix. Once you have decomposed the matrix, you can solve for multiple right hand sides at once (and thus invert the matrix easily). I won't discuss here factorization methods, as you're likely to know them already.
Please note that when a matrix is large, its condition number is very likely to be close to zero, which means that the matrix is "numerically non-invertible". Remedy: Preconditionning. Check wikipedia for this. The article is well written.
If the matrix is large, you don't want to store it. If it has a lot of zeros, it is a sparse matrix. Either it has structure (eg. band diagonal, block matrix, ...), and you have specialized methods for solving systems involving such matrices, or it has not.
When you're faced with a sparse matrix with no obvious structure, or with a matrix you don't want to store, you must use iterative methods. They only involve matrix-vector multiplications, which don't require a particular form of storage: you can compute the coefficients when you need them, or store non-zero coefficients the way you want, etc.
The methods are:
For symmetric definite positive matrices: conjugate gradient method. In short, solving Ax = b amounts to minimize 1/2 x^T A x - x^T b.
Biconjugate gradient method for general matrices. Unstable though.
Minimum residual methods, or best, GMRES. Please check the wikipedia articles for details. You may want to experiment with the number of iterations before restarting the algorithm.
And finally, you can perform some sort of factorization with sparse matrices, with specially designed algorithms to minimize the number of non-zero elements to store.
depending on the how large the matrix actually is, you probably need to keep only a small subset of the columns in memory at any given time. This might require overriding the low-level write and read operations to the matrix elements, which i'm not sure if Eigen, an otherwise pretty decent library, will allow you to.
For These very narrow cases where the matrix is really big, There is StlXXL library designed for memory access to arrays that are mostly stored in disk
EDIT To be more precise, if you have a matrix that does not fix in the available RAM, the preferred approach is to do blockwise inversion. The matrix is split recursively until each matrix does fit in RAM (this is a tuning parameter of the algorithm of course). The tricky part here is to avoid starving the CPU of matrices to invert while they are pulled in and out of disk. This might require to investigate in appropiate parallel filesystems, since even with StlXXL, this is likely to be the main bottleneck. Although, let me repeat the mantra; Premature optimization is the root of all programming evil. This evil can only be banished with the cleansing ritual of Coding, Execute and Profile
You might want to use a C++ wrapper around LAPACK. The LAPACK is very mature code: well-tested, optimized, etc.
One such wrapper is the Intel Math Kernel Library.