Transpose in BLAS or do it myself first?

Transpose in BLAS or do it myself first? - fortran

I'm putting together some scientific code in Fortran 77, and I am having a debate on what would be faster.
Basically, I have an MxN matrix, let's call it A. M is larger than N. Later on in the code, I need to multiply transpose(A) by a bunch of vectors.
My question is, would it be faster to take A, transpose it on my own and store that, or when I call BLAS, just give it the transpose flag?
Thanks!
-Patrick

my gut feeling tells me to use transpose flag.
in that case you doing lots of dot products with stride of one.
In reality, it's very hard to tell without actually running codes.
modern blas employs cache blocking techniques which make simple analysis difficult at best.

Related

What is the fastest way/library to calculate the rank of a matrix in C++

What library computes the rank of a matrix the fastest? Or, is there any code out in the open that does this fairly rapidly?
I am using Eigen3 and it seems to be slower than Python's numpy rank function. I just need this one function to be fast, absolutely nothing else matters. If you suggest a package everything but this is irrelevant, including ease of use.
The matrices I am looking at tend to be n by ( n choose 3) in size, the entries are 1 or 0....mostly 0's.
Thanks.
Edit 1: the rank is over R.

In general, BLAS/LAPACK functions are frighteningly fast. This link suggests using the GESVD or GESDD functions to compute singular values. The number of non-zero singular values will be the matrix's rank.
LAPACK is what numpy uses.
In short, you can use the same LAPACK library calls. It will be difficult to outperform BLAS/LAPACK functions, unless sparsity and special structure allow more efficient approaches. If that's true, you may want to check around for alternative libraries implementing sparse SVD solvers.
Note also there are multiple BLAS/LAPACK implementations.
Update
This post seems to argue that LU decomposition is unreliable for calculating rank. Better to do SVD. You may want to see how fast that eigen call is before going through all the hassle of using BLAS/LAPACK (I've just never used eigen).

What is a fast simple solver for a large Laplacian matrix?

I need to solve some large (N~1e6) Laplacian matrices that arise in the study of resistor networks. The rest of the network analysis is being handled with boost graph and I would like to stay in C++ if possible. I know there are lots and lots of C++ matrix libraries but no one seems to be a clear leader in speed or usability. Also, the many questions on the subject, here and elsewhere seem to rapidly devolve into laundry lists which are of limited utility. In an attempt to help myself and others, I will try to keep the question concise and answerable:
What is the best library that can effectively handle the following requirements?
Matrix type: Symmetric Diagonal Dominant/Laplacian
Size: Very large (N~1e6), no dynamic resizing needed
Sparsity: Extreme (maximum 5 nonzero terms per row/column)
Operations needed: Solve for x in A*x=b and mat/vec multiply
Language: C++ (C ok)
Priority: Speed and simplicity to code. I would really rather avoid having to learn a whole new framework for this one problem or have to manually write too much helper code.
Extra love to answers with a minimal working example...

If you want to write your own solver, in terms of simplicity, it's hard to beat Gauss-Seidel iteration. The update step is one line, and it can be parallelized easily. Successive over-relaxation (SOR) is only slightly more complicated and converges much faster.
Conjugate gradient is also straightforward to code, and should converge much faster than the other iterative methods. The important thing to note is that you don't need to form the full matrix A, just compute matrix-vector products A*b. Once that's working, you can improve the convergance rate again by adding a preconditioner like SSOR (Symmetric SOR).
Probably the fastest solution method that's reasonable to write yourself is a Fourier-based solver. It essentially involves taking an FFT of the right-hand side, multiplying each value by a function of its coordinate, and taking the inverse FFT. You can use an FFT library like FFTW, or roll your own.
A good reference for all of these is A First Course in the Numerical Analysis of Differential Equations by Arieh Iserles.

Eigen is quite nice to use and one of the fastest libraries I know:
http://eigen.tuxfamily.org/dox/group__TutorialSparse.html

There is a lot of related post, you could have look.
I would recommend C++ and Boost::ublas as used in UMFPACK and BOOST's uBLAS Sparse Matrix

Large Matrix Inversion

I am looking at taking the inverse of a large matrix, common size of 1000 x 1000, but sometimes exceeds 100000 x 100000 (which is currently failing due to time and memory). I know that the normal sentiment is 'don't take the inverse, find some other way to do it', but that is not possible at the moment. The reason for this is due to the usage of software that is already made that expects to get the matrix inverse. (Note: I am looking into ways of changing this, but that will take a long time)
At the moment we are using an LU decomposition method from numerical recopies, and I am currently in the process of testing the eigen library. The eigen library seems to be more stable and a bit faster, but I am still in testing phase for accuracy. I have taken a quick look at other libraries such as ATLAS and LAPACK but have not done any substantial testing with these yet.
It seems as though the eigen library does not use concurrent methods to compute the inverse (though does for LU factorization part of the inverse) and as far as I can tell ATLAS and LAPACK are similar in this limitation. (I am currently testing the speed difference for eigen with openMP and without.)
First question is can anyone explain how it would be possible to optimize matrix inversion by parallelization. I found an article here that talks about matrix inversion parallel algorithms, but I did not understand. It seems this article talks about another method? I am also not sure if scaLAPACK or PETSc are useful?
Second question, I read this article of using the GPUs to increase performance, but I have never coded for GPUs and so have no idea what is trying to convey, but the charts at the bottom looked rather alarming. How is this even possible, and how where do I start to go about implementing something like this if it is to be true.
I also found this article, have yet had the time to read through it to understand, but it seems promising, as memory is a current issue with our software.
Any information about these articles or the problems in general would be of great help. And again I apologize if this question seems vague, I will try to expand more if necessary.

First question is can anyone explain how it would be possible to optimize matrix inversion by parallelization.
I'd hazard a guess that this, and related topics in linear algebra, is one of the most studied topics in parallel computing. If you're stuck looking for somewhere to start reading, well good old Golub and Van Loan have a chapter on the topic. As to whether Scalapack and Petsc are likely to be useful, certainly the former, probably the latter. Of course, they both depend on MPI but that's kind of taken for granted in this field.
Second question ...
Use GPUs if you've got them and you can afford to translate your code into the programming model supported by your GPUs. If you've never coded for GPUs and have access to a cluster of commodity-type CPUs you'll get up to speed quicker by using the cluster than by wrestling with a novel technology.
As for the last article you refer to, it's now 10 years old in a field that changes very quickly (try finding a 10-year old research paper on using GPUs for matrix inversion). I can't comment on its excellence or other attributes, but the problem sizes you mention seem to me to be well within the capabilities of modern clusters for in-core (to use an old term) computation. If your matrices are very big, are they also sparse ?
Finally, I strongly support your apparent intention to use existing off-the-shelf codes rather than to try to develop your own.

100000 x 100000 is 80GB at double precision. You need a library that supports memory-mapped matrices on disk. I can't recommend a particular library and I didn't find anything with quick Google searches. But code from Numerical Recipes certainly isn't going to be adequate.

Regarding the first question (how to parallellize computing the inverse):
I assume you are computing the inverse by doing an LU decomposition of your matrix and then using the decomposition to solve A*B = I where A is your original matrix, B is the matrix you solve for, and I is the identity matrix. Then B is the inverse.
The last step is easy to parallellize. Divide your identity matrix along the columns. If you have p CPUs and your matrix is n-by-n, then every part has n/p columns and n rows. Lets call the parts I1, I2, etc. On every CPU, solve a system of the form A*B1 = I1, this gives you the parts B1, B2, etc., and you can combine them to form B which is the inverse.

An LU decomp on a GPU can be ~10x faster than on a CPU. Although this is now changing, GPU's have traditionally been designed around single precision arithmetic, and so on older hardware single precision arithmetic is generally much faster than double precision arithmetic. Also, storage requirements and performance will be greatly impacted by the structure of your matrices. A sparse 100,000 x 100,000 matrix LU decomp is a reasonable problem to solve and will not require much memory.
Unless you want to become a specialist and spend a lot of time tuning for hardware updates, I would strongly recommend using a commercial library. I would suggest CULA tools. They have both sparse and dense GPU libraries and in fact their free library offers SGETRF - a single precision (dense) LU decomp routine. You'll have to pay for their double precision libraries.

I know it's old post - but really - OpenCL (you download the relevant one based on your graphics card) + OpenMP + Vectorization (not in that order) is the way to go.
Anyhow - for me my experience with matrix anything is really to do with overheads from copying double double arrays in and out the system and also to pad up or initialize matrices with 0s before any commencement of computation - especially when I am working with creating .xll for Excel usage.
If I were to reprioritize the top -
try to vectorize the code (Visual Studio 2012 and Intel C++ has autovectorization - I'm not sure about MinGW or GCC, but I think there are flags for the compiler to analyse your for loops to generate the right assembly codes to use instead of the normal registers to hold your data, to populate your processor's vector registers. I think Excel is doing that because when I monitored Excel's threads while running their MINVERSE(), I notice only 1 thread is used.
I don't know much assembly language - so I don't know how to vectorize manually... (haven't had time to go learn this yet but sooooo wanna do it!)
Parallelize with OpenMP (omp pragma) or MPI or pthreads library (parallel_for) - very simple - but... here's the catch - I realise that if your matrix class is completely single threaded in the first place - then parallelizing the operation like mat multiply or inverse is scrappable - cuz parallelizing will deteriorate the speed due to initializing or copying to or just accessing the non-parallelized matrix class.
But... where parallelization helps is - if you're designing your own matrix class and you parallelize its constructor operation (padding with 0s etc), then your computation of LU(A^-1) = I will also be faster.
It's also mathematically straightforward to also optimize your LU decomposition, and also optimizing ur forward backward substitution for the special case of identity. (I.e. don't waste time creating any identity matrix - analyse where your for (row = col) and evaluate to be a function with 1 and the rest with 0.)
Once it's been parallelized (on the outer layers) - the matrix operations requiring element by element can be mapped to be computed by GPU(SSSSSS) - hundreds of processors to compute elements - beat that!. There is now sample Monte Carlo code available on ATI's website - using ATI's OpenCL - don't worry about porting code to something that uses GeForce - all u gotta do is recompile there.
For 2 and 3 though - remember that overheads are incurred so no point unless you're handling F*K*G HUGE matrices - but I see 100k^2? wow...
Gene

C vs Fortran for BLAS 2

I have an application in which I need to carry out a lot of Norms, Dot Products and most importantly, Matrix Vector multiplications.
matrix and vectors are huge. Matrix dimension is tending to be a 100000x100000
the loop structure is:
while(condition)
/* usually iterations=dimension of matrix, so around 1 million iterations are *at least* required (if not more) */
matrix-vector multiplication
3 dot prods
2 norms
I am currently using Intel Fortran with Intel MKL. Will rewriting my codes in Intel C with Intel MKL help any?
Has anyone ever carried out a benchmark of any kind (for DGEMV especially)?
Rewriting codes is a major pain but I would not mind rewriting iff I see a reason to.
EDIT: I misspoke: The matrix dimensions are 100000 not a million. Pretty serious error :|
And yes, the matrix is dense and it needs to be dense.
Moreover, it is not symmetric and not even positive definite.
My algorithm is a modified version of QMR.

The performance will be completely identical in either C or Fortran, as the actual implementation backing the library calls are the same, and essentially all of the time in your code is spent in those library calls.

Large matrix inversion methods

Hi I've been doing some research about matrix inversion (linear algebra) and I wanted to use C++ template programming for the algorithm , what i found out is that there are number of methods like: Gauss-Jordan Elimination or LU Decomposition and I found the function LU_factorize (c++ boost library)
I want to know if there are other methods , which one is better (advantages/disadvantages) , from a perspective of programmers or mathematicians ?
If there are no other faster methods is there already a (matrix) inversion function in the boost library ? , because i've searched alot and didn't find any.

As you mention, the standard approach is to perform a LU factorization and then solve for the identity. This can be implemented using the LAPACK library, for example, with dgetrf (factor) and dgetri (compute inverse). Most other linear algebra libraries have roughly equivalent functions.
There are some slower methods that degrade more gracefully when the matrix is singular or nearly singular, and are used for that reason. For example, the Moore-Penrose pseudoinverse is equal to the inverse if the matrix is invertible, and often useful even if the matrix is not invertible; it can be calculated using a Singular Value Decomposition.

I'd suggest you to take a look at Eigen source code.

Please Google or Wikipedia for the buzzwords below.
First, make sure you really want the inverse. Solving a system does not require inverting a matrix. Matrix inversion can be performed by solving n systems, with unit basis vectors as right hand sides. So I'll focus on solving systems, because it is usually what you want.
It depends on what "large" means. Methods based on decomposition must generally store the entire matrix. Once you have decomposed the matrix, you can solve for multiple right hand sides at once (and thus invert the matrix easily). I won't discuss here factorization methods, as you're likely to know them already.
Please note that when a matrix is large, its condition number is very likely to be close to zero, which means that the matrix is "numerically non-invertible". Remedy: Preconditionning. Check wikipedia for this. The article is well written.
If the matrix is large, you don't want to store it. If it has a lot of zeros, it is a sparse matrix. Either it has structure (eg. band diagonal, block matrix, ...), and you have specialized methods for solving systems involving such matrices, or it has not.
When you're faced with a sparse matrix with no obvious structure, or with a matrix you don't want to store, you must use iterative methods. They only involve matrix-vector multiplications, which don't require a particular form of storage: you can compute the coefficients when you need them, or store non-zero coefficients the way you want, etc.
The methods are:
For symmetric definite positive matrices: conjugate gradient method. In short, solving Ax = b amounts to minimize 1/2 x^T A x - x^T b.
Biconjugate gradient method for general matrices. Unstable though.
Minimum residual methods, or best, GMRES. Please check the wikipedia articles for details. You may want to experiment with the number of iterations before restarting the algorithm.
And finally, you can perform some sort of factorization with sparse matrices, with specially designed algorithms to minimize the number of non-zero elements to store.

depending on the how large the matrix actually is, you probably need to keep only a small subset of the columns in memory at any given time. This might require overriding the low-level write and read operations to the matrix elements, which i'm not sure if Eigen, an otherwise pretty decent library, will allow you to.
For These very narrow cases where the matrix is really big, There is StlXXL library designed for memory access to arrays that are mostly stored in disk
EDIT To be more precise, if you have a matrix that does not fix in the available RAM, the preferred approach is to do blockwise inversion. The matrix is split recursively until each matrix does fit in RAM (this is a tuning parameter of the algorithm of course). The tricky part here is to avoid starving the CPU of matrices to invert while they are pulled in and out of disk. This might require to investigate in appropiate parallel filesystems, since even with StlXXL, this is likely to be the main bottleneck. Although, let me repeat the mantra; Premature optimization is the root of all programming evil. This evil can only be banished with the cleansing ritual of Coding, Execute and Profile

You might want to use a C++ wrapper around LAPACK. The LAPACK is very mature code: well-tested, optimized, etc.
One such wrapper is the Intel Math Kernel Library.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js