I require solving a large set of (independent) Ax=b linear problems.
This cannot be parallelized (or more specifically, this is within each processor's responsibility anyway).
The Ax=b sets are small (say 10x10 at most) but are dense (Usually all terms are non-zero) and both A matrices and RHS vectors are completely different and independent.
What is the most efficient/practical way of solving a large set of small Ax=b problems using PETSc?
I.e. how costly would it be to have a single A matrix and a single b vector to be modified all the time and solved for each system?
After considering all the options, PETSc was not very efficient at handling the situation due to the need to rebuild/repopulate the matrices all the time with different values and ended up being relatively memory-costly.
I ended up putting PETSc aside for this matter.
Since your A is small, why not try the direct method to solve it. SuperLU MT version or SuperLU serial version.
If you have all the matrices at once, you can make a large, block diagonal matrix out of them and this will be efficient (vectorized and such).
Related
I would like to write in C++ Tensorflow sparse matrix dense vector (SPMv) multiplication: y = Ax
The sparse matrix, A, is stored in CSR format. The usual sparsity of A is between 50-90%. The goal is to reach better or similar time than that of dense matrix dense vector (DMv) multiplication.
Please note that I have already viewed the following posts: Q1 Q2 Q3. However, I still am wondering about the following:
How does SPMv multiplication compare in terms of time to DMv? Since sparsity is relatively high, I assume that SPMv should be better given the reduction in the number of operations - Yes?
What should I take into to account to make SpMv the same or better in terms of time than the DMv? Why ppl are saying that the DMv will perform petter than SPMv? Does the storage representation make a difference?
Any recommended libraries that do SPMv in C++ for either CPU or GPU implementation.
This question is relevant to my other question here: (CSCC: Convolution Split Compression Calculation Algorithm for Deep Neural Network)
To answer the edited question:
Unless the Matrix is very sparse (<10% nonzeros on CPU, probably <1% on GPU), you will likely not benefit from the sparsity. While the number of floating point operations is reduced, the amount of storage is at least double (column or row index + value), memory access is irregular (you have an indirection via the index for the right-hand side), it becomes far more difficult to vectorize (or to achieve coalescing on the GPU) and if you parallelize you have to deal with the fact that rows are of varying length and therefore a static schedule is likely to be suboptimal.
Beyond the points above, yes, the storage representation matters. For example a COO-matrix stores two indices and the value, while CSR/CSC only store one but require an additional offset array which makes them more complex to build on the fly. Especially on the GPU, storage formats matter if you want to at least achieve some coalescing. This paper looks into how storage formats affect performance on the GPU: https://onlinelibrary.wiley.com/doi/full/10.1111/cgf.13957
For something generic try Eigen or cuSparse on GPU. There are plenty of others that perform better for specific use cases, but this part of the question isn't clearly answerable.
Beyond the matrix format itself, even the ordering of entries in your matrix can have a massive impact on performance, which is why the Cuthill-McKee algorithm is often used to reduce matrix bandwidth (and thereby improve cache performance).
I have a problem which involves many matrix multiplications (classical and kronecker product) . I read that GPU are suited for this task and since speed is my main objective I was thinking about using Cuda with c++. However I would have to learn Cuda first. So before I start waisting my time I thought I should ask wiser people first. Can Cuda speed up my calculations? The matrices are generally quite small around 20x50. Sometimes involving a third dimension so it becomes a 20x50x10 matrix. I can only multiply a couple of matrices at one step in time (10-100). But I need to do several millions iteration after each other (Monte Carlo simulation). Currently I am using armadillo and matlab.
You would see some speed ups if your matrices were bigger, now you will be facing data bandwidth bottlenecks worse than computation time delays.
Something worth considering is to see mathematical tricks that could allow you (based on your computations) to combine multiple instances into bigger matrices then transfer and compute. But usually this is quite difficult and probably not always doable.
I am working on a FEM project where I need a linear solution of Ku=f.
I am doing this by LAPACK solver.
As you may be familiar that sometimes the K matrix will be so huge (30GB).
Its needs good ram to malloc such a matrix in conventional way. I just need your help if I can write the matrix to a file
Can you please suggest me to input such a matrix from file itself to lapack solver and get output to a file.
Thanks in advance.
Maharshi.
30G is not a large size for computing servers. You may want to upgrade your server.
With limited hardware, yes, you can put the matrix in file, and use the same LAPACK routines to solve the equation. The technique is called memory mapped file. It maps the content of a file to a memory address range with the same size, without allocating the physical memory. When you read/write the data from/to this address range, you are actually read/write the file.
https://en.wikipedia.org/wiki/Memory-mapped_file
On linux you can use mmap() to achieve this.
http://man7.org/linux/man-pages/man2/mmap.2.html
However the speed to access the memory address range is as slow as accessing the disk file.
Depending on the support of shape functions used in your FEM code, the matrix K is often sparse : most of the elements of the matrix are null. Hence, using a format dedicated to sparse matrices such as CSR is much more efficient to store the matrix. Unfortunately, LAPACK offers little support for such matrices, although it can handle banded matrices.
Take a look at the Eigen library or the PETSc library. These libraries provide interfaces to efficient solvers dedicated to sparse matrices. See there for PETSc. For instance, see Mumps or SuiteSparse.
I am looking to work about 4000 fixed-size (3x3, 4x4) matrices, doing things such as matrix inversion and eigendecomposition.
It seems to me the best way to parallelize this would be to let each of the many GPU threads work on a single instance of the problem.
Is there a reasonable way to do this? I have read: http://www.culatools.com/blog/2011/12/09/batched-operations/ but as far as I can tell, it's always something that is "being worked on" with no solution in sight. Three years later, I hope there is a good solution.
So far, I have looked at:
Using Eigen in CUDA kernels: http://eigen.tuxfamily.org/dox-devel/TopicCUDA.html. But this is in its infancy: thus, it doesn't seem to work well and some things are not implemented. Moreover, I am not sure if it is optimized for CUDA at all. There is almost no documentation and the only example of code is a test file (eigen/test/cuda_basic.cu). When I tried using Eigen in CUDA kernels, simple things like declaring an Eigen::MatrixXf in a kernel did not survive compilation with nvcc V7.0.27 and Eigen 3.2.90 (mercurial).
Using the cuBLAS device API library to run BLAS routines within a kernel. It seems cuBLAS and its ilk are written to be parallelized even for small matrices, which seems overkill and likely slow for the 3x3 and 4x4 matrices I am interested in. Also, I'm not sure if there is anything like cuBLAS that can also do eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from within kernels).
Batch processing kernels using CUDA streams. In Section 2.1.7 "Batching Kernels" of the cuBLAS documentation for the CUDA Toolkit v7.0, this is suggested. But """in practice it is not possible to have more than 16 concurrent kernels executing at the same time""" and consequently it would be terrible for processing 4000 small matrices. In an aforementioned link to the CULA blog post, I quote, """One could, in theory, use a CUDA stream per problem and launch one problem at a time. This would be ill-performing for two reasons. First is that the number of threads per block would be far too low; [...] Second is that the overhead incurred by launching thousands of operations in this manner would be unacceptable, because the launch code is as expensive (if not more expensive) as just performing the matrix on the CPU."""
Implementing my own matrix multiplication and eigendecomposition in kernels. This is likely to be very slow, and may in addition be time consuming to implement.
At this point I am tempted to give up on doing this on the GPU at all. It is a pity, since I was hoping for real time performance for an algorithm that requires inverting 4000 3x3 matrices about 100 times every 0.1 seconds.
The cublas functions getrfBatched and getriBatched are designed for batch inversion of small matrices. This should be quicker than either dynamic parallelism or streams (your 2nd and 3rd approaches.) Also a batch solver is available in source code form that can do matrix inversions. You will need to log in as a registered developer at developer.nvidia.com to access this link.
Also, I'm not sure if there is anything like cuBLAS that can also do eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from within kernels).
Cusolver provides some eigen solver functions. However they are not batched nor callable from device code, so you're faced with streams as the only option beyond that.
In some code we need to get auto vectors and auto values for the generalized eigenvalue problem with symmetric real matrices (Ax=lamba Bx). This code uses DSPGVX from LACPACK. We wanted to speed it up on GPU using a MAGMA function. We asked on this forum and got the answer about this
http://icl.cs.utk.edu/magma/docs/zhegvx_8cpp.html
The size of our matrices (N) goes from 100 to 50000 and even more, related to the number of atoms in a molecule. We observe:
a) for N bigger than 2500 (approx), MAGMA just does not work; segmentation fault
b) MAGMA runs always slower than LAPACK sequential, around 10 times slower
Is this behavior normal and could we overcome it? Can anybody report any reference where anybody working on this similar problems gets a decent speedup?
Thanks
In my experience you may be able to gain greater performance benefits by switching to a better eigensolver. The best solver that I know of is ARPACK. You will gain most benefit it your matrices have some structure, for example if they are sparse. This solver is also most efficient if you only need to extract a small fraction of the total number of eigenpairs.
I would start off by trying this solver on your problems running just on the CPU. You may find that this alone gives sufficient performance for your needs. If not then it is relatively easy to move the calculation core for ARPACK to the GPU. Or, there are parallel versions of ARPACK available.
Have you tried CULA http://www.culatools.com/ ? CULA is Lapack converted for CUDA by NVIDIA, so at least in theory it should have one of the best implementation for the generalized eigenvalue problem. I think the single precision version is free so you could give it a try.