I would like to perform a Principal Component Analysis on a dataset composed of approximately 40 000 samples, each sample displaying about 10 000 features.
Using Matlab princomp function takes ages ... What would be the fastest algorithm ? How long would it take on a i7 dual core / 4GB Ram ?
Thanks for your support
crosspost of this: https://scicomp.stackexchange.com/questions/1681/what-is-the-fastest-way-to-calculate-the-largest-eigenvalue-of-a-general-matrix/7487#7487
There has been some good research on this recently. The new approaches use "randomized algorithms" which only require a few reads of your matrix to get good accuracy on the largest eigenvalues. This is in contrast to power iterations which require several matrix-vector multiplications to reach high accuracy.
You can read more about the new research here:
http://math.berkeley.edu/~strain/273.F10/martinsson.tygert.rokhlin.randomized.decomposition.pdf
http://arxiv.org/abs/0909.4061
This code will do it for you:
http://cims.nyu.edu/~tygert/software.html
https://bitbucket.org/rcompton/pca_hgdp/raw/be45a1d9a7077b60219f7017af0130c7f43d7b52/pca.m
http://code.google.com/p/redsvd/
https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.html
If your language of choice isn't in there you can roll your own randomized SVD pretty easily; it only requires a matrix vector multiplication followed by a call to an off-the-shelf SVD.
Related
I am trying to solve a very large and sparse system of linear equations in C++. Currently, I am using BiCGSTAB from eigen. It works fine for small matrix, but it is taking just too much time for matrix of the size I need, which is 40804x40804 (It could be even larger in the future).
I have a very long script, but I simply used the following format:
SparseMatrix<double> sj(40804,40804);
VectorXd c_(40804), sf(40804);
sj.reserve(VectorXi::Constant(40804,36)); //This is a very good estimate of how many non zeros in each column
//...Fill in actual number in sj
sj.makeCompressed();
BiCGSTAB<SparseMatrix<double> > handler;
//...Fill in sj, only in the entries that have been initialized previously
handler.analyzePattern(sj)
handler.factorize(sj);
c_.setZero();
c_=handler.solve(sf);
This takes way too long! And yes, the solution does exist. Sparse function in matlab seems to handle this very well, but I need it in C++ in order to connect to a server.
I would really appreciate it you could help me!
You should consider use of one of the advanced sparse direct solvers: CHOLMOD
Sparse direct solvers are a fundamental tool in computational analysis, providing a very general method for obtaining high-quality results to almost any problem. CHOLMOD is a high performance library for sparse Cholesky factorization.
I guarantee that this package definetly will help you. Moreover CHOLMOD has supported GPU acceleration since 2012 with version 4.0.0 . In SuiteSparse-4.3.1 performance has been further improved, providing speedups of 3x or greater vs. the CPU for the sparse factorization operation.
If your matrices are the representations of graphs you can also consider METIS with combination of CHOLMOD. Which means you will be able to do partition/domainDecomposition in graphs then parallel solve with CHOLMOD.
SuiteSparse is a powerfull tool with the support of linear(KLU) and direct solvers.
Here are the GitHub link, UserGuide and SuiteSparse's home page
I am doing machine learning with python (scikit-learn) using the same data but with different classifiers. When I use 500k of data, LR and SVM (linear kernel) take about the same time, SVM (with polynomial kernel) takes forever. But using 5 million data, it seems LR is faster than SVM (linear) by a lot, I wonder if this is what people normally find?
Faster is a bit of a weird question, in part because it is hard to compare apples to apples on this, and it depends on context. LR and SVM are very similar in the linear case. The TLDR for the linear case is that Logistic Regression and SVMs are both very fast and the speed difference shouldn't normally be too large, and both could be faster/slower in certain cases.
From a mathematical perspective, Logistic regression is strictly convex [its loss is also smoother] where SVMs are only convex, so that helps LR be "faster" from an optimization perspective, but that doesn't always translate to faster in terms of how long you wait.
Part of this is because, computationally, SVMs are simpler. Logistic Regression requires computing the exp function, which is a good bit more expensive than just the max function used in SVMs, but computing these doesn't make the majority of the work in most cases. SVMs also have hard zeros in the dual space, so a common optimization is to perform "shrinkage", where you assume (often correctly) that a data point's contribution to the solution won't change in the near future and stop visiting it / checking its optimality. The hard zero of the SVM loss and the C regularization term in the soft margin form allow for this, where LR has no hard zeros to exploit like that.
However, when you want something to be fast, you usually don't use an exact solver. In this case, the issues above mostly disappear, and both tend to learn just as quick as the other in this scenario.
In my own experience, I've found Dual Coordinate Descent based solvers to be the fastest for getting exact solutions to both, with Logistic Regression usually being faster in wall clock time than SVMs, but not always (and never by more than a 2x factor). However, if you try and compare different solver methods for LRs and SVMs you may get very different numbers on which is "faster", and those comparisons won't necessarily be fair. For example, the SMO solver for SVMs can be used in the linear case, but will be orders of magnitude slower because it is not exploiting the fact that you only care are Linear solutions.
I have a sparse banded matrix A and I'd like to (direct) solve Ax=b. I have about 500 vectors b, so I'd like to solve for the corresponding 500 x's.
I'm brand new to CUDA, so I'm a little confused as to what options I have available.
cuSOLVER has a batch direct solver cuSolverSP for sparse A_i x_i = b_i using QR here. (I'd be fine with LU too since A is decently conditioned.) However, as far as I can tell, I can't exploit the fact that all my A_i's are the same.
Would an alternative option be to first determine a sparse LU (QR) factorization on the CPU or GPU then perform in parallel the backsubstitution (respectively, backsub and matrix mult) on the GPU? If cusolverSp< t >csrlsvlu() is for one b_i, is there a standard way to batch perform this operation for multiple b_i's?
Finally, since I don't have intuition for this, should I expect a speedup on a GPU for either of these options, given the necessary overhead? x has length ~10000-100000. Thanks.
I'm currently working on something similar myself. I decided to basically wrap the conjugate gradient and level-0 incomplete cholesky preconditioned conjugate gradient solvers utility samples that came with the CUDA SDK into a small class.
You can find them in your CUDA_HOME directory under the path:
samples/7_CUDALibraries/conjugateGradient and /Developer/NVIDIA/CUDA-samples/7_CUDALibraries/conjugateGradientPrecond
Basically, you would load the matrix into the device memory once (and for ICCG, compute the corresponding conditioner / matrix analysis) then call the solve kernel with different b vectors.
I don't know what you anticipate your matrix band structure to look like, but if it is symmetric and either diagonally dominant (off diagonal bands along each row and column are opposite sign of diagonal and their sum is less than the diagonal entry) or positive definite (no eigenvectors with an eigenvalue of 0.) then CG and ICCG should be useful. Alternately, the various multigrid algorithms are another option if you are willing to go through coding them up.
If your matrix is only positive semi-definite (e.g. has at least one eigenvector with an eigenvalue of zero) you can still ofteb get away with using CG or ICCG as long as you ensure that:
1) The right hand side (b vectors) are orthogonal to the null space (null space meaning eigenvectors with an eigenvalue of zero).
2) The solution you obtain is orthogonal to the null space.
It is interesting to note that if you do have a non-trivial null space, then different numeric solvers can give you different answers for the same exact system. The solutions will end up differing by a linear combination of the null space... That problem has caused me many many man hours of debugging and frustration before I finally caught on, so its good to be aware of it.
Lastly, if your matrix has a Circulant Band structure you might consider using a fast fourier transform (FFT) based solver. FFT based numerical solvers can often yield superior performance in cases where they are applicable.
is there a standard way to batch perform this operation for multiple b_i's?
One option is to use the batched refactorization module in CUDA's cuSOLVER, but I am not sure if it is standard.
Batched refactorization module in cuSOLVER provides an efficient method to solve batches of linear systems with fixed left-hand side sparse matrix (or matrices with fixed sparsity pattern but varying coefficients) and varying right-hand sides, based on LU decomposition. Only some partially completed code snippets can be found in the official documentation (as of CUDA 10.1) that are related to it. A complete example can be found here.
If you don't mind going with an open-source library, you could also check out CUSP:
CUSP Quick Start Page
It has a fairly decent suite of solvers, including a few preconditioned methods:
CUSP Preconditioner Examples
The smoothed aggregation preconditioner (a variant of algebraic multigrid) seems to work very well as long as your GPU has enough onboard memory for it.
I am using finite differences for a square 100x100 domain (with neumann bcs on all sides) in c++ using Eigen's sparse matrix functionality, and built in solvers to compute x in Ax=b.
I have tried the following solvers, but am getting very different time results to what I expect from reading the documentation in http://eigen.tuxfamily.org/dox-devel/group__TopicSparseSystems.html which gives typical timescales for different solvers. In particular, the documentation suggests that conjugate gradients should be one of the fastest ways of solving this system, giving a timescale of 0.239 seconds for solving Poisson SPD with a size bigger than my system. By contrast this documentation suggests that SimplicialLLT should take roughly 3X as long.
When I run each of the solvers I am obtaining the following:
- Conjugate gradients: 25 seconds
- LLT: 0.35 seconds
I was wondering whether anyone could help me understand why there is two orders of magnitude between these two solvers, and in particular, why CG seems to be being beaten by LLT, in contrast to the literature? Also, if anyone else has an idea how I could substantially speed up the solvers by using different methods from other packages, then suggestions are welcome!
I am implementing the solvers by the following:
//Conjugate gradients
ConjugateGradient<SparseMatrix<double> > cg;
cg.compute(A);
MatrixXd vGDNF = cg.solve(b);
//SimplicialLLT
SimplicialLLT<SparseMatrix<double>> solver;
MatrixXd vGDNF = solver.compute(A).solve(b);
Here A is that of a finite difference Laplacian, and b is a vector of point sources of the field.
Thanks!
Ben
On this documentation, Poisson_SPD corresponds to a 3D problem which is much harder for direct methods as Cholesky because there are much more fill-in in the factors. As the same documentation page says in the first table, for 2D Laplacian/Poisson problems as yours, SimplicialLLT is the recommended solver.
After some studying, I created a small app that calculates DFTs (Discrete Fourier Transformations) from some input. It works well enough, but it is quite slow.
I read that FFTs (Fast Fourier Transformations) allow quicker calculations, but how are they different? And more importantly, how would I go about implementing them in C++?
If you don't need to manually implement the algorithm, you could take a look at the Fastest Fourier Transform in the West
Even thought it's developed in C, it officially works in C++ (from the FAQ)
Question 2.9. Can I call FFTW from
C++?
Most definitely. FFTW should compile
and/or link under any C++ compiler.
Moreover, it is likely that the C++
template class is
bit-compatible with FFTW's
complex-number format (see the FFTW
manual for more details).
FFT has n*log(n) compexity compared to DFT which has n^2.
There are lot of literature about that, and I strongly advise that you check that first, because such wide topic can not be full explaned here.
http://en.wikipedia.org/wiki/Fast_Fourier_transform (check external links )
If you need library I advise you to use existing one, for instance.
http://www.fftw.org/
This library has efficiently implementation of FFT and is also used in propariaretery software (MATLAB for instance)
Steven Smith's book The Scientist and Engineer's Guide to Digital Signal Processing , specifically Chapter 8 on the DFT and Chapter 12 on the FFT, does a much better job of explaining the two transforms that I ever could.
By the way, the whole book is available for free (link above) and it's a very good introduction to signal processing.
Regarding the C++ code request, I've only used the Fastest Fourier Transform in the West (already cited by superexsl) or DSP libraries such as those from TI or Analog Devices.
The results of a correctly implemented DFT are essentially identical to the results of a correctly implemented FFT (they differ only by rounding errors). As others have pointed out here, the major difference is that of performance. DFT has O(n^2) operations while the FFT has O(nlogn) operations.
The best, most readable publication I have ever found (the one I still refer to) is The Fast Fourier Transform and its Applications by E Oran Brigham. The first few chapters provide a very thorough overview of the continuous and discrete forms of the Fourier Transform. He then uses that to develop the fast version of the DFT based on the Cooley-Tukey Algorithm for the radix-2 (n is a power of 2) and mixed-radix cases (though the latter being somewhat more shallow treatise than the former).
The basic approach in the radix-2 algorithm to perform a linear time operation on the input X and to recursively split the result in half and perform a similar linear time operation on the two halves. The mixed radix case is similar, though you need to divide X into equal portions each time, so it helps if n doesn't have any large prime factors.
I've found this nice explanation with some algorithms described.
FastFourierTransform
About implementation,
first i'd make sure your implementation returns correct results (compare the output from matlab or octave - which have built in fourier transformates)
optimize when necessary, use profilers
don't use unnecesary for loops