Paralize numpy.linalg.matrix_power do not increase performance - python-2.7

I need to paralize the function numpy.linalg.matrix_power and I use the following code to test how much fast can be the parallel version
def aux_matrix_arg3(A):
aaa=np.linalg.matrix_power(np.random.randn(199,199),100)
return 1
N=10000
processes=4
chunksize=N/processes
poolWorkers=mp.Pool(processes=processes)
ti=t.time()
A=poolWorkers.map(aux_matrix_arg3,range(N),chunksize=chunksize)
print 't_iteration 3',t.time()-ti
I have tried with 1 and 4 processes in my laptop. I got very similar performance
4 processes: t_iteration 3 40.7985408306
1 processes: t_iteration 3 40.6538720131
Any clue why I do not get any improvment with paralle processes?

The docs say:
For positive integers n, the power is computed by repeated matrix squarings and matrix multiplications. If n == 0, the identity matrix of the same shape as M is returned. If n < 0, the inverse is computed and then raised to the abs(n).
If your system is set up correctly, BLAS will be used to parallelize matrix-multiplications and LAPACK (maybe SuperLU, the latter probably only in the sparse-case) for solving systems of linear-equations (used to calculate the inverse). So with a very high probability, your naive code is already very very optimized!
Despite that, you should be careful as naive parallelization copies all the data, which can hurt. (Normaly one would use mmap-arrays to share data instead of copying).

Related

Reliability of the boost::numeric::ublas::matrix LU inversion method

My Problem : inverse squared matrices of doubles numbers of size 30*30 (or around this size).
I started coding in C++ a LU decomposition method, then I discovered the existence of the librairie boost::numeric::ublas::matrix. Therefore to spare me rewriting everything I used some functions of this librairie, namely in this very order lu_factorize() then lu_substitute() to retrieve the inverse.
I hand-checked the reliability of my inversing function (which again, only use the 2 aforementioned boost functions) comparing the results with simple squared matrices (size 3 or 4) and the results are satisfying so far.
Now taking a (30,30) matrix "A" and inversing it "A^-1", the product "A*A^-1" returns me a matrix with 1 on the diagonale and everywhere else some very small numbers, here's a snippet :
1 -1.5e-16 -5.1e-20 2.4e-19
0 1 0 -5.4e-20
1.1e-16 1.1e-16 1 -1.4e-19
6.9e-17 0 -3.3e-17 1
I cannot tell if these numbers (out of the main diagonale) are made up by the matrix librairie trying to approximate 0 or if they issued from the LU decomposition ...
My question : Have you ever used boost ublas with this issue ? Is this librairie still reliable or outdated ? Is there any way to access the source code of these algorithm ?
Thanks in advance
Using C++11, gcc/8.2.0 and boost/1.76.0.

Best way of solving sparse linear systems in C++ - GPU Possible?

I am currently working on a project where we need to solve
|Ax - b|^2.
In this case, A is a very sparse matrix and A'A has at most 5 nonzero elements in each row.
We are working with images and the dimension of A'A is NxN where N is the number of pixels. In this case N = 76800. We plan to go to RGB and then the dimension will be 3Nx3N.
In matlab solving (A'A)\(A'b) takes about 0.15 s, using doubles.
I have now done some experimenting with Eigens sparse solvers. I have tried:
SimplicialLLT
SimplicialLDLT
SparseQR
ConjugateGradient
and some different orderings. The by far best so far is
SimplicialLDLT
which takes about 0.35 - 0.5 using AMDOrdering.
When I for example use ConjugateGradient it takes roughly 6 s, using 0 as initilization.
The code for solving the problem is:
A_tot.makeCompressed();
// Create solver
Eigen::SimplicialLDLT<Eigen::SparseMatrix<float>, Eigen::Lower, Eigen::AMDOrdering<int> > solver;
// Eigen::ConjugateGradient<Eigen::SparseMatrix<float>, Eigen::Lower> cg;
solver.analyzePattern(A_tot);
t1 = omp_get_wtime();
solver.compute(A_tot);
if (solver.info() != Eigen::Success)
{
std::cerr << "Decomposition Failed" << std::endl;
getchar();
}
Eigen::VectorXf opt = solver.solve(b_tot);
t2 = omp_get_wtime();
std::cout << "Time for normal equations: " << t2 - t1 << std::endl;
This is the first time I use sparse matrices in C++ and its solvers. For this project speed is crucial and below 0.1 s is a minimum.
I would like to get some feedback on which would be the best strategy here. For example one is supposed to be able to use SuiteSparse and OpenMP in Eigen. What are your experiences about these types of problems? Is there a way of extracting the structure for example? And should conjugateGradient really be that slow?
Edit:
Thanks for som valuable comments! Tonight I have been reading a bit about cuSparse on Nvidia. It seems to be able to do factorisation an even solve systems. In particular it seems one can reuse pattern and so forth. The question is how fast could this be and what is the possible overhead?
Given that the amount of data in my matrix A is the same as in an image, the memory copying should not be such an issue. I did some years ago software for real-time 3D reconstruction and then you copy data for each frame and a slow version still runs in over 50 Hz. So if the factorization is much faster it is a possible speed-up? In particualar the rest of the project will be on the GPU, so if one can solve it there directly and keep the solution it is no drawback I guess.
A lot has happened in the field of Cuda and I am not really up to date.
Here are two links I found: Benchmark?, MatlabGPU
Your matrix is extremely sparse and corresponds to a discretization on a 2D domain, so it is expected that SimplicialLDLT is the fastest here. Since the sparsity pattern is fixed, call analyzePattern once, and then factorize instead of compute. This should save some milliseconds. Moreover, since you're working on a regular grid, you might also try to bypass the re-ordering step using NaturalOrdering (not 100% sure, you have to bench). If that's still not fast enough, you might search for a Cholesky solver tailored for skyline matrices (the Cholesky factorization is much simpler and thus faster in this case).

Halide FFT Implementation Bugs?

I'm attempting to run the halide FFT implementation found here for benchmarking against FTTW. I'm able to run the implementation as is, but I've encountered some issues when digging a little deeper. The routine fails with errors for different values of H and W (the height and width of the random input image). For example, I get the following error with H=W=5:
Error at ./fft.cpp:603:
Cannot vectorize dimension n0 of function v_S1_R5$6 because the function is scheduled inline.
Aborted (core dumped)
I've been attempting to test on small image sizes (i.e. 5x5) to compare the results of the algorithms, but I can't get the algorithm to complete for any values less than 16, which even at that point makes checking the values a long task. The FFT also fails for values greater than 32, seemingly not working for all non-powers of 2.
Has anyone run into this issue before? Are there any other implementations of FFT in halide that work for different sized images?
For reference, I'm running the code on RHEL7 using gcc 4.8.3.
I think there are a few issues going on. First, there looks to be a bug for very small FFTs that only use one pass. I think that's what you hit in your first case.
The second issue is that W and H need to be a multiple of the vector size of your target, not necessarily that W and H need to be a power of 2. For example, W = 48, H = 32 seems to work for me. There's a further complication that for real FFTs, one dimension gets internally cut in half (this is how efficient real FFTs are implemented), so if you are on an AVX machine, that dimension must be a multiple of 16 (2x the vector width of 8 floats).
If you want to run on really small FFTs, you could remove the vectorize scheduling directives, then it should work, at least for learning purposes.
However, I would point out that running 5x5 won't be very interesting, because it will be done in just one radix 5 pass, i.e. just a plain old DFT (this also appears to be broken, as you've found). 4x4 (factored into 2 radix 2 passes) will be the smallest interesting FFT. When debugging it, I often used 8x8 FFTs (radix 4, radix 2).

Multiple matrix-vector calls with CUBLAS

I currently have to perform 128 independent sequential matrix-vector CUBLAS operations. All the matrices and vectors are different. Each independent matrix is stored right after the next in memory and the vectors are likewise stored contiguously in memory (all in row-major form).
A bit more context:
The matrices are (2048 X 8) and the vector is length 2048. The outputs are all independent. Because I have super matrices, I have the following:
matrix[(2048*128)x8]
vector[(2048*128)x1]
output[(8*128)x1]
With cublasSgemv I'm doing a transpose on the each mini matrix first and then adding (rather than replacing) the result into memory with:
cublasSgemv(*handle, CUBLAS_OP_T, Bdim, Adim, scale1, d_matrix + offset1, Bdim, d_vector + offset2, 1, scale2, out + offset3, 1);
I am making 128 such calls which I would like to do in one.
The profiler shows significant performance degradation from making these multiple calls. What is the best way to do multiple matrix-vector operations? Is there a way to batch them together into one fast call?
Are streams the best way to go or is there some way to make a call with relevant offsets (to index into my array of matrices and vectors)? The only other efficient option seemed to be to use a CUSPASE call and stick all the matrices on the diagonal.
NOTE: I'm not interested in getting the transposes or row/column major ordering in the gemv call correct for this particular question.
Updated
In fact you have to pay special attention to the r/c major ordering if your want to speed up your code in this case.
As shown in your revised question, you use row-major matrices. then you have a super-matrix A[(2048*128)x8] and a super vector V[(2048*128)x1]. And here I assume that you want a col-major matrix output[8x128] (can be seen as a super-vector [(8*128)x1]), where each col is the result of transpose( miniA[2048x8] ) * miniV[2048x1].
On the other hand, CUBLAS assumes that matrices are stored in column-major. So it may need some extra matrix transpose routines to change the ordering.
Since you need 128 independent [8x1] results, it should be able to calculate the result in 4 cuda API calls, which should be more efficient than your original 128 calls.
1. Row-major A[(2048*128)x8] can be seen as colum-major AA[8x(2048*128)]
B[8x(2048*128)] = AA[8x(2048*128)] * diag( V[[(2048*128)x1]] ) by 1 dgmm()
2. C[(2048*128)x8] = transpose( B[8x(2048*128)] ) by 1 geam()
3. Col-major C[(2048*128)x8] can be seen as col-major CC[2048x(8*128)]
O[1x(8*128)] = ones[1x2048] * CC[2048x(8*128)] by 1 gemv()
4. Row vector O[1x(8*128)] can be seen as col-major matrix OO[128x8]
output[8x128] = transpose( OO[128x8] ) by 1 geam()
This col-major output[8x128] is what you want.
Since you need adding rather then replacing, you may need one more call to add the orginal values to output
I have done a very quick launch of the batchCUBLAS SDK example. I have considered 128 independent runs for matrices of size 2048x8 and 8x1. Here are the results on an NVIDIA GeForce GT 540M (compute capability 2.1) and on a Kepler K20c (compute capability 3.5).
For the NVIDIA GeForce GT 540M case, there is no relevant improvement for the "streamed" and "batched" versions against the "non-streamed" cuBLAS execution.
For the NVIDIA Kepler K20c, I have obtained
sgemm 1.87 GFlops (non-streamed); 3.08 GFlops (streamed); 6.58 GFlops (batched);
dgemm 1.00 GFlops (non-streamed); 1.43 GFlops (streamed); 6.67 GFlops (batched);
Streamed and batched cases seem to relevantly improve the non-streamed case for single precision.
Disclaimers
I'm not accounting for transposition, as you do;
The SDK example considers matrix-matrix multiplications, whereas you are needing matrix-vector multiplications; streaming is possible for gemv, but not batching.
I hope that those partial results could provide you with some useful information.

Compute rank of Matrix

I need to calculate rank of 4096x4096 sparse matrix, and I use C/C++ code.
I found some libraries (like Armadillo) that do it but they're too slow (almost 5 minutes).
I've also tried two Open Source version of Matlab (Freemat and Octave) but both crashed when I tried to make a test with a script.
5 minutes isn't so much but I must get rank from something like a million of matrix so the faster the better.
Someone knows a fast library for rank computation?
The Eigen library supports sparse matrices, try it out.
Computing the algebraic rank is O(n^3), where n is the matrix size, so it's inherently slow. You need eg. to perform pivoting, and this is slow and inaccurate if your matrix is not well conditioned (for n = 4096, a typical matrix is very ill conditioned).
Now, what is the rank ? It is the dimension of the image. It is very difficult to compute when n is large and it'll be spoiled by any small numerical inaccuracy of the input. For n = 4096, unless you happen to have particularly well conditioned matrices, this will prevent you from doing anything useful with a pivoting algorithm.
The best way is in fact to fix a cutoff epsilon, compute the singular values s_1 > ... > s_n and take as the rank the lowest integer r such that sum(s_i^2, i > r) < epsilon^2 * sum(s_i^2).
You thus need a sparse SVD routine, eg. from there.
This may not be faster, but to the very least it will be correct.
You can ask for less singular values that you need to speed up things. This is a tough problem, and with no info on the background and how you got these matrices, there is nothing more we can do.
Try the following code (the documentation is here).
It is an example for calculating the rank of the matrix A with Eigen library:
MatrixXd A(2,2);
A << 1 , 0, 1, 0;
FullPivLU<MatrixXd> luA(A);
int rank = luA.rank();