Many-vertex test problems for the simplex method - linear-programming

I'm looking for ways to generate test problems for simplex-method linear programming solvers
(A x <= b, x >= 0)
that have many vertices, so (I believe) would make difficult test problems.
There's quite a bit of theory that looks relevant, e.g.
How many vertices can a convex polytope have?
But I don't see how to turn this into code for A b --
I don't need all the vertices, and Vertex enumeration
would explode memory anyway.
For example, a 1000 x 1000 assignment problem gives a sparse 2k x 1m A matrix
with 2m non-zeros.
GLPK simplex solves this in 34 seconds -- not much of a test case.

Extended Latin squares give
LP matrices with 4n^3 rows (constraints), n^4 columns (variables), and 4 non-zeros in each column.
For examples, n=16 -- 2^14 rows, 2^16 columns, 2^18 non-zeros --
runs for 10 hours
in the opensource GLPK simplex solver,
on my 2.7 GHz iMac.
(The
Klee-Minty cube,
which was once a difficult test case for simplex methods,
runs in < 1 second in GLPK simplex, with d=200.
Some intuition on why some LP problems are difficult would be welcome.)

Related

Paralize numpy.linalg.matrix_power do not increase performance

I need to paralize the function numpy.linalg.matrix_power and I use the following code to test how much fast can be the parallel version
def aux_matrix_arg3(A):
aaa=np.linalg.matrix_power(np.random.randn(199,199),100)
return 1
N=10000
processes=4
chunksize=N/processes
poolWorkers=mp.Pool(processes=processes)
ti=t.time()
A=poolWorkers.map(aux_matrix_arg3,range(N),chunksize=chunksize)
print 't_iteration 3',t.time()-ti
I have tried with 1 and 4 processes in my laptop. I got very similar performance
4 processes: t_iteration 3 40.7985408306
1 processes: t_iteration 3 40.6538720131
Any clue why I do not get any improvment with paralle processes?
The docs say:
For positive integers n, the power is computed by repeated matrix squarings and matrix multiplications. If n == 0, the identity matrix of the same shape as M is returned. If n < 0, the inverse is computed and then raised to the abs(n).
If your system is set up correctly, BLAS will be used to parallelize matrix-multiplications and LAPACK (maybe SuperLU, the latter probably only in the sparse-case) for solving systems of linear-equations (used to calculate the inverse). So with a very high probability, your naive code is already very very optimized!
Despite that, you should be careful as naive parallelization copies all the data, which can hurt. (Normaly one would use mmap-arrays to share data instead of copying).

Best way of solving sparse linear systems in C++ - GPU Possible?

I am currently working on a project where we need to solve
|Ax - b|^2.
In this case, A is a very sparse matrix and A'A has at most 5 nonzero elements in each row.
We are working with images and the dimension of A'A is NxN where N is the number of pixels. In this case N = 76800. We plan to go to RGB and then the dimension will be 3Nx3N.
In matlab solving (A'A)\(A'b) takes about 0.15 s, using doubles.
I have now done some experimenting with Eigens sparse solvers. I have tried:
SimplicialLLT
SimplicialLDLT
SparseQR
ConjugateGradient
and some different orderings. The by far best so far is
SimplicialLDLT
which takes about 0.35 - 0.5 using AMDOrdering.
When I for example use ConjugateGradient it takes roughly 6 s, using 0 as initilization.
The code for solving the problem is:
A_tot.makeCompressed();
// Create solver
Eigen::SimplicialLDLT<Eigen::SparseMatrix<float>, Eigen::Lower, Eigen::AMDOrdering<int> > solver;
// Eigen::ConjugateGradient<Eigen::SparseMatrix<float>, Eigen::Lower> cg;
solver.analyzePattern(A_tot);
t1 = omp_get_wtime();
solver.compute(A_tot);
if (solver.info() != Eigen::Success)
{
std::cerr << "Decomposition Failed" << std::endl;
getchar();
}
Eigen::VectorXf opt = solver.solve(b_tot);
t2 = omp_get_wtime();
std::cout << "Time for normal equations: " << t2 - t1 << std::endl;
This is the first time I use sparse matrices in C++ and its solvers. For this project speed is crucial and below 0.1 s is a minimum.
I would like to get some feedback on which would be the best strategy here. For example one is supposed to be able to use SuiteSparse and OpenMP in Eigen. What are your experiences about these types of problems? Is there a way of extracting the structure for example? And should conjugateGradient really be that slow?
Edit:
Thanks for som valuable comments! Tonight I have been reading a bit about cuSparse on Nvidia. It seems to be able to do factorisation an even solve systems. In particular it seems one can reuse pattern and so forth. The question is how fast could this be and what is the possible overhead?
Given that the amount of data in my matrix A is the same as in an image, the memory copying should not be such an issue. I did some years ago software for real-time 3D reconstruction and then you copy data for each frame and a slow version still runs in over 50 Hz. So if the factorization is much faster it is a possible speed-up? In particualar the rest of the project will be on the GPU, so if one can solve it there directly and keep the solution it is no drawback I guess.
A lot has happened in the field of Cuda and I am not really up to date.
Here are two links I found: Benchmark?, MatlabGPU
Your matrix is extremely sparse and corresponds to a discretization on a 2D domain, so it is expected that SimplicialLDLT is the fastest here. Since the sparsity pattern is fixed, call analyzePattern once, and then factorize instead of compute. This should save some milliseconds. Moreover, since you're working on a regular grid, you might also try to bypass the re-ordering step using NaturalOrdering (not 100% sure, you have to bench). If that's still not fast enough, you might search for a Cholesky solver tailored for skyline matrices (the Cholesky factorization is much simpler and thus faster in this case).

How can I get eigenvalues and eigenvectors fast and accurate?

I need to compute the eigenvalues and eigenvectors of a big matrix (about 1000*1000 or even more). Matlab works very fast but it does not guaranty accuracy. I need this to be pretty accurate (about 1e-06 error is ok) and within a reasonable time (an hour or two is ok).
My matrix is symmetric and pretty sparse. The exact values are: ones on the diagonal, and on the diagonal below the main diagonal, and on the diagonal above it. Example:
How can I do this? C++ is the most convenient to me.
MATLAB does not guarrantee accuracy
I find this claim unreasonable. On what grounds do you say that you can find a (significantly) more accurate implementation than MATLAB's highly refined computational algorithms?
AND... using MATLAB's eig, the following is computed in less than half a second:
%// Generate the input matrix
X = ones(1000);
A = triu(X, -1) + tril(X, 1) - X;
%// Compute eigenvalues
v = eig(A);
It's fast alright!
I need this to be pretty accurate (about 1e-06 error is OK)
Remember that solving eigenvalues accurately is related to finding the roots of the characteristic polynomial. This specific 1000x1000 matrix is very ill-conditioned:
>> cond(A)
ans =
1.6551e+003
A general rule of thumb is that for a condition number of 10k, you may lose up to k digits of accuracy (on top of what would be lost to the numerical method due to loss of precision from arithmetic method).
So in your case, I'd expect the results to be accurate up to an approximate error of 10-3.
If you're not opposed to using a third party library, I've had great success using the Armadillo linear algebra libraries.
For the example below, arma is the namespace they like to use, vec is a vector, mat is a matrix.
arma::vec getEigenValues(arma::mat M) {
return arma::eig_sym(M);
}
You can also serialize the data directly into MATLAB and vice versa.
Your system is tridiagonal and a (symmetric) Toeplitz matrix. I'd guess that eigen and Matlab's eig have special cases to handle such matrices. There is a closed-form solution for the eigenvalues in this case (reference (PDF)). In Matlab for your matrix this is simply:
n = size(A,1);
k = (1:n).';
v = 1-2*cos(pi*k./(n+1));
This can be further optimized by noting that the eigenvalues are centered about 1 and thus only half of them need to be computed:
n = size(A,1);
if mod(n,2) == 0
k = (1:n/2).';
u = 2*cos(pi*k./(n+1));
v = 1+[u;-u];
else
k = (1:(n-1)/2).';
u = 2*cos(pi*k./(n+1));
v = 1+[u;0;-u];
end
I'm not sure how you're going to get more fast and accurate than that (other than performing a refinement step using the eigenvectors and optimization) with simple code. The above should be able to translated to C++ very easily (or use Matlab's codgen to generate C/C++ code that uses this or eig). However, your matrix is still ill-conditioned. Just remember that estimates of accuracy are worst case.

Multiple matrix-vector calls with CUBLAS

I currently have to perform 128 independent sequential matrix-vector CUBLAS operations. All the matrices and vectors are different. Each independent matrix is stored right after the next in memory and the vectors are likewise stored contiguously in memory (all in row-major form).
A bit more context:
The matrices are (2048 X 8) and the vector is length 2048. The outputs are all independent. Because I have super matrices, I have the following:
matrix[(2048*128)x8]
vector[(2048*128)x1]
output[(8*128)x1]
With cublasSgemv I'm doing a transpose on the each mini matrix first and then adding (rather than replacing) the result into memory with:
cublasSgemv(*handle, CUBLAS_OP_T, Bdim, Adim, scale1, d_matrix + offset1, Bdim, d_vector + offset2, 1, scale2, out + offset3, 1);
I am making 128 such calls which I would like to do in one.
The profiler shows significant performance degradation from making these multiple calls. What is the best way to do multiple matrix-vector operations? Is there a way to batch them together into one fast call?
Are streams the best way to go or is there some way to make a call with relevant offsets (to index into my array of matrices and vectors)? The only other efficient option seemed to be to use a CUSPASE call and stick all the matrices on the diagonal.
NOTE: I'm not interested in getting the transposes or row/column major ordering in the gemv call correct for this particular question.
Updated
In fact you have to pay special attention to the r/c major ordering if your want to speed up your code in this case.
As shown in your revised question, you use row-major matrices. then you have a super-matrix A[(2048*128)x8] and a super vector V[(2048*128)x1]. And here I assume that you want a col-major matrix output[8x128] (can be seen as a super-vector [(8*128)x1]), where each col is the result of transpose( miniA[2048x8] ) * miniV[2048x1].
On the other hand, CUBLAS assumes that matrices are stored in column-major. So it may need some extra matrix transpose routines to change the ordering.
Since you need 128 independent [8x1] results, it should be able to calculate the result in 4 cuda API calls, which should be more efficient than your original 128 calls.
1. Row-major A[(2048*128)x8] can be seen as colum-major AA[8x(2048*128)]
B[8x(2048*128)] = AA[8x(2048*128)] * diag( V[[(2048*128)x1]] ) by 1 dgmm()
2. C[(2048*128)x8] = transpose( B[8x(2048*128)] ) by 1 geam()
3. Col-major C[(2048*128)x8] can be seen as col-major CC[2048x(8*128)]
O[1x(8*128)] = ones[1x2048] * CC[2048x(8*128)] by 1 gemv()
4. Row vector O[1x(8*128)] can be seen as col-major matrix OO[128x8]
output[8x128] = transpose( OO[128x8] ) by 1 geam()
This col-major output[8x128] is what you want.
Since you need adding rather then replacing, you may need one more call to add the orginal values to output
I have done a very quick launch of the batchCUBLAS SDK example. I have considered 128 independent runs for matrices of size 2048x8 and 8x1. Here are the results on an NVIDIA GeForce GT 540M (compute capability 2.1) and on a Kepler K20c (compute capability 3.5).
For the NVIDIA GeForce GT 540M case, there is no relevant improvement for the "streamed" and "batched" versions against the "non-streamed" cuBLAS execution.
For the NVIDIA Kepler K20c, I have obtained
sgemm 1.87 GFlops (non-streamed); 3.08 GFlops (streamed); 6.58 GFlops (batched);
dgemm 1.00 GFlops (non-streamed); 1.43 GFlops (streamed); 6.67 GFlops (batched);
Streamed and batched cases seem to relevantly improve the non-streamed case for single precision.
Disclaimers
I'm not accounting for transposition, as you do;
The SDK example considers matrix-matrix multiplications, whereas you are needing matrix-vector multiplications; streaming is possible for gemv, but not batching.
I hope that those partial results could provide you with some useful information.

Compute rank of Matrix

I need to calculate rank of 4096x4096 sparse matrix, and I use C/C++ code.
I found some libraries (like Armadillo) that do it but they're too slow (almost 5 minutes).
I've also tried two Open Source version of Matlab (Freemat and Octave) but both crashed when I tried to make a test with a script.
5 minutes isn't so much but I must get rank from something like a million of matrix so the faster the better.
Someone knows a fast library for rank computation?
The Eigen library supports sparse matrices, try it out.
Computing the algebraic rank is O(n^3), where n is the matrix size, so it's inherently slow. You need eg. to perform pivoting, and this is slow and inaccurate if your matrix is not well conditioned (for n = 4096, a typical matrix is very ill conditioned).
Now, what is the rank ? It is the dimension of the image. It is very difficult to compute when n is large and it'll be spoiled by any small numerical inaccuracy of the input. For n = 4096, unless you happen to have particularly well conditioned matrices, this will prevent you from doing anything useful with a pivoting algorithm.
The best way is in fact to fix a cutoff epsilon, compute the singular values s_1 > ... > s_n and take as the rank the lowest integer r such that sum(s_i^2, i > r) < epsilon^2 * sum(s_i^2).
You thus need a sparse SVD routine, eg. from there.
This may not be faster, but to the very least it will be correct.
You can ask for less singular values that you need to speed up things. This is a tough problem, and with no info on the background and how you got these matrices, there is nothing more we can do.
Try the following code (the documentation is here).
It is an example for calculating the rank of the matrix A with Eigen library:
MatrixXd A(2,2);
A << 1 , 0, 1, 0;
FullPivLU<MatrixXd> luA(A);
int rank = luA.rank();