I am using Eigen on a C++ program for solving linear equation for very small square matrix(4X4).
My test code is like
template<template <typename MatrixType> typename EigenSolver>
Vertor3d solve(){
//Solve Ax = b and A is a real symmetric matrix and positive semidefinite
... // Construct 4X4 square matrix A and 4X1 vector b
EigenSolver<Matrix4d> solver(A);
auto x = solver.solve(b);
... // Compute relative error for validating
}
I test some EigenSolver which include:
FullPixLU
PartialPivLU
HouseholderQR
ColPivHouseholderQR
ColPivHouseholderQR
CompleteOrthogonalDecomposition
LDLT
Direct Inverse
Direct Inverse is:
template<typename MatrixType>
struct InverseSolve
{
private:
MatrixType inv;
public:
InverseSolve(const MatrixType &matrix) :inv(matrix.inverse()) {
}
template<typename VectorType>
auto solve(const VectorType & b) {
return inv * b;
}
};
I found that the fast method is DirectInverse,Even If I linked Eigen with MKL , the result was not change.
This is the test result
FullPixLU : 477 ms
PartialPivLU : 468 ms
HouseholderQR : 849 ms
ColPivHouseholderQR : 766 ms
ColPivHouseholderQR : 857 ms
CompleteOrthogonalDecomposition : 832 ms
LDLT : 477 ms
Direct Inverse : 88 ms
which all use 1000000 matrices with random double from uniform distribution [0,100].I fristly construct upper-triangle and then copy to lower-triangle.
The only problem of DirectInverse is that its relative error slightly larger than other solver but acceptble.
Is there any faster or more felegant solution for my program?Is DirectInverse the fast solution for my program?
DirectInverse does not use the symmetric infomation so why is DirectInverse far faster than LDLT?
Despite what many people suggest of never explicitly computing an inverse when you only want to solve a linear system, for very small matrices this can actually be beneficial, since there are closed-form solutions using co-factors.
All other alternatives you tested will be slower, since they will do pivoting (which implies branching), even for small fixed-sized matrices. Also, most of them will result in more divisions and be not vectorizable as good, as the direct computation.
To increase the accuracy (this technique can actually be used independent of the solver if required), you can refine an initial solution by solving the system again with the residual:
Eigen::Vector4d solveDirect(const Eigen::Matrix4d& A, const Eigen::Vector4d& b)
{
Eigen::Matrix4d inv = A.inverse();
Eigen::Vector4d x = inv * b;
x += inv*(b-A*x);
return x;
}
I don't think Eigen directly provides a way to exploit the symmetry of A here (for the directly computed inverse). You can try hinting that by explicitly copying a selfadjoint view of A into a temporary and hope that the compiler is smart enough to find common sub-expressions:
Eigen::Matrix4d tmp = A.selfadjointView<Eigen::Upper>();
Eigen::Matrix4d inv = tmp.inverse();
To reduce some divisions, you can also compile with -freciprocal-math (on gcc or clang), this will slightly reduce accuracy of course.
If this is really performance critical, try implementing a hand-tuned inverse_4x4_symmetric method.
Exploiting the symmetry of inv * b will unlikely be beneficial for such small matrices.
Related
The following code works as expected:
matrix.cpp
// [[Rcpp::depends(RcppEigen)]]
#include <RcppEigen.h>
// [[Rcpp::export]]
SEXP eigenMatTrans(Eigen::MatrixXd A){
Eigen::MatrixXd C = A.transpose();
return Rcpp::wrap(C);
}
// [[Rcpp::export]]
SEXP eigenMatMult(Eigen::MatrixXd A, Eigen::MatrixXd B){
Eigen::MatrixXd C = A * B;
return Rcpp::wrap(C);
}
// [[Rcpp::export]]
SEXP eigenMapMatMult(const Eigen::Map<Eigen::MatrixXd> A, Eigen::Map<Eigen::MatrixXd> B){
Eigen::MatrixXd C = A * B;
return Rcpp::wrap(C);
}
This is using the C++ eigen class for matrices, See https://eigen.tuxfamily.org/dox
In R, I can access those functions.
library(Rcpp);
Rcpp::sourceCpp('matrix.cpp');
A <- matrix(rnorm(10000), 100, 100);
B <- matrix(rnorm(10000), 100, 100);
library(microbenchmark);
microbenchmark(eigenMatTrans(A), t(A), A%*%B, eigenMatMult(A, B), eigenMapMatMult(A, B))
This shows that R performs pretty well on resorting (transpose). Multiplying has some advantages with eigen.
Using the Matrix library, I can convert a normal matrix to a sparse matrix.
Example from https://cmdlinetips.com/2019/05/introduction-to-sparse-matrices-in-r/
library(Matrix);
data<- rnorm(1e6)
zero_index <- sample(1e6)[1:9e5]
data[zero_index] <- 0
A = matrix(data, ncol=1000)
A.csr = as(A, "dgRMatrix");
B.csr = t(A.csr);
A.csc = as(A, "dgCMatrix");
B.csc = t(A.csc);
So if I wanted to multiply A.csr times B.csr using eigen, how to do that in C++? I do not want to have to convert types if I don't have to. It is a memory size thing.
The A.csr %*% B.csr is not-yet-implemented.
The A.csc %*% B.csc is working.
I would like to microbenchmark the different options, and see how matrix size will be most efficient. In the end, I will have a matrix that is about 1% sparse and have 5 million rows and cols ...
There's a reason that dgRMatrix crossproduct functions are not yet implemented, in fact, they should not be implemented because otherwise they would enable bad practice.
There are a few performance considerations when working with sparse matrices:
Accessing marginal views against the major marginal orientation is highly inefficient. For instance, a column iterator in a dgRMatrix and a row iterator in a dgCMatrix need to loop through almost all elements of the matrix to find the ones in just that column or row. See this Rcpp gallery post for additional enlightenment.
A matrix cross-product is simply a dot product between all combinations of columns. This means the penalty of using a column iterator in a dgRMatrix (vs. a column iterator in a dgCMatrix) is multiplied by the number of column combinations.
Cross-product functions in R are highly optimized, and are not (in my experience) significantly faster than Eigen, Armadillo, equivalent STL variants. They are parallelized, and the Matrix package takes wonderful advantage of these optimized algorithms. I have written C++ parallelized STL cross-product variants using Rcpp structures and I don't see any increase in performance.
If you're really going this route, check out my Rcpp gallery post on Sparse Matrix structures in Rcpp. This is to be preferred to Eigen and Armadillo Sparse Matrices if memory is a concern, as Eigen and Armadillo perform a deep copy rather than a reference to an R object already existing in memory.
At 1% density, the inefficiencies of row iterators will be greater than at say 5 or 10% density. I do most of my tests at 5% density and generally binary operations take 5-10x longer for row iterators than for column iterators.
There may be applications where row-major ordering shines (i.e. see the work by Dmitry Selivanov on CSR matrices and irlba svd), but this is absolutely not one of them, in fact, so much so you are better off doing in-place conversion to get to a CSC matrix.
tl;dr: column-wise cross-product in row-major matrices is the ultimatum of inefficiency.
I'm using Eigen extensively in a scientific application I've been developing for some time. Since I'm implementing a numerical method, a number below a certain threshold (e.g. 1e-15) is not a point of interest, and slowing down calculations, and increasing error rate.
Hence, I want to round off numbers below that threshold to 0. I can do it with a for loop, but hammering multiple relatively big matrices (2M cells and up per matrix) with a for-if loop is expensive and slowing me down since I need to do it multiple times.
Is there a more efficient way to do this with Eigen library?
In other words, I'm trying to eliminate numbers below a certain threshold in my calculation pipeline.
The shortest way to write what you want, is
void foo(Eigen::VectorXf& inout, float threshold)
{
inout = (threshold < inout.array().abs()).select(inout, 0.0f);
}
However, neither comparisons nor the select method get vectorized by Eigen (as of now).
If speed is essential, you need to either write some manual SIMD code, or write a custom functor which supports a packet method (this uses internal functionality of Eigen, so it is not guaranteed to be stable!):
template<typename Scalar> struct threshold_op {
Scalar threshold;
threshold_op(const Scalar& value) : threshold(value) {}
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const Scalar operator() (const Scalar& a) const{
return threshold < std::abs(a) ? a : Scalar(0); }
template<typename Packet>
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const Packet packetOp(const Packet& a) const {
using namespace Eigen::internal;
return pand(pcmp_lt(pset1<Packet>(threshold),pabs(a)), a);
}
};
namespace Eigen { namespace internal {
template<typename Scalar>
struct functor_traits<threshold_op<Scalar> >
{ enum {
Cost = 3*NumTraits<Scalar>::AddCost,
PacketAccess = packet_traits<Scalar>::HasAbs };
};
}}
This can then be passed to unaryExpr:
inout = inout.unaryExpr(threshold_op<float>(threshold));
Godbolt-Demo (should work with SSE/AVX/AVX512/NEON/...): https://godbolt.org/z/bslATI
It might actually be that the only reason for your slowdown are subnormal numbers. In that case, a simple
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
should do the trick (cf: Why does changing 0.1f to 0 slow down performance by 10x?)
Eigen has a method called UnaryExpr which applies a given function pointer to every coefficient in a matrix (it has sparse and array variants too).
Will test its performance and update this answer accordingly.
Context: I am using Eigen for Artificial Neural Network where the typical dimensions are around 1000 nodes per layer. So most of the operations are to multiplying matrix M of size ~(1000,1000) with a vector of size 1000 or a batch of B vectors, which are represented as matrices of size Bx1000.
After training a neural network, I am using pruning - which is a common compression technique which ends up with sparse matrix (density of non empty parameters between 10 and 50%).
Goal: I would like to use sparse matrix for compression purpose and secondarily for performance optimization but it is not the main goal
Issue:
I am comparing performance of sparse and dense matrix multiplication (only multiplication time is computed) for different batch sizes and I am observing the following (Using Eigen 3.2.8, MacBook Pro 64bits, without open_mp, and using standard g++):
when B=1 (Matrix x Vector) - sparse matrix operations with density 10% or 30% is more efficient than dense matrix operations - which seems expected result: far less operations are performed
for B=32:
the time needed for dense matrix operation is only ~10 times the time need for B=1 - which is cool - does it shows some vectorization effect?
the time needed for sparse matrix operation is 67 times the time needed for B=1 - which means that it is less efficient than processing the 32 vectors independently
MxN multiplication time (ms) for M sparse/dense, and N of size 1000xB
Same numbers but showing the time per vector in a batch of different size for sparse and dense matrix. We see clearly the decrease of time for dense matrix when batch size increase, and the augmentation for sparse matrix showing some wrong. Normalized with time for B=1
Code:
I am using the following types for sparse and dense matrices:
typedef SparseMatrix<float> spMatFloat;
typedef Matrix<float, Dynamic, Dynamic, RowMajor> deMatRowFloat;
the operation I am benchmarking is the following:
o.noalias()=m*in.transpose();
where o is a dense matrix (1000xB), m is either a dense matrix (1000x1000) or the corresponding sparse matrix obtained with m.sparseView(), and in is a dense matrix (Bx1000)
A full code is below (averaging time for 20 different random matrices, and running each multiplication 50 times) - time for B=32 and B=1 are below.
Any feedback/intuition is welcome!
batch 1 ratio 0.3 dense 0.32 sparse 0.29
batch 32 ratio 0.3 dense 2.75 sparse 15.01
#include <Eigen/Sparse>
#include <Eigen/Dense>
#include <stdlib.h>
#include <boost/timer/timer.hpp>
using namespace Eigen;
using namespace boost::timer;
typedef SparseMatrix<float> spMatFloat;
typedef Matrix<float, Dynamic, Dynamic, RowMajor> deMatRowFloat;
void bench_Sparse(const spMatFloat &m, const deMatRowFloat &in, deMatRowFloat &o) {
o.noalias()=m*in.transpose();
}
void bench_Dense(const deMatRowFloat &m, const deMatRowFloat &in, deMatRowFloat &o) {
o.noalias()=m*in.transpose();
}
int main(int argc, const char **argv) {
float ratio=0.3;
int iter=20;
int batch=32;
float t_dense=0;
float t_sparse=0;
deMatRowFloat d_o1(batch,1000);
deMatRowFloat d_o2(batch,1000);
for(int k=0; k<iter; k++) {
deMatRowFloat d_m=deMatRowFloat::Zero(1000,1000);
deMatRowFloat d_b=deMatRowFloat::Random(batch,1000);
for(int h=0;h<ratio*1000000;h++) {
int i=rand()%1000;
int j=rand()%1000;
d_m(i,j)=(rand()%1000)/500.-1;
}
spMatFloat s_m=d_m.sparseView();
{
cpu_timer timer;
for(int k=0;k<50;k++) bench_Dense(d_m,d_b,d_o1);
cpu_times const elapsed_times(timer.elapsed());
nanosecond_type const elapsed(elapsed_times.system+elapsed_times.user);
t_dense+=elapsed/1000000.;
}
{
cpu_timer timer;
for(int k=0;k<50;k++) bench_Sparse(s_m,d_b,d_o2);
cpu_times const elapsed_times(timer.elapsed());
nanosecond_type const elapsed(elapsed_times.system+elapsed_times.user);
t_sparse+=elapsed/1000000.;
}
}
std::cout<<"batch\t"<<batch<<"\tratio\t"<<ratio<<"\tdense\t"<<t_dense/50/iter<<"\tsparse\t"<<t_sparse/50/iter<<std::endl;
}
New Results after ggael suggestion: I tried the different possible combinations and found indeed huge differences of performance when changing M and B RowMajor/ColMajor.
To summarize I am interested in doing M*B where M is (1000,1000) and B is (1000,batch): I am interested in comparing performance of M sparse/dense and when batch is growing.
I tested 3 configurations:
M dense, B dense
M sparse, B dense
M sparse, B dense, but the multiplication of M*B is done manually column by column
results are as following - where the number is the ratio time per column for B=32/time for B=1 with matrix M with density 0.3:
The initial reported problem was the worse case (M ColMajor, B RowMajor). For (M RowMajor, B ColMajor), there is a 5 times speedup between B=32 and B=1 and performance of sparse matrix is almost equivalent to dense matrix.
In Eigen, for dense algebra, both matrix-vector and matrix-matrix products are highly optimized and take full advantage of vectorization. As you observed, matrix-matrix products exhibit a much higher efficiency. This is because matrix-matrix products can further optimized by increasing the ratio between the number of arithmetic operations and memory accesses, and by exploiting memory caches.
Then regarding sparse-dense products, there are two strategies:
Process the dense right hand side one column at once, and thus scan the sparse matrix multiple times. For this strategy, better use a column-major storage for the dense matrices (right-hand side and result). In Eigen 3.2, this strategy has be emulated by scanning the columns manually.
Scan the sparse matrix only once, and process the rows of the dense right hand side and results in the most nested loop. This is default strategy in Eigen 3.2. In this case, better use a row-major storage for the dense matrices (Matrix<float,Dynamic,32,RowMajor>).
Finally, in either case, you could try with both a row-major and column-major storage for the sparse matrix, and figure out which combination of strategy and storage order of the sparse matrix works best in your case.
I am trying to get Eigen3 to solve a linear system A * X = B with an in-place Cholesky decomposition. I cannot afford to have any temporaries of the size of A pushed on the stack, but I am free to destroy A in the process.
Unfortunately,
A.llt().solveInPlace(B);
is out of question, since A.llt() implicitly pushes a temporary matrix of the size of A on the stack. For the LLT case, I could get access to the necessary functionality like so:
// solve A * X = B in-place for positive-definite A
template <typename AType, typename BType>
void AllInPlaceSolve(AType& A, BType& B)
{
typedef Eigen::internal::LLT_Traits<AType, Eigen::Upper> TraitsType;
TraitsType::inplace_decomposition(A);
TraitsType::getL(A).solveInPlace(B);
TraitsType::getU(A).solveInPlace(B);
}
This works fine, but I am worried that:
My matrices A might be positive semidefinite only, in which case a LDLT decomposition is required
The LLT decomposition calculates sqrt() unnecessarily for the solution of the system
I could not find a way to hook in Eigen's LDLT functionality similarly to the code above, since the code is structured very differently.
So my question is: Is there a way to use Eigen3 for solving a linear system using LDLT decompositions using no more scratch space than for the diagonal matrix D?
One option is to allocate a LDLT solver only once, and call the compute method:
LDLT<MatType> ldlt(size);
// ...
ldlt.compute(A);
x = ldlt.solve(b);
If that's also not an option, you can const cast the matrix stored by the ldlt object:
LDLT<MatType> ldlt(MatType::Identity(size,size));
MatType& A = const_cast<MatType&>(ldlt.matrixLDLT());
plays with A, and then:
ldlt.compute(A);
x = ldlt.solve(b);
This is ugly, but this should work as long as MatType is column major.
I'm a new to Eigen and I'm working with sparse LU problem.
I found that if I create a vector b(n), Eigen could compute the x(n) for the Ax=b equation.
Questions:
How to display the L & U, which is the factorization result of the original matrix A?
How to insert non-zeros in Eigen? Right now I just test with some small sparse matrix so I insert non-zeros one by one, but if I have a large-scale matrix, how can I input the matrix in my program?
I realize that this question was asked a long time ago. Apparently, referring to Eigen documentation:
an expression of the matrix L, internally stored as supernodes The only operation available with this expression is the triangular solve
So there is no way to actually convert this to an actual sparse matrix to display it. Eigen::FullPivLU performs dense decomposition and is of no use to us here. Using it on a large sparse matrix, we would quickly run out of memory while trying to convert it to dense, and the time required to compute the factorization would increase several orders of magnitude.
An alternative solution is using the CSparse library from the Suite Sparse as:
extern "C" { // we are in C++ now, since you are using Eigen
#include <csparse/cs.h>
}
const cs *p_matrix = ...; // perhaps possible to use Eigen::internal::viewAsCholmod()
css *p_symbolic_decomposition;
csn *p_factor;
p_symbolic_decomposition = cs_sqr(2, p_matrix, 0); // 1 = ordering A + AT, 2 = ATA
p_factor = cs_lu(p_matrix, m_p_symbolic_decomposition, 1.0); // tol = 1.0 for ATA ordering, or use A + AT with a small tol if the matrix has amostly symmetric nonzero pattern and large enough entries on its diagonal
// calculate ordering, symbolic decomposition and numerical decomposition
cs *L = p_factor->L, *U = p_factor->U;
// there they are (perhaps can use Eigen::internal::viewAsEigen())
cs_sfree(p_symbolic_decomposition); cs_nfree(p_factor);
// clean up (deletes the L and U matrices)
Note that although this does not use expliit vectorization as some Eigen functions do, it is still fairly fast. CSparse is also very compact, it is just a single header and about thirty .c files with no external dependencies. Easy to incorporate in any C++ project. There is no need to actually include all of Suite Sparse.
If you'll use Eigen::FullPivLU::matrixLU() to the original matrix, you'll receive LU decomposition matrix. To display L and U separately, you can use method triangularView<mode>. In Eigen wiki you can find good example of it. Inserting nonzeros into matrices depends on numbers, which you wan't to put. Eigen has convenient syntax, so you can easily insert values in loop:
for(int i=0;i<size;i++)
{
for(int j=size;j>someNumber;j--)
{
matrix(i,j)=yourClass.getNextNumber();
}
}