sparse x dense matrix multiplication performance under-efficient

sparse x dense matrix multiplication performance under-efficient - c++

Context: I am using Eigen for Artificial Neural Network where the typical dimensions are around 1000 nodes per layer. So most of the operations are to multiplying matrix M of size ~(1000,1000) with a vector of size 1000 or a batch of B vectors, which are represented as matrices of size Bx1000.
After training a neural network, I am using pruning - which is a common compression technique which ends up with sparse matrix (density of non empty parameters between 10 and 50%).
Goal: I would like to use sparse matrix for compression purpose and secondarily for performance optimization but it is not the main goal
Issue:
I am comparing performance of sparse and dense matrix multiplication (only multiplication time is computed) for different batch sizes and I am observing the following (Using Eigen 3.2.8, MacBook Pro 64bits, without open_mp, and using standard g++):
when B=1 (Matrix x Vector) - sparse matrix operations with density 10% or 30% is more efficient than dense matrix operations - which seems expected result: far less operations are performed
for B=32:
the time needed for dense matrix operation is only ~10 times the time need for B=1 - which is cool - does it shows some vectorization effect?
the time needed for sparse matrix operation is 67 times the time needed for B=1 - which means that it is less efficient than processing the 32 vectors independently
MxN multiplication time (ms) for M sparse/dense, and N of size 1000xB
Same numbers but showing the time per vector in a batch of different size for sparse and dense matrix. We see clearly the decrease of time for dense matrix when batch size increase, and the augmentation for sparse matrix showing some wrong. Normalized with time for B=1
Code:
I am using the following types for sparse and dense matrices:
typedef SparseMatrix<float> spMatFloat;
typedef Matrix<float, Dynamic, Dynamic, RowMajor> deMatRowFloat;
the operation I am benchmarking is the following:
o.noalias()=m*in.transpose();
where o is a dense matrix (1000xB), m is either a dense matrix (1000x1000) or the corresponding sparse matrix obtained with m.sparseView(), and in is a dense matrix (Bx1000)
A full code is below (averaging time for 20 different random matrices, and running each multiplication 50 times) - time for B=32 and B=1 are below.
Any feedback/intuition is welcome!
batch 1 ratio 0.3 dense 0.32 sparse 0.29
batch 32 ratio 0.3 dense 2.75 sparse 15.01
#include <Eigen/Sparse>
#include <Eigen/Dense>
#include <stdlib.h>
#include <boost/timer/timer.hpp>
using namespace Eigen;
using namespace boost::timer;
typedef SparseMatrix<float> spMatFloat;
typedef Matrix<float, Dynamic, Dynamic, RowMajor> deMatRowFloat;
void bench_Sparse(const spMatFloat &m, const deMatRowFloat &in, deMatRowFloat &o) {
o.noalias()=m*in.transpose();
}
void bench_Dense(const deMatRowFloat &m, const deMatRowFloat &in, deMatRowFloat &o) {
o.noalias()=m*in.transpose();
}
int main(int argc, const char **argv) {
float ratio=0.3;
int iter=20;
int batch=32;
float t_dense=0;
float t_sparse=0;
deMatRowFloat d_o1(batch,1000);
deMatRowFloat d_o2(batch,1000);
for(int k=0; k<iter; k++) {
deMatRowFloat d_m=deMatRowFloat::Zero(1000,1000);
deMatRowFloat d_b=deMatRowFloat::Random(batch,1000);
for(int h=0;h<ratio*1000000;h++) {
int i=rand()%1000;
int j=rand()%1000;
d_m(i,j)=(rand()%1000)/500.-1;
}
spMatFloat s_m=d_m.sparseView();
{
cpu_timer timer;
for(int k=0;k<50;k++) bench_Dense(d_m,d_b,d_o1);
cpu_times const elapsed_times(timer.elapsed());
nanosecond_type const elapsed(elapsed_times.system+elapsed_times.user);
t_dense+=elapsed/1000000.;
}
{
cpu_timer timer;
for(int k=0;k<50;k++) bench_Sparse(s_m,d_b,d_o2);
cpu_times const elapsed_times(timer.elapsed());
nanosecond_type const elapsed(elapsed_times.system+elapsed_times.user);
t_sparse+=elapsed/1000000.;
}
}
std::cout<<"batch\t"<<batch<<"\tratio\t"<<ratio<<"\tdense\t"<<t_dense/50/iter<<"\tsparse\t"<<t_sparse/50/iter<<std::endl;
}
New Results after ggael suggestion: I tried the different possible combinations and found indeed huge differences of performance when changing M and B RowMajor/ColMajor.
To summarize I am interested in doing M*B where M is (1000,1000) and B is (1000,batch): I am interested in comparing performance of M sparse/dense and when batch is growing.
I tested 3 configurations:
M dense, B dense
M sparse, B dense
M sparse, B dense, but the multiplication of M*B is done manually column by column
results are as following - where the number is the ratio time per column for B=32/time for B=1 with matrix M with density 0.3:
The initial reported problem was the worse case (M ColMajor, B RowMajor). For (M RowMajor, B ColMajor), there is a 5 times speedup between B=32 and B=1 and performance of sparse matrix is almost equivalent to dense matrix.

In Eigen, for dense algebra, both matrix-vector and matrix-matrix products are highly optimized and take full advantage of vectorization. As you observed, matrix-matrix products exhibit a much higher efficiency. This is because matrix-matrix products can further optimized by increasing the ratio between the number of arithmetic operations and memory accesses, and by exploiting memory caches.
Then regarding sparse-dense products, there are two strategies:
Process the dense right hand side one column at once, and thus scan the sparse matrix multiple times. For this strategy, better use a column-major storage for the dense matrices (right-hand side and result). In Eigen 3.2, this strategy has be emulated by scanning the columns manually.
Scan the sparse matrix only once, and process the rows of the dense right hand side and results in the most nested loop. This is default strategy in Eigen 3.2. In this case, better use a row-major storage for the dense matrices (Matrix<float,Dynamic,32,RowMajor>).
Finally, in either case, you could try with both a row-major and column-major storage for the sparse matrix, and figure out which combination of strategy and storage order of the sparse matrix works best in your case.

Related

How to optimize matrix product of sparse and dense matrices in eigen when the result is selfadjoint

I am working with square matrices of type std::complex<double>. In particular, a sparse matrix S and a self-adjoint, dense matrix H, and I would like to compute the product of the form S*H*S.adjoint() and add it to another dense, self-adjoint matrix J. So a straight-forward way to do this in Eigen would be:
#include <Eigen/Dense>
#include <Eigen/Sparse>
#include <complex>
Eigen::Matrix<std::complex<double>, Eigen::Dynamic, Eigen::Dynamic> J, H;
Eigen::SparseMatrix<std::complex<double>> S;
// ...
// Set H, and J to be some self-adjoint matrices of the same size, and S also same
// size, but not necessarily self-adjoint.
// ...
J += S*H*S.adjoint();
But because H and J are self-adjoint and by the form of the product S*H*S.adjoint(), we know that J will remain self-adjoint after the operation. So there is really no need to compute the entire dense matrix result S*H*S.adjoint() and we could probably save some computation time by only computing the lower- or upper-triangular part of the result and adding that to the corresponding part of the matrix J. Eigen provides an API for this sort of optimization, but I'm not able to use it in this case. For example if instead of the sparse matrix S we had a dense matrix D, then doing
J += D*H*D.adjoint();
should be less efficient than
J.triangularView<Eigen::Lower>() = D*H*D.adjoint();
or
J.triangularView<Eigen::Lower>() = D*H.selfadjointView<Eigen::Lower>()*H.adjoint();
but the API doesn't seem to provide this level of optimization when computing the former product with a sparse matrix S instead of the dense matrix D. That is,
J.triangularView<Eigen::Lower>() = S*H*S.adjoint();
doesn't compile. So my question is: is there a way to tell Eigen to only compute the lower- (or upper-) triangular part of the matrix S*H*S.adjoint() and add it to the lower- (or upper-) triangular part of the self-adjoint matrix J to improve performance?
Perhaps even better would be an overload of a rank 1 update that looked something like
J.selfadjointView<Eigen::Lower>().rankUpdate(S,H);
Of course the current API doesn't support this form and to get the desired result would require taking the square root of H, call it G and do
J.selfadjointView<Eigen::Lower>().rankUpdate(S*G);
but although this should give the correct result, taking the square root is probably super expensive compared to the rest, so this would probably be slower.
The best performance I've found so far is
J.noalias() += S*H*S.adjoint();

SIMD matrix multiplication for a rectangular matrix

Is it possible to do a generic matrix multiplication for a rectangular matrix using SIMD instructions. So far all the examples that i came across through online are of square matrix ( N X N) and the N is know. I understand that SIMD instructions has nothing to do with matrices size and its more of a parallel computing.
Is it a good idea or Is it possible to have a matrix multiplication using SIMD instruction of size M X N where are M, N are set in the constructor of class.
class MatrixMN {
MatrixMN(size_t rows, size_t cols) {..}
MatrixMN operator*(const MatrixMN& m) const {
// check for dimension match
// USE SIMD INSTRUCTION TO PERFORM MATRIX MULTIPLICATION ??
}
};
The Matrix is double precision, and since we are using older hardware we have only access to __m128d. Which result in loading two doubles

Using R and Rcpp, how to multiply two matrices that are sparse Matrix::csr/csc format?

The following code works as expected:
matrix.cpp
// [[Rcpp::depends(RcppEigen)]]
#include <RcppEigen.h>
// [[Rcpp::export]]
SEXP eigenMatTrans(Eigen::MatrixXd A){
Eigen::MatrixXd C = A.transpose();
return Rcpp::wrap(C);
}
// [[Rcpp::export]]
SEXP eigenMatMult(Eigen::MatrixXd A, Eigen::MatrixXd B){
Eigen::MatrixXd C = A * B;
return Rcpp::wrap(C);
}
// [[Rcpp::export]]
SEXP eigenMapMatMult(const Eigen::Map<Eigen::MatrixXd> A, Eigen::Map<Eigen::MatrixXd> B){
Eigen::MatrixXd C = A * B;
return Rcpp::wrap(C);
}
This is using the C++ eigen class for matrices, See https://eigen.tuxfamily.org/dox
In R, I can access those functions.
library(Rcpp);
Rcpp::sourceCpp('matrix.cpp');
A <- matrix(rnorm(10000), 100, 100);
B <- matrix(rnorm(10000), 100, 100);
library(microbenchmark);
microbenchmark(eigenMatTrans(A), t(A), A%*%B, eigenMatMult(A, B), eigenMapMatMult(A, B))
This shows that R performs pretty well on resorting (transpose). Multiplying has some advantages with eigen.
Using the Matrix library, I can convert a normal matrix to a sparse matrix.
Example from https://cmdlinetips.com/2019/05/introduction-to-sparse-matrices-in-r/
library(Matrix);
data<- rnorm(1e6)
zero_index <- sample(1e6)[1:9e5]
data[zero_index] <- 0
A = matrix(data, ncol=1000)
A.csr = as(A, "dgRMatrix");
B.csr = t(A.csr);
A.csc = as(A, "dgCMatrix");
B.csc = t(A.csc);
So if I wanted to multiply A.csr times B.csr using eigen, how to do that in C++? I do not want to have to convert types if I don't have to. It is a memory size thing.
The A.csr %*% B.csr is not-yet-implemented.
The A.csc %*% B.csc is working.
I would like to microbenchmark the different options, and see how matrix size will be most efficient. In the end, I will have a matrix that is about 1% sparse and have 5 million rows and cols ...

There's a reason that dgRMatrix crossproduct functions are not yet implemented, in fact, they should not be implemented because otherwise they would enable bad practice.
There are a few performance considerations when working with sparse matrices:
Accessing marginal views against the major marginal orientation is highly inefficient. For instance, a column iterator in a dgRMatrix and a row iterator in a dgCMatrix need to loop through almost all elements of the matrix to find the ones in just that column or row. See this Rcpp gallery post for additional enlightenment.
A matrix cross-product is simply a dot product between all combinations of columns. This means the penalty of using a column iterator in a dgRMatrix (vs. a column iterator in a dgCMatrix) is multiplied by the number of column combinations.
Cross-product functions in R are highly optimized, and are not (in my experience) significantly faster than Eigen, Armadillo, equivalent STL variants. They are parallelized, and the Matrix package takes wonderful advantage of these optimized algorithms. I have written C++ parallelized STL cross-product variants using Rcpp structures and I don't see any increase in performance.
If you're really going this route, check out my Rcpp gallery post on Sparse Matrix structures in Rcpp. This is to be preferred to Eigen and Armadillo Sparse Matrices if memory is a concern, as Eigen and Armadillo perform a deep copy rather than a reference to an R object already existing in memory.
At 1% density, the inefficiencies of row iterators will be greater than at say 5 or 10% density. I do most of my tests at 5% density and generally binary operations take 5-10x longer for row iterators than for column iterators.
There may be applications where row-major ordering shines (i.e. see the work by Dmitry Selivanov on CSR matrices and irlba svd), but this is absolutely not one of them, in fact, so much so you are better off doing in-place conversion to get to a CSC matrix.
tl;dr: column-wise cross-product in row-major matrices is the ultimatum of inefficiency.

Sparse x dense matrix multiply unexpectedly slow with Armadillo

This is something I just came across. For some reason, multiplying a dense by a sparse matrix in Armadillo is much slower than multiplying a sparse and dense matrix (ie, reversing the order).
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
// [[Rcpp::export]]
arma::sp_mat mult_sp_den_to_sp(arma::sp_mat& a, arma::mat& b)
{
// sparse x dense -> sparse
arma::sp_mat result(a * b);
return result;
}
// [[Rcpp::export]]
arma::sp_mat mult_den_sp_to_sp(arma::mat& a, arma::sp_mat& b)
{
// dense x sparse -> sparse
arma::sp_mat result(a * b);
return result;
}
I'm using the RcppArmadillo package to interface Arma with R; RcppArmadillo.h includes armadillo. Here's some timings in R, on a couple of reasonably large mats:
set.seed(98765)
# 10000 x 10000 sparse matrices, 99% sparse
a <- rsparsematrix(1e4, 1e4, 0.01, rand.x=function(n) rpois(n, 1) + 1)
b <- rsparsematrix(1e4, 1e4, 0.01, rand.x=function(n) rpois(n, 1) + 1)
# dense copies
a_den <- as.matrix(a)
b_den <- as.matrix(b)
system.time(mult_sp_den_to_sp(a, b_den))
# user system elapsed
# 508.66 0.79 509.95
system.time(mult_den_sp_to_sp(a_den, b))
# user system elapsed
# 13.52 0.74 14.29
So the first multiply takes about 35 times longer than the second (all times are in seconds).
Interestingly, if I simply make a temporary sparse copy of the dense matrix, performance is much improved:
// [[Rcpp::export]]
arma::sp_mat mult_sp_den_to_sp2(arma::sp_mat& a, arma::mat& b)
{
// sparse x dense -> sparse
// copy dense to sparse, then multiply
arma::sp_mat temp(b);
arma::sp_mat result(a * temp);
return result;
}
system.time(mult_sp_den_to_sp2(a, b_den))
# user system elapsed
# 5.45 0.41 5.86
Is this expected behaviour? I'm aware that with sparse matrices, the exact way in which you do things can have big impacts on the efficiency of your code, much more so than with dense. A 35x difference in speed seems rather large though.

Sparse and dense matrices are stored in a very different way.
Armadillo uses CMS (column-major storage) for dense matrices, and CSC (compressed sparse column) for sparse matrices. From Armadillo's documentation:
Mat
mat
cx_mat
Classes for dense matrices, with elements stored in column-major ordering (ie. column by column)
SpMat
sp_mat
sp_cx_mat
Classes for sparse matrices, with elements stored in compressed sparse column (CSC) format
The first thing we have to understand is how much storage space each format requires:
Given the quantities element_size (4 bytes for single precision, 8 bytes for double precision), index_size (4 bytes if using 32-bit integers, or 8 bytes if using 64-bit integers), num_rows (the number of rows of the matrix), num_cols (the number of columns of the matrix), and num_nnz (number of nonzero elements of the matrix), the following formule give us the storage space for each format:
storage_cms = num_rows * num_cols * element_size
storage_csc = num_nnz * element_size + num_nnz * index_size + num_cols * index_size
For more details about storage formats see wikipedia, or netlib.
Assuming double precision and 32-bit indeces, in your case that means:
storage_cms = 800MB
storage_csc = 12.04MB
So when you are multiplying a sparse x dense (or dense x sparse) matrix, you are accessing ~812MB of memory, while you only access ~24MB of memory when multiplying sparse x sparse matrix.
Note that this doesn't include the memory where you write the results, and this can be a significant portion (up to ~800MB in both cases), but I am not very familiar with Armadillo and which algorithm it uses for matrix multiplication, so cannot exactly say how it stores the intermediate results.
Whatever the algorithm, it definitely needs to access both input matrices multiple times, which explains why converting a dense matrix to sparse (which requires only one access to the 800MB of dense matrix), and then doing a sparse x sparse product (which requires accessing 24MB of memory multiple times) is more efficient than dense x sparse and sparse x dense product.
There are also all sorts of cache effects here, which would require the knowledge of the exact implementation of the algorithm and the hardware (and a lot of time) to explain properly, but above is the general idea.
As for why is dense x sparse faster than sparse x dense, it is because of the CSC storage format for sparse matrices. As noted in scipy's documentation, CSC format is efficient for column slicing, and slow for row slicing. dense x sparse multiplication algorithms need column slicing of the sparse matrix, and sparse x dense need row slicing of the sparse matrix. Note that if armadillo used CSR instead of CSC, sparse x dense would be efficient, and dense x sparse wouldn't.
I am aware that this is not a complete answer of all the performance effects you are seeing, but should give you a general idea of what is happening. A proper analysis would require a lot more time and effort to do, and would have to include concrete implementations of the algorithms, and information about the hardware on which it is run.

This should be fixed in the upcoming Armadillo 8.500, which will be wrapped in RcppArmadillo 0.8.5 Real Soon Now. Specifically:
sparse matrix transpose is much faster
(sparse x dense) reimplemented as ((dense^T) x (sparse^T))^T, taking advantage of the relatively speedy (dense x sparse) code
When I tested it, the time taken dropped from ~500 seconds to about 18 seconds, which is comparable to the other timings.

Can Armadillo efficiently multiply sparse-by-sparse and sparse-by-dense matrices into a dense result?

I am using Armadillo for some linear algebra problems. It has SpMat<float> for sparse matrices and Mat<float> for dense matrices.
Suppose I have sparse matricesS_a and S_b, and a dense matrix D. I need to compute the produces S_a*S_b and S_a*D, the results will be dense in both cases.
I can convert the sparse matrices into dense matrices and then multiply, but that will be inefficient (these matrices are very large). Is there a way to tell Armadillo to store the results into a dense matrix without performing an intermediate conversion step?

You can use the mat constructor which takes a sparse matrix and converts its data to a dense one:
arma::mat out1(S_a * S_b);
arma::mat out2(S_b * D);
Both multiplication operators for the sparse class (sparse-sparse and sparse-dense) will produce a sparse matrix object as output. (Whether or not it's really sparse will depend on the structure of the inputs.) This can be converted to a dense matrix using the dense matrix constructor with signature: arma::mat(arma::sp_mat).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js