I would like to know if there is a function or an optimized way to reshape sparse matrices in Eigen.
In the documentation there is no reshape method for such matrices, so I implemented a function myself, but I don't know if it is optimized (i need it to be as fast as possible). Here is my approach:
Eigen::SparseMatrix<double> reshape_sp(const Eigen::SparseMatrix<double>& x,
lint a, lint b) {
Eigen::SparseMatrix<double> y(a, b);
for (int k=0; k<x.outerSize(); ++k) {
for (Eigen::SparseMatrix<double>::InnerIterator it(x,k); it; ++it) {
int pos = it.col()*x.rows()+it.row();
int col = int(pos/a);
int row = pos%a;
y.insert(row, col) = it.value();
}
}
y.makeCompressed();
return y;
}
For performance, it is absolutely crucial that you call reserve on your matrix. I've tested with a 100,000 x 100,000 matrix population 1%. Your version (after fixing the 32 bit overflow in pos computation), took 3 minutes. This fixed version a few seconds:
Eigen::SparseMatrix<double>
reshape(const Eigen::SparseMatrix<double>& orig,
int rows, int cols)
{
Eigen::SparseMatrix<double> rtrn(rows, cols);
rtrn.reserve(orig.nonZeros());
using InnerIterator = Eigen::SparseMatrix<double>::InnerIterator;
for(int k = 0; k < orig.outerSize(); ++k) {
for(InnerIterator it(orig, k); it; ++it) {
std::int64_t pos = std::int64_t(it.col()) * orig.rows() + it.row();
int col = int(pos / rows);
int row = int(pos % rows);
rtrn.insert(row, col) = it.value();
}
}
rtrn.makeCompressed();
return rtrn;
}
An alternative is to work with triplets again. This is a bit slower but less likely to explode in your face the same way insert does. This is particularly helpful for more complex operations like transposing where you cannot guarantee that the insert appends at the end.
Eigen::SparseMatrix<double>
reshape(const Eigen::SparseMatrix<double>& orig,
int rows, int cols)
{
using InnerIterator = Eigen::SparseMatrix<double>::InnerIterator;
using Triplet = Eigen::Triplet<double>;
std::vector<Triplet> triplets;
triplets.reserve(std::size_t(orig.nonZeros()));
for(int k = 0; k < orig.outerSize(); ++k) {
for(InnerIterator it(orig, k); it; ++it) {
std::int64_t pos = std::int64_t(it.col()) * orig.rows() + it.row();
int col = int(pos / rows);
int row = int(pos % rows);
triplets.emplace_back(row, col, it.value());
}
}
Eigen::SparseMatrix<double> rtrn(rows, cols);
rtrn.setFromTriplets(triplets.begin(), triplets.end());
return rtrn;
}
Things I tested that did not work:
Using FXDiv to replace the division with a cheaper operation
Computing maximum distance from one index to the next within a single column to skip dividing if both values are in the same output column (may still be worth it for sparse matrices with suitable inner structure)
Parallelizing the loop with OpenMP, using a final std::sort(std::execution::par, ...) for the triplets.
Related
I'm implementing sparse matrices multiplication(type of elements std::complex) after converting them to CSR(compressed sparse row) format and I'm using openmp for this, but what I noticed that increasing the number of threads doesn't necessarily increase the performance, sometimes is totally the opposite! why is that the case? and what can I do to solve the issue?
typedef std::vector < std::vector < std::complex < int >>> matrix;
struct CSR {
std::vector<std::complex<int>> values; //non-zero values
std::vector<int> row_ptr; //pointers of rows
std::vector<int> cols_index; //indices of columns
int rows; //number of rows
int cols; //number of columns
int NNZ; //number of non_zero elements
};
const matrix multiply_omp (const CSR& A,
const CSR& B,const unsigned int num_threds=4) {
if (A.cols != B.rows)
throw "Error";
CSR B_t = sparse_transpose(B);
omp_set_num_threads(num_threds);
matrix result(A.rows, std::vector < std::complex < int >>(B.cols, 0));
#pragma omp parallel
{
int i, j, k, l;
#pragma omp for
for (i = 0; i < A.rows; i++) {
for (j = 0; j < B_t.rows; j++) {
std::complex < int > sum(0, 0);
for (k = A.row_ptr[i]; k < A.row_ptr[i + 1]; k++)
for (l = B_t.row_ptr[j]; l < B_t.row_ptr[j + 1]; l++)
if (A.cols_index[k] == B_t.cols_index[l]) {
sum += A.values[k] * B_t.values[l];
break;
}
if (sum != std::complex < int >(0, 0)) {
result[i][j] += sum;
}
}
}
}
return result;
}
You can try to improve the scaling of this algorithm, but I would use a better algorithm. You are allocating a dense matrix (wrongly, but that's beside the point) for the product of two sparse matrices. That's wasteful since quite often the project of two sparse matrices will not be dense by a long shot.
Your algorithm also has the wrong time complexity. The way you search through the rows of B means that your complexity has an extra factor of something like the average number of nonzeros per row. A better algorithm would assume that the indices in each row are sorted, and then keep a pointer for how far you got into that row.
Read the literature on "Graph Blas" for references to efficient algorithms.
Eigen / C++ newbie here.
I do have massive (sparse) matrix to initialize and fill and I do have the vector of indices (row,col) and vector of values that correspond to those. How do I quickly (efficiently) build a matrix out of the two?
At this time, I prepare the vector of Triplets, and then I use setFromTriplets, but making that vector of Triplets in the loop is far too inefficient.
I feel that there has to be a better way than a loop. Please help.
void ImagingObjects::InitSparseSigmaPriorTriplets(Eigen::VectorXd& signals_vec)
{
Eigen::MatrixXd values = signals_vec.sum() * ( *this->GetSigmaPValues() );
for (size_t i=0; i < sparse_sigma_prior_triplets_vector.size(); ++i)
{
int index_row = (int) (*this->GetSigmaPIndices())(i, 0);
int index_col = (int) (*this->GetSigmaPIndices())(i, 1);
sparse_sigma_prior_triplets_vector[i] = (Eigen::Triplet <double> ){ index_row, index_col, values(i) } ;
}
}
So basically I prepare these triplets, every time, for every data sample, by cranking the nested loop.
Then, inside the algorithm solver iterative loop, I have this code:
size_t array_it = 0;
for (size_t i=0; i < sigma_prior_retain_bool_array.size(); ++i)
{
if( sigma_prior_retain_bool_array[i] )
{
triplets_sigma_prior[array_it] = *ImagingObjsPtr->GetSparseSigmaPriorTriplet(i);
array_it++;
}
}
auto end_it = triplets_sigma_prior.begin() + array_it;
Sigma_prior_sparse.setFromTriplets(triplets_sigma_prior.begin(),end_it );
I have two sparse matrices in Eigen, and I would like to join them vertically into one. As an example the target of the code would be:
SparseMatrix<double> matrix1;
matrix1.resize(10, 10);
SparseMatrix<double> matrix2;
matrix2.resize(5, 10);
SparseMatrix<double> MATRIX_JOIN;
MATRIX_JOIN.resize(15, 10);
MATRIX_JOIN << matrix1, matrix2;
I found some solutions on a forum however, I wasn't able to implement it.
What's the proper way to join the matrices vertically?
Edit
My implementation:
SparseMatrix<double> L;
SparseMatrix<double> C;
// ... (Operations with the matrices)
SparseMatrix<double> EMATRIX;
EMATRIX.resize(L.rows() + C.rows(), L.cols());
EMATRIX.middleRows(0, L.rows()) = L;
EMATRIX.middleRows(L.rows(), C.rows()) = C;
I get an error of types, acording to the compiler the right hand side is an Eigen::Block and the left side is Eigen::SparseMatrix
As far as I know, there is currently no built-in solution. You can be way more efficient than your solution by using the internal insertBack function:
SparseMatrix<double> M(L.rows() + C.rows(), L.cols());
M.reserve(L.nonZeros() + C.nonZeros());
for(Index c=0; c<L.cols(); ++c)
{
M.startVec(c); // Important: Must be called once for each column before inserting!
for(SparseMatrix<double>::InnerIterator itL(L, c); itL; ++itL)
M.insertBack(itL.row(), c) = itL.value();
for(SparseMatrix<double>::InnerIterator itC(C, c); itC; ++itC)
M.insertBack(itC.row()+L.rows(), c) = itC.value();
}
M.finalize();
Based on #Javier's answer.
Fixed the number of columns in the output matrix (just cols instead of cols+cols)
Fixed the lower matrice's triplet indices (upper.rows() + it.row() instead of just it.row())
using sparse_matrix_type = Eigen::SparseMatrix<T>;
using triplet_type = Eigen::Triplet<T, size_t>;
static sparse_matrix_type sparse_vstack(sparse_matrix_type const& upper, sparse_matrix_type const& lower) {
assert(upper.cols() == lower.cols() && "vstack with mismatching number of columns");
std::vector<triplet_type> triplets;
triplets.reserve(upper.nonZeros() + lower.nonZeros());
for (int k = 0; k < upper.outerSize(); ++k) {
for (sparse_matrix_type::InnerIterator it(upper, k); it; ++it) {
triplets.emplace_back(it.row(), it.col(), it.value());
}
}
for (int k = 0; k < lower.outerSize(); ++k) {
for (sparse_matrix_type::InnerIterator it(lower, k); it; ++it) {
triplets.emplace_back(upper.rows() + it.row(), it.col(), it.value());
}
}
sparse_matrix_type result(lower.rows() + upper.rows(), upper.cols());
result.setFromTriplets(triplets.begin(), triplets.end());
return result;
}
Unfortunately I coulnd't get #chtz's example to work with Eigen 3.3.4 due to a static assertion error THIS_SPARSE_BLOCK_SUBEXPRESSION_IS_READ_ONLY. It seems to be explicitly forbidden by Eigen (see https://eigen.tuxfamily.org/dox/SparseBlock_8h_source.html).
I ended up doing the following:
MATRIX_JOIN.resize(matrix1.rows() + matrix2.rows(), matrix1.cols() + matrix2.cols());
MATRIX_JOIN.setZero();
// Fill MATRIX_JOIN with triples from the other matrices
std::vector<Triplet<double> > tripletList;
for (int k = 0; k < matrix1.outerSize(); ++k)
{
for (SparseMatrix<double>::InnerIterator it(matrix1, k); it; ++it)
{
tripletList.push_back(Triplet<double>(it.row(), it.col(), it.value()));
}
}
for (int k = 0; k < matrix2.outerSize(); ++k)
{
for (SparseMatrix<double>::InnerIterator it(matrix2, k); it; ++it)
{
tripletList.push_back(Triplet<double>(it.row(), it.col(), it.value()));
}
}
FINALMATRIX.setFromTriplets(tripletList.begin(), tripletList.end());
There can be a speedup by calling tripleList.reserve(X) with X being the expected amount of triplets to insert.
I need to compute a product vector-matrix as efficiently as possible. Specifically, given a vector s and a matrix A, I need to compute s * A. I have a class Vector which wraps a std::vector and a class Matrix which also wraps a std::vector (for efficiency).
The naive approach (the one that I am using at the moment) is to have something like
Vector<T> timesMatrix(Matrix<T>& matrix)
{
Vector<unsigned int> result(matrix.columns());
// constructor that does a resize on the underlying std::vector
for(unsigned int i = 0 ; i < vector.size() ; ++i)
{
for(unsigned int j = 0 ; j < matrix.columns() ; ++j)
{
result[j] += (vector[i] * matrix.getElementAt(i, j));
// getElementAt accesses the appropriate entry
// of the underlying std::vector
}
}
return result;
}
It works fine and takes nearly 12000 microseconds. Note that the vector s has 499 elements, while A is 499 x 15500.
The next step was trying to parallelize the computation: if I have N threads then I can give each thread a part of the vector s and the "corresponding" rows of the matrix A. Each thread will compute a 499-sized Vector and the final result will be their entry-wise sum.
First of all, in the class Matrix I added a method to extract some rows from a Matrix and build a smaller one:
Matrix<T> extractSomeRows(unsigned int start, unsigned int end)
{
unsigned int rowsToExtract = end - start + 1;
std::vector<T> tmp;
tmp.reserve(rowsToExtract * numColumns);
for(unsigned int i = start * numColumns ; i < (end+1) * numColumns ; ++i)
{
tmp.push_back(matrix[i]);
}
return Matrix<T>(rowsToExtract, numColumns, tmp);
}
Then I defined a thread routine
void timesMatrixThreadRoutine
(Matrix<T>& matrix, unsigned int start, unsigned int end, Vector<T>& newRow)
{
// newRow is supposed to contain the partial result
// computed by a thread
newRow.resize(matrix.columns());
for(unsigned int i = start ; i < end + 1 ; ++i)
{
for(unsigned int j = 0 ; j < matrix.columns() ; ++j)
{
newRow[j] += vector[i] * matrix.getElementAt(i - start, j);
}
}
}
And finally I modified the code of the timesMatrix method that I showed above:
Vector<T> timesMatrix(Matrix<T>& matrix)
{
static const unsigned int NUM_THREADS = 4;
unsigned int matRows = matrix.rows();
unsigned int matColumns = matrix.columns();
unsigned int rowsEachThread = vector.size()/NUM_THREADS;
std::thread threads[NUM_THREADS];
Vector<T> tmp[NUM_THREADS];
unsigned int start, end;
// all but the last thread
for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
{
start = i*rowsEachThread;
end = (i+1)*rowsEachThread - 1;
threads[i] = std::thread(&Vector<T>::timesMatrixThreadRoutine, this,
matrix.extractSomeRows(start, end), start, end, std::ref(tmp[i]));
}
// last thread
start = (NUM_THREADS-1)*rowsEachThread;
end = matRows - 1;
threads[NUM_THREADS - 1] = std::thread(&Vector<T>::timesMatrixThreadRoutine, this,
matrix.extractSomeRows(start, end), start, end, std::ref(tmp[NUM_THREADS-1]));
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
threads[i].join();
}
Vector<unsigned int> result(matColumns);
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
result = result + tmp[i]; // the operator+ is overloaded
}
return result;
}
It still works but now it takes nearly 30000 microseconds, which is almost three times as much as before.
Am I doing something wrong? Do you think there is a better approach?
EDIT - using a "lightweight" VirtualMatrix
Following Ilya Ovodov's suggestion, I defined a class VirtualMatrix that wraps a T* matrixData, which is initialized in the constructor as
VirtualMatrix(Matrix<T>& m)
{
numRows = m.rows();
numColumns = m.columns();
matrixData = m.pointerToData();
// pointerToData() returns underlyingVector.data();
}
Then there is a method to retrieve a specific entry of the matrix:
inline T getElementAt(unsigned int row, unsigned int column)
{
return *(matrixData + row*numColumns + column);
}
Now the execution time is better (approximately 8000 microseconds) but maybe there are some improvements to be made. In particular the thread routine is now
void timesMatrixThreadRoutine
(VirtualMatrix<T>& matrix, unsigned int startRow, unsigned int endRow, Vector<T>& newRow)
{
unsigned int matColumns = matrix.columns();
newRow.resize(matColumns);
for(unsigned int i = startRow ; i < endRow + 1 ; ++i)
{
for(unsigned int j = 0 ; j < matColumns ; ++j)
{
newRow[j] += (vector[i] * matrix.getElementAt(i, j));
}
}
}
and the really slow part is the one with the nested for loops. If I remove it, the result is obviously wrong but is "computed" in less than 500 microseconds. This to say that now passing the arguments takes almost no time and the heavy part is really the computation.
According to you, is there any way to make it even faster?
Actually you make a partial copy of matrix for each thread in extractSomeRows. It takes a lot of time.
Redesign it so that "some rows" become virtual matrix pointing at data located in original matrix.
Use vectorized assembly instructions for an architecture by making it more explicit that you want to multiply in 4's, i.e. for the x86-64 SSE2+ and possibly ARM'S NEON.
C++ compilers can often unroll the loop into vectorized code if you explicitly make an operation happen in contingent elements:
Simple and fast matrix-vector multiplication in C / C++
There is also the option of using libraries specifically made for matrix multipication. For larger matrices, it may be more efficient to use special implementations based on the Fast Fourier Transform, alternate algorithms like Strassen's Algorithm, etc. In fact, your best bet would be to use a C library like this, and then wrap it in an interface that looks similar to a C++ vector.
I have a problem when calculating the product of two sparse matrices.
Here ist the program:
void RandomWalk::calculateAx(const SpMat &x, const SpMat &adj_mat1, const SpMat &adj_mat2, const double &lambda, SpMat &result)
{
SpMat Y(adj_mat1.cols(), adj_mat2.rows());
for (int k=0; k<x.outerSize(); ++k)
{
for (SpMat::InnerIterator it(x,k); it; ++it)
{
div_t divresult;
divresult = div (it.row(),adj_mat1.rows());
Y.insert(divresult.quot, divresult.rem) = it.value();
}
}
SpMat tmp;
tmp = adj_mat1 * Y; // <-- error in this line
tmp = tmp * SpMat(adj_mat2.transpose());
result.resize(adj_mat1.rows()*adj_mat2.rows(), 1);
result.setZero();
for (int k=0; k<tmp.outerSize(); ++k)
{
for (SpMat::InnerIterator it(tmp,k); it; ++it)
{
result.insert(it.col()*adj_mat1.rows()+it.row(), 0) = it.value();
}
}
result = lambda * result;
result = x - result;
}
x is a Matrix of size (k,1). adj_mat1 is a matrix of size nxn and adj_mat2 of size mxm. They both are symmetric. First I have to rescale x to a matrix Y of size (nxm) (by using the first n elements as the first column, the second n as the second column ans so on. After that the matrix adj_mat1*Y*adj_mat2^T has to be calculated. This result then has again to be vectorized by writing all the columns below each other into a vector.
I get a Segmentation fault at the multiplication of adj_mat1 with Y.
The problem only occurs if adj_mat1 and adj_mat 2 are of different sizes.
If you need any more information just ask.
Thank you in advance.
Alex
Solution:
The problem was the insertion of the values. I had to change the quot and the rem at the inset statement. Now it works
Y.insert(divresult.rem, divresult.quot) = it.value();