Armadillo SpMat<int> extremely slow compared to Mat<int> - c++

I am trying to utilize sparse matrices in Armadillo, and am noticing a significant difference in access times with SpMat<int> compared to equivalent code using Mat<int>.
Description:
Below are two methods, which are identical in every respect except that Method_One uses regular matrices and Method_Two uses sparse matrices.
Both methods take following arguments:
WS, DS: Pointers to a NN dimensional array
WW: 13 K [max(WS)]
DD: 1.7 K [max(DS)]
NN: 2.3 M
TT: 50
I am using Visual Studio 2017 for compiling the code into a .mexw64 executable which can be called from Matlab.
Code:
void Method_One(int WW, int DD, int TT, int NN, double* WS, double* DS)
{
Mat<int> WP(WW, TT, fill::zeros); // (13000 x 50) matrix
Mat<int> DP(DD, TT, fill::zeros); // (1700 x 50) matrix
Col<int> ZZ(NN, fill::zeros); // 2,300,000 column vector
for (int n = 0; n < NN; n++)
{
int w_n = (int) WS[n] - 1;
int d_n = (int) DS[n] - 1;
int t_n = rand() % TT;
WP(w_n, t_n)++;
DP(d_n, t_n)++;
ZZ(n) = t_n + 1;
}
return;
}
void Method_Two(int WW, int DD, int TT, int NN, double* WS, double* DS)
{
SpMat<int> WP(WW, TT); // (13000 x 50) matrix
SpMat<int> DP(DD, TT); // (1700 x 50) matrix
Col<int> ZZ(NN, fill::zeros); // 2,300,000 column vector
for (int n = 0; n < NN; n++)
{
int w_n = (int) WS[n] - 1;
int d_n = (int) DS[n] - 1;
int t_n = rand() % TT;
WP(w_n, t_n)++;
DP(d_n, t_n)++;
ZZ(n) = t_n + 1;
}
return;
}
Timing:
I am timing both methods using wall_clock timer object in Armadillo. For example,
wall_clock timer;
timer.tic();
Method_One(WW, DD, TT, NN, WS, DS);
double t = timer.toc();
Results:
Timing elapsed for Method_One using Mat<int>: 0.091 sec
Timing elapsed for Method_Two using SpMat<int>: 30.227 sec (almost 300 times slower)
Any insights into this are highly appreciated!
UPDATE:
This issue has been fixed with newer version (8.100.1) of Armadillo.
Here are the new results:
Timing elapsed for Method_One using Mat<int>: 0.141 sec
Timing elapsed for Method_Two using SpMat<int>: 2.127 sec (15 times slower, which is acceptable!)
Thanks to Conrad and Ryan.

As hbrerkere already mentioned, the problem stems from the fact that the values of the matrix are stored in a packed format (CSC) that makes it time-consuming to
Find the index of an already existing entry: Depending on whether the column entries are sorted by their row index you need either linear or binary search.
Insert a value that was previously zero: Here you need to find the insertion point for your new value and move all elements after that, leading to Ω(n) worst case time for a single insertion!
All these operations are constant-time operations for dense matrices, which mostly explains the runtime difference.
My usual solution was to use a separate sparse matrix type for assembly (where you usually access an element multiple times) based on the coordinate format (storing triples (i, j, value)) that uses a map like std::map or std::unordered_map to store the triple index corresponding to a position (i,j) in the matrix.
Some similar approaches are also discussed in this question about matrix assembly
Example from my most recent use:
class DynamicSparseMatrix {
using Number = double;
using Index = std::size_t;
using Entry = std::pair<Index, Index>;
std::vector<Number> values;
std::vector<Index> rows;
std::vector<Index> cols;
std::map<Entry, Index> map; // unordered_map might be faster,
// but you need a suitable hash function
// like boost::hash<Entry> for this.
Index num_rows;
Index num_cols;
...
Number& value(Index row, Index col) {
// just to prevent misuse
assert(row >= 0 && row < num_rows);
assert(col >= 0 && col < num_cols);
// Find the entry in the matrix
Entry e{row, col};
auto it = map.find(e);
// If the entry hasn't previously been stored
if (it == map.end()) {
// Add a new entry by adding its value and coordinates
// to the end of the storage vectors.
it = map.insert(make_pair(e, values.size())).first;
rows.push_back(row);
cols.push_back(col);
values.push_back(0);
}
// Return the value
return values[(*it).second];
}
...
};
After assembly you can store all the values from rows, cols, values (which actually represent the matrix in Coordinate format), possibly sort them and do a batch insertion into your Armadillo matrix.

Sparse matrices are stored in a compressed format (CSC). Every time a non-zero element inserted into a sparse matrix, the entire internal representation has to be updated. This is time consuming.
It's much faster to construct the sparse matrix using batch constructors.

Related

FAISS with C++ indexing 512D vectors

I have a collection of 512D std::vector to store face embeddings. I create my index and perform training on a subset of the data.
int d = 512;
size_t nb = this->templates.size() // 95000
size_t nt = 50000; // training data size
std::vector<float> training_set(nt * d);
faiss::IndexFlatIP coarse_quantizer(d);
int ncentroids = int(4 * sqrt(nb)));
faiss::IndexIVFPQ index(&coarse_quantizer,d,ncentroids,4,8);
std::vector<float> training_set(nt*d);
The this->templates has an index value in [0] and the 512D vectors in [1]. My question is about the training and indexing. I have this currently:
int v=0;
for (auto const& element : this->templates)
{
std::vector<double> enrollment_template = element.second;
for (int i=0;i<d;i++){
training_set[(v*d)+i] = (float)enrollment_template.at(i);
v++;
}
index.train(nt,training_set.data());
FAISS Index.Train function
virtual void train(idx_t n, const float *x)
Perform training on a representative set of vectors
Parameters:
n – nb of training vectors
x – training vecors, size n * d
Is that the proper way of adding the 512D vector data into Faiss for training? It seems to me that if I have 2 face embeddings that are 512D in size, the training_set would be like this:
training_set[0-511] - Face #1's 512D vectors
training_set[512-1024] - Face #2's 512D vectors
and since Faiss knows we are working with 512D vectors, it will intelligently parse them out of the array.
Here's a more efficient way to write it:
int v = 0;
for (auto const& element : this->templates)
{
auto& enrollment_template = element.second; // not copy
if (v + d > training_set.size()) {
break; // prevent overflow, "nt" is smaller than templates.size()
}
for (int i = 0; i < d; i++) {
training_set[v] = enrollment_template[i]; // not at()
v++;
}
}
We avoid a copy with auto& enrollment_template, avoid extra branching with enrollment_template[i] (you know you won't be out of bounds), and simplify the address computation with training_set[v] by making v a count of elements rather than rows.
Further efficiency could be gained if templates can be changed to store floats rather than doubles--then you'd just be bitwise-copying 512 floats rather than converting doubles to floats.
Also, be sure to declare d as constexpr to give the compiler the best chance of optimizing the loop.

How to write Multiplicative Update Rules for Matrix Factorization when one doesn't have access to the whole matrix?

So we want to approximate the matrix A with m rows and n columns with the product of two matrices P and Q that have dimension mxk and kxn respectively. Here is an implementation of the multiplicative update rule due to Lee in C++ using the Eigen library.
void multiplicative_update()
{
Q = Q.cwiseProduct((P.transpose()*matrix).cwiseQuotient(P.transpose()*P*Q));
P = P.cwiseProduct((matrix*Q.transpose()).cwiseQuotient(P*Q*Q.transpose()));
}
where P, Q, and the matrix (matrix = A) are global variables in the class mat_fac. Thus I train them using the following method,
void train_2(){
double error_trial = 0;
for (int count = 0;count < num_iterations; count ++)
{
multiplicative_update();
error_trial = (matrix-P*Q).squaredNorm();
if (error_trial < 0.001)
{
break;
}
}
}
where num_iterations is also a global variable in the class mat_fac.
The problem is that I am working with very large matrices and in particular I do not have access to the entire matrix. Given a triple (i,j,matrix[i][j]), I have access to the row vector P[i][:] and the column vector Q[:][j]. So my goal is to write rewrite the multiplicative update rule in such a way that I update these two vectors every time, I see a non-zero matrix value.
In code, I want to have something like this:
void multiplicative_update(int i, int j, double mat_value)
{
Eigen::MatrixXd q_vect = get_vector(1, j); // get_vector returns Q[:][j] as a column vector
Eigen::MatrixXd p_vect = get_vector(0, i); // get_vector returns P[i][:] as a column vector
// Somehow compute coeff_AQ_t, coeff_PQQ_t, coeff_P_tA and coeff_P_tA.
for(int i = 0; i< k; i++):
p_vect[i] = p_vect[i]* (coeff_AQ_t)/(coeff_PQQ_t)
q_vect[i] = q_vect[i]* (coeff_P_tA)/(coeff_P_tA)
}
Thus the problem boils down to computing the required coefficients given the two vectors. Is this a possible thing to do? If not, what more data do I need for the multiplicative update to work in this manner?

Getting values for specific frequencies in a short time fourier transform

I'm trying to use C++ to recreate the spectrogram function used by Matlab. The function uses a Short Time Fourier Transform (STFT). I found some C++ code here that performs a STFT. The code seems to work perfectly for all frequencies but I only want a few. I found this post for a similar question with the following answer:
Just take the inner product of your data with a complex exponential at
the frequency of interest. If g is your data, then just substitute for
f the value of the frequency you want (e.g., 1, 3, 10, ...)
Having no background in mathematics, I can't figure out how to do this. The inner product part seems simple enough from the Wikipedia page but I have absolutely no idea what he means by (with regard to the formula for a DFT)
a complex exponential at frequency of interest
Could someone explain how I might be able to do this? My data structure after the STFT is a matrix filled with complex numbers. I just don't know how to extract my desired frequencies.
Relevant function, where window is Hamming, and vector of desired frequencies isn't yet an input because I don't know what to do with them:
Matrix<complex<double>> ShortTimeFourierTransform::Calculate(const vector<double> &signal,
const vector<double> &window, int windowSize, int hopSize)
{
int signalLength = signal.size();
int nOverlap = hopSize;
int cols = (signal.size() - nOverlap) / (windowSize - nOverlap);
Matrix<complex<double>> results(window.size(), cols);
int chunkPosition = 0;
int readIndex;
// Should we stop reading in chunks?
bool shouldStop = false;
int numChunksCompleted = 0;
int i;
// Process each chunk of the signal
while (chunkPosition < signalLength && !shouldStop)
{
// Copy the chunk into our buffer
for (i = 0; i < windowSize; i++)
{
readIndex = chunkPosition + i;
if (readIndex < signalLength)
{
// Note the windowing!
data[i][0] = signal[readIndex] * window[i];
data[i][1] = 0.0;
}
else
{
// we have read beyond the signal, so zero-pad it!
data[i][0] = 0.0;
data[i][1] = 0.0;
shouldStop = true;
}
}
// Perform the FFT on our chunk
fftw_execute(plan_forward);
// Copy the first (windowSize/2 + 1) data points into your spectrogram.
// We do this because the FFT output is mirrored about the nyquist
// frequency, so the second half of the data is redundant. This is how
// Matlab's spectrogram routine works.
for (i = 0; i < windowSize / 2 + 1; i++)
{
double real = fft_result[i][0];
double imaginary = fft_result[i][1];
results(i, numChunksCompleted) = complex<double>(real, imaginary);
}
chunkPosition += hopSize;
numChunksCompleted++;
} // Excuse the formatting, the while ends here.
return results;
}
Look up the Goertzel algorithm or filter for example code that uses the computational equivalent of an inner product against a complex exponential to measure the presence or magnitude of a specific stationary sinusoidal frequency in a signal. Performance or resolution will depend on the length of the filter and your signal.

Cusparse illegal memory access unless I increase the sparsity of the sparse matrix

I am trying to make an existing piece of software that uses hand tuned sparse multiplication of special CSC matrices that have exactly k nonzero elements per column. I decided to use cusparse for the job, but unfortunately I encounter that the matrix multiplication takes over 7 seconds in some cases, which is much slower than the CPU version of the code. (largest sparse matrix concerned is 19871x1000 largest dense matrix concerned is 1000*150, nnz = 101000).
When trying to reproduce the problem in a self contained example, I always encounter an "illegal memory access error" when nnz != sparse_cols.
After some investigation turns out that if I increase the size of matrices 10fold the problem disappears. If I make the matrices small enough I don't experience crashes. However with large matrices the sparse matrix has to not cross over some degree of denseness, otherwise multiplication produces a bunch of illegal memory accesses.
Here is the code that exibits the problem:
#include <cuda.h>
#include <cusparse.h>
#include <iostream>
#include <stdlib.h>
#define CALL_CUDA( err ) \
{ if (err != cudaSuccess) \
{std::cout<<"cuda Error "<< cudaGetErrorString(err)<<" in "<<__FILE__<<" at line "<<__LINE__<<"\n"; exit(EXIT_FAILURE); }\
}
int main(){
//cusparse status and handle
cusparseStatus_t status;
cusparseHandle_t handle = 0;
status = cusparseCreate(&handle);
if (status != CUSPARSE_STATUS_SUCCESS){
std::cout << "Error creating handle: " << status << std::endl;
}
//Set matrix description
cusparseMatDescr_t descr; //Describe the matrices
cusparseCreateMatDescr(&descr);
cusparseSetMatType(descr, CUSPARSE_MATRIX_TYPE_GENERAL);
cusparseSetMatIndexBase(descr, CUSPARSE_INDEX_BASE_ZERO);
//Sparse matrix properties
int sparse_rows = 19871;
int sparse_cols = 1000;
int nnz_new = 101000;
//int nnz_new = 1000; //Works with that value
//Dense matrix properties
int bmat_rows = 1000;
int bmat_cols = 150;
//Generate a special type of sparse matrix that has exactly k nonzero elements in each column in CSC format
float * amat_vals;
CALL_CUDA(cudaMallocHost((void **)&amat_vals, nnz_new*sizeof(float)));
int * amat_idx;
CALL_CUDA(cudaMallocHost((void **)&amat_idx, nnz_new*sizeof(int)));
int * crccolptr;
CALL_CUDA(cudaMallocHost((void **)&crccolptr, (sparse_cols+1)*sizeof(int)));
//Fill in values with random values
for (int i = 0; i < nnz_new; i++){
amat_vals[i] = (float)rand()/(float)RAND_MAX;
}
//Generate indexes for those rows
for (int i = 0; i < nnz_new; i++){
amat_idx[i] = rand() % (sparse_rows - 1);
}
//generate crccolptr
int k = (int)(nnz_new/sparse_cols); //Number of elements per row
for (int i = 0; i < sparse_cols; i++){
crccolptr[i] = k*i;
}
crccolptr[sparse_cols] = nnz_new;
//Generate bmat_array with random floats
float * bmat_array;
CALL_CUDA(cudaMallocHost((void **)&bmat_array, bmat_rows*bmat_cols*sizeof(float)));
for (int i = 0; i < bmat_rows*bmat_cols; i++){
bmat_array[i] = (float)rand()/(float)RAND_MAX;
}
//generate array for output
float * outmatrix_test;
CALL_CUDA(cudaMallocHost((void **)&outmatrix_test, sparse_rows*bmat_cols*sizeof(float)));
//Allocate and copy device memory for sparse matrix
float * cudavals;
int * colIdx;
int * colPtr;
CALL_CUDA(cudaMalloc((void **)&colPtr, (sparse_cols + 1)*sizeof(int)));
CALL_CUDA(cudaMemcpy(colPtr, crccolptr, (sparse_cols + 1)*sizeof(int), cudaMemcpyHostToDevice));
CALL_CUDA(cudaMalloc((void **)&cudavals, nnz_new*sizeof(float)));
CALL_CUDA(cudaMalloc((void **)&colIdx, nnz_new*sizeof(int)));
CALL_CUDA(cudaMemcpy(cudavals, amat_vals, nnz_new*sizeof(float), cudaMemcpyHostToDevice));
CALL_CUDA(cudaMemcpy(colIdx, amat_idx, nnz_new*sizeof(int), cudaMemcpyHostToDevice));
//Allocate and copy device memory for dense matrix
float * B_gpumatrix;
CALL_CUDA(cudaMalloc((void **)&B_gpumatrix, bmat_rows*bmat_cols*sizeof(float)));
CALL_CUDA(cudaMemcpy(B_gpumatrix, bmat_array, bmat_rows*bmat_cols*sizeof(float), cudaMemcpyHostToDevice));
//Allocate output matrix
float * outmatrix_gpu;
CALL_CUDA(cudaMalloc((void **)&outmatrix_gpu, (sparse_rows*bmat_cols)*sizeof(float)));
//sparse_cols is passed as sparse_rows, because we're multiplying a CSC matrix instead of a CSR so we need
// to transpose it and invert the rows and columns.
const float alpha = 1.0;
const float beta = 0.0;
/*
float * outmatrix_gpu2;
CALL_CUDA(cudaMalloc((void **)&outmatrix_gpu2, (sparse_rows*sparse_cols)*sizeof(float)));
cusparseStatus_t mat_mul = cusparseScsc2dense(handle, sparse_rows, sparse_cols, descr, cudavals, colIdx, colPtr, outmatrix_gpu2, sparse_rows);
float * outmatrix_test2;
CALL_CUDA(cudaMallocHost((void **)&outmatrix_test2, sparse_rows*sparse_cols*sizeof(float)));
CALL_CUDA(cudaMemcpy(outmatrix_test2, outmatrix_gpu2, (sparse_rows*sparse_cols)*sizeof(float), cudaMemcpyDeviceToHost));
*/
cusparseStatus_t mat_mul = cusparseScsrmm(handle, //Cusparse handle
CUSPARSE_OPERATION_TRANSPOSE, //Transposing the matrix
sparse_cols, //Number of sparse rows. Since we're using CSC matrix it's the columns.
bmat_cols, //Number of columns of the dense matrix
sparse_rows, //Number of sparse cols, Since we're using CSC matrix it's the rows
nnz_new, //Non zero elements
&alpha, //Pointer to alpha (1.0)
descr, //Description of the matrix
cudavals, //The values vector
colPtr, //The column pointer
colIdx, //The indexes of the sparse matrix
B_gpumatrix, //Dense matrix array
bmat_rows, //ldb - the rows of the dense matrix
&beta, //Pointer to beta. It's 0
outmatrix_gpu, //Pointer to the output matrix
sparse_rows); //ldc - leading dimensions of the output matrix.
if (mat_mul != CUSPARSE_STATUS_SUCCESS){
std::cout << "MULTIPLICATION ERROR: " << mat_mul << std::endl;
}
cudaThreadSynchronize(); //Syncs before copy back to memory should not be necessary
cudaDeviceSynchronize();
//Copy matrix back to host
CALL_CUDA(cudaMemcpy(outmatrix_test, outmatrix_gpu, (sparse_rows*bmat_cols)*sizeof(float), cudaMemcpyDeviceToHost));
CALL_CUDA(cudaFree(outmatrix_gpu));
CALL_CUDA(cudaFree(cudavals));
CALL_CUDA(cudaFree(colPtr));
CALL_CUDA(cudaFree(colIdx));
CALL_CUDA(cudaFree(B_gpumatrix));
CALL_CUDA(cudaFreeHost(crccolptr));
CALL_CUDA(cudaFreeHost(amat_vals));
CALL_CUDA(cudaFreeHost(amat_idx));
CALL_CUDA(cudaFreeHost(bmat_array));
CALL_CUDA(cudaFreeHost(outmatrix_test));
return 1;
}
I believe i am generating a valid sparse matrix, because I can convert it to dense one using the appropariate cusparse function without triggering any invalid memory accesses.
When running the above code under cuda-memcheck you can see many illegal accesses from within the cusparseScsrmm. Running without cuda-memcheck you would see an error in the first cuda operation after the matrix multiplication.
Any ideas what I am doing wrong? I hope that if I can solve this problem, I would be able to diagnoze (or at least isolate) a self contained example that exhibits the painfully slow matrix multiplications.
EDIT:
Using smaller matrices I don't experience the problem. sparse matrix with 50*200 works fine for NNZ until about 1000, but takes forever with NNZ = 5000 (I killed it after half a minute). Increasing matrix size to 200*500 works performs instantaneously with NNZ = 5000.... Strange.
EDIT2:
The original number of nnz works if I increase the size of the matrices 10fold.
This isn't sensible:
//Generate indexes for those rows
for (int i = 0; i < nnz_new; i++){
amat_idx[i] = rand() % (sparse_rows - 1);
}
CSR matrix format expects the values vector to be stored in left-to-right, top-to-bottom order. Therefore the column indices in each row must be in increasing order. You are generating column indices in random order, and in fact it's remotely possible that you will generate two elements in the same row with the same column index. That is simply broken.
Your variable naming also suggests some confusion to me. CSR is compressed sparse row format, and it expects:
a vector of matrix values (=nnz in length)
a vector of column indices specifying which column each value belongs to (=nnz in length)
a vector of row pointers specifying the start of each row (=numrows +1 in length)
Since you are using the Scsrmm function, CSR format is required.
variable names like crccolptr don't make sense to me in a CSR format.
as a simple proof-point, replace the above excerpted code with the following:
//Generate indexes for those rows
int my_idx = 0;
int j;
for (int i = 0; i < sparse_rows; i++){
//amat_idx[i] = rand() % (sparse_rows - 1);
for (j = 0; j < (nnz_new/sparse_rows); j++)
amat_idx[my_idx++] = j;
}
while (my_idx < nnz_new) amat_idx[my_idx++] = j++;
And I believe the errors will go away, since the actual matrix now conforms to CSR format expectations.

How can I optimize this function which handles large c++ vectors?

According to Visual Studio's performance analyzer, the following function is consuming what seems to me to be an abnormally large amount of processor power, seeing as all it does is add between 1 and 3 numbers from several vectors and store the result in one of those vectors.
//Relevant class members:
//vector<double> cache (~80,000);
//int inputSize;
//Notes:
//RealFFT::real is a typedef for POD double.
//RealFFT::RealSet is a wrapper class for a c-style array of RealFFT::real.
//This is because of the FFT library I'm using (FFTW).
//It's bracket operator is overloaded to return a const reference to the appropriate array element
vector<RealFFT::real> Convolver::store(vector<RealFFT::RealSet>& data)
{
int cr = inputSize; //'cache' read position
int cw = 0; //'cache' write position
int di = 0; //index within 'data' vector (ex. data[di])
int bi = 0; //index within 'data' element (ex. data[di][bi])
int blockSize = irBlockSize();
int dataSize = data.size();
int cacheSize = cache.size();
//Basically, this takes the existing values in 'cache', sums them with the
//values in 'data' at the appropriate positions, and stores them back in
//the cache at a new position.
while (cw < cacheSize)
{
int n = 0;
if (di < dataSize)
n = data[di][bi];
if (di > 0 && bi < inputSize)
n += data[di - 1][blockSize + bi];
if (++bi == blockSize)
{
di++;
bi = 0;
}
if (cr < cacheSize)
n += cache[cr++];
cache[cw++] = n;
}
//Take the first 'inputSize' number of values and return them to a new vector.
return Common::vecTake<RealFFT::real>(inputSize, cache, 0);
}
Granted, the vectors in question have sizes of around 80,000 items, but by comparison, a function which multiplies similar vectors of complex numbers (complex multiplication requires 4 real multiplications and 2 additions each) consumes about 1/3 the processor power.
Perhaps it has something to with the fact it has to jump around within the vectors rather then just accessing them linearly? I really have no idea though. Any thoughts on how this could be optimized?
Edit: I should mention I also tried writing the function to access each vector linearly, but this requires more total iterations and actually the performance was worse that way.
Turn on compiler optimization as appropriate. A guide for MSVC is here:
http://msdn.microsoft.com/en-us/library/k1ack8f1.aspx