I'm trying to implement a naive threaded matrix multiplication, I'm creating multiple threads for each linear combination using a manually allocated result array and writing to the respective position on each thread, however my code run slower than on it's single threaded version, is it the use of memory that slows down the code?
I used heap allocation to avoid any memory copying but could it be the problem?
#define rows first
#define columns second
void linear_combination(double const *arr_1,std::pair<int, int> sp_1,
double const *arr_2, std::pair<int, int> sp_2,
double *arr_3, std::pair<int, int> sp_3,
int base_row,int base_col){
double sum = 0;
for (int i = 0; i < sp_1.columns; i++){
int idx_1 = base_row * sp_1.columns + i;
int idx_2 = i * sp_2.columns + base_col;
sum += arr_1[idx_1] * arr_2[idx_2];
}
int idx_3 = base_row * sp_3.columns + base_col;
arr_3[idx_3] = sum;
}
auto matmul(double *m1, std::pair<int, int> sp_1, double *m2, std::pair<int, int> sp_2){
// "sp_n" stands for shape for n-th matrix
if (sp_1.second == sp_2.first){
auto *m3 = (double *) malloc(sp_1.first*sp_2.second* sizeof(double));
std::pair sp_3 = {sp_1.first, sp_2.second};
for (int k = 0; k < sp_3.rows; k++){
std::vector<std::thread> thread_list(sp_2.columns);
for (int j = 0; j < sp_2.columns; j++){
// will automatically save linear combination sum into m3
thread_list[j] = ( std::thread(linear_combination,
m1, sp_1,
m2, sp_2,
m3, sp_3,
k, j) );
}
// join threads and use calculation
std::for_each(thread_list.begin(), thread_list.end(), std::mem_fn(&std::thread::join));
}
return std::make_tuple(m3, sp_3);
} else{
puts("Size mismatch");
printf("%d %d\n", sp_1.second, sp_2.first);
double m3 = 0;
return std::make_tuple(&m3, std::make_pair(0, 0));
}
}
Related
I'm implementing sparse matrices multiplication(type of elements std::complex) after converting them to CSR(compressed sparse row) format and I'm using openmp for this, but what I noticed that increasing the number of threads doesn't necessarily increase the performance, sometimes is totally the opposite! why is that the case? and what can I do to solve the issue?
typedef std::vector < std::vector < std::complex < int >>> matrix;
struct CSR {
std::vector<std::complex<int>> values; //non-zero values
std::vector<int> row_ptr; //pointers of rows
std::vector<int> cols_index; //indices of columns
int rows; //number of rows
int cols; //number of columns
int NNZ; //number of non_zero elements
};
const matrix multiply_omp (const CSR& A,
const CSR& B,const unsigned int num_threds=4) {
if (A.cols != B.rows)
throw "Error";
CSR B_t = sparse_transpose(B);
omp_set_num_threads(num_threds);
matrix result(A.rows, std::vector < std::complex < int >>(B.cols, 0));
#pragma omp parallel
{
int i, j, k, l;
#pragma omp for
for (i = 0; i < A.rows; i++) {
for (j = 0; j < B_t.rows; j++) {
std::complex < int > sum(0, 0);
for (k = A.row_ptr[i]; k < A.row_ptr[i + 1]; k++)
for (l = B_t.row_ptr[j]; l < B_t.row_ptr[j + 1]; l++)
if (A.cols_index[k] == B_t.cols_index[l]) {
sum += A.values[k] * B_t.values[l];
break;
}
if (sum != std::complex < int >(0, 0)) {
result[i][j] += sum;
}
}
}
}
return result;
}
You can try to improve the scaling of this algorithm, but I would use a better algorithm. You are allocating a dense matrix (wrongly, but that's beside the point) for the product of two sparse matrices. That's wasteful since quite often the project of two sparse matrices will not be dense by a long shot.
Your algorithm also has the wrong time complexity. The way you search through the rows of B means that your complexity has an extra factor of something like the average number of nonzeros per row. A better algorithm would assume that the indices in each row are sorted, and then keep a pointer for how far you got into that row.
Read the literature on "Graph Blas" for references to efficient algorithms.
I am working with Armadillo and there seems to be some weird memory management in my program.
I need to solve a matrix system recursively, and for this I call the following function in a for-loop :
void get_T(arma::cx_vec &T, arma::mat M0, arma::vec Q, int dim, int nmem){
int n;
n = pow(2 * dim, nmem);
arma::cx_mat M(n, n);
arma::cx_mat P(n, n);
arma::cx_mat InvP(n, n);
arma::vec U(n);
for (int i = 0; i < n; i++)
{
U(i) = 1;
for (int j = 0; j < n; j++)
{
M(i, j) = std::complex<double>(0, 0);
P(i, j) = std::complex<double>(0, 0);
InvP(i, j) = std::complex<double>(0, 0);
}
}
get_M(M, M0, Q, dim, nmem);
P = arma::eye(n, n) - M;
InvP = P.i();
T = InvP * U;
}
I checked the overall RSS memory taken by the entire program, and it seems that the step involving P.i() increases the amount of memory used (which makes sense), but it does not free it when the program exits the get_T function. So the overall memory keeps increasing as the for-loop continues, which in the end leads to a huge amount of memory required. How can I fix this ? I read that Armadillo cleans up memory every time it exits a function, but it does not seem to do it here.
Thanks for helping !
I need to compute a product vector-matrix as efficiently as possible. Specifically, given a vector s and a matrix A, I need to compute s * A. I have a class Vector which wraps a std::vector and a class Matrix which also wraps a std::vector (for efficiency).
The naive approach (the one that I am using at the moment) is to have something like
Vector<T> timesMatrix(Matrix<T>& matrix)
{
Vector<unsigned int> result(matrix.columns());
// constructor that does a resize on the underlying std::vector
for(unsigned int i = 0 ; i < vector.size() ; ++i)
{
for(unsigned int j = 0 ; j < matrix.columns() ; ++j)
{
result[j] += (vector[i] * matrix.getElementAt(i, j));
// getElementAt accesses the appropriate entry
// of the underlying std::vector
}
}
return result;
}
It works fine and takes nearly 12000 microseconds. Note that the vector s has 499 elements, while A is 499 x 15500.
The next step was trying to parallelize the computation: if I have N threads then I can give each thread a part of the vector s and the "corresponding" rows of the matrix A. Each thread will compute a 499-sized Vector and the final result will be their entry-wise sum.
First of all, in the class Matrix I added a method to extract some rows from a Matrix and build a smaller one:
Matrix<T> extractSomeRows(unsigned int start, unsigned int end)
{
unsigned int rowsToExtract = end - start + 1;
std::vector<T> tmp;
tmp.reserve(rowsToExtract * numColumns);
for(unsigned int i = start * numColumns ; i < (end+1) * numColumns ; ++i)
{
tmp.push_back(matrix[i]);
}
return Matrix<T>(rowsToExtract, numColumns, tmp);
}
Then I defined a thread routine
void timesMatrixThreadRoutine
(Matrix<T>& matrix, unsigned int start, unsigned int end, Vector<T>& newRow)
{
// newRow is supposed to contain the partial result
// computed by a thread
newRow.resize(matrix.columns());
for(unsigned int i = start ; i < end + 1 ; ++i)
{
for(unsigned int j = 0 ; j < matrix.columns() ; ++j)
{
newRow[j] += vector[i] * matrix.getElementAt(i - start, j);
}
}
}
And finally I modified the code of the timesMatrix method that I showed above:
Vector<T> timesMatrix(Matrix<T>& matrix)
{
static const unsigned int NUM_THREADS = 4;
unsigned int matRows = matrix.rows();
unsigned int matColumns = matrix.columns();
unsigned int rowsEachThread = vector.size()/NUM_THREADS;
std::thread threads[NUM_THREADS];
Vector<T> tmp[NUM_THREADS];
unsigned int start, end;
// all but the last thread
for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
{
start = i*rowsEachThread;
end = (i+1)*rowsEachThread - 1;
threads[i] = std::thread(&Vector<T>::timesMatrixThreadRoutine, this,
matrix.extractSomeRows(start, end), start, end, std::ref(tmp[i]));
}
// last thread
start = (NUM_THREADS-1)*rowsEachThread;
end = matRows - 1;
threads[NUM_THREADS - 1] = std::thread(&Vector<T>::timesMatrixThreadRoutine, this,
matrix.extractSomeRows(start, end), start, end, std::ref(tmp[NUM_THREADS-1]));
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
threads[i].join();
}
Vector<unsigned int> result(matColumns);
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
result = result + tmp[i]; // the operator+ is overloaded
}
return result;
}
It still works but now it takes nearly 30000 microseconds, which is almost three times as much as before.
Am I doing something wrong? Do you think there is a better approach?
EDIT - using a "lightweight" VirtualMatrix
Following Ilya Ovodov's suggestion, I defined a class VirtualMatrix that wraps a T* matrixData, which is initialized in the constructor as
VirtualMatrix(Matrix<T>& m)
{
numRows = m.rows();
numColumns = m.columns();
matrixData = m.pointerToData();
// pointerToData() returns underlyingVector.data();
}
Then there is a method to retrieve a specific entry of the matrix:
inline T getElementAt(unsigned int row, unsigned int column)
{
return *(matrixData + row*numColumns + column);
}
Now the execution time is better (approximately 8000 microseconds) but maybe there are some improvements to be made. In particular the thread routine is now
void timesMatrixThreadRoutine
(VirtualMatrix<T>& matrix, unsigned int startRow, unsigned int endRow, Vector<T>& newRow)
{
unsigned int matColumns = matrix.columns();
newRow.resize(matColumns);
for(unsigned int i = startRow ; i < endRow + 1 ; ++i)
{
for(unsigned int j = 0 ; j < matColumns ; ++j)
{
newRow[j] += (vector[i] * matrix.getElementAt(i, j));
}
}
}
and the really slow part is the one with the nested for loops. If I remove it, the result is obviously wrong but is "computed" in less than 500 microseconds. This to say that now passing the arguments takes almost no time and the heavy part is really the computation.
According to you, is there any way to make it even faster?
Actually you make a partial copy of matrix for each thread in extractSomeRows. It takes a lot of time.
Redesign it so that "some rows" become virtual matrix pointing at data located in original matrix.
Use vectorized assembly instructions for an architecture by making it more explicit that you want to multiply in 4's, i.e. for the x86-64 SSE2+ and possibly ARM'S NEON.
C++ compilers can often unroll the loop into vectorized code if you explicitly make an operation happen in contingent elements:
Simple and fast matrix-vector multiplication in C / C++
There is also the option of using libraries specifically made for matrix multipication. For larger matrices, it may be more efficient to use special implementations based on the Fast Fourier Transform, alternate algorithms like Strassen's Algorithm, etc. In fact, your best bet would be to use a C library like this, and then wrap it in an interface that looks similar to a C++ vector.
I'd like to parallelize the following code. Especially these for loops, since it is the most expensive operation.
for (i = 0; i < d1; i++)
for (j = 0; j < d3; j++)
for (k = 0; k < d2; k++)
C[i][j] = C[i][j] + A[i][k] * B[k][j];
It is the first time I tried parallelizing code using OpenMP. I have tried several things but I always end up having a worse runtime than by using the serial version.
It would be great if u could tell me if there is something wrong with the code or the pragmas...
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
//#include <stdint.h>
// ---------------------------------------------------------------------------
// allocate space for empty matrix A[row][col]
// access to matrix elements possible with:
// - A[row][col]
// - A[0][row*col]
float **alloc_mat(int row, int col)
{
float **A1, *A2;
A1 = (float **)calloc(row, sizeof(float *)); // pointer on rows
A2 = (float *)calloc(row*col, sizeof(float)); // all matrix elements
//#pragma omp parallel for
for (int i=0; i<row; i++)
A1[i] = A2 + i*col;
return A1;
}
// ---------------------------------------------------------------------------
// random initialisation of matrix with values [0..9]
void init_mat(float **A, int row, int col)
{
//#pragma omp parallel for
for (int i = 0; i < row*col; i++)
A[0][i] = (float)(rand() % 10);
}
// ---------------------------------------------------------------------------
// DEBUG FUNCTION: printout of all matrix elements
void print_mat(float **A, int row, int col, char *tag)
{
int i, j;
printf("Matrix %s:\n", tag);
for (i = 0; i < row; i++)
{
//#pragma omp parallel for
for (j=0; j<col; j++)
printf("%6.1f ", A[i][j]);
printf("\n");
}
}
// ---------------------------------------------------------------------------
int main(int argc, char *argv[])
{
float **A, **B, **C; // matrices
int d1, d2, d3; // dimensions of matrices
int i, j, k; // loop variables
double start, end;
start = omp_get_wtime();
/* print user instruction */
if (argc != 4)
{
printf ("Matrix multiplication: C = A x B\n");
printf ("Usage: %s <NumRowA>; <NumColA> <NumColB>\n",argv[0]);
return 0;
}
/* read user input */
d1 = atoi(argv[1]); // rows of A and C
d2 = atoi(argv[2]); // cols of A and rows of B
d3 = atoi(argv[3]); // cols of B and C
printf("Matrix sizes C[%d][%d] = A[%d][%d] x B[%d][%d]\n",
d1, d3, d1, d2, d2, d3);
/* prepare matrices */
A = alloc_mat(d1, d2);
init_mat(A, d1, d2);
B = alloc_mat(d2, d3);
init_mat(B, d2, d3);
C = alloc_mat(d1, d3); // no initialisation of C,
//because it gets filled by matmult
/* serial version of matmult */
printf("Perform matrix multiplication...\n");
int sum;
//#pragma omp parallel
//{
#pragma omp parallel for collapse(3) schedule(guided)
for (i = 0; i < d1; i++)
for (j = 0; j < d3; j++)
for (k = 0; k < d2; k++){
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
//}
end = omp_get_wtime();
/* test output */
print_mat(A, d1, d2, "A");
print_mat(B, d2, d3, "B");
print_mat(C, d1, d3, "C");
printf("This task took %f seconds\n", end-start);
printf ("\nDone.\n");
return 0;
}
As #genisage suggested in the comments, the size of matrix is likely small enough that the overhead of initializing the additional threads is greater than the time savings achieved by computing the matrix multiplication in parallel. Consider the following plot, however, with data that I obtained by running your code with and without OpenMP.
I used square matrices ranging from n=10 to n=1000. Notice how somewhere between n=50 and n=100 the parallel version becomes faster.
There are other issues to consider, however, when trying to write fast matrix multiplication, which mostly have to do with using the cache effectively. First, you allocate your entire matrix contiguously (which is good), but still use two pointer redirections to access the data, which is unnecessary. Also, your matrices are stored in row major format, which means you are accessing the data in A and C contiguously, but not in B. Instead of explicitly storing B and multiplying a row of A with a column of B, you would get a faster multiplication by storing B transposed and multiplying a row of A elementwise with a row of B transpose.
This is an optimization focused only on A*B, however, and there may be other places in your code where storing B is better than B transpose, in which case often doing matrix multiplication by blocking can lead to better cache utilization
I have a datastructure containing a vector of vectors which each consist of about ~16000000 double values.
I now want to median-combine these vectors, meaning, of each original vectors I take the values at place i, calculate the median of these and then store them in the resulting vector at place i.
I already have the straight-forward solution, but it is incredible slow:
vector< vector<double> > vectors; //vectors contains the datavectors
vector<double> tmp;
vector<double> result;
vector<double> tmpmedian;
double pixels = 0.0;
double matrixcount = vectors.size();
tmp = vectors.at(0);
pixels = tmp.size();
for (int i = 0; i < pixels; i++) {
for (int j = 0; j < matrixcount; j++) {
tmp = vectors.at(j);
tmpmedian.push_back(tmp.at(i));
}
result.push_back(medianOfVector(tmpmedian));
tmpmedian.clear();
}
return result;
And medianOfVector looks like this:
double result = 0;
if ((vec.size() % 2) != 0) {
vector<double>::iterator i = vec.begin();
vector<double>::size_type m = (vec.size() / 2);
nth_element(i, i + m, vec.end());
result = vec.at(m);
} else {
vector<double>::iterator i = vec.begin();
vector<double>::size_type m = (vec.size() / 2) - 1;
nth_element(i, i + m, vec.end());
result = (vec.at(m) + vec.at(m + 1)) / 2;
}
return result;
I there an algorithm or a way to do this faster, it takes nearly an eternity to do it.
Edit: Thank you for your replies, in case anyone is interested here is the fixed version, it now takes about 9sec to median combine three vectors with ~16000000 elements, mean combining takes around 3sec:
vector< vector<double> > vectors; //vectors contains the datavectors
vector<double> *tmp;
vector<double> result;
vector<double> tmpmedian;
tmp = &vectors.at(0);
int size = tmp->size();
int vectorsize = vectors.size();
for (int i = 0; i < size; i++) {
for (int j = 0; j < vectorsize; j++) {
tmp = &vectors.at(j);
tmpmedian.push_back(tmp->at(i));
}
result.push_back(medianOfVector(tmpmedian));
tmpmedian.clear();
}
return result;
And medianOfVector:
double result = 0;
if ((vec.size() % 2) != 0) {
vector<double>::iterator i = vec.begin();
vector<double>::size_type m = (vec.size() / 2);
nth_element(i, i + m, vec.end());
result = vec.at(m);
} else {
vector<double>::iterator i = vec.begin();
vector<double>::size_type m = (int) (((vec.size() - 1) / 2));
nth_element(i, i + m, vec.end());
double min = vec.at(m);
double max = *min_element(i + m + 1, vec.end());
result = (min + max) / 2;
}
return result;
}
A couple of points, both stemming from the fact that you've defined tmp as a vector instead of (for example) a reference.
vector<double> tmp;
tmp = vectors.at(0);
pixels = tmp.size();
Here you're copying the entirety of vectors[0] into tmp just to extract the size. You'll almost certainly gain some speed by avoiding the copy:
pixels = vectors.at(0).size();
Instead of copying the entire vector just to get its size, this just gets a reference to the first vector, and gets the size of that existing vector.
for (int i = 0; i < pixels; i++) {
for (int j = 0; j < matrixcount; j++) {
tmp = vectors.at(j);
tmpmedian.push_back(tmp.at(i));
}
Here you're again copying the entirety of vectors.at(j) into tmp. But (again) you don't really need a new copy of all the data--you're just retrieving a single item from that copy. You can retrieve the data you need directly from the original vector without copying the whole thing:
tmpmedian.push_back(vectors.at(j).at(i));
A possible next step would be to switch from using .at to operator[]:
tmpmedian.push_back(vectors[j][i]);
This is much more of a tradeoff though--it's not likely to gain nearly as much, and loses a bit of safety (range checking) in the process. To avoid losing safety, you could consider (for example) using range-based for loops instead of the counted for loops in your current code.
Along rather different lines, you could instead change from using a vector<vector<double>> to using a small wrapper around a vector to give 2D addressing into a single vector. Using this with a suitable column-wise iterator, you could avoid creating tmpmedian as basically a copy of a column of the original 2D matrix--instead, you'd pass a column-wise iterator to medianOfVector, and just iterate through a column of the original data in-place.