openmp increasing number of threads increases the execution time

openmp increasing number of threads increases the execution time - c++

I'm implementing sparse matrices multiplication(type of elements std::complex) after converting them to CSR(compressed sparse row) format and I'm using openmp for this, but what I noticed that increasing the number of threads doesn't necessarily increase the performance, sometimes is totally the opposite! why is that the case? and what can I do to solve the issue?
typedef std::vector < std::vector < std::complex < int >>> matrix;
struct CSR {
std::vector<std::complex<int>> values; //non-zero values
std::vector<int> row_ptr; //pointers of rows
std::vector<int> cols_index; //indices of columns
int rows; //number of rows
int cols; //number of columns
int NNZ; //number of non_zero elements
};
const matrix multiply_omp (const CSR& A,
const CSR& B,const unsigned int num_threds=4) {
if (A.cols != B.rows)
throw "Error";
CSR B_t = sparse_transpose(B);
omp_set_num_threads(num_threds);
matrix result(A.rows, std::vector < std::complex < int >>(B.cols, 0));
#pragma omp parallel
{
int i, j, k, l;
#pragma omp for
for (i = 0; i < A.rows; i++) {
for (j = 0; j < B_t.rows; j++) {
std::complex < int > sum(0, 0);
for (k = A.row_ptr[i]; k < A.row_ptr[i + 1]; k++)
for (l = B_t.row_ptr[j]; l < B_t.row_ptr[j + 1]; l++)
if (A.cols_index[k] == B_t.cols_index[l]) {
sum += A.values[k] * B_t.values[l];
break;
}
if (sum != std::complex < int >(0, 0)) {
result[i][j] += sum;
}
}
}
}
return result;
}

You can try to improve the scaling of this algorithm, but I would use a better algorithm. You are allocating a dense matrix (wrongly, but that's beside the point) for the product of two sparse matrices. That's wasteful since quite often the project of two sparse matrices will not be dense by a long shot.
Your algorithm also has the wrong time complexity. The way you search through the rows of B means that your complexity has an extra factor of something like the average number of nonzeros per row. A better algorithm would assume that the indices in each row are sorted, and then keep a pointer for how far you got into that row.
Read the literature on "Graph Blas" for references to efficient algorithms.

Related

Accelerating a nested loop in C++

I have the following piece of C++ code. The scale of the problem is N and M. Running the code takes about two minutes on my machine. (after g++ -O3 compilation). Is there anyway to further accelerate it, on the same machine? Any kind of option, choosing a better data structure, library, GPU or parallelism, etc, is on the table.
void demo() {
int N = 1000000;
int M=3000;
vector<vector<int> > res(M);
for (int i =0; i <N;i++) {
for (int j=1; j < M; j++){
res[j].push_back(i);
}
}
}
int main() {
demo();
return 0;
}
An additional info: The second loop above for (int j=1; j < M; j++) is a simplified version of the real problem. In fact, j could be in a different range for each i (of the outer loop), but the number of iterations is about 3000.

With the exact code as shown when writing this answer, you could create the inner vector once, with the specific size, and call iota to initialize it. Then just pass this vector along to the outer vector constructor to use it for each element.
Then you don't need any explicit loops at all, and instead use the (highly optimized, hopefully) standard library to do all the work for you.
Perhaps something like this:
void demo()
{
static int const N = 1000000;
static int const M = 3000;
std::vector<int> data(N);
std::iota(begin(data), end(data), 0);
std::vector<std::vector<int>> res(M, data);
}

Alternatively you could try to initialize just one vector with that elements, and then create the other vectors just by copying that part of the memory using std::memcpy or std::copy.
Another optimization would be to allocate the memory in advance (e.g. array.reserve(3000)).
Also if you're sure that all the members of the vector are similar vectors, you could do a hack by just creating a single vector with 3000 elements, and in the other res just put the same reference of that 3000-element vector million times.

On my machine which has enough memory to avoid swapping your original code took 86 seconds.
Adding reserve:
for (auto& v : res)
{
v.reserve(N);
}
made basically no difference (85 seconds but I only ran each version once).
Swapping the loop order:
for (int j = 1; j < M; j++) {
for (int i = 0; i < N; i++) {
res[j].push_back(i);
}
}
reduced the time to 10 seconds, this is likely due to a combination of allowing the compiler to use SIMD optimisations and improving cache coherency by accessing memory in sequential order.
Creating one vector and copying it into the others:
for (int i = 0; i < N; i++) {
res[1].push_back(i);
}
for (int j = 2; j < M; j++) {
res[j] = res[1];
}
reduced the time to 4 seconds.
Using a single vector:
void demo() {
size_t N = 1000000;
size_t M = 3000;
vector<int> res(M*N);
size_t offset = N;
for (size_t i = 0; i < N; i++) {
res[offset++] = i;
}
for (size_t j = 2; j < M; j++) {
std::copy(res.begin() + N, res.begin() + N * 2, res.begin() + offset);
offset += N;
}
}
also took 4 seconds, there probably isn't much improvement because you have 3,000 4 MB vectors, there would likely be more difference if N was smaller or M was larger.

Multiplying Matrices with two for loops in C++ [duplicate]

I came up with this algorithm for matrix multiplication. I read somewhere that matrix multiplication has a time complexity of o(n^2).
But I think my this algorithm will give o(n^3).
I don't know how to calculate time complexity of nested loops. So please correct me.
for i=1 to n
for j=1 to n
c[i][j]=0
for k=1 to n
c[i][j] = c[i][j]+a[i][k]*b[k][j]

Using linear algebra, there exist algorithms that achieve better complexity than the naive O(n3). Solvay Strassen algorithm achieves a complexity of O(n2.807) by reducing the number of multiplications required for each 2x2 sub-matrix from 8 to 7.
The fastest known matrix multiplication algorithm is Coppersmith-Winograd algorithm with a complexity of O(n2.3737). Unless the matrix is huge, these algorithms do not result in a vast difference in computation time. In practice, it is easier and faster to use parallel algorithms for matrix multiplication.

The naive algorithm, which is what you've got once you correct it as noted in comments, is O(n^3).
There do exist algorithms that reduce this somewhat, but you're not likely to find an O(n^2) implementation. I believe the question of the most efficient implementation is still open.
See this wikipedia article on Matrix Multiplication for more information.

The standard way of multiplying an m-by-n matrix by an n-by-p matrix has complexity O(mnp). If all of those are "n" to you, it's O(n^3), not O(n^2). EDIT: it will not be O(n^2) in the general case. But there are faster algorithms for particular types of matrices -- if you know more you may be able to do better.

In matrix multiplication there are 3 for loop, we are using since execution of each for loop requires time complexity O(n). So for three loops it becomes O(n^3)

I recently had a matrix multiplication problem in my college assignment, this is how I solved it in O(n^2).
import java.util.Scanner;
public class q10 {
public static int[][] multiplyMatrices(int[][] A, int[][] B) {
int ra = A.length; // rows in A
int ca = A[0].length; // columns in A
int rb = B.length; // rows in B
int cb = B[0].length; // columns in B
// if columns of A is not equal to rows of B, then the two matrices,
// cannot be multiplied.
if (ca != rb) {
System.out.println("Incorrect order, multiplication cannot be performed");
return A;
} else {
// AB is the product of A and B, and it will have rows,
// equal to rown in A and columns equal to columns in B
int[][] AB = new int[ra][cb];
int k = 0; // column number of matrix B, while multiplying
int entry; // = Aij, value in ith row and at jth index
for (int i = 0; i < A.length; i++) {
entry = 0;
k = 0;
for (int j = 0; j < A[i].length; j++) {
// to evaluate a new Aij, clear the earlier entry
if (j == 0) {
entry = 0;
}
int currA = A[i][j]; // number selected in matrix A
int currB = B[j][k]; // number selected in matrix B
entry += currA * currB; // adding to the current entry
// if we are done with all the columns for this entry,
// reset the loop for next one.
if (j + 1 == ca) {
j = -1;
// put the evaluated value at its position
AB[i][k] = entry;
// increase the column number of matrix B as we are done with this one
k++;
}
// if this row is done break this loop,
// move to next row.
if (k == cb) {
j = A[i].length;
}
}
}
return AB;
}
}
#SuppressWarnings({ "resource" })
public static void main(String[] args) {
Scanner ip = new Scanner(System.in);
System.out.println("Input order of first matrix (r x c):");
int ra = ip.nextInt();
int ca = ip.nextInt();
System.out.println("Input order of second matrix (r x c):");
int rb = ip.nextInt();
int cb = ip.nextInt();
int[][] A = new int[ra][ca];
int[][] B = new int[rb][cb];
System.out.println("Enter values in first matrix:");
for (int i = 0; i < ra; i++) {
for (int j = 0; j < ca; j++) {
A[i][j] = ip.nextInt();
}
}
System.out.println("Enter values in second matrix:");
for (int i = 0; i < rb; i++) {
for (int j = 0; j < cb; j++) {
B[i][j] = ip.nextInt();
}
}
int[][] AB = multiplyMatrices(A, B);
System.out.println("The product of first and second matrix is:");
for (int i = 0; i < AB.length; i++) {
for (int j = 0; j < AB[i].length; j++) {
System.out.print(AB[i][j] + " ");
}
System.out.println();
}
}
}

two dimensional vector matrices addition

vector<vector<int>> AsumB(
int kolumny, vector<vector<int>> matrix1, vector<vector<int>> matrix2) {
vector<vector<int>>matrix(kolumny);
matrix = vector<vector<int>>(matrix1.size());
for (int i = 0; i < kolumny; ++i)
for (int j = 0; i <(static_cast<signed int>(matrix1.size())); ++i)
matrix[i][j] = matrix1[i][j] + matrix2[i][j];
return matrix;
}
Please tell me what I don't understand and help me solve this problem
because for 1dimensional vector this kind of description would work;

What about
vector<vector<int>> AsumB(vector<vector<int>> const & matrix1,
vector<vector<int>> const & matrix2) {
vector<vector<int>> matrix(matrix1);
for (std::size_t i = 0U; i < matrix.size(); ++i)
for (std::size_t j = 0U; j < matrix[j].size(); ++j)
matrix[i][j] += matrix2[i][j];
return matrix;
}
?

Unable to reproduce, and OP's reported compiler error doesn't look like it matches the code, so the problem is probably somewhere else.
However, there is a lot wrong here that could be causing all sorts of bad that should be addressed. I've taken the liberty of reformatting the code a bit to make explaining easier
vector<vector<int>> AsumB(int kolumny,
vector<vector<int>> matrix1,
vector<vector<int>> matrix2)
matrix1 and matrix2 are passed by value. There is nothing wrong logically, but this means there is the potential for a lot of unnecessary copying unless the compiler is very sharp.
{
vector<vector<int>> matrix(kolumny);
Declares a vector of vectors with the outer vector sized to kolumny. There are no inner vectors allocated, so 2D operations are doomed.
matrix = vector<vector<int>>(matrix1.size());
Makes a temporary vector of vectors with the outer vector sized to match the outer vector of matrix1. This temporary vector is then assigned to the just created matrix, replacing it's current contents, and is then destroyed. matrix still has no inner vectors allocated, so 2D operations are still doomed.
for (int i = 0; i < kolumny; ++i)
for (int j = 0; i < (static_cast<signed int>(matrix1.size())); ++i)
i and j should never go negative (huge logic problem if they do), so use an unsigned type. Use the right unsigned type and the static_cast is meaningless.
In addition the inner for loop increments and tests i, not j
matrix[i][j] = matrix1[i][j] + matrix2[i][j];
I see nothing wrong here other than matrix having nothing for j to index. This will result in Undefined Behaviour as access go out of bounds.
return matrix;
}
Cleaning this up so that it is logically sound:
vector<vector<int>> AsumB(const vector<vector<int>> & matrix1,
const vector<vector<int>> & matrix2)
We don't need the number of columns. The vector already knows all the sizes involved. A caveat, though: vector<vector<int>> allows different sizes of all of the inner vectors. Don't do this and you should be good.
Next, this function now takes parameters by constant reference.. With the reference there is no copying. With const the compiler knows the vectors will not be changed insode the function and can prevent errors and make a bunch of optimizations.
{
size_t row = matrix1.size();
size_t is an unsigned data type guaranteed to be large enough to index any representable object. It will be bg enough and you don't have to worry about pesky negaitve numbers. Also eliminates the need for any casting later.
if (!(row > 0 && row == matrix2.size()))
{
return vector<vector<int>>();
}
Here we make sure that everyone agrees ont he number of rows inviolved and return an empty vector if they don't. You could also throw an exception. The exception may be a better solution, but I don't know the use case.
size_t column = matrix1[0].size();
if (!(column > 0 && column == matrix2[0].size()))
{
return vector<vector<int>>();
}
Dowes the same as above, but makes sure the number of columns makes sense.
vector<vector<int>> matrix(row, vector<int>(column));
Created a local row by column matrix to store the result. Note the second parameter. vector<int>(column) tells the compiler that all row inner vectors will be initialized to a vector of size column.
for (int i = 0; i < row; ++i)
{
for (int j = 0; j < column; ++j)
{
Here we simplified the loops just a bit since we know all the sizes.
matrix[i][j] = matrix1[i][j] + matrix2[i][j];
}
}
return matrix;
The compiler has a number of tricks at its disposal to eliminate copying matrix on return. Look up Return Value Optimization with your preferred web search engine if you want to know more.
}
All together:
vector<vector<int>> AsumB(const vector<vector<int>> & matrix1,
const vector<vector<int>> & matrix2)
{
size_t row = matrix1.size();
if (!(row > 0 && row == matrix2.size()))
{
return vector<vector<int>>();
}
size_t column = matrix1[0].size();
if (!(column > 0 && column == matrix2[0].size()))
{
return vector<vector<int>>();
}
vector<vector<int>> matrix(row, vector<int>(column));
for (int i = 0; i < row; ++i)
{
for (int j = 0; j < column; ++j)
{
matrix[i][j] = matrix1[i][j] + matrix2[i][j];
}
}
return matrix;
}

C++ - Efficiently computing a vector-matrix product

I need to compute a product vector-matrix as efficiently as possible. Specifically, given a vector s and a matrix A, I need to compute s * A. I have a class Vector which wraps a std::vector and a class Matrix which also wraps a std::vector (for efficiency).
The naive approach (the one that I am using at the moment) is to have something like
Vector<T> timesMatrix(Matrix<T>& matrix)
{
Vector<unsigned int> result(matrix.columns());
// constructor that does a resize on the underlying std::vector
for(unsigned int i = 0 ; i < vector.size() ; ++i)
{
for(unsigned int j = 0 ; j < matrix.columns() ; ++j)
{
result[j] += (vector[i] * matrix.getElementAt(i, j));
// getElementAt accesses the appropriate entry
// of the underlying std::vector
}
}
return result;
}
It works fine and takes nearly 12000 microseconds. Note that the vector s has 499 elements, while A is 499 x 15500.
The next step was trying to parallelize the computation: if I have N threads then I can give each thread a part of the vector s and the "corresponding" rows of the matrix A. Each thread will compute a 499-sized Vector and the final result will be their entry-wise sum.
First of all, in the class Matrix I added a method to extract some rows from a Matrix and build a smaller one:
Matrix<T> extractSomeRows(unsigned int start, unsigned int end)
{
unsigned int rowsToExtract = end - start + 1;
std::vector<T> tmp;
tmp.reserve(rowsToExtract * numColumns);
for(unsigned int i = start * numColumns ; i < (end+1) * numColumns ; ++i)
{
tmp.push_back(matrix[i]);
}
return Matrix<T>(rowsToExtract, numColumns, tmp);
}
Then I defined a thread routine
void timesMatrixThreadRoutine
(Matrix<T>& matrix, unsigned int start, unsigned int end, Vector<T>& newRow)
{
// newRow is supposed to contain the partial result
// computed by a thread
newRow.resize(matrix.columns());
for(unsigned int i = start ; i < end + 1 ; ++i)
{
for(unsigned int j = 0 ; j < matrix.columns() ; ++j)
{
newRow[j] += vector[i] * matrix.getElementAt(i - start, j);
}
}
}
And finally I modified the code of the timesMatrix method that I showed above:
Vector<T> timesMatrix(Matrix<T>& matrix)
{
static const unsigned int NUM_THREADS = 4;
unsigned int matRows = matrix.rows();
unsigned int matColumns = matrix.columns();
unsigned int rowsEachThread = vector.size()/NUM_THREADS;
std::thread threads[NUM_THREADS];
Vector<T> tmp[NUM_THREADS];
unsigned int start, end;
// all but the last thread
for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
{
start = i*rowsEachThread;
end = (i+1)*rowsEachThread - 1;
threads[i] = std::thread(&Vector<T>::timesMatrixThreadRoutine, this,
matrix.extractSomeRows(start, end), start, end, std::ref(tmp[i]));
}
// last thread
start = (NUM_THREADS-1)*rowsEachThread;
end = matRows - 1;
threads[NUM_THREADS - 1] = std::thread(&Vector<T>::timesMatrixThreadRoutine, this,
matrix.extractSomeRows(start, end), start, end, std::ref(tmp[NUM_THREADS-1]));
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
threads[i].join();
}
Vector<unsigned int> result(matColumns);
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
result = result + tmp[i]; // the operator+ is overloaded
}
return result;
}
It still works but now it takes nearly 30000 microseconds, which is almost three times as much as before.
Am I doing something wrong? Do you think there is a better approach?
EDIT - using a "lightweight" VirtualMatrix
Following Ilya Ovodov's suggestion, I defined a class VirtualMatrix that wraps a T* matrixData, which is initialized in the constructor as
VirtualMatrix(Matrix<T>& m)
{
numRows = m.rows();
numColumns = m.columns();
matrixData = m.pointerToData();
// pointerToData() returns underlyingVector.data();
}
Then there is a method to retrieve a specific entry of the matrix:
inline T getElementAt(unsigned int row, unsigned int column)
{
return *(matrixData + row*numColumns + column);
}
Now the execution time is better (approximately 8000 microseconds) but maybe there are some improvements to be made. In particular the thread routine is now
void timesMatrixThreadRoutine
(VirtualMatrix<T>& matrix, unsigned int startRow, unsigned int endRow, Vector<T>& newRow)
{
unsigned int matColumns = matrix.columns();
newRow.resize(matColumns);
for(unsigned int i = startRow ; i < endRow + 1 ; ++i)
{
for(unsigned int j = 0 ; j < matColumns ; ++j)
{
newRow[j] += (vector[i] * matrix.getElementAt(i, j));
}
}
}
and the really slow part is the one with the nested for loops. If I remove it, the result is obviously wrong but is "computed" in less than 500 microseconds. This to say that now passing the arguments takes almost no time and the heavy part is really the computation.
According to you, is there any way to make it even faster?

Actually you make a partial copy of matrix for each thread in extractSomeRows. It takes a lot of time.
Redesign it so that "some rows" become virtual matrix pointing at data located in original matrix.

Use vectorized assembly instructions for an architecture by making it more explicit that you want to multiply in 4's, i.e. for the x86-64 SSE2+ and possibly ARM'S NEON.
C++ compilers can often unroll the loop into vectorized code if you explicitly make an operation happen in contingent elements:
Simple and fast matrix-vector multiplication in C / C++
There is also the option of using libraries specifically made for matrix multipication. For larger matrices, it may be more efficient to use special implementations based on the Fast Fourier Transform, alternate algorithms like Strassen's Algorithm, etc. In fact, your best bet would be to use a C library like this, and then wrap it in an interface that looks similar to a C++ vector.

sum of squares matrices

I want to do a function that given 2 matrix returns the sum of both.I think the problem is in how I initialize the Matrix 't'.
#include <iostream>
#include <vector>
using namespace std;
typedef vector< vector<int> > Matrix;
Matrix sum(const Matrix&a,const Matrix&b){
Matrix t;
for(int i=0;i<a.size();i++)
for(int j=0;j<a.size();j++)
t[i][j] = a[i][j] + b[i][j];
return t;
}

You'll need to initialize the rows and columns of t with something like:
Matrix t = vector< vector<int> >(row_count, vector<int>(col_count, 0));
That will make a row_count by col_count matrix filled with zeroes.
On a side note about performance: comparing to .size() in a for loop means that before each iteration, .size() has to be calculated again. You can save a bit of processing (which adds up for massive data sets) by pre-calculating it like so:
for (int row = 0, row_ct = mat.size(); row < row_ct; ++row)

You don't have a rectangular data set in general: each a[i] is a vector of a possibly different length. Supposing you do in fact take care to have a rectangular grid, your for loop is still off; it should be like this:
for (int i = 0; i < a.size(); i++)
{
assert(a.size() <= b.size() && a.size() <= t.size());
for (int j = 0; j < a[i].size(); j++) // !!
{
assert(a[i].size() <= b[i].size() && a[i].size() <= t[i].size());
t[i][j] = a[i][j] + b[i][j];
}
}
I added some assertions to indicate which preconditions you have to satisfy.
To initialize a rectangular array, you can do something like this:
std::vector<std::vector<int>> v(n_rows, std::vector<int>(n_cols, 0));

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

openmp increasing number of threads increases the execution time - c++

Related

Accelerating a nested loop in C++

Multiplying Matrices with two for loops in C++ [duplicate]

two dimensional vector matrices addition

C++ - Efficiently computing a vector-matrix product

sum of squares matrices

Categories

Resources