Matrix multiplication using multiple threads

Matrix multiplication using multiple threads - c++

So I am trying to compute (M by N matrix) times (N by 1 vector) operations with threads into a resulting vector. The question in my book says that I should think about how many threads to use, and I assume since the result matrix should be M by 1 then I should use M threads, one for each set of operations.
M is height, and N is width.
To create the threads I use
thread* myThreads = new thread[height];
Then I call the MatrixMultThreads function i times. At the end I join all the threads.
for (int i = 0; i < height; i++)
{
myThreads[i] = thread(MatrixMultThreads, my2DArray, vector, height, width);
}
for (int i = 0; i < height; i++)
{
myThreads[i].join();
}
What I am having trouble figuring out is how should I sum up all the resulting values in the correct order. How would I tell each specific thread what to do.
I was thinking, maybe I should create a global variable step_i and set it to 0, then each time the function is called I can iterate that variable. then since I can pass the width of the array, I go through each step_i and add arr[i][j] * vector[j]

What I am having trouble figuring out is how should I sum up all the
resulting values in the correct order.
They can be summed out-of-order, which is why this is a good problem to solve with multi-threading. If ordering matters to a specific problem, you can't improve it with multithreading (to be clear, if any sub-problem can be solved out-of-order then that sub-problem is a potential candidate for multithreading).
One solution to your problem is to set up a solution vector at the call site, then pass the corresponding element by reference (also the MatrixMultiply function needs to know which problem it's solving):
void MatrixMultiply(const Array2d& matrix,
const vector<int>& vec, int row, int& solution);
// ...
vector<int> result(height);
for (int i = 0; i < height; i++)
{
threads[i] = thread(MatrixMultiply, array2d, array1d, i, result[i]);
}
Your 2D array should really provide info on its height and width without having to pass these values explicitly.
BONUS INFO:
We could make this solution much more OOP in a way that you'll want to reuse for future problems (and some experienced programmers seem to miss this trick for using arrays):
MatrixMultiply function is really similar to a dot-product function:
template <typename V1, typename V2>
auto DotProduct(const V1& vec1, const V2& vec2)
{
auto result = vec1[0] * vec2[0];
for (size_t i = 1; i < vec1.size(); ++i)
result += vec1[i] * vec2[i];
return result;
}
template <typename V1, typename V2, typename T>
auto DotProduct(const V1& vec1, const V2& vec2, T& result)
{
result = DotProduct(vec1, vec2);
}
(The above allows the vectors to be any objects that uses size() and [] methods as expected.)
We can write a wrapper class around std::vector that can be used by our array class to handle all the indexing for us; like this:
template <typename T, typename A>
class SubVector
{
const typename std::vector<T,A>::iterator m_it;
const size_t m_size, m_interval_size;
public:
SubVector (std::vector<T,A>& v, size_t start, size_t sub_size, size_t i_size = 1)
: m_it(v.begin() + start), m_size(sub_size), m_interval_size(i_size)
{}
auto size () const
{
return m_size;
}
const T& operator [] (size_t i) const
{
return it[i*m_interval_size];
}
T& operator [] (size_t i)
{
return it[i*m_interval_size];
}
};
Then you could use this in some kind of Vectorise method in your array; like this:
template <typename T, typename A = std::allocator<T>>
class Array2D
{
std::vector<T,A> m_data;
size_t m_width, m_height;
public:
// your normal methods
auto VectoriseRow(int r) const
{
return SubVector(m_data, r*m_width, m_width);
}
auto VectoriseColumn(int c) const
{
return SubVector(m_data, c, m_height, m_width);
}
}
(Note: We could add the Vectorise feature to std::array or boost::multi_array by just writing a wrapper around them, which makes our array class more generic and saves us from having to do all the work. boost actually has this sort of feature inbuilt with array_view.)
Now our call site can be like so:
vector<int> result(height);
for (int i = 0; i < height; i++)
{
threads[i] = thread(DotProduct, array2d.VectoriseRow(i), array1d, result[i]);
}
This might seem like a more verbose way of solving the original problem (because it is), but if you use multi-dimensional arrays in your coding you'll find you no longer have to write multi-array-specific functions, or handle ugly indices for sub-problems (even in 1D problems, like Mean of Means). When dealing with those sorts of problems, you'll invariably want to reuse something like the above code.

You can store the results of the rows dot the Nx1 vector in a Mx1 vector and then do the sum.
By the way, you would be much better using OpenMP for such a problem, it would automatize most of your threads managements according to the number of cores on your machine, since here you might spawn a lot of threads:
https://www.openmp.org/
http://www.bowdoin.edu/~ltoma/teaching/cs3225-GIS/fall17/Lectures/openmp.html

Related

How can we speedup matrix multiplication where matrices are initialized using vectors of vectors (2D vector) in C++

I have written a function for matrix multiplication where matrices are defined using vectors of vectors containing double values.
vector<vector<double> > mat_mul(vector<vector<double> >A, vector<vector<double> >B){
vector<vector<double> >result(A.size(),vector<double>(B[0].size(),0));
if(A[0].size()==B.size()){
const int N=A[0].size();
for(int i=0;i<A.size();i++){
for(int j=0;j<B[0].size();j++){
for(int k=0;k<A[0].size();k++)
result[i][j]+=A[i][k]*B[k][j];
}
}
}
return result;
}
The code seems to work fine but is very slow for matrices large as 400X400.
How can we speed this up? I have seen other answers but they discuss matrix multiplication but not about any speed up for vector of vectors.
Any help is highly appreciated.

struct matrix:
vector<double>
{
using base = vector<double>
using size_type=std::pair<std::size_t,std::size_t>;
void resize(size_type const& dims, double const v=0){
rows = dims.second;
base::resize(dims.first * rows);
};
size_type size() const { return { cols(), rows }; };
double& operator[](size_type const& idx) {
base& vec=*this;
return vec[idx.first + idx.second * cols()];
};
double operator[](size_type const& idx) const {
base const& vec=*this;
return vec[idx.first + idx.second * cols()];
};
private:
std::size_t cols() const { return base::size() / rows; };
std::size_t rows = 0;
};
///...
auto const size = std::tuple_cat(A.size(), std::make_tuple(B.size().second));
matrix result;
result.resize({get<0>(size), get<2>(size)});
for(auto i = 0; i < get<0>(size); ++i)
for(auto j = 0; j < get<1>(size); ++j)
for(auto k = 0; k < get<2>(size); ++k)
result[{i,k}] += A[{i,j}] * B[{j,k}];
I just skipped lots of details, such as none-dedault constructors which is needed if you want a pretty initialization syntax. Moreover as a matrix, this type will need lots of arithmetics operators.
Another approach would be type_erased 2D raw array, but that would require defining assignment operator, as well as copy and move constructors. So this std::vector based solution seems to be the simplest implementation.
Also, if the dimensions are fixed at compile-time, a template alias can do:
template<typename T, std::size_t cols, std::size_t rows>
using array_2d = std::array<std::array<double, rows>, cols>;
array_2d<double, icount, kcount> result{};

You are using the naive algorithm for matrix multiplication. It’s extremely cache unfriendly, and you are hit by the full latency.
First, split up the operations so your inner loop repeatedly accessed data that fit into your cache.
Second, calculate four sums simultaneously to avoid penalties for latency.
Third, use fma instructions (fused multiply-add) which calculate a product and a sum in the same time as a product.
Fourth, use vector registers.
Five, use multiple threads.
Or just use a package like linpack optimising things for you.

Is there an efficient way to slice a C++ vector given a vector containing the indexes to be sliced

I am working to implement a code which was written in MATLAB into C++.
In MATLAB you can slice an Array with another array, like A(B), which results in a new array of the elements of A at the indexes specified by the values of the element in B.
I would like to do a similar thing in C++ using vectors. These vectors are of size 10000-40000 elements of type double.
I want to be able to slice these vectors using another vector of type int containing the indexes to be sliced.
For example, I have a vector v = <1.0, 3.0, 5.0, 2.0, 8.0> and a vector w = <0, 3, 2>. I want to slice v using w such that the outcome of the slice is a new vector (since the old vector must remain unchanged) x = <1.0, 2.0, 5.0>.
I came up with a function to do this:
template<typename T>
std::vector<T> slice(std::vector<T>& v, std::vector<int>& id) {
std::vector<T> tmp;
tmp.reserve(id.size());
for (auto& i : id) {
tmp.emplace_back(v[i]);
}
return tmp;
}
I was wondering if there was potentially a more efficient way to do such a task. Speed is the key here since this slice function will be in a for-loop which has approximately 300000 iterations. I heard the boost library might contain some valid solutions, but I have not had experience yet with it.
I used the chrono library to measure the time it takes to call this slice function, where the vector to be sliced was length 37520 and the vector containing the indexes was size 1550. For a single call of this function, the time elapsed = 0.0004284s. However, over ~300000 for-loop iterations, the total elapsed time was 134s.
Any advice would be much appreicated!

emplace_back has some overhead as it involves some internal accounting inside std::vector. Try this instead:
template<typename T>
std::vector<T> slice(const std::vector<T>& v, const std::vector<int>& id) {
std::vector<T> tmp;
tmp.resize (id.size ());
size_t n = 0;
for (auto i : id) {
tmp [n++] = v [i];
}
return tmp;
}
Also, I removed an unnecessary dereference in your inner loop.
Edit: I thought about this some more, and inspired by #jack's answer, I think that the inner loop (which is the one that counts) can be optimised further. The idea is to put everything used by the loop in local variables, which gives the compiler the best chance to optimise the code. So try this, see what timings you get. Make sure that you test a Release / optimised build:
template<typename T>
std::vector<T> slice(const std::vector<T>& v, const std::vector<int>& id) {
size_t id_size = id.size ();
std::vector<T> tmp (id_size);
T *tmp_data = tmp.data ();
const int *id_data = id.data ();
const T* v_data = v.data ();
for (size_t i = 0; i < id_size; ++i) {
tmp_data [i] = v_data [id_data [i]];
}
return tmp;
}

The performance seems a bit slow; are you building with compiler optimizations (eg. g++ main.cpp -O3 or if using an IDE, switching to release mode). This alone sped up computation time around 10x.
If you are using optimizations already, by using basic for loop iteration (for int i = 0; i < id.size(); i++) computation time was sped up around 2-3x on my machine, the idea being, the compiler doesn't have to resolve what type auto refers to, and since basic for loops have been in C++ forever, the compiler is likely to have lots of tricks to speed it up.
template<typename T>
std::vector<T> slice(const std::vector<T>& v, const std::vector<int>& id){
// #Jan Schultke's suggestion
std::vector<T> tmp(id.size ());
size_t n = 0;
for (int i = 0; i < id.size(); i++) {
tmp [n++] = v [i];
}
return tmp;
}

How to initialize double matrix with NaN element?

in my code I have a matrix of double like this:
double * * matrix=new double * [10];
for(int i=0;i<10;i++)
matrix[i]=new double[10];
I want to have NaN value in every cell of this matrix when I initialize it, is it possible to do automatically or the only solution is:
for(int i=0;i<10;i++)
for(int j=0;j<10;j++)
matrix[i][j]=nan("");
Is it possible to infer that when the matrix will costruct, it doesn't use the default constructor of double that insert, for every matrix[i][j], 0.0 value but insert nan("")?

double doesn't have a default constructor, i.e. double values are uninitialized by default.
To avoid explicitly implementing the loops, you can use std::vector :
#include <vector>
...
std::vector<std::vector<double>> matrix(10, std::vector<double>(10, nan("")));
or:
#include <vector>
using namespace std;
...
vector<vector<double>> matrix(10, vector<double>(10, nan("")));

First, strongly avoid using raw pointers in C++ yourself - it's almost always a bad idea. If there's no container class that fits, use std::unique_ptr. So your code becomes:
auto matrix = std::make_unique<double* []>(10);
for(int i=0;i<10;i++) {
matrix.get()[i]= std::make_unique<double []>(10);
}
This code is still not what you want. It's usually not a good idea to create your NxN matrix using N calls to new, or n constructions of a vector. Make a single allocation of NxN doubles, and then either wrap it in a class MyMatrix which supports a 2-parameter square-brace operator, i.e.
template <typename T>
class MyMatrix {
// etc. etc
double const T& operator[](size_type i, size_type j) const { return data_[i*n + j]; }
double T& operator[](size_type i, size_type j) { return data_[i*n + j]; }
}
or (not-recommended) have the pointers point into the single-allocation region:
size_t n = 10;
auto matrix_data = std::make_unique<double []>(n * n);
auto matrix = std::make_unique<double* []>(n);
for(int i=0;i<10;i++) {
matrix.get()[i] = matrix_data.get() + i * n;
}
in each of these cases you can later use std::fill to set all matrix values to NaN, outside of any loop.
The last example above can also be transformed into using vectors (which is probably a better idea than just the raw pointers if you're not using your own class):
size_t n = 10;
auto matrix_data = std::vector<double>(n * n);
auto matrix = std::vector<double*>(n);
for(auto& row : matrix) {
auto row_index = std::dist(row, matrix.begin());
row = &matrix_data[row_index * n];
}
Again, I don't recommend this - it's still a C-like way to enable a my_matrix[i][j] syntax, while using a wrapper class gets you my_matrix[i,j] without needing extra storage, with initialization to NaN or another value (in the constructor), and without following two pointers each time you access it.

If you want to use statically sized arrays you would be better off using std::array. For easier use of multi-dimenstional std::array you can use a template alias
template <class T, size_t ROW, size_t COL>
using Matrix = std::array<std::array<T, COL>, ROW>;
You can set the values in the matrix with std::array::fill, e.g.
Matrix<double, 3, 4> m = {};
m.fill(42.0);
You can also create a compile-time constant matrix object initialized with a default value to skip the initialization at runtime with a simple constexprfunction.
template<typename T, size_t R, size_t C>
constexpr auto makeArray(T&& x) {
Matrix<T,R,C> m = {};
for(size_t i=0; i != R; ++i) {
for(size_t j=0; j != C; ++j) {
m[i][j] = std::forward<T>(x);
}
}
return m;
}
auto constexpr m = makeArray<double, 3,4>(23.42);
I am going to repeat the advice given to prefer C++ constructs over C constructs. They are more type-safe and IMHO almost always more convenient to use, e.g. passing std::array objects as parameters is not different from any other objects. If you are coming from a C background and have no further C++ experience, I would recommend to read some tutorial text that does not first introduce C, e.g. The Tour of C++,

Best practice to return a vector for nested loop in C++11

I would like to return a vector of pointers to vectors from a class member function.
The function will be called from a nested loop, requesting the vector (and processing it's elements) million times, so unnecessary (re)allocations should be avoided.
Bjarne Stroustrup recommends returning collections by value, due to C++11 move semantics. However it seems to me that the second approach (doStuff2) is better in my case, since it supports vector reuse. Any suggestions?
template <typename T>
class A
{
typedef std::vector<T> TVec;
std::vector<TVec> m_items;
public:
size_t getIndex(size_t i, size_t j);
std::vector<TVec*> doStuff(float x, float y)
{
// calculate n, i0, i1, j0, j1 (by x and y)
// ...
std::vector<TVec*> vec;
vec.reserve(n);
for (size_t i = i0; i<i1; i++)
for (size_t j = j0; j<j1; j++)
vec.push_back(&m_items[getIndex(i, j)]);
return vec;
}
void doStuff2(float x, float y, std::vector<TVec*> &vec)
{
// calculate n, i0, i1, j0, j1 (by x and y)
// ...
vec.clear();
vec.reserve(n);
for (size_t i = i0; i<i1; i++)
for (size_t j = j0; j<j1; j++)
vec.push_back(&m_items[getIndex(i, j)]);
}
};

However it seems to me that the second approach (doStuff2) is better in my case, since it supports vector reuse. Any suggestions?
The second option (doStuff2) is better than the first, because it avoids reallocating the vector. That said, you should (probably) consider using a visitor pattern:
Your code (if I understood you correctly):
// "function will be called from a nested loop, requesting the vector
// (and processing it's elements) million times, so unnecessary
// (re)allocations should be avoided."
void yourCientCode()
{
std::vector<TVec*> vec;
for(auto x: ???) for(auto y: ???) // nested loop (a million(?) times)
{
A::doStuff2(x, y, vec);
performClientComputation(vec);
}
}
Alternative code:
// "function will be called from a nested loop, requesting the vector
// (and processing it's elements) million times, so unnecessary
// (re)allocations should be avoided."
void yourCientCode()
{
for(auto x: ???) for(auto y: ???) // nested loop
{
A::doStuff3(x, y, performClientComputation); // computation function should
// be injected as a visitor
}
}
This way, no vector is returned. Client code doesn't have to "get vector then apply computation", but "apply computation on elements satisfying whatever conditions" (see Demeter's Law).
Having a vector (or not) becomes an internal implementation detail (as far as client code is concerned) and can be optimized later, without altering client code at all).

Signature for matrix-vector product function

I am relatively new to C++ and still confused how to pass and return arrays as arguments. I would like to write a simple matrix-vector-product c = A * b function, with a signature like
times(A, b, c, m, n)
where A is a two-dimensional array, b is the input array, c is the result array, and m and n are the dimensions of A. I want to specify array dimensions through m and n, not through A.
The body of the (parallel) function is
int i, j;
double sum;
#pragma omp parallel for default(none) private(i, j, sum) shared(m, n, A, b, c)
for (i = 0; i < m; ++i) {
sum = 0.0;
for (j = 0; j < n; j++) {
sum += A[i][j] * b[j];
}
c[i] = sum;
}
What is the correct signature for a function like this?
Now suppose I want to create the result array c in the function and return it. How can I do this?

So instead of "you should rather" answer (which I will leave up, because you really should rather!), here is "what you asked for" answer.
I would use std::vector to hold your array data (because they have O(1) move capabilities) rather than a std::array (which saves you an indirection, but costs more to move around). std::vector is the C++ "improvement" of a malloc'd (and realloc'd) buffer, while std::array is the C++ "improvement" of a char foo[27]; style buffer.
std::vector<double> times(std::vector<double> const& A, std::vector<double> const& b, size_t m, size_t n)
{
std::vector<double> c;
Assert(A.size() = m*n);
c.resize(n);
// .. your code goes in here.
// Instead of A[x][y], do A[x*n+y] or A[y*m+x] depending on if you want column or
// row-major order in memory.
return std::move(c); // O(1) copy of the std::vector out of this function
}
You'll note I changed the signature slightly, so that it returns the std::vector instead of taking it as a parameter. I did this because I can, and it looks prettier!
If you really must pass c in to the function, pass it in as a std::vector<double>& -- a reference to a std::vector.

This is the answer you should use... So a good way to solve this one involves creating a struct or class to wrap your array (well, buffer of data -- I'd use a std::vector). And instead of a signature like times(A, b, c, m, n), go with this kind of syntax:
Matrix<4,4> M;
ColumnMatrix<4> V;
ColumnMatrix<4> C = M*V;
where the width/height of M are in the <4,4> numbers.
A quick sketch of the Matrix class might be (somewhat incomplete -- no const access, for example)
template<size_t rows, size_t columns>
class Matrix
{
private:
std::vector<double> values;
public:
struct ColumnSlice
{
Matrix<rows,columns>* matrix;
size_t row_number;
double& operator[](size_t column) const
{
size_t index = row_number * columns + column;
Assert(matrix && index < matrix->values.size());
return matrix->values[index];
}
ColumnSlice( Matrix<rows,columns>* matrix_, size_t row_number_ ):
matrix(matrix_), row_number(row_number_)
{}
};
ColumnSlice operator[](size_t row)
{
Assert(row < rows); // note: zero based indexes
return ColumnSlice(this, row);
}
Matrix() {values.resize(rows*columns);}
template<size_t other_columns>
Matrix<rows, other_columns> operator*( Matrix<columns, other_columns> const& other ) const
{
Matrix<rows, other_columns> retval;
// TODO: matrix multiplication code goes here
return std::move(retval);
}
};
template<size_t rows>
using ColumnMatrix = Matrix< rows, 1 >;
template<size_t columns>
using RowMatrix = Matrix< 1, columns >;
The above uses C++0x features your compiler might not have, and can be done without these features.
The point of all of this? You can have math that both looks like math and does the right thing in C++, while being really darn efficient, and that is the "proper" C++ way to do it.
You can also program in a C-like way using some features of C++ (like std::vector to handle array memory management) if you are more used to it. But that is a different answer to this question. :)
(Note: code above has not been compiled, nor is it a complete Matrix implementation. There are template based Matrix implementations in the wild you can find, however.)

Normal vector-matrix multiplication is as follows:
friend Vector operator*(const Vector &v, const Matrix &m);
But if you want to pass the dimensions separately, it's as follows:
friend Vector mul(const Vector &v, const Matrix &m, int size_x, int size_y);
Since the Vector and Matrix would be 1d and 2d arrays, they would look like this:
struct Vector { float *array; };
struct Matrix { float *matrix; };

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Matrix multiplication using multiple threads - c++

Related

How can we speedup matrix multiplication where matrices are initialized using vectors of vectors (2D vector) in C++

Is there an efficient way to slice a C++ vector given a vector containing the indexes to be sliced

How to initialize double matrix with NaN element?

Best practice to return a vector for nested loop in C++11

Signature for matrix-vector product function

Categories

Resources