I inspired myself from this link to code a multiplicator of matrix which are multiple of 4: SSE matrix-matrix multiplication
I came up with something somewhat similar, but I observed that if the for loop with j increase by 4 like in the suggest code, it only fill 1 column each 4 column ( which make sense). I can decrease the for loop by 2, and the result is that only half of the column are filled.
So logically, the solution should be to only increase the loop by 1, but when I make the change in the code, I get either segfault error if I use_mm_store_ps or data corrupted size vs. prev_size if I use _mm_storeu_ps, which makes me believe that the data is simply not align.
What and how should I align the data to not cause such error and fill the resulting matrix?
Here is the code I have so far:
void mat_mult(Matrix A, Matrix B, Matrix C, n) {
for(int i = 0; i < n; ++i) {
for(int j = 0; j < n; j+=1) {
__m128 vR = _mm_setzero_ps();
for(int k = 0; k < n; k++) {
__m128 vA = _mm_set1_ps(A(i,k));
__m128 vB = _mm_loadu_ps(&B(k,j));
vR = _mm_add_ss(vR,vA*vB);
}
_mm_storeu_ps(&C(i,j), vR);
}
}
}
I corrected your code, also implemented quite a lot of other supplementary code to fully run tests and print outputs, including that I needed to implement Matrix class from scratch. Following code can be compiled in C++11 standard.
Main corrections to your function are: you should handle separately a case when number of B columns is not multiple of 4, this uneven tail case should be handled by separate loop, you should actually run j loop in steps of 4 (as 128-bit SSE float-32 register contains 4 floats), you should use _mm_mul_ps(vA, vB) instead of vA * vB.
Main bug of your code is that instead of yours _mm_add_ss() you should use _mm_add_ps() because you need to add not single value but 4 of them separately. Only due to usage of _mm_add_ss() you were observed that only 1 out of 4 columns was filled (the rest 3 were zeros).
Alternatively you can fix work of your code by using _mm_load_ss() instead of _mm_loadu_ps() and _mm_store_ss() instead _mm_storeu_ps(). After only this fix your code will give correct result, but will be slow, it will be not faster than regular non-SSE solution. To actually gain speed you have to use only ..._ps() instructions everywhere, also handle correctly case of non-multiple of 4.
Because you don't handle case of B columns being non-multiple of 4, because of this your program segfaults, you just store memory out of bounds of matrix C.
Also you asked a question about alignment. Don't ever use aligned store/load like _mm_store_ps()/_mm_load_ps(), always use _mm_storeu_ps()/_mm_loadu_ps(). Because unaligned access instructions are guaranteed to be of same speed as aligned access instructions for same memory pointers values. But aligned instructions may segfault. So unaligned is always better, same speed and never segfault. It used to be in old time on old CPUs that aligned instructions where faster, but right now they are implemented in CPU with exactly same speed. Aligned instructions don't give any profit, only segfaults. But still you may want to use aligned instructions to intentionally segfault if you want to make sure that your program's memory pointers are always aligned.
I implemented also a separate function with reference slow multiplication of matrices, in order to run a reference test to check the correctness of fast (SSE) multiplication.
As commented out by #АлексейНеудачин, my previous version of Matrix class was allocating unaligned memory for array, now I implemented new helper class AlignmentAllocator which ensures that Matrix is allocating aligned memory, this allocator is used by std::vector<> that stores underlying Matrix's data.
Full code with all the corrections, tests and console outputs plus all the extra supplementary code is below. See also console output after the code, I do print two matrices produced by two different multiplication functions, so that two matrices can be compared visually. All test cases are generated randomly. Scroll down my code a bit to see your fixed function mat_mult(). Also click on Try it online! link if you want to see/run my code online.
Try it online!
#include <cmath>
#include <iostream>
#include <vector>
#include <random>
#include <stdexcept>
#include <string>
#include <iomanip>
#include <cstdlib>
#include <malloc.h>
#include <immintrin.h>
using FloatT = float;
template <typename T, std::size_t N>
class AlignmentAllocator {
public:
typedef T value_type;
typedef std::size_t size_type;
typedef std::ptrdiff_t difference_type;
typedef T * pointer;
typedef const T * const_pointer;
typedef T & reference;
typedef const T & const_reference;
public:
inline AlignmentAllocator() throw() {}
template <typename T2> inline AlignmentAllocator(const AlignmentAllocator<T2, N> &) throw() {}
inline ~AlignmentAllocator() throw() {}
inline pointer adress(reference r) { return &r; }
inline const_pointer adress(const_reference r) const { return &r; }
inline pointer allocate(size_type n);
inline void deallocate(pointer p, size_type);
inline void construct(pointer p, const value_type & v) { new (p) value_type(v); }
inline void destroy(pointer p) { p->~value_type(); }
inline size_type max_size() const throw() { return size_type(-1) / sizeof(value_type); }
template <typename T2> struct rebind { typedef AlignmentAllocator<T2, N> other; };
bool operator!=(const AlignmentAllocator<T, N> & other) const { return !(*this == other); }
bool operator==(const AlignmentAllocator<T, N> & other) const { return true; }
};
template <typename T, std::size_t N>
inline typename AlignmentAllocator<T, N>::pointer AlignmentAllocator<T, N>::allocate(size_type n) {
#ifdef _MSC_VER
auto p = (pointer)_aligned_malloc(n * sizeof(value_type), N);
#else
auto p = (pointer)aligned_alloc(N, n * sizeof(value_type));
#endif
if (!p)
throw std::bad_alloc();
return p;
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::deallocate(pointer p, size_type) {
#ifdef _MSC_VER
_aligned_free(p);
#else
std::free(p);
#endif
}
static size_t constexpr MatrixAlign = 64;
template <typename T, size_t Align = MatrixAlign>
using AlignedVector = std::vector<T, AlignmentAllocator<T, Align>>;
class Matrix {
public:
Matrix(size_t rows, size_t cols)
: rows_(rows), cols_(cols) {
cols_aligned_ = (sizeof(FloatT) * cols_ + MatrixAlign - 1)
/ MatrixAlign * MatrixAlign / sizeof(FloatT);
Clear();
if (size_t(m_.data()) % 64 != 0 ||
(cols_aligned_ * sizeof(FloatT)) % 64 != 0)
throw std::runtime_error("Matrix was allocated unaligned!");
}
Matrix & Clear() {
m_.clear();
m_.resize(rows_ * cols_aligned_);
return *this;
}
FloatT & operator() (size_t i, size_t j) {
if (i >= rows_ || j >= cols_)
throw std::runtime_error("Matrix index (" +
std::to_string(i) + ", " + std::to_string(j) + ") out of bounds (" +
std::to_string(rows_) + ", " + std::to_string(cols_) + ")!");
return m_[i * cols_aligned_ + j];
}
FloatT const & operator() (size_t i, size_t j) const {
return const_cast<Matrix &>(*this)(i, j);
}
size_t Rows() const { return rows_; }
size_t Cols() const { return cols_; }
bool Equal(Matrix const & b, int round = 7) const {
if (Rows() != b.Rows() || Cols() != b.Cols())
return false;
FloatT const eps = std::pow(FloatT(10), -round);
for (size_t i = 0; i < Rows(); ++i)
for (size_t j = 0; j < Cols(); ++j)
if (std::fabs((*this)(i, j) - b(i, j)) > eps)
return false;
return true;
}
private:
size_t rows_ = 0, cols_ = 0, cols_aligned_ = 0;
AlignedVector<FloatT> m_;
};
void mat_print(Matrix const & A, int round = 7, size_t width = 0) {
FloatT const pow10 = std::pow(FloatT(10), round);
for (size_t i = 0; i < A.Rows(); ++i) {
for (size_t j = 0; j < A.Cols(); ++j)
std::cout << std::setprecision(round) << std::fixed << std::setw(width)
<< std::right << (std::round(A(i, j) * pow10) / pow10) << " ";
std::cout << std::endl;;
}
}
void mat_mult(Matrix const & A, Matrix const & B, Matrix & C) {
if (A.Cols() != B.Rows())
throw std::runtime_error("Number of A.Cols and B.Rows don't match!");
if (A.Rows() != C.Rows() || B.Cols() != C.Cols())
throw std::runtime_error("Wrong C rows, cols!");
for (size_t i = 0; i < A.Rows(); ++i)
for (size_t j = 0; j < B.Cols() - B.Cols() % 4; j += 4) {
auto sum = _mm_setzero_ps();
for (size_t k = 0; k < A.Cols(); ++k)
sum = _mm_add_ps(
sum,
_mm_mul_ps(
_mm_set1_ps(A(i, k)),
_mm_loadu_ps(&B(k, j))
)
);
_mm_storeu_ps(&C(i, j), sum);
}
if (B.Cols() % 4 == 0)
return;
for (size_t i = 0; i < A.Rows(); ++i)
for (size_t j = B.Cols() - B.Cols() % 4; j < B.Cols(); ++j) {
FloatT sum = 0;
for (size_t k = 0; k < A.Cols(); ++k)
sum += A(i, k) * B(k, j);
C(i, j) = sum;
}
}
void mat_mult_slow(Matrix const & A, Matrix const & B, Matrix & C) {
if (A.Cols() != B.Rows())
throw std::runtime_error("Number of A.Cols and B.Rows don't match!");
if (A.Rows() != C.Rows() || B.Cols() != C.Cols())
throw std::runtime_error("Wrong C rows, cols!");
for (size_t i = 0; i < A.Rows(); ++i)
for (size_t j = 0; j < B.Cols(); ++j) {
FloatT sum = 0;
for (size_t k = 0; k < A.Cols(); ++k)
sum += A(i, k) * B(k, j);
C(i, j) = sum;
}
}
void mat_fill_random(Matrix & A) {
std::mt19937_64 rng{std::random_device{}()};
std::uniform_real_distribution<FloatT> distr(-9.99, 9.99);
for (size_t i = 0; i < A.Rows(); ++i)
for (size_t j = 0; j < A.Cols(); ++j)
A(i, j) = distr(rng);
}
int main() {
try {
{
Matrix a(17, 23), b(23, 19), c(17, 19), d(c.Rows(), c.Cols());
mat_fill_random(a);
mat_fill_random(b);
mat_mult_slow(a, b, c);
mat_mult(a, b, d);
if (!c.Equal(d, 5))
throw std::runtime_error("Test failed, c != d.");
}
{
Matrix a(3, 7), b(7, 5), c(3, 5), d(c.Rows(), c.Cols());
mat_fill_random(a);
mat_fill_random(b);
mat_mult_slow(a, b, c);
mat_mult(a, b, d);
mat_print(c, 3, 8);
std::cout << std::endl;
mat_print(d, 3, 8);
}
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
}
}
Output:
-37.177 -114.438 36.094 -49.689 -139.857
22.113 -127.210 -94.434 -14.363 -6.336
71.878 94.234 33.372 32.573 73.310
-37.177 -114.438 36.094 -49.689 -139.857
22.113 -127.210 -94.434 -14.363 -6.336
71.878 94.234 33.372 32.573 73.310
I have a fixed-size 2D matrix with size W x H, each element in the matrix is a std::vector. The data is stored in vector of vectors with linearized index. I'm trying to find a way to concurrently fill the output vector. Here is some code to indicate what I'm trying to do.
#include <cmath>
#include <chrono>
#include <iostream>
#include <mutex>
#include <vector>
#include <omp.h>
struct Vector2d
{
double x;
double y;
};
double generate(double range_min, double range_max)
{
double val = (double)rand() / RAND_MAX;
return range_min + val * (range_max - range_min);
}
int main(int argc, char** argv)
{
(void)argc;
(void)argv;
// generate input data
std::vector<Vector2d> points;
size_t num = 10000000;
size_t w = 100;
size_t h = 100;
for (size_t i = 0; i < num; ++i)
{
Vector2d point;
point.x = generate(0, w);
point.y = generate(0, h);
points.push_back(point);
}
// output
std::vector<std::vector<Vector2d> > output(num, std::vector<Vector2d>());
std::mutex mutex;
auto start = std::chrono::system_clock::now();
#pragma omp parallel for
for (size_t i = 0; i < num; ++i)
{
const Vector2d point = points[i];
size_t x = std::floor(point.x);
size_t y = std::floor(point.y);
size_t id = y * w + x;
mutex.lock();
output[id].push_back(point);
mutex.unlock();
}
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
return 0;
}
The problem is the code is much slower with openmp enabled. I found some example to fill std::vector using reduction, but I don't know how to adapt it to vector of vectors. Any help is appreciate, thanks!
There are some things you could do to improve the performance:
I would preallocate the second vector holding the Vector2d class, because every time you push_back a new Vector2d and the capacity of the std::vector is exceeded, it is going to reallocate. So if you do not care having initialized Vector2ds in your std::vector I would simply use:
std::vector<std::vector<Vector2d> > output(num,
std::vector<Vector2d>(num, Vector2d(/*whatever goes in here*/)));
Then in your for loop, you coul access the elements in the second vector via operator[], which allows you to get rid of the lock.
#pragma omp parallel for
for (size_t i = 0; i < num; ++i)
{
const Vector2d point = points[i];
size_t x = std::floor(point(0));
size_t y = std::floor(point(1));
size_t id = y * w + x;
output[id][i] = num;
}
Though I'm not sure, the before-mentioned way works with what you want to do. Otherwise you could reserve the storage for each std::vector<Vector2d>, which would leave you with your initial loop:
std::vector<std::vector<Vector2d> > output(num, std::vector<Vector2d>());
for(int i = 0; i < num; ++i) {
output[i].reserve(num);
}
#pragma omp parallel for
for (size_t i = 0; i < num; ++i)
{
const Vector2d point = points[i];
size_t x = std::floor(point(0));
size_t y = std::floor(point(1));
size_t id = y * w + x;
mutex.lock();
output[id].push_back(point);
mutex.unlock();
}
Which means you get rid of the vector re-allocation, but you still have the mutex...
I overloaded the arithmetic/assignment operators on std::vector in order to be able to do some basic linear algebra operations. However, I'm having some performance trouble when chaining those operations.
Here's the content of my main.h:
#include <vector>
#include <stdlib.h>
using namespace std;
typedef vector<float> vec;
inline vec& operator+=(vec& lhs, const vec& rhs) {
for (size_t i = 0; i < lhs.size(); ++i) {
lhs[i] += rhs[i];
}
return lhs;
}
inline vec operator*(float lhs, vec rhs) {
for (size_t i = 0; i < rhs.size(); ++i) {
rhs[i] *= lhs;
}
return rhs;
}
Content of main1.cpp:
#include "main.h"
// gcc 4.9.2 (-O3): 0m5.965s
int main(int, char**) {
float x = rand();
vec v1(1000);
vec v2(1000);
for (size_t i = 0; i < v1.size(); ++i) {
v1[i] = rand();
v2[i] = rand();
}
for (int i = 0; i < 10000000; ++i) {
v1 += x * v2;
// same as:
//vec y = x * v2;
//v1 += y;
}
return 0;
}
Content of main2.cpp:
#include "main.h"
// gcc 4.9.2 (-O3): 0m2.400s
int main(int, char**) {
// same stuff
for (int i = 0; i < 10000000; ++i) {
for (size_t j = 0; j < v1.size(); ++j) {
v1[j] += x * v2[j];
}
}
return 0;
}
The second program runs much faster than the first. I do understand why this is the case: instead of just one loop, the first program does two loops, and it allocates a temporary vector.
But this is the kind of thing I'd expect the compiler to see and optimize. Or am I doing something wrong?
I don't recall having this problem with linear algebra libraries (e.g. Armadillo). How do they tackle this problem? Does this involve some complicated template programming, or is there some simple way to help the compiler optimize this?
There were some super ugly template meta-programming solutions to that problem. But then the standards committee invented the combination of rvalue references and move semantics. Look those up and find many examples of the solution without the absurd levels of meta programming.
I am trying to rewrite some of my code to use threading, since I am trying to learn about it in C++. I have this method:
template <typename T>
Polynomial<T> Polynomial<T>::operator *(const Polynomial<T>& other) {
std::vector<std::future<std::vector<T>>> tempResults;
for (auto i = 0; i < Coefficients.size(); i++) {
tempResults.push_back(std::async(std::launch::async, [&other, this](int i) {
std::vector<T> result(i + other.GetCoefficientsSize() + 2);
std::fill(result.begin(), result.end(), 0);
for (auto j = 0; j < other.GetCoefficientsSize(); j++) {
result.at(j + i + 1) = other.GetCoefficient(j) * Coefficients.at(i);
}
return result;
}, Coefficients.at(i)));
}
std::mutex multiplyMutex;
std::vector<T> total(Coefficients.size() + other.Coefficients.size());
for (auto i = 0; i < tempResults.size(); i++) {
std::vector<T> result = tempResults.at(i).get();
std::lock_guard<std::mutex> lock(multiplyMutex);
for (T value : result) {
total.at(i) += value;
}
}
Polynomial<T> result(total);
return result;
After much trial and error, i got the code to compile. On runtime however, i get an index out of bounds exception at this line:
std::vector<T> result = tempResults.at(i).get();
Furthermore, step-by-step debugging shows me that some kind of exception handling in the file "future" is run at this line as well:
result.at(j + i + 1) = other.GetCoefficient(j) * Coefficients.at(i);
Can you identify what is happening?
I finally solved the problem myself. It turned out I had made a stupid mistake, storing the results of the async tasks on index Coefficients.at(i) which obviously is a coefficient of the polynomial. This in turn made the result of the async task go out of bounds on the tempResults vector. What I really needed was to store at i.
Once I had it working, I also found that the last for loop was faulty. The full working result can be seen below.
template <typename T>
Polynomial<T> Polynomial<T>::operator *(const Polynomial<T>& other) {
std::vector<std::future<std::vector<T>>> tempResults; //Results from async will be stored here
for (int i = 0; i < Coefficients.size(); i++) {
tempResults.push_back(std::async(std::launch::async, [&other, this](int i) { //Start threads for each Coefficient
std::vector<T> result(Coefficients.size() + other.GetCoefficientsSize()); //Size of result is the combined degree
std::fill(result.begin(), result.end(), 0);
for (auto j = 0; j < other.GetCoefficientsSize(); j++) {
result.at(i + j + 1) += Coefficients.at(i) * other.GetCoefficient(j);
}
return result;
}, i)); //save in tempResults at i
}
std::mutex multiplyMutex;
std::vector<T> total(Coefficients.size() + other.Coefficients.size());
for (auto i = 0; i < tempResults.size(); i++) { //Combine tempResults in total
std::vector<T> result = tempResults.at(i).get(); //Get result of async task
std::lock_guard<std::mutex> lock(multiplyMutex); //Disallow concurrent access to the final variable
for (auto j = 0; j < result.size(); j++) {
total.at(j) += result.at(j);
}
}
Polynomial<T> result(total);
return result;
}
I've been trying to implement Dijkstra's algorithm in C++11 to work on matrices of arbitrary size. Specifically, I am interested in solving question 83 on Project Euler.
I appear to always run in to a situation where every node neighboring the current node has already been visited, which, if I understand the algorithm correctly, should never happen.
I've tried poking around in a debugger, and I've re-read the code several times, but I have no idea where I am going wrong.
Here is what I have done so far:
#include <iostream>
#include <fstream>
#include <cstdlib>
#include <vector>
#include <set>
#include <tuple>
#include <cstdint>
#include <cinttypes>
typedef std::tuple<size_t, size_t> Index;
std::ostream& operator<<(std::ostream& os, Index i)
{
os << "(" << std::get<0>(i) << ", " << std::get<1>(i) << ")";
return os;
}
template<typename T>
class Matrix
{
public:
Matrix(size_t i, size_t j):
n(i),
m(j),
xs(i * j)
{}
Matrix(size_t n, size_t m, const std::string& path):
n(n),
m(m),
xs(n * m)
{
std::ifstream mat_in {path};
char c;
for (size_t i = 0; i < n; ++i) {
for (size_t j = 0; j < m - 1; ++j) {
mat_in >> (*this)(i,j);
mat_in >> c;
}
mat_in >> (*this)(i,m - 1);
}
}
T& operator()(size_t i, size_t j)
{
return xs[n * i + j];
}
T& operator()(Index i)
{
return xs[n * std::get<0>(i) + std::get<1>(i)];
}
T operator()(Index i) const
{
return xs[n * std::get<0>(i) + std::get<1>(i)];
}
std::vector<Index> surrounding(Index ind) const
{
size_t i = std::get<0>(ind);
size_t j = std::get<1>(ind);
std::vector<Index> is;
if (i > 0)
is.push_back(Index(i - 1, j));
if (i < n - 1)
is.push_back(Index(i + 1, j));
if (j > 0)
is.push_back(Index(i, j - 1));
if (j < m - 1)
is.push_back(Index(i, j + 1));
return is;
}
size_t rows() const { return n; }
size_t cols() const { return m; }
private:
size_t n;
size_t m;
std::vector<T> xs;
};
/* Finds the minimum sum of the weights of the nodes along a path from 1,1 to n,m using Dijkstra's algorithm modified for matrices */
int64_t shortest_path(const Matrix<int>& m)
{
Index origin(0,0);
Index current { m.rows() - 1, m.cols() - 1 };
Matrix<int64_t> nodes(m.rows(), m.cols());
std::set<Index> in_path;
for (size_t i = 0; i < m.rows(); ++i)
for (size_t j = 0; j < m.cols(); ++j)
nodes(i,j) = INTMAX_MAX;
nodes(current) = m(current);
while (1) {
auto is = m.surrounding(current);
Index next = origin;
for (auto i : is) {
if (in_path.find(i) == in_path.end()) {
nodes(i) = std::min(nodes(i), nodes(current) + m(i));
if (nodes(i) < nodes(next))
next = i;
}
}
in_path.insert(current);
current = next;
if (current == origin)
return nodes(current);
}
}
int64_t at(const Matrix<int64_t>& m, const Index& i) { return m(i); }
int at(const Matrix<int>& m, const Index& i) { return m(i); }
int main()
{
Matrix<int> m(80,80,"mat.txt");
printf("%" PRIi64 "\n", shortest_path(m));
return 0;
}
You do not understand the algorithm correctly. There is nothing stopping you from running into dead ends. As long as there are other options you have not yet explored, just mark it as a dead end and move on.
BTW I agree with commentators who say that you are overcomplicating the solution. It suffices to create a matrix of "cost to get to here" and have a queue of points to explore paths from. Initialize the total cost matrix to a value for NOT_VISITED, -1 would work. For each point, you look at the neighbors. If the neighbor either has not been visited, or you just found a cheaper path to it, then adjust the cost matrix and add the point to the queue.
Keep going until the queue is empty. And then you have guaranteed lowest costs everywhere.
A* is a lot more efficient than this naive approach, but what I just described is more than efficient enough to solve the problem.