I inspired myself from this link to code a multiplicator of matrix which are multiple of 4: SSE matrix-matrix multiplication
I came up with something somewhat similar, but I observed that if the for loop with j increase by 4 like in the suggest code, it only fill 1 column each 4 column ( which make sense). I can decrease the for loop by 2, and the result is that only half of the column are filled.
So logically, the solution should be to only increase the loop by 1, but when I make the change in the code, I get either segfault error if I use_mm_store_ps or data corrupted size vs. prev_size if I use _mm_storeu_ps, which makes me believe that the data is simply not align.
What and how should I align the data to not cause such error and fill the resulting matrix?
Here is the code I have so far:
void mat_mult(Matrix A, Matrix B, Matrix C, n) {
for(int i = 0; i < n; ++i) {
for(int j = 0; j < n; j+=1) {
__m128 vR = _mm_setzero_ps();
for(int k = 0; k < n; k++) {
__m128 vA = _mm_set1_ps(A(i,k));
__m128 vB = _mm_loadu_ps(&B(k,j));
vR = _mm_add_ss(vR,vA*vB);
_mm_storeu_ps(&C(i,j), vR);
I corrected your code, also implemented quite a lot of other supplementary code to fully run tests and print outputs, including that I needed to implement Matrix class from scratch. Following code can be compiled in C++11 standard.
Main corrections to your function are: you should handle separately a case when number of B columns is not multiple of 4, this uneven tail case should be handled by separate loop, you should actually run j loop in steps of 4 (as 128-bit SSE float-32 register contains 4 floats), you should use _mm_mul_ps(vA, vB) instead of vA * vB.
Main bug of your code is that instead of yours _mm_add_ss() you should use _mm_add_ps() because you need to add not single value but 4 of them separately. Only due to usage of _mm_add_ss() you were observed that only 1 out of 4 columns was filled (the rest 3 were zeros).
Alternatively you can fix work of your code by using _mm_load_ss() instead of _mm_loadu_ps() and _mm_store_ss() instead _mm_storeu_ps(). After only this fix your code will give correct result, but will be slow, it will be not faster than regular non-SSE solution. To actually gain speed you have to use only ..._ps() instructions everywhere, also handle correctly case of non-multiple of 4.
Because you don't handle case of B columns being non-multiple of 4, because of this your program segfaults, you just store memory out of bounds of matrix C.
Also you asked a question about alignment. Don't ever use aligned store/load like _mm_store_ps()/_mm_load_ps(), always use _mm_storeu_ps()/_mm_loadu_ps(). Because unaligned access instructions are guaranteed to be of same speed as aligned access instructions for same memory pointers values. But aligned instructions may segfault. So unaligned is always better, same speed and never segfault. It used to be in old time on old CPUs that aligned instructions where faster, but right now they are implemented in CPU with exactly same speed. Aligned instructions don't give any profit, only segfaults. But still you may want to use aligned instructions to intentionally segfault if you want to make sure that your program's memory pointers are always aligned.
I implemented also a separate function with reference slow multiplication of matrices, in order to run a reference test to check the correctness of fast (SSE) multiplication.
As commented out by #АлексейНеудачин, my previous version of Matrix class was allocating unaligned memory for array, now I implemented new helper class AlignmentAllocator which ensures that Matrix is allocating aligned memory, this allocator is used by std::vector<> that stores underlying Matrix's data.
Full code with all the corrections, tests and console outputs plus all the extra supplementary code is below. See also console output after the code, I do print two matrices produced by two different multiplication functions, so that two matrices can be compared visually. All test cases are generated randomly. Scroll down my code a bit to see your fixed function mat_mult(). Also click on Try it online! link if you want to see/run my code online.
Try it online!
#include <cmath>
#include <iostream>
#include <vector>
#include <random>
#include <stdexcept>
#include <string>
#include <iomanip>
#include <cstdlib>
#include <malloc.h>
#include <immintrin.h>
using FloatT = float;
template <typename T, std::size_t N>
class AlignmentAllocator {
typedef T value_type;
typedef std::size_t size_type;
typedef std::ptrdiff_t difference_type;
typedef T * pointer;
typedef const T * const_pointer;
typedef T & reference;
typedef const T & const_reference;
inline AlignmentAllocator() throw() {}
template <typename T2> inline AlignmentAllocator(const AlignmentAllocator<T2, N> &) throw() {}
inline ~AlignmentAllocator() throw() {}
inline pointer adress(reference r) { return &r; }
inline const_pointer adress(const_reference r) const { return &r; }
inline pointer allocate(size_type n);
inline void deallocate(pointer p, size_type);
inline void construct(pointer p, const value_type & v) { new (p) value_type(v); }
inline void destroy(pointer p) { p->~value_type(); }
inline size_type max_size() const throw() { return size_type(-1) / sizeof(value_type); }
template <typename T2> struct rebind { typedef AlignmentAllocator<T2, N> other; };
bool operator!=(const AlignmentAllocator<T, N> & other) const { return !(*this == other); }
bool operator==(const AlignmentAllocator<T, N> & other) const { return true; }
template <typename T, std::size_t N>
inline typename AlignmentAllocator<T, N>::pointer AlignmentAllocator<T, N>::allocate(size_type n) {
#ifdef _MSC_VER
auto p = (pointer)_aligned_malloc(n * sizeof(value_type), N);
auto p = (pointer)aligned_alloc(N, n * sizeof(value_type));
if (!p)
throw std::bad_alloc();
return p;
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::deallocate(pointer p, size_type) {
#ifdef _MSC_VER
static size_t constexpr MatrixAlign = 64;
template <typename T, size_t Align = MatrixAlign>
using AlignedVector = std::vector<T, AlignmentAllocator<T, Align>>;
class Matrix {
Matrix(size_t rows, size_t cols)
: rows_(rows), cols_(cols) {
cols_aligned_ = (sizeof(FloatT) * cols_ + MatrixAlign - 1)
/ MatrixAlign * MatrixAlign / sizeof(FloatT);
if (size_t(m_.data()) % 64 != 0 ||
(cols_aligned_ * sizeof(FloatT)) % 64 != 0)
throw std::runtime_error("Matrix was allocated unaligned!");
Matrix & Clear() {
m_.resize(rows_ * cols_aligned_);
return *this;
FloatT & operator() (size_t i, size_t j) {
if (i >= rows_ || j >= cols_)
throw std::runtime_error("Matrix index (" +
std::to_string(i) + ", " + std::to_string(j) + ") out of bounds (" +
std::to_string(rows_) + ", " + std::to_string(cols_) + ")!");
return m_[i * cols_aligned_ + j];
FloatT const & operator() (size_t i, size_t j) const {
return const_cast<Matrix &>(*this)(i, j);
size_t Rows() const { return rows_; }
size_t Cols() const { return cols_; }
bool Equal(Matrix const & b, int round = 7) const {
if (Rows() != b.Rows() || Cols() != b.Cols())
return false;
FloatT const eps = std::pow(FloatT(10), -round);
for (size_t i = 0; i < Rows(); ++i)
for (size_t j = 0; j < Cols(); ++j)
if (std::fabs((*this)(i, j) - b(i, j)) > eps)
return false;
return true;
size_t rows_ = 0, cols_ = 0, cols_aligned_ = 0;
AlignedVector<FloatT> m_;
void mat_print(Matrix const & A, int round = 7, size_t width = 0) {
FloatT const pow10 = std::pow(FloatT(10), round);
for (size_t i = 0; i < A.Rows(); ++i) {
for (size_t j = 0; j < A.Cols(); ++j)
std::cout << std::setprecision(round) << std::fixed << std::setw(width)
<< std::right << (std::round(A(i, j) * pow10) / pow10) << " ";
std::cout << std::endl;;
void mat_mult(Matrix const & A, Matrix const & B, Matrix & C) {
if (A.Cols() != B.Rows())
throw std::runtime_error("Number of A.Cols and B.Rows don't match!");
if (A.Rows() != C.Rows() || B.Cols() != C.Cols())
throw std::runtime_error("Wrong C rows, cols!");
for (size_t i = 0; i < A.Rows(); ++i)
for (size_t j = 0; j < B.Cols() - B.Cols() % 4; j += 4) {
auto sum = _mm_setzero_ps();
for (size_t k = 0; k < A.Cols(); ++k)
sum = _mm_add_ps(
_mm_set1_ps(A(i, k)),
_mm_loadu_ps(&B(k, j))
_mm_storeu_ps(&C(i, j), sum);
if (B.Cols() % 4 == 0)
for (size_t i = 0; i < A.Rows(); ++i)
for (size_t j = B.Cols() - B.Cols() % 4; j < B.Cols(); ++j) {
FloatT sum = 0;
for (size_t k = 0; k < A.Cols(); ++k)
sum += A(i, k) * B(k, j);
C(i, j) = sum;
void mat_mult_slow(Matrix const & A, Matrix const & B, Matrix & C) {
if (A.Cols() != B.Rows())
throw std::runtime_error("Number of A.Cols and B.Rows don't match!");
if (A.Rows() != C.Rows() || B.Cols() != C.Cols())
throw std::runtime_error("Wrong C rows, cols!");
for (size_t i = 0; i < A.Rows(); ++i)
for (size_t j = 0; j < B.Cols(); ++j) {
FloatT sum = 0;
for (size_t k = 0; k < A.Cols(); ++k)
sum += A(i, k) * B(k, j);
C(i, j) = sum;
void mat_fill_random(Matrix & A) {
std::mt19937_64 rng{std::random_device{}()};
std::uniform_real_distribution<FloatT> distr(-9.99, 9.99);
for (size_t i = 0; i < A.Rows(); ++i)
for (size_t j = 0; j < A.Cols(); ++j)
A(i, j) = distr(rng);
int main() {
try {
Matrix a(17, 23), b(23, 19), c(17, 19), d(c.Rows(), c.Cols());
mat_mult_slow(a, b, c);
mat_mult(a, b, d);
if (!c.Equal(d, 5))
throw std::runtime_error("Test failed, c != d.");
Matrix a(3, 7), b(7, 5), c(3, 5), d(c.Rows(), c.Cols());
mat_mult_slow(a, b, c);
mat_mult(a, b, d);
mat_print(c, 3, 8);
std::cout << std::endl;
mat_print(d, 3, 8);
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
-37.177 -114.438 36.094 -49.689 -139.857
22.113 -127.210 -94.434 -14.363 -6.336
71.878 94.234 33.372 32.573 73.310
-37.177 -114.438 36.094 -49.689 -139.857
22.113 -127.210 -94.434 -14.363 -6.336
71.878 94.234 33.372 32.573 73.310
I am trying to add multi-threading in a C++ code. The target is the for loop inside the function. The objective is to reduce the execution time of the program. It takes 3.83 seconds for execution.
I have tried to add the command #pragma omp parallel for reduction(+:sum) in the inner loop (before the j for-loop) but it was not enough. It took 1.98 seconds. The aim is to decrease the time up to 0.5 seconds.
I made some research to increase the speed up and some people recommend the Strip Mining method for Vectorization for better results. However I do not know how to implement it yet.
Could someone know how to do it ?
The code is:
void filter(const long n, const long m, float *data, const float threshold, std::vector &result_row_ind) {
for (long i = 0; i < n; i++) {
float sum = 0.0f;
for (long j = 0; j < m; j++) {
sum += data[i*m + j];
if (sum > threshold)
Thank you very much
When possible, you likely want to parallelize the outer loop. The simplest way to go about this in OpenMP is to do this:
#pragma omp parallel for
for (long i = 0; i < n; i++) {
float sum = 0.0f;
for (long j = 0; j < m; j++) {
sum += data[i*m + j];
if (sum > threshold) {
#pragma omp critical
This works, and is probably a great deal faster than parallelizing the inner loop (launching a parallel region is expensive), but it uses a critical section for locking to prevent races. The race could also be avoided by using a user defined reduction over vectors with a reduction on that loop, if the number of threads is very large and the number of matching results is very small this might be slower, but otherwise it is likely notably faster. This is not quite right, the vector type is incomplete since it wasn't listed, but should be pretty close:
#pragma omp declare \
reduction(CatVec: std::vector<T>: \
omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end())) \
#pragma omp parallel for reduction(CatVec: result_row_ind)
for (long i = 0; i < n; i++) {
float sum = 0.0f;
for (long j = 0; j < m; j++) {
sum += data[i*m + j];
if (sum > threshold) {
If you have a C++ compiler with support for execution policies, you could try std::for_each with the execution policy std::execution::par to see if that helps. Example:
#include <iostream>
#include <vector>
#include <algorithm>
#if __has_include(<execution>)
# include <execution>
#elif __has_include(<experimental/execution_policy>)
# include <experimental/execution_policy>
// iterator to use with std::for_each
class iterator {
size_t val;
using iterator_category = std::forward_iterator_tag;
using value_type = size_t;
using difference_type = size_t;
using pointer = size_t*;
using reference = size_t&;
iterator(size_t value=0) : val(value) {}
inline iterator& operator++() { ++val; return *this; }
inline bool operator!=(const iterator& rhs) const { return val != rhs.val; }
inline reference operator*() { return val; }
std::vector<size_t> filter(const size_t rows, const size_t cols, const float* data, const float threshold) {
std::vector<size_t> result_row_ind;
std::vector<float> sums(rows);
iterator begin(0);
iterator end(rows);
std::for_each(std::execution::par, begin, end, [&](const size_t& row) {
const float* dataend = data + (row+1) * cols;
float& sum = sums[row];
for (const float* dataptr = data + row * cols; dataptr < dataend; ++dataptr) {
sum += *dataptr;
// pushing moved outside the threaded code to avoid using mutexes
for (size_t row = 0; row < rows; ++row) {
if (sums[row] > threshold)
return result_row_ind;
int main() {
constexpr size_t rows = 1<<15, cols = 1<<18;
float* data = new float[rows*cols];
for (int i = 0; i < rows*cols; ++i) data[i] = (float)i / (float)100000000.;
std::vector<size_t> res = filter(rows, cols, data, 10.);
std::cout << res.size() << "\n";
delete[] data;
I am trying to overload + but I got the errors:
#Error1 Error: no [] operator overload for type main.Matrix
Besides, I also got errors for measuring the time.
import std.stdio;
import std.c.process;
import std.date;
class Matrix
Matrix opBinary(string op)(Matrix another)
if(op == "+")
if (row != another.row || col != another.col)
// error
return (this);
Matrix temp = new Matrix(row, col);
for (int i = 0; i < row; i++)
for (int j = 0; j < col; j++)
temp[i][j] = this[i][j] + another[i][j];
return temp;
m2[i][j] = this[i][j] + b[i][j];
You must define opIndex to use operations like this. E.g.:
double[] opIndex(size_t i1)
return d[i1];
double opIndex(size_t i1, size_t i2)
return d[i1][i2];
Or just inside of that method you might want to access double[][] directly:
m2.d[i][j] = this.d[i][j] + b.d[i][j];
std.date.d_time starttime = getCount();
Use StopWatch. E.g.:
StopWatch sw;
// operations...
writefln("elapsed time = %s", sw.peek().msecs);
im trying to make memory pool class and have to overload operator[], but theres a huge(2x) slow down:
T(overloaded) = 76.4043 ns
T(not-ovld) = 28.6016 ns
is it normal or im doing something wrong? thanks for help :)
compiler vc++2013
optimization disabled/full - same thing
template<class T>
class pool{
T *cell;
size_t size = 0;
pool(const size_t n ){
size = n;
cell = new T[size];
T& operator [](const size_t i) { return cell[i]; }
T operator [](const size_t i)const { return cell[i]; }
template<class T>
T F( T x){
return x/2 % 100;
#define test_count 10000000
int main()
pool<unsigned int> P(test_count);
unsigned int sum = 0;
// test 1
for (int i = 0; i < test_count; i++)
P[i] = F(i);
for (int i = 0; i < test_count; i++)
sum = sum + P[i];
cout << sum << endl;
sum = 0;
// test2
for (int i = 0; i < test_count; i++)
P.cell[i] = F(i);
for (int i = 0; i < test_count; i++)
sum = sum + P.cell[i];
cout << sum << endl;
int q;
cin >> q;
return 0;
Problem was with Debug build, in Release build (optimization n stuff) all works like it should. Hehe stupid mistake but taught me something :)
Conclusion - dont measure performence in debug mode ;)
One often reads that there is little performance difference between dynamically allocated array and std::vector.
Here are two versions of the problem 10 of project Euler test with two versions:
with std::vector:
const __int64 sum_of_primes_below_vectorversion(int max)
auto primes = new_primes_vector(max);
__int64 sum = 0;
for (auto p : primes) {
sum += p;
return sum;
const std::vector<int> new_primes_vector(__int32 max_prime)
std::vector<bool> is_prime(max_prime, true);
is_prime[0] = is_prime[1] = false;
for (auto i = 2; i < max_prime; i++) {
is_prime[i] = true;
for (auto i = 1; i < max_prime; i++) {
if (is_prime[i]) {
auto max_j = max_prime / i;
for (auto j = i; j < max_j; j++) {
is_prime[j * i] = false;
auto primes_count = 0;
for (auto i = 0; i < max_prime; i++) {
if (is_prime[i]) {
std::vector<int> primes(primes_count, 0);
for (auto i = 0; i < max_prime; i++) {
if (is_prime[i]) {
return primes;
Note that I also tested the version version with the call to the default constructor of std::vector and without the precomputation of its final size.
Here is the array version:
const __int64 sum_of_primes_below_carrayversion(int max)
auto p_length = (int*)malloc(sizeof(int));
auto primes = new_primes_array(max, p_length);
auto last_index = *p_length - 1;
__int64 sum = 0;
for (int i = 0; i < last_index; i++) {
sum += primes[i];
return sum;
const __int32* new_primes_array(__int32 max_prime, int* p_primes_count)
auto is_prime = (bool*)malloc(max_prime * sizeof(bool));
is_prime[0] = false;
is_prime[1] = false;
for (auto i = 2; i < max_prime; i++) {
is_prime[i] = true;
for (auto i = 1; i < max_prime; i++) {
if (is_prime[i]) {
auto max_j = max_prime / i;
for (auto j = i; j < max_j; j++) {
is_prime[j * i] = false;
auto primes_count = 0;
for (auto i = 0; i < max_prime; i++) {
if (is_prime[i]) {
*p_primes_count = primes_count;
int* primes = (int*)malloc(*p_primes_count * sizeof(__int32));
int index_primes = 0;
for (auto i = 0; i < max_prime; i++) {
if (is_prime[i]) {
primes[index_primes] = i;
return primes;
This is compiled with the MVS2013 compiler, with optimization flags O2.
I don't really see what should be the big difference, because of the move semantics (allowing returning the big vector by value without copy).
Here are the results, with an input of 2E6:
C array version
avg= 0.0438
std= 0.00928224
vector version
avg= 0.0625
std= 0.0005
vector version (no realloc)
avg= 0.0687
std= 0.00781089
The statistics are on 10 trials.
I think there are quite some differences here. Is it because something in my code to be improved?
edit: after correction of my code (and another improvement), here are my new results:
C array version
avg= 0.0344
std= 0.00631189
vector version
avg= 0.0343
std= 0.00611637
vector version (no realloc)
avg= 0.0469
std= 0.00997447
which confirms that there is no penalty of std::vector compare to C arrays (and that one should avoid reallocating).
There shouldn't be a performance difference between vector and a dynamic array, since a vector is a dynamic array.
The performance difference in your code comes from the fact that you are actually doing different things between the vector and array version. For instance:
std::vector<int> primes(primes_count, 0);
for (auto i = 0; i < max_prime; i++) {
if (is_prime[i]) {
return primes;
This creates a vector of size primes_count, all initialized to 0, and then pushes back a bunch of primes onto it. But it still starts with primes_count 0s! So that's wasted memory from both an initialization perspective and an iteration perspective. What you want to do is:
std::vector<int> primes;
// same push_back loop
return primes;
Along the same lines, this block;
std::vector<int> is_prime(max_prime, true);
is_prime[0] = is_prime[1] = false;
for (auto i = 2; i < max_prime; i++) {
is_prime[i] = true;
You construct a vector of max_prime ints initialized to true... and then assign most of them to true again. You're doing the initialization twice here, whereas in the array implementation you only do it once. You should just remove this for loop.
I bet if you fix these two issues - which would make the two algorithms comparable - you'd get the same performance.