I'm performing matrix multiplication with this simple algorithm. To be more flexible I used objects for the matricies which contain dynamicly created arrays.
Comparing this solution to my first one with static arrays it is 4 times slower. What can I do to speed up the data access? I don't want to change the algorithm.
matrix mult_std(matrix a, matrix b) {
matrix c(a.dim(), false, false);
for (int i = 0; i < a.dim(); i++)
for (int j = 0; j < a.dim(); j++) {
int sum = 0;
for (int k = 0; k < a.dim(); k++)
sum += a(i,k) * b(k,j);
c(i,j) = sum;
}
return c;
}
EDIT
I corrected my Question avove! I added the full source code below and tried some of your advices:
swapped k and j loop iterations -> performance improvement
declared dim() and operator()() as inline -> performance improvement
passing arguments by const reference -> performance loss! why? so I don't use it.
The performance is now nearly the same as it was in the old porgram. Maybe there should be a bit more improvement.
But I have another problem: I get a memory error in the function mult_strassen(...). Why?
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
OLD PROGRAM
main.c http://pastebin.com/qPgDWGpW
c99 main.c -o matrix -O3
NEW PROGRAM
matrix.h http://pastebin.com/TYFYCTY7
matrix.cpp http://pastebin.com/wYADLJ8Y
main.cpp http://pastebin.com/48BSqGJr
g++ main.cpp matrix.cpp -o matrix -O3.
EDIT
Here are some results. Comparison between standard algorithm (std), swapped order of j and k loop (swap) and blocked algortihm with block size 13 (block).
Speaking of speed-up, your function will be more cache-friendly if you swap the order of the k and j loop iterations:
matrix mult_std(matrix a, matrix b) {
matrix c(a.dim(), false, false);
for (int i = 0; i < a.dim(); i++)
for (int k = 0; k < a.dim(); k++)
for (int j = 0; j < a.dim(); j++) // swapped order
c(i,j) += a(i,k) * b(k,j);
return c;
}
That's because a k index on the inner-most loop will cause a cache miss in b on every iteration. With j as the inner-most index, both c and b are accessed contiguously, while a stays put.
Make sure that the members dim() and operator()() are declared inline, and that compiler optimization is turned on. Then play with options like -funroll-loops (on gcc).
How big is a.dim() anyway? If a row of the matrix doesn't fit in just a couple cache lines, you'd be better off with a block access pattern instead of a full row at-a-time.
You say you don't want to modify the algorithm, but what does that mean exactly?
Does unrolling the loop count as "modifying the algorithm"? What about using SSE/VMX whichever SIMD instructions are available on your CPU? What about employing some form of blocking to improve cache locality?
If you don't want to restructure your code at all, I doubt there's more you can do than the changes you've already made. Everything else becomes a trade-off of minor changes to the algorithm to achieve a performance boost.
Of course, you should still take a look at the asm generated by the compiler. That'll tell you much more about what can be done to speed up the code.
Use SIMD if you can. You absolutely have to use something like VMX registers if you do extensive vector math assuming you are using a platform that is capable of doing so, otherwise you will incur a huge performance hit.
Don't pass complex types like matrix by value - use a const reference.
Don't call a function in each iteration - cache dim() outside your loops.
Although compilers typically optimize this efficiently, it's often a good idea to have the caller provide a matrix reference for your function to fill out rather than returning a matrix by type. In some cases, this may result in an expensive copy operation.
Here is my implementation of the fast simple multiplication algorithm for square float matrices (2D arrays). It should be a little faster than chrisaycock code since it spares some increments.
static void fastMatrixMultiply(const int dim, float* dest, const float* srcA, const float* srcB)
{
memset( dest, 0x0, dim * dim * sizeof(float) );
for( int i = 0; i < dim; i++ ) {
for( int k = 0; k < dim; k++ )
{
const float* a = srcA + i * dim + k;
const float* b = srcB + k * dim;
float* c = dest + i * dim;
float* cMax = c + dim;
while( c < cMax )
{
*c++ += (*a) * (*b++);
}
}
}
}
Pass the parameters by const reference to start with:
matrix mult_std(matrix const& a, matrix const& b) {
To give you more details we need to know the details of the other methods used.
And to answer why the original method is 4 times faster we would need to see the original method.
The problem is undoubtedly yours as this problem has been solved a million times before.
Also when asking this type of question ALWAYS provide compilable source with appropriate inputs so we can actually build and run the code and see what is happening.
Without the code we are just guessing.
Edit
After fixing the main bug in the original C code (a buffer over-run)
I have update the code to run the test side by side in a fair comparison:
// INCLUDES -------------------------------------------------------------------
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
// DEFINES -------------------------------------------------------------------
// The original problem was here. The MAXDIM was 500. But we were using arrays
// that had a size of 512 in each dimension. This caused a buffer overrun that
// the dim variable and caused it to be reset to 0. The result of this was causing
// the multiplication loop to fall out before it had finished (as the loop was
// controlled by this global variable.
//
// Everything now uses the MAXDIM variable directly.
// This of course gives the C code an advantage as the compiler can optimize the
// loop explicitly for the fixed size arrays and thus unroll loops more efficiently.
#define MAXDIM 512
#define RUNS 10
// MATRIX FUNCTIONS ----------------------------------------------------------
class matrix
{
public:
matrix(int dim)
: dim_(dim)
{
data_ = new int[dim_ * dim_];
}
inline int dim() const {
return dim_;
}
inline int& operator()(unsigned row, unsigned col) {
return data_[dim_*row + col];
}
inline int operator()(unsigned row, unsigned col) const {
return data_[dim_*row + col];
}
private:
int dim_;
int* data_;
};
// ---------------------------------------------------
void random_matrix(int (&matrix)[MAXDIM][MAXDIM]) {
for (int r = 0; r < MAXDIM; r++)
for (int c = 0; c < MAXDIM; c++)
matrix[r][c] = rand() % 100;
}
void random_matrix_class(matrix& matrix) {
for (int r = 0; r < matrix.dim(); r++)
for (int c = 0; c < matrix.dim(); c++)
matrix(r, c) = rand() % 100;
}
template<typename T, typename M>
float run(T f, M const& a, M const& b, M& c)
{
float time = 0;
for (int i = 0; i < RUNS; i++) {
struct timeval start, end;
gettimeofday(&start, NULL);
f(a,b,c);
gettimeofday(&end, NULL);
long s = start.tv_sec * 1000 + start.tv_usec / 1000;
long e = end.tv_sec * 1000 + end.tv_usec / 1000;
time += e - s;
}
return time / RUNS;
}
// SEQ MULTIPLICATION ----------------------------------------------------------
int* mult_seq(int const(&a)[MAXDIM][MAXDIM], int const(&b)[MAXDIM][MAXDIM], int (&z)[MAXDIM][MAXDIM]) {
for (int r = 0; r < MAXDIM; r++) {
for (int c = 0; c < MAXDIM; c++) {
z[r][c] = 0;
for (int i = 0; i < MAXDIM; i++)
z[r][c] += a[r][i] * b[i][c];
}
}
}
void mult_std(matrix const& a, matrix const& b, matrix& z) {
for (int r = 0; r < a.dim(); r++) {
for (int c = 0; c < a.dim(); c++) {
z(r,c) = 0;
for (int i = 0; i < a.dim(); i++)
z(r,c) += a(r,i) * b(i,c);
}
}
}
// MAIN ------------------------------------------------------------------------
using namespace std;
int main(int argc, char* argv[]) {
srand(time(NULL));
int matrix_a[MAXDIM][MAXDIM];
int matrix_b[MAXDIM][MAXDIM];
int matrix_c[MAXDIM][MAXDIM];
random_matrix(matrix_a);
random_matrix(matrix_b);
printf("%d ", MAXDIM);
printf("%f \n", run(mult_seq, matrix_a, matrix_b, matrix_c));
matrix a(MAXDIM);
matrix b(MAXDIM);
matrix c(MAXDIM);
random_matrix_class(a);
random_matrix_class(b);
printf("%d ", MAXDIM);
printf("%f \n", run(mult_std, a, b, c));
return 0;
}
The results now:
$ g++ t1.cpp
$ ./a.exe
512 1270.900000
512 3308.800000
$ g++ -O3 t1.cpp
$ ./a.exe
512 284.900000
512 622.000000
From this we see the C code is about twice as fast as the C++ code when fully optimized. I can not see the reason in the code.
I'm taking a wild guess here, but if you dynamically allocating the matrices makes such a huge difference, maybe the problem is fragmentation. Again, I've no idea how the underlying matrix is implemented.
Why don't you allocate the memory for the matrices by hand, ensuring it's contiguous, and build the pointer structure yourself?
Also, does the dim() method have any extra complexity? I would declare it inline, too.
Related
I have the following piece of C++ code. The scale of the problem is N and M. Running the code takes about two minutes on my machine. (after g++ -O3 compilation). Is there anyway to further accelerate it, on the same machine? Any kind of option, choosing a better data structure, library, GPU or parallelism, etc, is on the table.
void demo() {
int N = 1000000;
int M=3000;
vector<vector<int> > res(M);
for (int i =0; i <N;i++) {
for (int j=1; j < M; j++){
res[j].push_back(i);
}
}
}
int main() {
demo();
return 0;
}
An additional info: The second loop above for (int j=1; j < M; j++) is a simplified version of the real problem. In fact, j could be in a different range for each i (of the outer loop), but the number of iterations is about 3000.
With the exact code as shown when writing this answer, you could create the inner vector once, with the specific size, and call iota to initialize it. Then just pass this vector along to the outer vector constructor to use it for each element.
Then you don't need any explicit loops at all, and instead use the (highly optimized, hopefully) standard library to do all the work for you.
Perhaps something like this:
void demo()
{
static int const N = 1000000;
static int const M = 3000;
std::vector<int> data(N);
std::iota(begin(data), end(data), 0);
std::vector<std::vector<int>> res(M, data);
}
Alternatively you could try to initialize just one vector with that elements, and then create the other vectors just by copying that part of the memory using std::memcpy or std::copy.
Another optimization would be to allocate the memory in advance (e.g. array.reserve(3000)).
Also if you're sure that all the members of the vector are similar vectors, you could do a hack by just creating a single vector with 3000 elements, and in the other res just put the same reference of that 3000-element vector million times.
On my machine which has enough memory to avoid swapping your original code took 86 seconds.
Adding reserve:
for (auto& v : res)
{
v.reserve(N);
}
made basically no difference (85 seconds but I only ran each version once).
Swapping the loop order:
for (int j = 1; j < M; j++) {
for (int i = 0; i < N; i++) {
res[j].push_back(i);
}
}
reduced the time to 10 seconds, this is likely due to a combination of allowing the compiler to use SIMD optimisations and improving cache coherency by accessing memory in sequential order.
Creating one vector and copying it into the others:
for (int i = 0; i < N; i++) {
res[1].push_back(i);
}
for (int j = 2; j < M; j++) {
res[j] = res[1];
}
reduced the time to 4 seconds.
Using a single vector:
void demo() {
size_t N = 1000000;
size_t M = 3000;
vector<int> res(M*N);
size_t offset = N;
for (size_t i = 0; i < N; i++) {
res[offset++] = i;
}
for (size_t j = 2; j < M; j++) {
std::copy(res.begin() + N, res.begin() + N * 2, res.begin() + offset);
offset += N;
}
}
also took 4 seconds, there probably isn't much improvement because you have 3,000 4 MB vectors, there would likely be more difference if N was smaller or M was larger.
I am working on a code that emulates a matrix-vector product. In this code, we are explicitly calculating the vector A x, but we do not have access to the full matrix A, as dim is typically very large such that the full matrix A cannot be stored in memory.
By doing some timings, we have established that the following loop takes about 4 seconds to complete. string is a an instance of a custom class, basically a wrapper around an unsigned long, and matvec is an Eigen::VectorXd.
// Off-diagonal contributions
for (size_t I = 0; I < dim; I++) {
for (size_t p = 0; p < K; p++) {
if (string.isOccupied(p)) { // p in I
for (size_t q = 0; q < p; q++) {
if (!string.isOccupied(q)) {
string.annihilate(p);
string.create(q);
size_t J = string.address();
matvec(I) += value(p,q) * x(J);
matvec(J) += value(p,q) * x(I);
string.annihilate(q);
string.create(p);
}
} // q < p loop
}
} // p loop
string.nextPermutation();
}
In trying to determine the bottleneck of this loop, we tried a variety of things, such as optimizing the annihilate and create calls, but these have little effect. However, when we time the code without the Eigen::VectorXd value updating, i.e.
// Off-diagonal contributions
for (size_t I = 0; I < dim; I++) {
for (size_t p = 0; p < K; p++) {
if (string.isOccupied(p)) { // p in I
for (size_t q = 0; q < p; q++) {
if (!string.isOccupied(q)) {
string.annihilate(p);
string.create(q);
size_t J = string.address();
string.annihilate(q);
string.create(p);
}
} // q < p loop
}
} // p loop
string.nextPermutation();
}
the loop now only takes 0.2 seconds. From this, we have deduced that the bottleneck is actually adding the right values to the resulting vector. see the edit
Is there a faster way to do the matvec(I) += value(p,q) * x(J) calls, using the C++ Eigen library?
EDIT: I have done some better timings, inspired by #PeterT and #chtz's comments, and I find that the += statements take about 30% of all the time spent in the for loop, whereas the other parts all take about an equally proportioned size of the remaining time.
So, the question remains: how can the matvec(I) += value(p,q) * x(J) and matvec(J) += value(p,q) * x(I) be sped up?
I am asking if it is possible to improve considerably integer matrix multiplication with bitwise operations. The matrices are small, and the elements are small nonnegative integers (small means at most 20).
To keep us focused, let's be extremely specific, and say that I have two 3x3 matrices, with integer entries 0<=x<15.
The following naive C++ implementation executed a million times performs around 1s, measured with linux time.
#include <random>
int main() {
//Random number generator
std::random_device rd;
std::mt19937 eng(rd());
std::uniform_int_distribution<> distr(0, 15);
int A[3][3];
int B[3][3];
int C[3][3];
for (int trials = 0; trials <= 1000000; trials++) {
//Set up A[] and B[]
for (int i = 0; i < 3; ++i) {
for (int j = 0; j < 3; ++j) {
A[i][j] = distr(eng);
B[i][j] = distr(eng);
C[i][j] = 0;
}
}
//Compute C[]=A[]*B[]
for (int i = 0; i < 3; ++i) {
for (int j = 0; j < 3; ++j) {
for (int k = 0; k < 3; ++k) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
}
return 0;
}
Notes:
The matrices are not necessarily sparse.
Strassen-like comments does not help here.
Let's try not to use the circumstantial observation, that in this specific problem the matrices A[] and B[] can be encoded as a single 64 bit integer. Think of what would happen for just a bit larger matrices.
Computation is single-threaded.
Related: Binary matrix multiplication bit twiddling hack and What is the optimal algorithm for the game 2048?
The question you linked is about a matrix where every element is a single bit. For one-bit values a and b, a * b is exactly equivalent to a & b.
For adding 2-bit elements, it might be plausible (and faster than unpacking) to add basically from scratch, with XOR (carryless-add), then generate the carry with AND, shift, and mask off carry across element boundaries.
A 3rd bit would require detecting when adding the carry produces yet another carry. I don't think it would be a win to emulating even a 3 bit adder or multiplier, compared to using SIMD. Without SIMD (i.e. in pure C with uint64_t) it might make sense. For add, you might try using a normal add and then try to undo the carry between element boundaries, instead of building an adder yourself out of XOR/AND/shift operations.
packed vs. unpacked-to-bytes storage formats
If you have very many of these tiny matrices, storing them in memory in compressed form (e.g. packed 4bit elements) can help with cache footprint / memory bandwidth. 4bit elements are fairly easy to unpack to having each element in a separate byte element of a vector.
Otherwise, store them with one matrix element per byte. From there, you can easily unpack them to 16bit or 32bit per element if needed, depending on what element sizes the target SIMD instruction set provides. You might keep some matrices in local variables in unpacked format to reuse across multiplies, but pack them back into 4bits per element for storage in an array.
Compilers suck at this with uint8_t in scalar C code for x86. See comments on #Richard's answer: gcc and clang both like to use mul r8 for uint8_t, which forces them to move data into eax (the implicit input/output for a one-operand multiply), rather than using imul r32, r32 and ignoring the garbage that leaves outside the low 8 bits of the destination register.
The uint8_t version actually runs slower than the uint16_t version, even though it has half the cache footprint.
You're probably going to get best results from some kind of SIMD.
Intel SSSE3 has a vector byte multiply, but only with adding of adjacent elements. Using it would require unpacking your matrix into a vector with some zeros between rows or something, so you don't get data from one row mixed with data from another row. Fortunately, pshufb can zero elements as well as copy them around.
More likely to be useful is SSE2 PMADDWD, if you unpack to each matrix element in a separate 16bit vector element. So given a row in one vector, and a transposed-column in another vector, pmaddwd (_mm_madd_epi16) is one horizontal add away from giving you the dot-product result you need for C[i][j].
Instead of doing each of those adds separately, you can probably pack multiple pmaddwd results into a single vector so you can store C[i][0..2] in one go.
You may find that reducing the data size gives you a considerable performance improvement if you are performing this calculation over a large number of matrices:
#include <cstdint>
#include <cstdlib>
using T = std::uint_fast8_t;
void mpy(T A[3][3], T B[3][3], T C[3][3])
{
for (int i = 0; i < 3; ++i) {
for (int j = 0; j < 3; ++j) {
for (int k = 0; k < 3; ++k) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
}
The pentium can move and sign-extend an 8-bit value in one instruction. This means you're getting 4 times as many matricies per cache line.
UPDATE: curiosity piqued, I wrote a test:
#include <random>
#include <utility>
#include <algorithm>
#include <chrono>
#include <iostream>
#include <typeinfo>
template<class T>
struct matrix
{
static constexpr std::size_t rows = 3;
static constexpr std::size_t cols = 3;
static constexpr std::size_t size() { return rows * cols; }
template<class Engine, class U>
matrix(Engine& engine, std::uniform_int_distribution<U>& dist)
: matrix(std::make_index_sequence<size()>(), engine, dist)
{}
template<class U>
matrix(std::initializer_list<U> li)
: matrix(std::make_index_sequence<size()>(), li)
{
}
matrix()
: _data { 0 }
{}
const T* operator[](std::size_t i) const {
return std::addressof(_data[i * cols]);
}
T* operator[](std::size_t i) {
return std::addressof(_data[i * cols]);
}
private:
template<std::size_t...Is, class U, class Engine>
matrix(std::index_sequence<Is...>, Engine& eng, std::uniform_int_distribution<U>& dist)
: _data { (void(Is), dist(eng))... }
{}
template<std::size_t...Is, class U>
matrix(std::index_sequence<Is...>, std::initializer_list<U> li)
: _data { ((Is < li.size()) ? *(li.begin() + Is) : 0)... }
{}
T _data[rows * cols];
};
template<class T>
matrix<T> operator*(const matrix<T>& A, const matrix<T>& B)
{
matrix<T> C;
for (int i = 0; i < 3; ++i) {
for (int j = 0; j < 3; ++j) {
for (int k = 0; k < 3; ++k) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
return C;
}
static constexpr std::size_t test_size = 1000000;
template<class T, class Engine>
void fill(std::vector<matrix<T>>& v, Engine& eng, std::uniform_int_distribution<T>& dist)
{
v.clear();
v.reserve(test_size);
generate_n(std::back_inserter(v), test_size,
[&] { return matrix<T>(eng, dist); });
}
template<class T>
void test(std::random_device& rd)
{
std::mt19937 eng(rd());
std::uniform_int_distribution<T> distr(0, 15);
std::vector<matrix<T>> As, Bs, Cs;
fill(As, eng, distr);
fill(Bs, eng, distr);
fill(Cs, eng, distr);
auto start = std::chrono::high_resolution_clock::now();
auto ia = As.cbegin();
auto ib = Bs.cbegin();
for (auto&m : Cs)
{
m = *ia++ * *ib++;
}
auto stop = std::chrono::high_resolution_clock::now();
auto diff = stop - start;
auto millis = std::chrono::duration_cast<std::chrono::microseconds>(diff).count();
std::cout << "for type " << typeid(T).name() << " time is " << millis << "us" << std::endl;
}
int main() {
//Random number generator
std::random_device rd;
test<std::uint64_t>(rd);
test<std::uint32_t>(rd);
test<std::uint16_t>(rd);
test<std::uint8_t>(rd);
}
example output (recent macbook pro, 64-bit, compiled with -O3)
for type y time is 32787us
for type j time is 15323us
for type t time is 14347us
for type h time is 31550us
summary:
on this platform, int32 and int16 proved to be as fast as each other. int64 and int8 were equally slow (the 8-bit result surprised me).
conclusion:
As ever, express intent to the compiler and let the optimiser do its thing. If the program is running too slowly in production, take measurements and optimise the worst-offenders.
I need to compute a product vector-matrix as efficiently as possible. Specifically, given a vector s and a matrix A, I need to compute s * A. I have a class Vector which wraps a std::vector and a class Matrix which also wraps a std::vector (for efficiency).
The naive approach (the one that I am using at the moment) is to have something like
Vector<T> timesMatrix(Matrix<T>& matrix)
{
Vector<unsigned int> result(matrix.columns());
// constructor that does a resize on the underlying std::vector
for(unsigned int i = 0 ; i < vector.size() ; ++i)
{
for(unsigned int j = 0 ; j < matrix.columns() ; ++j)
{
result[j] += (vector[i] * matrix.getElementAt(i, j));
// getElementAt accesses the appropriate entry
// of the underlying std::vector
}
}
return result;
}
It works fine and takes nearly 12000 microseconds. Note that the vector s has 499 elements, while A is 499 x 15500.
The next step was trying to parallelize the computation: if I have N threads then I can give each thread a part of the vector s and the "corresponding" rows of the matrix A. Each thread will compute a 499-sized Vector and the final result will be their entry-wise sum.
First of all, in the class Matrix I added a method to extract some rows from a Matrix and build a smaller one:
Matrix<T> extractSomeRows(unsigned int start, unsigned int end)
{
unsigned int rowsToExtract = end - start + 1;
std::vector<T> tmp;
tmp.reserve(rowsToExtract * numColumns);
for(unsigned int i = start * numColumns ; i < (end+1) * numColumns ; ++i)
{
tmp.push_back(matrix[i]);
}
return Matrix<T>(rowsToExtract, numColumns, tmp);
}
Then I defined a thread routine
void timesMatrixThreadRoutine
(Matrix<T>& matrix, unsigned int start, unsigned int end, Vector<T>& newRow)
{
// newRow is supposed to contain the partial result
// computed by a thread
newRow.resize(matrix.columns());
for(unsigned int i = start ; i < end + 1 ; ++i)
{
for(unsigned int j = 0 ; j < matrix.columns() ; ++j)
{
newRow[j] += vector[i] * matrix.getElementAt(i - start, j);
}
}
}
And finally I modified the code of the timesMatrix method that I showed above:
Vector<T> timesMatrix(Matrix<T>& matrix)
{
static const unsigned int NUM_THREADS = 4;
unsigned int matRows = matrix.rows();
unsigned int matColumns = matrix.columns();
unsigned int rowsEachThread = vector.size()/NUM_THREADS;
std::thread threads[NUM_THREADS];
Vector<T> tmp[NUM_THREADS];
unsigned int start, end;
// all but the last thread
for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
{
start = i*rowsEachThread;
end = (i+1)*rowsEachThread - 1;
threads[i] = std::thread(&Vector<T>::timesMatrixThreadRoutine, this,
matrix.extractSomeRows(start, end), start, end, std::ref(tmp[i]));
}
// last thread
start = (NUM_THREADS-1)*rowsEachThread;
end = matRows - 1;
threads[NUM_THREADS - 1] = std::thread(&Vector<T>::timesMatrixThreadRoutine, this,
matrix.extractSomeRows(start, end), start, end, std::ref(tmp[NUM_THREADS-1]));
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
threads[i].join();
}
Vector<unsigned int> result(matColumns);
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
result = result + tmp[i]; // the operator+ is overloaded
}
return result;
}
It still works but now it takes nearly 30000 microseconds, which is almost three times as much as before.
Am I doing something wrong? Do you think there is a better approach?
EDIT - using a "lightweight" VirtualMatrix
Following Ilya Ovodov's suggestion, I defined a class VirtualMatrix that wraps a T* matrixData, which is initialized in the constructor as
VirtualMatrix(Matrix<T>& m)
{
numRows = m.rows();
numColumns = m.columns();
matrixData = m.pointerToData();
// pointerToData() returns underlyingVector.data();
}
Then there is a method to retrieve a specific entry of the matrix:
inline T getElementAt(unsigned int row, unsigned int column)
{
return *(matrixData + row*numColumns + column);
}
Now the execution time is better (approximately 8000 microseconds) but maybe there are some improvements to be made. In particular the thread routine is now
void timesMatrixThreadRoutine
(VirtualMatrix<T>& matrix, unsigned int startRow, unsigned int endRow, Vector<T>& newRow)
{
unsigned int matColumns = matrix.columns();
newRow.resize(matColumns);
for(unsigned int i = startRow ; i < endRow + 1 ; ++i)
{
for(unsigned int j = 0 ; j < matColumns ; ++j)
{
newRow[j] += (vector[i] * matrix.getElementAt(i, j));
}
}
}
and the really slow part is the one with the nested for loops. If I remove it, the result is obviously wrong but is "computed" in less than 500 microseconds. This to say that now passing the arguments takes almost no time and the heavy part is really the computation.
According to you, is there any way to make it even faster?
Actually you make a partial copy of matrix for each thread in extractSomeRows. It takes a lot of time.
Redesign it so that "some rows" become virtual matrix pointing at data located in original matrix.
Use vectorized assembly instructions for an architecture by making it more explicit that you want to multiply in 4's, i.e. for the x86-64 SSE2+ and possibly ARM'S NEON.
C++ compilers can often unroll the loop into vectorized code if you explicitly make an operation happen in contingent elements:
Simple and fast matrix-vector multiplication in C / C++
There is also the option of using libraries specifically made for matrix multipication. For larger matrices, it may be more efficient to use special implementations based on the Fast Fourier Transform, alternate algorithms like Strassen's Algorithm, etc. In fact, your best bet would be to use a C library like this, and then wrap it in an interface that looks similar to a C++ vector.
I would like to optimize this simple loop:
unsigned int i;
while(j-- != 0){ //j is an unsigned int with a start value of about N = 36.000.000
float sub = 0;
i=1;
unsigned int c = j+s[1];
while(c < N) {
sub += d[i][j]*x[c];//d[][] and x[] are arrays of float
i++;
c = j+s[i];// s[] is an array of unsigned int with 6 entries.
}
x[j] -= sub; // only one memory-write per j
}
The loop has an execution time of about one second with a 4000 MHz AMD Bulldozer. I thought about SIMD and OpenMP (which I normally use to get more speed), but this loop is recursive.
Any suggestions?
think you may want to transpose the matrix d -- means store it in such a way that you can exchange the indices -- make i the outer index:
sub += d[j][i]*x[c];
instead of
sub += d[i][j]*x[c];
This should result in better cache performance.
I agree with transposing for better caching (but see my comments on that at the end), and there's more to do, so let's see what we can do with the full function...
Original function, for reference (with some tidying for my sanity):
void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *x, float *b){
//We want to solve L D Lt x = b where D is a diagonal matrix described by Diagonals[0] and L is a unit lower triagular matrix described by the rest of the diagonals.
//Let D Lt x = y. Then, first solve L y = b.
float *y = new float[n];
float **d = IncompleteCholeskyFactorization->Diagonals;
unsigned int *s = IncompleteCholeskyFactorization->StartRows;
unsigned int M = IncompleteCholeskyFactorization->m;
unsigned int N = IncompleteCholeskyFactorization->n;
unsigned int i, j;
for(j = 0; j != N; j++){
float sub = 0;
for(i = 1; i != M; i++){
int c = (int)j - (int)s[i];
if(c < 0) break;
if(c==j) {
sub += d[i][c]*b[c];
} else {
sub += d[i][c]*y[c];
}
}
y[j] = b[j] - sub;
}
//Now, solve x from D Lt x = y -> Lt x = D^-1 y
// Took this one out of the while, so it can be parallelized now, which speeds up, because division is expensive
#pragma omp parallel for
for(j = 0; j < N; j++){
x[j] = y[j]/d[0][j];
}
while(j-- != 0){
float sub = 0;
for(i = 1; i != M; i++){
if(j + s[i] >= N) break;
sub += d[i][j]*x[j + s[i]];
}
x[j] -= sub;
}
delete[] y;
}
Because of the comment about parallel divide giving a speed boost (despite being only O(N)), I'm assuming the function itself gets called a lot. So why allocate memory? Just mark x as __restrict__ and change y to x everywhere (__restrict__ is a GCC extension, taken from C99. You might want to use a define for it. Maybe the library already has one).
Similarly, though I guess you can't change the signature, you can make the function take only a single parameter and modify it. b is never used when x or y have been set. That would also mean you can get rid of the branch in the first loop which runs ~N*M times. Use memcpy at the start if you must have 2 parameters.
And why is d an array of pointers? Must it be? This seems too deep in the original code, so I won't touch it, but if there's any possibility of flattening the stored array, it will be a speed boost even if you can't transpose it (multiply, add, dereference is faster than dereference, add, dereference).
So, new code:
void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *__restrict__ x){
// comments removed so that suggestions are more visible. Don't remove them in the real code!
// these definitions got long. Feel free to remove const; it does nothing for the optimiser
const float *const __restrict__ *const __restrict__ d = IncompleteCholeskyFactorization->Diagonals;
const unsigned int *const __restrict__ s = IncompleteCholeskyFactorization->StartRows;
const unsigned int M = IncompleteCholeskyFactorization->m;
const unsigned int N = IncompleteCholeskyFactorization->n;
unsigned int i;
unsigned int j;
for(j = 0; j < N; j++){ // don't use != as an optimisation; compilers can do more with <
float sub = 0;
for(i = 1; i < M && j >= s[i]; i++){
const unsigned int c = j - s[i];
sub += d[i][c]*x[c];
}
x[j] -= sub;
}
// Consider using processor-specific optimisations for this
#pragma omp parallel for
for(j = 0; j < N; j++){
x[j] /= d[0][j];
}
for( j = N; (j --) > 0; ){ // changed for clarity
float sub = 0;
for(i = 1; i < M && j + s[i] < N; i++){
sub += d[i][j]*x[j + s[i]];
}
x[j] -= sub;
}
}
Well it's looking tidier, and the lack of memory allocation and reduced branching, if nothing else, is a boost. If you can change s to include an extra UINT_MAX value at the end, you can remove more branches (both the i<M checks, which again run ~N*M times).
Now we can't make any more loops parallel, and we can't combine loops. The boost now will be, as suggested in the other answer, to rearrange d. Except… the work required to rearrange d has exactly the same cache issues as the work to do the loop. And it would need memory allocated. Not good. The only options to optimise further are: change the structure of IncompleteCholeskyFactorization->Diagonals itself, which will probably mean a lot of changes, or find a different algorithm which works better with data in this order.
If you want to go further, your optimisations will need to impact quite a lot of the code (not a bad thing; unless there's a good reason for Diagonals being an array of pointers, it seems like it could do with a refactor).
I want to give an answer to my own question: The bad performance was caused by cache conflict misses due to the fact that (at least) Win7 aligns big memory blocks to the same boundary. In my case, for all buffers, the adresses had the same alignment (bufferadress % 4096 was same for all buffers), so they fall into the same cacheset of L1 cache. I changed memory allocation to align the buffers to different boundaries to avoid cache conflict misses and got a speedup of factor 2. Thanks for all the answers, especially the answers from Dave!