How to optimize the following common loop? - c++

I have code
#include <iostream>
#include <vector>
#include <ctime>
using namespace std;
void foo(int n, double* a, double* b, double *c, double*d, double* e, double* f, double* g)
{
for (int i = 0; i < n; ++i)
a[i] = b[i] * a[i] + c[i] * (d[i] + e[i] + f[i] + g[i]);
}
int main()
{
int m = 1001001;
vector<double> a(m), b(m), c(m), d(m), f(m);
clock_t start = std::clock();
for (int i = 0; i < 1000; ++i)
foo(1000000, &a[0], &b[0], &c[0], &d[0], &d[1], &f[0], &f[1000] );
double duration = (std::clock() - start) / (double)CLOCKS_PER_SEC;
cout << "Finished in " << duration << " seconds [CPU Clock] " << endl;
}
Can you give me a workable example to optimize it with better performance? Any compiler is fine, like Intel c++ compiler and visual c++ compiler. Please suggest a CPU with good performance to do such job.

The code in question is useless. It does lots of calculations with uninitialised variables and then ignores the results. Compilers are getting more and more clever at figuring out that kind of thing and removing all the code for this. So don't be surprised if code like this doesn't take any time at all.
In C, you would declare the pointers as "const double* restrict" except a which would be double* restrict, telling the compiler that all pointers except the first one point to data that isn't going to be modified during the loop; this allows the compiler to vectorise. Not a C++ feature unfortunately afaik.
If this was your real problem, you would just swap the inner and outer loop, and remove loop invariants like this:
void foo(int iter, int n, double* a, double* b, double *c, double*d, double* e, double* f, double* g)
{
for (int i = 0; i < n; ++i) {
double xa = a [i];
double xb = b [i];
double xr = c[i] * (d[i] + e[i] + f[i] + g[i]);
for (int j = 0; j < iter; ++j)
xa = xb * xa + xr;
a [i] = xa;
}
}
You'd probably do four iterations in parallel to avoid the latency.
But in a real life situation, you would observe that in each call, you read about 40MB which is way beyond any cache. So you are limited by RAM speed. The usual solution is to split the work into smaller parts, for example 500 elements at a time, so everything fits into L1 cache, then perform the operation with the same data 1000 times.

On apple clang, I tried:
using __restict__ on the arguments to convince the compiler that there was no aliasing.
result: no change
distributing the computation over 8 threads in foo()
result: computation time increased from ~3 seconds to ~18seconds!
using #pragma omp parallel for
result: compiler ignored me and stayed with the original solution. ~3 seconds.
setting the command line option -march=native to allow the cpu's full awesomeness to shine
result: different assembler output (vectorisation applied), but run time still unchanged at ~3s
initial conclusions:
This problem is bound by memory access and not by the CPU.

You could experiment with prefetching the vectors into cache lines and then operating on them in lumps of 8 (8 doubles will fit into every cache line).
Make sure that while you are operating on x[i] to x[i+7] you are prefetching x[i+8] to x[i+15].
This might not help as you are using additions and multiplications which are so fast that your RAM may not be able to keep up anyway.

I think you should use multithreading. change foo to get fromIndex, toIndex, instead of n and distribute vectores over threads.
void foo(int fromIndex, int toIndex, double* a, double* b, double *c, double*d, double* e, double* f, double* g)
{
for (int i = fromIndex; i < toIndex; ++i)
a[i] = b[i] * a[i] + c[i] * (d[i] + e[i] + f[i] + g[i]);
}

Related

How to use multi-threading in C++ binomial pricing?

I'm new to multithreading in C++, and I am not sure how to apply it. Can anyone help?
I'm trying to make the BinomialTree function multithreaded,
This is what I have tried so far:
thread th1(BinomialTree,S0, r, q, sigma, T, N);
th1.join();
But it doesn't work
int main() {
double K = 100;
double S0 = 100;
double r = 0.03;
double q = 0;
double sigma = 0.3;
double T = 1;
const int N = 1000;
shared_ptr<Payoff> callPayOff = make_shared<PayoffCall>(K, r, T);
EuropeanOption europeanCall(T, callPayOff);
BinomialTree tree(S0, r, q, sigma, T, N);
double callPrice1 = tree.Price(europeanCall);
}
double BinomialTree::Price(const Option &theOption)
{
if (!treeInitialized_) initializeTree();
for (long j = 0; j <= N; ++j)
{
// threads[j]=thread([&counter](){
tree_[N][j].second = theOption.ExpirationPayoff(tree_[N][j].first);
}
double disc = exp(-r*dt);
for (long ir = N - 1; ir >= 0; --ir)
{
#pragma omp parallel for
for (long j = 0; j <= ir; ++j)
{
double discountedExpectation = disc*0.5*(tree_[ir + 1][j].second + tree_[ir + 1][j + 1].second);
//find the payoff at the node:
tree_[ir][j].second = theOption.IntermediatePayoff(tree_[ir][j].first, discountedExpectation);
}
#pragma omp barrier
}
return tree_[0][0].second;
}
From
BinomialTree tree(S0, r, q, sigma, T, N);
double callPrice1 = tree.Price(europeanCall);
it looks like BinomialTree tree(...) is the definition of an object, and tree.Price is the actual function call. Your thread probably should be running the &BinomialTree::Price function.
That said,
thread th1(&BinomialTree::Price, tree, 0, r, q, sigma, T, N);
th1.join();
will just cause the main thread to wait while th1 is running. For multi-threading, you want to do something useful with multiple threads at the same time. And in this simple example, there's of course nothing obvious that you can do. But in your real program, you might want to have a look.
Note: since you've left out the code of &BinomialTree::Price, we can't tell if you could use multiple threads inside that function.
If you have a divide-and-conquer algorithm like q-sort, you can break data-set into smaller data-sets, each of those datasets become independent problems that can be processed in parallel and then combined when done. This will work well for trees that do not cross link but the binomial tree mentioned does have cross-links so might be more difficult to do. There is no silver bullet though, you can only parallelize if the algorithm allows you to.

Optimize eigen recomposition (Matrix - Diagonal Matrix - Matrix) product C++ with BLAS and OpenMP

I wrote a C++ code to solve a linear system A.x = b where A is a symmetric matrix by first diagonalizing the matrix A = V.D.V^T with LAPACK(E) (because I need the eigenvalues later) and then solving x = A^-1.b = V^T.D^-1.V.b where of course V is orthogonal.
Now I would like to optimize this last operation as much as possible, e.g. by using (C)BLAS routines and OpenMP.
Here is my naive implementation:
// Solve linear system A.X = B for X (V contains eigenvectors and D eigenvalues of A)
void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{
#ifdef _OPENMP
#pragma omp parallel for
#endif
for (int i=0; i<N; i++)
{
for (int j=0; j<N; j++)
{
for (int k=0; k<N; k++)
{
X[i] += B[j] * V[i+k*N] * V[j+k*N] / D[k];
}
}
}
}
All arrays are C-style arrays, where V is of size N^2, D is of size N, B is of size N and X is of size N (and initialized with zeros).
For now, this naive implementation is very slow and is the bottleneck of the code. Any hints and help would be very appreciated !
Thanks
EDIT
Thanks to Jérôme Richard's answer and comment I further optimized his solution by calling BLAS and parallelizing the middle loop with OpenMP. On a 1000x1000 matrix, this solution is ~4 times faster as his proposition, which itself was 1000 times faster than my naive implementation.
I found the #pragma omp parallel for simd clause the be faster than other alternatives on two different machines with 4 and 20 cores respectively, for N=1000 and N=2000.
void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{
double* sum = new double[N]{0.};
cblas_dgemv(CblasColMajor,CblasTrans,N,N,1.,V,N,B,1,0.,sum,1);
#pragma omp parallel for simd
for (int i=0; i<N; ++i)
{
sum[i] /= D[i];
}
cblas_dgemv(CblasColMajor,CblasNoTrans,N,N,1.,V,N,sum,1,0.,X,1);
delete [] sum;
}
This code is currently highly memory-bound. Thus the resulting program will probably poorly scale (as long as compiler optimization are enabled). Indeed, on most common systems (eg. 1 socket non-NUMA processor) the RAM throughput is a shared resource between cores and also a scarce one. Moreover, the memory access pattern is inefficient and the algorithmic complexity of the code can be improved.
To speed up the computation, the j and k loops can be swapped so that V is read contiguously. Moreover, the division by V[i+k*N] and D[k] becomes constants in the most inner loop. The computation can then be factorized to be much faster since B[j] and V[j+k*N] is not dependent of i too. The resulting algorithm runs in O(n^2) rather than O(n^3) thanks to sum precomputations !
Finally, omp simd can be used to help compilers vectorizing the code, making it even faster!
Note that _OPENMP seems useless here since #pragma should ignored by compilers when OpenMP is disabled or not supported.
// Solve linear system A.X = B for X (V contains eigenvectors and D eigenvalues of A)
void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{
std::vector<double> kSum(N);
#pragma omp parallel for
for (int k=0; k<N; k++)
{
const double sum = 0.0;
#pragma omp simd reduction(+:sum)
for (int j=0; j<N; j++)
{
sum += B[j] * V[j+k*N];
}
kSum[k] = sum / D[k];
}
// Loop tiling can be used to speed up this section even more.
// The idea is to swap i-based and j-based loops and work on thread-private copies
// of X and finally sum the thread-private versions into a global X.
// The resulting code should work on contiguous data and can even be vectorized.
#pragma omp parallel for
for (int i=0; i<N; i++)
{
double sum = X[i];
for (int k=0; k<N; k++)
{
sum += kSum[k] * V[i+k*N];
}
X[i] = sum;
}
}
The new code should be several order of magnitude faster than the original one (but still memory-bound). Note that results might be a bit different (as floating-point operations are not really associative), but I expect results to be be more accurate.

I want to optimize this short loop

I would like to optimize this simple loop:
unsigned int i;
while(j-- != 0){ //j is an unsigned int with a start value of about N = 36.000.000
float sub = 0;
i=1;
unsigned int c = j+s[1];
while(c < N) {
sub += d[i][j]*x[c];//d[][] and x[] are arrays of float
i++;
c = j+s[i];// s[] is an array of unsigned int with 6 entries.
}
x[j] -= sub; // only one memory-write per j
}
The loop has an execution time of about one second with a 4000 MHz AMD Bulldozer. I thought about SIMD and OpenMP (which I normally use to get more speed), but this loop is recursive.
Any suggestions?
think you may want to transpose the matrix d -- means store it in such a way that you can exchange the indices -- make i the outer index:
sub += d[j][i]*x[c];
instead of
sub += d[i][j]*x[c];
This should result in better cache performance.
I agree with transposing for better caching (but see my comments on that at the end), and there's more to do, so let's see what we can do with the full function...
Original function, for reference (with some tidying for my sanity):
void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *x, float *b){
//We want to solve L D Lt x = b where D is a diagonal matrix described by Diagonals[0] and L is a unit lower triagular matrix described by the rest of the diagonals.
//Let D Lt x = y. Then, first solve L y = b.
float *y = new float[n];
float **d = IncompleteCholeskyFactorization->Diagonals;
unsigned int *s = IncompleteCholeskyFactorization->StartRows;
unsigned int M = IncompleteCholeskyFactorization->m;
unsigned int N = IncompleteCholeskyFactorization->n;
unsigned int i, j;
for(j = 0; j != N; j++){
float sub = 0;
for(i = 1; i != M; i++){
int c = (int)j - (int)s[i];
if(c < 0) break;
if(c==j) {
sub += d[i][c]*b[c];
} else {
sub += d[i][c]*y[c];
}
}
y[j] = b[j] - sub;
}
//Now, solve x from D Lt x = y -> Lt x = D^-1 y
// Took this one out of the while, so it can be parallelized now, which speeds up, because division is expensive
#pragma omp parallel for
for(j = 0; j < N; j++){
x[j] = y[j]/d[0][j];
}
while(j-- != 0){
float sub = 0;
for(i = 1; i != M; i++){
if(j + s[i] >= N) break;
sub += d[i][j]*x[j + s[i]];
}
x[j] -= sub;
}
delete[] y;
}
Because of the comment about parallel divide giving a speed boost (despite being only O(N)), I'm assuming the function itself gets called a lot. So why allocate memory? Just mark x as __restrict__ and change y to x everywhere (__restrict__ is a GCC extension, taken from C99. You might want to use a define for it. Maybe the library already has one).
Similarly, though I guess you can't change the signature, you can make the function take only a single parameter and modify it. b is never used when x or y have been set. That would also mean you can get rid of the branch in the first loop which runs ~N*M times. Use memcpy at the start if you must have 2 parameters.
And why is d an array of pointers? Must it be? This seems too deep in the original code, so I won't touch it, but if there's any possibility of flattening the stored array, it will be a speed boost even if you can't transpose it (multiply, add, dereference is faster than dereference, add, dereference).
So, new code:
void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *__restrict__ x){
// comments removed so that suggestions are more visible. Don't remove them in the real code!
// these definitions got long. Feel free to remove const; it does nothing for the optimiser
const float *const __restrict__ *const __restrict__ d = IncompleteCholeskyFactorization->Diagonals;
const unsigned int *const __restrict__ s = IncompleteCholeskyFactorization->StartRows;
const unsigned int M = IncompleteCholeskyFactorization->m;
const unsigned int N = IncompleteCholeskyFactorization->n;
unsigned int i;
unsigned int j;
for(j = 0; j < N; j++){ // don't use != as an optimisation; compilers can do more with <
float sub = 0;
for(i = 1; i < M && j >= s[i]; i++){
const unsigned int c = j - s[i];
sub += d[i][c]*x[c];
}
x[j] -= sub;
}
// Consider using processor-specific optimisations for this
#pragma omp parallel for
for(j = 0; j < N; j++){
x[j] /= d[0][j];
}
for( j = N; (j --) > 0; ){ // changed for clarity
float sub = 0;
for(i = 1; i < M && j + s[i] < N; i++){
sub += d[i][j]*x[j + s[i]];
}
x[j] -= sub;
}
}
Well it's looking tidier, and the lack of memory allocation and reduced branching, if nothing else, is a boost. If you can change s to include an extra UINT_MAX value at the end, you can remove more branches (both the i<M checks, which again run ~N*M times).
Now we can't make any more loops parallel, and we can't combine loops. The boost now will be, as suggested in the other answer, to rearrange d. Except… the work required to rearrange d has exactly the same cache issues as the work to do the loop. And it would need memory allocated. Not good. The only options to optimise further are: change the structure of IncompleteCholeskyFactorization->Diagonals itself, which will probably mean a lot of changes, or find a different algorithm which works better with data in this order.
If you want to go further, your optimisations will need to impact quite a lot of the code (not a bad thing; unless there's a good reason for Diagonals being an array of pointers, it seems like it could do with a refactor).
I want to give an answer to my own question: The bad performance was caused by cache conflict misses due to the fact that (at least) Win7 aligns big memory blocks to the same boundary. In my case, for all buffers, the adresses had the same alignment (bufferadress % 4096 was same for all buffers), so they fall into the same cacheset of L1 cache. I changed memory allocation to align the buffers to different boundaries to avoid cache conflict misses and got a speedup of factor 2. Thanks for all the answers, especially the answers from Dave!

Simple and fast matrix-vector multiplication in C / C++

I need frequent usage of matrix_vector_mult() which multiplies matrix with vector, and below is its implementation.
Question: Is there a simple way to make it significantly, at least twice, faster?
Remarks: 1) The size of the matrix is about 300x50. It doesn't change during the
run. 2) It must work on both Windows and Linux.
double vectors_dot_prod(const double *x, const double *y, int n)
{
double res = 0.0;
int i;
for (i = 0; i < n; i++)
{
res += x[i] * y[i];
}
return res;
}
void matrix_vector_mult(const double **mat, const double *vec, double *result, int rows, int cols)
{ // in matrix form: result = mat * vec;
int i;
for (i = 0; i < rows; i++)
{
result[i] = vectors_dot_prod(mat[i], vec, cols);
}
}
This is something that in theory a good compiler should do by itself, however I made a try with my system (g++ 4.6.3) and got about twice the speed on a 300x50 matrix by hand unrolling 4 multiplications (about 18us per matrix instead of 34us per matrix):
double vectors_dot_prod2(const double *x, const double *y, int n)
{
double res = 0.0;
int i = 0;
for (; i <= n-4; i+=4)
{
res += (x[i] * y[i] +
x[i+1] * y[i+1] +
x[i+2] * y[i+2] +
x[i+3] * y[i+3]);
}
for (; i < n; i++)
{
res += x[i] * y[i];
}
return res;
}
I expect however the results of this level of micro-optimization to vary wildly between systems.
As Zhenya says, just use a good BLAS or matrix math library.
If for some reason you can't do that, see if your compiler can unroll and/or vectorize your loops; making sure rows and cols are both constants at the call site may help, assuming the functions you posted are available for inlining
If you still can't get the speedup you need, you're looking at manual unrolling, and vectorizing using extensions or inline assembler.
If the size is constant and known in advance, pass it in as a precompiler variable, which will permit the compiler to optimize more fully.

How to speed up matrix multiplication in C++?

I'm performing matrix multiplication with this simple algorithm. To be more flexible I used objects for the matricies which contain dynamicly created arrays.
Comparing this solution to my first one with static arrays it is 4 times slower. What can I do to speed up the data access? I don't want to change the algorithm.
matrix mult_std(matrix a, matrix b) {
matrix c(a.dim(), false, false);
for (int i = 0; i < a.dim(); i++)
for (int j = 0; j < a.dim(); j++) {
int sum = 0;
for (int k = 0; k < a.dim(); k++)
sum += a(i,k) * b(k,j);
c(i,j) = sum;
}
return c;
}
EDIT
I corrected my Question avove! I added the full source code below and tried some of your advices:
swapped k and j loop iterations -> performance improvement
declared dim() and operator()() as inline -> performance improvement
passing arguments by const reference -> performance loss! why? so I don't use it.
The performance is now nearly the same as it was in the old porgram. Maybe there should be a bit more improvement.
But I have another problem: I get a memory error in the function mult_strassen(...). Why?
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
OLD PROGRAM
main.c http://pastebin.com/qPgDWGpW
c99 main.c -o matrix -O3
NEW PROGRAM
matrix.h http://pastebin.com/TYFYCTY7
matrix.cpp http://pastebin.com/wYADLJ8Y
main.cpp http://pastebin.com/48BSqGJr
g++ main.cpp matrix.cpp -o matrix -O3.
EDIT
Here are some results. Comparison between standard algorithm (std), swapped order of j and k loop (swap) and blocked algortihm with block size 13 (block).
Speaking of speed-up, your function will be more cache-friendly if you swap the order of the k and j loop iterations:
matrix mult_std(matrix a, matrix b) {
matrix c(a.dim(), false, false);
for (int i = 0; i < a.dim(); i++)
for (int k = 0; k < a.dim(); k++)
for (int j = 0; j < a.dim(); j++) // swapped order
c(i,j) += a(i,k) * b(k,j);
return c;
}
That's because a k index on the inner-most loop will cause a cache miss in b on every iteration. With j as the inner-most index, both c and b are accessed contiguously, while a stays put.
Make sure that the members dim() and operator()() are declared inline, and that compiler optimization is turned on. Then play with options like -funroll-loops (on gcc).
How big is a.dim() anyway? If a row of the matrix doesn't fit in just a couple cache lines, you'd be better off with a block access pattern instead of a full row at-a-time.
You say you don't want to modify the algorithm, but what does that mean exactly?
Does unrolling the loop count as "modifying the algorithm"? What about using SSE/VMX whichever SIMD instructions are available on your CPU? What about employing some form of blocking to improve cache locality?
If you don't want to restructure your code at all, I doubt there's more you can do than the changes you've already made. Everything else becomes a trade-off of minor changes to the algorithm to achieve a performance boost.
Of course, you should still take a look at the asm generated by the compiler. That'll tell you much more about what can be done to speed up the code.
Use SIMD if you can. You absolutely have to use something like VMX registers if you do extensive vector math assuming you are using a platform that is capable of doing so, otherwise you will incur a huge performance hit.
Don't pass complex types like matrix by value - use a const reference.
Don't call a function in each iteration - cache dim() outside your loops.
Although compilers typically optimize this efficiently, it's often a good idea to have the caller provide a matrix reference for your function to fill out rather than returning a matrix by type. In some cases, this may result in an expensive copy operation.
Here is my implementation of the fast simple multiplication algorithm for square float matrices (2D arrays). It should be a little faster than chrisaycock code since it spares some increments.
static void fastMatrixMultiply(const int dim, float* dest, const float* srcA, const float* srcB)
{
memset( dest, 0x0, dim * dim * sizeof(float) );
for( int i = 0; i < dim; i++ ) {
for( int k = 0; k < dim; k++ )
{
const float* a = srcA + i * dim + k;
const float* b = srcB + k * dim;
float* c = dest + i * dim;
float* cMax = c + dim;
while( c < cMax )
{
*c++ += (*a) * (*b++);
}
}
}
}
Pass the parameters by const reference to start with:
matrix mult_std(matrix const& a, matrix const& b) {
To give you more details we need to know the details of the other methods used.
And to answer why the original method is 4 times faster we would need to see the original method.
The problem is undoubtedly yours as this problem has been solved a million times before.
Also when asking this type of question ALWAYS provide compilable source with appropriate inputs so we can actually build and run the code and see what is happening.
Without the code we are just guessing.
Edit
After fixing the main bug in the original C code (a buffer over-run)
I have update the code to run the test side by side in a fair comparison:
// INCLUDES -------------------------------------------------------------------
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
// DEFINES -------------------------------------------------------------------
// The original problem was here. The MAXDIM was 500. But we were using arrays
// that had a size of 512 in each dimension. This caused a buffer overrun that
// the dim variable and caused it to be reset to 0. The result of this was causing
// the multiplication loop to fall out before it had finished (as the loop was
// controlled by this global variable.
//
// Everything now uses the MAXDIM variable directly.
// This of course gives the C code an advantage as the compiler can optimize the
// loop explicitly for the fixed size arrays and thus unroll loops more efficiently.
#define MAXDIM 512
#define RUNS 10
// MATRIX FUNCTIONS ----------------------------------------------------------
class matrix
{
public:
matrix(int dim)
: dim_(dim)
{
data_ = new int[dim_ * dim_];
}
inline int dim() const {
return dim_;
}
inline int& operator()(unsigned row, unsigned col) {
return data_[dim_*row + col];
}
inline int operator()(unsigned row, unsigned col) const {
return data_[dim_*row + col];
}
private:
int dim_;
int* data_;
};
// ---------------------------------------------------
void random_matrix(int (&matrix)[MAXDIM][MAXDIM]) {
for (int r = 0; r < MAXDIM; r++)
for (int c = 0; c < MAXDIM; c++)
matrix[r][c] = rand() % 100;
}
void random_matrix_class(matrix& matrix) {
for (int r = 0; r < matrix.dim(); r++)
for (int c = 0; c < matrix.dim(); c++)
matrix(r, c) = rand() % 100;
}
template<typename T, typename M>
float run(T f, M const& a, M const& b, M& c)
{
float time = 0;
for (int i = 0; i < RUNS; i++) {
struct timeval start, end;
gettimeofday(&start, NULL);
f(a,b,c);
gettimeofday(&end, NULL);
long s = start.tv_sec * 1000 + start.tv_usec / 1000;
long e = end.tv_sec * 1000 + end.tv_usec / 1000;
time += e - s;
}
return time / RUNS;
}
// SEQ MULTIPLICATION ----------------------------------------------------------
int* mult_seq(int const(&a)[MAXDIM][MAXDIM], int const(&b)[MAXDIM][MAXDIM], int (&z)[MAXDIM][MAXDIM]) {
for (int r = 0; r < MAXDIM; r++) {
for (int c = 0; c < MAXDIM; c++) {
z[r][c] = 0;
for (int i = 0; i < MAXDIM; i++)
z[r][c] += a[r][i] * b[i][c];
}
}
}
void mult_std(matrix const& a, matrix const& b, matrix& z) {
for (int r = 0; r < a.dim(); r++) {
for (int c = 0; c < a.dim(); c++) {
z(r,c) = 0;
for (int i = 0; i < a.dim(); i++)
z(r,c) += a(r,i) * b(i,c);
}
}
}
// MAIN ------------------------------------------------------------------------
using namespace std;
int main(int argc, char* argv[]) {
srand(time(NULL));
int matrix_a[MAXDIM][MAXDIM];
int matrix_b[MAXDIM][MAXDIM];
int matrix_c[MAXDIM][MAXDIM];
random_matrix(matrix_a);
random_matrix(matrix_b);
printf("%d ", MAXDIM);
printf("%f \n", run(mult_seq, matrix_a, matrix_b, matrix_c));
matrix a(MAXDIM);
matrix b(MAXDIM);
matrix c(MAXDIM);
random_matrix_class(a);
random_matrix_class(b);
printf("%d ", MAXDIM);
printf("%f \n", run(mult_std, a, b, c));
return 0;
}
The results now:
$ g++ t1.cpp
$ ./a.exe
512 1270.900000
512 3308.800000
$ g++ -O3 t1.cpp
$ ./a.exe
512 284.900000
512 622.000000
From this we see the C code is about twice as fast as the C++ code when fully optimized. I can not see the reason in the code.
I'm taking a wild guess here, but if you dynamically allocating the matrices makes such a huge difference, maybe the problem is fragmentation. Again, I've no idea how the underlying matrix is implemented.
Why don't you allocate the memory for the matrices by hand, ensuring it's contiguous, and build the pointer structure yourself?
Also, does the dim() method have any extra complexity? I would declare it inline, too.